# Privacy-Preserving Similarity Search
A Python package for privacy-preserving similarity search on massive DataFrames containing PII (Personally Identifiable Information). Perfect for finding duplicate customers, similar purchase histories, and entity resolution while maintaining strong privacy guarantees.
## Features
- **Multiple Privacy Modes**: Differential Privacy, Homomorphic Encryption, Secure Hashing
- **Scalable Search**: Built on FAISS for billion-scale vector similarity search
- **Advanced Embeddings**: Deep learning-based embeddings using Sentence Transformers
- **Smart Blocking**: LSH and clustering-based candidate generation
- **Flexible API**: Easy-to-use interface for DataFrame operations
- **Production-Ready**: Based on research from Amazon, Meta, Google, and Microsoft
## Architecture
The package implements a multi-layered architecture:
1. **Privacy Protection Layer**: Transforms sensitive data using DP, HE, or secure hashing
2. **Embedding Generation**: Converts text and structured data into dense vector representations
3. **Blocking/Filtering**: Reduces search space using LSH or clustering techniques
4. **Similarity Search**: FAISS-based approximate nearest neighbor search
5. **Post-Processing**: Refinement and deduplication
### Component Diagram
```mermaid
graph TB
subgraph Input["Input Layer"]
DF[DataFrame with PII]
end
subgraph Privacy["Privacy Protection Layer"]
DP[Differential Privacy<br/>DP-MinHash, DP-OPH]
HE[Homomorphic Encryption<br/>Secure Inner Products]
SH[Secure Hashing<br/>Bloom Filters, k-Anonymity]
end
subgraph Embeddings["Embedding Generation Layer"]
TE[Text Embeddings<br/>Sentence Transformers]
PII[PII Tokenizer<br/>Name/Email/Address]
NF[Numeric Features<br/>Scaling, Encoding]
end
subgraph Blocking["Blocking/Filtering Layer"]
LSH[LSH Blocking<br/>Random Projection, MinHash]
CLUSTER[Clustering<br/>K-Means, Canopy]
DYNAMIC[Dynamic Bucketing<br/>Adaptive Radius]
end
subgraph Search["Similarity Search Layer"]
FLAT[FAISS Flat<br/>Exact Search]
HNSW[FAISS HNSW<br/>Graph-based ANN]
IVF[FAISS IVF<br/>Quantized Search]
end
subgraph Output["Output Layer"]
RESULTS[Search Results<br/>Top-k Neighbors]
DUPES[Duplicate Groups<br/>Connected Components]
end
DF --> Privacy
DP --> Embeddings
HE --> Embeddings
SH --> Embeddings
Embeddings --> TE
Embeddings --> PII
Embeddings --> NF
TE --> Blocking
PII --> Blocking
NF --> Blocking
LSH --> Search
CLUSTER --> Search
DYNAMIC --> Search
FLAT --> Output
HNSW --> Output
IVF --> Output
RESULTS --> USER[User Application]
DUPES --> USER
style Privacy fill:#e1f5ff
style Embeddings fill:#fff3cd
style Blocking fill:#d4edda
style Search fill:#f8d7da
style Output fill:#d1ecf1
```
### Data Flow
```mermaid
sequenceDiagram
participant U as User
participant API as Core API
participant P as Privacy Layer
participant E as Embeddings
participant B as Blocking
participant S as FAISS Search
U->>API: fit(df, sensitive_columns)
API->>P: apply_privacy(sensitive_data)
P-->>API: protected_data
API->>E: generate_embeddings(data)
E-->>API: vectors
API->>B: create_blocks(vectors)
B-->>API: candidate_sets
API->>S: build_index(vectors)
S-->>API: index
API-->>U: fitted_model
U->>API: search(query_df, k=10)
API->>P: apply_privacy(query_data)
P-->>API: protected_query
API->>E: generate_embeddings(query)
E-->>API: query_vectors
API->>B: filter_candidates(query_vectors)
B-->>API: candidate_ids
API->>S: search(query_vectors, k)
S-->>API: neighbors, distances
API-->>U: results_df
```
## Installation
### From PyPI (Recommended)
```bash
pip install privacy-similarity
```
### From Source
```bash
git clone https://github.com/alexandernicholson/python-similarity.git
cd python-similarity
pip install -r requirements.txt
pip install -e .
```
### GPU Support
For GPU acceleration (5-10x faster on large datasets):
```bash
pip install privacy-similarity
pip install faiss-gpu
```
## Quick Start
```python
from privacy_similarity import PrivacyPreservingSimilaritySearch
import pandas as pd
# Create sample DataFrame with customer data
df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['John Smith', 'Jon Smith', 'Jane Doe', 'John A. Smith', 'Alice Johnson'],
'email': ['john@example.com', 'jon@example.com', 'jane@example.com', 'jsmith@example.com', 'alice@example.com'],
'address': ['123 Main St', '123 Main Street', '456 Oak Ave', '123 Main St.', '789 Pine Rd'],
'interests': ['sports, technology', 'tech, sports', 'reading, cooking', 'technology, sports', 'cooking, travel']
})
# Initialize with privacy and search parameters
searcher = PrivacyPreservingSimilaritySearch(
privacy_mode='differential_privacy', # or 'homomorphic', 'secure_hashing'
epsilon=1.0, # Differential privacy parameter
embedding_model='sentence-transformers/all-MiniLM-L6-v2',
index_type='HNSW', # or 'IVF-HNSW', 'IVF-PQ' for larger datasets
use_gpu=False
)
# Fit the model on your data
searcher.fit(
df,
sensitive_columns=['name', 'email', 'address'],
embedding_columns=['interests'],
id_column='customer_id'
)
# Find duplicates
duplicates = searcher.find_duplicates(threshold=0.85)
print(f"Found {len(duplicates)} duplicate groups")
# Search for similar records
query_df = pd.DataFrame({
'name': ['Jonathan Smith'],
'email': ['j.smith@example.com'],
'address': ['123 Main Street'],
'interests': ['sports and tech']
})
results = searcher.search(query_df, k=3, similarity_threshold=0.7)
print(results)
```
## Privacy Modes
### Differential Privacy (Recommended)
- **Overhead**: 1.5-2x
- **Use Case**: Statistical privacy guarantees for analytics
- **Parameters**: `epsilon` (privacy budget, lower = more private)
```python
searcher = PrivacyPreservingSimilaritySearch(
privacy_mode='differential_privacy',
epsilon=1.0 # Standard: 0.1 (high privacy) to 10.0 (low privacy)
)
```
### Homomorphic Encryption
- **Overhead**: 10-100x
- **Use Case**: Cryptographic guarantees for sensitive data
- **Parameters**: `encryption_key_size`
```python
searcher = PrivacyPreservingSimilaritySearch(
privacy_mode='homomorphic',
encryption_key_size=2048
)
```
### Secure Hashing
- **Overhead**: 1x
- **Use Case**: Internal use, public data
- **Parameters**: `salt` (random string for security)
```python
searcher = PrivacyPreservingSimilaritySearch(
privacy_mode='secure_hashing',
salt='your-random-salt-string'
)
```
## Index Types
- **HNSW**: Best for <10M records, excellent accuracy and speed
- **IVF-HNSW**: Best for 10M-1B records, balanced performance
- **IVF-PQ**: Best for 1B+ records with memory constraints
## Performance Characteristics
| Index Type | Dataset Size | QPS | Recall | Memory |
|-----------|--------------|-----|--------|---------|
| HNSW | <10M | 10^4-10^5 | >95% | High |
| IVF-HNSW | 10M-1B | 10^3-10^4 | >90% | Medium |
| IVF-PQ | 1B+ | 10^2-10^3 | >85% | Low |
## Use Cases
### Customer Deduplication
```python
# Find duplicate customer records
duplicates = searcher.find_duplicates(
threshold=0.9,
max_cluster_size=100
)
# Get detailed match information
for group in duplicates:
print(f"Duplicate group: {group['ids']}")
print(f"Confidence: {group['similarity']}")
```
### Similar Customer Discovery
```python
# Find customers with similar interests or purchase history
similar = searcher.search(
query_df,
k=10,
similarity_threshold=0.75,
return_distances=True
)
```
### Privacy-Preserving Analytics
```python
# Perform analytics on sensitive data without exposing PII
searcher = PrivacyPreservingSimilaritySearch(
privacy_mode='differential_privacy',
epsilon=0.1 # High privacy
)
```
## Advanced Features
### Custom Embeddings
```python
# Use your own embedding model
from sentence_transformers import SentenceTransformer
custom_model = SentenceTransformer('your-model-name')
searcher = PrivacyPreservingSimilaritySearch(
embedding_model=custom_model
)
```
### Batch Processing
```python
# Process large datasets in batches
searcher.fit_batch(
df,
batch_size=10000,
n_jobs=-1 # Use all CPU cores
)
```
### Incremental Updates
```python
# Add new records to existing index
searcher.add_records(new_df)
```
## Research Background
This package is built on state-of-the-art research from:
- **Meta/Facebook**: FAISS library for billion-scale vector search
- **Amazon**: Semantic product search and entity resolution
- **Airbnb**: Real-time personalization using embeddings
- **Academic Research**: Differential privacy, homomorphic encryption, LSH
Key papers implemented:
- FAISS: A library for efficient similarity search (Meta AI)
- Differential Privacy for MinHash and LSH
- Privacy-Preserving Text Embeddings with Homomorphic Encryption
- Neural LSH for Entity Blocking
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
MIT License
## Citation
If you use this package in your research, please cite:
```bibtex
@software{privacy_similarity,
title={Privacy-Preserving Similarity Search},
author={Alexander Nicholson},
year={2025},
url={https://github.com/alexandernicholson/python-similarity}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/alexandernicholson/python-similarity",
"name": "privacy-similarity",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "privacy, similarity-search, differential-privacy, pii, embeddings, faiss, deduplication",
"author": "Alexander Nicholson",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/7d/85/6d79194cb33c773f42d885893d30e3152c6f5d4725059a529b68b56aee9d/privacy_similarity-0.1.4.tar.gz",
"platform": null,
"description": "# Privacy-Preserving Similarity Search\n\nA Python package for privacy-preserving similarity search on massive DataFrames containing PII (Personally Identifiable Information). Perfect for finding duplicate customers, similar purchase histories, and entity resolution while maintaining strong privacy guarantees.\n\n## Features\n\n- **Multiple Privacy Modes**: Differential Privacy, Homomorphic Encryption, Secure Hashing\n- **Scalable Search**: Built on FAISS for billion-scale vector similarity search\n- **Advanced Embeddings**: Deep learning-based embeddings using Sentence Transformers\n- **Smart Blocking**: LSH and clustering-based candidate generation\n- **Flexible API**: Easy-to-use interface for DataFrame operations\n- **Production-Ready**: Based on research from Amazon, Meta, Google, and Microsoft\n\n## Architecture\n\nThe package implements a multi-layered architecture:\n\n1. **Privacy Protection Layer**: Transforms sensitive data using DP, HE, or secure hashing\n2. **Embedding Generation**: Converts text and structured data into dense vector representations\n3. **Blocking/Filtering**: Reduces search space using LSH or clustering techniques\n4. **Similarity Search**: FAISS-based approximate nearest neighbor search\n5. **Post-Processing**: Refinement and deduplication\n\n### Component Diagram\n\n```mermaid\ngraph TB\n subgraph Input[\"Input Layer\"]\n DF[DataFrame with PII]\n end\n\n subgraph Privacy[\"Privacy Protection Layer\"]\n DP[Differential Privacy<br/>DP-MinHash, DP-OPH]\n HE[Homomorphic Encryption<br/>Secure Inner Products]\n SH[Secure Hashing<br/>Bloom Filters, k-Anonymity]\n end\n\n subgraph Embeddings[\"Embedding Generation Layer\"]\n TE[Text Embeddings<br/>Sentence Transformers]\n PII[PII Tokenizer<br/>Name/Email/Address]\n NF[Numeric Features<br/>Scaling, Encoding]\n end\n\n subgraph Blocking[\"Blocking/Filtering Layer\"]\n LSH[LSH Blocking<br/>Random Projection, MinHash]\n CLUSTER[Clustering<br/>K-Means, Canopy]\n DYNAMIC[Dynamic Bucketing<br/>Adaptive Radius]\n end\n\n subgraph Search[\"Similarity Search Layer\"]\n FLAT[FAISS Flat<br/>Exact Search]\n HNSW[FAISS HNSW<br/>Graph-based ANN]\n IVF[FAISS IVF<br/>Quantized Search]\n end\n\n subgraph Output[\"Output Layer\"]\n RESULTS[Search Results<br/>Top-k Neighbors]\n DUPES[Duplicate Groups<br/>Connected Components]\n end\n\n DF --> Privacy\n DP --> Embeddings\n HE --> Embeddings\n SH --> Embeddings\n\n Embeddings --> TE\n Embeddings --> PII\n Embeddings --> NF\n\n TE --> Blocking\n PII --> Blocking\n NF --> Blocking\n\n LSH --> Search\n CLUSTER --> Search\n DYNAMIC --> Search\n\n FLAT --> Output\n HNSW --> Output\n IVF --> Output\n\n RESULTS --> USER[User Application]\n DUPES --> USER\n\n style Privacy fill:#e1f5ff\n style Embeddings fill:#fff3cd\n style Blocking fill:#d4edda\n style Search fill:#f8d7da\n style Output fill:#d1ecf1\n```\n\n### Data Flow\n\n```mermaid\nsequenceDiagram\n participant U as User\n participant API as Core API\n participant P as Privacy Layer\n participant E as Embeddings\n participant B as Blocking\n participant S as FAISS Search\n\n U->>API: fit(df, sensitive_columns)\n API->>P: apply_privacy(sensitive_data)\n P-->>API: protected_data\n API->>E: generate_embeddings(data)\n E-->>API: vectors\n API->>B: create_blocks(vectors)\n B-->>API: candidate_sets\n API->>S: build_index(vectors)\n S-->>API: index\n API-->>U: fitted_model\n\n U->>API: search(query_df, k=10)\n API->>P: apply_privacy(query_data)\n P-->>API: protected_query\n API->>E: generate_embeddings(query)\n E-->>API: query_vectors\n API->>B: filter_candidates(query_vectors)\n B-->>API: candidate_ids\n API->>S: search(query_vectors, k)\n S-->>API: neighbors, distances\n API-->>U: results_df\n```\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install privacy-similarity\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/alexandernicholson/python-similarity.git\ncd python-similarity\npip install -r requirements.txt\npip install -e .\n```\n\n### GPU Support\n\nFor GPU acceleration (5-10x faster on large datasets):\n```bash\npip install privacy-similarity\npip install faiss-gpu\n```\n\n## Quick Start\n\n```python\nfrom privacy_similarity import PrivacyPreservingSimilaritySearch\nimport pandas as pd\n\n# Create sample DataFrame with customer data\ndf = pd.DataFrame({\n 'customer_id': [1, 2, 3, 4, 5],\n 'name': ['John Smith', 'Jon Smith', 'Jane Doe', 'John A. Smith', 'Alice Johnson'],\n 'email': ['john@example.com', 'jon@example.com', 'jane@example.com', 'jsmith@example.com', 'alice@example.com'],\n 'address': ['123 Main St', '123 Main Street', '456 Oak Ave', '123 Main St.', '789 Pine Rd'],\n 'interests': ['sports, technology', 'tech, sports', 'reading, cooking', 'technology, sports', 'cooking, travel']\n})\n\n# Initialize with privacy and search parameters\nsearcher = PrivacyPreservingSimilaritySearch(\n privacy_mode='differential_privacy', # or 'homomorphic', 'secure_hashing'\n epsilon=1.0, # Differential privacy parameter\n embedding_model='sentence-transformers/all-MiniLM-L6-v2',\n index_type='HNSW', # or 'IVF-HNSW', 'IVF-PQ' for larger datasets\n use_gpu=False\n)\n\n# Fit the model on your data\nsearcher.fit(\n df,\n sensitive_columns=['name', 'email', 'address'],\n embedding_columns=['interests'],\n id_column='customer_id'\n)\n\n# Find duplicates\nduplicates = searcher.find_duplicates(threshold=0.85)\nprint(f\"Found {len(duplicates)} duplicate groups\")\n\n# Search for similar records\nquery_df = pd.DataFrame({\n 'name': ['Jonathan Smith'],\n 'email': ['j.smith@example.com'],\n 'address': ['123 Main Street'],\n 'interests': ['sports and tech']\n})\n\nresults = searcher.search(query_df, k=3, similarity_threshold=0.7)\nprint(results)\n```\n\n## Privacy Modes\n\n### Differential Privacy (Recommended)\n- **Overhead**: 1.5-2x\n- **Use Case**: Statistical privacy guarantees for analytics\n- **Parameters**: `epsilon` (privacy budget, lower = more private)\n\n```python\nsearcher = PrivacyPreservingSimilaritySearch(\n privacy_mode='differential_privacy',\n epsilon=1.0 # Standard: 0.1 (high privacy) to 10.0 (low privacy)\n)\n```\n\n### Homomorphic Encryption\n- **Overhead**: 10-100x\n- **Use Case**: Cryptographic guarantees for sensitive data\n- **Parameters**: `encryption_key_size`\n\n```python\nsearcher = PrivacyPreservingSimilaritySearch(\n privacy_mode='homomorphic',\n encryption_key_size=2048\n)\n```\n\n### Secure Hashing\n- **Overhead**: 1x\n- **Use Case**: Internal use, public data\n- **Parameters**: `salt` (random string for security)\n\n```python\nsearcher = PrivacyPreservingSimilaritySearch(\n privacy_mode='secure_hashing',\n salt='your-random-salt-string'\n)\n```\n\n## Index Types\n\n- **HNSW**: Best for <10M records, excellent accuracy and speed\n- **IVF-HNSW**: Best for 10M-1B records, balanced performance\n- **IVF-PQ**: Best for 1B+ records with memory constraints\n\n## Performance Characteristics\n\n| Index Type | Dataset Size | QPS | Recall | Memory |\n|-----------|--------------|-----|--------|---------|\n| HNSW | <10M | 10^4-10^5 | >95% | High |\n| IVF-HNSW | 10M-1B | 10^3-10^4 | >90% | Medium |\n| IVF-PQ | 1B+ | 10^2-10^3 | >85% | Low |\n\n## Use Cases\n\n### Customer Deduplication\n```python\n# Find duplicate customer records\nduplicates = searcher.find_duplicates(\n threshold=0.9,\n max_cluster_size=100\n)\n\n# Get detailed match information\nfor group in duplicates:\n print(f\"Duplicate group: {group['ids']}\")\n print(f\"Confidence: {group['similarity']}\")\n```\n\n### Similar Customer Discovery\n```python\n# Find customers with similar interests or purchase history\nsimilar = searcher.search(\n query_df,\n k=10,\n similarity_threshold=0.75,\n return_distances=True\n)\n```\n\n### Privacy-Preserving Analytics\n```python\n# Perform analytics on sensitive data without exposing PII\nsearcher = PrivacyPreservingSimilaritySearch(\n privacy_mode='differential_privacy',\n epsilon=0.1 # High privacy\n)\n```\n\n## Advanced Features\n\n### Custom Embeddings\n```python\n# Use your own embedding model\nfrom sentence_transformers import SentenceTransformer\n\ncustom_model = SentenceTransformer('your-model-name')\nsearcher = PrivacyPreservingSimilaritySearch(\n embedding_model=custom_model\n)\n```\n\n### Batch Processing\n```python\n# Process large datasets in batches\nsearcher.fit_batch(\n df,\n batch_size=10000,\n n_jobs=-1 # Use all CPU cores\n)\n```\n\n### Incremental Updates\n```python\n# Add new records to existing index\nsearcher.add_records(new_df)\n```\n\n## Research Background\n\nThis package is built on state-of-the-art research from:\n\n- **Meta/Facebook**: FAISS library for billion-scale vector search\n- **Amazon**: Semantic product search and entity resolution\n- **Airbnb**: Real-time personalization using embeddings\n- **Academic Research**: Differential privacy, homomorphic encryption, LSH\n\nKey papers implemented:\n- FAISS: A library for efficient similarity search (Meta AI)\n- Differential Privacy for MinHash and LSH\n- Privacy-Preserving Text Embeddings with Homomorphic Encryption\n- Neural LSH for Entity Blocking\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nMIT License\n\n## Citation\n\nIf you use this package in your research, please cite:\n\n```bibtex\n@software{privacy_similarity,\n title={Privacy-Preserving Similarity Search},\n author={Alexander Nicholson},\n year={2025},\n url={https://github.com/alexandernicholson/python-similarity}\n}\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Privacy-preserving similarity search for massive DataFrames with PII",
"version": "0.1.4",
"project_urls": {
"Changelog": "https://github.com/alexandernicholson/python-similarity/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/alexandernicholson/python-similarity/tree/main/docs",
"Homepage": "https://github.com/alexandernicholson/python-similarity",
"Issues": "https://github.com/alexandernicholson/python-similarity/issues",
"Repository": "https://github.com/alexandernicholson/python-similarity"
},
"split_keywords": [
"privacy",
" similarity-search",
" differential-privacy",
" pii",
" embeddings",
" faiss",
" deduplication"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8ba632d0fd1ca04cb327d50ffbec42498e1a1ee231bdcf35a7c29d1de402d27e",
"md5": "1e099d9f7d4738de89f13503837e80b2",
"sha256": "04dfca899d0b4aa07f0229f51889b93aac08d948a797acf29e5e44ae657b691b"
},
"downloads": -1,
"filename": "privacy_similarity-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1e099d9f7d4738de89f13503837e80b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 42204,
"upload_time": "2025-10-21T23:51:09",
"upload_time_iso_8601": "2025-10-21T23:51:09.312695Z",
"url": "https://files.pythonhosted.org/packages/8b/a6/32d0fd1ca04cb327d50ffbec42498e1a1ee231bdcf35a7c29d1de402d27e/privacy_similarity-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7d856d79194cb33c773f42d885893d30e3152c6f5d4725059a529b68b56aee9d",
"md5": "33166845d7c815df7ffd671593852036",
"sha256": "1646ed84e0119833226129d7f929ddc2235f1edaed977ba634fad659ead09458"
},
"downloads": -1,
"filename": "privacy_similarity-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "33166845d7c815df7ffd671593852036",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 95745,
"upload_time": "2025-10-21T23:51:10",
"upload_time_iso_8601": "2025-10-21T23:51:10.285629Z",
"url": "https://files.pythonhosted.org/packages/7d/85/6d79194cb33c773f42d885893d30e3152c6f5d4725059a529b68b56aee9d/privacy_similarity-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 23:51:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "alexandernicholson",
"github_project": "python-similarity",
"github_not_found": true,
"lcname": "privacy-similarity"
}