# pl-fuzzy-frame-match
High-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.
## 🚀 Key Innovation: Hybrid Matching Approach
This library automatically selects the best matching strategy based on your data:
- **Small datasets (< 100M comparisons)**: Uses exact fuzzy matching with full cross-join
- **Large datasets (≥ 100M comparisons)**: Automatically switches to **approximate nearest neighbor joins** using `polars-simed`
- **Intelligent optimization**: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring
This hybrid approach means you get:
- ✅ **Best-in-class performance** regardless of data size
- ✅ **High accuracy** with configurable similarity thresholds
- ✅ **Memory efficiency** through chunked processing
- ✅ **No manual optimization needed** - the library handles it automatically
## Features
- 🚀 **Dual-Mode Performance**: Combines exact fuzzy matching with approximate joins
- 🎯 **Multiple Algorithms**: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel
- 🔧 **Smart Optimization**: Automatic query optimization based on data uniqueness and size
- 💾 **Memory Efficient**: Chunked processing and intelligent caching for massive datasets
- 🔄 **Incremental Matching**: Support for multi-column fuzzy matching with result filtering
- âš¡ **Automatic Strategy Selection**: No configuration needed - automatically picks the fastest approach
## Installation
```bash
pip install pl-fuzzy-frame-match
```
Or using Poetry:
```bash
poetry add pl-fuzzy-frame-match
```
## Performance Benchmarks
Performance comparison on commodity hardware (M3 Mac, 36GB RAM):
| Dataset Size | Cartesian Product | Standard Cross Join Fuzzy match | Automatic Selection | Speedup |
|--------------|------------------|---------------------------------|-------------------|---------|
| 500 × 400 | 200K | 0.04s | 0.03s | 1.3x |
| 3K × 2K | 6M | 0.39s | 0.39s | 1x |
| 10K × 8K | 80M | 18.67s | 18.79s | 1x |
| 15K × 10K | 150M | 40.82s | 1.45s | **28x** |
| 40K × 30K | 1.2B | 363.50s | 4.75s | **76x** |
| 400K × 10K | 4B | Skipped* | 34.52s | **∞** |
*Skipped due to prohibitive runtime
**Key Observations:**
- **Small to Medium datasets** (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy
- **Large datasets** (≥ 100M): Automatic selection switches to approximate matching first and then matches the dataframes
- **Memory efficiency**: Can handle billions of potential comparisons without running out of memory
## Quick Start
```python
import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping
# Create sample dataframes
left_df = pl.DataFrame({
"name": ["John Smith", "Jane Doe", "Bob Johnson"],
"id": [1, 2, 3]
}).lazy()
right_df = pl.DataFrame({
"customer": ["Jon Smith", "Jane Does", "Robert Johnson"],
"customer_id": [101, 102, 103]
}).lazy()
# Define fuzzy matching configuration
fuzzy_maps = [
FuzzyMapping(
left_col="name",
right_col="customer",
threshold_score=80.0, # 80% similarity threshold
fuzzy_type="levenshtein"
)
]
# Perform fuzzy matching
result = fuzzy_match_dfs(
left_df=left_df,
right_df=right_df,
fuzzy_maps=fuzzy_maps,
logger=your_logger # Pass your logger instance
)
print(result)
```
## Advanced Usage
### Multiple Column Matching
```python
# Match on multiple columns with different algorithms
fuzzy_maps = [
FuzzyMapping(
left_col="name",
right_col="customer_name",
threshold_score=85.0,
fuzzy_type="jaro_winkler"
),
FuzzyMapping(
left_col="address",
right_col="customer_address",
threshold_score=75.0,
fuzzy_type="levenshtein"
)
]
result = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)
```
### Supported Algorithms
- **levenshtein**: Edit distance between two strings
- **jaro**: Jaro similarity
- **jaro_winkler**: Jaro-Winkler similarity (good for name matching)
- **hamming**: Hamming distance (requires equal length strings)
- **damerau_levenshtein**: Like Levenshtein but includes transpositions
- **indel**: Insertion/deletion distance
## How It Works: The Best of Both Worlds
The library intelligently combines two approaches based on your data size:
### For Regular Datasets (< 100M potential matches)
1. **Preprocessing**: Analyzes column uniqueness to optimize join strategy
2. **Cross Join**: Creates all possible combinations
3. **Exact Scoring**: Calculates precise similarity scores using your chosen algorithm
4. **Filtering**: Returns only matches above the threshold
### For Large Datasets (≥ 100M potential matches)
1. **Approximate Candidate Selection**: Uses `polars-simed` to quickly find likely matches
2. **Chunked Processing**: Processes large datasets in memory-efficient chunks
3. **Reduced Comparisons**: Only scores the most promising pairs instead of all combinations
4. **Final Scoring**: Applies exact fuzzy matching to the reduced candidate set
### The Magic: Automatic Strategy Selection
```python
# The library automatically determines the best approach:
if cartesian_product_size >= 100_000_000 and has_polars_simed:
# Use approximate join for initial candidate selection
# This reduces a 1B comparison problem to ~1M comparisons
use_approximate_matching()
else:
# Use traditional cross join for smaller datasets
use_exact_matching()
```
This means you can use the same API whether matching 1,000 or 100 million records!
## Performance Tips
- **Large dataset matching**: Install `polars-simed` to enable approximate matching:
```bash
pip install polars-simed
```
- **Optimal threshold**: Start with higher thresholds (80-90%) for better performance
- **Column selection**: Use columns with high uniqueness for better candidate reduction
- **Algorithm choice**:
- `jaro_winkler`: Best for names and short strings
- `levenshtein`: Best for general text and typos
- `damerau_levenshtein`: Best when transpositions are common
- **Memory management**: The library automatically chunks large datasets, but you can monitor memory usage with logging
## Requirements
- Python >= 3.9
- Polars >= 1.8.2, < 2.0.0
- polars-distance ~= 0.4.3
- polars-simed >= 0.3.4 (optional, for large datasets)
## License
MIT License - see LICENSE file for details
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
Built on top of the excellent [Polars](https://github.com/pola-rs/polars) DataFrame library and [polars-distance](https://github.com/ion-elgreco/polars-distance) for string similarity calculations.
Raw data
{
"_id": null,
"home_page": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
"name": "pl-fuzzy-frame-match",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "polars, fuzzy, matching, string, similarity, dataframe",
"author": "Edward van Eechoud",
"author_email": "evaneechoud@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/0b/4b/32db4ce3986d2a6b8cc4d6f90bf266c3f3e30bdda3242c4b28307b2f61d7/pl_fuzzy_frame_match-0.1.4.tar.gz",
"platform": null,
"description": "# pl-fuzzy-frame-match\n\nHigh-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.\n\n## \ud83d\ude80 Key Innovation: Hybrid Matching Approach\n\nThis library automatically selects the best matching strategy based on your data:\n\n- **Small datasets (< 100M comparisons)**: Uses exact fuzzy matching with full cross-join\n- **Large datasets (\u2265 100M comparisons)**: Automatically switches to **approximate nearest neighbor joins** using `polars-simed`\n- **Intelligent optimization**: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring\n\nThis hybrid approach means you get:\n- \u2705 **Best-in-class performance** regardless of data size\n- \u2705 **High accuracy** with configurable similarity thresholds\n- \u2705 **Memory efficiency** through chunked processing\n- \u2705 **No manual optimization needed** - the library handles it automatically\n\n## Features\n\n- \ud83d\ude80 **Dual-Mode Performance**: Combines exact fuzzy matching with approximate joins\n- \ud83c\udfaf **Multiple Algorithms**: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel\n- \ud83d\udd27 **Smart Optimization**: Automatic query optimization based on data uniqueness and size\n- \ud83d\udcbe **Memory Efficient**: Chunked processing and intelligent caching for massive datasets\n- \ud83d\udd04 **Incremental Matching**: Support for multi-column fuzzy matching with result filtering\n- \u26a1 **Automatic Strategy Selection**: No configuration needed - automatically picks the fastest approach\n\n## Installation\n\n```bash\npip install pl-fuzzy-frame-match\n```\n\nOr using Poetry:\n\n```bash\npoetry add pl-fuzzy-frame-match\n```\n\n## Performance Benchmarks\n\nPerformance comparison on commodity hardware (M3 Mac, 36GB RAM):\n\n| Dataset Size | Cartesian Product | Standard Cross Join Fuzzy match | Automatic Selection | Speedup |\n|--------------|------------------|---------------------------------|-------------------|---------|\n| 500 \u00d7 400 | 200K | 0.04s | 0.03s | 1.3x |\n| 3K \u00d7 2K | 6M | 0.39s | 0.39s | 1x |\n| 10K \u00d7 8K | 80M | 18.67s | 18.79s | 1x |\n| 15K \u00d7 10K | 150M | 40.82s | 1.45s | **28x** |\n| 40K \u00d7 30K | 1.2B | 363.50s | 4.75s | **76x** |\n| 400K \u00d7 10K | 4B | Skipped* | 34.52s | **\u221e** |\n\n*Skipped due to prohibitive runtime\n\n**Key Observations:**\n- **Small to Medium datasets** (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy\n- **Large datasets** (\u2265 100M): Automatic selection switches to approximate matching first and then matches the dataframes\n- **Memory efficiency**: Can handle billions of potential comparisons without running out of memory\n\n## Quick Start\n\n```python\nimport polars as pl\nfrom pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping\n\n# Create sample dataframes\nleft_df = pl.DataFrame({\n \"name\": [\"John Smith\", \"Jane Doe\", \"Bob Johnson\"],\n \"id\": [1, 2, 3]\n}).lazy()\n\nright_df = pl.DataFrame({\n \"customer\": [\"Jon Smith\", \"Jane Does\", \"Robert Johnson\"],\n \"customer_id\": [101, 102, 103]\n}).lazy()\n\n# Define fuzzy matching configuration\nfuzzy_maps = [\n FuzzyMapping(\n left_col=\"name\",\n right_col=\"customer\",\n threshold_score=80.0, # 80% similarity threshold\n fuzzy_type=\"levenshtein\"\n )\n]\n\n# Perform fuzzy matching\nresult = fuzzy_match_dfs(\n left_df=left_df,\n right_df=right_df,\n fuzzy_maps=fuzzy_maps,\n logger=your_logger # Pass your logger instance\n)\n\nprint(result)\n```\n\n## Advanced Usage\n\n### Multiple Column Matching\n\n```python\n# Match on multiple columns with different algorithms\nfuzzy_maps = [\n FuzzyMapping(\n left_col=\"name\",\n right_col=\"customer_name\",\n threshold_score=85.0,\n fuzzy_type=\"jaro_winkler\"\n ),\n FuzzyMapping(\n left_col=\"address\",\n right_col=\"customer_address\",\n threshold_score=75.0,\n fuzzy_type=\"levenshtein\"\n )\n]\n\nresult = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)\n```\n\n### Supported Algorithms\n\n- **levenshtein**: Edit distance between two strings\n- **jaro**: Jaro similarity\n- **jaro_winkler**: Jaro-Winkler similarity (good for name matching)\n- **hamming**: Hamming distance (requires equal length strings)\n- **damerau_levenshtein**: Like Levenshtein but includes transpositions\n- **indel**: Insertion/deletion distance\n\n## How It Works: The Best of Both Worlds\n\nThe library intelligently combines two approaches based on your data size:\n\n### For Regular Datasets (< 100M potential matches)\n1. **Preprocessing**: Analyzes column uniqueness to optimize join strategy\n2. **Cross Join**: Creates all possible combinations\n3. **Exact Scoring**: Calculates precise similarity scores using your chosen algorithm\n4. **Filtering**: Returns only matches above the threshold\n\n### For Large Datasets (\u2265 100M potential matches)\n1. **Approximate Candidate Selection**: Uses `polars-simed` to quickly find likely matches\n2. **Chunked Processing**: Processes large datasets in memory-efficient chunks\n3. **Reduced Comparisons**: Only scores the most promising pairs instead of all combinations\n4. **Final Scoring**: Applies exact fuzzy matching to the reduced candidate set\n\n### The Magic: Automatic Strategy Selection\n```python\n# The library automatically determines the best approach:\nif cartesian_product_size >= 100_000_000 and has_polars_simed:\n # Use approximate join for initial candidate selection\n # This reduces a 1B comparison problem to ~1M comparisons\n use_approximate_matching()\nelse:\n # Use traditional cross join for smaller datasets\n use_exact_matching()\n```\n\nThis means you can use the same API whether matching 1,000 or 100 million records!\n\n## Performance Tips\n\n- **Large dataset matching**: Install `polars-simed` to enable approximate matching:\n ```bash\n pip install polars-simed\n ```\n- **Optimal threshold**: Start with higher thresholds (80-90%) for better performance\n- **Column selection**: Use columns with high uniqueness for better candidate reduction\n- **Algorithm choice**:\n - `jaro_winkler`: Best for names and short strings\n - `levenshtein`: Best for general text and typos\n - `damerau_levenshtein`: Best when transpositions are common\n- **Memory management**: The library automatically chunks large datasets, but you can monitor memory usage with logging\n\n## Requirements\n\n- Python >= 3.9\n- Polars >= 1.8.2, < 2.0.0\n- polars-distance ~= 0.4.3\n- polars-simed >= 0.3.4 (optional, for large datasets)\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgments\n\nBuilt on top of the excellent [Polars](https://github.com/pola-rs/polars) DataFrame library and [polars-distance](https://github.com/ion-elgreco/polars-distance) for string similarity calculations.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Efficient fuzzy matching for Polars DataFrames with support for multiple string similarity algorithms",
"version": "0.1.4",
"project_urls": {
"Documentation": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
"Homepage": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
"Repository": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match"
},
"split_keywords": [
"polars",
" fuzzy",
" matching",
" string",
" similarity",
" dataframe"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "99be677c97786811046e9b68eb7366cf970d57dafe5db88577726b444a7dba94",
"md5": "0cf568960fd8a638d1a22d7793e48717",
"sha256": "f4e4537f798ae91817b3d91710e317e42256505b85787f8a1833537ed5aa2838"
},
"downloads": -1,
"filename": "pl_fuzzy_frame_match-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0cf568960fd8a638d1a22d7793e48717",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 21962,
"upload_time": "2025-07-23T20:28:01",
"upload_time_iso_8601": "2025-07-23T20:28:01.546043Z",
"url": "https://files.pythonhosted.org/packages/99/be/677c97786811046e9b68eb7366cf970d57dafe5db88577726b444a7dba94/pl_fuzzy_frame_match-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0b4b32db4ce3986d2a6b8cc4d6f90bf266c3f3e30bdda3242c4b28307b2f61d7",
"md5": "bf608a8380d5b82ad5ff4c1adefe8b25",
"sha256": "21c98f7cfa5f5f704a19975dc283dbf9da059575af1fd518098d752b42ac80bb"
},
"downloads": -1,
"filename": "pl_fuzzy_frame_match-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "bf608a8380d5b82ad5ff4c1adefe8b25",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 20086,
"upload_time": "2025-07-23T20:28:02",
"upload_time_iso_8601": "2025-07-23T20:28:02.803818Z",
"url": "https://files.pythonhosted.org/packages/0b/4b/32db4ce3986d2a6b8cc4d6f90bf266c3f3e30bdda3242c4b28307b2f61d7/pl_fuzzy_frame_match-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-23 20:28:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "edwardvaneechoud",
"github_project": "pl-fuzzy-frame-match",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pl-fuzzy-frame-match"
}