pl-fuzzy-frame-match

Name	pl-fuzzy-frame-match JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/edwardvaneechoud/pl-fuzzy-frame-match
Summary	Efficient fuzzy matching for Polars DataFrames with support for multiple string similarity algorithms
upload_time	2025-08-19 06:45:07
maintainer	None
docs_url	None
author	Edward van Eechoud
requires_python	<4.0,>=3.10
license	MIT
keywords	polars fuzzy matching string similarity dataframe
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # pl-fuzzy-frame-match

High-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.

## 🚀 Key Innovation: Hybrid Matching Approach

This library automatically selects the best matching strategy based on your data:

- **Small datasets (< 100M comparisons)**: Uses exact fuzzy matching with full cross-join
- **Large datasets (≥ 100M comparisons)**: Automatically switches to **approximate nearest neighbor joins** using `polars-simed`
- **Intelligent optimization**: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring

This hybrid approach means you get:
- ✅ **Best-in-class performance** regardless of data size
- ✅ **High accuracy** with configurable similarity thresholds
- ✅ **Memory efficiency** through chunked processing
- ✅ **No manual optimization needed** - the library handles it automatically

## Features

- 🚀 **Dual-Mode Performance**: Combines exact fuzzy matching with approximate joins
- 🎯 **Multiple Algorithms**: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel
- 🔧 **Smart Optimization**: Automatic query optimization based on data uniqueness and size
- 💾 **Memory Efficient**: Chunked processing and intelligent caching for massive datasets
- 🔄 **Incremental Matching**: Support for multi-column fuzzy matching with result filtering
- ⚡ **Automatic Strategy Selection**: No configuration needed - automatically picks the fastest approach

## Installation

```bash
pip install pl-fuzzy-frame-match
```

Or using Poetry:

```bash
poetry add pl-fuzzy-frame-match
```

## Performance Benchmarks

Performance comparison on commodity hardware (M3 Mac, 36GB RAM):

| Dataset Size | Cartesian Product | Standard Cross Join Fuzzy match | Automatic Selection | Speedup |
|--------------|------------------|---------------------------------|-------------------|---------|
| 500 × 400 | 200K | 0.04s                           | 0.03s | 1.3x |
| 3K × 2K | 6M | 0.39s                           | 0.39s | 1x |
| 10K × 8K | 80M | 18.67s                          | 18.79s | 1x |
| 15K × 10K | 150M | 40.82s                          | 1.45s | **28x** |
| 40K × 30K | 1.2B | 363.50s                         | 4.75s | **76x** |
| 400K × 10K | 4B | Skipped*                        | 34.52s | **∞** |

*Skipped due to prohibitive runtime

**Key Observations:**
- **Small to Medium datasets** (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy
- **Large datasets** (≥ 100M): Automatic selection switches to approximate matching first and then matches the dataframes
- **Memory efficiency**: Can handle billions of potential comparisons without running out of memory

## Quick Start

```python
import polars as pl
from pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping

# Create sample dataframes
left_df = pl.DataFrame({
    "name": ["John Smith", "Jane Doe", "Bob Johnson"],
    "id": [1, 2, 3]
}).lazy()

right_df = pl.DataFrame({
    "customer": ["Jon Smith", "Jane Does", "Robert Johnson"],
    "customer_id": [101, 102, 103]
}).lazy()

# Define fuzzy matching configuration
fuzzy_maps = [
    FuzzyMapping(
        left_col="name",
        right_col="customer",
        threshold_score=80.0,  # 80% similarity threshold
        fuzzy_type="levenshtein"
    )
]

# Perform fuzzy matching
result = fuzzy_match_dfs(
    left_df=left_df,
    right_df=right_df,
    fuzzy_maps=fuzzy_maps,
    logger=your_logger  # Pass your logger instance
)

print(result)
```

## Advanced Usage

### Multiple Column Matching

```python
# Match on multiple columns with different algorithms
fuzzy_maps = [
    FuzzyMapping(
        left_col="name",
        right_col="customer_name",
        threshold_score=85.0,
        fuzzy_type="jaro_winkler"
    ),
    FuzzyMapping(
        left_col="address",
        right_col="customer_address",
        threshold_score=75.0,
        fuzzy_type="levenshtein"
    )
]

result = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)
```

### Supported Algorithms

- **levenshtein**: Edit distance between two strings
- **jaro**: Jaro similarity
- **jaro_winkler**: Jaro-Winkler similarity (good for name matching)
- **hamming**: Hamming distance (requires equal length strings)
- **damerau_levenshtein**: Like Levenshtein but includes transpositions
- **indel**: Insertion/deletion distance

## How It Works: The Best of Both Worlds

The library intelligently combines two approaches based on your data size:

### For Regular Datasets (< 100M potential matches)
1. **Preprocessing**: Analyzes column uniqueness to optimize join strategy
2. **Cross Join**: Creates all possible combinations
3. **Exact Scoring**: Calculates precise similarity scores using your chosen algorithm
4. **Filtering**: Returns only matches above the threshold

### For Large Datasets (≥ 100M potential matches)
1. **Approximate Candidate Selection**: Uses `polars-simed` to quickly find likely matches
2. **Chunked Processing**: Processes large datasets in memory-efficient chunks
3. **Reduced Comparisons**: Only scores the most promising pairs instead of all combinations
4. **Final Scoring**: Applies exact fuzzy matching to the reduced candidate set

### The Magic: Automatic Strategy Selection
```python
# The library automatically determines the best approach:
if cartesian_product_size >= 100_000_000 and has_polars_simed:
    # Use approximate join for initial candidate selection
    # This reduces a 1B comparison problem to ~1M comparisons
    use_approximate_matching()
else:
    # Use traditional cross join for smaller datasets
    use_exact_matching()
```

This means you can use the same API whether matching 1,000 or 100 million records!

## Performance Tips

- **Large dataset matching**: Install `polars-simed` to enable approximate matching:
  ```bash
  pip install polars-simed
  ```
- **Optimal threshold**: Start with higher thresholds (80-90%) for better performance
- **Column selection**: Use columns with high uniqueness for better candidate reduction
- **Algorithm choice**:
  - `jaro_winkler`: Best for names and short strings
  - `levenshtein`: Best for general text and typos
  - `damerau_levenshtein`: Best when transpositions are common
- **Memory management**: The library automatically chunks large datasets, but you can monitor memory usage with logging

## Requirements

- Python >= 3.9
- Polars >= 1.8.2, < 2.0.0
- polars-distance ~= 0.4.3
- polars-simed >= 0.3.4 (optional, for large datasets)

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgments

Built on top of the excellent [Polars](https://github.com/pola-rs/polars) DataFrame library and [polars-distance](https://github.com/ion-elgreco/polars-distance) for string similarity calculations.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
    "name": "pl-fuzzy-frame-match",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "polars, fuzzy, matching, string, similarity, dataframe",
    "author": "Edward van Eechoud",
    "author_email": "evaneechoud@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/70/39/c501b7441ede8a8bd1c923f954de6e553358796799f55f3c4dbbfdbe7b54/pl_fuzzy_frame_match-0.3.0.tar.gz",
    "platform": null,
    "description": "# pl-fuzzy-frame-match\n\nHigh-performance fuzzy matching for Polars DataFrames that intelligently combines exact fuzzy matching with approximate joins for optimal performance on datasets of any size.\n\n## \ud83d\ude80 Key Innovation: Hybrid Matching Approach\n\nThis library automatically selects the best matching strategy based on your data:\n\n- **Small datasets (< 100M comparisons)**: Uses exact fuzzy matching with full cross-join\n- **Large datasets (\u2265 100M comparisons)**: Automatically switches to **approximate nearest neighbor joins** using `polars-simed`\n- **Intelligent optimization**: Pre-filters candidates using approximate methods, then applies exact fuzzy scoring\n\nThis hybrid approach means you get:\n- \u2705 **Best-in-class performance** regardless of data size\n- \u2705 **High accuracy** with configurable similarity thresholds\n- \u2705 **Memory efficiency** through chunked processing\n- \u2705 **No manual optimization needed** - the library handles it automatically\n\n## Features\n\n- \ud83d\ude80 **Dual-Mode Performance**: Combines exact fuzzy matching with approximate joins\n- \ud83c\udfaf **Multiple Algorithms**: Support for Levenshtein, Jaro, Jaro-Winkler, Hamming, Damerau-Levenshtein, and Indel\n- \ud83d\udd27 **Smart Optimization**: Automatic query optimization based on data uniqueness and size\n- \ud83d\udcbe **Memory Efficient**: Chunked processing and intelligent caching for massive datasets\n- \ud83d\udd04 **Incremental Matching**: Support for multi-column fuzzy matching with result filtering\n- \u26a1 **Automatic Strategy Selection**: No configuration needed - automatically picks the fastest approach\n\n## Installation\n\n```bash\npip install pl-fuzzy-frame-match\n```\n\nOr using Poetry:\n\n```bash\npoetry add pl-fuzzy-frame-match\n```\n\n## Performance Benchmarks\n\nPerformance comparison on commodity hardware (M3 Mac, 36GB RAM):\n\n| Dataset Size | Cartesian Product | Standard Cross Join Fuzzy match | Automatic Selection | Speedup |\n|--------------|------------------|---------------------------------|-------------------|---------|\n| 500 \u00d7 400 | 200K | 0.04s                           | 0.03s | 1.3x |\n| 3K \u00d7 2K | 6M | 0.39s                           | 0.39s | 1x |\n| 10K \u00d7 8K | 80M | 18.67s                          | 18.79s | 1x |\n| 15K \u00d7 10K | 150M | 40.82s                          | 1.45s | **28x** |\n| 40K \u00d7 30K | 1.2B | 363.50s                         | 4.75s | **76x** |\n| 400K \u00d7 10K | 4B | Skipped*                        | 34.52s | **\u221e** |\n\n*Skipped due to prohibitive runtime\n\n**Key Observations:**\n- **Small to Medium datasets** (< 100M): Automatic selection uses standard cross join for optimal speed and accuracy\n- **Large datasets** (\u2265 100M): Automatic selection switches to approximate matching first and then matches the dataframes\n- **Memory efficiency**: Can handle billions of potential comparisons without running out of memory\n\n## Quick Start\n\n```python\nimport polars as pl\nfrom pl_fuzzy_frame_match import fuzzy_match_dfs, FuzzyMapping\n\n# Create sample dataframes\nleft_df = pl.DataFrame({\n    \"name\": [\"John Smith\", \"Jane Doe\", \"Bob Johnson\"],\n    \"id\": [1, 2, 3]\n}).lazy()\n\nright_df = pl.DataFrame({\n    \"customer\": [\"Jon Smith\", \"Jane Does\", \"Robert Johnson\"],\n    \"customer_id\": [101, 102, 103]\n}).lazy()\n\n# Define fuzzy matching configuration\nfuzzy_maps = [\n    FuzzyMapping(\n        left_col=\"name\",\n        right_col=\"customer\",\n        threshold_score=80.0,  # 80% similarity threshold\n        fuzzy_type=\"levenshtein\"\n    )\n]\n\n# Perform fuzzy matching\nresult = fuzzy_match_dfs(\n    left_df=left_df,\n    right_df=right_df,\n    fuzzy_maps=fuzzy_maps,\n    logger=your_logger  # Pass your logger instance\n)\n\nprint(result)\n```\n\n## Advanced Usage\n\n### Multiple Column Matching\n\n```python\n# Match on multiple columns with different algorithms\nfuzzy_maps = [\n    FuzzyMapping(\n        left_col=\"name\",\n        right_col=\"customer_name\",\n        threshold_score=85.0,\n        fuzzy_type=\"jaro_winkler\"\n    ),\n    FuzzyMapping(\n        left_col=\"address\",\n        right_col=\"customer_address\",\n        threshold_score=75.0,\n        fuzzy_type=\"levenshtein\"\n    )\n]\n\nresult = fuzzy_match_dfs(left_df, right_df, fuzzy_maps, logger)\n```\n\n### Supported Algorithms\n\n- **levenshtein**: Edit distance between two strings\n- **jaro**: Jaro similarity\n- **jaro_winkler**: Jaro-Winkler similarity (good for name matching)\n- **hamming**: Hamming distance (requires equal length strings)\n- **damerau_levenshtein**: Like Levenshtein but includes transpositions\n- **indel**: Insertion/deletion distance\n\n## How It Works: The Best of Both Worlds\n\nThe library intelligently combines two approaches based on your data size:\n\n### For Regular Datasets (< 100M potential matches)\n1. **Preprocessing**: Analyzes column uniqueness to optimize join strategy\n2. **Cross Join**: Creates all possible combinations\n3. **Exact Scoring**: Calculates precise similarity scores using your chosen algorithm\n4. **Filtering**: Returns only matches above the threshold\n\n### For Large Datasets (\u2265 100M potential matches)\n1. **Approximate Candidate Selection**: Uses `polars-simed` to quickly find likely matches\n2. **Chunked Processing**: Processes large datasets in memory-efficient chunks\n3. **Reduced Comparisons**: Only scores the most promising pairs instead of all combinations\n4. **Final Scoring**: Applies exact fuzzy matching to the reduced candidate set\n\n### The Magic: Automatic Strategy Selection\n```python\n# The library automatically determines the best approach:\nif cartesian_product_size >= 100_000_000 and has_polars_simed:\n    # Use approximate join for initial candidate selection\n    # This reduces a 1B comparison problem to ~1M comparisons\n    use_approximate_matching()\nelse:\n    # Use traditional cross join for smaller datasets\n    use_exact_matching()\n```\n\nThis means you can use the same API whether matching 1,000 or 100 million records!\n\n## Performance Tips\n\n- **Large dataset matching**: Install `polars-simed` to enable approximate matching:\n  ```bash\n  pip install polars-simed\n  ```\n- **Optimal threshold**: Start with higher thresholds (80-90%) for better performance\n- **Column selection**: Use columns with high uniqueness for better candidate reduction\n- **Algorithm choice**:\n  - `jaro_winkler`: Best for names and short strings\n  - `levenshtein`: Best for general text and typos\n  - `damerau_levenshtein`: Best when transpositions are common\n- **Memory management**: The library automatically chunks large datasets, but you can monitor memory usage with logging\n\n## Requirements\n\n- Python >= 3.9\n- Polars >= 1.8.2, < 2.0.0\n- polars-distance ~= 0.4.3\n- polars-simed >= 0.3.4 (optional, for large datasets)\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgments\n\nBuilt on top of the excellent [Polars](https://github.com/pola-rs/polars) DataFrame library and [polars-distance](https://github.com/ion-elgreco/polars-distance) for string similarity calculations.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Efficient fuzzy matching for Polars DataFrames with support for multiple string similarity algorithms",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
        "Homepage": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match",
        "Repository": "https://github.com/edwardvaneechoud/pl-fuzzy-frame-match"
    },
    "split_keywords": [
        "polars",
        " fuzzy",
        " matching",
        " string",
        " similarity",
        " dataframe"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bfb46191ac76e4b09fbe4588572c55fb153c1b3c2944a0e888973c4362e4c313",
                "md5": "be02c7578bf7b15e40251dc807a03c50",
                "sha256": "7cae14f6dc5f0ed2df03a73e6ffd83d0efa25754288824f70c1f5e0c6a39ae0c"
            },
            "downloads": -1,
            "filename": "pl_fuzzy_frame_match-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "be02c7578bf7b15e40251dc807a03c50",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 22536,
            "upload_time": "2025-08-19T06:45:05",
            "upload_time_iso_8601": "2025-08-19T06:45:05.470584Z",
            "url": "https://files.pythonhosted.org/packages/bf/b4/6191ac76e4b09fbe4588572c55fb153c1b3c2944a0e888973c4362e4c313/pl_fuzzy_frame_match-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7039c501b7441ede8a8bd1c923f954de6e553358796799f55f3c4dbbfdbe7b54",
                "md5": "e58ffbf463c92f76a075ade6a70a6c08",
                "sha256": "5566563b4b05f2697ac640869d165c960f9991839e4804cefb18eac327a53047"
            },
            "downloads": -1,
            "filename": "pl_fuzzy_frame_match-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e58ffbf463c92f76a075ade6a70a6c08",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 20768,
            "upload_time": "2025-08-19T06:45:07",
            "upload_time_iso_8601": "2025-08-19T06:45:07.019625Z",
            "url": "https://files.pythonhosted.org/packages/70/39/c501b7441ede8a8bd1c923f954de6e553358796799f55f3c4dbbfdbe7b54/pl_fuzzy_frame_match-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-19 06:45:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "edwardvaneechoud",
    "github_project": "pl-fuzzy-frame-match",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pl-fuzzy-frame-match"
}

Edward van Eechoud