blockingpy


Nameblockingpy JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/T-Strojny/BlockingPy
SummaryBlocking records for record linkage and data deduplication based on ANN algorithms.
upload_time2024-12-23 13:04:59
maintainerNone
docs_urlNone
authorTymoteusz Strojny
requires_python<4.0,>=3.10
licenseMIT
keywords record-linkage deduplication ann blocking data-matching
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![License](https://img.shields.io/github/license/T-Strojny/BlockingPy)](https://github.com/T-Strojny/BlockingPy/blob/main/LICENSE) 
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Python version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![Code Coverage](https://img.shields.io/codecov/c/github/T-Strojny/BlockingPy)](https://codecov.io/gh/T-Strojny/BlockingPy)\
[![PyPI version](https://img.shields.io/pypi/v/blockingpy.svg)](https://pypi.org/project/blockingpy/) 
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Tests](https://github.com/T-Strojny/BlockingPy/actions/workflows/run_tests.yml/badge.svg)](https://github.com/T-Strojny/BlockingPy/actions/workflows/run_tests.yml)
[![GitHub last commit](https://img.shields.io/github/last-commit/T-Strojny/BlockingPy)](https://github.com/T-Strojny/BlockingPy/commits/main)
[![Documentation Status](https://readthedocs.org/projects/blockingpy/badge/?version=latest)](https://blockingpy.readthedocs.io/en/latest/?badge=latest)


# BlockingPy

BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on [R blocking package](https://github.com/ncn-foreigners/blocking).

## Purpose

When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.

## Installation

BlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:
```bash
pip install blockingpy
```
or i.e. with poetry:

```bash
poetry add blockingpy
```
### Note
You may need to run the following beforehand:
```bash
sudo apt-get install -y libmlpack-dev # on Linux
brew install mlpack # on MacOS
```
## Basic Usage
### Record Linkage
```python
from blockingpy.blocker import Blocker
import pandas as pd

# Example data for record linkage
x = pd.DataFrame({
    "txt": [
            "johnsmith",
            "smithjohn",
            "smiithhjohn",
            "smithjohnny",
            "montypython",
            "pythonmonty",
            "errmontypython",
            "monty",
        ]})

y = pd.DataFrame({
    "txt": [
            "montypython",
            "smithjohn",
            "other",
        ]})

# Initialize blocker instance
blocker = Blocker()

# Perform blocking with default ANN : FAISS
block_result = blocker.block(x = x['txt'], y = y['txt'])
```
Printing `block_result` contains:

- The method used (`faiss` - refers to Facebook AI Similarity Search)
- Number of blocks created (`3` in this case)
- Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, `17` in this example)
- Reduction ratio, i.e. how large is the reduction of comparison pairs (here `0.8750` which means blocking reduces comparison by over 87.5%).
```python
print(block_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns used for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          1 | 3
```
By printing `block_result.result` we can take a look at the results table containing:

- row numbers from the original data,
- block number (integers),
- distance (from the ANN algorithm).

```python
print(block_result.result)
#    x  y  block  dist
# 0  4  0      0   0.0
# 1  1  1      1   0.0
# 2  7  2      2   6.0
```
### Deduplication
We can perform deduplication by putting previously created DataFrame in the `block()` method.
```python
dedup_result = blocker.block(x = x['txt'])
```
```python
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns used for blocking: 25
# Reduction ratio: 0.5714
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          3 | 2
```
```python
print(dedup_result.result)
#    x  y  block  dist
# 0  0  1      0   2.0
# 1  1  2      0   2.0
# 2  1  3      0   2.0
# 3  4  5      1   2.0
# 4  4  6      1   3.0
# 5  4  7      1   6.0
```
## Features
- Multiple ANN algorithms available:
    - [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search)
    - [Voyager](https://github.com/spotify/voyager) (Spotify)
    - [HNSW](https://github.com/nmslib/hnswlib) (Hierarchical Navigable Small World)
    - [MLPACK](https://github.com/mlpack/mlpack) (both LSH and k-d tree)
    - [NND](https://github.com/lmcinnes/pynndescent) (Nearest Neighbor Descent)
    - [Annoy](https://github.com/spotify/annoy) (Spotify)

- Multiple distance metrics such as:
    - Euclidean
    - Cosine
    - Inner Product
    
    and more...
- Comprehensive algorithm parameters customization with `control_ann` and `control_txt`
- Support for already created Document-Term-Matrices (as `np.ndarray` or `csr_matrix`)
- Support for both record linkage and deduplication
- Evaluation metrics when true blocks are known

You can find detailed information about BlockingPy in [documentation](https://blockingpy.readthedocs.io/en/latest/).

## Disclaimer
BlockingPy is still under development, API and features may change. Also bugs or errors can occur. 

## License
BlockingPy is released under [MIT license](https://github.com/T-Strojny/BlockingPy/blob/main/LICENSE).

## Third Party
BlockingPy benefits from many open-source packages such as [Faiss](https://github.com/facebookresearch/faiss) or [Annoy](https://github.com/spotify/annoy). For detailed information see [third party notice](https://github.com/T-Strojny/BlockingPy/blob/main/THIRD_PARTY).

## Contributing

Please see [CONTRIBUTING.md](https://github.com/T-Strojny/BlockingPy/blob/main/CONTRIBUTING.md) for more information.

## Citation
TODO ?

## Acknowledgements
This package is based on the R [blocking](https://github.com/ncn-foreigners/blocking/tree/main) package developed by [BERENZ](https://github.com/BERENZ). Special thanks to the original author for his foundational work in this area.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/T-Strojny/BlockingPy",
    "name": "blockingpy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "record-linkage, deduplication, ANN, blocking, data-matching",
    "author": "Tymoteusz Strojny",
    "author_email": "tymek.strojny@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ba/b6/69fad276f7f9972a26667f409768cbede3cc5c0d370c1a30d93e1b41bf87/blockingpy-0.1.4.tar.gz",
    "platform": null,
    "description": "[![License](https://img.shields.io/github/license/T-Strojny/BlockingPy)](https://github.com/T-Strojny/BlockingPy/blob/main/LICENSE) \n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Python version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)\n[![Code Coverage](https://img.shields.io/codecov/c/github/T-Strojny/BlockingPy)](https://codecov.io/gh/T-Strojny/BlockingPy)\\\n[![PyPI version](https://img.shields.io/pypi/v/blockingpy.svg)](https://pypi.org/project/blockingpy/) \n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Tests](https://github.com/T-Strojny/BlockingPy/actions/workflows/run_tests.yml/badge.svg)](https://github.com/T-Strojny/BlockingPy/actions/workflows/run_tests.yml)\n[![GitHub last commit](https://img.shields.io/github/last-commit/T-Strojny/BlockingPy)](https://github.com/T-Strojny/BlockingPy/commits/main)\n[![Documentation Status](https://readthedocs.org/projects/blockingpy/badge/?version=latest)](https://blockingpy.readthedocs.io/en/latest/?badge=latest)\n\n\n# BlockingPy\n\nBlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on [R blocking package](https://github.com/ncn-foreigners/blocking).\n\n## Purpose\n\nWhen performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.\n\n## Installation\n\nBlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:\n```bash\npip install blockingpy\n```\nor i.e. with poetry:\n\n```bash\npoetry add blockingpy\n```\n### Note\nYou may need to run the following beforehand:\n```bash\nsudo apt-get install -y libmlpack-dev # on Linux\nbrew install mlpack # on MacOS\n```\n## Basic Usage\n### Record Linkage\n```python\nfrom blockingpy.blocker import Blocker\nimport pandas as pd\n\n# Example data for record linkage\nx = pd.DataFrame({\n    \"txt\": [\n            \"johnsmith\",\n            \"smithjohn\",\n            \"smiithhjohn\",\n            \"smithjohnny\",\n            \"montypython\",\n            \"pythonmonty\",\n            \"errmontypython\",\n            \"monty\",\n        ]})\n\ny = pd.DataFrame({\n    \"txt\": [\n            \"montypython\",\n            \"smithjohn\",\n            \"other\",\n        ]})\n\n# Initialize blocker instance\nblocker = Blocker()\n\n# Perform blocking with default ANN : FAISS\nblock_result = blocker.block(x = x['txt'], y = y['txt'])\n```\nPrinting `block_result` contains:\n\n- The method used (`faiss` - refers to Facebook AI Similarity Search)\n- Number of blocks created (`3` in this case)\n- Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, `17` in this example)\n- Reduction ratio, i.e. how large is the reduction of comparison pairs (here `0.8750` which means blocking reduces comparison by over 87.5%).\n```python\nprint(block_result)\n# ========================================================\n# Blocking based on the faiss method.\n# Number of blocks: 3\n# Number of columns used for blocking: 17\n# Reduction ratio: 0.8750\n# ========================================================\n# Distribution of the size of the blocks:\n# Block Size | Number of Blocks\n#          1 | 3\n```\nBy printing `block_result.result` we can take a look at the results table containing:\n\n- row numbers from the original data,\n- block number (integers),\n- distance (from the ANN algorithm).\n\n```python\nprint(block_result.result)\n#    x  y  block  dist\n# 0  4  0      0   0.0\n# 1  1  1      1   0.0\n# 2  7  2      2   6.0\n```\n### Deduplication\nWe can perform deduplication by putting previously created DataFrame in the `block()` method.\n```python\ndedup_result = blocker.block(x = x['txt'])\n```\n```python\nprint(dedup_result)\n# ========================================================\n# Blocking based on the faiss method.\n# Number of blocks: 2\n# Number of columns used for blocking: 25\n# Reduction ratio: 0.5714\n# ========================================================\n# Distribution of the size of the blocks:\n# Block Size | Number of Blocks\n#          3 | 2\n```\n```python\nprint(dedup_result.result)\n#    x  y  block  dist\n# 0  0  1      0   2.0\n# 1  1  2      0   2.0\n# 2  1  3      0   2.0\n# 3  4  5      1   2.0\n# 4  4  6      1   3.0\n# 5  4  7      1   6.0\n```\n## Features\n- Multiple ANN algorithms available:\n    - [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search)\n    - [Voyager](https://github.com/spotify/voyager) (Spotify)\n    - [HNSW](https://github.com/nmslib/hnswlib) (Hierarchical Navigable Small World)\n    - [MLPACK](https://github.com/mlpack/mlpack) (both LSH and k-d tree)\n    - [NND](https://github.com/lmcinnes/pynndescent) (Nearest Neighbor Descent)\n    - [Annoy](https://github.com/spotify/annoy) (Spotify)\n\n- Multiple distance metrics such as:\n    - Euclidean\n    - Cosine\n    - Inner Product\n    \n    and more...\n- Comprehensive algorithm parameters customization with `control_ann` and `control_txt`\n- Support for already created Document-Term-Matrices (as `np.ndarray` or `csr_matrix`)\n- Support for both record linkage and deduplication\n- Evaluation metrics when true blocks are known\n\nYou can find detailed information about BlockingPy in [documentation](https://blockingpy.readthedocs.io/en/latest/).\n\n## Disclaimer\nBlockingPy is still under development, API and features may change. Also bugs or errors can occur. \n\n## License\nBlockingPy is released under [MIT license](https://github.com/T-Strojny/BlockingPy/blob/main/LICENSE).\n\n## Third Party\nBlockingPy benefits from many open-source packages such as [Faiss](https://github.com/facebookresearch/faiss) or [Annoy](https://github.com/spotify/annoy). For detailed information see [third party notice](https://github.com/T-Strojny/BlockingPy/blob/main/THIRD_PARTY).\n\n## Contributing\n\nPlease see [CONTRIBUTING.md](https://github.com/T-Strojny/BlockingPy/blob/main/CONTRIBUTING.md) for more information.\n\n## Citation\nTODO ?\n\n## Acknowledgements\nThis package is based on the R [blocking](https://github.com/ncn-foreigners/blocking/tree/main) package developed by [BERENZ](https://github.com/BERENZ). Special thanks to the original author for his foundational work in this area.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Blocking records for record linkage and data deduplication based on ANN algorithms.",
    "version": "0.1.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/T-Strojny/BlockingPy/issues",
        "Documentation": "https://blockingpy.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/T-Strojny/BlockingPy",
        "Repository": "https://github.com/T-Strojny/BlockingPy"
    },
    "split_keywords": [
        "record-linkage",
        " deduplication",
        " ann",
        " blocking",
        " data-matching"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "129c2cde65cee0bb41b0687cacea17a8faf25a58c44490de3dbcffbe6720c57a",
                "md5": "b25f5b21f14718e6f597375702bfeafa",
                "sha256": "fe07507020049782f8fc5cfdde88b01e8531188a54f19e5d81f5d7117ffa01e2"
            },
            "downloads": -1,
            "filename": "blockingpy-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b25f5b21f14718e6f597375702bfeafa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 30167,
            "upload_time": "2024-12-23T13:04:58",
            "upload_time_iso_8601": "2024-12-23T13:04:58.705386Z",
            "url": "https://files.pythonhosted.org/packages/12/9c/2cde65cee0bb41b0687cacea17a8faf25a58c44490de3dbcffbe6720c57a/blockingpy-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bab669fad276f7f9972a26667f409768cbede3cc5c0d370c1a30d93e1b41bf87",
                "md5": "f57a3431252ba6033e6ec08684869103",
                "sha256": "fe7a55809d5ca940a5cbb3d9243ed287c70ddbf1fe8af25fa1b8516b199bda78"
            },
            "downloads": -1,
            "filename": "blockingpy-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f57a3431252ba6033e6ec08684869103",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 23225,
            "upload_time": "2024-12-23T13:04:59",
            "upload_time_iso_8601": "2024-12-23T13:04:59.798424Z",
            "url": "https://files.pythonhosted.org/packages/ba/b6/69fad276f7f9972a26667f409768cbede3cc5c0d370c1a30d93e1b41bf87/blockingpy-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-23 13:04:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "T-Strojny",
    "github_project": "BlockingPy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "blockingpy"
}
        
Elapsed time: 1.69750s