blockingpy-gpu


Nameblockingpy-gpu JSON
Version 0.2.3 PyPI version JSON
download
home_pageNone
SummaryBlockingPy meta package (GPU)
upload_time2025-08-30 18:50:41
maintainerNone
docs_urlNone
authorNone
requires_python<3.13,>=3.10
licenseMIT
keywords ann blocking data-matching deduplication record-linkage
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![License](https://img.shields.io/github/license/ncn-foreigners/BlockingPy)](https://github.com/ncn-foreigners/BlockingPy/blob/main/LICENSE) 
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Python version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![codecov](https://codecov.io/gh/ncn-foreigners/BlockingPy/graph/badge.svg?token=BF41O220NY)](https://codecov.io/gh/ncn-foreigners/BlockingPy)
[![PyPI version](https://img.shields.io/pypi/v/blockingpy.svg)](https://pypi.org/project/blockingpy/) 
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Tests](https://github.com/ncn-foreigners/BlockingPy/actions/workflows/run_tests.yml/badge.svg)](https://github.com/ncn-foreigners/BlockingPy/actions/workflows/run_tests.yml)\
[![GitHub last commit](https://img.shields.io/github/last-commit/ncn-foreigners/BlockingPy)](https://github.com/ncn-foreigners/BlockingPy/commits/main)
[![Documentation Status](https://readthedocs.org/projects/blockingpy/badge/?version=latest)](https://blockingpy.readthedocs.io/en/latest/?badge=latest)
![PyPI Downloads](https://img.shields.io/pypi/dm/blockingpy)
[![PyPI (GPU)](https://img.shields.io/pypi/v/blockingpy-gpu.svg?label=blockingpy-gpu)](https://pypi.org/project/blockingpy-gpu/)
![CUDA ≥12.4](https://img.shields.io/badge/CUDA-%E2%89%A5%2012.4-76b900)

# BlockingPy

BlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on [R blocking package](https://github.com/ncn-foreigners/blocking). 


Additionally, **GPU** acceleration is available via `blockingpy-gpu` ([FAISS-GPU](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs)).


## Purpose

When performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.

## Installation

BlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:
```bash
pip install blockingpy
```
or i.e. with poetry:

```bash
poetry add blockingpy
```
### Note
You may need to run the following beforehand:
```bash
sudo apt-get install -y libmlpack-dev # on Linux
brew install mlpack # on MacOS
```
for the GPU version: see [here](#gpu-support) or [docs](https://blockingpy.readthedocs.io/en/latest/gpu/index.html)

## Basic Usage
### Record Linkage
```python
from blockingpy import Blocker
import pandas as pd

# Example data for record linkage
x = pd.DataFrame({
    "txt": [
            "johnsmith",
            "smithjohn",
            "smiithhjohn",
            "smithjohnny",
            "montypython",
            "pythonmonty",
            "errmontypython",
            "monty",
        ]})

y = pd.DataFrame({
    "txt": [
            "montypython",
            "smithjohn",
            "other",
        ]})

# Initialize blocker instance
blocker = Blocker()

# Perform blocking with default ANN : FAISS
block_result = blocker.block(x = x['txt'], y = y['txt'])
```
Printing `block_result` contains:

- The method used (`faiss` - refers to Facebook AI Similarity Search)
- Number of blocks created (`3` in this case)
- Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, `17` in this example)
- Reduction ratio, i.e. how large is the reduction of comparison pairs (here `0.8750` which means blocking reduces comparison by over 87.5%).
```python
print(block_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 3
# Number of columns created for blocking: 17
# Reduction ratio: 0.8750
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          2 | 3  
```
By printing `block_result.result` we can take a look at the results table containing:

- row numbers from the original data,
- block number (integers),
- distance (from the ANN algorithm).

```python
print(block_result.result)
#    x  y  block      dist
# 0  4  0      0  0.000000
# 1  1  1      1  0.000000
# 2  6  2      2  0.607768
```
### Deduplication
We can perform deduplication by putting previously created DataFrame in the `block()` method.
```python
dedup_result = blocker.block(x = x['txt'])
```
```python
print(dedup_result)
# ========================================================
# Blocking based on the faiss method.
# Number of blocks: 2
# Number of columns created for blocking: 25
# Reduction ratio: 0.571429
# ========================================================
# Distribution of the size of the blocks:
# Block Size | Number of Blocks
#          4 | 2 
```
```python
print(dedup_result.result)
#    x  y  block      dist
# 0  1  0      0  0.125000
# 1  3  1      0  0.105573
# 2  1  2      0  0.105573
# 3  5  4      1  0.083333
# 4  4  6      1  0.105573
# 5  5  7      1  0.278312
```
## Features
- Multiple ANN implementations available:
    - [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search) (`lsh`, `hnsw`, `flat`)
    - [Voyager](https://github.com/spotify/voyager) (Spotify)
    - [HNSW](https://github.com/nmslib/hnswlib) (Hierarchical Navigable Small World)
    - [MLPACK](https://github.com/mlpack/mlpack) (both LSH and k-d tree)
    - [NND](https://github.com/lmcinnes/pynndescent) (Nearest Neighbor Descent)
    - [Annoy](https://github.com/spotify/annoy) (Spotify)

- Multiple distance metrics such as:
    - Euclidean
    - Cosine
    - Inner Product
    
    and more...
- Support for both shingle-based and embedding-based text representation
- Comprehensive algorithm parameters customization with `control_ann` and `control_txt`
- Support for already created Document-Term-Matrices (as `np.ndarray` or `csr_matrix`)
- Support for both record linkage and deduplication
- Evaluation metrics when true blocks are known
- GPU support for fast blocking of large datasets usin GPU-accelerated indexes from [FAISS](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs)

You can find detailed information about BlockingPy in [documentation](https://blockingpy.readthedocs.io/en/latest/).

## GPU Support
`BlockingPy` can process large datasets by utilizing the GPU with [`faiss_gpu`](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs) algorithms. The available GPU indexes are (`Flat`/`IVF`/`IVFPQ`/`CAGRA`). `blockingpy-gpu` also includes all CPU indexes besides the **mlpack** backends.

### Prerequisites
- OS: Linux or Windows 11 with WSL2 (Ubuntu)  
- Python: 3.10  
- GPU: Nvidia with driver supporting CUDA ≥ 12.4  
- Tools: conda/mamba + pip 

### Install

PyPI wheels do not provide CUDA-enabled FAISS. You must install FAISS-GPU via conda/mamba, then install `blockingpy-gpu` with pip.

```python
# 1) Env
mamba create -n blockingpy-gpu python=3.10 -y
conda activate blockingpy-gpu

# 2) Install FAISS GPU (nightly cuVS build) - this version was tested
mamba install -c pytorch/label/nightly \
  faiss-gpu-cuvs=1.11.0=py3.10_ha3bacd1_55_cuda12.4.0_nightly -y

# 3) Install BlockingPy and the rest of deps with pip (or poetry, uv etc.)
pip install blockingpy-gpu
```

## Example Datasets

BlockingPy comes with built-in example datasets:

- Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS,
    for the ESSnet DI on-the-job training course, Southampton,
    25-28 January 2011

- Deduplication dataset taken from [RecordLinkage](https://cran.r-project.org/package=RecordLinkage) R package developed by Murat Sariyar
    and Andreas Borg. Package is licensed under GPL-3 license. Also known as [RLdata10000](https://www.rdocumentation.org/packages/RecordLinkage/versions/0.4-12.4/topics/RLdata).


## License
BlockingPy is released under [MIT license](https://github.com/ncn-foreigners/BlockingPy/blob/main/LICENSE).

## Third Party
BlockingPy benefits from many open-source packages such as [Faiss](https://github.com/facebookresearch/faiss) or [Annoy](https://github.com/spotify/annoy). For detailed information see [third party notice](https://github.com/ncn-foreigners/BlockingPy/blob/main/THIRD_PARTY).

## Contributing

Please see [CONTRIBUTING.md](https://github.com/ncn-foreigners/BlockingPy/blob/main/CONTRIBUTING.md) for more information.

## Code of Conduct
You can find it [here](https://github.com/ncn-foreigners/BlockingPy/blob/main/CODE_OF_CONDUCT.md)

## Acknowledgements
This package is based on the R [blocking](https://github.com/ncn-foreigners/blocking/tree/main) package developed by [BERENZ](https://github.com/BERENZ).

## Funding

Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations -- quality, data integration and estimation)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "blockingpy-gpu",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "ANN, blocking, data-matching, deduplication, record-linkage",
    "author": null,
    "author_email": "Tymoteusz Strojny <tymek.strojny@gmail.com>, Maciej Ber\u0119sewicz <maciej.beresewicz@ue.poznan.pl>",
    "download_url": "https://files.pythonhosted.org/packages/2d/72/d9209db19e20f696a67e0f2675fe2e556e553df866d1099cc8aca7bc732f/blockingpy_gpu-0.2.3.tar.gz",
    "platform": null,
    "description": "[![License](https://img.shields.io/github/license/ncn-foreigners/BlockingPy)](https://github.com/ncn-foreigners/BlockingPy/blob/main/LICENSE) \n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Python version](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)\n[![codecov](https://codecov.io/gh/ncn-foreigners/BlockingPy/graph/badge.svg?token=BF41O220NY)](https://codecov.io/gh/ncn-foreigners/BlockingPy)\n[![PyPI version](https://img.shields.io/pypi/v/blockingpy.svg)](https://pypi.org/project/blockingpy/) \n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Tests](https://github.com/ncn-foreigners/BlockingPy/actions/workflows/run_tests.yml/badge.svg)](https://github.com/ncn-foreigners/BlockingPy/actions/workflows/run_tests.yml)\\\n[![GitHub last commit](https://img.shields.io/github/last-commit/ncn-foreigners/BlockingPy)](https://github.com/ncn-foreigners/BlockingPy/commits/main)\n[![Documentation Status](https://readthedocs.org/projects/blockingpy/badge/?version=latest)](https://blockingpy.readthedocs.io/en/latest/?badge=latest)\n![PyPI Downloads](https://img.shields.io/pypi/dm/blockingpy)\n[![PyPI (GPU)](https://img.shields.io/pypi/v/blockingpy-gpu.svg?label=blockingpy-gpu)](https://pypi.org/project/blockingpy-gpu/)\n![CUDA \u226512.4](https://img.shields.io/badge/CUDA-%E2%89%A5%2012.4-76b900)\n\n# BlockingPy\n\nBlockingPy is a Python package that implements efficient blocking methods for record linkage and data deduplication using Approximate Nearest Neighbor (ANN) algorithms. It is based on [R blocking package](https://github.com/ncn-foreigners/blocking). \n\n\nAdditionally, **GPU** acceleration is available via `blockingpy-gpu` ([FAISS-GPU](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs)).\n\n\n## Purpose\n\nWhen performing record linkage or deduplication on large datasets, comparing all possible record pairs becomes computationally infeasible. Blocking helps reduce the comparison space by identifying candidate record pairs that are likely to match, using efficient approximate nearest neighbor search algorithms.\n\n## Installation\n\nBlockingPy requires Python 3.10 or later. Installation is handled via PIP as follows:\n```bash\npip install blockingpy\n```\nor i.e. with poetry:\n\n```bash\npoetry add blockingpy\n```\n### Note\nYou may need to run the following beforehand:\n```bash\nsudo apt-get install -y libmlpack-dev # on Linux\nbrew install mlpack # on MacOS\n```\nfor the GPU version: see [here](#gpu-support) or [docs](https://blockingpy.readthedocs.io/en/latest/gpu/index.html)\n\n## Basic Usage\n### Record Linkage\n```python\nfrom blockingpy import Blocker\nimport pandas as pd\n\n# Example data for record linkage\nx = pd.DataFrame({\n    \"txt\": [\n            \"johnsmith\",\n            \"smithjohn\",\n            \"smiithhjohn\",\n            \"smithjohnny\",\n            \"montypython\",\n            \"pythonmonty\",\n            \"errmontypython\",\n            \"monty\",\n        ]})\n\ny = pd.DataFrame({\n    \"txt\": [\n            \"montypython\",\n            \"smithjohn\",\n            \"other\",\n        ]})\n\n# Initialize blocker instance\nblocker = Blocker()\n\n# Perform blocking with default ANN : FAISS\nblock_result = blocker.block(x = x['txt'], y = y['txt'])\n```\nPrinting `block_result` contains:\n\n- The method used (`faiss` - refers to Facebook AI Similarity Search)\n- Number of blocks created (`3` in this case)\n- Number of columns (features) used for blocking (intersecting n-grams generated from both datasets, `17` in this example)\n- Reduction ratio, i.e. how large is the reduction of comparison pairs (here `0.8750` which means blocking reduces comparison by over 87.5%).\n```python\nprint(block_result)\n# ========================================================\n# Blocking based on the faiss method.\n# Number of blocks: 3\n# Number of columns created for blocking: 17\n# Reduction ratio: 0.8750\n# ========================================================\n# Distribution of the size of the blocks:\n# Block Size | Number of Blocks\n#          2 | 3  \n```\nBy printing `block_result.result` we can take a look at the results table containing:\n\n- row numbers from the original data,\n- block number (integers),\n- distance (from the ANN algorithm).\n\n```python\nprint(block_result.result)\n#    x  y  block      dist\n# 0  4  0      0  0.000000\n# 1  1  1      1  0.000000\n# 2  6  2      2  0.607768\n```\n### Deduplication\nWe can perform deduplication by putting previously created DataFrame in the `block()` method.\n```python\ndedup_result = blocker.block(x = x['txt'])\n```\n```python\nprint(dedup_result)\n# ========================================================\n# Blocking based on the faiss method.\n# Number of blocks: 2\n# Number of columns created for blocking: 25\n# Reduction ratio: 0.571429\n# ========================================================\n# Distribution of the size of the blocks:\n# Block Size | Number of Blocks\n#          4 | 2 \n```\n```python\nprint(dedup_result.result)\n#    x  y  block      dist\n# 0  1  0      0  0.125000\n# 1  3  1      0  0.105573\n# 2  1  2      0  0.105573\n# 3  5  4      1  0.083333\n# 4  4  6      1  0.105573\n# 5  5  7      1  0.278312\n```\n## Features\n- Multiple ANN implementations available:\n    - [FAISS](https://github.com/facebookresearch/faiss) (Facebook AI Similarity Search) (`lsh`, `hnsw`, `flat`)\n    - [Voyager](https://github.com/spotify/voyager) (Spotify)\n    - [HNSW](https://github.com/nmslib/hnswlib) (Hierarchical Navigable Small World)\n    - [MLPACK](https://github.com/mlpack/mlpack) (both LSH and k-d tree)\n    - [NND](https://github.com/lmcinnes/pynndescent) (Nearest Neighbor Descent)\n    - [Annoy](https://github.com/spotify/annoy) (Spotify)\n\n- Multiple distance metrics such as:\n    - Euclidean\n    - Cosine\n    - Inner Product\n    \n    and more...\n- Support for both shingle-based and embedding-based text representation\n- Comprehensive algorithm parameters customization with `control_ann` and `control_txt`\n- Support for already created Document-Term-Matrices (as `np.ndarray` or `csr_matrix`)\n- Support for both record linkage and deduplication\n- Evaluation metrics when true blocks are known\n- GPU support for fast blocking of large datasets usin GPU-accelerated indexes from [FAISS](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs)\n\nYou can find detailed information about BlockingPy in [documentation](https://blockingpy.readthedocs.io/en/latest/).\n\n## GPU Support\n`BlockingPy` can process large datasets by utilizing the GPU with [`faiss_gpu`](https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs) algorithms. The available GPU indexes are (`Flat`/`IVF`/`IVFPQ`/`CAGRA`). `blockingpy-gpu` also includes all CPU indexes besides the **mlpack** backends.\n\n### Prerequisites\n- OS: Linux or Windows 11 with WSL2 (Ubuntu)  \n- Python: 3.10  \n- GPU: Nvidia with driver supporting CUDA \u2265 12.4  \n- Tools: conda/mamba + pip \n\n### Install\n\nPyPI wheels do not provide CUDA-enabled FAISS. You must install FAISS-GPU via conda/mamba, then install `blockingpy-gpu` with pip.\n\n```python\n# 1) Env\nmamba create -n blockingpy-gpu python=3.10 -y\nconda activate blockingpy-gpu\n\n# 2) Install FAISS GPU (nightly cuVS build) - this version was tested\nmamba install -c pytorch/label/nightly \\\n  faiss-gpu-cuvs=1.11.0=py3.10_ha3bacd1_55_cuda12.4.0_nightly -y\n\n# 3) Install BlockingPy and the rest of deps with pip (or poetry, uv etc.)\npip install blockingpy-gpu\n```\n\n## Example Datasets\n\nBlockingPy comes with built-in example datasets:\n\n- Census-Cis dataset created by Paula McLeod, Dick Heasman and Ian Forbes, ONS,\n    for the ESSnet DI on-the-job training course, Southampton,\n    25-28 January 2011\n\n- Deduplication dataset taken from [RecordLinkage](https://cran.r-project.org/package=RecordLinkage) R package developed by Murat Sariyar\n    and Andreas Borg. Package is licensed under GPL-3 license. Also known as [RLdata10000](https://www.rdocumentation.org/packages/RecordLinkage/versions/0.4-12.4/topics/RLdata).\n\n\n## License\nBlockingPy is released under [MIT license](https://github.com/ncn-foreigners/BlockingPy/blob/main/LICENSE).\n\n## Third Party\nBlockingPy benefits from many open-source packages such as [Faiss](https://github.com/facebookresearch/faiss) or [Annoy](https://github.com/spotify/annoy). For detailed information see [third party notice](https://github.com/ncn-foreigners/BlockingPy/blob/main/THIRD_PARTY).\n\n## Contributing\n\nPlease see [CONTRIBUTING.md](https://github.com/ncn-foreigners/BlockingPy/blob/main/CONTRIBUTING.md) for more information.\n\n## Code of Conduct\nYou can find it [here](https://github.com/ncn-foreigners/BlockingPy/blob/main/CODE_OF_CONDUCT.md)\n\n## Acknowledgements\nThis package is based on the R [blocking](https://github.com/ncn-foreigners/blocking/tree/main) package developed by [BERENZ](https://github.com/BERENZ).\n\n## Funding\n\nWork on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941 (Towards census-like statistics for foreign-born populations -- quality, data integration and estimation)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "BlockingPy meta package (GPU)",
    "version": "0.2.3",
    "project_urls": {
        "Documentation": "https://blockingpy.readthedocs.io/en/latest/",
        "Funding": "https://www.ncn.gov.pl",
        "Issues": "https://github.com/ncn-foreigners/BlockingPy/issues",
        "Repository": "https://github.com/ncn-foreigners/BlockingPy"
    },
    "split_keywords": [
        "ann",
        " blocking",
        " data-matching",
        " deduplication",
        " record-linkage"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d6752ebe34ac7fa01900d14ab45e4543c70a4be6b6d31b2b7c03618e4eb37ee1",
                "md5": "e6ca21bb0c488753deae39d7c790c8ff",
                "sha256": "94de8a3335694d27de0406c467aa1675870212af77d3ab765540197db144a6da"
            },
            "downloads": -1,
            "filename": "blockingpy_gpu-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e6ca21bb0c488753deae39d7c790c8ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 8342,
            "upload_time": "2025-08-30T18:50:37",
            "upload_time_iso_8601": "2025-08-30T18:50:37.816348Z",
            "url": "https://files.pythonhosted.org/packages/d6/75/2ebe34ac7fa01900d14ab45e4543c70a4be6b6d31b2b7c03618e4eb37ee1/blockingpy_gpu-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2d72d9209db19e20f696a67e0f2675fe2e556e553df866d1099cc8aca7bc732f",
                "md5": "e307012176cf0c80265955361908b2e7",
                "sha256": "a7af28b7b94701da4b1753853b4598722120a9bc4ca0ed3f2074e5f13cf56b6e"
            },
            "downloads": -1,
            "filename": "blockingpy_gpu-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "e307012176cf0c80265955361908b2e7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 4684,
            "upload_time": "2025-08-30T18:50:41",
            "upload_time_iso_8601": "2025-08-30T18:50:41.827395Z",
            "url": "https://files.pythonhosted.org/packages/2d/72/d9209db19e20f696a67e0f2675fe2e556e553df866d1099cc8aca7bc732f/blockingpy_gpu-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 18:50:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ncn-foreigners",
    "github_project": "BlockingPy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "blockingpy-gpu"
}
        
Elapsed time: 2.60514s