matchain

Name	matchain JSON
Version	0.1.2 JSON
	download
home_page
Summary	Record linkage - simple, flexible, efficient.
upload_time	2023-11-03 16:50:26
maintainer
docs_url	None
author	Andreas Eibeck
requires_python	>=3.8.16
license	BSD-3-Clause
keywords	record linkage instance matching
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MatChain: Simple, Flexible, Efficient

MatChain is an experimental package designed for record linkage. Record linkage is the process of matching records that correspond to the same real-world entity in two or more datasets. This process typically includes several steps, such as blocking and the final matching decision, with a wide range of methods available, including probabilistic, rule-based, and machine learning approaches.

MatChain was created with three core objectives in mind: simplicity, flexibility, and efficiency. It focuses on unsupervised approaches to minimize manual efforts, allows for customization of matching steps, and offers fast and resource-efficient implementations.

MatChain leverages libraries like Pandas, NumPy, and SciPy for [vectorized data handling](https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/), [advanced indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing), and support for [sparse matrices](https://docs.scipy.org/doc/scipy/tutorial/sparse.html).
It also utilizes scikit-learn and SentenceTransformers to convert strings into sparse vectors and dense vectors, respectively. This allows to perform blocking as approximate nearest neighbour search in the resulting set of vectors utilizing libraries like [NMSLIB](https://github.com/nmslib/nmslib) and [Faiss](https://github.com/facebookresearch/faiss).

The currently published version of MatChain exclusively provides [AutoCal](https://como.ceb.cam.ac.uk/preprints/293/) as the matching algorithm. AutoCal is an unsupervised method initially designed for instance matching with [Ontomatch](https://github.com/cambridge-cares/TheWorldAvatar/tree/main/Agents/OntoMatchAgent) in the context of [The World Avatar](https://theworldavatar.io/).
MatChain's implementation is highly efficient and allows
for the combination of AutoCal with various procedures for blocking and computing similarity scores.

## Installation

MatChain requires Python 3.8 or higher and can be installed with pip:

``` console
pip install matchain
```

However, this only installs PyTorch's CPU version. If you want to use the GPU version, you need to install it separately:

``` console
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
```

## Basic Example Using the API

In this example, we demonstrate how to match two datasets, denoted as A and B, based on columns with the same names: "year," "title," "authors," and "venue." You can run this example in the accompanying notebook [run_matchain_api.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_api.ipynb), which provides a detailed explanation of MatChain's API, including how to specify parameters.

First, we read the data and initialize an instance of the class ```MatChain``` using Pandas' dataframes.

``` python
data_dir = './data/Structured/DBLP-ACM'
dfa = pd.read_csv(f'{data_dir}/tableA.csv')
dfb = pd.read_csv(f'{data_dir}/tableB.csv')

mat = matchain.api.MatChain(dfa, dfb)
```

Next, we specify one or more similarity functions for each matching column by the ```property``` method. These similarity functions calculate scores between 0 and 1 for pairs of column values. In this example, we use ```equal``` for the integer-valued "year" column, which returns 1 if two years are equal and 0 otherwise. For each of the remaining string-valued columns, we apply ```shingle_tfidf``` to generate a sparse vector for each string based on its shingles (n-grams on the character level) and compute the cosine similarity between the sparse vectors for pairs of strings:

``` python
mat.property('year', simfct='equal')
mat.property('title', simfct='shingle_tfidf')
mat.property('authors', simfct='shingle_tfidf')
mat.property('venue', simfct='shingle_tfidf')
```

As the total number of record pairs grows with the product of the record sizes in datasets A and B, classifying each pair as matching or non-matching can be computationally expensive, especially for large datasets. Blocking effectively reduces the number of pairs while only discarding a small fraction of true matching pairs. The following line specifies three columns to use for blocking. By default, MatChain utilizes the library [sparsedottopn](https://github.com/ing-bank/sparse_dot_topn) to perform blocking by conducting a nearest neighbor search on the same shingle vectors mentioned earlier:

``` python
mat.blocking(blocking_props=['title', 'authors', 'venue'])
```

Finally, we call ```autocal``` to execute the matching algorithm AutoCal and ```predict``` to get the predicted matching pairs:

``` python
mat.autocal()
predicted_matches = mat.predict()
```


## Configuration File

While the example above demonstrates how to use MatChain's API to match two datasets, an alternative and streamlined approach is to utilize a configuration file. This method allows us to specify datasets, matching chains, and parameters in a separate file:

``` console
python matchain --config ./config/mccommands.yaml
```

For more detailed information about configuration options, run the notebook [run_matchain_config.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_config.ipynb).



## Datasets

The data subdirectory includes pairs of example datasets and ground truth data for evaluating MatChain's performance. These datasets cover various domains, including restaurant, bibliography, product, and powerplants. Specifically, four of them originate from [this paper](https://dbs.uni-leipzig.de/files/research/publications/2010-9/pdf/EvaluationOfEntityResolutionApproaches_vldb2010_CameraReady.pdf) and were downloaded from the [DeepMatcher Data Repository](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md). Two additional dataset pairs are related to the powerplants domain and were originally used for [AutoCal](https://como.ceb.cam.ac.uk/preprints/293/).

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "matchain",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.16",
    "maintainer_email": "",
    "keywords": "record linkage,instance matching",
    "author": "Andreas Eibeck",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/8a/59/1f0e6c5c79999b66dea71d57c4cab18c71c7ff9b5d4e404dac41f07d1a31/matchain-0.1.2.tar.gz",
    "platform": null,
    "description": "# MatChain: Simple, Flexible, Efficient\r\n\r\nMatChain is an experimental package designed for record linkage. Record linkage is the process of matching records that correspond to the same real-world entity in two or more datasets. This process typically includes several steps, such as blocking and the final matching decision, with a wide range of methods available, including probabilistic, rule-based, and machine learning approaches.\r\n\r\nMatChain was created with three core objectives in mind: simplicity, flexibility, and efficiency. It focuses on unsupervised approaches to minimize manual efforts, allows for customization of matching steps, and offers fast and resource-efficient implementations.\r\n\r\nMatChain leverages libraries like Pandas, NumPy, and SciPy for [vectorized data handling](https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/), [advanced indexing](https://numpy.org/doc/stable/user/basics.indexing.html#advanced-indexing), and support for [sparse matrices](https://docs.scipy.org/doc/scipy/tutorial/sparse.html).\r\nIt also utilizes scikit-learn and SentenceTransformers to convert strings into sparse vectors and dense vectors, respectively. This allows to perform blocking as approximate nearest neighbour search in the resulting set of vectors utilizing libraries like [NMSLIB](https://github.com/nmslib/nmslib) and [Faiss](https://github.com/facebookresearch/faiss).\r\n\r\nThe currently published version of MatChain exclusively provides [AutoCal](https://como.ceb.cam.ac.uk/preprints/293/) as the matching algorithm. AutoCal is an unsupervised method initially designed for instance matching with [Ontomatch](https://github.com/cambridge-cares/TheWorldAvatar/tree/main/Agents/OntoMatchAgent) in the context of [The World Avatar](https://theworldavatar.io/).\r\nMatChain's implementation is highly efficient and allows\r\nfor the combination of AutoCal with various procedures for blocking and computing similarity scores.\r\n\r\n## Installation\r\n\r\nMatChain requires Python 3.8 or higher and can be installed with pip:\r\n\r\n``` console\r\npip install matchain\r\n```\r\n\r\nHowever, this only installs PyTorch's CPU version. If you want to use the GPU version, you need to install it separately:\r\n\r\n``` console\r\npip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118\r\n```\r\n\r\n## Basic Example Using the API\r\n\r\nIn this example, we demonstrate how to match two datasets, denoted as A and B, based on columns with the same names: \"year,\" \"title,\" \"authors,\" and \"venue.\" You can run this example in the accompanying notebook [run_matchain_api.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_api.ipynb), which provides a detailed explanation of MatChain's API, including how to specify parameters.\r\n\r\nFirst, we read the data and initialize an instance of the class ```MatChain``` using Pandas' dataframes.\r\n\r\n``` python\r\ndata_dir = './data/Structured/DBLP-ACM'\r\ndfa = pd.read_csv(f'{data_dir}/tableA.csv')\r\ndfb = pd.read_csv(f'{data_dir}/tableB.csv')\r\n\r\nmat = matchain.api.MatChain(dfa, dfb)\r\n```\r\n\r\nNext, we specify one or more similarity functions for each matching column by the ```property``` method. These similarity functions calculate scores between 0 and 1 for pairs of column values. In this example, we use ```equal``` for the integer-valued \"year\" column, which returns 1 if two years are equal and 0 otherwise. For each of the remaining string-valued columns, we apply ```shingle_tfidf``` to generate a sparse vector for each string based on its shingles (n-grams on the character level) and compute the cosine similarity between the sparse vectors for pairs of strings:\r\n\r\n``` python\r\nmat.property('year', simfct='equal')\r\nmat.property('title', simfct='shingle_tfidf')\r\nmat.property('authors', simfct='shingle_tfidf')\r\nmat.property('venue', simfct='shingle_tfidf')\r\n```\r\n\r\nAs the total number of record pairs grows with the product of the record sizes in datasets A and B, classifying each pair as matching or non-matching can be computationally expensive, especially for large datasets. Blocking effectively reduces the number of pairs while only discarding a small fraction of true matching pairs. The following line specifies three columns to use for blocking. By default, MatChain utilizes the library [sparsedottopn](https://github.com/ing-bank/sparse_dot_topn) to perform blocking by conducting a nearest neighbor search on the same shingle vectors mentioned earlier:\r\n\r\n``` python\r\nmat.blocking(blocking_props=['title', 'authors', 'venue'])\r\n```\r\n\r\nFinally, we call ```autocal``` to execute the matching algorithm AutoCal and ```predict``` to get the predicted matching pairs:\r\n\r\n``` python\r\nmat.autocal()\r\npredicted_matches = mat.predict()\r\n```\r\n\r\n\r\n## Configuration File\r\n\r\nWhile the example above demonstrates how to use MatChain's API to match two datasets, an alternative and streamlined approach is to utilize a configuration file. This method allows us to specify datasets, matching chains, and parameters in a separate file:\r\n\r\n``` console\r\npython matchain --config ./config/mccommands.yaml\r\n```\r\n\r\nFor more detailed information about configuration options, run the notebook [run_matchain_config.ipynb](https://github.com/ae3000/matchain/blob/main/notebooks/run_matchain_config.ipynb).\r\n\r\n\r\n\r\n## Datasets\r\n\r\nThe data subdirectory includes pairs of example datasets and ground truth data for evaluating MatChain's performance. These datasets cover various domains, including restaurant, bibliography, product, and powerplants. Specifically, four of them originate from [this paper](https://dbs.uni-leipzig.de/files/research/publications/2010-9/pdf/EvaluationOfEntityResolutionApproaches_vldb2010_CameraReady.pdf) and were downloaded from the [DeepMatcher Data Repository](https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md). Two additional dataset pairs are related to the powerplants domain and were originally used for [AutoCal](https://como.ceb.cam.ac.uk/preprints/293/).\r\n",
    "bugtrack_url": null,
    "license": "BSD-3-Clause",
    "summary": "Record linkage - simple, flexible, efficient.",
    "version": "0.1.2",
    "project_urls": {
        "repository": "https://github.com/ae3000/matchain"
    },
    "split_keywords": [
        "record linkage",
        "instance matching"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "252fde2c55a11b8647c570a366a3cfd3ec3879a0376527d868dbbe8397a8ea4b",
                "md5": "2b9bfa290205ccb9fd1dd77a9201067d",
                "sha256": "6145064dc343a3c609b93fbee9786ca851c574578441898a14cfe6620ca63492"
            },
            "downloads": -1,
            "filename": "matchain-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2b9bfa290205ccb9fd1dd77a9201067d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.16",
            "size": 62916,
            "upload_time": "2023-11-03T16:50:24",
            "upload_time_iso_8601": "2023-11-03T16:50:24.954264Z",
            "url": "https://files.pythonhosted.org/packages/25/2f/de2c55a11b8647c570a366a3cfd3ec3879a0376527d868dbbe8397a8ea4b/matchain-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a591f0e6c5c79999b66dea71d57c4cab18c71c7ff9b5d4e404dac41f07d1a31",
                "md5": "f249e5c1feb91113ac7d26cd1a2107be",
                "sha256": "2830cd944c7346d81a8262f790da865625bf2719230958c85b81254b35c129f4"
            },
            "downloads": -1,
            "filename": "matchain-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "f249e5c1feb91113ac7d26cd1a2107be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.16",
            "size": 58551,
            "upload_time": "2023-11-03T16:50:26",
            "upload_time_iso_8601": "2023-11-03T16:50:26.185791Z",
            "url": "https://files.pythonhosted.org/packages/8a/59/1f0e6c5c79999b66dea71d57c4cab18c71c7ff9b5d4e404dac41f07d1a31/matchain-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-03 16:50:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ae3000",
    "github_project": "matchain",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "matchain"
}

Andreas Eibeck