# Name Match
## About the Project
Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets.
The code was originally developed for linking records in criminal justice-related datasets (arrests, victimizations, city programs, school records, etc.) using at least first name, last name, date of birth, and age (some missingness in DOB and age is tolerated). If available, other data fields like middle initial, race, gender, address, and zipcode can be included to strengthen the quality of the match.
Project Link: https://urban-labs.github.io/namematch/
## Getting Started
### Installation
```
pip install namematch
```
Name Match has been tested using Python 3.7 and 3.8, on both linux and Windows systems. Note, Name Match will not currently work using Python 3.9 on Windows because of the dependency on NMSLIB.
### Reference
* [Examples](https://github.com/urban-labs/namematch/tree/main/examples)
* [End-to-end tutorial](https://github.com/urban-labs/namematch/blob/main/examples/end_to_end_tutorial.ipynb)
* [Overview + usage docs](https://urban-labs.github.io/namematch/about.html)
* [Algorithm docs](https://urban-labs.github.io/namematch/algorithm.html)
* [Developer docs](https://urban-labs.github.io/namematch/api.html)
### Requirements of the input data
Name Match links records by learning a supervised machine learning model that is then used to predict the likelihood that two records "match" (refer to the same person or entity). To build this model the algorithm needs training data with ground-truth "match" or "non-match" labels. In other words, it needs a way of generating a set of record pairs where it knows whether or not the records should be linked. Fortunately, if a subset of the records being input into Name Match already have a unique identifier like Social Securuity Number (SSN) or Fingerprint ID, Name Match is able to generate the training data it needs.
To see an example of this, say you are linking two datasets: dataset A and dataset B. People in dataset A can show up multiple times and can be uniquely identified via SSN. People in dataset B cannot be uniquely identified by any existing data field (hence the reason for using Name Match). If John (SSN 123) has two records in dataset A, we have found an example of two records that we know are a match. If Jane (SSN 456) also has a record in dataset A, we have found an example of two records that we know are NOT a match (Jane's record and either of John's records). Already we are on our way to building a training dataset for the Name Match model to learn from.
To facilitate the above process and make using Name Match possible, **a portion of the input data must meet the following criteria**:
* Already have a unique person or entity identifier that can be used to link records (e.g. SSN or Fingerprint ID)
* Be granular enough that some people or entities appear multiple times (e.g. the same person being arrested two or three times)
* Contain inconsistencies in identifying fields like name and date of birth (e.g. arrested once as John Browne and once as Jonathan Brown)
## Usage
### Package usage
```python
config = {
'data_files': {
'datasetA': {
'filepath' : '../preprocessed_data/datasetA.csv',
'record_id_col' : 'record_id'
},
'datasetB': {
'filepath' : '../preprocessed_data/datasetB.csv',
'record_id_col' : 'record_num'
}
},
'variables': [
{
'name' : 'first_name',
'compare_type' : 'String',
'datasetA' : 'first_name',
'datasetB' : 'fname',
}, {
'name' : 'last_name',
'compare_type' : 'String',
'datasetA' : 'last_name',
'datasetB' : 'lname',
}, {
'name' : 'dob',
'compare_type' : 'Date',
'datasetA' : 'date_of_birth',
'datasetB' : 'dob',
}, {
'name' : 'social_security_number',
'compare_type' : 'UniqueID',
'datasetA' : 'ssn',
'datasetB' : ''
}
]
}
nm = NameMatcher(config=config)
nm.run()
```
See `examples/end_to_end_tutorial.ipynb` or `examples/python_usage/link_data.py` for a full runnable example.
### Command line tool usage
```
cd examples/command_line_usage/
namematch --config-file=config.yaml --output-dir=nm_output --cluster-constraints-file=constraints.py run
```
For more details, please checkout [`examples/command_line_usage/README.md`](examples/command_line_usage/README.md).
## Roadmap
See the [open issues](https://github.com/urban-labs/namematch/issues) for a list of proposed features (and known issues).
## Contributing
All contributions -- to code, documentation, tests, examples, etc. -- are greatly appreciated. For more detailed information, see CONTRIBUTING.md.
1. Fork the project
2. Create your feature branch (git checkout -b some-feature)
3. Commit your changes (git commit -m 'Add some amazing feature')
4. Push to the branch (git push origin some-feature)
5. Open a pull request
## License
Distributed under the GNU Affero General Public License v3.0 license. See LICENSE for more information.
## Team
Melissa McNeill, UChicago Crime and Education Labs
Eddie Tzu-Yun Lin, UChicago Crime and Education Labs
Zubin Jelveh, University of Maryland
## Citation
If you use Name Match in an academic work, please give this citation:
Zubin Jelveh, Melissa McNeill, and Tzu-Yun Lin. 2022. Name Match. https://github.com/urban-labs/namematch.
Raw data
{
"_id": null,
"home_page": "https://github.com/urban-labs/namematch",
"name": "namematch",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "record linkage",
"author": "University of Chicago Crime Lab",
"author_email": "mmcneill@uchicago.edu, tlin2@uchicago.edu, zjelveh@umd.edu",
"download_url": "https://files.pythonhosted.org/packages/a3/f7/612ea73ee241a7d9c88168ab234f788e1878e3519aa2313bab06c6a8a452/namematch-1.2.1.tar.gz",
"platform": null,
"description": "# Name Match\n\n## About the Project\n\nTool for probabilistically linking the records of individual entities (e.g. people) within and across datasets.\n\nThe code was originally developed for linking records in criminal justice-related datasets (arrests, victimizations, city programs, school records, etc.) using at least first name, last name, date of birth, and age (some missingness in DOB and age is tolerated). If available, other data fields like middle initial, race, gender, address, and zipcode can be included to strengthen the quality of the match.\n\nProject Link:\u00a0https://urban-labs.github.io/namematch/\n\n## Getting Started\n\n### Installation\n```\npip install namematch \n```\n\nName Match has been tested using Python 3.7 and 3.8, on both linux and Windows systems. Note, Name Match will not currently work using Python 3.9 on Windows because of the dependency on NMSLIB.\n\n\n### Reference\n\n* [Examples](https://github.com/urban-labs/namematch/tree/main/examples)\n* [End-to-end tutorial](https://github.com/urban-labs/namematch/blob/main/examples/end_to_end_tutorial.ipynb)\n* [Overview + usage docs](https://urban-labs.github.io/namematch/about.html)\n* [Algorithm docs](https://urban-labs.github.io/namematch/algorithm.html)\n* [Developer docs](https://urban-labs.github.io/namematch/api.html)\n\n\n### Requirements of the input data\n\nName Match links records by learning a supervised machine learning model that is then used to predict the likelihood that two records \"match\" (refer to the same person or entity). To build this model the algorithm needs training data with ground-truth \"match\" or \"non-match\" labels. In other words, it needs a way of generating a set of record pairs where it knows whether or not the records should be linked. Fortunately, if a subset of the records being input into Name Match already have a unique identifier like Social Securuity Number (SSN) or Fingerprint ID, Name Match is able to generate the training data it needs. \n\nTo see an example of this, say you are linking two datasets: dataset A and dataset B. People in dataset A can show up multiple times and can be uniquely identified via SSN. People in dataset B cannot be uniquely identified by any existing data field (hence the reason for using Name Match). If John (SSN 123) has two records in dataset A, we have found an example of two records that we know are a match. If Jane (SSN 456) also has a record in dataset A, we have found an example of two records that we know are NOT a match (Jane's record and either of John's records). Already we are on our way to building a training dataset for the Name Match model to learn from.\n\nTo facilitate the above process and make using Name Match possible, **a portion of the input data must meet the following criteria**: \n* Already have a unique person or entity identifier that can be used to link records (e.g. SSN or Fingerprint ID)\n* Be granular enough that some people or entities appear multiple times (e.g. the same person being arrested two or three times)\n* Contain inconsistencies in identifying fields like name and date of birth (e.g. arrested once as John Browne and once as Jonathan Brown)\n\n\n## Usage\n\n### Package usage\n\n```python\nconfig = {\n \n 'data_files': {\n 'datasetA': {\n 'filepath' : '../preprocessed_data/datasetA.csv',\n 'record_id_col' : 'record_id'\n },\n 'datasetB': {\n 'filepath' : '../preprocessed_data/datasetB.csv',\n 'record_id_col' : 'record_num'\n } \n },\n \n 'variables': [\n {\n 'name' : 'first_name',\n 'compare_type' : 'String',\n 'datasetA' : 'first_name',\n 'datasetB' : 'fname',\n }, {\n 'name' : 'last_name',\n 'compare_type' : 'String',\n 'datasetA' : 'last_name',\n 'datasetB' : 'lname',\n }, {\n 'name' : 'dob',\n 'compare_type' : 'Date',\n 'datasetA' : 'date_of_birth',\n 'datasetB' : 'dob',\n }, {\n 'name' : 'social_security_number',\n 'compare_type' : 'UniqueID', \n 'datasetA' : 'ssn',\n 'datasetB' : ''\n }\n ]\n}\n\nnm = NameMatcher(config=config)\nnm.run()\n```\n\nSee `examples/end_to_end_tutorial.ipynb` or `examples/python_usage/link_data.py` for a full runnable example.\n\n\n### Command line tool usage\n\n```\ncd examples/command_line_usage/\nnamematch --config-file=config.yaml --output-dir=nm_output --cluster-constraints-file=constraints.py run\n```\n\nFor more details, please checkout [`examples/command_line_usage/README.md`](examples/command_line_usage/README.md).\n\n\n## Roadmap\n\nSee the [open issues](https://github.com/urban-labs/namematch/issues) for a list of proposed features (and known issues).\n\n## Contributing\n\nAll contributions -- to code, documentation, tests, examples, etc. -- are\u00a0greatly appreciated. For more detailed information, see CONTRIBUTING.md.\n1. Fork the project\n2. Create your feature branch (git checkout -b some-feature)\n3. Commit your changes (git commit -m 'Add some amazing feature')\n4. Push to the branch (git push origin some-feature)\n5. Open a pull request\n\n## License\n\nDistributed under the GNU Affero General Public License v3.0 license. See\u00a0LICENSE\u00a0for more information.\n\n## Team\n\nMelissa McNeill, UChicago Crime and Education Labs\n\nEddie Tzu-Yun Lin, UChicago Crime and Education Labs\n\nZubin Jelveh, University of Maryland\n\n## Citation\n\nIf you use Name Match in an academic work, please give this citation:\n\nZubin Jelveh, Melissa McNeill, and Tzu-Yun Lin. 2022. Name Match.\u00a0https://github.com/urban-labs/namematch.\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets",
"version": "1.2.1",
"split_keywords": [
"record",
"linkage"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "4e40804e86800269cdca5a2e539736f1",
"sha256": "77297cba7f4ec0215b8540eb161f6d58ea3fdb0676a300a63a5fe8ad38c72057"
},
"downloads": -1,
"filename": "namematch-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4e40804e86800269cdca5a2e539736f1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 4963448,
"upload_time": "2022-11-30T15:57:55",
"upload_time_iso_8601": "2022-11-30T15:57:55.924181Z",
"url": "https://files.pythonhosted.org/packages/94/ec/6acb875c55fa44884bd6419044e4be5f82e4bda29d3735a931b61859538d/namematch-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "4d91d11584d0823515b3e04da8aa9e67",
"sha256": "39828efe26da16133f0c0b5b0566af742b19fdd4e3aca55cfe23fafbcf2f8f86"
},
"downloads": -1,
"filename": "namematch-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "4d91d11584d0823515b3e04da8aa9e67",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 4948943,
"upload_time": "2022-11-30T15:58:01",
"upload_time_iso_8601": "2022-11-30T15:58:01.708199Z",
"url": "https://files.pythonhosted.org/packages/a3/f7/612ea73ee241a7d9c88168ab234f788e1878e3519aa2313bab06c6a8a452/namematch-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-11-30 15:58:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "urban-labs",
"github_project": "namematch",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "namematch"
}