<p align="center">
<img src="https://github.com/Christopher-Thornton/hmni/blob/master/nametag.png?raw=true" alt="logo" />
</p>
# HMNI
![GitHub](https://img.shields.io/github/license/Christopher-Thornton/hmni)
![PyPI](https://img.shields.io/pypi/v/hmni)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hmni)
[![Documentation Status](https://readthedocs.org/projects/hmni/badge/?version=latest)](https://hmni.readthedocs.io/en/latest/?badge=latest)
![PyPI - Downloads](https://img.shields.io/pypi/dm/hmni)
![GitHub repo size](https://img.shields.io/github/repo-size/Christopher-Thornton/hmni)
Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.
HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.
| Model | Accuracy | Precision | Recall | F1-Score
|-------------|-----------|-----------|-----------|-----------
| HMNI-Latin | 0.9393 | 0.9255 | 0.7548 | 0.8315
For an introduction to the methodology and research behind HMNI, please refer to my [blog post](https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4).
## Requirements
### Python 3.5–3.8
- tensorflow
- scikit-learn
- fuzzywuzzy
- abydos
- unidecode
## QUICK USAGE GUIDE
## Installation
Using PIP via PyPI
```bash
pip install hmni
```
#### Initialize a Matcher Object
```python
import hmni
matcher = hmni.Matcher(model='latin')
```
#### Single Pair Similarity
```python
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
```
#### Record Linkage
```python
import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
```
#### Name Deduplication and Normalization
```python
names_list = ['Alan', 'Al', 'Al', 'James']
matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']
matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']
matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']
```
## Matcher Parameters
> **hmni.Matcher**(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)
* **model** *(str)* -- HMNI statistical model (latin by default)
* **prefilter** *(bool)* -- Should the matcher prefilter unlikely candidates (True by default)
* **allow_alt_surname** *(bool)* -- Should the matcher consider phonetic matching surnames *e.g. Smith, Schmidt* (True by default)
* **allow_initials** *(bool)* -- Should the matcher consider names with initials (True by default)
* **allow_missing_components** *(bool)* -- Should the matcher consider names with missing components (True by default)
## Matcher Methods
> **similarity**(name_a, name_b, prob=True, surname_first=False)
* **name_a** *(str)* -- First name for comparison
* **name_b** *(str)* -- Second name for comparison
* **prob** *(bool)* -- If True return a predicted probability, else binary class label
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
* **surname_first** *(bool)* -- If name strings start with surname (False by default)
> **fuzzymerge**(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)
* **df1** *(pandas DataFrame or named Series)* -- First/Left object to merge with
* **df2** *(pandas DataFrame or named Series)* -- Second/Right object to merge with
* **how** *(str)* -- Type of merge to be performed
* `inner` (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
* `left`: Use only keys from left frame, similar to a SQL left outer join; preserve key order
* `right`: Use only keys from right frame, similar to a SQL right outer join; preserve key order
* `outer`: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
* **on** *(label or list)* -- Column or index level names to join on. These must be found in both DataFrames
* **left_on** *(label or list)* -- Column or index level names to join on in the left DataFrame
* **right_on** *(label or list)* -- Column or index level names to join on in the right DataFrame
* **indicator** *(bool)* -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
* **limit** *(int)* -- Top number of name matches to consider (1 by default)
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
* **allow_exact_matches** *(bool)* -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
* **surname_first** *(bool)* -- If name strings start with surname (False by default)
> **dedupe**(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)
* **names** *(list)* -- List of names to dedupe
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
* **keep** *(str)* -- Specifies method for keeping one of multiple alternative names
* `longest` (default): Keeps longest name
* `frequent`: Keeps most frequent name in names list
* **reverse** *(bool)* -- If True will sort matches descending order, else ascending (True by default)
* **limit** *(int)* -- Top number of name matches to consider (3 by default)
* **replace** *(bool)* -- If True return normalized name list, else return deduplicated name list (False by default)
* **surname_first** *(bool)* -- If name strings start with surname (False by default)
> **assign_similarity**(name_a, name_b, score)
* **name_a** *(str)* -- First name for similarity score assignment
* **name_b** *(str)* -- Second name for similarity score assignment
* **score** *(float)* -- Assigned similarity score for pair of names
## Contributing
Pull requests are welcome.
For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic),
jupyter notebooks are shared in the `dev` folder to build models using similar methods.
## License
[MIT](https://choosealicense.com/licenses/mit/)
Raw data
{
"_id": null,
"home_page": "https://github.com/Christopher-Thornton/hmni",
"name": "hmni",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "fuzzy-matching,natural-language-processing,nlp,machine-learning,data-science,python,artificial-intelligence,ai",
"author": "Christopher Thornton",
"author_email": "christopher_thornton@outlook.com",
"download_url": "https://files.pythonhosted.org/packages/9b/86/0c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7/hmni-0.1.8.tar.gz",
"platform": "",
"description": "<p align=\"center\">\n <img src=\"https://github.com/Christopher-Thornton/hmni/blob/master/nametag.png?raw=true\" alt=\"logo\" />\n</p>\n\n# HMNI\n\n![GitHub](https://img.shields.io/github/license/Christopher-Thornton/hmni)\n![PyPI](https://img.shields.io/pypi/v/hmni)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hmni)\n[![Documentation Status](https://readthedocs.org/projects/hmni/badge/?version=latest)](https://hmni.readthedocs.io/en/latest/?badge=latest)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/hmni)\n![GitHub repo size](https://img.shields.io/github/repo-size/Christopher-Thornton/hmni)\n\nFuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.\n\nHMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.\n\n| Model | Accuracy | Precision | Recall | F1-Score \n|-------------|-----------|-----------|-----------|-----------\n| HMNI-Latin | 0.9393 | 0.9255 | 0.7548 | 0.8315 \n\nFor an introduction to the methodology and research behind HMNI, please refer to my [blog post](https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4).\n\n## Requirements\n### Python 3.5\u20133.8\n- tensorflow\n- scikit-learn\n- fuzzywuzzy\n- abydos\n- unidecode\n\n## QUICK USAGE GUIDE\n## Installation\nUsing PIP via PyPI\n```bash\npip install hmni\n```\n#### Initialize a Matcher\u00a0Object\n```python\nimport hmni\nmatcher = hmni.Matcher(model='latin')\n```\n#### Single Pair Similarity\n```python\nmatcher.similarity('Alan', 'Al')\n# 0.6838303319889133\n\nmatcher.similarity('Alan', 'Al', prob=False)\n# 1\n\nmatcher.similarity('Alan Turing', 'Al Turing', surname_first=False)\n# 0.6838303319889133\n```\n#### Record Linkage\n```python\nimport pandas as pd\n\ndf1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})\ndf2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})\n\nmerged = matcher.fuzzymerge(df1, df2, how='left', on='name')\n```\n#### Name Deduplication and Normalization\n```python\nnames_list = ['Alan', 'Al', 'Al', 'James']\n\nmatcher.dedupe(names_list, keep='longest')\n# ['Alan', 'James']\n\nmatcher.dedupe(names_list, keep='frequent')\n# ['Al, 'James']\n\nmatcher.dedupe(names_list, keep='longest', replace=True)\n# ['Alan, 'Alan', 'Alan', 'James']\n```\n## Matcher Parameters\n> **hmni.Matcher**(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)\n* **model** *(str)* -- HMNI statistical model (latin by default)\n* **prefilter** *(bool)* -- Should the matcher prefilter unlikely candidates (True by default)\n* **allow_alt_surname** *(bool)* -- Should the matcher consider phonetic matching surnames *e.g. Smith, Schmidt* (True by default)\n* **allow_initials** *(bool)* -- Should the matcher consider names with initials (True by default)\n* **allow_missing_components** *(bool)* -- Should the matcher consider names with missing components (True by default)\n\n## Matcher Methods\n> **similarity**(name_a, name_b, prob=True, surname_first=False)\n* **name_a** *(str)* -- First name for comparison\n* **name_b** *(str)* -- Second name for comparison\n* **prob** *(bool)* -- If True return a predicted probability, else binary class label\n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)\n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **fuzzymerge**(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)\n* **df1** *(pandas DataFrame or named Series)* -- First/Left object to merge with\n* **df2** *(pandas DataFrame or named Series)* -- Second/Right object to merge with\n* **how** *(str)* -- Type of merge to be performed\n * `inner` (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys\n * `left`: Use only keys from left frame, similar to a SQL left outer join; preserve key order\n * `right`: Use only keys from right frame, similar to a SQL right outer join; preserve key order\n * `outer`: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically\n* **on** *(label or list)* -- Column or index level names to join on. These must be found in both DataFrames\n* **left_on** *(label or list)* -- Column or index level names to join on in the left DataFrame\n* **right_on** *(label or list)* -- Column or index level names to join on in the right DataFrame\n* **indicator** *(bool)* -- If True, adds a column to output DataFrame called \u201c_merge\u201d with information on the source of each row (False by default)\n* **limit** *(int)* -- Top number of name matches to consider (1 by default) \n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default) \n* **allow_exact_matches** *(bool)* -- If True allow merging on exact name matches, else do not consider exact matches (True by default)\n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **dedupe**(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)\n* **names** *(list)* -- List of names to dedupe\n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)\n* **keep** *(str)* -- Specifies method for keeping one of multiple alternative names \n * `longest` (default): Keeps longest name\n * `frequent`: Keeps most frequent name in names list\n* **reverse** *(bool)* -- If True will sort matches descending order, else ascending (True by default)\n* **limit** *(int)* -- Top number of name matches to consider (3 by default)\n* **replace** *(bool)* -- If True return normalized name list, else return deduplicated name list (False by default) \n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **assign_similarity**(name_a, name_b, score)\n* **name_a** *(str)* -- First name for similarity score assignment\n* **name_b** *(str)* -- Second name for similarity score assignment\n* **score** *(float)* -- Assigned similarity score for pair of names\n\n## Contributing\nPull requests are welcome. \nFor developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), \njupyter notebooks are shared in the `dev` folder to build models using similar methods. \n\n## License\n[MIT](https://choosealicense.com/licenses/mit/)\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fuzzy Name Matching with Machine Learning",
"version": "0.1.8",
"project_urls": {
"Download": "https://github.com/Christopher-Thornton/hmni/archive/v0.1.8.zip",
"Homepage": "https://github.com/Christopher-Thornton/hmni"
},
"split_keywords": [
"fuzzy-matching",
"natural-language-processing",
"nlp",
"machine-learning",
"data-science",
"python",
"artificial-intelligence",
"ai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6010ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4",
"md5": "0928b37676cc7a061114a37ba30f45ca",
"sha256": "f262fa842b0d7a6c7e1b5ae643c32e9ac9f61e33e79c7bab1396b7e5ef8aac36"
},
"downloads": -1,
"filename": "hmni-0.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0928b37676cc7a061114a37ba30f45ca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 22189171,
"upload_time": "2020-09-14T05:27:30",
"upload_time_iso_8601": "2020-09-14T05:27:30.991272Z",
"url": "https://files.pythonhosted.org/packages/60/10/ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4/hmni-0.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9b860c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7",
"md5": "290e9969eab1fa04649507ed026d4cca",
"sha256": "7d2d339c6848a509ac5bf99b6f925e1eee4cd858bec1d8233c87f375d9ed0063"
},
"downloads": -1,
"filename": "hmni-0.1.8.tar.gz",
"has_sig": false,
"md5_digest": "290e9969eab1fa04649507ed026d4cca",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 22188710,
"upload_time": "2020-09-14T05:27:36",
"upload_time_iso_8601": "2020-09-14T05:27:36.323599Z",
"url": "https://files.pythonhosted.org/packages/9b/86/0c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7/hmni-0.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-09-14 05:27:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Christopher-Thornton",
"github_project": "hmni",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "hmni"
}