hmni


Namehmni JSON
Version 0.1.8 PyPI version JSON
download
home_pagehttps://github.com/Christopher-Thornton/hmni
SummaryFuzzy Name Matching with Machine Learning
upload_time2020-09-14 05:27:36
maintainer
docs_urlNone
authorChristopher Thornton
requires_python
licenseMIT
keywords fuzzy-matching natural-language-processing nlp machine-learning data-science python artificial-intelligence ai
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://github.com/Christopher-Thornton/hmni/blob/master/nametag.png?raw=true" alt="logo" />
</p>

# HMNI

![GitHub](https://img.shields.io/github/license/Christopher-Thornton/hmni)
![PyPI](https://img.shields.io/pypi/v/hmni)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hmni)
[![Documentation Status](https://readthedocs.org/projects/hmni/badge/?version=latest)](https://hmni.readthedocs.io/en/latest/?badge=latest)
![PyPI - Downloads](https://img.shields.io/pypi/dm/hmni)
![GitHub repo size](https://img.shields.io/github/repo-size/Christopher-Thornton/hmni)

Fuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.

HMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.

|    Model    |  Accuracy | Precision |   Recall  |  F1-Score 
|-------------|-----------|-----------|-----------|-----------
| HMNI-Latin  | 0.9393    | 0.9255    | 0.7548    | 0.8315    

For an introduction to the methodology and research behind HMNI, please refer to my [blog post](https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4).

## Requirements
### Python 3.5–3.8
-  tensorflow
-  scikit-learn
-  fuzzywuzzy
-  abydos
-  unidecode

## QUICK USAGE GUIDE
## Installation
Using PIP via PyPI
```bash
pip install hmni
```
#### Initialize a Matcher Object
```python
import hmni
matcher = hmni.Matcher(model='latin')
```
#### Single Pair Similarity
```python
matcher.similarity('Alan', 'Al')
# 0.6838303319889133

matcher.similarity('Alan', 'Al', prob=False)
# 1

matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
```
#### Record Linkage
```python
import pandas as pd

df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})

merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
```
#### Name Deduplication and Normalization
```python
names_list = ['Alan', 'Al', 'Al', 'James']

matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']

matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']

matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']
```
## Matcher Parameters
> **hmni.Matcher**(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)
* **model** *(str)* -- HMNI statistical model (latin by default)
* **prefilter** *(bool)* -- Should the matcher prefilter unlikely candidates (True by default)
* **allow_alt_surname** *(bool)* -- Should the matcher consider phonetic matching surnames *e.g. Smith, Schmidt* (True by default)
* **allow_initials** *(bool)* -- Should the matcher consider names with initials (True by default)
* **allow_missing_components** *(bool)* -- Should the matcher consider names with missing components (True by default)

## Matcher Methods
> **similarity**(name_a, name_b, prob=True, surname_first=False)
* **name_a** *(str)* -- First name for comparison
* **name_b** *(str)* -- Second name for comparison
* **prob** *(bool)* -- If True return a predicted probability, else binary class label
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
* **surname_first** *(bool)* -- If name strings start with surname (False by default)

> **fuzzymerge**(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)
* **df1** *(pandas DataFrame or named Series)* -- First/Left object to merge with
* **df2** *(pandas DataFrame or named Series)* -- Second/Right object to merge with
* **how** *(str)* -- Type of merge to be performed
    * `inner` (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
    * `left`: Use only keys from left frame, similar to a SQL left outer join; preserve key order
    * `right`: Use only keys from right frame, similar to a SQL right outer join; preserve key order
    * `outer`: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
* **on** *(label or list)* -- Column or index level names to join on. These must be found in both DataFrames
* **left_on** *(label or list)* -- Column or index level names to join on in the left DataFrame
* **right_on** *(label or list)* -- Column or index level names to join on in the right DataFrame
* **indicator** *(bool)* -- If True, adds a column to output DataFrame called “_merge” with information on the source of each row (False by default)
* **limit** *(int)* -- Top number of name matches to consider (1 by default)     
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)       
* **allow_exact_matches** *(bool)* -- If True allow merging on exact name matches, else do not consider exact matches (True by default)
* **surname_first** *(bool)* -- If name strings start with surname (False by default)

> **dedupe**(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)
* **names** *(list)* -- List of names to dedupe
* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)
* **keep** *(str)* -- Specifies method for keeping one of multiple alternative names 
    * `longest` (default): Keeps longest name
    * `frequent`: Keeps most frequent name in names list
* **reverse** *(bool)* -- If True will sort matches descending order, else ascending (True by default)
* **limit** *(int)* -- Top number of name matches to consider (3 by default)
* **replace** *(bool)* -- If True return normalized name list, else return deduplicated name list (False by default) 
* **surname_first** *(bool)* -- If name strings start with surname (False by default)

> **assign_similarity**(name_a, name_b, score)
* **name_a** *(str)* -- First name for similarity score assignment
* **name_b** *(str)* -- Second name for similarity score assignment
* **score** *(float)* -- Assigned similarity score for pair of names

## Contributing
Pull requests are welcome. 
For developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), 
jupyter notebooks are shared in the `dev` folder to build models using similar methods. 

## License
[MIT](https://choosealicense.com/licenses/mit/)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Christopher-Thornton/hmni",
    "name": "hmni",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "fuzzy-matching,natural-language-processing,nlp,machine-learning,data-science,python,artificial-intelligence,ai",
    "author": "Christopher Thornton",
    "author_email": "christopher_thornton@outlook.com",
    "download_url": "https://files.pythonhosted.org/packages/9b/86/0c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7/hmni-0.1.8.tar.gz",
    "platform": "",
    "description": "<p align=\"center\">\n  <img src=\"https://github.com/Christopher-Thornton/hmni/blob/master/nametag.png?raw=true\" alt=\"logo\" />\n</p>\n\n# HMNI\n\n![GitHub](https://img.shields.io/github/license/Christopher-Thornton/hmni)\n![PyPI](https://img.shields.io/pypi/v/hmni)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/hmni)\n[![Documentation Status](https://readthedocs.org/projects/hmni/badge/?version=latest)](https://hmni.readthedocs.io/en/latest/?badge=latest)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/hmni)\n![GitHub repo size](https://img.shields.io/github/repo-size/Christopher-Thornton/hmni)\n\nFuzzy name matching with machine learning. Perform common fuzzy name matching tasks including similarity scoring, record linkage, deduplication and normalization.\n\nHMNI is trained on an internationally-transliterated Latin firstname dataset, where precision is afforded priority.\n\n|    Model    |  Accuracy | Precision |   Recall  |  F1-Score \n|-------------|-----------|-----------|-----------|-----------\n| HMNI-Latin  | 0.9393    | 0.9255    | 0.7548    | 0.8315    \n\nFor an introduction to the methodology and research behind HMNI, please refer to my [blog post](https://towardsdatascience.com/fuzzy-name-matching-with-machine-learning-f09895dce7b4).\n\n## Requirements\n### Python 3.5\u20133.8\n-  tensorflow\n-  scikit-learn\n-  fuzzywuzzy\n-  abydos\n-  unidecode\n\n## QUICK USAGE GUIDE\n## Installation\nUsing PIP via PyPI\n```bash\npip install hmni\n```\n#### Initialize a Matcher\u00a0Object\n```python\nimport hmni\nmatcher = hmni.Matcher(model='latin')\n```\n#### Single Pair Similarity\n```python\nmatcher.similarity('Alan', 'Al')\n# 0.6838303319889133\n\nmatcher.similarity('Alan', 'Al', prob=False)\n# 1\n\nmatcher.similarity('Alan Turing', 'Al Turing', surname_first=False)\n# 0.6838303319889133\n```\n#### Record Linkage\n```python\nimport pandas as pd\n\ndf1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})\ndf2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})\n\nmerged = matcher.fuzzymerge(df1, df2, how='left', on='name')\n```\n#### Name Deduplication and Normalization\n```python\nnames_list = ['Alan', 'Al', 'Al', 'James']\n\nmatcher.dedupe(names_list, keep='longest')\n# ['Alan', 'James']\n\nmatcher.dedupe(names_list, keep='frequent')\n# ['Al, 'James']\n\nmatcher.dedupe(names_list, keep='longest', replace=True)\n# ['Alan, 'Alan', 'Alan', 'James']\n```\n## Matcher Parameters\n> **hmni.Matcher**(model='latin', prefilter=True, allow_alt_surname=True, allow_initials=True, allow_missing_components=True)\n* **model** *(str)* -- HMNI statistical model (latin by default)\n* **prefilter** *(bool)* -- Should the matcher prefilter unlikely candidates (True by default)\n* **allow_alt_surname** *(bool)* -- Should the matcher consider phonetic matching surnames *e.g. Smith, Schmidt* (True by default)\n* **allow_initials** *(bool)* -- Should the matcher consider names with initials (True by default)\n* **allow_missing_components** *(bool)* -- Should the matcher consider names with missing components (True by default)\n\n## Matcher Methods\n> **similarity**(name_a, name_b, prob=True, surname_first=False)\n* **name_a** *(str)* -- First name for comparison\n* **name_b** *(str)* -- Second name for comparison\n* **prob** *(bool)* -- If True return a predicted probability, else binary class label\n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)\n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **fuzzymerge**(df1, df2, how='inner', on=None, left_on=None, right_on=None, indicator=False, limit=1, threshold=0.5, allow_exact_matches=True, surname_first=False)\n* **df1** *(pandas DataFrame or named Series)* -- First/Left object to merge with\n* **df2** *(pandas DataFrame or named Series)* -- Second/Right object to merge with\n* **how** *(str)* -- Type of merge to be performed\n    * `inner` (default): Use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys\n    * `left`: Use only keys from left frame, similar to a SQL left outer join; preserve key order\n    * `right`: Use only keys from right frame, similar to a SQL right outer join; preserve key order\n    * `outer`: Use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically\n* **on** *(label or list)* -- Column or index level names to join on. These must be found in both DataFrames\n* **left_on** *(label or list)* -- Column or index level names to join on in the left DataFrame\n* **right_on** *(label or list)* -- Column or index level names to join on in the right DataFrame\n* **indicator** *(bool)* -- If True, adds a column to output DataFrame called \u201c_merge\u201d with information on the source of each row (False by default)\n* **limit** *(int)* -- Top number of name matches to consider (1 by default)     \n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)       \n* **allow_exact_matches** *(bool)* -- If True allow merging on exact name matches, else do not consider exact matches (True by default)\n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **dedupe**(names, threshold=0.5, keep='longest', reverse=True, limit=3, replace=False, surname_first=False)\n* **names** *(list)* -- List of names to dedupe\n* **threshold** *(float)* -- Prediction probability threshold for positive match (0.5 by default)\n* **keep** *(str)* -- Specifies method for keeping one of multiple alternative names \n    * `longest` (default): Keeps longest name\n    * `frequent`: Keeps most frequent name in names list\n* **reverse** *(bool)* -- If True will sort matches descending order, else ascending (True by default)\n* **limit** *(int)* -- Top number of name matches to consider (3 by default)\n* **replace** *(bool)* -- If True return normalized name list, else return deduplicated name list (False by default) \n* **surname_first** *(bool)* -- If name strings start with surname (False by default)\n\n> **assign_similarity**(name_a, name_b, score)\n* **name_a** *(str)* -- First name for similarity score assignment\n* **name_b** *(str)* -- Second name for similarity score assignment\n* **score** *(float)* -- Assigned similarity score for pair of names\n\n## Contributing\nPull requests are welcome. \nFor developers wishing to build a model using Latin or non-Latin writing systems (Chinese, Cyrillic, Arabic), \njupyter notebooks are shared in the `dev` folder to build models using similar methods. \n\n## License\n[MIT](https://choosealicense.com/licenses/mit/)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fuzzy Name Matching with Machine Learning",
    "version": "0.1.8",
    "project_urls": {
        "Download": "https://github.com/Christopher-Thornton/hmni/archive/v0.1.8.zip",
        "Homepage": "https://github.com/Christopher-Thornton/hmni"
    },
    "split_keywords": [
        "fuzzy-matching",
        "natural-language-processing",
        "nlp",
        "machine-learning",
        "data-science",
        "python",
        "artificial-intelligence",
        "ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6010ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4",
                "md5": "0928b37676cc7a061114a37ba30f45ca",
                "sha256": "f262fa842b0d7a6c7e1b5ae643c32e9ac9f61e33e79c7bab1396b7e5ef8aac36"
            },
            "downloads": -1,
            "filename": "hmni-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0928b37676cc7a061114a37ba30f45ca",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22189171,
            "upload_time": "2020-09-14T05:27:30",
            "upload_time_iso_8601": "2020-09-14T05:27:30.991272Z",
            "url": "https://files.pythonhosted.org/packages/60/10/ef662c2d9d01f2fc5b13c0a779259b12b3b916bcde6686841496bcd665a4/hmni-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9b860c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7",
                "md5": "290e9969eab1fa04649507ed026d4cca",
                "sha256": "7d2d339c6848a509ac5bf99b6f925e1eee4cd858bec1d8233c87f375d9ed0063"
            },
            "downloads": -1,
            "filename": "hmni-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "290e9969eab1fa04649507ed026d4cca",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 22188710,
            "upload_time": "2020-09-14T05:27:36",
            "upload_time_iso_8601": "2020-09-14T05:27:36.323599Z",
            "url": "https://files.pythonhosted.org/packages/9b/86/0c5b4406c666ad73feef5d8dab012f04a6e4f1a31eeab60b32f530cb4fb7/hmni-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2020-09-14 05:27:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Christopher-Thornton",
    "github_project": "hmni",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "hmni"
}
        
Elapsed time: 0.34829s