ntec


Namentec JSON
Version 0.1.0 PyPI version JSON
download
home_page
SummaryEthnicity classification based on names.
upload_time2023-02-08 09:32:33
maintainer
docs_urlNone
author
requires_python>=3.7
licenseMIT License Copyright (c) 2023 Matthias Niggli Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords name classification ethnicity classification
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ntec

[![PyPI version](https://badge.fury.io/py/ntec.svg)](https://badge.fury.io/py/ntec)

## 1. What is ntec?

ntec is short for '**N**ame **T**o **E**thnicity **C**lassification' and is a Python-based framework for ethnicity classification based on peoples' names. It was first introduced and used in my paper [Moving On - Investigating Inventors' Ethnic Origins Using Supervised learning](https://academic.oup.com/joeg/advance-article/doi/10.1093/jeg/lbad001/7010698?utm_source=authortollfreelink&utm_campaign=joeg&utm_medium=email&guestAccessKey=431e97d4-c455-49ab-9019-d622f648c6d5), where you can find methodological details of the main classifier and its training data. In short, ntec builds on a trained artificial neural network that uses a name's letters to predict its ethnic origin.

## 2. Installation

### Step 1: Install tensorflow

ntec runs on top of keras and tensorflow (version 2.5 at the time of development). Hence, these packages have to be installed first. The [developer homepage](https://www.tensorflow.org/install) provides the details for the installation process (using a virtual environment is recommended). Afterwards, verify the installation of tensorflow as suggested by the devleopers:

```python
import tensorflow as tf
print(tf.reduce_sum(tf.random.normal([1000, 1000])))
```

### Step 2: Install ntec
ntec can be installed via pip:

```bash
pip install ntec
```

## 3. Quick start

ntec currently builds on the classifier presented in the article [Moving On - Investigating Inventors' Ethnic Origins Using Supervised learning](xxxx), which is labelled accordingly as the 'joeg'-classifier. You can load this classifier, check its parameters and recognized ethnic origin classes using the code below.

```python
import ntec

# load the classifier
classifier = ntec.Classifier("joeg") # initialize the 'joeg' classifier

# check the parameters of the classifier
classifier.params["seq_max"]
# 30
classifier.params["n_chars"]
# 27
classifier.params["char_dict"]
# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']

# check the ethnic origin classes recognized by the classifier
classifier.classes
# {'0': 'AngloSaxon', '1': 'Arabic', '2': 'Balkans', '3': 'Chinese', '4': 'East-European', '5': 'French', '6': 'German', '7': 'Hispanic-Iberian', '8': 'Indian', '9': 'Italian', '10': 'Japanese', '11': 'Korean', '12': 'Persian', '13': 'Scandinavian', '14': 'Slavic-Russian', '15': 'South-East-Asian', '16': 'Turkish', '17': 'Dutch'}
```

As specified by the `classifier.params["char_dict"]` attribute, the 'joeg' classifier only accepts names that entirely consist of either ASCII lowercased letters or whitespace. Hence, any name whose ethnic origins are to be precited by the chosen classifier has to be cleaned of non-ASCII letters first. ntec offers the function `clean_name()` for this task, which is demonstarted by the sample code below.

```python
# single name cleaning:
cleaned_name = ntec.clean_name(name = "ruud van niste'\lrooy")
print(cleaned_name)
# 'ruud van nistelrooy'

# multiple name cleaning:
# define example names (including some non-ascii characters, uppercase letters and digits):
names = [
    "tony aDams", "Mustapha Hadji", "sinisa mihajlovic", 
    "Sun Jihai", "ruud van niste'\lrooy", "tomasz rosicky ", 
    "didier d_eschamps", "oliver kahn", "gabriel batistuta", 
    "Sunil Chhetri", "paolo maldini3", "Shunsuke naka\\mura", 
    "Ji-Sung Park", "ali daei xasfdhkljhasdfhakghfiugasfd", "henrik larsson", 
    "Andrey Arshavin", "Teerasil Dangda", "hakan sükür",  
     ]
cleaned_names = [ntec.clean_name(name) for name in names]
print(cleaned_names)
# ['tony adams', 'mustapha hadji', 'sinisa mihajlovic', 'sun jihai', 'ruud van nistelrooy', 
# 'tomasz rosicky', 'didier deschamps', 'oliver kahn', 'gabriel batistuta', 'sunil chhetri',
# 'paolo maldini', 'shunsuke nakamura', 'jisung park', 'ali daei xasfdhkljhasdfhakghfiugasfd',
# 'henrik larsson', 'andrey arshavin', 'teerasil dangda', 'hakan sukur']
```

After cleaning, names must be encoded to a form that can be processed by the classifier. This encoding step is performed using the classifier's `classifier.encode_name()`-method as shown below. First, the `classifier.encode_name()` method automatically ensures that the lenghth of an input name does not surpass the classifier's `classifier.params["seq_max"]` attribute (i.e., longer names such as "ali daei xasfdhkljhasdfhakghfiugasfd" in the example will be cut to this length). Second, it then transforms a the cleaned name according to the `classifier.params` attribute to a 2D-numpy.array of shape `(seq_max`, `n_chars + 1)`.

```python
import numpy as np

# single name encoding:
encoded_name = classifier.encode_name(name = cleaned_name)
encoded_name.shape
# (30, 28)

# multiple name encoding
encoded_names = np.array([classifier.encode_name(name) for name in cleaned_names])
encoded_names.shape
# (18, 30, 28)
```

Subsequetly, the encoded names can be sent to the classifier, which then predicts corresponding ethnic origins.

```python
import pandas as pd

# single name origin prediction:
origin_pred = classifier.predict_origins(x = encoded_name, output = "classes")
pd.concat([pd.Series(cleaned_name, name = "cleaned_name"), origin_pred], axis = 1)
#           cleaned_name ethnic_origin
# 0  ruud van nistelrooy         Dutch

# multiple name origin prediction:
origin_pred = classifier.predict_origins(np.array(encoded_names), output = "classes")
pd.concat([pd.Series(cleaned_names, name = "cleaned_name"), origin_pred], axis = 1)
#                             cleaned_name     ethnic_origin
# 0                             tony adams        AngloSaxon
# 1                         mustapha hadji            Arabic
# 2                      sinisa mihajlovic           Balkans
# 3                              sun jihai           Chinese
# 4                    ruud van nistelrooy             Dutch
# 5                         tomasz rosicky     East-European
# 6                       didier deschamps            French
# 7                            oliver kahn            German
# 8                      gabriel batistuta  Hispanic-Iberian
# 9                          sunil chhetri            Indian
# 10                         paolo maldini           Italian
# 11                     shunsuke nakamura          Japanese
# 12                           jisung park            Korean
# 13  ali daei xasfdhkljhasdfhakghfiugasfd           Persian
# 14                        henrik larsson      Scandinavian
# 15                       andrey arshavin    Slavic-Russian
# 16                       teerasil dangda  South-East-Asian
# 17                           hakan sukur           Turkish
```

## Contact
If you have any feedback or questions, please contact me at matthias.niggli@gmx.ch

## Citation
Please cite appropriately as:

Matthias Niggli (2023), ‘Moving On’—investigating inventors’ ethnic origins using supervised learning, *Journal of Economic Geography*, lbad001, [https://doi.org/10.1093/jeg/lbad001](https://doi.org/10.1093/jeg/lbad001).

BibTex:
```
@article{niggli2023,
    author = {Niggli, Matthias},
    title = "{‘Moving On’—investigating inventors’ ethnic origins using supervised learning}",
    journal = {Journal of Economic Geography},
    year = {2023},
    month = {01},
    abstract = "{Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. In this article, I use supervised learning techniques to infer this information. To do so, I construct a dataset of 96′777 labeled names and train an artificial recurrent neural network with long short-term memory (LSTM) to predict ethnic origins based on names. The trained network achieves an overall performance of 91.4\\% across 18 ethnic origins. I use this model to predict and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition over time and across countries and technological fields. The global ethnic origin composition has become more diverse over the last decades, which was mostly due to a relative increase of Asian origin inventors. Furthermore, the prevalence of foreign-origin inventors is especially high in the USA, but has also increased in other high-income economies. This increase was mainly driven by an inflow of non-Western inventors into emerging high-technology fields for the USA, but not for other high-income countries.}",
    issn = {1468-2702},
    doi = {10.1093/jeg/lbad001},
    url = {https://doi.org/10.1093/jeg/lbad001},
    note = {lbad001},
    eprint = {https://academic.oup.com/joeg/advance-article-pdf/doi/10.1093/jeg/lbad001/48958974/lbad001.pdf},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "ntec",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "name classification,ethnicity classification",
    "author": "",
    "author_email": "Matthias Nigggli <matthiasniggli@gmx.ch>",
    "download_url": "https://files.pythonhosted.org/packages/23/41/06e5029c58b3a30c589cd782fe434acafb1c60efcbc019c09c5a44170dc8/ntec-0.1.0.tar.gz",
    "platform": null,
    "description": "# ntec\r\n\r\n[![PyPI version](https://badge.fury.io/py/ntec.svg)](https://badge.fury.io/py/ntec)\r\n\r\n## 1. What is ntec?\r\n\r\nntec is short for '**N**ame **T**o **E**thnicity **C**lassification' and is a Python-based framework for ethnicity classification based on peoples' names. It was first introduced and used in my paper [Moving On - Investigating Inventors' Ethnic Origins Using Supervised learning](https://academic.oup.com/joeg/advance-article/doi/10.1093/jeg/lbad001/7010698?utm_source=authortollfreelink&utm_campaign=joeg&utm_medium=email&guestAccessKey=431e97d4-c455-49ab-9019-d622f648c6d5), where you can find methodological details of the main classifier and its training data. In short, ntec builds on a trained artificial neural network that uses a name's letters to predict its ethnic origin.\r\n\r\n## 2. Installation\r\n\r\n### Step 1: Install tensorflow\r\n\r\nntec runs on top of keras and tensorflow (version 2.5 at the time of development). Hence, these packages have to be installed first. The [developer homepage](https://www.tensorflow.org/install) provides the details for the installation process (using a virtual environment is recommended). Afterwards, verify the installation of tensorflow as suggested by the devleopers:\r\n\r\n```python\r\nimport tensorflow as tf\r\nprint(tf.reduce_sum(tf.random.normal([1000, 1000])))\r\n```\r\n\r\n### Step 2: Install ntec\r\nntec can be installed via pip:\r\n\r\n```bash\r\npip install ntec\r\n```\r\n\r\n## 3. Quick start\r\n\r\nntec currently builds on the classifier presented in the article [Moving On - Investigating Inventors' Ethnic Origins Using Supervised learning](xxxx), which is labelled accordingly as the 'joeg'-classifier. You can load this classifier, check its parameters and recognized ethnic origin classes using the code below.\r\n\r\n```python\r\nimport ntec\r\n\r\n# load the classifier\r\nclassifier = ntec.Classifier(\"joeg\") # initialize the 'joeg' classifier\r\n\r\n# check the parameters of the classifier\r\nclassifier.params[\"seq_max\"]\r\n# 30\r\nclassifier.params[\"n_chars\"]\r\n# 27\r\nclassifier.params[\"char_dict\"]\r\n# ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ']\r\n\r\n# check the ethnic origin classes recognized by the classifier\r\nclassifier.classes\r\n# {'0': 'AngloSaxon', '1': 'Arabic', '2': 'Balkans', '3': 'Chinese', '4': 'East-European', '5': 'French', '6': 'German', '7': 'Hispanic-Iberian', '8': 'Indian', '9': 'Italian', '10': 'Japanese', '11': 'Korean', '12': 'Persian', '13': 'Scandinavian', '14': 'Slavic-Russian', '15': 'South-East-Asian', '16': 'Turkish', '17': 'Dutch'}\r\n```\r\n\r\nAs specified by the `classifier.params[\"char_dict\"]` attribute, the 'joeg' classifier only accepts names that entirely consist of either ASCII lowercased letters or whitespace. Hence, any name whose ethnic origins are to be precited by the chosen classifier has to be cleaned of non-ASCII letters first. ntec offers the function `clean_name()` for this task, which is demonstarted by the sample code below.\r\n\r\n```python\r\n# single name cleaning:\r\ncleaned_name = ntec.clean_name(name = \"ruud van niste'\\lrooy\")\r\nprint(cleaned_name)\r\n# 'ruud van nistelrooy'\r\n\r\n# multiple name cleaning:\r\n# define example names (including some non-ascii characters, uppercase letters and digits):\r\nnames = [\r\n    \"tony aDams\", \"Mustapha Hadji\", \"sinisa mihajlovic\", \r\n    \"Sun Jihai\", \"ruud van niste'\\lrooy\", \"tomasz rosicky \", \r\n    \"didier d_eschamps\", \"oliver kahn\", \"gabriel batistuta\", \r\n    \"Sunil Chhetri\", \"paolo maldini3\", \"Shunsuke naka\\\\mura\", \r\n    \"Ji-Sung Park\", \"ali daei xasfdhkljhasdfhakghfiugasfd\", \"henrik larsson\", \r\n    \"Andrey Arshavin\", \"Teerasil Dangda\", \"hakan s\u00fck\u00fcr\",  \r\n     ]\r\ncleaned_names = [ntec.clean_name(name) for name in names]\r\nprint(cleaned_names)\r\n# ['tony adams', 'mustapha hadji', 'sinisa mihajlovic', 'sun jihai', 'ruud van nistelrooy', \r\n# 'tomasz rosicky', 'didier deschamps', 'oliver kahn', 'gabriel batistuta', 'sunil chhetri',\r\n# 'paolo maldini', 'shunsuke nakamura', 'jisung park', 'ali daei xasfdhkljhasdfhakghfiugasfd',\r\n# 'henrik larsson', 'andrey arshavin', 'teerasil dangda', 'hakan sukur']\r\n```\r\n\r\nAfter cleaning, names must be encoded to a form that can be processed by the classifier. This encoding step is performed using the classifier's `classifier.encode_name()`-method as shown below. First, the `classifier.encode_name()` method automatically ensures that the lenghth of an input name does not surpass the classifier's `classifier.params[\"seq_max\"]` attribute (i.e., longer names such as \"ali daei xasfdhkljhasdfhakghfiugasfd\" in the example will be cut to this length). Second, it then transforms a the cleaned name according to the `classifier.params` attribute to a 2D-numpy.array of shape `(seq_max`, `n_chars + 1)`.\r\n\r\n```python\r\nimport numpy as np\r\n\r\n# single name encoding:\r\nencoded_name = classifier.encode_name(name = cleaned_name)\r\nencoded_name.shape\r\n# (30, 28)\r\n\r\n# multiple name encoding\r\nencoded_names = np.array([classifier.encode_name(name) for name in cleaned_names])\r\nencoded_names.shape\r\n# (18, 30, 28)\r\n```\r\n\r\nSubsequetly, the encoded names can be sent to the classifier, which then predicts corresponding ethnic origins.\r\n\r\n```python\r\nimport pandas as pd\r\n\r\n# single name origin prediction:\r\norigin_pred = classifier.predict_origins(x = encoded_name, output = \"classes\")\r\npd.concat([pd.Series(cleaned_name, name = \"cleaned_name\"), origin_pred], axis = 1)\r\n#           cleaned_name ethnic_origin\r\n# 0  ruud van nistelrooy         Dutch\r\n\r\n# multiple name origin prediction:\r\norigin_pred = classifier.predict_origins(np.array(encoded_names), output = \"classes\")\r\npd.concat([pd.Series(cleaned_names, name = \"cleaned_name\"), origin_pred], axis = 1)\r\n#                             cleaned_name     ethnic_origin\r\n# 0                             tony adams        AngloSaxon\r\n# 1                         mustapha hadji            Arabic\r\n# 2                      sinisa mihajlovic           Balkans\r\n# 3                              sun jihai           Chinese\r\n# 4                    ruud van nistelrooy             Dutch\r\n# 5                         tomasz rosicky     East-European\r\n# 6                       didier deschamps            French\r\n# 7                            oliver kahn            German\r\n# 8                      gabriel batistuta  Hispanic-Iberian\r\n# 9                          sunil chhetri            Indian\r\n# 10                         paolo maldini           Italian\r\n# 11                     shunsuke nakamura          Japanese\r\n# 12                           jisung park            Korean\r\n# 13  ali daei xasfdhkljhasdfhakghfiugasfd           Persian\r\n# 14                        henrik larsson      Scandinavian\r\n# 15                       andrey arshavin    Slavic-Russian\r\n# 16                       teerasil dangda  South-East-Asian\r\n# 17                           hakan sukur           Turkish\r\n```\r\n\r\n## Contact\r\nIf you have any feedback or questions, please contact me at matthias.niggli@gmx.ch\r\n\r\n## Citation\r\nPlease cite appropriately as:\r\n\r\nMatthias Niggli (2023), \u2018Moving On\u2019\u2014investigating inventors\u2019 ethnic origins using supervised learning, *Journal of Economic Geography*, lbad001, [https://doi.org/10.1093/jeg/lbad001](https://doi.org/10.1093/jeg/lbad001).\r\n\r\nBibTex:\r\n```\r\n@article{niggli2023,\r\n    author = {Niggli, Matthias},\r\n    title = \"{\u2018Moving On\u2019\u2014investigating inventors\u2019 ethnic origins using supervised learning}\",\r\n    journal = {Journal of Economic Geography},\r\n    year = {2023},\r\n    month = {01},\r\n    abstract = \"{Patent data provides rich information about technical inventions, but does not disclose the ethnic origin of inventors. In this article, I use supervised learning techniques to infer this information. To do so, I construct a dataset of 96\u2032777 labeled names and train an artificial recurrent neural network with long short-term memory (LSTM) to predict ethnic origins based on names. The trained network achieves an overall performance of 91.4\\\\% across 18 ethnic origins. I use this model to predict and investigate the ethnic origins of 2.68 million inventors and provide novel descriptive evidence regarding their ethnic origin composition over time and across countries and technological fields. The global ethnic origin composition has become more diverse over the last decades, which was mostly due to a relative increase of Asian origin inventors. Furthermore, the prevalence of foreign-origin inventors is especially high in the USA, but has also increased in other high-income economies. This increase was mainly driven by an inflow of non-Western inventors into emerging high-technology fields for the USA, but not for other high-income countries.}\",\r\n    issn = {1468-2702},\r\n    doi = {10.1093/jeg/lbad001},\r\n    url = {https://doi.org/10.1093/jeg/lbad001},\r\n    note = {lbad001},\r\n    eprint = {https://academic.oup.com/joeg/advance-article-pdf/doi/10.1093/jeg/lbad001/48958974/lbad001.pdf},\r\n}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Matthias Niggli  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Ethnicity classification based on names.",
    "version": "0.1.0",
    "split_keywords": [
        "name classification",
        "ethnicity classification"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd7d85fdbeef3f79321d0771f9fbed5fb8f5cfc6d27bb12b6e063206a1fd6296",
                "md5": "e4308b14bae5bd20885f94d084cc2049",
                "sha256": "b17bada889fee50200f51ca2a2e842bbbe0b2f6df1c789debce1374bfb48dda1"
            },
            "downloads": -1,
            "filename": "ntec-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e4308b14bae5bd20885f94d084cc2049",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 7360650,
            "upload_time": "2023-02-08T09:32:28",
            "upload_time_iso_8601": "2023-02-08T09:32:28.125666Z",
            "url": "https://files.pythonhosted.org/packages/cd/7d/85fdbeef3f79321d0771f9fbed5fb8f5cfc6d27bb12b6e063206a1fd6296/ntec-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "234106e5029c58b3a30c589cd782fe434acafb1c60efcbc019c09c5a44170dc8",
                "md5": "23a4c1e7bc6c176a2c56b2c5c3fe14b1",
                "sha256": "c4214e72f321d2e8074d0d0bb22c547dccffcc99f137b2f0af52be8dbf357780"
            },
            "downloads": -1,
            "filename": "ntec-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "23a4c1e7bc6c176a2c56b2c5c3fe14b1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7365212,
            "upload_time": "2023-02-08T09:32:33",
            "upload_time_iso_8601": "2023-02-08T09:32:33.687461Z",
            "url": "https://files.pythonhosted.org/packages/23/41/06e5029c58b3a30c589cd782fe434acafb1c60efcbc019c09c5a44170dc8/ntec-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-08 09:32:33",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "ntec"
}
        
Elapsed time: 0.25401s