ethnicolr


Nameethnicolr JSON
Version 0.18.4 PyPI version JSON
download
home_pageNone
SummaryPredict Race/Ethnicity Based on Sequence of Characters in Names
upload_time2025-09-01 13:17:38
maintainerNone
docs_urlNone
authorNone
requires_python<3.13,>=3.9
licenseNone
keywords race ethnicity names demographics machine-learning nlp
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            ## ethnicolr: Predict Race and Ethnicity From Name

![PyPI Authenicated](https://notarypy.soodoku.workers.dev/badge/ethnicolr/0.18.4/ethnicolr-0.18.4-py3-none-any.whl)
![Test Badge](https://github.com/appeler/ethnicolr/workflows/test/badge.svg)
[![PyPI version](https://img.shields.io/pypi/v/ethnicolr.svg)](https://pypi.python.org/pypi/ethnicolr)
[![Anaconda version](https://anaconda.org/soodoku/ethnicolr/badges/version.svg)](https://anaconda.org/soodoku/ethnicolr/)
[![PePy Downloads](https://static.pepy.tech/badge/ethnicolr)](https://www.pepy.tech/projects/ethnicolr)

We exploit the US census data, the Florida voting registration data, and
the Wikipedia data collected by Skiena and colleagues to predict race
and ethnicity based on first and last name or just the last name. The
granularity at which we predict the race depends on the dataset. For
instance, Skiena et al.\' Wikipedia data is at the ethnic group level,
while the census data we use in the model (the raw data has additional
categories of Native Americans and Bi-racial) merely categorizes between
Non-Hispanic Whites, Non-Hispanic Blacks, Asians, and Hispanics.

### New Package With New Models in Pytorch

[https://github.com/appeler/ethnicolr2](https://github.com/appeler/ethnicolr2)

### Streamlit App

[https://ethnicolr.streamlit.app/](https://ethnicolr.streamlit.app/)

### Caveats and Notes

If you picked a person at random with the last name \'Smith\' in the US
in 2010 and asked us to guess this person\'s race (as measured by the
census), the best guess would be based on what is available from the
aggregated Census file. It is the Bayes Optimal Solution. So what good
are last-name-only predictive models for? A few things\-\--if you want
to impute race and ethnicity for last names that are not in the census
file, infer the race and ethnicity in different years than when the
census was conducted (if some assumptions hold), infer the race of
people in different countries (if some assumptions hold), etc. The
biggest benefit comes in cases where both the first name and last name
are known.

### Install

We strongly recommend installing ethnicolr inside a Python virtual
environment (see [venv
documentation](https://docs.python.org/3/library/venv.html#creating-virtual-environments))

```bash
pip install ethnicolr
```

Notes:

> - The models are run and verified on TensorFlow 2.x using Python 3.10
>   through 3.12
> - If you install on Windows, Theano installation typically needs
>   admin. privileges on the shell.

### Jupyter Quickstart

```bash
pip install ethnicolr jupyter
ethnicolr_download_models
jupyter notebook ethnicolr/examples
```

Open one of the example notebooks and run the cells to see the package in
action.

## General API

To see the available command line options for any function, please type
in [`<function-name>`]` `[`--help`]

```python
# census_ln --help
usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input

Appends Census columns by last name

positional arguments:
  input                 Input file

optional arguments:
  -h, --help            show this help message and exit
  -y {2000,2010}, --year {2000,2010}
                        Year of Census data (default=2000)
  -o OUTPUT, --output OUTPUT
                        Output file with Census data columns
  -l LAST, --last LAST  Name of the column containing the last name
```

### Cleaning Names

The prediction models work best when first and last names contain only
alphabetic characters. Before calling the CLI or Python APIs, strip out
titles (e.g., *Dr*, *Hon.*), middle names, suffixes, punctuation and
non\-ASCII characters. The `pred_wiki_name` command automatically
normalizes names by removing diacritics and extraneous characters. If
the tool still skips entries, check that the first and last name columns
are not empty after cleaning.

## Examples

To append census data from 2010 to a [file with column header in the
first row](ethnicolr/data/input-with-header.csv),
specify the column name carrying last names using the [`-l`] option, keeping the rest the same:

```bash
census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv
```

To predict race/ethnicity using [Wikipedia full name
model](ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb), specify the column name of last name and first name by using
[`-l`] and [`-f`]
flags respectively.

```bash
pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv
```

## Functions

We expose several functions, each of which either takes a pandas DataFrame
or a CSV.

- **census_ln(df, lname_col, year=2000)**
  - What it does:
    - Removes extra space
    - For names in the [census file](ethnicolr/data/census), it appends relevant data of what probability the name
      provided is of a certain race/ethnicity

> -----------------------------------------------------------------------------
>   Parameters    
>   ------------ ----------------------------------------------------------------
>                **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
>                contains the names of the individual to be inferred
>
>                **lname_col** : *{string}* name of the column containing the
>                last name
>
>                **Year** : *{2000, 2010}, default=2000* year of census to use
>   -----------------------------------------------------------------------------

- Output: Appends the following columns to the pandas DataFrame or CSV:
  pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. See
  [here](https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf) for what the column names mean.

  ``` literal-block
  >>> import pandas as pd

  >>> from ethnicolr import census_ln, pred_census_ln

  >>> names = [{'name': 'smith'},
  ...         {'name': 'zhang'},
  ...         {'name': 'jackson'}]

  >>> df = pd.DataFrame(names)

  >>> df
        name
  0    smith
  1    zhang
  2  jackson

  >>> census_ln(df, 'name')
        name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
  0    smith    73.35    22.22   0.40    0.85      1.63        1.56
  1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
  2  jackson    41.93    53.02   0.31    1.04      2.18        1.53
  ```

- **pred_census_ln(df, lname_col, year=2000, num_iter=100,
  conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [last name census 2000
      model](ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb) or [last name census 2010
      model](ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb) to predict race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **namecol** : *{string}* name of the column containing the last
                 name

                 **year** : *{2000, 2010}, default=2000* year of census to use

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty in model

                 **conf_int** : *float, default=1.0* confidence interval in
                 predicted class
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (white, black, asian, or hispanic), api (percentage chance
    asian), black, hispanic, white. For each race it will provide the
    mean, standard error, lower & upper bound of confidence interval

  *(Using the same dataframe from example above)*

  ```python
  >>> census_ln(df, 'name')
        name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
  0    smith    73.35    22.22   0.40    0.85      1.63        1.56
  1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
  2  jackson    41.93    53.02   0.31    1.04      2.18        1.53

  >>> census_ln(df, 'name', 2010)
        name   race pctwhite pctblack pctapi pctaian pct2prace pcthispanic
  0    smith  white     70.9    23.11    0.5    0.89      2.19         2.4
  1    zhang    api     0.99     0.16  98.06    0.02      0.62        0.15
  2  jackson  black    39.89    53.04   0.39    1.06      3.12         2.5

  >>> pred_census_ln(df, 'name')
        name   race       api     black  hispanic     white
  0    smith  white  0.002019  0.247235  0.014485  0.736260
  1    zhang    api  0.997807  0.000149  0.000470  0.001574
  2  jackson  black  0.002797  0.528193  0.014605  0.454405
  ```

- **pred_wiki_ln( df, lname_col, num_iter=100, conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [last name wiki
      model](ethnicolr/models/ethnicolr_keras_lstm_wiki_ln.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **lname_col** : *{string}* name of the column containing the
                 last name

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty in model

                 **conf_int** : *float, default=1.0* confidence interval in
                 predicted class
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (categorical variable \-\-- category with the highest
    probability). For each race it will provide the mean, standard
    error, lower & upper bound of confidence interval

  ``` literal-block
  "Asian,GreaterEastAsian,EastAsian",
  "Asian,GreaterEastAsian,Japanese", "Asian,IndianSubContinent",
  "GreaterAfrican,Africans", "GreaterAfrican,Muslim",
  "GreaterEuropean,British","GreaterEuropean,EastEuropean",
  "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French",
  "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic",
  "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic".
  ```

  ```python
  >>> import pandas as pd

  >>> names = [
  ...             {"last": "smith", "first": "john", "true_race": "GreaterEuropean,British"},
  ...             {
  ...                 "last": "zhang",
  ...                 "first": "simon",
  ...                 "true_race": "Asian,GreaterEastAsian,EastAsian",
  ...             },
  ...         ]
  >>> df = pd.DataFrame(names)

  >>> from ethnicolr import pred_wiki_ln, pred_wiki_name

  >>> odf = pred_wiki_ln(df,'last', conf_int=0.9)
  ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']

  >>> odf
     last  first                         true_race  ...  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race
  0  Smith   john           GreaterEuropean,British                               0.016103  ...                                 0.014135                                0.007382                                0.048828           GreaterEuropean,British
  1  Zhang  simon  Asian,GreaterEastAsian,EastAsian                               0.863391  ...                                 0.017452                                0.001844                                0.027252  Asian,GreaterEastAsian,EastAsian

  [2 rows x 56 columns]

  >>> odf.iloc[0, :8]
  last                                                       Smith
  first                                                       john
  true_race                                GreaterEuropean,British
  Asian,GreaterEastAsian,EastAsian_mean                   0.016103
  Asian,GreaterEastAsian,EastAsian_std                    0.009735
  Asian,GreaterEastAsian,EastAsian_lb                     0.005873
  Asian,GreaterEastAsian,EastAsian_ub                     0.034637
  Asian,GreaterEastAsian,Japanese_mean                    0.003814
  Name: 0, dtype: object
  ```

- **pred_wiki_name(df, namecol, num_iter=100, conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [full name wiki
      model](ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **namecol** : *{string}* name of the column containing the
                 name.

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty of predictions

                 **conf_int** : *float, default=1.0* confidence interval
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (categorical variable\-\--category with the highest
    probability), \"Asian,GreaterEastAsian,EastAsian\",
    \"Asian,GreaterEastAsian,Japanese\", \"Asian,IndianSubContinent\",
    \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",
    \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\",
    \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",
    \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",
    \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\".
    For each race it will provide the mean, standard error, lower &
    upper bound of confidence interval

  *(Using the same dataframe from example above)*

  ``` literal-block
  >>> odf = pred_wiki_name(df,'last', 'first', conf_int=0.9)
  ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']

  >>> odf
     last  first                         true_race       __name  Asian,GreaterEastAsian,EastAsian_mean  ...  GreaterEuropean,WestEuropean,Nordic_mean  GreaterEuropean,WestEuropean,Nordic_std  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race
  0  Smith   john           GreaterEuropean,British   Smith John                               0.004111  ...                                  0.006246                                 0.004760                                0.001048                                0.016288           GreaterEuropean,British
  1  Zhang  simon  Asian,GreaterEastAsian,EastAsian  Zhang Simon                               0.944203  ...                                  0.000793                                 0.002557                                0.000019                                0.002470  Asian,GreaterEastAsian,EastAsian

  [2 rows x 57 columns]

  >>> odf.iloc[0,:8]
  last                                                       Smith
  first                                                       john
  true_race                                GreaterEuropean,British
  __name                                                Smith John
  Asian,GreaterEastAsian,EastAsian_mean                   0.004111
  Asian,GreaterEastAsian,EastAsian_std                    0.002929
  Asian,GreaterEastAsian,EastAsian_lb                     0.001356
  Asian,GreaterEastAsian,EastAsian_ub                     0.010571
  Name: 0, dtype: object
  ```

- **pred_fl_reg_ln(df, lname_col, num_iter=100, conf_int=1.0)**

  - What does it do?:
    - Removes extra space, if there.
    - Uses the [last name FL registration
      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **lname_col** : *{string}* name of the column containing the
                 last name

                 **num_iter** : *int, default=100* number of iterations to
                 calculate the uncertainty

                 **conf_int** : *float, default=1.0* confidence interval
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (white, black, asian, or Hispanic), asian (percentage
    chance Asian), Hispanic, nh_black, nh_white. For each race, it will
    provide the mean, standard error, lower & upper bound of confidence
    interval

  ```python
  >>> import pandas as pd

  >>> names = [
  ...             {"last": "sawyer", "first": "john", "true_race": "nh_white"},
  ...             {"last": "torres", "first": "raul", "true_race": "hispanic"},
  ...         ]

  >>> df = pd.DataFrame(names)

  >>> from ethnicolr import pred_fl_reg_ln, pred_fl_reg_name, pred_fl_reg_ln_five_cat, pred_fl_reg_name_five_cat

  >>> odf = pred_fl_reg_ln(df, 'last', conf_int=0.9)
  ['asian', 'hispanic', 'nh_black', 'nh_white']

  >>> odf
     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
  0  Sawyer  john  nh_white    0.009859   0.006819  0.005338  0.019673       0.021488      0.004602     0.014802     0.030148       0.180929      0.052784     0.105756     0.270238       0.787724      0.051082     0.705290     0.860286  nh_white
  1  Torres  raul  hispanic    0.006463   0.001985  0.003915  0.010146       0.878119      0.021998     0.839274     0.909151       0.013118      0.005002     0.007364     0.021633       0.102300      0.017828     0.075911     0.130929  hispanic

  [2 rows x 20 columns]

  >>> odf.iloc[0]
  last               Sawyer
  first                john
  true_race        nh_white
  asian_mean       0.009859
  asian_std        0.006819
  asian_lb         0.005338
  asian_ub         0.019673
  hispanic_mean    0.021488
  hispanic_std     0.004602
  hispanic_lb      0.014802
  hispanic_ub      0.030148
  nh_black_mean    0.180929
  nh_black_std     0.052784
  nh_black_lb      0.105756
  nh_black_ub      0.270238
  nh_white_mean    0.787724
  nh_white_std     0.051082
  nh_white_lb       0.70529
  nh_white_ub      0.860286
  race             nh_white
  Name: 0, dtype: object
  ```

- **pred_fl_reg_name(df, lname_col, num_iter=100, conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [full name FL
      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **namecol** : *{list}* name of the column containing the name.

                 **num_iter** : *int, default=100* number of iterations to
                 calculate the uncertainty

                 **conf_int** : *float, default=1.0* confidence interval in
                 predicted class
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (white, black, asian, or Hispanic), asian (percentage
    chance Asian), Hispanic, nh_black, nh_white. For each race, it will
    provide the mean, standard error, lower & upper bound of confidence
    interval

  *(Using the same dataframe from example above)*

  ``` literal-block
  >>> odf = pred_fl_reg_name(df, 'last', 'first', conf_int=0.9)
  ['asian', 'hispanic', 'nh_black', 'nh_white']

  >>> odf
     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
  0  Sawyer  john  nh_white    0.001534   0.000850  0.000636  0.002691       0.006818      0.002557     0.003684     0.011660       0.028068      0.015095     0.011488     0.055149       0.963581      0.015738     0.935445     0.983224  nh_white
  1  Torres  raul  hispanic    0.005791   0.002906  0.002446  0.011748       0.890561      0.029581     0.841328     0.937706       0.011397      0.004682     0.005829     0.020796       0.092251      0.026675     0.049868     0.139210  hispanic

  >>> odf.iloc[1]
  last               Torres
  first                raul
  true_race        hispanic
  asian_mean       0.005791
  asian_std        0.002906
  asian_lb         0.002446
  asian_ub         0.011748
  hispanic_mean    0.890561
  hispanic_std     0.029581
  hispanic_lb      0.841328
  hispanic_ub      0.937706
  nh_black_mean    0.011397
  nh_black_std     0.004682
  nh_black_lb      0.005829
  nh_black_ub      0.020796
  nh_white_mean    0.092251
  nh_white_std     0.026675
  nh_white_lb      0.049868
  nh_white_ub       0.13921
  race             hispanic
  Name: 1, dtype: object
  ```

- **pred_fl_reg_ln_five_cat(df, namecol, num_iter=100, conf_int=1.0)**

  - What does it do?:
    - Removes extra space, if there.
    - Uses the [last name FL registration
      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **lname_col** : *{string, list, int}* name of location of the
                 column containing the last name

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty

                 **conf_int** : *float, default=1.0* confidence interval
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (white, black, asian, Hispanic or other), asian
    (percentage chance Asian), hispanic, nh_black, nh_white, other. For
    each race, it will provide the mean, standard error, lower & upper
    bound of confidence interval

  *(Using the same dataframe from example above)*

  ```python
  >>> odf = pred_fl_reg_ln_five_cat(df,'last')
  ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']

  >>> odf
     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race
  0  Sawyer  john  nh_white    0.100038   0.020539  0.073266  0.143334       0.044263      0.013077  ...       0.376639      0.048289     0.296989     0.452834    0.248466   0.021040  0.219721  0.283785  nh_white
  1  Torres  raul  hispanic    0.062390   0.021863  0.033837  0.103737       0.774414      0.043238  ...       0.030393      0.009591     0.019713     0.046483    0.117761   0.019524  0.089418  0.150615  hispanic

  [2 rows x 24 columns]

  >>> odf.iloc[0]
  last               Sawyer
  first                john
  true_race        nh_white
  asian_mean       0.100038
  asian_std        0.020539
  asian_lb         0.073266
  asian_ub         0.143334
  hispanic_mean    0.044263
  hispanic_std     0.013077
  hispanic_lb       0.02476
  hispanic_ub      0.067965
  nh_black_mean    0.230593
  nh_black_std     0.063948
  nh_black_lb      0.130577
  nh_black_ub      0.343513
  nh_white_mean    0.376639
  nh_white_std     0.048289
  nh_white_lb      0.296989
  nh_white_ub      0.452834
  other_mean       0.248466
  other_std         0.02104
  other_lb         0.219721
  other_ub         0.283785
  race             nh_white
  Name: 0, dtype: object
  ```

- **pred_fl_reg_name_five_cat(df, namecol, num_iter=100, conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [full name FL
      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **namecol** : *{string, list}* string or list of the name or
                 location of the column containing the first name, last name.

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty

                 **conf_int** : *float, default=1.0* confidence interval
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race (white, black, asian, Hispanic, or other), asian
    (percentage chance Asian), hispanic, nh_black, nh_white, other. For
    each race, it will provide the mean, standard error, lower & upper
    bound of confidence interval

  *(Using the same dataframe from example above)*

  ```python
  >>> odf = pred_fl_reg_name_five_cat(df, 'last','first')
  ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']

  >>> odf
     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race
  0  Sawyer  john  nh_white    0.039310   0.011657  0.025982  0.059719       0.019737      0.005813  ...       0.650306      0.059327     0.553913     0.733201    0.192242   0.021004  0.160185  0.226063  nh_white
  1  Torres  raul  hispanic    0.020086   0.011765  0.008240  0.041741       0.899110      0.042237  ...       0.019073      0.009901     0.010166     0.040081    0.055774   0.017897  0.036245  0.088741  hispanic

  [2 rows x 24 columns]

  >>> odf.iloc[1]
  last               Torres
  first                raul
  true_race        hispanic
  asian_mean       0.020086
  asian_std        0.011765
  asian_lb          0.00824
  asian_ub         0.041741
  hispanic_mean     0.89911
  hispanic_std     0.042237
  hispanic_lb      0.823799
  hispanic_ub      0.937612
  nh_black_mean    0.005956
  nh_black_std     0.006528
  nh_black_lb      0.002686
  nh_black_ub      0.010134
  nh_white_mean    0.019073
  nh_white_std     0.009901
  nh_white_lb      0.010166
  nh_white_ub      0.040081
  other_mean       0.055774
  other_std        0.017897
  other_lb         0.036245
  other_ub         0.088741
  race             hispanic
  Name: 1, dtype: object
  ```

- **pred_nc_reg_name(df, namecol, num_iter=100, conf_int=1.0)**

  - What it does:
    - Removes extra space.
    - Uses the [full name NC
      model](ethnicolr/models/ethnicolr_keras_lstm_nc_12_cat_model.ipynb) to predict the race and ethnicity.

    ----------------------------------------------------------------------------
    Parameters    
    ------------ ---------------------------------------------------------------
                 **df** : *{DataFrame, csv}* Pandas dataframe of CSV file
                 contains the names of the individual to be inferred

                 **namecol** : *{string, list}* string or list of the name or
                 location of the column containing the first name and last name.

                 **num_iter** : *int, default=100* number of iterations to
                 calculate uncertainty

                 **conf_int** : *float, default=1.0* confidence interval
    ----------------------------------------------------------------------------

  - Output: Appends the following columns to the pandas DataFrame or
    CSV: race + ethnicity. The codebook is
    [here](https://github.com/appeler/nc_race_ethnicity). For each race, it will provide the mean, standard error,
    lower & upper bound of confidence interval

  ```python
  >>> import pandas as pd

  >>> names = [
  ...             {"last": "hernandez", "first": "hector", "true_race": "HL+O"},
  ...             {"last": "zhang", "first": "simon", "true_race": "NL+A"},
  ...         ]

  >>> df = pd.DataFrame(names)

  >>> from ethnicolr import pred_nc_reg_name

  >>> odf = pred_nc_reg_name(df, 'last','first', conf_int=0.9)
  ['HL+A', 'HL+B', 'HL+I', 'HL+M', 'HL+O', 'HL+W', 'NL+A', 'NL+B', 'NL+I', 'NL+M', 'NL+O', 'NL+W']

  >>> odf
        last   first true_race            __name     HL+A_mean  HL+A_std       HL+A_lb       HL+A_ub     HL+B_mean  HL+B_std       HL+B_lb       HL+B_ub  HL+I_mean  ...     NL+M_mean  NL+M_std       NL+M_lb       NL+M_ub  NL+O_mean  NL+O_std   NL+O_lb   NL+O_ub  NL+W_mean  NL+W_std   NL+W_lb   NL+W_ub  race
  0  hernandez  hector      HL+O  Hernandez Hector  2.727371e-13       0.0  2.727372e-13  2.727372e-13  6.542178e-04       0.0  6.542183e-04  6.542183e-04   0.000032  ...  7.863581e-06       0.0  7.863589e-06  7.863589e-06   0.184513       0.0  0.184514  0.184514   0.001256       0.0  0.001256  0.001256  HL+O
  1      zhang   simon      NL+A       Zhang Simon  1.985421e-06       0.0  1.985423e-06  1.985423e-06  8.708256e-09       0.0  8.708265e-09  8.708265e-09   0.000049  ...  1.446786e-07       0.0  1.446784e-07  1.446784e-07   0.003238       0.0  0.003238  0.003238   0.000154       0.0  0.000154  0.000154  NL+A

  [2 rows x 53 columns]

  >>> odf.iloc[0]
  last                hernandez
  first                  hector
  true_race                HL+O
  __name       Hernandez Hector
  HL+A_mean                 0.0
  HL+A_std                  0.0
  HL+A_lb                   0.0
  HL+A_ub                   0.0
  HL+B_mean            0.000654
  HL+B_std                  0.0
  HL+B_lb              0.000654
  HL+B_ub              0.000654
  HL+I_mean            0.000032
  HL+I_std                  0.0
  HL+I_lb              0.000032
  HL+I_ub              0.000032
  HL+M_mean            0.000541
  HL+M_std                  0.0
  HL+M_lb              0.000541
  HL+M_ub              0.000541
  HL+O_mean             0.58944
  HL+O_std                  0.0
  HL+O_lb               0.58944
  HL+O_ub               0.58944
  HL+W_mean            0.221309
  HL+W_std                  0.0
  HL+W_lb              0.221309
  HL+W_ub              0.221309
  NL+A_mean            0.000044
  NL+A_std                  0.0
  NL+A_lb              0.000044
  NL+A_ub              0.000044
  NL+B_mean            0.002199
  NL+B_std                  0.0
  NL+B_lb              0.002199
  NL+B_ub              0.002199
  NL+I_mean            0.000004
  NL+I_std                  0.0
  NL+I_lb              0.000004
  NL+I_ub              0.000004
  NL+M_mean            0.000008
  NL+M_std                  0.0
  NL+M_lb              0.000008
  NL+M_ub              0.000008
  NL+O_mean            0.184513
  NL+O_std                  0.0
  NL+O_lb              0.184514
  NL+O_ub              0.184514
  NL+W_mean            0.001256
  NL+W_std                  0.0
  NL+W_lb              0.001256
  NL+W_ub              0.001256
  race                     HL+O
  Name: 0, dtype: object
  ```

### Application

To illustrate how the package can be used, we impute the race of the
campaign contributors recorded by FEC for the years 2000 and 2010 and
tally campaign contributions by race.

- [Contrib 2000/2010 using
  census_ln](ethnicolr/examples/ethnicolr_app_contrib20xx-census_ln.ipynb)
- [Contrib 2000/2010 using
  pred_census_ln](ethnicolr/examples/ethnicolr_app_contrib20xx.ipynb)
- [Contrib 2000/2010 using
  pred_fl_reg_name](ethnicolr/examples/ethnicolr_app_contrib20xx-fl_reg.ipynb)
  
Data on race of all the people in the [DIME
data](https://data.stanford.edu/dime) is posted
[here](http://dx.doi.org/10.7910/DVN/M5K7VR). The
underlying Python scripts are posted
[here](https://github.com/appeler/dime_race)
# Data

In particular, we utilize the last-name\--race data from the [2000
census](http://www.census.gov/topics/population/genealogy/data/2000_surnames.html) and [2010
census](http://www.census.gov/topics/population/genealogy/data/2010_surnames.html), the [Wikipedia data](ethnicolr/data/wiki/) collected by Skiena and colleagues, and the Florida voter
registration data from early 2017.

- [Census](ethnicolr/data/census/)
- [The Wikipedia dataset](ethnicolr/data/wiki/)
- [Florida voter registration
  database](http://dx.doi.org/10.7910/DVN/UBIG3F)

### Evaluation

1.  SCAN Health Plan, a Medicare Advantage plan that serves over 200,000
    members throughout California used the software to better assess
    racial disparities of health among the people they serve. They only
    had racial data on about 47% of their members, so they used it to learn
    the race of the remaining 53%. On the data they had labels for, they
    found .9 AUC and 83% accuracy for the last name model.
    
3.  Evaluation on NC Data:
    [https://github.com/appeler/nc_race_ethnicity](https://github.com/appeler/nc_race_ethnicity)
    
### Authors

Suriyan Laohaprapanon and Gaurav Sood

### Contributor Code of Conduct

The project welcomes contributions from everyone! In fact, it depends on
it. To maintain this welcoming atmosphere and to collaborate in a fun
and productive way, we expect contributors to the project to abide by
the [Contributor Code of
Conduct](http://contributor-covenant.org/version/1/0/0/)


## License

The package is released under the [MIT
License](https://opensource.org/licenses/MIT).


## 🔗 Adjacent Repositories

- [appeler/ethnicolr2](https://github.com/appeler/ethnicolr2) — Ethnicolr implementation with new models in pytorch
- [appeler/ethnicolor](https://github.com/appeler/ethnicolor) — Race and Ethnicity based on name using data from census, voter reg. files, etc.
- [appeler/instate](https://github.com/appeler/instate) — instate: predict the state of residence from last name using the indian electoral rolls
- [appeler/search_names](https://github.com/appeler/search_names) — Search a long list of names (patterns) in a large text corpus systematically and quickly
- [appeler/nc_race_ethnicity](https://github.com/appeler/nc_race_ethnicity) — Evaluation of some of the ethnicolr models on the NC Voter Registration Data + New Models Based on NC Voter Registration Data.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ethnicolr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": "race, ethnicity, names, demographics, machine-learning, nlp",
    "author": null,
    "author_email": "Gaurav Sood <contact@gsood.com>, Suriyan Laohaprapanon <suriyant@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/58/9a/6a648bd3a221b8efefe51118aa76ab813fc5d7a29c0eb1cadc653df0c0c8/ethnicolr-0.18.4.tar.gz",
    "platform": null,
    "description": "## ethnicolr: Predict Race and Ethnicity From Name\n\n![PyPI Authenicated](https://notarypy.soodoku.workers.dev/badge/ethnicolr/0.18.4/ethnicolr-0.18.4-py3-none-any.whl)\n![Test Badge](https://github.com/appeler/ethnicolr/workflows/test/badge.svg)\n[![PyPI version](https://img.shields.io/pypi/v/ethnicolr.svg)](https://pypi.python.org/pypi/ethnicolr)\n[![Anaconda version](https://anaconda.org/soodoku/ethnicolr/badges/version.svg)](https://anaconda.org/soodoku/ethnicolr/)\n[![PePy Downloads](https://static.pepy.tech/badge/ethnicolr)](https://www.pepy.tech/projects/ethnicolr)\n\nWe exploit the US census data, the Florida voting registration data, and\nthe Wikipedia data collected by Skiena and colleagues to predict race\nand ethnicity based on first and last name or just the last name. The\ngranularity at which we predict the race depends on the dataset. For\ninstance, Skiena et al.\\' Wikipedia data is at the ethnic group level,\nwhile the census data we use in the model (the raw data has additional\ncategories of Native Americans and Bi-racial) merely categorizes between\nNon-Hispanic Whites, Non-Hispanic Blacks, Asians, and Hispanics.\n\n### New Package With New Models in Pytorch\n\n[https://github.com/appeler/ethnicolr2](https://github.com/appeler/ethnicolr2)\n\n### Streamlit App\n\n[https://ethnicolr.streamlit.app/](https://ethnicolr.streamlit.app/)\n\n### Caveats and Notes\n\nIf you picked a person at random with the last name \\'Smith\\' in the US\nin 2010 and asked us to guess this person\\'s race (as measured by the\ncensus), the best guess would be based on what is available from the\naggregated Census file. It is the Bayes Optimal Solution. So what good\nare last-name-only predictive models for? A few things\\-\\--if you want\nto impute race and ethnicity for last names that are not in the census\nfile, infer the race and ethnicity in different years than when the\ncensus was conducted (if some assumptions hold), infer the race of\npeople in different countries (if some assumptions hold), etc. The\nbiggest benefit comes in cases where both the first name and last name\nare known.\n\n### Install\n\nWe strongly recommend installing ethnicolr inside a Python virtual\nenvironment (see [venv\ndocumentation](https://docs.python.org/3/library/venv.html#creating-virtual-environments))\n\n```bash\npip install ethnicolr\n```\n\nNotes:\n\n> - The models are run and verified on TensorFlow 2.x using Python 3.10\n>   through 3.12\n> - If you install on Windows, Theano installation typically needs\n>   admin. privileges on the shell.\n\n### Jupyter Quickstart\n\n```bash\npip install ethnicolr jupyter\nethnicolr_download_models\njupyter notebook ethnicolr/examples\n```\n\nOpen one of the example notebooks and run the cells to see the package in\naction.\n\n## General API\n\nTo see the available command line options for any function, please type\nin [`<function-name>`]` `[`--help`]\n\n```python\n# census_ln --help\nusage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input\n\nAppends Census columns by last name\n\npositional arguments:\n  input                 Input file\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -y {2000,2010}, --year {2000,2010}\n                        Year of Census data (default=2000)\n  -o OUTPUT, --output OUTPUT\n                        Output file with Census data columns\n  -l LAST, --last LAST  Name of the column containing the last name\n```\n\n### Cleaning Names\n\nThe prediction models work best when first and last names contain only\nalphabetic characters. Before calling the CLI or Python APIs, strip out\ntitles (e.g., *Dr*, *Hon.*), middle names, suffixes, punctuation and\nnon\\-ASCII characters. The `pred_wiki_name` command automatically\nnormalizes names by removing diacritics and extraneous characters. If\nthe tool still skips entries, check that the first and last name columns\nare not empty after cleaning.\n\n## Examples\n\nTo append census data from 2010 to a [file with column header in the\nfirst row](ethnicolr/data/input-with-header.csv),\nspecify the column name carrying last names using the [`-l`] option, keeping the rest the same:\n\n```bash\ncensus_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv\n```\n\nTo predict race/ethnicity using [Wikipedia full name\nmodel](ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb), specify the column name of last name and first name by using\n[`-l`] and [`-f`]\nflags respectively.\n\n```bash\npred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv\n```\n\n## Functions\n\nWe expose several functions, each of which either takes a pandas DataFrame\nor a CSV.\n\n- **census_ln(df, lname_col, year=2000)**\n  - What it does:\n    - Removes extra space\n    - For names in the [census file](ethnicolr/data/census), it appends relevant data of what probability the name\n      provided is of a certain race/ethnicity\n\n> -----------------------------------------------------------------------------\n>   Parameters   \u00a0\n>   ------------ ----------------------------------------------------------------\n>   \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n>                contains the names of the individual to be inferred\n>\n>   \u00a0            **lname_col** : *{string}* name of the column containing the\n>                last name\n>\n>   \u00a0            **Year** : *{2000, 2010}, default=2000* year of census to use\n>   -----------------------------------------------------------------------------\n\n- Output: Appends the following columns to the pandas DataFrame or CSV:\n  pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. See\n  [here](https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf) for what the column names mean.\n\n  ``` literal-block\n  >>> import pandas as pd\n\n  >>> from ethnicolr import census_ln, pred_census_ln\n\n  >>> names = [{'name': 'smith'},\n  ...         {'name': 'zhang'},\n  ...         {'name': 'jackson'}]\n\n  >>> df = pd.DataFrame(names)\n\n  >>> df\n        name\n  0    smith\n  1    zhang\n  2  jackson\n\n  >>> census_ln(df, 'name')\n        name pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n  0    smith    73.35    22.22   0.40    0.85      1.63        1.56\n  1    zhang     0.61     0.09  98.16    0.02      0.96        0.16\n  2  jackson    41.93    53.02   0.31    1.04      2.18        1.53\n  ```\n\n- **pred_census_ln(df, lname_col, year=2000, num_iter=100,\n  conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [last name census 2000\n      model](ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb) or [last name census 2010\n      model](ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb) to predict race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **namecol** : *{string}* name of the column containing the last\n                 name\n\n    \u00a0            **year** : *{2000, 2010}, default=2000* year of census to use\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty in model\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval in\n                 predicted class\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (white, black, asian, or hispanic), api (percentage chance\n    asian), black, hispanic, white. For each race it will provide the\n    mean, standard error, lower & upper bound of confidence interval\n\n  *(Using the same dataframe from example above)*\n\n  ```python\n  >>> census_ln(df, 'name')\n        name pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n  0    smith    73.35    22.22   0.40    0.85      1.63        1.56\n  1    zhang     0.61     0.09  98.16    0.02      0.96        0.16\n  2  jackson    41.93    53.02   0.31    1.04      2.18        1.53\n\n  >>> census_ln(df, 'name', 2010)\n        name   race pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n  0    smith  white     70.9    23.11    0.5    0.89      2.19         2.4\n  1    zhang    api     0.99     0.16  98.06    0.02      0.62        0.15\n  2  jackson  black    39.89    53.04   0.39    1.06      3.12         2.5\n\n  >>> pred_census_ln(df, 'name')\n        name   race       api     black  hispanic     white\n  0    smith  white  0.002019  0.247235  0.014485  0.736260\n  1    zhang    api  0.997807  0.000149  0.000470  0.001574\n  2  jackson  black  0.002797  0.528193  0.014605  0.454405\n  ```\n\n- **pred_wiki_ln( df, lname_col, num_iter=100, conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [last name wiki\n      model](ethnicolr/models/ethnicolr_keras_lstm_wiki_ln.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **lname_col** : *{string}* name of the column containing the\n                 last name\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty in model\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval in\n                 predicted class\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (categorical variable \\-\\-- category with the highest\n    probability). For each race it will provide the mean, standard\n    error, lower & upper bound of confidence interval\n\n  ``` literal-block\n  \"Asian,GreaterEastAsian,EastAsian\",\n  \"Asian,GreaterEastAsian,Japanese\", \"Asian,IndianSubContinent\",\n  \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",\n  \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\",\n  \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",\n  \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",\n  \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\".\n  ```\n\n  ```python\n  >>> import pandas as pd\n\n  >>> names = [\n  ...             {\"last\": \"smith\", \"first\": \"john\", \"true_race\": \"GreaterEuropean,British\"},\n  ...             {\n  ...                 \"last\": \"zhang\",\n  ...                 \"first\": \"simon\",\n  ...                 \"true_race\": \"Asian,GreaterEastAsian,EastAsian\",\n  ...             },\n  ...         ]\n  >>> df = pd.DataFrame(names)\n\n  >>> from ethnicolr import pred_wiki_ln, pred_wiki_name\n\n  >>> odf = pred_wiki_ln(df,'last', conf_int=0.9)\n  ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']\n\n  >>> odf\n     last  first                         true_race  ...  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race\n  0  Smith   john           GreaterEuropean,British                               0.016103  ...                                 0.014135                                0.007382                                0.048828           GreaterEuropean,British\n  1  Zhang  simon  Asian,GreaterEastAsian,EastAsian                               0.863391  ...                                 0.017452                                0.001844                                0.027252  Asian,GreaterEastAsian,EastAsian\n\n  [2 rows x 56 columns]\n\n  >>> odf.iloc[0, :8]\n  last                                                       Smith\n  first                                                       john\n  true_race                                GreaterEuropean,British\n  Asian,GreaterEastAsian,EastAsian_mean                   0.016103\n  Asian,GreaterEastAsian,EastAsian_std                    0.009735\n  Asian,GreaterEastAsian,EastAsian_lb                     0.005873\n  Asian,GreaterEastAsian,EastAsian_ub                     0.034637\n  Asian,GreaterEastAsian,Japanese_mean                    0.003814\n  Name: 0, dtype: object\n  ```\n\n- **pred_wiki_name(df, namecol, num_iter=100, conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [full name wiki\n      model](ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **namecol** : *{string}* name of the column containing the\n                 name.\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty of predictions\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (categorical variable\\-\\--category with the highest\n    probability), \\\"Asian,GreaterEastAsian,EastAsian\\\",\n    \\\"Asian,GreaterEastAsian,Japanese\\\", \\\"Asian,IndianSubContinent\\\",\n    \\\"GreaterAfrican,Africans\\\", \\\"GreaterAfrican,Muslim\\\",\n    \\\"GreaterEuropean,British\\\",\\\"GreaterEuropean,EastEuropean\\\",\n    \\\"GreaterEuropean,Jewish\\\",\\\"GreaterEuropean,WestEuropean,French\\\",\n    \\\"GreaterEuropean,WestEuropean,Germanic\\\",\\\"GreaterEuropean,WestEuropean,Hispanic\\\",\n    \\\"GreaterEuropean,WestEuropean,Italian\\\",\\\"GreaterEuropean,WestEuropean,Nordic\\\".\n    For each race it will provide the mean, standard error, lower &\n    upper bound of confidence interval\n\n  *(Using the same dataframe from example above)*\n\n  ``` literal-block\n  >>> odf = pred_wiki_name(df,'last', 'first', conf_int=0.9)\n  ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']\n\n  >>> odf\n     last  first                         true_race       __name  Asian,GreaterEastAsian,EastAsian_mean  ...  GreaterEuropean,WestEuropean,Nordic_mean  GreaterEuropean,WestEuropean,Nordic_std  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race\n  0  Smith   john           GreaterEuropean,British   Smith John                               0.004111  ...                                  0.006246                                 0.004760                                0.001048                                0.016288           GreaterEuropean,British\n  1  Zhang  simon  Asian,GreaterEastAsian,EastAsian  Zhang Simon                               0.944203  ...                                  0.000793                                 0.002557                                0.000019                                0.002470  Asian,GreaterEastAsian,EastAsian\n\n  [2 rows x 57 columns]\n\n  >>> odf.iloc[0,:8]\n  last                                                       Smith\n  first                                                       john\n  true_race                                GreaterEuropean,British\n  __name                                                Smith John\n  Asian,GreaterEastAsian,EastAsian_mean                   0.004111\n  Asian,GreaterEastAsian,EastAsian_std                    0.002929\n  Asian,GreaterEastAsian,EastAsian_lb                     0.001356\n  Asian,GreaterEastAsian,EastAsian_ub                     0.010571\n  Name: 0, dtype: object\n  ```\n\n- **pred_fl_reg_ln(df, lname_col, num_iter=100, conf_int=1.0)**\n\n  - What does it do?:\n    - Removes extra space, if there.\n    - Uses the [last name FL registration\n      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **lname_col** : *{string}* name of the column containing the\n                 last name\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate the uncertainty\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (white, black, asian, or Hispanic), asian (percentage\n    chance Asian), Hispanic, nh_black, nh_white. For each race, it will\n    provide the mean, standard error, lower & upper bound of confidence\n    interval\n\n  ```python\n  >>> import pandas as pd\n\n  >>> names = [\n  ...             {\"last\": \"sawyer\", \"first\": \"john\", \"true_race\": \"nh_white\"},\n  ...             {\"last\": \"torres\", \"first\": \"raul\", \"true_race\": \"hispanic\"},\n  ...         ]\n\n  >>> df = pd.DataFrame(names)\n\n  >>> from ethnicolr import pred_fl_reg_ln, pred_fl_reg_name, pred_fl_reg_ln_five_cat, pred_fl_reg_name_five_cat\n\n  >>> odf = pred_fl_reg_ln(df, 'last', conf_int=0.9)\n  ['asian', 'hispanic', 'nh_black', 'nh_white']\n\n  >>> odf\n     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race\n  0  Sawyer  john  nh_white    0.009859   0.006819  0.005338  0.019673       0.021488      0.004602     0.014802     0.030148       0.180929      0.052784     0.105756     0.270238       0.787724      0.051082     0.705290     0.860286  nh_white\n  1  Torres  raul  hispanic    0.006463   0.001985  0.003915  0.010146       0.878119      0.021998     0.839274     0.909151       0.013118      0.005002     0.007364     0.021633       0.102300      0.017828     0.075911     0.130929  hispanic\n\n  [2 rows x 20 columns]\n\n  >>> odf.iloc[0]\n  last               Sawyer\n  first                john\n  true_race        nh_white\n  asian_mean       0.009859\n  asian_std        0.006819\n  asian_lb         0.005338\n  asian_ub         0.019673\n  hispanic_mean    0.021488\n  hispanic_std     0.004602\n  hispanic_lb      0.014802\n  hispanic_ub      0.030148\n  nh_black_mean    0.180929\n  nh_black_std     0.052784\n  nh_black_lb      0.105756\n  nh_black_ub      0.270238\n  nh_white_mean    0.787724\n  nh_white_std     0.051082\n  nh_white_lb       0.70529\n  nh_white_ub      0.860286\n  race             nh_white\n  Name: 0, dtype: object\n  ```\n\n- **pred_fl_reg_name(df, lname_col, num_iter=100, conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [full name FL\n      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **namecol** : *{list}* name of the column containing the name.\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate the uncertainty\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval in\n                 predicted class\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (white, black, asian, or Hispanic), asian (percentage\n    chance Asian), Hispanic, nh_black, nh_white. For each race, it will\n    provide the mean, standard error, lower & upper bound of confidence\n    interval\n\n  *(Using the same dataframe from example above)*\n\n  ``` literal-block\n  >>> odf = pred_fl_reg_name(df, 'last', 'first', conf_int=0.9)\n  ['asian', 'hispanic', 'nh_black', 'nh_white']\n\n  >>> odf\n     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race\n  0  Sawyer  john  nh_white    0.001534   0.000850  0.000636  0.002691       0.006818      0.002557     0.003684     0.011660       0.028068      0.015095     0.011488     0.055149       0.963581      0.015738     0.935445     0.983224  nh_white\n  1  Torres  raul  hispanic    0.005791   0.002906  0.002446  0.011748       0.890561      0.029581     0.841328     0.937706       0.011397      0.004682     0.005829     0.020796       0.092251      0.026675     0.049868     0.139210  hispanic\n\n  >>> odf.iloc[1]\n  last               Torres\n  first                raul\n  true_race        hispanic\n  asian_mean       0.005791\n  asian_std        0.002906\n  asian_lb         0.002446\n  asian_ub         0.011748\n  hispanic_mean    0.890561\n  hispanic_std     0.029581\n  hispanic_lb      0.841328\n  hispanic_ub      0.937706\n  nh_black_mean    0.011397\n  nh_black_std     0.004682\n  nh_black_lb      0.005829\n  nh_black_ub      0.020796\n  nh_white_mean    0.092251\n  nh_white_std     0.026675\n  nh_white_lb      0.049868\n  nh_white_ub       0.13921\n  race             hispanic\n  Name: 1, dtype: object\n  ```\n\n- **pred_fl_reg_ln_five_cat(df, namecol, num_iter=100, conf_int=1.0)**\n\n  - What does it do?:\n    - Removes extra space, if there.\n    - Uses the [last name FL registration\n      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **lname_col** : *{string, list, int}* name of location of the\n                 column containing the last name\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (white, black, asian, Hispanic or other), asian\n    (percentage chance Asian), hispanic, nh_black, nh_white, other. For\n    each race, it will provide the mean, standard error, lower & upper\n    bound of confidence interval\n\n  *(Using the same dataframe from example above)*\n\n  ```python\n  >>> odf = pred_fl_reg_ln_five_cat(df,'last')\n  ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']\n\n  >>> odf\n     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race\n  0  Sawyer  john  nh_white    0.100038   0.020539  0.073266  0.143334       0.044263      0.013077  ...       0.376639      0.048289     0.296989     0.452834    0.248466   0.021040  0.219721  0.283785  nh_white\n  1  Torres  raul  hispanic    0.062390   0.021863  0.033837  0.103737       0.774414      0.043238  ...       0.030393      0.009591     0.019713     0.046483    0.117761   0.019524  0.089418  0.150615  hispanic\n\n  [2 rows x 24 columns]\n\n  >>> odf.iloc[0]\n  last               Sawyer\n  first                john\n  true_race        nh_white\n  asian_mean       0.100038\n  asian_std        0.020539\n  asian_lb         0.073266\n  asian_ub         0.143334\n  hispanic_mean    0.044263\n  hispanic_std     0.013077\n  hispanic_lb       0.02476\n  hispanic_ub      0.067965\n  nh_black_mean    0.230593\n  nh_black_std     0.063948\n  nh_black_lb      0.130577\n  nh_black_ub      0.343513\n  nh_white_mean    0.376639\n  nh_white_std     0.048289\n  nh_white_lb      0.296989\n  nh_white_ub      0.452834\n  other_mean       0.248466\n  other_std         0.02104\n  other_lb         0.219721\n  other_ub         0.283785\n  race             nh_white\n  Name: 0, dtype: object\n  ```\n\n- **pred_fl_reg_name_five_cat(df, namecol, num_iter=100, conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [full name FL\n      model](ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **namecol** : *{string, list}* string or list of the name or\n                 location of the column containing the first name, last name.\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race (white, black, asian, Hispanic, or other), asian\n    (percentage chance Asian), hispanic, nh_black, nh_white, other. For\n    each race, it will provide the mean, standard error, lower & upper\n    bound of confidence interval\n\n  *(Using the same dataframe from example above)*\n\n  ```python\n  >>> odf = pred_fl_reg_name_five_cat(df, 'last','first')\n  ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']\n\n  >>> odf\n     last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race\n  0  Sawyer  john  nh_white    0.039310   0.011657  0.025982  0.059719       0.019737      0.005813  ...       0.650306      0.059327     0.553913     0.733201    0.192242   0.021004  0.160185  0.226063  nh_white\n  1  Torres  raul  hispanic    0.020086   0.011765  0.008240  0.041741       0.899110      0.042237  ...       0.019073      0.009901     0.010166     0.040081    0.055774   0.017897  0.036245  0.088741  hispanic\n\n  [2 rows x 24 columns]\n\n  >>> odf.iloc[1]\n  last               Torres\n  first                raul\n  true_race        hispanic\n  asian_mean       0.020086\n  asian_std        0.011765\n  asian_lb          0.00824\n  asian_ub         0.041741\n  hispanic_mean     0.89911\n  hispanic_std     0.042237\n  hispanic_lb      0.823799\n  hispanic_ub      0.937612\n  nh_black_mean    0.005956\n  nh_black_std     0.006528\n  nh_black_lb      0.002686\n  nh_black_ub      0.010134\n  nh_white_mean    0.019073\n  nh_white_std     0.009901\n  nh_white_lb      0.010166\n  nh_white_ub      0.040081\n  other_mean       0.055774\n  other_std        0.017897\n  other_lb         0.036245\n  other_ub         0.088741\n  race             hispanic\n  Name: 1, dtype: object\n  ```\n\n- **pred_nc_reg_name(df, namecol, num_iter=100, conf_int=1.0)**\n\n  - What it does:\n    - Removes extra space.\n    - Uses the [full name NC\n      model](ethnicolr/models/ethnicolr_keras_lstm_nc_12_cat_model.ipynb) to predict the race and ethnicity.\n\n    ----------------------------------------------------------------------------\n    Parameters   \u00a0\n    ------------ ---------------------------------------------------------------\n    \u00a0            **df** : *{DataFrame, csv}* Pandas dataframe of CSV file\n                 contains the names of the individual to be inferred\n\n    \u00a0            **namecol** : *{string, list}* string or list of the name or\n                 location of the column containing the first name and last name.\n\n    \u00a0            **num_iter** : *int, default=100* number of iterations to\n                 calculate uncertainty\n\n    \u00a0            **conf_int** : *float, default=1.0* confidence interval\n    ----------------------------------------------------------------------------\n\n  - Output: Appends the following columns to the pandas DataFrame or\n    CSV: race + ethnicity. The codebook is\n    [here](https://github.com/appeler/nc_race_ethnicity). For each race, it will provide the mean, standard error,\n    lower & upper bound of confidence interval\n\n  ```python\n  >>> import pandas as pd\n\n  >>> names = [\n  ...             {\"last\": \"hernandez\", \"first\": \"hector\", \"true_race\": \"HL+O\"},\n  ...             {\"last\": \"zhang\", \"first\": \"simon\", \"true_race\": \"NL+A\"},\n  ...         ]\n\n  >>> df = pd.DataFrame(names)\n\n  >>> from ethnicolr import pred_nc_reg_name\n\n  >>> odf = pred_nc_reg_name(df, 'last','first', conf_int=0.9)\n  ['HL+A', 'HL+B', 'HL+I', 'HL+M', 'HL+O', 'HL+W', 'NL+A', 'NL+B', 'NL+I', 'NL+M', 'NL+O', 'NL+W']\n\n  >>> odf\n        last   first true_race            __name     HL+A_mean  HL+A_std       HL+A_lb       HL+A_ub     HL+B_mean  HL+B_std       HL+B_lb       HL+B_ub  HL+I_mean  ...     NL+M_mean  NL+M_std       NL+M_lb       NL+M_ub  NL+O_mean  NL+O_std   NL+O_lb   NL+O_ub  NL+W_mean  NL+W_std   NL+W_lb   NL+W_ub  race\n  0  hernandez  hector      HL+O  Hernandez Hector  2.727371e-13       0.0  2.727372e-13  2.727372e-13  6.542178e-04       0.0  6.542183e-04  6.542183e-04   0.000032  ...  7.863581e-06       0.0  7.863589e-06  7.863589e-06   0.184513       0.0  0.184514  0.184514   0.001256       0.0  0.001256  0.001256  HL+O\n  1      zhang   simon      NL+A       Zhang Simon  1.985421e-06       0.0  1.985423e-06  1.985423e-06  8.708256e-09       0.0  8.708265e-09  8.708265e-09   0.000049  ...  1.446786e-07       0.0  1.446784e-07  1.446784e-07   0.003238       0.0  0.003238  0.003238   0.000154       0.0  0.000154  0.000154  NL+A\n\n  [2 rows x 53 columns]\n\n  >>> odf.iloc[0]\n  last                hernandez\n  first                  hector\n  true_race                HL+O\n  __name       Hernandez Hector\n  HL+A_mean                 0.0\n  HL+A_std                  0.0\n  HL+A_lb                   0.0\n  HL+A_ub                   0.0\n  HL+B_mean            0.000654\n  HL+B_std                  0.0\n  HL+B_lb              0.000654\n  HL+B_ub              0.000654\n  HL+I_mean            0.000032\n  HL+I_std                  0.0\n  HL+I_lb              0.000032\n  HL+I_ub              0.000032\n  HL+M_mean            0.000541\n  HL+M_std                  0.0\n  HL+M_lb              0.000541\n  HL+M_ub              0.000541\n  HL+O_mean             0.58944\n  HL+O_std                  0.0\n  HL+O_lb               0.58944\n  HL+O_ub               0.58944\n  HL+W_mean            0.221309\n  HL+W_std                  0.0\n  HL+W_lb              0.221309\n  HL+W_ub              0.221309\n  NL+A_mean            0.000044\n  NL+A_std                  0.0\n  NL+A_lb              0.000044\n  NL+A_ub              0.000044\n  NL+B_mean            0.002199\n  NL+B_std                  0.0\n  NL+B_lb              0.002199\n  NL+B_ub              0.002199\n  NL+I_mean            0.000004\n  NL+I_std                  0.0\n  NL+I_lb              0.000004\n  NL+I_ub              0.000004\n  NL+M_mean            0.000008\n  NL+M_std                  0.0\n  NL+M_lb              0.000008\n  NL+M_ub              0.000008\n  NL+O_mean            0.184513\n  NL+O_std                  0.0\n  NL+O_lb              0.184514\n  NL+O_ub              0.184514\n  NL+W_mean            0.001256\n  NL+W_std                  0.0\n  NL+W_lb              0.001256\n  NL+W_ub              0.001256\n  race                     HL+O\n  Name: 0, dtype: object\n  ```\n\n### Application\n\nTo illustrate how the package can be used, we impute the race of the\ncampaign contributors recorded by FEC for the years 2000 and 2010 and\ntally campaign contributions by race.\n\n- [Contrib 2000/2010 using\n  census_ln](ethnicolr/examples/ethnicolr_app_contrib20xx-census_ln.ipynb)\n- [Contrib 2000/2010 using\n  pred_census_ln](ethnicolr/examples/ethnicolr_app_contrib20xx.ipynb)\n- [Contrib 2000/2010 using\n  pred_fl_reg_name](ethnicolr/examples/ethnicolr_app_contrib20xx-fl_reg.ipynb)\n  \nData on race of all the people in the [DIME\ndata](https://data.stanford.edu/dime) is posted\n[here](http://dx.doi.org/10.7910/DVN/M5K7VR). The\nunderlying Python scripts are posted\n[here](https://github.com/appeler/dime_race)\n# Data\n\nIn particular, we utilize the last-name\\--race data from the [2000\ncensus](http://www.census.gov/topics/population/genealogy/data/2000_surnames.html) and [2010\ncensus](http://www.census.gov/topics/population/genealogy/data/2010_surnames.html), the [Wikipedia data](ethnicolr/data/wiki/) collected by Skiena and colleagues, and the Florida voter\nregistration data from early 2017.\n\n- [Census](ethnicolr/data/census/)\n- [The Wikipedia dataset](ethnicolr/data/wiki/)\n- [Florida voter registration\n  database](http://dx.doi.org/10.7910/DVN/UBIG3F)\n\n### Evaluation\n\n1.  SCAN Health Plan, a Medicare Advantage plan that serves over 200,000\n    members throughout California used the software to better assess\n    racial disparities of health among the people they serve. They only\n    had racial data on about 47% of their members, so they used it to learn\n    the race of the remaining 53%. On the data they had labels for, they\n    found .9 AUC and 83% accuracy for the last name model.\n    \n3.  Evaluation on NC Data:\n    [https://github.com/appeler/nc_race_ethnicity](https://github.com/appeler/nc_race_ethnicity)\n    \n### Authors\n\nSuriyan Laohaprapanon and Gaurav Sood\n\n### Contributor Code of Conduct\n\nThe project welcomes contributions from everyone! In fact, it depends on\nit. To maintain this welcoming atmosphere and to collaborate in a fun\nand productive way, we expect contributors to the project to abide by\nthe [Contributor Code of\nConduct](http://contributor-covenant.org/version/1/0/0/)\n\n\n## License\n\nThe package is released under the [MIT\nLicense](https://opensource.org/licenses/MIT).\n\n\n## \ud83d\udd17 Adjacent Repositories\n\n- [appeler/ethnicolr2](https://github.com/appeler/ethnicolr2) \u2014 Ethnicolr implementation with new models in pytorch\n- [appeler/ethnicolor](https://github.com/appeler/ethnicolor) \u2014 Race and Ethnicity based on name using data from census, voter reg. files, etc.\n- [appeler/instate](https://github.com/appeler/instate) \u2014 instate: predict the state of residence from last name using the indian electoral rolls\n- [appeler/search_names](https://github.com/appeler/search_names) \u2014 Search a long list of names (patterns) in a large text corpus systematically and quickly\n- [appeler/nc_race_ethnicity](https://github.com/appeler/nc_race_ethnicity) \u2014 Evaluation of some of the ethnicolr models on the NC Voter Registration Data + New Models Based on NC Voter Registration Data.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Predict Race/Ethnicity Based on Sequence of Characters in Names",
    "version": "0.18.4",
    "project_urls": {
        "Bug Reports": "https://github.com/appeler/ethnicolr/issues",
        "Documentation": "https://github.com/appeler/ethnicolr#readme",
        "Homepage": "https://github.com/appeler/ethnicolr",
        "Source Code": "https://github.com/appeler/ethnicolr"
    },
    "split_keywords": [
        "race",
        " ethnicity",
        " names",
        " demographics",
        " machine-learning",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "825dd2816db972bf34347e6caaf6a9f90c97a1640fd219b5d9c5a8da8933cedb",
                "md5": "5b364c4318d88053163b0fcd4e92c38e",
                "sha256": "d9b12dfe267273d109b1bf3e0c6992eda0e9ff1629d16ecab010d2d843e49474"
            },
            "downloads": -1,
            "filename": "ethnicolr-0.18.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5b364c4318d88053163b0fcd4e92c38e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 60304055,
            "upload_time": "2025-09-01T13:17:34",
            "upload_time_iso_8601": "2025-09-01T13:17:34.620960Z",
            "url": "https://files.pythonhosted.org/packages/82/5d/d2816db972bf34347e6caaf6a9f90c97a1640fd219b5d9c5a8da8933cedb/ethnicolr-0.18.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "589a6a648bd3a221b8efefe51118aa76ab813fc5d7a29c0eb1cadc653df0c0c8",
                "md5": "16a6929835cd470f85930e6df085aa94",
                "sha256": "18d02cd9cd658692ce9aa3f602dee12c5f9c78df9351d2a6daf1b92f0faee3b4"
            },
            "downloads": -1,
            "filename": "ethnicolr-0.18.4.tar.gz",
            "has_sig": false,
            "md5_digest": "16a6929835cd470f85930e6df085aa94",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 60119785,
            "upload_time": "2025-09-01T13:17:38",
            "upload_time_iso_8601": "2025-09-01T13:17:38.252464Z",
            "url": "https://files.pythonhosted.org/packages/58/9a/6a648bd3a221b8efefe51118aa76ab813fc5d7a29c0eb1cadc653df0c0c8/ethnicolr-0.18.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-01 13:17:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "appeler",
    "github_project": "ethnicolr",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ethnicolr"
}
        
Elapsed time: 0.72208s