ethnicolr


Nameethnicolr JSON
Version 0.9.6 PyPI version JSON
download
home_pagehttps://github.com/appeler/ethnicolr
SummaryPredict Race/Ethnicity Based on Sequence of Characters in the Name
upload_time2023-04-17 17:51:54
maintainer
docs_urlNone
authorSuriyan Laohaprapanon, Gaurav Sood, Bashar Naji
requires_python
licenseMIT
keywords race ethnicity names
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            ethnicolr: Predict Race and Ethnicity From Name
----------------------------------------------------

.. image:: https://github.com/appeler/ethnicolr/workflows/test/badge.svg
    :target: https://github.com/appeler/ethnicolr/actions?query=workflow%3Atest
.. image:: https://img.shields.io/pypi/v/ethnicolr.svg
    :target: https://pypi.python.org/pypi/ethnicolr
.. image:: https://anaconda.org/soodoku/ethnicolr/badges/version.svg
    :target: https://anaconda.org/soodoku/ethnicolr/
.. image:: https://pepy.tech/badge/ethnicolr
    :target: https://pepy.tech/project/ethnicolr

We exploit the US census data, the Florida voting registration data, and 
the Wikipedia data collected by Skiena and colleagues, to predict race
and ethnicity based on first and last name or just the last name. The granularity 
at which we predict the race depends on the dataset. For instance, 
Skiena et al.' Wikipedia data is at the ethnic group level, while the 
census data we use in the model (the raw data has additional categories of 
Native Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, 
Non-Hispanic Blacks, Asians, and Hispanics.

Streamlit
-----------
Streamlit App: https://appeler-ethnicolr-streamlitstreamlit-app-qek30c.streamlit.app/

Caveats and Notes
-----------------------

If you picked a person at random with the last name 'Smith' in the US in 2010 and asked us to guess this person's race (as measured by the census), the best guess would be based on what is available from the aggregated Census file. It is the Bayes Optimal Solution. So what good are last-name-only predictive models for? A few things---if you want to impute race and ethnicity for last names that are not in the census file, infer the race and ethnicity in different years than when the census was conducted (if some assumptions hold), infer the race of people in different countries (if some assumptions hold), etc. The biggest benefit comes in cases where both the first name and last name are known.

Install
----------

We strongly recommend installing `ethnicolor` inside a Python virtual environment
(see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)

::

    pip install ethnicolr

Or 

::
   
   conda install -c soodoku ethnicolr 

Notes:

 - The models are run and verified on TensorFlow 2.x using Python 3.7 and 3.8.
 - If you install on Windows, Theano installation typically needs admin. privileges on the shell.

General API
------------------

To see the available command line options for any function, please type in 
``<function-name> --help``

::

   # census_ln --help
   usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input

   Appends Census columns by last name

   positional arguments:
     input                 Input file

   optional arguments:
     -h, --help            show this help message and exit
     -y {2000,2010}, --year {2000,2010}
                           Year of Census data (default=2000)
     -o OUTPUT, --output OUTPUT
                           Output file with Census data columns
     -l LAST, --last LAST  Name of the column containing the last name


Examples
----------

To append census data from 2010 to a `file with column header in the first row <ethnicolr/data/input-with-header.csv>`__, specify the column name carrying last names using the ``-l`` option, keeping the rest the same:

::

   census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv   


To predict race/ethnicity using `Wikipedia full name model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>`__, specify the column name of last name and first name by using ``-l`` and ``-f`` flags respectively.

::

   pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv


Functions
----------

We expose 6 functions, each of which either take a pandas DataFrame or a
CSV.

- **census\_ln(df, lname_col, year=2000)**

  -  What it does:

     - Removes extra space
     - For names in the `census file <ethnicolr/data/census>`__, it appends 
       relevant data of what probability the name provided is of a certain race/ethnicity

 +------------+--------------------------------------------------------------------------------------------------------------------------+
 | Parameters |                                                                                                                          |
 +============+==========================================================================================================================+
 |            | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred             |
 +------------+--------------------------------------------------------------------------------------------------------------------------+
 |            | **lname_col** : *{string}* name of the column containing the last name                                                   |
 +------------+--------------------------------------------------------------------------------------------------------------------------+
 |            | **Year** : *{2000, 2010}, default=2000* year of census to use                                                            |
 +------------+--------------------------------------------------------------------------------------------------------------------------+


-  Output: Appends the following columns to the pandas DataFrame or CSV: 
   pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. 
   See `here <https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf>`__ 
   for what the column names mean.

   ::

      >>> import pandas as pd

      >>> from ethnicolr import census_ln, pred_census_ln

      >>> names = [{'name': 'smith'},
      ...         {'name': 'zhang'},
      ...         {'name': 'jackson'}]

      >>> df = pd.DataFrame(names)

      >>> df
            name
      0    smith
      1    zhang
      2  jackson

      >>> census_ln(df, 'name')
            name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
      0    smith    73.35    22.22   0.40    0.85      1.63        1.56
      1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
      2  jackson    41.93    53.02   0.31    1.04      2.18        1.53


-  **pred\_census\_ln(df, lname_col, year=2000, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `last name census 2000 
         model <ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb>`__ or 
         `last name census 2010 model <ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb>`__ 
         to predict the race and ethnicity.


   +--------------+---------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                     |
   +==============+=====================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **namecol** : *{string}* name of the column containing the last name                                                |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **year** : *{2000, 2010}, default=2000* year of census to use                                                       |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty in model                           |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval in predicted class                                         |
   +--------------+---------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), api (percentage chance
      asian), black, hispanic, white. For each race it will provide the
      mean, standard error, lower & upper bound of confidence interval

   *(Using the same dataframe from example above)*
   ::

         >>> census_ln(df, 'name')
               name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
         0    smith    73.35    22.22   0.40    0.85      1.63        1.56
         1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
         2  jackson    41.93    53.02   0.31    1.04      2.18        1.53

         >>> census_ln(df, 'name', 2010)
               name   race pctwhite pctblack pctapi pctaian pct2prace pcthispanic
         0    smith  white     70.9    23.11    0.5    0.89      2.19         2.4
         1    zhang    api     0.99     0.16  98.06    0.02      0.62        0.15
         2  jackson  black    39.89    53.04   0.39    1.06      3.12         2.5

         >>> pred_census_ln(df, 'name')
               name   race       api     black  hispanic     white
         0    smith  white  0.002019  0.247235  0.014485  0.736260
         1    zhang    api  0.997807  0.000149  0.000470  0.001574
         2  jackson  black  0.002797  0.528193  0.014605  0.454405


-  **pred\_wiki\_ln( df, lname_col, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `last name wiki
         model <ethnicolr/models/ethnicolr_keras_lstm_wiki_ln.ipynb>`__ to
         predict the race and ethnicity.


   +--------------+---------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                     |
   +==============+=====================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **lname_col** : *{string}* name of the column containing the last name                                              |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty in model                           |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval in predicted class                                         |
   +--------------+---------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (categorical variable --- category with the highest probability). 
      For each race it will provide the mean, standard error, lower & upper
      bound of confidence interval
      
   ::

      "Asian,GreaterEastAsian,EastAsian",
      "Asian,GreaterEastAsian,Japanese", "Asian,IndianSubContinent",
      "GreaterAfrican,Africans", "GreaterAfrican,Muslim",
      "GreaterEuropean,British","GreaterEuropean,EastEuropean",
      "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French",
      "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic",
      "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic".

   ::

      >>> import pandas as pd

      >>> names = [
      ...             {"last": "smith", "first": "john", "true_race": "GreaterEuropean,British"},
      ...             {
      ...                 "last": "zhang",
      ...                 "first": "simon",
      ...                 "true_race": "Asian,GreaterEastAsian,EastAsian",
      ...             },
      ...         ]
      >>> df = pd.DataFrame(names)

      >>> from ethnicolr import pred_wiki_ln, pred_wiki_name

      >>> odf = pred_wiki_ln(df,'last', conf_int=0.9)
      ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']
      
      >>> odf
         last  first                         true_race  ...  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race
      0  Smith   john           GreaterEuropean,British                               0.016103  ...                                 0.014135                                0.007382                                0.048828           GreaterEuropean,British
      1  Zhang  simon  Asian,GreaterEastAsian,EastAsian                               0.863391  ...                                 0.017452                                0.001844                                0.027252  Asian,GreaterEastAsian,EastAsian

      [2 rows x 56 columns]
      
      >>> odf.iloc[0, :8]
      last                                                       Smith
      first                                                       john
      true_race                                GreaterEuropean,British
      Asian,GreaterEastAsian,EastAsian_mean                   0.016103
      Asian,GreaterEastAsian,EastAsian_std                    0.009735
      Asian,GreaterEastAsian,EastAsian_lb                     0.005873
      Asian,GreaterEastAsian,EastAsian_ub                     0.034637
      Asian,GreaterEastAsian,Japanese_mean                    0.003814
      Name: 0, dtype: object




-  **pred\_wiki\_name(df, namecol, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name wiki
         model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>`__
         to predict the race and ethnicity.

   +--------------+----------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                |
   +==============+================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred   |
   +--------------+----------------------------------------------------------------------------------------------------------------+
   |              | **namecol** : *{string}* name of the column containing the name.                                               |
   +--------------+----------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty of predictions                |
   +--------------+----------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval                                                       |
   +--------------+----------------------------------------------------------------------------------------------------------------+



   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (categorical variable---category with the highest probability),
      "Asian,GreaterEastAsian,EastAsian",
      "Asian,GreaterEastAsian,Japanese", "Asian,IndianSubContinent",
      "GreaterAfrican,Africans", "GreaterAfrican,Muslim",
      "GreaterEuropean,British","GreaterEuropean,EastEuropean",
      "GreaterEuropean,Jewish","GreaterEuropean,WestEuropean,French",
      "GreaterEuropean,WestEuropean,Germanic","GreaterEuropean,WestEuropean,Hispanic",
      "GreaterEuropean,WestEuropean,Italian","GreaterEuropean,WestEuropean,Nordic".
      For each race it will provide the mean, standard error, lower & upper
      bound of confidence interval

   *(Using the same dataframe from example above)*
   ::

      >>> odf = pred_wiki_name(df,'last', 'first', conf_int=0.9)
      ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']

      >>> odf
         last  first                         true_race       __name  Asian,GreaterEastAsian,EastAsian_mean  ...  GreaterEuropean,WestEuropean,Nordic_mean  GreaterEuropean,WestEuropean,Nordic_std  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race
      0  Smith   john           GreaterEuropean,British   Smith John                               0.004111  ...                                  0.006246                                 0.004760                                0.001048                                0.016288           GreaterEuropean,British
      1  Zhang  simon  Asian,GreaterEastAsian,EastAsian  Zhang Simon                               0.944203  ...                                  0.000793                                 0.002557                                0.000019                                0.002470  Asian,GreaterEastAsian,EastAsian

      [2 rows x 57 columns]

      >>> odf.iloc[0,:8]
      last                                                       Smith
      first                                                       john
      true_race                                GreaterEuropean,British
      __name                                                Smith John
      Asian,GreaterEastAsian,EastAsian_mean                   0.004111
      Asian,GreaterEastAsian,EastAsian_std                    0.002929
      Asian,GreaterEastAsian,EastAsian_lb                     0.001356
      Asian,GreaterEastAsian,EastAsian_ub                     0.010571
      Name: 0, dtype: object


-  **pred\_fl\_reg\_ln(df, lname_col, num\_iter=100, conf\_int=1.0)**

   -  What it does?:

      -  Removes extra space, if there.
      -  Uses the `last name FL registration
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb>`__
         to predict the race and ethnicity.

   +--------------+---------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                     |
   +==============+=====================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **lname_col** : *{string}* name of the column containing the last name                                              |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate the uncertainty                                |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval                                                            |
   +--------------+---------------------------------------------------------------------------------------------------------------------+



   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), asian (percentage chance
      Asian), hispanic, nh\_black, nh\_white. For each race it will provide
      the mean, standard error, lower & upper bound of confidence interval

   ::

      >>> import pandas as pd

      >>> names = [
      ...             {"last": "sawyer", "first": "john", "true_race": "nh_white"},
      ...             {"last": "torres", "first": "raul", "true_race": "hispanic"},
      ...         ]
      
      >>> df = pd.DataFrame(names)

      >>> from ethnicolr import pred_fl_reg_ln, pred_fl_reg_name, pred_fl_reg_ln_five_cat, pred_fl_reg_name_five_cat

      >>> odf = pred_fl_reg_ln(df, 'last', conf_int=0.9)
      ['asian', 'hispanic', 'nh_black', 'nh_white']

      >>> odf
         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
      0  Sawyer  john  nh_white    0.009859   0.006819  0.005338  0.019673       0.021488      0.004602     0.014802     0.030148       0.180929      0.052784     0.105756     0.270238       0.787724      0.051082     0.705290     0.860286  nh_white
      1  Torres  raul  hispanic    0.006463   0.001985  0.003915  0.010146       0.878119      0.021998     0.839274     0.909151       0.013118      0.005002     0.007364     0.021633       0.102300      0.017828     0.075911     0.130929  hispanic

      [2 rows x 20 columns]

      >>> odf.iloc[0]
      last               Sawyer
      first                john
      true_race        nh_white
      asian_mean       0.009859
      asian_std        0.006819
      asian_lb         0.005338
      asian_ub         0.019673
      hispanic_mean    0.021488
      hispanic_std     0.004602
      hispanic_lb      0.014802
      hispanic_ub      0.030148
      nh_black_mean    0.180929
      nh_black_std     0.052784
      nh_black_lb      0.105756
      nh_black_ub      0.270238
      nh_white_mean    0.787724
      nh_white_std     0.051082
      nh_white_lb       0.70529
      nh_white_ub      0.860286
      race             nh_white
      Name: 0, dtype: object


-  **pred\_fl\_reg\_name(df, lname_col, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name FL
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb>`__
         to predict the race and ethnicity.

   +--------------+-------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                   |
   +==============+===================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred      |
   +--------------+-------------------------------------------------------------------------------------------------------------------+
   |              | **namecol** : *{list}* name of the column containing the name.                                                    |
   +--------------+-------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate the uncertainty                              |
   +--------------+-------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval in predicted class                                       |
   +--------------+-------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, or hispanic), asian (percentage chance
      Asian), hispanic, nh\_black, nh\_white. For each race it will provide
      the mean, standard error, lower & upper bound of confidence interval

   
   *(Using the same dataframe from example above)*
   ::

      >>> odf = pred_fl_reg_name(df, 'last', 'first', conf_int=0.9)
      ['asian', 'hispanic', 'nh_black', 'nh_white']

      >>> odf
         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
      0  Sawyer  john  nh_white    0.001534   0.000850  0.000636  0.002691       0.006818      0.002557     0.003684     0.011660       0.028068      0.015095     0.011488     0.055149       0.963581      0.015738     0.935445     0.983224  nh_white
      1  Torres  raul  hispanic    0.005791   0.002906  0.002446  0.011748       0.890561      0.029581     0.841328     0.937706       0.011397      0.004682     0.005829     0.020796       0.092251      0.026675     0.049868     0.139210  hispanic

      >>> odf.iloc[1]
      last               Torres
      first                raul
      true_race        hispanic
      asian_mean       0.005791
      asian_std        0.002906
      asian_lb         0.002446
      asian_ub         0.011748
      hispanic_mean    0.890561
      hispanic_std     0.029581
      hispanic_lb      0.841328
      hispanic_ub      0.937706
      nh_black_mean    0.011397
      nh_black_std     0.004682
      nh_black_lb      0.005829
      nh_black_ub      0.020796
      nh_white_mean    0.092251
      nh_white_std     0.026675
      nh_white_lb      0.049868
      nh_white_ub       0.13921
      race             hispanic
      Name: 1, dtype: object


-  **pred\_fl\_reg\_ln\_five\_cat(df, namecol, num\_iter=100, conf\_int=1.0)**

   -  What it does?:

      -  Removes extra space, if there.
      -  Uses the `last name FL registration
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb>`__
         to predict the race and ethnicity.

   +--------------+---------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                     |
   +==============+=====================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **lname_col** : *{string, list, int}* name of location of the column containing the last name                       |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty                                    |
   +--------------+---------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval                                                            |
   +--------------+---------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, hispanic or other), asian (percentage
      chance Asian), hispanic, nh\_black, nh\_white, other. For each race
      it will provide the mean, standard error, lower & upper bound of
      confidence interval

   *(Using the same dataframe from example above)*
   ::

      >>> odf = pred_fl_reg_ln_five_cat(df,'last')
      ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']

      >>> odf
         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race
      0  Sawyer  john  nh_white    0.100038   0.020539  0.073266  0.143334       0.044263      0.013077  ...       0.376639      0.048289     0.296989     0.452834    0.248466   0.021040  0.219721  0.283785  nh_white
      1  Torres  raul  hispanic    0.062390   0.021863  0.033837  0.103737       0.774414      0.043238  ...       0.030393      0.009591     0.019713     0.046483    0.117761   0.019524  0.089418  0.150615  hispanic

      [2 rows x 24 columns]

      >>> odf.iloc[0]
      last               Sawyer
      first                john
      true_race        nh_white
      asian_mean       0.100038
      asian_std        0.020539
      asian_lb         0.073266
      asian_ub         0.143334
      hispanic_mean    0.044263
      hispanic_std     0.013077
      hispanic_lb       0.02476
      hispanic_ub      0.067965
      nh_black_mean    0.230593
      nh_black_std     0.063948
      nh_black_lb      0.130577
      nh_black_ub      0.343513
      nh_white_mean    0.376639
      nh_white_std     0.048289
      nh_white_lb      0.296989
      nh_white_ub      0.452834
      other_mean       0.248466
      other_std         0.02104
      other_lb         0.219721
      other_ub         0.283785
      race             nh_white
      Name: 0, dtype: object


-  **pred\_fl\_reg\_name\_five\_cat(df, namecol, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name FL
         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb>`__
         to predict the race and ethnicity.

   +--------------+---------------------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                                 |
   +==============+=================================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred                    |
   +--------------+---------------------------------------------------------------------------------------------------------------------------------+
   |              | **namecol** : *{string, list}* string or list of the name or location of the column containing the first name, last name.       |
   +--------------+---------------------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty                                                |
   +--------------+---------------------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval                                                                        |
   +--------------+---------------------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race (white, black, asian, hispanic, or other), asian (percentage
      chance Asian), hispanic, nh\_black, nh\_white, other. For each race
      it will provide the mean, standard error, lower & upper bound of
      confidence interval

   *(Using the same dataframe from example above)*
   ::

      >>> odf = pred_fl_reg_name_five_cat(df, 'last','first')
      ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']

      >>> odf
         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race
      0  Sawyer  john  nh_white    0.039310   0.011657  0.025982  0.059719       0.019737      0.005813  ...       0.650306      0.059327     0.553913     0.733201    0.192242   0.021004  0.160185  0.226063  nh_white
      1  Torres  raul  hispanic    0.020086   0.011765  0.008240  0.041741       0.899110      0.042237  ...       0.019073      0.009901     0.010166     0.040081    0.055774   0.017897  0.036245  0.088741  hispanic

      [2 rows x 24 columns]

      >>> odf.iloc[1]
      last               Torres
      first                raul
      true_race        hispanic
      asian_mean       0.020086
      asian_std        0.011765
      asian_lb          0.00824
      asian_ub         0.041741
      hispanic_mean     0.89911
      hispanic_std     0.042237
      hispanic_lb      0.823799
      hispanic_ub      0.937612
      nh_black_mean    0.005956
      nh_black_std     0.006528
      nh_black_lb      0.002686
      nh_black_ub      0.010134
      nh_white_mean    0.019073
      nh_white_std     0.009901
      nh_white_lb      0.010166
      nh_white_ub      0.040081
      other_mean       0.055774
      other_std        0.017897
      other_lb         0.036245
      other_ub         0.088741
      race             hispanic
      Name: 1, dtype: object


-  **pred\_nc\_reg\_name(df, namecol, num\_iter=100, conf\_int=1.0)**

   -  What it does:

      -  Removes extra space.
      -  Uses the `full name NC
         model <ethnicolr/models/ethnicolr_keras_lstm_nc_12_cat_model.ipynb>`__
         to predict the race and ethnicity.

   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+
   | Parameters   |                                                                                                                                   |
   +==============+===================================================================================================================================+
   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred                      |
   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+
   |              | **namecol** : *{string, list}* string or list of the name or location of the column containing the first name, last name.         |
   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+
   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty                                                  |
   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+
   |              | **conf\_int** : *float, default=1.0* confidence interval                                                                          |
   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+


   -  Output: Appends the following columns to the pandas DataFrame or CSV:
      race + ethnicity. The codebook is
      `here <https://github.com/appeler/nc_race_ethnicity>`__. For each
      race it will provide the mean, standard error, lower & upper bound of
      confidence interval

   ::

      >>> import pandas as pd

      >>> names = [
      ...             {"last": "hernandez", "first": "hector", "true_race": "HL+O"},
      ...             {"last": "zhang", "first": "simon", "true_race": "NL+A"},
      ...         ]

      >>> df = pd.DataFrame(names)

      >>> from ethnicolr import pred_nc_reg_name

      >>> odf = pred_nc_reg_name(df, 'last','first', conf_int=0.9)
      ['HL+A', 'HL+B', 'HL+I', 'HL+M', 'HL+O', 'HL+W', 'NL+A', 'NL+B', 'NL+I', 'NL+M', 'NL+O', 'NL+W']

      >>> odf
            last   first true_race            __name     HL+A_mean  HL+A_std       HL+A_lb       HL+A_ub     HL+B_mean  HL+B_std       HL+B_lb       HL+B_ub  HL+I_mean  ...     NL+M_mean  NL+M_std       NL+M_lb       NL+M_ub  NL+O_mean  NL+O_std   NL+O_lb   NL+O_ub  NL+W_mean  NL+W_std   NL+W_lb   NL+W_ub  race
      0  hernandez  hector      HL+O  Hernandez Hector  2.727371e-13       0.0  2.727372e-13  2.727372e-13  6.542178e-04       0.0  6.542183e-04  6.542183e-04   0.000032  ...  7.863581e-06       0.0  7.863589e-06  7.863589e-06   0.184513       0.0  0.184514  0.184514   0.001256       0.0  0.001256  0.001256  HL+O
      1      zhang   simon      NL+A       Zhang Simon  1.985421e-06       0.0  1.985423e-06  1.985423e-06  8.708256e-09       0.0  8.708265e-09  8.708265e-09   0.000049  ...  1.446786e-07       0.0  1.446784e-07  1.446784e-07   0.003238       0.0  0.003238  0.003238   0.000154       0.0  0.000154  0.000154  NL+A

      [2 rows x 53 columns]

      >>> odf.iloc[0]
      last                hernandez
      first                  hector
      true_race                HL+O
      __name       Hernandez Hector
      HL+A_mean                 0.0
      HL+A_std                  0.0
      HL+A_lb                   0.0
      HL+A_ub                   0.0
      HL+B_mean            0.000654
      HL+B_std                  0.0
      HL+B_lb              0.000654
      HL+B_ub              0.000654
      HL+I_mean            0.000032
      HL+I_std                  0.0
      HL+I_lb              0.000032
      HL+I_ub              0.000032
      HL+M_mean            0.000541
      HL+M_std                  0.0
      HL+M_lb              0.000541
      HL+M_ub              0.000541
      HL+O_mean             0.58944
      HL+O_std                  0.0
      HL+O_lb               0.58944
      HL+O_ub               0.58944
      HL+W_mean            0.221309
      HL+W_std                  0.0
      HL+W_lb              0.221309
      HL+W_ub              0.221309
      NL+A_mean            0.000044
      NL+A_std                  0.0
      NL+A_lb              0.000044
      NL+A_ub              0.000044
      NL+B_mean            0.002199
      NL+B_std                  0.0
      NL+B_lb              0.002199
      NL+B_ub              0.002199
      NL+I_mean            0.000004
      NL+I_std                  0.0
      NL+I_lb              0.000004
      NL+I_ub              0.000004
      NL+M_mean            0.000008
      NL+M_std                  0.0
      NL+M_lb              0.000008
      NL+M_ub              0.000008
      NL+O_mean            0.184513
      NL+O_std                  0.0
      NL+O_lb              0.184514
      NL+O_ub              0.184514
      NL+W_mean            0.001256
      NL+W_std                  0.0
      NL+W_lb              0.001256
      NL+W_ub              0.001256
      race                     HL+O
      Name: 0, dtype: object



Application
--------------

To illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.

- `Contrib 2000/2010 using census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx-census_ln.ipynb>`__
- `Contrib 2000/2010 using pred_census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx.ipynb>`__
- `Contrib 2000/2010 using pred_fl_reg_name <ethnicolr/examples/ethnicolr_app_contrib20xx-fl_reg.ipynb>`__

Data on race of all the people in the `DIME data <https://data.stanford.edu/dime>`__ is posted `here <http://dx.doi.org/10.7910/DVN/M5K7VR>`__ The underlying python scripts are posted `here <https://github.com/appeler/dime_race>`__ 

Data
----------

In particular, we utilize the last-name--race data from the `2000
census <http://www.census.gov/topics/population/genealogy/data/2000_surnames.html>`__
and `2010
census <http://www.census.gov/topics/population/genealogy/data/2010_surnames.html>`__,
the `Wikipedia data <ethnicolr/data/wiki/>`__ collected by Skiena and colleagues,
and the Florida voter registration data from early 2017.

-  `Census <ethnicolr/data/census/>`__
-  `The Wikipedia dataset <ethnicolr/data/wiki/>`__
-  `Florida voter registration database <http://dx.doi.org/10.7910/DVN/UBIG3F>`__

Evaluation
------------------------------------------
1. SCAN Health Plan, a Medicare Advantage plan that serves over 200,000 members throughout California used the software to better assess racial disparities of health among the people they serve. They only had racial data on about 47% of their members so used it to learn the race of the remaining 53%. On the data they had labels for, they found .9 AUC and 83% accuracy for the last name model.

2. Evaluation on NC Data: https://github.com/appeler/nc_race_ethnicity

Authors
----------

Suriyan Laohaprapanon, Gaurav Sood and Bashar Naji

Contributor Code of Conduct
---------------------------------

The project welcomes contributions from everyone! In fact, it depends on
it. To maintain this welcoming atmosphere, and to collaborate in a fun
and productive way, we expect contributors to the project to abide by
the `Contributor Code of
Conduct <http://contributor-covenant.org/version/1/0/0/>`__.

License
----------

The package is released under the `MIT
License <https://opensource.org/licenses/MIT>`__.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/appeler/ethnicolr",
    "name": "ethnicolr",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "race ethnicity names",
    "author": "Suriyan Laohaprapanon, Gaurav Sood, Bashar Naji",
    "author_email": "suriyant@gmail.com, gsood07@gmail.com, balkuwai@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f7/05/073f62a75773d4f67ab3e86079d9e0ecc0ca5200a4164c8e4a4fc1395496/ethnicolr-0.9.6.tar.gz",
    "platform": null,
    "description": "ethnicolr: Predict Race and Ethnicity From Name\n----------------------------------------------------\n\n.. image:: https://github.com/appeler/ethnicolr/workflows/test/badge.svg\n    :target: https://github.com/appeler/ethnicolr/actions?query=workflow%3Atest\n.. image:: https://img.shields.io/pypi/v/ethnicolr.svg\n    :target: https://pypi.python.org/pypi/ethnicolr\n.. image:: https://anaconda.org/soodoku/ethnicolr/badges/version.svg\n    :target: https://anaconda.org/soodoku/ethnicolr/\n.. image:: https://pepy.tech/badge/ethnicolr\n    :target: https://pepy.tech/project/ethnicolr\n\nWe exploit the US census data, the Florida voting registration data, and \nthe Wikipedia data collected by Skiena and colleagues, to predict race\nand ethnicity based on first and last name or just the last name. The granularity \nat which we predict the race depends on the dataset. For instance, \nSkiena et al.' Wikipedia data is at the ethnic group level, while the \ncensus data we use in the model (the raw data has additional categories of \nNative Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, \nNon-Hispanic Blacks, Asians, and Hispanics.\n\nStreamlit\n-----------\nStreamlit App: https://appeler-ethnicolr-streamlitstreamlit-app-qek30c.streamlit.app/\n\nCaveats and Notes\n-----------------------\n\nIf you picked a person at random with the last name 'Smith' in the US in 2010 and asked us to guess this person's race (as measured by the census), the best guess would be based on what is available from the aggregated Census file. It is the Bayes Optimal Solution. So what good are last-name-only predictive models for? A few things---if you want to impute race and ethnicity for last names that are not in the census file, infer the race and ethnicity in different years than when the census was conducted (if some assumptions hold), infer the race of people in different countries (if some assumptions hold), etc. The biggest benefit comes in cases where both the first name and last name are known.\n\nInstall\n----------\n\nWe strongly recommend installing `ethnicolor` inside a Python virtual environment\n(see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)\n\n::\n\n    pip install ethnicolr\n\nOr \n\n::\n   \n   conda install -c soodoku ethnicolr \n\nNotes:\n\n - The models are run and verified on TensorFlow 2.x using Python 3.7 and 3.8.\n - If you install on Windows, Theano installation typically needs admin. privileges on the shell.\n\nGeneral API\n------------------\n\nTo see the available command line options for any function, please type in \n``<function-name> --help``\n\n::\n\n   # census_ln --help\n   usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input\n\n   Appends Census columns by last name\n\n   positional arguments:\n     input                 Input file\n\n   optional arguments:\n     -h, --help            show this help message and exit\n     -y {2000,2010}, --year {2000,2010}\n                           Year of Census data (default=2000)\n     -o OUTPUT, --output OUTPUT\n                           Output file with Census data columns\n     -l LAST, --last LAST  Name of the column containing the last name\n\n\nExamples\n----------\n\nTo append census data from 2010 to a `file with column header in the first row <ethnicolr/data/input-with-header.csv>`__, specify the column name carrying last names using the ``-l`` option, keeping the rest the same:\n\n::\n\n   census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv   \n\n\nTo predict race/ethnicity using `Wikipedia full name model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>`__, specify the column name of last name and first name by using ``-l`` and ``-f`` flags respectively.\n\n::\n\n   pred_wiki_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv\n\n\nFunctions\n----------\n\nWe expose 6 functions, each of which either take a pandas DataFrame or a\nCSV.\n\n- **census\\_ln(df, lname_col, year=2000)**\n\n  -  What it does:\n\n     - Removes extra space\n     - For names in the `census file <ethnicolr/data/census>`__, it appends \n       relevant data of what probability the name provided is of a certain race/ethnicity\n\n +------------+--------------------------------------------------------------------------------------------------------------------------+\n | Parameters |                                                                                                                          |\n +============+==========================================================================================================================+\n |            | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred             |\n +------------+--------------------------------------------------------------------------------------------------------------------------+\n |            | **lname_col** : *{string}* name of the column containing the last name                                                   |\n +------------+--------------------------------------------------------------------------------------------------------------------------+\n |            | **Year** : *{2000, 2010}, default=2000* year of census to use                                                            |\n +------------+--------------------------------------------------------------------------------------------------------------------------+\n\n\n-  Output: Appends the following columns to the pandas DataFrame or CSV: \n   pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. \n   See `here <https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf>`__ \n   for what the column names mean.\n\n   ::\n\n      >>> import pandas as pd\n\n      >>> from ethnicolr import census_ln, pred_census_ln\n\n      >>> names = [{'name': 'smith'},\n      ...         {'name': 'zhang'},\n      ...         {'name': 'jackson'}]\n\n      >>> df = pd.DataFrame(names)\n\n      >>> df\n            name\n      0    smith\n      1    zhang\n      2  jackson\n\n      >>> census_ln(df, 'name')\n            name pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n      0    smith    73.35    22.22   0.40    0.85      1.63        1.56\n      1    zhang     0.61     0.09  98.16    0.02      0.96        0.16\n      2  jackson    41.93    53.02   0.31    1.04      2.18        1.53\n\n\n-  **pred\\_census\\_ln(df, lname_col, year=2000, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `last name census 2000 \n         model <ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb>`__ or \n         `last name census 2010 model <ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb>`__ \n         to predict the race and ethnicity.\n\n\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                     |\n   +==============+=====================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **namecol** : *{string}* name of the column containing the last name                                                |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **year** : *{2000, 2010}, default=2000* year of census to use                                                       |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty in model                           |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval in predicted class                                         |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (white, black, asian, or hispanic), api (percentage chance\n      asian), black, hispanic, white. For each race it will provide the\n      mean, standard error, lower & upper bound of confidence interval\n\n   *(Using the same dataframe from example above)*\n   ::\n\n         >>> census_ln(df, 'name')\n               name pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n         0    smith    73.35    22.22   0.40    0.85      1.63        1.56\n         1    zhang     0.61     0.09  98.16    0.02      0.96        0.16\n         2  jackson    41.93    53.02   0.31    1.04      2.18        1.53\n\n         >>> census_ln(df, 'name', 2010)\n               name   race pctwhite pctblack pctapi pctaian pct2prace pcthispanic\n         0    smith  white     70.9    23.11    0.5    0.89      2.19         2.4\n         1    zhang    api     0.99     0.16  98.06    0.02      0.62        0.15\n         2  jackson  black    39.89    53.04   0.39    1.06      3.12         2.5\n\n         >>> pred_census_ln(df, 'name')\n               name   race       api     black  hispanic     white\n         0    smith  white  0.002019  0.247235  0.014485  0.736260\n         1    zhang    api  0.997807  0.000149  0.000470  0.001574\n         2  jackson  black  0.002797  0.528193  0.014605  0.454405\n\n\n-  **pred\\_wiki\\_ln( df, lname_col, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `last name wiki\n         model <ethnicolr/models/ethnicolr_keras_lstm_wiki_ln.ipynb>`__ to\n         predict the race and ethnicity.\n\n\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                     |\n   +==============+=====================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **lname_col** : *{string}* name of the column containing the last name                                              |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty in model                           |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval in predicted class                                         |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (categorical variable --- category with the highest probability). \n      For each race it will provide the mean, standard error, lower & upper\n      bound of confidence interval\n      \n   ::\n\n      \"Asian,GreaterEastAsian,EastAsian\",\n      \"Asian,GreaterEastAsian,Japanese\", \"Asian,IndianSubContinent\",\n      \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",\n      \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\",\n      \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",\n      \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",\n      \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\".\n\n   ::\n\n      >>> import pandas as pd\n\n      >>> names = [\n      ...             {\"last\": \"smith\", \"first\": \"john\", \"true_race\": \"GreaterEuropean,British\"},\n      ...             {\n      ...                 \"last\": \"zhang\",\n      ...                 \"first\": \"simon\",\n      ...                 \"true_race\": \"Asian,GreaterEastAsian,EastAsian\",\n      ...             },\n      ...         ]\n      >>> df = pd.DataFrame(names)\n\n      >>> from ethnicolr import pred_wiki_ln, pred_wiki_name\n\n      >>> odf = pred_wiki_ln(df,'last', conf_int=0.9)\n      ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']\n      \n      >>> odf\n         last  first                         true_race  ...  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race\n      0  Smith   john           GreaterEuropean,British                               0.016103  ...                                 0.014135                                0.007382                                0.048828           GreaterEuropean,British\n      1  Zhang  simon  Asian,GreaterEastAsian,EastAsian                               0.863391  ...                                 0.017452                                0.001844                                0.027252  Asian,GreaterEastAsian,EastAsian\n\n      [2 rows x 56 columns]\n      \n      >>> odf.iloc[0, :8]\n      last                                                       Smith\n      first                                                       john\n      true_race                                GreaterEuropean,British\n      Asian,GreaterEastAsian,EastAsian_mean                   0.016103\n      Asian,GreaterEastAsian,EastAsian_std                    0.009735\n      Asian,GreaterEastAsian,EastAsian_lb                     0.005873\n      Asian,GreaterEastAsian,EastAsian_ub                     0.034637\n      Asian,GreaterEastAsian,Japanese_mean                    0.003814\n      Name: 0, dtype: object\n\n\n\n\n-  **pred\\_wiki\\_name(df, namecol, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `full name wiki\n         model <ethnicolr/models/ethnicolr_keras_lstm_wiki_name.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+----------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                |\n   +==============+================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred   |\n   +--------------+----------------------------------------------------------------------------------------------------------------+\n   |              | **namecol** : *{string}* name of the column containing the name.                                               |\n   +--------------+----------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty of predictions                |\n   +--------------+----------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval                                                       |\n   +--------------+----------------------------------------------------------------------------------------------------------------+\n\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (categorical variable---category with the highest probability),\n      \"Asian,GreaterEastAsian,EastAsian\",\n      \"Asian,GreaterEastAsian,Japanese\", \"Asian,IndianSubContinent\",\n      \"GreaterAfrican,Africans\", \"GreaterAfrican,Muslim\",\n      \"GreaterEuropean,British\",\"GreaterEuropean,EastEuropean\",\n      \"GreaterEuropean,Jewish\",\"GreaterEuropean,WestEuropean,French\",\n      \"GreaterEuropean,WestEuropean,Germanic\",\"GreaterEuropean,WestEuropean,Hispanic\",\n      \"GreaterEuropean,WestEuropean,Italian\",\"GreaterEuropean,WestEuropean,Nordic\".\n      For each race it will provide the mean, standard error, lower & upper\n      bound of confidence interval\n\n   *(Using the same dataframe from example above)*\n   ::\n\n      >>> odf = pred_wiki_name(df,'last', 'first', conf_int=0.9)\n      ['Asian,GreaterEastAsian,EastAsian', 'Asian,GreaterEastAsian,Japanese', 'Asian,IndianSubContinent', 'GreaterAfrican,Africans', 'GreaterAfrican,Muslim', 'GreaterEuropean,British', 'GreaterEuropean,EastEuropean', 'GreaterEuropean,Jewish', 'GreaterEuropean,WestEuropean,French', 'GreaterEuropean,WestEuropean,Germanic', 'GreaterEuropean,WestEuropean,Hispanic', 'GreaterEuropean,WestEuropean,Italian', 'GreaterEuropean,WestEuropean,Nordic']\n\n      >>> odf\n         last  first                         true_race       __name  Asian,GreaterEastAsian,EastAsian_mean  ...  GreaterEuropean,WestEuropean,Nordic_mean  GreaterEuropean,WestEuropean,Nordic_std  GreaterEuropean,WestEuropean,Nordic_lb  GreaterEuropean,WestEuropean,Nordic_ub                              race\n      0  Smith   john           GreaterEuropean,British   Smith John                               0.004111  ...                                  0.006246                                 0.004760                                0.001048                                0.016288           GreaterEuropean,British\n      1  Zhang  simon  Asian,GreaterEastAsian,EastAsian  Zhang Simon                               0.944203  ...                                  0.000793                                 0.002557                                0.000019                                0.002470  Asian,GreaterEastAsian,EastAsian\n\n      [2 rows x 57 columns]\n\n      >>> odf.iloc[0,:8]\n      last                                                       Smith\n      first                                                       john\n      true_race                                GreaterEuropean,British\n      __name                                                Smith John\n      Asian,GreaterEastAsian,EastAsian_mean                   0.004111\n      Asian,GreaterEastAsian,EastAsian_std                    0.002929\n      Asian,GreaterEastAsian,EastAsian_lb                     0.001356\n      Asian,GreaterEastAsian,EastAsian_ub                     0.010571\n      Name: 0, dtype: object\n\n\n-  **pred\\_fl\\_reg\\_ln(df, lname_col, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does?:\n\n      -  Removes extra space, if there.\n      -  Uses the `last name FL registration\n         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                     |\n   +==============+=====================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **lname_col** : *{string}* name of the column containing the last name                                              |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate the uncertainty                                |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval                                                            |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (white, black, asian, or hispanic), asian (percentage chance\n      Asian), hispanic, nh\\_black, nh\\_white. For each race it will provide\n      the mean, standard error, lower & upper bound of confidence interval\n\n   ::\n\n      >>> import pandas as pd\n\n      >>> names = [\n      ...             {\"last\": \"sawyer\", \"first\": \"john\", \"true_race\": \"nh_white\"},\n      ...             {\"last\": \"torres\", \"first\": \"raul\", \"true_race\": \"hispanic\"},\n      ...         ]\n      \n      >>> df = pd.DataFrame(names)\n\n      >>> from ethnicolr import pred_fl_reg_ln, pred_fl_reg_name, pred_fl_reg_ln_five_cat, pred_fl_reg_name_five_cat\n\n      >>> odf = pred_fl_reg_ln(df, 'last', conf_int=0.9)\n      ['asian', 'hispanic', 'nh_black', 'nh_white']\n\n      >>> odf\n         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race\n      0  Sawyer  john  nh_white    0.009859   0.006819  0.005338  0.019673       0.021488      0.004602     0.014802     0.030148       0.180929      0.052784     0.105756     0.270238       0.787724      0.051082     0.705290     0.860286  nh_white\n      1  Torres  raul  hispanic    0.006463   0.001985  0.003915  0.010146       0.878119      0.021998     0.839274     0.909151       0.013118      0.005002     0.007364     0.021633       0.102300      0.017828     0.075911     0.130929  hispanic\n\n      [2 rows x 20 columns]\n\n      >>> odf.iloc[0]\n      last               Sawyer\n      first                john\n      true_race        nh_white\n      asian_mean       0.009859\n      asian_std        0.006819\n      asian_lb         0.005338\n      asian_ub         0.019673\n      hispanic_mean    0.021488\n      hispanic_std     0.004602\n      hispanic_lb      0.014802\n      hispanic_ub      0.030148\n      nh_black_mean    0.180929\n      nh_black_std     0.052784\n      nh_black_lb      0.105756\n      nh_black_ub      0.270238\n      nh_white_mean    0.787724\n      nh_white_std     0.051082\n      nh_white_lb       0.70529\n      nh_white_ub      0.860286\n      race             nh_white\n      Name: 0, dtype: object\n\n\n-  **pred\\_fl\\_reg\\_name(df, lname_col, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `full name FL\n         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+-------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                   |\n   +==============+===================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred      |\n   +--------------+-------------------------------------------------------------------------------------------------------------------+\n   |              | **namecol** : *{list}* name of the column containing the name.                                                    |\n   +--------------+-------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate the uncertainty                              |\n   +--------------+-------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval in predicted class                                       |\n   +--------------+-------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (white, black, asian, or hispanic), asian (percentage chance\n      Asian), hispanic, nh\\_black, nh\\_white. For each race it will provide\n      the mean, standard error, lower & upper bound of confidence interval\n\n   \n   *(Using the same dataframe from example above)*\n   ::\n\n      >>> odf = pred_fl_reg_name(df, 'last', 'first', conf_int=0.9)\n      ['asian', 'hispanic', 'nh_black', 'nh_white']\n\n      >>> odf\n         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race\n      0  Sawyer  john  nh_white    0.001534   0.000850  0.000636  0.002691       0.006818      0.002557     0.003684     0.011660       0.028068      0.015095     0.011488     0.055149       0.963581      0.015738     0.935445     0.983224  nh_white\n      1  Torres  raul  hispanic    0.005791   0.002906  0.002446  0.011748       0.890561      0.029581     0.841328     0.937706       0.011397      0.004682     0.005829     0.020796       0.092251      0.026675     0.049868     0.139210  hispanic\n\n      >>> odf.iloc[1]\n      last               Torres\n      first                raul\n      true_race        hispanic\n      asian_mean       0.005791\n      asian_std        0.002906\n      asian_lb         0.002446\n      asian_ub         0.011748\n      hispanic_mean    0.890561\n      hispanic_std     0.029581\n      hispanic_lb      0.841328\n      hispanic_ub      0.937706\n      nh_black_mean    0.011397\n      nh_black_std     0.004682\n      nh_black_lb      0.005829\n      nh_black_ub      0.020796\n      nh_white_mean    0.092251\n      nh_white_std     0.026675\n      nh_white_lb      0.049868\n      nh_white_ub       0.13921\n      race             hispanic\n      Name: 1, dtype: object\n\n\n-  **pred\\_fl\\_reg\\_ln\\_five\\_cat(df, namecol, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does?:\n\n      -  Removes extra space, if there.\n      -  Uses the `last name FL registration\n         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                     |\n   +==============+=====================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **lname_col** : *{string, list, int}* name of location of the column containing the last name                       |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty                                    |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval                                                            |\n   +--------------+---------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (white, black, asian, hispanic or other), asian (percentage\n      chance Asian), hispanic, nh\\_black, nh\\_white, other. For each race\n      it will provide the mean, standard error, lower & upper bound of\n      confidence interval\n\n   *(Using the same dataframe from example above)*\n   ::\n\n      >>> odf = pred_fl_reg_ln_five_cat(df,'last')\n      ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']\n\n      >>> odf\n         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race\n      0  Sawyer  john  nh_white    0.100038   0.020539  0.073266  0.143334       0.044263      0.013077  ...       0.376639      0.048289     0.296989     0.452834    0.248466   0.021040  0.219721  0.283785  nh_white\n      1  Torres  raul  hispanic    0.062390   0.021863  0.033837  0.103737       0.774414      0.043238  ...       0.030393      0.009591     0.019713     0.046483    0.117761   0.019524  0.089418  0.150615  hispanic\n\n      [2 rows x 24 columns]\n\n      >>> odf.iloc[0]\n      last               Sawyer\n      first                john\n      true_race        nh_white\n      asian_mean       0.100038\n      asian_std        0.020539\n      asian_lb         0.073266\n      asian_ub         0.143334\n      hispanic_mean    0.044263\n      hispanic_std     0.013077\n      hispanic_lb       0.02476\n      hispanic_ub      0.067965\n      nh_black_mean    0.230593\n      nh_black_std     0.063948\n      nh_black_lb      0.130577\n      nh_black_ub      0.343513\n      nh_white_mean    0.376639\n      nh_white_std     0.048289\n      nh_white_lb      0.296989\n      nh_white_ub      0.452834\n      other_mean       0.248466\n      other_std         0.02104\n      other_lb         0.219721\n      other_ub         0.283785\n      race             nh_white\n      Name: 0, dtype: object\n\n\n-  **pred\\_fl\\_reg\\_name\\_five\\_cat(df, namecol, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `full name FL\n         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln_five_cat.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+---------------------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                                 |\n   +==============+=================================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred                    |\n   +--------------+---------------------------------------------------------------------------------------------------------------------------------+\n   |              | **namecol** : *{string, list}* string or list of the name or location of the column containing the first name, last name.       |\n   +--------------+---------------------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty                                                |\n   +--------------+---------------------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval                                                                        |\n   +--------------+---------------------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race (white, black, asian, hispanic, or other), asian (percentage\n      chance Asian), hispanic, nh\\_black, nh\\_white, other. For each race\n      it will provide the mean, standard error, lower & upper bound of\n      confidence interval\n\n   *(Using the same dataframe from example above)*\n   ::\n\n      >>> odf = pred_fl_reg_name_five_cat(df, 'last','first')\n      ['asian', 'hispanic', 'nh_black', 'nh_white', 'other']\n\n      >>> odf\n         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  ...  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub  other_mean  other_std  other_lb  other_ub      race\n      0  Sawyer  john  nh_white    0.039310   0.011657  0.025982  0.059719       0.019737      0.005813  ...       0.650306      0.059327     0.553913     0.733201    0.192242   0.021004  0.160185  0.226063  nh_white\n      1  Torres  raul  hispanic    0.020086   0.011765  0.008240  0.041741       0.899110      0.042237  ...       0.019073      0.009901     0.010166     0.040081    0.055774   0.017897  0.036245  0.088741  hispanic\n\n      [2 rows x 24 columns]\n\n      >>> odf.iloc[1]\n      last               Torres\n      first                raul\n      true_race        hispanic\n      asian_mean       0.020086\n      asian_std        0.011765\n      asian_lb          0.00824\n      asian_ub         0.041741\n      hispanic_mean     0.89911\n      hispanic_std     0.042237\n      hispanic_lb      0.823799\n      hispanic_ub      0.937612\n      nh_black_mean    0.005956\n      nh_black_std     0.006528\n      nh_black_lb      0.002686\n      nh_black_ub      0.010134\n      nh_white_mean    0.019073\n      nh_white_std     0.009901\n      nh_white_lb      0.010166\n      nh_white_ub      0.040081\n      other_mean       0.055774\n      other_std        0.017897\n      other_lb         0.036245\n      other_ub         0.088741\n      race             hispanic\n      Name: 1, dtype: object\n\n\n-  **pred\\_nc\\_reg\\_name(df, namecol, num\\_iter=100, conf\\_int=1.0)**\n\n   -  What it does:\n\n      -  Removes extra space.\n      -  Uses the `full name NC\n         model <ethnicolr/models/ethnicolr_keras_lstm_nc_12_cat_model.ipynb>`__\n         to predict the race and ethnicity.\n\n   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+\n   | Parameters   |                                                                                                                                   |\n   +==============+===================================================================================================================================+\n   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred                      |\n   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+\n   |              | **namecol** : *{string, list}* string or list of the name or location of the column containing the first name, last name.         |\n   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+\n   |              | **num\\_iter** : *int, default=100* number of iterations to calculate uncertainty                                                  |\n   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+\n   |              | **conf\\_int** : *float, default=1.0* confidence interval                                                                          |\n   +--------------+-----------------------------------------------------------------------------------------------------------------------------------+\n\n\n   -  Output: Appends the following columns to the pandas DataFrame or CSV:\n      race + ethnicity. The codebook is\n      `here <https://github.com/appeler/nc_race_ethnicity>`__. For each\n      race it will provide the mean, standard error, lower & upper bound of\n      confidence interval\n\n   ::\n\n      >>> import pandas as pd\n\n      >>> names = [\n      ...             {\"last\": \"hernandez\", \"first\": \"hector\", \"true_race\": \"HL+O\"},\n      ...             {\"last\": \"zhang\", \"first\": \"simon\", \"true_race\": \"NL+A\"},\n      ...         ]\n\n      >>> df = pd.DataFrame(names)\n\n      >>> from ethnicolr import pred_nc_reg_name\n\n      >>> odf = pred_nc_reg_name(df, 'last','first', conf_int=0.9)\n      ['HL+A', 'HL+B', 'HL+I', 'HL+M', 'HL+O', 'HL+W', 'NL+A', 'NL+B', 'NL+I', 'NL+M', 'NL+O', 'NL+W']\n\n      >>> odf\n            last   first true_race            __name     HL+A_mean  HL+A_std       HL+A_lb       HL+A_ub     HL+B_mean  HL+B_std       HL+B_lb       HL+B_ub  HL+I_mean  ...     NL+M_mean  NL+M_std       NL+M_lb       NL+M_ub  NL+O_mean  NL+O_std   NL+O_lb   NL+O_ub  NL+W_mean  NL+W_std   NL+W_lb   NL+W_ub  race\n      0  hernandez  hector      HL+O  Hernandez Hector  2.727371e-13       0.0  2.727372e-13  2.727372e-13  6.542178e-04       0.0  6.542183e-04  6.542183e-04   0.000032  ...  7.863581e-06       0.0  7.863589e-06  7.863589e-06   0.184513       0.0  0.184514  0.184514   0.001256       0.0  0.001256  0.001256  HL+O\n      1      zhang   simon      NL+A       Zhang Simon  1.985421e-06       0.0  1.985423e-06  1.985423e-06  8.708256e-09       0.0  8.708265e-09  8.708265e-09   0.000049  ...  1.446786e-07       0.0  1.446784e-07  1.446784e-07   0.003238       0.0  0.003238  0.003238   0.000154       0.0  0.000154  0.000154  NL+A\n\n      [2 rows x 53 columns]\n\n      >>> odf.iloc[0]\n      last                hernandez\n      first                  hector\n      true_race                HL+O\n      __name       Hernandez Hector\n      HL+A_mean                 0.0\n      HL+A_std                  0.0\n      HL+A_lb                   0.0\n      HL+A_ub                   0.0\n      HL+B_mean            0.000654\n      HL+B_std                  0.0\n      HL+B_lb              0.000654\n      HL+B_ub              0.000654\n      HL+I_mean            0.000032\n      HL+I_std                  0.0\n      HL+I_lb              0.000032\n      HL+I_ub              0.000032\n      HL+M_mean            0.000541\n      HL+M_std                  0.0\n      HL+M_lb              0.000541\n      HL+M_ub              0.000541\n      HL+O_mean             0.58944\n      HL+O_std                  0.0\n      HL+O_lb               0.58944\n      HL+O_ub               0.58944\n      HL+W_mean            0.221309\n      HL+W_std                  0.0\n      HL+W_lb              0.221309\n      HL+W_ub              0.221309\n      NL+A_mean            0.000044\n      NL+A_std                  0.0\n      NL+A_lb              0.000044\n      NL+A_ub              0.000044\n      NL+B_mean            0.002199\n      NL+B_std                  0.0\n      NL+B_lb              0.002199\n      NL+B_ub              0.002199\n      NL+I_mean            0.000004\n      NL+I_std                  0.0\n      NL+I_lb              0.000004\n      NL+I_ub              0.000004\n      NL+M_mean            0.000008\n      NL+M_std                  0.0\n      NL+M_lb              0.000008\n      NL+M_ub              0.000008\n      NL+O_mean            0.184513\n      NL+O_std                  0.0\n      NL+O_lb              0.184514\n      NL+O_ub              0.184514\n      NL+W_mean            0.001256\n      NL+W_std                  0.0\n      NL+W_lb              0.001256\n      NL+W_ub              0.001256\n      race                     HL+O\n      Name: 0, dtype: object\n\n\n\nApplication\n--------------\n\nTo illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.\n\n- `Contrib 2000/2010 using census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx-census_ln.ipynb>`__\n- `Contrib 2000/2010 using pred_census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx.ipynb>`__\n- `Contrib 2000/2010 using pred_fl_reg_name <ethnicolr/examples/ethnicolr_app_contrib20xx-fl_reg.ipynb>`__\n\nData on race of all the people in the `DIME data <https://data.stanford.edu/dime>`__ is posted `here <http://dx.doi.org/10.7910/DVN/M5K7VR>`__ The underlying python scripts are posted `here <https://github.com/appeler/dime_race>`__ \n\nData\n----------\n\nIn particular, we utilize the last-name--race data from the `2000\ncensus <http://www.census.gov/topics/population/genealogy/data/2000_surnames.html>`__\nand `2010\ncensus <http://www.census.gov/topics/population/genealogy/data/2010_surnames.html>`__,\nthe `Wikipedia data <ethnicolr/data/wiki/>`__ collected by Skiena and colleagues,\nand the Florida voter registration data from early 2017.\n\n-  `Census <ethnicolr/data/census/>`__\n-  `The Wikipedia dataset <ethnicolr/data/wiki/>`__\n-  `Florida voter registration database <http://dx.doi.org/10.7910/DVN/UBIG3F>`__\n\nEvaluation\n------------------------------------------\n1. SCAN Health Plan, a Medicare Advantage plan that serves over 200,000 members throughout California used the software to better assess racial disparities of health among the people they serve. They only had racial data on about 47% of their members so used it to learn the race of the remaining 53%. On the data they had labels for, they found .9 AUC and 83% accuracy for the last name model.\n\n2. Evaluation on NC Data: https://github.com/appeler/nc_race_ethnicity\n\nAuthors\n----------\n\nSuriyan Laohaprapanon, Gaurav Sood and Bashar Naji\n\nContributor Code of Conduct\n---------------------------------\n\nThe project welcomes contributions from everyone! In fact, it depends on\nit. To maintain this welcoming atmosphere, and to collaborate in a fun\nand productive way, we expect contributors to the project to abide by\nthe `Contributor Code of\nConduct <http://contributor-covenant.org/version/1/0/0/>`__.\n\nLicense\n----------\n\nThe package is released under the `MIT\nLicense <https://opensource.org/licenses/MIT>`__.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Predict Race/Ethnicity Based on Sequence of Characters in the Name",
    "version": "0.9.6",
    "split_keywords": [
        "race",
        "ethnicity",
        "names"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae254bbf5c1ee190ab631af0ba9a626d1abcc5ef21e1293f858eec27a7a471cb",
                "md5": "cdcf4d2052886c274fa19ad8af342c06",
                "sha256": "aad1cae80d5cab5853ec9e8141d7cc9ce0b681aa51705184bbf78a71faa00615"
            },
            "downloads": -1,
            "filename": "ethnicolr-0.9.6-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cdcf4d2052886c274fa19ad8af342c06",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 39539011,
            "upload_time": "2023-04-17T17:51:49",
            "upload_time_iso_8601": "2023-04-17T17:51:49.822034Z",
            "url": "https://files.pythonhosted.org/packages/ae/25/4bbf5c1ee190ab631af0ba9a626d1abcc5ef21e1293f858eec27a7a471cb/ethnicolr-0.9.6-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f705073f62a75773d4f67ab3e86079d9e0ecc0ca5200a4164c8e4a4fc1395496",
                "md5": "daf50e10eaae7a8f654b67070c8d3117",
                "sha256": "f00dfcb3cdc95032b828335c46d04ce26e40cf77d3fd6f2d8f1a61d823d1c212"
            },
            "downloads": -1,
            "filename": "ethnicolr-0.9.6.tar.gz",
            "has_sig": false,
            "md5_digest": "daf50e10eaae7a8f654b67070c8d3117",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 39351332,
            "upload_time": "2023-04-17T17:51:54",
            "upload_time_iso_8601": "2023-04-17T17:51:54.204148Z",
            "url": "https://files.pythonhosted.org/packages/f7/05/073f62a75773d4f67ab3e86079d9e0ecc0ca5200a4164c8e4a4fc1395496/ethnicolr-0.9.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-17 17:51:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "appeler",
    "github_project": "ethnicolr",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "ethnicolr"
}
        
Elapsed time: 0.07084s