Zest Race Predictor
===================
[![Documentation Status](https://readthedocs.org/projects/zrp-docs/badge/?version=latest)](https://zrp-docs.readthedocs.io/en/latest/?badge=latest)
[![image](https://badge.fury.io/py/zrp.svg)](https://badge.fury.io/py/zrp)
[![image](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/zestai/zrp/HEAD)
[![image](https://img.shields.io/pypi/dm/zrp.svg?label=PyPI%20downloads)](https://pypi.org/project/zrp/)
Zest Race Predictor (ZRP) is an open-source machine learning algorithm
that estimates the race/ethnicity of an individual using only their full
name and home address as inputs. ZRP improves upon the most widely used
racial and ethnic data estimation method, Bayesian Improved Surname
Geocoding (BISG), developed by RAND Corporation in 2009.
ZRP was built using ML techniques such as gradient boosting and trained
on voter data from the southeastern U.S. It was then validated on a
national sample using adjusted tract-level American Community Survey
(ACS) data. (Model training procedures are provided.)
***Compared to BISG, ZRP correctly identified:***
* 25% more African-Americans as African-American
* 35% fewer African-Americans as non-African American
* 60% fewer Whites as non-White
ZRP can be used to analyze racial equity and outcomes in critical
spheres such as health care, financial services, criminal justice, or
anywhere there's a need to impute the race or ethnicity of a population
dataset. (Usage examples are included.) The financial services industry,
for example, has struggled for years to achieve more equitable outcomes
amid charges of discrimination in lending practices.
Zest AI began developing ZRP in 2020 to improve the accuracy of our
clients' fair lending analyses by using more data and better math. We
believe ZRP can greatly improve our understanding of the disparate
impact and disparate treatment of protected-status borrowers. Armed with
a better understanding of the disparities that exist in our financial
system, we can highlight inequities and create a roadmap to improve
equity in access to finance.
Notes
=====
This is the preliminary version and implementation of the ZRP tool.
We\'re dedicated to continue improving both the algorithm and
documentation and hope that government agencies, lenders, citizen data
scientists and other interested parties will help us improve the model.
Details of the model development process can be found in the [model
development documentation](./model_report.rst)
Install
=======
Install requires an internet connection. The package has been tested on python 3.7.7, but should likely work with 3.7.X.
Note: Due to the size and number of lookup tables necesary for the zrp
package, total installation requires 3 GB of available space.
### Setting up your virtual environment
We recommend installing zrp
inside a [python virtual
environment](https://docs.python.org/3/library/venv.html#creating-virtual-environments).
Run the following to build your virtual envrionment:
python3 -m venv /path/to/new/virtual/environment
Activate your virtual environment:
source /path/to/new/virtual/environment/bin/activate
Ex.:
python -m venv /Users/joejones/Documents/ZestAI/zrpvenv
source /Users/joejones/Documents/ZestAI/zrpvenv/bin/activate
### General Installation
pip install zrp
After installing via pip, you need to download the lookup tables and
pipelines using the following command: :
python -m zrp download
If you're experiencing issues with installation, please consult our [troubleshooting help](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst#manually-installing-lookup-tables-and-pipeline-files) page.
### Advanced Installation
*Required only if processing the data from scratch instead of using existing ZRP data
#### Unix-like systems
pip install fiona
pip install zrp
After installing via pip, you need to download the lookup tables and
pipelines using the following command: :
python -m zrp download
#### Windows
pip install pipwin
pipwin install gdal
pipwin install fiona
pip install zrp
After installing via pip, you need to download the lookup tables and
pipelines using the following command: :
python -m zrp download
If you're experiencing issues with installation, please consult our [troubleshooting help](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst#manually-installing-lookup-tables-and-pipeline-files) page.
Data
====
### Training Data
The models available in this package were trained on voter registration
data from the states of Florida , Georgia, and North Carolina. Summary
statistics on these datasets and additional datasets used as validation
can be found
[here](https://github.com/zestai/zrp/blob/main/dataset_statistics.txt) .
***Consult the following to download state voter registration data:***
* [North Carolina](https://www.ncsbe.gov/results-data/voter-registration-data)
* [Florida](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UBIG3F)
* [Alabama](https://www.alabamainteractive.org/sos/voter/voterWelcome.action)
* [South Carolina](https://www.scvotes.gov/sale-voter-registration-lists)
* [Georgia](https://sos.ga.gov/index.php/elections/order_voter_registration_lists_and_files)
* [Louisiana](https://www.sos.la.gov/ElectionsAndVoting/BecomeACandidate/PurchaseVoterLists/Pages/default.aspx)
### American Community Survey (ACS) Data:
The US Census Bureau details that, \"the American Community Survey (ACS)
is an ongoing survey that provides data every year \-- giving
communities the current information they need to plan investments and
services. The ACS covers a broad range of topics about social, economic,
demographic, and housing characteristics of the U.S. population. The
5-year estimates from the ACS are \"period\" estimates that represent
data collected over a period of time. The primary advantage of using
multiyear estimates is the increased statistical reliability of the data
for less populated areas and small population subgroups. The 5-year
estimates are available for all geographies down to the block group
level.\" ( Bureau, US Census. "American Community Survey 5-Year Data
(2009-2019)." Census.gov, 8 Dec. 2021,
<https://www.census.gov/data/developers/data-sets/acs-5year.html>. )
ACS data is available in 1 or 5 year spans. The 5yr ACS data is the most
comprehensive & is available at more granular levels than 1yr data. It
is thus used in this work.
## Model Development and Feature Documentation
Details of the model development process can be found in the [model
development documentation](./model_report.rst) . Details of the human
readable feature definitions as well as feature importances can be found
[here](https://github.com/zestai/zrp/tree/main/zrp/modeling#feature-definitions).
## Usage and Examples
To get started using the ZRP, first ensure the download is complete (as
described above) and xgboost == 1.0.2
Check out the guides in the
[examples](https://github.com/zestai/zrp/tree/main/examples) folder.
Clone the repo in order to obtain the example notebooks and data; this
is not provided in the pip installable package. If you\'re experiencing
issues, first consult our [troubleshooting help
guide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) .
[Here](https://mybinder.org/v2/gh/zestai/zrp/HEAD), we additionally
provide an interactive virtual environment, via Binder, with ZRP
installed. Once you open this link and are taken to the JupyterLab
environment, open up a terminal and run the following: :
python -m zrp download
Next, we present the primary ways you\'ll use ZRP.
### ZRP Predictions
**Summary of commands:** :
>>> from zrp import ZRP
>>> zest_race_predictor = ZRP()
>>> zest_race_predictor.fit()
>>> zrp_output = zest_race_predictor.transform(input_dataframe)
**Breaking down key commands** :
>>> zest_race_predictor = ZRP()
- **ZRP(pipe\_path=None, support\_files\_path=\"data/processed\",
key=\"ZEST\_KEY\", first\_name=\"first\_name\",
middle\_name=\"middle\_name\", last\_name=\"last\_name\",
house\_number=\"house\_number\",
street\_address=\"street\_address\", city=\"city\", state=\"state\",
zip\_code=\"zip\_code\", race=\'race\', proxy=\"probs\",
census\_tract=None, street\_address\_2=None, name\_prefix=None,
name\_suffix=None, na\_values=None, file\_path=None, geocode=True,
bisg=True, readout=True, n\_jobs=49, year=\"2019\", span=\"5\",
runname=\"test\")**
- What it does:
- Prepares data to generate race & ethnicity proxies
You can find parameter descriptions in the [ZRP
class](https://github.com/zestai/zrp/blob/main/zrp/zrp.py) and it\'s
[parent
class](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py).
```
>>> zrp_output = zest_race_predictor.transform(input_dataframe)
```
- **zest\_race\_predictor.transform(df)**
- What it does:
- Processes input data and generates ZRP proxy predictions.
- Attempts to predict on block group, then census tract, then
zip code based on which level ACS data is found for. If Geo
level data is unattainable, the BISG proxy is computed. No
prediction returned if BISG cannot be computed either.
> -----------------------------------------------------------------------------
> Parameters
> ------------ ----------------------------------------------------------------
> df : {DataFrame} Pandas dataframe containing input data
> (see below for necessary columns)
>
> -----------------------------------------------------------------------------
Input data, **df**, into the prediction/modeling pipeline **MUST**
contain the following columns: first name, middle name, last name, house
number, street address (street name), city, state, zip code, and zest
key. Consult our [troubleshooting help
guide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) to
ensure your input data is the correct format.
- Output: A dataframe with the following columns: AAPI AIAN BLACK
HISPANIC WHITE source\_block\_group source\_zip\_code source\_bisg :
>>> zrp_output
=========== =========== =========== =========== =========== =========== ===================== ====================== ==================
AAPI AIAN BLACK HISPANIC WHITE source_block_group source_census_tract source_zip_code
=========== =========== =========== =========== =========== =========== ===================== ====================== ==================
ZEST_KEY
10 0.021916 0.021960 0.004889 0.012153 0.939082 1.0 0.0 0.0
100 0.009462 0.013033 0.003875 0.008469 0.965162 1.0 0.0 0.0
103 0.107332 0.000674 0.000584 0.021980 0.869429 1.0 0.0 0.0
106 0.177411 0.015208 0.003767 0.041668 0.761946 1.0 0.0 0.0
109 0.000541 0.000416 0.000376 0.000932 0.997736 1.0 0.0 0.0
... ... ... ... ... ... ... ... ...
556 NaN NaN NaN NaN NaN 0.0 0.0 0.0
557 NaN NaN NaN NaN NaN 0.0 0.0 0.0
=========== =========== =========== =========== =========== =========== ===================== ====================== ==================
One of the parameters to the [parent
class](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py) that
ZRP() inherits from is `file_path`. This parameter allows you to specify
where the `artifacts/` folder is outputted during the run of the ZRP.
Once the run is complete, the `artifacts/` folder will contain the
outputted race/ethnicity proxies and additional logs documenting the
validity of input data. `file_path` **need not** be specified. If it is
not defined, the `artifacts/` folder will be placed in the same
directory of the script running zrp. Subsequent runs will, however,
overwrite the files in `artifacts/`; providing a unique directory path
for `file_path` will avoid this.
ZRP Build
---------
**Summary of commands** :
>>> from zrp.modeling import ZRP_Build
>>> zest_race_predictor_builder = ZRP_Build('/path/to/desired/output/directory')
>>> zest_race_predictor_builder.fit()
>>> zrp_build_output = zest_race_predictor_builder.transform(input_training_data)
**Breaking down key commands** :
>>> zest_race_predictor_builder = ZRP_Build('/path/to/desired/output/directory')
- **ZRP\_Build(file\_path, zrp\_model\_name = \'zrp\_0\',
zrp\_model\_source =\'ct\')**
- What it does:
- Prepares the class that builds the new custom ZRP model.
> -----------------------------------------------------------------------------
> Parameters
> ------------ ----------------------------------------------------------------
> file_path : {str} The path where pipeline, model, and
> supporting data are saved.
>
> zrp_model_name : {str} Name of zrp_model.
>
> zrp_model_source : {str} Indicates the source of
> zrp_modeling data to use.
> -----------------------------------------------------------------------------
>
> You can find more detailed parameter descriptions in the [ZRP\_Build
> class](https://github.com/zestai/zrp/blob/main/zrp/modeling/pipeline_builder.py).
> ZRP\_Build() also inherits initlizing parameters from its [parent
> class](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py).
>>> zrp_build_output = zest_race_predictor_builder.transform(input_training_data)
- **zest\_race\_predictor\_builder.transform(df)**
- What it does:
- Builds a new custom ZRP model trained off of user input data
when supplied with standard ZRP requirements including name,
address, and race
- Produces a custom model-pipeline. The pipeline, model, and
supporting data are saved automatically to
\"\~/data/experiments/model\_source/data/\" in the support
files path defined.
- The class assumes data is not broken into train and test
sets, performs this split itself, and outputs predictions on
the test set.
> -----------------------------------------------------------------------------
> Parameters
> ------------ ----------------------------------------------------------------
> df : {DataFrame} Pandas dataframe containing input data
> (see below for necessary columns)
>
> -----------------------------------------------------------------------------
Input data, **df**, into this pipeline **MUST** contain the following
columns: first name, middle name, last name, house number, street
address (street name), city, state, zip code, zest key, and race.
Consult our [troubleshooting help
guide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) to
ensure your input data is the correct format.
- Output: A dictionary of race & ethnicity probablities and labels.
As mentioned in the ZRP Predict section above, once the run is complete,
the `artifacts/` folder will contain the outputted race/ethnicity
proxies and additional logs documenting the validity of input data.
Similarly, defining `file_path` **need not** be specified, but providing
a unique directory path for `file_path` will avoid overwriting the
[artifacts/]{.title-ref} folder. When running ZRP Build, however,
`artifacts/` also contains the processed test and train data, trained
model, and pipeline.
### Additional Runs of Your Custom Model
After having run ZRP\_Build() you can re-use your custom model just like
you run the packaged model. All you must do is specify the path to the
generated model and pipelines (this path is the same path as
\'/path/to/desired/output/directory\' that you defined previously when
running ZRP\_Build() in the example above; we call this \'pipe\_path\').
Thus, you would run: :
>>> from zrp import ZRP
>>> zest_race_predictor = ZRP('pipe_path')
>>> zest_race_predictor.fit()
>>> zrp_output = zest_race_predictor.transform(input_dataframe)
Validation
==========
The models included in this package were trained on publicly-available
voter registration data and validated multiple times: on hold out sets
of voter registration data and on a national sample of PPP loan
forgiveness data. The results were consistent across tests: 20-30% more
African Americans correctily identified as African American, and 60%
fewer whites identified as people of color as compared with the status
quo BISG method.
To see our validation analysis with Alabama voter registration data,
please check out [this
notebook](https://github.com/zestai/zrp/blob/main/examples/analysis/Alabama_Case_Study.md).
Performance on the national PPP loan forgiveness dataset was as follows
(comparing ZRP softmax with the BISG method):
*African American*
| Statistic | BISG | ZRP | Pct. Diff |
|---------------------|-------|-------|-----------|
| True Positive Rate | 0.571 | 0.700 | +23% (F) |
| True Negative Rate | 0.954 | 0.961 | +01% (F) |
| False Positive Rate | 0.046 | 0.039 | -15% (F) |
| False Negative Rate | 0.429 | 0.300 | -30% (F) |
*Asian American and Pacific Islander*
| Statistic | BISG | ZRP | Pct. Diff |
|---------------------|-------|-------|-----------|
| True Positive Rate | 0.683 | 0.777 | +14% (F) |
| True Negative Rate | 0.982 | 0.977 | -01% (U) |
| False Positive Rate | 0.018 | 0.023 | -28% (F) |
| False Negative Rate | 0.317 | 0.223 | -30% (F) |
*Non-White Hispanic*
| Statistic | BISG | ZRP | Pct. Diff |
|---------------------|-------|-------|-----------|
| True Positive Rate | 0.599 | 0.711 | +19% (F) |
| True Negative Rate | 0.979 | 0.973 | -01% (U) |
| False Positive Rate | 0.021 | 0.027 | -29% (F) |
| False Negative Rate | 0.401 | 0.289 | -28% (F) |
*White, Non-Hispanic*
| Statistic | BISG | ZRP | Pct. Diff |
|---------------------|-------|-------|-----------|
| True Positive Rate | 0.758 | 0.906 | +19% (F) |
| True Negative Rate | 0.758 | 0.741 | -02% (U) |
| False Positive Rate | 0.242 | 0.259 | +07% (U) |
| False Negative Rate | 0.241 | 0.094 | -61% (F) |
Authors
=======
> - [Kasey
> Matthews](https://www.linkedin.com/in/kasey-matthews-datadriven/)
> (Zest AI Lead)
> - [Piotr Zak](https://www.linkedin.com/in/piotr-zak-datadriven/) (Algomine)
> - [Austin Li](https://www.linkedin.com/in/austinwli/) (Harvard T4SG)
> - [Christien
> Williams](https://www.linkedin.com/in/christienwilliams/) (Schmidt
> Futures)
> - [Sean Kamkar](https://www.linkedin.com/in/sean-kamkar/) (Zest AI)
> - [Jay Budzik](https://www.linkedin.com/in/jaybudzik/) (Zest AI)
Contributing
============
Contributions are encouraged! For small bug fixes and minor
improvements, feel free to just open a PR. For larger changes, please
open an issue first so that other contributors can discuss your plan,
avoid duplicated work, and ensure it aligns with the goals of the
project. Be sure to also follow the [Code of
Conduct](https://github.com/zestai/zrp/blob/main/CODE_OF_CONDUCT.md).
Thanks!
Maintainers
-----------
Maintainers should additionally consult our documentation on
[releasing](https://github.com/zestai/zrp/blob/main/releasing.rst).
Follow the steps there to push new releases to Pypi and Github releases.
With respect to Github releases, we provide new releases to ensure
relevant pipelines and look up tables requisite for package download and
use are consistently up to date.
Wishlist
========
Support for the following capabilities is planned:
- add multiracial classification output support
- national validation datasets and validation partners
- pointers to additional training data
- add support for gender and other protected bases
License
=======
The package is released under the [Apache-2.0
License](https://opensource.org/licenses/Apache-2.0).
Results and Feedback
====================
Generate interesting results with the tool and want to share it or other
interesting feedback? Get in touch via <abetterway@zest.ai>.
Raw data
{
"_id": null,
"home_page": "https://github.com/zestai/zrp",
"name": "zrp",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "race ethnicity names address acs geocode",
"author": "Kasey Matthews et al.",
"author_email": "abetterway@zest.ai",
"download_url": "https://files.pythonhosted.org/packages/fe/da/2861c1c117230b0614f01c985d14092b0307553f384389f3ee2c6a37a7d7/zrp-0.4.0.tar.gz",
"platform": null,
"description": "Zest Race Predictor\n===================\n\n[![Documentation Status](https://readthedocs.org/projects/zrp-docs/badge/?version=latest)](https://zrp-docs.readthedocs.io/en/latest/?badge=latest)\n[![image](https://badge.fury.io/py/zrp.svg)](https://badge.fury.io/py/zrp)\n[![image](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/zestai/zrp/HEAD)\n[![image](https://img.shields.io/pypi/dm/zrp.svg?label=PyPI%20downloads)](https://pypi.org/project/zrp/)\n\nZest Race Predictor (ZRP) is an open-source machine learning algorithm\nthat estimates the race/ethnicity of an individual using only their full\nname and home address as inputs. ZRP improves upon the most widely used\nracial and ethnic data estimation method, Bayesian Improved Surname\nGeocoding (BISG), developed by RAND Corporation in 2009.\n\nZRP was built using ML techniques such as gradient boosting and trained\non voter data from the southeastern U.S. It was then validated on a\nnational sample using adjusted tract-level American Community Survey\n(ACS) data. (Model training procedures are provided.)\n\n***Compared to BISG, ZRP correctly identified:***\n\n* 25% more African-Americans as African-American\n* 35% fewer African-Americans as non-African American\n* 60% fewer Whites as non-White\n\nZRP can be used to analyze racial equity and outcomes in critical\nspheres such as health care, financial services, criminal justice, or\nanywhere there's a need to impute the race or ethnicity of a population\ndataset. (Usage examples are included.) The financial services industry,\nfor example, has struggled for years to achieve more equitable outcomes\namid charges of discrimination in lending practices.\n\nZest AI began developing ZRP in 2020 to improve the accuracy of our\nclients' fair lending analyses by using more data and better math. We\nbelieve ZRP can greatly improve our understanding of the disparate\nimpact and disparate treatment of protected-status borrowers. Armed with\na better understanding of the disparities that exist in our financial\nsystem, we can highlight inequities and create a roadmap to improve\nequity in access to finance.\n\nNotes\n=====\n\nThis is the preliminary version and implementation of the ZRP tool.\nWe\\'re dedicated to continue improving both the algorithm and\ndocumentation and hope that government agencies, lenders, citizen data\nscientists and other interested parties will help us improve the model.\nDetails of the model development process can be found in the [model\ndevelopment documentation](./model_report.rst)\n\nInstall\n=======\n\nInstall requires an internet connection. The package has been tested on python 3.7.7, but should likely work with 3.7.X.\n\nNote: Due to the size and number of lookup tables necesary for the zrp\npackage, total installation requires 3 GB of available space.\n\n### Setting up your virtual environment\n\nWe recommend installing zrp\ninside a [python virtual\nenvironment](https://docs.python.org/3/library/venv.html#creating-virtual-environments).\n\nRun the following to build your virtual envrionment:\n\n python3 -m venv /path/to/new/virtual/environment\n\nActivate your virtual environment:\n\n source /path/to/new/virtual/environment/bin/activate\n \n\nEx.:\n\n python -m venv /Users/joejones/Documents/ZestAI/zrpvenv\n source /Users/joejones/Documents/ZestAI/zrpvenv/bin/activate\n \n### General Installation\n\n pip install zrp\n\nAfter installing via pip, you need to download the lookup tables and\npipelines using the following command: :\n\n python -m zrp download \n \nIf you're experiencing issues with installation, please consult our [troubleshooting help](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst#manually-installing-lookup-tables-and-pipeline-files) page.\n\n### Advanced Installation\n\n*Required only if processing the data from scratch instead of using existing ZRP data\n\n#### Unix-like systems\n\n pip install fiona\n pip install zrp\n\nAfter installing via pip, you need to download the lookup tables and\npipelines using the following command: :\n\n python -m zrp download\n\n#### Windows\n\n pip install pipwin\n pipwin install gdal\n pipwin install fiona\n\n pip install zrp\n\nAfter installing via pip, you need to download the lookup tables and\npipelines using the following command: :\n\n python -m zrp download\n\nIf you're experiencing issues with installation, please consult our [troubleshooting help](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst#manually-installing-lookup-tables-and-pipeline-files) page.\n\nData\n====\n\n### Training Data\n\nThe models available in this package were trained on voter registration\ndata from the states of Florida , Georgia, and North Carolina. Summary\nstatistics on these datasets and additional datasets used as validation\ncan be found\n[here](https://github.com/zestai/zrp/blob/main/dataset_statistics.txt) .\n\n***Consult the following to download state voter registration data:***\n\n* [North Carolina](https://www.ncsbe.gov/results-data/voter-registration-data)\n* [Florida](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UBIG3F)\n* [Alabama](https://www.alabamainteractive.org/sos/voter/voterWelcome.action)\n* [South Carolina](https://www.scvotes.gov/sale-voter-registration-lists)\n* [Georgia](https://sos.ga.gov/index.php/elections/order_voter_registration_lists_and_files)\n* [Louisiana](https://www.sos.la.gov/ElectionsAndVoting/BecomeACandidate/PurchaseVoterLists/Pages/default.aspx)\n\n### American Community Survey (ACS) Data:\n\nThe US Census Bureau details that, \\\"the American Community Survey (ACS)\nis an ongoing survey that provides data every year \\-- giving\ncommunities the current information they need to plan investments and\nservices. The ACS covers a broad range of topics about social, economic,\ndemographic, and housing characteristics of the U.S. population. The\n5-year estimates from the ACS are \\\"period\\\" estimates that represent\ndata collected over a period of time. The primary advantage of using\nmultiyear estimates is the increased statistical reliability of the data\nfor less populated areas and small population subgroups. The 5-year\nestimates are available for all geographies down to the block group\nlevel.\\\" ( Bureau, US Census. \"American Community Survey 5-Year Data\n(2009-2019).\" Census.gov, 8 Dec. 2021,\n<https://www.census.gov/data/developers/data-sets/acs-5year.html>. )\n\nACS data is available in 1 or 5 year spans. The 5yr ACS data is the most\ncomprehensive & is available at more granular levels than 1yr data. It\nis thus used in this work.\n\n## Model Development and Feature Documentation\n\nDetails of the model development process can be found in the [model\ndevelopment documentation](./model_report.rst) . Details of the human\nreadable feature definitions as well as feature importances can be found\n[here](https://github.com/zestai/zrp/tree/main/zrp/modeling#feature-definitions).\n\n## Usage and Examples\n\nTo get started using the ZRP, first ensure the download is complete (as\ndescribed above) and xgboost == 1.0.2\n\nCheck out the guides in the\n[examples](https://github.com/zestai/zrp/tree/main/examples) folder.\nClone the repo in order to obtain the example notebooks and data; this\nis not provided in the pip installable package. If you\\'re experiencing\nissues, first consult our [troubleshooting help\nguide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) .\n\n[Here](https://mybinder.org/v2/gh/zestai/zrp/HEAD), we additionally\nprovide an interactive virtual environment, via Binder, with ZRP\ninstalled. Once you open this link and are taken to the JupyterLab\nenvironment, open up a terminal and run the following: :\n\n python -m zrp download\n\nNext, we present the primary ways you\\'ll use ZRP.\n\n### ZRP Predictions\n\n**Summary of commands:** :\n\n >>> from zrp import ZRP\n >>> zest_race_predictor = ZRP()\n >>> zest_race_predictor.fit()\n >>> zrp_output = zest_race_predictor.transform(input_dataframe)\n\n**Breaking down key commands** :\n\n >>> zest_race_predictor = ZRP()\n\n- **ZRP(pipe\\_path=None, support\\_files\\_path=\\\"data/processed\\\",\n key=\\\"ZEST\\_KEY\\\", first\\_name=\\\"first\\_name\\\",\n middle\\_name=\\\"middle\\_name\\\", last\\_name=\\\"last\\_name\\\",\n house\\_number=\\\"house\\_number\\\",\n street\\_address=\\\"street\\_address\\\", city=\\\"city\\\", state=\\\"state\\\",\n zip\\_code=\\\"zip\\_code\\\", race=\\'race\\', proxy=\\\"probs\\\",\n census\\_tract=None, street\\_address\\_2=None, name\\_prefix=None,\n name\\_suffix=None, na\\_values=None, file\\_path=None, geocode=True,\n bisg=True, readout=True, n\\_jobs=49, year=\\\"2019\\\", span=\\\"5\\\",\n runname=\\\"test\\\")**\n\n - What it does:\n - Prepares data to generate race & ethnicity proxies\n\n You can find parameter descriptions in the [ZRP\n class](https://github.com/zestai/zrp/blob/main/zrp/zrp.py) and it\\'s\n [parent\n class](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py).\n\n```\n >>> zrp_output = zest_race_predictor.transform(input_dataframe)\n```\n\n- **zest\\_race\\_predictor.transform(df)**\n - What it does:\n - Processes input data and generates ZRP proxy predictions.\n - Attempts to predict on block group, then census tract, then\n zip code based on which level ACS data is found for. If Geo\n level data is unattainable, the BISG proxy is computed. No\n prediction returned if BISG cannot be computed either.\n\n> -----------------------------------------------------------------------------\n> Parameters \n> ------------ ----------------------------------------------------------------\n> df : {DataFrame} Pandas dataframe containing input data\n> (see below for necessary columns)\n>\n> -----------------------------------------------------------------------------\n\nInput data, **df**, into the prediction/modeling pipeline **MUST**\ncontain the following columns: first name, middle name, last name, house\nnumber, street address (street name), city, state, zip code, and zest\nkey. Consult our [troubleshooting help\nguide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) to\nensure your input data is the correct format.\n\n- Output: A dataframe with the following columns: AAPI AIAN BLACK\n HISPANIC WHITE source\\_block\\_group source\\_zip\\_code source\\_bisg :\n\n >>> zrp_output\n\n =========== =========== =========== =========== =========== =========== ===================== ====================== ================== \n AAPI AIAN BLACK HISPANIC WHITE source_block_group source_census_tract source_zip_code \n =========== =========== =========== =========== =========== =========== ===================== ====================== ================== \n ZEST_KEY \n 10 0.021916 0.021960 0.004889 0.012153 0.939082 1.0 0.0 0.0 \n 100 0.009462 0.013033 0.003875 0.008469 0.965162 1.0 0.0 0.0 \n 103 0.107332 0.000674 0.000584 0.021980 0.869429 1.0 0.0 0.0 \n 106 0.177411 0.015208 0.003767 0.041668 0.761946 1.0 0.0 0.0 \n 109 0.000541 0.000416 0.000376 0.000932 0.997736 1.0 0.0 0.0 \n ... ... ... ... ... ... ... ... ... \n 556 NaN NaN NaN NaN NaN 0.0 0.0 0.0 \n 557 NaN NaN NaN NaN NaN 0.0 0.0 0.0 \n =========== =========== =========== =========== =========== =========== ===================== ====================== ================== \n\nOne of the parameters to the [parent\nclass](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py) that\nZRP() inherits from is `file_path`. This parameter allows you to specify\nwhere the `artifacts/` folder is outputted during the run of the ZRP.\nOnce the run is complete, the `artifacts/` folder will contain the\noutputted race/ethnicity proxies and additional logs documenting the\nvalidity of input data. `file_path` **need not** be specified. If it is\nnot defined, the `artifacts/` folder will be placed in the same\ndirectory of the script running zrp. Subsequent runs will, however,\noverwrite the files in `artifacts/`; providing a unique directory path\nfor `file_path` will avoid this.\n\nZRP Build\n---------\n\n**Summary of commands** :\n\n >>> from zrp.modeling import ZRP_Build\n >>> zest_race_predictor_builder = ZRP_Build('/path/to/desired/output/directory')\n >>> zest_race_predictor_builder.fit()\n >>> zrp_build_output = zest_race_predictor_builder.transform(input_training_data)\n\n**Breaking down key commands** :\n\n >>> zest_race_predictor_builder = ZRP_Build('/path/to/desired/output/directory')\n\n- **ZRP\\_Build(file\\_path, zrp\\_model\\_name = \\'zrp\\_0\\',\n zrp\\_model\\_source =\\'ct\\')**\n - What it does:\n - Prepares the class that builds the new custom ZRP model.\n\n> -----------------------------------------------------------------------------\n> Parameters \n> ------------ ----------------------------------------------------------------\n> file_path : {str} The path where pipeline, model, and\n> supporting data are saved.\n>\n> zrp_model_name : {str} Name of zrp_model.\n>\n> zrp_model_source : {str} Indicates the source of\n> zrp_modeling data to use.\n> -----------------------------------------------------------------------------\n>\n> You can find more detailed parameter descriptions in the [ZRP\\_Build\n> class](https://github.com/zestai/zrp/blob/main/zrp/modeling/pipeline_builder.py).\n> ZRP\\_Build() also inherits initlizing parameters from its [parent\n> class](https://github.com/zestai/zrp/blob/main/zrp/prepare/base.py).\n\n >>> zrp_build_output = zest_race_predictor_builder.transform(input_training_data)\n\n- **zest\\_race\\_predictor\\_builder.transform(df)**\n - What it does:\n - Builds a new custom ZRP model trained off of user input data\n when supplied with standard ZRP requirements including name,\n address, and race\n - Produces a custom model-pipeline. The pipeline, model, and\n supporting data are saved automatically to\n \\\"\\~/data/experiments/model\\_source/data/\\\" in the support\n files path defined.\n - The class assumes data is not broken into train and test\n sets, performs this split itself, and outputs predictions on\n the test set.\n\n> -----------------------------------------------------------------------------\n> Parameters \n> ------------ ----------------------------------------------------------------\n> df : {DataFrame} Pandas dataframe containing input data\n> (see below for necessary columns)\n>\n> -----------------------------------------------------------------------------\n\nInput data, **df**, into this pipeline **MUST** contain the following\ncolumns: first name, middle name, last name, house number, street\naddress (street name), city, state, zip code, zest key, and race.\nConsult our [troubleshooting help\nguide](https://github.com/zestai/zrp/blob/main/troubleshooting_help.rst) to\nensure your input data is the correct format.\n\n- Output: A dictionary of race & ethnicity probablities and labels.\n\nAs mentioned in the ZRP Predict section above, once the run is complete,\nthe `artifacts/` folder will contain the outputted race/ethnicity\nproxies and additional logs documenting the validity of input data.\nSimilarly, defining `file_path` **need not** be specified, but providing\na unique directory path for `file_path` will avoid overwriting the\n[artifacts/]{.title-ref} folder. When running ZRP Build, however,\n`artifacts/` also contains the processed test and train data, trained\nmodel, and pipeline.\n\n### Additional Runs of Your Custom Model\nAfter having run ZRP\\_Build() you can re-use your custom model just like\nyou run the packaged model. All you must do is specify the path to the\ngenerated model and pipelines (this path is the same path as\n\\'/path/to/desired/output/directory\\' that you defined previously when\nrunning ZRP\\_Build() in the example above; we call this \\'pipe\\_path\\').\nThus, you would run: :\n\n >>> from zrp import ZRP\n >>> zest_race_predictor = ZRP('pipe_path')\n >>> zest_race_predictor.fit()\n >>> zrp_output = zest_race_predictor.transform(input_dataframe)\n\nValidation\n==========\n\nThe models included in this package were trained on publicly-available\nvoter registration data and validated multiple times: on hold out sets\nof voter registration data and on a national sample of PPP loan\nforgiveness data. The results were consistent across tests: 20-30% more\nAfrican Americans correctily identified as African American, and 60%\nfewer whites identified as people of color as compared with the status\nquo BISG method.\n\nTo see our validation analysis with Alabama voter registration data,\nplease check out [this\nnotebook](https://github.com/zestai/zrp/blob/main/examples/analysis/Alabama_Case_Study.md).\n\nPerformance on the national PPP loan forgiveness dataset was as follows\n(comparing ZRP softmax with the BISG method):\n\n*African American*\n\n| Statistic | BISG | ZRP | Pct. Diff |\n|---------------------|-------|-------|-----------|\n| True Positive Rate | 0.571 | 0.700 | +23% (F) |\n| True Negative Rate | 0.954 | 0.961 | +01% (F) |\n| False Positive Rate | 0.046 | 0.039 | -15% (F) |\n| False Negative Rate | 0.429 | 0.300 | -30% (F) |\n\n*Asian American and Pacific Islander*\n\n| Statistic | BISG | ZRP | Pct. Diff |\n|---------------------|-------|-------|-----------|\n| True Positive Rate | 0.683 | 0.777 | +14% (F) |\n| True Negative Rate | 0.982 | 0.977 | -01% (U) |\n| False Positive Rate | 0.018 | 0.023 | -28% (F) |\n| False Negative Rate | 0.317 | 0.223 | -30% (F) |\n\n*Non-White Hispanic*\n\n| Statistic | BISG | ZRP | Pct. Diff |\n|---------------------|-------|-------|-----------|\n| True Positive Rate | 0.599 | 0.711 | +19% (F) |\n| True Negative Rate | 0.979 | 0.973 | -01% (U) |\n| False Positive Rate | 0.021 | 0.027 | -29% (F) |\n| False Negative Rate | 0.401 | 0.289 | -28% (F) |\n\n*White, Non-Hispanic*\n\n| Statistic | BISG | ZRP | Pct. Diff |\n|---------------------|-------|-------|-----------|\n| True Positive Rate | 0.758 | 0.906 | +19% (F) |\n| True Negative Rate | 0.758 | 0.741 | -02% (U) |\n| False Positive Rate | 0.242 | 0.259 | +07% (U) |\n| False Negative Rate | 0.241 | 0.094 | -61% (F) |\n\nAuthors\n=======\n\n> - [Kasey\n> Matthews](https://www.linkedin.com/in/kasey-matthews-datadriven/)\n> (Zest AI Lead)\n> - [Piotr Zak](https://www.linkedin.com/in/piotr-zak-datadriven/) (Algomine)\n> - [Austin Li](https://www.linkedin.com/in/austinwli/) (Harvard T4SG)\n> - [Christien\n> Williams](https://www.linkedin.com/in/christienwilliams/) (Schmidt\n> Futures)\n> - [Sean Kamkar](https://www.linkedin.com/in/sean-kamkar/) (Zest AI)\n> - [Jay Budzik](https://www.linkedin.com/in/jaybudzik/) (Zest AI)\n\nContributing\n============\n\nContributions are encouraged! For small bug fixes and minor\nimprovements, feel free to just open a PR. For larger changes, please\nopen an issue first so that other contributors can discuss your plan,\navoid duplicated work, and ensure it aligns with the goals of the\nproject. Be sure to also follow the [Code of\nConduct](https://github.com/zestai/zrp/blob/main/CODE_OF_CONDUCT.md).\nThanks!\n\nMaintainers\n-----------\n\nMaintainers should additionally consult our documentation on\n[releasing](https://github.com/zestai/zrp/blob/main/releasing.rst).\nFollow the steps there to push new releases to Pypi and Github releases.\nWith respect to Github releases, we provide new releases to ensure\nrelevant pipelines and look up tables requisite for package download and\nuse are consistently up to date.\n\nWishlist\n========\n\nSupport for the following capabilities is planned:\n\n- add multiracial classification output support\n- national validation datasets and validation partners\n- pointers to additional training data\n- add support for gender and other protected bases\n\nLicense\n=======\n\nThe package is released under the [Apache-2.0\nLicense](https://opensource.org/licenses/Apache-2.0).\n\nResults and Feedback\n====================\n\nGenerate interesting results with the tool and want to share it or other\ninteresting feedback? Get in touch via <abetterway@zest.ai>.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "The Zest Race Predictor tool predicts race/ethnicity using a name and address as inputs.",
"version": "0.4.0",
"project_urls": {
"Homepage": "https://github.com/zestai/zrp"
},
"split_keywords": [
"race",
"ethnicity",
"names",
"address",
"acs",
"geocode"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1b1063d9486acd262deac83ea5418dd8d8c6cbc6c68833533f816bacc57a6156",
"md5": "03279f3c76a938d60a05dabfab062c46",
"sha256": "cb88d56a39fe47161df6634b29d9cbe6a6f694595262b5858563481f0f1f7118"
},
"downloads": -1,
"filename": "zrp-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "03279f3c76a938d60a05dabfab062c46",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7283754,
"upload_time": "2024-10-04T22:53:55",
"upload_time_iso_8601": "2024-10-04T22:53:55.932562Z",
"url": "https://files.pythonhosted.org/packages/1b/10/63d9486acd262deac83ea5418dd8d8c6cbc6c68833533f816bacc57a6156/zrp-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "feda2861c1c117230b0614f01c985d14092b0307553f384389f3ee2c6a37a7d7",
"md5": "3ee58d31e4812448a7c82637bac55131",
"sha256": "15ac4c1eebe272b700b36f56e2d956f6d0f7f63583c63a8880d93658c7c16af8"
},
"downloads": -1,
"filename": "zrp-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "3ee58d31e4812448a7c82637bac55131",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7188909,
"upload_time": "2024-10-04T22:53:57",
"upload_time_iso_8601": "2024-10-04T22:53:57.983227Z",
"url": "https://files.pythonhosted.org/packages/fe/da/2861c1c117230b0614f01c985d14092b0307553f384389f3ee2c6a37a7d7/zrp-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-04 22:53:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zestai",
"github_project": "zrp",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "category_encoders",
"specs": [
[
"==",
"2.3.0"
]
]
},
{
"name": "CensusData",
"specs": [
[
"==",
"1.15"
]
]
},
{
"name": "feature_engine",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "fastparquet",
"specs": []
},
{
"name": "joblib",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": [
[
"==",
"1.2.5"
]
]
},
{
"name": "plac",
"specs": [
[
"==",
"1.3.4"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"==",
"7.0.0"
]
]
},
{
"name": "pycm",
"specs": [
[
"==",
"3.3"
]
]
},
{
"name": "scikit_learn",
"specs": [
[
"==",
"1.0.2"
]
]
},
{
"name": "surgeo",
"specs": [
[
"==",
"1.1.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.46.0"
]
]
},
{
"name": "xgboost",
"specs": [
[
"==",
"1.0.2"
]
]
}
],
"lcname": "zrp"
}