Name | outkast JSON |
Version |
1.0.0
JSON |
| download |
home_page | None |
Summary | Infer Caste from Indian Names |
upload_time | 2025-10-07 20:44:01 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | None |
keywords |
caste
names
india
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
outkast: estimate caste by last name, year, and state
-----------------------------------------------------
.. image:: https://github.com/appeler/outkast/actions/workflows/ci.yml/badge.svg
:target: https://github.com/appeler/outkast/actions/workflows/ci.yml
.. image:: https://img.shields.io/pypi/v/outkast.svg
:target: https://pypi.python.org/pypi/outkast
.. image:: https://pepy.tech/badge/outkast
:target: https://pepy.tech/project/outkast
.. image:: https://img.shields.io/badge/docs-github.io-blue
:target: https://appeler.github.io/outkast/
Using data on more than 140M Indians across 19 states from the `Socio-Economic Caste Census <https://github.com/in-rolls/secc>`__ (parsed data `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__), we estimate the proportion `scheduled caste, scheduled tribe, and other` for a particular last name, year, and state.
Why?
====
We provide this package so that people can assess, highlight, and fight unfairness.
How is the underlying data produced?
====================================
1. The `script <outkast/data/secc/01_download_secc.ipynb>`__ downloads the `clean version <https://github.com/in-rolls/secc>`__ of the SECC posted `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__.
2. `Produce base data frame <outkast/data/secc/02_clean_secc_recode.ipynb>`__ and `infer last names <outkast/data/secc/03_outkast_dataset_state.ipynb>`__
* remove names with non-alphabetical characters
* remove records with missing last names
* remove < 2 char last names
* remove rows with birth_date < 1900
* last name shared by at least 1000 hh
3. `Group by last name, state, and year <outkast/data/secc/03_outkast_dataset_state.ipynb>`__ and produce the `underlying data <outkast/data/secc/secc_all_state_year_ln_outkast.csv.gz>`__
Base Classifier
~~~~~~~~~~~~~~~
We start by providing a base model for last\_name that gives the Bayes
optimal solution providing the proportion of `SC, ST, and Other` with that last name.
We also provide a series of base models where the state of
residence is known.
Installation
~~~~~~~~~~~~
We strongly recommend installing `outkast` inside a Python virtual environment (see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)
::
pip install outkast
Usage
~~~~~
::
usage: secc_caste [-h] -l LAST_NAME
[-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}]
[-y YEAR] [-o OUTPUT]
input
Appends SECC 2011 data columns for sc, st, and other by last name
positional arguments:
input Input file
optional arguments:
-h, --help show this help message and exit
-l LAST_NAME, --last-name LAST_NAME
Name or index location of column contains the last
name
-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}
State name of SECC data (default=all)
-y YEAR, --year YEAR Birth year in SECC data (default=all)
-o OUTPUT, --output OUTPUT
Output file with SECC data columns
Using outkast
~~~~~~~~~~~~~
::
>>> import pandas as pd
>>> from outkast import secc_caste
>>>
>>> names = [{'name': 'patel'},
... {'name': 'zala'},
... {'name': 'lal'},
... {'name': 'agarwal'}]
>>>
>>> df = pd.DataFrame(names)
>>>
>>> secc_caste(df, 'name')
name n_sc n_st n_other prop_sc prop_st prop_other
0 patel 5681 112302 631393 0.007581 0.149861 0.842558
1 zala 667 14 34550 0.018932 0.000397 0.980670
2 lal 703595 241846 1314224 0.311371 0.107027 0.581601
3 agarwal 39 12 4375 0.008812 0.002711 0.988477
>>>
>>> help(secc_caste)
Help on method secc_caste in module outkast.secc_caste_ln:
secc_caste(df, namecol, state=None, year=None) method of builtins.type instance
Appends additional columns from SECC data to the input DataFrame
based on the last name.
Removes extra space. Checks if the name is the SECC data.
If it is, outputs data from that row.
Args:
df (:obj:`DataFrame`): Pandas DataFrame containing the last name
column.
namecol (str or int): Column's name or location of the name in
DataFrame.
state (str): The state name of SECC data to be used.
(default is None for all states)
year (int): The year of SECC data to be used.
(default is None for all years)
Returns:
DataFrame: Pandas DataFrame with additional columns:-
'n_sc', 'n_st', 'n_other',
'prop_sc', 'prop_st', 'prop_other' by last name
Authors
~~~~~~~
Suriyan Laohaprapanon and Gaurav Sood
License
~~~~~~~
The package is released under the `MIT
License <https://opensource.org/licenses/MIT>`__.
Raw data
{
"_id": null,
"home_page": null,
"name": "outkast",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "caste, names, india",
"author": null,
"author_email": "Gaurav Sood <gsood07@gmail.com>, Suriyan Laohaprapanon <suriyant@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/02/09/29a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1/outkast-1.0.0.tar.gz",
"platform": null,
"description": "outkast: estimate caste by last name, year, and state\n-----------------------------------------------------\n\n.. image:: https://github.com/appeler/outkast/actions/workflows/ci.yml/badge.svg\n :target: https://github.com/appeler/outkast/actions/workflows/ci.yml\n.. image:: https://img.shields.io/pypi/v/outkast.svg\n :target: https://pypi.python.org/pypi/outkast\n.. image:: https://pepy.tech/badge/outkast\n :target: https://pepy.tech/project/outkast\n.. image:: https://img.shields.io/badge/docs-github.io-blue\n :target: https://appeler.github.io/outkast/\n\n\nUsing data on more than 140M Indians across 19 states from the `Socio-Economic Caste Census <https://github.com/in-rolls/secc>`__ (parsed data `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__), we estimate the proportion `scheduled caste, scheduled tribe, and other` for a particular last name, year, and state.\n\nWhy?\n====\n\nWe provide this package so that people can assess, highlight, and fight unfairness.\n\nHow is the underlying data produced?\n====================================\n\n1. The `script <outkast/data/secc/01_download_secc.ipynb>`__ downloads the `clean version <https://github.com/in-rolls/secc>`__ of the SECC posted `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__.\n\n2. `Produce base data frame <outkast/data/secc/02_clean_secc_recode.ipynb>`__ and `infer last names <outkast/data/secc/03_outkast_dataset_state.ipynb>`__\n\n * remove names with non-alphabetical characters\n * remove records with missing last names\n * remove < 2 char last names\n * remove rows with birth_date < 1900\n * last name shared by at least 1000 hh\n\n3. `Group by last name, state, and year <outkast/data/secc/03_outkast_dataset_state.ipynb>`__ and produce the `underlying data <outkast/data/secc/secc_all_state_year_ln_outkast.csv.gz>`__\n\nBase Classifier\n~~~~~~~~~~~~~~~\n\nWe start by providing a base model for last\\_name that gives the Bayes\noptimal solution providing the proportion of `SC, ST, and Other` with that last name.\nWe also provide a series of base models where the state of\nresidence is known.\n\nInstallation\n~~~~~~~~~~~~\n\nWe strongly recommend installing `outkast` inside a Python virtual environment (see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)\n\n::\n\n pip install outkast\n\n\nUsage\n~~~~~\n\n::\n\n usage: secc_caste [-h] -l LAST_NAME\n [-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}]\n [-y YEAR] [-o OUTPUT]\n input\n\n Appends SECC 2011 data columns for sc, st, and other by last name\n\n positional arguments:\n input Input file\n\n optional arguments:\n -h, --help show this help message and exit\n -l LAST_NAME, --last-name LAST_NAME\n Name or index location of column contains the last\n name\n -s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}\n State name of SECC data (default=all)\n -y YEAR, --year YEAR Birth year in SECC data (default=all)\n -o OUTPUT, --output OUTPUT\n Output file with SECC data columns\n\n\n\nUsing outkast\n~~~~~~~~~~~~~\n\n::\n\n >>> import pandas as pd\n >>> from outkast import secc_caste\n >>>\n >>> names = [{'name': 'patel'},\n ... {'name': 'zala'},\n ... {'name': 'lal'},\n ... {'name': 'agarwal'}]\n >>>\n >>> df = pd.DataFrame(names)\n >>>\n >>> secc_caste(df, 'name')\n name n_sc n_st n_other prop_sc prop_st prop_other\n 0 patel 5681 112302 631393 0.007581 0.149861 0.842558\n 1 zala 667 14 34550 0.018932 0.000397 0.980670\n 2 lal 703595 241846 1314224 0.311371 0.107027 0.581601\n 3 agarwal 39 12 4375 0.008812 0.002711 0.988477\n\n\n >>>\n >>> help(secc_caste)\n Help on method secc_caste in module outkast.secc_caste_ln:\n\n secc_caste(df, namecol, state=None, year=None) method of builtins.type instance\n Appends additional columns from SECC data to the input DataFrame\n based on the last name.\n\n Removes extra space. Checks if the name is the SECC data.\n If it is, outputs data from that row.\n\n Args:\n df (:obj:`DataFrame`): Pandas DataFrame containing the last name\n column.\n namecol (str or int): Column's name or location of the name in\n DataFrame.\n state (str): The state name of SECC data to be used.\n (default is None for all states)\n year (int): The year of SECC data to be used.\n (default is None for all years)\n\n Returns:\n DataFrame: Pandas DataFrame with additional columns:-\n 'n_sc', 'n_st', 'n_other',\n 'prop_sc', 'prop_st', 'prop_other' by last name\n\n\nAuthors\n~~~~~~~\n\nSuriyan Laohaprapanon and Gaurav Sood\n\nLicense\n~~~~~~~\n\nThe package is released under the `MIT\nLicense <https://opensource.org/licenses/MIT>`__.\n",
"bugtrack_url": null,
"license": null,
"summary": "Infer Caste from Indian Names",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/appeler/outkast",
"Issues": "https://github.com/appeler/outkast/issues",
"Repository": "https://github.com/appeler/outkast"
},
"split_keywords": [
"caste",
" names",
" india"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8bd8a91f2e8ff6c8cc4246248702dc047f2cded31f69b6a61e83253f5baab374",
"md5": "59322040eedc33197b9dd6a8b0c9a4a6",
"sha256": "f757e00eb252c71e64ae83ab82908955624d78c39e030de6e96a86864a2eac42"
},
"downloads": -1,
"filename": "outkast-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "59322040eedc33197b9dd6a8b0c9a4a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 8604713,
"upload_time": "2025-10-07T20:43:59",
"upload_time_iso_8601": "2025-10-07T20:43:59.137497Z",
"url": "https://files.pythonhosted.org/packages/8b/d8/a91f2e8ff6c8cc4246248702dc047f2cded31f69b6a61e83253f5baab374/outkast-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "020929a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1",
"md5": "16376e8908b2bf3b26a111c90b05d8a6",
"sha256": "09c3c2490ac0f1ddb186015ba07a04999dbf510a36b8902c5c104befb1aa8038"
},
"downloads": -1,
"filename": "outkast-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "16376e8908b2bf3b26a111c90b05d8a6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 8607844,
"upload_time": "2025-10-07T20:44:01",
"upload_time_iso_8601": "2025-10-07T20:44:01.469934Z",
"url": "https://files.pythonhosted.org/packages/02/09/29a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1/outkast-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 20:44:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "appeler",
"github_project": "outkast",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "outkast"
}