outkast


Nameoutkast JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryInfer Caste from Indian Names
upload_time2025-10-07 20:44:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords caste names india
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            outkast: estimate caste by last name, year, and state
-----------------------------------------------------

.. image:: https://github.com/appeler/outkast/actions/workflows/ci.yml/badge.svg
    :target: https://github.com/appeler/outkast/actions/workflows/ci.yml
.. image:: https://img.shields.io/pypi/v/outkast.svg
    :target: https://pypi.python.org/pypi/outkast
.. image:: https://pepy.tech/badge/outkast
    :target: https://pepy.tech/project/outkast
.. image:: https://img.shields.io/badge/docs-github.io-blue
    :target: https://appeler.github.io/outkast/


Using data on more than 140M Indians across 19 states from the `Socio-Economic Caste Census <https://github.com/in-rolls/secc>`__ (parsed data `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__), we estimate the proportion `scheduled caste, scheduled tribe, and other` for a particular last name, year, and state.

Why?
====

We provide this package so that people can assess, highlight, and fight unfairness.

How is the underlying data produced?
====================================

1. The `script <outkast/data/secc/01_download_secc.ipynb>`__ downloads the `clean version <https://github.com/in-rolls/secc>`__ of the SECC posted `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__.

2. `Produce base data frame <outkast/data/secc/02_clean_secc_recode.ipynb>`__ and `infer last names <outkast/data/secc/03_outkast_dataset_state.ipynb>`__

  * remove names with non-alphabetical characters
  * remove records with missing last names
  * remove < 2 char last names
  * remove rows with birth_date < 1900
  * last name shared by at least 1000 hh

3. `Group by last name, state, and year <outkast/data/secc/03_outkast_dataset_state.ipynb>`__ and produce the `underlying data <outkast/data/secc/secc_all_state_year_ln_outkast.csv.gz>`__

Base Classifier
~~~~~~~~~~~~~~~

We start by providing a base model for last\_name that gives the Bayes
optimal solution providing the proportion of `SC, ST, and Other` with that last name.
We also provide a series of base models where the state of
residence is known.

Installation
~~~~~~~~~~~~

We strongly recommend installing `outkast` inside a Python virtual environment (see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)

::

    pip install outkast


Usage
~~~~~

::

    usage: secc_caste [-h] -l LAST_NAME
                    [-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}]
                    [-y YEAR] [-o OUTPUT]
                    input

    Appends SECC 2011 data columns for sc, st, and other by last name

    positional arguments:
    input                 Input file

    optional arguments:
    -h, --help            show this help message and exit
    -l LAST_NAME, --last-name LAST_NAME
                            Name or index location of column contains the last
                            name
    -s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}
                            State name of SECC data (default=all)
    -y YEAR, --year YEAR  Birth year in SECC data (default=all)
    -o OUTPUT, --output OUTPUT
                            Output file with SECC data columns



Using outkast
~~~~~~~~~~~~~

::

    >>> import pandas as pd
    >>> from outkast import secc_caste
    >>>
    >>> names = [{'name': 'patel'},
    ...             {'name': 'zala'},
    ...             {'name': 'lal'},
    ...             {'name': 'agarwal'}]
    >>>
    >>> df = pd.DataFrame(names)
    >>>
    >>> secc_caste(df, 'name')
        name    n_sc    n_st  n_other   prop_sc   prop_st  prop_other
    0    patel    5681  112302   631393  0.007581  0.149861    0.842558
    1     zala     667      14    34550  0.018932  0.000397    0.980670
    2      lal  703595  241846  1314224  0.311371  0.107027    0.581601
    3  agarwal      39      12     4375  0.008812  0.002711    0.988477


    >>>
    >>> help(secc_caste)
    Help on method secc_caste in module outkast.secc_caste_ln:

    secc_caste(df, namecol, state=None, year=None) method of builtins.type instance
        Appends additional columns from SECC data to the input DataFrame
        based on the last name.

        Removes extra space. Checks if the name is the SECC data.
        If it is, outputs data from that row.

        Args:
            df (:obj:`DataFrame`): Pandas DataFrame containing the last name
                column.
            namecol (str or int): Column's name or location of the name in
                DataFrame.
            state (str): The state name of SECC data to be used.
                (default is None for all states)
            year (int): The year of SECC data to be used.
                (default is None for all years)

        Returns:
            DataFrame: Pandas DataFrame with additional columns:-
                'n_sc', 'n_st', 'n_other',
                'prop_sc', 'prop_st', 'prop_other' by last name


Authors
~~~~~~~

Suriyan Laohaprapanon and Gaurav Sood

License
~~~~~~~

The package is released under the `MIT
License <https://opensource.org/licenses/MIT>`__.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "outkast",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "caste, names, india",
    "author": null,
    "author_email": "Gaurav Sood <gsood07@gmail.com>, Suriyan Laohaprapanon <suriyant@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/02/09/29a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1/outkast-1.0.0.tar.gz",
    "platform": null,
    "description": "outkast: estimate caste by last name, year, and state\n-----------------------------------------------------\n\n.. image:: https://github.com/appeler/outkast/actions/workflows/ci.yml/badge.svg\n    :target: https://github.com/appeler/outkast/actions/workflows/ci.yml\n.. image:: https://img.shields.io/pypi/v/outkast.svg\n    :target: https://pypi.python.org/pypi/outkast\n.. image:: https://pepy.tech/badge/outkast\n    :target: https://pepy.tech/project/outkast\n.. image:: https://img.shields.io/badge/docs-github.io-blue\n    :target: https://appeler.github.io/outkast/\n\n\nUsing data on more than 140M Indians across 19 states from the `Socio-Economic Caste Census <https://github.com/in-rolls/secc>`__ (parsed data `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__), we estimate the proportion `scheduled caste, scheduled tribe, and other` for a particular last name, year, and state.\n\nWhy?\n====\n\nWe provide this package so that people can assess, highlight, and fight unfairness.\n\nHow is the underlying data produced?\n====================================\n\n1. The `script <outkast/data/secc/01_download_secc.ipynb>`__ downloads the `clean version <https://github.com/in-rolls/secc>`__ of the SECC posted `here <https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LIIBNB>`__.\n\n2. `Produce base data frame <outkast/data/secc/02_clean_secc_recode.ipynb>`__ and `infer last names <outkast/data/secc/03_outkast_dataset_state.ipynb>`__\n\n  * remove names with non-alphabetical characters\n  * remove records with missing last names\n  * remove < 2 char last names\n  * remove rows with birth_date < 1900\n  * last name shared by at least 1000 hh\n\n3. `Group by last name, state, and year <outkast/data/secc/03_outkast_dataset_state.ipynb>`__ and produce the `underlying data <outkast/data/secc/secc_all_state_year_ln_outkast.csv.gz>`__\n\nBase Classifier\n~~~~~~~~~~~~~~~\n\nWe start by providing a base model for last\\_name that gives the Bayes\noptimal solution providing the proportion of `SC, ST, and Other` with that last name.\nWe also provide a series of base models where the state of\nresidence is known.\n\nInstallation\n~~~~~~~~~~~~\n\nWe strongly recommend installing `outkast` inside a Python virtual environment (see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)\n\n::\n\n    pip install outkast\n\n\nUsage\n~~~~~\n\n::\n\n    usage: secc_caste [-h] -l LAST_NAME\n                    [-s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}]\n                    [-y YEAR] [-o OUTPUT]\n                    input\n\n    Appends SECC 2011 data columns for sc, st, and other by last name\n\n    positional arguments:\n    input                 Input file\n\n    optional arguments:\n    -h, --help            show this help message and exit\n    -l LAST_NAME, --last-name LAST_NAME\n                            Name or index location of column contains the last\n                            name\n    -s {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}, --state {arunachal pradesh,assam,bihar,chhattisgarh,gujarat,haryana,kerala,madhya pradesh,maharashtra,mizoram,odisha,nagaland,punjab,rajasthan,sikkim,tamilnadu,uttar pradesh,uttarakhand,west bengal}\n                            State name of SECC data (default=all)\n    -y YEAR, --year YEAR  Birth year in SECC data (default=all)\n    -o OUTPUT, --output OUTPUT\n                            Output file with SECC data columns\n\n\n\nUsing outkast\n~~~~~~~~~~~~~\n\n::\n\n    >>> import pandas as pd\n    >>> from outkast import secc_caste\n    >>>\n    >>> names = [{'name': 'patel'},\n    ...             {'name': 'zala'},\n    ...             {'name': 'lal'},\n    ...             {'name': 'agarwal'}]\n    >>>\n    >>> df = pd.DataFrame(names)\n    >>>\n    >>> secc_caste(df, 'name')\n        name    n_sc    n_st  n_other   prop_sc   prop_st  prop_other\n    0    patel    5681  112302   631393  0.007581  0.149861    0.842558\n    1     zala     667      14    34550  0.018932  0.000397    0.980670\n    2      lal  703595  241846  1314224  0.311371  0.107027    0.581601\n    3  agarwal      39      12     4375  0.008812  0.002711    0.988477\n\n\n    >>>\n    >>> help(secc_caste)\n    Help on method secc_caste in module outkast.secc_caste_ln:\n\n    secc_caste(df, namecol, state=None, year=None) method of builtins.type instance\n        Appends additional columns from SECC data to the input DataFrame\n        based on the last name.\n\n        Removes extra space. Checks if the name is the SECC data.\n        If it is, outputs data from that row.\n\n        Args:\n            df (:obj:`DataFrame`): Pandas DataFrame containing the last name\n                column.\n            namecol (str or int): Column's name or location of the name in\n                DataFrame.\n            state (str): The state name of SECC data to be used.\n                (default is None for all states)\n            year (int): The year of SECC data to be used.\n                (default is None for all years)\n\n        Returns:\n            DataFrame: Pandas DataFrame with additional columns:-\n                'n_sc', 'n_st', 'n_other',\n                'prop_sc', 'prop_st', 'prop_other' by last name\n\n\nAuthors\n~~~~~~~\n\nSuriyan Laohaprapanon and Gaurav Sood\n\nLicense\n~~~~~~~\n\nThe package is released under the `MIT\nLicense <https://opensource.org/licenses/MIT>`__.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Infer Caste from Indian Names",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/appeler/outkast",
        "Issues": "https://github.com/appeler/outkast/issues",
        "Repository": "https://github.com/appeler/outkast"
    },
    "split_keywords": [
        "caste",
        " names",
        " india"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8bd8a91f2e8ff6c8cc4246248702dc047f2cded31f69b6a61e83253f5baab374",
                "md5": "59322040eedc33197b9dd6a8b0c9a4a6",
                "sha256": "f757e00eb252c71e64ae83ab82908955624d78c39e030de6e96a86864a2eac42"
            },
            "downloads": -1,
            "filename": "outkast-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "59322040eedc33197b9dd6a8b0c9a4a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 8604713,
            "upload_time": "2025-10-07T20:43:59",
            "upload_time_iso_8601": "2025-10-07T20:43:59.137497Z",
            "url": "https://files.pythonhosted.org/packages/8b/d8/a91f2e8ff6c8cc4246248702dc047f2cded31f69b6a61e83253f5baab374/outkast-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "020929a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1",
                "md5": "16376e8908b2bf3b26a111c90b05d8a6",
                "sha256": "09c3c2490ac0f1ddb186015ba07a04999dbf510a36b8902c5c104befb1aa8038"
            },
            "downloads": -1,
            "filename": "outkast-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "16376e8908b2bf3b26a111c90b05d8a6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 8607844,
            "upload_time": "2025-10-07T20:44:01",
            "upload_time_iso_8601": "2025-10-07T20:44:01.469934Z",
            "url": "https://files.pythonhosted.org/packages/02/09/29a84ec6da4a4ab571a901b741489a045de1f1194b38fad58b23ffd378c1/outkast-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-07 20:44:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "appeler",
    "github_project": "outkast",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "outkast"
}
        
Elapsed time: 1.38009s