instate

Name	instate JSON
Version	0.1.7 JSON
	download
home_page	https://github.com/appeler/instate
Summary	Instate: predict the state of residence from last name
upload_time	2024-08-19 00:16:57
maintainer	None
docs_url	None
author	Atul Dhingra, Gaurav Sood and Rajashekar Chintalapati
requires_python	None
license	MIT
keywords	predict the state of residence from last name
VCS
bugtrack_url
requirements	pandas torch typing pytest Levenshtein
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            instate: predict the state of residence from last name 
=============================================================

.. image:: https://img.shields.io/pypi/v/instate.svg
    :target: https://pypi.python.org/pypi/instate
.. image:: https://readthedocs.org/projects/instate/badge/?version=latest
    :target: http://instate.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status
.. image:: https://static.pepy.tech/badge/instate
    :target: https://pepy.tech/project/instate


Using the Indian electoral rolls data (2017), we provide a Python package that takes the last name of a person and gives its distribution across states.
This package can also predict the spoken language of the person based on the last name.

Potential Use Cases
---------------------
India has 22 official languages. To serve such a diverse language base is a challenge for businesses and surveyors. To the extent that businesses have access to the last name (and no other information) and in the absence of other data that allows us to model a person's spoken language, the distribution of last names across states is the best we have.

Dataset
---------
Refer `lastname_langs_india.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india.csv.tar.gz>`__ for the dataset, that will be used to predict/lookup the spoken language based on the last name.

Refer `lastname_langs_india_top3.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india_top3.csv.tar.gz>`__ for the dataset, that will be used to predict the top-3 spoken languages based on the last name. A LSTM model has been trained on this dataset to predict the top-3 spoken languages.

Refer `notebooks <https://github.com/appeler/instate/tree/main/instate/notebooks>`__ for the notebooks that were used to prepare above datasets and train the models.

Web UI
--------------
Streamlit App.: https://appeler-instate-streamlitstreamlit-app-e39m4c.streamlit.app/

Installation
-------------
We strongly recommend installing `instate` inside a Python virtual environment
(see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)

::

    pip install instate

Examples
--------
::

  from instate import last_state
  last_dat <- pd.read_csv("last_dat.csv")
  last_state_dat <- last_state(last_dat, "dhingra")
  print(last_state_dat)

API
----------

instate exposes 5 functions. 

- **last_state**

    - takes a pandas dataframe, the column name for the df column with the last names, and produces a dataframe with 31 more columns, reflecting the number of states for which we have the data. 

::
    
    from instate import last_state
    df = pd.DataFrame({'last_name': ['Dhingra', 'Sood', 'Gowda']})
    last_state(df, "last_name").iloc[:, : 5]
        
        last_name   __last_name andaman     andhra      arunachal
    0   Dhingra     dhingra     0.001737    0.000744    0.000000
    1   Sood        sood        0.000258    0.002492    0.000043
    2   Gowda       gowda       0.000000    0.528533    0.000000

- **pred_last_state**
    
    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (pred_state), reflecting the top-3 predictions from GRU model.

::
    
    from instate import pred_last_state
    df = pd.DataFrame({'last_name': ['Dhingra', 'Sood', 'Gowda']})
    last_state(df, "last_name").iloc[:, : 5]
        last_name	pred_state
    0	dhingra	[Daman and Diu, Andaman and Nicobar Islands, Puducherry]
    1	sood	[Meghalaya, Chandigarh, Punjab]
    2	gowda	[Puducherry, Nagaland, Daman and Diu]

- **state_to_lang**

    - takes a pandas dataframe, the column name with the state, and appends census mappings from state to languages

::

  from instate import state_to_lang
  df = pd.DataFrame({'last_name': ['dhingra', 'sood', 'gowda']})
  state_last = last_state(df, "last_name")
  small_state = state_last.loc[:, "andaman":"utt"]
  state_last["modal_state"] = small_state.idxmax(axis = 1)
  state_to_lang(state_last, "modal_state")[["last_name", "modal_state", "official_languages"]]

        last_name   modal_state official_languages
    0   dhingra     delhi       Hindi, English
    1   sood        punjab      Punjabi
    2   gowda       andhra      Telugu


- **lookup_lang**

    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (lang), reflecting the most spoken language in the state. This method will find nearest names and then look up in dataset to find the most spoken language.

::
    
      from instate import lookup_lang
      df = pd.DataFrame({'last_name': ['sood', 'chintalapati']})
      lookup_lang(df, "last_name")
      
            last_name predicted_lang
    0          sood          hindi
    1  chintalapati         telugu

- **predict_lang**

    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (lang), reflecting the most spoken language in the state. This method will predict the language based on the names.

::
    
      from instate import predict_lang
      df = pd.DataFrame({'last_name': ['sood', 'chintalapati']})
      predict_lang(df, "last_name")
      
            last_name predicted_lang
    0          sood   [hindi, punjabi, urdu]
    1  chintalapati  [telugu, urdu, chenchu]

Data
----

The underlying data for the package can be accessed at: https://doi.org/10.7910/DVN/ZXMVTJ

Evaluation
----------

The model has a top-3 accuracy of 85.3\% on `unseen names <https://github.com/appeler/instate/blob/main/instate/models/model_dnn_gpu.ipynb>`__. The KNN model does quite well. See the details `here <https://github.com/appeler/instate/blob/main/instate/models/KNN_cosine_distance_simple_avg_modal_state.ipynb>`__
The name-to-language lookup has an accuracy of 67.9\%.
The name-to-language model prediction has an accuracy of 72.2\%.

Authors
-------

Atul Dhingra, Gaurav Sood and Rajashekar Chintalapati

Contributor Code of Conduct
---------------------------------

The project welcomes contributions from everyone! In fact, it depends on
it. To maintain this welcoming atmosphere, and to collaborate in a fun
and productive way, we expect contributors to the project to abide by
the `Contributor Code of
Conduct <http://contributor-covenant.org/version/1/0/0/>`__.

License
----------

The package is released under the `MIT
License <https://opensource.org/licenses/MIT>`__.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/appeler/instate",
    "name": "instate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "predict the state of residence from last name",
    "author": "Atul Dhingra, Gaurav Sood and Rajashekar Chintalapati",
    "author_email": "dhingra.atul92@gmail.com, gsood07@gmail.com, rajshekar.ch@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/1c/30/fb3760c6e0f03341c0efd83ab3533d5197eb98e9e8a8d2c0d12af604e8c7/instate-0.1.7.tar.gz",
    "platform": null,
    "description": "instate: predict the state of residence from last name \n=============================================================\n\n.. image:: https://img.shields.io/pypi/v/instate.svg\n    :target: https://pypi.python.org/pypi/instate\n.. image:: https://readthedocs.org/projects/instate/badge/?version=latest\n    :target: http://instate.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n.. image:: https://static.pepy.tech/badge/instate\n    :target: https://pepy.tech/project/instate\n\n\nUsing the Indian electoral rolls data (2017), we provide a Python package that takes the last name of a person and gives its distribution across states.\nThis package can also predict the spoken language of the person based on the last name.\n\nPotential Use Cases\n---------------------\nIndia has 22 official languages. To serve such a diverse language base is a challenge for businesses and surveyors. To the extent that businesses have access to the last name (and no other information) and in the absence of other data that allows us to model a person's spoken language, the distribution of last names across states is the best we have.\n\nDataset\n---------\nRefer `lastname_langs_india.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india.csv.tar.gz>`__ for the dataset, that will be used to predict/lookup the spoken language based on the last name.\n\nRefer `lastname_langs_india_top3.csv.tar.gz <https://github.com/appeler/instate/blob/main/instate/data/lastname_langs_india_top3.csv.tar.gz>`__ for the dataset, that will be used to predict the top-3 spoken languages based on the last name. A LSTM model has been trained on this dataset to predict the top-3 spoken languages.\n\nRefer `notebooks <https://github.com/appeler/instate/tree/main/instate/notebooks>`__ for the notebooks that were used to prepare above datasets and train the models.\n\nWeb UI\n--------------\nStreamlit App.: https://appeler-instate-streamlitstreamlit-app-e39m4c.streamlit.app/\n\nInstallation\n-------------\nWe strongly recommend installing `instate` inside a Python virtual environment\n(see `venv documentation <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`__)\n\n::\n\n    pip install instate\n\nExamples\n--------\n::\n\n  from instate import last_state\n  last_dat <- pd.read_csv(\"last_dat.csv\")\n  last_state_dat <- last_state(last_dat, \"dhingra\")\n  print(last_state_dat)\n\nAPI\n----------\n\ninstate exposes 5 functions. \n\n- **last_state**\n\n    - takes a pandas dataframe, the column name for the df column with the last names, and produces a dataframe with 31 more columns, reflecting the number of states for which we have the data. \n\n::\n    \n    from instate import last_state\n    df = pd.DataFrame({'last_name': ['Dhingra', 'Sood', 'Gowda']})\n    last_state(df, \"last_name\").iloc[:, : 5]\n        \n        last_name   __last_name andaman     andhra      arunachal\n    0   Dhingra     dhingra     0.001737    0.000744    0.000000\n    1   Sood        sood        0.000258    0.002492    0.000043\n    2   Gowda       gowda       0.000000    0.528533    0.000000\n\n- **pred_last_state**\n    \n    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (pred_state), reflecting the top-3 predictions from GRU model.\n\n::\n    \n    from instate import pred_last_state\n    df = pd.DataFrame({'last_name': ['Dhingra', 'Sood', 'Gowda']})\n    last_state(df, \"last_name\").iloc[:, : 5]\n        last_name\tpred_state\n    0\tdhingra\t[Daman and Diu, Andaman and Nicobar Islands, Puducherry]\n    1\tsood\t[Meghalaya, Chandigarh, Punjab]\n    2\tgowda\t[Puducherry, Nagaland, Daman and Diu]\n\n- **state_to_lang**\n\n    - takes a pandas dataframe, the column name with the state, and appends census mappings from state to languages\n\n::\n\n  from instate import state_to_lang\n  df = pd.DataFrame({'last_name': ['dhingra', 'sood', 'gowda']})\n  state_last = last_state(df, \"last_name\")\n  small_state = state_last.loc[:, \"andaman\":\"utt\"]\n  state_last[\"modal_state\"] = small_state.idxmax(axis = 1)\n  state_to_lang(state_last, \"modal_state\")[[\"last_name\", \"modal_state\", \"official_languages\"]]\n\n        last_name   modal_state official_languages\n    0   dhingra     delhi       Hindi, English\n    1   sood        punjab      Punjabi\n    2   gowda       andhra      Telugu\n\n\n- **lookup_lang**\n\n    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (lang), reflecting the most spoken language in the state. This method will find nearest names and then look up in dataset to find the most spoken language.\n\n::\n    \n      from instate import lookup_lang\n      df = pd.DataFrame({'last_name': ['sood', 'chintalapati']})\n      lookup_lang(df, \"last_name\")\n      \n            last_name predicted_lang\n    0          sood          hindi\n    1  chintalapati         telugu\n\n- **predict_lang**\n\n    - takes a pandas dataframe, the column name with the last names, and produces a dataframe with 1 more column (lang), reflecting the most spoken language in the state. This method will predict the language based on the names.\n\n::\n    \n      from instate import predict_lang\n      df = pd.DataFrame({'last_name': ['sood', 'chintalapati']})\n      predict_lang(df, \"last_name\")\n      \n            last_name predicted_lang\n    0          sood   [hindi, punjabi, urdu]\n    1  chintalapati  [telugu, urdu, chenchu]\n\nData\n----\n\nThe underlying data for the package can be accessed at: https://doi.org/10.7910/DVN/ZXMVTJ\n\nEvaluation\n----------\n\nThe model has a top-3 accuracy of 85.3\\% on `unseen names <https://github.com/appeler/instate/blob/main/instate/models/model_dnn_gpu.ipynb>`__. The KNN model does quite well. See the details `here <https://github.com/appeler/instate/blob/main/instate/models/KNN_cosine_distance_simple_avg_modal_state.ipynb>`__\nThe name-to-language lookup has an accuracy of 67.9\\%.\nThe name-to-language model prediction has an accuracy of 72.2\\%.\n\nAuthors\n-------\n\nAtul Dhingra, Gaurav Sood and Rajashekar Chintalapati\n\nContributor Code of Conduct\n---------------------------------\n\nThe project welcomes contributions from everyone! In fact, it depends on\nit. To maintain this welcoming atmosphere, and to collaborate in a fun\nand productive way, we expect contributors to the project to abide by\nthe `Contributor Code of\nConduct <http://contributor-covenant.org/version/1/0/0/>`__.\n\nLicense\n----------\n\nThe package is released under the `MIT\nLicense <https://opensource.org/licenses/MIT>`__.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Instate: predict the state of residence from last name",
    "version": "0.1.7",
    "project_urls": {
        "Homepage": "https://github.com/appeler/instate"
    },
    "split_keywords": [
        "predict",
        "the",
        "state",
        "of",
        "residence",
        "from",
        "last",
        "name"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2289689a9b915579d101ab5305de7824e9effea8afa68f91ddba1aed4fa8517c",
                "md5": "e9bd96439ed302ce6cdabe33ec85f870",
                "sha256": "3f875317682db298fcd7fe44684be296e4d1560e85a9ea93bc2d63e7afc60d2f"
            },
            "downloads": -1,
            "filename": "instate-0.1.7-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e9bd96439ed302ce6cdabe33ec85f870",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 7889447,
            "upload_time": "2024-08-19T00:16:54",
            "upload_time_iso_8601": "2024-08-19T00:16:54.304997Z",
            "url": "https://files.pythonhosted.org/packages/22/89/689a9b915579d101ab5305de7824e9effea8afa68f91ddba1aed4fa8517c/instate-0.1.7-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1c30fb3760c6e0f03341c0efd83ab3533d5197eb98e9e8a8d2c0d12af604e8c7",
                "md5": "2b988d914c55ebffc594c92b6e6877f1",
                "sha256": "33a8a1f666b76f3d244e59453350da1a95e663438d8573965aedb3377b1f5b0a"
            },
            "downloads": -1,
            "filename": "instate-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "2b988d914c55ebffc594c92b6e6877f1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7893551,
            "upload_time": "2024-08-19T00:16:57",
            "upload_time_iso_8601": "2024-08-19T00:16:57.159531Z",
            "url": "https://files.pythonhosted.org/packages/1c/30/fb3760c6e0f03341c0efd83ab3533d5197eb98e9e8a8d2c0d12af604e8c7/instate-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-19 00:16:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "appeler",
    "github_project": "instate",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "1.13.1"
                ]
            ]
        },
        {
            "name": "typing",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "Levenshtein",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "instate"
}

Atul Dhingra, Gaurav Sood and Rajashekar Chintalapati