twitter-demographer

Name	twitter-demographer JSON
Version	0.2.1 JSON
	download
home_page	https://github.com/MilaNLProc/twitter-demographer
Summary	Twitter Demographer
upload_time	2023-01-10 19:12:02
maintainer
docs_url	None
author	Federico Bianchi
requires_python	>=3.6
license	MIT license
keywords	twitter_demographer
VCS
bugtrack_url
requirements	pandas sklearn scipy tqdm tweepy liwc empath transformers torch m3inference appdirs geocoder datasets contextualized-topic-models umap hdbscan
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ===================
Twitter Demographer
===================


.. image:: https://img.shields.io/pypi/v/twitter-demographer.svg
        :target: https://pypi.python.org/pypi/twitter-demographer

.. image:: https://github.com/MilaNLProc/twitter-demographer/workflows/Python%20package/badge.svg
        :target: https://github.com/MilaNLProc/twitter-demographer/actions

.. image:: https://readthedocs.org/projects/twitter-demographer/badge/?version=latest
        :target: https://twitter-demographer.readthedocs.io/en/latest/?version=latest
        :alt: Documentation Status

.. image:: https://colab.research.google.com/assets/colab-badge.svg
    :target: https://colab.research.google.com/drive/1nk532mQS1MDAu_J3FpVTxPg21C5r44SE?usp=sharing
    :alt: Open In Colab


Twitter Demographer provides a simple API to enrich your twitter data with additional variables such as sentiment, user location,
gender and age. The tool is completely extensible and you can add your own components to the system.


.. image:: https://raw.githubusercontent.com/MilaNLProc/twitter-demographer/main/img/twitter-demographer.gif
   :width: 600pt


* Free software: MIT license
* Documentation: https://twitter-demographer.readthedocs.io.

**Note** the API is still under development (e.g., we have a lot of logging going on behind the scene) feel free to
suggest improvements or submit PRs! We are also working on improving the documentation and adding more examples!

If you find this useful, please remember to cite the following paper:

.. code-block::

    @article{bianchi2022twitter,
      title={Twitter-Demographer: A Flow-based Tool to Enrich Twitter Data},
      author={Bianchi, Federico and Cutrona, Vincenzo and Hovy, Dirk},
      journal={EMNLP},
      year={2022}
    }



Features
--------

From a simple set of tweet ids, Twitter Demographer allows you to rehydrate them and to add additional
variables to your dataset.

You are not forced to use a specific component. The design of this tool should be modular enough to allow you to
decide what to add and what to remove.

Let's make an example: you have a set of tweet ids (from english speakers) and you want to:

+ reconstruct the original tweets
+ disambiguate the location of the users
+ predict the sentiment of the tweet.

This can be done with very few lines of code with this library.

.. code-block:: python

    from twitter_demographer.twitter_demographer import Demographer
    from twitter_demographer.components import Rehydrate
    from twitter_demographer.geolocation.nominatim import NominatimDecoder
    from twitter_demographer.classification.transformers import HuggingFaceClassifier
    import pandas as pd

    demo = Demographer()

    data = pd.DataFrame({"tweet_ids": ["1477976329710673921", "1467887350084689928", "1467887352647462912", "1290664307370360834", "1465284810696445952"]})

    component_one = Rehydrate(BEARER_TOKEN)
    component_two = NominatimDecoder()
    component_three = HuggingFaceClassifier("cardiffnlp/twitter-roberta-base-sentiment")


    demo.add_component(component_one)
    demo.add_component(component_two)
    demo.add_component(component_three)

    print(demo.infer(data))

.. code-block:: python

                                             screen_name                created_at  ... geo_location_address cardiffnlp/twitter-roberta-base-sentiment
    1  ef51346744a099e011ff135f7b223186d4dab4d38bb1d8... 2021-12-06 16:03:10+00:00  ...                Milan                                         1
    4  146effc0d60c026197afe2404c4ee35dfb07c7aeb33720... 2021-11-29 11:41:37+00:00  ...                Milan                                         2
    2  ef51346744a099e011ff135f7b223186d4dab4d38bb1d8... 2021-12-06 16:03:11+00:00  ...                Milan                                         1
    0  241b67c6c698a70b18533ea7d4196e6b8f8eafd39afc6a... 2022-01-03 12:13:11+00:00  ...               Zurich                                         2
    3  df94741e2317dc8bfca7506f575ba3bd9a83deabfd9eec... 2020-08-04 15:02:04+00:00  ...            Viganello                                         2

Note that you still need to register to both twitter developer and to geonames to use the services.

Privacy Matters
---------------

Following the recommendations of the EU's General Data Protection Regulation, we implement a variety of measures to ensure pseudo-anonymity by design. Using \tool provides several built-in measures to remove identifying information and protect user privacy:

+ removing identifiers
+ unidirectional hashing
+ aggregate label swapping.

This does not compromise the value of aggregated analysis but allows for a fairer usage of this data.

Extending
---------

However, the library is also extensible. Say you want to use a custom classifier on some Twitter Data you have. For example, you might want to
detect the sentiment of the data using your own classifier.

.. code-block:: python

    class YourClassifier(Component):
        def __init__(self, model):
            self.model = model
            super().__init__()

        def inputs(self):
            return ["text"]

        def outputs(self):
            return [f"my_classifier"]

        # not null decorator helps you skip those record that have None in the field
        @not_null("text")
        def infer(self, data):

            return {"my_classifier": model.predict(data["text"])}

Components
----------

Twitter Demographer is based on components that can be concatenated together to build tools. For example, the
GeoNamesDecoder to predict the location of a user from a string of text looks like this.

.. code-block:: python

    class GeoNamesDecoder(Component):

        def __init__(self, key):
            super().__init__()
            self.key = key

        def outputs(self):
            return ["geo_location_country", "geo_location_address"]

        def inputs(self):
            return ["location"]

        @not_null("location")
        def infer(self, data):
            geo = self.initialize_return_dict()
            for val in data["location"]:
                    g = geocoder.geonames(val, key=self.key)
                    geo["geo_location_country"].append(g.country)
                    geo["geo_location_address"].append(g.address)
            return geo

Current Components
------------------

The project and the components are still under development and we are working on introducing novel pipelines to support
different use-cases.

You can see the components currently integrated in the system `here <https://twitter-demographer.readthedocs.io/en/latest/components.html>`__

+------------------------------+-------------------------------------------------+
| Name                         |  Tool                                           |
+==============================+=================================================+
| Geolocation                  |  GeoNames, OpenStreetMap                        |
+------------------------------+-------------------------------------------------+
| HateSpeech                   |  Perspective API                                |
+------------------------------+-------------------------------------------------+
| Classification               |  Support for all HuggingFace Classifiers        |
+------------------------------+-------------------------------------------------+
| Demographics                 | M3Inference, FairFace (Coming Soon)             |
+------------------------------+-------------------------------------------------+
| Topic Modeling               | Contextualized Topic Modeling                   |
+------------------------------+-------------------------------------------------+


Limitations and Ethical Considerations
--------------------------------------

Twitter Demographer does not come without limitations.
Some of these are related to the precision of the components used; for example, the Geonames decoder can fail the disambiguation - even if it has been adopted by other researchers and services. At the same time, the the topic modeling pipeline can be affected by the number of tweets used to train the model and by other training issues (fixing random seeds can generate suboptimal solutions).

The tool wraps the API from M3 for age and gender prediction. However, those predictions for gender are binary (male or female) and thus give a stereotyped representation of gender. Our intent is not to make normative claims about gender, as this is far from our beliefs. Twitter Demographer allows using other, more flexible tools. The API needs both text and user profile pictures of a tweet to make inferences, for that reason the tool has to include such information in the dataset during the pipeline execution. While this information is public (e.g., user profile pictures), the final dataset contains also inferred information, which may not be publicly available (e.g., gender or age of the user). We cannot completely prevent misuse of this capability but have taken steps to substantially reduce the risk and promote privacy by design.

Inferring user attributes carries the risk of privacy violations. We follow the definitions and recommendations of the European Union's General Data Protection Regulation for algorithmic pseudo-anonymity. We implement several measures to break a direct mapping between attributes and identifiable users without reducing the generalizability of aggregate findings on the data.
Our measures follow the GDPR definition of a "motivated intruder", i.e., it requires "significant effort" to undo our privacy protection measures. However, given enough determination and resources, a bad actor might still be able to circumvent or reverse-engineer these measures. This is true independent of Twitter Demographer, though, as existing tools could be used more easily to achieve those goals.
Using the tool provides practitioners with a reasonable way to protect anonymity.

Credits
-------

This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage


=======
History
=======

0.1.0 (2021-12-16)
------------------

* First release on PyPI.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MilaNLProc/twitter-demographer",
    "name": "twitter-demographer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "twitter_demographer",
    "author": "Federico Bianchi",
    "author_email": "f.bianchi@unibocconi.it",
    "download_url": "https://files.pythonhosted.org/packages/0f/20/df0d66abcdfedb61423ab5ef8349d8d9bb509e8c370b23d2436f42ccb101/twitter_demographer-0.2.1.tar.gz",
    "platform": null,
    "description": "===================\nTwitter Demographer\n===================\n\n\n.. image:: https://img.shields.io/pypi/v/twitter-demographer.svg\n        :target: https://pypi.python.org/pypi/twitter-demographer\n\n.. image:: https://github.com/MilaNLProc/twitter-demographer/workflows/Python%20package/badge.svg\n        :target: https://github.com/MilaNLProc/twitter-demographer/actions\n\n.. image:: https://readthedocs.org/projects/twitter-demographer/badge/?version=latest\n        :target: https://twitter-demographer.readthedocs.io/en/latest/?version=latest\n        :alt: Documentation Status\n\n.. image:: https://colab.research.google.com/assets/colab-badge.svg\n    :target: https://colab.research.google.com/drive/1nk532mQS1MDAu_J3FpVTxPg21C5r44SE?usp=sharing\n    :alt: Open In Colab\n\n\nTwitter Demographer provides a simple API to enrich your twitter data with additional variables such as sentiment, user location,\ngender and age. The tool is completely extensible and you can add your own components to the system.\n\n\n.. image:: https://raw.githubusercontent.com/MilaNLProc/twitter-demographer/main/img/twitter-demographer.gif\n   :width: 600pt\n\n\n* Free software: MIT license\n* Documentation: https://twitter-demographer.readthedocs.io.\n\n**Note** the API is still under development (e.g., we have a lot of logging going on behind the scene) feel free to\nsuggest improvements or submit PRs! We are also working on improving the documentation and adding more examples!\n\nIf you find this useful, please remember to cite the following paper:\n\n.. code-block::\n\n    @article{bianchi2022twitter,\n      title={Twitter-Demographer: A Flow-based Tool to Enrich Twitter Data},\n      author={Bianchi, Federico and Cutrona, Vincenzo and Hovy, Dirk},\n      journal={EMNLP},\n      year={2022}\n    }\n\n\n\nFeatures\n--------\n\nFrom a simple set of tweet ids, Twitter Demographer allows you to rehydrate them and to add additional\nvariables to your dataset.\n\nYou are not forced to use a specific component. The design of this tool should be modular enough to allow you to\ndecide what to add and what to remove.\n\nLet's make an example: you have a set of tweet ids (from english speakers) and you want to:\n\n+ reconstruct the original tweets\n+ disambiguate the location of the users\n+ predict the sentiment of the tweet.\n\nThis can be done with very few lines of code with this library.\n\n.. code-block:: python\n\n    from twitter_demographer.twitter_demographer import Demographer\n    from twitter_demographer.components import Rehydrate\n    from twitter_demographer.geolocation.nominatim import NominatimDecoder\n    from twitter_demographer.classification.transformers import HuggingFaceClassifier\n    import pandas as pd\n\n    demo = Demographer()\n\n    data = pd.DataFrame({\"tweet_ids\": [\"1477976329710673921\", \"1467887350084689928\", \"1467887352647462912\", \"1290664307370360834\", \"1465284810696445952\"]})\n\n    component_one = Rehydrate(BEARER_TOKEN)\n    component_two = NominatimDecoder()\n    component_three = HuggingFaceClassifier(\"cardiffnlp/twitter-roberta-base-sentiment\")\n\n\n    demo.add_component(component_one)\n    demo.add_component(component_two)\n    demo.add_component(component_three)\n\n    print(demo.infer(data))\n\n.. code-block:: python\n\n                                             screen_name                created_at  ... geo_location_address cardiffnlp/twitter-roberta-base-sentiment\n    1  ef51346744a099e011ff135f7b223186d4dab4d38bb1d8... 2021-12-06 16:03:10+00:00  ...                Milan                                         1\n    4  146effc0d60c026197afe2404c4ee35dfb07c7aeb33720... 2021-11-29 11:41:37+00:00  ...                Milan                                         2\n    2  ef51346744a099e011ff135f7b223186d4dab4d38bb1d8... 2021-12-06 16:03:11+00:00  ...                Milan                                         1\n    0  241b67c6c698a70b18533ea7d4196e6b8f8eafd39afc6a... 2022-01-03 12:13:11+00:00  ...               Zurich                                         2\n    3  df94741e2317dc8bfca7506f575ba3bd9a83deabfd9eec... 2020-08-04 15:02:04+00:00  ...            Viganello                                         2\n\nNote that you still need to register to both twitter developer and to geonames to use the services.\n\nPrivacy Matters\n---------------\n\nFollowing the recommendations of the EU's General Data Protection Regulation, we implement a variety of measures to ensure pseudo-anonymity by design. Using \\tool provides several built-in measures to remove identifying information and protect user privacy:\n\n+ removing identifiers\n+ unidirectional hashing\n+ aggregate label swapping.\n\nThis does not compromise the value of aggregated analysis but allows for a fairer usage of this data.\n\nExtending\n---------\n\nHowever, the library is also extensible. Say you want to use a custom classifier on some Twitter Data you have. For example, you might want to\ndetect the sentiment of the data using your own classifier.\n\n.. code-block:: python\n\n    class YourClassifier(Component):\n        def __init__(self, model):\n            self.model = model\n            super().__init__()\n\n        def inputs(self):\n            return [\"text\"]\n\n        def outputs(self):\n            return [f\"my_classifier\"]\n\n        # not null decorator helps you skip those record that have None in the field\n        @not_null(\"text\")\n        def infer(self, data):\n\n            return {\"my_classifier\": model.predict(data[\"text\"])}\n\nComponents\n----------\n\nTwitter Demographer is based on components that can be concatenated together to build tools. For example, the\nGeoNamesDecoder to predict the location of a user from a string of text looks like this.\n\n.. code-block:: python\n\n    class GeoNamesDecoder(Component):\n\n        def __init__(self, key):\n            super().__init__()\n            self.key = key\n\n        def outputs(self):\n            return [\"geo_location_country\", \"geo_location_address\"]\n\n        def inputs(self):\n            return [\"location\"]\n\n        @not_null(\"location\")\n        def infer(self, data):\n            geo = self.initialize_return_dict()\n            for val in data[\"location\"]:\n                    g = geocoder.geonames(val, key=self.key)\n                    geo[\"geo_location_country\"].append(g.country)\n                    geo[\"geo_location_address\"].append(g.address)\n            return geo\n\nCurrent Components\n------------------\n\nThe project and the components are still under development and we are working on introducing novel pipelines to support\ndifferent use-cases.\n\nYou can see the components currently integrated in the system `here <https://twitter-demographer.readthedocs.io/en/latest/components.html>`__\n\n+------------------------------+-------------------------------------------------+\n| Name                         |  Tool                                           |\n+==============================+=================================================+\n| Geolocation                  |  GeoNames, OpenStreetMap                        |\n+------------------------------+-------------------------------------------------+\n| HateSpeech                   |  Perspective API                                |\n+------------------------------+-------------------------------------------------+\n| Classification               |  Support for all HuggingFace Classifiers        |\n+------------------------------+-------------------------------------------------+\n| Demographics                 | M3Inference, FairFace (Coming Soon)             |\n+------------------------------+-------------------------------------------------+\n| Topic Modeling               | Contextualized Topic Modeling                   |\n+------------------------------+-------------------------------------------------+\n\n\nLimitations and Ethical Considerations\n--------------------------------------\n\nTwitter Demographer does not come without limitations.\nSome of these are related to the precision of the components used; for example, the Geonames decoder can fail the disambiguation - even if it has been adopted by other researchers and services. At the same time, the the topic modeling pipeline can be affected by the number of tweets used to train the model and by other training issues (fixing random seeds can generate suboptimal solutions).\n\nThe tool wraps the API from M3 for age and gender prediction. However, those predictions for gender are binary (male or female) and thus give a stereotyped representation of gender. Our intent is not to make normative claims about gender, as this is far from our beliefs. Twitter Demographer allows using other, more flexible tools. The API needs both text and user profile pictures of a tweet to make inferences, for that reason the tool has to include such information in the dataset during the pipeline execution. While this information is public (e.g., user profile pictures), the final dataset contains also inferred information, which may not be publicly available (e.g., gender or age of the user). We cannot completely prevent misuse of this capability but have taken steps to substantially reduce the risk and promote privacy by design.\n\nInferring user attributes carries the risk of privacy violations. We follow the definitions and recommendations of the European Union's General Data Protection Regulation for algorithmic pseudo-anonymity. We implement several measures to break a direct mapping between attributes and identifiable users without reducing the generalizability of aggregate findings on the data.\nOur measures follow the GDPR definition of a \"motivated intruder\", i.e., it requires \"significant effort\" to undo our privacy protection measures. However, given enough determination and resources, a bad actor might still be able to circumvent or reverse-engineer these measures. This is true independent of Twitter Demographer, though, as existing tools could be used more easily to achieve those goals.\nUsing the tool provides practitioners with a reasonable way to protect anonymity.\n\nCredits\n-------\n\nThis package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.\n\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\n\n\n=======\nHistory\n=======\n\n0.1.0 (2021-12-16)\n------------------\n\n* First release on PyPI.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "Twitter Demographer",
    "version": "0.2.1",
    "split_keywords": [
        "twitter_demographer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "779fbd6d92a1ba6375c86bf11a418a43939f5a98f09d909567be340bee7d7301",
                "md5": "d82c2d621fb2f77e43efc5a3a0b28a1c",
                "sha256": "e68d7bb21c1b34a623166023b073c39f79073fb7241c36ceb0b03097e587648a"
            },
            "downloads": -1,
            "filename": "twitter_demographer-0.2.1-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d82c2d621fb2f77e43efc5a3a0b28a1c",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6",
            "size": 20572,
            "upload_time": "2023-01-10T19:12:01",
            "upload_time_iso_8601": "2023-01-10T19:12:01.042332Z",
            "url": "https://files.pythonhosted.org/packages/77/9f/bd6d92a1ba6375c86bf11a418a43939f5a98f09d909567be340bee7d7301/twitter_demographer-0.2.1-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f20df0d66abcdfedb61423ab5ef8349d8d9bb509e8c370b23d2436f42ccb101",
                "md5": "3722c4d15cba985ae6d7b0ec5fa0d97e",
                "sha256": "bd2a08bc56cecf730bddd0a9afe8a03db134f5734a60386b3880570d669d4c3e"
            },
            "downloads": -1,
            "filename": "twitter_demographer-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "3722c4d15cba985ae6d7b0ec5fa0d97e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 26090,
            "upload_time": "2023-01-10T19:12:02",
            "upload_time_iso_8601": "2023-01-10T19:12:02.304638Z",
            "url": "https://files.pythonhosted.org/packages/0f/20/df0d66abcdfedb61423ab5ef8349d8d9bb509e8c370b23d2436f42ccb101/twitter_demographer-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-10 19:12:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "MilaNLProc",
    "github_project": "twitter-demographer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "sklearn",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "tweepy",
            "specs": []
        },
        {
            "name": "liwc",
            "specs": [
                [
                    "==",
                    "0.5.0"
                ]
            ]
        },
        {
            "name": "empath",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.19.2"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "1.9.0"
                ]
            ]
        },
        {
            "name": "m3inference",
            "specs": [
                [
                    "==",
                    "1.1.5"
                ]
            ]
        },
        {
            "name": "appdirs",
            "specs": [
                [
                    "==",
                    "1.4.4"
                ]
            ]
        },
        {
            "name": "geocoder",
            "specs": []
        },
        {
            "name": "datasets",
            "specs": []
        },
        {
            "name": "contextualized-topic-models",
            "specs": []
        },
        {
            "name": "umap",
            "specs": []
        },
        {
            "name": "hdbscan",
            "specs": []
        }
    ],
    "lcname": "twitter-demographer"
}

Federico Bianchi