disaggregators


Namedisaggregators JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/NimaBoscarino/disaggregators
SummaryHuggingFace community-driven open-source library for dataset disaggregation
upload_time2022-12-12 16:56:25
maintainer
docs_urlNone
authorHuggingFace Inc.
requires_python>=3.7.0
licenseApache 2.0
keywords machine learning evaluate evaluation disaggregation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <br>
    <img alt="Hugging Face Disaggregators" src="https://user-images.githubusercontent.com/6765188/206785111-b7724be3-6460-4092-9561-9fc2cd522320.png" width="400"/>
    <br>
<p>

<p align="center">
    <a href="https://huggingface.co/spaces/society-ethics/disaggregators">
        <img alt="GitHub" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20Spaces-Demo-blue">
    </a>
    <a href="https://github.com/huggingface/transformers/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/disaggregators.svg">
    </a>
</p>

> ⚠️ Please note: This library is in early development, and the disaggregation modules that are included are proofs of concept that are _not_ production-ready. Additionally, all APIs are subject to breaking changes any time before a 1.0.0 release. Rigorously tested versions of the included modules will be released in the future, so stay tuned. [We'd love your feedback in the meantime!](https://github.com/huggingface/disaggregators/discussions/23)

The `disaggregators` library allows you to easily add new features to your datasets to enable disaggregated data exploration and disaggregated model evaluation. `disaggregators` is preloaded with disaggregation modules for text data, with image modules coming soon!

This library is intended to be used with [🤗 Datasets](https://github.com/huggingface/datasets), but should work with any other "mappable" interface to a dataset. 

## Requirements and Installation

`disaggregators` has been tested on Python 3.8, 3.9, and 3.10.

`pip install disaggregators` will fetch the latest release from PyPI.

Note that some disaggregation modules require extra dependencies such as SpaCy modules, which may need to be installed manually. If these dependencies aren't installed, `disaggregators` will inform you about how to install them.

To install directly from this GitHub repo, use the following command:
```shell
pip install git+https://github.com/huggingface/disaggregators.git
```

## Usage

You will likely want to use 🤗 Datasets with `disaggregators`.

```shell
pip install datasets
```

The snippet below loads the IMDB dataset from the Hugging Face Hub, and initializes a disaggregator for "pronoun" that will run on the IMDB dataset's "text" column. If you would like to run multiple disaggregations, you can pass a list to the `Disaggregator` constructor (e.g. `Disaggregator(["pronoun", "sentiment"], column="text")`). We then use the 🤗 Datasets `map` method to apply the disaggregation to the dataset.

```python
from disaggregators import Disaggregator
from datasets import load_dataset

dataset = load_dataset("imdb", split="train")
disaggregator = Disaggregator("pronoun", column="text")

ds = dataset.map(disaggregator)  # New boolean columns are added for she/her, he/him, and they/them
```

The resulting dataset can now be used for data exploration and disaggregated model evaluation.

You can also run disaggregations on Pandas DataFrames with `.apply` and `.merge`:

```python
from disaggregators import Disaggregator
import pandas as pd
df = pd.DataFrame({"text": ["They went to the park."]})

disaggregator = Disaggregator("pronoun", column="text")

new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)
```

### Available Disaggregation Modules

The following modules are currently available:

- `"age"`
- `"gender"`
- `"pronoun"`
- `"religion"`
- `"continent"`

Note that `disaggregators` is in active development, and that these (and future) modules are subject to changing interfaces and implementations at any time before a `1.0.0` release. Each module provides its own method for overriding the default configuration, with the general interface documented below.

### Module Configurations

Modules may make certain variables and functionality configurable. If you'd like to configure a module, import the module, its labels, and its config class. Then, override the labels and set the configuration as needed while instantiating the module. Once instantiated, you can pass the module to the `Disaggregator`. The example below shows this with the `Age` module.

```python
from disaggregators import Disaggregator
from disaggregators.disaggregation_modules.age import Age, AgeLabels, AgeConfig

class MeSHAgeLabels(AgeLabels):
    INFANT = "infant"
    CHILD_PRESCHOOL = "child_preschool"
    CHILD = "child"
    ADOLESCENT = "adolescent"
    ADULT = "adult"
    MIDDLE_AGED = "middle_aged"
    AGED = "aged"
    AGED_80_OVER = "aged_80_over"

age = Age(
    config=AgeConfig(
        labels=MeSHAgeLabels,
        ages=[list(MeSHAgeLabels)],
        breakpoints=[0, 2, 5, 12, 18, 44, 64, 79]
    ),
    column="question"
)

disaggregator = Disaggregator([age, "gender"], column="question")
```

### Custom Modules

Custom modules can be created by extending the `CustomDisaggregator`. All custom modules must have `labels` and a `module_id`, and must implement a `__call__` method.

```python
from disaggregators import Disaggregator, DisaggregationModuleLabels, CustomDisaggregator

class TabsSpacesLabels(DisaggregationModuleLabels):
    TABS = "tabs"
    SPACES = "spaces"

class TabsSpaces(CustomDisaggregator):
    module_id = "tabs_spaces"
    labels = TabsSpacesLabels

    def __call__(self, row, *args, **kwargs):
        if "\t" in row[self.column]:
            return {self.labels.TABS: True, self.labels.SPACES: False}
        else:
            return {self.labels.TABS: False, self.labels.SPACES: True}

disaggregator = Disaggregator(TabsSpaces, column="text")
```

## Development

Development requirements can be installed with `pip install .[dev]`. See the `Makefile` for useful targets, such as code quality and test running.

To run tests locally across multiple Python versions (3.8, 3.9, and 3.10), ensure that you have all the Python versions available and then run `nox -r`. Note that this is quite slow, so it's only worth doing to double-check your code before you open a Pull Request.

## Contact

Nima Boscarino – `nima <at> huggingface <dot> co`

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NimaBoscarino/disaggregators",
    "name": "disaggregators",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.0",
    "maintainer_email": "",
    "keywords": "machine learning evaluate evaluation disaggregation",
    "author": "HuggingFace Inc.",
    "author_email": "nima@huggingface.co",
    "download_url": "https://files.pythonhosted.org/packages/2a/a9/631b13b95997c2986c1e67aa889f6cc355001b7a92b31f0938c24e81fd1d/disaggregators-0.1.2.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <br>\n    <img alt=\"Hugging Face Disaggregators\" src=\"https://user-images.githubusercontent.com/6765188/206785111-b7724be3-6460-4092-9561-9fc2cd522320.png\" width=\"400\"/>\n    <br>\n<p>\n\n<p align=\"center\">\n    <a href=\"https://huggingface.co/spaces/society-ethics/disaggregators\">\n        <img alt=\"GitHub\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20Spaces-Demo-blue\">\n    </a>\n    <a href=\"https://github.com/huggingface/transformers/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/huggingface/disaggregators.svg\">\n    </a>\n</p>\n\n> \u26a0\ufe0f Please note: This library is in early development, and the disaggregation modules that are included are proofs of concept that are _not_ production-ready. Additionally, all APIs are subject to breaking changes any time before a 1.0.0 release. Rigorously tested versions of the included modules will be released in the future, so stay tuned. [We'd love your feedback in the meantime!](https://github.com/huggingface/disaggregators/discussions/23)\n\nThe `disaggregators` library allows you to easily add new features to your datasets to enable disaggregated data exploration and disaggregated model evaluation. `disaggregators` is preloaded with disaggregation modules for text data, with image modules coming soon!\n\nThis library is intended to be used with [\ud83e\udd17 Datasets](https://github.com/huggingface/datasets), but should work with any other \"mappable\" interface to a dataset. \n\n## Requirements and Installation\n\n`disaggregators` has been tested on Python 3.8, 3.9, and 3.10.\n\n`pip install disaggregators` will fetch the latest release from PyPI.\n\nNote that some disaggregation modules require extra dependencies such as SpaCy modules, which may need to be installed manually. If these dependencies aren't installed, `disaggregators` will inform you about how to install them.\n\nTo install directly from this GitHub repo, use the following command:\n```shell\npip install git+https://github.com/huggingface/disaggregators.git\n```\n\n## Usage\n\nYou will likely want to use \ud83e\udd17 Datasets with `disaggregators`.\n\n```shell\npip install datasets\n```\n\nThe snippet below loads the IMDB dataset from the Hugging Face Hub, and initializes a disaggregator for \"pronoun\" that will run on the IMDB dataset's \"text\" column. If you would like to run multiple disaggregations, you can pass a list to the `Disaggregator` constructor (e.g. `Disaggregator([\"pronoun\", \"sentiment\"], column=\"text\")`). We then use the \ud83e\udd17 Datasets `map` method to apply the disaggregation to the dataset.\n\n```python\nfrom disaggregators import Disaggregator\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"imdb\", split=\"train\")\ndisaggregator = Disaggregator(\"pronoun\", column=\"text\")\n\nds = dataset.map(disaggregator)  # New boolean columns are added for she/her, he/him, and they/them\n```\n\nThe resulting dataset can now be used for data exploration and disaggregated model evaluation.\n\nYou can also run disaggregations on Pandas DataFrames with `.apply` and `.merge`:\n\n```python\nfrom disaggregators import Disaggregator\nimport pandas as pd\ndf = pd.DataFrame({\"text\": [\"They went to the park.\"]})\n\ndisaggregator = Disaggregator(\"pronoun\", column=\"text\")\n\nnew_cols = df.apply(disaggregator, axis=1)\ndf = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)\n```\n\n### Available Disaggregation Modules\n\nThe following modules are currently available:\n\n- `\"age\"`\n- `\"gender\"`\n- `\"pronoun\"`\n- `\"religion\"`\n- `\"continent\"`\n\nNote that `disaggregators` is in active development, and that these (and future) modules are subject to changing interfaces and implementations at any time before a `1.0.0` release. Each module provides its own method for overriding the default configuration, with the general interface documented below.\n\n### Module Configurations\n\nModules may make certain variables and functionality configurable. If you'd like to configure a module, import the module, its labels, and its config class. Then, override the labels and set the configuration as needed while instantiating the module. Once instantiated, you can pass the module to the `Disaggregator`. The example below shows this with the `Age` module.\n\n```python\nfrom disaggregators import Disaggregator\nfrom disaggregators.disaggregation_modules.age import Age, AgeLabels, AgeConfig\n\nclass MeSHAgeLabels(AgeLabels):\n    INFANT = \"infant\"\n    CHILD_PRESCHOOL = \"child_preschool\"\n    CHILD = \"child\"\n    ADOLESCENT = \"adolescent\"\n    ADULT = \"adult\"\n    MIDDLE_AGED = \"middle_aged\"\n    AGED = \"aged\"\n    AGED_80_OVER = \"aged_80_over\"\n\nage = Age(\n    config=AgeConfig(\n        labels=MeSHAgeLabels,\n        ages=[list(MeSHAgeLabels)],\n        breakpoints=[0, 2, 5, 12, 18, 44, 64, 79]\n    ),\n    column=\"question\"\n)\n\ndisaggregator = Disaggregator([age, \"gender\"], column=\"question\")\n```\n\n### Custom Modules\n\nCustom modules can be created by extending the `CustomDisaggregator`. All custom modules must have `labels` and a `module_id`, and must implement a `__call__` method.\n\n```python\nfrom disaggregators import Disaggregator, DisaggregationModuleLabels, CustomDisaggregator\n\nclass TabsSpacesLabels(DisaggregationModuleLabels):\n    TABS = \"tabs\"\n    SPACES = \"spaces\"\n\nclass TabsSpaces(CustomDisaggregator):\n    module_id = \"tabs_spaces\"\n    labels = TabsSpacesLabels\n\n    def __call__(self, row, *args, **kwargs):\n        if \"\\t\" in row[self.column]:\n            return {self.labels.TABS: True, self.labels.SPACES: False}\n        else:\n            return {self.labels.TABS: False, self.labels.SPACES: True}\n\ndisaggregator = Disaggregator(TabsSpaces, column=\"text\")\n```\n\n## Development\n\nDevelopment requirements can be installed with `pip install .[dev]`. See the `Makefile` for useful targets, such as code quality and test running.\n\nTo run tests locally across multiple Python versions (3.8, 3.9, and 3.10), ensure that you have all the Python versions available and then run `nox -r`. Note that this is quite slow, so it's only worth doing to double-check your code before you open a Pull Request.\n\n## Contact\n\nNima Boscarino \u2013 `nima <at> huggingface <dot> co`\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "HuggingFace community-driven open-source library for dataset disaggregation",
    "version": "0.1.2",
    "split_keywords": [
        "machine",
        "learning",
        "evaluate",
        "evaluation",
        "disaggregation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "a6d3be9ae8405cabb0e9070eb0662ce6",
                "sha256": "c77d8fcf568e7d6776a1bdf44509a04f5554bb468d6baf74ad2fd848d9a45450"
            },
            "downloads": -1,
            "filename": "disaggregators-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a6d3be9ae8405cabb0e9070eb0662ce6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.0",
            "size": 16413,
            "upload_time": "2022-12-12T16:56:23",
            "upload_time_iso_8601": "2022-12-12T16:56:23.154023Z",
            "url": "https://files.pythonhosted.org/packages/c6/f4/4e7dadf21e7c6deebebe596b40cb0931b475888f44b190182fde9c0abbbe/disaggregators-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "39d896d440773c19086b2f2fc82a6866",
                "sha256": "7ceb4e7a33a9accd1d3d2162861f8e8b882fb212eff30ec3858f227f26c5a7cb"
            },
            "downloads": -1,
            "filename": "disaggregators-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "39d896d440773c19086b2f2fc82a6866",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.0",
            "size": 17600,
            "upload_time": "2022-12-12T16:56:25",
            "upload_time_iso_8601": "2022-12-12T16:56:25.636920Z",
            "url": "https://files.pythonhosted.org/packages/2a/a9/631b13b95997c2986c1e67aa889f6cc355001b7a92b31f0938c24e81fd1d/disaggregators-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-12 16:56:25",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "NimaBoscarino",
    "github_project": "disaggregators",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "disaggregators"
}
        
Elapsed time: 0.02301s