datalabs


Namedatalabs JSON
Version 0.4.15 PyPI version JSON
download
home_pagehttps://github.com/expressai/datalab
SummaryDatalabs
upload_time2022-12-22 02:11:35
maintainer
docs_urlNone
authorexpressai
requires_python
licenseApache 2.0
keywords dataset
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <br>
    <img src="./docs/Resources/figs/readme_logo.png" width="400"/>
    <br>
  <a href="https://github.com/expressai/DataLab/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/expressai/DataLab" /></a>
  <a href="https://github.com/expressai/DataLab/stargazers"><img alt="GitHub stars" src="https://img.shields.io/github/stars/expressai/DataLab" /></a>
  <a href="https://pypi.org/project//"><img alt="PyPI" src="https://img.shields.io/pypi/v/datalabs" /></a>
  <a href=".github/workflows/ci.yml"><img alt="Integration Tests", src="https://github.com/expressai/DataLab/actions/workflows/ci.yml/badge.svg?event=push" />
</p>


[DataLab](http://datalab.nlpedia.ai/) is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:

    

    
    
<p align="center"> 
<img src="./docs/Resources/figs/datalab_overview.png" width="300"/>
 </p> 

* **Data Diagnostics**: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.
* **Operation Standardization**: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.
* **Data Search**: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.
* **Global Analysis**: DataLab provides tools to perform global analyses over a variety of datasets.


 

## Installation
DataLab can be installed from PyPi
```bash
pip install --upgrade pip
pip install datalabs
python -m nltk.downloader omw-1.4 # to support more feature calculation
```
or from the source
```bash
# This is suitable for SDK developers
pip install --upgrade pip
git clone git@github.com:ExpressAI/DataLab.git
cd Datalab
pip install -e .[dev]
python -m nltk.downloader omw-1.4 # to support more feature calculation
```
By adding `[dev]`, some [extra libraries](https://github.com/ExpressAI/DataLab/blob/03f69e5424859e3e9dbcbb487d3e1ce3de45a599/setup.py#L66) will be installed, such as `pre-commit`.



#### Code Quality Check?
If you would like to contribute to DataLab, checking the code style and quality before your pull
request is highly recommended. In this project, three types of checks will be expected: (a) black
(2) flake8 (3) isort

you could achieve this in two ways:

##### Manually (suitable for developers using Github Destop)
```shell
pre-commit install
git init .
pre-commit run --all-files or
```
where `pre-commit run -all-files` is equivalent to
```shell
pre-commit run black   # (this is also equivalent to python -m black .)
pre-commit run isort   # (this is also equivalent to isort .)
pre-commit run flake8  # (this is  also equivalent to flake8)
```
Notably, `black` and `isort` can help us fix code style automatically, while `flake8` only
provide hints with us, which means we need to fix these issues raised by `flake8`.



##### Automatically (suitable for developers using Git CLI)

```shell
pre-commit install
git init .
git commit -m "your update message"
```
The `git commit` will automatically activate the command `pre-commit run -all-files`

## Using DataLab
Below we give several simple examples to showcase the usage of DataLab:

You can also view documentation:
* [**Online Tutorial**](https://expressai.github.io/DataLab/)
* [**Adding Datasets to the SDK**](docs/SDK/add_new_datasets_into_sdk.md)
* [**Adding a new Task to the SDK**](docs/SDK/add_new_task_schema.md)


```python
# pip install datalabs
from datalabs import load_dataset
dataset = load_dataset("ag_news")


# Preprocessing operation
from preprocess import *
res=dataset["test"].apply(lower)
print(next(res))

# Featurizing operation
from featurize import *
res = dataset["test"].apply(get_text_length) # get length
print(next(res))

res = dataset["test"].apply(get_entities_spacy) # get entity
print(next(res))

# Editing/Transformation operation
from edit import *
res = dataset["test"].apply(change_person_name) #  change person name
print(next(res))

# Prompting operation
from prompt import *
res = dataset["test"].apply(template_tc1)
print(next(res))

# Aggregating operation
from aggregate.text_classification import *
res = dataset["test"].apply(get_statistics)
```
 

## Acknowledgment
DataLab originated from a fork of the awesome [Huggingface Datasets](https://github.com/huggingface/datasets) and [TensorFlow Datasets](https://github.com/tensorflow/datasets). We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section.
We thank Antonis Anastasopoulos for sharing the mapping data between countries and languages, and thank Alissa Ostapenko, Yulia Tsvetkov, Jie Fu, Ziyun Xu, Hiroaki Hayashi, and Zhengfu He for useful discussion and suggestions for the first version.






            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/expressai/datalab",
    "name": "datalabs",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "dataset",
    "author": "expressai",
    "author_email": "stefanpengfei@gamil.com",
    "download_url": "https://files.pythonhosted.org/packages/54/b6/25da85ae5f1758a13dd3d1f3be17f28ffe6d190e0c05e218cf00af2ce2f4/datalabs-0.4.15.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n    <br>\n    <img src=\"./docs/Resources/figs/readme_logo.png\" width=\"400\"/>\n    <br>\n  <a href=\"https://github.com/expressai/DataLab/blob/main/LICENSE\"><img alt=\"License\" src=\"https://img.shields.io/github/license/expressai/DataLab\" /></a>\n  <a href=\"https://github.com/expressai/DataLab/stargazers\"><img alt=\"GitHub stars\" src=\"https://img.shields.io/github/stars/expressai/DataLab\" /></a>\n  <a href=\"https://pypi.org/project//\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/datalabs\" /></a>\n  <a href=\".github/workflows/ci.yml\"><img alt=\"Integration Tests\", src=\"https://github.com/expressai/DataLab/actions/workflows/ci.yml/badge.svg?event=push\" />\n</p>\n\n\n[DataLab](http://datalab.nlpedia.ai/) is a unified platform that allows for NLP researchers to perform a number of data-related tasks in an efficient and easy-to-use manner. In particular, DataLab supports the following functionalities:\n\n    \n\n    \n    \n<p align=\"center\"> \n<img src=\"./docs/Resources/figs/datalab_overview.png\" width=\"300\"/>\n </p> \n\n* **Data Diagnostics**: DataLab allows for analysis and understanding of data to uncover undesirable traits such as hate speech, gender bias, or label imbalance.\n* **Operation Standardization**: DataLab provides and standardizes a large number of data processing operations, including aggregating, preprocessing, featurizing, editing and prompting operations.\n* **Data Search**: DataLab provides a semantic dataset search tool to help identify appropriate datasets given a textual description of an idea.\n* **Global Analysis**: DataLab provides tools to perform global analyses over a variety of datasets.\n\n\n \n\n## Installation\nDataLab can be installed from PyPi\n```bash\npip install --upgrade pip\npip install datalabs\npython -m nltk.downloader omw-1.4 # to support more feature calculation\n```\nor from the source\n```bash\n# This is suitable for SDK developers\npip install --upgrade pip\ngit clone git@github.com:ExpressAI/DataLab.git\ncd Datalab\npip install -e .[dev]\npython -m nltk.downloader omw-1.4 # to support more feature calculation\n```\nBy adding `[dev]`, some [extra libraries](https://github.com/ExpressAI/DataLab/blob/03f69e5424859e3e9dbcbb487d3e1ce3de45a599/setup.py#L66) will be installed, such as `pre-commit`.\n\n\n\n#### Code Quality Check?\nIf you would like to contribute to DataLab, checking the code style and quality before your pull\nrequest is highly recommended. In this project, three types of checks will be expected: (a) black\n(2) flake8 (3) isort\n\nyou could achieve this in two ways:\n\n##### Manually (suitable for developers using Github Destop)\n```shell\npre-commit install\ngit init .\npre-commit run --all-files or\n```\nwhere `pre-commit run -all-files` is equivalent to\n```shell\npre-commit run black   # (this is also equivalent to python -m black .)\npre-commit run isort   # (this is also equivalent to isort .)\npre-commit run flake8  # (this is  also equivalent to flake8)\n```\nNotably, `black` and `isort` can help us fix code style automatically, while `flake8` only\nprovide hints with us, which means we need to fix these issues raised by `flake8`.\n\n\n\n##### Automatically (suitable for developers using Git CLI)\n\n```shell\npre-commit install\ngit init .\ngit commit -m \"your update message\"\n```\nThe `git commit` will automatically activate the command `pre-commit run -all-files`\n\n## Using DataLab\nBelow we give several simple examples to showcase the usage of DataLab:\n\nYou can also view documentation:\n* [**Online Tutorial**](https://expressai.github.io/DataLab/)\n* [**Adding Datasets to the SDK**](docs/SDK/add_new_datasets_into_sdk.md)\n* [**Adding a new Task to the SDK**](docs/SDK/add_new_task_schema.md)\n\n\n```python\n# pip install datalabs\nfrom datalabs import load_dataset\ndataset = load_dataset(\"ag_news\")\n\n\n# Preprocessing operation\nfrom preprocess import *\nres=dataset[\"test\"].apply(lower)\nprint(next(res))\n\n# Featurizing operation\nfrom featurize import *\nres = dataset[\"test\"].apply(get_text_length) # get length\nprint(next(res))\n\nres = dataset[\"test\"].apply(get_entities_spacy) # get entity\nprint(next(res))\n\n# Editing/Transformation operation\nfrom edit import *\nres = dataset[\"test\"].apply(change_person_name) #  change person name\nprint(next(res))\n\n# Prompting operation\nfrom prompt import *\nres = dataset[\"test\"].apply(template_tc1)\nprint(next(res))\n\n# Aggregating operation\nfrom aggregate.text_classification import *\nres = dataset[\"test\"].apply(get_statistics)\n```\n \n\n## Acknowledgment\nDataLab originated from a fork of the awesome [Huggingface Datasets](https://github.com/huggingface/datasets) and [TensorFlow Datasets](https://github.com/tensorflow/datasets). We highly thank the Huggingface/TensorFlow Datasets for building this amazing library. More details on the differences between DataLab and them can be found in the section.\nWe thank Antonis Anastasopoulos for sharing the mapping data between countries and languages, and thank Alissa Ostapenko, Yulia Tsvetkov, Jie Fu, Ziyun Xu, Hiroaki Hayashi, and Zhengfu He for useful discussion and suggestions for the first version.\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Datalabs",
    "version": "0.4.15",
    "split_keywords": [
        "dataset"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8bce84ec52c2a0e13be98f5f1f7e9434",
                "sha256": "463d48186b98a70b1ce1d0abb10aa549f1cbaf65b0b8b7ed64d10b80969c483a"
            },
            "downloads": -1,
            "filename": "datalabs-0.4.15-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8bce84ec52c2a0e13be98f5f1f7e9434",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 2251734,
            "upload_time": "2022-12-22T02:11:34",
            "upload_time_iso_8601": "2022-12-22T02:11:34.362310Z",
            "url": "https://files.pythonhosted.org/packages/f2/13/53c4f424079a8769a40cd0dda67ac1601e34455d2a3400ba9a4a68de0955/datalabs-0.4.15-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "2ce3c09388ecd6c13d43e5b999bf0cf6",
                "sha256": "020e88f01890f21614af117f51922e78eaf48d1003d2ff93df150101b0e5e9af"
            },
            "downloads": -1,
            "filename": "datalabs-0.4.15.tar.gz",
            "has_sig": false,
            "md5_digest": "2ce3c09388ecd6c13d43e5b999bf0cf6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 340860,
            "upload_time": "2022-12-22T02:11:35",
            "upload_time_iso_8601": "2022-12-22T02:11:35.911419Z",
            "url": "https://files.pythonhosted.org/packages/54/b6/25da85ae5f1758a13dd3d1f3be17f28ffe6d190e0c05e218cf00af2ce2f4/datalabs-0.4.15.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-22 02:11:35",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "expressai",
    "github_project": "datalab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "datalabs"
}
        
Elapsed time: 0.02076s