ml-datasets


Nameml-datasets JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/explosion/ml-datasets
SummaryMachine Learning dataset loaders
upload_time2021-01-31 02:36:39
maintainer
docs_urlNone
authorExplosion
requires_python>=3.6
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>

# Machine learning dataset loaders for testing and examples

Loaders for various machine learning datasets for testing and example scripts.
Previously in `thinc.extra.datasets`.

[![PyPi Version](https://img.shields.io/pypi/v/ml-datasets.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/ml-datasets)

## Setup and installation

The package can be installed via pip:

```bash
pip install ml-datasets
```

## Loaders

Loaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments – see the source for details.

```python
# Import directly
from ml_datasets import imdb
train_data, dev_data = imdb()
```

```python
# Load via registry
from ml_datasets import loaders
imdb_loader = loaders.get("imdb")
train_data, dev_data = imdb_loader()
```

### Available loaders

#### NLP datasets

| ID / Function        | Description                                  | NLP task                                  | From URL |
| -------------------- | -------------------------------------------- | ----------------------------------------- | :------: |
| `imdb`               | IMDB sentiment dataset                       | Binary classification: sentiment analysis |    ✓     |
| `dbpedia`            | DBPedia ontology dataset                     | Multi-class single-label classification   |    ✓     |
| `cmu`                | CMU movie genres dataset                     | Multi-class, multi-label classification   |    ✓     |
| `quora_questions`    | Duplicate Quora questions dataset            | Detecting duplicate questions             |    ✓     |
| `reuters`            | Reuters dataset (texts not included)         | Multi-class multi-label classification    |    ✓     |
| `snli`               | Stanford Natural Language Inference corpus   | Recognizing textual entailment            |    ✓     |
| `stack_exchange`     | Stack Exchange dataset                       | Question Answering                        |          |
| `ud_ancora_pos_tags` | Universal Dependencies Spanish AnCora corpus | POS tagging                               |    ✓     |
| `ud_ewtb_pos_tags`   | Universal Dependencies English EWT corpus    | POS tagging                               |    ✓     |
| `wikiner`            | WikiNER data                                 | Named entity recognition                  |          |

#### Other ML datasets

| ID / Function | Description | ML task           | From URL |
| ------------- | ----------- | ----------------- | :------: |
| `mnist`       | MNIST data  | Image recognition |    ✓     |

### Dataset details

#### IMDB

Each instance contains the text of a movie review, and a sentiment expressed as `0` or `1`.

```python
train_data, dev_data = ml_datasets.imdb()
for text, annot in train_data[0:5]:
    print(f"Review: {text}")
    print(f"Sentiment: {annot}")
```

- Download URL: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)
- Citation: [Andrew L. Maas et al., 2011](https://www.aclweb.org/anthology/P11-1015/)

| Property            | Training         | Dev              |
| ------------------- | ---------------- | ---------------- |
| # Instances         | 25000            | 25000            |
| Label values        | {`0`, `1`}       | {`0`, `1`}       |
| Labels per instance | Single           | Single           |
| Label distribution  | Balanced (50/50) | Balanced (50/50) |

#### DBPedia

Each instance contains an ontological description, and a classification into one of the 14 distinct labels.

```python
train_data, dev_data = ml_datasets.dbpedia()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Category: {annot}")
```

- Download URL: [Via fast.ai](https://course.fast.ai/datasets)
- Original citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)

| Property            | Training | Dev      |
| ------------------- | -------- | -------- |
| # Instances         | 560000   | 70000    |
| Label values        | `1`-`14` | `1`-`14` |
| Labels per instance | Single   | Single   |
| Label distribution  | Balanced | Balanced |

#### CMU

Each instance contains a movie description, and a classification into a list of appropriate genres.

```python
train_data, dev_data = ml_datasets.cmu()
for text, annot in train_data[0:5]:
    print(f"Text: {text}")
    print(f"Genres: {annot}")
```

- Download URL: [http://www.cs.cmu.edu/~ark/personas/](http://www.cs.cmu.edu/~ark/personas/)
- Original citation: [David Bamman et al., 2013](https://www.aclweb.org/anthology/P13-1035/)

| Property            | Training                                                                                      | Dev |
| ------------------- | --------------------------------------------------------------------------------------------- | --- |
| # Instances         | 41793                                                                                         | 0   |
| Label values        | 363 different genres                                                                          | -   |
| Labels per instance | Multiple                                                                                      | -   |
| Label distribution  | Imbalanced: 147 labels with less than 20 examples, while `Drama` occurs more than 19000 times | -   |

#### Quora

```python
train_data, dev_data = ml_datasets.quora_questions()
for questions, annot in train_data[0:50]:
    q1, q2 = questions
    print(f"Question 1: {q1}")
    print(f"Question 2: {q2}")
    print(f"Similarity: {annot}")
```

Each instance contains two quora questions, and a label indicating whether or not they are duplicates (`0`: no, `1`: yes).
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

- Download URL: [http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv](http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv)
- Original citation: [Kornél Csernai et al., 2017](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)

| Property            | Training                  | Dev                       |
| ------------------- | ------------------------- | ------------------------- |
| # Instances         | 363859                    | 40429                     |
| Label values        | {`0`, `1`}                | {`0`, `1`}                |
| Labels per instance | Single                    | Single                    |
| Label distribution  | Imbalanced: 63% label `0` | Imbalanced: 63% label `0` |

### Registering loaders

Loaders can be registered externally using the `loaders` registry as a decorator. For example:

```python
@ml_datasets.loaders("my_custom_loader")
def my_custom_loader():
    return load_some_data()

assert "my_custom_loader" in ml_datasets.loaders
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/explosion/ml-datasets",
    "name": "ml-datasets",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Explosion",
    "author_email": "contact@explosion.ai",
    "download_url": "https://files.pythonhosted.org/packages/3c/a8/149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020/ml_datasets-0.2.0.tar.gz",
    "platform": "",
    "description": "<a href=\"https://explosion.ai\"><img src=\"https://explosion.ai/assets/img/logo.svg\" width=\"125\" height=\"125\" align=\"right\" /></a>\n\n# Machine learning dataset loaders for testing and examples\n\nLoaders for various machine learning datasets for testing and example scripts.\nPreviously in `thinc.extra.datasets`.\n\n[![PyPi Version](https://img.shields.io/pypi/v/ml-datasets.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/ml-datasets)\n\n## Setup and installation\n\nThe package can be installed via pip:\n\n```bash\npip install ml-datasets\n```\n\n## Loaders\n\nLoaders can be imported directly or used via their string name (which is useful if they're set via command line arguments). Some loaders may take arguments \u2013 see the source for details.\n\n```python\n# Import directly\nfrom ml_datasets import imdb\ntrain_data, dev_data = imdb()\n```\n\n```python\n# Load via registry\nfrom ml_datasets import loaders\nimdb_loader = loaders.get(\"imdb\")\ntrain_data, dev_data = imdb_loader()\n```\n\n### Available loaders\n\n#### NLP datasets\n\n| ID / Function        | Description                                  | NLP task                                  | From URL |\n| -------------------- | -------------------------------------------- | ----------------------------------------- | :------: |\n| `imdb`               | IMDB sentiment dataset                       | Binary classification: sentiment analysis |    \u2713     |\n| `dbpedia`            | DBPedia ontology dataset                     | Multi-class single-label classification   |    \u2713     |\n| `cmu`                | CMU movie genres dataset                     | Multi-class, multi-label classification   |    \u2713     |\n| `quora_questions`    | Duplicate Quora questions dataset            | Detecting duplicate questions             |    \u2713     |\n| `reuters`            | Reuters dataset (texts not included)         | Multi-class multi-label classification    |    \u2713     |\n| `snli`               | Stanford Natural Language Inference corpus   | Recognizing textual entailment            |    \u2713     |\n| `stack_exchange`     | Stack Exchange dataset                       | Question Answering                        |          |\n| `ud_ancora_pos_tags` | Universal Dependencies Spanish AnCora corpus | POS tagging                               |    \u2713     |\n| `ud_ewtb_pos_tags`   | Universal Dependencies English EWT corpus    | POS tagging                               |    \u2713     |\n| `wikiner`            | WikiNER data                                 | Named entity recognition                  |          |\n\n#### Other ML datasets\n\n| ID / Function | Description | ML task           | From URL |\n| ------------- | ----------- | ----------------- | :------: |\n| `mnist`       | MNIST data  | Image recognition |    \u2713     |\n\n### Dataset details\n\n#### IMDB\n\nEach instance contains the text of a movie review, and a sentiment expressed as `0` or `1`.\n\n```python\ntrain_data, dev_data = ml_datasets.imdb()\nfor text, annot in train_data[0:5]:\n    print(f\"Review: {text}\")\n    print(f\"Sentiment: {annot}\")\n```\n\n- Download URL: [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/)\n- Citation: [Andrew L. Maas et al., 2011](https://www.aclweb.org/anthology/P11-1015/)\n\n| Property            | Training         | Dev              |\n| ------------------- | ---------------- | ---------------- |\n| # Instances         | 25000            | 25000            |\n| Label values        | {`0`, `1`}       | {`0`, `1`}       |\n| Labels per instance | Single           | Single           |\n| Label distribution  | Balanced (50/50) | Balanced (50/50) |\n\n#### DBPedia\n\nEach instance contains an ontological description, and a classification into one of the 14 distinct labels.\n\n```python\ntrain_data, dev_data = ml_datasets.dbpedia()\nfor text, annot in train_data[0:5]:\n    print(f\"Text: {text}\")\n    print(f\"Category: {annot}\")\n```\n\n- Download URL: [Via fast.ai](https://course.fast.ai/datasets)\n- Original citation: [Xiang Zhang et al., 2015](https://arxiv.org/abs/1509.01626)\n\n| Property            | Training | Dev      |\n| ------------------- | -------- | -------- |\n| # Instances         | 560000   | 70000    |\n| Label values        | `1`-`14` | `1`-`14` |\n| Labels per instance | Single   | Single   |\n| Label distribution  | Balanced | Balanced |\n\n#### CMU\n\nEach instance contains a movie description, and a classification into a list of appropriate genres.\n\n```python\ntrain_data, dev_data = ml_datasets.cmu()\nfor text, annot in train_data[0:5]:\n    print(f\"Text: {text}\")\n    print(f\"Genres: {annot}\")\n```\n\n- Download URL: [http://www.cs.cmu.edu/~ark/personas/](http://www.cs.cmu.edu/~ark/personas/)\n- Original citation: [David Bamman et al., 2013](https://www.aclweb.org/anthology/P13-1035/)\n\n| Property            | Training                                                                                      | Dev |\n| ------------------- | --------------------------------------------------------------------------------------------- | --- |\n| # Instances         | 41793                                                                                         | 0   |\n| Label values        | 363 different genres                                                                          | -   |\n| Labels per instance | Multiple                                                                                      | -   |\n| Label distribution  | Imbalanced: 147 labels with less than 20 examples, while `Drama` occurs more than 19000 times | -   |\n\n#### Quora\n\n```python\ntrain_data, dev_data = ml_datasets.quora_questions()\nfor questions, annot in train_data[0:50]:\n    q1, q2 = questions\n    print(f\"Question 1: {q1}\")\n    print(f\"Question 2: {q2}\")\n    print(f\"Similarity: {annot}\")\n```\n\nEach instance contains two quora questions, and a label indicating whether or not they are duplicates (`0`: no, `1`: yes).\nThe ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.\n\n- Download URL: [http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv](http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv)\n- Original citation: [Korn\u00e9l Csernai et al., 2017](https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)\n\n| Property            | Training                  | Dev                       |\n| ------------------- | ------------------------- | ------------------------- |\n| # Instances         | 363859                    | 40429                     |\n| Label values        | {`0`, `1`}                | {`0`, `1`}                |\n| Labels per instance | Single                    | Single                    |\n| Label distribution  | Imbalanced: 63% label `0` | Imbalanced: 63% label `0` |\n\n### Registering loaders\n\nLoaders can be registered externally using the `loaders` registry as a decorator. For example:\n\n```python\n@ml_datasets.loaders(\"my_custom_loader\")\ndef my_custom_loader():\n    return load_some_data()\n\nassert \"my_custom_loader\" in ml_datasets.loaders\n```\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Machine Learning dataset loaders",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/explosion/ml-datasets"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5104caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831",
                "md5": "57af26a2844b672b69ac7095090c55b4",
                "sha256": "5adf087a2a8ff67ddbfc297f3bd7dd69a88d5c7f8f95d21cc1e96fef5a10ad3a"
            },
            "downloads": -1,
            "filename": "ml_datasets-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "57af26a2844b672b69ac7095090c55b4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 15883,
            "upload_time": "2021-01-31T02:36:37",
            "upload_time_iso_8601": "2021-01-31T02:36:37.688123Z",
            "url": "https://files.pythonhosted.org/packages/51/04/caa6c271b2dac193b9699745f67a7841eec38442329e0590e50b1938b831/ml_datasets-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3ca8149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020",
                "md5": "da3d4bf661213c6f6edac48a6c599639",
                "sha256": "3f9c8901f8d6be3dab5b23ec3a6c01e619a60d0184696b1030cde2e3086943f1"
            },
            "downloads": -1,
            "filename": "ml_datasets-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "da3d4bf661213c6f6edac48a6c599639",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 12966,
            "upload_time": "2021-01-31T02:36:39",
            "upload_time_iso_8601": "2021-01-31T02:36:39.061816Z",
            "url": "https://files.pythonhosted.org/packages/3c/a8/149700bd6087fbffdbe85d32a7587f497cf45c432864d0000eef6bad1020/ml_datasets-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-01-31 02:36:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "explosion",
    "github_project": "ml-datasets",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "ml-datasets"
}
        
Elapsed time: 0.60192s