datamaestro


Namedatamaestro JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/experimaestro/datamaestro
Summary"Dataset management command line and API"
upload_time2024-03-06 14:09:42
maintainer
docs_urlNone
authorBenjamin Piwowarski
requires_python>=3.8
licenseGPL-3
keywords dataset manager
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/datamaestro.svg)](https://badge.fury.io/py/datamaestro) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![DOI](https://zenodo.org/badge/4573876.svg)](https://zenodo.org/badge/latestdoi/4573876)



# Introduction

Full documentation can be found at http://datamaestro.rtfd.io

This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims
at being

1. a reference for available resources, listing datasets
1. a tool to automatically download and process resources (when freely available)
1. integration with the [experimaestro](http://experimaestro-python.rtfd.io/) experiment manager.
1. (planned) a tool that allows to copy data from one computer to another

Each datasets is uniquely identified by a qualified name such as `com.lecun.mnist`, which is usually the inversed path to the domain name of the website associated with the dataset.

The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.



## List of repositories

- [Information Retrieval](https://github.com/bpiwowar/experimaestro-ir) [![PyPI version](https://badge.fury.io/py/experimaestro-ir.svg)](https://badge.fury.io/py/experimaestro-ir)

- [NLP and information access related dataset](https://github.com/experimaestro/datamaestro_text) [![PyPI version](https://badge.fury.io/py/datamaestro-text.svg)](https://badge.fury.io/py/datamaestro-text) \
  Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets
- [image-related dataset](https://github.com/experimaestro/datamaestro_image) [![PyPI version](https://badge.fury.io/py/datamaestro-image.svg)](https://badge.fury.io/py/datamaestro-image)
  Image related datasets (e.g. MNIST)

- [machine learning](https://github.com/experimaestro/datamaestro_ml) [![PyPI version](https://badge.fury.io/py/datamaestro-ml.svg)](https://badge.fury.io/py/datamaestro-ml)\
 Generic machine learning datasets


# Command line interface (CLI)


The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing `datamaestro COMMAND --help`:

- `search` search dataset by name, tags and/or tasks
- `download` download files (if accessible on Internet) or ask for download path otherwise
- `prepare` download dataset files and outputs a JSON containing path and other dataset information
- `repositories` list the available repositories
- `orphans` list data directories that do no correspond to any registered dataset (and allows to clean them up)
- `create-dataset` creates a dataset definition


# Example (CLI)

## Retrieve and download

The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.

```bash
$ datamaestro search tag:image
[image] com.lecun.mnist

$ datamaestro prepare com.lecun.mnist
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s]                                                            INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
...JSON...
```

The previous command also returns a JSON on standard output
```json
{
  "train": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/train_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/train_labels.idx"
    }
  },
  "test": {
    "images": {
      "path": ".../data/image/com/lecun/mnist/test_images.idx"
    },
    "labels": {
      "path": ".../data/image/com/lecun/mnist/test_labels.idx"
    }
  },
  "id": "com.lecun.mnist"
}
```

For those using Python, this is even better since the IDX format is supported

```python
In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))
```


## Python definition of datasets

Each dataset (or a set of related datasets) is described in Python using a mix of declarative
and imperative statements. This allows to quickly define how to download dataset using the
datamaestro declarative API; the imperative part is used when creating the JSON output,
and is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).

Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).


For MNIST, this corresponds to.

```python
from datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage
from datamaestro.download.single import filedownloader
from datamaestro.definitions import  argument, datatasks, datatags, dataset
from datamaestro.data.tensor import IDX


@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
  ImageClassification,
  url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
  """The MNIST database

  The MNIST database of handwritten digits, available from this page, has a
  training set of 60,000 examples, and a test set of 10,000 examples. It is a
  subset of a larger set available from NIST. The digits have been
  size-normalized and centered in a fixed-size image.
  """
  return {
    "train": LabelledImages(
      images=IDXImage(path=train_images),
      labels=IDX(path=train_labels)
    ),
    "test": LabelledImages(
      images=IDXImage(path=test_images),
      labels=IDX(path=test_labels)
    ),
  }
```

# 0.8.0

- Integration with other repositories: abstracting away the notion of dataset
- Repository prefix
- Set sub-datasets IDs automatically

# 0.7.3

- Updates for new experimaestro (0.8.5)
- Search types with "type:..."

# 0.6.17

- Allow remote access through rpyc

# 0.6.9

`version` command

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/experimaestro/datamaestro",
    "name": "datamaestro",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "dataset manager",
    "author": "Benjamin Piwowarski",
    "author_email": "benjamin@piwowarski.fr",
    "download_url": "https://files.pythonhosted.org/packages/5f/f8/fe1e70bd27a4c2e4375283200c113ab048d30f59aa7499a1f598f3418af3/datamaestro-1.1.0.tar.gz",
    "platform": "any",
    "description": "[![PyPI version](https://badge.fury.io/py/datamaestro.svg)](https://badge.fury.io/py/datamaestro) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![DOI](https://zenodo.org/badge/4573876.svg)](https://zenodo.org/badge/latestdoi/4573876)\n\n\n\n# Introduction\n\nFull documentation can be found at http://datamaestro.rtfd.io\n\nThis projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims\nat being\n\n1. a reference for available resources, listing datasets\n1. a tool to automatically download and process resources (when freely available)\n1. integration with the [experimaestro](http://experimaestro-python.rtfd.io/) experiment manager.\n1. (planned) a tool that allows to copy data from one computer to another\n\nEach datasets is uniquely identified by a qualified name such as `com.lecun.mnist`, which is usually the inversed path to the domain name of the website associated with the dataset.\n\nThe main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.\n\n\n\n## List of repositories\n\n- [Information Retrieval](https://github.com/bpiwowar/experimaestro-ir) [![PyPI version](https://badge.fury.io/py/experimaestro-ir.svg)](https://badge.fury.io/py/experimaestro-ir)\n\n- [NLP and information access related dataset](https://github.com/experimaestro/datamaestro_text) [![PyPI version](https://badge.fury.io/py/datamaestro-text.svg)](https://badge.fury.io/py/datamaestro-text) \\\n  Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets\n- [image-related dataset](https://github.com/experimaestro/datamaestro_image) [![PyPI version](https://badge.fury.io/py/datamaestro-image.svg)](https://badge.fury.io/py/datamaestro-image)\n  Image related datasets (e.g. MNIST)\n\n- [machine learning](https://github.com/experimaestro/datamaestro_ml) [![PyPI version](https://badge.fury.io/py/datamaestro-ml.svg)](https://badge.fury.io/py/datamaestro-ml)\\\n Generic machine learning datasets\n\n\n# Command line interface (CLI)\n\n\nThe command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing `datamaestro COMMAND --help`:\n\n- `search` search dataset by name, tags and/or tasks\n- `download` download files (if accessible on Internet) or ask for download path otherwise\n- `prepare` download dataset files and outputs a JSON containing path and other dataset information\n- `repositories` list the available repositories\n- `orphans` list data directories that do no correspond to any registered dataset (and allows to clean them up)\n- `create-dataset` creates a dataset definition\n\n\n# Example (CLI)\n\n## Retrieve and download\n\nThe commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.\n\n```bash\n$ datamaestro search tag:image\n[image] com.lecun.mnist\n\n$ datamaestro prepare com.lecun.mnist\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\nDownloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s]                                                            INFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\nDownloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte\n...JSON...\n```\n\nThe previous command also returns a JSON on standard output\n```json\n{\n  \"train\": {\n    \"images\": {\n      \"path\": \".../data/image/com/lecun/mnist/train_images.idx\"\n    },\n    \"labels\": {\n      \"path\": \".../data/image/com/lecun/mnist/train_labels.idx\"\n    }\n  },\n  \"test\": {\n    \"images\": {\n      \"path\": \".../data/image/com/lecun/mnist/test_images.idx\"\n    },\n    \"labels\": {\n      \"path\": \".../data/image/com/lecun/mnist/test_labels.idx\"\n    }\n  },\n  \"id\": \"com.lecun.mnist\"\n}\n```\n\nFor those using Python, this is even better since the IDX format is supported\n\n```python\nIn [1]: from datamaestro import prepare_dataset\nIn [2]: ds = prepare_dataset(\"com.lecun.mnist\")\nIn [3]: ds.train.images.data().dtype, ds.train.images.data().shape\nOut[3]: (dtype('uint8'), (60000, 28, 28))\n```\n\n\n## Python definition of datasets\n\nEach dataset (or a set of related datasets) is described in Python using a mix of declarative\nand imperative statements. This allows to quickly define how to download dataset using the\ndatamaestro declarative API; the imperative part is used when creating the JSON output,\nand is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).\n\nIts syntax is described in the [documentation](https://datamaestro.readthedocs.io).\n\n\nFor MNIST, this corresponds to.\n\n```python\nfrom datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage\nfrom datamaestro.download.single import filedownloader\nfrom datamaestro.definitions import  argument, datatasks, datatags, dataset\nfrom datamaestro.data.tensor import IDX\n\n\n@filedownloader(\"train_images.idx\", \"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\")\n@filedownloader(\"train_labels.idx\", \"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\")\n@filedownloader(\"test_images.idx\", \"http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\")\n@filedownloader(\"test_labels.idx\", \"http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\")\n@dataset(\n  ImageClassification,\n  url=\"http://yann.lecun.com/exdb/mnist/\",\n)\ndef MNIST(train_images, train_labels, test_images, test_labels):\n  \"\"\"The MNIST database\n\n  The MNIST database of handwritten digits, available from this page, has a\n  training set of 60,000 examples, and a test set of 10,000 examples. It is a\n  subset of a larger set available from NIST. The digits have been\n  size-normalized and centered in a fixed-size image.\n  \"\"\"\n  return {\n    \"train\": LabelledImages(\n      images=IDXImage(path=train_images),\n      labels=IDX(path=train_labels)\n    ),\n    \"test\": LabelledImages(\n      images=IDXImage(path=test_images),\n      labels=IDX(path=test_labels)\n    ),\n  }\n```\n\n# 0.8.0\n\n- Integration with other repositories: abstracting away the notion of dataset\n- Repository prefix\n- Set sub-datasets IDs automatically\n\n# 0.7.3\n\n- Updates for new experimaestro (0.8.5)\n- Search types with \"type:...\"\n\n# 0.6.17\n\n- Allow remote access through rpyc\n\n# 0.6.9\n\n`version` command\n",
    "bugtrack_url": null,
    "license": "GPL-3",
    "summary": "\"Dataset management command line and API\"",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/experimaestro/datamaestro"
    },
    "split_keywords": [
        "dataset",
        "manager"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c65d979b3c8fb8defc447b772e3cd65fcf483039bdff02086d4a9bbcdf42a456",
                "md5": "cf82b4fbb18524ecef74f357a420a08b",
                "sha256": "e968b95e876794a91e9ba5e2e8be96e41af9363c0ecf5d0a858b3423591d930f"
            },
            "downloads": -1,
            "filename": "datamaestro-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf82b4fbb18524ecef74f357a420a08b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 58959,
            "upload_time": "2024-03-06T14:09:40",
            "upload_time_iso_8601": "2024-03-06T14:09:40.899912Z",
            "url": "https://files.pythonhosted.org/packages/c6/5d/979b3c8fb8defc447b772e3cd65fcf483039bdff02086d4a9bbcdf42a456/datamaestro-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ff8fe1e70bd27a4c2e4375283200c113ab048d30f59aa7499a1f598f3418af3",
                "md5": "68c0dc46ae3e6421ead2c203f91a92b3",
                "sha256": "a88d16a38faa0b6ca435bc267a440934c6b52ee7bf37bd15d5f594a39391820b"
            },
            "downloads": -1,
            "filename": "datamaestro-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "68c0dc46ae3e6421ead2c203f91a92b3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 59840,
            "upload_time": "2024-03-06T14:09:42",
            "upload_time_iso_8601": "2024-03-06T14:09:42.935904Z",
            "url": "https://files.pythonhosted.org/packages/5f/f8/fe1e70bd27a4c2e4375283200c113ab048d30f59aa7499a1f598f3418af3/datamaestro-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-06 14:09:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "experimaestro",
    "github_project": "datamaestro",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "tox": true,
    "lcname": "datamaestro"
}
        
Elapsed time: 0.20262s