[![PyPI version](https://badge.fury.io/py/datamaestro.svg)](https://badge.fury.io/py/datamaestro) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![DOI](https://zenodo.org/badge/4573876.svg)](https://zenodo.org/badge/latestdoi/4573876)
# Introduction
Full documentation can be found at http://datamaestro.rtfd.io
This projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims
at being
1. a reference for available resources, listing datasets
1. a tool to automatically download and process resources (when freely available)
1. integration with the [experimaestro](http://experimaestro-python.rtfd.io/) experiment manager.
1. (planned) a tool that allows to copy data from one computer to another
Each datasets is uniquely identified by a qualified name such as `com.lecun.mnist`, which is usually the inversed path to the domain name of the website associated with the dataset.
The main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.
## List of repositories
- [Information Retrieval](https://github.com/bpiwowar/experimaestro-ir) [![PyPI version](https://badge.fury.io/py/experimaestro-ir.svg)](https://badge.fury.io/py/experimaestro-ir)
- [NLP and information access related dataset](https://github.com/experimaestro/datamaestro_text) [![PyPI version](https://badge.fury.io/py/datamaestro-text.svg)](https://badge.fury.io/py/datamaestro-text) \
Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets
- [image-related dataset](https://github.com/experimaestro/datamaestro_image) [![PyPI version](https://badge.fury.io/py/datamaestro-image.svg)](https://badge.fury.io/py/datamaestro-image)
Image related datasets (e.g. MNIST)
- [machine learning](https://github.com/experimaestro/datamaestro_ml) [![PyPI version](https://badge.fury.io/py/datamaestro-ml.svg)](https://badge.fury.io/py/datamaestro-ml)\
Generic machine learning datasets
# Command line interface (CLI)
The command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing `datamaestro COMMAND --help`:
- `search` search dataset by name, tags and/or tasks
- `download` download files (if accessible on Internet) or ask for download path otherwise
- `prepare` download dataset files and outputs a JSON containing path and other dataset information
- `repositories` list the available repositories
- `orphans` list data directories that do no correspond to any registered dataset (and allows to clean them up)
- `create-dataset` creates a dataset definition
# Example (CLI)
## Retrieve and download
The commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.
```bash
$ datamaestro search tag:image
[image] com.lecun.mnist
$ datamaestro prepare com.lecun.mnist
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s] INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
INFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]
INFO:root:Transforming file
INFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte
...JSON...
```
The previous command also returns a JSON on standard output
```json
{
"train": {
"images": {
"path": ".../data/image/com/lecun/mnist/train_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/train_labels.idx"
}
},
"test": {
"images": {
"path": ".../data/image/com/lecun/mnist/test_images.idx"
},
"labels": {
"path": ".../data/image/com/lecun/mnist/test_labels.idx"
}
},
"id": "com.lecun.mnist"
}
```
For those using Python, this is even better since the IDX format is supported
```python
In [1]: from datamaestro import prepare_dataset
In [2]: ds = prepare_dataset("com.lecun.mnist")
In [3]: ds.train.images.data().dtype, ds.train.images.data().shape
Out[3]: (dtype('uint8'), (60000, 28, 28))
```
## Python definition of datasets
Each dataset (or a set of related datasets) is described in Python using a mix of declarative
and imperative statements. This allows to quickly define how to download dataset using the
datamaestro declarative API; the imperative part is used when creating the JSON output,
and is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).
Its syntax is described in the [documentation](https://datamaestro.readthedocs.io).
For MNIST, this corresponds to.
```python
from datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage
from datamaestro.download.single import filedownloader
from datamaestro.definitions import argument, datatasks, datatags, dataset
from datamaestro.data.tensor import IDX
@filedownloader("train_images.idx", "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz")
@filedownloader("train_labels.idx", "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz")
@filedownloader("test_images.idx", "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz")
@filedownloader("test_labels.idx", "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz")
@dataset(
ImageClassification,
url="http://yann.lecun.com/exdb/mnist/",
)
def MNIST(train_images, train_labels, test_images, test_labels):
"""The MNIST database
The MNIST database of handwritten digits, available from this page, has a
training set of 60,000 examples, and a test set of 10,000 examples. It is a
subset of a larger set available from NIST. The digits have been
size-normalized and centered in a fixed-size image.
"""
return {
"train": LabelledImages(
images=IDXImage(path=train_images),
labels=IDX(path=train_labels)
),
"test": LabelledImages(
images=IDXImage(path=test_images),
labels=IDX(path=test_labels)
),
}
```
# 0.8.0
- Integration with other repositories: abstracting away the notion of dataset
- Repository prefix
- Set sub-datasets IDs automatically
# 0.7.3
- Updates for new experimaestro (0.8.5)
- Search types with "type:..."
# 0.6.17
- Allow remote access through rpyc
# 0.6.9
`version` command
Raw data
{
"_id": null,
"home_page": "https://github.com/experimaestro/datamaestro",
"name": "datamaestro",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "dataset manager",
"author": "Benjamin Piwowarski",
"author_email": "benjamin@piwowarski.fr",
"download_url": "https://files.pythonhosted.org/packages/3a/92/235a92d7bd17f54116488a3d060f09cd455d28e905ff29af6ef5d3ea2e98/datamaestro-1.2.1.tar.gz",
"platform": "any",
"description": "[![PyPI version](https://badge.fury.io/py/datamaestro.svg)](https://badge.fury.io/py/datamaestro) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit) [![DOI](https://zenodo.org/badge/4573876.svg)](https://zenodo.org/badge/latestdoi/4573876)\n\n\n\n# Introduction\n\nFull documentation can be found at http://datamaestro.rtfd.io\n\nThis projects aims at grouping utilities to deal with the numerous and heterogenous datasets present on the Web. It aims\nat being\n\n1. a reference for available resources, listing datasets\n1. a tool to automatically download and process resources (when freely available)\n1. integration with the [experimaestro](http://experimaestro-python.rtfd.io/) experiment manager.\n1. (planned) a tool that allows to copy data from one computer to another\n\nEach datasets is uniquely identified by a qualified name such as `com.lecun.mnist`, which is usually the inversed path to the domain name of the website associated with the dataset.\n\nThe main repository only deals with very generic processing (downloading, basic pre-processing and data types). Plugins can then be registered that provide access to domain specific datasets.\n\n\n\n## List of repositories\n\n- [Information Retrieval](https://github.com/bpiwowar/experimaestro-ir) [![PyPI version](https://badge.fury.io/py/experimaestro-ir.svg)](https://badge.fury.io/py/experimaestro-ir)\n\n- [NLP and information access related dataset](https://github.com/experimaestro/datamaestro_text) [![PyPI version](https://badge.fury.io/py/datamaestro-text.svg)](https://badge.fury.io/py/datamaestro-text) \\\n Natural Language Processing (e.g. Sentiment101) and Information access (e.g. TREC) datasets\n- [image-related dataset](https://github.com/experimaestro/datamaestro_image) [![PyPI version](https://badge.fury.io/py/datamaestro-image.svg)](https://badge.fury.io/py/datamaestro-image)\n Image related datasets (e.g. MNIST)\n\n- [machine learning](https://github.com/experimaestro/datamaestro_ml) [![PyPI version](https://badge.fury.io/py/datamaestro-ml.svg)](https://badge.fury.io/py/datamaestro-ml)\\\n Generic machine learning datasets\n\n\n# Command line interface (CLI)\n\n\nThe command line interface allows to interact with the datasets. The commands are listed below, help can be found by typing `datamaestro COMMAND --help`:\n\n- `search` search dataset by name, tags and/or tasks\n- `download` download files (if accessible on Internet) or ask for download path otherwise\n- `prepare` download dataset files and outputs a JSON containing path and other dataset information\n- `repositories` list the available repositories\n- `orphans` list data directories that do no correspond to any registered dataset (and allows to clean them up)\n- `create-dataset` creates a dataset definition\n\n\n# Example (CLI)\n\n## Retrieve and download\n\nThe commmand line interface allows to download automatically the different resources. Datamaestro extensions can provide additional processing tools.\n\n```bash\n$ datamaestro search tag:image\n[image] com.lecun.mnist\n\n$ datamaestro prepare com.lecun.mnist\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/t10k-images-idx3-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\nDownloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz: 32.8kB [00:00, 92.1kB/s] INFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-labels-idx1-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz into /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte\nINFO:root:Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\nDownloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz: 9.92MB [00:00, 10.6MB/s]\nINFO:root:Transforming file\nINFO:root:Created file /home/bpiwowar/datamaestro/data/image/com/lecun/mnist/train-images-idx3-ubyte\n...JSON...\n```\n\nThe previous command also returns a JSON on standard output\n```json\n{\n \"train\": {\n \"images\": {\n \"path\": \".../data/image/com/lecun/mnist/train_images.idx\"\n },\n \"labels\": {\n \"path\": \".../data/image/com/lecun/mnist/train_labels.idx\"\n }\n },\n \"test\": {\n \"images\": {\n \"path\": \".../data/image/com/lecun/mnist/test_images.idx\"\n },\n \"labels\": {\n \"path\": \".../data/image/com/lecun/mnist/test_labels.idx\"\n }\n },\n \"id\": \"com.lecun.mnist\"\n}\n```\n\nFor those using Python, this is even better since the IDX format is supported\n\n```python\nIn [1]: from datamaestro import prepare_dataset\nIn [2]: ds = prepare_dataset(\"com.lecun.mnist\")\nIn [3]: ds.train.images.data().dtype, ds.train.images.data().shape\nOut[3]: (dtype('uint8'), (60000, 28, 28))\n```\n\n\n## Python definition of datasets\n\nEach dataset (or a set of related datasets) is described in Python using a mix of declarative\nand imperative statements. This allows to quickly define how to download dataset using the\ndatamaestro declarative API; the imperative part is used when creating the JSON output,\nand is integrated with [experimaestro](http://experimaestro.github.io/experimaestro-python).\n\nIts syntax is described in the [documentation](https://datamaestro.readthedocs.io).\n\n\nFor MNIST, this corresponds to.\n\n```python\nfrom datamaestro_image.data import ImageClassification, LabelledImages, Base, IDXImage\nfrom datamaestro.download.single import filedownloader\nfrom datamaestro.definitions import argument, datatasks, datatags, dataset\nfrom datamaestro.data.tensor import IDX\n\n\n@filedownloader(\"train_images.idx\", \"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz\")\n@filedownloader(\"train_labels.idx\", \"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz\")\n@filedownloader(\"test_images.idx\", \"http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz\")\n@filedownloader(\"test_labels.idx\", \"http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz\")\n@dataset(\n ImageClassification,\n url=\"http://yann.lecun.com/exdb/mnist/\",\n)\ndef MNIST(train_images, train_labels, test_images, test_labels):\n \"\"\"The MNIST database\n\n The MNIST database of handwritten digits, available from this page, has a\n training set of 60,000 examples, and a test set of 10,000 examples. It is a\n subset of a larger set available from NIST. The digits have been\n size-normalized and centered in a fixed-size image.\n \"\"\"\n return {\n \"train\": LabelledImages(\n images=IDXImage(path=train_images),\n labels=IDX(path=train_labels)\n ),\n \"test\": LabelledImages(\n images=IDXImage(path=test_images),\n labels=IDX(path=test_labels)\n ),\n }\n```\n\n# 0.8.0\n\n- Integration with other repositories: abstracting away the notion of dataset\n- Repository prefix\n- Set sub-datasets IDs automatically\n\n# 0.7.3\n\n- Updates for new experimaestro (0.8.5)\n- Search types with \"type:...\"\n\n# 0.6.17\n\n- Allow remote access through rpyc\n\n# 0.6.9\n\n`version` command\n",
"bugtrack_url": null,
"license": "GPL-3",
"summary": "\"Dataset management command line and API\"",
"version": "1.2.1",
"project_urls": {
"Homepage": "https://github.com/experimaestro/datamaestro"
},
"split_keywords": [
"dataset",
"manager"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b7b1d92328ecdc6cbee415aa74b05404543330d5cb5d8879cb850631021b482b",
"md5": "d7c5d7218e915b4b6e66e1a24727018c",
"sha256": "5f68c78773ec7bd3ce45ee463677a336e2f1c0817a70539b126e4a700c35a75c"
},
"downloads": -1,
"filename": "datamaestro-1.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d7c5d7218e915b4b6e66e1a24727018c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 61721,
"upload_time": "2024-05-31T10:35:17",
"upload_time_iso_8601": "2024-05-31T10:35:17.747014Z",
"url": "https://files.pythonhosted.org/packages/b7/b1/d92328ecdc6cbee415aa74b05404543330d5cb5d8879cb850631021b482b/datamaestro-1.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a92235a92d7bd17f54116488a3d060f09cd455d28e905ff29af6ef5d3ea2e98",
"md5": "4c5f7c35046a1ce3dce6fb429dfc7555",
"sha256": "64bc11426cbdd2df7b56fdfbb1691385daab97910e09d19022830c9d4b278462"
},
"downloads": -1,
"filename": "datamaestro-1.2.1.tar.gz",
"has_sig": false,
"md5_digest": "4c5f7c35046a1ce3dce6fb429dfc7555",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 62057,
"upload_time": "2024-05-31T10:35:20",
"upload_time_iso_8601": "2024-05-31T10:35:20.251951Z",
"url": "https://files.pythonhosted.org/packages/3a/92/235a92d7bd17f54116488a3d060f09cd455d28e905ff29af6ef5d3ea2e98/datamaestro-1.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-31 10:35:20",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "experimaestro",
"github_project": "datamaestro",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"tox": true,
"lcname": "datamaestro"
}