tarzan


Nametarzan JSON
Version 0.1.0 PyPI version JSON
download
home_page
Summaryhigh-level IO for tar based dataset
upload_time2024-02-03 07:45:42
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords datasets tar
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Tarzan

Tar, as a high performance streamable format, has been widely used in the DL community
(e.g. [TorchData](https://github.com/pytorch/data), [WebDataset](https://github.com/webdataset/webdataset)).
[TFDS](https://www.tensorflow.org/datasets/add_dataset)-like dataset builder API provides a high-level interface for
users to build their own datasets, and is also adopted
by [HuggingFace](https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script).

Why not connect the two? Tarzan provides a minimal high-level API to help users build their own Tar-based datasets. It
also maps well between nested feature and Tar file structure to let you peek into the Tar file without extracting it.

## Installation

```bash
pip install tarzan
```

## Quick Start

1. Define your dataset info, which describes the dataset structure and any metadata.
```python
from tarzan.info import DatasetInfo
from tarzan.features import Features, Text, Scalar, Tensor, Audio

info = DatasetInfo(
   description="A fake dataset",
   features=Features({
       'single': Text(),
       'nested_list': [Scalar('int32')],
       'nested_dict': {
           'inner': Tensor(shape=(None, 3), dtype='float32'),
       },
       'complex': [{
           'inner_1': Text(),
           'inner_2': Audio(sample_rate=16000),
       }]
   }),
   metadata={
       'version': '1.0.0'
   }
)
```

2. Write your data to Tar files with `ShardWriter`.
```python
from tarzan.writers import ShardWriter 
with ShardWriter('data_dir', info, max_count=2) as writer:
   for i in range(5):
      writer.write({
          'single': 'hello',
          'nested_list': [1, 2, 3],
          'nested_dict': {
              'inner': [[1, 2, 3], [4, 5, 6]]
          },
          'complex': [{
              'inner_1': 'world',
              'inner_2': 'audio.wav'
          }]
      })
```
The structure of the `data_dir` is as follows:
```text
data_dir
├── 00000.tar
├── 00001.tar
├── 00002.tar
└── dataset_info.json
```
`max_count` and `max_size` control the maximum number of samples and the maximum size of each shard. Here we set the
`max_count` to 2 to create 3 shards.
`dataset_info.json` is a json file serialized from `info, which we rely on to read the data later.
```bash
cat data_dir/dataset_info.json
```
```json
{
  "description": "A fake dataset",
  "file_list": [
    "00000.tar",
    "00000.tar",
    "00001.tar",
    "00002.tar"
  ],
  "features": {
    "single": {
      "_type": "Text"
    },
    "nested_list": [
      {
        "shape": [],
        "dtype": "int32",
        "_type": "Scalar"
      }
    ],
    "nested_dict": {
      "inner": {
        "shape": [
          null,
          3
        ],
        "dtype": "float32",
        "_type": "Tensor"
      }
    },
    "complex": [
      {
        "inner_1": {
          "_type": "Text"
        },
        "inner_2": {
          "shape": [
            null
          ],
          "dtype": "float32",
          "_type": "Audio",
          "sample_rate": 16000
        }
      }
    ]
  },
  "metadata": {
    "version": "1.0.0"
  }
}
```
You can peek the tar file without extracting it and it should map well to the nested feature structure.
```bash
tree data_dir/00000.tar
```
```text
.
├── 0
│   ├── complex
│   │   └── 0
│   │       ├── inner_1
│   │       └── inner_2
│   ├── nested_dict
│   │   └── inner
│   ├── nested_list
│   │   ├── 0
│   │   ├── 1
│   │   └── 2
│   └── single
└── 1
    ├── complex
    │   └── 0
    │       ├── inner_1
    │       └── inner_2
    ├── nested_dict
    │   └── inner
    ├── nested_list
    │   ├── 0
    │   ├── 1
    │   └── 2
    └── single
```
3.Read the dataset with `TarReader`
```python
from tarzan.readers import TarReader
reader = TarReader.from_dataset_info('data_dir/dataset_info.json')

for tar_name, idx, example in reader:
    print(tar_name, idx, example)
```
```text
data_dir/00000.tar 0 {'nested_dict': {'inner': array([[1., 2., 3.],
       [4., 5., 6.]], dtype=float32)}, 'single': 'hello', 'complex': [{'inner_1': 'world', 'inner_2': <tarzan.features.audio.AudioDecoder object at 0x7fb8903443d0>}], 'nested_list': [array(1, dtype=int32), array(2, dtype=int32), array(3, dtype=int32)]}
...
```
Note that the `Audio` feature is returned as a lazy read object `AudioDecoder` to avoid unnecessary read for large audio.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tarzan",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "datasets,tar",
    "author": "",
    "author_email": "Yuchao Zhang <418121364@qq.com>",
    "download_url": "https://files.pythonhosted.org/packages/e3/ce/8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d/tarzan-0.1.0.tar.gz",
    "platform": null,
    "description": "# Tarzan\n\nTar, as a high performance streamable format, has been widely used in the DL community\n(e.g. [TorchData](https://github.com/pytorch/data), [WebDataset](https://github.com/webdataset/webdataset)).\n[TFDS](https://www.tensorflow.org/datasets/add_dataset)-like dataset builder API provides a high-level interface for\nusers to build their own datasets, and is also adopted\nby [HuggingFace](https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script).\n\nWhy not connect the two? Tarzan provides a minimal high-level API to help users build their own Tar-based datasets. It\nalso maps well between nested feature and Tar file structure to let you peek into the Tar file without extracting it.\n\n## Installation\n\n```bash\npip install tarzan\n```\n\n## Quick Start\n\n1. Define your dataset info, which describes the dataset structure and any metadata.\n```python\nfrom tarzan.info import DatasetInfo\nfrom tarzan.features import Features, Text, Scalar, Tensor, Audio\n\ninfo = DatasetInfo(\n   description=\"A fake dataset\",\n   features=Features({\n       'single': Text(),\n       'nested_list': [Scalar('int32')],\n       'nested_dict': {\n           'inner': Tensor(shape=(None, 3), dtype='float32'),\n       },\n       'complex': [{\n           'inner_1': Text(),\n           'inner_2': Audio(sample_rate=16000),\n       }]\n   }),\n   metadata={\n       'version': '1.0.0'\n   }\n)\n```\n\n2. Write your data to Tar files with `ShardWriter`.\n```python\nfrom tarzan.writers import ShardWriter \nwith ShardWriter('data_dir', info, max_count=2) as writer:\n   for i in range(5):\n      writer.write({\n          'single': 'hello',\n          'nested_list': [1, 2, 3],\n          'nested_dict': {\n              'inner': [[1, 2, 3], [4, 5, 6]]\n          },\n          'complex': [{\n              'inner_1': 'world',\n              'inner_2': 'audio.wav'\n          }]\n      })\n```\nThe structure of the `data_dir` is as follows:\n```text\ndata_dir\n\u251c\u2500\u2500 00000.tar\n\u251c\u2500\u2500 00001.tar\n\u251c\u2500\u2500 00002.tar\n\u2514\u2500\u2500 dataset_info.json\n```\n`max_count` and `max_size` control the maximum number of samples and the maximum size of each shard. Here we set the\n`max_count` to 2 to create 3 shards.\n`dataset_info.json` is a json file serialized from `info, which we rely on to read the data later.\n```bash\ncat data_dir/dataset_info.json\n```\n```json\n{\n  \"description\": \"A fake dataset\",\n  \"file_list\": [\n    \"00000.tar\",\n    \"00000.tar\",\n    \"00001.tar\",\n    \"00002.tar\"\n  ],\n  \"features\": {\n    \"single\": {\n      \"_type\": \"Text\"\n    },\n    \"nested_list\": [\n      {\n        \"shape\": [],\n        \"dtype\": \"int32\",\n        \"_type\": \"Scalar\"\n      }\n    ],\n    \"nested_dict\": {\n      \"inner\": {\n        \"shape\": [\n          null,\n          3\n        ],\n        \"dtype\": \"float32\",\n        \"_type\": \"Tensor\"\n      }\n    },\n    \"complex\": [\n      {\n        \"inner_1\": {\n          \"_type\": \"Text\"\n        },\n        \"inner_2\": {\n          \"shape\": [\n            null\n          ],\n          \"dtype\": \"float32\",\n          \"_type\": \"Audio\",\n          \"sample_rate\": 16000\n        }\n      }\n    ]\n  },\n  \"metadata\": {\n    \"version\": \"1.0.0\"\n  }\n}\n```\nYou can peek the tar file without extracting it and it should map well to the nested feature structure.\n```bash\ntree data_dir/00000.tar\n```\n```text\n.\n\u251c\u2500\u2500 0\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 complex\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 0\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0     \u251c\u2500\u2500 inner_1\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0     \u2514\u2500\u2500 inner_2\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 nested_dict\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 nested_list\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 0\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 1\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 2\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 single\n\u2514\u2500\u2500 1\n    \u251c\u2500\u2500 complex\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 0\n    \u2502\u00a0\u00a0     \u251c\u2500\u2500 inner_1\n    \u2502\u00a0\u00a0     \u2514\u2500\u2500 inner_2\n    \u251c\u2500\u2500 nested_dict\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner\n    \u251c\u2500\u2500 nested_list\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 0\n    \u2502\u00a0\u00a0 \u251c\u2500\u2500 1\n    \u2502\u00a0\u00a0 \u2514\u2500\u2500 2\n    \u2514\u2500\u2500 single\n```\n3.Read the dataset with `TarReader`\n```python\nfrom tarzan.readers import TarReader\nreader = TarReader.from_dataset_info('data_dir/dataset_info.json')\n\nfor tar_name, idx, example in reader:\n    print(tar_name, idx, example)\n```\n```text\ndata_dir/00000.tar 0 {'nested_dict': {'inner': array([[1., 2., 3.],\n       [4., 5., 6.]], dtype=float32)}, 'single': 'hello', 'complex': [{'inner_1': 'world', 'inner_2': <tarzan.features.audio.AudioDecoder object at 0x7fb8903443d0>}], 'nested_list': [array(1, dtype=int32), array(2, dtype=int32), array(3, dtype=int32)]}\n...\n```\nNote that the `Audio` feature is returned as a lazy read object `AudioDecoder` to avoid unnecessary read for large audio.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "high-level IO for tar based dataset",
    "version": "0.1.0",
    "project_urls": null,
    "split_keywords": [
        "datasets",
        "tar"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "99d7a71cebeabed08faca09ee81e4ea7023d8ca5ffd022e5dc6a589dcfe6f287",
                "md5": "27cbeccad72acd0ea7f2590018f02423",
                "sha256": "3b46e2def0a1dfe717068737be628a7fc991ca00478dfb6cb3192fc4268e075f"
            },
            "downloads": -1,
            "filename": "tarzan-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "27cbeccad72acd0ea7f2590018f02423",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 19381,
            "upload_time": "2024-02-03T07:45:39",
            "upload_time_iso_8601": "2024-02-03T07:45:39.637850Z",
            "url": "https://files.pythonhosted.org/packages/99/d7/a71cebeabed08faca09ee81e4ea7023d8ca5ffd022e5dc6a589dcfe6f287/tarzan-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e3ce8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d",
                "md5": "a438acad7e8c5940e43b98f3812110b0",
                "sha256": "c42467aff5d61fdfea8d0ac48067ff96d3e0e813763588775a112e493deffb69"
            },
            "downloads": -1,
            "filename": "tarzan-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a438acad7e8c5940e43b98f3812110b0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 15598,
            "upload_time": "2024-02-03T07:45:42",
            "upload_time_iso_8601": "2024-02-03T07:45:42.565384Z",
            "url": "https://files.pythonhosted.org/packages/e3/ce/8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d/tarzan-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-03 07:45:42",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "tarzan"
}
        
Elapsed time: 0.22005s