Name | tarzan JSON |
Version |
0.1.0
JSON |
| download |
home_page | |
Summary | high-level IO for tar based dataset |
upload_time | 2024-02-03 07:45:42 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.8 |
license | |
keywords |
datasets
tar
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Tarzan
Tar, as a high performance streamable format, has been widely used in the DL community
(e.g. [TorchData](https://github.com/pytorch/data), [WebDataset](https://github.com/webdataset/webdataset)).
[TFDS](https://www.tensorflow.org/datasets/add_dataset)-like dataset builder API provides a high-level interface for
users to build their own datasets, and is also adopted
by [HuggingFace](https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script).
Why not connect the two? Tarzan provides a minimal high-level API to help users build their own Tar-based datasets. It
also maps well between nested feature and Tar file structure to let you peek into the Tar file without extracting it.
## Installation
```bash
pip install tarzan
```
## Quick Start
1. Define your dataset info, which describes the dataset structure and any metadata.
```python
from tarzan.info import DatasetInfo
from tarzan.features import Features, Text, Scalar, Tensor, Audio
info = DatasetInfo(
description="A fake dataset",
features=Features({
'single': Text(),
'nested_list': [Scalar('int32')],
'nested_dict': {
'inner': Tensor(shape=(None, 3), dtype='float32'),
},
'complex': [{
'inner_1': Text(),
'inner_2': Audio(sample_rate=16000),
}]
}),
metadata={
'version': '1.0.0'
}
)
```
2. Write your data to Tar files with `ShardWriter`.
```python
from tarzan.writers import ShardWriter
with ShardWriter('data_dir', info, max_count=2) as writer:
for i in range(5):
writer.write({
'single': 'hello',
'nested_list': [1, 2, 3],
'nested_dict': {
'inner': [[1, 2, 3], [4, 5, 6]]
},
'complex': [{
'inner_1': 'world',
'inner_2': 'audio.wav'
}]
})
```
The structure of the `data_dir` is as follows:
```text
data_dir
├── 00000.tar
├── 00001.tar
├── 00002.tar
└── dataset_info.json
```
`max_count` and `max_size` control the maximum number of samples and the maximum size of each shard. Here we set the
`max_count` to 2 to create 3 shards.
`dataset_info.json` is a json file serialized from `info, which we rely on to read the data later.
```bash
cat data_dir/dataset_info.json
```
```json
{
"description": "A fake dataset",
"file_list": [
"00000.tar",
"00000.tar",
"00001.tar",
"00002.tar"
],
"features": {
"single": {
"_type": "Text"
},
"nested_list": [
{
"shape": [],
"dtype": "int32",
"_type": "Scalar"
}
],
"nested_dict": {
"inner": {
"shape": [
null,
3
],
"dtype": "float32",
"_type": "Tensor"
}
},
"complex": [
{
"inner_1": {
"_type": "Text"
},
"inner_2": {
"shape": [
null
],
"dtype": "float32",
"_type": "Audio",
"sample_rate": 16000
}
}
]
},
"metadata": {
"version": "1.0.0"
}
}
```
You can peek the tar file without extracting it and it should map well to the nested feature structure.
```bash
tree data_dir/00000.tar
```
```text
.
├── 0
│ ├── complex
│ │ └── 0
│ │ ├── inner_1
│ │ └── inner_2
│ ├── nested_dict
│ │ └── inner
│ ├── nested_list
│ │ ├── 0
│ │ ├── 1
│ │ └── 2
│ └── single
└── 1
├── complex
│ └── 0
│ ├── inner_1
│ └── inner_2
├── nested_dict
│ └── inner
├── nested_list
│ ├── 0
│ ├── 1
│ └── 2
└── single
```
3.Read the dataset with `TarReader`
```python
from tarzan.readers import TarReader
reader = TarReader.from_dataset_info('data_dir/dataset_info.json')
for tar_name, idx, example in reader:
print(tar_name, idx, example)
```
```text
data_dir/00000.tar 0 {'nested_dict': {'inner': array([[1., 2., 3.],
[4., 5., 6.]], dtype=float32)}, 'single': 'hello', 'complex': [{'inner_1': 'world', 'inner_2': <tarzan.features.audio.AudioDecoder object at 0x7fb8903443d0>}], 'nested_list': [array(1, dtype=int32), array(2, dtype=int32), array(3, dtype=int32)]}
...
```
Note that the `Audio` feature is returned as a lazy read object `AudioDecoder` to avoid unnecessary read for large audio.
Raw data
{
"_id": null,
"home_page": "",
"name": "tarzan",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "datasets,tar",
"author": "",
"author_email": "Yuchao Zhang <418121364@qq.com>",
"download_url": "https://files.pythonhosted.org/packages/e3/ce/8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d/tarzan-0.1.0.tar.gz",
"platform": null,
"description": "# Tarzan\n\nTar, as a high performance streamable format, has been widely used in the DL community\n(e.g. [TorchData](https://github.com/pytorch/data), [WebDataset](https://github.com/webdataset/webdataset)).\n[TFDS](https://www.tensorflow.org/datasets/add_dataset)-like dataset builder API provides a high-level interface for\nusers to build their own datasets, and is also adopted\nby [HuggingFace](https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script).\n\nWhy not connect the two? Tarzan provides a minimal high-level API to help users build their own Tar-based datasets. It\nalso maps well between nested feature and Tar file structure to let you peek into the Tar file without extracting it.\n\n## Installation\n\n```bash\npip install tarzan\n```\n\n## Quick Start\n\n1. Define your dataset info, which describes the dataset structure and any metadata.\n```python\nfrom tarzan.info import DatasetInfo\nfrom tarzan.features import Features, Text, Scalar, Tensor, Audio\n\ninfo = DatasetInfo(\n description=\"A fake dataset\",\n features=Features({\n 'single': Text(),\n 'nested_list': [Scalar('int32')],\n 'nested_dict': {\n 'inner': Tensor(shape=(None, 3), dtype='float32'),\n },\n 'complex': [{\n 'inner_1': Text(),\n 'inner_2': Audio(sample_rate=16000),\n }]\n }),\n metadata={\n 'version': '1.0.0'\n }\n)\n```\n\n2. Write your data to Tar files with `ShardWriter`.\n```python\nfrom tarzan.writers import ShardWriter \nwith ShardWriter('data_dir', info, max_count=2) as writer:\n for i in range(5):\n writer.write({\n 'single': 'hello',\n 'nested_list': [1, 2, 3],\n 'nested_dict': {\n 'inner': [[1, 2, 3], [4, 5, 6]]\n },\n 'complex': [{\n 'inner_1': 'world',\n 'inner_2': 'audio.wav'\n }]\n })\n```\nThe structure of the `data_dir` is as follows:\n```text\ndata_dir\n\u251c\u2500\u2500 00000.tar\n\u251c\u2500\u2500 00001.tar\n\u251c\u2500\u2500 00002.tar\n\u2514\u2500\u2500 dataset_info.json\n```\n`max_count` and `max_size` control the maximum number of samples and the maximum size of each shard. Here we set the\n`max_count` to 2 to create 3 shards.\n`dataset_info.json` is a json file serialized from `info, which we rely on to read the data later.\n```bash\ncat data_dir/dataset_info.json\n```\n```json\n{\n \"description\": \"A fake dataset\",\n \"file_list\": [\n \"00000.tar\",\n \"00000.tar\",\n \"00001.tar\",\n \"00002.tar\"\n ],\n \"features\": {\n \"single\": {\n \"_type\": \"Text\"\n },\n \"nested_list\": [\n {\n \"shape\": [],\n \"dtype\": \"int32\",\n \"_type\": \"Scalar\"\n }\n ],\n \"nested_dict\": {\n \"inner\": {\n \"shape\": [\n null,\n 3\n ],\n \"dtype\": \"float32\",\n \"_type\": \"Tensor\"\n }\n },\n \"complex\": [\n {\n \"inner_1\": {\n \"_type\": \"Text\"\n },\n \"inner_2\": {\n \"shape\": [\n null\n ],\n \"dtype\": \"float32\",\n \"_type\": \"Audio\",\n \"sample_rate\": 16000\n }\n }\n ]\n },\n \"metadata\": {\n \"version\": \"1.0.0\"\n }\n}\n```\nYou can peek the tar file without extracting it and it should map well to the nested feature structure.\n```bash\ntree data_dir/00000.tar\n```\n```text\n.\n\u251c\u2500\u2500 0\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 complex\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 0\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 inner_1\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner_2\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 nested_dict\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner\n\u2502\u00a0\u00a0 \u251c\u2500\u2500 nested_list\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 0\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 1\n\u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 2\n\u2502\u00a0\u00a0 \u2514\u2500\u2500 single\n\u2514\u2500\u2500 1\n \u251c\u2500\u2500 complex\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 0\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 inner_1\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner_2\n \u251c\u2500\u2500 nested_dict\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 inner\n \u251c\u2500\u2500 nested_list\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 0\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 1\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 2\n \u2514\u2500\u2500 single\n```\n3.Read the dataset with `TarReader`\n```python\nfrom tarzan.readers import TarReader\nreader = TarReader.from_dataset_info('data_dir/dataset_info.json')\n\nfor tar_name, idx, example in reader:\n print(tar_name, idx, example)\n```\n```text\ndata_dir/00000.tar 0 {'nested_dict': {'inner': array([[1., 2., 3.],\n [4., 5., 6.]], dtype=float32)}, 'single': 'hello', 'complex': [{'inner_1': 'world', 'inner_2': <tarzan.features.audio.AudioDecoder object at 0x7fb8903443d0>}], 'nested_list': [array(1, dtype=int32), array(2, dtype=int32), array(3, dtype=int32)]}\n...\n```\nNote that the `Audio` feature is returned as a lazy read object `AudioDecoder` to avoid unnecessary read for large audio.\n",
"bugtrack_url": null,
"license": "",
"summary": "high-level IO for tar based dataset",
"version": "0.1.0",
"project_urls": null,
"split_keywords": [
"datasets",
"tar"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "99d7a71cebeabed08faca09ee81e4ea7023d8ca5ffd022e5dc6a589dcfe6f287",
"md5": "27cbeccad72acd0ea7f2590018f02423",
"sha256": "3b46e2def0a1dfe717068737be628a7fc991ca00478dfb6cb3192fc4268e075f"
},
"downloads": -1,
"filename": "tarzan-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "27cbeccad72acd0ea7f2590018f02423",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 19381,
"upload_time": "2024-02-03T07:45:39",
"upload_time_iso_8601": "2024-02-03T07:45:39.637850Z",
"url": "https://files.pythonhosted.org/packages/99/d7/a71cebeabed08faca09ee81e4ea7023d8ca5ffd022e5dc6a589dcfe6f287/tarzan-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e3ce8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d",
"md5": "a438acad7e8c5940e43b98f3812110b0",
"sha256": "c42467aff5d61fdfea8d0ac48067ff96d3e0e813763588775a112e493deffb69"
},
"downloads": -1,
"filename": "tarzan-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "a438acad7e8c5940e43b98f3812110b0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 15598,
"upload_time": "2024-02-03T07:45:42",
"upload_time_iso_8601": "2024-02-03T07:45:42.565384Z",
"url": "https://files.pythonhosted.org/packages/e3/ce/8bf72720140df49ed684ab9c46ba5d3fb24dd952524078ab3dadee7e704d/tarzan-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-03 07:45:42",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tarzan"
}