numpy2tfrecord

Name	numpy2tfrecord JSON
Version	0.0.3 JSON
	download
home_page	https://github.com/yonetaniryo/numpy2tfrecord
Summary	Convert a collection of numpy data to tfrecord
upload_time	2023-03-26 07:38:42
maintainer
docs_url	None
author	Ryo Yonetani
requires_python
license	MIT License
keywords	numpy tfrecord tensorflow
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # numpy2tfrecord

Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset.

## Installation
```sh
$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git
$ cd numpy2tfrecord
$ pip install .
```
or simply using pip:
```sh
$ pip install numpy2tfrecord
```


## How to use
### Convert a collection of numpy data to tfrecord

You can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.
```python
import numpy as np
from numpy2tfrecord import Numpy2TFRecordConverter

with Numpy2TFRecordConverter("test.tfrecord") as converter:
    x = np.arange(100).reshape(10, 10).astype(np.float32)  # float array
    y = np.arange(100).reshape(10, 10).astype(np.int64)  # int array
    a = 5  # int
    b = 0.3  # float
    sample = {"x": x, "y": y, "a": a, "b": b}
    converter.convert_sample(sample)  # convert data sample
```

You can also convert a `list` of samples at once using `convert_list`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = [
        {
            "x": np.random.rand(64).astype(np.float32),
            "y": np.random.randint(0, 10),
        }
        for _ in range(32)
    ]  # list of 32 samples

    converter.convert_list(samples)
```

Or a batch of samples at once using `convert_batch`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
    samples = {
        "x": np.random.rand(32, 64).astype(np.float32),
        "y": np.random.randint(0, 10, size=32).astype(np.int64),
    }  # batch of 32 samples

    converter.convert_batch(samples)
```

So what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`? 
Simply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory. 
On the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..



### Build a tensorflow dataset from tfrecord
Samples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.

```python
from numpy2tfrecord import build_dataset_from_tfrecord

dataset = build_dataset_from_tfrecord("test.tfrecord")
```

The dataset can then be used directly in the for-loop of machine learning.

```python
for batch in dataset.as_numpy_iterator():
    x, y = batch.values()
    ...
```

### Speeding up PyTorch data loading with `numpy2tfrecord`!
https://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01

```python
import os
import time

import numpy as np
from numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord
import torch
from torchvision import datasets, transforms

dataset = datasets.MNIST(".", download=True, transform=transforms.ToTensor())

# convert to tfrecord
with Numpy2TfrecordConverter("mnist.tfrecord") as converter:
    converter.convert_batch({"x": dataset.data.numpy().astype(np.int64), 
                        "y": dataset.targets.numpy().astype(np.int64)})

torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())
tic = time.time()
for e in range(5):
    for batch in torch_loader:
        x, y = batch
elapsed = time.time() - tic
print(f"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs")

tf_loader = build_dataset_from_tfrecord("mnist.tfrecord").batch(32).prefetch(1)
tic = time.time()
for e in range(5):
    for batch in tf_loader.as_numpy_iterator():
        x, y = batch.values()
elapsed = time.time() - tic
print(f"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs")
```

⬇️

```
elapsed time with pytorch dataloader: 41.10 sec for 5 epochs
elapsed time with tf dataloader: 17.34 sec for 5 epochs
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yonetaniryo/numpy2tfrecord",
    "name": "numpy2tfrecord",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "numpy,tfrecord,tensorflow",
    "author": "Ryo Yonetani",
    "author_email": "yonetani.vision@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4e/0b/919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67/numpy2tfrecord-0.0.3.tar.gz",
    "platform": null,
    "description": "# numpy2tfrecord\n\nSimple helper library to convert numpy data to tfrecord and build a tensorflow dataset.\n\n## Installation\n```sh\n$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git\n$ cd numpy2tfrecord\n$ pip install .\n```\nor simply using pip:\n```sh\n$ pip install numpy2tfrecord\n```\n\n\n## How to use\n### Convert a collection of numpy data to tfrecord\n\nYou can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.\n```python\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TFRecordConverter\n\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    x = np.arange(100).reshape(10, 10).astype(np.float32)  # float array\n    y = np.arange(100).reshape(10, 10).astype(np.int64)  # int array\n    a = 5  # int\n    b = 0.3  # float\n    sample = {\"x\": x, \"y\": y, \"a\": a, \"b\": b}\n    converter.convert_sample(sample)  # convert data sample\n```\n\nYou can also convert a `list` of samples at once using `convert_list`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    samples = [\n        {\n            \"x\": np.random.rand(64).astype(np.float32),\n            \"y\": np.random.randint(0, 10),\n        }\n        for _ in range(32)\n    ]  # list of 32 samples\n\n    converter.convert_list(samples)\n```\n\nOr a batch of samples at once using `convert_batch`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    samples = {\n        \"x\": np.random.rand(32, 64).astype(np.float32),\n        \"y\": np.random.randint(0, 10, size=32).astype(np.int64),\n    }  # batch of 32 samples\n\n    converter.convert_batch(samples)\n```\n\nSo what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`? \nSimply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory. \nOn the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..\n\n\n\n### Build a tensorflow dataset from tfrecord\nSamples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.\n\n```python\nfrom numpy2tfrecord import build_dataset_from_tfrecord\n\ndataset = build_dataset_from_tfrecord(\"test.tfrecord\")\n```\n\nThe dataset can then be used directly in the for-loop of machine learning.\n\n```python\nfor batch in dataset.as_numpy_iterator():\n    x, y = batch.values()\n    ...\n```\n\n### Speeding up PyTorch data loading with `numpy2tfrecord`!\nhttps://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01\n\n```python\nimport os\nimport time\n\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord\nimport torch\nfrom torchvision import datasets, transforms\n\ndataset = datasets.MNIST(\".\", download=True, transform=transforms.ToTensor())\n\n# convert to tfrecord\nwith Numpy2TfrecordConverter(\"mnist.tfrecord\") as converter:\n    converter.convert_batch({\"x\": dataset.data.numpy().astype(np.int64), \n                        \"y\": dataset.targets.numpy().astype(np.int64)})\n\ntorch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())\ntic = time.time()\nfor e in range(5):\n    for batch in torch_loader:\n        x, y = batch\nelapsed = time.time() - tic\nprint(f\"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs\")\n\ntf_loader = build_dataset_from_tfrecord(\"mnist.tfrecord\").batch(32).prefetch(1)\ntic = time.time()\nfor e in range(5):\n    for batch in tf_loader.as_numpy_iterator():\n        x, y = batch.values()\nelapsed = time.time() - tic\nprint(f\"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs\")\n```\n\n\u2b07\ufe0f\n\n```\nelapsed time with pytorch dataloader: 41.10 sec for 5 epochs\nelapsed time with tf dataloader: 17.34 sec for 5 epochs\n```\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Convert a collection of numpy data to tfrecord",
    "version": "0.0.3",
    "split_keywords": [
        "numpy",
        "tfrecord",
        "tensorflow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8465265df14bfda999f279f34070b58a0f38df56cf2079206193082f29baf32d",
                "md5": "0fd4dc0b35258a617895d58e104a6f80",
                "sha256": "e21e3507f92c3c5e90633fe8483d466543c16b754f062269575aa460f2090ed7"
            },
            "downloads": -1,
            "filename": "numpy2tfrecord-0.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0fd4dc0b35258a617895d58e104a6f80",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5331,
            "upload_time": "2023-03-26T07:38:40",
            "upload_time_iso_8601": "2023-03-26T07:38:40.064456Z",
            "url": "https://files.pythonhosted.org/packages/84/65/265df14bfda999f279f34070b58a0f38df56cf2079206193082f29baf32d/numpy2tfrecord-0.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4e0b919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67",
                "md5": "7a899e98c894a8c67703416498dfc375",
                "sha256": "fa44db6cc26677f3886ef1c5dc0bda13f3cf390247907388ee62acf12035f111"
            },
            "downloads": -1,
            "filename": "numpy2tfrecord-0.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "7a899e98c894a8c67703416498dfc375",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5120,
            "upload_time": "2023-03-26T07:38:42",
            "upload_time_iso_8601": "2023-03-26T07:38:42.169900Z",
            "url": "https://files.pythonhosted.org/packages/4e/0b/919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67/numpy2tfrecord-0.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-26 07:38:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "yonetaniryo",
    "github_project": "numpy2tfrecord",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "numpy2tfrecord"
}

Ryo Yonetani