# numpy2tfrecord
Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset.
## Installation
```sh
$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git
$ cd numpy2tfrecord
$ pip install .
```
or simply using pip:
```sh
$ pip install numpy2tfrecord
```
## How to use
### Convert a collection of numpy data to tfrecord
You can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.
```python
import numpy as np
from numpy2tfrecord import Numpy2TFRecordConverter
with Numpy2TFRecordConverter("test.tfrecord") as converter:
x = np.arange(100).reshape(10, 10).astype(np.float32) # float array
y = np.arange(100).reshape(10, 10).astype(np.int64) # int array
a = 5 # int
b = 0.3 # float
sample = {"x": x, "y": y, "a": a, "b": b}
converter.convert_sample(sample) # convert data sample
```
You can also convert a `list` of samples at once using `convert_list`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
samples = [
{
"x": np.random.rand(64).astype(np.float32),
"y": np.random.randint(0, 10),
}
for _ in range(32)
] # list of 32 samples
converter.convert_list(samples)
```
Or a batch of samples at once using `convert_batch`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
samples = {
"x": np.random.rand(32, 64).astype(np.float32),
"y": np.random.randint(0, 10, size=32).astype(np.int64),
} # batch of 32 samples
converter.convert_batch(samples)
```
So what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`?
Simply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory.
On the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..
### Build a tensorflow dataset from tfrecord
Samples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.
```python
from numpy2tfrecord import build_dataset_from_tfrecord
dataset = build_dataset_from_tfrecord("test.tfrecord")
```
The dataset can then be used directly in the for-loop of machine learning.
```python
for batch in dataset.as_numpy_iterator():
x, y = batch.values()
...
```
### Speeding up PyTorch data loading with `numpy2tfrecord`!
https://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01
```python
import os
import time
import numpy as np
from numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord
import torch
from torchvision import datasets, transforms
dataset = datasets.MNIST(".", download=True, transform=transforms.ToTensor())
# convert to tfrecord
with Numpy2TfrecordConverter("mnist.tfrecord") as converter:
converter.convert_batch({"x": dataset.data.numpy().astype(np.int64),
"y": dataset.targets.numpy().astype(np.int64)})
torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())
tic = time.time()
for e in range(5):
for batch in torch_loader:
x, y = batch
elapsed = time.time() - tic
print(f"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs")
tf_loader = build_dataset_from_tfrecord("mnist.tfrecord").batch(32).prefetch(1)
tic = time.time()
for e in range(5):
for batch in tf_loader.as_numpy_iterator():
x, y = batch.values()
elapsed = time.time() - tic
print(f"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs")
```
⬇️
```
elapsed time with pytorch dataloader: 41.10 sec for 5 epochs
elapsed time with tf dataloader: 17.34 sec for 5 epochs
```
Raw data
{
"_id": null,
"home_page": "https://github.com/yonetaniryo/numpy2tfrecord",
"name": "numpy2tfrecord",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "numpy,tfrecord,tensorflow",
"author": "Ryo Yonetani",
"author_email": "yonetani.vision@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4e/0b/919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67/numpy2tfrecord-0.0.3.tar.gz",
"platform": null,
"description": "# numpy2tfrecord\n\nSimple helper library to convert numpy data to tfrecord and build a tensorflow dataset.\n\n## Installation\n```sh\n$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git\n$ cd numpy2tfrecord\n$ pip install .\n```\nor simply using pip:\n```sh\n$ pip install numpy2tfrecord\n```\n\n\n## How to use\n### Convert a collection of numpy data to tfrecord\n\nYou can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.\n```python\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TFRecordConverter\n\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n x = np.arange(100).reshape(10, 10).astype(np.float32) # float array\n y = np.arange(100).reshape(10, 10).astype(np.int64) # int array\n a = 5 # int\n b = 0.3 # float\n sample = {\"x\": x, \"y\": y, \"a\": a, \"b\": b}\n converter.convert_sample(sample) # convert data sample\n```\n\nYou can also convert a `list` of samples at once using `convert_list`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n samples = [\n {\n \"x\": np.random.rand(64).astype(np.float32),\n \"y\": np.random.randint(0, 10),\n }\n for _ in range(32)\n ] # list of 32 samples\n\n converter.convert_list(samples)\n```\n\nOr a batch of samples at once using `convert_batch`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n samples = {\n \"x\": np.random.rand(32, 64).astype(np.float32),\n \"y\": np.random.randint(0, 10, size=32).astype(np.int64),\n } # batch of 32 samples\n\n converter.convert_batch(samples)\n```\n\nSo what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`? \nSimply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory. \nOn the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..\n\n\n\n### Build a tensorflow dataset from tfrecord\nSamples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.\n\n```python\nfrom numpy2tfrecord import build_dataset_from_tfrecord\n\ndataset = build_dataset_from_tfrecord(\"test.tfrecord\")\n```\n\nThe dataset can then be used directly in the for-loop of machine learning.\n\n```python\nfor batch in dataset.as_numpy_iterator():\n x, y = batch.values()\n ...\n```\n\n### Speeding up PyTorch data loading with `numpy2tfrecord`!\nhttps://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01\n\n```python\nimport os\nimport time\n\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord\nimport torch\nfrom torchvision import datasets, transforms\n\ndataset = datasets.MNIST(\".\", download=True, transform=transforms.ToTensor())\n\n# convert to tfrecord\nwith Numpy2TfrecordConverter(\"mnist.tfrecord\") as converter:\n converter.convert_batch({\"x\": dataset.data.numpy().astype(np.int64), \n \"y\": dataset.targets.numpy().astype(np.int64)})\n\ntorch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())\ntic = time.time()\nfor e in range(5):\n for batch in torch_loader:\n x, y = batch\nelapsed = time.time() - tic\nprint(f\"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs\")\n\ntf_loader = build_dataset_from_tfrecord(\"mnist.tfrecord\").batch(32).prefetch(1)\ntic = time.time()\nfor e in range(5):\n for batch in tf_loader.as_numpy_iterator():\n x, y = batch.values()\nelapsed = time.time() - tic\nprint(f\"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs\")\n```\n\n\u2b07\ufe0f\n\n```\nelapsed time with pytorch dataloader: 41.10 sec for 5 epochs\nelapsed time with tf dataloader: 17.34 sec for 5 epochs\n```\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Convert a collection of numpy data to tfrecord",
"version": "0.0.3",
"split_keywords": [
"numpy",
"tfrecord",
"tensorflow"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8465265df14bfda999f279f34070b58a0f38df56cf2079206193082f29baf32d",
"md5": "0fd4dc0b35258a617895d58e104a6f80",
"sha256": "e21e3507f92c3c5e90633fe8483d466543c16b754f062269575aa460f2090ed7"
},
"downloads": -1,
"filename": "numpy2tfrecord-0.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0fd4dc0b35258a617895d58e104a6f80",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 5331,
"upload_time": "2023-03-26T07:38:40",
"upload_time_iso_8601": "2023-03-26T07:38:40.064456Z",
"url": "https://files.pythonhosted.org/packages/84/65/265df14bfda999f279f34070b58a0f38df56cf2079206193082f29baf32d/numpy2tfrecord-0.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4e0b919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67",
"md5": "7a899e98c894a8c67703416498dfc375",
"sha256": "fa44db6cc26677f3886ef1c5dc0bda13f3cf390247907388ee62acf12035f111"
},
"downloads": -1,
"filename": "numpy2tfrecord-0.0.3.tar.gz",
"has_sig": false,
"md5_digest": "7a899e98c894a8c67703416498dfc375",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 5120,
"upload_time": "2023-03-26T07:38:42",
"upload_time_iso_8601": "2023-03-26T07:38:42.169900Z",
"url": "https://files.pythonhosted.org/packages/4e/0b/919950e84385fa697966ef54683b3b4d981a206f4063c478517477263c67/numpy2tfrecord-0.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-26 07:38:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "yonetaniryo",
"github_project": "numpy2tfrecord",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "numpy2tfrecord"
}