tfr-reader


Nametfr-reader JSON
Version 0.9.0 PyPI version JSON
download
home_pageNone
SummaryTensorflow Record Reader with Random Access
upload_time2025-11-01 13:14:53
maintainerNone
docs_urlNone
authorKrzysztof Kolasinski
requires_pythonNone
licenseNone
keywords dataframe tfrecords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tfrecords-reader

Fast TensorFlow TFRecords reader for Python with Random access and Google Storage streaming support.

```bash
pip install "tfr-reader"
# + Image classification dataset support
pip install "tfr-reader[datasets]"
# + Google Storage support
pip install "tfr-reader[google]"
# + All optional features
pip install "tfr-reader[datasets,google]"
```

## General Information
* No **TensorFlow** dependency - this library implement custom TFRecord Reader
* Protobuf is not required, this library contains cython decoder for TFRecord files
* Compressed TFRecord files are supported
* Fast random access to TFRecords i.e. you can read any example from the dataset without
  reading the whole dataset e.g.
    ```python
    import tfr_reader as tfr
    tfrds = tfr.TFRecordDatasetReader("/path/to/directory/with/tfrecords")
    example = tfrds[42]
    image_bytes: bytes = example["image/encoded"].value[0]
    ```

## Installation


* Base installation with minimum requirements:
    ```bash
    pip install "git+https://github.com/kmkolasinski/tfrecords-reader.git"
    ```
* For image classification dataset features (requires numpy, cython, opencv):
    ```bash
    pip install "git+https://github.com/kmkolasinski/tfrecords-reader.git#egg=tfr-reader[datasets]"
    pip install ".[datasets]"
    ```
* For extra Google Storage Cloud support use:
    ```bash
    pip install "git+https://github.com/kmkolasinski/tfrecords-reader.git#egg=tfr-reader[google]"
    ```

## Quick Start

```python
import tensorflow_datasets as tfds
import tfr_reader as tfr
from PIL import Image
import ipyplot

dataset, dataset_info = tfds.load('oxford_flowers102', split='train', with_info=True)

def index_fn(feature: tfr.Feature):
    label = feature["label"].value[0]
    return {
        "label": label,
        "name": dataset_info.features["label"].int2str(label)
    }

tfrds = tfr.load_from_directory(
    dataset_info.data_dir,
    # indexing options, not required if index is already created
    filepattern="*.tfrecord*",
    index_fn=index_fn,
    override=True, # override the index if it exists
)

# example selection using polars SQL query API
rows, examples = tfrds.select("select * from index where name ~ 'rose' limit 10")
assert examples == tfrds[rows["_row_id"]]

samples, names = [], []
for k, example in enumerate(examples):
    image = Image.open(example["image"].bytes_io[0]).resize((224, 224))
    names.append(rows["name"][k])
    samples.append(image)

ipyplot.plot_images(samples, names)
```
![demo](resources/quickstart.png)


## Usage

### Dataset Inspection
`inspect_dataset_example` function allows you to inspect the dataset and get a sample example
and its types.
```python
import tfr_reader as tfr
dataset_dir = "/path/to/directory/with/tfrecords"
example, types = tfr.inspect_dataset_example(dataset_dir)
types
>>> Out[1]:
[{'key': 'label', 'type': 'int64_list', 'length': 1},
 {'key': 'name', 'type': 'bytes_list', 'length': 1},
 {'key': 'image_id', 'type': 'bytes_list', 'length': 1},
 {'key': 'image', 'type': 'bytes_list', 'length': 1}]
```

### Dataset Indexing
Create an index of the dataset for fast access. The index is a dictionary with keys as the
image IDs and values as the file names. The index is created by reading the dataset and
parsing the examples. The index is saved in the `dataset_dir` directory. You can use the
`indexed_cols_fn` function to specify the columns you want to index. The function should return
a dictionary with keys as the column names and values as the column values.

> [!NOTE]
> Indexing operation works only for local files, remote files are not supported.


```python
import tfr_reader as tfr
dataset_dir = "/path/to/directory/with/tfrecords"

def indexed_cols_fn(feature):
    return {
        "label": feature["label"].value[0],
        "name": feature["name"].value[0].decode(),
        "image_id": feature["image/id"].value[0].decode(),
    }

tfrds = tfr.TFRecordDatasetReader.build_index_from_dataset_dir(dataset_dir, indexed_cols_fn)

tfrds.index_df[:5]
>> Out[2]:
shape: (5, 6)
┌───────────────────┬────────────────┬──────────────┬──────┬───────┬────────────┐
│ tfrecord_filename ┆ tfrecord_start ┆ tfrecord_end ┆ name ┆ label ┆ image_id   │
│ ---               ┆ ---            ┆ ---          ┆ ---  ┆ ---   ┆ ---        │
│ str               ┆ i64            ┆ i64          ┆ str  ┆ i64   ┆ str        │
╞═══════════════════╪════════════════╪══════════════╪══════╪═══════╪════════════╡
│ demo.tfrecord     ┆ 0              ┆ 79           ┆ cat  ┆ 1     ┆ image-id-0 │
│ demo.tfrecord     ┆ 79             ┆ 158          ┆ dog  ┆ 0     ┆ image-id-1 │
│ demo.tfrecord     ┆ 158            ┆ 237          ┆ cat  ┆ 1     ┆ image-id-2 │
│ demo.tfrecord     ┆ 237            ┆ 316          ┆ dog  ┆ 0     ┆ image-id-3 │
│ demo.tfrecord     ┆ 316            ┆ 395          ┆ cat  ┆ 1     ┆ image-id-4 │
└───────────────────┴────────────────┴──────────────┴──────┴───────┴────────────┘
```
Explanation about the index format:
* **tfrecord_filename**: name of the tfrecord file
* **tfrecord_start**: start byte position of the example in the tfrecord file
* **tfrecord_end**: end byte position of the example in the tfrecord file
* other columns: indexed columns from the dataset with `indexed_cols_fn` function

### Dataset Reading

```python
import tfr_reader as tfr

tfrds = tfr.TFRecordDatasetReader("/path/to/directory/with/tfrecords")
# assume that the dataset is indexed already
tfrds = tfr.TFRecordDatasetReader("gs://bucket/path/to/directory/with/tfrecords")
# selection API
selected_df, examples = tfrds.select("SELECT * FROM index WHERE name = 'cat' LIMIT 20")
# custom selection
selected_df = tfrds.index_df.sample(5)
examples = tfrds.load_records(selected_df)
# indexing API
for i in range(len(tfrds)):
    example = tfrds[i]
    # assuming image is encoded as bytes at key "image/encoded"
    image_bytes = example["image/encoded"].value[0]
    # label is encoded as int64 at key "label"
    label = example["label"].value[0]
```


### Image Classification Dataset

For efficient batch processing of image classification datasets, you can use the `TFRecordsImageDataset` class which provides:
* Multi-threaded data loading
* Batch processing with configurable batch size
* Image preprocessing (resizing)
* Shuffling and prefetching
* File interleaving for better data distribution

```python
from tfr_reader.datasets import TFRecordsImageDataset
from tqdm import tqdm

tfrecord_paths = ["/path/to/train.tfrecord"]

dataset = TFRecordsImageDataset(
    tfrecord_paths=tfrecord_paths,
    input_size=(320, 320),  # (height, width)
    batch_size=128,
    num_threads=6,
    shuffle=True,
    interleave_files=True,
    repeat=-1,
    prefetch=2,
)

# Iterate through the dataset
for images, labels in tqdm(dataset, total=len(dataset) // 128):
    # images: numpy array of shape (batch_size, height, width, channels)
    # labels: numpy array of shape (batch_size,)
    pass
```

### Custom Protobuf Decoder for TFRecord files

If protobuf is not installed or it uses old and slow 'python' API
decoder, this library will use custom specialized protobuf decoder written in cython.
To enforce custom protobuf decoder for TFRecord files, run this command
```python
import tfr_reader as tfr
# to use custom protobuf decoder
tfr.set_decoder_type("cython")
# to use default protobuf decoder
tfr.set_decoder_type("protobuf")
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tfr-reader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "dataframe, tfrecords",
    "author": "Krzysztof Kolasinski",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/48/08/588bf359e2cda57f324be6075ff8615b3331d732df396c95c100d1b6e59e/tfr_reader-0.9.0.tar.gz",
    "platform": null,
    "description": "# tfrecords-reader\n\nFast TensorFlow TFRecords reader for Python with Random access and Google Storage streaming support.\n\n```bash\npip install \"tfr-reader\"\n# + Image classification dataset support\npip install \"tfr-reader[datasets]\"\n# + Google Storage support\npip install \"tfr-reader[google]\"\n# + All optional features\npip install \"tfr-reader[datasets,google]\"\n```\n\n## General Information\n* No **TensorFlow** dependency - this library implement custom TFRecord Reader\n* Protobuf is not required, this library contains cython decoder for TFRecord files\n* Compressed TFRecord files are supported\n* Fast random access to TFRecords i.e. you can read any example from the dataset without\n  reading the whole dataset e.g.\n    ```python\n    import tfr_reader as tfr\n    tfrds = tfr.TFRecordDatasetReader(\"/path/to/directory/with/tfrecords\")\n    example = tfrds[42]\n    image_bytes: bytes = example[\"image/encoded\"].value[0]\n    ```\n\n## Installation\n\n\n* Base installation with minimum requirements:\n    ```bash\n    pip install \"git+https://github.com/kmkolasinski/tfrecords-reader.git\"\n    ```\n* For image classification dataset features (requires numpy, cython, opencv):\n    ```bash\n    pip install \"git+https://github.com/kmkolasinski/tfrecords-reader.git#egg=tfr-reader[datasets]\"\n    pip install \".[datasets]\"\n    ```\n* For extra Google Storage Cloud support use:\n    ```bash\n    pip install \"git+https://github.com/kmkolasinski/tfrecords-reader.git#egg=tfr-reader[google]\"\n    ```\n\n## Quick Start\n\n```python\nimport tensorflow_datasets as tfds\nimport tfr_reader as tfr\nfrom PIL import Image\nimport ipyplot\n\ndataset, dataset_info = tfds.load('oxford_flowers102', split='train', with_info=True)\n\ndef index_fn(feature: tfr.Feature):\n    label = feature[\"label\"].value[0]\n    return {\n        \"label\": label,\n        \"name\": dataset_info.features[\"label\"].int2str(label)\n    }\n\ntfrds = tfr.load_from_directory(\n    dataset_info.data_dir,\n    # indexing options, not required if index is already created\n    filepattern=\"*.tfrecord*\",\n    index_fn=index_fn,\n    override=True, # override the index if it exists\n)\n\n# example selection using polars SQL query API\nrows, examples = tfrds.select(\"select * from index where name ~ 'rose' limit 10\")\nassert examples == tfrds[rows[\"_row_id\"]]\n\nsamples, names = [], []\nfor k, example in enumerate(examples):\n    image = Image.open(example[\"image\"].bytes_io[0]).resize((224, 224))\n    names.append(rows[\"name\"][k])\n    samples.append(image)\n\nipyplot.plot_images(samples, names)\n```\n![demo](resources/quickstart.png)\n\n\n## Usage\n\n### Dataset Inspection\n`inspect_dataset_example` function allows you to inspect the dataset and get a sample example\nand its types.\n```python\nimport tfr_reader as tfr\ndataset_dir = \"/path/to/directory/with/tfrecords\"\nexample, types = tfr.inspect_dataset_example(dataset_dir)\ntypes\n>>> Out[1]:\n[{'key': 'label', 'type': 'int64_list', 'length': 1},\n {'key': 'name', 'type': 'bytes_list', 'length': 1},\n {'key': 'image_id', 'type': 'bytes_list', 'length': 1},\n {'key': 'image', 'type': 'bytes_list', 'length': 1}]\n```\n\n### Dataset Indexing\nCreate an index of the dataset for fast access. The index is a dictionary with keys as the\nimage IDs and values as the file names. The index is created by reading the dataset and\nparsing the examples. The index is saved in the `dataset_dir` directory. You can use the\n`indexed_cols_fn` function to specify the columns you want to index. The function should return\na dictionary with keys as the column names and values as the column values.\n\n> [!NOTE]\n> Indexing operation works only for local files, remote files are not supported.\n\n\n```python\nimport tfr_reader as tfr\ndataset_dir = \"/path/to/directory/with/tfrecords\"\n\ndef indexed_cols_fn(feature):\n    return {\n        \"label\": feature[\"label\"].value[0],\n        \"name\": feature[\"name\"].value[0].decode(),\n        \"image_id\": feature[\"image/id\"].value[0].decode(),\n    }\n\ntfrds = tfr.TFRecordDatasetReader.build_index_from_dataset_dir(dataset_dir, indexed_cols_fn)\n\ntfrds.index_df[:5]\n>> Out[2]:\nshape: (5, 6)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 tfrecord_filename \u2506 tfrecord_start \u2506 tfrecord_end \u2506 name \u2506 label \u2506 image_id   \u2502\n\u2502 ---               \u2506 ---            \u2506 ---          \u2506 ---  \u2506 ---   \u2506 ---        \u2502\n\u2502 str               \u2506 i64            \u2506 i64          \u2506 str  \u2506 i64   \u2506 str        \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 demo.tfrecord     \u2506 0              \u2506 79           \u2506 cat  \u2506 1     \u2506 image-id-0 \u2502\n\u2502 demo.tfrecord     \u2506 79             \u2506 158          \u2506 dog  \u2506 0     \u2506 image-id-1 \u2502\n\u2502 demo.tfrecord     \u2506 158            \u2506 237          \u2506 cat  \u2506 1     \u2506 image-id-2 \u2502\n\u2502 demo.tfrecord     \u2506 237            \u2506 316          \u2506 dog  \u2506 0     \u2506 image-id-3 \u2502\n\u2502 demo.tfrecord     \u2506 316            \u2506 395          \u2506 cat  \u2506 1     \u2506 image-id-4 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\nExplanation about the index format:\n* **tfrecord_filename**: name of the tfrecord file\n* **tfrecord_start**: start byte position of the example in the tfrecord file\n* **tfrecord_end**: end byte position of the example in the tfrecord file\n* other columns: indexed columns from the dataset with `indexed_cols_fn` function\n\n### Dataset Reading\n\n```python\nimport tfr_reader as tfr\n\ntfrds = tfr.TFRecordDatasetReader(\"/path/to/directory/with/tfrecords\")\n# assume that the dataset is indexed already\ntfrds = tfr.TFRecordDatasetReader(\"gs://bucket/path/to/directory/with/tfrecords\")\n# selection API\nselected_df, examples = tfrds.select(\"SELECT * FROM index WHERE name = 'cat' LIMIT 20\")\n# custom selection\nselected_df = tfrds.index_df.sample(5)\nexamples = tfrds.load_records(selected_df)\n# indexing API\nfor i in range(len(tfrds)):\n    example = tfrds[i]\n    # assuming image is encoded as bytes at key \"image/encoded\"\n    image_bytes = example[\"image/encoded\"].value[0]\n    # label is encoded as int64 at key \"label\"\n    label = example[\"label\"].value[0]\n```\n\n\n### Image Classification Dataset\n\nFor efficient batch processing of image classification datasets, you can use the `TFRecordsImageDataset` class which provides:\n* Multi-threaded data loading\n* Batch processing with configurable batch size\n* Image preprocessing (resizing)\n* Shuffling and prefetching\n* File interleaving for better data distribution\n\n```python\nfrom tfr_reader.datasets import TFRecordsImageDataset\nfrom tqdm import tqdm\n\ntfrecord_paths = [\"/path/to/train.tfrecord\"]\n\ndataset = TFRecordsImageDataset(\n    tfrecord_paths=tfrecord_paths,\n    input_size=(320, 320),  # (height, width)\n    batch_size=128,\n    num_threads=6,\n    shuffle=True,\n    interleave_files=True,\n    repeat=-1,\n    prefetch=2,\n)\n\n# Iterate through the dataset\nfor images, labels in tqdm(dataset, total=len(dataset) // 128):\n    # images: numpy array of shape (batch_size, height, width, channels)\n    # labels: numpy array of shape (batch_size,)\n    pass\n```\n\n### Custom Protobuf Decoder for TFRecord files\n\nIf protobuf is not installed or it uses old and slow 'python' API\ndecoder, this library will use custom specialized protobuf decoder written in cython.\nTo enforce custom protobuf decoder for TFRecord files, run this command\n```python\nimport tfr_reader as tfr\n# to use custom protobuf decoder\ntfr.set_decoder_type(\"cython\")\n# to use default protobuf decoder\ntfr.set_decoder_type(\"protobuf\")\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Tensorflow Record Reader with Random Access",
    "version": "0.9.0",
    "project_urls": {
        "Repository": "https://github.com/kmkolasinski/tfrecords-reader"
    },
    "split_keywords": [
        "dataframe",
        " tfrecords"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4808588bf359e2cda57f324be6075ff8615b3331d732df396c95c100d1b6e59e",
                "md5": "1321f3012d2e2b749c063021508dd67a",
                "sha256": "8d5dc9512df0497a15cea30572676d332dfa380572bb6bc70f8f7a489204ba18"
            },
            "downloads": -1,
            "filename": "tfr_reader-0.9.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1321f3012d2e2b749c063021508dd67a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 284109,
            "upload_time": "2025-11-01T13:14:53",
            "upload_time_iso_8601": "2025-11-01T13:14:53.661436Z",
            "url": "https://files.pythonhosted.org/packages/48/08/588bf359e2cda57f324be6075ff8615b3331d732df396c95c100d1b6e59e/tfr_reader-0.9.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-01 13:14:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kmkolasinski",
    "github_project": "tfrecords-reader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tfr-reader"
}
        
Elapsed time: 1.79399s