splitdataloader


Namesplitdataloader JSON
Version 0.0.1 PyPI version JSON
download
home_page
SummaryA simple library to load training data from hdd or ssd
upload_time2023-12-09 11:37:13
maintainer
docs_urlNone
author
requires_python>=3.8
licenseMIT License Copyright (c) 2023 Vinay Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords deep-learning datasets
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # split-data-loader

This package contains simple utility functions related to writing and reading
data (typically related to machine learning) using multiple files.

There are mainly two extremes when dealing with data.

1. All the data in a single file - This is good for sequential access,
   but can be cumbersome to shuffle data when reading.
2. Each frame in its own file - This creates too many tiny files
   and can be difficult to scale.

This library uses an intermediate approach. The entire dataset is split and
stored in multiple files (eg: N = 128) called bins. It allows easy shuffling of
data and parallel processing when required.

This library also uses an index file to keep track of the order and location of
each packet. It allows index based random lookup of all the input packets,
distributed among all the bin files.


## Writing Data
Use `write_split_data` to write data to a target directory.

```python
from splitdataloader import write_split_data

def example_writer(...):
    # Get the data source
    data_source: Iterable[bytes] = some_source()
    target_dir = "tmp/training_data"
    write_split_data(target_dir, data_source, splits=128)
```

## Reading Data
This is the main objective of this library. The class `SplitDataLoader` handles
the loading of data.
It supports the following:
1. Getting length using `len()`
2. Random indexing using `[]`
3. Data iteration (binwise), with support for shuffling

```python
from splitdataloader import SplitDataLoader

def example_loader(...):
    # Get the data source
    data_dir = "tmp/training_data"
    loader = SplitDataLoader(data_dir)
    # Supports len()
    print(len(loader))
    # Supports indexing
    data  = loader[2]
    # Supports iteration
    for data in loader.iterate_binwise(shuffle=True):
        do_something(data)
```

## Multiprocessing Queue Based Iterator
If the loading takes too much time, it is probably a good idea to run the
loading part in a separate process. If it is possible to refactor the entity
that produces the batches as a generator, `splitdataloader.MpQItr` can
be used to handle loading. Data will be loaded to an internal queue while
it is being processed in the main process.

```python
from splitdataloader import MpQItr

# a tuple, class, or whatever that handles the batch
class BatchType:
    ...

# a generator function that produces the batches
def batch_generator(...) -> Iterator[BatchType]:
    ...

def batch_wise_processing(...):
    # Multi-processing queue based iterator
    queued_batch_iterator = MpQItr[BatchType](
        batch_generator,  # the generator function
        args... # args or kwargs to the generator function
    )
    for batch in queued_batch_iterator:
        do_something_with(batch)
```

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "splitdataloader",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "deep-learning,datasets",
    "author": "",
    "author_email": "Vinay Krishnan <nk.vinay@zohomail.in>",
    "download_url": "https://files.pythonhosted.org/packages/e2/56/ab33539cbbb9cb1d230c386bc53b3640f7f8589567cdf7a7e956e2f9edee/splitdataloader-0.0.1.tar.gz",
    "platform": null,
    "description": "# split-data-loader\n\nThis package contains simple utility functions related to writing and reading\ndata (typically related to machine learning) using multiple files.\n\nThere are mainly two extremes when dealing with data.\n\n1. All the data in a single file - This is good for sequential access,\n   but can be cumbersome to shuffle data when reading.\n2. Each frame in its own file - This creates too many tiny files\n   and can be difficult to scale.\n\nThis library uses an intermediate approach. The entire dataset is split and\nstored in multiple files (eg: N = 128) called bins. It allows easy shuffling of\ndata and parallel processing when required.\n\nThis library also uses an index file to keep track of the order and location of\neach packet. It allows index based random lookup of all the input packets,\ndistributed among all the bin files.\n\n\n## Writing Data\nUse `write_split_data` to write data to a target directory.\n\n```python\nfrom splitdataloader import write_split_data\n\ndef example_writer(...):\n    # Get the data source\n    data_source: Iterable[bytes] = some_source()\n    target_dir = \"tmp/training_data\"\n    write_split_data(target_dir, data_source, splits=128)\n```\n\n## Reading Data\nThis is the main objective of this library. The class `SplitDataLoader` handles\nthe loading of data.\nIt supports the following:\n1. Getting length using `len()`\n2. Random indexing using `[]`\n3. Data iteration (binwise), with support for shuffling\n\n```python\nfrom splitdataloader import SplitDataLoader\n\ndef example_loader(...):\n    # Get the data source\n    data_dir = \"tmp/training_data\"\n    loader = SplitDataLoader(data_dir)\n    # Supports len()\n    print(len(loader))\n    # Supports indexing\n    data  = loader[2]\n    # Supports iteration\n    for data in loader.iterate_binwise(shuffle=True):\n        do_something(data)\n```\n\n## Multiprocessing Queue Based Iterator\nIf the loading takes too much time, it is probably a good idea to run the\nloading part in a separate process. If it is possible to refactor the entity\nthat produces the batches as a generator, `splitdataloader.MpQItr` can\nbe used to handle loading. Data will be loaded to an internal queue while\nit is being processed in the main process.\n\n```python\nfrom splitdataloader import MpQItr\n\n# a tuple, class, or whatever that handles the batch\nclass BatchType:\n    ...\n\n# a generator function that produces the batches\ndef batch_generator(...) -> Iterator[BatchType]:\n    ...\n\ndef batch_wise_processing(...):\n    # Multi-processing queue based iterator\n    queued_batch_iterator = MpQItr[BatchType](\n        batch_generator,  # the generator function\n        args... # args or kwargs to the generator function\n    )\n    for batch in queued_batch_iterator:\n        do_something_with(batch)\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2023 Vinay  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "A simple library to load training data from hdd or ssd",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/charstorm/split-data-loader/"
    },
    "split_keywords": [
        "deep-learning",
        "datasets"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ee638f0b463aa13bc51715f633835b77a93dbafafe0c3691cfd8986b388e4a3",
                "md5": "74cb654fb2f7ef886b36bc1de62b9a82",
                "sha256": "f8a15eff56dafbff567c44c6946de897f7270eb7186fa6d1368b10926f5b03e9"
            },
            "downloads": -1,
            "filename": "splitdataloader-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "74cb654fb2f7ef886b36bc1de62b9a82",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7382,
            "upload_time": "2023-12-09T11:37:11",
            "upload_time_iso_8601": "2023-12-09T11:37:11.176865Z",
            "url": "https://files.pythonhosted.org/packages/2e/e6/38f0b463aa13bc51715f633835b77a93dbafafe0c3691cfd8986b388e4a3/splitdataloader-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e256ab33539cbbb9cb1d230c386bc53b3640f7f8589567cdf7a7e956e2f9edee",
                "md5": "46e3ca9540ebe43dd3c95f38152dfe9f",
                "sha256": "ba8980a6c165080d0d55ea5aba2eb888197d95465583f3947c9725c207502546"
            },
            "downloads": -1,
            "filename": "splitdataloader-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "46e3ca9540ebe43dd3c95f38152dfe9f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 6695,
            "upload_time": "2023-12-09T11:37:13",
            "upload_time_iso_8601": "2023-12-09T11:37:13.308743Z",
            "url": "https://files.pythonhosted.org/packages/e2/56/ab33539cbbb9cb1d230c386bc53b3640f7f8589567cdf7a7e956e2f9edee/splitdataloader-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-09 11:37:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "charstorm",
    "github_project": "split-data-loader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "splitdataloader"
}
        
Elapsed time: 0.21974s