light-dataloader

Name	light-dataloader JSON
Version	1.0.7 JSON
	download
home_page	None
Summary	Faster DataLoader for datasets that are fully loaded into memory as tensors.
upload_time	2024-12-28 15:52:19
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT License Copyright (c) 2024 inikishev Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	dataloader
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            TensorDataLoader - A faster dataloader for datasets that are fully loaded into memory.

On my laptop pytorch dataloader is 9 times slower at dataloading CIFAR10 preloaded into memory, with random shuffling, and tested with all batch sizes from 1 to 1000.
![image](https://github.com/user-attachments/assets/deabc451-e3fd-4702-81db-c282f9d2695e)

Here is how much time the whole benchmark took for different dataloaders:

```
my laptop:
  pytorch DataLoader with pin_memory       146.8673715000623 sec.
  pytorch DataLoader                       113.20603140027379 sec.
  LightDataLoader                          112.37881010014098 sec.
  TensorDataLoader memory_efficient        21.554916899913223 sec.
  TensorLoader                             17.700561700039543 sec.
  TensorDataLoader                         14.947468700091122 sec.

google colab:
  pytorch DataLoader                       97.84741502100019 sec.
  LightDataLoader                          97.33544923200111 sec.
  pytorch DataLoader with pin_memory       91.82473706000007 sec.
  TensorLoader                             67.40266070800055 sec.
  TensorDataLoader                         62.62979004000067 sec.
  TensorDataLoader memory_efficient        24.25830095599804 sec.
```

TensorLoader is another library that I just found that does the same thing :D <https://github.com/zhb2000/tensorloader>

I found that pytorch dataloader is slow when benchmarking stuff on mnist1d, and despite my dataset being fully loaded into memory, dataloading took most of the training time (mnist1d training is REALLY quick because it is small enough to be preloaded straight to GPU).

# installation

```
pip install light-dataloader
```

# TensorDataLoader

This dataloader is created similarly to torch.utils.data.TensorDataset.

Stack all of your samples into one or multiple tensors that have the same size of the first dimension.

For example:

```py
cifar = torchvision.datasets.CIFAR10('cifar10', transform = loader, download=True)
stacked_images = torch.stack([i[0] for i in cifar])
stacked_labels = torch.tensor([i[1] for i in cifar])
```

If you pass a single tensor, the dataloader will yield tensors. If you pass a sequence of one or more tensors, the dataloader will yield lists of tensors.

```py
# passing a list
from light_dataloader import TensorDataLoader
dataloader = TensorDataLoader([stacked_images, stacked_labels], batch_size = 128, shuffle = True)
for images, labels in dataloader:
  ...

# passing a tensor
dataloader = TensorDataLoader(stacked_images, batch_size = 128, shuffle = True)
for tensor in dataloader:
  ...
```

# LightDataLoader

LightDataLoader is a very lightweight version of normal pytorch dataloader, it functions in the same way and collates the dataset. On a dataset that is fully preloaded into memory, compared to normal pytorch dataloader it is slightly faster with batch size under 64, but lacks many features. The reason you might consider this is when the dataset is just big enough to fit into memory, but too big to run `torch.stack` operations to use TensorDataLoader.

```py
from light_dataloader import LightDataLoader

loader = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32), v2.Normalize(0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)])
cifar = torchvision.datasets.CIFAR10('cifar10', transform = loader, download=True)

# usage is the same as torch.utils.data.DataLoader
# and like pytorch dataloader, it converts everything into tensors and collates the batch
dataloader = LightDataLoader(cifar, batch_size = 128, shuffle = True)
for images, labels in dataloader:
  ...
```

# Other

### memory_efficient option

During shuffling at the start of each epoch, TensorDataLoader has to use 2 times the memory of whatever tensors were passed to it. With `memory_efficient=True` it usually becomes slightly slower, but doesn't use any additional memory. However as I found out when benchmarking, `memory_efficient=True` is actually much faster then False when on google colab.

### reproducibility

Both TensorDataLoader and LightDataLoader accept `seed` argument. It is None by default, but if you set it to any integer, that integer will be used as seed for random shuffling, ensuring reproducible results.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "light-dataloader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "dataloader",
    "author": null,
    "author_email": "Ivan Nikishev <nkshv2@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/2a/e6/11d9faadde1726568067a4b4b1ed585c8b3638cc6a4fe3e1117a2d17e97a/light_dataloader-1.0.7.tar.gz",
    "platform": null,
    "description": "TensorDataLoader - A faster dataloader for datasets that are fully loaded into memory.\n\nOn my laptop pytorch dataloader is 9 times slower at dataloading CIFAR10 preloaded into memory, with random shuffling, and tested with all batch sizes from 1 to 1000.\n![image](https://github.com/user-attachments/assets/deabc451-e3fd-4702-81db-c282f9d2695e)\n\nHere is how much time the whole benchmark took for different dataloaders:\n\n```\nmy laptop:\n  pytorch DataLoader with pin_memory       146.8673715000623 sec.\n  pytorch DataLoader                       113.20603140027379 sec.\n  LightDataLoader                          112.37881010014098 sec.\n  TensorDataLoader memory_efficient        21.554916899913223 sec.\n  TensorLoader                             17.700561700039543 sec.\n  TensorDataLoader                         14.947468700091122 sec.\n\ngoogle colab:\n  pytorch DataLoader                       97.84741502100019 sec.\n  LightDataLoader                          97.33544923200111 sec.\n  pytorch DataLoader with pin_memory       91.82473706000007 sec.\n  TensorLoader                             67.40266070800055 sec.\n  TensorDataLoader                         62.62979004000067 sec.\n  TensorDataLoader memory_efficient        24.25830095599804 sec.\n```\n\nTensorLoader is another library that I just found that does the same thing :D <https://github.com/zhb2000/tensorloader>\n\nI found that pytorch dataloader is slow when benchmarking stuff on mnist1d, and despite my dataset being fully loaded into memory, dataloading took most of the training time (mnist1d training is REALLY quick because it is small enough to be preloaded straight to GPU).\n\n# installation\n\n```\npip install light-dataloader\n```\n\n# TensorDataLoader\n\nThis dataloader is created similarly to torch.utils.data.TensorDataset.\n\nStack all of your samples into one or multiple tensors that have the same size of the first dimension.\n\nFor example:\n\n```py\ncifar = torchvision.datasets.CIFAR10('cifar10', transform = loader, download=True)\nstacked_images = torch.stack([i[0] for i in cifar])\nstacked_labels = torch.tensor([i[1] for i in cifar])\n```\n\nIf you pass a single tensor, the dataloader will yield tensors. If you pass a sequence of one or more tensors, the dataloader will yield lists of tensors.\n\n```py\n# passing a list\nfrom light_dataloader import TensorDataLoader\ndataloader = TensorDataLoader([stacked_images, stacked_labels], batch_size = 128, shuffle = True)\nfor images, labels in dataloader:\n  ...\n\n# passing a tensor\ndataloader = TensorDataLoader(stacked_images, batch_size = 128, shuffle = True)\nfor tensor in dataloader:\n  ...\n```\n\n# LightDataLoader\n\nLightDataLoader is a very lightweight version of normal pytorch dataloader, it functions in the same way and collates the dataset. On a dataset that is fully preloaded into memory, compared to normal pytorch dataloader it is slightly faster with batch size under 64, but lacks many features. The reason you might consider this is when the dataset is just big enough to fit into memory, but too big to run `torch.stack` operations to use TensorDataLoader.\n\n```py\nfrom light_dataloader import LightDataLoader\n\nloader = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32), v2.Normalize(0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)])\ncifar = torchvision.datasets.CIFAR10('cifar10', transform = loader, download=True)\n\n# usage is the same as torch.utils.data.DataLoader\n# and like pytorch dataloader, it converts everything into tensors and collates the batch\ndataloader = LightDataLoader(cifar, batch_size = 128, shuffle = True)\nfor images, labels in dataloader:\n  ...\n```\n\n# Other\n\n### memory_efficient option\n\nDuring shuffling at the start of each epoch, TensorDataLoader has to use 2 times the memory of whatever tensors were passed to it. With `memory_efficient=True` it usually becomes slightly slower, but doesn't use any additional memory. However as I found out when benchmarking, `memory_efficient=True` is actually much faster then False when on google colab.\n\n### reproducibility\n\nBoth TensorDataLoader and LightDataLoader accept `seed` argument. It is None by default, but if you set it to any integer, that integer will be used as seed for random shuffling, ensuring reproducible results.\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 inikishev  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "Faster DataLoader for datasets that are fully loaded into memory as tensors.",
    "version": "1.0.7",
    "project_urls": {
        "Homepage": "https://github.com/inikishev/light-dataloader",
        "Issues": "https://github.com/inikishev/light-dataloader/isses",
        "Repository": "https://github.com/inikishev/light-dataloader"
    },
    "split_keywords": [
        "dataloader"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e0f84f034aad7af83c7b9785eee5b1c87f3eae2f1d05b18e15c63927e50a54be",
                "md5": "1d69721217c7612132b6da088814723f",
                "sha256": "492b184fd6dd279d4b8448da1726da12a30bff9b94f728ccea59df8f1c1e0e12"
            },
            "downloads": -1,
            "filename": "light_dataloader-1.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d69721217c7612132b6da088814723f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 6534,
            "upload_time": "2024-12-28T15:52:17",
            "upload_time_iso_8601": "2024-12-28T15:52:17.667204Z",
            "url": "https://files.pythonhosted.org/packages/e0/f8/4f034aad7af83c7b9785eee5b1c87f3eae2f1d05b18e15c63927e50a54be/light_dataloader-1.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ae611d9faadde1726568067a4b4b1ed585c8b3638cc6a4fe3e1117a2d17e97a",
                "md5": "223e8999570115855c1e00956988a550",
                "sha256": "be1b55fba9e2eb7f524a24c045b8c2bf3532fb5a0810430e60edf0a78705ab4a"
            },
            "downloads": -1,
            "filename": "light_dataloader-1.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "223e8999570115855c1e00956988a550",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 5630,
            "upload_time": "2024-12-28T15:52:19",
            "upload_time_iso_8601": "2024-12-28T15:52:19.977539Z",
            "url": "https://files.pythonhosted.org/packages/2a/e6/11d9faadde1726568067a4b4b1ed585c8b3638cc6a4fe3e1117a2d17e97a/light_dataloader-1.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-28 15:52:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inikishev",
    "github_project": "light-dataloader",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "light-dataloader"
}

None