torchdatasets-nightly

Name	torchdatasets-nightly JSON
Version	1711929801 JSON
	download
home_page	https://github.com/szymonmaszke/torchdatasets
Summary	PyTorch based library focused on data processing and input pipelines in general.
upload_time	2024-04-01 00:03:38
maintainer	None
docs_url	None
author	Szymon Maszke
requires_python	>=3.6
license	MIT
keywords	pytorch torch data datasets map cache memory disk apply database
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            ## Package renamed to torchdatasets!

<img align="left" width="256" height="256" src="https://github.com/szymonmaszke/torchdatasets/blob/master/assets/logos/medium.png">

* Use `map`, `apply`, `reduce` or `filter` directly on `Dataset` objects
* `cache` data in RAM/disk or via your own method (partial caching supported)
* Full PyTorch's [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and [`IterableDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>) support
* General `torchdatasets.maps` like `Flatten` or `Select`
* Extensible interface (your own cache methods, cache modifiers, maps etc.)
* Useful `torchdatasets.datasets` classes designed for general tasks (e.g. file reading)
* Support for `torchvision` datasets (e.g. `ImageFolder`, `MNIST`, `CIFAR10`) via `td.datasets.WrapDataset`
* Minimal overhead (single call to `super().__init__()`)

| Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |
|---------|------|-------|----------|-------|------|--------|---------|--------|---------|
| [![Version](https://img.shields.io/static/v1?label=&message=0.2.0&color=377EF0&style=for-the-badge)](https://github.com/szymonmaszke/torchdatasets/releases) | [![Documentation](https://img.shields.io/static/v1?label=&message=docs&color=EE4C2C&style=for-the-badge)](https://szymonmaszke.github.io/torchdatasets/)  | ![Tests](https://github.com/szymonmaszke/torchdatasets/workflows/test/badge.svg) | ![Coverage](https://img.shields.io/codecov/c/github/szymonmaszke/torchdatasets?label=%20&logo=codecov&style=for-the-badge) | [![codebeat](https://img.shields.io/static/v1?label=&message=CB&color=27A8E0&style=for-the-badge)](https://codebeat.co/projects/github-com-szymonmaszke-torchdatasets-master) | [![PyPI](https://img.shields.io/static/v1?label=&message=PyPI&color=377EF0&style=for-the-badge)](https://pypi.org/project/torchdatasets/) | [![Python](https://img.shields.io/static/v1?label=&message=3.6&color=377EF0&style=for-the-badge&logo=python&logoColor=F8C63D)](https://www.python.org/) | [![PyTorch](https://img.shields.io/static/v1?label=&message=>=1.2.0&color=EE4C2C&style=for-the-badge)](https://pytorch.org/) | [![Docker](https://img.shields.io/static/v1?label=&message=docker&color=309cef&style=for-the-badge)](https://hub.docker.com/r/szymonmaszke/torchdatasets) | [![Roadmap](https://img.shields.io/static/v1?label=&message=roadmap&color=009688&style=for-the-badge)](https://github.com/szymonmaszke/torchdatasets/blob/master/ROADMAP.md) |

# :bulb: Examples

__Check documentation here:__
[https://szymonmaszke.github.io/torchdatasets](https://szymonmaszke.github.io/torchdatasets)

## General example

- Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:

```python
import torchdatasets as td
import torchvision

class Images(td.Dataset): # Different inheritance
    def __init__(self, path: str):
        super().__init__() # This is the only change
        self.files = [file for file in pathlib.Path(path).glob("*")]

    def __getitem__(self, index):
        return Image.open(self.files[index])

    def __len__(self):
        return len(self.files)


images = Images("./data").map(torchvision.transforms.ToTensor()).cache()
```

You can concatenate above dataset with another (say `labels`) and iterate over them as per usual:

```python
for data, label in images | labels:
    # Do whatever you want with your data
```

- Cache first `1000` samples in memory, save the rest on disk in folder `./cache`:

```python
images = (
    ImageDataset.from_folder("./data").map(torchvision.transforms.ToTensor())
    # First 1000 samples in memory
    .cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))
    # Sample from 1000 to the end saved with Pickle on disk
    .cache(td.modifiers.FromIndex(1000, td.cachers.Pickle("./cache")))
    # You can define your own cachers, modifiers, see docs
)
```
To see what else you can do please check [**torchdatasets documentation**](https://szymonmaszke.github.io/torchdatasets/)

## Integration with `torchvision`

Using `torchdatasets` you can easily split `torchvision` datasets and apply augmentation
only to the training part of data without any troubles:

```python
import torchvision

import torchdatasets as td

# Wrap torchvision dataset with WrapDataset
dataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder("./images"))

# Split dataset
train_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(
    model_dataset,
    (int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),
)

# Apply torchvision mappings ONLY to train dataset
train_dataset.map(
    td.maps.To(
        torchvision.transforms.Compose(
            [
                torchvision.transforms.RandomResizedCrop(224),
                torchvision.transforms.RandomHorizontalFlip(),
                torchvision.transforms.ToTensor(),
                torchvision.transforms.Normalize(
                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
                ),
            ]
        )
    ),
    # Apply this transformation to zeroth sample
    # First sample is the label
    0,
)
```

Please notice you can use `td.datasets.WrapDataset` with any existing `torch.utils.data.Dataset`
instance to give it additional `caching` and `mapping` powers!

# :wrench: Installation

## :snake: [pip](<https://pypi.org/project/torchdatasets/>)

### Latest release:

```shell
pip install --user torchdatasets
```

### Nightly:

```shell
pip install --user torchdatasets-nightly
```

## :whale2: [Docker](https://hub.docker.com/r/szymonmaszke/torchdatasets)

__CPU standalone__ and various versions of __GPU enabled__ images are available
at [dockerhub](https://hub.docker.com/r/szymonmaszke/torchdatasets/tags).

For CPU quickstart, issue:

```shell
docker pull szymonmaszke/torchdatasets:18.04
```

Nightly builds are also available, just prefix tag with `nightly_`. If you are going for `GPU` image make sure you have
[nvidia/docker](https://github.com/NVIDIA/nvidia-docker) installed and it's runtime set.

# :question: Contributing

If you find any issue or you think some functionality may be useful to others and fits this library, please [open new Issue](https://help.github.com/en/articles/creating-an-issue) or [create Pull Request](https://help.github.com/en/articles/creating-a-pull-request-from-a-fork).

To get an overview of thins one can do to help this project, see [Roadmap](https://github.com/szymonmaszke/torchdatasets/blob/master/ROADMAP.md)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/szymonmaszke/torchdatasets",
    "name": "torchdatasets-nightly",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "pytorch torch data datasets map cache memory disk apply database",
    "author": "Szymon Maszke",
    "author_email": "szymon.maszke@protonmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ef/c5/3ee6e924ca65bdc9fba87c8e0eb28ca5f1befa72d19a589cf6091f86b476/torchdatasets-nightly-1711929801.tar.gz",
    "platform": null,
    "description": "## Package renamed to torchdatasets!\n\n<img align=\"left\" width=\"256\" height=\"256\" src=\"https://github.com/szymonmaszke/torchdatasets/blob/master/assets/logos/medium.png\">\n\n* Use `map`, `apply`, `reduce` or `filter` directly on `Dataset` objects\n* `cache` data in RAM/disk or via your own method (partial caching supported)\n* Full PyTorch's [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) and [`IterableDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset>) support\n* General `torchdatasets.maps` like `Flatten` or `Select`\n* Extensible interface (your own cache methods, cache modifiers, maps etc.)\n* Useful `torchdatasets.datasets` classes designed for general tasks (e.g. file reading)\n* Support for `torchvision` datasets (e.g. `ImageFolder`, `MNIST`, `CIFAR10`) via `td.datasets.WrapDataset`\n* Minimal overhead (single call to `super().__init__()`)\n\n| Version | Docs | Tests | Coverage | Style | PyPI | Python | PyTorch | Docker | Roadmap |\n|---------|------|-------|----------|-------|------|--------|---------|--------|---------|\n| [![Version](https://img.shields.io/static/v1?label=&message=0.2.0&color=377EF0&style=for-the-badge)](https://github.com/szymonmaszke/torchdatasets/releases) | [![Documentation](https://img.shields.io/static/v1?label=&message=docs&color=EE4C2C&style=for-the-badge)](https://szymonmaszke.github.io/torchdatasets/)  | ![Tests](https://github.com/szymonmaszke/torchdatasets/workflows/test/badge.svg) | ![Coverage](https://img.shields.io/codecov/c/github/szymonmaszke/torchdatasets?label=%20&logo=codecov&style=for-the-badge) | [![codebeat](https://img.shields.io/static/v1?label=&message=CB&color=27A8E0&style=for-the-badge)](https://codebeat.co/projects/github-com-szymonmaszke-torchdatasets-master) | [![PyPI](https://img.shields.io/static/v1?label=&message=PyPI&color=377EF0&style=for-the-badge)](https://pypi.org/project/torchdatasets/) | [![Python](https://img.shields.io/static/v1?label=&message=3.6&color=377EF0&style=for-the-badge&logo=python&logoColor=F8C63D)](https://www.python.org/) | [![PyTorch](https://img.shields.io/static/v1?label=&message=>=1.2.0&color=EE4C2C&style=for-the-badge)](https://pytorch.org/) | [![Docker](https://img.shields.io/static/v1?label=&message=docker&color=309cef&style=for-the-badge)](https://hub.docker.com/r/szymonmaszke/torchdatasets) | [![Roadmap](https://img.shields.io/static/v1?label=&message=roadmap&color=009688&style=for-the-badge)](https://github.com/szymonmaszke/torchdatasets/blob/master/ROADMAP.md) |\n\n# :bulb: Examples\n\n__Check documentation here:__\n[https://szymonmaszke.github.io/torchdatasets](https://szymonmaszke.github.io/torchdatasets)\n\n## General example\n\n- Create image dataset, convert it to Tensors, cache and concatenate with smoothed labels:\n\n```python\nimport torchdatasets as td\nimport torchvision\n\nclass Images(td.Dataset): # Different inheritance\n    def __init__(self, path: str):\n        super().__init__() # This is the only change\n        self.files = [file for file in pathlib.Path(path).glob(\"*\")]\n\n    def __getitem__(self, index):\n        return Image.open(self.files[index])\n\n    def __len__(self):\n        return len(self.files)\n\n\nimages = Images(\"./data\").map(torchvision.transforms.ToTensor()).cache()\n```\n\nYou can concatenate above dataset with another (say `labels`) and iterate over them as per usual:\n\n```python\nfor data, label in images | labels:\n    # Do whatever you want with your data\n```\n\n- Cache first `1000` samples in memory, save the rest on disk in folder `./cache`:\n\n```python\nimages = (\n    ImageDataset.from_folder(\"./data\").map(torchvision.transforms.ToTensor())\n    # First 1000 samples in memory\n    .cache(td.modifiers.UpToIndex(1000, td.cachers.Memory()))\n    # Sample from 1000 to the end saved with Pickle on disk\n    .cache(td.modifiers.FromIndex(1000, td.cachers.Pickle(\"./cache\")))\n    # You can define your own cachers, modifiers, see docs\n)\n```\nTo see what else you can do please check [**torchdatasets documentation**](https://szymonmaszke.github.io/torchdatasets/)\n\n## Integration with `torchvision`\n\nUsing `torchdatasets` you can easily split `torchvision` datasets and apply augmentation\nonly to the training part of data without any troubles:\n\n```python\nimport torchvision\n\nimport torchdatasets as td\n\n# Wrap torchvision dataset with WrapDataset\ndataset = td.datasets.WrapDataset(torchvision.datasets.ImageFolder(\"./images\"))\n\n# Split dataset\ntrain_dataset, validation_dataset, test_dataset = torch.utils.data.random_split(\n    model_dataset,\n    (int(0.6 * len(dataset)), int(0.2 * len(dataset)), int(0.2 * len(dataset))),\n)\n\n# Apply torchvision mappings ONLY to train dataset\ntrain_dataset.map(\n    td.maps.To(\n        torchvision.transforms.Compose(\n            [\n                torchvision.transforms.RandomResizedCrop(224),\n                torchvision.transforms.RandomHorizontalFlip(),\n                torchvision.transforms.ToTensor(),\n                torchvision.transforms.Normalize(\n                    mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]\n                ),\n            ]\n        )\n    ),\n    # Apply this transformation to zeroth sample\n    # First sample is the label\n    0,\n)\n```\n\nPlease notice you can use `td.datasets.WrapDataset` with any existing `torch.utils.data.Dataset`\ninstance to give it additional `caching` and `mapping` powers!\n\n# :wrench: Installation\n\n## :snake: [pip](<https://pypi.org/project/torchdatasets/>)\n\n### Latest release:\n\n```shell\npip install --user torchdatasets\n```\n\n### Nightly:\n\n```shell\npip install --user torchdatasets-nightly\n```\n\n## :whale2: [Docker](https://hub.docker.com/r/szymonmaszke/torchdatasets)\n\n__CPU standalone__ and various versions of __GPU enabled__ images are available\nat [dockerhub](https://hub.docker.com/r/szymonmaszke/torchdatasets/tags).\n\nFor CPU quickstart, issue:\n\n```shell\ndocker pull szymonmaszke/torchdatasets:18.04\n```\n\nNightly builds are also available, just prefix tag with `nightly_`. If you are going for `GPU` image make sure you have\n[nvidia/docker](https://github.com/NVIDIA/nvidia-docker) installed and it's runtime set.\n\n# :question: Contributing\n\nIf you find any issue or you think some functionality may be useful to others and fits this library, please [open new Issue](https://help.github.com/en/articles/creating-an-issue) or [create Pull Request](https://help.github.com/en/articles/creating-a-pull-request-from-a-fork).\n\nTo get an overview of thins one can do to help this project, see [Roadmap](https://github.com/szymonmaszke/torchdatasets/blob/master/ROADMAP.md)\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "PyTorch based library focused on data processing and input pipelines in general.",
    "version": "1711929801",
    "project_urls": {
        "Documentation": "https://szymonmaszke.github.io/torchdatasets/#torchdatasets",
        "Homepage": "https://github.com/szymonmaszke/torchdatasets",
        "Issues": "https://github.com/szymonmaszke/torchdatasets/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc",
        "Website": "https://szymonmaszke.github.io/torchdatasets"
    },
    "split_keywords": [
        "pytorch",
        "torch",
        "data",
        "datasets",
        "map",
        "cache",
        "memory",
        "disk",
        "apply",
        "database"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d1cc2c2794ff82b7a4a407f4d49dd07546827873cc3bea70c8ec4e337e9cca9e",
                "md5": "fabeed6d5eba9e63b4283175a57b3aa4",
                "sha256": "bd0f3b5a7a3790f671dc7add38d8ba9b4396eda68cbd9a1ef0a39ba48f248028"
            },
            "downloads": -1,
            "filename": "torchdatasets_nightly-1711929801-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fabeed6d5eba9e63b4283175a57b3aa4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 29672,
            "upload_time": "2024-04-01T00:03:31",
            "upload_time_iso_8601": "2024-04-01T00:03:31.075478Z",
            "url": "https://files.pythonhosted.org/packages/d1/cc/2c2794ff82b7a4a407f4d49dd07546827873cc3bea70c8ec4e337e9cca9e/torchdatasets_nightly-1711929801-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "efc53ee6e924ca65bdc9fba87c8e0eb28ca5f1befa72d19a589cf6091f86b476",
                "md5": "7b0ec21e8773a98c0fdb37f903a0aace",
                "sha256": "f97d1931c2b9e96d324d75f85688d745fb8ba5c6d535cb243ac18124d96a86d6"
            },
            "downloads": -1,
            "filename": "torchdatasets-nightly-1711929801.tar.gz",
            "has_sig": false,
            "md5_digest": "7b0ec21e8773a98c0fdb37f903a0aace",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 24957,
            "upload_time": "2024-04-01T00:03:38",
            "upload_time_iso_8601": "2024-04-01T00:03:38.685452Z",
            "url": "https://files.pythonhosted.org/packages/ef/c5/3ee6e924ca65bdc9fba87c8e0eb28ca5f1befa72d19a589cf6091f86b476/torchdatasets-nightly-1711929801.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-01 00:03:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "szymonmaszke",
    "github_project": "torchdatasets",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "tox": true,
    "lcname": "torchdatasets-nightly"
}

Szymon Maszke