pytorch-datastream

Name	pytorch-datastream JSON
Version	0.4.12 JSON
	download
home_page	https://github.com/nextml-code/pytorch-datastream
Summary	Simple dataset to dataloader library for pytorch
upload_time	2024-09-06 15:33:42
maintainer	None
docs_url	None
author	NextML
requires_python	<4.0,>=3.8
license	Apache-2.0
keywords	pytorch machine learning dataset pipeline dataloader
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ==================
Pytorch Datastream
==================

.. image:: https://badge.fury.io/py/pytorch-datastream.svg
       :target: https://badge.fury.io/py/pytorch-datastream

.. image:: https://img.shields.io/pypi/pyversions/pytorch-datastream.svg
       :target: https://pypi.python.org/pypi/pytorch-datastream

.. image:: https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest
       :target: https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest

.. image:: https://img.shields.io/pypi/l/pytorch-datastream.svg
       :target: https://pypi.python.org/pypi/pytorch-datastream



This is a simple library for creating readable dataset pipelines and
reusing best practices for issues such as imbalanced datasets. There are
just two components to keep track of: ``Dataset`` and ``Datastream``.

``Dataset`` is a simple mapping between an index and an example. It provides 
pipelining of functions in a readable syntax originally adapted from
tensorflow 2's ``tf.data.Dataset``.

``Datastream`` combines a ``Dataset`` and a sampler into a stream of examples.
It provides a simple solution to oversampling / stratification, weighted
sampling, and finally converting to a ``torch.utils.data.DataLoader``.


Install
=======

.. code-block::

    poetry add pytorch-datastream

Or, for the old-timers:

.. code-block::

    pip install pytorch-datastream


Usage
=====

The list below is meant to showcase functions that are useful in most standard
and non-standard cases. It is not meant to be an exhaustive list. See the 
`documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_ for 
a more extensive list on API and usage.

.. code-block:: python

    Dataset.from_subscriptable
    Dataset.from_dataframe
    Dataset
        .map
        .subset
        .split
        .cache
        .with_columns

    Datastream.merge
    Datastream.zip
    Datastream
        .map
        .data_loader
        .zip_index
        .update_weights_
        .update_example_weight_
        .weight
        .state_dict
        .load_state_dict


Simple image dataset example
----------------------------
Here's a basic example of loading images from a directory:

.. code-block:: python

    from datastream import Dataset
    from pathlib import Path
    from PIL import Image

    # Assuming images are in a directory structure like:
    # images/
    #   class1/
    #     image1.jpg
    #     image2.jpg
    #   class2/
    #     image3.jpg
    #     image4.jpg

    image_dir = Path("images")
    image_paths = list(image_dir.glob("**/*.jpg"))

    dataset = (
        Dataset.from_paths(image_paths, pattern=r".*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg")
        .map(lambda row: dict(
            image=Image.open(row["path"]),
            class_name=row["class_name"],
            image_name=row["image_name"],
        ))
    )

    # Access an item from the dataset
    first_item = dataset[0]
    print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")


Merge / stratify / oversample datastreams
-----------------------------------------
The fruit datastreams given below repeatedly yields the string of its fruit
type.

.. code-block:: python

    >>> datastream = Datastream.merge([
    ...     (apple_datastream, 2),
    ...     (pear_datastream, 1),
    ...     (banana_datastream, 1),
    ... ])
    >>> next(iter(datastream.data_loader(batch_size=8)))
    ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']


Zip independently sampled datastreams
-------------------------------------
The fruit datastreams given below repeatedly yields the string of its fruit
type.

.. code-block:: python

    >>> datastream = Datastream.zip([
    ...     apple_datastream,
    ...     Datastream.merge([pear_datastream, banana_datastream]),
    ... ])
    >>> next(iter(datastream.data_loader(batch_size=4)))
    [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]


More usage examples
-------------------
See the `documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_
for more usage examples.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nextml-code/pytorch-datastream",
    "name": "pytorch-datastream",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": "pytorch, machine, learning, dataset, pipeline, dataloader",
    "author": "NextML",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/d2/55/d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b/pytorch_datastream-0.4.12.tar.gz",
    "platform": null,
    "description": "==================\nPytorch Datastream\n==================\n\n.. image:: https://badge.fury.io/py/pytorch-datastream.svg\n       :target: https://badge.fury.io/py/pytorch-datastream\n\n.. image:: https://img.shields.io/pypi/pyversions/pytorch-datastream.svg\n       :target: https://pypi.python.org/pypi/pytorch-datastream\n\n.. image:: https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest\n       :target: https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest\n\n.. image:: https://img.shields.io/pypi/l/pytorch-datastream.svg\n       :target: https://pypi.python.org/pypi/pytorch-datastream\n\n\n\nThis is a simple library for creating readable dataset pipelines and\nreusing best practices for issues such as imbalanced datasets. There are\njust two components to keep track of: ``Dataset`` and ``Datastream``.\n\n``Dataset`` is a simple mapping between an index and an example. It provides \npipelining of functions in a readable syntax originally adapted from\ntensorflow 2's ``tf.data.Dataset``.\n\n``Datastream`` combines a ``Dataset`` and a sampler into a stream of examples.\nIt provides a simple solution to oversampling / stratification, weighted\nsampling, and finally converting to a ``torch.utils.data.DataLoader``.\n\n\nInstall\n=======\n\n.. code-block::\n\n    poetry add pytorch-datastream\n\nOr, for the old-timers:\n\n.. code-block::\n\n    pip install pytorch-datastream\n\n\nUsage\n=====\n\nThe list below is meant to showcase functions that are useful in most standard\nand non-standard cases. It is not meant to be an exhaustive list. See the \n`documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_ for \na more extensive list on API and usage.\n\n.. code-block:: python\n\n    Dataset.from_subscriptable\n    Dataset.from_dataframe\n    Dataset\n        .map\n        .subset\n        .split\n        .cache\n        .with_columns\n\n    Datastream.merge\n    Datastream.zip\n    Datastream\n        .map\n        .data_loader\n        .zip_index\n        .update_weights_\n        .update_example_weight_\n        .weight\n        .state_dict\n        .load_state_dict\n\n\nSimple image dataset example\n----------------------------\nHere's a basic example of loading images from a directory:\n\n.. code-block:: python\n\n    from datastream import Dataset\n    from pathlib import Path\n    from PIL import Image\n\n    # Assuming images are in a directory structure like:\n    # images/\n    #   class1/\n    #     image1.jpg\n    #     image2.jpg\n    #   class2/\n    #     image3.jpg\n    #     image4.jpg\n\n    image_dir = Path(\"images\")\n    image_paths = list(image_dir.glob(\"**/*.jpg\"))\n\n    dataset = (\n        Dataset.from_paths(image_paths, pattern=r\".*/(?P<class_name>\\w+)/(?P<image_name>\\w+).jpg\")\n        .map(lambda row: dict(\n            image=Image.open(row[\"path\"]),\n            class_name=row[\"class_name\"],\n            image_name=row[\"image_name\"],\n        ))\n    )\n\n    # Access an item from the dataset\n    first_item = dataset[0]\n    print(f\"Class: {first_item['class_name']}, Image name: {first_item['image_name']}\")\n\n\nMerge / stratify / oversample datastreams\n-----------------------------------------\nThe fruit datastreams given below repeatedly yields the string of its fruit\ntype.\n\n.. code-block:: python\n\n    >>> datastream = Datastream.merge([\n    ...     (apple_datastream, 2),\n    ...     (pear_datastream, 1),\n    ...     (banana_datastream, 1),\n    ... ])\n    >>> next(iter(datastream.data_loader(batch_size=8)))\n    ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']\n\n\nZip independently sampled datastreams\n-------------------------------------\nThe fruit datastreams given below repeatedly yields the string of its fruit\ntype.\n\n.. code-block:: python\n\n    >>> datastream = Datastream.zip([\n    ...     apple_datastream,\n    ...     Datastream.merge([pear_datastream, banana_datastream]),\n    ... ])\n    >>> next(iter(datastream.data_loader(batch_size=4)))\n    [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]\n\n\nMore usage examples\n-------------------\nSee the `documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_\nfor more usage examples.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Simple dataset to dataloader library for pytorch",
    "version": "0.4.12",
    "project_urls": {
        "Documentation": "https://pytorch-datastream.readthedocs.io",
        "Homepage": "https://github.com/nextml-code/pytorch-datastream",
        "Repository": "https://github.com/nextml-code/pytorch-datastream"
    },
    "split_keywords": [
        "pytorch",
        " machine",
        " learning",
        " dataset",
        " pipeline",
        " dataloader"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2ff2dc41dc75af3fa3f0c31033466aea916dbd5f3822809bd41c0d2a412c2620",
                "md5": "d8696743c51e5cb8b58e011ed7be07a7",
                "sha256": "c9b9c3aa5b7815b9a0bb3cd0059546cf1566575173d8306616d1c56a18684431"
            },
            "downloads": -1,
            "filename": "pytorch_datastream-0.4.12-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d8696743c51e5cb8b58e011ed7be07a7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 29333,
            "upload_time": "2024-09-06T15:33:41",
            "upload_time_iso_8601": "2024-09-06T15:33:41.121580Z",
            "url": "https://files.pythonhosted.org/packages/2f/f2/dc41dc75af3fa3f0c31033466aea916dbd5f3822809bd41c0d2a412c2620/pytorch_datastream-0.4.12-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d255d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b",
                "md5": "608a46a91756b1a1646083123151394d",
                "sha256": "db9c1da627a3f9d4583abfed3a650ad40931ea84f460a01b60a8cc00ef6f6782"
            },
            "downloads": -1,
            "filename": "pytorch_datastream-0.4.12.tar.gz",
            "has_sig": false,
            "md5_digest": "608a46a91756b1a1646083123151394d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 24243,
            "upload_time": "2024-09-06T15:33:42",
            "upload_time_iso_8601": "2024-09-06T15:33:42.765530Z",
            "url": "https://files.pythonhosted.org/packages/d2/55/d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b/pytorch_datastream-0.4.12.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-06 15:33:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nextml-code",
    "github_project": "pytorch-datastream",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pytorch-datastream"
}

NextML