==================
Pytorch Datastream
==================
.. image:: https://badge.fury.io/py/pytorch-datastream.svg
:target: https://badge.fury.io/py/pytorch-datastream
.. image:: https://img.shields.io/pypi/pyversions/pytorch-datastream.svg
:target: https://pypi.python.org/pypi/pytorch-datastream
.. image:: https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest
:target: https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest
.. image:: https://img.shields.io/pypi/l/pytorch-datastream.svg
:target: https://pypi.python.org/pypi/pytorch-datastream
This is a simple library for creating readable dataset pipelines and
reusing best practices for issues such as imbalanced datasets. There are
just two components to keep track of: ``Dataset`` and ``Datastream``.
``Dataset`` is a simple mapping between an index and an example. It provides
pipelining of functions in a readable syntax originally adapted from
tensorflow 2's ``tf.data.Dataset``.
``Datastream`` combines a ``Dataset`` and a sampler into a stream of examples.
It provides a simple solution to oversampling / stratification, weighted
sampling, and finally converting to a ``torch.utils.data.DataLoader``.
Install
=======
.. code-block::
poetry add pytorch-datastream
Or, for the old-timers:
.. code-block::
pip install pytorch-datastream
Usage
=====
The list below is meant to showcase functions that are useful in most standard
and non-standard cases. It is not meant to be an exhaustive list. See the
`documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_ for
a more extensive list on API and usage.
.. code-block:: python
Dataset.from_subscriptable
Dataset.from_dataframe
Dataset
.map
.subset
.split
.cache
.with_columns
Datastream.merge
Datastream.zip
Datastream
.map
.data_loader
.zip_index
.update_weights_
.update_example_weight_
.weight
.state_dict
.load_state_dict
Simple image dataset example
----------------------------
Here's a basic example of loading images from a directory:
.. code-block:: python
from datastream import Dataset
from pathlib import Path
from PIL import Image
# Assuming images are in a directory structure like:
# images/
# class1/
# image1.jpg
# image2.jpg
# class2/
# image3.jpg
# image4.jpg
image_dir = Path("images")
image_paths = list(image_dir.glob("**/*.jpg"))
dataset = (
Dataset.from_paths(image_paths, pattern=r".*/(?P<class_name>\w+)/(?P<image_name>\w+).jpg")
.map(lambda row: dict(
image=Image.open(row["path"]),
class_name=row["class_name"],
image_name=row["image_name"],
))
)
# Access an item from the dataset
first_item = dataset[0]
print(f"Class: {first_item['class_name']}, Image name: {first_item['image_name']}")
Merge / stratify / oversample datastreams
-----------------------------------------
The fruit datastreams given below repeatedly yields the string of its fruit
type.
.. code-block:: python
>>> datastream = Datastream.merge([
... (apple_datastream, 2),
... (pear_datastream, 1),
... (banana_datastream, 1),
... ])
>>> next(iter(datastream.data_loader(batch_size=8)))
['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']
Zip independently sampled datastreams
-------------------------------------
The fruit datastreams given below repeatedly yields the string of its fruit
type.
.. code-block:: python
>>> datastream = Datastream.zip([
... apple_datastream,
... Datastream.merge([pear_datastream, banana_datastream]),
... ])
>>> next(iter(datastream.data_loader(batch_size=4)))
[('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]
More usage examples
-------------------
See the `documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_
for more usage examples.
Raw data
{
"_id": null,
"home_page": "https://github.com/nextml-code/pytorch-datastream",
"name": "pytorch-datastream",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": "pytorch, machine, learning, dataset, pipeline, dataloader",
"author": "NextML",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/d2/55/d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b/pytorch_datastream-0.4.12.tar.gz",
"platform": null,
"description": "==================\nPytorch Datastream\n==================\n\n.. image:: https://badge.fury.io/py/pytorch-datastream.svg\n :target: https://badge.fury.io/py/pytorch-datastream\n\n.. image:: https://img.shields.io/pypi/pyversions/pytorch-datastream.svg\n :target: https://pypi.python.org/pypi/pytorch-datastream\n\n.. image:: https://readthedocs.org/projects/pytorch-datastream/badge/?version=latest\n :target: https://pytorch-datastream.readthedocs.io/en/latest/?badge=latest\n\n.. image:: https://img.shields.io/pypi/l/pytorch-datastream.svg\n :target: https://pypi.python.org/pypi/pytorch-datastream\n\n\n\nThis is a simple library for creating readable dataset pipelines and\nreusing best practices for issues such as imbalanced datasets. There are\njust two components to keep track of: ``Dataset`` and ``Datastream``.\n\n``Dataset`` is a simple mapping between an index and an example. It provides \npipelining of functions in a readable syntax originally adapted from\ntensorflow 2's ``tf.data.Dataset``.\n\n``Datastream`` combines a ``Dataset`` and a sampler into a stream of examples.\nIt provides a simple solution to oversampling / stratification, weighted\nsampling, and finally converting to a ``torch.utils.data.DataLoader``.\n\n\nInstall\n=======\n\n.. code-block::\n\n poetry add pytorch-datastream\n\nOr, for the old-timers:\n\n.. code-block::\n\n pip install pytorch-datastream\n\n\nUsage\n=====\n\nThe list below is meant to showcase functions that are useful in most standard\nand non-standard cases. It is not meant to be an exhaustive list. See the \n`documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_ for \na more extensive list on API and usage.\n\n.. code-block:: python\n\n Dataset.from_subscriptable\n Dataset.from_dataframe\n Dataset\n .map\n .subset\n .split\n .cache\n .with_columns\n\n Datastream.merge\n Datastream.zip\n Datastream\n .map\n .data_loader\n .zip_index\n .update_weights_\n .update_example_weight_\n .weight\n .state_dict\n .load_state_dict\n\n\nSimple image dataset example\n----------------------------\nHere's a basic example of loading images from a directory:\n\n.. code-block:: python\n\n from datastream import Dataset\n from pathlib import Path\n from PIL import Image\n\n # Assuming images are in a directory structure like:\n # images/\n # class1/\n # image1.jpg\n # image2.jpg\n # class2/\n # image3.jpg\n # image4.jpg\n\n image_dir = Path(\"images\")\n image_paths = list(image_dir.glob(\"**/*.jpg\"))\n\n dataset = (\n Dataset.from_paths(image_paths, pattern=r\".*/(?P<class_name>\\w+)/(?P<image_name>\\w+).jpg\")\n .map(lambda row: dict(\n image=Image.open(row[\"path\"]),\n class_name=row[\"class_name\"],\n image_name=row[\"image_name\"],\n ))\n )\n\n # Access an item from the dataset\n first_item = dataset[0]\n print(f\"Class: {first_item['class_name']}, Image name: {first_item['image_name']}\")\n\n\nMerge / stratify / oversample datastreams\n-----------------------------------------\nThe fruit datastreams given below repeatedly yields the string of its fruit\ntype.\n\n.. code-block:: python\n\n >>> datastream = Datastream.merge([\n ... (apple_datastream, 2),\n ... (pear_datastream, 1),\n ... (banana_datastream, 1),\n ... ])\n >>> next(iter(datastream.data_loader(batch_size=8)))\n ['apple', 'apple', 'pear', 'banana', 'apple', 'apple', 'pear', 'banana']\n\n\nZip independently sampled datastreams\n-------------------------------------\nThe fruit datastreams given below repeatedly yields the string of its fruit\ntype.\n\n.. code-block:: python\n\n >>> datastream = Datastream.zip([\n ... apple_datastream,\n ... Datastream.merge([pear_datastream, banana_datastream]),\n ... ])\n >>> next(iter(datastream.data_loader(batch_size=4)))\n [('apple', 'pear'), ('apple', 'banana'), ('apple', 'pear'), ('apple', 'banana')]\n\n\nMore usage examples\n-------------------\nSee the `documentation <https://pytorch-datastream.readthedocs.io/en/latest/>`_\nfor more usage examples.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Simple dataset to dataloader library for pytorch",
"version": "0.4.12",
"project_urls": {
"Documentation": "https://pytorch-datastream.readthedocs.io",
"Homepage": "https://github.com/nextml-code/pytorch-datastream",
"Repository": "https://github.com/nextml-code/pytorch-datastream"
},
"split_keywords": [
"pytorch",
" machine",
" learning",
" dataset",
" pipeline",
" dataloader"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2ff2dc41dc75af3fa3f0c31033466aea916dbd5f3822809bd41c0d2a412c2620",
"md5": "d8696743c51e5cb8b58e011ed7be07a7",
"sha256": "c9b9c3aa5b7815b9a0bb3cd0059546cf1566575173d8306616d1c56a18684431"
},
"downloads": -1,
"filename": "pytorch_datastream-0.4.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d8696743c51e5cb8b58e011ed7be07a7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 29333,
"upload_time": "2024-09-06T15:33:41",
"upload_time_iso_8601": "2024-09-06T15:33:41.121580Z",
"url": "https://files.pythonhosted.org/packages/2f/f2/dc41dc75af3fa3f0c31033466aea916dbd5f3822809bd41c0d2a412c2620/pytorch_datastream-0.4.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d255d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b",
"md5": "608a46a91756b1a1646083123151394d",
"sha256": "db9c1da627a3f9d4583abfed3a650ad40931ea84f460a01b60a8cc00ef6f6782"
},
"downloads": -1,
"filename": "pytorch_datastream-0.4.12.tar.gz",
"has_sig": false,
"md5_digest": "608a46a91756b1a1646083123151394d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 24243,
"upload_time": "2024-09-06T15:33:42",
"upload_time_iso_8601": "2024-09-06T15:33:42.765530Z",
"url": "https://files.pythonhosted.org/packages/d2/55/d9e4109f64f42c2a1b48aa28cb8e6ed57e7547b85657f42da325dcd3023b/pytorch_datastream-0.4.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-06 15:33:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nextml-code",
"github_project": "pytorch-datastream",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pytorch-datastream"
}