# 🍋 `pipelime`
[![Documentation Status](https://readthedocs.org/projects/pipelime-python/badge/?version=latest)](https://pipelime-python.readthedocs.io/en/latest/?badge=latest)
[![PyPI version](https://badge.fury.io/py/pipelime-python.svg)](https://badge.fury.io/py/pipelime-python)
<img src="docs/_static/pipelime_banner.png?raw=true" width="100%"/>
*If life gives you lemons, use `pipelime`.*
Welcome to **pipelime**, a swiss army knife for data processing!
`pipelime` is a full-fledge **framework** for **data science**: read your datasets,
manipulate them, write back to disk or upload to a remote data lake.
Then build up your **dataflow** with Piper and manage the configuration with Choixe.
Finally, **embed** your custom commands into the `pipelime` workspace, to act both as dataflow nodes and advanced command line interface.
Maybe too much for you? No worries, `pipelime` is **modular** and you can just take out what you need:
- **data processing scripts**: use the powerful `SamplesSequence` and create your own data processing pipelines, with a simple and intuitive API. Parallelization works out-of-the-box and, moreover, you can easily serialize your pipelines to yaml/json. Integrations with popular frameworks, e.g., [pytorch](https://pytorch.org/), are also provided.
- **easy dataflow**: `Piper` can manage and execute directed acyclic graphs (DAGs), giving back feedback on the progress through sockets or custom callbacks.
- **configuration management**: `Choixe` is a simple and intuitive mini scripting language designed to ease the creation of configuration files with the help of variables, symbol importing, for loops, switch statements, parameter sweeps and more.
- **command line interface**: `pipelime` can remove all the boilerplate code needed to create a beautiful CLI for you scripts and packages. You focus on *what matters* and we provide input parsing, advanced interfaces for complex arguments, automatic help generation, configuration management. Also, any `PipelimeCommand` can be used as a node in a dataflow for free!
- **pydantic tools**: most of the classes in `pipelime` derive from [`pydantic.BaseModel`](https://docs.pydantic.dev/), so we have built some useful tools to, e.g., inspect their structure, auto-generate human-friendly documentation and more (including a TUI to help you writing input data to [deserialize](https://docs.pydantic.dev/usage/models/#helper-functions) any pydantic model).
---
## Installation
Install `pipelime` using pip:
```
pip install pipelime-python
```
To be able to *draw* the dataflow graphs, you need the `draw` variant:
```
pip install pipelime-python[draw]
```
> **Warning**
>
> The `draw` variant needs `Graphviz` <https://www.graphviz.org/> installed on your system
> On Linux Ubuntu/Debian, you can install it with:
>
> ```
> sudo apt-get install graphviz graphviz-dev
> ```
>
> Alternatively you can use `conda`
>
> ```
> conda install --channel conda-forge pygraphviz
> ```
>
> Please see the full options at https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt
## Basic Usage
### Underfolder Format
The **Underfolder** format is the preferred `pipelime` dataset formats, i.e., a flexible way to
model and store a generic dataset through **filesystem**.
![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/underfolder.png?raw=true)
An Underfolder **dataset** is a collection of samples. A **sample** is a collection of items.
An **item** is a unitary block of data, i.e., a multi-channel image, a python object,
a dictionary and more.
Any valid underfolder dataset must contain a subfolder named `data` with samples
and items. Also, *global shared* items can be stored in the root folder.
Items are named using the following naming convention:
![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/naming.png?raw=true)
Where:
* `$ID` is the sample index, must be a unique integer for each sample.
* `ITEM` is the item name.
* `EXT` is the item extension.
We currently support many common file formats and others can be added by users:
* `.png`, `.jpeg/.jpg/.jfif/.jpe`, `.bmp` for images
* `.tiff/.tif` for multi-page images and multi-dimensional numpy arrays
* `.yaml/.yml`, `.json` and `.toml/.tml` for metadata
* `.txt` for numpy 2D matrix notation
* `.npy` for general numpy arrays
* `.pkl/.pickle` for picklable python objects
* `.bin` for generic binary data
Root files follow the same convention but they lack the sample identifier part, i.e., `$ITEM.$EXT`
### Reading an Underfolder Dataset
pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.
No complex signatures, weird object iterators, or boilerplate code, you just need a `SamplesSequence`:
```python
from pipelime.sequences import SamplesSequence
# Read an underfolder dataset with a single line of code
dataset = SamplesSequence.from_underfolder('tests/sample_data/datasets/underfolder_minimnist')
# A dataset behaves like a Sequence
print(len(dataset)) # the number of samples
sample = dataset[4] # get the fifth sample
# A sample is a mapping
print(len(sample)) # the number of items
print(set(sample.keys())) # the items' keys
# An item is an object wrapping the actual data
image_item = sample["image"] # get the "image" item from the sample
print(type(image_item)) # <class 'pipelime.items.image_item.PngImageItem'>
image = image_item() # actually loads the data from disk (may have been on the cloud as well)
print(type(image)) # <class 'numpy.ndarray'>
```
### Writing an Underfolder Dataset
You can **write** a dataset by calling the associated operation:
```python
# Attach a "write" operation to the dataset
dataset = dataset.to_underfolder('/tmp/my_output_dataset')
# Now run over all the samples
dataset.run()
# You can easily spawn multiple processes if needed
dataset.run(num_workers=4)
```
Raw data
{
"_id": null,
"home_page": "",
"name": "pipelime-python",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "dataflow,dataset,orchestration,pipelime,workflow",
"author": "",
"author_email": "\"Eyecan.ai\" <info@eyecan.ai>",
"download_url": "https://files.pythonhosted.org/packages/87/15/aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c/pipelime_python-1.9.1.tar.gz",
"platform": null,
"description": "\n# \ud83c\udf4b `pipelime`\n\n[![Documentation Status](https://readthedocs.org/projects/pipelime-python/badge/?version=latest)](https://pipelime-python.readthedocs.io/en/latest/?badge=latest)\n[![PyPI version](https://badge.fury.io/py/pipelime-python.svg)](https://badge.fury.io/py/pipelime-python)\n\n<img src=\"docs/_static/pipelime_banner.png?raw=true\" width=\"100%\"/>\n\n*If life gives you lemons, use `pipelime`.*\n\nWelcome to **pipelime**, a swiss army knife for data processing!\n\n`pipelime` is a full-fledge **framework** for **data science**: read your datasets,\nmanipulate them, write back to disk or upload to a remote data lake.\nThen build up your **dataflow** with Piper and manage the configuration with Choixe.\nFinally, **embed** your custom commands into the `pipelime` workspace, to act both as dataflow nodes and advanced command line interface.\n\nMaybe too much for you? No worries, `pipelime` is **modular** and you can just take out what you need:\n- **data processing scripts**: use the powerful `SamplesSequence` and create your own data processing pipelines, with a simple and intuitive API. Parallelization works out-of-the-box and, moreover, you can easily serialize your pipelines to yaml/json. Integrations with popular frameworks, e.g., [pytorch](https://pytorch.org/), are also provided.\n- **easy dataflow**: `Piper` can manage and execute directed acyclic graphs (DAGs), giving back feedback on the progress through sockets or custom callbacks.\n- **configuration management**: `Choixe` is a simple and intuitive mini scripting language designed to ease the creation of configuration files with the help of variables, symbol importing, for loops, switch statements, parameter sweeps and more.\n- **command line interface**: `pipelime` can remove all the boilerplate code needed to create a beautiful CLI for you scripts and packages. You focus on *what matters* and we provide input parsing, advanced interfaces for complex arguments, automatic help generation, configuration management. Also, any `PipelimeCommand` can be used as a node in a dataflow for free!\n- **pydantic tools**: most of the classes in `pipelime` derive from [`pydantic.BaseModel`](https://docs.pydantic.dev/), so we have built some useful tools to, e.g., inspect their structure, auto-generate human-friendly documentation and more (including a TUI to help you writing input data to [deserialize](https://docs.pydantic.dev/usage/models/#helper-functions) any pydantic model).\n\n---\n\n## Installation\n\nInstall `pipelime` using pip:\n\n```\npip install pipelime-python\n```\n\nTo be able to *draw* the dataflow graphs, you need the `draw` variant:\n\n```\npip install pipelime-python[draw]\n```\n\n> **Warning**\n>\n> The `draw` variant needs `Graphviz` <https://www.graphviz.org/> installed on your system\n> On Linux Ubuntu/Debian, you can install it with:\n>\n> ```\n> sudo apt-get install graphviz graphviz-dev\n> ```\n>\n> Alternatively you can use `conda`\n>\n> ```\n> conda install --channel conda-forge pygraphviz\n> ```\n>\n> Please see the full options at https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt\n\n## Basic Usage\n\n### Underfolder Format\n\nThe **Underfolder** format is the preferred `pipelime` dataset formats, i.e., a flexible way to\nmodel and store a generic dataset through **filesystem**.\n\n![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/underfolder.png?raw=true)\n\nAn Underfolder **dataset** is a collection of samples. A **sample** is a collection of items.\nAn **item** is a unitary block of data, i.e., a multi-channel image, a python object,\na dictionary and more.\nAny valid underfolder dataset must contain a subfolder named `data` with samples\nand items. Also, *global shared* items can be stored in the root folder.\n\nItems are named using the following naming convention:\n\n![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/naming.png?raw=true)\n\nWhere:\n\n* `$ID` is the sample index, must be a unique integer for each sample.\n* `ITEM` is the item name.\n* `EXT` is the item extension.\n\nWe currently support many common file formats and others can be added by users:\n\n * `.png`, `.jpeg/.jpg/.jfif/.jpe`, `.bmp` for images\n * `.tiff/.tif` for multi-page images and multi-dimensional numpy arrays\n * `.yaml/.yml`, `.json` and `.toml/.tml` for metadata\n * `.txt` for numpy 2D matrix notation\n * `.npy` for general numpy arrays\n * `.pkl/.pickle` for picklable python objects\n * `.bin` for generic binary data\n\nRoot files follow the same convention but they lack the sample identifier part, i.e., `$ITEM.$EXT`\n\n### Reading an Underfolder Dataset\n\npipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.\nNo complex signatures, weird object iterators, or boilerplate code, you just need a `SamplesSequence`:\n\n```python\n from pipelime.sequences import SamplesSequence\n\n # Read an underfolder dataset with a single line of code\n dataset = SamplesSequence.from_underfolder('tests/sample_data/datasets/underfolder_minimnist')\n\n # A dataset behaves like a Sequence\n print(len(dataset)) # the number of samples\n sample = dataset[4] # get the fifth sample\n\n # A sample is a mapping\n print(len(sample)) # the number of items\n print(set(sample.keys())) # the items' keys\n\n # An item is an object wrapping the actual data\n image_item = sample[\"image\"] # get the \"image\" item from the sample\n print(type(image_item)) # <class 'pipelime.items.image_item.PngImageItem'>\n image = image_item() # actually loads the data from disk (may have been on the cloud as well)\n print(type(image)) # <class 'numpy.ndarray'>\n```\n\n### Writing an Underfolder Dataset\n\nYou can **write** a dataset by calling the associated operation:\n\n```python\n # Attach a \"write\" operation to the dataset\n dataset = dataset.to_underfolder('/tmp/my_output_dataset')\n\n # Now run over all the samples\n dataset.run()\n\n # You can easily spawn multiple processes if needed\n dataset.run(num_workers=4)\n```\n",
"bugtrack_url": null,
"license": "GNU General Public License v3 (GPLv3)",
"summary": "Data workflows, cli and dataflow automation.",
"version": "1.9.1",
"project_urls": {
"Documentation": "http://pipelime-python.readthedocs.io/",
"Issues": "https://github.com/eyecan-ai/pipelime-python/issues",
"Source": "https://github.com/eyecan-ai/pipelime-python"
},
"split_keywords": [
"dataflow",
"dataset",
"orchestration",
"pipelime",
"workflow"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9e7a406bc327c8a073f9da97a3750da6e701058a7656a197942fe1977acb0d66",
"md5": "ff856d12a6928ca198ef21a650da92dd",
"sha256": "82340370187d2cd48e498c02a16243dcece515fad409a5954eba2a2a1c539a33"
},
"downloads": -1,
"filename": "pipelime_python-1.9.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ff856d12a6928ca198ef21a650da92dd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 205136,
"upload_time": "2023-12-20T22:41:24",
"upload_time_iso_8601": "2023-12-20T22:41:24.737761Z",
"url": "https://files.pythonhosted.org/packages/9e/7a/406bc327c8a073f9da97a3750da6e701058a7656a197942fe1977acb0d66/pipelime_python-1.9.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8715aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c",
"md5": "54c3aec8e8196afc58d0f9a518d4e4ae",
"sha256": "b6b4324b3fe393efae9312ea5645d6bddbfab9896ed1afe12065da74f219aeab"
},
"downloads": -1,
"filename": "pipelime_python-1.9.1.tar.gz",
"has_sig": false,
"md5_digest": "54c3aec8e8196afc58d0f9a518d4e4ae",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 155041,
"upload_time": "2023-12-20T22:41:26",
"upload_time_iso_8601": "2023-12-20T22:41:26.295468Z",
"url": "https://files.pythonhosted.org/packages/87/15/aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c/pipelime_python-1.9.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-12-20 22:41:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "eyecan-ai",
"github_project": "pipelime-python",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "pipelime-python"
}