pipelime-python

Name	pipelime-python JSON
Version	1.9.1 JSON
	download
home_page
Summary	Data workflows, cli and dataflow automation.
upload_time	2023-12-20 22:41:26
maintainer
docs_url	None
author
requires_python	>=3.8
license	GNU General Public License v3 (GPLv3)
keywords	dataflow dataset orchestration pipelime workflow
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            
# 🍋 `pipelime`

[![Documentation Status](https://readthedocs.org/projects/pipelime-python/badge/?version=latest)](https://pipelime-python.readthedocs.io/en/latest/?badge=latest)
[![PyPI version](https://badge.fury.io/py/pipelime-python.svg)](https://badge.fury.io/py/pipelime-python)

<img src="docs/_static/pipelime_banner.png?raw=true" width="100%"/>

*If life gives you lemons, use `pipelime`.*

Welcome to **pipelime**, a swiss army knife for data processing!

`pipelime` is a full-fledge **framework** for **data science**: read your datasets,
manipulate them, write back to disk or upload to a remote data lake.
Then build up your **dataflow** with Piper and manage the configuration with Choixe.
Finally, **embed** your custom commands into the `pipelime` workspace, to act both as dataflow nodes and advanced command line interface.

Maybe too much for you? No worries, `pipelime` is **modular** and you can just take out what you need:
- **data processing scripts**: use the powerful `SamplesSequence` and create your own data processing pipelines, with a simple and intuitive API. Parallelization works out-of-the-box and, moreover, you can easily serialize your pipelines to yaml/json. Integrations with popular frameworks, e.g., [pytorch](https://pytorch.org/), are also provided.
- **easy dataflow**: `Piper` can manage and execute directed acyclic graphs (DAGs), giving back feedback on the progress through sockets or custom callbacks.
- **configuration management**: `Choixe` is a simple and intuitive mini scripting language designed to ease the creation of configuration files with the help of variables, symbol importing, for loops, switch statements, parameter sweeps and more.
- **command line interface**: `pipelime` can remove all the boilerplate code needed to create a beautiful CLI for you scripts and packages. You focus on *what matters* and we provide input parsing, advanced interfaces for complex arguments, automatic help generation, configuration management. Also, any `PipelimeCommand` can be used as a node in a dataflow for free!
- **pydantic tools**: most of the classes in `pipelime` derive from [`pydantic.BaseModel`](https://docs.pydantic.dev/), so we have built some useful tools to, e.g., inspect their structure, auto-generate human-friendly documentation and more (including a TUI to help you writing input data to [deserialize](https://docs.pydantic.dev/usage/models/#helper-functions) any pydantic model).

---

## Installation

Install `pipelime` using pip:

```
pip install pipelime-python
```

To be able to *draw* the dataflow graphs, you need the `draw` variant:

```
pip install pipelime-python[draw]
```

> **Warning**
>
> The `draw` variant needs `Graphviz` <https://www.graphviz.org/> installed on your system
> On Linux Ubuntu/Debian, you can install it with:
>
> ```
> sudo apt-get install graphviz graphviz-dev
> ```
>
> Alternatively you can use `conda`
>
> ```
> conda install --channel conda-forge pygraphviz
> ```
>
> Please see the full options at https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt

## Basic Usage

### Underfolder Format

The **Underfolder** format is the preferred `pipelime` dataset formats, i.e., a flexible way to
model and store a generic dataset through **filesystem**.

![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/underfolder.png?raw=true)

An Underfolder **dataset** is a collection of samples. A **sample** is a collection of items.
An **item** is a unitary block of data, i.e., a multi-channel image, a python object,
a dictionary and more.
Any valid underfolder dataset must contain a subfolder named `data` with samples
and items. Also, *global shared* items can be stored in the root folder.

Items are named using the following naming convention:

![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/naming.png?raw=true)

Where:

* `$ID` is the sample index, must be a unique integer for each sample.
* `ITEM` is the item name.
* `EXT` is the item extension.

We currently support many common file formats and others can be added by users:

  * `.png`, `.jpeg/.jpg/.jfif/.jpe`, `.bmp` for images
  * `.tiff/.tif` for multi-page images and multi-dimensional numpy arrays
  * `.yaml/.yml`, `.json` and `.toml/.tml` for metadata
  * `.txt` for numpy 2D matrix notation
  * `.npy` for general numpy arrays
  * `.pkl/.pickle` for picklable python objects
  * `.bin` for generic binary data

Root files follow the same convention but they lack the sample identifier part, i.e., `$ITEM.$EXT`

### Reading an Underfolder Dataset

pipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.
No complex signatures, weird object iterators, or boilerplate code, you just need a `SamplesSequence`:

```python
    from pipelime.sequences import SamplesSequence

    # Read an underfolder dataset with a single line of code
    dataset = SamplesSequence.from_underfolder('tests/sample_data/datasets/underfolder_minimnist')

    # A dataset behaves like a Sequence
    print(len(dataset))             # the number of samples
    sample = dataset[4]             # get the fifth sample

    # A sample is a mapping
    print(len(sample))              # the number of items
    print(set(sample.keys()))       # the items' keys

    # An item is an object wrapping the actual data
    image_item = sample["image"]    # get the "image" item from the sample
    print(type(image_item))         # <class 'pipelime.items.image_item.PngImageItem'>
    image = image_item()            # actually loads the data from disk (may have been on the cloud as well)
    print(type(image))              # <class 'numpy.ndarray'>
```

### Writing an Underfolder Dataset

You can **write** a dataset by calling the associated operation:

```python
    # Attach a "write" operation to the dataset
    dataset = dataset.to_underfolder('/tmp/my_output_dataset')

    # Now run over all the samples
    dataset.run()

    # You can easily spawn multiple processes if needed
    dataset.run(num_workers=4)
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "pipelime-python",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "dataflow,dataset,orchestration,pipelime,workflow",
    "author": "",
    "author_email": "\"Eyecan.ai\" <info@eyecan.ai>",
    "download_url": "https://files.pythonhosted.org/packages/87/15/aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c/pipelime_python-1.9.1.tar.gz",
    "platform": null,
    "description": "\n# \ud83c\udf4b `pipelime`\n\n[![Documentation Status](https://readthedocs.org/projects/pipelime-python/badge/?version=latest)](https://pipelime-python.readthedocs.io/en/latest/?badge=latest)\n[![PyPI version](https://badge.fury.io/py/pipelime-python.svg)](https://badge.fury.io/py/pipelime-python)\n\n<img src=\"docs/_static/pipelime_banner.png?raw=true\" width=\"100%\"/>\n\n*If life gives you lemons, use `pipelime`.*\n\nWelcome to **pipelime**, a swiss army knife for data processing!\n\n`pipelime` is a full-fledge **framework** for **data science**: read your datasets,\nmanipulate them, write back to disk or upload to a remote data lake.\nThen build up your **dataflow** with Piper and manage the configuration with Choixe.\nFinally, **embed** your custom commands into the `pipelime` workspace, to act both as dataflow nodes and advanced command line interface.\n\nMaybe too much for you? No worries, `pipelime` is **modular** and you can just take out what you need:\n- **data processing scripts**: use the powerful `SamplesSequence` and create your own data processing pipelines, with a simple and intuitive API. Parallelization works out-of-the-box and, moreover, you can easily serialize your pipelines to yaml/json. Integrations with popular frameworks, e.g., [pytorch](https://pytorch.org/), are also provided.\n- **easy dataflow**: `Piper` can manage and execute directed acyclic graphs (DAGs), giving back feedback on the progress through sockets or custom callbacks.\n- **configuration management**: `Choixe` is a simple and intuitive mini scripting language designed to ease the creation of configuration files with the help of variables, symbol importing, for loops, switch statements, parameter sweeps and more.\n- **command line interface**: `pipelime` can remove all the boilerplate code needed to create a beautiful CLI for you scripts and packages. You focus on *what matters* and we provide input parsing, advanced interfaces for complex arguments, automatic help generation, configuration management. Also, any `PipelimeCommand` can be used as a node in a dataflow for free!\n- **pydantic tools**: most of the classes in `pipelime` derive from [`pydantic.BaseModel`](https://docs.pydantic.dev/), so we have built some useful tools to, e.g., inspect their structure, auto-generate human-friendly documentation and more (including a TUI to help you writing input data to [deserialize](https://docs.pydantic.dev/usage/models/#helper-functions) any pydantic model).\n\n---\n\n## Installation\n\nInstall `pipelime` using pip:\n\n```\npip install pipelime-python\n```\n\nTo be able to *draw* the dataflow graphs, you need the `draw` variant:\n\n```\npip install pipelime-python[draw]\n```\n\n> **Warning**\n>\n> The `draw` variant needs `Graphviz` <https://www.graphviz.org/> installed on your system\n> On Linux Ubuntu/Debian, you can install it with:\n>\n> ```\n> sudo apt-get install graphviz graphviz-dev\n> ```\n>\n> Alternatively you can use `conda`\n>\n> ```\n> conda install --channel conda-forge pygraphviz\n> ```\n>\n> Please see the full options at https://github.com/pygraphviz/pygraphviz/blob/main/INSTALL.txt\n\n## Basic Usage\n\n### Underfolder Format\n\nThe **Underfolder** format is the preferred `pipelime` dataset formats, i.e., a flexible way to\nmodel and store a generic dataset through **filesystem**.\n\n![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/underfolder.png?raw=true)\n\nAn Underfolder **dataset** is a collection of samples. A **sample** is a collection of items.\nAn **item** is a unitary block of data, i.e., a multi-channel image, a python object,\na dictionary and more.\nAny valid underfolder dataset must contain a subfolder named `data` with samples\nand items. Also, *global shared* items can be stored in the root folder.\n\nItems are named using the following naming convention:\n\n![](https://github.com/eyecan-ai/pipelime-python/blob/main/docs/images/naming.png?raw=true)\n\nWhere:\n\n* `$ID` is the sample index, must be a unique integer for each sample.\n* `ITEM` is the item name.\n* `EXT` is the item extension.\n\nWe currently support many common file formats and others can be added by users:\n\n  * `.png`, `.jpeg/.jpg/.jfif/.jpe`, `.bmp` for images\n  * `.tiff/.tif` for multi-page images and multi-dimensional numpy arrays\n  * `.yaml/.yml`, `.json` and `.toml/.tml` for metadata\n  * `.txt` for numpy 2D matrix notation\n  * `.npy` for general numpy arrays\n  * `.pkl/.pickle` for picklable python objects\n  * `.bin` for generic binary data\n\nRoot files follow the same convention but they lack the sample identifier part, i.e., `$ITEM.$EXT`\n\n### Reading an Underfolder Dataset\n\npipelime provides an intuitive interface to read, manipulate and write Underfolder Datasets.\nNo complex signatures, weird object iterators, or boilerplate code, you just need a `SamplesSequence`:\n\n```python\n    from pipelime.sequences import SamplesSequence\n\n    # Read an underfolder dataset with a single line of code\n    dataset = SamplesSequence.from_underfolder('tests/sample_data/datasets/underfolder_minimnist')\n\n    # A dataset behaves like a Sequence\n    print(len(dataset))             # the number of samples\n    sample = dataset[4]             # get the fifth sample\n\n    # A sample is a mapping\n    print(len(sample))              # the number of items\n    print(set(sample.keys()))       # the items' keys\n\n    # An item is an object wrapping the actual data\n    image_item = sample[\"image\"]    # get the \"image\" item from the sample\n    print(type(image_item))         # <class 'pipelime.items.image_item.PngImageItem'>\n    image = image_item()            # actually loads the data from disk (may have been on the cloud as well)\n    print(type(image))              # <class 'numpy.ndarray'>\n```\n\n### Writing an Underfolder Dataset\n\nYou can **write** a dataset by calling the associated operation:\n\n```python\n    # Attach a \"write\" operation to the dataset\n    dataset = dataset.to_underfolder('/tmp/my_output_dataset')\n\n    # Now run over all the samples\n    dataset.run()\n\n    # You can easily spawn multiple processes if needed\n    dataset.run(num_workers=4)\n```\n",
    "bugtrack_url": null,
    "license": "GNU General Public License v3 (GPLv3)",
    "summary": "Data workflows, cli and dataflow automation.",
    "version": "1.9.1",
    "project_urls": {
        "Documentation": "http://pipelime-python.readthedocs.io/",
        "Issues": "https://github.com/eyecan-ai/pipelime-python/issues",
        "Source": "https://github.com/eyecan-ai/pipelime-python"
    },
    "split_keywords": [
        "dataflow",
        "dataset",
        "orchestration",
        "pipelime",
        "workflow"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9e7a406bc327c8a073f9da97a3750da6e701058a7656a197942fe1977acb0d66",
                "md5": "ff856d12a6928ca198ef21a650da92dd",
                "sha256": "82340370187d2cd48e498c02a16243dcece515fad409a5954eba2a2a1c539a33"
            },
            "downloads": -1,
            "filename": "pipelime_python-1.9.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ff856d12a6928ca198ef21a650da92dd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 205136,
            "upload_time": "2023-12-20T22:41:24",
            "upload_time_iso_8601": "2023-12-20T22:41:24.737761Z",
            "url": "https://files.pythonhosted.org/packages/9e/7a/406bc327c8a073f9da97a3750da6e701058a7656a197942fe1977acb0d66/pipelime_python-1.9.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8715aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c",
                "md5": "54c3aec8e8196afc58d0f9a518d4e4ae",
                "sha256": "b6b4324b3fe393efae9312ea5645d6bddbfab9896ed1afe12065da74f219aeab"
            },
            "downloads": -1,
            "filename": "pipelime_python-1.9.1.tar.gz",
            "has_sig": false,
            "md5_digest": "54c3aec8e8196afc58d0f9a518d4e4ae",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 155041,
            "upload_time": "2023-12-20T22:41:26",
            "upload_time_iso_8601": "2023-12-20T22:41:26.295468Z",
            "url": "https://files.pythonhosted.org/packages/87/15/aa6132ffaad55145075311780784a1cc8a5549e0e18c3faaf573662b636c/pipelime_python-1.9.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-20 22:41:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "eyecan-ai",
    "github_project": "pipelime-python",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "pipelime-python"
}