rats-processors


Namerats-processors JSON
Version 0.1.3 PyPI version JSON
download
home_pagehttps://github.com/microsoft/rats/
SummaryRats Processors
upload_time2024-08-02 16:17:11
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.10
licenseMIT
keywords pipelines machine learning research
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # rats-processors

A package to create and compose pipelines in a high level API, where _processors_ (classes
or unbound methods) are mapped into _pipeline nodes_, _node ports_ are inferred from the
_processors_ signature, and _edges_ are created by connecting _node ports_ inputs and outputs.
Pipelines defined this way are immutable objects that can be reused and composed into larger
pipelines, facilitating reusability.

## Example

In your python project or Jupyter notebook, you can compose a pipeline as follows:

```python
from typing import NamedTuple

from pathlib import Path

from sklearn.base import BaseEstimator
import pandas as pd
from rats.processors import task, pipeline, Pipeline, PipelineContainer


class DataOut(NamedTuple):
    data: pd.DataFrame


class ModelOut(NamedTuple):
    model: BaseEstimator


class MyContainer(PipelineContainer):
    @task
    def load_data(self, fname: Path) -> DataOut:
        return DataOut(data=pd.read_csv(fname))

    @task
    def train_model(self, data: pd.DataFrame) -> ModelOut:
        return {"model": "trained"}

    @pipeline
    def my_pipeline(self) -> Pipeline:
        load_data = self.load_data()
        train_model = self.get(train_model)
        return self.combine(
            pipelines=[load_data, train_model],
            dependencies=(train_model.inputs.data << load_data.outputs.data),
        )
```

The above example helps with modularization and bringing exploratory code from notebooks to more
permanent code.

The example above illustrates already several important concepts:
* `rats.processors.PipelineContainer`: we wire up code modularly, i.e., one container organizes and
connects tasks and pipelines.
* `rats.processors.ux.Pipeline`: a data structure that represents a computation graph, or a
direct acyclic graph (DAG) of operations.
* `rats.processors.task`: a decorator to define a computational task, which we refer as _processor_
and register it into the container. The return value of this method is
`rats.processors.ux.Pipeline`, a (single-node) _pipeline_.
* `rats.processors.pipeline`: a decorator to register a `rats.processors.ux.Pipeline`,
which can be a combination of other pipelines, or any method that returns a pipeline.


Note that to create a pipeline, you first create _tasks_ (_processors_) and then combine them into
larger _pipelines_, e.g. `MyContainer.load_data` and `MyContainer.train_model` are _processors_
wrapped by the `task` decorator, and `MyContainer.my_pipeline` is a _pipeline_ wrapped by the
`pipeline` decorator.


To run the above pipeline, you can do the following:

```python
from rats.apps import autoid, NotebookApp


app = NotebookApp()
app.add(MyContainer())  # add a container to the notebook app
p = app.get(autoid(MyContainer.my_pipeline))  # get a pipeline by its id
app.draw(p)
app.run(p, inputs={"fname": "data.csv"})
```


## Concepts

| Concepts  | Description                                          |
|-----------|------------------------------------------------------|
| Pipelines | DAG organizing computation tasks                     |
|           | Orchestrated in run environments                     |
|           | Figure display                                       |
| Tasks     | Entry point for computation process                  |
|           | Accepts dynamic inputs/outputs                       |
| Combined  | Compose tasks & pipelines to draw more complex DAGs. |
|           | Dependency assignment                                |

## Features

| Features    |                                               |                                              |             |                                                     |                                              |
|-------------|-----------------------------------------------|----------------------------------------------|-------------|-----------------------------------------------------|----------------------------------------------|
| Modular     | Steps become independent;<br> Plug & play        | ![modular](docs/figures/modular.jpg)         | Distributed | Uses required resources (spark or GPUs)             | ![graph-based](docs/figures/distributed.png) |
| Graph-based | Can operate on the DAG;<br> Enables meta-pipelines | ![graph-based](docs/figures/graph-based.png) | Reusable    | Every pipeline is shareable allowing collaborations | ![graph-based](docs/figures/reusable.png)    |

## Goals

* Flexibility: multiple data sources; multiple ML frameworks (pytorch, sklearn, ...), etc.
* Scalability: both data and compute.
* Usability: everyone should be able to author components and share them.
* Reproducibility: Tracking and recreating results.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/microsoft/rats/",
    "name": "rats-processors",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "pipelines, machine learning, research",
    "author": null,
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# rats-processors\n\nA package to create and compose pipelines in a high level API, where _processors_ (classes\nor unbound methods) are mapped into _pipeline nodes_, _node ports_ are inferred from the\n_processors_ signature, and _edges_ are created by connecting _node ports_ inputs and outputs.\nPipelines defined this way are immutable objects that can be reused and composed into larger\npipelines, facilitating reusability.\n\n## Example\n\nIn your python project or Jupyter notebook, you can compose a pipeline as follows:\n\n```python\nfrom typing import NamedTuple\n\nfrom pathlib import Path\n\nfrom sklearn.base import BaseEstimator\nimport pandas as pd\nfrom rats.processors import task, pipeline, Pipeline, PipelineContainer\n\n\nclass DataOut(NamedTuple):\n    data: pd.DataFrame\n\n\nclass ModelOut(NamedTuple):\n    model: BaseEstimator\n\n\nclass MyContainer(PipelineContainer):\n    @task\n    def load_data(self, fname: Path) -> DataOut:\n        return DataOut(data=pd.read_csv(fname))\n\n    @task\n    def train_model(self, data: pd.DataFrame) -> ModelOut:\n        return {\"model\": \"trained\"}\n\n    @pipeline\n    def my_pipeline(self) -> Pipeline:\n        load_data = self.load_data()\n        train_model = self.get(train_model)\n        return self.combine(\n            pipelines=[load_data, train_model],\n            dependencies=(train_model.inputs.data << load_data.outputs.data),\n        )\n```\n\nThe above example helps with modularization and bringing exploratory code from notebooks to more\npermanent code.\n\nThe example above illustrates already several important concepts:\n* `rats.processors.PipelineContainer`: we wire up code modularly, i.e., one container organizes and\nconnects tasks and pipelines.\n* `rats.processors.ux.Pipeline`: a data structure that represents a computation graph, or a\ndirect acyclic graph (DAG) of operations.\n* `rats.processors.task`: a decorator to define a computational task, which we refer as _processor_\nand register it into the container. The return value of this method is\n`rats.processors.ux.Pipeline`, a (single-node) _pipeline_.\n* `rats.processors.pipeline`: a decorator to register a `rats.processors.ux.Pipeline`,\nwhich can be a combination of other pipelines, or any method that returns a pipeline.\n\n\nNote that to create a pipeline, you first create _tasks_ (_processors_) and then combine them into\nlarger _pipelines_, e.g. `MyContainer.load_data` and `MyContainer.train_model` are _processors_\nwrapped by the `task` decorator, and `MyContainer.my_pipeline` is a _pipeline_ wrapped by the\n`pipeline` decorator.\n\n\nTo run the above pipeline, you can do the following:\n\n```python\nfrom rats.apps import autoid, NotebookApp\n\n\napp = NotebookApp()\napp.add(MyContainer())  # add a container to the notebook app\np = app.get(autoid(MyContainer.my_pipeline))  # get a pipeline by its id\napp.draw(p)\napp.run(p, inputs={\"fname\": \"data.csv\"})\n```\n\n\n## Concepts\n\n| Concepts  | Description                                          |\n|-----------|------------------------------------------------------|\n| Pipelines | DAG organizing computation tasks                     |\n|           | Orchestrated in run environments                     |\n|           | Figure display                                       |\n| Tasks     | Entry point for computation process                  |\n|           | Accepts dynamic inputs/outputs                       |\n| Combined  | Compose tasks & pipelines to draw more complex DAGs. |\n|           | Dependency assignment                                |\n\n## Features\n\n| Features    |                                               |                                              |             |                                                     |                                              |\n|-------------|-----------------------------------------------|----------------------------------------------|-------------|-----------------------------------------------------|----------------------------------------------|\n| Modular     | Steps become independent;<br> Plug & play        | ![modular](docs/figures/modular.jpg)         | Distributed | Uses required resources (spark or GPUs)             | ![graph-based](docs/figures/distributed.png) |\n| Graph-based | Can operate on the DAG;<br> Enables meta-pipelines | ![graph-based](docs/figures/graph-based.png) | Reusable    | Every pipeline is shareable allowing collaborations | ![graph-based](docs/figures/reusable.png)    |\n\n## Goals\n\n* Flexibility: multiple data sources; multiple ML frameworks (pytorch, sklearn, ...), etc.\n* Scalability: both data and compute.\n* Usability: everyone should be able to author components and share them.\n* Reproducibility: Tracking and recreating results.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Rats Processors",
    "version": "0.1.3",
    "project_urls": {
        "Documentation": "https://microsoft.github.io/rats/",
        "Homepage": "https://github.com/microsoft/rats/",
        "Repository": "https://github.com/microsoft/rats/"
    },
    "split_keywords": [
        "pipelines",
        " machine learning",
        " research"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9ef440b730d9adcb08db65c0ba8f0fba56cd61c5f28c495ecebf3c0934ec6ac8",
                "md5": "03de1c4c675026bc56f76870b0d6c162",
                "sha256": "0a0150360d607d9453fcf894b757d3eb6012bc74c45b68813a8b0df1fe5c66b2"
            },
            "downloads": -1,
            "filename": "rats_processors-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "03de1c4c675026bc56f76870b0d6c162",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 95580,
            "upload_time": "2024-08-02T16:17:11",
            "upload_time_iso_8601": "2024-08-02T16:17:11.603087Z",
            "url": "https://files.pythonhosted.org/packages/9e/f4/40b730d9adcb08db65c0ba8f0fba56cd61c5f28c495ecebf3c0934ec6ac8/rats_processors-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-02 16:17:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "microsoft",
    "github_project": "rats",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "rats-processors"
}
        
Elapsed time: 0.39116s