# rats-processors
A package to create and compose pipelines in a high level API, where _processors_ (classes
or unbound methods) are mapped into _pipeline nodes_, _node ports_ are inferred from the
_processors_ signature, and _edges_ are created by connecting _node ports_ inputs and outputs.
Pipelines defined this way are immutable objects that can be reused and composed into larger
pipelines, facilitating reusability.
## Example
In your python project or Jupyter notebook, you can compose a pipeline as follows:
```python
from typing import NamedTuple
from pathlib import Path
from sklearn.base import BaseEstimator
import pandas as pd
from rats.processors import task, pipeline, Pipeline, PipelineContainer
class DataOut(NamedTuple):
data: pd.DataFrame
class ModelOut(NamedTuple):
model: BaseEstimator
class MyContainer(PipelineContainer):
@task
def load_data(self, fname: Path) -> DataOut:
return DataOut(data=pd.read_csv(fname))
@task
def train_model(self, data: pd.DataFrame) -> ModelOut:
return {"model": "trained"}
@pipeline
def my_pipeline(self) -> Pipeline:
load_data = self.load_data()
train_model = self.get(train_model)
return self.combine(
pipelines=[load_data, train_model],
dependencies=(train_model.inputs.data << load_data.outputs.data),
)
```
The above example helps with modularization and bringing exploratory code from notebooks to more
permanent code.
The example above illustrates already several important concepts:
* `rats.processors.PipelineContainer`: we wire up code modularly, i.e., one container organizes and
connects tasks and pipelines.
* `rats.processors.ux.Pipeline`: a data structure that represents a computation graph, or a
direct acyclic graph (DAG) of operations.
* `rats.processors.task`: a decorator to define a computational task, which we refer as _processor_
and register it into the container. The return value of this method is
`rats.processors.ux.Pipeline`, a (single-node) _pipeline_.
* `rats.processors.pipeline`: a decorator to register a `rats.processors.ux.Pipeline`,
which can be a combination of other pipelines, or any method that returns a pipeline.
Note that to create a pipeline, you first create _tasks_ (_processors_) and then combine them into
larger _pipelines_, e.g. `MyContainer.load_data` and `MyContainer.train_model` are _processors_
wrapped by the `task` decorator, and `MyContainer.my_pipeline` is a _pipeline_ wrapped by the
`pipeline` decorator.
To run the above pipeline, you can do the following:
```python
from rats.apps import autoid, NotebookApp
app = NotebookApp()
app.add(MyContainer()) # add a container to the notebook app
p = app.get(autoid(MyContainer.my_pipeline)) # get a pipeline by its id
app.draw(p)
app.run(p, inputs={"fname": "data.csv"})
```
## Concepts
| Concepts | Description |
|-----------|------------------------------------------------------|
| Pipelines | DAG organizing computation tasks |
| | Orchestrated in run environments |
| | Figure display |
| Tasks | Entry point for computation process |
| | Accepts dynamic inputs/outputs |
| Combined | Compose tasks & pipelines to draw more complex DAGs. |
| | Dependency assignment |
## Features
| Features | | | | | |
|-------------|-----------------------------------------------|----------------------------------------------|-------------|-----------------------------------------------------|----------------------------------------------|
| Modular | Steps become independent;<br> Plug & play |  | Distributed | Uses required resources (spark or GPUs) |  |
| Graph-based | Can operate on the DAG;<br> Enables meta-pipelines |  | Reusable | Every pipeline is shareable allowing collaborations |  |
## Goals
* Flexibility: multiple data sources; multiple ML frameworks (pytorch, sklearn, ...), etc.
* Scalability: both data and compute.
* Usability: everyone should be able to author components and share them.
* Reproducibility: Tracking and recreating results.
Raw data
{
"_id": null,
"home_page": "https://github.com/microsoft/rats/",
"name": "rats-processors",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": null,
"keywords": "pipelines, machine learning, research",
"author": null,
"author_email": null,
"download_url": null,
"platform": null,
"description": "# rats-processors\n\nA package to create and compose pipelines in a high level API, where _processors_ (classes\nor unbound methods) are mapped into _pipeline nodes_, _node ports_ are inferred from the\n_processors_ signature, and _edges_ are created by connecting _node ports_ inputs and outputs.\nPipelines defined this way are immutable objects that can be reused and composed into larger\npipelines, facilitating reusability.\n\n## Example\n\nIn your python project or Jupyter notebook, you can compose a pipeline as follows:\n\n```python\nfrom typing import NamedTuple\n\nfrom pathlib import Path\n\nfrom sklearn.base import BaseEstimator\nimport pandas as pd\nfrom rats.processors import task, pipeline, Pipeline, PipelineContainer\n\n\nclass DataOut(NamedTuple):\n data: pd.DataFrame\n\n\nclass ModelOut(NamedTuple):\n model: BaseEstimator\n\n\nclass MyContainer(PipelineContainer):\n @task\n def load_data(self, fname: Path) -> DataOut:\n return DataOut(data=pd.read_csv(fname))\n\n @task\n def train_model(self, data: pd.DataFrame) -> ModelOut:\n return {\"model\": \"trained\"}\n\n @pipeline\n def my_pipeline(self) -> Pipeline:\n load_data = self.load_data()\n train_model = self.get(train_model)\n return self.combine(\n pipelines=[load_data, train_model],\n dependencies=(train_model.inputs.data << load_data.outputs.data),\n )\n```\n\nThe above example helps with modularization and bringing exploratory code from notebooks to more\npermanent code.\n\nThe example above illustrates already several important concepts:\n* `rats.processors.PipelineContainer`: we wire up code modularly, i.e., one container organizes and\nconnects tasks and pipelines.\n* `rats.processors.ux.Pipeline`: a data structure that represents a computation graph, or a\ndirect acyclic graph (DAG) of operations.\n* `rats.processors.task`: a decorator to define a computational task, which we refer as _processor_\nand register it into the container. The return value of this method is\n`rats.processors.ux.Pipeline`, a (single-node) _pipeline_.\n* `rats.processors.pipeline`: a decorator to register a `rats.processors.ux.Pipeline`,\nwhich can be a combination of other pipelines, or any method that returns a pipeline.\n\n\nNote that to create a pipeline, you first create _tasks_ (_processors_) and then combine them into\nlarger _pipelines_, e.g. `MyContainer.load_data` and `MyContainer.train_model` are _processors_\nwrapped by the `task` decorator, and `MyContainer.my_pipeline` is a _pipeline_ wrapped by the\n`pipeline` decorator.\n\n\nTo run the above pipeline, you can do the following:\n\n```python\nfrom rats.apps import autoid, NotebookApp\n\n\napp = NotebookApp()\napp.add(MyContainer()) # add a container to the notebook app\np = app.get(autoid(MyContainer.my_pipeline)) # get a pipeline by its id\napp.draw(p)\napp.run(p, inputs={\"fname\": \"data.csv\"})\n```\n\n\n## Concepts\n\n| Concepts | Description |\n|-----------|------------------------------------------------------|\n| Pipelines | DAG organizing computation tasks |\n| | Orchestrated in run environments |\n| | Figure display |\n| Tasks | Entry point for computation process |\n| | Accepts dynamic inputs/outputs |\n| Combined | Compose tasks & pipelines to draw more complex DAGs. |\n| | Dependency assignment |\n\n## Features\n\n| Features | | | | | |\n|-------------|-----------------------------------------------|----------------------------------------------|-------------|-----------------------------------------------------|----------------------------------------------|\n| Modular | Steps become independent;<br> Plug & play |  | Distributed | Uses required resources (spark or GPUs) |  |\n| Graph-based | Can operate on the DAG;<br> Enables meta-pipelines |  | Reusable | Every pipeline is shareable allowing collaborations |  |\n\n## Goals\n\n* Flexibility: multiple data sources; multiple ML frameworks (pytorch, sklearn, ...), etc.\n* Scalability: both data and compute.\n* Usability: everyone should be able to author components and share them.\n* Reproducibility: Tracking and recreating results.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Rats Processors",
"version": "0.1.3",
"project_urls": {
"Documentation": "https://microsoft.github.io/rats/",
"Homepage": "https://github.com/microsoft/rats/",
"Repository": "https://github.com/microsoft/rats/"
},
"split_keywords": [
"pipelines",
" machine learning",
" research"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9ef440b730d9adcb08db65c0ba8f0fba56cd61c5f28c495ecebf3c0934ec6ac8",
"md5": "03de1c4c675026bc56f76870b0d6c162",
"sha256": "0a0150360d607d9453fcf894b757d3eb6012bc74c45b68813a8b0df1fe5c66b2"
},
"downloads": -1,
"filename": "rats_processors-0.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "03de1c4c675026bc56f76870b0d6c162",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.10",
"size": 95580,
"upload_time": "2024-08-02T16:17:11",
"upload_time_iso_8601": "2024-08-02T16:17:11.603087Z",
"url": "https://files.pythonhosted.org/packages/9e/f4/40b730d9adcb08db65c0ba8f0fba56cd61c5f28c495ecebf3c0934ec6ac8/rats_processors-0.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-02 16:17:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "microsoft",
"github_project": "rats",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "rats-processors"
}