.. |Codecov| image:: https://img.shields.io/codecov/c/github/SIMPLE-DVS/rain
:alt: Codecov
:target: https://app.codecov.io/gh/SIMPLE-DVS/rain
.. |License| image:: https://img.shields.io/badge/License-GPLv3-blue.svg
|Codecov| |License|
====
Rain
====
.. this is a comment, insert badge here
.. image:: https://img.shields.io/pypi/v/rain.svg
:target: https://pypi.python.org/pypi/rain
.. image:: https://img.shields.io/travis/SIMPLE-DVS/rain.svg
:target: https://travis-ci.com/SIMPLE-DVS/rain
What is it?
-----------
Rain is a Python library that supports the data scientist during the development of data pipelines,
here called Dataflow, in a rapid and easy way following a declarative approach.
In particular helps in data preparation/engineering where data are processed,
and in data analysis, consisting in the definition of the most suitable learning algorithm.
Rain contains a collection of nodes that abstract functions of the main Python's ML
libraries as Scikit-learn, Pandas and PySpark. The capability to combine multiple Python libraries
and the possibility to define more nodes or adding support for other libraries are the main Rain's
strengths. Currently the library contains several nodes regarding Anomaly Detection strategies.
Dataflow
--------
A DataFlow represents a Directed Acyclic Graph. Since a DataFlow
must be executed in a remote machine, then the acyclicity of the DAG must be
ensured to avoid deadlocks.
Nodes can be added to the DataFlow and connected one to each other by edges.
A node can be seen as a meta-function, a combination of several methods of a particular ML library
embedded in Rain, that provides one or more functionalities (for instance a Pandas node/meta-function
could compute the mean of a column and then round it up to some given decimals).
Edges connect meta-functions outputs to meta-functions inputs using a specific semantic.
In general we can say that an output can be connected to an input if and only if their types match
(semantic verification). Moreover an output can have one or more outgoing edges while an input
can have at most one ingoing edge.
The library contains also the so-called executors to run the Dataflow. Currently there are the
Local executor, where the computation is performed in a single local machine, and the Spark
executor to harness an Apache Spark cluster. A DataFlow is run in a single device because data that
are transformed by nodes are directly passed to the following ones.
Installation
------------
The library can be accessed in a stand-alone way using Python simply by installing it.
To install Rain, run this command in your terminal (preferred way to install the most recent stable release):
.. code-block:: console
$ pip install git+https://github.com/SIMPLE-DVS/rain.git
It is also possible to install Rain with all the optional dependencies by running the following command:
.. code-block:: console
$ pip install "rain[full] @ git+https://github.com/SIMPLE-DVS/rain"
If you don't have `pip`_ installed, this `Python installation guide`_ can guide
you through the process.
.. _pip: https://pip.pypa.io
.. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/
Furthermore the tool comes with a back-end that leverages the library and exposes its functionalities to a GUI
which eases the usage of the library itself.
QuickStart
----------
Here we provide a simple Python script in which Rain is used and a Dataflow is configured::
import rain
df = rain.DataFlow("df1", executor=rain.LocalExecutor())
csv_loader = rain.PandasCSVLoader("load", path="./iris.csv")
filter_col = rain.PandasColumnsFiltering("filter", column_indexes=[0, 1])
writer = rain.PandasCSVWriter("write", path="./new_iris.csv")
df.add_edges(
csv_loader @ "dataset" > filter_col @ "dataset",
filter_col @ "transformed_dataset" > writer @ "dataset"
)
df.execute()
In the above script we:
- first import the library;
- instantiate a Dataflow (with Id *"df1"* and referenced as *df*) passing a Local Executor, meaning
that the Dataflow will be executed in the local machine that runs the script;
- instantiate 3 nodes (*csv_loader, filter_col, writer*):
- the first one loads the *"iris.csv"* file stored in the root directory containing the Iris dataset,
using the node PandasCSVLoader;
- the second node filters some columns using a PandasColumnFiltering with its parameter
*column_indexes*;
- the last one saves the transformed dataset in a new file called *"new_iris.csv"* using the node
PandasCSVWriter;
- create 2 edges to link the 3 nodes:
- the *dataset* output variable of the node *csv_loader* is sent to the *dataset* input
variable of the node *filter_col*;
- the output *transformed_dataset* of the *filter_col* is then sent to the input of the
node *writer* (*dataset*);
- finally call the *execute* method of the Dataflow *df*. In this way, when the script is run
we get the expected result.
In general to use the library you have to perform the following steps:
- create a Dataflow specifying the type of executor;
- define all the nodes with the desired parameters to achieve your ML task;
- define the edges to link the nodes using the specific semantic:
- **>** is the symbol used to create an edge, where on the left you must specify the output of
the source node while on the right you must specify the input of the destination node;
- **@** is the symbol used to access an input/output variable of a node, where on the left you
must specify the variable name of the node while on the right you must specify the name of
the output/input variable of the source/destination node;
- execute the Dataflow and run the script.
More information about Rain usage, edges' semantic and all the possible executors are available `here`_.
A complete description of all the available nodes with their
behavior, accepted parameters, inputs and outputs is available at this `link`_.
.. _link: https://rain-library.readthedocs.io/en/latest/rain.nodes.html
.. _here: https://rain-library.readthedocs.io/en/latest/usage.html
Full Documentation
------------------
To load all the documentation follow the steps:
- Download sphinx and the sphinx theme specified in the requirements_dev.txt file or
install all the requirements listed in that file (suggested choice)
- From the main directory cd to the 'docs' directory.
.. code-block:: console
$ cd docs
Then run the 'make.bat singlehtml' file on Windows or run the command:
.. code-block:: console
$ sphinx-build . ./_build
The _build directory will contain the html files, open the index.html file to read the full documentation.
Authors
-------
* Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli,
Barbara Re, Marco Scarpetta, Luca Mozzoni, Vincenzo Nucci
Raw data
{
"_id": null,
"home_page": "https://bitbucket.org/proslabteam/rain_unicam",
"name": "rain-dm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "rain",
"author": "Universit\u00e0 degli Studi di Camerino",
"author_email": null,
"download_url": null,
"platform": null,
"description": ".. |Codecov| image:: https://img.shields.io/codecov/c/github/SIMPLE-DVS/rain\n :alt: Codecov\n :target: https://app.codecov.io/gh/SIMPLE-DVS/rain\n\n.. |License| image:: https://img.shields.io/badge/License-GPLv3-blue.svg\n\n|Codecov| |License|\n\n====\nRain\n====\n\n.. this is a comment, insert badge here\n .. image:: https://img.shields.io/pypi/v/rain.svg\n :target: https://pypi.python.org/pypi/rain\n .. image:: https://img.shields.io/travis/SIMPLE-DVS/rain.svg\n :target: https://travis-ci.com/SIMPLE-DVS/rain\n\nWhat is it?\n-----------\n\nRain is a Python library that supports the data scientist during the development of data pipelines,\nhere called Dataflow, in a rapid and easy way following a declarative approach.\nIn particular helps in data preparation/engineering where data are processed,\nand in data analysis, consisting in the definition of the most suitable learning algorithm.\n\nRain contains a collection of nodes that abstract functions of the main Python's ML\nlibraries as Scikit-learn, Pandas and PySpark. The capability to combine multiple Python libraries\nand the possibility to define more nodes or adding support for other libraries are the main Rain's\nstrengths. Currently the library contains several nodes regarding Anomaly Detection strategies.\n\nDataflow\n--------\n\nA DataFlow represents a Directed Acyclic Graph. Since a DataFlow\nmust be executed in a remote machine, then the acyclicity of the DAG must be\nensured to avoid deadlocks.\n\nNodes can be added to the DataFlow and connected one to each other by edges.\nA node can be seen as a meta-function, a combination of several methods of a particular ML library\nembedded in Rain, that provides one or more functionalities (for instance a Pandas node/meta-function\ncould compute the mean of a column and then round it up to some given decimals).\n\nEdges connect meta-functions outputs to meta-functions inputs using a specific semantic.\nIn general we can say that an output can be connected to an input if and only if their types match\n(semantic verification). Moreover an output can have one or more outgoing edges while an input\ncan have at most one ingoing edge.\n\nThe library contains also the so-called executors to run the Dataflow. Currently there are the\nLocal executor, where the computation is performed in a single local machine, and the Spark\nexecutor to harness an Apache Spark cluster. A DataFlow is run in a single device because data that\nare transformed by nodes are directly passed to the following ones.\n\nInstallation\n------------\n\nThe library can be accessed in a stand-alone way using Python simply by installing it.\n\nTo install Rain, run this command in your terminal (preferred way to install the most recent stable release):\n\n.. code-block:: console\n\n $ pip install git+https://github.com/SIMPLE-DVS/rain.git\n\nIt is also possible to install Rain with all the optional dependencies by running the following command:\n\n.. code-block:: console\n\n $ pip install \"rain[full] @ git+https://github.com/SIMPLE-DVS/rain\"\n\nIf you don't have `pip`_ installed, this `Python installation guide`_ can guide\nyou through the process.\n\n.. _pip: https://pip.pypa.io\n.. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/\n\nFurthermore the tool comes with a back-end that leverages the library and exposes its functionalities to a GUI\nwhich eases the usage of the library itself.\n\nQuickStart\n----------\n\nHere we provide a simple Python script in which Rain is used and a Dataflow is configured::\n\n import rain\n\n df = rain.DataFlow(\"df1\", executor=rain.LocalExecutor())\n\n csv_loader = rain.PandasCSVLoader(\"load\", path=\"./iris.csv\")\n filter_col = rain.PandasColumnsFiltering(\"filter\", column_indexes=[0, 1])\n writer = rain.PandasCSVWriter(\"write\", path=\"./new_iris.csv\")\n\n df.add_edges(\n csv_loader @ \"dataset\" > filter_col @ \"dataset\",\n filter_col @ \"transformed_dataset\" > writer @ \"dataset\"\n )\n\n df.execute()\n\nIn the above script we:\n\n- first import the library;\n- instantiate a Dataflow (with Id *\"df1\"* and referenced as *df*) passing a Local Executor, meaning\n that the Dataflow will be executed in the local machine that runs the script;\n- instantiate 3 nodes (*csv_loader, filter_col, writer*):\n\n - the first one loads the *\"iris.csv\"* file stored in the root directory containing the Iris dataset,\n using the node PandasCSVLoader;\n - the second node filters some columns using a PandasColumnFiltering with its parameter\n *column_indexes*;\n - the last one saves the transformed dataset in a new file called *\"new_iris.csv\"* using the node\n PandasCSVWriter;\n\n- create 2 edges to link the 3 nodes:\n\n - the *dataset* output variable of the node *csv_loader* is sent to the *dataset* input\n variable of the node *filter_col*;\n - the output *transformed_dataset* of the *filter_col* is then sent to the input of the\n node *writer* (*dataset*);\n\n- finally call the *execute* method of the Dataflow *df*. In this way, when the script is run\n we get the expected result.\n\nIn general to use the library you have to perform the following steps:\n\n- create a Dataflow specifying the type of executor;\n- define all the nodes with the desired parameters to achieve your ML task;\n- define the edges to link the nodes using the specific semantic:\n\n - **>** is the symbol used to create an edge, where on the left you must specify the output of\n the source node while on the right you must specify the input of the destination node;\n - **@** is the symbol used to access an input/output variable of a node, where on the left you\n must specify the variable name of the node while on the right you must specify the name of\n the output/input variable of the source/destination node;\n\n- execute the Dataflow and run the script.\n\nMore information about Rain usage, edges' semantic and all the possible executors are available `here`_.\nA complete description of all the available nodes with their\nbehavior, accepted parameters, inputs and outputs is available at this `link`_.\n\n.. _link: https://rain-library.readthedocs.io/en/latest/rain.nodes.html\n.. _here: https://rain-library.readthedocs.io/en/latest/usage.html\n\n\nFull Documentation\n------------------\n\nTo load all the documentation follow the steps:\n\n- Download sphinx and the sphinx theme specified in the requirements_dev.txt file or\n install all the requirements listed in that file (suggested choice)\n\n- From the main directory cd to the 'docs' directory.\n\n.. code-block:: console\n\n $ cd docs\n\nThen run the 'make.bat singlehtml' file on Windows or run the command:\n\n.. code-block:: console\n\n $ sphinx-build . ./_build\n\n\nThe _build directory will contain the html files, open the index.html file to read the full documentation.\n\nAuthors\n-------\n\n* Alessandro Antinori, Rosario Capparuccia, Riccardo Coltrinari, Flavio Corradini, Marco Piangerelli, \nBarbara Re, Marco Scarpetta, Luca Mozzoni, Vincenzo Nucci\n",
"bugtrack_url": null,
"license": "GNU General Public License",
"summary": "Rain library.",
"version": "1.1736337104.25989",
"project_urls": {
"Homepage": "https://bitbucket.org/proslabteam/rain_unicam"
},
"split_keywords": [
"rain"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2c00fa9f7ef6216a25e80f676e0125292a8dfccf097279a580a0e520ad66935e",
"md5": "bb6ff8b45439911d884a21521ad4334b",
"sha256": "6063d65d0760e34888d19c058c00bfb5a187ff5ced50388cc7c50e4451997627"
},
"downloads": -1,
"filename": "rain_dm-1.1736337104.25989-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "bb6ff8b45439911d884a21521ad4334b",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.6",
"size": 96517,
"upload_time": "2025-01-08T11:51:45",
"upload_time_iso_8601": "2025-01-08T11:51:45.340481Z",
"url": "https://files.pythonhosted.org/packages/2c/00/fa9f7ef6216a25e80f676e0125292a8dfccf097279a580a0e520ad66935e/rain_dm-1.1736337104.25989-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-08 11:51:45",
"github": false,
"gitlab": false,
"bitbucket": true,
"codeberg": false,
"bitbucket_user": "proslabteam",
"bitbucket_project": "rain_unicam",
"lcname": "rain-dm"
}