# stream-d4py
## New dispel4py (stream-d4py) streaming workflow repository
dispel4py is a free and open-source Python library for describing abstract stream-based workflows for distributed data-intensive applications. It enables users to focus on their scientific methods, avoiding distracting details and retaining flexibility over the computing infrastructure they use. It delivers mappings to diverse computing infrastructures, including cloud technologies, HPC architectures and specialised data-intensive machines, to move seamlessly into production with large-scale data loads. The dispel4py system maps workflows dynamically onto multiple enactment systems, and supports parallel processing on distributed memory systems with MPI and shared memory systems with multiprocessing, without users having to modify their workflows.
## Dependencies
This version of dispel4py has been tested with Python *3.10*
For earlier versions of dispel4py compatible with Python <3.10 ( e.g *2.7.5*, *2.7.2*, *2.6.6* and Python *3.4.3*, *3.6*, *3.7*) we recommend to go [here](https://gitlab.com/project-dare/dispel4py).
The dependencies required for running dispel4py are listed in the requirements.txt file.
You will also need the following installed on your system:
- If using the MPI mapping, please install [mpi4py](http://mpi4py.scipy.org/)
## Installation
The easiest way to install dispel4py is via pip (https://pypi.python.org/pypi/pip):
```
pip install stream-d4py
```
Or you can install the latest development from github [https://github.com/StreamingFlow/d4py.git](https://github.com/StreamingFlow/d4py.git) and follow this instructions:
- Clone the git repository
- Make sure that `redis` and the `mpi4py` Python package are installed on your system
- It is optional but recommended to create a virtual environment for dispel4py. Please refer to instructions bellow for setting it up with Conda.
- Run the dispel4py setup script: `python setup.py install`
- Run dispel4py using one of the following commands:
- `dispel4py <mapping name> <workflow file> <args>`, OR
- `python -m dispel4py.new.processor <mapping module> <workflow module> <args>`
- See "Examples" section bellow for more details
### Conda Environment
For installing for development with a conda environment, please run the following commands in your terminal.
1. `conda create --name stream-d4py_env python=3.10`
2. `conda activate stream-d4py_env`
3. `https://github.com/StreamingFlow/stream-d4py.git`
4. `cd dispel4py`
5. `conda install -c conda-forge mpi4py mpich` OR `pip install mpi4py` (Linux)
6. `python setup.py install`
OR just simply do these:
1. `conda create --name stream-d4py_env python=3.10`
2. `conda activate stream-d4py_env`
3. `conda install -c conda-forge mpi4py mpich` OR `pip install mpi4py` (Linux)
4. `pip install stream-d4py`
### Known Issues
1. Multiprocessing (multi) does not seem to work properly in MacOS (M1 chip).See bellow:
```
File "/Users/...../anaconda3/envs/.../lib/python3.10/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
AttributeError: 'TestProducer' object has no attribute 'simple_logger'
```
For those users, we do recommend to user our Docker file to create an image, and later a container.
2. You might have to use the following command to install mpi in your MacOS laptop:
```
conda install -c conda-forge mpi4py mpich
```
In Linux enviroments to install mpi you can use:
```
pip install mpi4py
```
3. For the mpi mapping, we need to indicate **twice** the number of processes, using twice the -n flag -- one at te beginning and one at the end --:
```
mpiexec -n 10 dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10
```
4. In some enviroments, you might need these flags for the mpi mapping:
```
--allow-run-as-root --oversubscribe
```
5. When running workflows with **mpi mapping** you may encounter messages like `Read -1, expected 56295, errno = 1`. There's no need for concern; these messages are typical and do not indicate a problem. Rest assured, your workflow is still running as expected.
### Docker
The Dockerfile in the dispel4py root directory installs dispel4py and mpi4py.
```
docker build . -t mydispel4py
```
Note: If you want to re-built an image without cache, use this flag: `--no-cache`
Start a Docker container with the dispel4py image in interactive mode with a bash shell:
```
docker run -it mydispel4py /bin/bash
```
## Mappings
The mappings of dispel4py refer to the connections between the processing elements (PEs) in a dataflow graph. Dispel4py is a Python library used for specifying and executing data-intensive workflows. In a dataflow graph, each PE represents a processing step, and the mappings define how data flows between the PEs during execution. These mappings ensure that data is correctly routed and processed through the dataflow, enabling efficient and parallel execution of tasks. We currently support the following ones:
- **Sequential**
- "simple": it executes dataflow graphs sequentially on a single process, suitable for small-scale data processing tasks.
- **Parallel**:
- **Fixed fixed workload distribution - support stateful and stateless PEs:**
- "mpi": it distributes dataflow graph computations across multiple nodes (distributed memory) using the **Message Passing Interface (MPI)**.
- "multi": it runs multiple instances of a dataflow graph concurrently using **multiprocessing Python library**, offering parallel processing on a single machine.
- "zmq_multi": it runs multiple instances of a dataflow graph concurrently using **ZMQ library**, offering parallel processing on a single machine.
- "redis" : it runs multiple instances of a dataflow graph concurrently using **Redis library**.
- **Dynamic workfload distribution - support only stateless PEs**
- "dyn_multi": it runs multiple instances of a dataflow graph concurrently using **multiprocessing Python library**. Worload assigned dynamically (but no autoscaling).
- "dyn_auto_multi": same as above, but allows autoscaling. We can indicate the number of threads to use.
- "dyn_redis": it runs multiple instances of a dataflow graph concurrently using **Redis library**. Workload assigned dynamically (but no autocasling).
- "dyn_auto_redis": same as above, but allows autoscaling. We can indicate the number of threads to use.
- **Hybrid workload distribution - supports stateful and stateless PEs**
- "hybrid_redis": it runs multiple instances of a dataflow graph concurrently using **Redis library**. Hybrid approach for workloads: Stafeless PEs assigned dynamically, while Stateful PEs are assigned from the begining.
## Examples
[This directory](https://github.com/StreamingFlow/d4py/tree/main/dispel4py/examples/graph_testing) contains a collection of dispel4py workflows used for testing and validating the functionalities and behavior of dataflow graphs. These workflows are primarily used for testing purposes and ensure that the different mappings (e.g., simple, MPI, Storm) and various features of dispel4py work as expected. They help in verifying the correctness and efficiency of dataflow graphs during development and maintenance of the dispel4py library
For more complex "real-world" examples for specific scientific domains, such as seismology, please go to [this repository](https://github.com/StreamingFlow/d4py_workflows)
### Pipeline_test
For each mapping we have always two options: either to use the `dispel4py` command ; or use `python -m` command.
##### Simple mapping
```shell
dispel4py simple dispel4py.examples.graph_testing.pipeline_test -i 10
```
OR
```shell
python -m dispel4py.new.processor dispel4py.new.simple_process dispel4py.examples.graph_testing.pipeline_test -i 10
```
##### Multi mapping
```shell
dispel4py multi dispel4py.examples.graph_testing.pipeline_test -i 10 -n 6
```
OR
```shell
python -m dispel4py.new.processor dispel4py.new.multi_process dispel4py.examples.graph_testing.pipeline_test -i 10 -n 6
```
##### MPI mapping
```shell
mpiexec -n 10 dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10
```
OR
```shell
mpiexec -n 10 python -m dispel4py.new.processor dispel4py.new.mpi_process dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10
```
Remember that you might to use the `--allow-run-as-root --oversubscribe` flags for some enviroments:
```shell
mpiexec -n 10 --allow-run-as-root --oversubscribe dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10
```
#### Redis mapping
Note: In another tab, we need to have REDIS working in background:
In Tab 1:
```shell
redis-server
```
In Tab 2:
```shell
dispel4py redis dispel4py.examples.graph_testing.word_count -ri localhost -n 4 -i 10
```
OR
```shell
python -m dispel4py.new.processor dispel4py.new.dynamic_redis dispel4py.examples.graph_testing.word_count -ri localhost -n 4 -i 10
```
**Note**: You can have just one tab, running redis-server in the background: `redis-server &`
#### Hibrid Redis with two stateful workflows
*Note 1*: This mapping also uses multiprocessing (appart from redis) - therefore you might have issues with MacOS (M1 chip). For this mapping, we recommed to use our Docker container.
*Note 2*: You need to have redis-server running. Either in a separete tab, or in the same tab, but in background.
###### Split and Merge workflow
```shell
python -m dispel4py.new.processor hybrid_redis dispel4py.examples.graph_testing.split_merge -i 100 -n 10
```
OR
```shell
dispel4py hybrid_redis dispel4py.examples.graph_testing.split_merge -i 100 -n 10
```
###### All to one stateful workflow
```shell
python -m dispel4py.new.processor hybrid_redis dispel4py.examples.graph_testing.grouping_alltoone_stateful -i 100 -n 10
```
OR
```shell
dispel4py hybrid_redis dispel4py.examples.graph_testing.grouping_alltoone_stateful -i 100 -n 10
```
## Google Colab Testing
Notebook for [testing dispel4py (stream-d4py) in Google Col](https://colab.research.google.com/drive/1rSkwBgu42YG3o2AAHweVkJPuqVZF62te?usp=sharing)
Raw data
{
"_id": null,
"home_page": "https://github.com/StreamingFlow/d4py/",
"name": "stream-d4py",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "",
"keywords": "updated dispel4py dispel workflows processing elements data intensive",
"author": "Rosa Filgueira and Amy Krauser",
"author_email": "rosa.filgueira.vicente@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/31/14/5df4826aa69084007b57ebbbec7a67c361b7110bb82bf2e070ca3edac555/stream_d4py-2.9.tar.gz",
"platform": null,
"description": "# stream-d4py\n\n## New dispel4py (stream-d4py) streaming workflow repository \n\ndispel4py is a free and open-source Python library for describing abstract stream-based workflows for distributed data-intensive applications. It enables users to focus on their scientific methods, avoiding distracting details and retaining flexibility over the computing infrastructure they use. It delivers mappings to diverse computing infrastructures, including cloud technologies, HPC architectures and specialised data-intensive machines, to move seamlessly into production with large-scale data loads. The dispel4py system maps workflows dynamically onto multiple enactment systems, and supports parallel processing on distributed memory systems with MPI and shared memory systems with multiprocessing, without users having to modify their workflows.\n\n\n## Dependencies\n\nThis version of dispel4py has been tested with Python *3.10*\n\nFor earlier versions of dispel4py compatible with Python <3.10 ( e.g *2.7.5*, *2.7.2*, *2.6.6* and Python *3.4.3*, *3.6*, *3.7*) we recommend to go [here](https://gitlab.com/project-dare/dispel4py).\n\nThe dependencies required for running dispel4py are listed in the requirements.txt file.\n\nYou will also need the following installed on your system:\n\n- If using the MPI mapping, please install [mpi4py](http://mpi4py.scipy.org/)\n\n\n## Installation\n\nThe easiest way to install dispel4py is via pip (https://pypi.python.org/pypi/pip):\n\n```\npip install stream-d4py\n```\n\nOr you can install the latest development from github [https://github.com/StreamingFlow/d4py.git](https://github.com/StreamingFlow/d4py.git) and follow this instructions:\n\n- Clone the git repository\n- Make sure that `redis` and the `mpi4py` Python package are installed on your system\n- It is optional but recommended to create a virtual environment for dispel4py. Please refer to instructions bellow for setting it up with Conda.\n- Run the dispel4py setup script: `python setup.py install`\n- Run dispel4py using one of the following commands:\n - `dispel4py <mapping name> <workflow file> <args>`, OR\n - `python -m dispel4py.new.processor <mapping module> <workflow module> <args>`\n- See \"Examples\" section bellow for more details\n\n### Conda Environment\n\nFor installing for development with a conda environment, please run the following commands in your terminal.\n\n1. `conda create --name stream-d4py_env python=3.10`\n2. `conda activate stream-d4py_env`\n3. `https://github.com/StreamingFlow/stream-d4py.git`\n4. `cd dispel4py`\n5. `conda install -c conda-forge mpi4py mpich` OR `pip install mpi4py` (Linux)\n6. `python setup.py install`\n\nOR just simply do these:\n\n1. `conda create --name stream-d4py_env python=3.10`\n2. `conda activate stream-d4py_env`\n3. `conda install -c conda-forge mpi4py mpich` OR `pip install mpi4py` (Linux)\n4. `pip install stream-d4py`\n\n\n### Known Issues\n\n1. Multiprocessing (multi) does not seem to work properly in MacOS (M1 chip).See bellow:\n\n\n```\nFile \"/Users/...../anaconda3/envs/.../lib/python3.10/multiprocessing/spawn.py\", line 126, in _main\n self = reduction.pickle.load(from_parent)\nAttributeError: 'TestProducer' object has no attribute 'simple_logger'\n```\n\nFor those users, we do recommend to user our Docker file to create an image, and later a container.\n\n2. You might have to use the following command to install mpi in your MacOS laptop:\n```\nconda install -c conda-forge mpi4py mpich\n```\n In Linux enviroments to install mpi you can use:\n```\npip install mpi4py\n```\n\n3. For the mpi mapping, we need to indicate **twice** the number of processes, using twice the -n flag -- one at te beginning and one at the end --:\n\n```\nmpiexec -n 10 dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10\n```\n\n4. In some enviroments, you might need these flags for the mpi mapping: \n\n```\n--allow-run-as-root --oversubscribe \n```\n\n5. When running workflows with **mpi mapping** you may encounter messages like `Read -1, expected 56295, errno = 1`. There's no need for concern; these messages are typical and do not indicate a problem. Rest assured, your workflow is still running as expected.\n\n\n### Docker\n\nThe Dockerfile in the dispel4py root directory installs dispel4py and mpi4py.\n\n```\ndocker build . -t mydispel4py\n```\n\nNote: If you want to re-built an image without cache, use this flag: `--no-cache`\n\nStart a Docker container with the dispel4py image in interactive mode with a bash shell:\n\n```\ndocker run -it mydispel4py /bin/bash\n```\n\n## Mappings\n\nThe mappings of dispel4py refer to the connections between the processing elements (PEs) in a dataflow graph. Dispel4py is a Python library used for specifying and executing data-intensive workflows. In a dataflow graph, each PE represents a processing step, and the mappings define how data flows between the PEs during execution. These mappings ensure that data is correctly routed and processed through the dataflow, enabling efficient and parallel execution of tasks. We currently support the following ones:\n\n- **Sequential**\n - \"simple\": it executes dataflow graphs sequentially on a single process, suitable for small-scale data processing tasks. \n- **Parallel**: \n - **Fixed fixed workload distribution - support stateful and stateless PEs:**\n \t- \"mpi\": it distributes dataflow graph computations across multiple nodes (distributed memory) using the **Message Passing Interface (MPI)**. \n \t- \"multi\": it runs multiple instances of a dataflow graph concurrently using **multiprocessing Python library**, offering parallel processing on a single machine. \n \t- \"zmq_multi\": it runs multiple instances of a dataflow graph concurrently using **ZMQ library**, offering parallel processing on a single machine.\n - \"redis\" : it runs multiple instances of a dataflow graph concurrently using **Redis library**. \n - **Dynamic workfload distribution - support only stateless PEs** \n - \"dyn_multi\": it runs multiple instances of a dataflow graph concurrently using **multiprocessing Python library**. Worload assigned dynamically (but no autoscaling). \n - \"dyn_auto_multi\": same as above, but allows autoscaling. We can indicate the number of threads to use.\n - \"dyn_redis\": it runs multiple instances of a dataflow graph concurrently using **Redis library**. Workload assigned dynamically (but no autocasling). \n - \"dyn_auto_redis\": same as above, but allows autoscaling. We can indicate the number of threads to use.\n - **Hybrid workload distribution - supports stateful and stateless PEs**\n - \"hybrid_redis\": it runs multiple instances of a dataflow graph concurrently using **Redis library**. Hybrid approach for workloads: Stafeless PEs assigned dynamically, while Stateful PEs are assigned from the begining.\n\n\n## Examples\n\n[This directory](https://github.com/StreamingFlow/d4py/tree/main/dispel4py/examples/graph_testing) contains a collection of dispel4py workflows used for testing and validating the functionalities and behavior of dataflow graphs. These workflows are primarily used for testing purposes and ensure that the different mappings (e.g., simple, MPI, Storm) and various features of dispel4py work as expected. They help in verifying the correctness and efficiency of dataflow graphs during development and maintenance of the dispel4py library\n\nFor more complex \"real-world\" examples for specific scientific domains, such as seismology, please go to [this repository](https://github.com/StreamingFlow/d4py_workflows)\n\n\n### Pipeline_test\n\nFor each mapping we have always two options: either to use the `dispel4py` command ; or use `python -m` command. \n\n##### Simple mapping\n```shell\ndispel4py simple dispel4py.examples.graph_testing.pipeline_test -i 10 \n```\nOR \n\n```shell\npython -m dispel4py.new.processor dispel4py.new.simple_process dispel4py.examples.graph_testing.pipeline_test -i 10\n```\n\n##### Multi mapping\n```shell\ndispel4py multi dispel4py.examples.graph_testing.pipeline_test -i 10 -n 6\n```\nOR \n\n```shell\npython -m dispel4py.new.processor dispel4py.new.multi_process dispel4py.examples.graph_testing.pipeline_test -i 10 -n 6\n```\n\n##### MPI mapping\n```shell\nmpiexec -n 10 dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10\n```\nOR \n\n```shell\nmpiexec -n 10 python -m dispel4py.new.processor dispel4py.new.mpi_process dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10\n```\n\nRemember that you might to use the `--allow-run-as-root --oversubscribe` flags for some enviroments:\n\n```shell\nmpiexec -n 10 --allow-run-as-root --oversubscribe dispel4py mpi dispel4py.examples.graph_testing.pipeline_test -i 20 -n 10\n```\n#### Redis mapping\n\nNote: In another tab, we need to have REDIS working in background:\n\nIn Tab 1: \n\n```shell\nredis-server\n```\n\nIn Tab 2: \n\n```shell\ndispel4py redis dispel4py.examples.graph_testing.word_count -ri localhost -n 4 -i 10\n```\n\nOR\n\n```shell\npython -m dispel4py.new.processor dispel4py.new.dynamic_redis dispel4py.examples.graph_testing.word_count -ri localhost -n 4 -i 10\n```\n\n**Note**: You can have just one tab, running redis-server in the background: `redis-server &` \n\n\n#### Hibrid Redis with two stateful workflows \n\n*Note 1*: This mapping also uses multiprocessing (appart from redis) - therefore you might have issues with MacOS (M1 chip). For this mapping, we recommed to use our Docker container. \n\n*Note 2*: You need to have redis-server running. Either in a separete tab, or in the same tab, but in background. \n\n###### Split and Merge workflow\n\n```shell\npython -m dispel4py.new.processor hybrid_redis dispel4py.examples.graph_testing.split_merge -i 100 -n 10\n```\nOR\n\n```shell\ndispel4py hybrid_redis dispel4py.examples.graph_testing.split_merge -i 100 -n 10\n```\n\n###### All to one stateful workflow\n```shell\npython -m dispel4py.new.processor hybrid_redis dispel4py.examples.graph_testing.grouping_alltoone_stateful -i 100 -n 10\n```\nOR\n```shell\ndispel4py hybrid_redis dispel4py.examples.graph_testing.grouping_alltoone_stateful -i 100 -n 10\n```\n\n## Google Colab Testing\n\nNotebook for [testing dispel4py (stream-d4py) in Google Col](https://colab.research.google.com/drive/1rSkwBgu42YG3o2AAHweVkJPuqVZF62te?usp=sharing)\n\n\n\n",
"bugtrack_url": null,
"license": "Apache 2",
"summary": "dispel4py is a free and open-source Python library for describing abstract stream-based workflows for distributed data-intensive applications.",
"version": "2.9",
"project_urls": {
"Homepage": "https://github.com/StreamingFlow/d4py/"
},
"split_keywords": [
"updated",
"dispel4py",
"dispel",
"workflows",
"processing",
"elements",
"data",
"intensive"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f89fe6d5b7f17d5791af011f279c5679ee72c4eb1d2494900d2701f342d52396",
"md5": "cfd7106ae83a57fe69ee8baea1d5d899",
"sha256": "4df9d3a8fd94f3a3dc5de517c50c9e6a934ea10b8c1e725b198c9bef50719c31"
},
"downloads": -1,
"filename": "stream_d4py-2.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cfd7106ae83a57fe69ee8baea1d5d899",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 121988,
"upload_time": "2024-01-12T16:12:50",
"upload_time_iso_8601": "2024-01-12T16:12:50.119176Z",
"url": "https://files.pythonhosted.org/packages/f8/9f/e6d5b7f17d5791af011f279c5679ee72c4eb1d2494900d2701f342d52396/stream_d4py-2.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "31145df4826aa69084007b57ebbbec7a67c361b7110bb82bf2e070ca3edac555",
"md5": "ee7d5fc09a3b72f94dd89d6c80443552",
"sha256": "c48ad9855ce9049003eea7b5138126c784372f8e3e6af9fb744a5597b5b9803b"
},
"downloads": -1,
"filename": "stream_d4py-2.9.tar.gz",
"has_sig": false,
"md5_digest": "ee7d5fc09a3b72f94dd89d6c80443552",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 85582,
"upload_time": "2024-01-12T16:12:52",
"upload_time_iso_8601": "2024-01-12T16:12:52.155626Z",
"url": "https://files.pythonhosted.org/packages/31/14/5df4826aa69084007b57ebbbec7a67c361b7110bb82bf2e070ca3edac555/stream_d4py-2.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-12 16:12:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "StreamingFlow",
"github_project": "d4py",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "stream-d4py"
}