flowcept

Name	flowcept JSON
Version	0.2.10 JSON
	download
home_page	https://github.com/ORNL/flowcept
Summary	FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.
upload_time	2024-02-28 17:48:12
maintainer
docs_url	None
author	Oak Ridge National Laboratory
requires_python	>=3.8
license	MIT
keywords	ai ml machine-learning provenance lineage responsible-ai databases big-data provenance tensorboard data-integration scientific-workflows dask reproducibility workflows parallel-processing lineage model-management mlflow responsible-ai data-analytics
VCS
bugtrack_url
requirements	PyYAML redis psutil py-cpuinfo pymongo Werkzeug flask requests flask_restful pandas
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![Build](https://github.com/ORNL/flowcept/actions/workflows/create-release-n-publish.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/create-release-n-publish.yml)
[![PyPI](https://badge.fury.io/py/flowcept.svg)](https://pypi.org/project/flowcept)
[![Tests](https://github.com/ORNL/flowcept/actions/workflows/run-tests.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/run-tests.yml)
[![Code Formatting](https://github.com/ORNL/flowcept/actions/workflows/code-formatting.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/code-formatting.yml)
[![License: MIT](https://img.shields.io/github/license/ORNL/flowcept)](LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

# FlowCept

FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow 
provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.

FlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate 
important data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within
different parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the 
capability to seamless and automatically integrate data from various workflows using data observability.
It builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring.
It follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema.
It does not require changes in user codes or systems (i.e., instrumentation). 
All users need to do is to create adapters for their systems or tools, if one is not available yet. 

Currently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze). 

See the [Jupyter Notebooks](notebooks) for utilization examples.

See the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the
term 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.


## Install and Setup:

1. Install FlowCept: 

`pip install .[full]` in this directory (or `pip install flowcept[full]`).

For convenience, this will install all dependencies for all adapters. But it can install
dependencies for adapters you will not use. For this reason, you may want to install 
like this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`.
See [extra_requirements](extra_requirements) if you want to install the dependencies individually.
 
2. Start MongoDB and Redis:

To enable the full advantages of FlowCept, the user needs to run Redis, as FlowCept's message queue system, and MongoDB, as FlowCept's main database system.
The easiest way to start Redis and MongoDB is by using the [docker-compose file](deployment/compose.yml) for its dependent services: 
MongoDB and Redis. You only need RabbitMQ if you want to observe Zambeze messages as well.

3. Define the settings (e.g., routes and ports) accordingly in the [settings.yaml](resources/settings.yaml) file.

4. Start the observation using the Controller API, as shown in the [Jupyter Notebooks](notebooks).

5. To use FlowCept's Query API, see utilization examples in the notebooks.


## Performance Tuning for Performance Evaluation

In the settings.yaml file, the following variables might impact interception performance:

```yaml
main_redis:
  buffer_size: 50
  insertion_buffer_time_secs: 5

plugin:
  enrich_messages: false
```

And other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead.

## See also

- [Zambeze Repository](https://github.com/ORNL/zambeze)

## Cite us

If you used FlowCept for your research, consider citing our paper.

```
Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability
R. Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. da Silva
19th IEEE International Conference on e-Science, 2023.
```

**Bibtex:**

```latex
@inproceedings{souza2023towards,  
  author = {Souza, Renan and Skluzacek, Tyler J and Wilkinson, Sean R and Ziatdinov, Maxim and da Silva, Rafael Ferreira},
  booktitle = {IEEE International Conference on e-Science},
  doi = {10.1109/e-Science58273.2023.10254822},
  link = {https://doi.org/10.1109/e-Science58273.2023.10254822},
  pdf = {https://arxiv.org/pdf/2308.09004.pdf},
  title = {Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability},
  year = {2023}
}

```

## Disclaimer & Get in Touch

Please note that this a research software. We encourage you to give it a try and use it with your own stack. We
are continuously working on improving documentation and adding more examples and notebooks, but we are still far from
a good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific
project, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new), 
[create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink: ).

## Acknowledgement

This research uses resources of the Oak Ridge Leadership Computing Facility 
at the Oak Ridge National Laboratory, which is supported by the Office of 
Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/ORNL/flowcept",
    "name": "flowcept",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "ai,ml,machine-learning,provenance,lineage,responsible-ai,databases,big-data,provenance,tensorboard,data-integration,scientific-workflows,dask,reproducibility,workflows,parallel-processing,lineage,model-management,mlflow,responsible-ai,data-analytics",
    "author": "Oak Ridge National Laboratory",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/99/33/c72c596a71d7470e5ceca642c33fa66d70d3f6841f179ee8baee683d619a/flowcept-0.2.10.tar.gz",
    "platform": null,
    "description": "[![Build](https://github.com/ORNL/flowcept/actions/workflows/create-release-n-publish.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/create-release-n-publish.yml)\n[![PyPI](https://badge.fury.io/py/flowcept.svg)](https://pypi.org/project/flowcept)\n[![Tests](https://github.com/ORNL/flowcept/actions/workflows/run-tests.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/run-tests.yml)\n[![Code Formatting](https://github.com/ORNL/flowcept/actions/workflows/code-formatting.yml/badge.svg)](https://github.com/ORNL/flowcept/actions/workflows/code-formatting.yml)\n[![License: MIT](https://img.shields.io/github/license/ORNL/flowcept)](LICENSE)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n# FlowCept\n\nFlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow \nprovenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.\n\nFlowCept is intended to address scenarios where multiple workflows in a science campaign or in an enterprise run and generate \nimportant data to be analyzed in an integrated manner. Since these workflows may use different data manipulation tools (e.g., provenance or lineage capture tools, database systems, performance profiling tools) or can be executed within\ndifferent parallel computing systems (e.g., Dask, Spark, Workflow Management Systems), its key differentiator is the \ncapability to seamless and automatically integrate data from various workflows using data observability.\nIt builds an integrated data view at runtime enabling end-to-end exploratory data analysis and monitoring.\nIt follows [W3C PROV](https://www.w3.org/TR/prov-overview/) recommendations for its data schema.\nIt does not require changes in user codes or systems (i.e., instrumentation). \nAll users need to do is to create adapters for their systems or tools, if one is not available yet. \n\nCurrently, FlowCept provides adapters for: [Dask](https://www.dask.org/), [MLFlow](https://mlflow.org/), [TensorBoard](https://www.tensorflow.org/tensorboard), and [Zambeze](https://github.com/ORNL/zambeze). \n\nSee the [Jupyter Notebooks](notebooks) for utilization examples.\n\nSee the [Contributing](CONTRIBUTING.md) file for guidelines to contribute with new adapters. Note that we may use the\nterm 'plugin' in the codebase as a synonym to adapter. Future releases should standardize the terminology to use adapter.\n\n\n## Install and Setup:\n\n1. Install FlowCept: \n\n`pip install .[full]` in this directory (or `pip install flowcept[full]`).\n\nFor convenience, this will install all dependencies for all adapters. But it can install\ndependencies for adapters you will not use. For this reason, you may want to install \nlike this: `pip install .[adapter_key1,adapter_key2]` for the adapters we have implemented, e.g., `pip install .[dask]`.\nSee [extra_requirements](extra_requirements) if you want to install the dependencies individually.\n \n2. Start MongoDB and Redis:\n\nTo enable the full advantages of FlowCept, the user needs to run Redis, as FlowCept's message queue system, and MongoDB, as FlowCept's main database system.\nThe easiest way to start Redis and MongoDB is by using the [docker-compose file](deployment/compose.yml) for its dependent services: \nMongoDB and Redis. You only need RabbitMQ if you want to observe Zambeze messages as well.\n\n3. Define the settings (e.g., routes and ports) accordingly in the [settings.yaml](resources/settings.yaml) file.\n\n4. Start the observation using the Controller API, as shown in the [Jupyter Notebooks](notebooks).\n\n5. To use FlowCept's Query API, see utilization examples in the notebooks.\n\n\n## Performance Tuning for Performance Evaluation\n\nIn the settings.yaml file, the following variables might impact interception performance:\n\n```yaml\nmain_redis:\n  buffer_size: 50\n  insertion_buffer_time_secs: 5\n\nplugin:\n  enrich_messages: false\n```\n\nAnd other variables depending on the Plugin. For instance, in Dask, timestamp creation by workers add interception overhead.\n\n## See also\n\n- [Zambeze Repository](https://github.com/ORNL/zambeze)\n\n## Cite us\n\nIf you used FlowCept for your research, consider citing our paper.\n\n```\nTowards Lightweight Data Integration using Multi-workflow Provenance and Data Observability\nR. Souza, T. Skluzacek, S. Wilkinson, M. Ziatdinov, and R. da Silva\n19th IEEE International Conference on e-Science, 2023.\n```\n\n**Bibtex:**\n\n```latex\n@inproceedings{souza2023towards,  \n  author = {Souza, Renan and Skluzacek, Tyler J and Wilkinson, Sean R and Ziatdinov, Maxim and da Silva, Rafael Ferreira},\n  booktitle = {IEEE International Conference on e-Science},\n  doi = {10.1109/e-Science58273.2023.10254822},\n  link = {https://doi.org/10.1109/e-Science58273.2023.10254822},\n  pdf = {https://arxiv.org/pdf/2308.09004.pdf},\n  title = {Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability},\n  year = {2023}\n}\n\n```\n\n## Disclaimer & Get in Touch\n\nPlease note that this a research software. We encourage you to give it a try and use it with your own stack. We\nare continuously working on improving documentation and adding more examples and notebooks, but we are still far from\na good documentation covering the whole system. If you are interested in working with FlowCept in your own scientific\nproject, we can give you a jump start if you reach out to us. Feel free to [create an issue](https://github.com/ORNL/flowcept/issues/new), \n[create a new discussion thread](https://github.com/ORNL/flowcept/discussions/new/choose) or drop us an email (we trust you'll find a way to reach out to us :wink: ).\n\n## Acknowledgement\n\nThis research uses resources of the Oak Ridge Leadership Computing Facility \nat the Oak Ridge National Laboratory, which is supported by the Office of \nScience of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "FlowCept is a runtime data integration system that empowers any data processing system to capture and query workflow provenance data using data observability, requiring minimal or no changes in the target system code. It seamlessly integrates data from multiple workflows, enabling users to comprehend complex, heterogeneous, and large-scale data from various sources in federated environments.",
    "version": "0.2.10",
    "project_urls": {
        "Homepage": "https://github.com/ORNL/flowcept"
    },
    "split_keywords": [
        "ai",
        "ml",
        "machine-learning",
        "provenance",
        "lineage",
        "responsible-ai",
        "databases",
        "big-data",
        "provenance",
        "tensorboard",
        "data-integration",
        "scientific-workflows",
        "dask",
        "reproducibility",
        "workflows",
        "parallel-processing",
        "lineage",
        "model-management",
        "mlflow",
        "responsible-ai",
        "data-analytics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4144cb37181fec38f83d6d5fe5e745aa8bf7bbd90ab5339bb7dc9956ec0a349c",
                "md5": "0d7001c97604f3903d75e880a8404385",
                "sha256": "8c93c73c3c064ed146a7e4e0eb6c5590c60883ff3df86a2472ba8c53ffd7b0df"
            },
            "downloads": -1,
            "filename": "flowcept-0.2.10-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0d7001c97604f3903d75e880a8404385",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 89836,
            "upload_time": "2024-02-28T17:48:10",
            "upload_time_iso_8601": "2024-02-28T17:48:10.704243Z",
            "url": "https://files.pythonhosted.org/packages/41/44/cb37181fec38f83d6d5fe5e745aa8bf7bbd90ab5339bb7dc9956ec0a349c/flowcept-0.2.10-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9933c72c596a71d7470e5ceca642c33fa66d70d3f6841f179ee8baee683d619a",
                "md5": "f7317d885eb2c03d1fad688a30a17218",
                "sha256": "11e5bd3efe71cf4c4310eb028a718874dfcd0e6240c06c67e2e11c03351f915a"
            },
            "downloads": -1,
            "filename": "flowcept-0.2.10.tar.gz",
            "has_sig": false,
            "md5_digest": "f7317d885eb2c03d1fad688a30a17218",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 68952,
            "upload_time": "2024-02-28T17:48:12",
            "upload_time_iso_8601": "2024-02-28T17:48:12.800217Z",
            "url": "https://files.pythonhosted.org/packages/99/33/c72c596a71d7470e5ceca642c33fa66d70d3f6841f179ee8baee683d619a/flowcept-0.2.10.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-28 17:48:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ORNL",
    "github_project": "flowcept",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "PyYAML",
            "specs": [
                [
                    "==",
                    "6.0.1"
                ]
            ]
        },
        {
            "name": "redis",
            "specs": [
                [
                    "==",
                    "4.4.2"
                ]
            ]
        },
        {
            "name": "psutil",
            "specs": [
                [
                    "==",
                    "5.9.5"
                ]
            ]
        },
        {
            "name": "py-cpuinfo",
            "specs": [
                [
                    "==",
                    "9.0.0"
                ]
            ]
        },
        {
            "name": "pymongo",
            "specs": [
                [
                    "==",
                    "4.3.3"
                ]
            ]
        },
        {
            "name": "Werkzeug",
            "specs": [
                [
                    "==",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "flask",
            "specs": [
                [
                    "==",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "flask_restful",
            "specs": [
                [
                    "==",
                    "0.3.9"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.0.3"
                ]
            ]
        }
    ],
    "lcname": "flowcept"
}

Oak Ridge National Laboratory