data-prep-toolkit-flows


Namedata-prep-toolkit-flows JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryData Preparation Toolkit Library for creation and execution of ttansformers flows
upload_time2024-07-09 18:48:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache-2.0
keywords data data preprocessing data preparation llm generative ai fine-tuning llmapps
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Flows Data Processing Library

This is a framework for combining and local execution of DatPrepKit -transforms_.
The large-scale execution of transformers is based on use of 
[KubeFlow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) and 
[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) on  big Kubernetes clusters. 
The project provides two example of "super" KFP workflows, 
[one](../../kfp/superworkflows/ray/kfp_v1/superworkflow_dedups_sample_wf.py) that combines '_exact dedup_', 
'_document identification_' and '_fuzzy dedup_'. Another 
[workflow](../../kfp/superworkflows/ray/kfp_v1/superworkflow_code_sample_wf.py) demonstrates processing of programming 
code. This workflow starts from _transformation of the code to parquet files_, after that it executes '_exact dedup_', 
"document identification_', '_fuzzy dedup_', 'programming language select_', '_code quality_', 'malware' transformers. 
The workflow finishes with '_tokenization'  transformers.

However, sometimes developers or data scientists would like to execute a set of transformers locally. This can be 
during the development process or due to the size of the processed data sets.

This package demonstrates two options for how this can be done.

## Data Processing Flows

[Flow](./src/data_processing_flows/flow.py) iis a Python representation of a workflow definition. It defines a set of 
steps that should be executed and a set of global parameters that are common to all steps. Each step can overwrite 
its corresponding parameters. To provide a "real" data-flow, _Flow_ automatically connects the input of each step to 
the output of the previous one. The global parameter set defines only the entire _Flow_ input and output, 
which are set to the first and last steps, respectively.

Currently, _Flow_ supports pure Python local transformers and Ray local transformers. Different transformer types can be 
part of the same _Flow_. Other transformer types will be added later.

### Flow creation
We provide two options of _Flow_ creation: programmatically or by [flow_loader](./src/data_processing_flows/flow_loader.py) 
from a JSON file. _Flow_ JSON schema is defined in [flow_schema.json](./src/data_processing_flows/flow_schema.json).

The [examples](./Examples) directory demonstrates creation of a simple _Flow_ with 3 steps: transformation of PDF files 
into parquet files, document identification and noop transformation. When the pdf and noop transformations are pure 
python transformers, and the document identification is a Ray local transformation.
You can see the JSON _Flow_ definition at [flow_example.json](./Examples/flow_example.json) and its execution at 
[run_example.py](./Examples/run_example.py). The file [flow_example.py](./Examples/flow_example.py) does the same 
programmatically.

### Flow execution from a Jupyter notebook
The [wdu jupyter notebook](./Examples/wdu.ipynb) is an example of a flow with one step of WDU transform. We can run this by running the following commands from [flows](.) directory:
```
make venv   # need to run only during the first execution
. ./venv/bin/activate
pip install jupyter
pip install ipykernel

# execute the Jupyter
python -m ipykernel install --user --name=venv --display-name="Python venv"
jupyter notebook

```
The notebook exists in [wdu.ipynb](./Examples/wdu.ipynb)

## KFP Local Pipelines
KFPv2 added an option to execute components and pipelines locally, see [Execute KFP pipelines locally
Overview ](https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/). 
Depending on the user's knowledge and preferences, this feature can be another option for executing workflows locally.  

KFP supports two local runner types which indicate how and where executed components should be executed: _DockerRunner_ 
and _SubprocessRunner_.
DockerRunner is the recommended option, because it executes each task in a separate container.
It offers the strongest form of local runtime environment isolation; is most faithful to the remote runtime environment, 
but might require a prebuilt docker image. 

The files [local_subprocess_wdu_docId_noop_super_pipeline.py](./Examples/local_subprocess_wdu_docId_noop_super_pipeline.py) 
and [local_docker_wdu_docId_noop_super_pipeline.py](./Examples/local_docker_wdu_docId_noop_super_pipeline.py) 
demonstrate a KFP local definition of the same workflow using _SubprocessRunner_ and _DockerRunner_, respectively.

**_Note:_** in order to execute the transformation of PDF files into parquet files you should be connected to the IBM 
Intranet with the "TUNNELAL" VPN

## Next Steps
- Extend support to S3 data sources
- Support Spark transformers
- Support isolated virtual environments by executing _FlowSteps_ in subprocesses.
- Investigate more KFP local opportunities

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "data-prep-toolkit-flows",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "data, data preprocessing, data preparation, llm, generative, ai, fine-tuning, llmapps",
    "author": null,
    "author_email": "Alexey Roytman <aroytman@il.ibm.com>, Mohammad Nassar <Mohammad.Nassar@ibm.com>, Revital Eres <eres@il.ibm.com>",
    "download_url": "https://files.pythonhosted.org/packages/73/e7/3dcb33197417ddce791d0630839c2e9e96ff11edce405baa6a91903fbce2/data_prep_toolkit_flows-0.2.0.tar.gz",
    "platform": null,
    "description": "# Flows Data Processing Library\n\nThis is a framework for combining and local execution of DatPrepKit -transforms_.\nThe large-scale execution of transformers is based on use of \n[KubeFlow Pipelines](https://www.kubeflow.org/docs/components/pipelines/v1/introduction/) and \n[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) on  big Kubernetes clusters. \nThe project provides two example of \"super\" KFP workflows, \n[one](../../kfp/superworkflows/ray/kfp_v1/superworkflow_dedups_sample_wf.py) that combines '_exact dedup_', \n'_document identification_' and '_fuzzy dedup_'. Another \n[workflow](../../kfp/superworkflows/ray/kfp_v1/superworkflow_code_sample_wf.py) demonstrates processing of programming \ncode. This workflow starts from _transformation of the code to parquet files_, after that it executes '_exact dedup_', \n\"document identification_', '_fuzzy dedup_', 'programming language select_', '_code quality_', 'malware' transformers. \nThe workflow finishes with '_tokenization'  transformers.\n\nHowever, sometimes developers or data scientists would like to execute a set of transformers locally. This can be \nduring the development process or due to the size of the processed data sets.\n\nThis package demonstrates two options for how this can be done.\n\n## Data Processing Flows\n\n[Flow](./src/data_processing_flows/flow.py) iis a Python representation of a workflow definition. It defines a set of \nsteps that should be executed and a set of global parameters that are common to all steps. Each step can overwrite \nits corresponding parameters. To provide a \"real\" data-flow, _Flow_ automatically connects the input of each step to \nthe output of the previous one. The global parameter set defines only the entire _Flow_ input and output, \nwhich are set to the first and last steps, respectively.\n\nCurrently, _Flow_ supports pure Python local transformers and Ray local transformers. Different transformer types can be \npart of the same _Flow_. Other transformer types will be added later.\n\n### Flow creation\nWe provide two options of _Flow_ creation: programmatically or by [flow_loader](./src/data_processing_flows/flow_loader.py) \nfrom a JSON file. _Flow_ JSON schema is defined in [flow_schema.json](./src/data_processing_flows/flow_schema.json).\n\nThe [examples](./Examples) directory demonstrates creation of a simple _Flow_ with 3 steps: transformation of PDF files \ninto parquet files, document identification and noop transformation. When the pdf and noop transformations are pure \npython transformers, and the document identification is a Ray local transformation.\nYou can see the JSON _Flow_ definition at [flow_example.json](./Examples/flow_example.json) and its execution at \n[run_example.py](./Examples/run_example.py). The file [flow_example.py](./Examples/flow_example.py) does the same \nprogrammatically.\n\n### Flow execution from a Jupyter notebook\nThe [wdu jupyter notebook](./Examples/wdu.ipynb) is an example of a flow with one step of WDU transform. We can run this by running the following commands from [flows](.) directory:\n```\nmake venv   # need to run only during the first execution\n. ./venv/bin/activate\npip install jupyter\npip install ipykernel\n\n# execute the Jupyter\npython -m ipykernel install --user --name=venv --display-name=\"Python venv\"\njupyter notebook\n\n```\nThe notebook exists in [wdu.ipynb](./Examples/wdu.ipynb)\n\n## KFP Local Pipelines\nKFPv2 added an option to execute components and pipelines locally, see [Execute KFP pipelines locally\nOverview ](https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/). \nDepending on the user's knowledge and preferences, this feature can be another option for executing workflows locally.  \n\nKFP supports two local runner types which indicate how and where executed components should be executed: _DockerRunner_ \nand _SubprocessRunner_.\nDockerRunner is the recommended option, because it executes each task in a separate container.\nIt offers the strongest form of local runtime environment isolation; is most faithful to the remote runtime environment, \nbut might require a prebuilt docker image. \n\nThe files [local_subprocess_wdu_docId_noop_super_pipeline.py](./Examples/local_subprocess_wdu_docId_noop_super_pipeline.py) \nand [local_docker_wdu_docId_noop_super_pipeline.py](./Examples/local_docker_wdu_docId_noop_super_pipeline.py) \ndemonstrate a KFP local definition of the same workflow using _SubprocessRunner_ and _DockerRunner_, respectively.\n\n**_Note:_** in order to execute the transformation of PDF files into parquet files you should be connected to the IBM \nIntranet with the \"TUNNELAL\" VPN\n\n## Next Steps\n- Extend support to S3 data sources\n- Support Spark transformers\n- Support isolated virtual environments by executing _FlowSteps_ in subprocesses.\n- Investigate more KFP local opportunities\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Data Preparation Toolkit Library for creation and execution of ttansformers flows",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [
        "data",
        " data preprocessing",
        " data preparation",
        " llm",
        " generative",
        " ai",
        " fine-tuning",
        " llmapps"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91b7eae435e407c100bc991002f48e17649226cbfd346951eb3d6ca09741e608",
                "md5": "827ce69c61b580f0850db89ad2c493c2",
                "sha256": "d58fb50a29d62e9a2e6e2d3baf23a2636403db67a6297a7a4ab495c0c567a58a"
            },
            "downloads": -1,
            "filename": "data_prep_toolkit_flows-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "827ce69c61b580f0850db89ad2c493c2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 7755,
            "upload_time": "2024-07-09T18:48:20",
            "upload_time_iso_8601": "2024-07-09T18:48:20.144027Z",
            "url": "https://files.pythonhosted.org/packages/91/b7/eae435e407c100bc991002f48e17649226cbfd346951eb3d6ca09741e608/data_prep_toolkit_flows-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "73e73dcb33197417ddce791d0630839c2e9e96ff11edce405baa6a91903fbce2",
                "md5": "616788052138826d9381b797e5aea02c",
                "sha256": "33bb5606f46972c8f0cb7de0b3eed8e9fcf5b989c7c1cbe9eabcf365fb63c0bd"
            },
            "downloads": -1,
            "filename": "data_prep_toolkit_flows-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "616788052138826d9381b797e5aea02c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 2295325,
            "upload_time": "2024-07-09T18:48:23",
            "upload_time_iso_8601": "2024-07-09T18:48:23.991885Z",
            "url": "https://files.pythonhosted.org/packages/73/e7/3dcb33197417ddce791d0630839c2e9e96ff11edce405baa6a91903fbce2/data_prep_toolkit_flows-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-09 18:48:23",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "data-prep-toolkit-flows"
}
        
Elapsed time: 1.45102s