open-dataflow

Name	open-dataflow JSON
Version	1.0.4 JSON
	download
home_page	None
Summary	Modern Data Centric AI system for Large Language Models
upload_time	2025-07-15 14:52:50
maintainer	None
docs_url	None
author	None
requires_python	<4,>=3.7
license	Apache-2.0
keywords	ai artificial intelligence
VCS
bugtrack_url
requirements	datasets numpy scipy torch torchvision torchaudio tqdm transformers aisuite math_verify word2number accelerate rapidfuzz colorlog appdirs datasketch modelscope addict pytest rich docstring_parser pydantic nltk colorama func_timeout sqlglot pymysql fasttext-wheel kenlm langkit openai sentencepiece datasketch presidio_analyzer presidio_anonymizer vendi-score google-api-core google-api-python-client evaluate contractions symspellpy simhash chonkie trafilatura lxml_html_clean cloudpickle fastapi httpx pandas psutil pyfiglet pyyaml requests termcolor uvicorn
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # DataFlow

<div align="center">
  <img src="./static/images/Face.jpg">


[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)
[![](https://img.shields.io/github/license/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)
[![](https://img.shields.io/github/stars/OpenDCAI/DataFlow?style=social)](https://github.com/OpenDCAI/DataFlow)
[![](https://img.shields.io/github/issues-raw/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/issues)
[![](https://img.shields.io/github/contributors/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/graphs/contributors)
[![](https://img.shields.io/github/repo-size/OpenDCAI/DataFlow?color=green)](https://github.com/OpenDCAI/DataFlow)

<!-- [![](https://img.shields.io/github/last-commit/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/commits/main/) -->

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

[简体中文](./README-zh.md) | English


**[🚀 Features](#Features) • [⚡ Quick Start](#Quick_Start) • [📖 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) • [🧪 Experiments](#Experiments)**

</div>

https://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126

## 📰 1. News
🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.

## 🔍 2. Overview

  <img src="./static/images/dataflow_framework.jpg">

DataFlow is a data preparation and training system designed to **parse, generate, process and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**

Specifically, we constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand.



<!-- Text: 输入是烂数据 通过大模型 输出QA （主要是强化学习）
NL2SQL: 反向构造SQL QA
Reasonning：Question很短，构建长链COT ，是否有category，是否有难度（通过大模型）
Agentic RAG: 输入QA，出来是 QA。没有额外信息解决不了，必须要引入
Knowlege Base Cleaning: PDF，表格+doc text输入，输出是高质量知识库
Dataflow-agent: 用Agent自动合成pipeline。编排已有算子。 -->

## 🛠️ 3. Pipelines Functionality
### 🔧 3.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
- 📝 **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
  - ![](./static/images/dataflow_text_pipeline.jpg)
  - [[HuggingFace🤗 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)
- 🧠 **Reasoning Pipeline**: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
  - ![](./static/images/dataflow_reasoning_pipeline.jpg)
  - [[HuggingFace🤗 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)
- 🗃️ **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
  - ![](./static/images/dataflow_text2sql_pipeline.jpg)
  - [[HuggingFace🤗 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)
- 📚 **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
  - ![](./static/images/dataflow_KnowledgeBaseClean_pipeline.jpg)
- 🤖 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
  - ![](./static/images/dataflow_agenticRAG_pipeline.jpg)
### ⚙️ 3.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.

### 🤖 3.3 Agent Guided Pipelines
<!-- Building on top of this, we also provide the -->
- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.

  - ![](./static/images/dataflow_agent_pipeline.jpg)
  - [[HuggingFace🤗 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)

<!-- ### 3.1 Text Pipeline
![](./static/images/demo_reasoning.png) -->

## ⚡ 4. Quick Start
For environment setup and installation, please using the following commands👇

```shell
conda create -n dataflow python=3.10 
conda activate dataflow

pip install open-dataflow
```
If you want to use your own GPU to inference locally, please use:
```shell
pip install open-dataflow[vllm]
```
> Dataflow supports Python>=3.10

You can use follwing command to check if installed correctly:
```shell
dataflow -v
```

You are expected to see following outputs:
```log
open-dataflow codebase version: 1.0.0
        Checking for updates...
        Local version:  1.0.0
        PyPI newest version:  1.0.0
You are using the latest version: 1.0.0.
```

For **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/). 

[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)


## 🧪 5. Experimental Results
For Detailed Experiments setting, please visit our documentation.


### 📝 5.1 Text PipeLine

#### 5.1.1 Pre-training data filter pipeline
The `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.

<div align="center">
  <img src="./static/images/text-pretrain.png" width="60%">
</div>

#### 5.1.2 SFT data filter pipeline
We filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:

<div align="center">
  <img src="./static/images/text-sft.png" width="60%">
</div>

### 🧠 5.2 Reasoning Pipeline

We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are: 

<div align="center">
  <img src="./static/images/reasoning_performance.png" width="60%">
</div>

### 🗃️ 5.3 Text2SQL PipeLine
We fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:

<div align="center">
  <img src="./static/images/text2sql.png" width="60%">
</div>

## 💐 6. Acknowledgements
We sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.

## 🤝 7. Community & Support
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

•	📮 [GitHub Issues](../../issues): Report bugs or suggest features
 
•	🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements

•	💬 Join our community groups to connect with us and other contributors!
 
<div align="center">
  <img src="./static/images/community_en.jpg" width="60%">
</div>

## 📜 8. Citation
If you use DataFlow in your research, feel free to give us a cite.
```bibtex
@misc{dataflow2025,
  author       = {DataFlow Develop Team},
  title        = {DataFlow: A Unified Framework for Data-Centric AI},
  year         = {2025},
  howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
  note         = {Accessed: 2025-07-08}
}
```

## 📊 9. Statistics
<div align="center">
  <a href="https://star-history.com/#OpenDCAI/DataFlow&Date">
    <picture>
      <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark" />
      <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" />
      <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" style="width:50%;" />
    </picture>
  </a>
  
</div>

---
<div align="center">
  <sub>
    Connect with the 
    <a href="https://zwt233.github.io/" target="_blank"><strong>PKU-DCAI Research Team</strong></a> 
    on Xiaohongshu: <strong>26133106768</strong>
  </sub>
</div>

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "open-dataflow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4,>=3.7",
    "maintainer_email": null,
    "keywords": "AI, artificial intelligence",
    "author": null,
    "author_email": "Hao Liang <hao.liang@stu.pku.edu.cn>, Xiaochen Ma <xiaochen.ma.cs@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/32/d0/a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca/open_dataflow-1.0.4.tar.gz",
    "platform": null,
    "description": "# DataFlow\n\n<div align=\"center\">\n  <img src=\"./static/images/Face.jpg\">\n\n\n[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)\n[![](https://img.shields.io/github/license/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)\n[![](https://img.shields.io/github/stars/OpenDCAI/DataFlow?style=social)](https://github.com/OpenDCAI/DataFlow)\n[![](https://img.shields.io/github/issues-raw/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/issues)\n[![](https://img.shields.io/github/contributors/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/graphs/contributors)\n[![](https://img.shields.io/github/repo-size/OpenDCAI/DataFlow?color=green)](https://github.com/OpenDCAI/DataFlow)\n\n<!-- [![](https://img.shields.io/github/last-commit/OpenDCAI/DataFlow)](https://github.com/OpenDCAI/DataFlow/commits/main/) -->\n\n\ud83c\udf89 If you like our project, please give us a star \u2b50 on GitHub for the latest update.\n\n[\u7b80\u4f53\u4e2d\u6587](./README-zh.md) | English\n\n\n**[\ud83d\ude80 Features](#Features) \u2022 [\u26a1 Quick Start](#Quick_Start) \u2022 [\ud83d\udcd6 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) \u2022 [\ud83e\uddea Experiments](#Experiments)**\n\n</div>\n\nhttps://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126\n\n## \ud83d\udcf0 1. News\n\ud83c\udf89 [2025-06-28] We\u2019re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.\n\n## \ud83d\udd0d 2. Overview\n\n  <img src=\"./static/images/dataflow_framework.jpg\">\n\nDataFlow is a data preparation and training system designed to\u00a0**parse, generate, process and evaluate**\u00a0high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**\n\nSpecifically, we constructing diverse\u00a0`operators`\u00a0leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct\u00a0`pipelines`, collectively forming the comprehensive\u00a0`DataFlow\u00a0system`. Additionally, we develop an intelligent\u00a0`DataFlow-agent`\u00a0capable of dynamically assembling new\u00a0`pipelines`\u00a0by recombining existing\u00a0`operators`\u00a0on demand.\n\n\n\n<!-- Text: \u8f93\u5165\u662f\u70c2\u6570\u636e \u901a\u8fc7\u5927\u6a21\u578b \u8f93\u51faQA \uff08\u4e3b\u8981\u662f\u5f3a\u5316\u5b66\u4e60\uff09\nNL2SQL: \u53cd\u5411\u6784\u9020SQL QA\nReasonning\uff1aQuestion\u5f88\u77ed\uff0c\u6784\u5efa\u957f\u94feCOT \uff0c\u662f\u5426\u6709category\uff0c\u662f\u5426\u6709\u96be\u5ea6\uff08\u901a\u8fc7\u5927\u6a21\u578b\uff09\nAgentic RAG: \u8f93\u5165QA\uff0c\u51fa\u6765\u662f QA\u3002\u6ca1\u6709\u989d\u5916\u4fe1\u606f\u89e3\u51b3\u4e0d\u4e86\uff0c\u5fc5\u987b\u8981\u5f15\u5165\nKnowlege Base Cleaning: PDF\uff0c\u8868\u683c+doc text\u8f93\u5165\uff0c\u8f93\u51fa\u662f\u9ad8\u8d28\u91cf\u77e5\u8bc6\u5e93\nDataflow-agent: \u7528Agent\u81ea\u52a8\u5408\u6210pipeline\u3002\u7f16\u6392\u5df2\u6709\u7b97\u5b50\u3002 -->\n\n## \ud83d\udee0\ufe0f 3. Pipelines Functionality\n### \ud83d\udd27 3.1 Ready-to-Use PipeLines\nCurrent Pipelines in Dataflow are as follows:\n- \ud83d\udcdd **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.\n  - ![](./static/images/dataflow_text_pipeline.jpg)\n  - [[HuggingFace\ud83e\udd17 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)\n- \ud83e\udde0 **Reasoning Pipeline**: Enhances existing question\u2013answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.\n  - ![](./static/images/dataflow_reasoning_pipeline.jpg)\n  - [[HuggingFace\ud83e\udd17 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)\n- \ud83d\uddc3\ufe0f **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.\n  - ![](./static/images/dataflow_text2sql_pipeline.jpg)\n  - [[HuggingFace\ud83e\udd17 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)\n- \ud83d\udcda **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.\n  - ![](./static/images/dataflow_KnowledgeBaseClean_pipeline.jpg)\n- \ud83e\udd16 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.\n  - ![](./static/images/dataflow_agenticRAG_pipeline.jpg)\n### \u2699\ufe0f 3.2 Flexible Operator PipeLines\nIn this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.\n\n### \ud83e\udd16 3.3 Agent Guided Pipelines\n<!-- Building on top of this, we also provide the -->\n- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.\n\n  - ![](./static/images/dataflow_agent_pipeline.jpg)\n  - [[HuggingFace\ud83e\udd17 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)\n\n<!-- ### 3.1 Text Pipeline\n![](./static/images/demo_reasoning.png) -->\n\n## \u26a1 4. Quick Start\nFor environment setup and installation, please using the following commands\ud83d\udc47\n\n```shell\nconda create -n dataflow python=3.10 \nconda activate dataflow\n\npip install open-dataflow\n```\nIf you want to use your own GPU to inference locally, please use:\n```shell\npip install open-dataflow[vllm]\n```\n> Dataflow supports Python>=3.10\n\nYou can use follwing command to check if installed correctly:\n```shell\ndataflow -v\n```\n\nYou are expected to see following outputs:\n```log\nopen-dataflow codebase version: 1.0.0\n        Checking for updates...\n        Local version:  1.0.0\n        PyPI newest version:  1.0.0\nYou are using the latest version: 1.0.0.\n```\n\nFor **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/). \n\n[![Documents](https://img.shields.io/badge/Documents-Click_here-brightgreen?logo=read-the-docs)](https://OpenDCAI.github.io/DataFlow-Doc/)\n\n\n## \ud83e\uddea 5. Experimental Results\nFor Detailed Experiments setting, please visit our documentation.\n\n\n### \ud83d\udcdd 5.1 Text PipeLine\n\n#### 5.1.1 Pre-training data filter pipeline\nThe `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.\n\n<div align=\"center\">\n  <img src=\"./static/images/text-pretrain.png\" width=\"60%\">\n</div>\n\n#### 5.1.2 SFT data filter pipeline\nWe filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:\n\n<div align=\"center\">\n  <img src=\"./static/images/text-sft.png\" width=\"60%\">\n</div>\n\n### \ud83e\udde0 5.2 Reasoning Pipeline\n\nWe verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are: \n\n<div align=\"center\">\n  <img src=\"./static/images/reasoning_performance.png\" width=\"60%\">\n</div>\n\n### \ud83d\uddc3\ufe0f 5.3 Text2SQL PipeLine\nWe fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:\n\n<div align=\"center\">\n  <img src=\"./static/images/text2sql.png\" width=\"60%\">\n</div>\n\n## \ud83d\udc90 6. Acknowledgements\nWe sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.\n\n## \ud83e\udd1d 7. Community & Support\nJoin the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!\n\n\u2022\t\ud83d\udcee [GitHub Issues](../../issues): Report bugs or suggest features\n \n\u2022\t\ud83d\udd27 [GitHub Pull Requests](../../pulls): Contribute code improvements\n\n\u2022\t\ud83d\udcac Join our community groups to connect with us and other contributors!\n \n<div align=\"center\">\n  <img src=\"./static/images/community_en.jpg\" width=\"60%\">\n</div>\n\n## \ud83d\udcdc 8. Citation\nIf you use DataFlow in your research, feel free to give us a cite.\n```bibtex\n@misc{dataflow2025,\n  author       = {DataFlow Develop Team},\n  title        = {DataFlow: A Unified Framework for Data-Centric AI},\n  year         = {2025},\n  howpublished = {\\url{https://github.com/OpenDCAI/DataFlow}},\n  note         = {Accessed: 2025-07-08}\n}\n```\n\n## \ud83d\udcca 9. Statistics\n<div align=\"center\">\n  <a href=\"https://star-history.com/#OpenDCAI/DataFlow&Date\">\n    <picture>\n      <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark\" />\n      <source media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" />\n      <img alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" style=\"width:50%;\" />\n    </picture>\n  </a>\n  \n</div>\n\n---\n<div align=\"center\">\n  <sub>\n    Connect with the \n    <a href=\"https://zwt233.github.io/\" target=\"_blank\"><strong>PKU-DCAI Research Team</strong></a> \n    on Xiaohongshu: <strong>26133106768</strong>\n  </sub>\n</div>\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Modern Data Centric AI system for Large Language Models",
    "version": "1.0.4",
    "project_urls": {
        "Bug Reports": "https://github.com/Open-DataFlow/DataFlow/issues",
        "Documentation": "https://open-dataflow.github.io/DataFlow-Doc/",
        "Github": "https://github.com/Open-DataFlow/DataFlow"
    },
    "split_keywords": [
        "ai",
        " artificial intelligence"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4e372370fa451d83eaa62ac37e05c883b9d940e0af58e96e973e14c0d4eca1ce",
                "md5": "f5b5bb483e6556d881aa91870a533bc2",
                "sha256": "0f5fdec0bf5884f275849fcafe140ac4eac92e37ed2761e89f270d3ff3051dcc"
            },
            "downloads": -1,
            "filename": "open_dataflow-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f5b5bb483e6556d881aa91870a533bc2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.7",
            "size": 1682198,
            "upload_time": "2025-07-15T14:52:48",
            "upload_time_iso_8601": "2025-07-15T14:52:48.426723Z",
            "url": "https://files.pythonhosted.org/packages/4e/37/2370fa451d83eaa62ac37e05c883b9d940e0af58e96e973e14c0d4eca1ce/open_dataflow-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "32d0a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca",
                "md5": "f810d3cecd118d237524ae1c8762c5a3",
                "sha256": "c5c077f91c003ed43e95c0b58ad96e3e4cf9f3b2e36a5950ccd8900c6c22ea5d"
            },
            "downloads": -1,
            "filename": "open_dataflow-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f810d3cecd118d237524ae1c8762c5a3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.7",
            "size": 1525820,
            "upload_time": "2025-07-15T14:52:50",
            "upload_time_iso_8601": "2025-07-15T14:52:50.002571Z",
            "url": "https://files.pythonhosted.org/packages/32/d0/a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca/open_dataflow-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-15 14:52:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Open-DataFlow",
    "github_project": "DataFlow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "datasets",
            "specs": [
                [
                    "<=",
                    "3.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "torchaudio",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "<=",
                    "4.51.3"
                ]
            ]
        },
        {
            "name": "aisuite",
            "specs": []
        },
        {
            "name": "math_verify",
            "specs": []
        },
        {
            "name": "word2number",
            "specs": []
        },
        {
            "name": "accelerate",
            "specs": []
        },
        {
            "name": "rapidfuzz",
            "specs": []
        },
        {
            "name": "colorlog",
            "specs": []
        },
        {
            "name": "appdirs",
            "specs": []
        },
        {
            "name": "datasketch",
            "specs": []
        },
        {
            "name": "modelscope",
            "specs": []
        },
        {
            "name": "addict",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        },
        {
            "name": "rich",
            "specs": []
        },
        {
            "name": "docstring_parser",
            "specs": []
        },
        {
            "name": "pydantic",
            "specs": []
        },
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "colorama",
            "specs": []
        },
        {
            "name": "func_timeout",
            "specs": []
        },
        {
            "name": "sqlglot",
            "specs": []
        },
        {
            "name": "pymysql",
            "specs": []
        },
        {
            "name": "fasttext-wheel",
            "specs": []
        },
        {
            "name": "kenlm",
            "specs": []
        },
        {
            "name": "langkit",
            "specs": []
        },
        {
            "name": "openai",
            "specs": []
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "datasketch",
            "specs": []
        },
        {
            "name": "presidio_analyzer",
            "specs": []
        },
        {
            "name": "presidio_anonymizer",
            "specs": []
        },
        {
            "name": "vendi-score",
            "specs": [
                [
                    "==",
                    "0.0.3"
                ]
            ]
        },
        {
            "name": "google-api-core",
            "specs": []
        },
        {
            "name": "google-api-python-client",
            "specs": []
        },
        {
            "name": "evaluate",
            "specs": []
        },
        {
            "name": "contractions",
            "specs": []
        },
        {
            "name": "symspellpy",
            "specs": []
        },
        {
            "name": "simhash",
            "specs": []
        },
        {
            "name": "chonkie",
            "specs": []
        },
        {
            "name": "trafilatura",
            "specs": []
        },
        {
            "name": "lxml_html_clean",
            "specs": []
        },
        {
            "name": "cloudpickle",
            "specs": []
        },
        {
            "name": "fastapi",
            "specs": []
        },
        {
            "name": "httpx",
            "specs": []
        },
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "psutil",
            "specs": []
        },
        {
            "name": "pyfiglet",
            "specs": []
        },
        {
            "name": "pyyaml",
            "specs": []
        },
        {
            "name": "requests",
            "specs": []
        },
        {
            "name": "termcolor",
            "specs": []
        },
        {
            "name": "uvicorn",
            "specs": []
        }
    ],
    "lcname": "open-dataflow"
}

None