# DataFlow
<div align="center">
<img src="./static/images/Face.jpg">
[](https://OpenDCAI.github.io/DataFlow-Doc/)
[](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)
[](https://github.com/OpenDCAI/DataFlow)
[](https://github.com/OpenDCAI/DataFlow/graphs/contributors)
[](https://github.com/OpenDCAI/DataFlow)
[](https://deepwiki.com/OpenDCAI/DataFlow)
<!-- [](https://github.com/OpenDCAI/DataFlow/commits/main/) -->
<!--[](https://github.com/OpenDCAI/DataFlow/issues) -->
🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.
**Beginner-friendly learning resources (continuously updated)**: 🎬 [DataFlow Video Tutorials](https://space.bilibili.com/3546929239689711?spm_id_from=333.337.0.0); 📚 [DataFlow Written Tutorials](https://wcny4qa9krto.feishu.cn/wiki/I9tbw2qnBi0lEakmmAGclTysnFd)
[简体中文](./README-zh.md) | English
</div>
https://github.com/user-attachments/assets/19742159-cfe0-42a6-9d3d-152466d2d588
## 📰 1. News
🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.
## 🔍 2. Overview
<img src="./static/images/dataflow_framework.jpg">
DataFlow is a data preparation and training system designed to **parse, generate, process and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**
Specifically, we constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand.
<!-- Text: 输入是烂数据 通过大模型 输出QA (主要是强化学习)
NL2SQL: 反向构造SQL QA
Reasonning:Question很短,构建长链COT ,是否有category,是否有难度(通过大模型)
Agentic RAG: 输入QA,出来是 QA。没有额外信息解决不了,必须要引入
Knowlege Base Cleaning: PDF,表格+doc text输入,输出是高质量知识库
Dataflow-agent: 用Agent自动合成pipeline。编排已有算子。 -->
## 🛠️ 3. Operators Functionality
### 🔧 3.1 How Operators Work
DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the [Operator Documentation](https://opendcai.github.io/DataFlow-Doc/en/guide/text_evaluation_operators/ ).

### 📊 3.2 Operator Classification System
In the DataFlow framework, operators are divided into three core categories based on their functional characteristics:
| Operator Type | Quantity | Main Function |
|---|---|---|
| **Generic Operators** | 80+ | Covers general functions for text evaluation, processing, and synthesis |
| **Domain-Specific Operators** | 40+ | Specialized processing for specific domains (e.g., medical, financial, legal) |
| **Evaluation Operators** | 20+ | Comprehensively evaluates data quality from 6 dimensions |
## 🛠️ 4. Pipelines Functionality
### 🔧 4.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
- [📝 **Text Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/textpipeline): Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- 
- [[HuggingFace🤗 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)
- [🧠 **Reasoning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/reasoningpipeline/#_2-question-handling): Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- 
- [[HuggingFace🤗 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)
- [🗃️ **Text2SQL Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/text2sqlpipeline/): Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- 
- [[HuggingFace🤗 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)
- [📚 **Knowlege Base Cleaning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/r51ooua8/): Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
- 
- [🤖 **Agentic RAG Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/agenticrag_pipeline/): Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
- 
### ⚙️ 4.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.
### 🤖 4.3 Agent Guided Pipelines
<!-- Building on top of this, we also provide the -->
- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.
- 
- [[HuggingFace🤗 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)
<!-- ### 3.1 Text Pipeline
 -->
## ⚡ 5. Quick Start
### 🛠️ 5.1 Environment Setup and Installation
Please use the following commands for environment setup and installation👇
```shell
conda create -n dataflow python=3.10
conda activate dataflow
pip install open-dataflow
```
If you want to use your own GPU for local inference, please use:
```shell
pip install open-dataflow[vllm]
```
> DataFlow supports Python>=3.10 environments
After installation, you can use the following command to check if dataflow has been installed correctly:
```shell
dataflow -v
```
If installed correctly, you should see:
```log
open-dataflow codebase version: 1.0.0
Checking for updates...
Local version: 1.0.0
PyPI newest version: 1.0.0
You are using the latest version: 1.0.0.
```
### 🚀 5.2 Using the Gradio Web Interface
DataFlow provides two interactive web interfaces to help you use operators, pipelines, and agents:
#### 5.2.1 DataFlow Operators Interface
Launch the DataFlow operator interface to test and visualize all operators and pipelines:
```bash
dataflow webui
```
This command will start an interactive web interface, allowing you to visualize and flexibly use all operators and pipelines.
#### 5.2.2 DataFlow Agent Interface
Launch the DataFlow agent interface for operator authoring and pipeline design:
```bash
dataflow webui agent
```
This command will start the DataFlow-Agent interface, providing automated operator authoring and pipeline recommendation services.
https://github.com/user-attachments/assets/fda1ad47-a9f3-447a-b5c0-cf4c9ad64763
### 🌐 5.3 ADP Intelligent Data Platform
Beyond the local Gradio interface, **DataFlow** is also available as a fully-managed SaaS solution on the **ADP Intelligent Data Platform**.
[**ADP**](https://adp.originhub.tech) is an end-to-end system by OriginHub, designed to help enterprises accelerate the development of custom Agents and Models by integrating Large Language Models (LLMs) with private data.
#### Core Capabilities:
* 🤖 **Automated Data Preparation**: Leverage DataFlow for full-process automation of your data workflows.
* 📚 **Unified Knowledge System**: Integrate and manage large-scale, multimodal knowledge bases.
* 🤝 **Intelligent Collaboration**: Build and orchestrate powerful multi-agent systems.
* 🗄️ **AI-Native Database**: Manage the full lifecycle of your multimodal data with a purpose-built AI database.
<p align="center">
<a href="https://adp.originhub.tech/login">
<img src="./static/images/ADP.jpg" alt="ADP Platform Interface" width="75%">
</a>
</p>
#### Get Started for Free
👉 **[Sign up now to claim your free compute credits!](https://adp.originhub.tech)**
### 📖 5.4 Reference Project Documentation
For detailed **usage instructions** and **getting started guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/).
## 🧪 6. Experimental Results
For Detailed Experiments setting, please visit our documentation.
### 📝 6.1 Text PipeLine
#### 6.1.1 Pre-training data filter pipeline
The `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
<div align="center">
<img src="./static/images/text-pretrain.png" width="60%">
</div>
#### 6.1.2 SFT data filter pipeline
We filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:
<div align="center">
<img src="./static/images/text-sft.png" width="60%">
</div>
### 🧠 6.2 Reasoning Pipeline
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
<div align="center">
<img src="./static/images/reasoning_performance.png" width="60%">
</div>
### 🗃️ 6.3 Text2SQL PipeLine
We fine-tuned the Qwen2.5-Coder-7B-Instruct model using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
<div align="center">
<img src="./static/images/text2sql.png" width="60%">
</div>
## 📄 7. Publications
Our team has published the following papers that form core components of the DataFlow system:
| Paper Title | DataFlow Component | Venue | Year |
|-------------|-------------------|-------|------|
| [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification](https://arxiv.org/pdf/2502.13383) | Multimodal reasoning verification framework for data processing and evaluation | ACL | 2025 |
| [Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration](https://arxiv.org/pdf/2410.08102) | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL | 2025 |
**Contributing Institutions**:
<img src="./static/logo/pku.png" alt="PKU" height="30"/>
<img src="./static/logo/hkust.png" alt="HKUST" height="30"/>
<img src="./static/logo/CAS.png" alt="CAS" height="30"/>
<img src="./static/logo/shanghai_ailab.png" alt="Shanghai AI Lab" height="30"/>
<img src="./static/logo/baichuan.png" alt="Baichuan" height="30"/>
<img src="./static/logo/ant_group.png" alt="Ant Group" height="30"/>
## 💐 8. Acknowledgements
We sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.
## 🤝 9. Community & Support
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
• 📮 [GitHub Issues](../../issues): Report bugs or suggest features
• 🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements
• 💬 Join our community groups to connect with us and other contributors!
<div align="center">
<img src="./static/images/community_en.jpg" width="60%">
</div>
## 📜 10. Citation
If you use DataFlow in your research, feel free to give us a cite.
```bibtex
@misc{dataflow2025,
author = {DataFlow Develop Team},
title = {DataFlow: A Unified Framework for Data-Centric AI},
year = {2025},
howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
note = {Accessed: 2025-07-08}
}
```
## 📊 11. Statistics
<div align="center">
<a href="https://star-history.com/#OpenDCAI/DataFlow&Date">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" style="width:50%;" />
</picture>
</a>
</div>
---
<div align="center">
<sub>
Connect with the
<a href="https://zwt233.github.io/" target="_blank"><strong>PKU-DCAI Research Team</strong></a>
on Xiaohongshu: <strong>26133106768</strong>
</sub>
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "open-dataflow",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.7",
"maintainer_email": null,
"keywords": "AI, artificial intelligence",
"author": null,
"author_email": "Hao Liang <hao.liang@stu.pku.edu.cn>, Xiaochen Ma <xiaochen.ma.cs@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/66/38/f31e8839ce67167b4af4383c1d52f931dd0da6c24577fdf42d107012efef/open_dataflow-1.0.6.tar.gz",
"platform": null,
"description": "# DataFlow\n\n<div align=\"center\">\n <img src=\"./static/images/Face.jpg\">\n\n\n[](https://OpenDCAI.github.io/DataFlow-Doc/)\n[](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)\n[](https://github.com/OpenDCAI/DataFlow)\n[](https://github.com/OpenDCAI/DataFlow/graphs/contributors)\n[](https://github.com/OpenDCAI/DataFlow)\n[](https://deepwiki.com/OpenDCAI/DataFlow)\n\n<!-- [](https://github.com/OpenDCAI/DataFlow/commits/main/) -->\n<!--[](https://github.com/OpenDCAI/DataFlow/issues) -->\n\ud83c\udf89 If you like our project, please give us a star \u2b50 on GitHub for the latest update.\n\n**Beginner-friendly learning resources (continuously updated)**: \ud83c\udfac [DataFlow Video Tutorials](https://space.bilibili.com/3546929239689711?spm_id_from=333.337.0.0); \ud83d\udcda [DataFlow Written Tutorials](https://wcny4qa9krto.feishu.cn/wiki/I9tbw2qnBi0lEakmmAGclTysnFd)\n\n\n[\u7b80\u4f53\u4e2d\u6587](./README-zh.md) | English\n\n\n</div>\n\nhttps://github.com/user-attachments/assets/19742159-cfe0-42a6-9d3d-152466d2d588\n\n## \ud83d\udcf0 1. News\n\ud83c\udf89 [2025-06-28] We\u2019re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.\n\n## \ud83d\udd0d 2. Overview\n\n <img src=\"./static/images/dataflow_framework.jpg\">\n\nDataFlow is a data preparation and training system designed to\u00a0**parse, generate, process and evaluate**\u00a0high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**\n\nSpecifically, we constructing diverse\u00a0`operators`\u00a0leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct\u00a0`pipelines`, collectively forming the comprehensive\u00a0`DataFlow\u00a0system`. Additionally, we develop an intelligent\u00a0`DataFlow-agent`\u00a0capable of dynamically assembling new\u00a0`pipelines`\u00a0by recombining existing\u00a0`operators`\u00a0on demand.\n\n\n\n<!-- Text: \u8f93\u5165\u662f\u70c2\u6570\u636e \u901a\u8fc7\u5927\u6a21\u578b \u8f93\u51faQA \uff08\u4e3b\u8981\u662f\u5f3a\u5316\u5b66\u4e60\uff09\nNL2SQL: \u53cd\u5411\u6784\u9020SQL QA\nReasonning\uff1aQuestion\u5f88\u77ed\uff0c\u6784\u5efa\u957f\u94feCOT \uff0c\u662f\u5426\u6709category\uff0c\u662f\u5426\u6709\u96be\u5ea6\uff08\u901a\u8fc7\u5927\u6a21\u578b\uff09\nAgentic RAG: \u8f93\u5165QA\uff0c\u51fa\u6765\u662f QA\u3002\u6ca1\u6709\u989d\u5916\u4fe1\u606f\u89e3\u51b3\u4e0d\u4e86\uff0c\u5fc5\u987b\u8981\u5f15\u5165\nKnowlege Base Cleaning: PDF\uff0c\u8868\u683c+doc text\u8f93\u5165\uff0c\u8f93\u51fa\u662f\u9ad8\u8d28\u91cf\u77e5\u8bc6\u5e93\nDataflow-agent: \u7528Agent\u81ea\u52a8\u5408\u6210pipeline\u3002\u7f16\u6392\u5df2\u6709\u7b97\u5b50\u3002 -->\n\n## \ud83d\udee0\ufe0f 3. Operators Functionality\n\n### \ud83d\udd27 3.1 How Operators Work\n\nDataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the [Operator Documentation](https://opendcai.github.io/DataFlow-Doc/en/guide/text_evaluation_operators/ ).\n\n\n\n### \ud83d\udcca 3.2 Operator Classification System\n\nIn the DataFlow framework, operators are divided into three core categories based on their functional characteristics:\n\n| Operator Type | Quantity | Main Function |\n|---|---|---|\n| **Generic Operators** | 80+ | Covers general functions for text evaluation, processing, and synthesis |\n| **Domain-Specific Operators** | 40+ | Specialized processing for specific domains (e.g., medical, financial, legal) |\n| **Evaluation Operators** | 20+ | Comprehensively evaluates data quality from 6 dimensions |\n\n## \ud83d\udee0\ufe0f 4. Pipelines Functionality\n### \ud83d\udd27 4.1 Ready-to-Use PipeLines\nCurrent Pipelines in Dataflow are as follows:\n- [\ud83d\udcdd **Text Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/textpipeline): Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)\n- [\ud83e\udde0 **Reasoning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/reasoningpipeline/#_2-question-handling): Enhances existing question\u2013answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)\n- [\ud83d\uddc3\ufe0f **Text2SQL Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/text2sqlpipeline/): Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)\n- [\ud83d\udcda **Knowlege Base Cleaning Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/r51ooua8/): Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.\n - \n- [\ud83e\udd16 **Agentic RAG Pipeline**](https://opendcai.github.io/DataFlow-Doc/en/guide/agenticrag_pipeline/): Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.\n - \n### \u2699\ufe0f 4.2 Flexible Operator PipeLines\nIn this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.\n\n### \ud83e\udd16 4.3 Agent Guided Pipelines\n<!-- Building on top of this, we also provide the -->\n- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.\n\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)\n\n<!-- ### 3.1 Text Pipeline\n -->\n\n## \u26a1 5. Quick Start\n\n### \ud83d\udee0\ufe0f 5.1 Environment Setup and Installation\n\nPlease use the following commands for environment setup and installation\ud83d\udc47\n\n```shell\nconda create -n dataflow python=3.10 \nconda activate dataflow\n\npip install open-dataflow\n```\nIf you want to use your own GPU for local inference, please use:\n```shell\npip install open-dataflow[vllm]\n```\n> DataFlow supports Python>=3.10 environments\n\nAfter installation, you can use the following command to check if dataflow has been installed correctly:\n\n```shell\ndataflow -v\n```\n\nIf installed correctly, you should see:\n```log\nopen-dataflow codebase version: 1.0.0\n Checking for updates...\n Local version: 1.0.0\n PyPI newest version: 1.0.0\nYou are using the latest version: 1.0.0.\n```\n\n### \ud83d\ude80 5.2 Using the Gradio Web Interface\n\nDataFlow provides two interactive web interfaces to help you use operators, pipelines, and agents:\n\n#### 5.2.1 DataFlow Operators Interface\n\nLaunch the DataFlow operator interface to test and visualize all operators and pipelines:\n\n```bash\ndataflow webui\n```\n\nThis command will start an interactive web interface, allowing you to visualize and flexibly use all operators and pipelines.\n\n#### 5.2.2 DataFlow Agent Interface\n\nLaunch the DataFlow agent interface for operator authoring and pipeline design:\n\n```bash\ndataflow webui agent\n```\n\nThis command will start the DataFlow-Agent interface, providing automated operator authoring and pipeline recommendation services.\n\nhttps://github.com/user-attachments/assets/fda1ad47-a9f3-447a-b5c0-cf4c9ad64763\n\n### \ud83c\udf10 5.3 ADP Intelligent Data Platform\n\nBeyond the local Gradio interface, **DataFlow** is also available as a fully-managed SaaS solution on the **ADP Intelligent Data Platform**.\n\n[**ADP**](https://adp.originhub.tech) is an end-to-end system by OriginHub, designed to help enterprises accelerate the development of custom Agents and Models by integrating Large Language Models (LLMs) with private data.\n\n#### Core Capabilities:\n\n* \ud83e\udd16 **Automated Data Preparation**: Leverage DataFlow for full-process automation of your data workflows.\n* \ud83d\udcda **Unified Knowledge System**: Integrate and manage large-scale, multimodal knowledge bases.\n* \ud83e\udd1d **Intelligent Collaboration**: Build and orchestrate powerful multi-agent systems.\n* \ud83d\uddc4\ufe0f **AI-Native Database**: Manage the full lifecycle of your multimodal data with a purpose-built AI database.\n\n<p align=\"center\">\n <a href=\"https://adp.originhub.tech/login\">\n <img src=\"./static/images/ADP.jpg\" alt=\"ADP Platform Interface\" width=\"75%\">\n </a>\n</p>\n\n#### Get Started for Free\n\n\ud83d\udc49 **[Sign up now to claim your free compute credits!](https://adp.originhub.tech)**\n\n### \ud83d\udcd6 5.4 Reference Project Documentation\n\nFor detailed **usage instructions** and **getting started guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/).\n\n## \ud83e\uddea 6. Experimental Results\nFor Detailed Experiments setting, please visit our documentation.\n\n\n### \ud83d\udcdd 6.1 Text PipeLine\n\n#### 6.1.1 Pre-training data filter pipeline\nThe `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.\n\n<div align=\"center\">\n <img src=\"./static/images/text-pretrain.png\" width=\"60%\">\n</div>\n\n#### 6.1.2 SFT data filter pipeline\nWe filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:\n\n<div align=\"center\">\n <img src=\"./static/images/text-sft.png\" width=\"60%\">\n</div>\n\n### \ud83e\udde0 6.2 Reasoning Pipeline\n\nWe verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are: \n\n<div align=\"center\">\n <img src=\"./static/images/reasoning_performance.png\" width=\"60%\">\n</div>\n\n### \ud83d\uddc3\ufe0f 6.3 Text2SQL PipeLine\nWe fine-tuned the Qwen2.5-Coder-7B-Instruct model using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:\n\n<div align=\"center\">\n <img src=\"./static/images/text2sql.png\" width=\"60%\">\n</div>\n\n## \ud83d\udcc4 7. Publications\nOur team has published the following papers that form core components of the DataFlow system:\n\n| Paper Title | DataFlow Component | Venue | Year |\n|-------------|-------------------|-------|------|\n| [MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification](https://arxiv.org/pdf/2502.13383) | Multimodal reasoning verification framework for data processing and evaluation | ACL | 2025 |\n| [Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration](https://arxiv.org/pdf/2410.08102) | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL | 2025 |\n\n**Contributing Institutions**: \n<img src=\"./static/logo/pku.png\" alt=\"PKU\" height=\"30\"/> \n<img src=\"./static/logo/hkust.png\" alt=\"HKUST\" height=\"30\"/> \n<img src=\"./static/logo/CAS.png\" alt=\"CAS\" height=\"30\"/> \n<img src=\"./static/logo/shanghai_ailab.png\" alt=\"Shanghai AI Lab\" height=\"30\"/> \n<img src=\"./static/logo/baichuan.png\" alt=\"Baichuan\" height=\"30\"/> \n<img src=\"./static/logo/ant_group.png\" alt=\"Ant Group\" height=\"30\"/>\n\n## \ud83d\udc90 8. Acknowledgements\nWe sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.\n\n## \ud83e\udd1d 9. Community & Support\nJoin the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!\n\n\u2022\t\ud83d\udcee [GitHub Issues](../../issues): Report bugs or suggest features\n \n\u2022\t\ud83d\udd27 [GitHub Pull Requests](../../pulls): Contribute code improvements\n\n\u2022\t\ud83d\udcac Join our community groups to connect with us and other contributors!\n \n<div align=\"center\">\n <img src=\"./static/images/community_en.jpg\" width=\"60%\">\n</div>\n\n## \ud83d\udcdc 10. Citation\nIf you use DataFlow in your research, feel free to give us a cite.\n```bibtex\n@misc{dataflow2025,\n author = {DataFlow Develop Team},\n title = {DataFlow: A Unified Framework for Data-Centric AI},\n year = {2025},\n howpublished = {\\url{https://github.com/OpenDCAI/DataFlow}},\n note = {Accessed: 2025-07-08}\n}\n```\n\n## \ud83d\udcca 11. Statistics\n<div align=\"center\">\n <a href=\"https://star-history.com/#OpenDCAI/DataFlow&Date\">\n <picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark\" />\n <source media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" />\n <img alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" style=\"width:50%;\" />\n </picture>\n </a>\n \n</div>\n\n---\n<div align=\"center\">\n <sub>\n Connect with the \n <a href=\"https://zwt233.github.io/\" target=\"_blank\"><strong>PKU-DCAI Research Team</strong></a> \n on Xiaohongshu: <strong>26133106768</strong>\n </sub>\n</div>\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Modern Data Centric AI system for Large Language Models",
"version": "1.0.6",
"project_urls": {
"Bug Reports": "https://github.com/Open-DataFlow/DataFlow/issues",
"Documentation": "https://open-dataflow.github.io/DataFlow-Doc/",
"Github": "https://github.com/Open-DataFlow/DataFlow"
},
"split_keywords": [
"ai",
" artificial intelligence"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4b7fc9487c987c3a28484af3d75fd3dcd13b520a63c7bffbe63e6c7d423fca54",
"md5": "9548ab7e59a8ac8d7137862c9eddeca3",
"sha256": "8818706bb24f42bb183f102eecc2e87372c2d17cf87229e541ab5509a5594640"
},
"downloads": -1,
"filename": "open_dataflow-1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9548ab7e59a8ac8d7137862c9eddeca3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.7",
"size": 2804889,
"upload_time": "2025-10-15T15:29:47",
"upload_time_iso_8601": "2025-10-15T15:29:47.739877Z",
"url": "https://files.pythonhosted.org/packages/4b/7f/c9487c987c3a28484af3d75fd3dcd13b520a63c7bffbe63e6c7d423fca54/open_dataflow-1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6638f31e8839ce67167b4af4383c1d52f931dd0da6c24577fdf42d107012efef",
"md5": "316d78114d6b90e516a0937b07d99b48",
"sha256": "6be669599a457ff59d79f2b88c927330651d7083bdd1674f0088fa4bde6dcf9c"
},
"downloads": -1,
"filename": "open_dataflow-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "316d78114d6b90e516a0937b07d99b48",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.7",
"size": 2519851,
"upload_time": "2025-10-15T15:29:49",
"upload_time_iso_8601": "2025-10-15T15:29:49.511885Z",
"url": "https://files.pythonhosted.org/packages/66/38/f31e8839ce67167b4af4383c1d52f931dd0da6c24577fdf42d107012efef/open_dataflow-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-15 15:29:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Open-DataFlow",
"github_project": "DataFlow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
"<",
"2.0.0"
]
]
},
{
"name": "datasets",
"specs": [
[
"<=",
"3.2"
]
]
},
{
"name": "scipy",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "torchaudio",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "aisuite",
"specs": []
},
{
"name": "math_verify",
"specs": []
},
{
"name": "word2number",
"specs": []
},
{
"name": "accelerate",
"specs": []
},
{
"name": "rapidfuzz",
"specs": []
},
{
"name": "colorlog",
"specs": []
},
{
"name": "appdirs",
"specs": []
},
{
"name": "datasketch",
"specs": []
},
{
"name": "modelscope",
"specs": []
},
{
"name": "addict",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "rich",
"specs": []
},
{
"name": "docstring_parser",
"specs": []
},
{
"name": "pydantic",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "colorama",
"specs": []
},
{
"name": "gradio",
"specs": [
[
">",
"5"
]
]
},
{
"name": "func_timeout",
"specs": []
},
{
"name": "sqlglot",
"specs": []
},
{
"name": "pymysql",
"specs": []
},
{
"name": "fasttext-wheel",
"specs": []
},
{
"name": "langkit",
"specs": []
},
{
"name": "openai",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "datasketch",
"specs": []
},
{
"name": "presidio_analyzer",
"specs": []
},
{
"name": "presidio_anonymizer",
"specs": []
},
{
"name": "vendi-score",
"specs": [
[
"==",
"0.0.3"
]
]
},
{
"name": "google-api-core",
"specs": []
},
{
"name": "google-api-python-client",
"specs": []
},
{
"name": "evaluate",
"specs": []
},
{
"name": "contractions",
"specs": []
},
{
"name": "symspellpy",
"specs": []
},
{
"name": "simhash",
"specs": []
},
{
"name": "chonkie",
"specs": []
},
{
"name": "trafilatura",
"specs": []
},
{
"name": "lxml_html_clean",
"specs": []
},
{
"name": "pymupdf",
"specs": []
},
{
"name": "httpx",
"specs": []
},
{
"name": "cloudpickle",
"specs": []
},
{
"name": "fastapi",
"specs": []
},
{
"name": "httpx",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "pyfiglet",
"specs": []
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "termcolor",
"specs": []
},
{
"name": "uvicorn",
"specs": []
},
{
"name": "sseclient-py",
"specs": []
},
{
"name": "librosa",
"specs": []
},
{
"name": "soundfile",
"specs": []
}
],
"lcname": "open-dataflow"
}