# DataFlow
<div align="center">
<img src="./static/images/Face.jpg">
[](https://OpenDCAI.github.io/DataFlow-Doc/)
[](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)
[](https://github.com/OpenDCAI/DataFlow)
[](https://github.com/OpenDCAI/DataFlow/issues)
[](https://github.com/OpenDCAI/DataFlow/graphs/contributors)
[](https://github.com/OpenDCAI/DataFlow)
<!-- [](https://github.com/OpenDCAI/DataFlow/commits/main/) -->
🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.
[简体中文](./README-zh.md) | English
**[🚀 Features](#Features) • [⚡ Quick Start](#Quick_Start) • [📖 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) • [🧪 Experiments](#Experiments)**
</div>
https://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126
## 📰 1. News
🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.
## 🔍 2. Overview
<img src="./static/images/dataflow_framework.jpg">
DataFlow is a data preparation and training system designed to **parse, generate, process and evaluate** high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**
Specifically, we constructing diverse `operators` leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct `pipelines`, collectively forming the comprehensive `DataFlow system`. Additionally, we develop an intelligent `DataFlow-agent` capable of dynamically assembling new `pipelines` by recombining existing `operators` on demand.
<!-- Text: 输入是烂数据 通过大模型 输出QA (主要是强化学习)
NL2SQL: 反向构造SQL QA
Reasonning:Question很短,构建长链COT ,是否有category,是否有难度(通过大模型)
Agentic RAG: 输入QA,出来是 QA。没有额外信息解决不了,必须要引入
Knowlege Base Cleaning: PDF,表格+doc text输入,输出是高质量知识库
Dataflow-agent: 用Agent自动合成pipeline。编排已有算子。 -->
## 🛠️ 3. Pipelines Functionality
### 🔧 3.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
- 📝 **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.
- 
- [[HuggingFace🤗 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)
- 🧠 **Reasoning Pipeline**: Enhances existing question–answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.
- 
- [[HuggingFace🤗 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)
- 🗃️ **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.
- 
- [[HuggingFace🤗 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)
- 📚 **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.
- 
- 🤖 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.
- 
### ⚙️ 3.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.
### 🤖 3.3 Agent Guided Pipelines
<!-- Building on top of this, we also provide the -->
- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.
- 
- [[HuggingFace🤗 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)
<!-- ### 3.1 Text Pipeline
 -->
## ⚡ 4. Quick Start
For environment setup and installation, please using the following commands👇
```shell
conda create -n dataflow python=3.10
conda activate dataflow
pip install open-dataflow
```
If you want to use your own GPU to inference locally, please use:
```shell
pip install open-dataflow[vllm]
```
> Dataflow supports Python>=3.10
You can use follwing command to check if installed correctly:
```shell
dataflow -v
```
You are expected to see following outputs:
```log
open-dataflow codebase version: 1.0.0
Checking for updates...
Local version: 1.0.0
PyPI newest version: 1.0.0
You are using the latest version: 1.0.0.
```
For **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/).
[](https://OpenDCAI.github.io/DataFlow-Doc/)
## 🧪 5. Experimental Results
For Detailed Experiments setting, please visit our documentation.
### 📝 5.1 Text PipeLine
#### 5.1.1 Pre-training data filter pipeline
The `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.
<div align="center">
<img src="./static/images/text-pretrain.png" width="60%">
</div>
#### 5.1.2 SFT data filter pipeline
We filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:
<div align="center">
<img src="./static/images/text-sft.png" width="60%">
</div>
### 🧠 5.2 Reasoning Pipeline
We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:
<div align="center">
<img src="./static/images/reasoning_performance.png" width="60%">
</div>
### 🗃️ 5.3 Text2SQL PipeLine
We fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:
<div align="center">
<img src="./static/images/text2sql.png" width="60%">
</div>
## 💐 6. Acknowledgements
We sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.
## 🤝 7. Community & Support
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
• 📮 [GitHub Issues](../../issues): Report bugs or suggest features
• 🔧 [GitHub Pull Requests](../../pulls): Contribute code improvements
• 💬 Join our community groups to connect with us and other contributors!
<div align="center">
<img src="./static/images/community_en.jpg" width="60%">
</div>
## 📜 8. Citation
If you use DataFlow in your research, feel free to give us a cite.
```bibtex
@misc{dataflow2025,
author = {DataFlow Develop Team},
title = {DataFlow: A Unified Framework for Data-Centric AI},
year = {2025},
howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
note = {Accessed: 2025-07-08}
}
```
## 📊 9. Statistics
<div align="center">
<a href="https://star-history.com/#OpenDCAI/DataFlow&Date">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date" style="width:50%;" />
</picture>
</a>
</div>
---
<div align="center">
<sub>
Connect with the
<a href="https://zwt233.github.io/" target="_blank"><strong>PKU-DCAI Research Team</strong></a>
on Xiaohongshu: <strong>26133106768</strong>
</sub>
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "open-dataflow",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.7",
"maintainer_email": null,
"keywords": "AI, artificial intelligence",
"author": null,
"author_email": "Hao Liang <hao.liang@stu.pku.edu.cn>, Xiaochen Ma <xiaochen.ma.cs@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/32/d0/a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca/open_dataflow-1.0.4.tar.gz",
"platform": null,
"description": "# DataFlow\n\n<div align=\"center\">\n <img src=\"./static/images/Face.jpg\">\n\n\n[](https://OpenDCAI.github.io/DataFlow-Doc/)\n[](https://github.com/OpenDCAI/DataFlow/blob/main/LICENSE)\n[](https://github.com/OpenDCAI/DataFlow)\n[](https://github.com/OpenDCAI/DataFlow/issues)\n[](https://github.com/OpenDCAI/DataFlow/graphs/contributors)\n[](https://github.com/OpenDCAI/DataFlow)\n\n<!-- [](https://github.com/OpenDCAI/DataFlow/commits/main/) -->\n\n\ud83c\udf89 If you like our project, please give us a star \u2b50 on GitHub for the latest update.\n\n[\u7b80\u4f53\u4e2d\u6587](./README-zh.md) | English\n\n\n**[\ud83d\ude80 Features](#Features) \u2022 [\u26a1 Quick Start](#Quick_Start) \u2022 [\ud83d\udcd6 Documentation](https://OpenDCAI.github.io/DataFlow-Doc/) \u2022 [\ud83e\uddea Experiments](#Experiments)**\n\n</div>\n\nhttps://github.com/user-attachments/assets/05e047a5-99bb-4043-bc71-2b5ccdab2126\n\n## \ud83d\udcf0 1. News\n\ud83c\udf89 [2025-06-28] We\u2019re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.\n\n## \ud83d\udd0d 2. Overview\n\n <img src=\"./static/images/dataflow_framework.jpg\">\n\nDataFlow is a data preparation and training system designed to\u00a0**parse, generate, process and evaluate**\u00a0high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. **DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.**\n\nSpecifically, we constructing diverse\u00a0`operators`\u00a0leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct\u00a0`pipelines`, collectively forming the comprehensive\u00a0`DataFlow\u00a0system`. Additionally, we develop an intelligent\u00a0`DataFlow-agent`\u00a0capable of dynamically assembling new\u00a0`pipelines`\u00a0by recombining existing\u00a0`operators`\u00a0on demand.\n\n\n\n<!-- Text: \u8f93\u5165\u662f\u70c2\u6570\u636e \u901a\u8fc7\u5927\u6a21\u578b \u8f93\u51faQA \uff08\u4e3b\u8981\u662f\u5f3a\u5316\u5b66\u4e60\uff09\nNL2SQL: \u53cd\u5411\u6784\u9020SQL QA\nReasonning\uff1aQuestion\u5f88\u77ed\uff0c\u6784\u5efa\u957f\u94feCOT \uff0c\u662f\u5426\u6709category\uff0c\u662f\u5426\u6709\u96be\u5ea6\uff08\u901a\u8fc7\u5927\u6a21\u578b\uff09\nAgentic RAG: \u8f93\u5165QA\uff0c\u51fa\u6765\u662f QA\u3002\u6ca1\u6709\u989d\u5916\u4fe1\u606f\u89e3\u51b3\u4e0d\u4e86\uff0c\u5fc5\u987b\u8981\u5f15\u5165\nKnowlege Base Cleaning: PDF\uff0c\u8868\u683c+doc text\u8f93\u5165\uff0c\u8f93\u51fa\u662f\u9ad8\u8d28\u91cf\u77e5\u8bc6\u5e93\nDataflow-agent: \u7528Agent\u81ea\u52a8\u5408\u6210pipeline\u3002\u7f16\u6392\u5df2\u6709\u7b97\u5b50\u3002 -->\n\n## \ud83d\udee0\ufe0f 3. Pipelines Functionality\n### \ud83d\udd27 3.1 Ready-to-Use PipeLines\nCurrent Pipelines in Dataflow are as follows:\n- \ud83d\udcdd **Text Pipeline**: Mine question-answer pairs from large-scale plain-text data (mostly crawed from InterNet) for use in SFT and RL training.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Text Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text)\n- \ud83e\udde0 **Reasoning Pipeline**: Enhances existing question\u2013answer pairs with (1) extended chain-of-thought, (2) category classification, and (3) difficulty estimation.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Reasoning Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Reasonning)\n- \ud83d\uddc3\ufe0f **Text2SQL Pipeline**: Translates natural language questions into SQL queries, supplemented with explanations, chain-of-thought reasoning, and contextual schema information.\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **Text2SQL Pipeline**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Text2SQL)\n- \ud83d\udcda **Knowlege Base Cleaning Pipeline**: Extract and structure knowledge from unorganized sources like tables, PDFs, and Word documents into usable entries for downstream RAG or QA pair generation.\n - \n- \ud83e\udd16 **Agentic RAG Pipeline**: Identify and extract QA pairs from existing QA datasets or knowledge bases that require external knowledge to answer, for use in downstream training of Agnetic RAG tasks.\n - \n### \u2699\ufe0f 3.2 Flexible Operator PipeLines\nIn this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the [documentation](https://OpenDCAI.github.io/DataFlow-Doc/) for details.\n\n### \ud83e\udd16 3.3 Agent Guided Pipelines\n<!-- Building on top of this, we also provide the -->\n- **DataFlow Agent**: An intelligent assistant that performs data analysis, writes custom `operators`, and automatically orchestrates them into `pipelines` based on specific task objectives.\n\n - \n - [[HuggingFace\ud83e\udd17 demo input & output for **DataFlow Agent**]](https://huggingface.co/datasets/Open-Dataflow/dataflow-demo-Agent)\n\n<!-- ### 3.1 Text Pipeline\n -->\n\n## \u26a1 4. Quick Start\nFor environment setup and installation, please using the following commands\ud83d\udc47\n\n```shell\nconda create -n dataflow python=3.10 \nconda activate dataflow\n\npip install open-dataflow\n```\nIf you want to use your own GPU to inference locally, please use:\n```shell\npip install open-dataflow[vllm]\n```\n> Dataflow supports Python>=3.10\n\nYou can use follwing command to check if installed correctly:\n```shell\ndataflow -v\n```\n\nYou are expected to see following outputs:\n```log\nopen-dataflow codebase version: 1.0.0\n Checking for updates...\n Local version: 1.0.0\n PyPI newest version: 1.0.0\nYou are using the latest version: 1.0.0.\n```\n\nFor **Quick-Start** and **Guide**, please visit our [Documentation](https://OpenDCAI.github.io/DataFlow-Doc/). \n\n[](https://OpenDCAI.github.io/DataFlow-Doc/)\n\n\n## \ud83e\uddea 5. Experimental Results\nFor Detailed Experiments setting, please visit our documentation.\n\n\n### \ud83d\udcdd 5.1 Text PipeLine\n\n#### 5.1.1 Pre-training data filter pipeline\nThe `pre-training data processing pipeline` was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using `QuratingScorer` are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.\n\n<div align=\"center\">\n <img src=\"./static/images/text-pretrain.png\" width=\"60%\">\n</div>\n\n#### 5.1.2 SFT data filter pipeline\nWe filted 3k record from `alpaca` dataset and compare it with radom selected 3k data from `alpaca` dataset by training it on Qwen2.5-7B. Results are:\n\n<div align=\"center\">\n <img src=\"./static/images/text-sft.png\" width=\"60%\">\n</div>\n\n### \ud83e\udde0 5.2 Reasoning Pipeline\n\nWe verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are: \n\n<div align=\"center\">\n <img src=\"./static/images/reasoning_performance.png\" width=\"60%\">\n</div>\n\n### \ud83d\uddc3\ufe0f 5.3 Text2SQL PipeLine\nWe fine-tuned the Qwen2.5-Coder-14B model on the Bird dataset using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:\n\n<div align=\"center\">\n <img src=\"./static/images/text2sql.png\" width=\"60%\">\n</div>\n\n## \ud83d\udc90 6. Acknowledgements\nWe sincerely appreciate [MinerU](https://github.com/opendatalab/MinerU)'s outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.\n\n## \ud83e\udd1d 7. Community & Support\nJoin the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!\n\n\u2022\t\ud83d\udcee [GitHub Issues](../../issues): Report bugs or suggest features\n \n\u2022\t\ud83d\udd27 [GitHub Pull Requests](../../pulls): Contribute code improvements\n\n\u2022\t\ud83d\udcac Join our community groups to connect with us and other contributors!\n \n<div align=\"center\">\n <img src=\"./static/images/community_en.jpg\" width=\"60%\">\n</div>\n\n## \ud83d\udcdc 8. Citation\nIf you use DataFlow in your research, feel free to give us a cite.\n```bibtex\n@misc{dataflow2025,\n author = {DataFlow Develop Team},\n title = {DataFlow: A Unified Framework for Data-Centric AI},\n year = {2025},\n howpublished = {\\url{https://github.com/OpenDCAI/DataFlow}},\n note = {Accessed: 2025-07-08}\n}\n```\n\n## \ud83d\udcca 9. Statistics\n<div align=\"center\">\n <a href=\"https://star-history.com/#OpenDCAI/DataFlow&Date\">\n <picture>\n <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date&theme=dark\" />\n <source media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" />\n <img alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=OpenDCAI/DataFlow&type=Date\" style=\"width:50%;\" />\n </picture>\n </a>\n \n</div>\n\n---\n<div align=\"center\">\n <sub>\n Connect with the \n <a href=\"https://zwt233.github.io/\" target=\"_blank\"><strong>PKU-DCAI Research Team</strong></a> \n on Xiaohongshu: <strong>26133106768</strong>\n </sub>\n</div>\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Modern Data Centric AI system for Large Language Models",
"version": "1.0.4",
"project_urls": {
"Bug Reports": "https://github.com/Open-DataFlow/DataFlow/issues",
"Documentation": "https://open-dataflow.github.io/DataFlow-Doc/",
"Github": "https://github.com/Open-DataFlow/DataFlow"
},
"split_keywords": [
"ai",
" artificial intelligence"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4e372370fa451d83eaa62ac37e05c883b9d940e0af58e96e973e14c0d4eca1ce",
"md5": "f5b5bb483e6556d881aa91870a533bc2",
"sha256": "0f5fdec0bf5884f275849fcafe140ac4eac92e37ed2761e89f270d3ff3051dcc"
},
"downloads": -1,
"filename": "open_dataflow-1.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f5b5bb483e6556d881aa91870a533bc2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.7",
"size": 1682198,
"upload_time": "2025-07-15T14:52:48",
"upload_time_iso_8601": "2025-07-15T14:52:48.426723Z",
"url": "https://files.pythonhosted.org/packages/4e/37/2370fa451d83eaa62ac37e05c883b9d940e0af58e96e973e14c0d4eca1ce/open_dataflow-1.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "32d0a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca",
"md5": "f810d3cecd118d237524ae1c8762c5a3",
"sha256": "c5c077f91c003ed43e95c0b58ad96e3e4cf9f3b2e36a5950ccd8900c6c22ea5d"
},
"downloads": -1,
"filename": "open_dataflow-1.0.4.tar.gz",
"has_sig": false,
"md5_digest": "f810d3cecd118d237524ae1c8762c5a3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.7",
"size": 1525820,
"upload_time": "2025-07-15T14:52:50",
"upload_time_iso_8601": "2025-07-15T14:52:50.002571Z",
"url": "https://files.pythonhosted.org/packages/32/d0/a5c5216e50aec8b9fc85a7733e2e173909d1937b039d42ef274490bb2eca/open_dataflow-1.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-15 14:52:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Open-DataFlow",
"github_project": "DataFlow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "datasets",
"specs": [
[
"<=",
"3.2"
]
]
},
{
"name": "numpy",
"specs": [
[
"<",
"2.0.0"
]
]
},
{
"name": "scipy",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "torchaudio",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "transformers",
"specs": [
[
"<=",
"4.51.3"
]
]
},
{
"name": "aisuite",
"specs": []
},
{
"name": "math_verify",
"specs": []
},
{
"name": "word2number",
"specs": []
},
{
"name": "accelerate",
"specs": []
},
{
"name": "rapidfuzz",
"specs": []
},
{
"name": "colorlog",
"specs": []
},
{
"name": "appdirs",
"specs": []
},
{
"name": "datasketch",
"specs": []
},
{
"name": "modelscope",
"specs": []
},
{
"name": "addict",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "rich",
"specs": []
},
{
"name": "docstring_parser",
"specs": []
},
{
"name": "pydantic",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "colorama",
"specs": []
},
{
"name": "func_timeout",
"specs": []
},
{
"name": "sqlglot",
"specs": []
},
{
"name": "pymysql",
"specs": []
},
{
"name": "fasttext-wheel",
"specs": []
},
{
"name": "kenlm",
"specs": []
},
{
"name": "langkit",
"specs": []
},
{
"name": "openai",
"specs": []
},
{
"name": "sentencepiece",
"specs": []
},
{
"name": "datasketch",
"specs": []
},
{
"name": "presidio_analyzer",
"specs": []
},
{
"name": "presidio_anonymizer",
"specs": []
},
{
"name": "vendi-score",
"specs": [
[
"==",
"0.0.3"
]
]
},
{
"name": "google-api-core",
"specs": []
},
{
"name": "google-api-python-client",
"specs": []
},
{
"name": "evaluate",
"specs": []
},
{
"name": "contractions",
"specs": []
},
{
"name": "symspellpy",
"specs": []
},
{
"name": "simhash",
"specs": []
},
{
"name": "chonkie",
"specs": []
},
{
"name": "trafilatura",
"specs": []
},
{
"name": "lxml_html_clean",
"specs": []
},
{
"name": "cloudpickle",
"specs": []
},
{
"name": "fastapi",
"specs": []
},
{
"name": "httpx",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "pyfiglet",
"specs": []
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "requests",
"specs": []
},
{
"name": "termcolor",
"specs": []
},
{
"name": "uvicorn",
"specs": []
}
],
"lcname": "open-dataflow"
}