# DataMax
<div align="center">
[δΈζ](README_zh.md) | **English**
[](https://badge.fury.io/py/pydatamax) [](https://www.python.org/downloads/) [](https://opensource.org/licenses/MIT)
</div>
A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.
## β¨ Key Features
- π **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
- π§Ή **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
- π€ **AI Annotation**: LLM-powered automatic annotation and QA generation
- β‘ **High Performance**: Efficient batch processing with caching and parallel execution
- π― **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling
- βοΈ **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers
## π Quick Start
### Install
```bash
pip install pydatamax
```
### Examples
```python
from datamax import DataMax
# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"
# init client
client = DataMax(file_path=FILE_PATHS)
# get pre label. return trainable qa list
qa_list = client.get_pre_label(
api_key=LABEL_LLM_API_KEY,
base_url=LABEL_LLM_BASE_URL,
model_name=LABEL_LLM_MODEL_NAME,
question_number=10,
max_workers=5)
# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)
```
## π€ Contributing
Issues and Pull Requests are welcome!
## π License
This project is licensed under the [MIT License](LICENSE).
## π Contact Us
- π§ Email: cy.kron@foxmail.com
- π Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)
- π Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)
- π¬ Wechat Group: <br><img src='wechat.jpg' width=300>
---
β If this project helps you, please give us a star!
Raw data
{
"_id": null,
"home_page": "https://github.com/Hi-Dolphin/datamax",
"name": "pydatamax",
"maintainer": "DataMax Team",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "DataMax Team <cy.kron@foxmail.com>",
"keywords": "crawler, scraping, data-processing, arxiv, web-scraping, data-extraction, parsing, async, cli, framework, academic-papers, research, automation, data-collection, file-conversion, document-processing",
"author": "ccy",
"author_email": "DataMax Team <cy.kron@foxmail.com>",
"download_url": "https://files.pythonhosted.org/packages/9a/e0/fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344/pydatamax-0.2.0.tar.gz",
"platform": "any",
"description": "# DataMax\n\n<div align=\"center\">\n\n[\u4e2d\u6587](README_zh.md) | **English**\n\n[](https://badge.fury.io/py/pydatamax) [](https://www.python.org/downloads/) [](https://opensource.org/licenses/MIT)\n\n</div>\n\nA powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.\n\n## \u2728 Key Features\n\n- \ud83d\udd04 **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more\n- \ud83e\uddf9 **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering\n- \ud83e\udd16 **AI Annotation**: LLM-powered automatic annotation and QA generation\n- \u26a1 **High Performance**: Efficient batch processing with caching and parallel execution\n- \ud83c\udfaf **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling\n- \u2601\ufe0f **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers\n\n## \ud83d\ude80 Quick Start\n\n### Install\n\n```bash\npip install pydatamax\n```\n\n### Examples\n\n```python\nfrom datamax import DataMax\n\n# prepare info\nFILE_PATHS = [\"/your/file/path/1.md\", \"/your/file/path/2.doc\", \"/your/file/path/3.xlsx\"]\nLABEL_LLM_API_KEY = \"YOUR_API_KEY\"\nLABEL_LLM_BASE_URL = \"YOUR_BASE_URL\"\nLABEL_LLM_MODEL_NAME = \"YOUR_MODEL_NAME\"\nLLM_TRAIN_OUTPUT_FILE_NAME = \"train\"\n\n# init client\nclient = DataMax(file_path=FILE_PATHS)\n\n# get pre label. return trainable qa list\nqa_list = client.get_pre_label(\n api_key=LABEL_LLM_API_KEY,\n base_url=LABEL_LLM_BASE_URL,\n model_name=LABEL_LLM_MODEL_NAME,\n question_number=10,\n max_workers=5)\n\n# save label data\nclient.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)\n```\n\n\n## \ud83e\udd1d Contributing\n\nIssues and Pull Requests are welcome!\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n## \ud83d\udcde Contact Us\n\n- \ud83d\udce7 Email: cy.kron@foxmail.com\n- \ud83d\udc1b Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)\n- \ud83d\udcda Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)\n- \ud83d\udcac Wechat Group: <br><img src='wechat.jpg' width=300>\n---\n\n\u2b50 If this project helps you, please give us a star!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Advanced Data Crawling and Processing Framework",
"version": "0.2.0",
"project_urls": {
"Bug Reports": "https://github.com/Hi-Dolphin/datamax/issues",
"Documentation": "https://github.com/Hi-Dolphin/datamax/docs",
"Homepage": "https://github.com/Hi-Dolphin/datamax",
"Repository": "https://github.com/Hi-Dolphin/datamax",
"Source": "https://github.com/Hi-Dolphin/datamax"
},
"split_keywords": [
"crawler",
" scraping",
" data-processing",
" arxiv",
" web-scraping",
" data-extraction",
" parsing",
" async",
" cli",
" framework",
" academic-papers",
" research",
" automation",
" data-collection",
" file-conversion",
" document-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "41cf166de21835db9dd74aa368982a2fdd572ecdc893ec279b6e8eed94e8e1e6",
"md5": "89761221d89a65c5cf6fd4afbcd7db81",
"sha256": "da42b95524336378539b03db0f300e3763a9f1901fec3d51325cf11f2172931c"
},
"downloads": -1,
"filename": "pydatamax-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "89761221d89a65c5cf6fd4afbcd7db81",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 5071,
"upload_time": "2025-09-03T17:39:41",
"upload_time_iso_8601": "2025-09-03T17:39:41.171479Z",
"url": "https://files.pythonhosted.org/packages/41/cf/166de21835db9dd74aa368982a2fdd572ecdc893ec279b6e8eed94e8e1e6/pydatamax-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "9ae0fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344",
"md5": "1edc6c86c7e4ad896b57c0969776fea9",
"sha256": "5a51e26feb96e2d6041372b16e363bcfabf321bcaa78a78c2be396a9ec5f8522"
},
"downloads": -1,
"filename": "pydatamax-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "1edc6c86c7e4ad896b57c0969776fea9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 177492,
"upload_time": "2025-09-03T17:39:42",
"upload_time_iso_8601": "2025-09-03T17:39:42.263776Z",
"url": "https://files.pythonhosted.org/packages/9a/e0/fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344/pydatamax-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-03 17:39:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Hi-Dolphin",
"github_project": "datamax",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "oss2",
"specs": [
[
"<",
"3.0.0"
],
[
">=",
"2.19.1"
]
]
},
{
"name": "aliyun-python-sdk-core",
"specs": [
[
">=",
"2.16.0"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "aliyun-python-sdk-kms",
"specs": [
[
"<",
"3.0.0"
],
[
">=",
"2.16.5"
]
]
},
{
"name": "crcmod",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.7"
]
]
},
{
"name": "langdetect",
"specs": [
[
">=",
"1.0.9"
],
[
"<",
"2.0.0"
]
]
},
{
"name": "loguru",
"specs": [
[
">=",
"0.7.3"
],
[
"<",
"1.0.0"
]
]
},
{
"name": "python-docx",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.1.2"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.1.0"
]
]
},
{
"name": "pymupdf",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.26.0"
]
]
},
{
"name": "pypdf",
"specs": [
[
">=",
"5.5.0"
],
[
"<",
"6.0.0"
]
]
},
{
"name": "openpyxl",
"specs": [
[
">=",
"3.1.5"
],
[
"<",
"4.0.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.2.3"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"2.2.6"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "requests",
"specs": [
[
"<",
"3.0.0"
],
[
">=",
"2.32.3"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.67.1"
],
[
"<",
"5.0.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
"<",
"3.0.0"
],
[
">=",
"2.11.5"
]
]
},
{
"name": "pydantic-settings",
"specs": [
[
">=",
"2.9.1"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "python-magic",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.4.27"
]
]
},
{
"name": "PyYAML",
"specs": [
[
">=",
"6.0.2"
],
[
"<",
"7.0.0"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"11.2.1"
],
[
"<",
"12.0.0"
]
]
},
{
"name": "packaging",
"specs": [
[
"<",
"25.0"
],
[
">=",
"24.2"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
">=",
"4.13.4"
],
[
"<",
"5.0.0"
]
]
},
{
"name": "minio",
"specs": [
[
"<",
"8.0.0"
],
[
">=",
"7.2.15"
]
]
},
{
"name": "openai",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.82.0"
]
]
},
{
"name": "jionlp",
"specs": [
[
"<",
"2.0.0"
],
[
">=",
"1.5.23"
]
]
},
{
"name": "chardet",
"specs": [
[
"<",
"6.0.0"
],
[
">=",
"5.2.0"
]
]
},
{
"name": "olefile",
"specs": [
[
">=",
"0.46"
]
]
},
{
"name": "python-pptx",
"specs": [
[
">=",
"1.0.2"
],
[
"<",
"2.0.0"
]
]
},
{
"name": "tiktoken",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.9.0"
]
]
},
{
"name": "markitdown",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.1.1"
]
]
},
{
"name": "xlrd",
"specs": [
[
">=",
"2.0.1"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "tabulate",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.9.0"
]
]
},
{
"name": "unstructured",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.17.2"
]
]
},
{
"name": "markdown",
"specs": [
[
"<",
"4.0.0"
],
[
">=",
"3.8"
]
]
},
{
"name": "langchain",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.3.0"
]
]
},
{
"name": "langchain-community",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.3.0"
]
]
},
{
"name": "langchain-text-splitters",
"specs": [
[
"<",
"1.0.0"
],
[
">=",
"0.3.0"
]
]
},
{
"name": "ebooklib",
"specs": [
[
"==",
"0.19"
]
]
},
{
"name": "setuptools",
"specs": []
},
{
"name": "transformers",
"specs": [
[
"==",
"4.53.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "pytest-asyncio",
"specs": [
[
">=",
"0.21.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "pytest-mock",
"specs": [
[
">=",
"3.10.0"
]
]
},
{
"name": "pytest-timeout",
"specs": [
[
">=",
"2.1.0"
]
]
}
],
"tox": true,
"lcname": "pydatamax"
}