pydatamax


Namepydatamax JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/Hi-Dolphin/datamax
SummaryAdvanced Data Crawling and Processing Framework
upload_time2025-09-03 17:39:42
maintainerDataMax Team
docs_urlNone
authorccy
requires_python>=3.10
licenseMIT
keywords crawler scraping data-processing arxiv web-scraping data-extraction parsing async cli framework academic-papers research automation data-collection file-conversion document-processing
VCS
bugtrack_url
requirements oss2 aliyun-python-sdk-core aliyun-python-sdk-kms crcmod langdetect loguru python-docx python-dotenv pymupdf pypdf openpyxl pandas numpy requests tqdm pydantic pydantic-settings python-magic PyYAML Pillow packaging beautifulsoup4 minio openai jionlp chardet olefile python-pptx tiktoken markitdown xlrd tabulate unstructured markdown langchain langchain-community langchain-text-splitters ebooklib setuptools transformers pytest pytest-asyncio pytest-cov pytest-mock pytest-timeout
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DataMax

<div align="center">

[δΈ­ζ–‡](README_zh.md) | **English**

[![PyPI version](https://badge.fury.io/py/pydatamax.svg)](https://badge.fury.io/py/pydatamax) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

</div>

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

## ✨ Key Features

- πŸ”„ **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
- 🧹 **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
- πŸ€– **AI Annotation**: LLM-powered automatic annotation and QA generation
- ⚑ **High Performance**: Efficient batch processing with caching and parallel execution
- 🎯 **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling
- ☁️ **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers

## πŸš€ Quick Start

### Install

```bash
pip install pydatamax
```

### Examples

```python
from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get pre label. return trainable qa list
qa_list = client.get_pre_label(
    api_key=LABEL_LLM_API_KEY,
    base_url=LABEL_LLM_BASE_URL,
    model_name=LABEL_LLM_MODEL_NAME,
    question_number=10,
    max_workers=5)

# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)
```


## 🀝 Contributing

Issues and Pull Requests are welcome!

## πŸ“„ License

This project is licensed under the [MIT License](LICENSE).

## πŸ“ž Contact Us

- πŸ“§ Email: cy.kron@foxmail.com
- πŸ› Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)
- πŸ“š Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)
- πŸ’¬ Wechat Group: <br><img src='wechat.jpg' width=300>
---

⭐ If this project helps you, please give us a star!

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Hi-Dolphin/datamax",
    "name": "pydatamax",
    "maintainer": "DataMax Team",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "DataMax Team <cy.kron@foxmail.com>",
    "keywords": "crawler, scraping, data-processing, arxiv, web-scraping, data-extraction, parsing, async, cli, framework, academic-papers, research, automation, data-collection, file-conversion, document-processing",
    "author": "ccy",
    "author_email": "DataMax Team <cy.kron@foxmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9a/e0/fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344/pydatamax-0.2.0.tar.gz",
    "platform": "any",
    "description": "# DataMax\n\n<div align=\"center\">\n\n[\u4e2d\u6587](README_zh.md) | **English**\n\n[![PyPI version](https://badge.fury.io/py/pydatamax.svg)](https://badge.fury.io/py/pydatamax) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n</div>\n\nA powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.\n\n## \u2728 Key Features\n\n- \ud83d\udd04 **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more\n- \ud83e\uddf9 **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering\n- \ud83e\udd16 **AI Annotation**: LLM-powered automatic annotation and QA generation\n- \u26a1 **High Performance**: Efficient batch processing with caching and parallel execution\n- \ud83c\udfaf **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling\n- \u2601\ufe0f **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers\n\n## \ud83d\ude80 Quick Start\n\n### Install\n\n```bash\npip install pydatamax\n```\n\n### Examples\n\n```python\nfrom datamax import DataMax\n\n# prepare info\nFILE_PATHS = [\"/your/file/path/1.md\", \"/your/file/path/2.doc\", \"/your/file/path/3.xlsx\"]\nLABEL_LLM_API_KEY = \"YOUR_API_KEY\"\nLABEL_LLM_BASE_URL = \"YOUR_BASE_URL\"\nLABEL_LLM_MODEL_NAME = \"YOUR_MODEL_NAME\"\nLLM_TRAIN_OUTPUT_FILE_NAME = \"train\"\n\n# init client\nclient = DataMax(file_path=FILE_PATHS)\n\n# get pre label. return trainable qa list\nqa_list = client.get_pre_label(\n    api_key=LABEL_LLM_API_KEY,\n    base_url=LABEL_LLM_BASE_URL,\n    model_name=LABEL_LLM_MODEL_NAME,\n    question_number=10,\n    max_workers=5)\n\n# save label data\nclient.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)\n```\n\n\n## \ud83e\udd1d Contributing\n\nIssues and Pull Requests are welcome!\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n## \ud83d\udcde Contact Us\n\n- \ud83d\udce7 Email: cy.kron@foxmail.com\n- \ud83d\udc1b Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)\n- \ud83d\udcda Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)\n- \ud83d\udcac Wechat Group: <br><img src='wechat.jpg' width=300>\n---\n\n\u2b50 If this project helps you, please give us a star!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Advanced Data Crawling and Processing Framework",
    "version": "0.2.0",
    "project_urls": {
        "Bug Reports": "https://github.com/Hi-Dolphin/datamax/issues",
        "Documentation": "https://github.com/Hi-Dolphin/datamax/docs",
        "Homepage": "https://github.com/Hi-Dolphin/datamax",
        "Repository": "https://github.com/Hi-Dolphin/datamax",
        "Source": "https://github.com/Hi-Dolphin/datamax"
    },
    "split_keywords": [
        "crawler",
        " scraping",
        " data-processing",
        " arxiv",
        " web-scraping",
        " data-extraction",
        " parsing",
        " async",
        " cli",
        " framework",
        " academic-papers",
        " research",
        " automation",
        " data-collection",
        " file-conversion",
        " document-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "41cf166de21835db9dd74aa368982a2fdd572ecdc893ec279b6e8eed94e8e1e6",
                "md5": "89761221d89a65c5cf6fd4afbcd7db81",
                "sha256": "da42b95524336378539b03db0f300e3763a9f1901fec3d51325cf11f2172931c"
            },
            "downloads": -1,
            "filename": "pydatamax-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "89761221d89a65c5cf6fd4afbcd7db81",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 5071,
            "upload_time": "2025-09-03T17:39:41",
            "upload_time_iso_8601": "2025-09-03T17:39:41.171479Z",
            "url": "https://files.pythonhosted.org/packages/41/cf/166de21835db9dd74aa368982a2fdd572ecdc893ec279b6e8eed94e8e1e6/pydatamax-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9ae0fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344",
                "md5": "1edc6c86c7e4ad896b57c0969776fea9",
                "sha256": "5a51e26feb96e2d6041372b16e363bcfabf321bcaa78a78c2be396a9ec5f8522"
            },
            "downloads": -1,
            "filename": "pydatamax-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1edc6c86c7e4ad896b57c0969776fea9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 177492,
            "upload_time": "2025-09-03T17:39:42",
            "upload_time_iso_8601": "2025-09-03T17:39:42.263776Z",
            "url": "https://files.pythonhosted.org/packages/9a/e0/fcd26eeea6dd730a19d7816e82e5fe65df3fce2d9a384937ac09c421a344/pydatamax-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-03 17:39:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Hi-Dolphin",
    "github_project": "datamax",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "oss2",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.19.1"
                ]
            ]
        },
        {
            "name": "aliyun-python-sdk-core",
            "specs": [
                [
                    ">=",
                    "2.16.0"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "aliyun-python-sdk-kms",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.16.5"
                ]
            ]
        },
        {
            "name": "crcmod",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.7"
                ]
            ]
        },
        {
            "name": "langdetect",
            "specs": [
                [
                    ">=",
                    "1.0.9"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.7.3"
                ],
                [
                    "<",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.1.2"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "pymupdf",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "pypdf",
            "specs": [
                [
                    ">=",
                    "5.5.0"
                ],
                [
                    "<",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1.5"
                ],
                [
                    "<",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.2.3"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "2.2.6"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.32.3"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.67.1"
                ],
                [
                    "<",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.11.5"
                ]
            ]
        },
        {
            "name": "pydantic-settings",
            "specs": [
                [
                    ">=",
                    "2.9.1"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "python-magic",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.4.27"
                ]
            ]
        },
        {
            "name": "PyYAML",
            "specs": [
                [
                    ">=",
                    "6.0.2"
                ],
                [
                    "<",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    ">=",
                    "11.2.1"
                ],
                [
                    "<",
                    "12.0.0"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "<",
                    "25.0"
                ],
                [
                    ">=",
                    "24.2"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.13.4"
                ],
                [
                    "<",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "minio",
            "specs": [
                [
                    "<",
                    "8.0.0"
                ],
                [
                    ">=",
                    "7.2.15"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.82.0"
                ]
            ]
        },
        {
            "name": "jionlp",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.5.23"
                ]
            ]
        },
        {
            "name": "chardet",
            "specs": [
                [
                    "<",
                    "6.0.0"
                ],
                [
                    ">=",
                    "5.2.0"
                ]
            ]
        },
        {
            "name": "olefile",
            "specs": [
                [
                    ">=",
                    "0.46"
                ]
            ]
        },
        {
            "name": "python-pptx",
            "specs": [
                [
                    ">=",
                    "1.0.2"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "markitdown",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.1.1"
                ]
            ]
        },
        {
            "name": "xlrd",
            "specs": [
                [
                    ">=",
                    "2.0.1"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "unstructured",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.17.2"
                ]
            ]
        },
        {
            "name": "markdown",
            "specs": [
                [
                    "<",
                    "4.0.0"
                ],
                [
                    ">=",
                    "3.8"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "langchain-community",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "langchain-text-splitters",
            "specs": [
                [
                    "<",
                    "1.0.0"
                ],
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "ebooklib",
            "specs": [
                [
                    "==",
                    "0.19"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.53.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-asyncio",
            "specs": [
                [
                    ">=",
                    "0.21.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "pytest-mock",
            "specs": [
                [
                    ">=",
                    "3.10.0"
                ]
            ]
        },
        {
            "name": "pytest-timeout",
            "specs": [
                [
                    ">=",
                    "2.1.0"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "pydatamax"
}
        
ccy
Elapsed time: 0.62811s