# Data Processing
## Current Version Main Features
Data Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:
- txt
- json
- doc
- html
- excel
- csv
- pdf
- markdown
- ppt
### Current Text Type Processing
The data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.
## Design
![Design](../../docs/images/data-process.drawio.png)
## Local Development
### Software Requirements
Before setting up the local data-process environment, please make sure the following software is installed:
- Python 3.10.x
### Environment Setup
Install the Python dependencies in the requirements.txt file
### Running
Run the server.py file in the src directory
# isort
isort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order.
## install
```shell
pip install isort
```
## isort a file
```shell
isort src/server.py
```
## isort a directory
```shell
isort .
```
Raw data
{
"_id": null,
"home_page": "https://github.com/kubeagi/arcadia",
"name": "a-data-processing",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9.0,<3.12",
"maintainer_email": "",
"keywords": "PDF WORD WEB parsing preprocessing",
"author": "ggservice007",
"author_email": "ggservice007@126.com",
"download_url": "https://files.pythonhosted.org/packages/30/70/001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f/a-data-processing-0.0.1.tar.gz",
"platform": null,
"description": "# Data Processing \n\n## Current Version Main Features\n\nData Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:\n- txt\n- json \n- doc\n- html\n- excel\n- csv\n- pdf\n- markdown\n- ppt\n\n### Current Text Type Processing \n\nThe data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.\n\n## Design\n\n![Design](../../docs/images/data-process.drawio.png)\n\n## Local Development\n### Software Requirements\n\nBefore setting up the local data-process environment, please make sure the following software is installed:\n\n- Python 3.10.x\n\n### Environment Setup\n\nInstall the Python dependencies in the requirements.txt file\n\n### Running\n\nRun the server.py file in the src directory\n\n# isort\nisort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order. \n\n## install\n```shell\npip install isort\n```\n\n## isort a file\n```shell\nisort src/server.py\n```\n\n## isort a directory\n```shell\nisort .\n```\n\n",
"bugtrack_url": null,
"license": "",
"summary": "A library that prepares raw documents for downstream ML tasks.",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/kubeagi/arcadia"
},
"split_keywords": [
"pdf",
"word",
"web",
"parsing",
"preprocessing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "902484ecf0ab0a70ea980e2decdfe055021eb5b4086e99bb8df0d5da905f2601",
"md5": "d377e56c6410d49f4bc01b9fc7745376",
"sha256": "7b17845d30a734266a7ced56d0625404de65b5b91391d14ec7d2e45b577153a5"
},
"downloads": -1,
"filename": "a_data_processing-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d377e56c6410d49f4bc01b9fc7745376",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9.0,<3.12",
"size": 2094,
"upload_time": "2024-02-02T06:44:31",
"upload_time_iso_8601": "2024-02-02T06:44:31.362338Z",
"url": "https://files.pythonhosted.org/packages/90/24/84ecf0ab0a70ea980e2decdfe055021eb5b4086e99bb8df0d5da905f2601/a_data_processing-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3070001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f",
"md5": "ee67abe21e7989f1511716fbe83024dd",
"sha256": "6be65c32a4e8ba62324fb12b19c121d692a623c40fe417caed744a73a9af4a0d"
},
"downloads": -1,
"filename": "a-data-processing-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "ee67abe21e7989f1511716fbe83024dd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9.0,<3.12",
"size": 2798,
"upload_time": "2024-02-02T06:44:32",
"upload_time_iso_8601": "2024-02-02T06:44:32.767492Z",
"url": "https://files.pythonhosted.org/packages/30/70/001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f/a-data-processing-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-02 06:44:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kubeagi",
"github_project": "arcadia",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "a-data-processing"
}