a-data-processing


Namea-data-processing JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/kubeagi/arcadia
SummaryA library that prepares raw documents for downstream ML tasks.
upload_time2024-02-02 06:44:32
maintainer
docs_urlNone
authorggservice007
requires_python>=3.9.0,<3.12
license
keywords pdf word web parsing preprocessing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Data Processing 

## Current Version Main Features

Data Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:
- txt
- json  
- doc
- html
- excel
- csv
- pdf
- markdown
- ppt

### Current Text Type Processing  

The data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.

## Design

![Design](../../docs/images/data-process.drawio.png)

## Local Development
### Software Requirements

Before setting up the local data-process environment, please make sure the following software is installed:

- Python 3.10.x

### Environment Setup

Install the Python dependencies in the requirements.txt file

### Running

Run the server.py file in the src directory

# isort
isort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order. 

## install
```shell
pip install isort
```

## isort a file
```shell
isort src/server.py
```

## isort a directory
```shell
isort .
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/kubeagi/arcadia",
    "name": "a-data-processing",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9.0,<3.12",
    "maintainer_email": "",
    "keywords": "PDF WORD WEB parsing preprocessing",
    "author": "ggservice007",
    "author_email": "ggservice007@126.com",
    "download_url": "https://files.pythonhosted.org/packages/30/70/001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f/a-data-processing-0.0.1.tar.gz",
    "platform": null,
    "description": "# Data Processing \n\n## Current Version Main Features\n\nData Processing is used for data processing through MinIO, databases, Web APIs, etc. The data types handled include:\n- txt\n- json  \n- doc\n- html\n- excel\n- csv\n- pdf\n- markdown\n- ppt\n\n### Current Text Type Processing  \n\nThe data processing process includes: cleaning abnormal data, filtering, de-duplication, and anonymization.\n\n## Design\n\n![Design](../../docs/images/data-process.drawio.png)\n\n## Local Development\n### Software Requirements\n\nBefore setting up the local data-process environment, please make sure the following software is installed:\n\n- Python 3.10.x\n\n### Environment Setup\n\nInstall the Python dependencies in the requirements.txt file\n\n### Running\n\nRun the server.py file in the src directory\n\n# isort\nisort is a tool for sorting imports alphabetically within your Python code. It helps maintain a consistent and clean import order. \n\n## install\n```shell\npip install isort\n```\n\n## isort a file\n```shell\nisort src/server.py\n```\n\n## isort a directory\n```shell\nisort .\n```\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A library that prepares raw documents for downstream ML tasks.",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/kubeagi/arcadia"
    },
    "split_keywords": [
        "pdf",
        "word",
        "web",
        "parsing",
        "preprocessing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "902484ecf0ab0a70ea980e2decdfe055021eb5b4086e99bb8df0d5da905f2601",
                "md5": "d377e56c6410d49f4bc01b9fc7745376",
                "sha256": "7b17845d30a734266a7ced56d0625404de65b5b91391d14ec7d2e45b577153a5"
            },
            "downloads": -1,
            "filename": "a_data_processing-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d377e56c6410d49f4bc01b9fc7745376",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9.0,<3.12",
            "size": 2094,
            "upload_time": "2024-02-02T06:44:31",
            "upload_time_iso_8601": "2024-02-02T06:44:31.362338Z",
            "url": "https://files.pythonhosted.org/packages/90/24/84ecf0ab0a70ea980e2decdfe055021eb5b4086e99bb8df0d5da905f2601/a_data_processing-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3070001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f",
                "md5": "ee67abe21e7989f1511716fbe83024dd",
                "sha256": "6be65c32a4e8ba62324fb12b19c121d692a623c40fe417caed744a73a9af4a0d"
            },
            "downloads": -1,
            "filename": "a-data-processing-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ee67abe21e7989f1511716fbe83024dd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9.0,<3.12",
            "size": 2798,
            "upload_time": "2024-02-02T06:44:32",
            "upload_time_iso_8601": "2024-02-02T06:44:32.767492Z",
            "url": "https://files.pythonhosted.org/packages/30/70/001f4d1841f58cb92d82478c450a8ae0f21712ebdb93d45ca3f9ad6c3a5f/a-data-processing-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-02 06:44:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "kubeagi",
    "github_project": "arcadia",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "a-data-processing"
}
        
Elapsed time: 3.25667s