pygestor

Name	pygestor JSON
Version	0.2.1 JSON
	download
home_page	https://github.com/rlsn/Pygestor
Summary	A tool for dataset ingestion and management.
upload_time	2024-08-14 13:36:34
maintainer	None
docs_url	None
author	Yumo Wang
requires_python	>=3.11
license	Copyright (c) 2024 rlsn Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	dataset ingestion management machine learning
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Pygestor
[![Python application](https://github.com/rlsn/Ingestor/actions/workflows/python-app.yml/badge.svg)](https://github.com/rlsn/Ingestor/actions/workflows/python-app.yml)
[![Publish Python Package](https://github.com/rlsn/Pygestor/actions/workflows/python-publish.yml/badge.svg)](https://github.com/rlsn/Pygestor/actions/workflows/python-publish.yml)
![GitHub deployments](https://img.shields.io/github/deployments/rlsn/Pygestor/pypi)
![GitHub Release](https://img.shields.io/github/v/release/rlsn/Pygestor)
![PyPI - Version](https://img.shields.io/pypi/v/pygestor)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A data interface designed to seamlessly acquire, organize, and manage diverse datasets, offering AI researchers a one-line downloader and data-loader for quick access to data, while providing a scalable and easily manageable system for future dataset acquisition.
<img src = "imgs/ui.png" width ="95%" />


## Key Features
- Dataset Acquisition & Usage:
     - Support for downloading and loading datasets with a simple one-line command.
     - Automatic handling of subsets and partitions for efficient data storage and access.
     - Support dataset batched loading.
     - Adding new datasets via URL with minimal effort

- Data Organization:
    - Three-level data organization structure: dataset, subset, and partition.
    - Support for both local and network file systems for data storage.
    - Efficient handling of large files by storing data in partitions.

- Web Interface
    - Introduced a web UI for intuitive data management and analysis.
    - Support for viewing schema, metadata and data samples.
    - Ability to download and remove one subset or multiple partitions in one go.
    - Support for data searching and sorting.
    - Ability to generate code snippets for quick access to datasets.
    - Support for creating and deleting metadata for new datasets.

 <img src = "imgs/snippet.png" width ="45%" /> <img src = "imgs/preview.png" width ="45%" />

## Quick Start
### Installation
```
pip install -r requirements.txt
```
or
```
pip install pygestor
```
The module can be used with a webUI, terminal commands or Python APIs (more functionalities). For Python APIs introductions please refer to [this notebook](notebooks/api_demo.ipynb).

### Configurations
Edit [`confs/system.conf`](confs/system.conf) to change the default system settings. In particular, set `data_dir` to the desired data storage location, either a local path or a cloud NFS.

### Run GUI
```
python .\run-gui.py
```

For a usage guide on the CLI, refer to [docs/cli_usage.md](docs/cli_usage.md)

### Download Dataset
Datasets can be downloaded via the WebUI or using the API. Run the following example script to download '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), and the first 10 parquet files from [wikimedia/wit_base](https://huggingface.co/datasets/wikimedia/wit_base)
```
python .\examples\download_example.py
```

## Adding a New Dataset

New datasets can be added using predefined ingestion and processing pipelines. For example, the [HuggingFaceParquet](pygestor/datasets/hf_parquet.py) pipeline can be used to ingest Parquet datasets from Hugging Face. It is recommended to use the WebUI for this process. In the "Add New" menu, fill in the dataset name, URL, and pipeline name to retrieve and save the metadata of the new dataset. For example:

- Dataset Name: facebook/multilingual_librispeech
- Dataset URL: https://huggingface.co/datasets/facebook/multilingual_librispeech
- Pipeline: HuggingFaceParquet

If a custom pipeline is required for datasets that don't fit the general pipelines, you will need to add a new pipeline to [pygestor/datasets](pygestor/datasets) that defines how to organize, download, and process the data. You can follow the example provided in [pygestor/datasets/wikipedia.py](pygestor/datasets/wikipedia.py). Ensure that the pipeline name matches your desired dataset name. After that, update the metadata by running 
```
python cli.py -init -d <new_dataset_name>
```

## Technical Details
### Storage
The data is stored in a file storage system and organized into three levels: dataset, subset (distinguished by version, language, class, split, annotation, etc.), and partition (splitting large files into smaller chunks for memory efficiency), as follows:

```
dataset_A
├── subset_a
│   ├── partition_1
│   └── partition_2
└── subset_b
    ├── partition_1
    └── partition_2
...
```
File storage is used for its comparatively high cost efficiency, scalability, and ease of management compared to other types of storage.

The dataset info and storage status is tracked by a metadata file `metadata.json` for efficient reference and update.

### Dependencies
- python >= 3.11
- huggingface_hub: Provides native support for datasets hosted on Hugging Face, making it an ideal library for downloading.
- pyarrow: Used to compress and extract parquet files, a data file format designed for efficient data storage and retrieval.
- pandas: Used to structure the dataset info tabular form for downstream data consumers. It provides a handy API for data manipulation and access, as well as chunking and datatype adjustments for memory efficiency.
- nicegui (optional): Used to serve webUI frontend

## Dataset Expansion
For a proposed management process to handle future dataset expansions, refer to [docs/dataset_expansion.md](docs/dataset_expansion.md).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rlsn/Pygestor",
    "name": "pygestor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": "Yumo Wang <yumo1996@gmail.com>",
    "keywords": "dataset, ingestion, management, machine learning",
    "author": "Yumo Wang",
    "author_email": "Yumo Wang <yumo1996@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e5/3d/339c76f1461a170f6e0e4a9df16535b92e942f12ecce41e474e1227fa1ef/pygestor-0.2.1.tar.gz",
    "platform": null,
    "description": "# Pygestor\n[![Python application](https://github.com/rlsn/Ingestor/actions/workflows/python-app.yml/badge.svg)](https://github.com/rlsn/Ingestor/actions/workflows/python-app.yml)\n[![Publish Python Package](https://github.com/rlsn/Pygestor/actions/workflows/python-publish.yml/badge.svg)](https://github.com/rlsn/Pygestor/actions/workflows/python-publish.yml)\n![GitHub deployments](https://img.shields.io/github/deployments/rlsn/Pygestor/pypi)\n![GitHub Release](https://img.shields.io/github/v/release/rlsn/Pygestor)\n![PyPI - Version](https://img.shields.io/pypi/v/pygestor)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nA data interface designed to seamlessly acquire, organize, and manage diverse datasets, offering AI researchers a one-line downloader and data-loader for quick access to data, while providing a scalable and easily manageable system for future dataset acquisition.\n<img src = \"imgs/ui.png\" width =\"95%\" />\n\n\n## Key Features\n- Dataset Acquisition & Usage:\n     - Support for downloading and loading datasets with a simple one-line command.\n     - Automatic handling of subsets and partitions for efficient data storage and access.\n     - Support dataset batched loading.\n     - Adding new datasets via URL with minimal effort\n\n- Data Organization:\n    - Three-level data organization structure: dataset, subset, and partition.\n    - Support for both local and network file systems for data storage.\n    - Efficient handling of large files by storing data in partitions.\n\n- Web Interface\n    - Introduced a web UI for intuitive data management and analysis.\n    - Support for viewing schema, metadata and data samples.\n    - Ability to download and remove one subset or multiple partitions in one go.\n    - Support for data searching and sorting.\n    - Ability to generate code snippets for quick access to datasets.\n    - Support for creating and deleting metadata for new datasets.\n\n <img src = \"imgs/snippet.png\" width =\"45%\" /> <img src = \"imgs/preview.png\" width =\"45%\" />\n\n## Quick Start\n### Installation\n```\npip install -r requirements.txt\n```\nor\n```\npip install pygestor\n```\nThe module can be used with a webUI, terminal commands or Python APIs (more functionalities). For Python APIs introductions please refer to [this notebook](notebooks/api_demo.ipynb).\n\n### Configurations\nEdit [`confs/system.conf`](confs/system.conf) to change the default system settings. In particular, set `data_dir` to the desired data storage location, either a local path or a cloud NFS.\n\n### Run GUI\n```\npython .\\run-gui.py\n```\n\nFor a usage guide on the CLI, refer to [docs/cli_usage.md](docs/cli_usage.md)\n\n### Download Dataset\nDatasets can be downloaded via the WebUI or using the API. Run the following example script to download '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia), and the first 10 parquet files from [wikimedia/wit_base](https://huggingface.co/datasets/wikimedia/wit_base)\n```\npython .\\examples\\download_example.py\n```\n\n## Adding a New Dataset\n\nNew datasets can be added using predefined ingestion and processing pipelines. For example, the [HuggingFaceParquet](pygestor/datasets/hf_parquet.py) pipeline can be used to ingest Parquet datasets from Hugging Face. It is recommended to use the WebUI for this process. In the \"Add New\" menu, fill in the dataset name, URL, and pipeline name to retrieve and save the metadata of the new dataset. For example:\n\n- Dataset Name: facebook/multilingual_librispeech\n- Dataset URL: https://huggingface.co/datasets/facebook/multilingual_librispeech\n- Pipeline: HuggingFaceParquet\n\nIf a custom pipeline is required for datasets that don't fit the general pipelines, you will need to add a new pipeline to [pygestor/datasets](pygestor/datasets) that defines how to organize, download, and process the data. You can follow the example provided in [pygestor/datasets/wikipedia.py](pygestor/datasets/wikipedia.py). Ensure that the pipeline name matches your desired dataset name. After that, update the metadata by running \n```\npython cli.py -init -d <new_dataset_name>\n```\n\n## Technical Details\n### Storage\nThe data is stored in a file storage system and organized into three levels: dataset, subset (distinguished by version, language, class, split, annotation, etc.), and partition (splitting large files into smaller chunks for memory efficiency), as follows:\n\n```\ndataset_A\n\u251c\u2500\u2500 subset_a\n\u2502   \u251c\u2500\u2500 partition_1\n\u2502   \u2514\u2500\u2500 partition_2\n\u2514\u2500\u2500 subset_b\n    \u251c\u2500\u2500 partition_1\n    \u2514\u2500\u2500 partition_2\n...\n```\nFile storage is used for its comparatively high cost efficiency, scalability, and ease of management compared to other types of storage.\n\nThe dataset info and storage status is tracked by a metadata file `metadata.json` for efficient reference and update.\n\n### Dependencies\n- python >= 3.11\n- huggingface_hub: Provides native support for datasets hosted on Hugging Face, making it an ideal library for downloading.\n- pyarrow: Used to compress and extract parquet files, a data file format designed for efficient data storage and retrieval.\n- pandas: Used to structure the dataset info tabular form for downstream data consumers. It provides a handy API for data manipulation and access, as well as chunking and datatype adjustments for memory efficiency.\n- nicegui (optional): Used to serve webUI frontend\n\n## Dataset Expansion\nFor a proposed management process to handle future dataset expansions, refer to [docs/dataset_expansion.md](docs/dataset_expansion.md).\n",
    "bugtrack_url": null,
    "license": "Copyright (c) 2024 rlsn  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "A tool for dataset ingestion and management.",
    "version": "0.2.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/rlsn/Ingestor/issues",
        "Changelog": "https://github.com/rlsn/Ingestor",
        "Documentation": "https://github.com/rlsn/Ingestor",
        "Homepage": "https://github.com/rlsn/Ingestor",
        "Repository": "https://github.com/rlsn/Ingestor.git"
    },
    "split_keywords": [
        "dataset",
        " ingestion",
        " management",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70c29f206e0ebf28601b93cb7bec4036037b22646bad28e8f1c029b8c1e86020",
                "md5": "d2be5857b5289bb5a5d9b016bd280c44",
                "sha256": "cb77fc93b3e6914ed4b287549a75504ff2dc3570af57b74d092248c4ed548677"
            },
            "downloads": -1,
            "filename": "pygestor-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d2be5857b5289bb5a5d9b016bd280c44",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 22994,
            "upload_time": "2024-08-14T13:36:33",
            "upload_time_iso_8601": "2024-08-14T13:36:33.077962Z",
            "url": "https://files.pythonhosted.org/packages/70/c2/9f206e0ebf28601b93cb7bec4036037b22646bad28e8f1c029b8c1e86020/pygestor-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e53d339c76f1461a170f6e0e4a9df16535b92e942f12ecce41e474e1227fa1ef",
                "md5": "c6746133afa1dbd90bad8874489bc57e",
                "sha256": "69796fe41b392efe4d88a230e123e2bcbc72f50daba9994ba41264b63be96f4d"
            },
            "downloads": -1,
            "filename": "pygestor-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "c6746133afa1dbd90bad8874489bc57e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 20572,
            "upload_time": "2024-08-14T13:36:34",
            "upload_time_iso_8601": "2024-08-14T13:36:34.317946Z",
            "url": "https://files.pythonhosted.org/packages/e5/3d/339c76f1461a170f6e0e4a9df16535b92e942f12ecce41e474e1227fa1ef/pygestor-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-14 13:36:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rlsn",
    "github_project": "Pygestor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pygestor"
}

Yumo Wang