vision-data-curation


Namevision-data-curation JSON
Version 0.0.1.dev7 PyPI version JSON
download
home_pageNone
SummaryA pipeline for curating and sanitizing large-scale image datasets.
upload_time2025-09-06 09:53:23
maintainerNone
docs_urlNone
authorOfer Hasson
requires_python>=3.11
licenseNone
keywords image-processing data-curation computer-vision pytorch deep-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets.
Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

## Status

This project is in early development. Most features are functional, but APIs may still change.

- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
- Features in Progress: Duplicate removal, Rotation correction

Feedback and contributions are welcome.

## Features

VDC provides modular tools for dataset cleanup:

- **Input validation** - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
- **Example-based filtering** - remove images similar to a set of unwanted examples
- **Image Quality Filtering** - remove images based on aesthetic score or NSFW classification
- **Duplicate removal** - identify and remove near-duplicate images from your dataset
- **Hierarchical K-Means sampling** - select diverse, representative subsets from large datasets

**Coming soon:**

- Rotation correction (correct 90°/180°/270° orientation errors)

## The Curation Pipeline

```mermaid
flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation*]
    R --> D[Dedup]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E
```

Note: * = WIP

## Installation

### From PyPI

```sh
pip install vision-data-curation
```

### From Source

```sh
git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .
```

Developing directly from the project root allows for script and configuration execution as if fully installed.

## Usage

Each step is a script under `vdc.scripts`.

Examples:

```sh
# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
```

- Run `python -m vdc.scripts` to see available scripts
- Run `python -m vdc.scripts.<script> --help` for options

Configuration:

- Default settings live in [vdc/conf/config.json](https://gitlab.com/birder/vision-data-curation/-/blob/main/vdc/conf/config.json)
- A `config.json` in your project root will take precedence (or pass `--config` to any script)

## License

This project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/birder/vision-data-curation/blob/main/LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vision-data-curation",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "image-processing, data-curation, computer-vision, pytorch, deep-learning",
    "author": "Ofer Hasson",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/11/59/73e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94/vision_data_curation-0.0.1.dev7.tar.gz",
    "platform": null,
    "description": "# Vision Data Curation (VDC)\n\nA lightweight framework for cleaning, filtering, and sampling large-scale image datasets.\nBuilt for computer vision researchers and practitioners who want higher-quality data with less manual effort.\n\n## Status\n\nThis project is in early development. Most features are functional, but APIs may still change.\n\n- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling\n- Features in Progress: Duplicate removal, Rotation correction\n\nFeedback and contributions are welcome.\n\n## Features\n\nVDC provides modular tools for dataset cleanup:\n\n- **Input validation** - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios\n- **Example-based filtering** - remove images similar to a set of unwanted examples\n- **Image Quality Filtering** - remove images based on aesthetic score or NSFW classification\n- **Duplicate removal** - identify and remove near-duplicate images from your dataset\n- **Hierarchical K-Means sampling** - select diverse, representative subsets from large datasets\n\n**Coming soon:**\n\n- Rotation correction (correct 90\u00b0/180\u00b0/270\u00b0 orientation errors)\n\n## The Curation Pipeline\n\n```mermaid\nflowchart LR\n    A[Raw<br/>Dataset] --> V[Validation]\n    V --> R[Rotation*]\n    R --> D[Dedup]\n    D --> E[Example<br/>Filter]\n    E --> Q[Quality Filter<br/>Aesthetic/NSFW]\n    Q --> S[Cluster-based<br/>Sampling]\n    S --> F[Curated<br/>Dataset]\n\n    U[Unwanted<br/>Examples] --> E\n```\n\nNote: * = WIP\n\n## Installation\n\n### From PyPI\n\n```sh\npip install vision-data-curation\n```\n\n### From Source\n\n```sh\ngit clone https://gitlab.com/birder/vision-data-curation.git\ncd vision-data-curation\npip install -e .\n```\n\nDeveloping directly from the project root allows for script and configuration execution as if fully installed.\n\n## Usage\n\nEach step is a script under `vdc.scripts`.\n\nExamples:\n\n```sh\n# Remove corrupt/invalid images\npython -m vdc.scripts.sanitize_images data/raw_images/\n\n# Filter based on \"Unwanted examples\"\npython -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv\n```\n\n- Run `python -m vdc.scripts` to see available scripts\n- Run `python -m vdc.scripts.<script> --help` for options\n\nConfiguration:\n\n- Default settings live in [vdc/conf/config.json](https://gitlab.com/birder/vision-data-curation/-/blob/main/vdc/conf/config.json)\n- A `config.json` in your project root will take precedence (or pass `--config` to any script)\n\n## License\n\nThis project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/birder/vision-data-curation/blob/main/LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A pipeline for curating and sanitizing large-scale image datasets.",
    "version": "0.0.1.dev7",
    "project_urls": {
        "Homepage": "https://gitlab.com/birder/vision-data-curation",
        "Issues": "https://gitlab.com/birder/vision-data-curation/-/issues"
    },
    "split_keywords": [
        "image-processing",
        " data-curation",
        " computer-vision",
        " pytorch",
        " deep-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "41299b5bb641cd148e4ea92db2f5345504bc740136a4bc709f5134f0db996c45",
                "md5": "1fdab7a7cd1ab71e4c2a62b06cd0e1d1",
                "sha256": "42e73db041245e2b44570db1c025b1ff51830c6c3482ae9a0756752ed950e5d8"
            },
            "downloads": -1,
            "filename": "vision_data_curation-0.0.1.dev7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1fdab7a7cd1ab71e4c2a62b06cd0e1d1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 64580,
            "upload_time": "2025-09-06T09:53:22",
            "upload_time_iso_8601": "2025-09-06T09:53:22.040591Z",
            "url": "https://files.pythonhosted.org/packages/41/29/9b5bb641cd148e4ea92db2f5345504bc740136a4bc709f5134f0db996c45/vision_data_curation-0.0.1.dev7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "115973e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94",
                "md5": "88662e36cd9daffe19de1c3065661677",
                "sha256": "92e115288fe5dca191c21b7c425d435a01e7e45a1acf78074136a71403ac27e0"
            },
            "downloads": -1,
            "filename": "vision_data_curation-0.0.1.dev7.tar.gz",
            "has_sig": false,
            "md5_digest": "88662e36cd9daffe19de1c3065661677",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 51968,
            "upload_time": "2025-09-06T09:53:23",
            "upload_time_iso_8601": "2025-09-06T09:53:23.242613Z",
            "url": "https://files.pythonhosted.org/packages/11/59/73e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94/vision_data_curation-0.0.1.dev7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-06 09:53:23",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "birder",
    "gitlab_project": "vision-data-curation",
    "lcname": "vision-data-curation"
}
        
Elapsed time: 2.35685s