# Vision Data Curation (VDC)
A lightweight framework for cleaning, filtering, and sampling large-scale image datasets.
Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.
## Status
This project is in early development. Most features are functional, but APIs may still change.
- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling
- Features in Progress: Duplicate removal, Rotation correction
Feedback and contributions are welcome.
## Features
VDC provides modular tools for dataset cleanup:
- **Input validation** - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
- **Example-based filtering** - remove images similar to a set of unwanted examples
- **Image Quality Filtering** - remove images based on aesthetic score or NSFW classification
- **Duplicate removal** - identify and remove near-duplicate images from your dataset
- **Hierarchical K-Means sampling** - select diverse, representative subsets from large datasets
**Coming soon:**
- Rotation correction (correct 90°/180°/270° orientation errors)
## The Curation Pipeline
```mermaid
flowchart LR
A[Raw<br/>Dataset] --> V[Validation]
V --> R[Rotation*]
R --> D[Dedup]
D --> E[Example<br/>Filter]
E --> Q[Quality Filter<br/>Aesthetic/NSFW]
Q --> S[Cluster-based<br/>Sampling]
S --> F[Curated<br/>Dataset]
U[Unwanted<br/>Examples] --> E
```
Note: * = WIP
## Installation
### From PyPI
```sh
pip install vision-data-curation
```
### From Source
```sh
git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .
```
Developing directly from the project root allows for script and configuration execution as if fully installed.
## Usage
Each step is a script under `vdc.scripts`.
Examples:
```sh
# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/
# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv
```
- Run `python -m vdc.scripts` to see available scripts
- Run `python -m vdc.scripts.<script> --help` for options
Configuration:
- Default settings live in [vdc/conf/config.json](https://gitlab.com/birder/vision-data-curation/-/blob/main/vdc/conf/config.json)
- A `config.json` in your project root will take precedence (or pass `--config` to any script)
## License
This project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/birder/vision-data-curation/blob/main/LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "vision-data-curation",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "image-processing, data-curation, computer-vision, pytorch, deep-learning",
"author": "Ofer Hasson",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/11/59/73e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94/vision_data_curation-0.0.1.dev7.tar.gz",
"platform": null,
"description": "# Vision Data Curation (VDC)\n\nA lightweight framework for cleaning, filtering, and sampling large-scale image datasets.\nBuilt for computer vision researchers and practitioners who want higher-quality data with less manual effort.\n\n## Status\n\nThis project is in early development. Most features are functional, but APIs may still change.\n\n- Implemented Features: Input validation, Example-based filtering, Aesthetic filtering, NSFW filtering, Hierarchical sampling\n- Features in Progress: Duplicate removal, Rotation correction\n\nFeedback and contributions are welcome.\n\n## Features\n\nVDC provides modular tools for dataset cleanup:\n\n- **Input validation** - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios\n- **Example-based filtering** - remove images similar to a set of unwanted examples\n- **Image Quality Filtering** - remove images based on aesthetic score or NSFW classification\n- **Duplicate removal** - identify and remove near-duplicate images from your dataset\n- **Hierarchical K-Means sampling** - select diverse, representative subsets from large datasets\n\n**Coming soon:**\n\n- Rotation correction (correct 90\u00b0/180\u00b0/270\u00b0 orientation errors)\n\n## The Curation Pipeline\n\n```mermaid\nflowchart LR\n A[Raw<br/>Dataset] --> V[Validation]\n V --> R[Rotation*]\n R --> D[Dedup]\n D --> E[Example<br/>Filter]\n E --> Q[Quality Filter<br/>Aesthetic/NSFW]\n Q --> S[Cluster-based<br/>Sampling]\n S --> F[Curated<br/>Dataset]\n\n U[Unwanted<br/>Examples] --> E\n```\n\nNote: * = WIP\n\n## Installation\n\n### From PyPI\n\n```sh\npip install vision-data-curation\n```\n\n### From Source\n\n```sh\ngit clone https://gitlab.com/birder/vision-data-curation.git\ncd vision-data-curation\npip install -e .\n```\n\nDeveloping directly from the project root allows for script and configuration execution as if fully installed.\n\n## Usage\n\nEach step is a script under `vdc.scripts`.\n\nExamples:\n\n```sh\n# Remove corrupt/invalid images\npython -m vdc.scripts.sanitize_images data/raw_images/\n\n# Filter based on \"Unwanted examples\"\npython -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv\n```\n\n- Run `python -m vdc.scripts` to see available scripts\n- Run `python -m vdc.scripts.<script> --help` for options\n\nConfiguration:\n\n- Default settings live in [vdc/conf/config.json](https://gitlab.com/birder/vision-data-curation/-/blob/main/vdc/conf/config.json)\n- A `config.json` in your project root will take precedence (or pass `--config` to any script)\n\n## License\n\nThis project is licensed under the Apache-2.0 License - see the [LICENSE](https://gitlab.com/birder/vision-data-curation/blob/main/LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "A pipeline for curating and sanitizing large-scale image datasets.",
"version": "0.0.1.dev7",
"project_urls": {
"Homepage": "https://gitlab.com/birder/vision-data-curation",
"Issues": "https://gitlab.com/birder/vision-data-curation/-/issues"
},
"split_keywords": [
"image-processing",
" data-curation",
" computer-vision",
" pytorch",
" deep-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "41299b5bb641cd148e4ea92db2f5345504bc740136a4bc709f5134f0db996c45",
"md5": "1fdab7a7cd1ab71e4c2a62b06cd0e1d1",
"sha256": "42e73db041245e2b44570db1c025b1ff51830c6c3482ae9a0756752ed950e5d8"
},
"downloads": -1,
"filename": "vision_data_curation-0.0.1.dev7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1fdab7a7cd1ab71e4c2a62b06cd0e1d1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 64580,
"upload_time": "2025-09-06T09:53:22",
"upload_time_iso_8601": "2025-09-06T09:53:22.040591Z",
"url": "https://files.pythonhosted.org/packages/41/29/9b5bb641cd148e4ea92db2f5345504bc740136a4bc709f5134f0db996c45/vision_data_curation-0.0.1.dev7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "115973e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94",
"md5": "88662e36cd9daffe19de1c3065661677",
"sha256": "92e115288fe5dca191c21b7c425d435a01e7e45a1acf78074136a71403ac27e0"
},
"downloads": -1,
"filename": "vision_data_curation-0.0.1.dev7.tar.gz",
"has_sig": false,
"md5_digest": "88662e36cd9daffe19de1c3065661677",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 51968,
"upload_time": "2025-09-06T09:53:23",
"upload_time_iso_8601": "2025-09-06T09:53:23.242613Z",
"url": "https://files.pythonhosted.org/packages/11/59/73e403056135096da81435c6556cfb14700ee7dff7816475cf6e6b7f4c94/vision_data_curation-0.0.1.dev7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-06 09:53:23",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "birder",
"gitlab_project": "vision-data-curation",
"lcname": "vision-data-curation"
}