# smartdownsample
**Fast image downsampling for large camera trap animal crop datasets**
`smartdownsample` helps select representative subsets of camera trap images, particularly centered animal crops. In many machine learning workflows, majority classes may contain hundreds of thousands of images. These often need to be downsampled for processing efficiency or dataset balance, but without losing valuable variation.
An ideal solution would retain only truly distinct images and exclude near-duplicates, but that is computationally expensive for large datasets. This package provides a practical compromise: fast downsampling that preserves diversity with minimal computations, reducing processing time from hours or days to just minutes.
If you need mathematically optimal results, this tool is not the right fit. If you want a reasonably intelligent selection that is more effective than random sampling, `smartdownsample` is designed for you.
## Installation
```bash
pip install smartdownsample
```
## Usage
```python
from smartdownsample import sample_diverse
# List of image paths
my_image_list = [
"path/to/img1.jpg",
"path/to/img2.jpg",
"path/to/img3.jpg",
# ...
]
# Basic usage
selected = sample_diverse(
image_paths=my_image_list,
target_count=1000
)
```
## Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `image_paths` | Required | List of image file paths (str or Path objects) |
| `target_count` | Required | Exact number of images to select |
| `hash_size` | `8` | Perceptual hash size (8 recommended) |
| `n_workers` | `4` | Number of parallel workers for hash computation |
| `show_progress` | `True` | Display progress bars during processing |
| `random_seed` | `42` | Random seed for reproducible bucket selection |
| `show_summary` | `True` | Print bucket statistics and distribution summary |
| `show_distribution` | `False` | Show bucket distribution bar chart |
| `show_thumbnails` | `False` | Show 5x5 thumbnail grids for each bucket |
## How it works
The algorithm balances speed and diversity in four steps:
1. **Feature extraction**
Each image is reduced to a compact set of visual features:
- DHash (`2 bits`) → structure/edges
- AHash (`1 bit`) → brightness/contrast
- Color variance (`1 bit`) → grayscale vs. color
- Overall brightness (`1 bit`) → dark vs. bright
- Average color (`1 bit`) → dominant scene color (red/green/blue/neutral)
Maximum: 128 theoretical buckets (2×2×2×2×2×4)
Typical: 16–80 buckets, depending on dataset diversity
Examples of resulting groups:
- Dark grayscale (night IR)
- Bright blue snow scenes
- Color forest images with mixed poses
2. **Bucket grouping**
Images are assigned to similarity buckets based on these features.
3. **Selection across buckets**
- Ensure at least one image per bucket (diversity first)
- Fill the remaining quota proportionally from larger buckets
4. **Within-bucket selection**
- Buckets are naturally sorted by folder structure
- Locations, deployments, and sequences stay together in order
- Take every stride-th image until quota is met, ensuring a systematical sample across time and space
5. **Optionally show distribution chart**
- Vertical bar chart of kept vs. excluded images per bucket
<img src="https://github.com/PetervanLunteren/EcoAssist-metadata/blob/main/smartdown-sample/bar.png" width="50%">
6. **Optionally show thumbnail grids**
- 5×5 grids of the first 25 images from each bucket, for quick visual review
<img src="https://github.com/PetervanLunteren/EcoAssist-metadata/blob/main/smartdown-sample/grid.png" width="50%">
## License
MIT License – see LICENSE file.
Raw data
{
"_id": null,
"home_page": null,
"name": "smartdownsample",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "image, downsampling, camera-trap, machine-learning, computer-vision, diversity, deduplication",
"author": "peteraddax",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/26/97/17c9d28ca4268d573ea919e7122524d242572afe23504cc388adb9812cbd/smartdownsample-1.8.5.tar.gz",
"platform": null,
"description": "# smartdownsample\n\n**Fast image downsampling for large camera trap animal crop datasets**\n\n`smartdownsample` helps select representative subsets of camera trap images, particularly centered animal crops. In many machine learning workflows, majority classes may contain hundreds of thousands of images. These often need to be downsampled for processing efficiency or dataset balance, but without losing valuable variation. \n\nAn ideal solution would retain only truly distinct images and exclude near-duplicates, but that is computationally expensive for large datasets. This package provides a practical compromise: fast downsampling that preserves diversity with minimal computations, reducing processing time from hours or days to just minutes. \n\nIf you need mathematically optimal results, this tool is not the right fit. If you want a reasonably intelligent selection that is more effective than random sampling, `smartdownsample` is designed for you.\n\n## Installation\n\n```bash\npip install smartdownsample\n```\n\n## Usage\n\n```python\nfrom smartdownsample import sample_diverse\n\n# List of image paths\nmy_image_list = [\n \"path/to/img1.jpg\",\n \"path/to/img2.jpg\",\n \"path/to/img3.jpg\",\n # ...\n]\n\n# Basic usage\nselected = sample_diverse(\n image_paths=my_image_list,\n target_count=1000\n)\n```\n\n## Parameters\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `image_paths` | Required | List of image file paths (str or Path objects) |\n| `target_count` | Required | Exact number of images to select |\n| `hash_size` | `8` | Perceptual hash size (8 recommended) |\n| `n_workers` | `4` | Number of parallel workers for hash computation |\n| `show_progress` | `True` | Display progress bars during processing |\n| `random_seed` | `42` | Random seed for reproducible bucket selection |\n| `show_summary` | `True` | Print bucket statistics and distribution summary |\n| `show_distribution` | `False` | Show bucket distribution bar chart |\n| `show_thumbnails` | `False` | Show 5x5 thumbnail grids for each bucket |\n\n## How it works\n\nThe algorithm balances speed and diversity in four steps:\n\n1. **Feature extraction** \n Each image is reduced to a compact set of visual features: \n - DHash (`2 bits`) \u2192 structure/edges \n - AHash (`1 bit`) \u2192 brightness/contrast \n - Color variance (`1 bit`) \u2192 grayscale vs. color \n - Overall brightness (`1 bit`) \u2192 dark vs. bright \n - Average color (`1 bit`) \u2192 dominant scene color (red/green/blue/neutral) \n\n Maximum: 128 theoretical buckets (2\u00d72\u00d72\u00d72\u00d72\u00d74) \n Typical: 16\u201380 buckets, depending on dataset diversity \n\n Examples of resulting groups: \n - Dark grayscale (night IR) \n - Bright blue snow scenes \n - Color forest images with mixed poses \n\n2. **Bucket grouping** \n Images are assigned to similarity buckets based on these features. \n\n3. **Selection across buckets** \n - Ensure at least one image per bucket (diversity first) \n - Fill the remaining quota proportionally from larger buckets \n\n4. **Within-bucket selection** \n - Buckets are naturally sorted by folder structure\n - Locations, deployments, and sequences stay together in order \n - Take every stride-th image until quota is met, ensuring a systematical sample across time and space\n\n5. **Optionally show distribution chart** \n - Vertical bar chart of kept vs. excluded images per bucket \n<img src=\"https://github.com/PetervanLunteren/EcoAssist-metadata/blob/main/smartdown-sample/bar.png\" width=\"50%\">\n\n\n6. **Optionally show thumbnail grids** \n - 5\u00d75 grids of the first 25 images from each bucket, for quick visual review \n<img src=\"https://github.com/PetervanLunteren/EcoAssist-metadata/blob/main/smartdown-sample/grid.png\" width=\"50%\">\n\n\n## License\n\nMIT License \u2013 see LICENSE file.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Smart image downsampling for image classification datasets",
"version": "1.8.5",
"project_urls": {
"Documentation": "https://github.com/PetervanLunteren/smartdownsample#readme",
"Homepage": "https://github.com/PetervanLunteren/smartdownsample",
"Issues": "https://github.com/PetervanLunteren/smartdownsample/issues",
"Repository": "https://github.com/PetervanLunteren/smartdownsample"
},
"split_keywords": [
"image",
" downsampling",
" camera-trap",
" machine-learning",
" computer-vision",
" diversity",
" deduplication"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "299743ac9bb3f1cfda6f83119a42e1af58680dd0905fabe28788ba2269bb37b7",
"md5": "f059007ea5062c2e51f2320336f591d2",
"sha256": "43bb8050cfec9ea1f29e4c4b1f35b5f981d94580459cd33076360039e6f44561"
},
"downloads": -1,
"filename": "smartdownsample-1.8.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f059007ea5062c2e51f2320336f591d2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 11169,
"upload_time": "2025-08-20T10:49:24",
"upload_time_iso_8601": "2025-08-20T10:49:24.095230Z",
"url": "https://files.pythonhosted.org/packages/29/97/43ac9bb3f1cfda6f83119a42e1af58680dd0905fabe28788ba2269bb37b7/smartdownsample-1.8.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "269717c9d28ca4268d573ea919e7122524d242572afe23504cc388adb9812cbd",
"md5": "a6bf6238fac10ed19237afd7c2f5a5a2",
"sha256": "ba4fb254cfbc9d40dfaf5b567cc188bbec71dd3bc80a2286a934332912f94fa1"
},
"downloads": -1,
"filename": "smartdownsample-1.8.5.tar.gz",
"has_sig": false,
"md5_digest": "a6bf6238fac10ed19237afd7c2f5a5a2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 14171,
"upload_time": "2025-08-20T10:49:24",
"upload_time_iso_8601": "2025-08-20T10:49:24.995452Z",
"url": "https://files.pythonhosted.org/packages/26/97/17c9d28ca4268d573ea919e7122524d242572afe23504cc388adb9812cbd/smartdownsample-1.8.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-20 10:49:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "PetervanLunteren",
"github_project": "smartdownsample#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "smartdownsample"
}