Name | anhaltai-gbif-downloader JSON |
Version |
2025.8.1
JSON |
| download |
home_page | None |
Summary | This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket. |
upload_time | 2025-08-28 13:17:02 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.12 |
license | None |
keywords |
gbif
image
downloader
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# 🌳 GBIF Image Downloader
This project automatically downloads taxon-specific images from the [GBIF API](https://techdocs.gbif.org/en/openapi/),
processes them, and stores both images and metadata in a taxonomically organized structure in a
[MinIO](https://www.min.io/) bucket.
---
## Features
- Loads Latin taxon names from `.csv` or `.xlsx` files
- Resolves `taxonKeys` automatically via the GBIF API
- Downloads associated media (images) from GBIF
- Stores metadata and images in a taxonomic folder structure in MinIO
- Optionally processes only new GBIF occurrences (`crawl_new_entries`)
- Multithreading for parallel processing and uploads
- Logging directly to MinIO
---
## Usage
## Installation
Install dependencies via:
```bash
pip install -r requirements.txt
```
---
### 1. Prepare your input file
Create a `.csv` or `.xlsx` file with at least the following column:
| latin_name |
|-----------------|
| Quercus robur |
| Fagus sylvatica |
### 2. Adjust your configuration
Edit the file `config/config.yaml` to set your MinIO connection, output paths, and processing options.
A typical configuration looks like this:
```yaml
minio:
bucket: meinewaldki-gbif # Name of your MinIO bucket
endpoint: s3.anhalt.ai # MinIO/S3 endpoint URL
secure: true # Use HTTPS (true/false)
cert_check: true # Check SSL certificates (true/false)
paths:
output: gbif-test/ # Output directory for images and metadata
tree_list_input_path: data/tree_list.xlsx # Path to your input taxon list
processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list
log_dir: logs/ # Directory for log files
query_params:
mediaType: StillImage # Only download images
limit: 100 # Number of records per API call
offset: 0 # Start offset
options:
already_preprocessed: True # Set False to process the taxon list again
crawl_new_entries: False # Only process new occurrences if True
max_threads: 10 # Number of parallel threads for downloads/uploads
```
#### Query Parameters for GBIF API URL
The parameters used to build the GBIF API request URL are defined in the `query_params` section of your
`config/config.yaml`. These parameters control which records are fetched from the GBIF API.
**Supported parameters:**
- `mediaType` (e.g. `StillImage`): Only download records with images.
- `taxonKey`: The taxon key.
- `datasetKey`: Filter by dataset.
- `country`: Filter by country code (e.g. `DE` for Germany).
- `hasCoordinate`: Only records with coordinates (`true` or `false`).
- `year`, `month`: Filter by year or month of occurrence.
- `basisOfRecord`: Type of record (e.g. `HUMAN_OBSERVATION`).
- `recordedBy`: Filter by collector/observer.
- `institutionCode`, `collectionCode`: Filter by institution or collection.
- `limit`: Number of records per API call (pagination, max. 300).
- `offset`: Start offset for pagination.
**How it works:**
- All parameters in `query_params` are automatically validated at startup.
- Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.
### 3. Process taxonKey list and resolve taxonKeys
```python
from anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor
processor = TreeListProcessor(input_path="data/tree_list.xlsx",
sheet_name="Gehölzarten", taxon="speciesKey")
processor.process_tree_list(output_path="data/species_key.csv")
```
### 4. Download media and metadata from GBIF
Run the main program:
```bash
PYTHONPATH=src python3 src/gbif_extractor/main.py
```
### Note:
- MinIO credentials must be set in `.env` see `.env-example` for the required format\.
- Log files are automatically uploaded to MinIO.
- Parallel processing and uploads are controlled by a configurable thread limit.
- Semaphores are used in this project to control the number of concurrent threads
during uploads to MinIO.
- The program will skip old entries if `crawl_new_entries` is set to `True`.
Raw data
{
"_id": null,
"home_page": null,
"name": "anhaltai-gbif-downloader",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "GBIF, image, downloader",
"author": null,
"author_email": "AnhaltAI <hello@anhalt.ai>",
"download_url": "https://files.pythonhosted.org/packages/78/29/dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c/anhaltai_gbif_downloader-2025.8.1.tar.gz",
"platform": null,
"description": "# \ud83c\udf33 GBIF Image Downloader\n\nThis project automatically downloads taxon-specific images from the [GBIF API](https://techdocs.gbif.org/en/openapi/),\nprocesses them, and stores both images and metadata in a taxonomically organized structure in a\n[MinIO](https://www.min.io/) bucket.\n\n---\n\n## Features\n\n- Loads Latin taxon names from `.csv` or `.xlsx` files\n- Resolves `taxonKeys` automatically via the GBIF API\n- Downloads associated media (images) from GBIF\n- Stores metadata and images in a taxonomic folder structure in MinIO\n- Optionally processes only new GBIF occurrences (`crawl_new_entries`)\n- Multithreading for parallel processing and uploads\n- Logging directly to MinIO\n\n---\n\n## Usage\n\n## Installation\n\nInstall dependencies via:\n\n```bash\npip install -r requirements.txt\n```\n\n---\n\n### 1. Prepare your input file\n\nCreate a `.csv` or `.xlsx` file with at least the following column:\n\n| latin_name |\n|-----------------|\n| Quercus robur |\n| Fagus sylvatica |\n\n### 2. Adjust your configuration\n\nEdit the file `config/config.yaml` to set your MinIO connection, output paths, and processing options. \nA typical configuration looks like this:\n\n```yaml\nminio:\n bucket: meinewaldki-gbif # Name of your MinIO bucket\n endpoint: s3.anhalt.ai # MinIO/S3 endpoint URL\n secure: true # Use HTTPS (true/false)\n cert_check: true # Check SSL certificates (true/false)\n\npaths:\n output: gbif-test/ # Output directory for images and metadata\n tree_list_input_path: data/tree_list.xlsx # Path to your input taxon list\n processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list\n log_dir: logs/ # Directory for log files\n\nquery_params:\n mediaType: StillImage # Only download images\n limit: 100 # Number of records per API call\n offset: 0 # Start offset\n\noptions:\n already_preprocessed: True # Set False to process the taxon list again\n crawl_new_entries: False # Only process new occurrences if True\n max_threads: 10 # Number of parallel threads for downloads/uploads\n```\n\n#### Query Parameters for GBIF API URL\n\nThe parameters used to build the GBIF API request URL are defined in the `query_params` section of your\n`config/config.yaml`. These parameters control which records are fetched from the GBIF API.\n\n**Supported parameters:**\n\n- `mediaType` (e.g. `StillImage`): Only download records with images.\n- `taxonKey`: The taxon key.\n- `datasetKey`: Filter by dataset.\n- `country`: Filter by country code (e.g. `DE` for Germany).\n- `hasCoordinate`: Only records with coordinates (`true` or `false`).\n- `year`, `month`: Filter by year or month of occurrence.\n- `basisOfRecord`: Type of record (e.g. `HUMAN_OBSERVATION`).\n- `recordedBy`: Filter by collector/observer.\n- `institutionCode`, `collectionCode`: Filter by institution or collection.\n- `limit`: Number of records per API call (pagination, max. 300).\n- `offset`: Start offset for pagination.\n\n**How it works:**\n\n- All parameters in `query_params` are automatically validated at startup.\n- Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.\n\n### 3. Process taxonKey list and resolve taxonKeys\n\n```python\nfrom anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor\n\nprocessor = TreeListProcessor(input_path=\"data/tree_list.xlsx\",\n sheet_name=\"Geh\u00f6lzarten\", taxon=\"speciesKey\")\nprocessor.process_tree_list(output_path=\"data/species_key.csv\")\n```\n\n### 4. Download media and metadata from GBIF\n\nRun the main program:\n\n```bash\nPYTHONPATH=src python3 src/gbif_extractor/main.py\n```\n\n### Note:\n\n- MinIO credentials must be set in `.env` see `.env-example` for the required format\\.\n- Log files are automatically uploaded to MinIO.\n- Parallel processing and uploads are controlled by a configurable thread limit.\n- Semaphores are used in this project to control the number of concurrent threads\n during uploads to MinIO.\n- The program will skip old entries if `crawl_new_entries` is set to `True`.\n",
"bugtrack_url": null,
"license": null,
"summary": "This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.",
"version": "2025.8.1",
"project_urls": {
"Homepage": "https://pypi.org/project/anhaltai-gbif-downloader/"
},
"split_keywords": [
"gbif",
" image",
" downloader"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2857409d12e7f94042efb5a53fdf4bc9673c7a2433441e15290a76cd30dfb78e",
"md5": "d20f930038c0b32ffdd8b781d0ab987a",
"sha256": "d731b748e7251da999ed91f827a850422c9103169708d329cbd96756a29e2995"
},
"downloads": -1,
"filename": "anhaltai_gbif_downloader-2025.8.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d20f930038c0b32ffdd8b781d0ab987a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 16544,
"upload_time": "2025-08-28T13:17:01",
"upload_time_iso_8601": "2025-08-28T13:17:01.607942Z",
"url": "https://files.pythonhosted.org/packages/28/57/409d12e7f94042efb5a53fdf4bc9673c7a2433441e15290a76cd30dfb78e/anhaltai_gbif_downloader-2025.8.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7829dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c",
"md5": "6e9e8c78fe11fc9db1a49bf1cd0676a0",
"sha256": "528e696eabba5061907807a75c2435ee041fe28abd49b298578ef73be220b2f5"
},
"downloads": -1,
"filename": "anhaltai_gbif_downloader-2025.8.1.tar.gz",
"has_sig": false,
"md5_digest": "6e9e8c78fe11fc9db1a49bf1cd0676a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 15348,
"upload_time": "2025-08-28T13:17:02",
"upload_time_iso_8601": "2025-08-28T13:17:02.982934Z",
"url": "https://files.pythonhosted.org/packages/78/29/dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c/anhaltai_gbif_downloader-2025.8.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-28 13:17:02",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "anhaltai-gbif-downloader"
}