anhaltai-gbif-downloader

Name	anhaltai-gbif-downloader JSON
Version	2025.8.1 JSON
	download
home_page	None
Summary	This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.
upload_time	2025-08-28 13:17:02
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	None
keywords	gbif image downloader
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# 🌳 GBIF Image Downloader

This project automatically downloads taxon-specific images from the [GBIF API](https://techdocs.gbif.org/en/openapi/),
processes them, and stores both images and metadata in a taxonomically organized structure in a
[MinIO](https://www.min.io/) bucket.

---

## Features

- Loads Latin taxon names from `.csv` or `.xlsx` files
- Resolves `taxonKeys` automatically via the GBIF API
- Downloads associated media (images) from GBIF
- Stores metadata and images in a taxonomic folder structure in MinIO
- Optionally processes only new GBIF occurrences (`crawl_new_entries`)
- Multithreading for parallel processing and uploads
- Logging directly to MinIO

---

## Usage

## Installation

Install dependencies via:

```bash
pip install -r requirements.txt
```

---

### 1. Prepare your input file

Create a `.csv` or `.xlsx` file with at least the following column:

| latin_name |
|-----------------|
| Quercus robur |
| Fagus sylvatica |

### 2. Adjust your configuration

Edit the file `config/config.yaml` to set your MinIO connection, output paths, and processing options.
A typical configuration looks like this:

```yaml
minio:
bucket: meinewaldki-gbif # Name of your MinIO bucket
endpoint: s3.anhalt.ai # MinIO/S3 endpoint URL
secure: true # Use HTTPS (true/false)
cert_check: true # Check SSL certificates (true/false)

paths:
output: gbif-test/ # Output directory for images and metadata
tree_list_input_path: data/tree_list.xlsx # Path to your input taxon list
processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list
log_dir: logs/ # Directory for log files

query_params:
mediaType: StillImage # Only download images
limit: 100 # Number of records per API call
offset: 0 # Start offset

options:
already_preprocessed: True # Set False to process the taxon list again
crawl_new_entries: False # Only process new occurrences if True
max_threads: 10 # Number of parallel threads for downloads/uploads
```

#### Query Parameters for GBIF API URL

The parameters used to build the GBIF API request URL are defined in the `query_params` section of your
`config/config.yaml`. These parameters control which records are fetched from the GBIF API.

**Supported parameters:**

- `mediaType` (e.g. `StillImage`): Only download records with images.
- `taxonKey`: The taxon key.
- `datasetKey`: Filter by dataset.
- `country`: Filter by country code (e.g. `DE` for Germany).
- `hasCoordinate`: Only records with coordinates (`true` or `false`).
- `year`, `month`: Filter by year or month of occurrence.
- `basisOfRecord`: Type of record (e.g. `HUMAN_OBSERVATION`).
- `recordedBy`: Filter by collector/observer.
- `institutionCode`, `collectionCode`: Filter by institution or collection.
- `limit`: Number of records per API call (pagination, max. 300).
- `offset`: Start offset for pagination.

**How it works:**

- All parameters in `query_params` are automatically validated at startup.
- Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.

### 3. Process taxonKey list and resolve taxonKeys

```python
from anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor

processor = TreeListProcessor(input_path="data/tree_list.xlsx",
sheet_name="Gehölzarten", taxon="speciesKey")
processor.process_tree_list(output_path="data/species_key.csv")
```

### 4. Download media and metadata from GBIF

Run the main program:

```bash
PYTHONPATH=src python3 src/gbif_extractor/main.py
```

### Note:

- MinIO credentials must be set in `.env` see `.env-example` for the required format\.
- Log files are automatically uploaded to MinIO.
- Parallel processing and uploads are controlled by a configurable thread limit.
- Semaphores are used in this project to control the number of concurrent threads
during uploads to MinIO.
- The program will skip old entries if `crawl_new_entries` is set to `True`.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "anhaltai-gbif-downloader",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "GBIF, image, downloader",
    "author": null,
    "author_email": "AnhaltAI <hello@anhalt.ai>",
    "download_url": "https://files.pythonhosted.org/packages/78/29/dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c/anhaltai_gbif_downloader-2025.8.1.tar.gz",
    "platform": null,
    "description": "# \ud83c\udf33 GBIF Image Downloader\n\nThis project automatically downloads taxon-specific images from the [GBIF API](https://techdocs.gbif.org/en/openapi/),\nprocesses them, and stores both images and metadata in a taxonomically organized structure in a\n[MinIO](https://www.min.io/) bucket.\n\n---\n\n## Features\n\n- Loads Latin taxon names from `.csv` or `.xlsx` files\n- Resolves `taxonKeys` automatically via the GBIF API\n- Downloads associated media (images) from GBIF\n- Stores metadata and images in a taxonomic folder structure in MinIO\n- Optionally processes only new GBIF occurrences (`crawl_new_entries`)\n- Multithreading for parallel processing and uploads\n- Logging directly to MinIO\n\n---\n\n## Usage\n\n## Installation\n\nInstall dependencies via:\n\n```bash\npip install -r requirements.txt\n```\n\n---\n\n### 1. Prepare your input file\n\nCreate a `.csv` or `.xlsx` file with at least the following column:\n\n| latin_name      |\n|-----------------|\n| Quercus robur   |\n| Fagus sylvatica |\n\n### 2. Adjust your configuration\n\nEdit the file `config/config.yaml` to set your MinIO connection, output paths, and processing options.  \nA typical configuration looks like this:\n\n```yaml\nminio:\n  bucket: meinewaldki-gbif         # Name of your MinIO bucket\n  endpoint: s3.anhalt.ai           # MinIO/S3 endpoint URL\n  secure: true                     # Use HTTPS (true/false)\n  cert_check: true                 # Check SSL certificates (true/false)\n\npaths:\n  output: gbif-test/               # Output directory for images and metadata\n  tree_list_input_path: data/tree_list.xlsx   # Path to your input taxon list\n  processed_tree_list_path: data/species_key.csv # Path for the processed taxonKey list\n  log_dir: logs/                   # Directory for log files\n\nquery_params:\n  mediaType: StillImage            # Only download images\n  limit: 100                       # Number of records per API call\n  offset: 0                        # Start offset\n\noptions:\n  already_preprocessed: True       # Set False to process the taxon list again\n  crawl_new_entries: False         # Only process new occurrences if True\n  max_threads: 10                  # Number of parallel threads for downloads/uploads\n```\n\n#### Query Parameters for GBIF API URL\n\nThe parameters used to build the GBIF API request URL are defined in the `query_params` section of your\n`config/config.yaml`. These parameters control which records are fetched from the GBIF API.\n\n**Supported parameters:**\n\n- `mediaType` (e.g. `StillImage`): Only download records with images.\n- `taxonKey`: The taxon key.\n- `datasetKey`: Filter by dataset.\n- `country`: Filter by country code (e.g. `DE` for Germany).\n- `hasCoordinate`: Only records with coordinates (`true` or `false`).\n- `year`, `month`: Filter by year or month of occurrence.\n- `basisOfRecord`: Type of record (e.g. `HUMAN_OBSERVATION`).\n- `recordedBy`: Filter by collector/observer.\n- `institutionCode`, `collectionCode`: Filter by institution or collection.\n- `limit`: Number of records per API call (pagination, max. 300).\n- `offset`: Start offset for pagination.\n\n**How it works:**\n\n- All parameters in `query_params` are automatically validated at startup.\n- Only the above parameters are allowed. Invalid parameters will cause the program to stop with an error.\n\n### 3. Process taxonKey list and resolve taxonKeys\n\n```python\nfrom anhaltai.gbif_downloader.tree_list_processor import TreeListProcessor\n\nprocessor = TreeListProcessor(input_path=\"data/tree_list.xlsx\",\n                              sheet_name=\"Geh\u00f6lzarten\", taxon=\"speciesKey\")\nprocessor.process_tree_list(output_path=\"data/species_key.csv\")\n```\n\n### 4. Download media and metadata from GBIF\n\nRun the main program:\n\n```bash\nPYTHONPATH=src python3 src/gbif_extractor/main.py\n```\n\n### Note:\n\n- MinIO credentials must be set in `.env` see `.env-example` for the required format\\.\n- Log files are automatically uploaded to MinIO.\n- Parallel processing and uploads are controlled by a configurable thread limit.\n- Semaphores are used in this project to control the number of concurrent threads\n  during uploads to MinIO.\n- The program will skip old entries if `crawl_new_entries` is set to `True`.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "This project automatically downloads taxon-specific images from the GBIF API (https://techdocs.gbif.org/en/openapi/), processes them, and stores both images and metadata in a taxonomically organized structure in a MinIO (https://www.min.io/) bucket.",
    "version": "2025.8.1",
    "project_urls": {
        "Homepage": "https://pypi.org/project/anhaltai-gbif-downloader/"
    },
    "split_keywords": [
        "gbif",
        " image",
        " downloader"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2857409d12e7f94042efb5a53fdf4bc9673c7a2433441e15290a76cd30dfb78e",
                "md5": "d20f930038c0b32ffdd8b781d0ab987a",
                "sha256": "d731b748e7251da999ed91f827a850422c9103169708d329cbd96756a29e2995"
            },
            "downloads": -1,
            "filename": "anhaltai_gbif_downloader-2025.8.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d20f930038c0b32ffdd8b781d0ab987a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 16544,
            "upload_time": "2025-08-28T13:17:01",
            "upload_time_iso_8601": "2025-08-28T13:17:01.607942Z",
            "url": "https://files.pythonhosted.org/packages/28/57/409d12e7f94042efb5a53fdf4bc9673c7a2433441e15290a76cd30dfb78e/anhaltai_gbif_downloader-2025.8.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7829dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c",
                "md5": "6e9e8c78fe11fc9db1a49bf1cd0676a0",
                "sha256": "528e696eabba5061907807a75c2435ee041fe28abd49b298578ef73be220b2f5"
            },
            "downloads": -1,
            "filename": "anhaltai_gbif_downloader-2025.8.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6e9e8c78fe11fc9db1a49bf1cd0676a0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 15348,
            "upload_time": "2025-08-28T13:17:02",
            "upload_time_iso_8601": "2025-08-28T13:17:02.982934Z",
            "url": "https://files.pythonhosted.org/packages/78/29/dc4da42b4bc61792dad0afd5dc9ea1af5c05f11004f0b61b5b49e3564e5c/anhaltai_gbif_downloader-2025.8.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 13:17:02",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "anhaltai-gbif-downloader"
}

None