udata-hydra

Name	udata-hydra JSON
Version	2.4.3 JSON
	download
home_page	None
Summary	Async crawler and parsing service for data.gouv.fr
upload_time	2025-10-21 07:44:29
maintainer	None
docs_url	None
author	None
requires_python	<3.14,>=3.11
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![udata-hydra](banner.png)

# udata-hydra

[![CircleCI](https://circleci.com/gh/datagouv/hydra.svg?style=svg)](https://circleci.com/gh/datagouv/hydra)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).

URLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.

Since it's called _hydra_, it also has mythical powers embedded:
- analyse remote resource metadata over time to detect changes in the smartest way possible
- if the remote resource is tabular (csv or excel-like), convert it to a PostgreSQL table, ready for APIfication, and to parquet to offer another distribution of the data
- if the remote resource is a geojson, convert it to PMTiles to offer another distribution of the data
- send crawl and analysis info to a udata instance

## 🏗️ Architecture schema

The architecture for the full workflow is the following:

![Full workflow architecture](docs/archi-idd-IDD.drawio.png)


The hydra crawler is one of the components of the architecture. It will check if resource is available, analyse the type of file if the resource has been modified, and analyse the CSV content. It will also convert CSV resources to database tables and send the data to a udata instance.

![Crawler architecture](docs/hydra.drawio.png)

## 📦 Dependencies

This project uses `libmagic`, which needs to be installed on your system, eg:

`brew install libmagic` on MacOS, or `sudo apt-get install libmagic-dev` on linux.

This project uses Python >=3.11 and [uv](https://docs.astral.sh/uv/) to manage dependencies.

## 🚀 Installation

### With uv (recommended)
```bash
uv sync
```

### With pip
```bash
pip3 install -e .
```

## 🖥️ CLI

### Create database structure

Install udata-hydra dependencies and cli (see Installation section above), then migrate the DB with:

`uv run udata-hydra migrate`

### Load (UPSERT) latest catalog version from data.gouv.fr

`uv run udata-hydra load-catalog`

## 🕷️ Crawler

`uv run udata-hydra-crawl`

It will crawl (forever) the catalog according to the config set in `config.toml`, with a default config in `udata_hydra/config_default.toml`.

`BATCH_SIZE` URLs are queued at each loop run.

The crawler will start with URLs never checked and then proceed with URLs crawled before `CHECK_DELAYS` interval. It will then wait until something changes (catalog or time).

There's a by-domain backoff mechanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.

If an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.

## ⚙️ Worker

A job queuing system is used to process long-running tasks. Launch the worker with the following command:

`uv run rq worker -c udata_hydra.worker`

To monitor worker status:

`uv run rq info -c udata_hydra.worker --interval 1`

To empty all the queues:

`uv run rq empty -c udata_hydra.worker low default high`

## 📊 CSV conversion to database

Converted CSV tables will be stored in the database specified via `config.DATABASE_URL_CSV`. For tests it's the same database as for the catalog. Locally, `docker compose` will launch two distinct database containers.

## 🧪 Tests

To run the tests, you need to launch the database, the test database, and the Redis broker with `docker compose -f docker-compose.yml -f docker-compose.test.yml -f docker-compose.broker.yml up -d`.

Make sure the dev dependencies are installed with `uv pip install -r pylock.toml --extras dev` or `pip3 install -r pylock.toml --extras dev`.

Then you can run the tests with `uv run pytest`.

To run a specific test file, you can pass the path to the file to pytest, like this: `uv run pytest tests/test_file.py`.

To run a specific test function, you can pass the path to the file and the name of the function to pytest, like this: `uv run pytest tests/test_api/test_api_checks.py::test_get_latest_check`.

If you would like to see print statements as they are executed, you can pass the -s flag to pytest (`uv run pytest -s`). However, note that this can sometimes be difficult to parse.

### 🎯 Tests coverage

Pytest automatically uses the `coverage` package to generate a coverage report, which is displayed at the end of the test run in the terminal.
The coverage is configured in the `pyproject.toml` file, in the `[tool.pytest.ini_options]` section.
You can also override the coverage report configuration when running the tests by passing some flags like `--cov-report` to pytest. See [the pytest-cov documentation](https://pytest-cov.readthedocs.io/en/latest/config.html) for more information.

### 📈 Performance benchmarking

Hydra includes performance benchmarks to track and compare the performance of different operations on large files.
These benchmarks help identify performance regressions and improvements across different commits.

#### How it works

Performance benchmarks are automatically executed on CI runners when pushing to the `benchmarks` branch. The benchmarks test three key operations:

1. **CSV analysis** on large files using integrated test data
2. **CSV to GeoJSON conversion** on large files using the `TEST_GEOCSV_URL` configured in CI
3. **GeoJSON to PMTiles conversion** on large files using the `TEST_GEOCSV_URL` configured in CI

#### Benchmark execution

Benchmarks run on:
- **[CircleCI](https://app.circleci.com/pipelines/github/datagouv/hydra)** ([workflow file](https://github.com/datagouv/hydra/blob/main/.circleci/config.yml)) - available as a manually triggerable pipeline after a push to `benchmarks` branch
- **[GitHub Actions](https://github.com/datagouv/hydra/actions/workflows/benchmark.yml)** ([workflow file](https://github.com/datagouv/hydra/blob/main/.github/workflows/benchmark.yml)) - triggered automatically on pushes to `benchmarks` branch

Using two different CI systems allows for performance comparison across different environments and gives a way to avoid exhausting CI time limits.

#### Metrics collected

Each benchmark run collects **execution time** in seconds, **commit information** (hash, author) and **runner specifications** (CPU cores, memory, Python version, runner class), which are stored in [`.benchmarks/benchmarks.csv`](https://github.com/datagouv/hydra/blob/benchmarks/.benchmarks/benchmarks.csv).

More specifically:
- `datetime` - when the test was run
- `test_name` - which test was executed
- `input_file` - URL or path of the input test data file used
- `ci` - which CI system ran the test (github or circleci)
- `execution_time_seconds` - performance measurement
- `commit_author` - who made the commit
- `commit_id` - the commit hash (7 characters)
- `runner_class` - CircleCI/GitHub Actions runner type
- `runner_cpu` - number of CPU cores
- `runner_memory` - available memory in MB
- `python_version` - Python version used

Results are committed and pushed back to the `benchmarks` branch, creating a historical performance tracking dataset.

#### Viewing results

You can view the current benchmark results at: [benchmarks.csv](https://github.com/datagouv/hydra/blob/benchmarks/.benchmarks/benchmarks.csv)

#### Running benchmarks locally

To run performance benchmarks locally, you can use the CLI commands:

```bash
# Convert CSV to GeoJSON
poetry run udata-hydra convert-csv-to-geojson /path/to/large/file.csv

# Convert GeoJSON to PMTiles
poetry run udata-hydra convert-geojson-to-pmtiles /path/to/large/file.geojson
```

These commands allow you to test performance improvements locally before pushing to the benchmarks branch.

## 🔌 API

The API will need a Bearer token for each request on protected endpoints (any endpoint that isn't a `GET`).
The token is configured in the `config.toml` file as `API_KEY`, and has a default value set in the `udata_hydra/config_default.toml` file.

If you're using hydra as an external service to receive resource events from [udata](https://github.com/opendatateam/udata), then udata needs to also configure this
API key in its `udata.cfg` file:

```python
# Whether udata should publish the resource events
PUBLISH_ON_RESOURCE_EVENTS = True
# Where to publish the events
RESOURCES_ANALYSER_URI = "http://localhost:8000"
# The API key that hydra needs
RESOURCES_ANALYSER_API_KEY = "api_key_to_change"
```

### 🚀 Run

```bash
# Install dependencies (see Installation section above)
uv run adev runserver udata_hydra/app.py
```
By default, the app will listen on `localhost:8000`.
You can check the status of the app with `curl http://localhost:8000/api/health`.

### 🛣️ Routes/endpoints

The API serves the following endpoints:

*Related to checks:*
- `GET` on `/api/checks/latest?url={url}&resource_id={resource_id}` to get the latest check for a given URL and/or `resource_id`
- `GET` on `/api/checks/all?url={url}&resource_id={resource_id}` to get all checks for a given URL and/or `resource_id`
- `GET` on `/api/checks/aggregate?group_by={column}&created_at={date}` to get checks occurrences grouped by a `column` for a specific `date`

*Related to resources:*
- `GET` on `/api/resources/{resource_id}` to get a resource in the DB "catalog" table from its `resource_id`
- `POST` on `/api/resources` to receive a resource creation event from a source. It will create a new resource in the DB "catalog" table and mark it as priority for next crawling
- `PUT` on `/api/resources/{resource_id}` to update a resource in the DB "catalog" table
- `DELETE` on `/api/resources/{resource_id}` to delete a resource in the DB "catalog" table

> :warning: **Warning: the following routes are deprecated and will be removed in the future:**
> - `POST` on `/api/resource/created` -> use `POST` on `/api/resources/` instead
> - `POST` on `/api/resource/updated` -> use `PUT` on `/api/resources/` instead
> - `POST` on `/api/resource/deleted` -> use `DELETE` on `/api/resources/` instead

*Related to resources exceptions:*
- `GET` on `/api/resources-exceptions` to get the list of all resources exceptions
- `POST` on `/api/resources-exceptions` to create a new resource exception in the DB
- `PUT` on `/api/resources-exceptions/{resource_id}` to update a resource exception in the DB
- `DELETE` on `/api/resources-exceptions/{resource_id}` to delete a resource exception from the DB

*Related to some status and health check:*
- `GET` on `/api/status/crawler` to get the crawling status
- `GET` on `/api/status/worker` to get the worker status
- `GET` on `/api/stats` to get the crawling stats
- `GET` on `/api/health` to get the API version number and environment

You may want to use a helper such as [Bruno](https://www.usebruno.com/) to handle API calls, in which case all the endpoints are ready to use [here](https://github.com/datagouv/api-calls).
More details about some endpoints are provided below with examples, but not for all of them:

#### Get latest check

Works with `?url={url}` and `?resource_id={resource_id}`.

```bash
$ curl -s "http://localhost:8000/api/checks/latest?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv" | json_pp
{
   "status" : 200,
   "catalog_id" : 64148,
   "deleted" : false,
   "error" : null,
   "created_at" : "2021-02-06T12:19:08.203055",
   "response_time" : 0.830198049545288,
   "url" : "http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv",
   "domain" : "opendata-sig.saintdenis.re",
   "timeout" : false,
   "id" : 114750,
   "dataset_id" : "5c34944606e3e73d4a551889",
   "resource_id" : "b3678c59-5b35-43ad-9379-fce29e5b56fe",
   "headers" : {
      "content-disposition" : "attachment; filename=\"xn--Dlimitation_des_cantons-bcc.csv\"",
      "server" : "openresty",
      "x-amz-meta-cachetime" : "191",
      "last-modified" : "Wed, 29 Apr 2020 02:19:04 GMT",
      "content-encoding" : "gzip",
      "content-type" : "text/csv",
      "cache-control" : "must-revalidate",
      "etag" : "\"20415964703d9ccc4815d7126aa3a6d8\"",
      "content-length" : "207",
      "date" : "Sat, 06 Feb 2021 12:19:08 GMT",
      "x-amz-meta-contentlastmodified" : "2018-11-19T09:38:28.490Z",
      "connection" : "keep-alive",
      "vary" : "Accept-Encoding"
   }
}
```

#### Get all checks for an URL or resource

Works with `?url={url}` and `?resource_id={resource_id}`.

```bash
$ curl -s "http://localhost:8000/api/checks/all?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls" | json_pp
[
   {
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 165107,
      "created_at" : "2021-02-06T14:32:47.675854",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null
   },
   {
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "created_at" : "2020-12-24T17:06:58.158125",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null,
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 65092
   }
]
```

#### Get checks occurrences grouped by a column for a specific date

Works with `?group_by={column}` and `?created_at={date}`.
`date` should be a date in format `YYYY-MM-DD` or the default keyword `today`.

```bash
$ curl -s "http://localhost:8000/api/checks/aggregate?group_by=domain&created_at=today" | json_pp
[
  {
    "value": "www.geo2france.fr",
    "count": 4
  },
  {
    "value": "static.data.gouv.fr",
    "count": 4
  },
  {
    "value": "grandestprod.data4citizen.com",
    "count": 3
  },
  {
    "value": "www.datasud.fr",
    "count": 2
  },
  {
    "value": "koumoul.com",
    "count": 2
  },
  {
    "value": "opendata.aude.fr",
    "count": 2
  },
  {
    "value": "departement-ain.opendata.arcgis.com",
    "count": 2
  },
  {
    "value": "opendata.agglo-larochelle.fr",
    "count": 1
  }
]
```

#### Adding a resource exception

```bash
$ curl   -X POST http://localhost:8000/api/resources-exceptions \
         -H 'Authorization: Bearer <myAPIkey>' \
         -d '{
            "resource_id": "123e4567-e89b-12d3-a456-426614174000",
            "table_indexes": {
                  "siren": "index"
            },
            "comment": "This is a comment for the resource exception."
         }'
```

...or, if you don't want to add table indexes and a comment:
```bash
$ curl  -X POST localhost:8000/api/resources-exceptions \
        -H 'Authorization: Bearer <myAPIkey>" \
        -d '{"resource_id": "f868cca6-8da1-4369-a78d-47463f19a9a3"}'
```

#### Updating a resource exception

```bash
$ curl   -X PUT http://localhost:8000/api/resources-exceptions/f868cca6-8da1-4369-a78d-47463f19a9a3 \
         -H "Authorization: Bearer <myAPIkey>" \
         -d '{
            "table_indexes": {
                  "siren": "index",
                  "code_postal": "index"
            },
            "comment": "Updated comment for the resource exception."
         }'
```

#### Deleting a resource exception

```bash
$ curl  -X DELETE http://localhost:8000/api/resources-exceptions/f868cca6-8da1-4369-a78d-47463f19a9a3 \
        -H "Authorization: Bearer <myAPIkey>"
```

#### Get crawling status

```bash
$ curl -s "http://localhost:8000/api/status/crawler" | json_pp
{
   "fresh_checks_percentage" : 0.4,
   "pending_checks" : 142153,
   "total" : 142687,
   "fresh_checks" : 534,
   "checks_percentage" : 0.4,
   "resources_statuses_count": {
      "null": 195339,
      "BACKOFF": 0,
      "CRAWLING_URL": 0,
      "TO_ANALYSE_RESOURCE": 1,
      "ANALYSING_RESOURCE": 0,
      "TO_ANALYSE_CSV": 0,
      "ANALYSING_CSV": 0,
      "INSERTING_IN_DB": 0,
      "CONVERTING_TO_PARQUET": 0
  }
}
```

#### Get worker status

```bash
$ curl -s "http://localhost:8000/api/status/worker" | json_pp
{
   "queued" : {
      "default" : 0,
      "high" : 825,
      "low" : 655
   }
}
```

#### Get crawling stats

```bash
$ curl -s "http://localhost:8000/api/stats" | json_pp
{
   "status" : [
      {
         "count" : 525,
         "percentage" : 98.3,
         "label" : "ok"
      },
      {
         "label" : "error",
         "percentage" : 1.3,
         "count" : 7
      },
      {
         "label" : "timeout",
         "percentage" : 0.4,
         "count" : 2
      }
   ],
   "status_codes" : [
      {
         "code" : 200,
         "count" : 413,
         "percentage" : 78.7
      },
      {
         "code" : 501,
         "percentage" : 12.4,
         "count" : 65
      },
      {
         "percentage" : 6.1,
         "count" : 32,
         "code" : 404
      },
      {
         "code" : 500,
         "percentage" : 2.7,
         "count" : 14
      },
      {
         "code" : 502,
         "count" : 1,
         "percentage" : 0.2
      }
   ]
}
```

## 🔗 Using Webhook integration

** Set the config values**

Create a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.

```toml
UDATA_URI = "https://dev.local:7000/api/2"
UDATA_URI_API_KEY = "example.api.key"
SENTRY_DSN = "https://{my-sentry-dsn}"
```

The webhook integration sends HTTP messages to `udata` when resources are analysed or checked to fill resources extras.

Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criteria:
- harvest modified date in catalog
- content-length and last-modified headers
- checksum comparison over time

The payload should look something like:

```json
{
   "analysis:content-length": 91661,
   "analysis:mime-type": "application/zip",
   "analysis:checksum": "bef1de04601dedaf2d127418759b16915ba083be",
   "analysis:last-modified-at": "2022-11-27T23:00:54.762000",
   "analysis:last-modified-detection": "harvest-resource-metadata",
}
```

## 🛠️ Development

### 🐳 docker compose

Multiple docker-compose files are provided:
- a minimal `docker-compose.yml` with two PostgreSQL containers (one for catalog and metadata, the other for converted CSV to database)
- `docker-compose.broker.yml` adds a Redis broker
- `docker-compose.test.yml` launches a test DB, needed to run tests

NB: you can launch compose from multiple files like this: `docker compose -f docker-compose.yml -f docker-compose.test.yml up`

### 📝 Logging & Debugging

The log level can be adjusted using the environment variable LOG_LEVEL.
For example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL="DEBUG" udata-hydra init_db `.

### 📋 Writing a migration

1. Add a file named `migrations/{YYYYMMDD}_{description}.sql` and write the SQL you need to perform migration.
2. `udata-hydra migrate` will migrate the database as needed.

## 🚀 Deployment

3 services need to be deployed for the full stack to run:
- worker
- api / app
- crawler

Refer to each section to learn how to launch them. The only differences from dev to prod are:
- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`
- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server

## 🤝 Contributing

Before contributing to the repository and making any PR, it is necessary to initialize the pre-commit hooks:
```bash
pre-commit install
```
Once this is done, code formatting and linting, as well as import sorting, will be automatically checked before each commit.

If you cannot use pre-commit, it is necessary to format, lint, and sort imports with [Ruff](https://docs.astral.sh/ruff/) before committing:
```bash
uv run ruff check --fix . && uv run ruff format .
```

### 🏷️ Releases

The release process uses the [`tag_version.sh`](tag_version.sh) script to create git tags and update [CHANGELOG.md](CHANGELOG.md) and [pyproject.toml](pyproject.toml) automatically.

```bash
# Create a new release
./tag_version.sh <version>

# Example
./tag_version.sh 2.5.0

# Dry run to see what would happen
./tag_version.sh 2.5.0 --dry-run
```

**Prerequisites**: GitHub CLI (`gh`) must be installed and authenticated, and you must be on the main branch with a clean working directory.

The script automatically:
- Updates the version in pyproject.toml
- Extracts commits since the last tag and formats them for CHANGELOG.md
- Identifies breaking changes (commits with `!:` in the subject)
- Creates a git tag and pushes it to the remote repository
- Creates a GitHub release with the changelog content

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "udata-hydra",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Opendata Team <opendatateam@data.gouv.fr>",
    "download_url": null,
    "platform": null,
    "description": "![udata-hydra](banner.png)\n\n# udata-hydra\n\n[![CircleCI](https://circleci.com/gh/datagouv/hydra.svg?style=svg)](https://circleci.com/gh/datagouv/hydra)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).\n\nURLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.\n\nSince it's called _hydra_, it also has mythical powers embedded:\n- analyse remote resource metadata over time to detect changes in the smartest way possible\n- if the remote resource is tabular (csv or excel-like), convert it to a PostgreSQL table, ready for APIfication, and to parquet to offer another distribution of the data\n- if the remote resource is a geojson, convert it to PMTiles to offer another distribution of the data\n- send crawl and analysis info to a udata instance\n\n## \ud83c\udfd7\ufe0f Architecture schema\n\nThe architecture for the full workflow is the following:\n\n![Full workflow architecture](docs/archi-idd-IDD.drawio.png)\n\n\nThe hydra crawler is one of the components of the architecture. It will check if resource is available, analyse the type of file if the resource has been modified, and analyse the CSV content. It will also convert CSV resources to database tables and send the data to a udata instance.\n\n![Crawler architecture](docs/hydra.drawio.png)\n\n## \ud83d\udce6 Dependencies\n\nThis project uses `libmagic`, which needs to be installed on your system, eg:\n\n`brew install libmagic` on MacOS, or `sudo apt-get install libmagic-dev` on linux.\n\nThis project uses Python >=3.11 and [uv](https://docs.astral.sh/uv/) to manage dependencies.\n\n## \ud83d\ude80 Installation\n\n### With uv (recommended)\n```bash\nuv sync\n```\n\n### With pip\n```bash\npip3 install -e .\n```\n\n## \ud83d\udda5\ufe0f CLI\n\n### Create database structure\n\nInstall udata-hydra dependencies and cli (see Installation section above), then migrate the DB with:\n\n`uv run udata-hydra migrate`\n\n### Load (UPSERT) latest catalog version from data.gouv.fr\n\n`uv run udata-hydra load-catalog`\n\n## \ud83d\udd77\ufe0f Crawler\n\n`uv run udata-hydra-crawl`\n\nIt will crawl (forever) the catalog according to the config set in `config.toml`, with a default config in `udata_hydra/config_default.toml`.\n\n`BATCH_SIZE` URLs are queued at each loop run.\n\nThe crawler will start with URLs never checked and then proceed with URLs crawled before `CHECK_DELAYS` interval. It will then wait until something changes (catalog or time).\n\nThere's a by-domain backoff mechanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.\n\nIf an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.\n\n## \u2699\ufe0f Worker\n\nA job queuing system is used to process long-running tasks. Launch the worker with the following command:\n\n`uv run rq worker -c udata_hydra.worker`\n\nTo monitor worker status:\n\n`uv run rq info -c udata_hydra.worker --interval 1`\n\nTo empty all the queues:\n\n`uv run rq empty -c udata_hydra.worker low default high`\n\n## \ud83d\udcca CSV conversion to database\n\nConverted CSV tables will be stored in the database specified via `config.DATABASE_URL_CSV`. For tests it's the same database as for the catalog. Locally, `docker compose` will launch two distinct database containers.\n\n## \ud83e\uddea Tests\n\nTo run the tests, you need to launch the database, the test database, and the Redis broker with `docker compose -f docker-compose.yml -f docker-compose.test.yml -f docker-compose.broker.yml up -d`.\n\nMake sure the dev dependencies are installed with `uv pip install -r pylock.toml --extras dev` or `pip3 install -r pylock.toml --extras dev`.\n\nThen you can run the tests with `uv run pytest`.\n\nTo run a specific test file, you can pass the path to the file to pytest, like this: `uv run pytest tests/test_file.py`.\n\nTo run a specific test function, you can pass the path to the file and the name of the function to pytest, like this: `uv run pytest tests/test_api/test_api_checks.py::test_get_latest_check`.\n\nIf you would like to see print statements as they are executed, you can pass the -s flag to pytest (`uv run pytest -s`). However, note that this can sometimes be difficult to parse.\n\n### \ud83c\udfaf Tests coverage\n\nPytest automatically uses the `coverage` package to generate a coverage report, which is displayed at the end of the test run in the terminal.\nThe coverage is configured in the `pyproject.toml` file, in the `[tool.pytest.ini_options]` section.\nYou can also override the coverage report configuration when running the tests by passing some flags like `--cov-report` to pytest. See [the pytest-cov documentation](https://pytest-cov.readthedocs.io/en/latest/config.html) for more information.\n\n### \ud83d\udcc8 Performance benchmarking\n\nHydra includes performance benchmarks to track and compare the performance of different operations on large files.\nThese benchmarks help identify performance regressions and improvements across different commits.\n\n#### How it works\n\nPerformance benchmarks are automatically executed on CI runners when pushing to the `benchmarks` branch. The benchmarks test three key operations:\n\n1. **CSV analysis** on large files using integrated test data\n2. **CSV to GeoJSON conversion** on large files using the `TEST_GEOCSV_URL` configured in CI\n3. **GeoJSON to PMTiles conversion** on large files using the `TEST_GEOCSV_URL` configured in CI\n\n#### Benchmark execution\n\nBenchmarks run on:\n- **[CircleCI](https://app.circleci.com/pipelines/github/datagouv/hydra)** ([workflow file](https://github.com/datagouv/hydra/blob/main/.circleci/config.yml)) - available as a manually triggerable pipeline after a push to `benchmarks` branch\n- **[GitHub Actions](https://github.com/datagouv/hydra/actions/workflows/benchmark.yml)** ([workflow file](https://github.com/datagouv/hydra/blob/main/.github/workflows/benchmark.yml)) - triggered automatically on pushes to `benchmarks` branch\n\nUsing two different CI systems allows for performance comparison across different environments and gives a way to avoid exhausting CI time limits.\n\n#### Metrics collected\n\nEach benchmark run collects **execution time** in seconds, **commit information** (hash, author) and **runner specifications** (CPU cores, memory, Python version, runner class), which are stored in [`.benchmarks/benchmarks.csv`](https://github.com/datagouv/hydra/blob/benchmarks/.benchmarks/benchmarks.csv).\n\nMore specifically:\n- `datetime` - when the test was run\n- `test_name` - which test was executed\n- `input_file` - URL or path of the input test data file used\n- `ci` - which CI system ran the test (github or circleci)\n- `execution_time_seconds` - performance measurement\n- `commit_author` - who made the commit\n- `commit_id` - the commit hash (7 characters)\n- `runner_class` - CircleCI/GitHub Actions runner type\n- `runner_cpu` - number of CPU cores\n- `runner_memory` - available memory in MB\n- `python_version` - Python version used\n\nResults are committed and pushed back to the `benchmarks` branch, creating a historical performance tracking dataset.\n\n#### Viewing results\n\nYou can view the current benchmark results at: [benchmarks.csv](https://github.com/datagouv/hydra/blob/benchmarks/.benchmarks/benchmarks.csv)\n\n#### Running benchmarks locally\n\nTo run performance benchmarks locally, you can use the CLI commands:\n\n```bash\n# Convert CSV to GeoJSON\npoetry run udata-hydra convert-csv-to-geojson /path/to/large/file.csv\n\n# Convert GeoJSON to PMTiles\npoetry run udata-hydra convert-geojson-to-pmtiles /path/to/large/file.geojson\n```\n\nThese commands allow you to test performance improvements locally before pushing to the benchmarks branch.\n\n## \ud83d\udd0c API\n\nThe API will need a Bearer token for each request on protected endpoints (any endpoint that isn't a `GET`).\nThe token is configured in the `config.toml` file as `API_KEY`, and has a default value set in the `udata_hydra/config_default.toml` file.\n\nIf you're using hydra as an external service to receive resource events from [udata](https://github.com/opendatateam/udata), then udata needs to also configure this\nAPI key in its `udata.cfg` file:\n\n```python\n# Whether udata should publish the resource events\nPUBLISH_ON_RESOURCE_EVENTS = True\n# Where to publish the events\nRESOURCES_ANALYSER_URI = \"http://localhost:8000\"\n# The API key that hydra needs\nRESOURCES_ANALYSER_API_KEY = \"api_key_to_change\"\n```\n\n### \ud83d\ude80 Run\n\n```bash\n# Install dependencies (see Installation section above)\nuv run adev runserver udata_hydra/app.py\n```\nBy default, the app will listen on `localhost:8000`.\nYou can check the status of the app with `curl http://localhost:8000/api/health`.\n\n### \ud83d\udee3\ufe0f Routes/endpoints\n\nThe API serves the following endpoints:\n\n*Related to checks:*\n- `GET` on `/api/checks/latest?url={url}&resource_id={resource_id}` to get the latest check for a given URL and/or `resource_id`\n- `GET` on `/api/checks/all?url={url}&resource_id={resource_id}` to get all checks for a given URL and/or `resource_id`\n- `GET` on `/api/checks/aggregate?group_by={column}&created_at={date}` to get checks occurrences grouped by a `column` for a specific `date`\n\n*Related to resources:*\n- `GET` on `/api/resources/{resource_id}` to get a resource in the DB \"catalog\" table from its `resource_id`\n- `POST` on `/api/resources` to receive a resource creation event from a source. It will create a new resource in the DB \"catalog\" table and mark it as priority for next crawling\n- `PUT` on `/api/resources/{resource_id}` to update a resource in the DB \"catalog\" table\n- `DELETE` on `/api/resources/{resource_id}` to delete a resource in the DB \"catalog\" table\n\n> :warning: **Warning: the following routes are deprecated and will be removed in the future:**\n> - `POST` on `/api/resource/created` -> use `POST` on `/api/resources/` instead\n> - `POST` on `/api/resource/updated` -> use `PUT` on `/api/resources/` instead\n> - `POST` on `/api/resource/deleted` -> use `DELETE` on `/api/resources/` instead\n\n*Related to resources exceptions:*\n- `GET` on `/api/resources-exceptions` to get the list of all resources exceptions\n- `POST` on `/api/resources-exceptions` to create a new resource exception in the DB\n- `PUT` on `/api/resources-exceptions/{resource_id}` to update a resource exception in the DB\n- `DELETE` on `/api/resources-exceptions/{resource_id}` to delete a resource exception from the DB\n\n*Related to some status and health check:*\n- `GET` on `/api/status/crawler` to get the crawling status\n- `GET` on `/api/status/worker` to get the worker status\n- `GET` on `/api/stats` to get the crawling stats\n- `GET` on `/api/health` to get the API version number and environment\n\nYou may want to use a helper such as [Bruno](https://www.usebruno.com/) to handle API calls, in which case all the endpoints are ready to use [here](https://github.com/datagouv/api-calls).\nMore details about some endpoints are provided below with examples, but not for all of them:\n\n#### Get latest check\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```bash\n$ curl -s \"http://localhost:8000/api/checks/latest?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\" | json_pp\n{\n   \"status\" : 200,\n   \"catalog_id\" : 64148,\n   \"deleted\" : false,\n   \"error\" : null,\n   \"created_at\" : \"2021-02-06T12:19:08.203055\",\n   \"response_time\" : 0.830198049545288,\n   \"url\" : \"http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\",\n   \"domain\" : \"opendata-sig.saintdenis.re\",\n   \"timeout\" : false,\n   \"id\" : 114750,\n   \"dataset_id\" : \"5c34944606e3e73d4a551889\",\n   \"resource_id\" : \"b3678c59-5b35-43ad-9379-fce29e5b56fe\",\n   \"headers\" : {\n      \"content-disposition\" : \"attachment; filename=\\\"xn--Dlimitation_des_cantons-bcc.csv\\\"\",\n      \"server\" : \"openresty\",\n      \"x-amz-meta-cachetime\" : \"191\",\n      \"last-modified\" : \"Wed, 29 Apr 2020 02:19:04 GMT\",\n      \"content-encoding\" : \"gzip\",\n      \"content-type\" : \"text/csv\",\n      \"cache-control\" : \"must-revalidate\",\n      \"etag\" : \"\\\"20415964703d9ccc4815d7126aa3a6d8\\\"\",\n      \"content-length\" : \"207\",\n      \"date\" : \"Sat, 06 Feb 2021 12:19:08 GMT\",\n      \"x-amz-meta-contentlastmodified\" : \"2018-11-19T09:38:28.490Z\",\n      \"connection\" : \"keep-alive\",\n      \"vary\" : \"Accept-Encoding\"\n   }\n}\n```\n\n#### Get all checks for an URL or resource\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```bash\n$ curl -s \"http://localhost:8000/api/checks/all?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\" | json_pp\n[\n   {\n      \"domain\" : \"www.drees.sante.gouv.fr\",\n      \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n      \"timeout\" : false,\n      \"deleted\" : false,\n      \"response_time\" : null,\n      \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n      \"catalog_id\" : 232112,\n      \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n      \"headers\" : {},\n      \"id\" : 165107,\n      \"created_at\" : \"2021-02-06T14:32:47.675854\",\n      \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n      \"status\" : null\n   },\n   {\n      \"timeout\" : false,\n      \"deleted\" : false,\n      \"response_time\" : null,\n      \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n      \"domain\" : \"www.drees.sante.gouv.fr\",\n      \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n      \"created_at\" : \"2020-12-24T17:06:58.158125\",\n      \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n      \"status\" : null,\n      \"catalog_id\" : 232112,\n      \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n      \"headers\" : {},\n      \"id\" : 65092\n   }\n]\n```\n\n#### Get checks occurrences grouped by a column for a specific date\n\nWorks with `?group_by={column}` and `?created_at={date}`.\n`date` should be a date in format `YYYY-MM-DD` or the default keyword `today`.\n\n```bash\n$ curl -s \"http://localhost:8000/api/checks/aggregate?group_by=domain&created_at=today\" | json_pp\n[\n  {\n    \"value\": \"www.geo2france.fr\",\n    \"count\": 4\n  },\n  {\n    \"value\": \"static.data.gouv.fr\",\n    \"count\": 4\n  },\n  {\n    \"value\": \"grandestprod.data4citizen.com\",\n    \"count\": 3\n  },\n  {\n    \"value\": \"www.datasud.fr\",\n    \"count\": 2\n  },\n  {\n    \"value\": \"koumoul.com\",\n    \"count\": 2\n  },\n  {\n    \"value\": \"opendata.aude.fr\",\n    \"count\": 2\n  },\n  {\n    \"value\": \"departement-ain.opendata.arcgis.com\",\n    \"count\": 2\n  },\n  {\n    \"value\": \"opendata.agglo-larochelle.fr\",\n    \"count\": 1\n  }\n]\n```\n\n#### Adding a resource exception\n\n```bash\n$ curl   -X POST http://localhost:8000/api/resources-exceptions \\\n         -H 'Authorization: Bearer <myAPIkey>' \\\n         -d '{\n            \"resource_id\": \"123e4567-e89b-12d3-a456-426614174000\",\n            \"table_indexes\": {\n                  \"siren\": \"index\"\n            },\n            \"comment\": \"This is a comment for the resource exception.\"\n         }'\n```\n\n...or, if you don't want to add table indexes and a comment:\n```bash\n$ curl  -X POST localhost:8000/api/resources-exceptions \\\n        -H 'Authorization: Bearer <myAPIkey>\" \\\n        -d '{\"resource_id\": \"f868cca6-8da1-4369-a78d-47463f19a9a3\"}'\n```\n\n#### Updating a resource exception\n\n```bash\n$ curl   -X PUT http://localhost:8000/api/resources-exceptions/f868cca6-8da1-4369-a78d-47463f19a9a3 \\\n         -H \"Authorization: Bearer <myAPIkey>\" \\\n         -d '{\n            \"table_indexes\": {\n                  \"siren\": \"index\",\n                  \"code_postal\": \"index\"\n            },\n            \"comment\": \"Updated comment for the resource exception.\"\n         }'\n```\n\n#### Deleting a resource exception\n\n```bash\n$ curl  -X DELETE http://localhost:8000/api/resources-exceptions/f868cca6-8da1-4369-a78d-47463f19a9a3 \\\n        -H \"Authorization: Bearer <myAPIkey>\"\n```\n\n#### Get crawling status\n\n```bash\n$ curl -s \"http://localhost:8000/api/status/crawler\" | json_pp\n{\n   \"fresh_checks_percentage\" : 0.4,\n   \"pending_checks\" : 142153,\n   \"total\" : 142687,\n   \"fresh_checks\" : 534,\n   \"checks_percentage\" : 0.4,\n   \"resources_statuses_count\": {\n      \"null\": 195339,\n      \"BACKOFF\": 0,\n      \"CRAWLING_URL\": 0,\n      \"TO_ANALYSE_RESOURCE\": 1,\n      \"ANALYSING_RESOURCE\": 0,\n      \"TO_ANALYSE_CSV\": 0,\n      \"ANALYSING_CSV\": 0,\n      \"INSERTING_IN_DB\": 0,\n      \"CONVERTING_TO_PARQUET\": 0\n  }\n}\n```\n\n#### Get worker status\n\n```bash\n$ curl -s \"http://localhost:8000/api/status/worker\" | json_pp\n{\n   \"queued\" : {\n      \"default\" : 0,\n      \"high\" : 825,\n      \"low\" : 655\n   }\n}\n```\n\n#### Get crawling stats\n\n```bash\n$ curl -s \"http://localhost:8000/api/stats\" | json_pp\n{\n   \"status\" : [\n      {\n         \"count\" : 525,\n         \"percentage\" : 98.3,\n         \"label\" : \"ok\"\n      },\n      {\n         \"label\" : \"error\",\n         \"percentage\" : 1.3,\n         \"count\" : 7\n      },\n      {\n         \"label\" : \"timeout\",\n         \"percentage\" : 0.4,\n         \"count\" : 2\n      }\n   ],\n   \"status_codes\" : [\n      {\n         \"code\" : 200,\n         \"count\" : 413,\n         \"percentage\" : 78.7\n      },\n      {\n         \"code\" : 501,\n         \"percentage\" : 12.4,\n         \"count\" : 65\n      },\n      {\n         \"percentage\" : 6.1,\n         \"count\" : 32,\n         \"code\" : 404\n      },\n      {\n         \"code\" : 500,\n         \"percentage\" : 2.7,\n         \"count\" : 14\n      },\n      {\n         \"code\" : 502,\n         \"count\" : 1,\n         \"percentage\" : 0.2\n      }\n   ]\n}\n```\n\n## \ud83d\udd17 Using Webhook integration\n\n** Set the config values**\n\nCreate a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.\n\n```toml\nUDATA_URI = \"https://dev.local:7000/api/2\"\nUDATA_URI_API_KEY = \"example.api.key\"\nSENTRY_DSN = \"https://{my-sentry-dsn}\"\n```\n\nThe webhook integration sends HTTP messages to `udata` when resources are analysed or checked to fill resources extras.\n\nRegarding analysis, there is a phase called \"change detection\". It will try to guess if a resource has been modified based on different criteria:\n- harvest modified date in catalog\n- content-length and last-modified headers\n- checksum comparison over time\n\nThe payload should look something like:\n\n```json\n{\n   \"analysis:content-length\": 91661,\n   \"analysis:mime-type\": \"application/zip\",\n   \"analysis:checksum\": \"bef1de04601dedaf2d127418759b16915ba083be\",\n   \"analysis:last-modified-at\": \"2022-11-27T23:00:54.762000\",\n   \"analysis:last-modified-detection\": \"harvest-resource-metadata\",\n}\n```\n\n## \ud83d\udee0\ufe0f Development\n\n### \ud83d\udc33 docker compose\n\nMultiple docker-compose files are provided:\n- a minimal `docker-compose.yml` with two PostgreSQL containers (one for catalog and metadata, the other for converted CSV to database)\n- `docker-compose.broker.yml` adds a Redis broker\n- `docker-compose.test.yml` launches a test DB, needed to run tests\n\nNB: you can launch compose from multiple files like this: `docker compose -f docker-compose.yml -f docker-compose.test.yml up`\n\n### \ud83d\udcdd Logging & Debugging\n\nThe log level can be adjusted using the environment variable LOG_LEVEL.\nFor example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL=\"DEBUG\" udata-hydra init_db `.\n\n### \ud83d\udccb Writing a migration\n\n1. Add a file named `migrations/{YYYYMMDD}_{description}.sql` and write the SQL you need to perform migration.\n2. `udata-hydra migrate` will migrate the database as needed.\n\n## \ud83d\ude80 Deployment\n\n3 services need to be deployed for the full stack to run:\n- worker\n- api / app\n- crawler\n\nRefer to each section to learn how to launch them. The only differences from dev to prod are:\n- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`\n- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server\n\n## \ud83e\udd1d Contributing\n\nBefore contributing to the repository and making any PR, it is necessary to initialize the pre-commit hooks:\n```bash\npre-commit install\n```\nOnce this is done, code formatting and linting, as well as import sorting, will be automatically checked before each commit.\n\nIf you cannot use pre-commit, it is necessary to format, lint, and sort imports with [Ruff](https://docs.astral.sh/ruff/) before committing:\n```bash\nuv run ruff check --fix . && uv run ruff format .\n```\n\n### \ud83c\udff7\ufe0f Releases\n\nThe release process uses the [`tag_version.sh`](tag_version.sh) script to create git tags and update [CHANGELOG.md](CHANGELOG.md) and [pyproject.toml](pyproject.toml) automatically.\n\n```bash\n# Create a new release\n./tag_version.sh <version>\n\n# Example\n./tag_version.sh 2.5.0\n\n# Dry run to see what would happen\n./tag_version.sh 2.5.0 --dry-run\n```\n\n**Prerequisites**: GitHub CLI (`gh`) must be installed and authenticated, and you must be on the main branch with a clean working directory.\n\nThe script automatically:\n- Updates the version in pyproject.toml\n- Extracts commits since the last tag and formats them for CHANGELOG.md\n- Identifies breaking changes (commits with `!:` in the subject)\n- Creates a git tag and pushes it to the remote repository\n- Creates a GitHub release with the changelog content\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Async crawler and parsing service for data.gouv.fr",
    "version": "2.4.3",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "662ac2e1826c65a9f09f9811e2bed0cf381009e19d9afe44a007444fa6d939d5",
                "md5": "134dc633228378cdc215fcaa52936c58",
                "sha256": "c9d48ee554c88b17a0b4ed36461068578c97f63465778248e9b95b2df99cedd0"
            },
            "downloads": -1,
            "filename": "udata_hydra-2.4.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "134dc633228378cdc215fcaa52936c58",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.11",
            "size": 86244,
            "upload_time": "2025-10-21T07:44:29",
            "upload_time_iso_8601": "2025-10-21T07:44:29.294429Z",
            "url": "https://files.pythonhosted.org/packages/66/2a/c2e1826c65a9f09f9811e2bed0cf381009e19d9afe44a007444fa6d939d5/udata_hydra-2.4.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 07:44:29",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "udata-hydra"
}

None