Name | udata-hydra JSON |
Version |
1.0.1
JSON |
| download |
home_page | |
Summary | Async crawler and datalake service for data.gouv.fr |
upload_time | 2023-01-04 06:16:46 |
maintainer | |
docs_url | None |
author | Opendata Team |
requires_python | >=3.9,<4.0 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# udata-hydra 🦀
`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).
URLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.
## CLI
### Create database structure
Install udata-hydra dependencies and cli.
`poetry install`
`poetry run udata-hydra migrate`
### Load (UPSERT) latest catalog version from data.gouv.fr
`udata-hydra load-catalog`
## Crawler
`udata-hydra-crawl`
It will crawl (forever) the catalog according to config set in `config.py`.
`BATCH_SIZE` URLs are queued at each loop run.
The crawler will start with URLs never checked and then proceed with URLs crawled before `SINCE` interval. It will then wait until something changes (catalog or time).
There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.
If an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.
## Worker
A job queuing system is used to process long-running tasks. Launch the worker with the following command:
`poetry run rq worker -c udata_hydra.worker`
## API
### Run
```
poetry install
poetry run adev runserver udata_hydra/app.py
```
### Get latest check
Works with `?url={url}` and `?resource_id={resource_id}`.
```
$ curl -s "http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv" | json_pp
{
"status" : 200,
"catalog_id" : 64148,
"deleted" : false,
"error" : null,
"created_at" : "2021-02-06T12:19:08.203055",
"response_time" : 0.830198049545288,
"url" : "http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv",
"domain" : "opendata-sig.saintdenis.re",
"timeout" : false,
"id" : 114750,
"dataset_id" : "5c34944606e3e73d4a551889",
"resource_id" : "b3678c59-5b35-43ad-9379-fce29e5b56fe",
"headers" : {
"content-disposition" : "attachment; filename=\"xn--Dlimitation_des_cantons-bcc.csv\"",
"server" : "openresty",
"x-amz-meta-cachetime" : "191",
"last-modified" : "Wed, 29 Apr 2020 02:19:04 GMT",
"content-encoding" : "gzip",
"content-type" : "text/csv",
"cache-control" : "must-revalidate",
"etag" : "\"20415964703d9ccc4815d7126aa3a6d8\"",
"content-length" : "207",
"date" : "Sat, 06 Feb 2021 12:19:08 GMT",
"x-amz-meta-contentlastmodified" : "2018-11-19T09:38:28.490Z",
"connection" : "keep-alive",
"vary" : "Accept-Encoding"
}
}
```
### Get all checks for an URL or resource
Works with `?url={url}` and `?resource_id={resource_id}`.
```
$ curl -s "http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls" | json_pp
[
{
"domain" : "www.drees.sante.gouv.fr",
"dataset_id" : "53d6eadba3a72954d9dd62f5",
"timeout" : false,
"deleted" : false,
"response_time" : null,
"error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
"catalog_id" : 232112,
"url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
"headers" : {},
"id" : 165107,
"created_at" : "2021-02-06T14:32:47.675854",
"resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
"status" : null
},
{
"timeout" : false,
"deleted" : false,
"response_time" : null,
"error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
"domain" : "www.drees.sante.gouv.fr",
"dataset_id" : "53d6eadba3a72954d9dd62f5",
"created_at" : "2020-12-24T17:06:58.158125",
"resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
"status" : null,
"catalog_id" : 232112,
"url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
"headers" : {},
"id" : 65092
}
]
```
### Get crawling status
```
$ curl -s "http://localhost:8000/api/status/crawler/" | json_pp
{
"fresh_checks_percentage" : 0.4,
"pending_checks" : 142153,
"total" : 142687,
"fresh_checks" : 534,
"checks_percentage" : 0.4
}
```
### Get worker status
```
$ curl -s "http://localhost:8000/api/status/worker/" | json_pp
{
"queued" : {
"default" : 0,
"high" : 825,
"low" : 655
}
}
```
### Get crawling stats
```
$ curl -s "http://localhost:8000/api/stats/" | json_pp
{
"status" : [
{
"count" : 525,
"percentage" : 98.3,
"label" : "ok"
},
{
"label" : "error",
"percentage" : 1.3,
"count" : 7
},
{
"label" : "timeout",
"percentage" : 0.4,
"count" : 2
}
],
"status_codes" : [
{
"code" : 200,
"count" : 413,
"percentage" : 78.7
},
{
"code" : 501,
"percentage" : 12.4,
"count" : 65
},
{
"percentage" : 6.1,
"count" : 32,
"code" : 404
},
{
"code" : 500,
"percentage" : 2.7,
"count" : 14
},
{
"code" : 502,
"count" : 1,
"percentage" : 0.2
}
]
}
```
## Using Webhook integration
** Set the config values**
Create a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.
```toml
UDATA_URI = "https://dev.local:7000/api/2"
UDATA_URI_API_KEY = "example.api.key"
SENTRY_DSN = "https://{my-sentry-dsn}"
```
The webhook integration sends HTTP messages to `udata` when resources are analyzed or checked to fill resources extras.
Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criterions:
- harvest modified date in catalog
- content-length and last-modified headers
- checksum comparison over time
The payload should look something like:
```json
{
"analysis:filesize": 91661,
"analysis:mime-type": "application/zip",
"analysis:checksum": "bef1de04601dedaf2d127418759b16915ba083be",
"analysis:last-modified-at": "2022-11-27T23:00:54.762000",
"analysis:last-modified-detection": "harvest-resource-metadata",
}
```
## Development
### docker-compose
Multiple docker-compose files are provided:
- a minimal `docker-compose.yml` with PostgreSQL
- `docker-compose.broker.yml` adds a Redis broker
- `docker-compose.test.yml` launches a test DB, needed to run tests
NB: you can launch compose from multiple files like this: `docker-compose -f docker-compose.yml -f docker-compose.test.yml up`
### Logging & Debugging
The log level can be adjusted using the environment variable LOG_LEVEL.
For example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL="DEBUG" udata-hydra init_db `.
### Writing a migration
1. Add a file named `migrations/{YYYYMMDD}_{from}_up_{to}.sql` and write the SQL you need to perform migration. `from` should be the revision from before (eg `rev1`), `to` the revision you're aiming at (eg `rev2`)
2. Modify the latest revision (eg `rev2`) in `migrations/_LATEST_REVISION`
3. `udata-hydra migrate` will use the info from `_LATEST_REVISION` to upgrade to `rev2`. You can also specify `udata-hydra migrate --revision rev2`
## Deployment
3 services need to be deployed for the full stack to run:
- worker
- api / app
- crawler
Refer to each section to learn how to launch them. The only differences from dev to prod are:
- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`
- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server
Raw data
{
"_id": null,
"home_page": "",
"name": "udata-hydra",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Opendata Team",
"author_email": "opendatateam@data.gouv.fr",
"download_url": "https://files.pythonhosted.org/packages/0c/41/7cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef/udata_hydra-1.0.1.tar.gz",
"platform": null,
"description": "# udata-hydra \ud83e\udd80\n\n`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).\n\nURLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.\n\n## CLI\n\n### Create database structure\n\nInstall udata-hydra dependencies and cli.\n`poetry install`\n\n`poetry run udata-hydra migrate`\n\n### Load (UPSERT) latest catalog version from data.gouv.fr\n\n`udata-hydra load-catalog`\n\n## Crawler\n\n`udata-hydra-crawl`\n\nIt will crawl (forever) the catalog according to config set in `config.py`.\n\n`BATCH_SIZE` URLs are queued at each loop run.\n\nThe crawler will start with URLs never checked and then proceed with URLs crawled before `SINCE` interval. It will then wait until something changes (catalog or time).\n\nThere's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.\n\nIf an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.\n\n## Worker\n\nA job queuing system is used to process long-running tasks. Launch the worker with the following command:\n\n`poetry run rq worker -c udata_hydra.worker`\n\n## API\n\n### Run\n\n```\npoetry install\npoetry run adev runserver udata_hydra/app.py\n```\n\n### Get latest check\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```\n$ curl -s \"http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\" | json_pp\n{\n \"status\" : 200,\n \"catalog_id\" : 64148,\n \"deleted\" : false,\n \"error\" : null,\n \"created_at\" : \"2021-02-06T12:19:08.203055\",\n \"response_time\" : 0.830198049545288,\n \"url\" : \"http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\",\n \"domain\" : \"opendata-sig.saintdenis.re\",\n \"timeout\" : false,\n \"id\" : 114750,\n \"dataset_id\" : \"5c34944606e3e73d4a551889\",\n \"resource_id\" : \"b3678c59-5b35-43ad-9379-fce29e5b56fe\",\n \"headers\" : {\n \"content-disposition\" : \"attachment; filename=\\\"xn--Dlimitation_des_cantons-bcc.csv\\\"\",\n \"server\" : \"openresty\",\n \"x-amz-meta-cachetime\" : \"191\",\n \"last-modified\" : \"Wed, 29 Apr 2020 02:19:04 GMT\",\n \"content-encoding\" : \"gzip\",\n \"content-type\" : \"text/csv\",\n \"cache-control\" : \"must-revalidate\",\n \"etag\" : \"\\\"20415964703d9ccc4815d7126aa3a6d8\\\"\",\n \"content-length\" : \"207\",\n \"date\" : \"Sat, 06 Feb 2021 12:19:08 GMT\",\n \"x-amz-meta-contentlastmodified\" : \"2018-11-19T09:38:28.490Z\",\n \"connection\" : \"keep-alive\",\n \"vary\" : \"Accept-Encoding\"\n }\n}\n```\n\n### Get all checks for an URL or resource\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```\n$ curl -s \"http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\" | json_pp\n[\n {\n \"domain\" : \"www.drees.sante.gouv.fr\",\n \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n \"timeout\" : false,\n \"deleted\" : false,\n \"response_time\" : null,\n \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n \"catalog_id\" : 232112,\n \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n \"headers\" : {},\n \"id\" : 165107,\n \"created_at\" : \"2021-02-06T14:32:47.675854\",\n \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n \"status\" : null\n },\n {\n \"timeout\" : false,\n \"deleted\" : false,\n \"response_time\" : null,\n \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n \"domain\" : \"www.drees.sante.gouv.fr\",\n \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n \"created_at\" : \"2020-12-24T17:06:58.158125\",\n \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n \"status\" : null,\n \"catalog_id\" : 232112,\n \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n \"headers\" : {},\n \"id\" : 65092\n }\n]\n```\n\n### Get crawling status\n\n```\n$ curl -s \"http://localhost:8000/api/status/crawler/\" | json_pp\n{\n \"fresh_checks_percentage\" : 0.4,\n \"pending_checks\" : 142153,\n \"total\" : 142687,\n \"fresh_checks\" : 534,\n \"checks_percentage\" : 0.4\n}\n```\n\n### Get worker status\n\n```\n$ curl -s \"http://localhost:8000/api/status/worker/\" | json_pp\n{\n \"queued\" : {\n \"default\" : 0,\n \"high\" : 825,\n \"low\" : 655\n }\n}\n```\n\n### Get crawling stats\n\n```\n$ curl -s \"http://localhost:8000/api/stats/\" | json_pp\n{\n \"status\" : [\n {\n \"count\" : 525,\n \"percentage\" : 98.3,\n \"label\" : \"ok\"\n },\n {\n \"label\" : \"error\",\n \"percentage\" : 1.3,\n \"count\" : 7\n },\n {\n \"label\" : \"timeout\",\n \"percentage\" : 0.4,\n \"count\" : 2\n }\n ],\n \"status_codes\" : [\n {\n \"code\" : 200,\n \"count\" : 413,\n \"percentage\" : 78.7\n },\n {\n \"code\" : 501,\n \"percentage\" : 12.4,\n \"count\" : 65\n },\n {\n \"percentage\" : 6.1,\n \"count\" : 32,\n \"code\" : 404\n },\n {\n \"code\" : 500,\n \"percentage\" : 2.7,\n \"count\" : 14\n },\n {\n \"code\" : 502,\n \"count\" : 1,\n \"percentage\" : 0.2\n }\n ]\n}\n```\n\n## Using Webhook integration\n\n** Set the config values**\n\nCreate a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.\n\n```toml\nUDATA_URI = \"https://dev.local:7000/api/2\"\nUDATA_URI_API_KEY = \"example.api.key\"\nSENTRY_DSN = \"https://{my-sentry-dsn}\"\n```\n\nThe webhook integration sends HTTP messages to `udata` when resources are analyzed or checked to fill resources extras.\n\nRegarding analysis, there is a phase called \"change detection\". It will try to guess if a resource has been modified based on different criterions:\n- harvest modified date in catalog\n- content-length and last-modified headers\n- checksum comparison over time\n\nThe payload should look something like:\n\n```json\n{\n \"analysis:filesize\": 91661,\n \"analysis:mime-type\": \"application/zip\",\n \"analysis:checksum\": \"bef1de04601dedaf2d127418759b16915ba083be\",\n \"analysis:last-modified-at\": \"2022-11-27T23:00:54.762000\",\n \"analysis:last-modified-detection\": \"harvest-resource-metadata\",\n}\n```\n\n## Development\n\n### docker-compose\n\nMultiple docker-compose files are provided:\n- a minimal `docker-compose.yml` with PostgreSQL\n- `docker-compose.broker.yml` adds a Redis broker\n- `docker-compose.test.yml` launches a test DB, needed to run tests\n\nNB: you can launch compose from multiple files like this: `docker-compose -f docker-compose.yml -f docker-compose.test.yml up`\n\n### Logging & Debugging\n\nThe log level can be adjusted using the environment variable LOG_LEVEL.\nFor example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL=\"DEBUG\" udata-hydra init_db `.\n\n### Writing a migration\n\n1. Add a file named `migrations/{YYYYMMDD}_{from}_up_{to}.sql` and write the SQL you need to perform migration. `from` should be the revision from before (eg `rev1`), `to` the revision you're aiming at (eg `rev2`)\n2. Modify the latest revision (eg `rev2`) in `migrations/_LATEST_REVISION`\n3. `udata-hydra migrate` will use the info from `_LATEST_REVISION` to upgrade to `rev2`. You can also specify `udata-hydra migrate --revision rev2`\n\n## Deployment\n\n3 services need to be deployed for the full stack to run:\n- worker\n- api / app\n- crawler\n\nRefer to each section to learn how to launch them. The only differences from dev to prod are:\n- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`\n- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Async crawler and datalake service for data.gouv.fr",
"version": "1.0.1",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c6c21349571f7288d79e7f3ae72ee9e820cf30407b3005b152e6b8d10f3b652c",
"md5": "ce5f5db0e85c815a42f89edf3ed846bc",
"sha256": "a7219d9536c7e892471d66ad61b4433462f3bad8b97061bdc994cb224e3938da"
},
"downloads": -1,
"filename": "udata_hydra-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ce5f5db0e85c815a42f89edf3ed846bc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<4.0",
"size": 24410,
"upload_time": "2023-01-04T06:16:44",
"upload_time_iso_8601": "2023-01-04T06:16:44.564540Z",
"url": "https://files.pythonhosted.org/packages/c6/c2/1349571f7288d79e7f3ae72ee9e820cf30407b3005b152e6b8d10f3b652c/udata_hydra-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0c417cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef",
"md5": "f998a4f9ff785dd11c8b373862970f69",
"sha256": "e5511e7d3a83ffc9a6280852fa65037cfb1aececfa5422927c55487ab5d56646"
},
"downloads": -1,
"filename": "udata_hydra-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "f998a4f9ff785dd11c8b373862970f69",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<4.0",
"size": 23122,
"upload_time": "2023-01-04T06:16:46",
"upload_time_iso_8601": "2023-01-04T06:16:46.362793Z",
"url": "https://files.pythonhosted.org/packages/0c/41/7cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef/udata_hydra-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-04 06:16:46",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "udata-hydra"
}