udata-hydra


Nameudata-hydra JSON
Version 1.0.1 PyPI version JSON
download
home_page
SummaryAsync crawler and datalake service for data.gouv.fr
upload_time2023-01-04 06:16:46
maintainer
docs_urlNone
authorOpendata Team
requires_python>=3.9,<4.0
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # udata-hydra 🦀

`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).

URLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.

## CLI

### Create database structure

Install udata-hydra dependencies and cli.
`poetry install`

`poetry run udata-hydra migrate`

### Load (UPSERT) latest catalog version from data.gouv.fr

`udata-hydra load-catalog`

## Crawler

`udata-hydra-crawl`

It will crawl (forever) the catalog according to config set in `config.py`.

`BATCH_SIZE` URLs are queued at each loop run.

The crawler will start with URLs never checked and then proceed with URLs crawled before `SINCE` interval. It will then wait until something changes (catalog or time).

There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.

If an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.

## Worker

A job queuing system is used to process long-running tasks. Launch the worker with the following command:

`poetry run rq worker -c udata_hydra.worker`

## API

### Run

```
poetry install
poetry run adev runserver udata_hydra/app.py
```

### Get latest check

Works with `?url={url}` and `?resource_id={resource_id}`.

```
$ curl -s "http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv" | json_pp
{
   "status" : 200,
   "catalog_id" : 64148,
   "deleted" : false,
   "error" : null,
   "created_at" : "2021-02-06T12:19:08.203055",
   "response_time" : 0.830198049545288,
   "url" : "http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv",
   "domain" : "opendata-sig.saintdenis.re",
   "timeout" : false,
   "id" : 114750,
   "dataset_id" : "5c34944606e3e73d4a551889",
   "resource_id" : "b3678c59-5b35-43ad-9379-fce29e5b56fe",
   "headers" : {
      "content-disposition" : "attachment; filename=\"xn--Dlimitation_des_cantons-bcc.csv\"",
      "server" : "openresty",
      "x-amz-meta-cachetime" : "191",
      "last-modified" : "Wed, 29 Apr 2020 02:19:04 GMT",
      "content-encoding" : "gzip",
      "content-type" : "text/csv",
      "cache-control" : "must-revalidate",
      "etag" : "\"20415964703d9ccc4815d7126aa3a6d8\"",
      "content-length" : "207",
      "date" : "Sat, 06 Feb 2021 12:19:08 GMT",
      "x-amz-meta-contentlastmodified" : "2018-11-19T09:38:28.490Z",
      "connection" : "keep-alive",
      "vary" : "Accept-Encoding"
   }
}
```

### Get all checks for an URL or resource

Works with `?url={url}` and `?resource_id={resource_id}`.

```
$ curl -s "http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls" | json_pp
[
   {
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 165107,
      "created_at" : "2021-02-06T14:32:47.675854",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null
   },
   {
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "created_at" : "2020-12-24T17:06:58.158125",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null,
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 65092
   }
]
```

### Get crawling status

```
$ curl -s "http://localhost:8000/api/status/crawler/" | json_pp
{
   "fresh_checks_percentage" : 0.4,
   "pending_checks" : 142153,
   "total" : 142687,
   "fresh_checks" : 534,
   "checks_percentage" : 0.4
}
```

### Get worker status

```
$ curl -s "http://localhost:8000/api/status/worker/" | json_pp
{
   "queued" : {
      "default" : 0,
      "high" : 825,
      "low" : 655
   }
}
```

### Get crawling stats

```
$ curl -s "http://localhost:8000/api/stats/" | json_pp
{
   "status" : [
      {
         "count" : 525,
         "percentage" : 98.3,
         "label" : "ok"
      },
      {
         "label" : "error",
         "percentage" : 1.3,
         "count" : 7
      },
      {
         "label" : "timeout",
         "percentage" : 0.4,
         "count" : 2
      }
   ],
   "status_codes" : [
      {
         "code" : 200,
         "count" : 413,
         "percentage" : 78.7
      },
      {
         "code" : 501,
         "percentage" : 12.4,
         "count" : 65
      },
      {
         "percentage" : 6.1,
         "count" : 32,
         "code" : 404
      },
      {
         "code" : 500,
         "percentage" : 2.7,
         "count" : 14
      },
      {
         "code" : 502,
         "count" : 1,
         "percentage" : 0.2
      }
   ]
}
```

## Using Webhook integration

** Set the config values**

Create a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.

```toml
UDATA_URI = "https://dev.local:7000/api/2"
UDATA_URI_API_KEY = "example.api.key"
SENTRY_DSN = "https://{my-sentry-dsn}"
```

The webhook integration sends HTTP messages to `udata` when resources are analyzed or checked to fill resources extras.

Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criterions:
- harvest modified date in catalog
- content-length and last-modified headers
- checksum comparison over time

The payload should look something like:

```json
{
   "analysis:filesize": 91661,
   "analysis:mime-type": "application/zip",
   "analysis:checksum": "bef1de04601dedaf2d127418759b16915ba083be",
   "analysis:last-modified-at": "2022-11-27T23:00:54.762000",
   "analysis:last-modified-detection": "harvest-resource-metadata",
}
```

## Development

### docker-compose

Multiple docker-compose files are provided:
- a minimal `docker-compose.yml` with PostgreSQL
- `docker-compose.broker.yml` adds a Redis broker
- `docker-compose.test.yml` launches a test DB, needed to run tests

NB: you can launch compose from multiple files like this: `docker-compose -f docker-compose.yml -f docker-compose.test.yml up`

### Logging & Debugging

The log level can be adjusted using the environment variable LOG_LEVEL.
For example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL="DEBUG" udata-hydra init_db `.

### Writing a migration

1. Add a file named `migrations/{YYYYMMDD}_{from}_up_{to}.sql` and write the SQL you need to perform migration. `from` should be the revision from before (eg `rev1`), `to` the revision you're aiming at (eg `rev2`)
2. Modify the latest revision (eg `rev2`) in `migrations/_LATEST_REVISION`
3. `udata-hydra migrate` will use the info from `_LATEST_REVISION` to upgrade to `rev2`. You can also specify `udata-hydra migrate --revision rev2`

## Deployment

3 services need to be deployed for the full stack to run:
- worker
- api / app
- crawler

Refer to each section to learn how to launch them. The only differences from dev to prod are:
- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`
- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "udata-hydra",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Opendata Team",
    "author_email": "opendatateam@data.gouv.fr",
    "download_url": "https://files.pythonhosted.org/packages/0c/41/7cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef/udata_hydra-1.0.1.tar.gz",
    "platform": null,
    "description": "# udata-hydra \ud83e\udd80\n\n`udata-hydra` is an async metadata crawler for [data.gouv.fr](https://www.data.gouv.fr).\n\nURLs are crawled via _aiohttp_, catalog and crawled metadata are stored in a _PostgreSQL_ database.\n\n## CLI\n\n### Create database structure\n\nInstall udata-hydra dependencies and cli.\n`poetry install`\n\n`poetry run udata-hydra migrate`\n\n### Load (UPSERT) latest catalog version from data.gouv.fr\n\n`udata-hydra load-catalog`\n\n## Crawler\n\n`udata-hydra-crawl`\n\nIt will crawl (forever) the catalog according to config set in `config.py`.\n\n`BATCH_SIZE` URLs are queued at each loop run.\n\nThe crawler will start with URLs never checked and then proceed with URLs crawled before `SINCE` interval. It will then wait until something changes (catalog or time).\n\nThere's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.\n\nIf an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.\n\n## Worker\n\nA job queuing system is used to process long-running tasks. Launch the worker with the following command:\n\n`poetry run rq worker -c udata_hydra.worker`\n\n## API\n\n### Run\n\n```\npoetry install\npoetry run adev runserver udata_hydra/app.py\n```\n\n### Get latest check\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```\n$ curl -s \"http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\" | json_pp\n{\n   \"status\" : 200,\n   \"catalog_id\" : 64148,\n   \"deleted\" : false,\n   \"error\" : null,\n   \"created_at\" : \"2021-02-06T12:19:08.203055\",\n   \"response_time\" : 0.830198049545288,\n   \"url\" : \"http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv\",\n   \"domain\" : \"opendata-sig.saintdenis.re\",\n   \"timeout\" : false,\n   \"id\" : 114750,\n   \"dataset_id\" : \"5c34944606e3e73d4a551889\",\n   \"resource_id\" : \"b3678c59-5b35-43ad-9379-fce29e5b56fe\",\n   \"headers\" : {\n      \"content-disposition\" : \"attachment; filename=\\\"xn--Dlimitation_des_cantons-bcc.csv\\\"\",\n      \"server\" : \"openresty\",\n      \"x-amz-meta-cachetime\" : \"191\",\n      \"last-modified\" : \"Wed, 29 Apr 2020 02:19:04 GMT\",\n      \"content-encoding\" : \"gzip\",\n      \"content-type\" : \"text/csv\",\n      \"cache-control\" : \"must-revalidate\",\n      \"etag\" : \"\\\"20415964703d9ccc4815d7126aa3a6d8\\\"\",\n      \"content-length\" : \"207\",\n      \"date\" : \"Sat, 06 Feb 2021 12:19:08 GMT\",\n      \"x-amz-meta-contentlastmodified\" : \"2018-11-19T09:38:28.490Z\",\n      \"connection\" : \"keep-alive\",\n      \"vary\" : \"Accept-Encoding\"\n   }\n}\n```\n\n### Get all checks for an URL or resource\n\nWorks with `?url={url}` and `?resource_id={resource_id}`.\n\n```\n$ curl -s \"http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\" | json_pp\n[\n   {\n      \"domain\" : \"www.drees.sante.gouv.fr\",\n      \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n      \"timeout\" : false,\n      \"deleted\" : false,\n      \"response_time\" : null,\n      \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n      \"catalog_id\" : 232112,\n      \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n      \"headers\" : {},\n      \"id\" : 165107,\n      \"created_at\" : \"2021-02-06T14:32:47.675854\",\n      \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n      \"status\" : null\n   },\n   {\n      \"timeout\" : false,\n      \"deleted\" : false,\n      \"response_time\" : null,\n      \"error\" : \"Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \\\"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\\\")]\",\n      \"domain\" : \"www.drees.sante.gouv.fr\",\n      \"dataset_id\" : \"53d6eadba3a72954d9dd62f5\",\n      \"created_at\" : \"2020-12-24T17:06:58.158125\",\n      \"resource_id\" : \"93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7\",\n      \"status\" : null,\n      \"catalog_id\" : 232112,\n      \"url\" : \"http://www.drees.sante.gouv.fr/IMG/xls/er864.xls\",\n      \"headers\" : {},\n      \"id\" : 65092\n   }\n]\n```\n\n### Get crawling status\n\n```\n$ curl -s \"http://localhost:8000/api/status/crawler/\" | json_pp\n{\n   \"fresh_checks_percentage\" : 0.4,\n   \"pending_checks\" : 142153,\n   \"total\" : 142687,\n   \"fresh_checks\" : 534,\n   \"checks_percentage\" : 0.4\n}\n```\n\n### Get worker status\n\n```\n$ curl -s \"http://localhost:8000/api/status/worker/\" | json_pp\n{\n   \"queued\" : {\n      \"default\" : 0,\n      \"high\" : 825,\n      \"low\" : 655\n   }\n}\n```\n\n### Get crawling stats\n\n```\n$ curl -s \"http://localhost:8000/api/stats/\" | json_pp\n{\n   \"status\" : [\n      {\n         \"count\" : 525,\n         \"percentage\" : 98.3,\n         \"label\" : \"ok\"\n      },\n      {\n         \"label\" : \"error\",\n         \"percentage\" : 1.3,\n         \"count\" : 7\n      },\n      {\n         \"label\" : \"timeout\",\n         \"percentage\" : 0.4,\n         \"count\" : 2\n      }\n   ],\n   \"status_codes\" : [\n      {\n         \"code\" : 200,\n         \"count\" : 413,\n         \"percentage\" : 78.7\n      },\n      {\n         \"code\" : 501,\n         \"percentage\" : 12.4,\n         \"count\" : 65\n      },\n      {\n         \"percentage\" : 6.1,\n         \"count\" : 32,\n         \"code\" : 404\n      },\n      {\n         \"code\" : 500,\n         \"percentage\" : 2.7,\n         \"count\" : 14\n      },\n      {\n         \"code\" : 502,\n         \"count\" : 1,\n         \"percentage\" : 0.2\n      }\n   ]\n}\n```\n\n## Using Webhook integration\n\n** Set the config values**\n\nCreate a `config.toml` where your service and commands are launched, or specify a path to a TOML file via the `HYDRA_SETTINGS` environment variable. `config.toml` or equivalent will override values from `udata_hydra/config_default.toml`, lookup there for values that can/need to be defined.\n\n```toml\nUDATA_URI = \"https://dev.local:7000/api/2\"\nUDATA_URI_API_KEY = \"example.api.key\"\nSENTRY_DSN = \"https://{my-sentry-dsn}\"\n```\n\nThe webhook integration sends HTTP messages to `udata` when resources are analyzed or checked to fill resources extras.\n\nRegarding analysis, there is a phase called \"change detection\". It will try to guess if a resource has been modified based on different criterions:\n- harvest modified date in catalog\n- content-length and last-modified headers\n- checksum comparison over time\n\nThe payload should look something like:\n\n```json\n{\n   \"analysis:filesize\": 91661,\n   \"analysis:mime-type\": \"application/zip\",\n   \"analysis:checksum\": \"bef1de04601dedaf2d127418759b16915ba083be\",\n   \"analysis:last-modified-at\": \"2022-11-27T23:00:54.762000\",\n   \"analysis:last-modified-detection\": \"harvest-resource-metadata\",\n}\n```\n\n## Development\n\n### docker-compose\n\nMultiple docker-compose files are provided:\n- a minimal `docker-compose.yml` with PostgreSQL\n- `docker-compose.broker.yml` adds a Redis broker\n- `docker-compose.test.yml` launches a test DB, needed to run tests\n\nNB: you can launch compose from multiple files like this: `docker-compose -f docker-compose.yml -f docker-compose.test.yml up`\n\n### Logging & Debugging\n\nThe log level can be adjusted using the environment variable LOG_LEVEL.\nFor example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL=\"DEBUG\" udata-hydra init_db `.\n\n### Writing a migration\n\n1. Add a file named `migrations/{YYYYMMDD}_{from}_up_{to}.sql` and write the SQL you need to perform migration. `from` should be the revision from before (eg `rev1`), `to` the revision you're aiming at (eg `rev2`)\n2. Modify the latest revision (eg `rev2`) in `migrations/_LATEST_REVISION`\n3. `udata-hydra migrate` will use the info from `_LATEST_REVISION` to upgrade to `rev2`. You can also specify `udata-hydra migrate --revision rev2`\n\n## Deployment\n\n3 services need to be deployed for the full stack to run:\n- worker\n- api / app\n- crawler\n\nRefer to each section to learn how to launch them. The only differences from dev to prod are:\n- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`\n- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Async crawler and datalake service for data.gouv.fr",
    "version": "1.0.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c6c21349571f7288d79e7f3ae72ee9e820cf30407b3005b152e6b8d10f3b652c",
                "md5": "ce5f5db0e85c815a42f89edf3ed846bc",
                "sha256": "a7219d9536c7e892471d66ad61b4433462f3bad8b97061bdc994cb224e3938da"
            },
            "downloads": -1,
            "filename": "udata_hydra-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ce5f5db0e85c815a42f89edf3ed846bc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 24410,
            "upload_time": "2023-01-04T06:16:44",
            "upload_time_iso_8601": "2023-01-04T06:16:44.564540Z",
            "url": "https://files.pythonhosted.org/packages/c6/c2/1349571f7288d79e7f3ae72ee9e820cf30407b3005b152e6b8d10f3b652c/udata_hydra-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0c417cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef",
                "md5": "f998a4f9ff785dd11c8b373862970f69",
                "sha256": "e5511e7d3a83ffc9a6280852fa65037cfb1aececfa5422927c55487ab5d56646"
            },
            "downloads": -1,
            "filename": "udata_hydra-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f998a4f9ff785dd11c8b373862970f69",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 23122,
            "upload_time": "2023-01-04T06:16:46",
            "upload_time_iso_8601": "2023-01-04T06:16:46.362793Z",
            "url": "https://files.pythonhosted.org/packages/0c/41/7cef2098fee321cb88de599a32b0c1f349c5fa34caad128c123bc14138ef/udata_hydra-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-04 06:16:46",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "udata-hydra"
}
        
Elapsed time: 0.28329s