[![MIT license](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
# Data Harvesting
This repository contains harvesters, aggregators for linked Data and tools around them.
This software allows to harvest small subgraphs exposed by certain sources on the web and
and enrich them such that they can be combined to a single larger linked data graph.
This software was written for and is mainly currently deployed as a part of the backend for the unified Helmholtz Information and Data Exchange (unHIDE) project by the Helmholtz Metadata Collaboration (HMC) to create
a knowledge graph for the Helmholtz association which allows to monitor, check, enrich metadata as well as
identify gabs and needs.
Contributions of any kind by you are always welcome!
## Approach:
We establish certain data pipelines of certain data providers with linked metadata and complement it, by combining it with other sources. For the unhide project this data is annotated in schema.org semantics and serialized mainly in JSON-LD.
Data pipelines contain code to execute harvesting from a local to a global level.
They are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some data eco system
Data harvester pipelines so far:
- gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: extend to github)
- sitemap pipeline: extract JSON-LD metadata a data provider over its sitemap, which contains links to the data entries and when they have been last updated
- oai pmh pipeline: extract metadata over oai-pmh endpoints from a data provider. it contains a list of entries and when they where last updated. This pipeline uses a converter from dublin core to schema.org, since many providers provide just dublin core so far.
- datacite pipeline: extract JSON-LD metadata from datacite.org connected to a given organization identifier.
- schoolix pipeline (todo): Extract links and related resources for a list of given PIDs of any kind
Besides the harvesters there are aggregators which allow one to specify how linked data should be processed while tracking the provenance of the processing in a reversible way. This is done by storing graph updates, so called patches, for each subgraph. These updates can also be then applied directly to a graph database. Processes changes can be provided as SPARQL updates or through python function with a specific interface.
All harvesters and Aggregators read from a single config file (as example see [configs/config.yaml](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/dev/data_harvesting/configs/config.yaml)), which contains als sources and specific operations.
## Documentation:
Currently only in code documentation. In the future under the docs folder and hosted somewhere.
## Installation
```
git clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_harvesting.git
cd data_harvesting
pip install .
```
as a developer install with
```
pip install -e .
```
You can also setup the project using poetry instead of pip.
```
poetry install --with dev
```
The individual pipelines have further dependencies outside of python.
For example the gitlab pipeline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)
## How to use this
For examples look at the `examples` folder. Also the tests in `tests` folder may provide some insight.
Also once installed there is a command line interface (CLI), 'hmc-unhide' for example one can execute the gitlab pipeline via:
```
hmc-unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline
```
further the cli exposes some other utility on the command line for example to convert linked data files
into different formats.
You can also use the CLI to register two pipelines and then run them in parallel. Don't forget to set your prefect server URL.
```
# register the data pipeline, use any config or out folder path
hmc-unhide pipeline register --config configs/config.yaml --out /opt/data
# register the hifis pipeline
hmc-unhide stats register
```
## License
The software is distributed under the terms and conditions of the MIT license which is specified in the `LICENSE` file.
## Acknowledgement
This project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.
Raw data
{
"_id": null,
"home_page": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
"name": "data-harvesting",
"maintainer": "Jens Br\u00f6der",
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": "j.broeder@fz-juelich.de",
"keywords": "unhide, Helmholtz association, data mining, HMC, metadata, data publications, software publication, RSE, FAIR, linked data, knowledge graph, json-ld, schema.org, restruct",
"author": "Jens Br\u00f6der",
"author_email": "j.broeder@fz-juelich.de",
"download_url": "https://files.pythonhosted.org/packages/fd/22/24187004a3dcf00f4ab89ee2ebc92353bc62f4343456e318332f08f69a8d/data_harvesting-2.0.0.tar.gz",
"platform": null,
"description": "[![MIT license](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)\n# Data Harvesting\n\nThis repository contains harvesters, aggregators for linked Data and tools around them. \nThis software allows to harvest small subgraphs exposed by certain sources on the web and\nand enrich them such that they can be combined to a single larger linked data graph. \n\nThis software was written for and is mainly currently deployed as a part of the backend for the unified Helmholtz Information and Data Exchange (unHIDE) project by the Helmholtz Metadata Collaboration (HMC) to create\na knowledge graph for the Helmholtz association which allows to monitor, check, enrich metadata as well as\nidentify gabs and needs.\n\nContributions of any kind by you are always welcome!\n\n## Approach:\n\nWe establish certain data pipelines of certain data providers with linked metadata and complement it, by combining it with other sources. For the unhide project this data is annotated in schema.org semantics and serialized mainly in JSON-LD.\n\nData pipelines contain code to execute harvesting from a local to a global level. \nThey are exposed through a cmdline interface (cli) and thus easily integrated in a cron job and can therefore be used to stream data on a time interval bases into some data eco system\n\nData harvester pipelines so far:\n- gitlab pipeline: harvest all public projects in Helmholtz gitlab instances and extracts and complements codemeta.jsonld files. (todo: extend to github)\n- sitemap pipeline: extract JSON-LD metadata a data provider over its sitemap, which contains links to the data entries and when they have been last updated\n- oai pmh pipeline: extract metadata over oai-pmh endpoints from a data provider. it contains a list of entries and when they where last updated. This pipeline uses a converter from dublin core to schema.org, since many providers provide just dublin core so far.\n- datacite pipeline: extract JSON-LD metadata from datacite.org connected to a given organization identifier.\n- schoolix pipeline (todo): Extract links and related resources for a list of given PIDs of any kind\n\nBesides the harvesters there are aggregators which allow one to specify how linked data should be processed while tracking the provenance of the processing in a reversible way. This is done by storing graph updates, so called patches, for each subgraph. These updates can also be then applied directly to a graph database. Processes changes can be provided as SPARQL updates or through python function with a specific interface.\n\nAll harvesters and Aggregators read from a single config file (as example see [configs/config.yaml](https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting/-/blob/dev/data_harvesting/configs/config.yaml)), which contains als sources and specific operations. \n\n## Documentation:\n\nCurrently only in code documentation. In the future under the docs folder and hosted somewhere.\n\n## Installation\n\n```\ngit clone git@codebase.helmholtz.cloud:hmc/hmc-public/unhide/data_harvesting.git\ncd data_harvesting\npip install .\n```\nas a developer install with\n```\npip install -e .\n```\nYou can also setup the project using poetry instead of pip.\n```\npoetry install --with dev\n```\n\nThe individual pipelines have further dependencies outside of python.\n\nFor example the gitlab pipeline relies an codemeta-harvester (https://github.com/proycon/codemeta-harvester)\n\n## How to use this\n\nFor examples look at the `examples` folder. Also the tests in `tests` folder may provide some insight.\nAlso once installed there is a command line interface (CLI), 'hmc-unhide' for example one can execute the gitlab pipeline via:\n\n```\nhmc-unhide harvester run --name gitlab --out ~/work/data/gitlab_pipeline\n```\n\nfurther the cli exposes some other utility on the command line for example to convert linked data files \ninto different formats.\n\nYou can also use the CLI to register two pipelines and then run them in parallel. Don't forget to set your prefect server URL.\n\n```\n# register the data pipeline, use any config or out folder path\nhmc-unhide pipeline register --config configs/config.yaml --out /opt/data\n\n# register the hifis pipeline\nhmc-unhide stats register\n```\n\n## License\n\nThe software is distributed under the terms and conditions of the MIT license which is specified in the `LICENSE` file.\n## Acknowledgement\n\nThis project was supported by the Helmholtz Metadata Collaboration (HMC), an incubator-platform of the Helmholtz Association within the framework of the Information and Data Science strategic initiative.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Set of tools to harvest, process and uplift (meta)data from metadata providers within the Helmholtz association to be included in the Helmholtz Knowledge Graph (Helmholtz-KG).",
"version": "2.0.0",
"project_urls": {
"Homepage": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting",
"Repository": "https://codebase.helmholtz.cloud/hmc/hmc-public/unhide/data_harvesting.git"
},
"split_keywords": [
"unhide",
" helmholtz association",
" data mining",
" hmc",
" metadata",
" data publications",
" software publication",
" rse",
" fair",
" linked data",
" knowledge graph",
" json-ld",
" schema.org",
" restruct"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6df3a79f29f96a47503a62dd38ba2c0cc17e9a96787c8333df0a05bc187c28d3",
"md5": "ce223d7730c65f41592402cfbeb38619",
"sha256": "a34c2bf99c5ee66bbbccc99016a8a8030f63e0915916f0c1f00cd44062f0a821"
},
"downloads": -1,
"filename": "data_harvesting-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ce223d7730c65f41592402cfbeb38619",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 671730,
"upload_time": "2024-07-09T10:09:30",
"upload_time_iso_8601": "2024-07-09T10:09:30.613099Z",
"url": "https://files.pythonhosted.org/packages/6d/f3/a79f29f96a47503a62dd38ba2c0cc17e9a96787c8333df0a05bc187c28d3/data_harvesting-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fd2224187004a3dcf00f4ab89ee2ebc92353bc62f4343456e318332f08f69a8d",
"md5": "c4b76acb9363dd480b1ab466746e4556",
"sha256": "e805a16022d22f25c53fa516b3d72d48df0278a0689e221f2ae838b1b0ff2437"
},
"downloads": -1,
"filename": "data_harvesting-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "c4b76acb9363dd480b1ab466746e4556",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 628049,
"upload_time": "2024-07-09T10:09:32",
"upload_time_iso_8601": "2024-07-09T10:09:32.592946Z",
"url": "https://files.pythonhosted.org/packages/fd/22/24187004a3dcf00f4ab89ee2ebc92353bc62f4343456e318332f08f69a8d/data_harvesting-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-09 10:09:32",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "data-harvesting"
}