[<img src="https://img.shields.io/badge/powered%20by-OpenCitations-%239931FC?labelColor=2D22DE" />](http://opencitations.net) [![Python package](https://github.com/opencitations/index/actions/workflows/python-package.yml/badge.svg?branch=farm_revision)](https://github.com/opencitations/index/actions/workflows/python-package.yml)
# OpenCitations: Preprocess
This software is meant to preprocess data dumps to be ingested in OpenCitations, provided by different data sources.
The aim of the software is that of preprocessing data dumps in order to facilitate data parsing and extraction in OpenCitations Meta and OpenCitation Index processes.
Note that preprocessing is not a mandatory step of data ingestion in OpenCitations. However, preprocessing is suggested when:
<ol>
<li>A consistent part of the bibliographic entities represented in the dump come without citation data</li>
<li>The dump content is redundant with respect to OpenCitations scopes (e.g.: duplicated citations retrievable both as addressed and received citations)</li>
<li>The dump consists of a unique big file, and it is too heavy to be processed all at once</li>
<li>A consistent part of the data provided is not relevant with respect to OpenCitations scopes (e.g.: discipline-specific and content-related metadata) </li>
</ol>
### Mandatory
- Python 3.8+
### Start the tests
```console
$ python -m unittest discover -s ./preprocessing/test -p "*.py"
```
## License
OpenCitations Index is released under the [ISC License](LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/opencitations/preprocess",
"name": "oc-preprocessing",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "preprocessing data dumps",
"author": "OpenCitations authors",
"author_email": "OpenCitations authors <contact@opencitations.net>",
"download_url": "https://files.pythonhosted.org/packages/1b/02/de710bec42e662155c89650d8bafc9b9ca2d40f7fe127ac7b220435d528b/oc_preprocessing-0.0.5.tar.gz",
"platform": null,
"description": "[<img src=\"https://img.shields.io/badge/powered%20by-OpenCitations-%239931FC?labelColor=2D22DE\" />](http://opencitations.net) [![Python package](https://github.com/opencitations/index/actions/workflows/python-package.yml/badge.svg?branch=farm_revision)](https://github.com/opencitations/index/actions/workflows/python-package.yml)\n# OpenCitations: Preprocess\n\nThis software is meant to preprocess data dumps to be ingested in OpenCitations, provided by different data sources. \nThe aim of the software is that of preprocessing data dumps in order to facilitate data parsing and extraction in OpenCitations Meta and OpenCitation Index processes. \nNote that preprocessing is not a mandatory step of data ingestion in OpenCitations. However, preprocessing is suggested when:\n<ol>\n<li>A consistent part of the bibliographic entities represented in the dump come without citation data</li>\n<li>The dump content is redundant with respect to OpenCitations scopes (e.g.: duplicated citations retrievable both as addressed and received citations)</li>\n<li>The dump consists of a unique big file, and it is too heavy to be processed all at once</li>\n<li>A consistent part of the data provided is not relevant with respect to OpenCitations scopes (e.g.: discipline-specific and content-related metadata) </li>\n</ol>\n\n\n### Mandatory\n- Python 3.8+\n\n### Start the tests\n```console\n$ python -m unittest discover -s ./preprocessing/test -p \"*.py\"\n```\n\n## License\nOpenCitations Index is released under the [ISC License](LICENSE).\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "This package is meant to preprocess OpenCitations source dumps so to make them easily usable in OpenCitations main processes, by deleting unused information, splitting big files, and validating identifiers",
"version": "0.0.5",
"split_keywords": [
"preprocessing",
"data",
"dumps"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0fec632c6ac97235fc0b360fa215144071cf2dd977d3ec47da3da9f62f0db721",
"md5": "ef4cfa776753c24b846e839a86792a96",
"sha256": "c011eef4e1253c03445a4140ab7c111d6f0b9dd02c885987e0052ebadd2e2aca"
},
"downloads": -1,
"filename": "oc_preprocessing-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ef4cfa776753c24b846e839a86792a96",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 32404,
"upload_time": "2023-04-12T23:06:49",
"upload_time_iso_8601": "2023-04-12T23:06:49.677310Z",
"url": "https://files.pythonhosted.org/packages/0f/ec/632c6ac97235fc0b360fa215144071cf2dd977d3ec47da3da9f62f0db721/oc_preprocessing-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "1b02de710bec42e662155c89650d8bafc9b9ca2d40f7fe127ac7b220435d528b",
"md5": "d5c96cadacbf593e757f731081cf9cce",
"sha256": "59e1c08f1f71ba96c9ae8cf662cde49bbe5c0d1386d318c3171564e57b164fba"
},
"downloads": -1,
"filename": "oc_preprocessing-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "d5c96cadacbf593e757f731081cf9cce",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23023,
"upload_time": "2023-04-12T23:06:51",
"upload_time_iso_8601": "2023-04-12T23:06:51.768330Z",
"url": "https://files.pythonhosted.org/packages/1b/02/de710bec42e662155c89650d8bafc9b9ca2d40f7fe127ac7b220435d528b/oc_preprocessing-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-12 23:06:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "opencitations",
"github_project": "preprocess",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "oc-preprocessing"
}