<p align="center">
<img src="https://raw.githubusercontent.com/neuml/paperetl/master/logo.png"/>
</p>
<p align="center">
<b>ETL processes for medical and scientific papers</b>
</p>
<p align="center">
<a href="https://github.com/neuml/paperetl/releases">
<img src="https://img.shields.io/github/release/neuml/paperetl.svg?style=flat&color=success" alt="Version"/>
</a>
<a href="https://github.com/neuml/paperetl/releases">
<img src="https://img.shields.io/github/release-date/neuml/paperetl.svg?style=flat&color=blue" alt="GitHub Release Date"/>
</a>
<a href="https://github.com/neuml/paperetl/issues">
<img src="https://img.shields.io/github/issues/neuml/paperetl.svg?style=flat&color=success" alt="GitHub issues"/>
</a>
<a href="https://github.com/neuml/paperetl">
<img src="https://img.shields.io/github/last-commit/neuml/paperetl.svg?style=flat&color=blue" alt="GitHub last commit"/>
</a>
<a href="https://github.com/neuml/paperetl/actions?query=workflow%3Abuild">
<img src="https://github.com/neuml/paperetl/workflows/build/badge.svg" alt="Build Status"/>
</a>
<a href="https://coveralls.io/github/neuml/paperetl?branch=master">
<img src="https://img.shields.io/coverallsCoverage/github/neuml/paperetl" alt="Coverage Status">
</a>
</p>
-------------------------------------------------------------------------------------------------------------------------------------------------------
`paperetl` is an ETL library for processing medical and scientific papers.

`paperetl` supports the following sources:
- Full PDF articles
- [PubMed XML](https://pubmed.ncbi.nlm.nih.gov/download/)
- [ArXiv XML](https://info.arxiv.org/help/api/basics.html)
- [Text Encoding Initiative (TEI) XML](https://grobid.readthedocs.io/en/latest/TEI-encoding-of-results/)
- CSV with article metadta
`paperetl` supports the following datastores for parsed articles.
- SQLite
- JSON files
- YAML files
Additional optional datastores are available.
- Elasticsearch
## Installation
The easiest way to install is via pip and PyPI
```
pip install paperetl
```
Python 3.10+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.
`paperetl` can also be installed directly from GitHub to access the latest, unreleased features.
```
pip install git+https://github.com/neuml/paperetl
```
### Additional dependencies
PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only
necessary for PDF files.
- [GROBID install instructions](https://grobid.readthedocs.io/en/latest/Install-Grobid/)
- [GROBID start service](https://grobid.readthedocs.io/en/latest/Grobid-service/)
_Note: In some cases, the GROBID engine pool can be exhausted, resulting in a 503 error. This can be fixed by increasing `concurrency` and/or `poolMaxWait` in the [GROBID configuration file](https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration)._
### Docker
A Dockerfile with commands to install `paperetl`, all dependencies and scripts are available in this repository.
```
wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl
```
This will bring up a `paperetl` command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.
## Examples
### Notebooks
| Notebook | Description | |
|:----------|:-------------|------:|
| [Introducing paperetl](https://github.com/neuml/paperetl/blob/master/examples/01_Introducing_paperetl.ipynb) | Overview of the functionality provided by `paperetl` | [](https://colab.research.google.com/github/neuml/paperetl/blob/master/examples/01_Introducing_paperetl.ipynb) |
### Load Articles into SQLite
The following example shows how to use `paperetl` to load a set of medical/scientific articles into a SQLite database.
1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named `paperetl/data`
2. Build the database
```
python -m paperetl.file paperetl/data paperetl/models
```
Once complete, there will be an articles.sqlite file in paperetl/models
### Load into Elasticsearch
Elasticsearch is a supported datastore. It's an optional install feature via the Elasticsearch extra.
```
pip install paperetl[elasticsearch]
```
This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.
```
python -m paperetl.file paperetl/data http://localhost:9200
```
Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.
### Convert articles to JSON/YAML
`paperetl` can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.
JSON:
```
python -m paperetl.file paperetl/data json://paperetl/json
```
YAML:
```
python -m paperetl.file paperetl/data yaml://paperetl/yaml
```
Converted files will be stored in paperetl/(json|yaml)
Raw data
{
"_id": null,
"home_page": "https://github.com/neuml/paperetl",
"name": "paperetl",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "etl parse medical scientific papers",
"author": "NeuML",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/80/8b/d43f48e9d8df65875d25ccd416e31b3df07c075ad371abb42fd09afe544b/paperetl-2.5.1.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/neuml/paperetl/master/logo.png\"/>\n</p>\n\n<p align=\"center\">\n <b>ETL processes for medical and scientific papers</b>\n</p>\n\n<p align=\"center\">\n <a href=\"https://github.com/neuml/paperetl/releases\">\n <img src=\"https://img.shields.io/github/release/neuml/paperetl.svg?style=flat&color=success\" alt=\"Version\"/>\n </a>\n <a href=\"https://github.com/neuml/paperetl/releases\">\n <img src=\"https://img.shields.io/github/release-date/neuml/paperetl.svg?style=flat&color=blue\" alt=\"GitHub Release Date\"/>\n </a>\n <a href=\"https://github.com/neuml/paperetl/issues\">\n <img src=\"https://img.shields.io/github/issues/neuml/paperetl.svg?style=flat&color=success\" alt=\"GitHub issues\"/>\n </a>\n <a href=\"https://github.com/neuml/paperetl\">\n <img src=\"https://img.shields.io/github/last-commit/neuml/paperetl.svg?style=flat&color=blue\" alt=\"GitHub last commit\"/>\n </a>\n <a href=\"https://github.com/neuml/paperetl/actions?query=workflow%3Abuild\">\n <img src=\"https://github.com/neuml/paperetl/workflows/build/badge.svg\" alt=\"Build Status\"/>\n </a>\n <a href=\"https://coveralls.io/github/neuml/paperetl?branch=master\">\n <img src=\"https://img.shields.io/coverallsCoverage/github/neuml/paperetl\" alt=\"Coverage Status\">\n </a>\n</p>\n\n-------------------------------------------------------------------------------------------------------------------------------------------------------\n\n`paperetl` is an ETL library for processing medical and scientific papers.\n\n\n\n`paperetl` supports the following sources:\n\n- Full PDF articles\n- [PubMed XML](https://pubmed.ncbi.nlm.nih.gov/download/)\n- [ArXiv XML](https://info.arxiv.org/help/api/basics.html)\n- [Text Encoding Initiative (TEI) XML](https://grobid.readthedocs.io/en/latest/TEI-encoding-of-results/)\n- CSV with article metadta\n\n`paperetl` supports the following datastores for parsed articles.\n\n- SQLite\n- JSON files\n- YAML files\n\nAdditional optional datastores are available.\n\n- Elasticsearch\n\n## Installation\n\nThe easiest way to install is via pip and PyPI\n\n```\npip install paperetl\n```\n\nPython 3.10+ is supported. Using a Python [virtual environment](https://docs.python.org/3/library/venv.html) is recommended.\n\n`paperetl` can also be installed directly from GitHub to access the latest, unreleased features.\n\n```\npip install git+https://github.com/neuml/paperetl\n```\n\n### Additional dependencies\n\nPDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only\nnecessary for PDF files.\n\n- [GROBID install instructions](https://grobid.readthedocs.io/en/latest/Install-Grobid/)\n- [GROBID start service](https://grobid.readthedocs.io/en/latest/Grobid-service/)\n\n_Note: In some cases, the GROBID engine pool can be exhausted, resulting in a 503 error. This can be fixed by increasing `concurrency` and/or `poolMaxWait` in the [GROBID configuration file](https://grobid.readthedocs.io/en/latest/Configuration/#service-configuration)._\n\n### Docker\n\nA Dockerfile with commands to install `paperetl`, all dependencies and scripts are available in this repository.\n\n```\nwget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile\ndocker build -t paperetl -f Dockerfile .\ndocker run --name paperetl --rm -it paperetl\n```\n\nThis will bring up a `paperetl` command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.\n\n## Examples\n\n### Notebooks\n\n| Notebook | Description | |\n|:----------|:-------------|------:|\n| [Introducing paperetl](https://github.com/neuml/paperetl/blob/master/examples/01_Introducing_paperetl.ipynb) | Overview of the functionality provided by `paperetl` | [](https://colab.research.google.com/github/neuml/paperetl/blob/master/examples/01_Introducing_paperetl.ipynb) |\n\n### Load Articles into SQLite\n\nThe following example shows how to use `paperetl` to load a set of medical/scientific articles into a SQLite database.\n\n1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named `paperetl/data`\n\n2. Build the database\n\n ```\n python -m paperetl.file paperetl/data paperetl/models\n ```\n\nOnce complete, there will be an articles.sqlite file in paperetl/models\n\n### Load into Elasticsearch\n\nElasticsearch is a supported datastore. It's an optional install feature via the Elasticsearch extra.\n\n```\npip install paperetl[elasticsearch]\n```\n\nThis example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.\n\n```\npython -m paperetl.file paperetl/data http://localhost:9200\n```\n\nOnce complete, there will be an articles index in Elasticsearch with the metadata and full text stored.\n\n### Convert articles to JSON/YAML\n\n`paperetl` can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.\n\nJSON:\n\n```\npython -m paperetl.file paperetl/data json://paperetl/json\n```\n\nYAML:\n\n```\npython -m paperetl.file paperetl/data yaml://paperetl/yaml\n```\n\nConverted files will be stored in paperetl/(json|yaml)\n",
"bugtrack_url": null,
"license": "Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0",
"summary": "ETL processes for medical and scientific papers",
"version": "2.5.1",
"project_urls": {
"Documentation": "https://github.com/neuml/paperetl",
"Homepage": "https://github.com/neuml/paperetl",
"Issue Tracker": "https://github.com/neuml/paperetl/issues",
"Source Code": "https://github.com/neuml/paperetl"
},
"split_keywords": [
"etl",
"parse",
"medical",
"scientific",
"papers"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dbb3c4de335b5321061edb75ae253ab6c394a40a3cd72868374d1707134fd2ed",
"md5": "2b49bdd675b9aa0df0c53aa10a2c5859",
"sha256": "98bc8e36f36d6e949a93b383c9d2fbb300eb6ce9ce8f5bb58edcf68a65a6a0d7"
},
"downloads": -1,
"filename": "paperetl-2.5.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2b49bdd675b9aa0df0c53aa10a2c5859",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 26929,
"upload_time": "2025-08-02T12:21:20",
"upload_time_iso_8601": "2025-08-02T12:21:20.420340Z",
"url": "https://files.pythonhosted.org/packages/db/b3/c4de335b5321061edb75ae253ab6c394a40a3cd72868374d1707134fd2ed/paperetl-2.5.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "808bd43f48e9d8df65875d25ccd416e31b3df07c075ad371abb42fd09afe544b",
"md5": "725d55db91056bd85e902df9b65eaf96",
"sha256": "9c4ad1900014d922e2502b17e19d2ac9b71c6b81dfa42c27ab2ab195fecc33bb"
},
"downloads": -1,
"filename": "paperetl-2.5.1.tar.gz",
"has_sig": false,
"md5_digest": "725d55db91056bd85e902df9b65eaf96",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 24231,
"upload_time": "2025-08-02T12:21:21",
"upload_time_iso_8601": "2025-08-02T12:21:21.315077Z",
"url": "https://files.pythonhosted.org/packages/80/8b/d43f48e9d8df65875d25ccd416e31b3df07c075ad371abb42fd09afe544b/paperetl-2.5.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-02 12:21:21",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "neuml",
"github_project": "paperetl",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"lcname": "paperetl"
}