# Chemistry Paper Parser
Convert HTML/XML Chemistry/Material Science articles into plain text.
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/Yinghao-Li/chemdocparsing)
[![PyPI version](https://badge.fury.io/py/ChemistryPaperParser.svg)](https://badge.fury.io/py/ChemistryPaperParser)
---
## 1. Install
### Requirements
The current version of Chemistry Paper Parser is built for `Python >= 3.9`.
Please check `requirements.txt` for other dependencies.
### Install package
Chemistry Paper Parser is hosted on [pypi](https://pypi.org/).
You can simply install it with
```bash
pip install ChemistryPaperParser
```
Once installed, you can import the package as `chempp` in `Python`:
```python
from chempp import parse_html, parse_xml
html_article, _ = parse_html(path_to_my_local_html)
xml_article, _ = parse_xml(path_to_my_local_xml)
```
### Supported publishers:
Currently, Chemistry Paper Parser supports the following publishers and file types.
| Publisher | Supports HTML | Supports XML |
|-----------|---------------|--------------|
| [RSC](https://pubs.rsc.org/) | ✓ | ✗ |
| [Springer](https://www.springer.com/us) | ✓ | ✗ |
| [Nature](https://www.nature.com/) | ✓ | ✗ |
| [Wiley](https://onlinelibrary.wiley.com/) | ✓ | ✗ |
| [AIP](https://pubs.aip.org/) | ✓ | ✗ |
| [ACS](https://pubs.acs.org/) | ✓ | ✓ |
| [Elsevier](https://www.elsevier.com/) | ✓ | ✓ |
| [AAAS (Science)](https://www.science.org/journals) | ✓ | ✗ |
In addition, table parsing is not supported for all publishers.
For figures, only captions will be parsed and saved in the current version.
## 2. Example
The open-access ACS article [Toland et al. (2023)](https://pubs.acs.org/doi/10.1021/acs.jpca.3c05870#) is used here as an example to demonstrate the article parsing process.
The offline file is provided at `./examples/Toland.et.al.2023.html`.
For online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided `chempp.crawler.load_online_html` function (requires external dependencies).
To parse the example article, you can try the following example in your shell.
```bash
PYTHONPATH="." python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt
```
The `--input_dir` argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all `html` and `xml` files in the folder.
`--output_format` defines the output format of the parse file.
`pt` will retain all structural information within the [`Article`](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/blob/087cf01fb0a0b44008e3ac987ba4e77e2d9f8d3c/chempp/article/article.py#L57) class.
`jsonl` saves the file as a Doccano-compatible jsonl file for easy annotation.
`html` saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens.
It also is a good way to present the quality of the parsed article.
Notice that [`./examples/process_articles.py`](./examples/process_articles.py) is only an incomplete demonstration of `chempp` APIs and their usage.
The notebook [`./examples/example.ipynb`](./examples/example.ipynb) demonstrates the structure of the parsed `Article` object and some possible use cases.
You can find more details regarding Chemistry Article Parser and its application in my [blog](https://yinghao-li.github.io/posts/2023/07/material-ie/).
I'll provide more comprehensive API introduction if needed in the future.
## 3. Known issues
Due to the variety of HTML/XML documents, not all document can be successfully parsed.
It would be helpful for our improvement if you can report the failed cases in the Issue section.
- HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.
- May fail to extract sections from Elsevier when section ids are `s[\d]+` instead of `sec[\d]+`, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/2).
- Fails to extract abstracts from RSC due to updated HTML format, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/1).
## Citation
Please consider citing the following article if your find our package useful.
Although not mentioned at all, Chemistry Paper Parser is still a part of this project.
```
@article{toland.2023.accelerated.scheme,
author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},
title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},
journal = {The Journal of Physical Chemistry A},
volume = {127},
number = {50},
pages = {10709-10716},
year = {2023},
doi = {10.1021/acs.jpca.3c05870},
note ={PMID: 38055927},
URL = {https://doi.org/10.1021/acs.jpca.3c05870},
eprint = {https://doi.org/10.1021/acs.jpca.3c05870}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Yinghao-Li/ChemistryHTMLPaperParser",
"name": "ChemistryPaperParser",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "data-mining natural-language-processing nlp parser chemistry",
"author": "Yinghao Li",
"author_email": "yinghaoli@gatech.edu",
"download_url": null,
"platform": null,
"description": "# Chemistry Paper Parser\nConvert HTML/XML Chemistry/Material Science articles into plain text.\n\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/Yinghao-Li/chemdocparsing)\n[![PyPI version](https://badge.fury.io/py/ChemistryPaperParser.svg)](https://badge.fury.io/py/ChemistryPaperParser)\n---\n\n## 1. Install\n\n### Requirements\nThe current version of Chemistry Paper Parser is built for `Python >= 3.9`.\nPlease check `requirements.txt` for other dependencies.\n\n### Install package\n\nChemistry Paper Parser is hosted on [pypi](https://pypi.org/).\nYou can simply install it with\n```bash\npip install ChemistryPaperParser\n```\n\nOnce installed, you can import the package as `chempp` in `Python`:\n\n```python\nfrom chempp import parse_html, parse_xml\n\nhtml_article, _ = parse_html(path_to_my_local_html)\nxml_article, _ = parse_xml(path_to_my_local_xml)\n```\n\n### Supported publishers:\n\nCurrently, Chemistry Paper Parser supports the following publishers and file types.\n\n| Publisher | Supports HTML | Supports XML |\n|-----------|---------------|--------------|\n| [RSC](https://pubs.rsc.org/) | \u2713 | \u2717 |\n| [Springer](https://www.springer.com/us) | \u2713 | \u2717 |\n| [Nature](https://www.nature.com/) | \u2713 | \u2717 |\n| [Wiley](https://onlinelibrary.wiley.com/) | \u2713 | \u2717 |\n| [AIP](https://pubs.aip.org/) | \u2713 | \u2717 |\n| [ACS](https://pubs.acs.org/) | \u2713 | \u2713 |\n| [Elsevier](https://www.elsevier.com/) | \u2713 | \u2713 |\n| [AAAS (Science)](https://www.science.org/journals) | \u2713 | \u2717 |\n\nIn addition, table parsing is not supported for all publishers.\n\nFor figures, only captions will be parsed and saved in the current version.\n\n## 2. Example\n\nThe open-access ACS article [Toland et al. (2023)](https://pubs.acs.org/doi/10.1021/acs.jpca.3c05870#) is used here as an example to demonstrate the article parsing process.\nThe offline file is provided at `./examples/Toland.et.al.2023.html`.\nFor online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided `chempp.crawler.load_online_html` function (requires external dependencies).\n\nTo parse the example article, you can try the following example in your shell.\n```bash\nPYTHONPATH=\".\" python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt\n```\nThe `--input_dir` argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all `html` and `xml` files in the folder.\n`--output_format` defines the output format of the parse file.\n`pt` will retain all structural information within the [`Article`](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/blob/087cf01fb0a0b44008e3ac987ba4e77e2d9f8d3c/chempp/article/article.py#L57) class.\n`jsonl` saves the file as a Doccano-compatible jsonl file for easy annotation.\n`html` saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens.\nIt also is a good way to present the quality of the parsed article.\n\nNotice that [`./examples/process_articles.py`](./examples/process_articles.py) is only an incomplete demonstration of `chempp` APIs and their usage.\nThe notebook [`./examples/example.ipynb`](./examples/example.ipynb) demonstrates the structure of the parsed `Article` object and some possible use cases.\nYou can find more details regarding Chemistry Article Parser and its application in my [blog](https://yinghao-li.github.io/posts/2023/07/material-ie/).\nI'll provide more comprehensive API introduction if needed in the future.\n\n\n## 3. Known issues\n\nDue to the variety of HTML/XML documents, not all document can be successfully parsed.\nIt would be helpful for our improvement if you can report the failed cases in the Issue section.\n\n\n- HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.\n- May fail to extract sections from Elsevier when section ids are `s[\\d]+` instead of `sec[\\d]+`, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/2).\n- Fails to extract abstracts from RSC due to updated HTML format, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/1).\n\n## Citation\n\nPlease consider citing the following article if your find our package useful.\nAlthough not mentioned at all, Chemistry Paper Parser is still a part of this project.\n```\n@article{toland.2023.accelerated.scheme,\n author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},\n title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},\n journal = {The Journal of Physical Chemistry A},\n volume = {127},\n number = {50},\n pages = {10709-10716},\n year = {2023},\n doi = {10.1021/acs.jpca.3c05870},\n note ={PMID: 38055927},\n URL = {https://doi.org/10.1021/acs.jpca.3c05870},\n eprint = {https://doi.org/10.1021/acs.jpca.3c05870}\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Parsing HTML chemistry papers from certain publishers into plain text",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/Yinghao-Li/ChemistryHTMLPaperParser"
},
"split_keywords": [
"data-mining",
"natural-language-processing",
"nlp",
"parser",
"chemistry"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fa1db8cebfebdeb45e9d5b9dc4b0733cae3b87432b4071b28bcd8e1b63b43450",
"md5": "bd1a42fb580745225ff281306c71af0c",
"sha256": "a384bee770ae05249cde9442ecbdc2e7f811eff489838e0eca6274340de41c73"
},
"downloads": -1,
"filename": "ChemistryPaperParser-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "bd1a42fb580745225ff281306c71af0c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 27986,
"upload_time": "2024-06-15T23:25:08",
"upload_time_iso_8601": "2024-06-15T23:25:08.651958Z",
"url": "https://files.pythonhosted.org/packages/fa/1d/b8cebfebdeb45e9d5b9dc4b0733cae3b87432b4071b28bcd8e1b63b43450/ChemistryPaperParser-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-15 23:25:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Yinghao-Li",
"github_project": "ChemistryHTMLPaperParser",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
"~=",
"1.19.2"
]
]
},
{
"name": "dataclasses",
"specs": [
[
"~=",
"0.8"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"~=",
"4.10.0"
]
]
},
{
"name": "SeqLbToolkit",
"specs": [
[
"~=",
"0.5.5"
]
]
},
{
"name": "html5lib",
"specs": [
[
"~=",
"1.1"
]
]
},
{
"name": "spacy_alignments",
"specs": [
[
"~=",
"0.9.1"
]
]
},
{
"name": "pytextspan",
"specs": [
[
"~=",
"0.5.7"
]
]
}
],
"lcname": "chemistrypaperparser"
}