ChemistryPaperParser


NameChemistryPaperParser JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/Yinghao-Li/ChemistryHTMLPaperParser
SummaryParsing HTML chemistry papers from certain publishers into plain text
upload_time2024-06-15 23:25:08
maintainerNone
docs_urlNone
authorYinghao Li
requires_python>=3.9
licenseMIT
keywords data-mining natural-language-processing nlp parser chemistry
VCS
bugtrack_url
requirements numpy dataclasses beautifulsoup4 SeqLbToolkit html5lib spacy_alignments pytextspan
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Chemistry Paper Parser
Convert HTML/XML Chemistry/Material Science articles into plain text.

[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/Yinghao-Li/chemdocparsing)
[![PyPI version](https://badge.fury.io/py/ChemistryPaperParser.svg)](https://badge.fury.io/py/ChemistryPaperParser)
---

## 1. Install

### Requirements
The current version of Chemistry Paper Parser is built for `Python >= 3.9`.
Please check `requirements.txt` for other dependencies.

### Install package

Chemistry Paper Parser is hosted on [pypi](https://pypi.org/).
You can simply install it with
```bash
pip install ChemistryPaperParser
```

Once installed, you can import the package as `chempp` in `Python`:

```python
from chempp import parse_html, parse_xml

html_article, _ = parse_html(path_to_my_local_html)
xml_article, _ = parse_xml(path_to_my_local_xml)
```

### Supported publishers:

Currently, Chemistry Paper Parser supports the following publishers and file types.

| Publisher | Supports HTML | Supports XML |
|-----------|---------------|--------------|
| [RSC](https://pubs.rsc.org/) | ✓ | ✗ |
| [Springer](https://www.springer.com/us) | ✓ | ✗ |
| [Nature](https://www.nature.com/) | ✓ | ✗ |
| [Wiley](https://onlinelibrary.wiley.com/) | ✓ | ✗ |
| [AIP](https://pubs.aip.org/) | ✓ | ✗ |
| [ACS](https://pubs.acs.org/) | ✓ | ✓ |
| [Elsevier](https://www.elsevier.com/) | ✓ | ✓ |
| [AAAS (Science)](https://www.science.org/journals) | ✓ | ✗ |

In addition, table parsing is not supported for all publishers.

For figures, only captions will be parsed and saved in the current version.

## 2. Example

The open-access ACS article [Toland et al. (2023)](https://pubs.acs.org/doi/10.1021/acs.jpca.3c05870#) is used here as an example to demonstrate the article parsing process.
The offline file is provided at `./examples/Toland.et.al.2023.html`.
For online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided `chempp.crawler.load_online_html` function (requires external dependencies).

To parse the example article, you can try the following example in your shell.
```bash
PYTHONPATH="." python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt
```
The `--input_dir` argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all `html` and `xml` files in the folder.
`--output_format` defines the output format of the parse file.
`pt` will retain all structural information within the [`Article`](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/blob/087cf01fb0a0b44008e3ac987ba4e77e2d9f8d3c/chempp/article/article.py#L57) class.
`jsonl` saves the file as a Doccano-compatible jsonl file for easy annotation.
`html` saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens.
It also is a good way to present the quality of the parsed article.

Notice that [`./examples/process_articles.py`](./examples/process_articles.py) is only an incomplete demonstration of `chempp` APIs and their usage.
The notebook [`./examples/example.ipynb`](./examples/example.ipynb) demonstrates the structure of the parsed `Article` object and some possible use cases.
You can find more details regarding Chemistry Article Parser and its application in my [blog](https://yinghao-li.github.io/posts/2023/07/material-ie/).
I'll provide more comprehensive API introduction if needed in the future.


## 3. Known issues

Due to the variety of HTML/XML documents, not all document can be successfully parsed.
It would be helpful for our improvement if you can report the failed cases in the Issue section.


- HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.
- May fail to extract sections from Elsevier when section ids are `s[\d]+` instead of `sec[\d]+`, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/2).
- Fails to extract abstracts from RSC due to updated HTML format, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/1).

## Citation

Please consider citing the following article if your find our package useful.
Although not mentioned at all, Chemistry Paper Parser is still a part of this project.
```
@article{toland.2023.accelerated.scheme,
  author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},
  title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},
  journal = {The Journal of Physical Chemistry A},
  volume = {127},
  number = {50},
  pages = {10709-10716},
  year = {2023},
  doi = {10.1021/acs.jpca.3c05870},
  note ={PMID: 38055927},
  URL = {https://doi.org/10.1021/acs.jpca.3c05870},
  eprint = {https://doi.org/10.1021/acs.jpca.3c05870}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Yinghao-Li/ChemistryHTMLPaperParser",
    "name": "ChemistryPaperParser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "data-mining natural-language-processing nlp parser chemistry",
    "author": "Yinghao Li",
    "author_email": "yinghaoli@gatech.edu",
    "download_url": null,
    "platform": null,
    "description": "# Chemistry Paper Parser\nConvert HTML/XML Chemistry/Material Science articles into plain text.\n\n[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple)](https://www.python.org/)\n[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/Yinghao-Li/chemdocparsing)\n[![PyPI version](https://badge.fury.io/py/ChemistryPaperParser.svg)](https://badge.fury.io/py/ChemistryPaperParser)\n---\n\n## 1. Install\n\n### Requirements\nThe current version of Chemistry Paper Parser is built for `Python >= 3.9`.\nPlease check `requirements.txt` for other dependencies.\n\n### Install package\n\nChemistry Paper Parser is hosted on [pypi](https://pypi.org/).\nYou can simply install it with\n```bash\npip install ChemistryPaperParser\n```\n\nOnce installed, you can import the package as `chempp` in `Python`:\n\n```python\nfrom chempp import parse_html, parse_xml\n\nhtml_article, _ = parse_html(path_to_my_local_html)\nxml_article, _ = parse_xml(path_to_my_local_xml)\n```\n\n### Supported publishers:\n\nCurrently, Chemistry Paper Parser supports the following publishers and file types.\n\n| Publisher | Supports HTML | Supports XML |\n|-----------|---------------|--------------|\n| [RSC](https://pubs.rsc.org/) | \u2713 | \u2717 |\n| [Springer](https://www.springer.com/us) | \u2713 | \u2717 |\n| [Nature](https://www.nature.com/) | \u2713 | \u2717 |\n| [Wiley](https://onlinelibrary.wiley.com/) | \u2713 | \u2717 |\n| [AIP](https://pubs.aip.org/) | \u2713 | \u2717 |\n| [ACS](https://pubs.acs.org/) | \u2713 | \u2713 |\n| [Elsevier](https://www.elsevier.com/) | \u2713 | \u2713 |\n| [AAAS (Science)](https://www.science.org/journals) | \u2713 | \u2717 |\n\nIn addition, table parsing is not supported for all publishers.\n\nFor figures, only captions will be parsed and saved in the current version.\n\n## 2. Example\n\nThe open-access ACS article [Toland et al. (2023)](https://pubs.acs.org/doi/10.1021/acs.jpca.3c05870#) is used here as an example to demonstrate the article parsing process.\nThe offline file is provided at `./examples/Toland.et.al.2023.html`.\nFor online HTML files, you can either download the html files manually and load it as demonstrated below, or use the provided `chempp.crawler.load_online_html` function (requires external dependencies).\n\nTo parse the example article, you can try the following example in your shell.\n```bash\nPYTHONPATH=\".\" python ./examples/process_articles.py --input_dir ./examples/ --output_dir ./output/ --output_format pt\n```\nThe `--input_dir` argument can either be the file path or a directory. If it is a directory, the program will try to read and parse all `html` and `xml` files in the folder.\n`--output_format` defines the output format of the parse file.\n`pt` will retain all structural information within the [`Article`](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/blob/087cf01fb0a0b44008e3ac987ba4e77e2d9f8d3c/chempp/article/article.py#L57) class.\n`jsonl` saves the file as a Doccano-compatible jsonl file for easy annotation.\n`html` saves the file as a simplified HTML for easy demonstration of the annotated sentences and tokens.\nIt also is a good way to present the quality of the parsed article.\n\nNotice that [`./examples/process_articles.py`](./examples/process_articles.py) is only an incomplete demonstration of `chempp` APIs and their usage.\nThe notebook [`./examples/example.ipynb`](./examples/example.ipynb) demonstrates the structure of the parsed `Article` object and some possible use cases.\nYou can find more details regarding Chemistry Article Parser and its application in my [blog](https://yinghao-li.github.io/posts/2023/07/material-ie/).\nI'll provide more comprehensive API introduction if needed in the future.\n\n\n## 3. Known issues\n\nDue to the variety of HTML/XML documents, not all document can be successfully parsed.\nIt would be helpful for our improvement if you can report the failed cases in the Issue section.\n\n\n- HTML highlighting sometimes may fail when multiple entities start at the same position due to incorrect text span alignment.\n- May fail to extract sections from Elsevier when section ids are `s[\\d]+` instead of `sec[\\d]+`, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/2).\n- Fails to extract abstracts from RSC due to updated HTML format, as mentioned in [this issue](https://github.com/Yinghao-Li/ChemistryHTMLPaperParser/issues/1).\n\n## Citation\n\nPlease consider citing the following article if your find our package useful.\nAlthough not mentioned at all, Chemistry Paper Parser is still a part of this project.\n```\n@article{toland.2023.accelerated.scheme,\n  author = {Toland, Aubrey and Tran, Huan and Chen, Lihua and Li, Yinghao and Zhang, Chao and Gutekunst, Will and Ramprasad, Rampi},\n  title = {Accelerated Scheme to Predict Ring-Opening Polymerization Enthalpy: Simulation-Experimental Data Fusion and Multitask Machine Learning},\n  journal = {The Journal of Physical Chemistry A},\n  volume = {127},\n  number = {50},\n  pages = {10709-10716},\n  year = {2023},\n  doi = {10.1021/acs.jpca.3c05870},\n  note ={PMID: 38055927},\n  URL = {https://doi.org/10.1021/acs.jpca.3c05870},\n  eprint = {https://doi.org/10.1021/acs.jpca.3c05870}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Parsing HTML chemistry papers from certain publishers into plain text",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/Yinghao-Li/ChemistryHTMLPaperParser"
    },
    "split_keywords": [
        "data-mining",
        "natural-language-processing",
        "nlp",
        "parser",
        "chemistry"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fa1db8cebfebdeb45e9d5b9dc4b0733cae3b87432b4071b28bcd8e1b63b43450",
                "md5": "bd1a42fb580745225ff281306c71af0c",
                "sha256": "a384bee770ae05249cde9442ecbdc2e7f811eff489838e0eca6274340de41c73"
            },
            "downloads": -1,
            "filename": "ChemistryPaperParser-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bd1a42fb580745225ff281306c71af0c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 27986,
            "upload_time": "2024-06-15T23:25:08",
            "upload_time_iso_8601": "2024-06-15T23:25:08.651958Z",
            "url": "https://files.pythonhosted.org/packages/fa/1d/b8cebfebdeb45e9d5b9dc4b0733cae3b87432b4071b28bcd8e1b63b43450/ChemistryPaperParser-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-15 23:25:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Yinghao-Li",
    "github_project": "ChemistryHTMLPaperParser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    "~=",
                    "1.19.2"
                ]
            ]
        },
        {
            "name": "dataclasses",
            "specs": [
                [
                    "~=",
                    "0.8"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "~=",
                    "4.10.0"
                ]
            ]
        },
        {
            "name": "SeqLbToolkit",
            "specs": [
                [
                    "~=",
                    "0.5.5"
                ]
            ]
        },
        {
            "name": "html5lib",
            "specs": [
                [
                    "~=",
                    "1.1"
                ]
            ]
        },
        {
            "name": "spacy_alignments",
            "specs": [
                [
                    "~=",
                    "0.9.1"
                ]
            ]
        },
        {
            "name": "pytextspan",
            "specs": [
                [
                    "~=",
                    "0.5.7"
                ]
            ]
        }
    ],
    "lcname": "chemistrypaperparser"
}
        
Elapsed time: 0.67681s