pubmed-parser


Namepubmed-parser JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/titipata/pubmed_parser
SummaryA python parser for Pubmed Open-Access Subset and MEDLINE XML repository
upload_time2024-04-13 12:22:47
maintainerNone
docs_urlNone
authorTitipat Achakulvisut
requires_pythonNone
licenseMIT (c) 2015 - 2019 Titipat Achakulvisut, Daniel E. Acuna
keywords python medline pubmed biomedical corpus natural language processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)

Pubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)
 , [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or
 [documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.

## Available Parsers

* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.
* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.

Below, we list available parsers from `pubmed_parser`.

  * [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)
  * [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)
  * [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)
  * [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)
  * [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)
  * [Parse MEDLINE XML](#parse-medline-xml)
  * [Parse MEDLINE Grant ID](#parse-medline-grant-id)
  * [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)
  * [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)
  * [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)

### Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:

* `full_title` : article's title
* `abstract` : abstract
* `journal` : Journal name
* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI of the article
* `publisher_id` : publisher ID
* `author_list` : list of authors with affiliation keys in the following format

``` python
 [['last_name_1', 'first_name_1', 'aff_key_1'],
  ['last_name_1', 'first_name_1', 'aff_key_2'],
  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]
 ```

* `affiliation_list` : list of affiliation keys and affiliation strings in the following format

``` python
 [['aff_key_1', 'affiliation_1'],
  ['aff_key_2', 'affiliation_2'], ...]
```

* `publication_year` : publication year
* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.

``` python
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)
```

### Parse PubMed OA citation references

The function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

* `pmid` : PubMed ID of the article
* `pmc` : PubMed Central ID of the article
* `article_title` : title of cited article
* `journal` : journal name
* `journal_type` : type of journal
* `pmid_cited` : PubMed ID of article that article cites
* `doi_cited` : DOI of article that article cites
* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)

``` python
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
```

### Parse PubMed OA images and captions

The function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `fig_caption` : string of caption
* `fig_id` : reference id for figure (use to refer in XML article)
* `fig_label` : label of the figure
* `graphic_ref` : reference to image file name provided from Pubmed OA

``` python
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
```

### Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `text` : full text of the paragraph
* `reference_ids` : list of reference code within that paragraph.

This IDs can merge with output from `parse_pubmed_references` .

* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)

``` python
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
```

### Parse PubMed OA Table [WIP]

You can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `caption` : caption of the table
* `label` : lable of the table
* `table_columns` : list of column name
* `table_values` : list of values inside the table
* `table_xml` : raw xml text of the table (return if `return_xml=True`)

``` python
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
```

### Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:

* `pmid` : PubMed ID
* `pmc` : PubMed Central ID
* `doi` : DOI
* `other_id` : Other IDs found, each separated by `;`
* `title` : title of the article
* `abstract` : abstract of the article
* `authors` : authors, each separated by `;`
* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`
* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`
* `keywords` : list of keywords, each separated by `;`
* `chemical_list` : list of chemical terms, each separated by `;`
* `pubdate` : Publication date. Defaults to year information only.
* `journal` : journal of the given paper
* `medline_ta` : this is abbreviation of the journal name
* `nlm_unique_id` : NLM unique identification
* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset
* `country` : Country extracted from journal information field
* `reference` : string of PMID each separated by `;` or list of references made to the article
* `delete` : boolean if `False` means paper got updated so you might have two
* `languages` : list of languages, separated by `;`
* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

``` python
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False) # return list of dictionary
```

To extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.

### Parse MEDLINE Grant ID

Use `parse_medline_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

* `pmid` : PubMed ID
* `grant_id` : Grant ID
* `grant_acronym` : Acronym of grant
* `country` : Country where grant funding from
* `agency` : Grant agency

If no Grant ID is found, it will return `None`

### Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys

* `title` : title
* `abstract` : abstract
* `journal` : journal
* `affiliation` : affiliation of first author
* `authors` : string of authors, separated by `;`
* `year` : Publication year
* `keywords` : keywords or MESH terms of the article

``` python
dict_out = pp.parse_xml_web(pmid, save_xml=False)
```

### Parse MEDLINE XML citations from website

The function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `pmc` : PubMed Central ID
* `pmid` : PubMed ID
* `doi` : DOI of the article
* `n_citations` : number of citations for given articles
* `pmc_cited` : list of PMCs that cite the given PMC

``` python
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
```

### Parse Outgoing XML citations from website

The function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

* `n_citations` : number of cited articles
* `doc_id` : the document identifier given
* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`
* `pmid_cited` : list of PMIDs cited by the article

``` python
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
```

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.

## Installation

You can install the most update version of the package directly from the repository

``` bash
pip install git+https://github.com/titipata/pubmed_parser.git
```

or install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using

``` bash
pip install pubmed-parser
```

or clone the repository and install using `pip`

``` bash
git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser
```

You can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`
in the root of the repository.

## Example snippet to parse PubMed OA dataset

An example usage is shown as follows

``` python
import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",
 'affiliation_list':
  [['I1': 'Department of Biological Sciences, ...'],
   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
  'author_list':
  [['Dennehy', 'John J', 'I1'],
   ['Dennehy', 'John J', 'I2'],
   ['Wang', 'Ing-Nang', 'I1']],
 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
 'journal': 'BMC Microbiology',
 'pmc': '3166277',
 'pmid': '21810267',
 'publication_year': '2011',
 'publisher_id': '1471-2180-11-174',
 'subjects': 'Research Article'}
```

## Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)

``` python
import os
import pubmed_parser as pp
from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
                                               **pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
                                 'file_name', 'pmc', 'pmid',
                                 'publication_year', 'publisher_id',
                                 'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe
```

See [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)
folder for more information.

## Core Members

* [Titipat Achakulvisut](http://titipata.github.io)
* [Daniel E. Acuna](http://scienceofscience.org/about)

and [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)

## Dependencies

* [lxml](http://lxml.de/)
* [unidecode](https://pypi.python.org/pypi/Unidecode)
* [requests](http://docs.python-requests.org/en/master/)

## Citation

If you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows

> Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

```
@article{Achakulvisut2020,
  doi = {10.21105/joss.01979},
  url = {https://doi.org/10.21105/joss.01979},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {46},
  pages = {1979},
  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
  journal = {Journal of Open Source Software}
}
```

## Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.

## Acknowledgement

This package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!

## License

MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/titipata/pubmed_parser",
    "name": "pubmed-parser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Python, MEDLINE, PubMed, Biomedical corpus, Natural Language Processing",
    "author": "Titipat Achakulvisut",
    "author_email": "my.titipat@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/a9/02/9afb05991a417b7758b55fb7a7eae98edeb9bfb4df3d3ddc351ebf334130/pubmed_parser-0.4.0.tar.gz",
    "platform": "any",
    "description": "# Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset\n\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/titipata/pubmed_parser/blob/master/LICENSE) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01979/status.svg)](https://doi.org/10.21105/joss.01979)\n [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3660006.svg)](https://doi.org/10.5281/zenodo.3660006) [![Build Status](https://travis-ci.com/titipata/pubmed_parser.svg?branch=master)](https://travis-ci.com/titipata/pubmed_parser)\n\nPubmed Parser is a Python library for parsing the [PubMed Open-Access (OA) subset](http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/)\n , [MEDLINE XML](https://www.nlm.nih.gov/bsd/licensee/) repositories, and [Entrez Programming Utilities (E-utils)](https://eutils.ncbi.nlm.nih.gov/). It uses the `lxml` library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.\n\nFor available APIs and details about the dataset, please see our [wiki page](https://github.com/titipata/pubmed_parser/wiki) or\n [documentation page](http://titipata.github.io/pubmed_parser/) for more details. Below, we list some of the core funtionalities and code examples.\n\n## Available Parsers\n\n* `path` provided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the [ `data` ](data/) folder.\n* for website parsing, you should scrape with pause. Please see the [copyright notice](https://www.ncbi.nlm.nih.gov/pmc/about/copyright/#copy-PMC) because your IP can get blocked if you try to download in bulk.\n\nBelow, we list available parsers from `pubmed_parser`.\n\n  * [Parse PubMed OA XML information](#parse-pubmed-oa-xml-information)\n  * [Parse PubMed OA citation references](#parse-pubmed-oa-citation-references)\n  * [Parse PubMed OA images and captions](#parse-pubmed-oa-images-and-captions)\n  * [Parse PubMed OA Paragraph](#parse-pubmed-oa-paragraph)\n  * [Parse PubMed OA Table [WIP]](#parse-pubmed-oa-table-wip)\n  * [Parse MEDLINE XML](#parse-medline-xml)\n  * [Parse MEDLINE Grant ID](#parse-medline-grant-id)\n  * [Parse MEDLINE XML from eutils website](#parse-medline-xml-from-eutils-website)\n  * [Parse MEDLINE XML citations from website](#parse-medline-xml-citations-from-website)\n  * [Parse Outgoing XML citations from website](#parse-outgoing-xml-citations-from-website)\n\n### Parse PubMed OA XML information\n\nWe created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called `parse_pubmed_xml` which will return a dictionary with the following information:\n\n* `full_title` : article's title\n* `abstract` : abstract\n* `journal` : Journal name\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `doi` : DOI of the article\n* `publisher_id` : publisher ID\n* `author_list` : list of authors with affiliation keys in the following format\n\n``` python\n [['last_name_1', 'first_name_1', 'aff_key_1'],\n  ['last_name_1', 'first_name_1', 'aff_key_2'],\n  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]\n ```\n\n* `affiliation_list` : list of affiliation keys and affiliation strings in the following format\n\n``` python\n [['aff_key_1', 'affiliation_1'],\n  ['aff_key_2', 'affiliation_2'], ...]\n```\n\n* `publication_year` : publication year\n* `subjects` : list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.\n\n``` python\nimport pubmed_parser as pp\ndict_out = pp.parse_pubmed_xml(path)\n```\n\n### Parse PubMed OA citation references\n\nThe function `parse_pubmed_references` will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows\n\n* `pmid` : PubMed ID of the article\n* `pmc` : PubMed Central ID of the article\n* `article_title` : title of cited article\n* `journal` : journal name\n* `journal_type` : type of journal\n* `pmid_cited` : PubMed ID of article that article cites\n* `doi_cited` : DOI of article that article cites\n* `year` : Publication year as it appears in the reference (may include letter suffix, e.g.2007a)\n\n``` python\ndicts_out = pp.parse_pubmed_references(path) # return list of dictionary\n```\n\n### Parse PubMed OA images and captions\n\nThe function `parse_pubmed_caption` can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `fig_caption` : string of caption\n* `fig_id` : reference id for figure (use to refer in XML article)\n* `fig_label` : label of the figure\n* `graphic_ref` : reference to image file name provided from Pubmed OA\n\n``` python\ndicts_out = pp.parse_pubmed_caption(path) # return list of dictionary\n```\n\n### Parse PubMed OA Paragraph\n\nFor someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use `parse_pubmed_paragraph` to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `text` : full text of the paragraph\n* `reference_ids` : list of reference code within that paragraph.\n\nThis IDs can merge with output from `parse_pubmed_references` .\n\n* `section` : section of paragraph (e.g. Background, Discussion, Appendix, etc.)\n\n``` python\ndicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)\n```\n\n### Parse PubMed OA Table [WIP]\n\nYou can use `parse_pubmed_table` to parse table from XML file. This function will return list of dictionaries where each has following keys.\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `caption` : caption of the table\n* `label` : lable of the table\n* `table_columns` : list of column name\n* `table_values` : list of values inside the table\n* `table_xml` : raw xml text of the table (return if `return_xml=True`)\n\n``` python\ndicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)\n```\n\n### Parse MEDLINE XML\n\nMEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD [here](https://www.nlm.nih.gov/databases/dtd/). You can use the function `parse_medline_xml` to parse that format. This function will return list of dictionaries, where each element contains:\n\n* `pmid` : PubMed ID\n* `pmc` : PubMed Central ID\n* `doi` : DOI\n* `other_id` : Other IDs found, each separated by `;`\n* `title` : title of the article\n* `abstract` : abstract of the article\n* `authors` : authors, each separated by `;`\n* `mesh_terms` : list of MeSH terms with corresponding MeSH ID, each separated by `;` e.g. `'D000161:Acoustic Stimulation; D000328:Adult; ...`\n* `publication_types` : list of publication type list each separated by `;` e.g. `'D016428:Journal Article'`\n* `keywords` : list of keywords, each separated by `;`\n* `chemical_list` : list of chemical terms, each separated by `;`\n* `pubdate` : Publication date. Defaults to year information only.\n* `journal` : journal of the given paper\n* `medline_ta` : this is abbreviation of the journal name\n* `nlm_unique_id` : NLM unique identification\n* `issn_linking` : ISSN linkage, typically use to link with Web of Science dataset\n* `country` : Country extracted from journal information field\n* `reference` : string of PMID each separated by `;` or list of references made to the article\n* `delete` : boolean if `False` means paper got updated so you might have two\n* `languages` : list of languages, separated by `;`\n* `vernacular_title`: vernacular title. Defaults to empty string whenever non-available.\n\nXMLs for the same paper. You can delete the record of deleted paper because it got updated.\n\n``` python\ndicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',\n                                 year_info_only=False,\n                                 nlm_category=False,\n                                 author_list=False,\n                                 reference_list=False) # return list of dictionary\n```\n\nTo extract month and day information from PubDate, set `year_info_only=True`. We also allow parsing structured abstract and we can control display of each section or label by changing `nlm_category` argument.\n\n### Parse MEDLINE Grant ID\n\nUse `parse_medline_grant_id` in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing\n\n* `pmid` : PubMed ID\n* `grant_id` : Grant ID\n* `grant_acronym` : Acronym of grant\n* `country` : Country where grant funding from\n* `agency` : Grant agency\n\nIf no Grant ID is found, it will return `None`\n\n### Parse MEDLINE XML from eutils website\n\nYou can use PubMed parser to parse XML file from [E-Utilities](http://www.ncbi.nlm.nih.gov/books/NBK25501/) using `parse_xml_web` . For this function, you can provide a single `pmid` as an input and get a dictionary with following keys\n\n* `title` : title\n* `abstract` : abstract\n* `journal` : journal\n* `affiliation` : affiliation of first author\n* `authors` : string of authors, separated by `;`\n* `year` : Publication year\n* `keywords` : keywords or MESH terms of the article\n\n``` python\ndict_out = pp.parse_xml_web(pmid, save_xml=False)\n```\n\n### Parse MEDLINE XML citations from website\n\nThe function `parse_citation_web` allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys\n\n* `pmc` : PubMed Central ID\n* `pmid` : PubMed ID\n* `doi` : DOI of the article\n* `n_citations` : number of citations for given articles\n* `pmc_cited` : list of PMCs that cite the given PMC\n\n``` python\ndict_out = pp.parse_citation_web(doc_id, id_type='PMC')\n```\n\n### Parse Outgoing XML citations from website\n\nThe function `parse_outgoing_citation_web` allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys\n\n* `n_citations` : number of cited articles\n* `doc_id` : the document identifier given\n* `id_type` : the type of identifier given. Either `'PMID'` or `'PMC'`\n* `pmid_cited` : list of PMIDs cited by the article\n\n``` python\ndict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')\n```\n\nIdentifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings *without* the `'PMC'` prefix. If no citations are found, or if no article is found matching `doc_id` in the indicated database, it will return `None`.\n\n## Installation\n\nYou can install the most update version of the package directly from the repository\n\n``` bash\npip install git+https://github.com/titipata/pubmed_parser.git\n```\n\nor install recent release with [PyPI](https://pypi.org/project/pubmed-parser/) using\n\n``` bash\npip install pubmed-parser\n```\n\nor clone the repository and install using `pip`\n\n``` bash\ngit clone https://github.com/titipata/pubmed_parser\npip install ./pubmed_parser\n```\n\nYou can test your installation by running `pytest --cov=pubmed_parser tests/ --verbose`\nin the root of the repository.\n\n## Example snippet to parse PubMed OA dataset\n\nAn example usage is shown as follows\n\n``` python\nimport pubmed_parser as pp\npath_xml = pp.list_xml_path('data') # list all xml paths under directory\npubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output\nprint(pubmed_dict)\n\n{'abstract': u\"Background Despite identical genotypes and ...\",\n 'affiliation_list':\n  [['I1': 'Department of Biological Sciences, ...'],\n   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],\n  'author_list':\n  [['Dennehy', 'John J', 'I1'],\n   ['Dennehy', 'John J', 'I2'],\n   ['Wang', 'Ing-Nang', 'I1']],\n 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \\u03bb',\n 'journal': 'BMC Microbiology',\n 'pmc': '3166277',\n 'pmid': '21810267',\n 'publication_year': '2011',\n 'publisher_id': '1471-2180-11-174',\n 'subjects': 'Research Article'}\n```\n\n## Example Usage with PySpark\n\nThis is a snippet to parse all PubMed Open Access subset using [PySpark 2.1](https://spark.apache.org/docs/latest/api/python/index.html)\n\n``` python\nimport os\nimport pubmed_parser as pp\nfrom pyspark.sql import Row\n\npath_all = pp.list_xml_path('/path/to/xml/folder/')\npath_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)\nparse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),\n                                               **pp.parse_pubmed_xml(x)))\npubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe\npubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',\n                                 'file_name', 'pmc', 'pmid',\n                                 'publication_year', 'publisher_id',\n                                 'journal', 'subjects']] # select columns\npubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe\n```\n\nSee [scripts](https://github.com/titipata/pubmed_parser/tree/master/scripts)\nfolder for more information.\n\n## Core Members\n\n* [Titipat Achakulvisut](http://titipata.github.io)\n* [Daniel E. Acuna](http://scienceofscience.org/about)\n\nand [contributors](https://github.com/titipata/pubmed_parser/graphs/contributors)\n\n## Dependencies\n\n* [lxml](http://lxml.de/)\n* [unidecode](https://pypi.python.org/pypi/Unidecode)\n* [requests](http://docs.python-requests.org/en/master/)\n\n## Citation\n\nIf you use Pubmed Parser, please cite it from [JOSS](https://joss.theoj.org/papers/10.21105/joss.01979) as follows\n\n> Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979\n\nor using BibTex\n\n```\n@article{Achakulvisut2020,\n  doi = {10.21105/joss.01979},\n  url = {https://doi.org/10.21105/joss.01979},\n  year = {2020},\n  publisher = {The Open Journal},\n  volume = {5},\n  number = {46},\n  pages = {1979},\n  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},\n  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},\n  journal = {Journal of Open Source Software}\n}\n```\n\n## Contributions\n\nWe welcome contributions from anyone who would like to improve Pubmed Parser. You can create [GitHub issues](https://github.com/titipata/pubmed_parser/issues) to discuss questions or issues relating to the repository. We suggest you to read our [Contributing Guidelines](https://github.com/titipata/pubmed_parser/blob/master/CONTRIBUTING.md) before creating issues, reporting bugs, or making a contribution to the repository.\n\n## Acknowledgement\n\nThis package is developed in [Konrad Kording's Lab](http://kordinglab.com/) at the University of Pennsylvania. We would like to thank reviewers and the editor from [JOSS](https://joss.readthedocs.io/en/latest/) including [`tleonardi`](https://github.com/tleonardi), [`timClicks`](https://github.com/timClicks), and [`majensen`](https://github.com/majensen). They made our repository much better!\n\n## License\n\nMIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna\n",
    "bugtrack_url": null,
    "license": "MIT (c) 2015 - 2019 Titipat Achakulvisut, Daniel E. Acuna",
    "summary": "A python parser for Pubmed Open-Access Subset and MEDLINE XML repository",
    "version": "0.4.0",
    "project_urls": {
        "Bug Reports": "https://github.com/titipata/pubmed_parser/issues",
        "Documentation": "http://titipata.github.io/pubmed_parser",
        "Download": "https://github.com/titipata/pubmed_parser.git",
        "Homepage": "https://github.com/titipata/pubmed_parser",
        "Source": "https://github.com/titipata/pubmed_parser"
    },
    "split_keywords": [
        "python",
        " medline",
        " pubmed",
        " biomedical corpus",
        " natural language processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fca7ce752d4837957300b5220ce61c9d2149e2cc1abd5ff52c9f065c1fb729a8",
                "md5": "5c39c3f364cc280acb493d267e9b0e14",
                "sha256": "0d41b709900d3a80fcefa3552734bdd37837ee7b3b269c3a195a7fe9a338b3b5"
            },
            "downloads": -1,
            "filename": "pubmed_parser-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5c39c3f364cc280acb493d267e9b0e14",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 23951,
            "upload_time": "2024-04-13T12:22:44",
            "upload_time_iso_8601": "2024-04-13T12:22:44.238791Z",
            "url": "https://files.pythonhosted.org/packages/fc/a7/ce752d4837957300b5220ce61c9d2149e2cc1abd5ff52c9f065c1fb729a8/pubmed_parser-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a9029afb05991a417b7758b55fb7a7eae98edeb9bfb4df3d3ddc351ebf334130",
                "md5": "019ae97318c47e8fd574e066a87f4940",
                "sha256": "9d12d8510a65338d33bcc1f8781855e903a438a846fcd9c3e6ff53cfedb09fd0"
            },
            "downloads": -1,
            "filename": "pubmed_parser-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "019ae97318c47e8fd574e066a87f4940",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 33662,
            "upload_time": "2024-04-13T12:22:47",
            "upload_time_iso_8601": "2024-04-13T12:22:47.368715Z",
            "url": "https://files.pythonhosted.org/packages/a9/02/9afb05991a417b7758b55fb7a7eae98edeb9bfb4df3d3ddc351ebf334130/pubmed_parser-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-13 12:22:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "titipata",
    "github_project": "pubmed_parser",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "pubmed-parser"
}
        
Elapsed time: 0.20458s