pubmedparser2


Namepubmedparser2 JSON
Version 2.0.6 PyPI version JSON
download
home_pagehttps://gitlab.com/net-synergy/pubmedparser
SummaryDownload and parse pubmed publication data
upload_time2024-01-21 01:11:30
maintainerDavid Connell
docs_urlNone
authorDavid Connell
requires_python>=3.9,<4.0
licenseMIT
keywords publication network medline pubmed references
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Read XML files and pull out selected values. Values to collect are
determined by paths found in a [structure file](#structure-file). The
structure file also includes a key which associates the values with a
parent element and names, which determine which file to place the
elements in.

Files can be passed as either gzipped or uncompressed XML files or from
standard in.

For more info on Pubmed's XML files see:
[pubmed<sub>190101</sub>.dtd.](https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html)

Usage:

``` python
import pubmedparser
import pubmedparser.ftp

# Download data
files = pubmedparser.ftp.download(range(1, 6))

# Read XML files using a YAML file to describe what data to collect.
data_dir = "file_example"
structure_file = "example/structure.yml"
results = pubmedparser.read_xml(files, structure_file, data_dir)
```

See [the example
file](https://gitlab.com/net-synergy/pubmedparser/-/blob/master/example/creating_graphs.py)
for more options.

In python, the structure file can be replaced with a dictionary of
dictionaries as well.

Or, as a CLI:

``` bash
xml_read --cache-dir=cache --structure-file=structure.yml \
    data/*.xml.gz
```

## Installing with pip

``` bash
pip install pubmedparser2
```

## Building python package

Requires `zlib`.

Clone the repository and in the directory. Then use [poetry](https://python-poetry.org/docs) to install the dependencies.

``` bash
poetry install
```

Then run the make command:

``` bash
make python
```

# Structure file

The structure file is a YAML file containing key-value pairs for
different tags and paths. There are two required keys: `root` and `key`.
`Root` provide the top-level tag, in the case of the pubmed files this
will be `PubmedArticleSet`.

``` bash
root: "/PubmedArticleSet"
```

The `/` is not strictly required as the program will ignore them, but
they are used to conform to the
[xpath](https://en.wikipedia.org/wiki/XPath) syntax (although this
program does not handle all cases for `xpath`).

Only tags below the root tag will be considered and the parsing will
terminate once the program has left the root of the tree.

`Key` is a reference tag. In the pubmed case, all data is with respect
to a publication, so the key should identify the publication the values
are linked to. The `PMID` tag is a suitable candidate.

``` bash
key: "/PubmedArticle/MedlineCitation/PMID"
```

After `root`, all paths are taken as relative to the root node.

The other name-pairs in the file determine what other items to collect.
These can either be a simple name and path, like the key, such as:

``` bash
Language: "/PubmedArticle/MedlineCitation/Article/Language"
Keywords: "/PubmedArticle/MedlineCitation/KeywordList/Keyword"
```

Or they can use a hierarchical representation to get multiple values
below a child. This is mainly used to handle lists of items where there
is an indefinite number of items below the list.

``` bash
Author: {
  root: "/PubmedArticle/MedlineCitation/Article/AuthorList",
  key: "/Author/auto_index",
  LastName: "/Author/LastName",
  ForeName: "/Author/ForeName",
  Affiliation: "/Author/AffiliationInfo/Affiliation",
  Orcid: "/Author/Identifier/[@Source='ORCID']"
}
```

Here, all paths are relative to the sub-structures `root` path, which is
in turn relative to the parent structure's `root`. This sub-structure
uses the same rules as the parent structure, so it needs both a `root`
and `key` name-value pair. The results of searching each path are
written to separate files. Each file gets a column for the parent and
child key. So in this case, each element of the author is linked by an
author key and that is related to the publication they authored through
the parent key.

The main parser is called recursively to parse this structure so it's
worth thinking about what the root should be under the context that the
parser will be called with that root. This means if, instead of stopping
at `/AuthorList`, `/Author` was added to the end of the root, the parser
would be called for each individual author, instead of once per author
list, leading to all author's getting the index 0.

There are a number of additional syntax constructs to note in the above
example. The key uses the special name `auto_index`, since there is no
author ID in the XML data, an index is used to count the authors in the
order they appear. This resets for each publication and starts at 0.
Treating the `auto_index` as the tail of a path allows you to control
when the indexing occurs—the index is incremented whenever it hits a
`/Author` tag.

In addition to the `auto_index` key, there is a second special index
name, `condensed`.

``` bash
Reference: {
  root: "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
  key: "/condensed"
  PMID: "/ArticleId/[@IdType='pubmed']"
  DOI: "/ArticleId/[@IdType='doi']"
}
```

In the case of `condensed`, instead of writing the results to separate
files, they will printed as columns in the same file, and therefore do
not need an additional key for the sub-structure. If any of the elements
are missing, they will be left blank, for example, if the parser does
not find a pubmed ID for a given reference, the row will look like
`"%s\t\t%s"` where the first string will contain the parent key (the
`PMID` of the publication citing this reference) and the second string
will contain the reference's `DOI`.

The `/[@attribute='value']` syntax at the end of a path tells the parser
to only collect an element if it has an attribute and the attribute's
value matches the supplied value. Similarly the `/@attribute` syntax,
tells the parser to collect the value of the attribute `attribute` along
with the element's value. Then both values will be written to the output
file. Currently only a single attribute can be specified.

Lastly, there is a special syntax for writing condensed sub-structures:

``` bash
Date: "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}"
```

The `{child,child,child}` syntax allows you to select multiple children
at the same level to be printed to a single file. This is useful when
multiple children make up a single piece of information (i.e. the
publication date).

A similar example structure file can be found in the example directory
of this project at:
[file:./example/structure.yml](./example/structure.yml).

# Structure dictionary

The structure of the xml data to read can also be described as a python
dictionary of dictionaries.

The form is similar to the file:

``` python
structure = {
    "root": "//PubmedArticleSet",
    "key": "/PubmedArticle/MedlineCitation/PMID",
    "DOI": "/PubmedArticle/PubmedData/ArticleIdList/ArticleId/[@IdType='doi']",
    "Date": "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}",
    "Journal": "/PubmedArticle/MedlineCitation/Article/Journal/{Title,ISOAbbreviation}",
    "Language": "/PubmedArticle/MedlineCitation/Article/Language",
    "Author": {
        "root": "/PubmedArticle/MedlineCitation/Article/AuthorList",
        "key": "/Author/auto_index",
        "LastName": "/Author/LastName",
        "ForName": "/Author/ForeName",
        "Affiliation": "/Author/AffiliationInfo/Affiliation",
        "Orcid": "/Author/Identifier/[@Source='ORCID']",
    },
    "Grant": {
        "root": "/PubmedArticle/MedlineCitation/Article/GrantList",
        "key": "/Grant/auto_index",
        "ID": "/Grant/GrantID",
        "Agency": "/Grant/Agency",
    },
    "Chemical": "/PubmedArticle/MedlineCitation/ChemicalList/Chemical/NameOfSubstance/@UI",
    "Qualifier": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/QualifierName/@UI",
    "Descriptor": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/@UI",
    "Keywords": "/PubmedArticle/MedlineCitation/KeywordList/Keyword",
    "Reference": {
        "root": (
            "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
        ),
        "key": "/condensed",
        "PMID": "/ArticleId/[@IdType='pubmed']",
        "DOI": "/ArticleId/[@IdType='doi']",
    },
}
```

This can then be passed to `pubmedparser.read_xml` in place of the
structure file.

# Future goals

## Improve printing logic

Currently, values are printed as they are read in. Since the results for
the different paths are written to separate files, this shouldn't
matter, except for the case of the key. The key is not printed to its
own results file, instead whatever the last seen key was is printed as
the key for the current value being printed. If the key is not the first
element to be read in the subtree, there will be a mismatch between
value and publication ID.

In the case of `PMID` this is consistently the first element, so there
should not be a problem, however, it could be in other scenarios.

## Error handling

After refactoring the code, I have started adding some error handling
code, however this has not been consistently applied. Ideally, the
default behavior will be for functions to return error codes. Then use
an error checking macro to test that the result was not an error. I
would also like to add a set error strings that would be printed
depending on the error code. Possibly use a structure to represent
errors so that the erroring function could supply an additional string
along with the error.

Better error handling like this could also allow the python package to
write it's own error handling function in the C API to override the
default error mechanism to use python level errors. This would be done
by testing if an error handler function was defined, if so the error
checking macro would use that function, otherwise it would fallback to a
default function.

            

Raw data

            {
    "_id": null,
    "home_page": "https://gitlab.com/net-synergy/pubmedparser",
    "name": "pubmedparser2",
    "maintainer": "David Connell",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "davidconnell12@gmail.com",
    "keywords": "publication,network,MEDLINE,PubMed,references",
    "author": "David Connell",
    "author_email": "davidconnell12@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/52/cd/3a3a444e0b73994ab11da7ded5b3479af206eaa5feeb84e34968976ea209/pubmedparser2-2.0.6.tar.gz",
    "platform": null,
    "description": "Read XML files and pull out selected values. Values to collect are\ndetermined by paths found in a [structure file](#structure-file). The\nstructure file also includes a key which associates the values with a\nparent element and names, which determine which file to place the\nelements in.\n\nFiles can be passed as either gzipped or uncompressed XML files or from\nstandard in.\n\nFor more info on Pubmed's XML files see:\n[pubmed<sub>190101</sub>.dtd.](https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html)\n\nUsage:\n\n``` python\nimport pubmedparser\nimport pubmedparser.ftp\n\n# Download data\nfiles = pubmedparser.ftp.download(range(1, 6))\n\n# Read XML files using a YAML file to describe what data to collect.\ndata_dir = \"file_example\"\nstructure_file = \"example/structure.yml\"\nresults = pubmedparser.read_xml(files, structure_file, data_dir)\n```\n\nSee [the example\nfile](https://gitlab.com/net-synergy/pubmedparser/-/blob/master/example/creating_graphs.py)\nfor more options.\n\nIn python, the structure file can be replaced with a dictionary of\ndictionaries as well.\n\nOr, as a CLI:\n\n``` bash\nxml_read --cache-dir=cache --structure-file=structure.yml \\\n    data/*.xml.gz\n```\n\n## Installing with pip\n\n``` bash\npip install pubmedparser2\n```\n\n## Building python package\n\nRequires `zlib`.\n\nClone the repository and in the directory. Then use [poetry](https://python-poetry.org/docs) to install the dependencies.\n\n``` bash\npoetry install\n```\n\nThen run the make command:\n\n``` bash\nmake python\n```\n\n# Structure file\n\nThe structure file is a YAML file containing key-value pairs for\ndifferent tags and paths. There are two required keys: `root` and `key`.\n`Root` provide the top-level tag, in the case of the pubmed files this\nwill be `PubmedArticleSet`.\n\n``` bash\nroot: \"/PubmedArticleSet\"\n```\n\nThe `/` is not strictly required as the program will ignore them, but\nthey are used to conform to the\n[xpath](https://en.wikipedia.org/wiki/XPath) syntax (although this\nprogram does not handle all cases for `xpath`).\n\nOnly tags below the root tag will be considered and the parsing will\nterminate once the program has left the root of the tree.\n\n`Key` is a reference tag. In the pubmed case, all data is with respect\nto a publication, so the key should identify the publication the values\nare linked to. The `PMID` tag is a suitable candidate.\n\n``` bash\nkey: \"/PubmedArticle/MedlineCitation/PMID\"\n```\n\nAfter `root`, all paths are taken as relative to the root node.\n\nThe other name-pairs in the file determine what other items to collect.\nThese can either be a simple name and path, like the key, such as:\n\n``` bash\nLanguage: \"/PubmedArticle/MedlineCitation/Article/Language\"\nKeywords: \"/PubmedArticle/MedlineCitation/KeywordList/Keyword\"\n```\n\nOr they can use a hierarchical representation to get multiple values\nbelow a child. This is mainly used to handle lists of items where there\nis an indefinite number of items below the list.\n\n``` bash\nAuthor: {\n  root: \"/PubmedArticle/MedlineCitation/Article/AuthorList\",\n  key: \"/Author/auto_index\",\n  LastName: \"/Author/LastName\",\n  ForeName: \"/Author/ForeName\",\n  Affiliation: \"/Author/AffiliationInfo/Affiliation\",\n  Orcid: \"/Author/Identifier/[@Source='ORCID']\"\n}\n```\n\nHere, all paths are relative to the sub-structures `root` path, which is\nin turn relative to the parent structure's `root`. This sub-structure\nuses the same rules as the parent structure, so it needs both a `root`\nand `key` name-value pair. The results of searching each path are\nwritten to separate files. Each file gets a column for the parent and\nchild key. So in this case, each element of the author is linked by an\nauthor key and that is related to the publication they authored through\nthe parent key.\n\nThe main parser is called recursively to parse this structure so it's\nworth thinking about what the root should be under the context that the\nparser will be called with that root. This means if, instead of stopping\nat `/AuthorList`, `/Author` was added to the end of the root, the parser\nwould be called for each individual author, instead of once per author\nlist, leading to all author's getting the index 0.\n\nThere are a number of additional syntax constructs to note in the above\nexample. The key uses the special name `auto_index`, since there is no\nauthor ID in the XML data, an index is used to count the authors in the\norder they appear. This resets for each publication and starts at 0.\nTreating the `auto_index` as the tail of a path allows you to control\nwhen the indexing occurs\u2014the index is incremented whenever it hits a\n`/Author` tag.\n\nIn addition to the `auto_index` key, there is a second special index\nname, `condensed`.\n\n``` bash\nReference: {\n  root: \"/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList\"\n  key: \"/condensed\"\n  PMID: \"/ArticleId/[@IdType='pubmed']\"\n  DOI: \"/ArticleId/[@IdType='doi']\"\n}\n```\n\nIn the case of `condensed`, instead of writing the results to separate\nfiles, they will printed as columns in the same file, and therefore do\nnot need an additional key for the sub-structure. If any of the elements\nare missing, they will be left blank, for example, if the parser does\nnot find a pubmed ID for a given reference, the row will look like\n`\"%s\\t\\t%s\"` where the first string will contain the parent key (the\n`PMID` of the publication citing this reference) and the second string\nwill contain the reference's `DOI`.\n\nThe `/[@attribute='value']` syntax at the end of a path tells the parser\nto only collect an element if it has an attribute and the attribute's\nvalue matches the supplied value. Similarly the `/@attribute` syntax,\ntells the parser to collect the value of the attribute `attribute` along\nwith the element's value. Then both values will be written to the output\nfile. Currently only a single attribute can be specified.\n\nLastly, there is a special syntax for writing condensed sub-structures:\n\n``` bash\nDate: \"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}\"\n```\n\nThe `{child,child,child}` syntax allows you to select multiple children\nat the same level to be printed to a single file. This is useful when\nmultiple children make up a single piece of information (i.e. the\npublication date).\n\nA similar example structure file can be found in the example directory\nof this project at:\n[file:./example/structure.yml](./example/structure.yml).\n\n# Structure dictionary\n\nThe structure of the xml data to read can also be described as a python\ndictionary of dictionaries.\n\nThe form is similar to the file:\n\n``` python\nstructure = {\n    \"root\": \"//PubmedArticleSet\",\n    \"key\": \"/PubmedArticle/MedlineCitation/PMID\",\n    \"DOI\": \"/PubmedArticle/PubmedData/ArticleIdList/ArticleId/[@IdType='doi']\",\n    \"Date\": \"/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}\",\n    \"Journal\": \"/PubmedArticle/MedlineCitation/Article/Journal/{Title,ISOAbbreviation}\",\n    \"Language\": \"/PubmedArticle/MedlineCitation/Article/Language\",\n    \"Author\": {\n        \"root\": \"/PubmedArticle/MedlineCitation/Article/AuthorList\",\n        \"key\": \"/Author/auto_index\",\n        \"LastName\": \"/Author/LastName\",\n        \"ForName\": \"/Author/ForeName\",\n        \"Affiliation\": \"/Author/AffiliationInfo/Affiliation\",\n        \"Orcid\": \"/Author/Identifier/[@Source='ORCID']\",\n    },\n    \"Grant\": {\n        \"root\": \"/PubmedArticle/MedlineCitation/Article/GrantList\",\n        \"key\": \"/Grant/auto_index\",\n        \"ID\": \"/Grant/GrantID\",\n        \"Agency\": \"/Grant/Agency\",\n    },\n    \"Chemical\": \"/PubmedArticle/MedlineCitation/ChemicalList/Chemical/NameOfSubstance/@UI\",\n    \"Qualifier\": \"/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/QualifierName/@UI\",\n    \"Descriptor\": \"/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/@UI\",\n    \"Keywords\": \"/PubmedArticle/MedlineCitation/KeywordList/Keyword\",\n    \"Reference\": {\n        \"root\": (\n            \"/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList\"\n        ),\n        \"key\": \"/condensed\",\n        \"PMID\": \"/ArticleId/[@IdType='pubmed']\",\n        \"DOI\": \"/ArticleId/[@IdType='doi']\",\n    },\n}\n```\n\nThis can then be passed to `pubmedparser.read_xml` in place of the\nstructure file.\n\n# Future goals\n\n## Improve printing logic\n\nCurrently, values are printed as they are read in. Since the results for\nthe different paths are written to separate files, this shouldn't\nmatter, except for the case of the key. The key is not printed to its\nown results file, instead whatever the last seen key was is printed as\nthe key for the current value being printed. If the key is not the first\nelement to be read in the subtree, there will be a mismatch between\nvalue and publication ID.\n\nIn the case of `PMID` this is consistently the first element, so there\nshould not be a problem, however, it could be in other scenarios.\n\n## Error handling\n\nAfter refactoring the code, I have started adding some error handling\ncode, however this has not been consistently applied. Ideally, the\ndefault behavior will be for functions to return error codes. Then use\nan error checking macro to test that the result was not an error. I\nwould also like to add a set error strings that would be printed\ndepending on the error code. Possibly use a structure to represent\nerrors so that the erroring function could supply an additional string\nalong with the error.\n\nBetter error handling like this could also allow the python package to\nwrite it's own error handling function in the C API to override the\ndefault error mechanism to use python level errors. This would be done\nby testing if an error handler function was defined, if so the error\nchecking macro would use that function, otherwise it would fallback to a\ndefault function.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Download and parse pubmed publication data",
    "version": "2.0.6",
    "project_urls": {
        "Homepage": "https://gitlab.com/net-synergy/pubmedparser",
        "Repository": "https://gitlab.com/net-synergy/pubmedparser"
    },
    "split_keywords": [
        "publication",
        "network",
        "medline",
        "pubmed",
        "references"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0ed365d659f215d9c5707396d2902de697dd696a87b1a6a9e8c9e47c25d46202",
                "md5": "ff58fecc2c5c7c8cacb94e6d7c9d94f3",
                "sha256": "b7d4a6c002bfa2290e405e98c42f38fd1f8d694e5115926b0c2ff0393f34c7df"
            },
            "downloads": -1,
            "filename": "pubmedparser2-2.0.6-cp310-cp310-manylinux_2_37_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ff58fecc2c5c7c8cacb94e6d7c9d94f3",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.9,<4.0",
            "size": 186337,
            "upload_time": "2024-01-21T01:11:28",
            "upload_time_iso_8601": "2024-01-21T01:11:28.996759Z",
            "url": "https://files.pythonhosted.org/packages/0e/d3/65d659f215d9c5707396d2902de697dd696a87b1a6a9e8c9e47c25d46202/pubmedparser2-2.0.6-cp310-cp310-manylinux_2_37_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "52cd3a3a444e0b73994ab11da7ded5b3479af206eaa5feeb84e34968976ea209",
                "md5": "8615049b8ba9b81f933a287785c5af88",
                "sha256": "0681ca7020089dbd198bb78c21d6c7b27ff1d4a329460e0e5ac8f235158177a1"
            },
            "downloads": -1,
            "filename": "pubmedparser2-2.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "8615049b8ba9b81f933a287785c5af88",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 42412,
            "upload_time": "2024-01-21T01:11:30",
            "upload_time_iso_8601": "2024-01-21T01:11:30.857708Z",
            "url": "https://files.pythonhosted.org/packages/52/cd/3a3a444e0b73994ab11da7ded5b3479af206eaa5feeb84e34968976ea209/pubmedparser2-2.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-21 01:11:30",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "net-synergy",
    "gitlab_project": "pubmedparser",
    "lcname": "pubmedparser2"
}
        
Elapsed time: 0.16612s