# easy-entrez
![tests](https://github.com/krassowski/easy-entrez/workflows/tests/badge.svg)
![CodeQL](https://github.com/krassowski/easy-entrez/workflows/CodeQL/badge.svg)
[![Documentation Status](https://readthedocs.org/projects/easy-entrez/badge/?version=latest)](https://easy-entrez.readthedocs.io/en/latest/?badge=latest)
[![DOI](https://zenodo.org/badge/272182307.svg)](https://zenodo.org/badge/latestdoi/272182307)
Python REST API for Entrez E-Utilities, aiming to be easy to use and reliable.
Easy-entrez:
- makes common tasks easy thanks to simple Pythonic API,
- is typed and integrates well with mypy,
- is tested on Windows, Mac and Linux across Python 3.7, 3.8, 3.9, 3.10 and 3.11
- is limited in scope, allowing to focus on the reliability of the core code,
- does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.
### Examples
```python
from easy_entrez import EntrezAPI
entrez_api = EntrezAPI(
'your-tool-name',
'e@mail.com',
# optional
return_type='json'
)
# find up to 10 000 results for cancer in human
result = entrez_api.search('cancer AND human[organism]', max_results=10_000)
# data will be populated with JSON or XML (depending on the `return_type` value)
result.data
```
See more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-entrez.readthedocs.io/en/latest).
For a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.
#### Fetching genes for a variant from dbSNP
Fetch the SNP record for `rs6311`:
```python
rs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]
rs6311
```
Display the result:
```python
from easy_entrez.parsing import xml_to_string
print(xml_to_string(rs6311))
```
Find the gene names for `rs6311`:
```python
namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
genes = [
name.text
for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
print(genes)
```
> `['HTR2A']`
Fetch data for multiple variants at once:
```python
result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
gene_names = {
'rs' + document_summary.get('uid'): [
element.text
for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)
]
for document_summary in result.data
}
print(gene_names)
```
> `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`
#### Obtaining the chromosomal position from SNP rsID number
```python
from pandas import DataFrame
result = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')
variant_positions = DataFrame([
{
'id': 'rs' + document_summary.get('uid'),
'chromosome': chromosome,
'position': position
}
for document_summary in result.data
for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)
for chromosome, position in [chrom_and_position.text.split(':')]
])
variant_positions
```
> | | id | chromosome | position |
> |---:|:---------|-------------:|-----------:|
> | 0 | rs6311 | 13 | 46897343 |
> | 1 | rs662138 | 6 | 160143444 |
#### Converting full variation/mutation data to tabular format
Parsing utilities can quickly extract the data to a `VariantSet` object
holding pandas `DataFrame`s with coordinates and alternative alleles frequencies:
```python
from easy_entrez.parsing import parse_dbsnp_variants
variants = parse_dbsnp_variants(result)
variants
```
> `<VariantSet with 2 variants>`
To get the coordinates:
```python
variants.coordinates
```
> | rs_id | ref | alts | chrom | pos | chrom_prev | pos_prev | consequence |
> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|
>| rs6311 | C | A,T | 13 | 46897343 | 13 | 47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |
>| rs662138 | C | G | 6 | 160143444 | 6 | 160564476 | intron_variant |
For frequencies:
```python
variants.alt_frequencies.head(5) # using head to only display first 5 for brevity
```
> | | rs_id | allele | source_frequency | total_count | study | count |
> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|
> | 0 | rs6311 | T | 0.44349 | 2221 | 1000Genomes | 984.991 |
> | 1 | rs6311 | T | 0.411261 | 1585 | ALSPAC | 651.849 |
> | 2 | rs6311 | T | 0.331696 | 1486 | Estonian | 492.9 |
> | 3 | rs6311 | T | 0.35 | 14 | GENOME_DK | 4.9 |
> | 4 | rs6311 | T | 0.402529 | 56309 | GnomAD | 22666 |
#### Obtaining the SNP rs ID number from chromosomal position
You can use the query string directly:
```python
results = entrez_api.search(
'13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',
database='snp',
max_results=10
)
print(results.data['esearchresult']['idlist'])
```
> `['59296319', '17076752', '7336701', '4']`
Or pass a dictionary (no validation of arguments is performed, `AND` conjunction is used):
```python
results = entrez_api.search(
dict(chromosome=13, organism='human', position=31873085),
database='snp',
max_results=10
)
print(results.data['esearchresult']['idlist'])
```
> `['59296319', '17076752', '7336701', '4']`
The base position should use the latest genome assembly (GRCh38 at the time of writing);
you can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.
For more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.
#### Obtaining amino acids change information for variants in given range
First we search for dbSNP rs identifiers for variants in given region:
```python
dbsnp_ids = (
entrez_api
.search(
'12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',
database='snp',
max_results=100
)
.data
['esearchresult']
['idlist']
)
```
Then fetch the variant data for identifiers:
```python
variant_data = entrez_api.fetch(
['rs' + rs_id for rs_id in dbsnp_ids],
max_results=10,
database='snp'
)
```
And parse the data, extracting the HGVS out of summary:
```python
from easy_entrez.parsing import parse_dbsnp_variants
from pandas import Series
def select_protein_hgvs(items):
return [
[sequence, hgvs]
for entry in items
for sequence, hgvs in [entry.split(':')]
if hgvs.startswith('p.')
]
protein_hgvs = (
parse_dbsnp_variants(variant_data)
.summary
.HGVS
.apply(select_protein_hgvs)
.explode()
.dropna()
.apply(Series)
.rename(columns={0: 'sequence', 1: 'hgvs'})
)
protein_hgvs.head()
```
> | rs_id | sequence | hgvs |
> |:-------------|:------------|:------------|
> | rs1940853486 | NP_006437.3 | p.Gly203Ter |
> | rs1940853414 | NP_006437.3 | p.Glu202Gly |
> | rs1940853378 | NP_006437.3 | p.Glu202Lys |
> | rs1940853299 | NP_006437.3 | p.Lys201Thr |
> | rs1940852987 | NP_006437.3 | p.Asp198Glu |
#### Fetching more than 10 000 entries
Use `in_batches_of` method to fetch more than 10k entries (e.g. `variant_ids`):
```python
snps_result = (
entrez.api
.in_batches_of(1_000)
.fetch(variant_ids, max_results=5_000, database='snp')
)
```
The result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use `parse_dbsnp_variants` directly on this dictionary.
#### Find PubMed ID from DOI
When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:
```python
def doi_term(doi: str) -> str:
"""Prepare DOI for PubMed search"""
doi = (
doi
.replace('http://', 'https://')
.replace('https://doi.org/', '')
)
return f'"{doi}"[Publisher ID]'
result = entrez_api.search(
doi_term('https://doi.org/10.3389/fcell.2021.626821'),
database='pubmed',
max_results=1
)
result.data['esearchresult']['idlist']
```
> `['33834021']`
### Installation
Requires Python 3.6+. Install with:
```bash
pip install easy-entrez
```
If you wish to enable (optional, tqdm-based) progress bars use:
```bash
pip install easy-entrez[with_progress_bars]
```
If you wish to enable (optional, pandas-based) parsing utilities use:
```bash
pip install easy-entrez[with_parsing_utils]
```
### Alternatives
You might want to try:
- [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it
- [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses
- [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API
- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since
Raw data
{
"_id": null,
"home_page": "https://github.com/krassowski/easy-entrez",
"name": "easy-entrez",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "entrez,pubmed,e-utilities,ncbi,rest,api,dbsnp,literature,mining",
"author": "Michal Krassowski",
"author_email": "krassowski.michal+pypi@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/ba/68/933e2495789257a29c0245cf5bafcb4808a369e700b75515d90a0c138ce6/easy_entrez-0.3.7.tar.gz",
"platform": null,
"description": "# easy-entrez\n\n![tests](https://github.com/krassowski/easy-entrez/workflows/tests/badge.svg)\n![CodeQL](https://github.com/krassowski/easy-entrez/workflows/CodeQL/badge.svg)\n[![Documentation Status](https://readthedocs.org/projects/easy-entrez/badge/?version=latest)](https://easy-entrez.readthedocs.io/en/latest/?badge=latest)\n[![DOI](https://zenodo.org/badge/272182307.svg)](https://zenodo.org/badge/latestdoi/272182307)\n\nPython REST API for Entrez E-Utilities, aiming to be easy to use and reliable.\n\nEasy-entrez:\n\n- makes common tasks easy thanks to simple Pythonic API,\n- is typed and integrates well with mypy,\n- is tested on Windows, Mac and Linux across Python 3.7, 3.8, 3.9, 3.10 and 3.11\n- is limited in scope, allowing to focus on the reliability of the core code,\n- does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.\n\n\n### Examples\n\n```python\nfrom easy_entrez import EntrezAPI\n\nentrez_api = EntrezAPI(\n 'your-tool-name',\n 'e@mail.com',\n # optional\n return_type='json'\n)\n\n# find up to 10 000 results for cancer in human\nresult = entrez_api.search('cancer AND human[organism]', max_results=10_000)\n\n# data will be populated with JSON or XML (depending on the `return_type` value)\nresult.data\n```\n\nSee more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-entrez.readthedocs.io/en/latest).\n\nFor a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.\n\n#### Fetching genes for a variant from dbSNP\n\nFetch the SNP record for `rs6311`:\n\n```python\nrs6311 = entrez_api.fetch(['rs6311'], max_results=1, database='snp').data[0]\nrs6311\n```\n\nDisplay the result:\n\n```python\nfrom easy_entrez.parsing import xml_to_string\n\nprint(xml_to_string(rs6311))\n```\n\nFind the gene names for `rs6311`:\n\n```python\nnamespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}\ngenes = [\n name.text\n for name in rs6311.findall('.//ns0:GENE_E/ns0:NAME', namespaces)\n]\nprint(genes)\n```\n\n> `['HTR2A']`\n\nFetch data for multiple variants at once:\n\n```python\nresult = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')\ngene_names = {\n 'rs' + document_summary.get('uid'): [\n element.text\n for element in document_summary.findall('.//ns0:GENE_E/ns0:NAME', namespaces)\n ]\n for document_summary in result.data\n}\nprint(gene_names)\n```\n\n> `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`\n\n#### Obtaining the chromosomal position from SNP rsID number\n\n```python\nfrom pandas import DataFrame\n\nresult = entrez_api.fetch(['rs6311', 'rs662138'], max_results=10, database='snp')\n\nvariant_positions = DataFrame([\n {\n 'id': 'rs' + document_summary.get('uid'),\n 'chromosome': chromosome,\n 'position': position\n }\n for document_summary in result.data\n for chrom_and_position in document_summary.findall('.//ns0:CHRPOS', namespaces)\n for chromosome, position in [chrom_and_position.text.split(':')]\n])\n\nvariant_positions\n```\n\n> | | id | chromosome | position |\n> |---:|:---------|-------------:|-----------:|\n> | 0 | rs6311 | 13 | 46897343 |\n> | 1 | rs662138 | 6 | 160143444 |\n\n\n#### Converting full variation/mutation data to tabular format\n\nParsing utilities can quickly extract the data to a `VariantSet` object\nholding pandas `DataFrame`s with coordinates and alternative alleles frequencies:\n\n```python\nfrom easy_entrez.parsing import parse_dbsnp_variants\n\nvariants = parse_dbsnp_variants(result)\nvariants\n```\n\n> `<VariantSet with 2 variants>`\n\nTo get the coordinates:\n\n```python\nvariants.coordinates\n```\n\n> | rs_id | ref | alts | chrom | pos | chrom_prev | pos_prev | consequence |\n> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|\n>| rs6311 | C | A,T | 13 | 46897343 | 13 | 47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |\n>| rs662138 | C | G | 6 | 160143444 | 6 | 160564476 | intron_variant |\n\nFor frequencies:\n\n```python\nvariants.alt_frequencies.head(5) # using head to only display first 5 for brevity\n```\n\n> | | rs_id | allele | source_frequency | total_count | study | count |\n> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|\n> | 0 | rs6311 | T | 0.44349 | 2221 | 1000Genomes | 984.991 |\n> | 1 | rs6311 | T | 0.411261 | 1585 | ALSPAC | 651.849 |\n> | 2 | rs6311 | T | 0.331696 | 1486 | Estonian | 492.9 |\n> | 3 | rs6311 | T | 0.35 | 14 | GENOME_DK | 4.9 |\n> | 4 | rs6311 | T | 0.402529 | 56309 | GnomAD | 22666 |\n\n\n#### Obtaining the SNP rs ID number from chromosomal position\n\nYou can use the query string directly:\n\n```python\nresults = entrez_api.search(\n '13[CHROMOSOME] AND human[ORGANISM] AND 31873085[POSITION]',\n database='snp',\n max_results=10\n)\nprint(results.data['esearchresult']['idlist'])\n```\n\n> `['59296319', '17076752', '7336701', '4']`\n\nOr pass a dictionary (no validation of arguments is performed, `AND` conjunction is used):\n\n```python\nresults = entrez_api.search(\n dict(chromosome=13, organism='human', position=31873085),\n database='snp',\n max_results=10\n)\nprint(results.data['esearchresult']['idlist'])\n```\n\n> `['59296319', '17076752', '7336701', '4']`\n\nThe base position should use the latest genome assembly (GRCh38 at the time of writing);\nyou can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.\nFor more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.\n\n#### Obtaining amino acids change information for variants in given range\n\nFirst we search for dbSNP rs identifiers for variants in given region:\n\n```python\ndbsnp_ids = (\n entrez_api\n .search(\n '12[CHROMOSOME] AND human[ORGANISM] AND 21178600:21178720[POSITION]',\n database='snp',\n max_results=100\n )\n .data\n ['esearchresult']\n ['idlist']\n)\n```\n\nThen fetch the variant data for identifiers:\n\n```python\nvariant_data = entrez_api.fetch(\n ['rs' + rs_id for rs_id in dbsnp_ids],\n max_results=10,\n database='snp'\n)\n```\n\nAnd parse the data, extracting the HGVS out of summary:\n\n```python\nfrom easy_entrez.parsing import parse_dbsnp_variants\nfrom pandas import Series\n\n\ndef select_protein_hgvs(items):\n return [\n [sequence, hgvs]\n for entry in items\n for sequence, hgvs in [entry.split(':')]\n if hgvs.startswith('p.')\n ]\n\n\nprotein_hgvs = (\n parse_dbsnp_variants(variant_data)\n .summary\n .HGVS\n .apply(select_protein_hgvs)\n .explode()\n .dropna()\n .apply(Series)\n .rename(columns={0: 'sequence', 1: 'hgvs'})\n)\nprotein_hgvs.head()\n```\n\n> | rs_id | sequence | hgvs |\n> |:-------------|:------------|:------------|\n> | rs1940853486 | NP_006437.3 | p.Gly203Ter |\n> | rs1940853414 | NP_006437.3 | p.Glu202Gly |\n> | rs1940853378 | NP_006437.3 | p.Glu202Lys |\n> | rs1940853299 | NP_006437.3 | p.Lys201Thr |\n> | rs1940852987 | NP_006437.3 | p.Asp198Glu |\n\n#### Fetching more than 10 000 entries\n\nUse `in_batches_of` method to fetch more than 10k entries (e.g. `variant_ids`):\n\n```python\nsnps_result = (\n entrez.api\n .in_batches_of(1_000)\n .fetch(variant_ids, max_results=5_000, database='snp')\n)\n```\n\nThe result is a dictionary with keys being identifiers used in each batch (because the Entrez API does not always return the indentifiers back) and values representing the result. You can use `parse_dbsnp_variants` directly on this dictionary.\n\n#### Find PubMed ID from DOI\n\nWhen searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:\n\n```python\ndef doi_term(doi: str) -> str:\n \"\"\"Prepare DOI for PubMed search\"\"\"\n doi = (\n doi\n .replace('http://', 'https://')\n .replace('https://doi.org/', '')\n )\n return f'\"{doi}\"[Publisher ID]'\n\n\nresult = entrez_api.search(\n doi_term('https://doi.org/10.3389/fcell.2021.626821'),\n database='pubmed',\n max_results=1\n)\nresult.data['esearchresult']['idlist']\n```\n\n> `['33834021']`\n\n### Installation\n\nRequires Python 3.6+. Install with:\n\n\n```bash\npip install easy-entrez\n```\n\nIf you wish to enable (optional, tqdm-based) progress bars use:\n\n```bash\npip install easy-entrez[with_progress_bars]\n```\n\nIf you wish to enable (optional, pandas-based) parsing utilities use:\n\n```bash\npip install easy-entrez[with_parsing_utils]\n```\n\n### Alternatives\n\nYou might want to try:\n\n- [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it\n- [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses\n- [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API\n- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python REST API for Entrez E-Utilities: stateless, easy to use, reliable.",
"version": "0.3.7",
"project_urls": {
"Homepage": "https://github.com/krassowski/easy-entrez"
},
"split_keywords": [
"entrez",
"pubmed",
"e-utilities",
"ncbi",
"rest",
"api",
"dbsnp",
"literature",
"mining"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a0c58e4f76cc61c397cc39c8a8c41f4fb302e158c2ae1295f053afe7ab519207",
"md5": "36357201609f1c52380fc6722f2d0a00",
"sha256": "ceab7173e114f629fa05779164fd057e7fe2410c53b4a0153a05d535244399cc"
},
"downloads": -1,
"filename": "easy_entrez-0.3.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "36357201609f1c52380fc6722f2d0a00",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21588,
"upload_time": "2023-10-12T13:57:51",
"upload_time_iso_8601": "2023-10-12T13:57:51.822528Z",
"url": "https://files.pythonhosted.org/packages/a0/c5/8e4f76cc61c397cc39c8a8c41f4fb302e158c2ae1295f053afe7ab519207/easy_entrez-0.3.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ba68933e2495789257a29c0245cf5bafcb4808a369e700b75515d90a0c138ce6",
"md5": "998c0485464c4a3c3e0d640d7ac37ed4",
"sha256": "50aa3434eeaaedb7cac2a9aeed1d6ef8387b24b3536759754bee53632d7fb2a4"
},
"downloads": -1,
"filename": "easy_entrez-0.3.7.tar.gz",
"has_sig": false,
"md5_digest": "998c0485464c4a3c3e0d640d7ac37ed4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 29203,
"upload_time": "2023-10-12T13:57:53",
"upload_time_iso_8601": "2023-10-12T13:57:53.306404Z",
"url": "https://files.pythonhosted.org/packages/ba/68/933e2495789257a29c0245cf5bafcb4808a369e700b75515d90a0c138ce6/easy_entrez-0.3.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-10-12 13:57:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "krassowski",
"github_project": "easy-entrez",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "easy-entrez"
}