pubmed-api


Namepubmed-api JSON
Version 2.1.3 PyPI version JSON
download
home_pagehttps://github.com/shivam221098/pubmed-api-v1
SummaryRuns PubMed search strings over pubmed API using a batch logic
upload_time2024-08-19 16:36:00
maintainerNone
docs_urlNone
authorShivam Singh
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Problem Statement

Due to recent changes in PubMed API policies, they don't allow users get more than 10k results, resulting 
in results capping against search strings.

# Solution #1
1. `edirect` - edirect is a command-line utility. It provides capabilities of fetching PMIDs against a 
search string when `esearch` and `efetch` commands are piped together.

# Solution #2
1. Running pubmed API on search strings with some date-ranges. But sometimes even a single day has more than 10k PMIDs

# Solution #3

Instead of using `edirect` or concept of date-ranges, let's divide the search strings results in ranges of PMIDs.

#### Algorithm
1. Take a usual string and run over PubMed API using `sort=pub_date`. It will get some PMIDs and Count of actual PMIDs from the pubmed db.
2. The returned results are sorted on the basis of most recent articles as we have used `sort=pub_date`. In essence, the most recent PMID is
returned in the very first API call. So, take the maximum PMID from the batch of 10K PMIDs as this PMID is the most is most recent.
3. If the returned PMIDs count (counted PMIDs from response) <= actual counts `<Counts>47851</Counts>` mentioned in the API response. Then our first API call is the required result.
4. If not `Step 3` and current call is first API call, then divide the maximum PMID from `Step 2` by `2`. It will give you two halves. 
First from `1` to `max_pmid / 2` and second from `max_pmid / 2` to `max_pmid`
5. If not `Step 3` and current call is 2nd, 3rd or so on. API call then divide `end` by 2 and make two halves
First from `1` to `end / 2` and second from `end / 2` to `end`
6. Now, change the original search string as follows:

Suppose `max_pmid = 15000` or `end = 15000` (`end` is `null` during first API call. It is initialised with `max_pmid / 2` during 1st API call
and used in all subsequent recursive API calls)

**NOTE: we are only using `max_pmid` in first API call. In all subsequent call we're just making half of `max_pmid` again and again.**

Suppose our search string is `"human immunodeficiency virus (hiv)"`, so, two new search strings will be
`"human immunodeficiency virus (hiv)" AND 1:7500[UID]` and `"human immunodeficiency virus (hiv)" AND 7500:15000[UID]`
For 1st half 
make `start` = `1`
     `end` = `7500`
For 2nd half
make `start` = `7500`
     `end` = `15000`
Now, Run these two string over the API and repeat `Step 3` for each new string.

This way we can grab all PMIDs using API, without shifting to any other new tool.

If you see we are dividing each sets into two halves recursively. That means PMIDs from range `1` to `x` can be fetched
in an API call and the minimum `x` can go is `1`. We are using same concept of date ranges but dividing sets into halves
using PMID itself. This way the issue of more than 10k PMIDs in a day can be corrected, and it's far simpler than
using `edirect` on the terminal. 

The above process takes only `41 Seconds` for `106K` PMIDs.

#### The edirect actually uses the PubMed API to get the results, but how it is able to get the results from PubMed API if it is using the same API? 🤔
#### Actually it uses the same above concept to get the results from API. 🎉


# Usage
### To get PMIDs for a search string
```python
from pubmed_api.pubmed import PubMedAPI


pa = PubMedAPI()  # instantiate once and use it for n number of search terms
# pa = PubMedAPI(10)  # pass any number. This number tells API to fetch only last 10 years of PMIDs
# pa.__PUBMED_LIMIT__ = 100_000  # update with new pubmed limit

search_terms = [
    '"parkinson\'s disease"',
    '"human immunodeficiency virus (hiv)"'
]

for term in search_terms:
    result = pa.extract(term)  # result will the object of ResultSet()
    print(result.pmids)  # will return list of PMIDs (list)
    print(result.record_count)  # will return number if PMIDs (int)
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shivam221098/pubmed-api-v1",
    "name": "pubmed-api",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Shivam Singh",
    "author_email": "shivam221098@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/1c/8a/7a8b02cc5d6f83147063bf71c3938a94e45f7c3c7afdff285bf439422569/pubmed_api-2.1.3.tar.gz",
    "platform": null,
    "description": "# Problem Statement\n\nDue to recent changes in PubMed API policies, they don't allow users get more than 10k results, resulting \nin results capping against search strings.\n\n# Solution #1\n1. `edirect` - edirect is a command-line utility. It provides capabilities of fetching PMIDs against a \nsearch string when `esearch` and `efetch` commands are piped together.\n\n# Solution #2\n1. Running pubmed API on search strings with some date-ranges. But sometimes even a single day has more than 10k PMIDs\n\n# Solution #3\n\nInstead of using `edirect` or concept of date-ranges, let's divide the search strings results in ranges of PMIDs.\n\n#### Algorithm\n1. Take a usual string and run over PubMed API using `sort=pub_date`. It will get some PMIDs and Count of actual PMIDs from the pubmed db.\n2. The returned results are sorted on the basis of most recent articles as we have used `sort=pub_date`. In essence, the most recent PMID is\nreturned in the very first API call. So, take the maximum PMID from the batch of 10K PMIDs as this PMID is the most is most recent.\n3. If the returned PMIDs count (counted PMIDs from response) <= actual counts `<Counts>47851</Counts>` mentioned in the API response. Then our first API call is the required result.\n4. If not `Step 3` and current call is first API call, then divide the maximum PMID from `Step 2` by `2`. It will give you two halves. \nFirst from `1` to `max_pmid / 2` and second from `max_pmid / 2` to `max_pmid`\n5. If not `Step 3` and current call is 2nd, 3rd or so on. API call then divide `end` by 2 and make two halves\nFirst from `1` to `end / 2` and second from `end / 2` to `end`\n6. Now, change the original search string as follows:\n\nSuppose `max_pmid = 15000` or `end = 15000` (`end` is `null` during first API call. It is initialised with `max_pmid / 2` during 1st API call\nand used in all subsequent recursive API calls)\n\n**NOTE: we are only using `max_pmid` in first API call. In all subsequent call we're just making half of `max_pmid` again and again.**\n\nSuppose our search string is `\"human immunodeficiency virus (hiv)\"`, so, two new search strings will be\n`\"human immunodeficiency virus (hiv)\" AND 1:7500[UID]` and `\"human immunodeficiency virus (hiv)\" AND 7500:15000[UID]`\nFor 1st half \nmake `start` = `1`\n     `end` = `7500`\nFor 2nd half\nmake `start` = `7500`\n     `end` = `15000`\nNow, Run these two string over the API and repeat `Step 3` for each new string.\n\nThis way we can grab all PMIDs using API, without shifting to any other new tool.\n\nIf you see we are dividing each sets into two halves recursively. That means PMIDs from range `1` to `x` can be fetched\nin an API call and the minimum `x` can go is `1`. We are using same concept of date ranges but dividing sets into halves\nusing PMID itself. This way the issue of more than 10k PMIDs in a day can be corrected, and it's far simpler than\nusing `edirect` on the terminal. \n\nThe above process takes only `41 Seconds` for `106K` PMIDs.\n\n#### The edirect actually uses the PubMed API to get the results, but how it is able to get the results from PubMed API if it is using the same API? \ud83e\udd14\n#### Actually it uses the same above concept to get the results from API. \ud83c\udf89\n\n\n# Usage\n### To get PMIDs for a search string\n```python\nfrom pubmed_api.pubmed import PubMedAPI\n\n\npa = PubMedAPI()  # instantiate once and use it for n number of search terms\n# pa = PubMedAPI(10)  # pass any number. This number tells API to fetch only last 10 years of PMIDs\n# pa.__PUBMED_LIMIT__ = 100_000  # update with new pubmed limit\n\nsearch_terms = [\n    '\"parkinson\\'s disease\"',\n    '\"human immunodeficiency virus (hiv)\"'\n]\n\nfor term in search_terms:\n    result = pa.extract(term)  # result will the object of ResultSet()\n    print(result.pmids)  # will return list of PMIDs (list)\n    print(result.record_count)  # will return number if PMIDs (int)\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Runs PubMed search strings over pubmed API using a batch logic",
    "version": "2.1.3",
    "project_urls": {
        "Homepage": "https://github.com/shivam221098/pubmed-api-v1"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1c8a7a8b02cc5d6f83147063bf71c3938a94e45f7c3c7afdff285bf439422569",
                "md5": "c4039542dea99ba48ab0a610f8e729ea",
                "sha256": "d9ecc29b01836a4cc007a6d34d631ff2d797811a624e68e05129f60ee40f56c9"
            },
            "downloads": -1,
            "filename": "pubmed_api-2.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "c4039542dea99ba48ab0a610f8e729ea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6360,
            "upload_time": "2024-08-19T16:36:00",
            "upload_time_iso_8601": "2024-08-19T16:36:00.277673Z",
            "url": "https://files.pythonhosted.org/packages/1c/8a/7a8b02cc5d6f83147063bf71c3938a94e45f7c3c7afdff285bf439422569/pubmed_api-2.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-19 16:36:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shivam221098",
    "github_project": "pubmed-api-v1",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "pubmed-api"
}
        
Elapsed time: 0.28000s