# PMCGrab - From PubMed Central ID to AI-Ready JSON in Seconds
[](https://pypi.org/project/pmcgrab/) [](https://pypi.org/project/pmcgrab/) [](https://rajdeepmondaldotcom.github.io/pmcgrab/) [](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)
Every AI workflow that touches biomedical literature hits the same wall:
1. **Download** PMC XML hoping it’s “structured.”
2. **Fight** nested tags, footnotes, figure refs, and half-broken links.
3. **Hope** your regex didn’t blow away the Methods section you actually need.
That wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning-any downstream AI task**.
**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.
---
## The Hidden Cost of “I’ll Just Parse It Myself”
| Task | Manual / ad-hoc | **PMCGrab** |
| --------------------------- | ----------------------- | ------------------------------ |
| Install dependencies | 5–10 min | **≈ 2 s** (`uv add pmcgrab`) |
| Convert one article to JSON | 15–30 min | **≈ 3 s** |
| Capture every IMRaD section | Hope & regex | **98 % detection accuracy\*** |
| Parallel processing | Bash loops & temp files | `--workers N` flag |
| Edge-case maintenance | Yours forever | **200 + tests**, active upkeep |
**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**
At \$50 /hour, hand-parsing 100 papers burns **\$1,000+**.
PMCGrab does the same job for \$0-within minutes-so you can focus on _using_ the information instead of extracting it.
---
## Quick Install
Install via **uv** (make sure `uv` itself is up to date first)
```bash
uv add pmcgrab
```
Python ≥ 3.10 required.
---
## Ways to Use
### 1 · Python API
```python
from pmcgrab.application.processing import process_single_pmc
article = process_single_pmc("7114487")
print(article)
```
(Use the numeric part of the PMC ID only.)
---
## Output Example
```json
{
"pmc_id": "7114487",
"title": "Machine learning approaches in cancer research",
"abstract": "…",
"body": {
"Introduction": "…",
"Methods": "…",
"Results": "…",
"Discussion": "…"
},
"authors": [...],
"journal": "Nature Medicine"
}
```
---
## Context Engineering: Why This Matters for LLMs
Large-language-model performance lives or dies on **context quality**-the snippets you retrieve and feed back into the model:
- **RAG pipelines** need precise, de-duplicated passages to ground answers.
- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.
PMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input → better retrieval → better answers.
---
## Why PMCGrab Beats Home-Grown Scripts
1. **Section-Aware Parsing**
Detects IMRaD plus custom subsections like _Statistical Analysis_-crucial for accurate retrieval scoring.
2. **Resilient XML Cleaning**
Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.
3. **True Concurrency**
`--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don’t throttle.
4. **Modern Python Stack**
Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.
---
## Proof at a Glance
| Metric | Value |
| --------------------------- | ------------------ |
| Unit tests | **218** |
| Branch coverage | **95 %** |
| Section detection accuracy | **98 %** |
| Median parse time / article | **3.1 s** |
| Largest batch processed | **7,500 articles** |
---
## Promise to you
If PMCGrab doesn’t save you hours on day one, delete it-no questions asked.
Once you see clean JSON in seconds, you’ll never fight PMC XML again.
---
## Install Now & Ship Real Results
```bash
uv add pmcgrab
```
Stop paying the **XML tax**. Start engineering context-and building AI products that matter.
Raw data
{
"_id": null,
"home_page": null,
"name": "pmcgrab",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
"keywords": "ai, bioinformatics, entrez, fast, llm, modern, ncbi, pmc, pubmed, rag, research papers, retrieval augmented generation, scientific literature, text mining, uv, xml parsing",
"author": null,
"author_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
"download_url": "https://files.pythonhosted.org/packages/22/53/1826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb/pmcgrab-0.5.8.tar.gz",
"platform": null,
"description": "# PMCGrab - From PubMed Central ID to AI-Ready JSON in Seconds\n\n[](https://pypi.org/project/pmcgrab/) [](https://pypi.org/project/pmcgrab/) [](https://rajdeepmondaldotcom.github.io/pmcgrab/) [](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)\n\nEvery AI workflow that touches biomedical literature hits the same wall:\n\n1. **Download** PMC XML hoping it\u2019s \u201cstructured.\u201d\n2. **Fight** nested tags, footnotes, figure refs, and half-broken links.\n3. **Hope** your regex didn\u2019t blow away the Methods section you actually need.\n\nThat wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning-any downstream AI task**.\n**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.\n\n---\n\n## The Hidden Cost of \u201cI\u2019ll Just Parse It Myself\u201d\n\n| Task | Manual / ad-hoc | **PMCGrab** |\n| --------------------------- | ----------------------- | ------------------------------ |\n| Install dependencies | 5\u201310 min | **\u2248 2 s** (`uv add pmcgrab`) |\n| Convert one article to JSON | 15\u201330 min | **\u2248 3 s** |\n| Capture every IMRaD section | Hope & regex | **98 % detection accuracy\\*** |\n| Parallel processing | Bash loops & temp files | `--workers N` flag |\n| Edge-case maintenance | Yours forever | **200 + tests**, active upkeep |\n\n**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**\n\nAt \\$50 /hour, hand-parsing 100 papers burns **\\$1,000+**.\nPMCGrab does the same job for \\$0-within minutes-so you can focus on _using_ the information instead of extracting it.\n\n---\n\n## Quick Install\n\nInstall via **uv** (make sure `uv` itself is up to date first)\n\n```bash\nuv add pmcgrab\n```\n\nPython \u2265 3.10 required.\n\n---\n\n## Ways to Use\n\n### 1 \u00b7 Python API\n\n```python\nfrom pmcgrab.application.processing import process_single_pmc\n\narticle = process_single_pmc(\"7114487\")\nprint(article)\n```\n\n(Use the numeric part of the PMC ID only.)\n\n---\n\n## Output Example\n\n```json\n{\n \"pmc_id\": \"7114487\",\n \"title\": \"Machine learning approaches in cancer research\",\n \"abstract\": \"\u2026\",\n \"body\": {\n \"Introduction\": \"\u2026\",\n \"Methods\": \"\u2026\",\n \"Results\": \"\u2026\",\n \"Discussion\": \"\u2026\"\n },\n \"authors\": [...],\n \"journal\": \"Nature Medicine\"\n}\n```\n\n---\n\n## Context Engineering: Why This Matters for LLMs\n\nLarge-language-model performance lives or dies on **context quality**-the snippets you retrieve and feed back into the model:\n\n- **RAG pipelines** need precise, de-duplicated passages to ground answers.\n- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.\n- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.\n\nPMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input \u2192 better retrieval \u2192 better answers.\n\n---\n\n## Why PMCGrab Beats Home-Grown Scripts\n\n1. **Section-Aware Parsing**\n Detects IMRaD plus custom subsections like _Statistical Analysis_-crucial for accurate retrieval scoring.\n\n2. **Resilient XML Cleaning**\n Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.\n\n3. **True Concurrency**\n `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don\u2019t throttle.\n\n4. **Modern Python Stack**\n Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.\n\n---\n\n## Proof at a Glance\n\n| Metric | Value |\n| --------------------------- | ------------------ |\n| Unit tests | **218** |\n| Branch coverage | **95 %** |\n| Section detection accuracy | **98 %** |\n| Median parse time / article | **3.1 s** |\n| Largest batch processed | **7,500 articles** |\n\n---\n\n## Promise to you\n\nIf PMCGrab doesn\u2019t save you hours on day one, delete it-no questions asked.\nOnce you see clean JSON in seconds, you\u2019ll never fight PMC XML again.\n\n---\n\n## Install Now & Ship Real Results\n\n```bash\nuv add pmcgrab\n```\n\nStop paying the **XML tax**. Start engineering context-and building AI products that matter.\n",
"bugtrack_url": null,
"license": null,
"summary": "AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.",
"version": "0.5.8",
"project_urls": {
"Bug Reports": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
"Changelog": "https://github.com/rajdeepmondaldotcom/pmcgrab/releases",
"Documentation": "https://github.com/rajdeepmondaldotcom/pmcgrab#readme",
"Download": "https://pypi.org/project/pmcgrab/",
"Homepage": "https://github.com/rajdeepmondaldotcom/pmcgrab",
"Issues": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
"Repository": "https://github.com/rajdeepmondaldotcom/pmcgrab.git",
"Source Code": "https://github.com/rajdeepmondaldotcom/pmcgrab"
},
"split_keywords": [
"ai",
" bioinformatics",
" entrez",
" fast",
" llm",
" modern",
" ncbi",
" pmc",
" pubmed",
" rag",
" research papers",
" retrieval augmented generation",
" scientific literature",
" text mining",
" uv",
" xml parsing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fe24a60b244656f125b26b0d9c9563e03f279006f235f4c6a5c410d0fb2a2302",
"md5": "e6147e81311ca382059f11c95c675ac8",
"sha256": "42cd1788bc82804667d63d8d9e5667b67ba27cf39c34f6b106b61f69826de620"
},
"downloads": -1,
"filename": "pmcgrab-0.5.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e6147e81311ca382059f11c95c675ac8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 109322,
"upload_time": "2025-08-05T18:24:46",
"upload_time_iso_8601": "2025-08-05T18:24:46.085886Z",
"url": "https://files.pythonhosted.org/packages/fe/24/a60b244656f125b26b0d9c9563e03f279006f235f4c6a5c410d0fb2a2302/pmcgrab-0.5.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "22531826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb",
"md5": "9268b86631a63a56b687bf4dad003cd6",
"sha256": "551a9ecbd2f6b73a27217fdaf84ccaec6ac0288107c4b06744f077848dc334ee"
},
"downloads": -1,
"filename": "pmcgrab-0.5.8.tar.gz",
"has_sig": false,
"md5_digest": "9268b86631a63a56b687bf4dad003cd6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 107842,
"upload_time": "2025-08-05T18:24:47",
"upload_time_iso_8601": "2025-08-05T18:24:47.566985Z",
"url": "https://files.pythonhosted.org/packages/22/53/1826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb/pmcgrab-0.5.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 18:24:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rajdeepmondaldotcom",
"github_project": "pmcgrab",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pmcgrab"
}