# PMCGrab — From PubMed Central ID to AI-Ready JSON in Seconds
[](https://pypi.org/project/pmcgrab/) [](https://pypi.org/project/pmcgrab/) [](https://rajdeepmondaldotcom.github.io/pmcgrab/) [](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)
Every AI workflow that touches biomedical literature hits the same wall:
1. **Download** PMC XML hoping it’s “structured.”
2. **Fight** nested tags, footnotes, figure refs, and half-broken links.
3. **Hope** your regex didn’t blow away the Methods section you actually need.
That wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning—any downstream AI task**.
**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.
---
## The Hidden Cost of “I’ll Just Parse It Myself”
| Task | Manual / ad-hoc | **PMCGrab** |
| --------------------------- | ----------------------- | ------------------------------ |
| Install dependencies | 5–10 min | **≈ 2 s** (`uv add pmcgrab`) |
| Convert one article to JSON | 15–30 min | **≈ 3 s** |
| Capture every IMRaD section | Hope & regex | **98 % detection accuracy\*** |
| Parallel processing | Bash loops & temp files | `--workers N` flag |
| Edge-case maintenance | Yours forever | **200 + tests**, active upkeep |
**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**
At \$50 /hour, hand-parsing 100 papers burns **\$1,000+**.
PMCGrab does the same job for \$0—within minutes—so you can focus on _using_ the information instead of extracting it.
---
## Quick Install
```bash
uv add pmcgrab # fastest
```
Python ≥ 3.10 required.
---
## Two Ways to Use
### 1 · Python API
```python
from pmcgrab.application.processing import process_single_pmc
article = process_single_pmc("7114487")
print(article)
```
### 2 · Command Line
```bash
uv run python -m pmcgrab --pmcids 7114487 3084273 --workers 4
# → writes pmc_output/PMC7114487.json, PMC3084273.json
```
(Use the numeric part of the PMC ID only.)
---
## Output Example
```json
{
"pmc_id": "7114487",
"title": "Machine learning approaches in cancer research",
"abstract": "…",
"body": {
"Introduction": "…",
"Methods": "…",
"Results": "…",
"Discussion": "…"
},
"authors": [...],
"journal": "Nature Medicine"
}
```
---
## Context Engineering: Why This Matters for LLMs
Large-language-model performance lives or dies on **context quality**—the snippets you retrieve and feed back into the model:
- **RAG pipelines** need precise, de-duplicated passages to ground answers.
- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.
PMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input → better retrieval → better answers.
---
## Why PMCGrab Beats Home-Grown Scripts
1. **Section-Aware Parsing**
Detects IMRaD plus custom subsections like _Statistical Analysis_—crucial for accurate retrieval scoring.
2. **Resilient XML Cleaning**
Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.
3. **True Concurrency**
`--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don’t throttle.
4. **Modern Python Stack**
Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.
---
## Proof at a Glance
| Metric | Value |
| --------------------------- | ------------------ |
| Unit tests | **218** |
| Branch coverage | **95 %** |
| Section detection accuracy | **98 %** |
| Median parse time / article | **3.1 s** |
| Largest batch processed | **7,500 articles** |
---
## Promise to you
If PMCGrab doesn’t save you hours on day one, delete it—no questions asked.
Once you see clean JSON in seconds, you’ll never fight PMC XML again.
---
## Install Now & Ship Real Results
```bash
uv add pmcgrab
```
Stop paying the **XML tax**. Start engineering context—and building AI products that matter.
Raw data
{
"_id": null,
"home_page": null,
"name": "pmcgrab",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
"keywords": "ai, bioinformatics, entrez, fast, llm, modern, ncbi, pmc, pubmed, rag, research papers, retrieval augmented generation, scientific literature, text mining, uv, xml parsing",
"author": null,
"author_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
"download_url": "https://files.pythonhosted.org/packages/dd/d7/76c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6/pmcgrab-0.5.4.tar.gz",
"platform": null,
"description": "# PMCGrab \u2014 From PubMed Central ID to AI-Ready JSON in Seconds\n\n[](https://pypi.org/project/pmcgrab/) [](https://pypi.org/project/pmcgrab/) [](https://rajdeepmondaldotcom.github.io/pmcgrab/) [](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)\n\nEvery AI workflow that touches biomedical literature hits the same wall:\n\n1. **Download** PMC XML hoping it\u2019s \u201cstructured.\u201d\n2. **Fight** nested tags, footnotes, figure refs, and half-broken links.\n3. **Hope** your regex didn\u2019t blow away the Methods section you actually need.\n\nThat wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning\u2014any downstream AI task**.\n**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.\n\n---\n\n## The Hidden Cost of \u201cI\u2019ll Just Parse It Myself\u201d\n\n| Task | Manual / ad-hoc | **PMCGrab** |\n| --------------------------- | ----------------------- | ------------------------------ |\n| Install dependencies | 5\u201310 min | **\u2248 2 s** (`uv add pmcgrab`) |\n| Convert one article to JSON | 15\u201330 min | **\u2248 3 s** |\n| Capture every IMRaD section | Hope & regex | **98 % detection accuracy\\*** |\n| Parallel processing | Bash loops & temp files | `--workers N` flag |\n| Edge-case maintenance | Yours forever | **200 + tests**, active upkeep |\n\n**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**\n\nAt \\$50 /hour, hand-parsing 100 papers burns **\\$1,000+**.\nPMCGrab does the same job for \\$0\u2014within minutes\u2014so you can focus on _using_ the information instead of extracting it.\n\n---\n\n## Quick Install\n\n```bash\nuv add pmcgrab # fastest\n```\n\nPython \u2265 3.10 required.\n\n---\n\n## Two Ways to Use\n\n### 1 \u00b7 Python API\n\n```python\nfrom pmcgrab.application.processing import process_single_pmc\n\narticle = process_single_pmc(\"7114487\")\nprint(article)\n```\n\n### 2 \u00b7 Command Line\n\n```bash\nuv run python -m pmcgrab --pmcids 7114487 3084273 --workers 4\n# \u2192 writes pmc_output/PMC7114487.json, PMC3084273.json\n```\n\n(Use the numeric part of the PMC ID only.)\n\n---\n\n## Output Example\n\n```json\n{\n \"pmc_id\": \"7114487\",\n \"title\": \"Machine learning approaches in cancer research\",\n \"abstract\": \"\u2026\",\n \"body\": {\n \"Introduction\": \"\u2026\",\n \"Methods\": \"\u2026\",\n \"Results\": \"\u2026\",\n \"Discussion\": \"\u2026\"\n },\n \"authors\": [...],\n \"journal\": \"Nature Medicine\"\n}\n```\n\n---\n\n## Context Engineering: Why This Matters for LLMs\n\nLarge-language-model performance lives or dies on **context quality**\u2014the snippets you retrieve and feed back into the model:\n\n- **RAG pipelines** need precise, de-duplicated passages to ground answers.\n- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.\n- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.\n\nPMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input \u2192 better retrieval \u2192 better answers.\n\n---\n\n## Why PMCGrab Beats Home-Grown Scripts\n\n1. **Section-Aware Parsing**\n Detects IMRaD plus custom subsections like _Statistical Analysis_\u2014crucial for accurate retrieval scoring.\n\n2. **Resilient XML Cleaning**\n Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.\n\n3. **True Concurrency**\n `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don\u2019t throttle.\n\n4. **Modern Python Stack**\n Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.\n\n---\n\n## Proof at a Glance\n\n| Metric | Value |\n| --------------------------- | ------------------ |\n| Unit tests | **218** |\n| Branch coverage | **95 %** |\n| Section detection accuracy | **98 %** |\n| Median parse time / article | **3.1 s** |\n| Largest batch processed | **7,500 articles** |\n\n---\n\n## Promise to you\n\nIf PMCGrab doesn\u2019t save you hours on day one, delete it\u2014no questions asked.\nOnce you see clean JSON in seconds, you\u2019ll never fight PMC XML again.\n\n---\n\n## Install Now & Ship Real Results\n\n```bash\nuv add pmcgrab\n```\n\nStop paying the **XML tax**. Start engineering context\u2014and building AI products that matter.\n",
"bugtrack_url": null,
"license": null,
"summary": "AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.",
"version": "0.5.4",
"project_urls": {
"Bug Reports": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
"Changelog": "https://github.com/rajdeepmondaldotcom/pmcgrab/releases",
"Documentation": "https://github.com/rajdeepmondaldotcom/pmcgrab#readme",
"Download": "https://pypi.org/project/pmcgrab/",
"Homepage": "https://github.com/rajdeepmondaldotcom/pmcgrab",
"Issues": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
"Repository": "https://github.com/rajdeepmondaldotcom/pmcgrab.git",
"Source Code": "https://github.com/rajdeepmondaldotcom/pmcgrab"
},
"split_keywords": [
"ai",
" bioinformatics",
" entrez",
" fast",
" llm",
" modern",
" ncbi",
" pmc",
" pubmed",
" rag",
" research papers",
" retrieval augmented generation",
" scientific literature",
" text mining",
" uv",
" xml parsing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "abd82c6b131ffaea4ff426feb64cf5ed9e5a8c6ddfadec674e02fbb1f555c2fa",
"md5": "af09a92bbbf726e93fd65509ce9f4a7c",
"sha256": "a70ad723a546df6c2bd07a5b4b11de0b6e68d772f73f9e5b29871186be603a5f"
},
"downloads": -1,
"filename": "pmcgrab-0.5.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "af09a92bbbf726e93fd65509ce9f4a7c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 109352,
"upload_time": "2025-08-01T07:43:53",
"upload_time_iso_8601": "2025-08-01T07:43:53.417966Z",
"url": "https://files.pythonhosted.org/packages/ab/d8/2c6b131ffaea4ff426feb64cf5ed9e5a8c6ddfadec674e02fbb1f555c2fa/pmcgrab-0.5.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ddd776c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6",
"md5": "295108f64ba390805013c837c50baee2",
"sha256": "f1b4027cb51f91aeaff67971f8a10b432022bdee4e73787047fa17959521216a"
},
"downloads": -1,
"filename": "pmcgrab-0.5.4.tar.gz",
"has_sig": false,
"md5_digest": "295108f64ba390805013c837c50baee2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 105303,
"upload_time": "2025-08-01T07:43:54",
"upload_time_iso_8601": "2025-08-01T07:43:54.449761Z",
"url": "https://files.pythonhosted.org/packages/dd/d7/76c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6/pmcgrab-0.5.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 07:43:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rajdeepmondaldotcom",
"github_project": "pmcgrab",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pmcgrab"
}