pmcgrab

Name	pmcgrab JSON
Version	0.5.8 JSON
	download
home_page	None
Summary	AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.
upload_time	2025-08-05 18:24:47
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	ai bioinformatics entrez fast llm modern ncbi pmc pubmed rag research papers retrieval augmented generation scientific literature text mining uv xml parsing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PMCGrab - From PubMed Central ID to AI-Ready JSON in Seconds

[![PyPI](https://img.shields.io/pypi/v/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Python](https://img.shields.io/pypi/pyversions/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Docs](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://rajdeepmondaldotcom.github.io/pmcgrab/) [![CI](https://github.com/rajdeepmondaldotcom/pmcgrab/workflows/CI/badge.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)

Every AI workflow that touches biomedical literature hits the same wall:

1. **Download** PMC XML hoping it’s “structured.”
2. **Fight** nested tags, footnotes, figure refs, and half-broken links.
3. **Hope** your regex didn’t blow away the Methods section you actually need.

That wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning-any downstream AI task**.
**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.

---

## The Hidden Cost of “I’ll Just Parse It Myself”

| Task                        | Manual / ad-hoc         | **PMCGrab**                    |
| --------------------------- | ----------------------- | ------------------------------ |
| Install dependencies        | 5–10 min                | **≈ 2 s** (`uv add pmcgrab`)   |
| Convert one article to JSON | 15–30 min               | **≈ 3 s**                      |
| Capture every IMRaD section | Hope & regex            | **98 % detection accuracy\***  |
| Parallel processing         | Bash loops & temp files | `--workers N` flag             |
| Edge-case maintenance       | Yours forever           | **200 + tests**, active upkeep |

**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**

At \$50 /hour, hand-parsing 100 papers burns **\$1,000+**.
PMCGrab does the same job for \$0-within minutes-so you can focus on _using_ the information instead of extracting it.

---

## Quick Install

Install via **uv** (make sure `uv` itself is up to date first)

```bash
uv add pmcgrab
```

Python ≥ 3.10 required.

---

## Ways to Use

### 1 · Python API

```python
from pmcgrab.application.processing import process_single_pmc

article = process_single_pmc("7114487")
print(article)
```

(Use the numeric part of the PMC ID only.)

---

## Output Example

```json
{
  "pmc_id": "7114487",
  "title": "Machine learning approaches in cancer research",
  "abstract": "…",
  "body": {
    "Introduction": "…",
    "Methods": "…",
    "Results": "…",
    "Discussion": "…"
  },
  "authors": [...],
  "journal": "Nature Medicine"
}
```

---

## Context Engineering: Why This Matters for LLMs

Large-language-model performance lives or dies on **context quality**-the snippets you retrieve and feed back into the model:

- **RAG pipelines** need precise, de-duplicated passages to ground answers.
- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.

PMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input → better retrieval → better answers.

---

## Why PMCGrab Beats Home-Grown Scripts

1. **Section-Aware Parsing**
   Detects IMRaD plus custom subsections like _Statistical Analysis_-crucial for accurate retrieval scoring.

2. **Resilient XML Cleaning**
   Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.

3. **True Concurrency**
   `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don’t throttle.

4. **Modern Python Stack**
   Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.

---

## Proof at a Glance

| Metric                      | Value              |
| --------------------------- | ------------------ |
| Unit tests                  | **218**            |
| Branch coverage             | **95 %**           |
| Section detection accuracy  | **98 %**           |
| Median parse time / article | **3.1 s**          |
| Largest batch processed     | **7,500 articles** |

---

## Promise to you

If PMCGrab doesn’t save you hours on day one, delete it-no questions asked.
Once you see clean JSON in seconds, you’ll never fight PMC XML again.

---

## Install Now & Ship Real Results

```bash
uv add pmcgrab
```

Stop paying the **XML tax**. Start engineering context-and building AI products that matter.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pmcgrab",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
    "keywords": "ai, bioinformatics, entrez, fast, llm, modern, ncbi, pmc, pubmed, rag, research papers, retrieval augmented generation, scientific literature, text mining, uv, xml parsing",
    "author": null,
    "author_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
    "download_url": "https://files.pythonhosted.org/packages/22/53/1826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb/pmcgrab-0.5.8.tar.gz",
    "platform": null,
    "description": "# PMCGrab - From PubMed Central ID to AI-Ready JSON in Seconds\n\n[![PyPI](https://img.shields.io/pypi/v/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Python](https://img.shields.io/pypi/pyversions/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Docs](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://rajdeepmondaldotcom.github.io/pmcgrab/) [![CI](https://github.com/rajdeepmondaldotcom/pmcgrab/workflows/CI/badge.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)\n\nEvery AI workflow that touches biomedical literature hits the same wall:\n\n1. **Download** PMC XML hoping it\u2019s \u201cstructured.\u201d\n2. **Fight** nested tags, footnotes, figure refs, and half-broken links.\n3. **Hope** your regex didn\u2019t blow away the Methods section you actually need.\n\nThat wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning-any downstream AI task**.\n**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.\n\n---\n\n## The Hidden Cost of \u201cI\u2019ll Just Parse It Myself\u201d\n\n| Task                        | Manual / ad-hoc         | **PMCGrab**                    |\n| --------------------------- | ----------------------- | ------------------------------ |\n| Install dependencies        | 5\u201310 min                | **\u2248 2 s** (`uv add pmcgrab`)   |\n| Convert one article to JSON | 15\u201330 min               | **\u2248 3 s**                      |\n| Capture every IMRaD section | Hope & regex            | **98 % detection accuracy\\***  |\n| Parallel processing         | Bash loops & temp files | `--workers N` flag             |\n| Edge-case maintenance       | Yours forever           | **200 + tests**, active upkeep |\n\n**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**\n\nAt \\$50 /hour, hand-parsing 100 papers burns **\\$1,000+**.\nPMCGrab does the same job for \\$0-within minutes-so you can focus on _using_ the information instead of extracting it.\n\n---\n\n## Quick Install\n\nInstall via **uv** (make sure `uv` itself is up to date first)\n\n```bash\nuv add pmcgrab\n```\n\nPython \u2265 3.10 required.\n\n---\n\n## Ways to Use\n\n### 1 \u00b7 Python API\n\n```python\nfrom pmcgrab.application.processing import process_single_pmc\n\narticle = process_single_pmc(\"7114487\")\nprint(article)\n```\n\n(Use the numeric part of the PMC ID only.)\n\n---\n\n## Output Example\n\n```json\n{\n  \"pmc_id\": \"7114487\",\n  \"title\": \"Machine learning approaches in cancer research\",\n  \"abstract\": \"\u2026\",\n  \"body\": {\n    \"Introduction\": \"\u2026\",\n    \"Methods\": \"\u2026\",\n    \"Results\": \"\u2026\",\n    \"Discussion\": \"\u2026\"\n  },\n  \"authors\": [...],\n  \"journal\": \"Nature Medicine\"\n}\n```\n\n---\n\n## Context Engineering: Why This Matters for LLMs\n\nLarge-language-model performance lives or dies on **context quality**-the snippets you retrieve and feed back into the model:\n\n- **RAG pipelines** need precise, de-duplicated passages to ground answers.\n- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.\n- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.\n\nPMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input \u2192 better retrieval \u2192 better answers.\n\n---\n\n## Why PMCGrab Beats Home-Grown Scripts\n\n1. **Section-Aware Parsing**\n   Detects IMRaD plus custom subsections like _Statistical Analysis_-crucial for accurate retrieval scoring.\n\n2. **Resilient XML Cleaning**\n   Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.\n\n3. **True Concurrency**\n   `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don\u2019t throttle.\n\n4. **Modern Python Stack**\n   Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.\n\n---\n\n## Proof at a Glance\n\n| Metric                      | Value              |\n| --------------------------- | ------------------ |\n| Unit tests                  | **218**            |\n| Branch coverage             | **95 %**           |\n| Section detection accuracy  | **98 %**           |\n| Median parse time / article | **3.1 s**          |\n| Largest batch processed     | **7,500 articles** |\n\n---\n\n## Promise to you\n\nIf PMCGrab doesn\u2019t save you hours on day one, delete it-no questions asked.\nOnce you see clean JSON in seconds, you\u2019ll never fight PMC XML again.\n\n---\n\n## Install Now & Ship Real Results\n\n```bash\nuv add pmcgrab\n```\n\nStop paying the **XML tax**. Start engineering context-and building AI products that matter.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.",
    "version": "0.5.8",
    "project_urls": {
        "Bug Reports": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
        "Changelog": "https://github.com/rajdeepmondaldotcom/pmcgrab/releases",
        "Documentation": "https://github.com/rajdeepmondaldotcom/pmcgrab#readme",
        "Download": "https://pypi.org/project/pmcgrab/",
        "Homepage": "https://github.com/rajdeepmondaldotcom/pmcgrab",
        "Issues": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
        "Repository": "https://github.com/rajdeepmondaldotcom/pmcgrab.git",
        "Source Code": "https://github.com/rajdeepmondaldotcom/pmcgrab"
    },
    "split_keywords": [
        "ai",
        " bioinformatics",
        " entrez",
        " fast",
        " llm",
        " modern",
        " ncbi",
        " pmc",
        " pubmed",
        " rag",
        " research papers",
        " retrieval augmented generation",
        " scientific literature",
        " text mining",
        " uv",
        " xml parsing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fe24a60b244656f125b26b0d9c9563e03f279006f235f4c6a5c410d0fb2a2302",
                "md5": "e6147e81311ca382059f11c95c675ac8",
                "sha256": "42cd1788bc82804667d63d8d9e5667b67ba27cf39c34f6b106b61f69826de620"
            },
            "downloads": -1,
            "filename": "pmcgrab-0.5.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e6147e81311ca382059f11c95c675ac8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 109322,
            "upload_time": "2025-08-05T18:24:46",
            "upload_time_iso_8601": "2025-08-05T18:24:46.085886Z",
            "url": "https://files.pythonhosted.org/packages/fe/24/a60b244656f125b26b0d9c9563e03f279006f235f4c6a5c410d0fb2a2302/pmcgrab-0.5.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "22531826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb",
                "md5": "9268b86631a63a56b687bf4dad003cd6",
                "sha256": "551a9ecbd2f6b73a27217fdaf84ccaec6ac0288107c4b06744f077848dc334ee"
            },
            "downloads": -1,
            "filename": "pmcgrab-0.5.8.tar.gz",
            "has_sig": false,
            "md5_digest": "9268b86631a63a56b687bf4dad003cd6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 107842,
            "upload_time": "2025-08-05T18:24:47",
            "upload_time_iso_8601": "2025-08-05T18:24:47.566985Z",
            "url": "https://files.pythonhosted.org/packages/22/53/1826c19b5368fb61986ae77930002d177f84f8777e2556fece7a9cb5c7fb/pmcgrab-0.5.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-05 18:24:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rajdeepmondaldotcom",
    "github_project": "pmcgrab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pmcgrab"
}

None