pmcgrab


Namepmcgrab JSON
Version 0.5.4 PyPI version JSON
download
home_pageNone
SummaryAI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.
upload_time2025-08-01 07:43:54
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords ai bioinformatics entrez fast llm modern ncbi pmc pubmed rag research papers retrieval augmented generation scientific literature text mining uv xml parsing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PMCGrab — From PubMed Central ID to AI-Ready JSON in Seconds

[![PyPI](https://img.shields.io/pypi/v/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Python](https://img.shields.io/pypi/pyversions/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Docs](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://rajdeepmondaldotcom.github.io/pmcgrab/) [![CI](https://github.com/rajdeepmondaldotcom/pmcgrab/workflows/CI/badge.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)

Every AI workflow that touches biomedical literature hits the same wall:

1. **Download** PMC XML hoping it’s “structured.”
2. **Fight** nested tags, footnotes, figure refs, and half-broken links.
3. **Hope** your regex didn’t blow away the Methods section you actually need.

That wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning—any downstream AI task**.
**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.

---

## The Hidden Cost of “I’ll Just Parse It Myself”

| Task                        | Manual / ad-hoc         | **PMCGrab**                    |
| --------------------------- | ----------------------- | ------------------------------ |
| Install dependencies        | 5–10 min                | **≈ 2 s** (`uv add pmcgrab`)   |
| Convert one article to JSON | 15–30 min               | **≈ 3 s**                      |
| Capture every IMRaD section | Hope & regex            | **98 % detection accuracy\***  |
| Parallel processing         | Bash loops & temp files | `--workers N` flag             |
| Edge-case maintenance       | Yours forever           | **200 + tests**, active upkeep |

**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**

At \$50 /hour, hand-parsing 100 papers burns **\$1,000+**.
PMCGrab does the same job for \$0—within minutes—so you can focus on _using_ the information instead of extracting it.

---

## Quick Install

```bash
uv add pmcgrab          # fastest
```

Python ≥ 3.10 required.

---

## Two Ways to Use

### 1 · Python API

```python
from pmcgrab.application.processing import process_single_pmc

article = process_single_pmc("7114487")
print(article)
```

### 2 · Command Line

```bash
uv run python -m pmcgrab --pmcids 7114487 3084273 --workers 4
# → writes pmc_output/PMC7114487.json, PMC3084273.json
```

(Use the numeric part of the PMC ID only.)

---

## Output Example

```json
{
  "pmc_id": "7114487",
  "title": "Machine learning approaches in cancer research",
  "abstract": "…",
  "body": {
    "Introduction": "…",
    "Methods": "…",
    "Results": "…",
    "Discussion": "…"
  },
  "authors": [...],
  "journal": "Nature Medicine"
}
```

---

## Context Engineering: Why This Matters for LLMs

Large-language-model performance lives or dies on **context quality**—the snippets you retrieve and feed back into the model:

- **RAG pipelines** need precise, de-duplicated passages to ground answers.
- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.
- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.

PMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input → better retrieval → better answers.

---

## Why PMCGrab Beats Home-Grown Scripts

1. **Section-Aware Parsing**
   Detects IMRaD plus custom subsections like _Statistical Analysis_—crucial for accurate retrieval scoring.

2. **Resilient XML Cleaning**
   Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.

3. **True Concurrency**
   `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don’t throttle.

4. **Modern Python Stack**
   Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.

---

## Proof at a Glance

| Metric                      | Value              |
| --------------------------- | ------------------ |
| Unit tests                  | **218**            |
| Branch coverage             | **95 %**           |
| Section detection accuracy  | **98 %**           |
| Median parse time / article | **3.1 s**          |
| Largest batch processed     | **7,500 articles** |

---

## Promise to you

If PMCGrab doesn’t save you hours on day one, delete it—no questions asked.
Once you see clean JSON in seconds, you’ll never fight PMC XML again.

---

## Install Now & Ship Real Results

```bash
uv add pmcgrab
```

Stop paying the **XML tax**. Start engineering context—and building AI products that matter.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pmcgrab",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
    "keywords": "ai, bioinformatics, entrez, fast, llm, modern, ncbi, pmc, pubmed, rag, research papers, retrieval augmented generation, scientific literature, text mining, uv, xml parsing",
    "author": null,
    "author_email": "Rajdeep Mondal <rajdeep@rajdeepmondal.com>",
    "download_url": "https://files.pythonhosted.org/packages/dd/d7/76c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6/pmcgrab-0.5.4.tar.gz",
    "platform": null,
    "description": "# PMCGrab \u2014 From PubMed Central ID to AI-Ready JSON in Seconds\n\n[![PyPI](https://img.shields.io/pypi/v/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Python](https://img.shields.io/pypi/pyversions/pmcgrab.svg)](https://pypi.org/project/pmcgrab/) [![Docs](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://rajdeepmondaldotcom.github.io/pmcgrab/) [![CI](https://github.com/rajdeepmondaldotcom/pmcgrab/workflows/CI/badge.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/actions) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/rajdeepmondaldotcom/pmcgrab/blob/main/LICENSE)\n\nEvery AI workflow that touches biomedical literature hits the same wall:\n\n1. **Download** PMC XML hoping it\u2019s \u201cstructured.\u201d\n2. **Fight** nested tags, footnotes, figure refs, and half-broken links.\n3. **Hope** your regex didn\u2019t blow away the Methods section you actually need.\n\nThat wall steals hours from **RAG pipelines, knowledge-graph builds, LLM fine-tuning\u2014any downstream AI task**.\n**PMCGrab knocks it down.** Feed the tool a list of PMC IDs and get back clean, section-aware JSON you can drop straight into a vector DB or LLM prompt.\n\n---\n\n## The Hidden Cost of \u201cI\u2019ll Just Parse It Myself\u201d\n\n| Task                        | Manual / ad-hoc         | **PMCGrab**                    |\n| --------------------------- | ----------------------- | ------------------------------ |\n| Install dependencies        | 5\u201310 min                | **\u2248 2 s** (`uv add pmcgrab`)   |\n| Convert one article to JSON | 15\u201330 min               | **\u2248 3 s**                      |\n| Capture every IMRaD section | Hope & regex            | **98 % detection accuracy\\***  |\n| Parallel processing         | Bash loops & temp files | `--workers N` flag             |\n| Edge-case maintenance       | Yours forever           | **200 + tests**, active upkeep |\n\n**_Evaluated on 7,500 PMC papers used in a disease-specific knowledge-graph pipeline._**\n\nAt \\$50 /hour, hand-parsing 100 papers burns **\\$1,000+**.\nPMCGrab does the same job for \\$0\u2014within minutes\u2014so you can focus on _using_ the information instead of extracting it.\n\n---\n\n## Quick Install\n\n```bash\nuv add pmcgrab          # fastest\n```\n\nPython \u2265 3.10 required.\n\n---\n\n## Two Ways to Use\n\n### 1 \u00b7 Python API\n\n```python\nfrom pmcgrab.application.processing import process_single_pmc\n\narticle = process_single_pmc(\"7114487\")\nprint(article)\n```\n\n### 2 \u00b7 Command Line\n\n```bash\nuv run python -m pmcgrab --pmcids 7114487 3084273 --workers 4\n# \u2192 writes pmc_output/PMC7114487.json, PMC3084273.json\n```\n\n(Use the numeric part of the PMC ID only.)\n\n---\n\n## Output Example\n\n```json\n{\n  \"pmc_id\": \"7114487\",\n  \"title\": \"Machine learning approaches in cancer research\",\n  \"abstract\": \"\u2026\",\n  \"body\": {\n    \"Introduction\": \"\u2026\",\n    \"Methods\": \"\u2026\",\n    \"Results\": \"\u2026\",\n    \"Discussion\": \"\u2026\"\n  },\n  \"authors\": [...],\n  \"journal\": \"Nature Medicine\"\n}\n```\n\n---\n\n## Context Engineering: Why This Matters for LLMs\n\nLarge-language-model performance lives or dies on **context quality**\u2014the snippets you retrieve and feed back into the model:\n\n- **RAG pipelines** need precise, de-duplicated passages to ground answers.\n- **Knowledge-graph population** demands reliable section boundaries (e.g., Methods vs. Results) to classify triples accurately.\n- **Fine-tuning & few-shot prompting** work best with noise-free, domain-specific examples.\n\nPMCGrab _is_ a context-engineering tool: it converts messy XML into **clean, section-aware, UTF-8 JSON** that slots directly into embeddings, vector stores, or prompt templates. No preprocessing gymnastics, no guessing where the Methods section starts, no hallucinations from half-garbled text. Better input \u2192 better retrieval \u2192 better answers.\n\n---\n\n## Why PMCGrab Beats Home-Grown Scripts\n\n1. **Section-Aware Parsing**\n   Detects IMRaD plus custom subsections like _Statistical Analysis_\u2014crucial for accurate retrieval scoring.\n\n2. **Resilient XML Cleaning**\n   Removes cross-refs and figure stubs without dropping scientific content, preserving token-level fidelity for embeddings.\n\n3. **True Concurrency**\n   `--workers` fan-outs across all CPU cores; automatic email rotation respects NCBI rate limits so large harvests don\u2019t throttle.\n\n4. **Modern Python Stack**\n   Type-safe (`mypy`), linted (`ruff`), CI-checked on Ubuntu, macOS, and Windows.\n\n---\n\n## Proof at a Glance\n\n| Metric                      | Value              |\n| --------------------------- | ------------------ |\n| Unit tests                  | **218**            |\n| Branch coverage             | **95 %**           |\n| Section detection accuracy  | **98 %**           |\n| Median parse time / article | **3.1 s**          |\n| Largest batch processed     | **7,500 articles** |\n\n---\n\n## Promise to you\n\nIf PMCGrab doesn\u2019t save you hours on day one, delete it\u2014no questions asked.\nOnce you see clean JSON in seconds, you\u2019ll never fight PMC XML again.\n\n---\n\n## Install Now & Ship Real Results\n\n```bash\nuv add pmcgrab\n```\n\nStop paying the **XML tax**. Start engineering context\u2014and building AI products that matter.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI-ready retrieval and parsing of PubMed Central articles for RAG applications. Install with uv for best performance.",
    "version": "0.5.4",
    "project_urls": {
        "Bug Reports": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
        "Changelog": "https://github.com/rajdeepmondaldotcom/pmcgrab/releases",
        "Documentation": "https://github.com/rajdeepmondaldotcom/pmcgrab#readme",
        "Download": "https://pypi.org/project/pmcgrab/",
        "Homepage": "https://github.com/rajdeepmondaldotcom/pmcgrab",
        "Issues": "https://github.com/rajdeepmondaldotcom/pmcgrab/issues",
        "Repository": "https://github.com/rajdeepmondaldotcom/pmcgrab.git",
        "Source Code": "https://github.com/rajdeepmondaldotcom/pmcgrab"
    },
    "split_keywords": [
        "ai",
        " bioinformatics",
        " entrez",
        " fast",
        " llm",
        " modern",
        " ncbi",
        " pmc",
        " pubmed",
        " rag",
        " research papers",
        " retrieval augmented generation",
        " scientific literature",
        " text mining",
        " uv",
        " xml parsing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "abd82c6b131ffaea4ff426feb64cf5ed9e5a8c6ddfadec674e02fbb1f555c2fa",
                "md5": "af09a92bbbf726e93fd65509ce9f4a7c",
                "sha256": "a70ad723a546df6c2bd07a5b4b11de0b6e68d772f73f9e5b29871186be603a5f"
            },
            "downloads": -1,
            "filename": "pmcgrab-0.5.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "af09a92bbbf726e93fd65509ce9f4a7c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 109352,
            "upload_time": "2025-08-01T07:43:53",
            "upload_time_iso_8601": "2025-08-01T07:43:53.417966Z",
            "url": "https://files.pythonhosted.org/packages/ab/d8/2c6b131ffaea4ff426feb64cf5ed9e5a8c6ddfadec674e02fbb1f555c2fa/pmcgrab-0.5.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ddd776c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6",
                "md5": "295108f64ba390805013c837c50baee2",
                "sha256": "f1b4027cb51f91aeaff67971f8a10b432022bdee4e73787047fa17959521216a"
            },
            "downloads": -1,
            "filename": "pmcgrab-0.5.4.tar.gz",
            "has_sig": false,
            "md5_digest": "295108f64ba390805013c837c50baee2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 105303,
            "upload_time": "2025-08-01T07:43:54",
            "upload_time_iso_8601": "2025-08-01T07:43:54.449761Z",
            "url": "https://files.pythonhosted.org/packages/dd/d7/76c5bf9fe62fad9f794df6a1639b0b3e0fcfca76b5ac6bcf65ba58b77ff6/pmcgrab-0.5.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 07:43:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rajdeepmondaldotcom",
    "github_project": "pmcgrab",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pmcgrab"
}
        
Elapsed time: 1.72705s