pdf2llm


Namepdf2llm JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryExtract PDF content optimized for Large Language Model (LLM) consumption
upload_time2025-08-01 22:56:44
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords document-processing extraction llm markdown pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pdf2llm - PDF to LLM Context Extractor

Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.

## Features

- **Multiple formats**: Extract as Markdown (preserves structure) or plain text
- **Image extraction**: Automatically extracts and saves images with configurable DPI
- **Table preservation**: Maintains table structure in Markdown format
- **Page boundaries**: Optional page markers for maintaining document structure
- **Batch processing**: Process multiple PDFs at once
- **Organized output**: Clean directory structure for each PDF
- **Structure analysis**: Analyze PDFs before extraction
- **Token estimation**: Get token counts for LLM context planning

## Installation

```bash
# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm

# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync
```

## Usage

### Command Line Interface

```bash
# Basic extraction
uv run ./pdf2llm document.pdf

# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/

# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/

# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images

# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only

# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300

# Get JSON output for integration
uv run ./pdf2llm document.pdf --json

# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000
```

### Python API

```python
from pdf_utils import PDFExtractor

# Create extractor
extractor = PDFExtractor(
    output_dir=Path("extracted"),
    image_format="png",
    dpi=150
)

# Extract single PDF
result = extractor.extract(
    Path("document.pdf"),
    output_format="markdown"
)

print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")

# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))

# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)
```

## Output Structure

```
extracted/
├── document_name/
│   ├── content.md         # Extracted content
│   └── images/           # Extracted images (if any)
│       ├── page-1-0.png
│       └── page-2-0.png
└── another_document/
    ├── content.md
    └── images/
```

## Use Cases

### Zoning Documents Analysis
```bash
# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300

# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
    content = f.read()

# Use with your LLM
response = llm.chat(
    messages=[{
        "role": "system",
        "content": f"You are analyzing zoning documents. Document: {content}"
    }, {
        "role": "user", 
        "content": "What are the setback requirements for R-1 zones?"
    }]
)
```

### Document Q&A System
```bash
# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/

# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'
```

### Research Paper Analysis
```bash
# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200

# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images
```

## CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `-o, --output-dir` | Output directory | `extracted/` |
| `--format` | Output format (markdown, text, both) | `markdown` |
| `--no-images` | Skip image extraction | False |
| `--image-format` | Image format (png, jpg, jpeg) | `png` |
| `--dpi` | DPI for image extraction | `150` |
| `--no-page-chunks` | Disable page boundary markers | False |
| `--analyze-only` | Only analyze structure | False |
| `--quiet` | Minimal output | False |
| `--json` | JSON output | False |
| `--token-limit` | Warn if exceeds limit | None |

## Package Structure

```
pdf_utils/
├── core/
│   └── extractor.py      # Core extraction logic
├── cli/
│   └── main.py          # CLI interface
└── __init__.py         # Package exports
```

## Requirements

- Python 3.12+
- uv (for dependency management)
- Dependencies managed in `pyproject.toml`

## License

MIT License
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pdf2llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "document-processing, extraction, llm, markdown, pdf",
    "author": null,
    "author_email": "Your Name <your.email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/72/4d/2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db/pdf2llm-0.1.1.tar.gz",
    "platform": null,
    "description": "# pdf2llm - PDF to LLM Context Extractor\n\nExtract content from PDFs in a format optimized for Large Language Model (LLM) consumption.\n\n## Features\n\n- **Multiple formats**: Extract as Markdown (preserves structure) or plain text\n- **Image extraction**: Automatically extracts and saves images with configurable DPI\n- **Table preservation**: Maintains table structure in Markdown format\n- **Page boundaries**: Optional page markers for maintaining document structure\n- **Batch processing**: Process multiple PDFs at once\n- **Organized output**: Clean directory structure for each PDF\n- **Structure analysis**: Analyze PDFs before extraction\n- **Token estimation**: Get token counts for LLM context planning\n\n## Installation\n\n```bash\n# Install from PyPI\npip install pdf2llm\n# or\nuv pip install pdf2llm\n\n# Or install from source\ngit clone https://github.com/yourusername/pdf2llm.git\ncd pdf2llm\nuv sync\n```\n\n## Usage\n\n### Command Line Interface\n\n```bash\n# Basic extraction\nuv run ./pdf2llm document.pdf\n\n# Extract to specific directory\nuv run ./pdf2llm document.pdf -o extracted_docs/\n\n# Batch process multiple PDFs\nuv run ./pdf2llm *.pdf -o zoning_docs/\n\n# Extract as plain text without images\nuv run ./pdf2llm document.pdf --format text --no-images\n\n# Analyze PDF structure only\nuv run ./pdf2llm document.pdf --analyze-only\n\n# High quality image extraction\nuv run ./pdf2llm document.pdf --dpi 300\n\n# Get JSON output for integration\nuv run ./pdf2llm document.pdf --json\n\n# Set token limit warning\nuv run ./pdf2llm document.pdf --token-limit 4000\n```\n\n### Python API\n\n```python\nfrom pdf_utils import PDFExtractor\n\n# Create extractor\nextractor = PDFExtractor(\n    output_dir=Path(\"extracted\"),\n    image_format=\"png\",\n    dpi=150\n)\n\n# Extract single PDF\nresult = extractor.extract(\n    Path(\"document.pdf\"),\n    output_format=\"markdown\"\n)\n\nprint(f\"Tokens: {result.token_estimate}\")\nprint(f\"Pages: {result.page_count}\")\nprint(f\"Has images: {result.has_images}\")\nprint(f\"Has tables: {result.has_tables}\")\n\n# Save to file\noutput_path = extractor.save_extraction(result, Path(\"document.pdf\"))\n\n# Batch extraction\npdf_files = list(Path(\"pdfs/\").glob(\"*.pdf\"))\nresults = extractor.batch_extract(pdf_files)\n```\n\n## Output Structure\n\n```\nextracted/\n\u251c\u2500\u2500 document_name/\n\u2502   \u251c\u2500\u2500 content.md         # Extracted content\n\u2502   \u2514\u2500\u2500 images/           # Extracted images (if any)\n\u2502       \u251c\u2500\u2500 page-1-0.png\n\u2502       \u2514\u2500\u2500 page-2-0.png\n\u2514\u2500\u2500 another_document/\n    \u251c\u2500\u2500 content.md\n    \u2514\u2500\u2500 images/\n```\n\n## Use Cases\n\n### Zoning Documents Analysis\n```bash\n# Extract all zoning PDFs with high-quality images\nuv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300\n\n# Then in your Python code:\nwith open(\"zoning_analysis/zoning_code_2024/content.md\", \"r\") as f:\n    content = f.read()\n\n# Use with your LLM\nresponse = llm.chat(\n    messages=[{\n        \"role\": \"system\",\n        \"content\": f\"You are analyzing zoning documents. Document: {content}\"\n    }, {\n        \"role\": \"user\", \n        \"content\": \"What are the setback requirements for R-1 zones?\"\n    }]\n)\n```\n\n### Document Q&A System\n```bash\n# Process all documents\nuv run ./pdf2llm documents/*.pdf -o knowledge_base/\n\n# Check token counts\nuv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'\n```\n\n### Research Paper Analysis\n```bash\n# Extract with tables and figures\nuv run ./pdf2llm research_paper.pdf --dpi 200\n\n# Extract text only for quick analysis\nuv run ./pdf2llm research_paper.pdf --format text --no-images\n```\n\n## CLI Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `-o, --output-dir` | Output directory | `extracted/` |\n| `--format` | Output format (markdown, text, both) | `markdown` |\n| `--no-images` | Skip image extraction | False |\n| `--image-format` | Image format (png, jpg, jpeg) | `png` |\n| `--dpi` | DPI for image extraction | `150` |\n| `--no-page-chunks` | Disable page boundary markers | False |\n| `--analyze-only` | Only analyze structure | False |\n| `--quiet` | Minimal output | False |\n| `--json` | JSON output | False |\n| `--token-limit` | Warn if exceeds limit | None |\n\n## Package Structure\n\n```\npdf_utils/\n\u251c\u2500\u2500 core/\n\u2502   \u2514\u2500\u2500 extractor.py      # Core extraction logic\n\u251c\u2500\u2500 cli/\n\u2502   \u2514\u2500\u2500 main.py          # CLI interface\n\u2514\u2500\u2500 __init__.py         # Package exports\n```\n\n## Requirements\n\n- Python 3.12+\n- uv (for dependency management)\n- Dependencies managed in `pyproject.toml`\n\n## License\n\nMIT License",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Extract PDF content optimized for Large Language Model (LLM) consumption",
    "version": "0.1.1",
    "project_urls": {
        "Bug Reports": "https://github.com/yourusername/pdf2llm/issues",
        "Homepage": "https://github.com/yourusername/pdf2llm",
        "Source": "https://github.com/yourusername/pdf2llm"
    },
    "split_keywords": [
        "document-processing",
        " extraction",
        " llm",
        " markdown",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f76b022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6",
                "md5": "d2a78b82b8f295ac9006e2d56853efe0",
                "sha256": "6a0bc2d76b43779da0474c662fa5fa61345a4210545d508bc3b1d34ee12bbeeb"
            },
            "downloads": -1,
            "filename": "pdf2llm-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d2a78b82b8f295ac9006e2d56853efe0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 11731,
            "upload_time": "2025-08-01T22:56:43",
            "upload_time_iso_8601": "2025-08-01T22:56:43.288033Z",
            "url": "https://files.pythonhosted.org/packages/f7/6b/022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6/pdf2llm-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "724d2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db",
                "md5": "79eadd4d72945441e113568da17245ef",
                "sha256": "f963cfd0433de670ecbbaa2ff57db85da7fdc8908e928ab6914d22214b6e97ae"
            },
            "downloads": -1,
            "filename": "pdf2llm-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "79eadd4d72945441e113568da17245ef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 7533,
            "upload_time": "2025-08-01T22:56:44",
            "upload_time_iso_8601": "2025-08-01T22:56:44.298656Z",
            "url": "https://files.pythonhosted.org/packages/72/4d/2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db/pdf2llm-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 22:56:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "pdf2llm",
    "github_not_found": true,
    "lcname": "pdf2llm"
}
        
Elapsed time: 1.32154s