Name | pdf2llm JSON |
Version |
0.1.1
JSON |
| download |
home_page | None |
Summary | Extract PDF content optimized for Large Language Model (LLM) consumption |
upload_time | 2025-08-01 22:56:44 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.12 |
license | MIT |
keywords |
document-processing
extraction
llm
markdown
pdf
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# pdf2llm - PDF to LLM Context Extractor
Extract content from PDFs in a format optimized for Large Language Model (LLM) consumption.
## Features
- **Multiple formats**: Extract as Markdown (preserves structure) or plain text
- **Image extraction**: Automatically extracts and saves images with configurable DPI
- **Table preservation**: Maintains table structure in Markdown format
- **Page boundaries**: Optional page markers for maintaining document structure
- **Batch processing**: Process multiple PDFs at once
- **Organized output**: Clean directory structure for each PDF
- **Structure analysis**: Analyze PDFs before extraction
- **Token estimation**: Get token counts for LLM context planning
## Installation
```bash
# Install from PyPI
pip install pdf2llm
# or
uv pip install pdf2llm
# Or install from source
git clone https://github.com/yourusername/pdf2llm.git
cd pdf2llm
uv sync
```
## Usage
### Command Line Interface
```bash
# Basic extraction
uv run ./pdf2llm document.pdf
# Extract to specific directory
uv run ./pdf2llm document.pdf -o extracted_docs/
# Batch process multiple PDFs
uv run ./pdf2llm *.pdf -o zoning_docs/
# Extract as plain text without images
uv run ./pdf2llm document.pdf --format text --no-images
# Analyze PDF structure only
uv run ./pdf2llm document.pdf --analyze-only
# High quality image extraction
uv run ./pdf2llm document.pdf --dpi 300
# Get JSON output for integration
uv run ./pdf2llm document.pdf --json
# Set token limit warning
uv run ./pdf2llm document.pdf --token-limit 4000
```
### Python API
```python
from pdf_utils import PDFExtractor
# Create extractor
extractor = PDFExtractor(
output_dir=Path("extracted"),
image_format="png",
dpi=150
)
# Extract single PDF
result = extractor.extract(
Path("document.pdf"),
output_format="markdown"
)
print(f"Tokens: {result.token_estimate}")
print(f"Pages: {result.page_count}")
print(f"Has images: {result.has_images}")
print(f"Has tables: {result.has_tables}")
# Save to file
output_path = extractor.save_extraction(result, Path("document.pdf"))
# Batch extraction
pdf_files = list(Path("pdfs/").glob("*.pdf"))
results = extractor.batch_extract(pdf_files)
```
## Output Structure
```
extracted/
├── document_name/
│ ├── content.md # Extracted content
│ └── images/ # Extracted images (if any)
│ ├── page-1-0.png
│ └── page-2-0.png
└── another_document/
├── content.md
└── images/
```
## Use Cases
### Zoning Documents Analysis
```bash
# Extract all zoning PDFs with high-quality images
uv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300
# Then in your Python code:
with open("zoning_analysis/zoning_code_2024/content.md", "r") as f:
content = f.read()
# Use with your LLM
response = llm.chat(
messages=[{
"role": "system",
"content": f"You are analyzing zoning documents. Document: {content}"
}, {
"role": "user",
"content": "What are the setback requirements for R-1 zones?"
}]
)
```
### Document Q&A System
```bash
# Process all documents
uv run ./pdf2llm documents/*.pdf -o knowledge_base/
# Check token counts
uv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'
```
### Research Paper Analysis
```bash
# Extract with tables and figures
uv run ./pdf2llm research_paper.pdf --dpi 200
# Extract text only for quick analysis
uv run ./pdf2llm research_paper.pdf --format text --no-images
```
## CLI Options
| Option | Description | Default |
|--------|-------------|---------|
| `-o, --output-dir` | Output directory | `extracted/` |
| `--format` | Output format (markdown, text, both) | `markdown` |
| `--no-images` | Skip image extraction | False |
| `--image-format` | Image format (png, jpg, jpeg) | `png` |
| `--dpi` | DPI for image extraction | `150` |
| `--no-page-chunks` | Disable page boundary markers | False |
| `--analyze-only` | Only analyze structure | False |
| `--quiet` | Minimal output | False |
| `--json` | JSON output | False |
| `--token-limit` | Warn if exceeds limit | None |
## Package Structure
```
pdf_utils/
├── core/
│ └── extractor.py # Core extraction logic
├── cli/
│ └── main.py # CLI interface
└── __init__.py # Package exports
```
## Requirements
- Python 3.12+
- uv (for dependency management)
- Dependencies managed in `pyproject.toml`
## License
MIT License
Raw data
{
"_id": null,
"home_page": null,
"name": "pdf2llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "document-processing, extraction, llm, markdown, pdf",
"author": null,
"author_email": "Your Name <your.email@example.com>",
"download_url": "https://files.pythonhosted.org/packages/72/4d/2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db/pdf2llm-0.1.1.tar.gz",
"platform": null,
"description": "# pdf2llm - PDF to LLM Context Extractor\n\nExtract content from PDFs in a format optimized for Large Language Model (LLM) consumption.\n\n## Features\n\n- **Multiple formats**: Extract as Markdown (preserves structure) or plain text\n- **Image extraction**: Automatically extracts and saves images with configurable DPI\n- **Table preservation**: Maintains table structure in Markdown format\n- **Page boundaries**: Optional page markers for maintaining document structure\n- **Batch processing**: Process multiple PDFs at once\n- **Organized output**: Clean directory structure for each PDF\n- **Structure analysis**: Analyze PDFs before extraction\n- **Token estimation**: Get token counts for LLM context planning\n\n## Installation\n\n```bash\n# Install from PyPI\npip install pdf2llm\n# or\nuv pip install pdf2llm\n\n# Or install from source\ngit clone https://github.com/yourusername/pdf2llm.git\ncd pdf2llm\nuv sync\n```\n\n## Usage\n\n### Command Line Interface\n\n```bash\n# Basic extraction\nuv run ./pdf2llm document.pdf\n\n# Extract to specific directory\nuv run ./pdf2llm document.pdf -o extracted_docs/\n\n# Batch process multiple PDFs\nuv run ./pdf2llm *.pdf -o zoning_docs/\n\n# Extract as plain text without images\nuv run ./pdf2llm document.pdf --format text --no-images\n\n# Analyze PDF structure only\nuv run ./pdf2llm document.pdf --analyze-only\n\n# High quality image extraction\nuv run ./pdf2llm document.pdf --dpi 300\n\n# Get JSON output for integration\nuv run ./pdf2llm document.pdf --json\n\n# Set token limit warning\nuv run ./pdf2llm document.pdf --token-limit 4000\n```\n\n### Python API\n\n```python\nfrom pdf_utils import PDFExtractor\n\n# Create extractor\nextractor = PDFExtractor(\n output_dir=Path(\"extracted\"),\n image_format=\"png\",\n dpi=150\n)\n\n# Extract single PDF\nresult = extractor.extract(\n Path(\"document.pdf\"),\n output_format=\"markdown\"\n)\n\nprint(f\"Tokens: {result.token_estimate}\")\nprint(f\"Pages: {result.page_count}\")\nprint(f\"Has images: {result.has_images}\")\nprint(f\"Has tables: {result.has_tables}\")\n\n# Save to file\noutput_path = extractor.save_extraction(result, Path(\"document.pdf\"))\n\n# Batch extraction\npdf_files = list(Path(\"pdfs/\").glob(\"*.pdf\"))\nresults = extractor.batch_extract(pdf_files)\n```\n\n## Output Structure\n\n```\nextracted/\n\u251c\u2500\u2500 document_name/\n\u2502 \u251c\u2500\u2500 content.md # Extracted content\n\u2502 \u2514\u2500\u2500 images/ # Extracted images (if any)\n\u2502 \u251c\u2500\u2500 page-1-0.png\n\u2502 \u2514\u2500\u2500 page-2-0.png\n\u2514\u2500\u2500 another_document/\n \u251c\u2500\u2500 content.md\n \u2514\u2500\u2500 images/\n```\n\n## Use Cases\n\n### Zoning Documents Analysis\n```bash\n# Extract all zoning PDFs with high-quality images\nuv run ./pdf2llm zoning_*.pdf -o zoning_analysis/ --dpi 300\n\n# Then in your Python code:\nwith open(\"zoning_analysis/zoning_code_2024/content.md\", \"r\") as f:\n content = f.read()\n\n# Use with your LLM\nresponse = llm.chat(\n messages=[{\n \"role\": \"system\",\n \"content\": f\"You are analyzing zoning documents. Document: {content}\"\n }, {\n \"role\": \"user\", \n \"content\": \"What are the setback requirements for R-1 zones?\"\n }]\n)\n```\n\n### Document Q&A System\n```bash\n# Process all documents\nuv run ./pdf2llm documents/*.pdf -o knowledge_base/\n\n# Check token counts\nuv run ./pdf2llm documents/*.pdf --json | jq '.token_estimate'\n```\n\n### Research Paper Analysis\n```bash\n# Extract with tables and figures\nuv run ./pdf2llm research_paper.pdf --dpi 200\n\n# Extract text only for quick analysis\nuv run ./pdf2llm research_paper.pdf --format text --no-images\n```\n\n## CLI Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `-o, --output-dir` | Output directory | `extracted/` |\n| `--format` | Output format (markdown, text, both) | `markdown` |\n| `--no-images` | Skip image extraction | False |\n| `--image-format` | Image format (png, jpg, jpeg) | `png` |\n| `--dpi` | DPI for image extraction | `150` |\n| `--no-page-chunks` | Disable page boundary markers | False |\n| `--analyze-only` | Only analyze structure | False |\n| `--quiet` | Minimal output | False |\n| `--json` | JSON output | False |\n| `--token-limit` | Warn if exceeds limit | None |\n\n## Package Structure\n\n```\npdf_utils/\n\u251c\u2500\u2500 core/\n\u2502 \u2514\u2500\u2500 extractor.py # Core extraction logic\n\u251c\u2500\u2500 cli/\n\u2502 \u2514\u2500\u2500 main.py # CLI interface\n\u2514\u2500\u2500 __init__.py # Package exports\n```\n\n## Requirements\n\n- Python 3.12+\n- uv (for dependency management)\n- Dependencies managed in `pyproject.toml`\n\n## License\n\nMIT License",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extract PDF content optimized for Large Language Model (LLM) consumption",
"version": "0.1.1",
"project_urls": {
"Bug Reports": "https://github.com/yourusername/pdf2llm/issues",
"Homepage": "https://github.com/yourusername/pdf2llm",
"Source": "https://github.com/yourusername/pdf2llm"
},
"split_keywords": [
"document-processing",
" extraction",
" llm",
" markdown",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f76b022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6",
"md5": "d2a78b82b8f295ac9006e2d56853efe0",
"sha256": "6a0bc2d76b43779da0474c662fa5fa61345a4210545d508bc3b1d34ee12bbeeb"
},
"downloads": -1,
"filename": "pdf2llm-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d2a78b82b8f295ac9006e2d56853efe0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 11731,
"upload_time": "2025-08-01T22:56:43",
"upload_time_iso_8601": "2025-08-01T22:56:43.288033Z",
"url": "https://files.pythonhosted.org/packages/f7/6b/022097fab890a88612b7073d797fc03f52e2a84c20ece7df78016d08cfb6/pdf2llm-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "724d2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db",
"md5": "79eadd4d72945441e113568da17245ef",
"sha256": "f963cfd0433de670ecbbaa2ff57db85da7fdc8908e928ab6914d22214b6e97ae"
},
"downloads": -1,
"filename": "pdf2llm-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "79eadd4d72945441e113568da17245ef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 7533,
"upload_time": "2025-08-01T22:56:44",
"upload_time_iso_8601": "2025-08-01T22:56:44.298656Z",
"url": "https://files.pythonhosted.org/packages/72/4d/2ba84e7a832a4c445819421bc6881e94138d0385a540d5718c4ed8b835db/pdf2llm-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 22:56:44",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "pdf2llm",
"github_not_found": true,
"lcname": "pdf2llm"
}