# PDF Chunker Library v2.0
A production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.
## 🚀 Features
- **Strategic Header Chunking**: Advanced font-size analysis with frequency-based header selection
- **Enhanced Meaning Detection**: AI-powered content analysis with metadata pattern filtering
- **Multi-Level Processing**: Undersized → Oversized → Hierarchical sub-chunking pipeline
- **Robust Content Filtering**: Removes document metadata, page markers, and meaningless fragments
- **Smart Chunk Processing**: Intelligent merging of meaningful short chunks
- **Professional Summarization**: Extractive summaries with rich metadata output
- **Dual Usage Modes**: Simple convenience methods AND advanced custom processing
- **Multiple Output Formats**: JSON, CSV, and custom formats with rich metadata
## 📦 Installation
### Basic Installation
```bash
pip install PyMuPDF pypdf
```
## 🎯 Quick Start - Two Approaches
### 🟢 Approach 1: Simple Convenience (Recommended for Most Users)
Perfect for: Quick prototyping, standard use cases, minimal configuration
```python
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker
# Initialize and process in one line
chunker = CleanHybridPDFChunker()
output_file = chunker.process_and_save('document.pdf')
print(f"✅ Chunks saved to: {output_file}")
```
**Run the example:**
```bash
cd examples/
python simple_usage.py
```
**What you get:**
- Automatic header detection and chunking
- JSON output with metadata
- Multiple format options (JSON/CSV)
- Error handling and validation
### 🔵 Approach 2: Advanced Custom Processing
Perfect for: Custom applications, data analysis, integration with other systems
```python
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker
# Get raw chunk data for custom processing
chunker = CleanHybridPDFChunker()
chunks, headers = chunker.strategic_header_chunking('document.pdf')
# Now you have direct access to chunk data
for chunk in chunks:
topic = chunk['topic']
content = chunk['content']
word_count = chunk['word_count']
# Your custom logic here...
# Save however you want
import json
with open('my_chunks.json', 'w') as f:
json.dump({'chunks': chunks}, f, indent=2)
```
# Your own save logic here
```
**Run the example:**
```bash
cd examples/
python advanced_usage.py
```
**What you get:**
- Direct access to chunk data and headers
- Custom filtering and analysis
- Multiple output formats with custom metadata
- Advanced statistics and reporting
### With Enhanced NLP (recommended)
```bash
pip install PyMuPDF pypdf spacy
python -m spacy download en_core_web_sm
```
### Development Installation
```bash
pip install -e .[dev,nlp]
```
## Quick Start
```python
from pdf_chunker_for_rag import CleanHybridPDFChunker
# Initialize the production chunker
chunker = CleanHybridPDFChunker()
# Process PDF with strategic header chunking
chunks = chunker.strategic_header_chunking(
pdf_path="your_document.pdf",
target_words_per_chunk=200
)
print(f"✅ Created {len(chunks)} structured chunks")
print(f"📊 Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words")
# Access chunk data
for chunk in chunks:
print(f"📖 {chunk['topic']} ({chunk['word_count']} words)")
print(f"📋 {chunk['summary']}")
print()
```
## Advanced Usage
```python
from pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod
# Custom configuration
config = ChunkingConfig(
target_words_per_chunk=300,
min_header_occurrences=2,
oversized_threshold=600,
critical_threshold=1000,
min_meaningful_words=30,
summarization_method=SummarizationMethod.EXTRACTIVE
)
chunker = PDFChunker(config)
result = chunker.chunk_pdf("your_document.pdf")
```
## Key Classes
### PDFChunker
Main interface for PDF chunking operations.
**Methods:**
- `chunk_pdf(pdf_path)`: Complete chunking process
- `detect_headers(pdf_path)`: Header detection only
- `extract_text(pdf_path)`: Text extraction only
- `get_font_analysis(pdf_path)`: Font analysis only
### ChunkingConfig
Configuration for chunking behavior.
**Parameters:**
- `target_words_per_chunk`: Target words per chunk (default: 200)
- `min_header_occurrences`: Minimum header occurrences for selection (default: 3)
- `font_size_tolerance`: Tolerance for font size grouping (default: 2.0)
- `oversized_threshold`: Word count threshold for oversized chunks (default: 500)
- `critical_threshold`: Critical threshold requiring forced splitting (default: 800)
- `min_meaningful_words`: Minimum words for meaningful chunks (default: 50)
### Data Structures
**ChunkData**: Represents a processed chunk
- `chunk_id`: Unique identifier
- `topic`: Header/topic text
- `content`: Chunk content
- `word_count`: Number of words
- `summary`: Generated summary
- `parent_chunk_info`: Information about parent chunk (for split chunks)
**HeaderData**: Represents a detected header
- `text`: Header text
- `font_size`: Font size in points
- `page`: Page number
- `is_bold`: Whether header is bold
## Processing Pipeline
1. **Font Analysis**: Analyze document fonts and determine normal text size
2. **Header Detection**: Identify potential headers based on font size
3. **Strategic Selection**: Select optimal header level using frequency analysis
4. **Text Extraction**: Extract text with proper reading order
5. **Chunk Creation**: Create initial chunks based on headers
6. **Content Filtering**: Remove meaningless content and merge short meaningful chunks
7. **Summarization**: Generate summaries for all chunks
8. **Oversized Processing**: Handle large chunks through sub-header detection or forced splitting
## Content Quality Features
### Meaningless Content Detection
- Version numbers and dates
- Page markers and formatting artifacts
- Low meaningful word ratios
- Incomplete sentences and titles
### Smart Merging
- Preserves short but meaningful content
- Forward-direction merging with adjacent chunks
- Maintains topic coherence
### NLP-Enhanced Analysis (with spaCy)
- Sentence structure analysis
- Named entity recognition
- Vocabulary diversity scoring
- Professional content detection
## Library Architecture
```
pdf_chunker_for_rag/
├── core/ # Core types and main chunker class
├── analysis/ # Font analysis and header detection
├── filtering/ # Content quality filtering and merging
├── processing/ # Summarization and oversized chunk handling
└── utils/ # Text extraction and utility functions
```
## Examples
### Processing Multiple PDFs
```python
import os
from pdf_chunker_for_rag import PDFChunker
chunker = PDFChunker()
results = {}
for filename in os.listdir("pdfs/"):
if filename.endswith(".pdf"):
pdf_path = os.path.join("pdfs", filename)
results[filename] = chunker.chunk_pdf(pdf_path)
# Analyze results
for filename, result in results.items():
print(f"{filename}: {len(result.chunks)} chunks, "
f"avg {result.average_chunk_size:.0f} words")
```
### Custom Content Filtering
```python
from pdf_chunker_for_rag.filtering import ContentFilter
# Create custom filter
filter = ContentFilter(min_meaningful_words=30)
# Check if content is meaningful
is_meaningful = filter.has_meaningful_sentence_structure("Your text here")
is_meaningless = filter.is_meaningless_content("Your text here")
```
### Font Analysis Only
```python
from pdf_chunker_for_rag.analysis import FontAnalyzer
analyzer = FontAnalyzer()
font_info = analyzer.analyze_document_fonts("document.pdf")
print(f"Normal text size: {font_info['normal_font_size']:.1f}pt")
print(f"Header threshold: {font_info['min_header_threshold']:.1f}pt")
print(f"Unique font sizes: {len(font_info['all_font_sizes'])}")
```
## Requirements
- Python 3.8+
- PyMuPDF (fitz) >= 1.20.0
- pypdf >= 3.0.0
- spaCy >= 3.4.0 (optional, for enhanced NLP features)
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## Changelog
### Version 1.0.0
- Initial release
- Complete modular architecture
- Font-based header detection
- Content quality filtering
- Smart chunk merging
- Multiple summarization methods
- Oversized chunk processing
Raw data
{
"_id": null,
"home_page": "https://github.com/your-username/chunk_creation",
"name": "pdf-chunker-for-rag",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "AI Assistant",
"author_email": "assistant@example.com",
"download_url": "https://files.pythonhosted.org/packages/57/19/a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d/pdf_chunker_for_rag-2.0.0.tar.gz",
"platform": null,
"description": "# PDF Chunker Library v2.0\r\n\r\nA production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.\r\n\r\n## \ud83d\ude80 Features\r\n\r\n- **Strategic Header Chunking**: Advanced font-size analysis with frequency-based header selection\r\n- **Enhanced Meaning Detection**: AI-powered content analysis with metadata pattern filtering \r\n- **Multi-Level Processing**: Undersized \u2192 Oversized \u2192 Hierarchical sub-chunking pipeline\r\n- **Robust Content Filtering**: Removes document metadata, page markers, and meaningless fragments\r\n- **Smart Chunk Processing**: Intelligent merging of meaningful short chunks\r\n- **Professional Summarization**: Extractive summaries with rich metadata output\r\n- **Dual Usage Modes**: Simple convenience methods AND advanced custom processing\r\n- **Multiple Output Formats**: JSON, CSV, and custom formats with rich metadata\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Basic Installation\r\n```bash\r\npip install PyMuPDF pypdf\r\n```\r\n\r\n## \ud83c\udfaf Quick Start - Two Approaches\r\n\r\n### \ud83d\udfe2 Approach 1: Simple Convenience (Recommended for Most Users)\r\n\r\nPerfect for: Quick prototyping, standard use cases, minimal configuration\r\n\r\n```python\r\nfrom pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker\r\n\r\n# Initialize and process in one line\r\nchunker = CleanHybridPDFChunker()\r\noutput_file = chunker.process_and_save('document.pdf')\r\nprint(f\"\u2705 Chunks saved to: {output_file}\")\r\n```\r\n\r\n**Run the example:**\r\n```bash\r\ncd examples/\r\npython simple_usage.py\r\n```\r\n\r\n**What you get:**\r\n- Automatic header detection and chunking\r\n- JSON output with metadata\r\n- Multiple format options (JSON/CSV)\r\n- Error handling and validation\r\n\r\n### \ud83d\udd35 Approach 2: Advanced Custom Processing\r\n\r\nPerfect for: Custom applications, data analysis, integration with other systems\r\n\r\n```python\r\nfrom pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker\r\n\r\n# Get raw chunk data for custom processing\r\nchunker = CleanHybridPDFChunker()\r\nchunks, headers = chunker.strategic_header_chunking('document.pdf')\r\n\r\n# Now you have direct access to chunk data\r\nfor chunk in chunks:\r\n topic = chunk['topic']\r\n content = chunk['content'] \r\n word_count = chunk['word_count']\r\n # Your custom logic here...\r\n\r\n# Save however you want\r\nimport json\r\nwith open('my_chunks.json', 'w') as f:\r\n json.dump({'chunks': chunks}, f, indent=2)\r\n```\r\n# Your own save logic here\r\n```\r\n\r\n**Run the example:**\r\n```bash\r\ncd examples/\r\npython advanced_usage.py\r\n```\r\n\r\n**What you get:**\r\n- Direct access to chunk data and headers\r\n- Custom filtering and analysis\r\n- Multiple output formats with custom metadata\r\n- Advanced statistics and reporting\r\n\r\n### With Enhanced NLP (recommended)\r\n```bash\r\npip install PyMuPDF pypdf spacy\r\npython -m spacy download en_core_web_sm\r\n```\r\n\r\n### Development Installation\r\n```bash\r\npip install -e .[dev,nlp]\r\n```\r\n\r\n## Quick Start\r\n\r\n```python\r\nfrom pdf_chunker_for_rag import CleanHybridPDFChunker\r\n\r\n# Initialize the production chunker\r\nchunker = CleanHybridPDFChunker()\r\n\r\n# Process PDF with strategic header chunking\r\nchunks = chunker.strategic_header_chunking(\r\n pdf_path=\"your_document.pdf\",\r\n target_words_per_chunk=200\r\n)\r\n\r\nprint(f\"\u2705 Created {len(chunks)} structured chunks\")\r\nprint(f\"\ud83d\udcca Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words\")\r\n\r\n# Access chunk data\r\nfor chunk in chunks:\r\n print(f\"\ud83d\udcd6 {chunk['topic']} ({chunk['word_count']} words)\")\r\n print(f\"\ud83d\udccb {chunk['summary']}\")\r\n print()\r\n```\r\n\r\n## Advanced Usage\r\n\r\n```python\r\nfrom pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod\r\n\r\n# Custom configuration\r\nconfig = ChunkingConfig(\r\n target_words_per_chunk=300,\r\n min_header_occurrences=2,\r\n oversized_threshold=600,\r\n critical_threshold=1000,\r\n min_meaningful_words=30,\r\n summarization_method=SummarizationMethod.EXTRACTIVE\r\n)\r\n\r\nchunker = PDFChunker(config)\r\nresult = chunker.chunk_pdf(\"your_document.pdf\")\r\n```\r\n\r\n## Key Classes\r\n\r\n### PDFChunker\r\nMain interface for PDF chunking operations.\r\n\r\n**Methods:**\r\n- `chunk_pdf(pdf_path)`: Complete chunking process\r\n- `detect_headers(pdf_path)`: Header detection only\r\n- `extract_text(pdf_path)`: Text extraction only\r\n- `get_font_analysis(pdf_path)`: Font analysis only\r\n\r\n### ChunkingConfig\r\nConfiguration for chunking behavior.\r\n\r\n**Parameters:**\r\n- `target_words_per_chunk`: Target words per chunk (default: 200)\r\n- `min_header_occurrences`: Minimum header occurrences for selection (default: 3)\r\n- `font_size_tolerance`: Tolerance for font size grouping (default: 2.0)\r\n- `oversized_threshold`: Word count threshold for oversized chunks (default: 500)\r\n- `critical_threshold`: Critical threshold requiring forced splitting (default: 800)\r\n- `min_meaningful_words`: Minimum words for meaningful chunks (default: 50)\r\n\r\n### Data Structures\r\n\r\n**ChunkData**: Represents a processed chunk\r\n- `chunk_id`: Unique identifier\r\n- `topic`: Header/topic text\r\n- `content`: Chunk content\r\n- `word_count`: Number of words\r\n- `summary`: Generated summary\r\n- `parent_chunk_info`: Information about parent chunk (for split chunks)\r\n\r\n**HeaderData**: Represents a detected header\r\n- `text`: Header text\r\n- `font_size`: Font size in points\r\n- `page`: Page number\r\n- `is_bold`: Whether header is bold\r\n\r\n## Processing Pipeline\r\n\r\n1. **Font Analysis**: Analyze document fonts and determine normal text size\r\n2. **Header Detection**: Identify potential headers based on font size\r\n3. **Strategic Selection**: Select optimal header level using frequency analysis\r\n4. **Text Extraction**: Extract text with proper reading order\r\n5. **Chunk Creation**: Create initial chunks based on headers\r\n6. **Content Filtering**: Remove meaningless content and merge short meaningful chunks\r\n7. **Summarization**: Generate summaries for all chunks\r\n8. **Oversized Processing**: Handle large chunks through sub-header detection or forced splitting\r\n\r\n## Content Quality Features\r\n\r\n### Meaningless Content Detection\r\n- Version numbers and dates\r\n- Page markers and formatting artifacts\r\n- Low meaningful word ratios\r\n- Incomplete sentences and titles\r\n\r\n### Smart Merging\r\n- Preserves short but meaningful content\r\n- Forward-direction merging with adjacent chunks\r\n- Maintains topic coherence\r\n\r\n### NLP-Enhanced Analysis (with spaCy)\r\n- Sentence structure analysis\r\n- Named entity recognition\r\n- Vocabulary diversity scoring\r\n- Professional content detection\r\n\r\n## Library Architecture\r\n\r\n```\r\npdf_chunker_for_rag/\r\n\u251c\u2500\u2500 core/ # Core types and main chunker class\r\n\u251c\u2500\u2500 analysis/ # Font analysis and header detection\r\n\u251c\u2500\u2500 filtering/ # Content quality filtering and merging\r\n\u251c\u2500\u2500 processing/ # Summarization and oversized chunk handling\r\n\u2514\u2500\u2500 utils/ # Text extraction and utility functions\r\n```\r\n\r\n## Examples\r\n\r\n### Processing Multiple PDFs\r\n```python\r\nimport os\r\nfrom pdf_chunker_for_rag import PDFChunker\r\n\r\nchunker = PDFChunker()\r\nresults = {}\r\n\r\nfor filename in os.listdir(\"pdfs/\"):\r\n if filename.endswith(\".pdf\"):\r\n pdf_path = os.path.join(\"pdfs\", filename)\r\n results[filename] = chunker.chunk_pdf(pdf_path)\r\n\r\n# Analyze results\r\nfor filename, result in results.items():\r\n print(f\"{filename}: {len(result.chunks)} chunks, \"\r\n f\"avg {result.average_chunk_size:.0f} words\")\r\n```\r\n\r\n### Custom Content Filtering\r\n```python\r\nfrom pdf_chunker_for_rag.filtering import ContentFilter\r\n\r\n# Create custom filter\r\nfilter = ContentFilter(min_meaningful_words=30)\r\n\r\n# Check if content is meaningful\r\nis_meaningful = filter.has_meaningful_sentence_structure(\"Your text here\")\r\nis_meaningless = filter.is_meaningless_content(\"Your text here\")\r\n```\r\n\r\n### Font Analysis Only\r\n```python\r\nfrom pdf_chunker_for_rag.analysis import FontAnalyzer\r\n\r\nanalyzer = FontAnalyzer()\r\nfont_info = analyzer.analyze_document_fonts(\"document.pdf\")\r\n\r\nprint(f\"Normal text size: {font_info['normal_font_size']:.1f}pt\")\r\nprint(f\"Header threshold: {font_info['min_header_threshold']:.1f}pt\")\r\nprint(f\"Unique font sizes: {len(font_info['all_font_sizes'])}\")\r\n```\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- PyMuPDF (fitz) >= 1.20.0\r\n- pypdf >= 3.0.0\r\n- spaCy >= 3.4.0 (optional, for enhanced NLP features)\r\n\r\n## License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests if applicable\r\n5. Submit a pull request\r\n\r\n## Changelog\r\n\r\n### Version 1.0.0\r\n- Initial release\r\n- Complete modular architecture\r\n- Font-based header detection\r\n- Content quality filtering\r\n- Smart chunk merging\r\n- Multiple summarization methods\r\n- Oversized chunk processing\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Production-ready PDF chunking library with intelligent content filtering and strategic header detection",
"version": "2.0.0",
"project_urls": {
"Homepage": "https://github.com/your-username/chunk_creation"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b1ac0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1",
"md5": "12398a6f12307953c81315765bb2a928",
"sha256": "5ec0af6ecbece5c93c912ab09ce1a46b0cc598c8ed8b9280a3d2e6c5b367f8b1"
},
"downloads": -1,
"filename": "pdf_chunker_for_rag-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "12398a6f12307953c81315765bb2a928",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 29555,
"upload_time": "2025-08-04T03:52:01",
"upload_time_iso_8601": "2025-08-04T03:52:01.139457Z",
"url": "https://files.pythonhosted.org/packages/b1/ac/0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1/pdf_chunker_for_rag-2.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5719a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d",
"md5": "62c176e6cb906cbdf8bc9fd882a0b7ec",
"sha256": "adf6abb6c3683ca25a5170192ceeef192982eca394239ac86f7d24661bd3e19a"
},
"downloads": -1,
"filename": "pdf_chunker_for_rag-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "62c176e6cb906cbdf8bc9fd882a0b7ec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 32584,
"upload_time": "2025-08-04T03:52:02",
"upload_time_iso_8601": "2025-08-04T03:52:02.251286Z",
"url": "https://files.pythonhosted.org/packages/57/19/a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d/pdf_chunker_for_rag-2.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-04 03:52:02",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-username",
"github_project": "chunk_creation",
"github_not_found": true,
"lcname": "pdf-chunker-for-rag"
}