pdf-chunker-for-rag


Namepdf-chunker-for-rag JSON
Version 2.0.0 PyPI version JSON
download
home_pagehttps://github.com/your-username/chunk_creation
SummaryProduction-ready PDF chunking library with intelligent content filtering and strategic header detection
upload_time2025-08-04 03:52:02
maintainerNone
docs_urlNone
authorAI Assistant
requires_python>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF Chunker Library v2.0

A production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.

## 🚀 Features

- **Strategic Header Chunking**: Advanced font-size analysis with frequency-based header selection
- **Enhanced Meaning Detection**: AI-powered content analysis with metadata pattern filtering  
- **Multi-Level Processing**: Undersized → Oversized → Hierarchical sub-chunking pipeline
- **Robust Content Filtering**: Removes document metadata, page markers, and meaningless fragments
- **Smart Chunk Processing**: Intelligent merging of meaningful short chunks
- **Professional Summarization**: Extractive summaries with rich metadata output
- **Dual Usage Modes**: Simple convenience methods AND advanced custom processing
- **Multiple Output Formats**: JSON, CSV, and custom formats with rich metadata

## 📦 Installation

### Basic Installation
```bash
pip install PyMuPDF pypdf
```

## 🎯 Quick Start - Two Approaches

### 🟢 Approach 1: Simple Convenience (Recommended for Most Users)

Perfect for: Quick prototyping, standard use cases, minimal configuration

```python
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Initialize and process in one line
chunker = CleanHybridPDFChunker()
output_file = chunker.process_and_save('document.pdf')
print(f"✅ Chunks saved to: {output_file}")
```

**Run the example:**
```bash
cd examples/
python simple_usage.py
```

**What you get:**
- Automatic header detection and chunking
- JSON output with metadata
- Multiple format options (JSON/CSV)
- Error handling and validation

### 🔵 Approach 2: Advanced Custom Processing

Perfect for: Custom applications, data analysis, integration with other systems

```python
from pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker

# Get raw chunk data for custom processing
chunker = CleanHybridPDFChunker()
chunks, headers = chunker.strategic_header_chunking('document.pdf')

# Now you have direct access to chunk data
for chunk in chunks:
    topic = chunk['topic']
    content = chunk['content'] 
    word_count = chunk['word_count']
    # Your custom logic here...

# Save however you want
import json
with open('my_chunks.json', 'w') as f:
    json.dump({'chunks': chunks}, f, indent=2)
```
# Your own save logic here
```

**Run the example:**
```bash
cd examples/
python advanced_usage.py
```

**What you get:**
- Direct access to chunk data and headers
- Custom filtering and analysis
- Multiple output formats with custom metadata
- Advanced statistics and reporting

### With Enhanced NLP (recommended)
```bash
pip install PyMuPDF pypdf spacy
python -m spacy download en_core_web_sm
```

### Development Installation
```bash
pip install -e .[dev,nlp]
```

## Quick Start

```python
from pdf_chunker_for_rag import CleanHybridPDFChunker

# Initialize the production chunker
chunker = CleanHybridPDFChunker()

# Process PDF with strategic header chunking
chunks = chunker.strategic_header_chunking(
    pdf_path="your_document.pdf",
    target_words_per_chunk=200
)

print(f"✅ Created {len(chunks)} structured chunks")
print(f"📊 Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words")

# Access chunk data
for chunk in chunks:
    print(f"📖 {chunk['topic']} ({chunk['word_count']} words)")
    print(f"📋 {chunk['summary']}")
    print()
```

## Advanced Usage

```python
from pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod

# Custom configuration
config = ChunkingConfig(
    target_words_per_chunk=300,
    min_header_occurrences=2,
    oversized_threshold=600,
    critical_threshold=1000,
    min_meaningful_words=30,
    summarization_method=SummarizationMethod.EXTRACTIVE
)

chunker = PDFChunker(config)
result = chunker.chunk_pdf("your_document.pdf")
```

## Key Classes

### PDFChunker
Main interface for PDF chunking operations.

**Methods:**
- `chunk_pdf(pdf_path)`: Complete chunking process
- `detect_headers(pdf_path)`: Header detection only
- `extract_text(pdf_path)`: Text extraction only
- `get_font_analysis(pdf_path)`: Font analysis only

### ChunkingConfig
Configuration for chunking behavior.

**Parameters:**
- `target_words_per_chunk`: Target words per chunk (default: 200)
- `min_header_occurrences`: Minimum header occurrences for selection (default: 3)
- `font_size_tolerance`: Tolerance for font size grouping (default: 2.0)
- `oversized_threshold`: Word count threshold for oversized chunks (default: 500)
- `critical_threshold`: Critical threshold requiring forced splitting (default: 800)
- `min_meaningful_words`: Minimum words for meaningful chunks (default: 50)

### Data Structures

**ChunkData**: Represents a processed chunk
- `chunk_id`: Unique identifier
- `topic`: Header/topic text
- `content`: Chunk content
- `word_count`: Number of words
- `summary`: Generated summary
- `parent_chunk_info`: Information about parent chunk (for split chunks)

**HeaderData**: Represents a detected header
- `text`: Header text
- `font_size`: Font size in points
- `page`: Page number
- `is_bold`: Whether header is bold

## Processing Pipeline

1. **Font Analysis**: Analyze document fonts and determine normal text size
2. **Header Detection**: Identify potential headers based on font size
3. **Strategic Selection**: Select optimal header level using frequency analysis
4. **Text Extraction**: Extract text with proper reading order
5. **Chunk Creation**: Create initial chunks based on headers
6. **Content Filtering**: Remove meaningless content and merge short meaningful chunks
7. **Summarization**: Generate summaries for all chunks
8. **Oversized Processing**: Handle large chunks through sub-header detection or forced splitting

## Content Quality Features

### Meaningless Content Detection
- Version numbers and dates
- Page markers and formatting artifacts
- Low meaningful word ratios
- Incomplete sentences and titles

### Smart Merging
- Preserves short but meaningful content
- Forward-direction merging with adjacent chunks
- Maintains topic coherence

### NLP-Enhanced Analysis (with spaCy)
- Sentence structure analysis
- Named entity recognition
- Vocabulary diversity scoring
- Professional content detection

## Library Architecture

```
pdf_chunker_for_rag/
├── core/           # Core types and main chunker class
├── analysis/       # Font analysis and header detection
├── filtering/      # Content quality filtering and merging
├── processing/     # Summarization and oversized chunk handling
└── utils/          # Text extraction and utility functions
```

## Examples

### Processing Multiple PDFs
```python
import os
from pdf_chunker_for_rag import PDFChunker

chunker = PDFChunker()
results = {}

for filename in os.listdir("pdfs/"):
    if filename.endswith(".pdf"):
        pdf_path = os.path.join("pdfs", filename)
        results[filename] = chunker.chunk_pdf(pdf_path)

# Analyze results
for filename, result in results.items():
    print(f"{filename}: {len(result.chunks)} chunks, "
          f"avg {result.average_chunk_size:.0f} words")
```

### Custom Content Filtering
```python
from pdf_chunker_for_rag.filtering import ContentFilter

# Create custom filter
filter = ContentFilter(min_meaningful_words=30)

# Check if content is meaningful
is_meaningful = filter.has_meaningful_sentence_structure("Your text here")
is_meaningless = filter.is_meaningless_content("Your text here")
```

### Font Analysis Only
```python
from pdf_chunker_for_rag.analysis import FontAnalyzer

analyzer = FontAnalyzer()
font_info = analyzer.analyze_document_fonts("document.pdf")

print(f"Normal text size: {font_info['normal_font_size']:.1f}pt")
print(f"Header threshold: {font_info['min_header_threshold']:.1f}pt")
print(f"Unique font sizes: {len(font_info['all_font_sizes'])}")
```

## Requirements

- Python 3.8+
- PyMuPDF (fitz) >= 1.20.0
- pypdf >= 3.0.0
- spaCy >= 3.4.0 (optional, for enhanced NLP features)

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## Changelog

### Version 1.0.0
- Initial release
- Complete modular architecture
- Font-based header detection
- Content quality filtering
- Smart chunk merging
- Multiple summarization methods
- Oversized chunk processing

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/your-username/chunk_creation",
    "name": "pdf-chunker-for-rag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "AI Assistant",
    "author_email": "assistant@example.com",
    "download_url": "https://files.pythonhosted.org/packages/57/19/a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d/pdf_chunker_for_rag-2.0.0.tar.gz",
    "platform": null,
    "description": "# PDF Chunker Library v2.0\r\n\r\nA production-ready Python library for intelligently chunking PDF documents using sophisticated font analysis, enhanced content filtering, and strategic header detection.\r\n\r\n## \ud83d\ude80 Features\r\n\r\n- **Strategic Header Chunking**: Advanced font-size analysis with frequency-based header selection\r\n- **Enhanced Meaning Detection**: AI-powered content analysis with metadata pattern filtering  \r\n- **Multi-Level Processing**: Undersized \u2192 Oversized \u2192 Hierarchical sub-chunking pipeline\r\n- **Robust Content Filtering**: Removes document metadata, page markers, and meaningless fragments\r\n- **Smart Chunk Processing**: Intelligent merging of meaningful short chunks\r\n- **Professional Summarization**: Extractive summaries with rich metadata output\r\n- **Dual Usage Modes**: Simple convenience methods AND advanced custom processing\r\n- **Multiple Output Formats**: JSON, CSV, and custom formats with rich metadata\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Basic Installation\r\n```bash\r\npip install PyMuPDF pypdf\r\n```\r\n\r\n## \ud83c\udfaf Quick Start - Two Approaches\r\n\r\n### \ud83d\udfe2 Approach 1: Simple Convenience (Recommended for Most Users)\r\n\r\nPerfect for: Quick prototyping, standard use cases, minimal configuration\r\n\r\n```python\r\nfrom pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker\r\n\r\n# Initialize and process in one line\r\nchunker = CleanHybridPDFChunker()\r\noutput_file = chunker.process_and_save('document.pdf')\r\nprint(f\"\u2705 Chunks saved to: {output_file}\")\r\n```\r\n\r\n**Run the example:**\r\n```bash\r\ncd examples/\r\npython simple_usage.py\r\n```\r\n\r\n**What you get:**\r\n- Automatic header detection and chunking\r\n- JSON output with metadata\r\n- Multiple format options (JSON/CSV)\r\n- Error handling and validation\r\n\r\n### \ud83d\udd35 Approach 2: Advanced Custom Processing\r\n\r\nPerfect for: Custom applications, data analysis, integration with other systems\r\n\r\n```python\r\nfrom pdf_chunker_for_rag.chunk_creator import CleanHybridPDFChunker\r\n\r\n# Get raw chunk data for custom processing\r\nchunker = CleanHybridPDFChunker()\r\nchunks, headers = chunker.strategic_header_chunking('document.pdf')\r\n\r\n# Now you have direct access to chunk data\r\nfor chunk in chunks:\r\n    topic = chunk['topic']\r\n    content = chunk['content'] \r\n    word_count = chunk['word_count']\r\n    # Your custom logic here...\r\n\r\n# Save however you want\r\nimport json\r\nwith open('my_chunks.json', 'w') as f:\r\n    json.dump({'chunks': chunks}, f, indent=2)\r\n```\r\n# Your own save logic here\r\n```\r\n\r\n**Run the example:**\r\n```bash\r\ncd examples/\r\npython advanced_usage.py\r\n```\r\n\r\n**What you get:**\r\n- Direct access to chunk data and headers\r\n- Custom filtering and analysis\r\n- Multiple output formats with custom metadata\r\n- Advanced statistics and reporting\r\n\r\n### With Enhanced NLP (recommended)\r\n```bash\r\npip install PyMuPDF pypdf spacy\r\npython -m spacy download en_core_web_sm\r\n```\r\n\r\n### Development Installation\r\n```bash\r\npip install -e .[dev,nlp]\r\n```\r\n\r\n## Quick Start\r\n\r\n```python\r\nfrom pdf_chunker_for_rag import CleanHybridPDFChunker\r\n\r\n# Initialize the production chunker\r\nchunker = CleanHybridPDFChunker()\r\n\r\n# Process PDF with strategic header chunking\r\nchunks = chunker.strategic_header_chunking(\r\n    pdf_path=\"your_document.pdf\",\r\n    target_words_per_chunk=200\r\n)\r\n\r\nprint(f\"\u2705 Created {len(chunks)} structured chunks\")\r\nprint(f\"\ud83d\udcca Average chunk size: {sum(c.get('word_count', 0) for c in chunks) // len(chunks)} words\")\r\n\r\n# Access chunk data\r\nfor chunk in chunks:\r\n    print(f\"\ud83d\udcd6 {chunk['topic']} ({chunk['word_count']} words)\")\r\n    print(f\"\ud83d\udccb {chunk['summary']}\")\r\n    print()\r\n```\r\n\r\n## Advanced Usage\r\n\r\n```python\r\nfrom pdf_chunker_for_rag import PDFChunker, ChunkingConfig, SummarizationMethod\r\n\r\n# Custom configuration\r\nconfig = ChunkingConfig(\r\n    target_words_per_chunk=300,\r\n    min_header_occurrences=2,\r\n    oversized_threshold=600,\r\n    critical_threshold=1000,\r\n    min_meaningful_words=30,\r\n    summarization_method=SummarizationMethod.EXTRACTIVE\r\n)\r\n\r\nchunker = PDFChunker(config)\r\nresult = chunker.chunk_pdf(\"your_document.pdf\")\r\n```\r\n\r\n## Key Classes\r\n\r\n### PDFChunker\r\nMain interface for PDF chunking operations.\r\n\r\n**Methods:**\r\n- `chunk_pdf(pdf_path)`: Complete chunking process\r\n- `detect_headers(pdf_path)`: Header detection only\r\n- `extract_text(pdf_path)`: Text extraction only\r\n- `get_font_analysis(pdf_path)`: Font analysis only\r\n\r\n### ChunkingConfig\r\nConfiguration for chunking behavior.\r\n\r\n**Parameters:**\r\n- `target_words_per_chunk`: Target words per chunk (default: 200)\r\n- `min_header_occurrences`: Minimum header occurrences for selection (default: 3)\r\n- `font_size_tolerance`: Tolerance for font size grouping (default: 2.0)\r\n- `oversized_threshold`: Word count threshold for oversized chunks (default: 500)\r\n- `critical_threshold`: Critical threshold requiring forced splitting (default: 800)\r\n- `min_meaningful_words`: Minimum words for meaningful chunks (default: 50)\r\n\r\n### Data Structures\r\n\r\n**ChunkData**: Represents a processed chunk\r\n- `chunk_id`: Unique identifier\r\n- `topic`: Header/topic text\r\n- `content`: Chunk content\r\n- `word_count`: Number of words\r\n- `summary`: Generated summary\r\n- `parent_chunk_info`: Information about parent chunk (for split chunks)\r\n\r\n**HeaderData**: Represents a detected header\r\n- `text`: Header text\r\n- `font_size`: Font size in points\r\n- `page`: Page number\r\n- `is_bold`: Whether header is bold\r\n\r\n## Processing Pipeline\r\n\r\n1. **Font Analysis**: Analyze document fonts and determine normal text size\r\n2. **Header Detection**: Identify potential headers based on font size\r\n3. **Strategic Selection**: Select optimal header level using frequency analysis\r\n4. **Text Extraction**: Extract text with proper reading order\r\n5. **Chunk Creation**: Create initial chunks based on headers\r\n6. **Content Filtering**: Remove meaningless content and merge short meaningful chunks\r\n7. **Summarization**: Generate summaries for all chunks\r\n8. **Oversized Processing**: Handle large chunks through sub-header detection or forced splitting\r\n\r\n## Content Quality Features\r\n\r\n### Meaningless Content Detection\r\n- Version numbers and dates\r\n- Page markers and formatting artifacts\r\n- Low meaningful word ratios\r\n- Incomplete sentences and titles\r\n\r\n### Smart Merging\r\n- Preserves short but meaningful content\r\n- Forward-direction merging with adjacent chunks\r\n- Maintains topic coherence\r\n\r\n### NLP-Enhanced Analysis (with spaCy)\r\n- Sentence structure analysis\r\n- Named entity recognition\r\n- Vocabulary diversity scoring\r\n- Professional content detection\r\n\r\n## Library Architecture\r\n\r\n```\r\npdf_chunker_for_rag/\r\n\u251c\u2500\u2500 core/           # Core types and main chunker class\r\n\u251c\u2500\u2500 analysis/       # Font analysis and header detection\r\n\u251c\u2500\u2500 filtering/      # Content quality filtering and merging\r\n\u251c\u2500\u2500 processing/     # Summarization and oversized chunk handling\r\n\u2514\u2500\u2500 utils/          # Text extraction and utility functions\r\n```\r\n\r\n## Examples\r\n\r\n### Processing Multiple PDFs\r\n```python\r\nimport os\r\nfrom pdf_chunker_for_rag import PDFChunker\r\n\r\nchunker = PDFChunker()\r\nresults = {}\r\n\r\nfor filename in os.listdir(\"pdfs/\"):\r\n    if filename.endswith(\".pdf\"):\r\n        pdf_path = os.path.join(\"pdfs\", filename)\r\n        results[filename] = chunker.chunk_pdf(pdf_path)\r\n\r\n# Analyze results\r\nfor filename, result in results.items():\r\n    print(f\"{filename}: {len(result.chunks)} chunks, \"\r\n          f\"avg {result.average_chunk_size:.0f} words\")\r\n```\r\n\r\n### Custom Content Filtering\r\n```python\r\nfrom pdf_chunker_for_rag.filtering import ContentFilter\r\n\r\n# Create custom filter\r\nfilter = ContentFilter(min_meaningful_words=30)\r\n\r\n# Check if content is meaningful\r\nis_meaningful = filter.has_meaningful_sentence_structure(\"Your text here\")\r\nis_meaningless = filter.is_meaningless_content(\"Your text here\")\r\n```\r\n\r\n### Font Analysis Only\r\n```python\r\nfrom pdf_chunker_for_rag.analysis import FontAnalyzer\r\n\r\nanalyzer = FontAnalyzer()\r\nfont_info = analyzer.analyze_document_fonts(\"document.pdf\")\r\n\r\nprint(f\"Normal text size: {font_info['normal_font_size']:.1f}pt\")\r\nprint(f\"Header threshold: {font_info['min_header_threshold']:.1f}pt\")\r\nprint(f\"Unique font sizes: {len(font_info['all_font_sizes'])}\")\r\n```\r\n\r\n## Requirements\r\n\r\n- Python 3.8+\r\n- PyMuPDF (fitz) >= 1.20.0\r\n- pypdf >= 3.0.0\r\n- spaCy >= 3.4.0 (optional, for enhanced NLP features)\r\n\r\n## License\r\n\r\nMIT License - see LICENSE file for details.\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests if applicable\r\n5. Submit a pull request\r\n\r\n## Changelog\r\n\r\n### Version 1.0.0\r\n- Initial release\r\n- Complete modular architecture\r\n- Font-based header detection\r\n- Content quality filtering\r\n- Smart chunk merging\r\n- Multiple summarization methods\r\n- Oversized chunk processing\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Production-ready PDF chunking library with intelligent content filtering and strategic header detection",
    "version": "2.0.0",
    "project_urls": {
        "Homepage": "https://github.com/your-username/chunk_creation"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b1ac0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1",
                "md5": "12398a6f12307953c81315765bb2a928",
                "sha256": "5ec0af6ecbece5c93c912ab09ce1a46b0cc598c8ed8b9280a3d2e6c5b367f8b1"
            },
            "downloads": -1,
            "filename": "pdf_chunker_for_rag-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "12398a6f12307953c81315765bb2a928",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 29555,
            "upload_time": "2025-08-04T03:52:01",
            "upload_time_iso_8601": "2025-08-04T03:52:01.139457Z",
            "url": "https://files.pythonhosted.org/packages/b1/ac/0d2a767259f60a053c3b15c4d4457bea226e67f07ef02f1fa254ee5825e1/pdf_chunker_for_rag-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5719a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d",
                "md5": "62c176e6cb906cbdf8bc9fd882a0b7ec",
                "sha256": "adf6abb6c3683ca25a5170192ceeef192982eca394239ac86f7d24661bd3e19a"
            },
            "downloads": -1,
            "filename": "pdf_chunker_for_rag-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "62c176e6cb906cbdf8bc9fd882a0b7ec",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 32584,
            "upload_time": "2025-08-04T03:52:02",
            "upload_time_iso_8601": "2025-08-04T03:52:02.251286Z",
            "url": "https://files.pythonhosted.org/packages/57/19/a1e44f2b9cfcbdc050e567fcdcfd7092b27b0c6385961b5ac949b510446d/pdf_chunker_for_rag-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 03:52:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "your-username",
    "github_project": "chunk_creation",
    "github_not_found": true,
    "lcname": "pdf-chunker-for-rag"
}
        
Elapsed time: 1.24708s