# PDF Header Detector
A Python library for intelligent PDF header detection using advanced font size analysis and hierarchy detection.
## Features
- **Intelligent Header Detection**: Automatically detects headers based on font size analysis
- **Font Hierarchy Analysis**: Analyzes document font patterns to identify header levels
- **Auto-Threshold Detection**: Automatically determines optimal font size thresholds
- **Body Text Filtering**: Intelligently filters out body text from header candidates
- **JSON Output**: Export detected headers as structured JSON data
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Extensible Design**: Clean API for integration into other applications
## Installation
### From Source
```bash
# Clone the repository
git clone https://github.com/yourusername/pdf-header-detector.git
cd pdf-header-detector
# Install the package
pip install .
# Or install in development mode
pip install -e .
```
### Using pip (when published)
```bash
pip install pdf-header-detector
```
## Dependencies
- Python 3.7+
- PyMuPDF (fitz) >= 1.23.0
## Quick Start
### As a Library
```python
from pdf_header_detector import CleanHybridPDFChunker
# Create detector instance
detector = CleanHybridPDFChunker()
# Detect headers (auto-detect font size threshold)
headers = detector.detect_headers_by_font_size("document.pdf")
# Print results
for header in headers:
print(f"Page {header['page']}: {header['header']} (Level {header['header_level']})")
# Get JSON output
json_output = detector.get_headers_json("document.pdf")
print(json_output)
# Save to file
detector.save_headers_to_json(headers, "output_headers.json")
```
### Command Line Interface
```bash
# Basic usage - auto-detect headers
pdf-header-detector document.pdf
# Specify minimum font size
pdf-header-detector document.pdf --min-size 14
# Save to specific output file
pdf-header-detector document.pdf --output my_headers.json
# Show detailed font analysis
pdf-header-detector document.pdf --analyze-fonts
# Verbose output
pdf-header-detector document.pdf --verbose
# Short command alias
phd document.pdf
```
## API Reference
### CleanHybridPDFChunker Class
The main class for PDF header detection.
#### Methods
##### `detect_headers_by_font_size(pdf_path, min_size=None)`
Detect headers in a PDF using font size analysis.
**Parameters:**
- `pdf_path` (str): Path to the PDF file
- `min_size` (float, optional): Minimum font size for headers (auto-detected if None)
**Returns:**
- `List[Dict]`: List of detected headers with metadata
**Example:**
```python
headers = detector.detect_headers_by_font_size("document.pdf", min_size=12.0)
```
##### `get_headers_json(pdf_path, min_size=None)`
Get headers as a JSON string.
**Parameters:**
- `pdf_path` (str): Path to the PDF file
- `min_size` (float, optional): Minimum font size for headers
**Returns:**
- `str`: JSON string representation of headers
##### `save_headers_to_json(headers, output_file)`
Save headers to a JSON file.
**Parameters:**
- `headers` (List[Dict]): Headers to save
- `output_file` (str): Output file path
**Returns:**
- `bool`: True if successful, False otherwise
##### `get_font_analysis(pdf_path)`
Get detailed font analysis without detecting headers.
**Parameters:**
- `pdf_path` (str): Path to the PDF file
**Returns:**
- `Dict`: Font analysis results
## Output Format
Headers are returned as a list of dictionaries with the following structure:
```json
[
{
"header": "Introduction",
"header_level_name": "level 1",
"page": 1,
"header_level": 1
},
{
"header": "Background",
"header_level_name": "level 2",
"page": 2,
"header_level": 2
}
]
```
### Fields
- `header` (str): The header text
- `header_level_name` (str): Human-readable level name (e.g., "level 1")
- `page` (int): Page number where the header appears
- `header_level` (int): Numeric hierarchy level (1 = highest, 2 = second, etc.)
## Algorithm Overview
The library uses a sophisticated multi-step process:
1. **Font Analysis**: Scans the entire document to analyze font size distribution
2. **Body Text Detection**: Identifies the most common font sizes (likely body text)
3. **Threshold Calculation**: Automatically determines optimal header detection threshold
4. **Header Filtering**: Applies heuristics to filter out false positives
5. **Hierarchy Assignment**: Groups similar font sizes into hierarchy levels
6. **Duplicate Removal**: Removes duplicate headers that appear multiple times
## Advanced Usage
### Custom Font Size Threshold
```python
# Use a specific minimum font size
headers = detector.detect_headers_by_font_size("document.pdf", min_size=16.0)
```
### Font Analysis Only
```python
# Get detailed font analysis without header detection
analysis = detector.get_font_analysis("document.pdf")
print(f"Body text size: {analysis['body_text_size']}pt")
print(f"Header levels: {analysis['total_levels']}")
```
### Error Handling
```python
try:
headers = detector.detect_headers_by_font_size("document.pdf")
except FileNotFoundError:
print("PDF file not found")
except Exception as e:
print(f"Error processing PDF: {e}")
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
### Development Setup
```bash
# Clone the repository
git clone https://github.com/yourusername/pdf-header-detector.git
cd pdf-header-detector
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
flake8 pdf_header_detector/
black pdf_header_detector/
# Type checking
mypy pdf_header_detector/
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Changelog
### Version 1.0.0
- Initial release
- Intelligent header detection with font analysis
- Command line interface
- JSON output support
- Auto-threshold detection
- Body text filtering
## Support
- Create an issue on [GitHub Issues](https://github.com/yourusername/pdf-header-detector/issues)
- Check the documentation for more examples
- Review the test files for usage patterns
## Acknowledgments
- Built with [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF processing
- Inspired by the need for better document structure analysis
Raw data
{
"_id": null,
"home_page": "https://github.com/yourusername/pdf-header-detector",
"name": "pdf-header-detector",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "pdf header detection font analysis document processing",
"author": "Assistant",
"author_email": "assistant@example.com",
"download_url": "https://files.pythonhosted.org/packages/f9/f6/e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2/pdf_header_detector-1.0.0.tar.gz",
"platform": null,
"description": "# PDF Header Detector\r\n\r\nA Python library for intelligent PDF header detection using advanced font size analysis and hierarchy detection.\r\n\r\n## Features\r\n\r\n- **Intelligent Header Detection**: Automatically detects headers based on font size analysis\r\n- **Font Hierarchy Analysis**: Analyzes document font patterns to identify header levels\r\n- **Auto-Threshold Detection**: Automatically determines optimal font size thresholds\r\n- **Body Text Filtering**: Intelligently filters out body text from header candidates\r\n- **JSON Output**: Export detected headers as structured JSON data\r\n- **Command Line Interface**: Easy-to-use CLI for batch processing\r\n- **Extensible Design**: Clean API for integration into other applications\r\n\r\n## Installation\r\n\r\n### From Source\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/yourusername/pdf-header-detector.git\r\ncd pdf-header-detector\r\n\r\n# Install the package\r\npip install .\r\n\r\n# Or install in development mode\r\npip install -e .\r\n```\r\n\r\n### Using pip (when published)\r\n\r\n```bash\r\npip install pdf-header-detector\r\n```\r\n\r\n## Dependencies\r\n\r\n- Python 3.7+\r\n- PyMuPDF (fitz) >= 1.23.0\r\n\r\n## Quick Start\r\n\r\n### As a Library\r\n\r\n```python\r\nfrom pdf_header_detector import CleanHybridPDFChunker\r\n\r\n# Create detector instance\r\ndetector = CleanHybridPDFChunker()\r\n\r\n# Detect headers (auto-detect font size threshold)\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\")\r\n\r\n# Print results\r\nfor header in headers:\r\n print(f\"Page {header['page']}: {header['header']} (Level {header['header_level']})\")\r\n\r\n# Get JSON output\r\njson_output = detector.get_headers_json(\"document.pdf\")\r\nprint(json_output)\r\n\r\n# Save to file\r\ndetector.save_headers_to_json(headers, \"output_headers.json\")\r\n```\r\n\r\n### Command Line Interface\r\n\r\n```bash\r\n# Basic usage - auto-detect headers\r\npdf-header-detector document.pdf\r\n\r\n# Specify minimum font size\r\npdf-header-detector document.pdf --min-size 14\r\n\r\n# Save to specific output file\r\npdf-header-detector document.pdf --output my_headers.json\r\n\r\n# Show detailed font analysis\r\npdf-header-detector document.pdf --analyze-fonts\r\n\r\n# Verbose output\r\npdf-header-detector document.pdf --verbose\r\n\r\n# Short command alias\r\nphd document.pdf\r\n```\r\n\r\n## API Reference\r\n\r\n### CleanHybridPDFChunker Class\r\n\r\nThe main class for PDF header detection.\r\n\r\n#### Methods\r\n\r\n##### `detect_headers_by_font_size(pdf_path, min_size=None)`\r\n\r\nDetect headers in a PDF using font size analysis.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n- `min_size` (float, optional): Minimum font size for headers (auto-detected if None)\r\n\r\n**Returns:**\r\n- `List[Dict]`: List of detected headers with metadata\r\n\r\n**Example:**\r\n```python\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\", min_size=12.0)\r\n```\r\n\r\n##### `get_headers_json(pdf_path, min_size=None)`\r\n\r\nGet headers as a JSON string.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n- `min_size` (float, optional): Minimum font size for headers\r\n\r\n**Returns:**\r\n- `str`: JSON string representation of headers\r\n\r\n##### `save_headers_to_json(headers, output_file)`\r\n\r\nSave headers to a JSON file.\r\n\r\n**Parameters:**\r\n- `headers` (List[Dict]): Headers to save\r\n- `output_file` (str): Output file path\r\n\r\n**Returns:**\r\n- `bool`: True if successful, False otherwise\r\n\r\n##### `get_font_analysis(pdf_path)`\r\n\r\nGet detailed font analysis without detecting headers.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n\r\n**Returns:**\r\n- `Dict`: Font analysis results\r\n\r\n## Output Format\r\n\r\nHeaders are returned as a list of dictionaries with the following structure:\r\n\r\n```json\r\n[\r\n {\r\n \"header\": \"Introduction\",\r\n \"header_level_name\": \"level 1\",\r\n \"page\": 1,\r\n \"header_level\": 1\r\n },\r\n {\r\n \"header\": \"Background\",\r\n \"header_level_name\": \"level 2\", \r\n \"page\": 2,\r\n \"header_level\": 2\r\n }\r\n]\r\n```\r\n\r\n### Fields\r\n\r\n- `header` (str): The header text\r\n- `header_level_name` (str): Human-readable level name (e.g., \"level 1\")\r\n- `page` (int): Page number where the header appears\r\n- `header_level` (int): Numeric hierarchy level (1 = highest, 2 = second, etc.)\r\n\r\n## Algorithm Overview\r\n\r\nThe library uses a sophisticated multi-step process:\r\n\r\n1. **Font Analysis**: Scans the entire document to analyze font size distribution\r\n2. **Body Text Detection**: Identifies the most common font sizes (likely body text)\r\n3. **Threshold Calculation**: Automatically determines optimal header detection threshold\r\n4. **Header Filtering**: Applies heuristics to filter out false positives\r\n5. **Hierarchy Assignment**: Groups similar font sizes into hierarchy levels\r\n6. **Duplicate Removal**: Removes duplicate headers that appear multiple times\r\n\r\n## Advanced Usage\r\n\r\n### Custom Font Size Threshold\r\n\r\n```python\r\n# Use a specific minimum font size\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\", min_size=16.0)\r\n```\r\n\r\n### Font Analysis Only\r\n\r\n```python\r\n# Get detailed font analysis without header detection\r\nanalysis = detector.get_font_analysis(\"document.pdf\")\r\nprint(f\"Body text size: {analysis['body_text_size']}pt\")\r\nprint(f\"Header levels: {analysis['total_levels']}\")\r\n```\r\n\r\n### Error Handling\r\n\r\n```python\r\ntry:\r\n headers = detector.detect_headers_by_font_size(\"document.pdf\")\r\nexcept FileNotFoundError:\r\n print(\"PDF file not found\")\r\nexcept Exception as e:\r\n print(f\"Error processing PDF: {e}\")\r\n```\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\r\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\r\n4. Push to the branch (`git push origin feature/amazing-feature`)\r\n5. Open a Pull Request\r\n\r\n### Development Setup\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/yourusername/pdf-header-detector.git\r\ncd pdf-header-detector\r\n\r\n# Install in development mode with dev dependencies\r\npip install -e \".[dev]\"\r\n\r\n# Run tests\r\npytest\r\n\r\n# Run linting\r\nflake8 pdf_header_detector/\r\nblack pdf_header_detector/\r\n\r\n# Type checking\r\nmypy pdf_header_detector/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Changelog\r\n\r\n### Version 1.0.0\r\n- Initial release\r\n- Intelligent header detection with font analysis\r\n- Command line interface\r\n- JSON output support\r\n- Auto-threshold detection\r\n- Body text filtering\r\n\r\n## Support\r\n\r\n- Create an issue on [GitHub Issues](https://github.com/yourusername/pdf-header-detector/issues)\r\n- Check the documentation for more examples\r\n- Review the test files for usage patterns\r\n\r\n## Acknowledgments\r\n\r\n- Built with [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF processing\r\n- Inspired by the need for better document structure analysis\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A library for intelligent PDF header detection using font analysis",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/yourusername/pdf-header-detector"
},
"split_keywords": [
"pdf",
"header",
"detection",
"font",
"analysis",
"document",
"processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "6a8d48abf82cba4c5707b0d21a8d72117e7c01313de9039f494d280a7639b015",
"md5": "eb5eb2a727cd3cc3431ea6e849329fcb",
"sha256": "ec04e4ce9a66a9062555f7fc99536a0d61480ad2025e2ec396789908e3bd9655"
},
"downloads": -1,
"filename": "pdf_header_detector-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "eb5eb2a727cd3cc3431ea6e849329fcb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 13224,
"upload_time": "2025-07-13T16:21:32",
"upload_time_iso_8601": "2025-07-13T16:21:32.505605Z",
"url": "https://files.pythonhosted.org/packages/6a/8d/48abf82cba4c5707b0d21a8d72117e7c01313de9039f494d280a7639b015/pdf_header_detector-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f9f6e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2",
"md5": "44cbd74333f0ba7aa8bb58bf006fe962",
"sha256": "71b919cff21ade8b70e161c49bac74214cb4df939d41bcd7bf74960efb6dd4ae"
},
"downloads": -1,
"filename": "pdf_header_detector-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "44cbd74333f0ba7aa8bb58bf006fe962",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 14423,
"upload_time": "2025-07-13T16:21:33",
"upload_time_iso_8601": "2025-07-13T16:21:33.548020Z",
"url": "https://files.pythonhosted.org/packages/f9/f6/e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2/pdf_header_detector-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 16:21:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "pdf-header-detector",
"github_not_found": true,
"lcname": "pdf-header-detector"
}