pdf-header-detector


Namepdf-header-detector JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/yourusername/pdf-header-detector
SummaryA library for intelligent PDF header detection using font analysis
upload_time2025-07-13 16:21:33
maintainerNone
docs_urlNone
authorAssistant
requires_python>=3.7
licenseNone
keywords pdf header detection font analysis document processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF Header Detector

A Python library for intelligent PDF header detection using advanced font size analysis and hierarchy detection.

## Features

- **Intelligent Header Detection**: Automatically detects headers based on font size analysis
- **Font Hierarchy Analysis**: Analyzes document font patterns to identify header levels
- **Auto-Threshold Detection**: Automatically determines optimal font size thresholds
- **Body Text Filtering**: Intelligently filters out body text from header candidates
- **JSON Output**: Export detected headers as structured JSON data
- **Command Line Interface**: Easy-to-use CLI for batch processing
- **Extensible Design**: Clean API for integration into other applications

## Installation

### From Source

```bash
# Clone the repository
git clone https://github.com/yourusername/pdf-header-detector.git
cd pdf-header-detector

# Install the package
pip install .

# Or install in development mode
pip install -e .
```

### Using pip (when published)

```bash
pip install pdf-header-detector
```

## Dependencies

- Python 3.7+
- PyMuPDF (fitz) >= 1.23.0

## Quick Start

### As a Library

```python
from pdf_header_detector import CleanHybridPDFChunker

# Create detector instance
detector = CleanHybridPDFChunker()

# Detect headers (auto-detect font size threshold)
headers = detector.detect_headers_by_font_size("document.pdf")

# Print results
for header in headers:
    print(f"Page {header['page']}: {header['header']} (Level {header['header_level']})")

# Get JSON output
json_output = detector.get_headers_json("document.pdf")
print(json_output)

# Save to file
detector.save_headers_to_json(headers, "output_headers.json")
```

### Command Line Interface

```bash
# Basic usage - auto-detect headers
pdf-header-detector document.pdf

# Specify minimum font size
pdf-header-detector document.pdf --min-size 14

# Save to specific output file
pdf-header-detector document.pdf --output my_headers.json

# Show detailed font analysis
pdf-header-detector document.pdf --analyze-fonts

# Verbose output
pdf-header-detector document.pdf --verbose

# Short command alias
phd document.pdf
```

## API Reference

### CleanHybridPDFChunker Class

The main class for PDF header detection.

#### Methods

##### `detect_headers_by_font_size(pdf_path, min_size=None)`

Detect headers in a PDF using font size analysis.

**Parameters:**
- `pdf_path` (str): Path to the PDF file
- `min_size` (float, optional): Minimum font size for headers (auto-detected if None)

**Returns:**
- `List[Dict]`: List of detected headers with metadata

**Example:**
```python
headers = detector.detect_headers_by_font_size("document.pdf", min_size=12.0)
```

##### `get_headers_json(pdf_path, min_size=None)`

Get headers as a JSON string.

**Parameters:**
- `pdf_path` (str): Path to the PDF file
- `min_size` (float, optional): Minimum font size for headers

**Returns:**
- `str`: JSON string representation of headers

##### `save_headers_to_json(headers, output_file)`

Save headers to a JSON file.

**Parameters:**
- `headers` (List[Dict]): Headers to save
- `output_file` (str): Output file path

**Returns:**
- `bool`: True if successful, False otherwise

##### `get_font_analysis(pdf_path)`

Get detailed font analysis without detecting headers.

**Parameters:**
- `pdf_path` (str): Path to the PDF file

**Returns:**
- `Dict`: Font analysis results

## Output Format

Headers are returned as a list of dictionaries with the following structure:

```json
[
  {
    "header": "Introduction",
    "header_level_name": "level 1",
    "page": 1,
    "header_level": 1
  },
  {
    "header": "Background",
    "header_level_name": "level 2", 
    "page": 2,
    "header_level": 2
  }
]
```

### Fields

- `header` (str): The header text
- `header_level_name` (str): Human-readable level name (e.g., "level 1")
- `page` (int): Page number where the header appears
- `header_level` (int): Numeric hierarchy level (1 = highest, 2 = second, etc.)

## Algorithm Overview

The library uses a sophisticated multi-step process:

1. **Font Analysis**: Scans the entire document to analyze font size distribution
2. **Body Text Detection**: Identifies the most common font sizes (likely body text)
3. **Threshold Calculation**: Automatically determines optimal header detection threshold
4. **Header Filtering**: Applies heuristics to filter out false positives
5. **Hierarchy Assignment**: Groups similar font sizes into hierarchy levels
6. **Duplicate Removal**: Removes duplicate headers that appear multiple times

## Advanced Usage

### Custom Font Size Threshold

```python
# Use a specific minimum font size
headers = detector.detect_headers_by_font_size("document.pdf", min_size=16.0)
```

### Font Analysis Only

```python
# Get detailed font analysis without header detection
analysis = detector.get_font_analysis("document.pdf")
print(f"Body text size: {analysis['body_text_size']}pt")
print(f"Header levels: {analysis['total_levels']}")
```

### Error Handling

```python
try:
    headers = detector.detect_headers_by_font_size("document.pdf")
except FileNotFoundError:
    print("PDF file not found")
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

### Development Setup

```bash
# Clone the repository
git clone https://github.com/yourusername/pdf-header-detector.git
cd pdf-header-detector

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 pdf_header_detector/
black pdf_header_detector/

# Type checking
mypy pdf_header_detector/
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Changelog

### Version 1.0.0
- Initial release
- Intelligent header detection with font analysis
- Command line interface
- JSON output support
- Auto-threshold detection
- Body text filtering

## Support

- Create an issue on [GitHub Issues](https://github.com/yourusername/pdf-header-detector/issues)
- Check the documentation for more examples
- Review the test files for usage patterns

## Acknowledgments

- Built with [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF processing
- Inspired by the need for better document structure analysis

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/pdf-header-detector",
    "name": "pdf-header-detector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pdf header detection font analysis document processing",
    "author": "Assistant",
    "author_email": "assistant@example.com",
    "download_url": "https://files.pythonhosted.org/packages/f9/f6/e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2/pdf_header_detector-1.0.0.tar.gz",
    "platform": null,
    "description": "# PDF Header Detector\r\n\r\nA Python library for intelligent PDF header detection using advanced font size analysis and hierarchy detection.\r\n\r\n## Features\r\n\r\n- **Intelligent Header Detection**: Automatically detects headers based on font size analysis\r\n- **Font Hierarchy Analysis**: Analyzes document font patterns to identify header levels\r\n- **Auto-Threshold Detection**: Automatically determines optimal font size thresholds\r\n- **Body Text Filtering**: Intelligently filters out body text from header candidates\r\n- **JSON Output**: Export detected headers as structured JSON data\r\n- **Command Line Interface**: Easy-to-use CLI for batch processing\r\n- **Extensible Design**: Clean API for integration into other applications\r\n\r\n## Installation\r\n\r\n### From Source\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/yourusername/pdf-header-detector.git\r\ncd pdf-header-detector\r\n\r\n# Install the package\r\npip install .\r\n\r\n# Or install in development mode\r\npip install -e .\r\n```\r\n\r\n### Using pip (when published)\r\n\r\n```bash\r\npip install pdf-header-detector\r\n```\r\n\r\n## Dependencies\r\n\r\n- Python 3.7+\r\n- PyMuPDF (fitz) >= 1.23.0\r\n\r\n## Quick Start\r\n\r\n### As a Library\r\n\r\n```python\r\nfrom pdf_header_detector import CleanHybridPDFChunker\r\n\r\n# Create detector instance\r\ndetector = CleanHybridPDFChunker()\r\n\r\n# Detect headers (auto-detect font size threshold)\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\")\r\n\r\n# Print results\r\nfor header in headers:\r\n    print(f\"Page {header['page']}: {header['header']} (Level {header['header_level']})\")\r\n\r\n# Get JSON output\r\njson_output = detector.get_headers_json(\"document.pdf\")\r\nprint(json_output)\r\n\r\n# Save to file\r\ndetector.save_headers_to_json(headers, \"output_headers.json\")\r\n```\r\n\r\n### Command Line Interface\r\n\r\n```bash\r\n# Basic usage - auto-detect headers\r\npdf-header-detector document.pdf\r\n\r\n# Specify minimum font size\r\npdf-header-detector document.pdf --min-size 14\r\n\r\n# Save to specific output file\r\npdf-header-detector document.pdf --output my_headers.json\r\n\r\n# Show detailed font analysis\r\npdf-header-detector document.pdf --analyze-fonts\r\n\r\n# Verbose output\r\npdf-header-detector document.pdf --verbose\r\n\r\n# Short command alias\r\nphd document.pdf\r\n```\r\n\r\n## API Reference\r\n\r\n### CleanHybridPDFChunker Class\r\n\r\nThe main class for PDF header detection.\r\n\r\n#### Methods\r\n\r\n##### `detect_headers_by_font_size(pdf_path, min_size=None)`\r\n\r\nDetect headers in a PDF using font size analysis.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n- `min_size` (float, optional): Minimum font size for headers (auto-detected if None)\r\n\r\n**Returns:**\r\n- `List[Dict]`: List of detected headers with metadata\r\n\r\n**Example:**\r\n```python\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\", min_size=12.0)\r\n```\r\n\r\n##### `get_headers_json(pdf_path, min_size=None)`\r\n\r\nGet headers as a JSON string.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n- `min_size` (float, optional): Minimum font size for headers\r\n\r\n**Returns:**\r\n- `str`: JSON string representation of headers\r\n\r\n##### `save_headers_to_json(headers, output_file)`\r\n\r\nSave headers to a JSON file.\r\n\r\n**Parameters:**\r\n- `headers` (List[Dict]): Headers to save\r\n- `output_file` (str): Output file path\r\n\r\n**Returns:**\r\n- `bool`: True if successful, False otherwise\r\n\r\n##### `get_font_analysis(pdf_path)`\r\n\r\nGet detailed font analysis without detecting headers.\r\n\r\n**Parameters:**\r\n- `pdf_path` (str): Path to the PDF file\r\n\r\n**Returns:**\r\n- `Dict`: Font analysis results\r\n\r\n## Output Format\r\n\r\nHeaders are returned as a list of dictionaries with the following structure:\r\n\r\n```json\r\n[\r\n  {\r\n    \"header\": \"Introduction\",\r\n    \"header_level_name\": \"level 1\",\r\n    \"page\": 1,\r\n    \"header_level\": 1\r\n  },\r\n  {\r\n    \"header\": \"Background\",\r\n    \"header_level_name\": \"level 2\", \r\n    \"page\": 2,\r\n    \"header_level\": 2\r\n  }\r\n]\r\n```\r\n\r\n### Fields\r\n\r\n- `header` (str): The header text\r\n- `header_level_name` (str): Human-readable level name (e.g., \"level 1\")\r\n- `page` (int): Page number where the header appears\r\n- `header_level` (int): Numeric hierarchy level (1 = highest, 2 = second, etc.)\r\n\r\n## Algorithm Overview\r\n\r\nThe library uses a sophisticated multi-step process:\r\n\r\n1. **Font Analysis**: Scans the entire document to analyze font size distribution\r\n2. **Body Text Detection**: Identifies the most common font sizes (likely body text)\r\n3. **Threshold Calculation**: Automatically determines optimal header detection threshold\r\n4. **Header Filtering**: Applies heuristics to filter out false positives\r\n5. **Hierarchy Assignment**: Groups similar font sizes into hierarchy levels\r\n6. **Duplicate Removal**: Removes duplicate headers that appear multiple times\r\n\r\n## Advanced Usage\r\n\r\n### Custom Font Size Threshold\r\n\r\n```python\r\n# Use a specific minimum font size\r\nheaders = detector.detect_headers_by_font_size(\"document.pdf\", min_size=16.0)\r\n```\r\n\r\n### Font Analysis Only\r\n\r\n```python\r\n# Get detailed font analysis without header detection\r\nanalysis = detector.get_font_analysis(\"document.pdf\")\r\nprint(f\"Body text size: {analysis['body_text_size']}pt\")\r\nprint(f\"Header levels: {analysis['total_levels']}\")\r\n```\r\n\r\n### Error Handling\r\n\r\n```python\r\ntry:\r\n    headers = detector.detect_headers_by_font_size(\"document.pdf\")\r\nexcept FileNotFoundError:\r\n    print(\"PDF file not found\")\r\nexcept Exception as e:\r\n    print(f\"Error processing PDF: {e}\")\r\n```\r\n\r\n## Contributing\r\n\r\n1. Fork the repository\r\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\r\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\r\n4. Push to the branch (`git push origin feature/amazing-feature`)\r\n5. Open a Pull Request\r\n\r\n### Development Setup\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/yourusername/pdf-header-detector.git\r\ncd pdf-header-detector\r\n\r\n# Install in development mode with dev dependencies\r\npip install -e \".[dev]\"\r\n\r\n# Run tests\r\npytest\r\n\r\n# Run linting\r\nflake8 pdf_header_detector/\r\nblack pdf_header_detector/\r\n\r\n# Type checking\r\nmypy pdf_header_detector/\r\n```\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## Changelog\r\n\r\n### Version 1.0.0\r\n- Initial release\r\n- Intelligent header detection with font analysis\r\n- Command line interface\r\n- JSON output support\r\n- Auto-threshold detection\r\n- Body text filtering\r\n\r\n## Support\r\n\r\n- Create an issue on [GitHub Issues](https://github.com/yourusername/pdf-header-detector/issues)\r\n- Check the documentation for more examples\r\n- Review the test files for usage patterns\r\n\r\n## Acknowledgments\r\n\r\n- Built with [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF processing\r\n- Inspired by the need for better document structure analysis\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A library for intelligent PDF header detection using font analysis",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/yourusername/pdf-header-detector"
    },
    "split_keywords": [
        "pdf",
        "header",
        "detection",
        "font",
        "analysis",
        "document",
        "processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6a8d48abf82cba4c5707b0d21a8d72117e7c01313de9039f494d280a7639b015",
                "md5": "eb5eb2a727cd3cc3431ea6e849329fcb",
                "sha256": "ec04e4ce9a66a9062555f7fc99536a0d61480ad2025e2ec396789908e3bd9655"
            },
            "downloads": -1,
            "filename": "pdf_header_detector-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eb5eb2a727cd3cc3431ea6e849329fcb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 13224,
            "upload_time": "2025-07-13T16:21:32",
            "upload_time_iso_8601": "2025-07-13T16:21:32.505605Z",
            "url": "https://files.pythonhosted.org/packages/6a/8d/48abf82cba4c5707b0d21a8d72117e7c01313de9039f494d280a7639b015/pdf_header_detector-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f9f6e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2",
                "md5": "44cbd74333f0ba7aa8bb58bf006fe962",
                "sha256": "71b919cff21ade8b70e161c49bac74214cb4df939d41bcd7bf74960efb6dd4ae"
            },
            "downloads": -1,
            "filename": "pdf_header_detector-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "44cbd74333f0ba7aa8bb58bf006fe962",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 14423,
            "upload_time": "2025-07-13T16:21:33",
            "upload_time_iso_8601": "2025-07-13T16:21:33.548020Z",
            "url": "https://files.pythonhosted.org/packages/f9/f6/e83e4802ed38caf8fb3f5a4e459e7083b50fe82f3c4f82404dc3c0d95ab2/pdf_header_detector-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 16:21:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "pdf-header-detector",
    "github_not_found": true,
    "lcname": "pdf-header-detector"
}
        
Elapsed time: 0.50274s