pydocextractor


Namepydocextractor JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryA Python library for converting documents (PDF, DOCX, XLSX) to Markdown with multiple precision levels
upload_time2025-10-29 11:30:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords conversion document extractor markdown pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pyDocExtractor

**A Python library for converting documents (PDF, DOCX, XLSX) to Markdown with multiple precision levels.**

Built with **Hexagonal Architecture** for maximum testability, flexibility, and maintainability.

[![CI](https://github.com/AminiTech/pyDocExtractor/actions/workflows/ci.yml/badge.svg)](https://github.com/AminiTech/pyDocExtractor/actions/workflows/ci.yml)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Coverage](https://img.shields.io/badge/coverage-87%25-brightgreen.svg)](https://github.com/AminiTech/pyDocExtractor)
[![Tests](https://img.shields.io/badge/tests-260%2B%20passing-brightgreen.svg)](https://github.com/AminiTech/pyDocExtractor)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

## Features

- **4 Precision Levels** - Choose between speed and quality
- **LLM Image Description** - Automatic AI-powered image descriptions using OpenAI-compatible APIs
- **Hexagonal Architecture** - Clean separation of concerns with Protocol-based ports
- **Automatic Selection** - Smart converter selection based on file characteristics
- **Quality Scoring** - 0-1.0 quality scores for converted content
- **Fallback Chain** - Automatic fallback if preferred converter fails
- **Multi-Format Support** - PDF, DOCX, XLSX, XLS
- **CLI & Python API** - Use as command-line tool or library
- **Dependency Injection** - Easy testing and customization
- **Extras Model** - Install only what you need

## Installation

### Basic Installation

Note: Make sure to use CodeArtifact since the package is not available on PyPi.

```bash
pip install pydocextractor
```

Note: If you encounter NumPy compatibility issues, try installing in a clean virtual environment using `uv venv` for better dependency management.

### With Specific Extractors

Note: DOCX support actually requires docling `[doc]` extra, not `[docx]`

```bash
# PDF support only
pip install pydocextractor[pdf]

# DOCX support
pip install pydocextractor[docx]

# Excel support
pip install pydocextractor[xlsx]

# CLI tools
pip install pydocextractor[cli]

# LLM support for image descriptions
pip install pydocextractor[llm]

# Everything
pip install pydocextractor[all]
```

### Install from GitHub Repository

If you want to use the latest version directly from the repository:

Note: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.

#### Using uv (Recommended)

```bash
# Install with all features
uv venv
uv pip install "pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git"
```

```bash
# Install with specific extras
uv venv
uv pip install "pydocextractor[pdf,cli] @ git+https://github.com/AminiTech/pyDocExtractor.git"
```

```bash
# Install specific branch or tag
uv venv
uv pip install "pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@main"
uv pip install "pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@v1.0.0"
```

```bash
# Install in editable mode for local changes
git clone https://github.com/AminiTech/pyDocExtractor.git
cd pyDocExtractor
uv venv
uv pip install -e ".[all]"
```

#### Using pip

Note: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.

```bash
# Install with all features
pip install "git+https://github.com/AminiTech/pyDocExtractor.git#egg=pydocextractor[all]"

# Install with specific extras
pip install "git+https://github.com/AminiTech/pyDocExtractor.git#egg=pydocextractor[pdf,cli]"

# Install in editable mode
git clone https://github.com/AminiTech/pyDocExtractor.git
cd pyDocExtractor
pip install -e ".[all]"
```

#### Using requirements.txt

Add to your `requirements.txt`:

Note: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.

```txt
# Latest from main branch with all features
pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git

# Specific version/tag
pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@v1.0.0

# With specific extras only
pydocextractor[pdf,xlsx] @ git+https://github.com/AminiTech/pyDocExtractor.git
```

Then install:

```bash
# With uv
uv pip install -r requirements.txt

# With pip
pip install -r requirements.txt
```

#### Using pyproject.toml

Add to your `pyproject.toml`:

```toml
[project]
dependencies = [
    "pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git",
]

# Or specify extras
dependencies = [
    "pydocextractor[pdf,xlsx,cli] @ git+https://github.com/AminiTech/pyDocExtractor.git@main",
]
```

Then install:

```bash
# With uv (recommended)
uv sync

# With pip
pip install .
```

### Development Setup

For contributors and developers who want to modify the library:

**Prerequisites:** Install [just](https://github.com/casey/just) command runner:
```bash
# macOS
brew install just

# Linux
curl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to /usr/local/bin

# Windows
scoop install just
# or: choco install just
```

**Setup:**
```bash
# Clone repository
git clone https://github.com/AminiTech/pyDocExtractor.git
cd pyDocExtractor

# Bootstrap development environment (installs all dependencies)
just bootstrap

# Run tests
just test

# Run quality checks
just check

# See all available commands
just --list
```

Note: All `just` commands tested and working. Architecture validation with `just guard`. Quality checks with `just check`. Comprehensive testing shows 18/18 test categories passed (100% success rate).

## Quick Start

### CLI Usage

```bash
# Convert a document
pydocextractor convert document.pdf

# Specify precision level (1-4)
pydocextractor convert document.pdf --level 2

# Custom output file
pydocextractor convert document.pdf -o output.md

# Show quality score
pydocextractor convert document.pdf --show-score

# Batch convert directory with pattern matching (includes timing info)
pydocextractor batch input_dir/ output_dir/

# Batch convert with custom pattern
pydocextractor batch input_dir/ output_dir/ --pattern "*.pdf"

# Batch convert DOCX files with highest quality
pydocextractor batch input_dir/ output_dir/ --pattern "*.docx" --level 4

# Check converter status
pydocextractor status

# Document info
pydocextractor info document.pdf

# Note: Image descriptions require LLM configuration (see LLM Image Description section)
```

### Python API (Hexagonal)

```python
from pathlib import Path
from pydocextractor.domain.models import Document, PrecisionLevel
from pydocextractor.factory import create_converter_service

# Create service (uses dependency injection)
service = create_converter_service()

# Load document
file_path = Path("document.pdf")
doc = Document(
    bytes=file_path.read_bytes(),
    mime="application/pdf",
    size_bytes=file_path.stat().st_size,
    precision=PrecisionLevel.BALANCED,
    filename=file_path.name,
)

# Convert to Markdown
result = service.convert_to_markdown(doc)

# Access results
print(result.text)              # Markdown text
print(result.quality_score)     # Quality score (0.0-1.0)
print(result.metadata)          # Additional metadata
```

### Using from Another Python Program

If you're using pyDocExtractor as a library in your Python application:

```python
from pathlib import Path
from pydocextractor import (
    Document,
    PrecisionLevel,
    create_converter_service,
    get_available_extractors,
)

# Create service using factory (recommended)
# Automatically loads LLM config from config.env or .env if present
service = create_converter_service()

# Load your document
file_path = Path("your_document.pdf")
doc = Document(
    bytes=file_path.read_bytes(),
    mime="application/pdf",
    size_bytes=file_path.stat().st_size,
    precision=PrecisionLevel.BALANCED,
    filename=file_path.name,
)

# Convert to Markdown
# If LLM is configured, images will be automatically described
result = service.convert_to_markdown(doc)

# Access results
markdown_text = result.text
quality = result.quality_score
extractor_used = result.metadata.get("extractor")

# Check if images were described
has_image_descriptions = "Image Description" in markdown_text

# For CSV/Excel files, use tabular template
excel_doc = Document(
    bytes=Path("data.xlsx").read_bytes(),
    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    size_bytes=Path("data.xlsx").stat().st_size,
    precision=PrecisionLevel.HIGHEST_QUALITY,
    filename="data.xlsx",
)
result = service.convert_to_markdown(excel_doc, template_name="tabular")
```

### Custom Configuration (Advanced)

For advanced users who need custom service configuration:

```python
from pydocextractor.app.service import ConverterService
from pydocextractor.infra.policy.heuristics import DefaultPolicy
from pydocextractor.infra.templates.engines import Jinja2TemplateEngine
from pydocextractor.infra.scoring.default_scorer import DefaultQualityScorer
from pathlib import Path

# Create components manually
policy = DefaultPolicy()  # Auto-discovers all available extractors
template_engine = Jinja2TemplateEngine()
quality_scorer = DefaultQualityScorer()

# Assemble service with custom components
service = ConverterService(
    policy=policy,
    template_engine=template_engine,
    quality_scorer=quality_scorer,
)

# Use with custom template directory
custom_templates = Path("my_custom_templates/")
service_with_custom_templates = create_converter_service(template_dir=custom_templates)

# Use custom service
result = service.convert_to_markdown(doc, template_name="simple")
```

## Precision Levels

| Level | Name              | Speed          | Quality | Use Case                    | Table Statistics |
|-------|-------------------|----------------|---------|----------------------------|------------------|
| 1     | FASTEST           | ⚡ 0.1s - 4.2s  | Basic   | Large files, quick preview | ❌ No            |
| 2     | BALANCED (default)| ⚙️ 0.5s - 35.7s | Good    | General purpose            | ✅ Yes           |
| 3     | TABLE_OPTIMIZED   | 🐌 1.2s - 120s  | High    | Complex tables             | ✅ Yes           |
| 4     | HIGHEST_QUALITY   | 🐢 45s - 3600s  | Maximum | Small files, archival      | ✅ Yes           |

Note: Performance expectations are realistic. Large files (>20MB) automatically use Level 1 for speed. Small files (<2MB) automatically use Level 4 for quality.

**Important Notes:**
- **Level 1 (FASTEST)** prioritizes speed and does not detect tables or generate statistics. Use Level 2+ if you need table analysis.
- **Level 2-4** automatically detect tables and generate comprehensive statistics including min/max/mean/std for numerical data and frequency distributions for categorical data.

**Automatic Selection:**
- Small files (<2MB) → Level 4 (Docling)
- Large files (>20MB) → Level 1 (ChunkedParallel)
- Files with tables → Level 3 (PDFPlumber)
- Default → Level 2 (PyMuPDF4LLM)

## Template System

pyDocExtractor uses **Jinja2 templates** to control how extracted content is formatted into Markdown. This gives you complete control over the output structure and presentation.

### Built-in Templates

| Template | Purpose | Best For |
|----------|---------|----------|
| **simple** | Minimal formatting, just content | Quick conversions, plain text |
| **default** | Enhanced formatting with metadata | PDF/DOCX documents, structured output |
| **tabular** | Specialized for data with statistics | CSV/Excel files, spreadsheets |

### Using Templates

**In Code:**
```python
from pydocextractor import create_converter_service

service = create_converter_service()

# Use specific template
result = service.convert_to_markdown(doc, template_name="simple")

# Use tabular template for CSV/Excel
result = service.convert_to_markdown(doc, template_name="tabular")

# List available templates
templates = service.list_available_templates()
```

Note: Template names include `.j2` extension (standard Jinja2 convention). Non-existent templates cause extraction failure (prevents silent errors).

**In CLI:**
```bash
# Use specific template
pydocextractor convert document.pdf --template simple

# Templates auto-selected for CSV/Excel
pydocextractor convert data.xlsx  # Uses tabular template automatically
```

### Creating Custom Templates

Create a Jinja2 template file in `src/pydocextractor/infra/templates/templates/`:

```jinja2
{#- my_custom.j2 -#}
# {{ metadata.filename }}

{% for block in blocks %}
{{ block.content }}

{% endfor %}

---
*Quality: {{ quality_score }}*
```

**Available Context Variables:**
- `blocks` - List of content blocks (text, tables, images)
- `metadata` - Document metadata (filename, extractor, stats)
- `quality_score` - Quality score (0.0-1.0)
- `has_tables` - Boolean indicating tables present
- `has_images` - Boolean indicating images present
- `page_count` - Number of pages (if applicable)

### Custom Template Directory

```python
from pathlib import Path
from pydocextractor import create_converter_service

# Use templates from custom directory
custom_dir = Path("my_templates/")
service = create_converter_service(template_dir=custom_dir)

result = service.convert_to_markdown(doc, template_name="my_custom")
```

**For detailed information**, see **[docs/TEMPLATES.md](docs/TEMPLATES.md)** for:
- Complete template context reference
- Advanced Jinja2 techniques
- Template filters and macros
- Best practices and examples

## CSV and Excel Support

pyDocExtractor includes specialized extractors for tabular data with rich statistical analysis:

Note: PSV/TSV files are not supported by default. Only CSV files with comma delimiters are supported. For PSV/TSV, consider converting to CSV first.

### Features

- **Multi-sheet Excel support** - Process XLSX/XLS files with multiple sheets
- **Auto-delimiter detection** - Handles CSV automatically (TSV/PSV not supported)
- **Statistical summaries** - Min/max/mean for numeric columns, mode for categorical
- **Type inference** - Automatic detection of numerical vs categorical columns
- **Tabular template** - Dedicated Markdown template with formatted tables

### Usage

```python
from pathlib import Path
from pydocextractor import Document, PrecisionLevel, create_converter_service

service = create_converter_service()

# Excel file
excel_doc = Document(
    bytes=Path("Sales_2025.xlsx").read_bytes(),
    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    size_bytes=Path("Sales_2025.xlsx").stat().st_size,
    precision=PrecisionLevel.HIGHEST_QUALITY,
    filename="Sales_2025.xlsx",
)
result = service.convert_to_markdown(excel_doc, template_name="tabular")

# CSV file
csv_doc = Document(
    bytes=Path("customers.csv").read_bytes(),
    mime="text/csv",
    size_bytes=Path("customers.csv").stat().st_size,
    precision=PrecisionLevel.BALANCED,
    filename="customers.csv",
)
result = service.convert_to_markdown(csv_doc, template_name="tabular")
```

### CLI Usage

The CLI automatically selects the tabular template for CSV/Excel files:

```bash
# Excel - auto-selects tabular template
pydocextractor convert Sales_2025.xlsx

# CSV - auto-selects tabular template
pydocextractor convert customers.csv

# Force specific template
pydocextractor convert data.xlsx --template simple
```

### Output Example

The tabular template generates structured Markdown with:
- YAML frontmatter with file metadata
- Per-sheet summary tables
- Column statistics (min/max/mean/mode)
- Data type classification
- Quality score

## LLM Image Description

pyDocExtractor can automatically describe images in documents using OpenAI-compatible multimodal LLMs. This feature provides context-aware descriptions by analyzing images alongside the surrounding text.

### Features

- **Context-Aware Descriptions** - LLM receives the previous 100 lines of text as context
- **Multi-Format Support** - Works with images in PDF and DOCX files
- **Automatic Image Resizing** - Images resized to 1024x1024 with aspect ratio preservation
- **Cost Control** - Configurable limit on images per document (default: 5)
- **Graceful Degradation** - System works normally if LLM is not configured
- **Multi-Level Support** - Works with all extractors (Docling, PyMuPDF4LLM, PDFPlumber)

Note: LLM features work with real API calls (tested with ChatGPT). Image descriptions require documents with actual images. Cost control features limit images per document (default: 5).

### Installation

Install the LLM extra to enable image description:

```bash
pip install pydocextractor[llm]
```

This installs:
- `httpx` - Synchronous HTTP client for API calls
- `python-dotenv` - Environment configuration
- `pillow` - Image processing and resizing

### Configuration

Create a `config.env` file in your project directory:

```bash
# Enable LLM image description
LLM_ENABLED=true

# OpenAI API (or compatible endpoint)
LLM_API_URL=https://api.openai.com/v1/chat/completions
LLM_API_KEY=your-api-key-here

# Model configuration
LLM_MODEL_NAME=gpt-4o-mini

# Optional settings
LLM_MAX_IMAGES=5           # Max images per document (cost control)
LLM_CONTEXT_LINES=100      # Lines of context to provide
LLM_IMAGE_SIZE=1024        # Image resize dimension (1024x1024)
LLM_TIMEOUT=30             # Request timeout in seconds
LLM_MAX_RETRIES=3          # Retry attempts on failure
```

**Important:** Add `config.env` to your `.gitignore` to avoid committing API keys:

```bash
echo "config.env" >> .gitignore
```

### Usage in Python

When using pyDocExtractor as a library, the LLM configuration is automatically loaded from `config.env` (or `.env`) in the current directory:

```python
from pathlib import Path
from pydocextractor import Document, PrecisionLevel, create_converter_service

# Service automatically loads LLM config from config.env
service = create_converter_service()

# Convert document with images
doc_path = Path("document_with_images.pdf")
doc = Document(
    bytes=doc_path.read_bytes(),
    mime="application/pdf",
    size_bytes=doc_path.stat().st_size,
    precision=PrecisionLevel.BALANCED,
    filename=doc_path.name,
)

# Images will be automatically described if LLM is configured
result = service.convert_to_markdown(doc)

# Check for image descriptions in output
if "Image Description" in result.text:
    print("Images were described by LLM")
```

### Manual Configuration

You can also provide LLM configuration programmatically:

```python
from pydocextractor import create_converter_service
from pydocextractor.domain.config import LLMConfig

# Create LLM configuration
llm_config = LLMConfig(
    api_url="https://api.openai.com/v1/chat/completions",
    api_key="your-api-key",
    model_name="gpt-4o-mini",
    enabled=True,
    max_images_per_document=5,
    context_lines=100,
    image_size=1024,
    timeout_seconds=30,
    max_retries=3,
)

# Create service with LLM config (auto_load_llm=False to skip env loading)
service = create_converter_service(
    llm_config=llm_config,
    auto_load_llm=False,
)

# Use service normally - images will be described
result = service.convert_to_markdown(doc)
```

### Disabling LLM

To disable LLM features:

**Option 1:** Set `LLM_ENABLED=false` in config.env
**Option 2:** Remove config.env entirely
**Option 3:** Don't install the `[llm]` extra

The system will work normally without LLM, just without image descriptions.

### OpenAI-Compatible APIs

The LLM feature works with any OpenAI-compatible API endpoint:

**OpenAI:**
```bash
LLM_API_URL=https://api.openai.com/v1/chat/completions
LLM_MODEL_NAME=gpt-4o-mini  # or gpt-4o, gpt-4-vision-preview
```

**Azure OpenAI:**
```bash
LLM_API_URL=https://your-resource.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-02-15-preview
LLM_MODEL_NAME=gpt-4o
```

**Local/Self-Hosted (e.g., Ollama, LM Studio):**
```bash
LLM_API_URL=http://localhost:11434/v1/chat/completions
LLM_MODEL_NAME=llava:13b
```

### Output Example

When LLM is enabled, images in documents are automatically described:

```markdown
## Document Content

Some text content here...

<!-- image -->

**Image Description**: The image shows a architectural diagram depicting
a three-tier system architecture with a web server layer, application
server layer, and database layer. The diagram illustrates the flow of
requests through load balancers and connection to backend services.

More text content following the image...
```

### Cost Considerations

- **Max Images Limit**: Set `LLM_MAX_IMAGES` to control costs (default: 5 per document)
- **Image Resizing**: Images automatically resized to 1024x1024 to reduce token usage
- **Retry Logic**: Failed API calls retry up to 3 times with exponential backoff
- **Fallback**: If LLM call fails, document processing continues without description

### Technical Details

**How it works:**

1. Extractors detect images in documents (PDF/DOCX)
2. Raw image data is extracted and stored in Block objects
3. Images are resized to 1024x1024 (white padding, aspect ratio preserved)
4. Image + previous 100 lines of text sent to LLM
5. LLM generates contextual description
6. Description inserted into markdown output

**Extractor Support:**

All three PDF extractors support image extraction when LLM is enabled:

- **Docling (Level 4)**: Extracts images from PDF and DOCX files
- **PyMuPDF4LLM (Level 2)**: Extracts images from PDF files
- **PDFPlumber (Level 3)**: Extracts images from PDF files

## Architecture

pyDocExtractor follows **Hexagonal Architecture** (Ports and Adapters pattern) for clean separation of concerns:

```mermaid
graph TB
    subgraph Infrastructure["🔌 Infrastructure Layer (Adapters)"]
        subgraph Extractors["PDF Extractors"]
            E1[ChunkedParallelExtractor<br/>Level 1: FASTEST<br/>📄 PDF - Parallel Processing]
            E2[PyMuPDF4LLMExtractor<br/>Level 2: BALANCED<br/>📄 PDF - LLM Optimized]
            E3[PDFPlumberExtractor<br/>Level 3: TABLE_OPTIMIZED<br/>📄 PDF - Table Extraction]
            E4[DoclingExtractor<br/>Level 4: HIGHEST_QUALITY<br/>📄 PDF/DOCX/Excel]
        end

        subgraph TabularExtractors["Tabular Data Extractors"]
            E5[PandasCSVExtractor<br/>Level 4: HIGHEST_QUALITY<br/>📊 CSV with Statistics]
            E6[PandasExcelExtractor<br/>Level 4: HIGHEST_QUALITY<br/>📊 Excel Multi-Sheet]
        end

        subgraph Policy["Selection Policy"]
            P1[DefaultPolicy<br/>📋 Smart Selection Logic<br/>Auto-discovers extractors<br/>Builds fallback chains]
        end

        subgraph Templates["Template Rendering"]
            T1[Jinja2TemplateEngine<br/>📝 Markdown Generation<br/>Custom filters & templates]
        end

        subgraph Scoring["Quality Scoring"]
            S1[DefaultQualityScorer<br/>⭐ 0.0-1.0 Quality Score<br/>Content/Structure/Format]
        end
    end

    subgraph Application["⚙️ Application Layer (Orchestration)"]
        CS[ConverterService<br/>🎯 Coordinates Workflow<br/>Fallback chains<br/>Quality scoring]
    end

    subgraph Domain["🎯 Domain Layer (Pure Business Logic)"]
        subgraph Models["Immutable Models"]
            M1[Document<br/>Input file + metadata]
            M2[NormalizedDoc<br/>Vendor-agnostic format]
            M3[Block<br/>Content units]
            M4[Markdown<br/>Final output]
        end

        subgraph Ports["Ports - Protocols"]
            PR1[Extractor Protocol<br/>extract/supports/is_available]
            PR2[Policy Protocol<br/>choose_extractors]
            PR3[TemplateEngine Protocol<br/>render/list_templates]
            PR4[QualityScorer Protocol<br/>calculate_score]
        end

        subgraph Rules["Pure Functions"]
            R1[quality_score<br/>Quality calculation]
            R2[calculate_document_hash<br/>Deduplication]
            R3[normalize_blocks<br/>Block processing]
        end
    end

    %% Protocol Implementation Relationships
    E1 -.implements.-> PR1
    E2 -.implements.-> PR1
    E3 -.implements.-> PR1
    E4 -.implements.-> PR1
    E5 -.implements.-> PR1
    E6 -.implements.-> PR1
    P1 -.implements.-> PR2
    T1 -.implements.-> PR3
    S1 -.implements.-> PR4

    %% Service Dependencies (only uses Protocols)
    CS -->|uses| PR2
    CS -->|uses| PR1
    CS -->|uses| PR3
    CS -->|uses| PR4

    %% Domain Dependencies
    PR1 -->|depends on| M1
    PR1 -->|depends on| M2
    PR2 -->|depends on| M1
    CS -->|creates| M4

    %% Styling
    classDef infrastructure fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef application fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    classDef domain fill:#f3e5f5,stroke:#4a148c,stroke-width:2px

    class E1,E2,E3,E4,E5,E6,P1,T1,S1 infrastructure
    class CS application
    class M1,M2,M3,M4,PR1,PR2,PR3,PR4,R1,R2,R3 domain
```

### Why Hexagonal Architecture?

**Clean Separation:**
- **Domain Layer**: Business rules (Document, Block models) with zero dependencies
- **Application Layer**: Orchestrates conversion workflow using ports
- **Infrastructure Layer**: Concrete implementations (PDF extractors, templates)

**Key Benefits:**

1. **Testability**
   - Domain & app layers testable without real extractors
   - 260+ tests with 87% coverage
   - BDD scenarios for behavior validation

2. **Flexibility**
   - Swap extractors: Use Docling instead of PyMuPDF4LLM
   - Change templates: Different Markdown formats
   - Replace scoring: Custom quality algorithms

3. **Maintainability**
   - Boundaries enforced by import-linter
   - Strict mypy type checking
   - Clear dependency flow

4. **Extensibility**
   - Add new extractors by implementing `Extractor` Protocol
   - No changes needed to domain or application layers
   - Example: PandasCSV added without touching core logic

### Architecture Layers in Detail

#### 🎯 Domain Layer (`src/pydocextractor/domain/`)

The innermost layer containing pure business logic with **zero external dependencies**.

**Components:**

**1. Models (`models.py`)** - Immutable dataclasses representing core business entities:
- `Document`: Input file representation (bytes, MIME type, precision level)
- `Block`: Single unit of extracted content (text, table, image, etc.)
- `NormalizedDoc`: Vendor-agnostic intermediate format containing blocks
- `Markdown`: Final output with quality score and metadata
- `ExtractionResult`: Result of an extraction attempt (success/failure)
- `TemplateContext`: Template rendering context
- `PrecisionLevel`: Enum (1=FASTEST, 2=BALANCED, 3=TABLE_OPTIMIZED, 4=HIGHEST_QUALITY)
- `BlockType`: Enum (TEXT, TABLE, IMAGE, HEADER, LIST, CODE, METADATA)

**2. Ports (`ports.py`)** - Protocol definitions (interfaces):
- `Extractor`: Extract content from documents
- `Policy`: Choose which extractor to use
- `TemplateEngine`: Render markdown from normalized docs
- `QualityScorer`: Calculate quality scores
- `DocumentValidator`: Validate documents
- `TableProfiler`: Analyze tabular data
- `Cache`: Caching operations

**3. Rules (`rules.py`)** - Pure functions for business logic:
- `quality_score()`: Calculate 0.0-1.0 quality score
- `calculate_document_hash()`: Generate content hashes for deduplication
- `hint_has_tables()`: Detect if document likely contains tables
- `normalize_blocks()`: Clean and deduplicate blocks
- `merge_text_blocks()`: Merge consecutive text blocks
- `validate_precision_level()`: Validate precision levels
- `estimate_processing_time()`: Estimate conversion duration

**4. Errors (`errors.py`)** - Domain exception hierarchy:
- `DomainError`: Base exception
- `ConversionFailed`: All extractors failed
- `RecoverableError`: Extractor failed, try fallback
- `UnsupportedFormat`: No extractor available
- `ValidationError`: Invalid document
- `ExtractionError`: Extraction process error
- `TemplateError`: Template rendering error

**Architecture Rule:** Domain layer must NOT import from `app` or `infra` layers. This is enforced by `import-linter`.

#### ⚙️ Application Layer (`src/pydocextractor/app/`)

Orchestrates the conversion workflow using domain ports.

**ConverterService (`service.py`):**

The main orchestration service that coordinates the entire conversion process.

**Dependencies (injected via constructor):**
- `policy: Policy` - Chooses which extractor to use
- `template_engine: TemplateEngine` - Renders markdown
- `quality_scorer: QualityScorer | None` - Calculates quality score
- `table_profilers: Sequence[TableProfiler]` - Analyzes tabular data

**Key Methods:**
- `convert_to_markdown(doc, template_name, allow_fallback)`: Main conversion entry point
- `convert_with_specific_extractor(doc, extractor_name, template_name)`: Force specific extractor
- `list_available_templates()`: List available markdown templates
- `get_supported_formats()`: List supported MIME types

**Conversion Workflow:**
1. Validate document
2. Ask policy to choose extractors (ordered by preference)
3. Try extractors in order until one succeeds
4. Apply table profilers if configured
5. Render markdown using template engine
6. Calculate quality score
7. Return `Markdown` result

**Architecture Rule:** Application layer depends ONLY on domain layer (models + ports). Never imports concrete infrastructure classes.

#### 🔌 Infrastructure Layer (`src/pydocextractor/infra/`)

Concrete implementations of domain ports.

**1. Extractors (`infra/extractors/`)** - 6 implementations of `Extractor` Protocol:

| Extractor | Level | Library | MIME Types | Specialization |
|-----------|-------|---------|------------|----------------|
| `ChunkedParallelExtractor` | 1 (FASTEST) | PyMuPDF | `application/pdf` | Parallel page processing for speed |
| `PyMuPDF4LLMExtractor` | 2 (BALANCED) | pymupdf4llm | `application/pdf` | LLM-optimized extraction (default) |
| `PDFPlumberExtractor` | 3 (TABLE_OPTIMIZED) | pdfplumber | `application/pdf` | Superior table extraction |
| `DoclingExtractor` | 4 (HIGHEST_QUALITY) | Docling | `application/pdf`, DOCX, Excel | Comprehensive layout analysis |
| `PandasCSVExtractor` | 4 (HIGHEST_QUALITY) | pandas | `text/csv` | CSV with column statistics |
| `PandasExcelExtractor` | 4 (HIGHEST_QUALITY) | pandas | Excel (XLS/XLSX) | Multi-sheet with rich metadata |

All extractors implement:
- `extract(data: bytes, precision: PrecisionLevel) -> ExtractionResult`
- `supports(mime: str) -> bool`
- `is_available() -> bool` (checks if dependencies installed)
- Properties: `name`, `precision_level`

**2. Policy (`infra/policy/heuristics.py`)** - `DefaultPolicy`:

Smart extractor selection logic:

**Selection Strategy:**
- **CSV files** → `PandasCSVExtractor`
- **Excel files** → `PandasExcelExtractor`
- **DOCX files** → `DoclingExtractor`
- **PDF files** (by characteristics):
  - Size > 20MB → `ChunkedParallelExtractor` (Level 1)
  - Size < 2MB → `DoclingExtractor` (Level 4)
  - Has tables → `PDFPlumberExtractor` (Level 3)
  - Default → `PyMuPDF4LLMExtractor` (Level 2)

**Fallback Chain:** If preferred extractor fails, tries: Level 2 → Level 1 → Level 3 → Level 4

**3. Templates (`infra/templates/engines.py`)** - `Jinja2TemplateEngine`:

Markdown rendering using Jinja2:
- Default templates in `infra/templates/templates/`
- Built-in templates: `simple.j2`, `tabular.j2`
- Custom filters: `word_count`, `char_count`
- Supports custom template directories

**4. Scoring (`infra/scoring/default_scorer.py`)** - `DefaultQualityScorer`:

Calculates 0.0-1.0 quality score based on:
- **Content Length (25%)**: Document has substantial text
- **Structure (30%)**: Presence of headers and tables
- **Text Quality (25%)**: Average block length and word count
- **Formatting (20%)**: Line structure and markdown formatting

**5. Factory (`factory.py`)** - Dependency Injection:

Creates fully configured services:
```python
def create_converter_service(template_dir=None) -> ConverterService:
    # Auto-discovers all available extractors
    policy = DefaultPolicy()
    template_engine = Jinja2TemplateEngine(template_dir)
    quality_scorer = DefaultQualityScorer()

    return ConverterService(
        policy=policy,
        template_engine=template_engine,
        quality_scorer=quality_scorer,
    )
```

Helper functions:
- `get_available_extractors()`: Lists all installed extractors
- `get_extractor_by_level(level)`: Gets specific extractor by precision level

**Graceful Degradation:** If optional dependencies missing, extractors are excluded but library still works with available ones.

### Conversion Flow

The following diagram shows how a document flows through the system:

```mermaid
sequenceDiagram
    participant User
    participant Service as ConverterService
    participant Policy as DefaultPolicy
    participant Extractor as Selected Extractor
    participant Template as Jinja2TemplateEngine
    participant Scorer as QualityScorer

    User->>Service: convert_to_markdown(doc)
    Service->>Policy: choose_extractors(doc)
    Policy-->>Service: [extractor_list]

    loop For each extractor (with fallback)
        Service->>Extractor: extract(data, precision)
        alt Extraction Success
            Extractor-->>Service: ExtractionResult(success=True)
            Note over Service: Break loop
        else Extraction Failed
            Extractor-->>Service: ExtractionResult(success=False)
            Note over Service: Try next extractor
        end
    end

    Service->>Template: render(blocks, metadata)
    Template-->>Service: Markdown text

    Service->>Scorer: calculate_quality(markdown)
    Scorer-->>Service: quality_score (0.0-1.0)

    Service-->>User: Markdown(text, score, metadata)
```

### Dependency Injection Flow

This diagram shows how components are created and wired together:

```mermaid
graph TB
    subgraph Factory["🏭 Factory (Composition Root)"]
        F[create_converter_service]
    end

    subgraph Creation["Component Creation"]
        F -->|1. Create| P[DefaultPolicy]
        F -->|2. Create| TE[Jinja2TemplateEngine]
        F -->|3. Create| QS[DefaultQualityScorer]
        F -->|4. Inject & Assemble| CS[ConverterService]

        P -->|discovers| E1[ChunkedParallelExtractor]
        P -->|discovers| E2[PyMuPDF4LLMExtractor]
        P -->|discovers| E3[PDFPlumberExtractor]
        P -->|discovers| E4[DoclingExtractor]
        P -->|discovers| E5[PandasCSVExtractor]
        P -->|discovers| E6[PandasExcelExtractor]
    end

    subgraph Runtime["Runtime (Protocol-based)"]
        CS -->|uses Protocol| PP[Policy Protocol]
        CS -->|uses Protocol| EP[Extractor Protocol]
        CS -->|uses Protocol| TEP[TemplateEngine Protocol]
        CS -->|uses Protocol| QSP[QualityScorer Protocol]

        PP -.implemented by.-> P
        EP -.implemented by.-> E1
        EP -.implemented by.-> E2
        EP -.implemented by.-> E3
        EP -.implemented by.-> E4
        EP -.implemented by.-> E5
        EP -.implemented by.-> E6
        TEP -.implemented by.-> TE
        QSP -.implemented by.-> QS
    end

    subgraph UserCode["User Code"]
        U[Your Application]
        U -->|calls| F
        F -->|returns| CS2[Fully Configured<br/>ConverterService]
        CS2 -->|ready to use| U
    end

    classDef factory fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    classDef creation fill:#e1f5ff,stroke:#01579b,stroke-width:2px
    classDef runtime fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef user fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px

    class F factory
    class P,TE,QS,CS,E1,E2,E3,E4,E5,E6 creation
    class CS,PP,EP,TEP,QSP runtime
    class U,CS2 user
```

**Key Principles:**

1. **Factory Pattern**: `create_converter_service()` is the composition root
2. **Protocol-Based**: Service depends on protocols, not concrete types
3. **Auto-Discovery**: Policy automatically finds all available extractors
4. **Graceful Degradation**: Missing dependencies = extractor excluded
5. **Testability**: Easy to inject mocks/fakes for testing

### Practical Architecture Example

```python
# Domain Layer - Pure business logic
from pydocextractor.domain.models import Document, PrecisionLevel
from pydocextractor.domain.ports import Extractor  # Protocol

# Infrastructure Layer - Concrete implementations
from pydocextractor.infra.extractors.pymupdf4llm_adapter import PyMuPDF4LLMExtractor
from pydocextractor.infra.policy.heuristics import DefaultPolicy

# Application Layer - Orchestration
from pydocextractor.app.service import ConverterService

# Dependency Injection in action
extractor: Extractor = PyMuPDF4LLMExtractor()  # Depends on Protocol
policy = DefaultPolicy()  # Chooses which extractor
service = ConverterService(policy=policy, ...)

# Usage - clean and simple
doc = Document(bytes=..., mime="application/pdf", ...)
result = service.convert_to_markdown(doc)
```

### Architecture Validation

The architecture is continuously validated:

```bash
just guard      # Enforces layer boundaries with import-linter
just typecheck  # Validates Protocol compliance with mypy --strict
just test       # Ensures all layers work together
just check      # Run all quality checks (format, lint, types, guard)
```

**Learn More:** See the detailed Architecture section below for:
- Layer responsibilities and rules
- Protocol definitions
- Component relationships
- Dependency injection patterns

## Development Workflow

### Installation Commands

```bash
# Development setup
just bootstrap        # Install all dev dependencies
just install          # Install package in editable mode with all extras
just install-dev      # Install package (minimal, no optional deps)
just install-prod     # Install for production (non-editable)
```

### Code Quality

```bash
just fmt              # Format code with ruff
just lint             # Lint code with ruff
just fix              # Auto-fix linting issues
just typecheck        # Type check with mypy
just guard            # Verify architectural boundaries
just check            # Run all quality checks (fmt, lint, types, guard)
```

### Testing

```bash
just test             # Run all tests
just test-unit        # Domain + app tests (fast)
just test-adapters    # Infrastructure tests
just test-contract    # Protocol compliance tests
just test-bdd         # BDD tests
just test-integration # Integration tests
just test-cov         # With coverage report
just coverage-check   # Verify 70% coverage threshold
```

### Utilities

```bash
just build            # Build package distribution
just clean            # Remove build artifacts and cache
just layers           # Show architecture layers
just stats            # Project statistics
```

### Workflows

```bash
just ci               # Full CI pipeline locally
just pre-commit       # Pre-commit checks (fmt + check + test)
```

## Testing

Comprehensive test suite following hexagonal architecture:

### Test Structure

```
tests/
├── unit/              # Pure unit tests (no infrastructure)
│   ├── domain/       # Domain layer tests
│   └── app/          # Application layer tests (mocked)
├── adapters/         # Infrastructure adapter tests
├── contract/         # Protocol compliance tests
├── integration/      # End-to-end tests
└── bdd/              # BDD tests with pytest-bdd
    ├── features/     # Gherkin scenarios
    └── steps/        # Step definitions
```

### Running Tests

```bash
# All tests
just test

# By category
just test-unit           # Unit tests only
just test-adapters       # Adapter tests
just test-bdd            # BDD tests
just test-integration    # Integration tests

# With coverage
just test-cov            # Generate coverage report
just test-unit-coverage  # Unit tests with coverage

# Check coverage threshold (70%)
just coverage-check
```

### BDD Tests

Behavior-Driven Development tests using Gherkin:

```gherkin
Scenario: Convert a text-based PDF to Markdown
  Given I have a PDF file "Company_Handbook.pdf"
  When I submit the file for extraction
  Then the service produces a Markdown document
  And a content ID is generated and returned
```

See [tests/bdd/README.md](tests/bdd/README.md) for BDD documentation.

## Extending the Library

### Adding a Custom Extractor

Implement the `Extractor` Protocol to add support for new file formats:

```python
from pydocextractor.domain.ports import ExtractionResult
from pydocextractor.domain.models import (
    PrecisionLevel,
    NormalizedDoc,
    Block,
    BlockType,
)

class MyCustomExtractor:
    """Custom extractor implementing Extractor Protocol."""

    @property
    def name(self) -> str:
        return "MyCustomExtractor"

    @property
    def precision_level(self) -> PrecisionLevel:
        return PrecisionLevel.HIGHEST_QUALITY

    def is_available(self) -> bool:
        # Check if required dependencies are installed
        try:
            import my_custom_library
            return True
        except ImportError:
            return False

    def supports(self, mime: str) -> bool:
        return mime == "application/custom"

    def extract(self, data: bytes, precision: PrecisionLevel) -> ExtractionResult:
        import time
        start = time.time()

        try:
            # Your extraction logic here
            extracted_text = self._extract_content(data)

            blocks = (
                Block(type=BlockType.TEXT, content=extracted_text),
            )
            ndoc = NormalizedDoc(blocks=blocks, source_mime="application/custom")

            return ExtractionResult(
                success=True,
                normalized_doc=ndoc,  # Note: 'normalized_doc', not 'ndoc'
                extractor_name=self.name,
                processing_time_seconds=time.time() - start,
            )
        except Exception as e:
            return ExtractionResult(
                success=False,
                normalized_doc=None,
                error=str(e),
                extractor_name=self.name,
                processing_time_seconds=time.time() - start,
            )

    def _extract_content(self, data: bytes) -> str:
        # Implement your extraction logic
        return "Extracted text from custom format"

# Using custom extractor (not directly injectable into DefaultPolicy)
# You would need to create a custom policy or use the extractor directly:
from pydocextractor.domain.models import Document

extractor = MyCustomExtractor()
if extractor.is_available() and extractor.supports("application/custom"):
    doc = Document(
        bytes=b"...",
        mime="application/custom",
        size_bytes=100,
        precision=PrecisionLevel.HIGHEST_QUALITY,
    )
    result = extractor.extract(doc.bytes, doc.precision)
```

**Note:** The current `DefaultPolicy` hardcodes extractors. To use custom extractors in the service, you would need to create a custom policy implementation.

### Custom Template

Create a Jinja2 template in `templates/`:

```jinja2
{# my_template.j2 #}
# {{ metadata.filename }}

{% for block in blocks %}
{{ block.content }}

{% endfor %}

---
Quality: {{ quality_score }}
Extractor: {{ metadata.extractor }}
```

Use it:

```python
result = service.convert_to_markdown(doc, template_name="my_template")
```

## Project Structure

```
pyDocExtractor/
├── src/pydocextractor/
│   ├── domain/              # Pure domain layer
│   │   ├── models.py        # Immutable dataclasses
│   │   ├── ports.py         # Protocol definitions
│   │   ├── rules.py         # Pure functions
│   │   └── errors.py        # Domain exceptions
│   ├── app/                 # Application layer
│   │   └── service.py       # ConverterService
│   ├── infra/               # Infrastructure layer
│   │   ├── extractors/      # 4 extractor adapters
│   │   ├── policy/          # Selection logic
│   │   ├── templates/       # Jinja2 templates
│   │   └── scoring/         # Quality scoring
│   ├── factory.py           # Dependency injection
│   └── cli.py               # CLI interface
├── tests/                   # Hexagonal test suite
├── test_documents/          # Real test documents
└── pyproject.toml           # Project configuration
```

## Configuration

### pyproject.toml

- **Strict mypy**: Enforced on domain layer
- **Ruff linting**: Per-layer rules
- **Import linter**: Enforces architectural boundaries
- **Extras model**: Optional dependencies

### Architectural Boundaries

Enforced via `import-linter`:

```ini
[importlinter:contract:domain-independence]
# Domain MUST NOT import from app or infra
type = forbidden
source_modules = pydocextractor.domain
forbidden_modules = pydocextractor.infra, pydocextractor.app
```

Verify with:
```bash
just guard
```

## Performance

Benchmarks on typical documents:

| Document Size | Level 1 | Level 2 | Level 3 | Level 4 |
|--------------|---------|---------|---------|---------|
| 200 KB       | 0.1s    | 0.5s    | 1.2s    | 45s     |
| 1.2 MB       | 0.3s    | 2.1s    | 5.2s    | 180s    |
| 3.2 MB       | 1.8s    | 8.4s    | 25.1s   | 900s    |
| 13 MB        | 4.2s    | 35.7s   | 120s    | 3600s   |

## Quality Scoring

Documents are scored 0.0-1.0 based on:

- **Content Length** (25%): Substantial extracted text
- **Structure** (30%): Headings and paragraphs
- **Text Quality** (25%): Average block length
- **Formatting** (20%): Lists and tables

```python
result = service.convert_to_markdown(doc)
if result.quality_score > 0.8:
    print("High quality conversion")
```

## Contributing

We welcome contributions! Whether you're fixing bugs, adding new features, or improving documentation, your help is appreciated.

### Quick Start for Contributors

```bash
# Clone and setup
git clone https://github.com/AminiTech/pyDocExtractor.git
cd pyDocExtractor
just bootstrap

# Make your changes, then verify
just fmt           # Format code
just check         # Run quality checks
just test          # Run tests
just guard         # Verify architecture
```

### Common Contribution Scenarios

#### Adding Support for a New Document Type

1. Create a new extractor in `src/pydocextractor/infra/extractors/`
2. Implement the `Extractor` Protocol (see [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md#how-to-add-support-for-a-new-document-type))
3. Update the selection policy in `src/pydocextractor/infra/policy/heuristics.py`
4. Add tests and sample documents

#### Creating a Custom Template

1. Add a Jinja2 template to `src/pydocextractor/infra/templates/templates/`
2. Use available context variables: `blocks`, `metadata`, `quality_score`
3. Test with various document types

#### Modifying Quality Scoring

1. Create a new scorer in `src/pydocextractor/infra/scoring/`
2. Implement the `QualityScorer` Protocol
3. Inject via `ConverterService` constructor

### Architecture Guidelines

pyDocExtractor follows **Hexagonal Architecture**:

- **Domain Layer** (`src/pydocextractor/domain/`) - Pure business logic, no external dependencies
- **Application Layer** (`src/pydocextractor/app/`) - Orchestrates workflows using domain ports
- **Infrastructure Layer** (`src/pydocextractor/infra/`) - Concrete implementations (extractors, templates, etc.)

**Rule:** Domain layer must NOT import from `app` or `infra` layers (enforced by `import-linter`).

### For Detailed Information

See **[docs/CONTRIBUTING.md](docs/CONTRIBUTING.md)** for comprehensive guides on:

- Project structure and what each folder does
- Step-by-step guide to add new document type support
- How to create custom templates and quality scorers
- Testing guidelines and best practices
- Code quality standards
- Pull request process

### Questions?

- **Issues**: [GitHub Issues](https://github.com/AminiTech/pyDocExtractor/issues)
- **Discussions**: [GitHub Discussions](https://github.com/AminiTech/pyDocExtractor/discussions)

## Documentation

- **[docs/CONTRIBUTING.md](docs/CONTRIBUTING.md)** - How to contribute to the project
- **[docs/CONTRIBUTING_GUIDE.md](docs/CONTRIBUTING_GUIDE.md)** - Detailed contribution guide with architecture reference
- **[docs/TEMPLATES.md](docs/TEMPLATES.md)** - Template system guide with Jinja2 examples
- **[tests/bdd/README.md](tests/bdd/README.md)** - BDD testing guide
- See the Architecture section above for hexagonal architecture details

## License

MIT License - see [LICENSE](LICENSE) for details.

## Credits

Extracted from the [Amini Ingestion KGraph](https://github.com/AminiTech/amini-ingestion-kgraph) project.

Built with:
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - Fast PDF processing
- [pymupdf4llm](https://github.com/pymupdf/pymupdf4llm) - LLM-optimized extraction
- [pdfplumber](https://github.com/jsvine/pdfplumber) - Table extraction
- [Docling](https://github.com/DS4SD/docling) - Highest quality conversion
- [pytest-bdd](https://github.com/pytest-dev/pytest-bdd) - BDD testing

## Support

- **Issues**: [GitHub Issues](https://github.com/AminiTech/pyDocExtractor/issues)
- **Documentation**: [GitHub Wiki](https://github.com/AminiTech/pyDocExtractor/wiki)
- **Discussions**: [GitHub Discussions](https://github.com/AminiTech/pyDocExtractor/discussions)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pydocextractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "conversion, document, extractor, markdown, pdf",
    "author": null,
    "author_email": "Leonardo Araujo <leo@amini.ai>",
    "download_url": "https://files.pythonhosted.org/packages/f5/1f/f562777d05f51203574b7289a58fc6f4a3ceece4bf517ec1edb6edfbb2e4/pydocextractor-0.1.1.tar.gz",
    "platform": null,
    "description": "# pyDocExtractor\n\n**A Python library for converting documents (PDF, DOCX, XLSX) to Markdown with multiple precision levels.**\n\nBuilt with **Hexagonal Architecture** for maximum testability, flexibility, and maintainability.\n\n[![CI](https://github.com/AminiTech/pyDocExtractor/actions/workflows/ci.yml/badge.svg)](https://github.com/AminiTech/pyDocExtractor/actions/workflows/ci.yml)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![Coverage](https://img.shields.io/badge/coverage-87%25-brightgreen.svg)](https://github.com/AminiTech/pyDocExtractor)\n[![Tests](https://img.shields.io/badge/tests-260%2B%20passing-brightgreen.svg)](https://github.com/AminiTech/pyDocExtractor)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n\n## Features\n\n- **4 Precision Levels** - Choose between speed and quality\n- **LLM Image Description** - Automatic AI-powered image descriptions using OpenAI-compatible APIs\n- **Hexagonal Architecture** - Clean separation of concerns with Protocol-based ports\n- **Automatic Selection** - Smart converter selection based on file characteristics\n- **Quality Scoring** - 0-1.0 quality scores for converted content\n- **Fallback Chain** - Automatic fallback if preferred converter fails\n- **Multi-Format Support** - PDF, DOCX, XLSX, XLS\n- **CLI & Python API** - Use as command-line tool or library\n- **Dependency Injection** - Easy testing and customization\n- **Extras Model** - Install only what you need\n\n## Installation\n\n### Basic Installation\n\nNote: Make sure to use CodeArtifact since the package is not available on PyPi.\n\n```bash\npip install pydocextractor\n```\n\nNote: If you encounter NumPy compatibility issues, try installing in a clean virtual environment using `uv venv` for better dependency management.\n\n### With Specific Extractors\n\nNote: DOCX support actually requires docling `[doc]` extra, not `[docx]`\n\n```bash\n# PDF support only\npip install pydocextractor[pdf]\n\n# DOCX support\npip install pydocextractor[docx]\n\n# Excel support\npip install pydocextractor[xlsx]\n\n# CLI tools\npip install pydocextractor[cli]\n\n# LLM support for image descriptions\npip install pydocextractor[llm]\n\n# Everything\npip install pydocextractor[all]\n```\n\n### Install from GitHub Repository\n\nIf you want to use the latest version directly from the repository:\n\nNote: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.\n\n#### Using uv (Recommended)\n\n```bash\n# Install with all features\nuv venv\nuv pip install \"pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git\"\n```\n\n```bash\n# Install with specific extras\nuv venv\nuv pip install \"pydocextractor[pdf,cli] @ git+https://github.com/AminiTech/pyDocExtractor.git\"\n```\n\n```bash\n# Install specific branch or tag\nuv venv\nuv pip install \"pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@main\"\nuv pip install \"pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@v1.0.0\"\n```\n\n```bash\n# Install in editable mode for local changes\ngit clone https://github.com/AminiTech/pyDocExtractor.git\ncd pyDocExtractor\nuv venv\nuv pip install -e \".[all]\"\n```\n\n#### Using pip\n\nNote: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.\n\n```bash\n# Install with all features\npip install \"git+https://github.com/AminiTech/pyDocExtractor.git#egg=pydocextractor[all]\"\n\n# Install with specific extras\npip install \"git+https://github.com/AminiTech/pyDocExtractor.git#egg=pydocextractor[pdf,cli]\"\n\n# Install in editable mode\ngit clone https://github.com/AminiTech/pyDocExtractor.git\ncd pyDocExtractor\npip install -e \".[all]\"\n```\n\n#### Using requirements.txt\n\nAdd to your `requirements.txt`:\n\nNote: Make sure to have the correct credentials installed on your machine to be able to access the Github repository. You need either SSH key configured for `git@github.com` access, OR Personal Access Token for HTTPS access.\n\n```txt\n# Latest from main branch with all features\npydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git\n\n# Specific version/tag\npydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git@v1.0.0\n\n# With specific extras only\npydocextractor[pdf,xlsx] @ git+https://github.com/AminiTech/pyDocExtractor.git\n```\n\nThen install:\n\n```bash\n# With uv\nuv pip install -r requirements.txt\n\n# With pip\npip install -r requirements.txt\n```\n\n#### Using pyproject.toml\n\nAdd to your `pyproject.toml`:\n\n```toml\n[project]\ndependencies = [\n    \"pydocextractor[all] @ git+https://github.com/AminiTech/pyDocExtractor.git\",\n]\n\n# Or specify extras\ndependencies = [\n    \"pydocextractor[pdf,xlsx,cli] @ git+https://github.com/AminiTech/pyDocExtractor.git@main\",\n]\n```\n\nThen install:\n\n```bash\n# With uv (recommended)\nuv sync\n\n# With pip\npip install .\n```\n\n### Development Setup\n\nFor contributors and developers who want to modify the library:\n\n**Prerequisites:** Install [just](https://github.com/casey/just) command runner:\n```bash\n# macOS\nbrew install just\n\n# Linux\ncurl --proto '=https' --tlsv1.2 -sSf https://just.systems/install.sh | bash -s -- --to /usr/local/bin\n\n# Windows\nscoop install just\n# or: choco install just\n```\n\n**Setup:**\n```bash\n# Clone repository\ngit clone https://github.com/AminiTech/pyDocExtractor.git\ncd pyDocExtractor\n\n# Bootstrap development environment (installs all dependencies)\njust bootstrap\n\n# Run tests\njust test\n\n# Run quality checks\njust check\n\n# See all available commands\njust --list\n```\n\nNote: All `just` commands tested and working. Architecture validation with `just guard`. Quality checks with `just check`. Comprehensive testing shows 18/18 test categories passed (100% success rate).\n\n## Quick Start\n\n### CLI Usage\n\n```bash\n# Convert a document\npydocextractor convert document.pdf\n\n# Specify precision level (1-4)\npydocextractor convert document.pdf --level 2\n\n# Custom output file\npydocextractor convert document.pdf -o output.md\n\n# Show quality score\npydocextractor convert document.pdf --show-score\n\n# Batch convert directory with pattern matching (includes timing info)\npydocextractor batch input_dir/ output_dir/\n\n# Batch convert with custom pattern\npydocextractor batch input_dir/ output_dir/ --pattern \"*.pdf\"\n\n# Batch convert DOCX files with highest quality\npydocextractor batch input_dir/ output_dir/ --pattern \"*.docx\" --level 4\n\n# Check converter status\npydocextractor status\n\n# Document info\npydocextractor info document.pdf\n\n# Note: Image descriptions require LLM configuration (see LLM Image Description section)\n```\n\n### Python API (Hexagonal)\n\n```python\nfrom pathlib import Path\nfrom pydocextractor.domain.models import Document, PrecisionLevel\nfrom pydocextractor.factory import create_converter_service\n\n# Create service (uses dependency injection)\nservice = create_converter_service()\n\n# Load document\nfile_path = Path(\"document.pdf\")\ndoc = Document(\n    bytes=file_path.read_bytes(),\n    mime=\"application/pdf\",\n    size_bytes=file_path.stat().st_size,\n    precision=PrecisionLevel.BALANCED,\n    filename=file_path.name,\n)\n\n# Convert to Markdown\nresult = service.convert_to_markdown(doc)\n\n# Access results\nprint(result.text)              # Markdown text\nprint(result.quality_score)     # Quality score (0.0-1.0)\nprint(result.metadata)          # Additional metadata\n```\n\n### Using from Another Python Program\n\nIf you're using pyDocExtractor as a library in your Python application:\n\n```python\nfrom pathlib import Path\nfrom pydocextractor import (\n    Document,\n    PrecisionLevel,\n    create_converter_service,\n    get_available_extractors,\n)\n\n# Create service using factory (recommended)\n# Automatically loads LLM config from config.env or .env if present\nservice = create_converter_service()\n\n# Load your document\nfile_path = Path(\"your_document.pdf\")\ndoc = Document(\n    bytes=file_path.read_bytes(),\n    mime=\"application/pdf\",\n    size_bytes=file_path.stat().st_size,\n    precision=PrecisionLevel.BALANCED,\n    filename=file_path.name,\n)\n\n# Convert to Markdown\n# If LLM is configured, images will be automatically described\nresult = service.convert_to_markdown(doc)\n\n# Access results\nmarkdown_text = result.text\nquality = result.quality_score\nextractor_used = result.metadata.get(\"extractor\")\n\n# Check if images were described\nhas_image_descriptions = \"Image Description\" in markdown_text\n\n# For CSV/Excel files, use tabular template\nexcel_doc = Document(\n    bytes=Path(\"data.xlsx\").read_bytes(),\n    mime=\"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\",\n    size_bytes=Path(\"data.xlsx\").stat().st_size,\n    precision=PrecisionLevel.HIGHEST_QUALITY,\n    filename=\"data.xlsx\",\n)\nresult = service.convert_to_markdown(excel_doc, template_name=\"tabular\")\n```\n\n### Custom Configuration (Advanced)\n\nFor advanced users who need custom service configuration:\n\n```python\nfrom pydocextractor.app.service import ConverterService\nfrom pydocextractor.infra.policy.heuristics import DefaultPolicy\nfrom pydocextractor.infra.templates.engines import Jinja2TemplateEngine\nfrom pydocextractor.infra.scoring.default_scorer import DefaultQualityScorer\nfrom pathlib import Path\n\n# Create components manually\npolicy = DefaultPolicy()  # Auto-discovers all available extractors\ntemplate_engine = Jinja2TemplateEngine()\nquality_scorer = DefaultQualityScorer()\n\n# Assemble service with custom components\nservice = ConverterService(\n    policy=policy,\n    template_engine=template_engine,\n    quality_scorer=quality_scorer,\n)\n\n# Use with custom template directory\ncustom_templates = Path(\"my_custom_templates/\")\nservice_with_custom_templates = create_converter_service(template_dir=custom_templates)\n\n# Use custom service\nresult = service.convert_to_markdown(doc, template_name=\"simple\")\n```\n\n## Precision Levels\n\n| Level | Name              | Speed          | Quality | Use Case                    | Table Statistics |\n|-------|-------------------|----------------|---------|----------------------------|------------------|\n| 1     | FASTEST           | \u26a1 0.1s - 4.2s  | Basic   | Large files, quick preview | \u274c No            |\n| 2     | BALANCED (default)| \u2699\ufe0f 0.5s - 35.7s | Good    | General purpose            | \u2705 Yes           |\n| 3     | TABLE_OPTIMIZED   | \ud83d\udc0c 1.2s - 120s  | High    | Complex tables             | \u2705 Yes           |\n| 4     | HIGHEST_QUALITY   | \ud83d\udc22 45s - 3600s  | Maximum | Small files, archival      | \u2705 Yes           |\n\nNote: Performance expectations are realistic. Large files (>20MB) automatically use Level 1 for speed. Small files (<2MB) automatically use Level 4 for quality.\n\n**Important Notes:**\n- **Level 1 (FASTEST)** prioritizes speed and does not detect tables or generate statistics. Use Level 2+ if you need table analysis.\n- **Level 2-4** automatically detect tables and generate comprehensive statistics including min/max/mean/std for numerical data and frequency distributions for categorical data.\n\n**Automatic Selection:**\n- Small files (<2MB) \u2192 Level 4 (Docling)\n- Large files (>20MB) \u2192 Level 1 (ChunkedParallel)\n- Files with tables \u2192 Level 3 (PDFPlumber)\n- Default \u2192 Level 2 (PyMuPDF4LLM)\n\n## Template System\n\npyDocExtractor uses **Jinja2 templates** to control how extracted content is formatted into Markdown. This gives you complete control over the output structure and presentation.\n\n### Built-in Templates\n\n| Template | Purpose | Best For |\n|----------|---------|----------|\n| **simple** | Minimal formatting, just content | Quick conversions, plain text |\n| **default** | Enhanced formatting with metadata | PDF/DOCX documents, structured output |\n| **tabular** | Specialized for data with statistics | CSV/Excel files, spreadsheets |\n\n### Using Templates\n\n**In Code:**\n```python\nfrom pydocextractor import create_converter_service\n\nservice = create_converter_service()\n\n# Use specific template\nresult = service.convert_to_markdown(doc, template_name=\"simple\")\n\n# Use tabular template for CSV/Excel\nresult = service.convert_to_markdown(doc, template_name=\"tabular\")\n\n# List available templates\ntemplates = service.list_available_templates()\n```\n\nNote: Template names include `.j2` extension (standard Jinja2 convention). Non-existent templates cause extraction failure (prevents silent errors).\n\n**In CLI:**\n```bash\n# Use specific template\npydocextractor convert document.pdf --template simple\n\n# Templates auto-selected for CSV/Excel\npydocextractor convert data.xlsx  # Uses tabular template automatically\n```\n\n### Creating Custom Templates\n\nCreate a Jinja2 template file in `src/pydocextractor/infra/templates/templates/`:\n\n```jinja2\n{#- my_custom.j2 -#}\n# {{ metadata.filename }}\n\n{% for block in blocks %}\n{{ block.content }}\n\n{% endfor %}\n\n---\n*Quality: {{ quality_score }}*\n```\n\n**Available Context Variables:**\n- `blocks` - List of content blocks (text, tables, images)\n- `metadata` - Document metadata (filename, extractor, stats)\n- `quality_score` - Quality score (0.0-1.0)\n- `has_tables` - Boolean indicating tables present\n- `has_images` - Boolean indicating images present\n- `page_count` - Number of pages (if applicable)\n\n### Custom Template Directory\n\n```python\nfrom pathlib import Path\nfrom pydocextractor import create_converter_service\n\n# Use templates from custom directory\ncustom_dir = Path(\"my_templates/\")\nservice = create_converter_service(template_dir=custom_dir)\n\nresult = service.convert_to_markdown(doc, template_name=\"my_custom\")\n```\n\n**For detailed information**, see **[docs/TEMPLATES.md](docs/TEMPLATES.md)** for:\n- Complete template context reference\n- Advanced Jinja2 techniques\n- Template filters and macros\n- Best practices and examples\n\n## CSV and Excel Support\n\npyDocExtractor includes specialized extractors for tabular data with rich statistical analysis:\n\nNote: PSV/TSV files are not supported by default. Only CSV files with comma delimiters are supported. For PSV/TSV, consider converting to CSV first.\n\n### Features\n\n- **Multi-sheet Excel support** - Process XLSX/XLS files with multiple sheets\n- **Auto-delimiter detection** - Handles CSV automatically (TSV/PSV not supported)\n- **Statistical summaries** - Min/max/mean for numeric columns, mode for categorical\n- **Type inference** - Automatic detection of numerical vs categorical columns\n- **Tabular template** - Dedicated Markdown template with formatted tables\n\n### Usage\n\n```python\nfrom pathlib import Path\nfrom pydocextractor import Document, PrecisionLevel, create_converter_service\n\nservice = create_converter_service()\n\n# Excel file\nexcel_doc = Document(\n    bytes=Path(\"Sales_2025.xlsx\").read_bytes(),\n    mime=\"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\",\n    size_bytes=Path(\"Sales_2025.xlsx\").stat().st_size,\n    precision=PrecisionLevel.HIGHEST_QUALITY,\n    filename=\"Sales_2025.xlsx\",\n)\nresult = service.convert_to_markdown(excel_doc, template_name=\"tabular\")\n\n# CSV file\ncsv_doc = Document(\n    bytes=Path(\"customers.csv\").read_bytes(),\n    mime=\"text/csv\",\n    size_bytes=Path(\"customers.csv\").stat().st_size,\n    precision=PrecisionLevel.BALANCED,\n    filename=\"customers.csv\",\n)\nresult = service.convert_to_markdown(csv_doc, template_name=\"tabular\")\n```\n\n### CLI Usage\n\nThe CLI automatically selects the tabular template for CSV/Excel files:\n\n```bash\n# Excel - auto-selects tabular template\npydocextractor convert Sales_2025.xlsx\n\n# CSV - auto-selects tabular template\npydocextractor convert customers.csv\n\n# Force specific template\npydocextractor convert data.xlsx --template simple\n```\n\n### Output Example\n\nThe tabular template generates structured Markdown with:\n- YAML frontmatter with file metadata\n- Per-sheet summary tables\n- Column statistics (min/max/mean/mode)\n- Data type classification\n- Quality score\n\n## LLM Image Description\n\npyDocExtractor can automatically describe images in documents using OpenAI-compatible multimodal LLMs. This feature provides context-aware descriptions by analyzing images alongside the surrounding text.\n\n### Features\n\n- **Context-Aware Descriptions** - LLM receives the previous 100 lines of text as context\n- **Multi-Format Support** - Works with images in PDF and DOCX files\n- **Automatic Image Resizing** - Images resized to 1024x1024 with aspect ratio preservation\n- **Cost Control** - Configurable limit on images per document (default: 5)\n- **Graceful Degradation** - System works normally if LLM is not configured\n- **Multi-Level Support** - Works with all extractors (Docling, PyMuPDF4LLM, PDFPlumber)\n\nNote: LLM features work with real API calls (tested with ChatGPT). Image descriptions require documents with actual images. Cost control features limit images per document (default: 5).\n\n### Installation\n\nInstall the LLM extra to enable image description:\n\n```bash\npip install pydocextractor[llm]\n```\n\nThis installs:\n- `httpx` - Synchronous HTTP client for API calls\n- `python-dotenv` - Environment configuration\n- `pillow` - Image processing and resizing\n\n### Configuration\n\nCreate a `config.env` file in your project directory:\n\n```bash\n# Enable LLM image description\nLLM_ENABLED=true\n\n# OpenAI API (or compatible endpoint)\nLLM_API_URL=https://api.openai.com/v1/chat/completions\nLLM_API_KEY=your-api-key-here\n\n# Model configuration\nLLM_MODEL_NAME=gpt-4o-mini\n\n# Optional settings\nLLM_MAX_IMAGES=5           # Max images per document (cost control)\nLLM_CONTEXT_LINES=100      # Lines of context to provide\nLLM_IMAGE_SIZE=1024        # Image resize dimension (1024x1024)\nLLM_TIMEOUT=30             # Request timeout in seconds\nLLM_MAX_RETRIES=3          # Retry attempts on failure\n```\n\n**Important:** Add `config.env` to your `.gitignore` to avoid committing API keys:\n\n```bash\necho \"config.env\" >> .gitignore\n```\n\n### Usage in Python\n\nWhen using pyDocExtractor as a library, the LLM configuration is automatically loaded from `config.env` (or `.env`) in the current directory:\n\n```python\nfrom pathlib import Path\nfrom pydocextractor import Document, PrecisionLevel, create_converter_service\n\n# Service automatically loads LLM config from config.env\nservice = create_converter_service()\n\n# Convert document with images\ndoc_path = Path(\"document_with_images.pdf\")\ndoc = Document(\n    bytes=doc_path.read_bytes(),\n    mime=\"application/pdf\",\n    size_bytes=doc_path.stat().st_size,\n    precision=PrecisionLevel.BALANCED,\n    filename=doc_path.name,\n)\n\n# Images will be automatically described if LLM is configured\nresult = service.convert_to_markdown(doc)\n\n# Check for image descriptions in output\nif \"Image Description\" in result.text:\n    print(\"Images were described by LLM\")\n```\n\n### Manual Configuration\n\nYou can also provide LLM configuration programmatically:\n\n```python\nfrom pydocextractor import create_converter_service\nfrom pydocextractor.domain.config import LLMConfig\n\n# Create LLM configuration\nllm_config = LLMConfig(\n    api_url=\"https://api.openai.com/v1/chat/completions\",\n    api_key=\"your-api-key\",\n    model_name=\"gpt-4o-mini\",\n    enabled=True,\n    max_images_per_document=5,\n    context_lines=100,\n    image_size=1024,\n    timeout_seconds=30,\n    max_retries=3,\n)\n\n# Create service with LLM config (auto_load_llm=False to skip env loading)\nservice = create_converter_service(\n    llm_config=llm_config,\n    auto_load_llm=False,\n)\n\n# Use service normally - images will be described\nresult = service.convert_to_markdown(doc)\n```\n\n### Disabling LLM\n\nTo disable LLM features:\n\n**Option 1:** Set `LLM_ENABLED=false` in config.env\n**Option 2:** Remove config.env entirely\n**Option 3:** Don't install the `[llm]` extra\n\nThe system will work normally without LLM, just without image descriptions.\n\n### OpenAI-Compatible APIs\n\nThe LLM feature works with any OpenAI-compatible API endpoint:\n\n**OpenAI:**\n```bash\nLLM_API_URL=https://api.openai.com/v1/chat/completions\nLLM_MODEL_NAME=gpt-4o-mini  # or gpt-4o, gpt-4-vision-preview\n```\n\n**Azure OpenAI:**\n```bash\nLLM_API_URL=https://your-resource.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-02-15-preview\nLLM_MODEL_NAME=gpt-4o\n```\n\n**Local/Self-Hosted (e.g., Ollama, LM Studio):**\n```bash\nLLM_API_URL=http://localhost:11434/v1/chat/completions\nLLM_MODEL_NAME=llava:13b\n```\n\n### Output Example\n\nWhen LLM is enabled, images in documents are automatically described:\n\n```markdown\n## Document Content\n\nSome text content here...\n\n<!-- image -->\n\n**Image Description**: The image shows a architectural diagram depicting\na three-tier system architecture with a web server layer, application\nserver layer, and database layer. The diagram illustrates the flow of\nrequests through load balancers and connection to backend services.\n\nMore text content following the image...\n```\n\n### Cost Considerations\n\n- **Max Images Limit**: Set `LLM_MAX_IMAGES` to control costs (default: 5 per document)\n- **Image Resizing**: Images automatically resized to 1024x1024 to reduce token usage\n- **Retry Logic**: Failed API calls retry up to 3 times with exponential backoff\n- **Fallback**: If LLM call fails, document processing continues without description\n\n### Technical Details\n\n**How it works:**\n\n1. Extractors detect images in documents (PDF/DOCX)\n2. Raw image data is extracted and stored in Block objects\n3. Images are resized to 1024x1024 (white padding, aspect ratio preserved)\n4. Image + previous 100 lines of text sent to LLM\n5. LLM generates contextual description\n6. Description inserted into markdown output\n\n**Extractor Support:**\n\nAll three PDF extractors support image extraction when LLM is enabled:\n\n- **Docling (Level 4)**: Extracts images from PDF and DOCX files\n- **PyMuPDF4LLM (Level 2)**: Extracts images from PDF files\n- **PDFPlumber (Level 3)**: Extracts images from PDF files\n\n## Architecture\n\npyDocExtractor follows **Hexagonal Architecture** (Ports and Adapters pattern) for clean separation of concerns:\n\n```mermaid\ngraph TB\n    subgraph Infrastructure[\"\ud83d\udd0c Infrastructure Layer (Adapters)\"]\n        subgraph Extractors[\"PDF Extractors\"]\n            E1[ChunkedParallelExtractor<br/>Level 1: FASTEST<br/>\ud83d\udcc4 PDF - Parallel Processing]\n            E2[PyMuPDF4LLMExtractor<br/>Level 2: BALANCED<br/>\ud83d\udcc4 PDF - LLM Optimized]\n            E3[PDFPlumberExtractor<br/>Level 3: TABLE_OPTIMIZED<br/>\ud83d\udcc4 PDF - Table Extraction]\n            E4[DoclingExtractor<br/>Level 4: HIGHEST_QUALITY<br/>\ud83d\udcc4 PDF/DOCX/Excel]\n        end\n\n        subgraph TabularExtractors[\"Tabular Data Extractors\"]\n            E5[PandasCSVExtractor<br/>Level 4: HIGHEST_QUALITY<br/>\ud83d\udcca CSV with Statistics]\n            E6[PandasExcelExtractor<br/>Level 4: HIGHEST_QUALITY<br/>\ud83d\udcca Excel Multi-Sheet]\n        end\n\n        subgraph Policy[\"Selection Policy\"]\n            P1[DefaultPolicy<br/>\ud83d\udccb Smart Selection Logic<br/>Auto-discovers extractors<br/>Builds fallback chains]\n        end\n\n        subgraph Templates[\"Template Rendering\"]\n            T1[Jinja2TemplateEngine<br/>\ud83d\udcdd Markdown Generation<br/>Custom filters & templates]\n        end\n\n        subgraph Scoring[\"Quality Scoring\"]\n            S1[DefaultQualityScorer<br/>\u2b50 0.0-1.0 Quality Score<br/>Content/Structure/Format]\n        end\n    end\n\n    subgraph Application[\"\u2699\ufe0f Application Layer (Orchestration)\"]\n        CS[ConverterService<br/>\ud83c\udfaf Coordinates Workflow<br/>Fallback chains<br/>Quality scoring]\n    end\n\n    subgraph Domain[\"\ud83c\udfaf Domain Layer (Pure Business Logic)\"]\n        subgraph Models[\"Immutable Models\"]\n            M1[Document<br/>Input file + metadata]\n            M2[NormalizedDoc<br/>Vendor-agnostic format]\n            M3[Block<br/>Content units]\n            M4[Markdown<br/>Final output]\n        end\n\n        subgraph Ports[\"Ports - Protocols\"]\n            PR1[Extractor Protocol<br/>extract/supports/is_available]\n            PR2[Policy Protocol<br/>choose_extractors]\n            PR3[TemplateEngine Protocol<br/>render/list_templates]\n            PR4[QualityScorer Protocol<br/>calculate_score]\n        end\n\n        subgraph Rules[\"Pure Functions\"]\n            R1[quality_score<br/>Quality calculation]\n            R2[calculate_document_hash<br/>Deduplication]\n            R3[normalize_blocks<br/>Block processing]\n        end\n    end\n\n    %% Protocol Implementation Relationships\n    E1 -.implements.-> PR1\n    E2 -.implements.-> PR1\n    E3 -.implements.-> PR1\n    E4 -.implements.-> PR1\n    E5 -.implements.-> PR1\n    E6 -.implements.-> PR1\n    P1 -.implements.-> PR2\n    T1 -.implements.-> PR3\n    S1 -.implements.-> PR4\n\n    %% Service Dependencies (only uses Protocols)\n    CS -->|uses| PR2\n    CS -->|uses| PR1\n    CS -->|uses| PR3\n    CS -->|uses| PR4\n\n    %% Domain Dependencies\n    PR1 -->|depends on| M1\n    PR1 -->|depends on| M2\n    PR2 -->|depends on| M1\n    CS -->|creates| M4\n\n    %% Styling\n    classDef infrastructure fill:#e1f5ff,stroke:#01579b,stroke-width:2px\n    classDef application fill:#fff9c4,stroke:#f57f17,stroke-width:2px\n    classDef domain fill:#f3e5f5,stroke:#4a148c,stroke-width:2px\n\n    class E1,E2,E3,E4,E5,E6,P1,T1,S1 infrastructure\n    class CS application\n    class M1,M2,M3,M4,PR1,PR2,PR3,PR4,R1,R2,R3 domain\n```\n\n### Why Hexagonal Architecture?\n\n**Clean Separation:**\n- **Domain Layer**: Business rules (Document, Block models) with zero dependencies\n- **Application Layer**: Orchestrates conversion workflow using ports\n- **Infrastructure Layer**: Concrete implementations (PDF extractors, templates)\n\n**Key Benefits:**\n\n1. **Testability**\n   - Domain & app layers testable without real extractors\n   - 260+ tests with 87% coverage\n   - BDD scenarios for behavior validation\n\n2. **Flexibility**\n   - Swap extractors: Use Docling instead of PyMuPDF4LLM\n   - Change templates: Different Markdown formats\n   - Replace scoring: Custom quality algorithms\n\n3. **Maintainability**\n   - Boundaries enforced by import-linter\n   - Strict mypy type checking\n   - Clear dependency flow\n\n4. **Extensibility**\n   - Add new extractors by implementing `Extractor` Protocol\n   - No changes needed to domain or application layers\n   - Example: PandasCSV added without touching core logic\n\n### Architecture Layers in Detail\n\n#### \ud83c\udfaf Domain Layer (`src/pydocextractor/domain/`)\n\nThe innermost layer containing pure business logic with **zero external dependencies**.\n\n**Components:**\n\n**1. Models (`models.py`)** - Immutable dataclasses representing core business entities:\n- `Document`: Input file representation (bytes, MIME type, precision level)\n- `Block`: Single unit of extracted content (text, table, image, etc.)\n- `NormalizedDoc`: Vendor-agnostic intermediate format containing blocks\n- `Markdown`: Final output with quality score and metadata\n- `ExtractionResult`: Result of an extraction attempt (success/failure)\n- `TemplateContext`: Template rendering context\n- `PrecisionLevel`: Enum (1=FASTEST, 2=BALANCED, 3=TABLE_OPTIMIZED, 4=HIGHEST_QUALITY)\n- `BlockType`: Enum (TEXT, TABLE, IMAGE, HEADER, LIST, CODE, METADATA)\n\n**2. Ports (`ports.py`)** - Protocol definitions (interfaces):\n- `Extractor`: Extract content from documents\n- `Policy`: Choose which extractor to use\n- `TemplateEngine`: Render markdown from normalized docs\n- `QualityScorer`: Calculate quality scores\n- `DocumentValidator`: Validate documents\n- `TableProfiler`: Analyze tabular data\n- `Cache`: Caching operations\n\n**3. Rules (`rules.py`)** - Pure functions for business logic:\n- `quality_score()`: Calculate 0.0-1.0 quality score\n- `calculate_document_hash()`: Generate content hashes for deduplication\n- `hint_has_tables()`: Detect if document likely contains tables\n- `normalize_blocks()`: Clean and deduplicate blocks\n- `merge_text_blocks()`: Merge consecutive text blocks\n- `validate_precision_level()`: Validate precision levels\n- `estimate_processing_time()`: Estimate conversion duration\n\n**4. Errors (`errors.py`)** - Domain exception hierarchy:\n- `DomainError`: Base exception\n- `ConversionFailed`: All extractors failed\n- `RecoverableError`: Extractor failed, try fallback\n- `UnsupportedFormat`: No extractor available\n- `ValidationError`: Invalid document\n- `ExtractionError`: Extraction process error\n- `TemplateError`: Template rendering error\n\n**Architecture Rule:** Domain layer must NOT import from `app` or `infra` layers. This is enforced by `import-linter`.\n\n#### \u2699\ufe0f Application Layer (`src/pydocextractor/app/`)\n\nOrchestrates the conversion workflow using domain ports.\n\n**ConverterService (`service.py`):**\n\nThe main orchestration service that coordinates the entire conversion process.\n\n**Dependencies (injected via constructor):**\n- `policy: Policy` - Chooses which extractor to use\n- `template_engine: TemplateEngine` - Renders markdown\n- `quality_scorer: QualityScorer | None` - Calculates quality score\n- `table_profilers: Sequence[TableProfiler]` - Analyzes tabular data\n\n**Key Methods:**\n- `convert_to_markdown(doc, template_name, allow_fallback)`: Main conversion entry point\n- `convert_with_specific_extractor(doc, extractor_name, template_name)`: Force specific extractor\n- `list_available_templates()`: List available markdown templates\n- `get_supported_formats()`: List supported MIME types\n\n**Conversion Workflow:**\n1. Validate document\n2. Ask policy to choose extractors (ordered by preference)\n3. Try extractors in order until one succeeds\n4. Apply table profilers if configured\n5. Render markdown using template engine\n6. Calculate quality score\n7. Return `Markdown` result\n\n**Architecture Rule:** Application layer depends ONLY on domain layer (models + ports). Never imports concrete infrastructure classes.\n\n#### \ud83d\udd0c Infrastructure Layer (`src/pydocextractor/infra/`)\n\nConcrete implementations of domain ports.\n\n**1. Extractors (`infra/extractors/`)** - 6 implementations of `Extractor` Protocol:\n\n| Extractor | Level | Library | MIME Types | Specialization |\n|-----------|-------|---------|------------|----------------|\n| `ChunkedParallelExtractor` | 1 (FASTEST) | PyMuPDF | `application/pdf` | Parallel page processing for speed |\n| `PyMuPDF4LLMExtractor` | 2 (BALANCED) | pymupdf4llm | `application/pdf` | LLM-optimized extraction (default) |\n| `PDFPlumberExtractor` | 3 (TABLE_OPTIMIZED) | pdfplumber | `application/pdf` | Superior table extraction |\n| `DoclingExtractor` | 4 (HIGHEST_QUALITY) | Docling | `application/pdf`, DOCX, Excel | Comprehensive layout analysis |\n| `PandasCSVExtractor` | 4 (HIGHEST_QUALITY) | pandas | `text/csv` | CSV with column statistics |\n| `PandasExcelExtractor` | 4 (HIGHEST_QUALITY) | pandas | Excel (XLS/XLSX) | Multi-sheet with rich metadata |\n\nAll extractors implement:\n- `extract(data: bytes, precision: PrecisionLevel) -> ExtractionResult`\n- `supports(mime: str) -> bool`\n- `is_available() -> bool` (checks if dependencies installed)\n- Properties: `name`, `precision_level`\n\n**2. Policy (`infra/policy/heuristics.py`)** - `DefaultPolicy`:\n\nSmart extractor selection logic:\n\n**Selection Strategy:**\n- **CSV files** \u2192 `PandasCSVExtractor`\n- **Excel files** \u2192 `PandasExcelExtractor`\n- **DOCX files** \u2192 `DoclingExtractor`\n- **PDF files** (by characteristics):\n  - Size > 20MB \u2192 `ChunkedParallelExtractor` (Level 1)\n  - Size < 2MB \u2192 `DoclingExtractor` (Level 4)\n  - Has tables \u2192 `PDFPlumberExtractor` (Level 3)\n  - Default \u2192 `PyMuPDF4LLMExtractor` (Level 2)\n\n**Fallback Chain:** If preferred extractor fails, tries: Level 2 \u2192 Level 1 \u2192 Level 3 \u2192 Level 4\n\n**3. Templates (`infra/templates/engines.py`)** - `Jinja2TemplateEngine`:\n\nMarkdown rendering using Jinja2:\n- Default templates in `infra/templates/templates/`\n- Built-in templates: `simple.j2`, `tabular.j2`\n- Custom filters: `word_count`, `char_count`\n- Supports custom template directories\n\n**4. Scoring (`infra/scoring/default_scorer.py`)** - `DefaultQualityScorer`:\n\nCalculates 0.0-1.0 quality score based on:\n- **Content Length (25%)**: Document has substantial text\n- **Structure (30%)**: Presence of headers and tables\n- **Text Quality (25%)**: Average block length and word count\n- **Formatting (20%)**: Line structure and markdown formatting\n\n**5. Factory (`factory.py`)** - Dependency Injection:\n\nCreates fully configured services:\n```python\ndef create_converter_service(template_dir=None) -> ConverterService:\n    # Auto-discovers all available extractors\n    policy = DefaultPolicy()\n    template_engine = Jinja2TemplateEngine(template_dir)\n    quality_scorer = DefaultQualityScorer()\n\n    return ConverterService(\n        policy=policy,\n        template_engine=template_engine,\n        quality_scorer=quality_scorer,\n    )\n```\n\nHelper functions:\n- `get_available_extractors()`: Lists all installed extractors\n- `get_extractor_by_level(level)`: Gets specific extractor by precision level\n\n**Graceful Degradation:** If optional dependencies missing, extractors are excluded but library still works with available ones.\n\n### Conversion Flow\n\nThe following diagram shows how a document flows through the system:\n\n```mermaid\nsequenceDiagram\n    participant User\n    participant Service as ConverterService\n    participant Policy as DefaultPolicy\n    participant Extractor as Selected Extractor\n    participant Template as Jinja2TemplateEngine\n    participant Scorer as QualityScorer\n\n    User->>Service: convert_to_markdown(doc)\n    Service->>Policy: choose_extractors(doc)\n    Policy-->>Service: [extractor_list]\n\n    loop For each extractor (with fallback)\n        Service->>Extractor: extract(data, precision)\n        alt Extraction Success\n            Extractor-->>Service: ExtractionResult(success=True)\n            Note over Service: Break loop\n        else Extraction Failed\n            Extractor-->>Service: ExtractionResult(success=False)\n            Note over Service: Try next extractor\n        end\n    end\n\n    Service->>Template: render(blocks, metadata)\n    Template-->>Service: Markdown text\n\n    Service->>Scorer: calculate_quality(markdown)\n    Scorer-->>Service: quality_score (0.0-1.0)\n\n    Service-->>User: Markdown(text, score, metadata)\n```\n\n### Dependency Injection Flow\n\nThis diagram shows how components are created and wired together:\n\n```mermaid\ngraph TB\n    subgraph Factory[\"\ud83c\udfed Factory (Composition Root)\"]\n        F[create_converter_service]\n    end\n\n    subgraph Creation[\"Component Creation\"]\n        F -->|1. Create| P[DefaultPolicy]\n        F -->|2. Create| TE[Jinja2TemplateEngine]\n        F -->|3. Create| QS[DefaultQualityScorer]\n        F -->|4. Inject & Assemble| CS[ConverterService]\n\n        P -->|discovers| E1[ChunkedParallelExtractor]\n        P -->|discovers| E2[PyMuPDF4LLMExtractor]\n        P -->|discovers| E3[PDFPlumberExtractor]\n        P -->|discovers| E4[DoclingExtractor]\n        P -->|discovers| E5[PandasCSVExtractor]\n        P -->|discovers| E6[PandasExcelExtractor]\n    end\n\n    subgraph Runtime[\"Runtime (Protocol-based)\"]\n        CS -->|uses Protocol| PP[Policy Protocol]\n        CS -->|uses Protocol| EP[Extractor Protocol]\n        CS -->|uses Protocol| TEP[TemplateEngine Protocol]\n        CS -->|uses Protocol| QSP[QualityScorer Protocol]\n\n        PP -.implemented by.-> P\n        EP -.implemented by.-> E1\n        EP -.implemented by.-> E2\n        EP -.implemented by.-> E3\n        EP -.implemented by.-> E4\n        EP -.implemented by.-> E5\n        EP -.implemented by.-> E6\n        TEP -.implemented by.-> TE\n        QSP -.implemented by.-> QS\n    end\n\n    subgraph UserCode[\"User Code\"]\n        U[Your Application]\n        U -->|calls| F\n        F -->|returns| CS2[Fully Configured<br/>ConverterService]\n        CS2 -->|ready to use| U\n    end\n\n    classDef factory fill:#fff9c4,stroke:#f57f17,stroke-width:2px\n    classDef creation fill:#e1f5ff,stroke:#01579b,stroke-width:2px\n    classDef runtime fill:#f3e5f5,stroke:#4a148c,stroke-width:2px\n    classDef user fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px\n\n    class F factory\n    class P,TE,QS,CS,E1,E2,E3,E4,E5,E6 creation\n    class CS,PP,EP,TEP,QSP runtime\n    class U,CS2 user\n```\n\n**Key Principles:**\n\n1. **Factory Pattern**: `create_converter_service()` is the composition root\n2. **Protocol-Based**: Service depends on protocols, not concrete types\n3. **Auto-Discovery**: Policy automatically finds all available extractors\n4. **Graceful Degradation**: Missing dependencies = extractor excluded\n5. **Testability**: Easy to inject mocks/fakes for testing\n\n### Practical Architecture Example\n\n```python\n# Domain Layer - Pure business logic\nfrom pydocextractor.domain.models import Document, PrecisionLevel\nfrom pydocextractor.domain.ports import Extractor  # Protocol\n\n# Infrastructure Layer - Concrete implementations\nfrom pydocextractor.infra.extractors.pymupdf4llm_adapter import PyMuPDF4LLMExtractor\nfrom pydocextractor.infra.policy.heuristics import DefaultPolicy\n\n# Application Layer - Orchestration\nfrom pydocextractor.app.service import ConverterService\n\n# Dependency Injection in action\nextractor: Extractor = PyMuPDF4LLMExtractor()  # Depends on Protocol\npolicy = DefaultPolicy()  # Chooses which extractor\nservice = ConverterService(policy=policy, ...)\n\n# Usage - clean and simple\ndoc = Document(bytes=..., mime=\"application/pdf\", ...)\nresult = service.convert_to_markdown(doc)\n```\n\n### Architecture Validation\n\nThe architecture is continuously validated:\n\n```bash\njust guard      # Enforces layer boundaries with import-linter\njust typecheck  # Validates Protocol compliance with mypy --strict\njust test       # Ensures all layers work together\njust check      # Run all quality checks (format, lint, types, guard)\n```\n\n**Learn More:** See the detailed Architecture section below for:\n- Layer responsibilities and rules\n- Protocol definitions\n- Component relationships\n- Dependency injection patterns\n\n## Development Workflow\n\n### Installation Commands\n\n```bash\n# Development setup\njust bootstrap        # Install all dev dependencies\njust install          # Install package in editable mode with all extras\njust install-dev      # Install package (minimal, no optional deps)\njust install-prod     # Install for production (non-editable)\n```\n\n### Code Quality\n\n```bash\njust fmt              # Format code with ruff\njust lint             # Lint code with ruff\njust fix              # Auto-fix linting issues\njust typecheck        # Type check with mypy\njust guard            # Verify architectural boundaries\njust check            # Run all quality checks (fmt, lint, types, guard)\n```\n\n### Testing\n\n```bash\njust test             # Run all tests\njust test-unit        # Domain + app tests (fast)\njust test-adapters    # Infrastructure tests\njust test-contract    # Protocol compliance tests\njust test-bdd         # BDD tests\njust test-integration # Integration tests\njust test-cov         # With coverage report\njust coverage-check   # Verify 70% coverage threshold\n```\n\n### Utilities\n\n```bash\njust build            # Build package distribution\njust clean            # Remove build artifacts and cache\njust layers           # Show architecture layers\njust stats            # Project statistics\n```\n\n### Workflows\n\n```bash\njust ci               # Full CI pipeline locally\njust pre-commit       # Pre-commit checks (fmt + check + test)\n```\n\n## Testing\n\nComprehensive test suite following hexagonal architecture:\n\n### Test Structure\n\n```\ntests/\n\u251c\u2500\u2500 unit/              # Pure unit tests (no infrastructure)\n\u2502   \u251c\u2500\u2500 domain/       # Domain layer tests\n\u2502   \u2514\u2500\u2500 app/          # Application layer tests (mocked)\n\u251c\u2500\u2500 adapters/         # Infrastructure adapter tests\n\u251c\u2500\u2500 contract/         # Protocol compliance tests\n\u251c\u2500\u2500 integration/      # End-to-end tests\n\u2514\u2500\u2500 bdd/              # BDD tests with pytest-bdd\n    \u251c\u2500\u2500 features/     # Gherkin scenarios\n    \u2514\u2500\u2500 steps/        # Step definitions\n```\n\n### Running Tests\n\n```bash\n# All tests\njust test\n\n# By category\njust test-unit           # Unit tests only\njust test-adapters       # Adapter tests\njust test-bdd            # BDD tests\njust test-integration    # Integration tests\n\n# With coverage\njust test-cov            # Generate coverage report\njust test-unit-coverage  # Unit tests with coverage\n\n# Check coverage threshold (70%)\njust coverage-check\n```\n\n### BDD Tests\n\nBehavior-Driven Development tests using Gherkin:\n\n```gherkin\nScenario: Convert a text-based PDF to Markdown\n  Given I have a PDF file \"Company_Handbook.pdf\"\n  When I submit the file for extraction\n  Then the service produces a Markdown document\n  And a content ID is generated and returned\n```\n\nSee [tests/bdd/README.md](tests/bdd/README.md) for BDD documentation.\n\n## Extending the Library\n\n### Adding a Custom Extractor\n\nImplement the `Extractor` Protocol to add support for new file formats:\n\n```python\nfrom pydocextractor.domain.ports import ExtractionResult\nfrom pydocextractor.domain.models import (\n    PrecisionLevel,\n    NormalizedDoc,\n    Block,\n    BlockType,\n)\n\nclass MyCustomExtractor:\n    \"\"\"Custom extractor implementing Extractor Protocol.\"\"\"\n\n    @property\n    def name(self) -> str:\n        return \"MyCustomExtractor\"\n\n    @property\n    def precision_level(self) -> PrecisionLevel:\n        return PrecisionLevel.HIGHEST_QUALITY\n\n    def is_available(self) -> bool:\n        # Check if required dependencies are installed\n        try:\n            import my_custom_library\n            return True\n        except ImportError:\n            return False\n\n    def supports(self, mime: str) -> bool:\n        return mime == \"application/custom\"\n\n    def extract(self, data: bytes, precision: PrecisionLevel) -> ExtractionResult:\n        import time\n        start = time.time()\n\n        try:\n            # Your extraction logic here\n            extracted_text = self._extract_content(data)\n\n            blocks = (\n                Block(type=BlockType.TEXT, content=extracted_text),\n            )\n            ndoc = NormalizedDoc(blocks=blocks, source_mime=\"application/custom\")\n\n            return ExtractionResult(\n                success=True,\n                normalized_doc=ndoc,  # Note: 'normalized_doc', not 'ndoc'\n                extractor_name=self.name,\n                processing_time_seconds=time.time() - start,\n            )\n        except Exception as e:\n            return ExtractionResult(\n                success=False,\n                normalized_doc=None,\n                error=str(e),\n                extractor_name=self.name,\n                processing_time_seconds=time.time() - start,\n            )\n\n    def _extract_content(self, data: bytes) -> str:\n        # Implement your extraction logic\n        return \"Extracted text from custom format\"\n\n# Using custom extractor (not directly injectable into DefaultPolicy)\n# You would need to create a custom policy or use the extractor directly:\nfrom pydocextractor.domain.models import Document\n\nextractor = MyCustomExtractor()\nif extractor.is_available() and extractor.supports(\"application/custom\"):\n    doc = Document(\n        bytes=b\"...\",\n        mime=\"application/custom\",\n        size_bytes=100,\n        precision=PrecisionLevel.HIGHEST_QUALITY,\n    )\n    result = extractor.extract(doc.bytes, doc.precision)\n```\n\n**Note:** The current `DefaultPolicy` hardcodes extractors. To use custom extractors in the service, you would need to create a custom policy implementation.\n\n### Custom Template\n\nCreate a Jinja2 template in `templates/`:\n\n```jinja2\n{# my_template.j2 #}\n# {{ metadata.filename }}\n\n{% for block in blocks %}\n{{ block.content }}\n\n{% endfor %}\n\n---\nQuality: {{ quality_score }}\nExtractor: {{ metadata.extractor }}\n```\n\nUse it:\n\n```python\nresult = service.convert_to_markdown(doc, template_name=\"my_template\")\n```\n\n## Project Structure\n\n```\npyDocExtractor/\n\u251c\u2500\u2500 src/pydocextractor/\n\u2502   \u251c\u2500\u2500 domain/              # Pure domain layer\n\u2502   \u2502   \u251c\u2500\u2500 models.py        # Immutable dataclasses\n\u2502   \u2502   \u251c\u2500\u2500 ports.py         # Protocol definitions\n\u2502   \u2502   \u251c\u2500\u2500 rules.py         # Pure functions\n\u2502   \u2502   \u2514\u2500\u2500 errors.py        # Domain exceptions\n\u2502   \u251c\u2500\u2500 app/                 # Application layer\n\u2502   \u2502   \u2514\u2500\u2500 service.py       # ConverterService\n\u2502   \u251c\u2500\u2500 infra/               # Infrastructure layer\n\u2502   \u2502   \u251c\u2500\u2500 extractors/      # 4 extractor adapters\n\u2502   \u2502   \u251c\u2500\u2500 policy/          # Selection logic\n\u2502   \u2502   \u251c\u2500\u2500 templates/       # Jinja2 templates\n\u2502   \u2502   \u2514\u2500\u2500 scoring/         # Quality scoring\n\u2502   \u251c\u2500\u2500 factory.py           # Dependency injection\n\u2502   \u2514\u2500\u2500 cli.py               # CLI interface\n\u251c\u2500\u2500 tests/                   # Hexagonal test suite\n\u251c\u2500\u2500 test_documents/          # Real test documents\n\u2514\u2500\u2500 pyproject.toml           # Project configuration\n```\n\n## Configuration\n\n### pyproject.toml\n\n- **Strict mypy**: Enforced on domain layer\n- **Ruff linting**: Per-layer rules\n- **Import linter**: Enforces architectural boundaries\n- **Extras model**: Optional dependencies\n\n### Architectural Boundaries\n\nEnforced via `import-linter`:\n\n```ini\n[importlinter:contract:domain-independence]\n# Domain MUST NOT import from app or infra\ntype = forbidden\nsource_modules = pydocextractor.domain\nforbidden_modules = pydocextractor.infra, pydocextractor.app\n```\n\nVerify with:\n```bash\njust guard\n```\n\n## Performance\n\nBenchmarks on typical documents:\n\n| Document Size | Level 1 | Level 2 | Level 3 | Level 4 |\n|--------------|---------|---------|---------|---------|\n| 200 KB       | 0.1s    | 0.5s    | 1.2s    | 45s     |\n| 1.2 MB       | 0.3s    | 2.1s    | 5.2s    | 180s    |\n| 3.2 MB       | 1.8s    | 8.4s    | 25.1s   | 900s    |\n| 13 MB        | 4.2s    | 35.7s   | 120s    | 3600s   |\n\n## Quality Scoring\n\nDocuments are scored 0.0-1.0 based on:\n\n- **Content Length** (25%): Substantial extracted text\n- **Structure** (30%): Headings and paragraphs\n- **Text Quality** (25%): Average block length\n- **Formatting** (20%): Lists and tables\n\n```python\nresult = service.convert_to_markdown(doc)\nif result.quality_score > 0.8:\n    print(\"High quality conversion\")\n```\n\n## Contributing\n\nWe welcome contributions! Whether you're fixing bugs, adding new features, or improving documentation, your help is appreciated.\n\n### Quick Start for Contributors\n\n```bash\n# Clone and setup\ngit clone https://github.com/AminiTech/pyDocExtractor.git\ncd pyDocExtractor\njust bootstrap\n\n# Make your changes, then verify\njust fmt           # Format code\njust check         # Run quality checks\njust test          # Run tests\njust guard         # Verify architecture\n```\n\n### Common Contribution Scenarios\n\n#### Adding Support for a New Document Type\n\n1. Create a new extractor in `src/pydocextractor/infra/extractors/`\n2. Implement the `Extractor` Protocol (see [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md#how-to-add-support-for-a-new-document-type))\n3. Update the selection policy in `src/pydocextractor/infra/policy/heuristics.py`\n4. Add tests and sample documents\n\n#### Creating a Custom Template\n\n1. Add a Jinja2 template to `src/pydocextractor/infra/templates/templates/`\n2. Use available context variables: `blocks`, `metadata`, `quality_score`\n3. Test with various document types\n\n#### Modifying Quality Scoring\n\n1. Create a new scorer in `src/pydocextractor/infra/scoring/`\n2. Implement the `QualityScorer` Protocol\n3. Inject via `ConverterService` constructor\n\n### Architecture Guidelines\n\npyDocExtractor follows **Hexagonal Architecture**:\n\n- **Domain Layer** (`src/pydocextractor/domain/`) - Pure business logic, no external dependencies\n- **Application Layer** (`src/pydocextractor/app/`) - Orchestrates workflows using domain ports\n- **Infrastructure Layer** (`src/pydocextractor/infra/`) - Concrete implementations (extractors, templates, etc.)\n\n**Rule:** Domain layer must NOT import from `app` or `infra` layers (enforced by `import-linter`).\n\n### For Detailed Information\n\nSee **[docs/CONTRIBUTING.md](docs/CONTRIBUTING.md)** for comprehensive guides on:\n\n- Project structure and what each folder does\n- Step-by-step guide to add new document type support\n- How to create custom templates and quality scorers\n- Testing guidelines and best practices\n- Code quality standards\n- Pull request process\n\n### Questions?\n\n- **Issues**: [GitHub Issues](https://github.com/AminiTech/pyDocExtractor/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/AminiTech/pyDocExtractor/discussions)\n\n## Documentation\n\n- **[docs/CONTRIBUTING.md](docs/CONTRIBUTING.md)** - How to contribute to the project\n- **[docs/CONTRIBUTING_GUIDE.md](docs/CONTRIBUTING_GUIDE.md)** - Detailed contribution guide with architecture reference\n- **[docs/TEMPLATES.md](docs/TEMPLATES.md)** - Template system guide with Jinja2 examples\n- **[tests/bdd/README.md](tests/bdd/README.md)** - BDD testing guide\n- See the Architecture section above for hexagonal architecture details\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Credits\n\nExtracted from the [Amini Ingestion KGraph](https://github.com/AminiTech/amini-ingestion-kgraph) project.\n\nBuilt with:\n- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - Fast PDF processing\n- [pymupdf4llm](https://github.com/pymupdf/pymupdf4llm) - LLM-optimized extraction\n- [pdfplumber](https://github.com/jsvine/pdfplumber) - Table extraction\n- [Docling](https://github.com/DS4SD/docling) - Highest quality conversion\n- [pytest-bdd](https://github.com/pytest-dev/pytest-bdd) - BDD testing\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/AminiTech/pyDocExtractor/issues)\n- **Documentation**: [GitHub Wiki](https://github.com/AminiTech/pyDocExtractor/wiki)\n- **Discussions**: [GitHub Discussions](https://github.com/AminiTech/pyDocExtractor/discussions)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library for converting documents (PDF, DOCX, XLSX) to Markdown with multiple precision levels",
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [
        "conversion",
        " document",
        " extractor",
        " markdown",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4712049171331b7d5258a081741a82fb477dc60473ab39633edf1842bb4f481d",
                "md5": "46c150160fbdf773d2219c5165d554f5",
                "sha256": "98f3303231917dc2af2d2f046a10829ffaa451f2ce592d64c86d91dfca437558"
            },
            "downloads": -1,
            "filename": "pydocextractor-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "46c150160fbdf773d2219c5165d554f5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 77331,
            "upload_time": "2025-10-29T11:29:57",
            "upload_time_iso_8601": "2025-10-29T11:29:57.724378Z",
            "url": "https://files.pythonhosted.org/packages/47/12/049171331b7d5258a081741a82fb477dc60473ab39633edf1842bb4f481d/pydocextractor-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f51ff562777d05f51203574b7289a58fc6f4a3ceece4bf517ec1edb6edfbb2e4",
                "md5": "1e0a73982c53b14236d1cc003ce40e2b",
                "sha256": "ae2b01e28ddc2923e8f3be23009621d98a555872e3973aad20a15d487912083f"
            },
            "downloads": -1,
            "filename": "pydocextractor-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1e0a73982c53b14236d1cc003ce40e2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 538494,
            "upload_time": "2025-10-29T11:30:01",
            "upload_time_iso_8601": "2025-10-29T11:30:01.069305Z",
            "url": "https://files.pythonhosted.org/packages/f5/1f/f562777d05f51203574b7289a58fc6f4a3ceece4bf517ec1edb6edfbb2e4/pydocextractor-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 11:30:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pydocextractor"
}
        
Elapsed time: 1.01360s