pdf-image-extract-annotate

Name	pdf-image-extract-annotate JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/thijshakkenberg/pdf-image-extract-annotate
Summary	Extract images from PDFs and create annotated versions with watermarks
upload_time	2025-10-21 12:45:31
maintainer	None
docs_url	None
author	Thijs Hakkenberg
requires_python	>=3.8
license	None
keywords	pdf image extraction watermark annotation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PDFImageExtractAnnotate

A Python package for extracting images from PDF documents and creating annotated versions with watermarks showing the extracted image filenames.

## Features

- **Image Extraction**: Extract all images from PDF documents with configurable filters
- **Page-based Organization**: Images are organized by page number for easy reference
- **Watermark Annotation**: Add watermarks to the original PDF showing extracted image filenames
- **Flexible Filtering**: Filter images by dimensions, file size, or relative compression
- **Azure Blob Storage Support**: Optional support for storing images in Azure Blob Storage
- **Customizable Watermarks**: Configure font size, color, background, and text format

## Installation

### From PyPI (when published)

```bash
pip install pdf-image-extract-annotate
```

### From Source

```bash
git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate
pip install -e .
```

### With Azure Support

```bash
pip install pdf-image-extract-annotate[azure]
```

## Quick Start

### Basic Image Extraction

```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

# Configure extraction
config = ExtractionConfig(
    output_dir="extracted_images",
    dim_limit=50,  # Minimum dimension in pixels
    abs_size=1000  # Minimum file size in bytes
)

# Extract images
extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))

print(f"Extracted {result['images_extracted']} images")
print(f"Saved to: {result['output_directory']}")
```

### Extract and Watermark PDF

```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, WatermarkConfig

# Configure watermark appearance
watermark_config = WatermarkConfig(
    font_size=10,
    font_color=(1.0, 0.0, 0.0),  # Red text
    background_color=(1.0, 1.0, 1.0, 0.7),  # Semi-transparent white
    text_format="filename"  # Show just the filename
)

# Process PDF
watermarker = PDFImageWatermarker(
    pdf_path=Path("document.pdf"),
    watermark_config=watermark_config
)

result = watermarker.process_pdf_with_watermarks()

# Save the annotated PDF
result.output_pdf.save("annotated_document.pdf")
result.output_pdf.close()

print(f"Extracted {result.images_extracted} images")
print(f"Watermarked {result.images_watermarked} images")
```

## Configuration Options

### ExtractionConfig

- `output_dir` (str): Directory to save extracted images
- `dim_limit` (int): Minimum dimension filter (0 = no limit)
- `rel_size` (float): Relative size filter (0.0-1.0, 0 = no limit)
- `abs_size` (int): Absolute size filter in bytes (0 = no limit)
- `blob_connection_string` (str, optional): Azure Blob Storage connection string

### WatermarkConfig

- `font_size` (int): Font size for watermark text
- `font_color` (tuple): RGB color values (0.0-1.0)
- `background_color` (tuple): RGBA background color
- `text_format` (str): Format for watermark text ("filename", "filepath", or "custom")
- `padding` (int): Padding around text in pixels

## Output Structure

Images are organized using a page-based structure:

```
output_dir/
├── images/
│   ├── page_1/
│   │   ├── img00001.png
│   │   └── img00002.jpg
│   ├── page_2/
│   │   └── img00003.png
│   └── page_N/
│       └── imgXXXXX.ext
└── annotated_pdf.pdf  (if using watermarker)
```

## Advanced Usage

### Using with Azure Blob Storage

```python
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

config = ExtractionConfig(
    output_dir="my-container",
    blob_connection_string="DefaultEndpointsProtocol=https;..."
)

extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))
```

### Custom Image Filtering

```python
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig

# Only extract large, high-quality images
config = ExtractionConfig(
    output_dir="high_quality_images",
    dim_limit=200,      # At least 200px in smallest dimension
    rel_size=0.5,       # At least 50% of uncompressed size
    abs_size=50000      # At least 50KB
)
```

### Dependency Injection in Larger Projects

```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, ExtractionConfig, WatermarkConfig

class DocumentProcessor:
    def __init__(self, extraction_config: ExtractionConfig, watermark_config: WatermarkConfig):
        self.extraction_config = extraction_config
        self.watermark_config = watermark_config

    def process_document(self, pdf_path: Path):
        watermarker = PDFImageWatermarker(
            pdf_path=pdf_path,
            extraction_config=self.extraction_config,
            watermark_config=self.watermark_config
        )
        return watermarker.process_pdf_with_watermarks()
```

## API Reference

### Classes

- `PDFImageExtractor`: Core image extraction functionality
- `PDFImageWatermarker`: Extended extractor with watermarking capabilities
- `ExtractionConfig`: Configuration for image extraction
- `WatermarkConfig`: Configuration for watermark appearance
- `ImageMetadata`: Metadata for extracted images
- `ImageWatermarkEntry`: Entry for images with watermark information
- `WatermarkResult`: Result of the PDF watermarking process

## Requirements

- Python 3.11+
- PyMuPDF >= 1.23.0
- pydantic >= 2.0.0
- azure-storage-blob >= 12.0.0 (optional, for Azure support)

## Development

### Setting up development environment

```bash
# Clone the repository
git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black pdf_image_extract_annotate tests

# Type checking
mypy pdf_image_extract_annotate
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=pdf_image_extract_annotate

# Run specific test file
pytest tests/test_extractor.py
```

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Support

For issues, questions, or suggestions, please open an issue on GitHub.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thijshakkenberg/pdf-image-extract-annotate",
    "name": "pdf-image-extract-annotate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "pdf, image, extraction, watermark, annotation",
    "author": "Thijs Hakkenberg",
    "author_email": "Thijs Hakkenberg <thijs.hakkenberg@ecolab.com>",
    "download_url": "https://files.pythonhosted.org/packages/88/ad/b356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0/pdf_image_extract_annotate-1.0.0.tar.gz",
    "platform": null,
    "description": "# PDFImageExtractAnnotate\n\nA Python package for extracting images from PDF documents and creating annotated versions with watermarks showing the extracted image filenames.\n\n## Features\n\n- **Image Extraction**: Extract all images from PDF documents with configurable filters\n- **Page-based Organization**: Images are organized by page number for easy reference\n- **Watermark Annotation**: Add watermarks to the original PDF showing extracted image filenames\n- **Flexible Filtering**: Filter images by dimensions, file size, or relative compression\n- **Azure Blob Storage Support**: Optional support for storing images in Azure Blob Storage\n- **Customizable Watermarks**: Configure font size, color, background, and text format\n\n## Installation\n\n### From PyPI (when published)\n\n```bash\npip install pdf-image-extract-annotate\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate\ncd pdf-image-extract-annotate\npip install -e .\n```\n\n### With Azure Support\n\n```bash\npip install pdf-image-extract-annotate[azure]\n```\n\n## Quick Start\n\n### Basic Image Extraction\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\n# Configure extraction\nconfig = ExtractionConfig(\n    output_dir=\"extracted_images\",\n    dim_limit=50,  # Minimum dimension in pixels\n    abs_size=1000  # Minimum file size in bytes\n)\n\n# Extract images\nextractor = PDFImageExtractor(config)\nresult = extractor.extract_all_images(Path(\"document.pdf\"))\n\nprint(f\"Extracted {result['images_extracted']} images\")\nprint(f\"Saved to: {result['output_directory']}\")\n```\n\n### Extract and Watermark PDF\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageWatermarker, WatermarkConfig\n\n# Configure watermark appearance\nwatermark_config = WatermarkConfig(\n    font_size=10,\n    font_color=(1.0, 0.0, 0.0),  # Red text\n    background_color=(1.0, 1.0, 1.0, 0.7),  # Semi-transparent white\n    text_format=\"filename\"  # Show just the filename\n)\n\n# Process PDF\nwatermarker = PDFImageWatermarker(\n    pdf_path=Path(\"document.pdf\"),\n    watermark_config=watermark_config\n)\n\nresult = watermarker.process_pdf_with_watermarks()\n\n# Save the annotated PDF\nresult.output_pdf.save(\"annotated_document.pdf\")\nresult.output_pdf.close()\n\nprint(f\"Extracted {result.images_extracted} images\")\nprint(f\"Watermarked {result.images_watermarked} images\")\n```\n\n## Configuration Options\n\n### ExtractionConfig\n\n- `output_dir` (str): Directory to save extracted images\n- `dim_limit` (int): Minimum dimension filter (0 = no limit)\n- `rel_size` (float): Relative size filter (0.0-1.0, 0 = no limit)\n- `abs_size` (int): Absolute size filter in bytes (0 = no limit)\n- `blob_connection_string` (str, optional): Azure Blob Storage connection string\n\n### WatermarkConfig\n\n- `font_size` (int): Font size for watermark text\n- `font_color` (tuple): RGB color values (0.0-1.0)\n- `background_color` (tuple): RGBA background color\n- `text_format` (str): Format for watermark text (\"filename\", \"filepath\", or \"custom\")\n- `padding` (int): Padding around text in pixels\n\n## Output Structure\n\nImages are organized using a page-based structure:\n\n```\noutput_dir/\n\u251c\u2500\u2500 images/\n\u2502   \u251c\u2500\u2500 page_1/\n\u2502   \u2502   \u251c\u2500\u2500 img00001.png\n\u2502   \u2502   \u2514\u2500\u2500 img00002.jpg\n\u2502   \u251c\u2500\u2500 page_2/\n\u2502   \u2502   \u2514\u2500\u2500 img00003.png\n\u2502   \u2514\u2500\u2500 page_N/\n\u2502       \u2514\u2500\u2500 imgXXXXX.ext\n\u2514\u2500\u2500 annotated_pdf.pdf  (if using watermarker)\n```\n\n## Advanced Usage\n\n### Using with Azure Blob Storage\n\n```python\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\nconfig = ExtractionConfig(\n    output_dir=\"my-container\",\n    blob_connection_string=\"DefaultEndpointsProtocol=https;...\"\n)\n\nextractor = PDFImageExtractor(config)\nresult = extractor.extract_all_images(Path(\"document.pdf\"))\n```\n\n### Custom Image Filtering\n\n```python\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\n# Only extract large, high-quality images\nconfig = ExtractionConfig(\n    output_dir=\"high_quality_images\",\n    dim_limit=200,      # At least 200px in smallest dimension\n    rel_size=0.5,       # At least 50% of uncompressed size\n    abs_size=50000      # At least 50KB\n)\n```\n\n### Dependency Injection in Larger Projects\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageWatermarker, ExtractionConfig, WatermarkConfig\n\nclass DocumentProcessor:\n    def __init__(self, extraction_config: ExtractionConfig, watermark_config: WatermarkConfig):\n        self.extraction_config = extraction_config\n        self.watermark_config = watermark_config\n\n    def process_document(self, pdf_path: Path):\n        watermarker = PDFImageWatermarker(\n            pdf_path=pdf_path,\n            extraction_config=self.extraction_config,\n            watermark_config=self.watermark_config\n        )\n        return watermarker.process_pdf_with_watermarks()\n```\n\n## API Reference\n\n### Classes\n\n- `PDFImageExtractor`: Core image extraction functionality\n- `PDFImageWatermarker`: Extended extractor with watermarking capabilities\n- `ExtractionConfig`: Configuration for image extraction\n- `WatermarkConfig`: Configuration for watermark appearance\n- `ImageMetadata`: Metadata for extracted images\n- `ImageWatermarkEntry`: Entry for images with watermark information\n- `WatermarkResult`: Result of the PDF watermarking process\n\n## Requirements\n\n- Python 3.11+\n- PyMuPDF >= 1.23.0\n- pydantic >= 2.0.0\n- azure-storage-blob >= 12.0.0 (optional, for Azure support)\n\n## Development\n\n### Setting up development environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate\ncd pdf-image-extract-annotate\n\n# Install in development mode with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack pdf_image_extract_annotate tests\n\n# Type checking\nmypy pdf_image_extract_annotate\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=pdf_image_extract_annotate\n\n# Run specific test file\npytest tests/test_extractor.py\n```\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## Support\n\nFor issues, questions, or suggestions, please open an issue on GitHub.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Extract images from PDFs and create annotated versions with watermarks",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/thijshakkenberg/pdf-image-extract-annotate"
    },
    "split_keywords": [
        "pdf",
        " image",
        " extraction",
        " watermark",
        " annotation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "dbf94049d465e73c9593e953838f1235f6d7213529e14adafeb0945f28988c76",
                "md5": "047a7dd424c501e5a3f023b60bd2cb33",
                "sha256": "d4a68dd65b961c2d4470026808a64532b96a78206525b122699b761167861afa"
            },
            "downloads": -1,
            "filename": "pdf_image_extract_annotate-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "047a7dd424c501e5a3f023b60bd2cb33",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 24731,
            "upload_time": "2025-10-21T12:45:29",
            "upload_time_iso_8601": "2025-10-21T12:45:29.881175Z",
            "url": "https://files.pythonhosted.org/packages/db/f9/4049d465e73c9593e953838f1235f6d7213529e14adafeb0945f28988c76/pdf_image_extract_annotate-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "88adb356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0",
                "md5": "d62509f097acd17ba390ac8e69488933",
                "sha256": "61783c6d8b0bd305ae6e4cb65fda801b81d228f3011bf10bf173eab0f311415e"
            },
            "downloads": -1,
            "filename": "pdf_image_extract_annotate-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d62509f097acd17ba390ac8e69488933",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 21496,
            "upload_time": "2025-10-21T12:45:31",
            "upload_time_iso_8601": "2025-10-21T12:45:31.329153Z",
            "url": "https://files.pythonhosted.org/packages/88/ad/b356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0/pdf_image_extract_annotate-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 12:45:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "thijshakkenberg",
    "github_project": "pdf-image-extract-annotate",
    "github_not_found": true,
    "lcname": "pdf-image-extract-annotate"
}

Thijs Hakkenberg