# PDFImageExtractAnnotate
A Python package for extracting images from PDF documents and creating annotated versions with watermarks showing the extracted image filenames.
## Features
- **Image Extraction**: Extract all images from PDF documents with configurable filters
- **Page-based Organization**: Images are organized by page number for easy reference
- **Watermark Annotation**: Add watermarks to the original PDF showing extracted image filenames
- **Flexible Filtering**: Filter images by dimensions, file size, or relative compression
- **Azure Blob Storage Support**: Optional support for storing images in Azure Blob Storage
- **Customizable Watermarks**: Configure font size, color, background, and text format
## Installation
### From PyPI (when published)
```bash
pip install pdf-image-extract-annotate
```
### From Source
```bash
git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate
pip install -e .
```
### With Azure Support
```bash
pip install pdf-image-extract-annotate[azure]
```
## Quick Start
### Basic Image Extraction
```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig
# Configure extraction
config = ExtractionConfig(
output_dir="extracted_images",
dim_limit=50, # Minimum dimension in pixels
abs_size=1000 # Minimum file size in bytes
)
# Extract images
extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))
print(f"Extracted {result['images_extracted']} images")
print(f"Saved to: {result['output_directory']}")
```
### Extract and Watermark PDF
```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, WatermarkConfig
# Configure watermark appearance
watermark_config = WatermarkConfig(
font_size=10,
font_color=(1.0, 0.0, 0.0), # Red text
background_color=(1.0, 1.0, 1.0, 0.7), # Semi-transparent white
text_format="filename" # Show just the filename
)
# Process PDF
watermarker = PDFImageWatermarker(
pdf_path=Path("document.pdf"),
watermark_config=watermark_config
)
result = watermarker.process_pdf_with_watermarks()
# Save the annotated PDF
result.output_pdf.save("annotated_document.pdf")
result.output_pdf.close()
print(f"Extracted {result.images_extracted} images")
print(f"Watermarked {result.images_watermarked} images")
```
## Configuration Options
### ExtractionConfig
- `output_dir` (str): Directory to save extracted images
- `dim_limit` (int): Minimum dimension filter (0 = no limit)
- `rel_size` (float): Relative size filter (0.0-1.0, 0 = no limit)
- `abs_size` (int): Absolute size filter in bytes (0 = no limit)
- `blob_connection_string` (str, optional): Azure Blob Storage connection string
### WatermarkConfig
- `font_size` (int): Font size for watermark text
- `font_color` (tuple): RGB color values (0.0-1.0)
- `background_color` (tuple): RGBA background color
- `text_format` (str): Format for watermark text ("filename", "filepath", or "custom")
- `padding` (int): Padding around text in pixels
## Output Structure
Images are organized using a page-based structure:
```
output_dir/
├── images/
│ ├── page_1/
│ │ ├── img00001.png
│ │ └── img00002.jpg
│ ├── page_2/
│ │ └── img00003.png
│ └── page_N/
│ └── imgXXXXX.ext
└── annotated_pdf.pdf (if using watermarker)
```
## Advanced Usage
### Using with Azure Blob Storage
```python
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig
config = ExtractionConfig(
output_dir="my-container",
blob_connection_string="DefaultEndpointsProtocol=https;..."
)
extractor = PDFImageExtractor(config)
result = extractor.extract_all_images(Path("document.pdf"))
```
### Custom Image Filtering
```python
from pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig
# Only extract large, high-quality images
config = ExtractionConfig(
output_dir="high_quality_images",
dim_limit=200, # At least 200px in smallest dimension
rel_size=0.5, # At least 50% of uncompressed size
abs_size=50000 # At least 50KB
)
```
### Dependency Injection in Larger Projects
```python
from pathlib import Path
from pdf_image_extract_annotate import PDFImageWatermarker, ExtractionConfig, WatermarkConfig
class DocumentProcessor:
def __init__(self, extraction_config: ExtractionConfig, watermark_config: WatermarkConfig):
self.extraction_config = extraction_config
self.watermark_config = watermark_config
def process_document(self, pdf_path: Path):
watermarker = PDFImageWatermarker(
pdf_path=pdf_path,
extraction_config=self.extraction_config,
watermark_config=self.watermark_config
)
return watermarker.process_pdf_with_watermarks()
```
## API Reference
### Classes
- `PDFImageExtractor`: Core image extraction functionality
- `PDFImageWatermarker`: Extended extractor with watermarking capabilities
- `ExtractionConfig`: Configuration for image extraction
- `WatermarkConfig`: Configuration for watermark appearance
- `ImageMetadata`: Metadata for extracted images
- `ImageWatermarkEntry`: Entry for images with watermark information
- `WatermarkResult`: Result of the PDF watermarking process
## Requirements
- Python 3.11+
- PyMuPDF >= 1.23.0
- pydantic >= 2.0.0
- azure-storage-blob >= 12.0.0 (optional, for Azure support)
## Development
### Setting up development environment
```bash
# Clone the repository
git clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate
cd pdf-image-extract-annotate
# Install in development mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black pdf_image_extract_annotate tests
# Type checking
mypy pdf_image_extract_annotate
```
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=pdf_image_extract_annotate
# Run specific test file
pytest tests/test_extractor.py
```
## License
MIT License - see LICENSE file for details
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## Support
For issues, questions, or suggestions, please open an issue on GitHub.
Raw data
{
"_id": null,
"home_page": "https://github.com/thijshakkenberg/pdf-image-extract-annotate",
"name": "pdf-image-extract-annotate",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "pdf, image, extraction, watermark, annotation",
"author": "Thijs Hakkenberg",
"author_email": "Thijs Hakkenberg <thijs.hakkenberg@ecolab.com>",
"download_url": "https://files.pythonhosted.org/packages/88/ad/b356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0/pdf_image_extract_annotate-1.0.0.tar.gz",
"platform": null,
"description": "# PDFImageExtractAnnotate\n\nA Python package for extracting images from PDF documents and creating annotated versions with watermarks showing the extracted image filenames.\n\n## Features\n\n- **Image Extraction**: Extract all images from PDF documents with configurable filters\n- **Page-based Organization**: Images are organized by page number for easy reference\n- **Watermark Annotation**: Add watermarks to the original PDF showing extracted image filenames\n- **Flexible Filtering**: Filter images by dimensions, file size, or relative compression\n- **Azure Blob Storage Support**: Optional support for storing images in Azure Blob Storage\n- **Customizable Watermarks**: Configure font size, color, background, and text format\n\n## Installation\n\n### From PyPI (when published)\n\n```bash\npip install pdf-image-extract-annotate\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate\ncd pdf-image-extract-annotate\npip install -e .\n```\n\n### With Azure Support\n\n```bash\npip install pdf-image-extract-annotate[azure]\n```\n\n## Quick Start\n\n### Basic Image Extraction\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\n# Configure extraction\nconfig = ExtractionConfig(\n output_dir=\"extracted_images\",\n dim_limit=50, # Minimum dimension in pixels\n abs_size=1000 # Minimum file size in bytes\n)\n\n# Extract images\nextractor = PDFImageExtractor(config)\nresult = extractor.extract_all_images(Path(\"document.pdf\"))\n\nprint(f\"Extracted {result['images_extracted']} images\")\nprint(f\"Saved to: {result['output_directory']}\")\n```\n\n### Extract and Watermark PDF\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageWatermarker, WatermarkConfig\n\n# Configure watermark appearance\nwatermark_config = WatermarkConfig(\n font_size=10,\n font_color=(1.0, 0.0, 0.0), # Red text\n background_color=(1.0, 1.0, 1.0, 0.7), # Semi-transparent white\n text_format=\"filename\" # Show just the filename\n)\n\n# Process PDF\nwatermarker = PDFImageWatermarker(\n pdf_path=Path(\"document.pdf\"),\n watermark_config=watermark_config\n)\n\nresult = watermarker.process_pdf_with_watermarks()\n\n# Save the annotated PDF\nresult.output_pdf.save(\"annotated_document.pdf\")\nresult.output_pdf.close()\n\nprint(f\"Extracted {result.images_extracted} images\")\nprint(f\"Watermarked {result.images_watermarked} images\")\n```\n\n## Configuration Options\n\n### ExtractionConfig\n\n- `output_dir` (str): Directory to save extracted images\n- `dim_limit` (int): Minimum dimension filter (0 = no limit)\n- `rel_size` (float): Relative size filter (0.0-1.0, 0 = no limit)\n- `abs_size` (int): Absolute size filter in bytes (0 = no limit)\n- `blob_connection_string` (str, optional): Azure Blob Storage connection string\n\n### WatermarkConfig\n\n- `font_size` (int): Font size for watermark text\n- `font_color` (tuple): RGB color values (0.0-1.0)\n- `background_color` (tuple): RGBA background color\n- `text_format` (str): Format for watermark text (\"filename\", \"filepath\", or \"custom\")\n- `padding` (int): Padding around text in pixels\n\n## Output Structure\n\nImages are organized using a page-based structure:\n\n```\noutput_dir/\n\u251c\u2500\u2500 images/\n\u2502 \u251c\u2500\u2500 page_1/\n\u2502 \u2502 \u251c\u2500\u2500 img00001.png\n\u2502 \u2502 \u2514\u2500\u2500 img00002.jpg\n\u2502 \u251c\u2500\u2500 page_2/\n\u2502 \u2502 \u2514\u2500\u2500 img00003.png\n\u2502 \u2514\u2500\u2500 page_N/\n\u2502 \u2514\u2500\u2500 imgXXXXX.ext\n\u2514\u2500\u2500 annotated_pdf.pdf (if using watermarker)\n```\n\n## Advanced Usage\n\n### Using with Azure Blob Storage\n\n```python\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\nconfig = ExtractionConfig(\n output_dir=\"my-container\",\n blob_connection_string=\"DefaultEndpointsProtocol=https;...\"\n)\n\nextractor = PDFImageExtractor(config)\nresult = extractor.extract_all_images(Path(\"document.pdf\"))\n```\n\n### Custom Image Filtering\n\n```python\nfrom pdf_image_extract_annotate import PDFImageExtractor, ExtractionConfig\n\n# Only extract large, high-quality images\nconfig = ExtractionConfig(\n output_dir=\"high_quality_images\",\n dim_limit=200, # At least 200px in smallest dimension\n rel_size=0.5, # At least 50% of uncompressed size\n abs_size=50000 # At least 50KB\n)\n```\n\n### Dependency Injection in Larger Projects\n\n```python\nfrom pathlib import Path\nfrom pdf_image_extract_annotate import PDFImageWatermarker, ExtractionConfig, WatermarkConfig\n\nclass DocumentProcessor:\n def __init__(self, extraction_config: ExtractionConfig, watermark_config: WatermarkConfig):\n self.extraction_config = extraction_config\n self.watermark_config = watermark_config\n\n def process_document(self, pdf_path: Path):\n watermarker = PDFImageWatermarker(\n pdf_path=pdf_path,\n extraction_config=self.extraction_config,\n watermark_config=self.watermark_config\n )\n return watermarker.process_pdf_with_watermarks()\n```\n\n## API Reference\n\n### Classes\n\n- `PDFImageExtractor`: Core image extraction functionality\n- `PDFImageWatermarker`: Extended extractor with watermarking capabilities\n- `ExtractionConfig`: Configuration for image extraction\n- `WatermarkConfig`: Configuration for watermark appearance\n- `ImageMetadata`: Metadata for extracted images\n- `ImageWatermarkEntry`: Entry for images with watermark information\n- `WatermarkResult`: Result of the PDF watermarking process\n\n## Requirements\n\n- Python 3.11+\n- PyMuPDF >= 1.23.0\n- pydantic >= 2.0.0\n- azure-storage-blob >= 12.0.0 (optional, for Azure support)\n\n## Development\n\n### Setting up development environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/thijshakkenbergecolab/pdf-image-extract-annotate\ncd pdf-image-extract-annotate\n\n# Install in development mode with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack pdf_image_extract_annotate tests\n\n# Type checking\nmypy pdf_image_extract_annotate\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=pdf_image_extract_annotate\n\n# Run specific test file\npytest tests/test_extractor.py\n```\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## Support\n\nFor issues, questions, or suggestions, please open an issue on GitHub.\n",
"bugtrack_url": null,
"license": null,
"summary": "Extract images from PDFs and create annotated versions with watermarks",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/thijshakkenberg/pdf-image-extract-annotate"
},
"split_keywords": [
"pdf",
" image",
" extraction",
" watermark",
" annotation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "dbf94049d465e73c9593e953838f1235f6d7213529e14adafeb0945f28988c76",
"md5": "047a7dd424c501e5a3f023b60bd2cb33",
"sha256": "d4a68dd65b961c2d4470026808a64532b96a78206525b122699b761167861afa"
},
"downloads": -1,
"filename": "pdf_image_extract_annotate-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "047a7dd424c501e5a3f023b60bd2cb33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 24731,
"upload_time": "2025-10-21T12:45:29",
"upload_time_iso_8601": "2025-10-21T12:45:29.881175Z",
"url": "https://files.pythonhosted.org/packages/db/f9/4049d465e73c9593e953838f1235f6d7213529e14adafeb0945f28988c76/pdf_image_extract_annotate-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "88adb356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0",
"md5": "d62509f097acd17ba390ac8e69488933",
"sha256": "61783c6d8b0bd305ae6e4cb65fda801b81d228f3011bf10bf173eab0f311415e"
},
"downloads": -1,
"filename": "pdf_image_extract_annotate-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "d62509f097acd17ba390ac8e69488933",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 21496,
"upload_time": "2025-10-21T12:45:31",
"upload_time_iso_8601": "2025-10-21T12:45:31.329153Z",
"url": "https://files.pythonhosted.org/packages/88/ad/b356371f75ed4e0bcee06d27da2e89f3e9e11de00f3af341fdceffad96c0/pdf_image_extract_annotate-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 12:45:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "thijshakkenberg",
"github_project": "pdf-image-extract-annotate",
"github_not_found": true,
"lcname": "pdf-image-extract-annotate"
}