# DeepSeek OCR Encoder
A handy and elastic encoder for vision tasks based on DeepSeek-OCR. This package provides an optimized, memory-lean encoder that combines SAM-base with CLIP for efficient vision token generation.
## Features
- 🚀 **Optimized Performance**: Leverages CUDA graphs, torch.compile, and memory-efficient techniques
- 💾 **Memory Efficient**: Automatically removes unused model components to save RAM/VRAM
- 🎯 **Easy to Use**: Simple API - just import and encode
- ⚡ **Fast Inference**: Support for BF16, channels_last memory layout, and optional CUDA graph capture
- 🔧 **Flexible**: Configurable device, dtype, and optimization settings
- 📄 **PDF Support**: Encode multi-page PDF documents with automatic page-to-image conversion
## About DeepSeek-OCR
This encoder is based on [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR), a state-of-the-art vision-language model designed for optical character recognition and document understanding. The recent paper ["DeepSeek-OCR: Contexts Optical Compression"](https://arxiv.org/html/2510.18234v1) (arXiv:2510.18234v1) introduces innovative optical compression techniques for long text contexts using vision tokens.
**Key highlights from the paper:**
- 📊 **High Precision OCR**: Achieves up to ~97% OCR precision at less than 10× compression
- 🗜️ **Efficient Compression**: Maintains ~60% precision even at 20× compression ratios
- 📈 **Strong Benchmark Results**: Significant improvements on OmniDocBench
- ⚡ **High-Throughput Data Generation**: Enables efficient processing of large document datasets
This encoder package provides an optimized implementation for extracting vision tokens from the DeepSeek-OCR model, making it easy to integrate into your own applications.
## Installation
```bash
uv add deepseek-ocr-encoder
```
Or install from source:
```bash
git clone https://github.com/dwojcik92/deepseek-ocr-encoder.git
cd deepseek-ocr-encoder
uv pip install .
```
**Important:** This package requires `transformers>=4.30.0,<4.48.0`. If you have a newer version already installed, you may need to downgrade:
```bash
uv pip install 'transformers>=4.30.0,<4.48.0'
```
## Quick Start
### Simple One-Line Initialization (Recommended)
```python
from deepseek_ocr_encoder import DeepSeekOCREncoder
# One-line initialization - automatically handles device, dtype, and model loading
encoder = DeepSeekOCREncoder.from_pretrained("deepseek-ai/DeepSeek-OCR")
# Encode an image
vision_tokens = encoder("your_image.png")
# Returns: torch.Tensor of shape [1, N, 1024] where N=256 for 1024x1024 input
```
### Advanced Usage with Manual Model Loading
If you need more control over the model loading process:
```python
from transformers import AutoModel
import torch
from deepseek_ocr_encoder import DeepSeekOCREncoder
from PIL import Image
# Load the base DeepSeek-OCR model
model_name = "deepseek-ai/DeepSeek-OCR"
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
torch_dtype=torch.bfloat16,
attn_implementation="eager",
)
model = model.eval().to("cuda", dtype=torch.bfloat16)
# Create the optimized encoder
encoder = DeepSeekOCREncoder(
full_model=model,
device="cuda",
dtype=torch.bfloat16,
freeze=True,
eager_to_device=True,
precompute_pos_for_1024=True,
use_compile=False, # Set True for PyTorch 2.3+ with extra fusion
)
# Optional: Capture CUDA graph for even faster inference
encoder.capture_cudagraph(batch_size=1, H=1024, W=1024)
# Encode an image
image_path = "your_image.png"
vision_tokens = encoder.encode(image_path)
# Returns: torch.Tensor of shape [1, N, 1024] where N=256 for 1024x1024 input
# Or use with PIL Image
img = Image.open(image_path).convert("RGB")
vision_tokens = encoder(img) # Shorthand for encoder.encode(img)
# Encode a PDF document (multi-page support)
pdf_path = "document.pdf"
vision_tokens_list = encoder.encode(pdf_path)
# Returns: List of torch.Tensor, one per page, each of shape [1, N, 1024]
# Process each page
for page_num, page_tokens in enumerate(vision_tokens_list):
print(f"Page {page_num + 1}: {page_tokens.shape}")
```
## API Reference
### DeepSeekOCREncoder
The main encoder class that wraps the DeepSeek-OCR model for efficient vision token extraction.
#### Class Methods
##### `from_pretrained(model_name_or_path: str, **kwargs) -> DeepSeekOCREncoder`
**(Recommended)** Load a DeepSeek-OCR model and wrap it with the optimized encoder in one line.
**Parameters:**
- `model_name_or_path` (str, required): Model identifier from Hugging Face Hub (e.g., "deepseek-ai/DeepSeek-OCR") or path to a local checkpoint
- `device` (Optional[Union[str, torch.device]]): Target device (default: auto-detect cuda if available, else cpu)
- `dtype` (Optional[torch.dtype]): Data type for computation (default: bfloat16 on cuda, float32 on cpu)
- `freeze` (bool): Whether to freeze encoder parameters (default: True)
- `eager_to_device` (bool): Move model to device immediately (default: True)
- `precompute_pos_for_1024` (bool): Pre-compute position embeddings for 1024x1024 input (default: True)
- `use_compile` (bool): Enable torch.compile for better performance (requires PyTorch 2.3+, default: False)
- `trust_remote_code` (bool): Whether to trust remote code when loading model (default: True)
- `use_safetensors` (bool): Whether to use safetensors format (default: True)
- `attn_implementation` (str): Attention implementation to use (default: "eager")
- `**model_kwargs`: Additional keyword arguments passed to AutoModel.from_pretrained()
**Returns:**
- Initialized `DeepSeekOCREncoder` ready for inference
**Example:**
```python
# Simple usage
encoder = DeepSeekOCREncoder.from_pretrained("deepseek-ai/DeepSeek-OCR")
# With custom device/dtype
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
device="cpu",
dtype=torch.float32
)
# From local checkpoint
encoder = DeepSeekOCREncoder.from_pretrained("./my-finetuned-model")
```
#### Instance Methods
##### `encode(image: Union[Image.Image, str, os.PathLike]) -> Union[torch.Tensor, List[torch.Tensor]]`
Encode an image or PDF into vision tokens.
**Parameters:**
- `image`: PIL Image, path to an RGB image file, or path to a PDF file
**Returns:**
- For single images: Vision tokens tensor of shape `[1, N, 1024]` where N=256 for 1024×1024 input
- For PDFs: List of vision token tensors, one per page, each of shape `[1, N, 1024]`
**Example:**
```python
# Single image
tokens = encoder.encode("image.png") # Returns torch.Tensor
# Multi-page PDF
tokens_list = encoder.encode("document.pdf") # Returns List[torch.Tensor]
for page_tokens in tokens_list:
print(f"Page shape: {page_tokens.shape}")
```
##### `capture_cudagraph(batch_size: int = 1, H: int = 1024, W: int = 1024)`
Capture a CUDA graph for optimized steady-state inference. Call this once after initialization to enable CUDA graph acceleration.
**Parameters:**
- `batch_size`: Batch size for the graph (default: 1)
- `H`: Input height (default: 1024)
- `W`: Input width (default: 1024)
**Raises:**
- `RuntimeError`: If device is not CUDA
##### `__call__(image: Union[Image.Image, str, os.PathLike]) -> Union[torch.Tensor, List[torch.Tensor]]`
Convenience method, equivalent to `encode()`. Supports both single images and multi-page PDFs.
## Custom Preprocessing Hooks
The encoder now supports configurable preprocessing, allowing you to customize the image preprocessing pipeline without forking the codebase. This is useful for:
- Using native image resolutions
- Applying domain-specific preprocessing (medical images, documents, etc.)
- Reusing existing preprocessing pipelines
- Fine-tuning preprocessing parameters
### Basic Examples
#### Custom Resize Dimensions
```python
# Use 512x512 instead of default 1024x1024
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
resize_size=(512, 512)
)
# Keep native resolution (no resizing)
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
resize_size=None
)
# Use non-square dimensions
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
resize_size=(768, 1024) # (height, width)
)
```
#### Custom Normalization
```python
# Use ImageNet normalization instead of CLIP
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
normalization_mean=(0.485, 0.456, 0.406),
normalization_std=(0.229, 0.224, 0.225)
)
```
#### Custom Interpolation Mode
```python
from torchvision import transforms
# Use LANCZOS for higher quality
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
resize_interpolation=transforms.InterpolationMode.LANCZOS,
resize_antialias=True
)
```
### Advanced: Custom Preprocessing Transform
For full control, provide your own preprocessing function:
```python
from torchvision import transforms
from PIL import Image
import torch
def my_preprocessing(img: Image.Image) -> torch.Tensor:
"""Custom preprocessing with domain-specific augmentations."""
transform = transforms.Compose([
transforms.Resize((1024, 1024)),
transforms.ColorJitter(brightness=0.1, contrast=0.2),
transforms.ToTensor(),
transforms.Normalize(
mean=(0.48145466, 0.4578275, 0.40821073),
std=(0.26862954, 0.26130258, 0.27577711)
),
])
return transform(img)
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
preprocessing_transform=my_preprocessing
)
```
### Pre-processed Tensor Input
If you need to preprocess images externally (e.g., in a batched data pipeline):
```python
# Create encoder that accepts pre-processed tensors
encoder = DeepSeekOCREncoder.from_pretrained(
"deepseek-ai/DeepSeek-OCR",
skip_default_preprocessing=True
)
# Your external preprocessing
img = Image.open("image.jpg").convert("RGB")
preprocessed = my_external_pipeline(img) # Returns torch.Tensor [C, H, W]
# Encode the pre-processed tensor
tokens = encoder._encode_single_image(preprocessed)
```
### Preprocessing Parameters
When using `from_pretrained()` or the constructor, you can configure:
- `preprocessing_transform`: Custom callable that takes PIL Image and returns torch.Tensor (overrides all other settings)
- `resize_size`: Target size (int or tuple). Default: (1024, 1024). Set to None for native resolution
- `resize_interpolation`: Interpolation mode (default: `BICUBIC`)
- `resize_antialias`: Enable antialiasing during resize (default: True)
- `normalization_mean`: RGB mean values (default: CLIP normalization)
- `normalization_std`: RGB std values (default: CLIP normalization)
- `skip_default_preprocessing`: If True, accept only pre-processed tensors (default: False)
See `examples/custom_preprocessing.py` for more detailed examples.
## Architecture
The encoder implements the following pipeline:
1. **SAM-base encoder** with built-in conv compressor → `[B, 1024, Hs, Ws]`
2. **Flatten** spatial dimensions → `[B, N, 1024]` where N = Hs × Ws
3. **Add CLIP 2D positional embeddings** (without CLS token)
4. **CLIP pre-layernorm + transformer**
5. **Residual connection**: returns `tokens + CLIP(tokens)`
## Performance Optimizations
This encoder includes several optimizations:
- **Memory layout**: Uses `channels_last` format for conv-heavy operations
- **Precision**: BF16 computation for faster inference on modern GPUs
- **CUDA Graphs**: Optional graph capture for minimal kernel launch overhead
- **torch.compile**: Optional compilation for kernel fusion (PyTorch 2.3+)
- **Memory cleanup**: Removes unused model components (text decoder, LM head, etc.)
- **Position embedding caching**: Pre-computes and caches position embeddings
## Requirements
- Python ≥ 3.10
- PyTorch ≥ 2.0.0
- torchvision ≥ 0.15.0
- **transformers ≥ 4.30.0, < 4.48.0** (see [Troubleshooting](#troubleshooting) for details)
- Pillow ≥ 9.0.0
- PyMuPDF ≥ 1.23.0 (for PDF support)
## Troubleshooting
### ImportError: cannot import name 'LlamaFlashAttention2'
If you encounter this error, it's caused by incompatible transformers versions. The `LlamaFlashAttention2` class was removed in transformers 4.48.0+.
**Solution:**
```bash
uv pip install 'transformers>=4.30.0,<4.48.0'
```
The DeepSeek-OCR model uses specific attention mechanisms that were refactored in transformers 4.48.0+. The model code references `LlamaFlashAttention2`, which is only available in transformers versions 4.30.0 through 4.47.x.
## Development
```bash
# Install with dev dependencies
uv pip install -e ".[dev]"
# Run tests
pytest
```
## License
MIT
## Citation
If you use this encoder in your research, please cite the DeepSeek-OCR papers:
```bibtex
@article{deepseek-ocr-compression,
title={DeepSeek-OCR: Contexts Optical Compression},
author={DeepSeek-AI},
journal={arXiv preprint arXiv:2510.18234},
year={2025}
}
```
## Resources
- 📄 **Paper**: [DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/html/2510.18234v1) (arXiv:2510.18234v1)
- 💻 **Official Repository**: [DeepSeek-OCR on GitHub](https://github.com/deepseek-ai/DeepSeek-OCR)
- 🤗 **Model**: [deepseek-ai/DeepSeek-OCR on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-OCR)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Raw data
{
"_id": null,
"home_page": null,
"name": "deepseek-ocr-encoder",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": "clip, deepseek, deepseek-ocr, huggingface, ocr, pytorch, sam, transformers, vision-encoder",
"author": null,
"author_email": "Dariusz Wojcik <dariusz.patryk.wojcik@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/01/1c/db350d29d3b5c9e9b60dc39bc14095688aff6a7edb30ac9e6e3917cc3af4/deepseek_ocr_encoder-1.0.1.tar.gz",
"platform": null,
"description": "# DeepSeek OCR Encoder\n\nA handy and elastic encoder for vision tasks based on DeepSeek-OCR. This package provides an optimized, memory-lean encoder that combines SAM-base with CLIP for efficient vision token generation.\n\n## Features\n\n- \ud83d\ude80 **Optimized Performance**: Leverages CUDA graphs, torch.compile, and memory-efficient techniques\n- \ud83d\udcbe **Memory Efficient**: Automatically removes unused model components to save RAM/VRAM\n- \ud83c\udfaf **Easy to Use**: Simple API - just import and encode\n- \u26a1 **Fast Inference**: Support for BF16, channels_last memory layout, and optional CUDA graph capture\n- \ud83d\udd27 **Flexible**: Configurable device, dtype, and optimization settings\n- \ud83d\udcc4 **PDF Support**: Encode multi-page PDF documents with automatic page-to-image conversion\n\n## About DeepSeek-OCR\n\nThis encoder is based on [DeepSeek-OCR](https://github.com/deepseek-ai/DeepSeek-OCR), a state-of-the-art vision-language model designed for optical character recognition and document understanding. The recent paper [\"DeepSeek-OCR: Contexts Optical Compression\"](https://arxiv.org/html/2510.18234v1) (arXiv:2510.18234v1) introduces innovative optical compression techniques for long text contexts using vision tokens.\n\n**Key highlights from the paper:**\n- \ud83d\udcca **High Precision OCR**: Achieves up to ~97% OCR precision at less than 10\u00d7 compression\n- \ud83d\udddc\ufe0f **Efficient Compression**: Maintains ~60% precision even at 20\u00d7 compression ratios\n- \ud83d\udcc8 **Strong Benchmark Results**: Significant improvements on OmniDocBench\n- \u26a1 **High-Throughput Data Generation**: Enables efficient processing of large document datasets\n\nThis encoder package provides an optimized implementation for extracting vision tokens from the DeepSeek-OCR model, making it easy to integrate into your own applications.\n\n## Installation\n\n```bash\nuv add deepseek-ocr-encoder\n```\n\nOr install from source:\n\n```bash\ngit clone https://github.com/dwojcik92/deepseek-ocr-encoder.git\ncd deepseek-ocr-encoder\nuv pip install .\n```\n\n**Important:** This package requires `transformers>=4.30.0,<4.48.0`. If you have a newer version already installed, you may need to downgrade:\n\n```bash\nuv pip install 'transformers>=4.30.0,<4.48.0'\n```\n\n## Quick Start\n\n### Simple One-Line Initialization (Recommended)\n\n```python\nfrom deepseek_ocr_encoder import DeepSeekOCREncoder\n\n# One-line initialization - automatically handles device, dtype, and model loading\nencoder = DeepSeekOCREncoder.from_pretrained(\"deepseek-ai/DeepSeek-OCR\")\n\n# Encode an image\nvision_tokens = encoder(\"your_image.png\")\n# Returns: torch.Tensor of shape [1, N, 1024] where N=256 for 1024x1024 input\n```\n\n### Advanced Usage with Manual Model Loading\n\nIf you need more control over the model loading process:\n\n```python\nfrom transformers import AutoModel\nimport torch\nfrom deepseek_ocr_encoder import DeepSeekOCREncoder\nfrom PIL import Image\n\n# Load the base DeepSeek-OCR model\nmodel_name = \"deepseek-ai/DeepSeek-OCR\"\nmodel = AutoModel.from_pretrained(\n model_name,\n trust_remote_code=True,\n use_safetensors=True,\n torch_dtype=torch.bfloat16,\n attn_implementation=\"eager\",\n)\nmodel = model.eval().to(\"cuda\", dtype=torch.bfloat16)\n\n# Create the optimized encoder\nencoder = DeepSeekOCREncoder(\n full_model=model,\n device=\"cuda\",\n dtype=torch.bfloat16,\n freeze=True,\n eager_to_device=True,\n precompute_pos_for_1024=True,\n use_compile=False, # Set True for PyTorch 2.3+ with extra fusion\n)\n\n# Optional: Capture CUDA graph for even faster inference\nencoder.capture_cudagraph(batch_size=1, H=1024, W=1024)\n\n# Encode an image\nimage_path = \"your_image.png\"\nvision_tokens = encoder.encode(image_path)\n# Returns: torch.Tensor of shape [1, N, 1024] where N=256 for 1024x1024 input\n\n# Or use with PIL Image\nimg = Image.open(image_path).convert(\"RGB\")\nvision_tokens = encoder(img) # Shorthand for encoder.encode(img)\n\n# Encode a PDF document (multi-page support)\npdf_path = \"document.pdf\"\nvision_tokens_list = encoder.encode(pdf_path)\n# Returns: List of torch.Tensor, one per page, each of shape [1, N, 1024]\n\n# Process each page\nfor page_num, page_tokens in enumerate(vision_tokens_list):\n print(f\"Page {page_num + 1}: {page_tokens.shape}\")\n```\n\n## API Reference\n\n### DeepSeekOCREncoder\n\nThe main encoder class that wraps the DeepSeek-OCR model for efficient vision token extraction.\n\n#### Class Methods\n\n##### `from_pretrained(model_name_or_path: str, **kwargs) -> DeepSeekOCREncoder`\n\n**(Recommended)** Load a DeepSeek-OCR model and wrap it with the optimized encoder in one line.\n\n**Parameters:**\n- `model_name_or_path` (str, required): Model identifier from Hugging Face Hub (e.g., \"deepseek-ai/DeepSeek-OCR\") or path to a local checkpoint\n- `device` (Optional[Union[str, torch.device]]): Target device (default: auto-detect cuda if available, else cpu)\n- `dtype` (Optional[torch.dtype]): Data type for computation (default: bfloat16 on cuda, float32 on cpu)\n- `freeze` (bool): Whether to freeze encoder parameters (default: True)\n- `eager_to_device` (bool): Move model to device immediately (default: True)\n- `precompute_pos_for_1024` (bool): Pre-compute position embeddings for 1024x1024 input (default: True)\n- `use_compile` (bool): Enable torch.compile for better performance (requires PyTorch 2.3+, default: False)\n- `trust_remote_code` (bool): Whether to trust remote code when loading model (default: True)\n- `use_safetensors` (bool): Whether to use safetensors format (default: True)\n- `attn_implementation` (str): Attention implementation to use (default: \"eager\")\n- `**model_kwargs`: Additional keyword arguments passed to AutoModel.from_pretrained()\n\n**Returns:**\n- Initialized `DeepSeekOCREncoder` ready for inference\n\n**Example:**\n```python\n# Simple usage\nencoder = DeepSeekOCREncoder.from_pretrained(\"deepseek-ai/DeepSeek-OCR\")\n\n# With custom device/dtype\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n device=\"cpu\",\n dtype=torch.float32\n)\n\n# From local checkpoint\nencoder = DeepSeekOCREncoder.from_pretrained(\"./my-finetuned-model\")\n```\n\n#### Instance Methods\n\n##### `encode(image: Union[Image.Image, str, os.PathLike]) -> Union[torch.Tensor, List[torch.Tensor]]`\n\nEncode an image or PDF into vision tokens.\n\n**Parameters:**\n- `image`: PIL Image, path to an RGB image file, or path to a PDF file\n\n**Returns:**\n- For single images: Vision tokens tensor of shape `[1, N, 1024]` where N=256 for 1024\u00d71024 input\n- For PDFs: List of vision token tensors, one per page, each of shape `[1, N, 1024]`\n\n**Example:**\n```python\n# Single image\ntokens = encoder.encode(\"image.png\") # Returns torch.Tensor\n\n# Multi-page PDF\ntokens_list = encoder.encode(\"document.pdf\") # Returns List[torch.Tensor]\nfor page_tokens in tokens_list:\n print(f\"Page shape: {page_tokens.shape}\")\n```\n\n##### `capture_cudagraph(batch_size: int = 1, H: int = 1024, W: int = 1024)`\n\nCapture a CUDA graph for optimized steady-state inference. Call this once after initialization to enable CUDA graph acceleration.\n\n**Parameters:**\n- `batch_size`: Batch size for the graph (default: 1)\n- `H`: Input height (default: 1024)\n- `W`: Input width (default: 1024)\n\n**Raises:**\n- `RuntimeError`: If device is not CUDA\n\n##### `__call__(image: Union[Image.Image, str, os.PathLike]) -> Union[torch.Tensor, List[torch.Tensor]]`\n\nConvenience method, equivalent to `encode()`. Supports both single images and multi-page PDFs.\n\n## Custom Preprocessing Hooks\n\nThe encoder now supports configurable preprocessing, allowing you to customize the image preprocessing pipeline without forking the codebase. This is useful for:\n- Using native image resolutions\n- Applying domain-specific preprocessing (medical images, documents, etc.)\n- Reusing existing preprocessing pipelines\n- Fine-tuning preprocessing parameters\n\n### Basic Examples\n\n#### Custom Resize Dimensions\n\n```python\n# Use 512x512 instead of default 1024x1024\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n resize_size=(512, 512)\n)\n\n# Keep native resolution (no resizing)\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n resize_size=None\n)\n\n# Use non-square dimensions\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n resize_size=(768, 1024) # (height, width)\n)\n```\n\n#### Custom Normalization\n\n```python\n# Use ImageNet normalization instead of CLIP\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n normalization_mean=(0.485, 0.456, 0.406),\n normalization_std=(0.229, 0.224, 0.225)\n)\n```\n\n#### Custom Interpolation Mode\n\n```python\nfrom torchvision import transforms\n\n# Use LANCZOS for higher quality\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n resize_interpolation=transforms.InterpolationMode.LANCZOS,\n resize_antialias=True\n)\n```\n\n### Advanced: Custom Preprocessing Transform\n\nFor full control, provide your own preprocessing function:\n\n```python\nfrom torchvision import transforms\nfrom PIL import Image\nimport torch\n\ndef my_preprocessing(img: Image.Image) -> torch.Tensor:\n \"\"\"Custom preprocessing with domain-specific augmentations.\"\"\"\n transform = transforms.Compose([\n transforms.Resize((1024, 1024)),\n transforms.ColorJitter(brightness=0.1, contrast=0.2),\n transforms.ToTensor(),\n transforms.Normalize(\n mean=(0.48145466, 0.4578275, 0.40821073),\n std=(0.26862954, 0.26130258, 0.27577711)\n ),\n ])\n return transform(img)\n\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n preprocessing_transform=my_preprocessing\n)\n```\n\n### Pre-processed Tensor Input\n\nIf you need to preprocess images externally (e.g., in a batched data pipeline):\n\n```python\n# Create encoder that accepts pre-processed tensors\nencoder = DeepSeekOCREncoder.from_pretrained(\n \"deepseek-ai/DeepSeek-OCR\",\n skip_default_preprocessing=True\n)\n\n# Your external preprocessing\nimg = Image.open(\"image.jpg\").convert(\"RGB\")\npreprocessed = my_external_pipeline(img) # Returns torch.Tensor [C, H, W]\n\n# Encode the pre-processed tensor\ntokens = encoder._encode_single_image(preprocessed)\n```\n\n### Preprocessing Parameters\n\nWhen using `from_pretrained()` or the constructor, you can configure:\n\n- `preprocessing_transform`: Custom callable that takes PIL Image and returns torch.Tensor (overrides all other settings)\n- `resize_size`: Target size (int or tuple). Default: (1024, 1024). Set to None for native resolution\n- `resize_interpolation`: Interpolation mode (default: `BICUBIC`)\n- `resize_antialias`: Enable antialiasing during resize (default: True)\n- `normalization_mean`: RGB mean values (default: CLIP normalization)\n- `normalization_std`: RGB std values (default: CLIP normalization)\n- `skip_default_preprocessing`: If True, accept only pre-processed tensors (default: False)\n\nSee `examples/custom_preprocessing.py` for more detailed examples.\n\n## Architecture\n\nThe encoder implements the following pipeline:\n\n1. **SAM-base encoder** with built-in conv compressor \u2192 `[B, 1024, Hs, Ws]`\n2. **Flatten** spatial dimensions \u2192 `[B, N, 1024]` where N = Hs \u00d7 Ws\n3. **Add CLIP 2D positional embeddings** (without CLS token)\n4. **CLIP pre-layernorm + transformer**\n5. **Residual connection**: returns `tokens + CLIP(tokens)`\n\n## Performance Optimizations\n\nThis encoder includes several optimizations:\n\n- **Memory layout**: Uses `channels_last` format for conv-heavy operations\n- **Precision**: BF16 computation for faster inference on modern GPUs\n- **CUDA Graphs**: Optional graph capture for minimal kernel launch overhead\n- **torch.compile**: Optional compilation for kernel fusion (PyTorch 2.3+)\n- **Memory cleanup**: Removes unused model components (text decoder, LM head, etc.)\n- **Position embedding caching**: Pre-computes and caches position embeddings\n\n## Requirements\n\n- Python \u2265 3.10\n- PyTorch \u2265 2.0.0\n- torchvision \u2265 0.15.0\n- **transformers \u2265 4.30.0, < 4.48.0** (see [Troubleshooting](#troubleshooting) for details)\n- Pillow \u2265 9.0.0\n- PyMuPDF \u2265 1.23.0 (for PDF support)\n\n## Troubleshooting\n\n### ImportError: cannot import name 'LlamaFlashAttention2'\n\nIf you encounter this error, it's caused by incompatible transformers versions. The `LlamaFlashAttention2` class was removed in transformers 4.48.0+.\n\n**Solution:**\n```bash\nuv pip install 'transformers>=4.30.0,<4.48.0'\n```\n\nThe DeepSeek-OCR model uses specific attention mechanisms that were refactored in transformers 4.48.0+. The model code references `LlamaFlashAttention2`, which is only available in transformers versions 4.30.0 through 4.47.x.\n\n## Development\n\n```bash\n# Install with dev dependencies\nuv pip install -e \".[dev]\"\n\n# Run tests\npytest\n```\n\n## License\n\nMIT\n\n## Citation\n\nIf you use this encoder in your research, please cite the DeepSeek-OCR papers:\n\n```bibtex\n@article{deepseek-ocr-compression,\n title={DeepSeek-OCR: Contexts Optical Compression},\n author={DeepSeek-AI},\n journal={arXiv preprint arXiv:2510.18234},\n year={2025}\n}\n```\n\n## Resources\n\n- \ud83d\udcc4 **Paper**: [DeepSeek-OCR: Contexts Optical Compression](https://arxiv.org/html/2510.18234v1) (arXiv:2510.18234v1)\n- \ud83d\udcbb **Official Repository**: [DeepSeek-OCR on GitHub](https://github.com/deepseek-ai/DeepSeek-OCR)\n- \ud83e\udd17 **Model**: [deepseek-ai/DeepSeek-OCR on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-OCR)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A handy and elastic encoder for DeepSeek OCR vision tasks",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://github.com/dwojcik92/deepseek-ocr-encoder#readme",
"Homepage": "https://github.com/dwojcik92/deepseek-ocr-encoder",
"Issues": "https://github.com/dwojcik92/deepseek-ocr-encoder/issues",
"Repository": "https://github.com/dwojcik92/deepseek-ocr-encoder"
},
"split_keywords": [
"clip",
" deepseek",
" deepseek-ocr",
" huggingface",
" ocr",
" pytorch",
" sam",
" transformers",
" vision-encoder"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "816d190665ec69437551da9f389fd22a4024fbbc00015190037a41aa768f232a",
"md5": "3efda063c57008328d439ac3bc18b1e6",
"sha256": "8e735f3ea187537ea78f123130722644b9d07c3f902a696dc78eef0fb33c3bfe"
},
"downloads": -1,
"filename": "deepseek_ocr_encoder-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3efda063c57008328d439ac3bc18b1e6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 13922,
"upload_time": "2025-10-23T09:36:27",
"upload_time_iso_8601": "2025-10-23T09:36:27.370372Z",
"url": "https://files.pythonhosted.org/packages/81/6d/190665ec69437551da9f389fd22a4024fbbc00015190037a41aa768f232a/deepseek_ocr_encoder-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "011cdb350d29d3b5c9e9b60dc39bc14095688aff6a7edb30ac9e6e3917cc3af4",
"md5": "460b48afd61ad39b88ef5c2d7e57e8b8",
"sha256": "fed8b88b44bcaabb5416ebc1f65ccc009071311b764be94de771c7e39f49310e"
},
"downloads": -1,
"filename": "deepseek_ocr_encoder-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "460b48afd61ad39b88ef5c2d7e57e8b8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 25399,
"upload_time": "2025-10-23T09:36:28",
"upload_time_iso_8601": "2025-10-23T09:36:28.473252Z",
"url": "https://files.pythonhosted.org/packages/01/1c/db350d29d3b5c9e9b60dc39bc14095688aff6a7edb30ac9e6e3917cc3af4/deepseek_ocr_encoder-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-23 09:36:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dwojcik92",
"github_project": "deepseek-ocr-encoder#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "deepseek-ocr-encoder"
}