# HashubDocApp Python SDK
[](https://python.org)
[](LICENSE)
[](https://github.com)
Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service.
## β¨ Features
- π **Fast OCR**: Quick text extraction with 76+ language support
- π§ **Smart OCR**: High-quality OCR with layout preservation
- π **Document Conversion**: Office documents (Word, Excel) and HTML to Markdown/Text
- π **Batch Processing**: Process multiple files with intelligent categorization
- π **Multi-language**: Support for 76+ languages with ISO 639-1 codes
- π¨ **Image Enhancement**: 11 pre-configured enhancement presets
- π **Progress Tracking**: Real-time progress bars and status monitoring
- β‘ **Rate Limiting**: Built-in API throttling protection
## π Quick Start
### Installation
```bash
pip install hashub-docapp
```
### Basic Usage
```python
from hashub_docapp import DocAppClient
# Initialize client
client = DocAppClient("your_api_key_here")
# Fast OCR - Quick text extraction
text = client.convert_fast("document.pdf", language="en")
print(text)
# Smart OCR - High-quality with layout preservation
markdown = client.convert_smart("document.pdf")
print(markdown)
```
## π Core Methods
### `convert_fast()`
Fast OCR for quick text extraction with language support.
```python
def convert_fast(
file_or_image: Union[str, Path],
output: str = "markdown",
language: str = "en",
enhancement: Optional[str] = None,
return_type: ReturnType = "content",
save_to: Optional[Union[str, Path]] = None,
show_progress: bool = True,
timeout: int = 300
) -> Union[str, Path]
```
**Parameters:**
- `file_or_image`: Path to PDF or image file
- `output`: Output format ("markdown", "txt", "json")
- `language`: Language code (ISO 639-1 like "en", "tr", "de")
- `enhancement`: Image enhancement preset (optional)
- `return_type`: "content" (default), "url", or "file"
- `save_to`: File path when return_type="file"
- `show_progress`: Show progress bar (default: True)
- `timeout`: Maximum wait time in seconds (default: 300)
**Examples:**
```python
# Basic fast OCR
text = client.convert_fast("scan.pdf")
# With Turkish language
text = client.convert_fast("document.pdf", language="tr")
# With enhancement for low-quality scans
text = client.convert_fast("scan.pdf", enhancement="scan_low_dpi")
# Save to file
client.convert_fast("document.pdf", return_type="file", save_to="output.txt")
```
### `convert_smart()`
High-quality OCR with layout preservation and structure detection.
```python
def convert_smart(
file_or_image: Union[str, Path],
output: str = "markdown",
return_type: ReturnType = "content",
save_to: Optional[Union[str, Path]] = None,
show_progress: bool = True,
timeout: int = 300
) -> Union[str, Path]
```
**Parameters:**
- `file_or_image`: Path to PDF or image file
- `output`: Output format ("markdown", "txt", "json")
- `return_type`: "content" (default), "url", or "file"
- `save_to`: File path when return_type="file"
- `show_progress`: Show progress bar (default: True)
- `timeout`: Maximum wait time in seconds (default: 300)
**Examples:**
```python
# Smart OCR with layout preservation
markdown = client.convert_smart("complex_document.pdf")
# Save as file
client.convert_smart("document.pdf", return_type="file", save_to="output.md")
# Different output format
json_data = client.convert_smart("document.pdf", output="json")
```
## π Language Support
The SDK supports 76+ languages with ISO 639-1 codes:
```python
from hashub_docapp.languages import LanguageHelper
# List all supported languages
languages = LanguageHelper.list_languages()
print(f"Supported languages: {len(languages)}")
# Get language info
turkish_info = LanguageHelper.get_language_info("tr")
print(turkish_info) # {'english': 'Turkish', 'native': 'TΓΌrkΓ§e', 'iso': 'tr', 'api_code': 'lang_tur_tr'}
# Use with convert_fast
text = client.convert_fast("document.pdf", language="tr") # Turkish
text = client.convert_fast("document.pdf", language="de") # German
text = client.convert_fast("document.pdf", language="zh") # Chinese
```
**Popular Language Codes:**
- `en` - English
- `tr` - Turkish
- `de` - German
- `fr` - French
- `es` - Spanish
- `zh` - Chinese (Simplified)
- `ar` - Arabic
- `ru` - Russian
- `ja` - Japanese
- `ko` - Korean
## π¨ Image Enhancement Presets
The SDK includes 11 pre-configured enhancement presets for different document types:
```python
# Enhancement presets (use with convert_fast)
client.convert_fast("scan.pdf", enhancement="document_crisp") # Clean documents
client.convert_fast("scan.pdf", enhancement="scan_low_dpi") # Low quality scans
client.convert_fast("scan.pdf", enhancement="camera_shadow") # Phone photos
client.convert_fast("scan.pdf", enhancement="photocopy_faded") # Faded copies
client.convert_fast("scan.pdf", enhancement="inverted_scan") # Inverted colors
client.convert_fast("scan.pdf", enhancement="noisy_dots") # Noisy artifacts
client.convert_fast("scan.pdf", enhancement="tables_fine") # Tables and grids
client.convert_fast("scan.pdf", enhancement="receipt_thermal") # Receipts
client.convert_fast("scan.pdf", enhancement="newspaper_moire") # Newspapers
client.convert_fast("scan.pdf", enhancement="fax_low_quality") # Fax documents
client.convert_fast("scan.pdf", enhancement="blueprint") # Technical drawings
```
## π Document Conversion
### `convert_doc()`
Convert Word, Excel, and other office documents.
```python
def convert_doc(
path: Union[str, Path],
output: str = "markdown",
return_type: ReturnType = "content",
save_to: Optional[Union[str, Path]] = None,
options: Optional[Dict[str, Any]] = None
) -> Union[str, Path]
```
**Examples:**
```python
# Convert Word document to Markdown
markdown = client.convert_doc("document.docx")
# Convert Excel to text
text = client.convert_doc("spreadsheet.xlsx", output="txt")
# Save to file
client.convert_doc("presentation.pptx", return_type="file", save_to="output.md")
```
### `convert_html_string()`
Convert HTML string content to other formats.
```python
def convert_html_string(
html_content: str,
output: str = "markdown",
return_type: ReturnType = "content",
save_to: Optional[Union[str, Path]] = None,
options: Optional[Dict[str, Any]] = None
) -> Union[str, Path]
```
**Examples:**
```python
html = "<h1>Title</h1><p>Content</p>"
markdown = client.convert_html_string(html)
```
## π Batch Processing
### `batch_convert_smart()`
Smart batch processing with automatic file categorization.
```python
def batch_convert_smart(
directory: Union[str, Path],
save_to: Union[str, Path],
output_format: str = "txt",
recursive: bool = True,
show_progress: bool = True,
max_workers: int = 3,
timeout: int = 600
) -> Dict[str, Any]
```
**Example:**
```python
# Process all files in directory intelligently
results = client.batch_convert_smart(
directory="./documents",
save_to="./output",
output_format="markdown"
)
print(f"Processed {results['processed_count']} files")
print(f"Success: {results['success_count']}, Failed: {results['failed_count']}")
```
### `batch_convert_fast()`
Fast batch OCR for images and PDFs.
```python
def batch_convert_fast(
directory: Union[str, Path],
save_to: Union[str, Path],
language: str = "en",
enhancement: Optional[str] = None,
output_format: str = "txt",
recursive: bool = True,
show_progress: bool = True,
max_workers: int = 5,
timeout: int = 300
) -> Dict[str, Any]
```
### `batch_convert_auto()`
Automatic processing mode selection based on file types.
```python
def batch_convert_auto(
directory: Union[str, Path],
save_to: Union[str, Path],
language: str = "en",
enhancement: Optional[str] = None,
output_format: str = "txt",
recursive: bool = True,
show_progress: bool = True,
max_workers: int = 4,
timeout: int = 900
) -> Dict[str, Any]
```
## π Return Types
The SDK supports three return types for conversion methods:
### 1. Content (Default)
```python
text = client.convert_fast("doc.pdf", return_type="content")
print(text) # Direct text content
```
### 2. URL
```python
url = client.convert_fast("doc.pdf", return_type="url")
print(url) # Download URL for the result
```
### 3. File
```python
path = client.convert_fast(
"doc.pdf",
return_type="file",
save_to="output.txt"
)
print(path) # Path to saved file
```
## π οΈ Job Management
### `get_status()`
Check job status.
```python
status = client.get_status(job_id)
print(f"Status: {status['status']}")
print(f"Progress: {status.get('progress', 0)}%")
```
### `wait()`
Wait for job completion with polling.
```python
final_status = client.wait(job_id, interval=2.0, timeout=300)
```
### `get_result()`
Get completed job result.
```python
result = client.get_result(job_id)
print(result['content']) # The extracted/converted text
```
### `cancel()`
Cancel a running job.
```python
client.cancel(job_id)
```
## π§ Configuration
### Environment Variables
```bash
export HASHUB_API_KEY="your_api_key_here"
```
### Client Configuration
```python
client = DocAppClient(
api_key="your_api_key",
base_url="https://doc.hashub.dev/api/v1", # Default
timeout=(30, 120), # (connect, read) timeout
max_retries=3, # Max retry attempts
rate_limit_delay=2.0 # Min delay between requests
)
```
## π― Usage Examples
### Basic OCR
```python
from hashub_docapp import DocAppClient
client = DocAppClient("your_api_key")
# Extract text from PDF
text = client.convert_fast("invoice.pdf", language="en")
print(text)
# High-quality OCR with layout
markdown = client.convert_smart("complex_document.pdf")
print(markdown)
```
### Multi-language Processing
```python
# Process documents in different languages
documents = [
("english_doc.pdf", "en"),
("turkish_doc.pdf", "tr"),
("german_doc.pdf", "de"),
("chinese_doc.pdf", "zh")
]
for doc_path, lang in documents:
text = client.convert_fast(doc_path, language=lang)
print(f"{lang}: {text[:100]}...")
```
### Enhanced Image Processing
```python
# Process different types of scanned documents
scan_types = {
"old_book.pdf": "scan_low_dpi",
"phone_photo.jpg": "camera_shadow",
"faded_copy.pdf": "photocopy_faded",
"receipt.jpg": "receipt_thermal",
"technical_drawing.pdf": "blueprint"
}
for file_path, enhancement in scan_types.items():
text = client.convert_fast(
file_path,
enhancement=enhancement,
language="en"
)
print(f"Processed {file_path} with {enhancement}")
```
### Batch Processing Example
```python
# Process entire directory
results = client.batch_convert_auto(
directory="./input_docs",
save_to="./output",
output_format="markdown",
show_progress=True
)
print(f"β
Processed {results['success_count']} files successfully")
for file_result in results['results']:
if file_result['status'] == 'success':
print(f" π {file_result['source_file']} -> {file_result['output_file']}")
```
## π‘οΈ Error Handling
```python
from hashub_docapp import DocAppClient
from hashub_docapp.exceptions import (
AuthenticationError,
RateLimitError,
ProcessingError,
ValidationError
)
client = DocAppClient("your_api_key")
try:
result = client.convert_fast("document.pdf")
print(result)
except AuthenticationError:
print("β Invalid API key")
except RateLimitError:
print("β³ Rate limit exceeded, wait and retry")
except ProcessingError as e:
print(f"π₯ Processing failed: {e}")
except ValidationError as e:
print(f"π Validation error: {e}")
except FileNotFoundError:
print("π File not found")
```
## π Rate Limiting
The SDK includes built-in rate limiting to prevent API throttling:
- **Default delay**: 2 seconds between requests
- **Automatic retry**: Failed requests are retried with exponential backoff
- **Progress tracking**: Polls job status with appropriate intervals
```python
# Configure rate limiting
client = DocAppClient(
api_key="your_key",
rate_limit_delay=3.0, # 3 second delay between requests
max_retries=5 # Retry failed requests up to 5 times
)
```
## π Performance Tips
1. **Use appropriate modes**:
- `convert_fast()` for simple text extraction with language support
- `convert_smart()` for complex layouts and formatting
2. **Batch processing**:
- Use batch methods for multiple files
- Adjust `max_workers` based on your API limits
3. **Language specification**:
- Always specify the correct language for better accuracy
- Use ISO codes for convenience (`"en"`, `"tr"`, `"de"`)
4. **Enhancement presets**:
- Choose the right preset for your document type
- Experiment with different presets for optimal results
## π Troubleshooting
### Common Issues
**1. 404 Errors**
```python
# Ensure correct base URL
client = DocAppClient(
api_key="your_key",
base_url="https://doc.hashub.dev/api/v1"
)
```
**2. Rate Limiting**
```python
# Increase delay between requests
client = DocAppClient(
api_key="your_key",
rate_limit_delay=3.0
)
```
**3. Timeout Issues**
```python
# Increase timeout for large files
result = client.convert_smart("large_file.pdf", timeout=600)
```
**4. Language Errors**
```python
# Check supported languages
from hashub_docapp.languages import LanguageHelper
languages = LanguageHelper.list_languages()
print([lang['iso'] for lang in languages])
```
## π API Method Summary
| Method | Purpose | Key Parameters | Returns |
|--------|---------|----------------|---------|
| `convert_fast()` | Fast OCR | file_path, language, enhancement | str/Path |
| `convert_smart()` | Smart OCR | file_path, output | str/Path |
| `convert_doc()` | Office docs | file_path, output | str/Path |
| `convert_html_string()` | HTML conversion | html_content, output | str/Path |
| `batch_convert_smart()` | Smart batch | directory, save_to | Dict |
| `batch_convert_fast()` | Fast batch | directory, save_to, language | Dict |
| `batch_convert_auto()` | Auto batch | directory, save_to | Dict |
## π License
MIT License - see [LICENSE](LICENSE) file for details.
## π€ Support
- **Documentation**: [HashubDocApp Docs](https://doc.hashub.dev)
- **API Reference**: [API Documentation](https://doc.hashub.dev/api)
- **Support**: [Contact Support](mailto:support@hashub.dev)
---
**Made with β€οΈ by the Hashub Team**
Raw data
{
"_id": null,
"home_page": "https://github.com/hasanbahadir/hashub-doc-sdk",
"name": "hashub-docapp",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Hashub Team <support@hashub.dev>",
"keywords": "ocr, pdf, document, conversion, text-extraction, image-processing, api-client, hashub, batch-processing",
"author": "Hashub Team",
"author_email": "Hashub Team <support@hashub.dev>",
"download_url": "https://files.pythonhosted.org/packages/39/4a/b5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f/hashub_docapp-1.0.0.tar.gz",
"platform": null,
"description": "# HashubDocApp Python SDK\r\n\r\n[](https://python.org)\r\n[](LICENSE)\r\n[](https://github.com)\r\n\r\nProfessional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service.\r\n\r\n## \u2728 Features\r\n\r\n- \ud83d\ude80 **Fast OCR**: Quick text extraction with 76+ language support\r\n- \ud83e\udde0 **Smart OCR**: High-quality OCR with layout preservation\r\n- \ud83d\udcc4 **Document Conversion**: Office documents (Word, Excel) and HTML to Markdown/Text\r\n- \ud83d\udd04 **Batch Processing**: Process multiple files with intelligent categorization\r\n- \ud83c\udf0d **Multi-language**: Support for 76+ languages with ISO 639-1 codes\r\n- \ud83c\udfa8 **Image Enhancement**: 11 pre-configured enhancement presets\r\n- \ud83d\udcca **Progress Tracking**: Real-time progress bars and status monitoring\r\n- \u26a1 **Rate Limiting**: Built-in API throttling protection\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\npip install hashub-docapp\r\n```\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\n\r\n# Initialize client\r\nclient = DocAppClient(\"your_api_key_here\")\r\n\r\n# Fast OCR - Quick text extraction\r\ntext = client.convert_fast(\"document.pdf\", language=\"en\")\r\nprint(text)\r\n\r\n# Smart OCR - High-quality with layout preservation \r\nmarkdown = client.convert_smart(\"document.pdf\")\r\nprint(markdown)\r\n```\r\n\r\n## \ud83d\udcd6 Core Methods\r\n\r\n### `convert_fast()`\r\nFast OCR for quick text extraction with language support.\r\n\r\n```python\r\ndef convert_fast(\r\n file_or_image: Union[str, Path], \r\n output: str = \"markdown\",\r\n language: str = \"en\",\r\n enhancement: Optional[str] = None,\r\n return_type: ReturnType = \"content\",\r\n save_to: Optional[Union[str, Path]] = None,\r\n show_progress: bool = True,\r\n timeout: int = 300\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Parameters:**\r\n- `file_or_image`: Path to PDF or image file\r\n- `output`: Output format (\"markdown\", \"txt\", \"json\")\r\n- `language`: Language code (ISO 639-1 like \"en\", \"tr\", \"de\")\r\n- `enhancement`: Image enhancement preset (optional)\r\n- `return_type`: \"content\" (default), \"url\", or \"file\"\r\n- `save_to`: File path when return_type=\"file\"\r\n- `show_progress`: Show progress bar (default: True)\r\n- `timeout`: Maximum wait time in seconds (default: 300)\r\n\r\n**Examples:**\r\n```python\r\n# Basic fast OCR\r\ntext = client.convert_fast(\"scan.pdf\")\r\n\r\n# With Turkish language\r\ntext = client.convert_fast(\"document.pdf\", language=\"tr\")\r\n\r\n# With enhancement for low-quality scans\r\ntext = client.convert_fast(\"scan.pdf\", enhancement=\"scan_low_dpi\")\r\n\r\n# Save to file\r\nclient.convert_fast(\"document.pdf\", return_type=\"file\", save_to=\"output.txt\")\r\n```\r\n\r\n### `convert_smart()`\r\nHigh-quality OCR with layout preservation and structure detection.\r\n\r\n```python\r\ndef convert_smart(\r\n file_or_image: Union[str, Path], \r\n output: str = \"markdown\",\r\n return_type: ReturnType = \"content\",\r\n save_to: Optional[Union[str, Path]] = None,\r\n show_progress: bool = True,\r\n timeout: int = 300\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Parameters:**\r\n- `file_or_image`: Path to PDF or image file\r\n- `output`: Output format (\"markdown\", \"txt\", \"json\")\r\n- `return_type`: \"content\" (default), \"url\", or \"file\"\r\n- `save_to`: File path when return_type=\"file\"\r\n- `show_progress`: Show progress bar (default: True)\r\n- `timeout`: Maximum wait time in seconds (default: 300)\r\n\r\n**Examples:**\r\n```python\r\n# Smart OCR with layout preservation\r\nmarkdown = client.convert_smart(\"complex_document.pdf\")\r\n\r\n# Save as file\r\nclient.convert_smart(\"document.pdf\", return_type=\"file\", save_to=\"output.md\")\r\n\r\n# Different output format\r\njson_data = client.convert_smart(\"document.pdf\", output=\"json\")\r\n```\r\n\r\n## \ud83c\udf0d Language Support\r\n\r\nThe SDK supports 76+ languages with ISO 639-1 codes:\r\n\r\n```python\r\nfrom hashub_docapp.languages import LanguageHelper\r\n\r\n# List all supported languages\r\nlanguages = LanguageHelper.list_languages()\r\nprint(f\"Supported languages: {len(languages)}\")\r\n\r\n# Get language info\r\nturkish_info = LanguageHelper.get_language_info(\"tr\")\r\nprint(turkish_info) # {'english': 'Turkish', 'native': 'T\u00fcrk\u00e7e', 'iso': 'tr', 'api_code': 'lang_tur_tr'}\r\n\r\n# Use with convert_fast\r\ntext = client.convert_fast(\"document.pdf\", language=\"tr\") # Turkish\r\ntext = client.convert_fast(\"document.pdf\", language=\"de\") # German\r\ntext = client.convert_fast(\"document.pdf\", language=\"zh\") # Chinese\r\n```\r\n\r\n**Popular Language Codes:**\r\n- `en` - English\r\n- `tr` - Turkish \r\n- `de` - German\r\n- `fr` - French\r\n- `es` - Spanish\r\n- `zh` - Chinese (Simplified)\r\n- `ar` - Arabic\r\n- `ru` - Russian\r\n- `ja` - Japanese\r\n- `ko` - Korean\r\n\r\n## \ud83c\udfa8 Image Enhancement Presets\r\n\r\nThe SDK includes 11 pre-configured enhancement presets for different document types:\r\n\r\n```python\r\n# Enhancement presets (use with convert_fast)\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"document_crisp\") # Clean documents\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"scan_low_dpi\") # Low quality scans\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"camera_shadow\") # Phone photos\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"photocopy_faded\") # Faded copies\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"inverted_scan\") # Inverted colors\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"noisy_dots\") # Noisy artifacts\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"tables_fine\") # Tables and grids\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"receipt_thermal\") # Receipts\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"newspaper_moire\") # Newspapers\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"fax_low_quality\") # Fax documents\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"blueprint\") # Technical drawings\r\n```\r\n\r\n## \ud83d\udcc4 Document Conversion\r\n\r\n### `convert_doc()`\r\nConvert Word, Excel, and other office documents.\r\n\r\n```python\r\ndef convert_doc(\r\n path: Union[str, Path], \r\n output: str = \"markdown\",\r\n return_type: ReturnType = \"content\",\r\n save_to: Optional[Union[str, Path]] = None,\r\n options: Optional[Dict[str, Any]] = None\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Examples:**\r\n```python\r\n# Convert Word document to Markdown\r\nmarkdown = client.convert_doc(\"document.docx\")\r\n\r\n# Convert Excel to text\r\ntext = client.convert_doc(\"spreadsheet.xlsx\", output=\"txt\")\r\n\r\n# Save to file\r\nclient.convert_doc(\"presentation.pptx\", return_type=\"file\", save_to=\"output.md\")\r\n```\r\n\r\n### `convert_html_string()`\r\nConvert HTML string content to other formats.\r\n\r\n```python\r\ndef convert_html_string(\r\n html_content: str, \r\n output: str = \"markdown\",\r\n return_type: ReturnType = \"content\",\r\n save_to: Optional[Union[str, Path]] = None,\r\n options: Optional[Dict[str, Any]] = None\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Examples:**\r\n```python\r\nhtml = \"<h1>Title</h1><p>Content</p>\"\r\nmarkdown = client.convert_html_string(html)\r\n```\r\n\r\n## \ud83d\udd04 Batch Processing\r\n\r\n### `batch_convert_smart()`\r\nSmart batch processing with automatic file categorization.\r\n\r\n```python\r\ndef batch_convert_smart(\r\n directory: Union[str, Path],\r\n save_to: Union[str, Path],\r\n output_format: str = \"txt\",\r\n recursive: bool = True,\r\n show_progress: bool = True,\r\n max_workers: int = 3,\r\n timeout: int = 600\r\n) -> Dict[str, Any]\r\n```\r\n\r\n**Example:**\r\n```python\r\n# Process all files in directory intelligently\r\nresults = client.batch_convert_smart(\r\n directory=\"./documents\",\r\n save_to=\"./output\",\r\n output_format=\"markdown\"\r\n)\r\n\r\nprint(f\"Processed {results['processed_count']} files\")\r\nprint(f\"Success: {results['success_count']}, Failed: {results['failed_count']}\")\r\n```\r\n\r\n### `batch_convert_fast()`\r\nFast batch OCR for images and PDFs.\r\n\r\n```python\r\ndef batch_convert_fast(\r\n directory: Union[str, Path],\r\n save_to: Union[str, Path],\r\n language: str = \"en\",\r\n enhancement: Optional[str] = None,\r\n output_format: str = \"txt\",\r\n recursive: bool = True,\r\n show_progress: bool = True,\r\n max_workers: int = 5,\r\n timeout: int = 300\r\n) -> Dict[str, Any]\r\n```\r\n\r\n### `batch_convert_auto()`\r\nAutomatic processing mode selection based on file types.\r\n\r\n```python\r\ndef batch_convert_auto(\r\n directory: Union[str, Path],\r\n save_to: Union[str, Path],\r\n language: str = \"en\",\r\n enhancement: Optional[str] = None,\r\n output_format: str = \"txt\",\r\n recursive: bool = True,\r\n show_progress: bool = True,\r\n max_workers: int = 4,\r\n timeout: int = 900\r\n) -> Dict[str, Any]\r\n```\r\n\r\n## \ud83d\udcca Return Types\r\n\r\nThe SDK supports three return types for conversion methods:\r\n\r\n### 1. Content (Default)\r\n```python\r\ntext = client.convert_fast(\"doc.pdf\", return_type=\"content\")\r\nprint(text) # Direct text content\r\n```\r\n\r\n### 2. URL\r\n```python\r\nurl = client.convert_fast(\"doc.pdf\", return_type=\"url\") \r\nprint(url) # Download URL for the result\r\n```\r\n\r\n### 3. File\r\n```python\r\npath = client.convert_fast(\r\n \"doc.pdf\", \r\n return_type=\"file\", \r\n save_to=\"output.txt\"\r\n)\r\nprint(path) # Path to saved file\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Job Management\r\n\r\n### `get_status()`\r\nCheck job status.\r\n\r\n```python\r\nstatus = client.get_status(job_id)\r\nprint(f\"Status: {status['status']}\")\r\nprint(f\"Progress: {status.get('progress', 0)}%\")\r\n```\r\n\r\n### `wait()`\r\nWait for job completion with polling.\r\n\r\n```python\r\nfinal_status = client.wait(job_id, interval=2.0, timeout=300)\r\n```\r\n\r\n### `get_result()`\r\nGet completed job result.\r\n\r\n```python\r\nresult = client.get_result(job_id)\r\nprint(result['content']) # The extracted/converted text\r\n```\r\n\r\n### `cancel()`\r\nCancel a running job.\r\n\r\n```python\r\nclient.cancel(job_id)\r\n```\r\n\r\n## \ud83d\udd27 Configuration\r\n\r\n### Environment Variables\r\n\r\n```bash\r\nexport HASHUB_API_KEY=\"your_api_key_here\"\r\n```\r\n\r\n### Client Configuration\r\n\r\n```python\r\nclient = DocAppClient(\r\n api_key=\"your_api_key\",\r\n base_url=\"https://doc.hashub.dev/api/v1\", # Default\r\n timeout=(30, 120), # (connect, read) timeout\r\n max_retries=3, # Max retry attempts\r\n rate_limit_delay=2.0 # Min delay between requests\r\n)\r\n```\r\n\r\n## \ud83c\udfaf Usage Examples\r\n\r\n### Basic OCR\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\n\r\nclient = DocAppClient(\"your_api_key\")\r\n\r\n# Extract text from PDF\r\ntext = client.convert_fast(\"invoice.pdf\", language=\"en\")\r\nprint(text)\r\n\r\n# High-quality OCR with layout\r\nmarkdown = client.convert_smart(\"complex_document.pdf\")\r\nprint(markdown)\r\n```\r\n\r\n### Multi-language Processing\r\n\r\n```python\r\n# Process documents in different languages\r\ndocuments = [\r\n (\"english_doc.pdf\", \"en\"),\r\n (\"turkish_doc.pdf\", \"tr\"), \r\n (\"german_doc.pdf\", \"de\"),\r\n (\"chinese_doc.pdf\", \"zh\")\r\n]\r\n\r\nfor doc_path, lang in documents:\r\n text = client.convert_fast(doc_path, language=lang)\r\n print(f\"{lang}: {text[:100]}...\")\r\n```\r\n\r\n### Enhanced Image Processing\r\n\r\n```python\r\n# Process different types of scanned documents\r\nscan_types = {\r\n \"old_book.pdf\": \"scan_low_dpi\",\r\n \"phone_photo.jpg\": \"camera_shadow\", \r\n \"faded_copy.pdf\": \"photocopy_faded\",\r\n \"receipt.jpg\": \"receipt_thermal\",\r\n \"technical_drawing.pdf\": \"blueprint\"\r\n}\r\n\r\nfor file_path, enhancement in scan_types.items():\r\n text = client.convert_fast(\r\n file_path, \r\n enhancement=enhancement,\r\n language=\"en\"\r\n )\r\n print(f\"Processed {file_path} with {enhancement}\")\r\n```\r\n\r\n### Batch Processing Example\r\n\r\n```python\r\n# Process entire directory\r\nresults = client.batch_convert_auto(\r\n directory=\"./input_docs\",\r\n save_to=\"./output\",\r\n output_format=\"markdown\",\r\n show_progress=True\r\n)\r\n\r\nprint(f\"\u2705 Processed {results['success_count']} files successfully\")\r\nfor file_result in results['results']:\r\n if file_result['status'] == 'success':\r\n print(f\" \ud83d\udcc4 {file_result['source_file']} -> {file_result['output_file']}\")\r\n```\r\n\r\n## \ud83d\udee1\ufe0f Error Handling\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\nfrom hashub_docapp.exceptions import (\r\n AuthenticationError, \r\n RateLimitError, \r\n ProcessingError,\r\n ValidationError\r\n)\r\n\r\nclient = DocAppClient(\"your_api_key\")\r\n\r\ntry:\r\n result = client.convert_fast(\"document.pdf\")\r\n print(result)\r\n \r\nexcept AuthenticationError:\r\n print(\"\u274c Invalid API key\")\r\n \r\nexcept RateLimitError:\r\n print(\"\u23f3 Rate limit exceeded, wait and retry\")\r\n \r\nexcept ProcessingError as e:\r\n print(f\"\ud83d\udca5 Processing failed: {e}\")\r\n \r\nexcept ValidationError as e:\r\n print(f\"\ud83d\udcdd Validation error: {e}\")\r\n \r\nexcept FileNotFoundError:\r\n print(\"\ud83d\udcc1 File not found\")\r\n```\r\n\r\n## \ud83d\udd04 Rate Limiting\r\n\r\nThe SDK includes built-in rate limiting to prevent API throttling:\r\n\r\n- **Default delay**: 2 seconds between requests\r\n- **Automatic retry**: Failed requests are retried with exponential backoff\r\n- **Progress tracking**: Polls job status with appropriate intervals\r\n\r\n```python\r\n# Configure rate limiting\r\nclient = DocAppClient(\r\n api_key=\"your_key\",\r\n rate_limit_delay=3.0, # 3 second delay between requests\r\n max_retries=5 # Retry failed requests up to 5 times\r\n)\r\n```\r\n\r\n## \ud83d\udcc8 Performance Tips\r\n\r\n1. **Use appropriate modes**:\r\n - `convert_fast()` for simple text extraction with language support\r\n - `convert_smart()` for complex layouts and formatting\r\n\r\n2. **Batch processing**:\r\n - Use batch methods for multiple files\r\n - Adjust `max_workers` based on your API limits\r\n\r\n3. **Language specification**:\r\n - Always specify the correct language for better accuracy\r\n - Use ISO codes for convenience (`\"en\"`, `\"tr\"`, `\"de\"`)\r\n\r\n4. **Enhancement presets**:\r\n - Choose the right preset for your document type\r\n - Experiment with different presets for optimal results\r\n\r\n## \ud83d\udc1b Troubleshooting\r\n\r\n### Common Issues\r\n\r\n**1. 404 Errors**\r\n```python\r\n# Ensure correct base URL\r\nclient = DocAppClient(\r\n api_key=\"your_key\",\r\n base_url=\"https://doc.hashub.dev/api/v1\"\r\n)\r\n```\r\n\r\n**2. Rate Limiting**\r\n```python\r\n# Increase delay between requests\r\nclient = DocAppClient(\r\n api_key=\"your_key\", \r\n rate_limit_delay=3.0\r\n)\r\n```\r\n\r\n**3. Timeout Issues**\r\n```python\r\n# Increase timeout for large files\r\nresult = client.convert_smart(\"large_file.pdf\", timeout=600)\r\n```\r\n\r\n**4. Language Errors**\r\n```python\r\n# Check supported languages\r\nfrom hashub_docapp.languages import LanguageHelper\r\nlanguages = LanguageHelper.list_languages()\r\nprint([lang['iso'] for lang in languages])\r\n```\r\n\r\n## \ud83d\udcca API Method Summary\r\n\r\n| Method | Purpose | Key Parameters | Returns |\r\n|--------|---------|----------------|---------|\r\n| `convert_fast()` | Fast OCR | file_path, language, enhancement | str/Path |\r\n| `convert_smart()` | Smart OCR | file_path, output | str/Path |\r\n| `convert_doc()` | Office docs | file_path, output | str/Path |\r\n| `convert_html_string()` | HTML conversion | html_content, output | str/Path |\r\n| `batch_convert_smart()` | Smart batch | directory, save_to | Dict |\r\n| `batch_convert_fast()` | Fast batch | directory, save_to, language | Dict |\r\n| `batch_convert_auto()` | Auto batch | directory, save_to | Dict |\r\n\r\n## \ud83d\udcc4 License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83e\udd1d Support\r\n\r\n- **Documentation**: [HashubDocApp Docs](https://doc.hashub.dev)\r\n- **API Reference**: [API Documentation](https://doc.hashub.dev/api)\r\n- **Support**: [Contact Support](mailto:support@hashub.dev)\r\n\r\n---\r\n\r\n**Made with \u2764\ufe0f by the Hashub Team**\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service",
"version": "1.0.0",
"project_urls": {
"API Reference": "https://doc.hashub.dev/api",
"Bug Reports": "https://github.com/hasanbahadir/hashub-doc-sdk/issues",
"Documentation": "https://doc.hashub.dev",
"Homepage": "https://github.com/hasanbahadir/hashub-doc-sdk",
"Repository": "https://github.com/hasanbahadir/hashub-doc-sdk"
},
"split_keywords": [
"ocr",
" pdf",
" document",
" conversion",
" text-extraction",
" image-processing",
" api-client",
" hashub",
" batch-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "16ca9b289f51608d360bc9d31711d487e64e0e991d0032a458981e30c6580947",
"md5": "3f0d68a6b4c868657c2d194fe0a34658",
"sha256": "e1ede2bfdc5fabc0630da0edcda50b2ea7d2c031ef9c145d7867e86f46fe11e5"
},
"downloads": -1,
"filename": "hashub_docapp-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3f0d68a6b4c868657c2d194fe0a34658",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 39325,
"upload_time": "2025-08-15T12:09:57",
"upload_time_iso_8601": "2025-08-15T12:09:57.376482Z",
"url": "https://files.pythonhosted.org/packages/16/ca/9b289f51608d360bc9d31711d487e64e0e991d0032a458981e30c6580947/hashub_docapp-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "394ab5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f",
"md5": "10d33f3179096ae0740bed5c0c2db50d",
"sha256": "f5d0b5d50b32e0272a0a8dc4e7d1d57d379a956d8ab7ddf4a287453123eb8d5a"
},
"downloads": -1,
"filename": "hashub_docapp-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "10d33f3179096ae0740bed5c0c2db50d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 52664,
"upload_time": "2025-08-15T12:09:58",
"upload_time_iso_8601": "2025-08-15T12:09:58.921670Z",
"url": "https://files.pythonhosted.org/packages/39/4a/b5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f/hashub_docapp-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-15 12:09:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hasanbahadir",
"github_project": "hashub-doc-sdk",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "urllib3",
"specs": [
[
">=",
"1.26.0"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
">=",
"3.7.4"
]
]
}
],
"lcname": "hashub-docapp"
}