hashub-docapp

Name	hashub-docapp JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/hasanbahadir/hashub-doc-sdk
Summary	Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service
upload_time	2025-08-15 12:09:58
maintainer	None
docs_url	None
author	Hashub Team
requires_python	>=3.8
license	MIT
keywords	ocr pdf document conversion text-extraction image-processing api-client hashub batch-processing
VCS
bugtrack_url
requirements	requests urllib3 typing-extensions
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # HashubDocApp Python SDK

[![Python](https://img.shields.io/badge/python-3.8+-blue.svg)](https://python.org)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Status](https://img.shields.io/badge/status-production-brightgreen.svg)](https://github.com)

Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service.

## ✨ Features

- 🚀 **Fast OCR**: Quick text extraction with 76+ language support
- 🧠 **Smart OCR**: High-quality OCR with layout preservation
- 📄 **Document Conversion**: Office documents (Word, Excel) and HTML to Markdown/Text
- 🔄 **Batch Processing**: Process multiple files with intelligent categorization
- 🌍 **Multi-language**: Support for 76+ languages with ISO 639-1 codes
- 🎨 **Image Enhancement**: 11 pre-configured enhancement presets
- 📊 **Progress Tracking**: Real-time progress bars and status monitoring
- ⚡ **Rate Limiting**: Built-in API throttling protection

## 🚀 Quick Start

### Installation

```bash
pip install hashub-docapp
```

### Basic Usage

```python
from hashub_docapp import DocAppClient

# Initialize client
client = DocAppClient("your_api_key_here")

# Fast OCR - Quick text extraction
text = client.convert_fast("document.pdf", language="en")
print(text)

# Smart OCR - High-quality with layout preservation  
markdown = client.convert_smart("document.pdf")
print(markdown)
```

## 📖 Core Methods

### `convert_fast()`
Fast OCR for quick text extraction with language support.

```python
def convert_fast(
    file_or_image: Union[str, Path], 
    output: str = "markdown",
    language: str = "en",
    enhancement: Optional[str] = None,
    return_type: ReturnType = "content",
    save_to: Optional[Union[str, Path]] = None,
    show_progress: bool = True,
    timeout: int = 300
) -> Union[str, Path]
```

**Parameters:**
- `file_or_image`: Path to PDF or image file
- `output`: Output format ("markdown", "txt", "json")
- `language`: Language code (ISO 639-1 like "en", "tr", "de")
- `enhancement`: Image enhancement preset (optional)
- `return_type`: "content" (default), "url", or "file"
- `save_to`: File path when return_type="file"
- `show_progress`: Show progress bar (default: True)
- `timeout`: Maximum wait time in seconds (default: 300)

**Examples:**
```python
# Basic fast OCR
text = client.convert_fast("scan.pdf")

# With Turkish language
text = client.convert_fast("document.pdf", language="tr")

# With enhancement for low-quality scans
text = client.convert_fast("scan.pdf", enhancement="scan_low_dpi")

# Save to file
client.convert_fast("document.pdf", return_type="file", save_to="output.txt")
```

### `convert_smart()`
High-quality OCR with layout preservation and structure detection.

```python
def convert_smart(
    file_or_image: Union[str, Path], 
    output: str = "markdown",
    return_type: ReturnType = "content",
    save_to: Optional[Union[str, Path]] = None,
    show_progress: bool = True,
    timeout: int = 300
) -> Union[str, Path]
```

**Parameters:**
- `file_or_image`: Path to PDF or image file
- `output`: Output format ("markdown", "txt", "json")
- `return_type`: "content" (default), "url", or "file"
- `save_to`: File path when return_type="file"
- `show_progress`: Show progress bar (default: True)
- `timeout`: Maximum wait time in seconds (default: 300)

**Examples:**
```python
# Smart OCR with layout preservation
markdown = client.convert_smart("complex_document.pdf")

# Save as file
client.convert_smart("document.pdf", return_type="file", save_to="output.md")

# Different output format
json_data = client.convert_smart("document.pdf", output="json")
```

## 🌍 Language Support

The SDK supports 76+ languages with ISO 639-1 codes:

```python
from hashub_docapp.languages import LanguageHelper

# List all supported languages
languages = LanguageHelper.list_languages()
print(f"Supported languages: {len(languages)}")

# Get language info
turkish_info = LanguageHelper.get_language_info("tr")
print(turkish_info)  # {'english': 'Turkish', 'native': 'Türkçe', 'iso': 'tr', 'api_code': 'lang_tur_tr'}

# Use with convert_fast
text = client.convert_fast("document.pdf", language="tr")  # Turkish
text = client.convert_fast("document.pdf", language="de")  # German
text = client.convert_fast("document.pdf", language="zh")  # Chinese
```

**Popular Language Codes:**
- `en` - English
- `tr` - Turkish  
- `de` - German
- `fr` - French
- `es` - Spanish
- `zh` - Chinese (Simplified)
- `ar` - Arabic
- `ru` - Russian
- `ja` - Japanese
- `ko` - Korean

## 🎨 Image Enhancement Presets

The SDK includes 11 pre-configured enhancement presets for different document types:

```python
# Enhancement presets (use with convert_fast)
client.convert_fast("scan.pdf", enhancement="document_crisp")     # Clean documents
client.convert_fast("scan.pdf", enhancement="scan_low_dpi")       # Low quality scans
client.convert_fast("scan.pdf", enhancement="camera_shadow")      # Phone photos
client.convert_fast("scan.pdf", enhancement="photocopy_faded")    # Faded copies
client.convert_fast("scan.pdf", enhancement="inverted_scan")      # Inverted colors
client.convert_fast("scan.pdf", enhancement="noisy_dots")         # Noisy artifacts
client.convert_fast("scan.pdf", enhancement="tables_fine")        # Tables and grids
client.convert_fast("scan.pdf", enhancement="receipt_thermal")    # Receipts
client.convert_fast("scan.pdf", enhancement="newspaper_moire")    # Newspapers
client.convert_fast("scan.pdf", enhancement="fax_low_quality")    # Fax documents
client.convert_fast("scan.pdf", enhancement="blueprint")          # Technical drawings
```

## 📄 Document Conversion

### `convert_doc()`
Convert Word, Excel, and other office documents.

```python
def convert_doc(
    path: Union[str, Path], 
    output: str = "markdown",
    return_type: ReturnType = "content",
    save_to: Optional[Union[str, Path]] = None,
    options: Optional[Dict[str, Any]] = None
) -> Union[str, Path]
```

**Examples:**
```python
# Convert Word document to Markdown
markdown = client.convert_doc("document.docx")

# Convert Excel to text
text = client.convert_doc("spreadsheet.xlsx", output="txt")

# Save to file
client.convert_doc("presentation.pptx", return_type="file", save_to="output.md")
```

### `convert_html_string()`
Convert HTML string content to other formats.

```python
def convert_html_string(
    html_content: str, 
    output: str = "markdown",
    return_type: ReturnType = "content",
    save_to: Optional[Union[str, Path]] = None,
    options: Optional[Dict[str, Any]] = None
) -> Union[str, Path]
```

**Examples:**
```python
html = "<h1>Title</h1><p>Content</p>"
markdown = client.convert_html_string(html)
```

## 🔄 Batch Processing

### `batch_convert_smart()`
Smart batch processing with automatic file categorization.

```python
def batch_convert_smart(
    directory: Union[str, Path],
    save_to: Union[str, Path],
    output_format: str = "txt",
    recursive: bool = True,
    show_progress: bool = True,
    max_workers: int = 3,
    timeout: int = 600
) -> Dict[str, Any]
```

**Example:**
```python
# Process all files in directory intelligently
results = client.batch_convert_smart(
    directory="./documents",
    save_to="./output",
    output_format="markdown"
)

print(f"Processed {results['processed_count']} files")
print(f"Success: {results['success_count']}, Failed: {results['failed_count']}")
```

### `batch_convert_fast()`
Fast batch OCR for images and PDFs.

```python
def batch_convert_fast(
    directory: Union[str, Path],
    save_to: Union[str, Path],
    language: str = "en",
    enhancement: Optional[str] = None,
    output_format: str = "txt",
    recursive: bool = True,
    show_progress: bool = True,
    max_workers: int = 5,
    timeout: int = 300
) -> Dict[str, Any]
```

### `batch_convert_auto()`
Automatic processing mode selection based on file types.

```python
def batch_convert_auto(
    directory: Union[str, Path],
    save_to: Union[str, Path],
    language: str = "en",
    enhancement: Optional[str] = None,
    output_format: str = "txt",
    recursive: bool = True,
    show_progress: bool = True,
    max_workers: int = 4,
    timeout: int = 900
) -> Dict[str, Any]
```

## 📊 Return Types

The SDK supports three return types for conversion methods:

### 1. Content (Default)
```python
text = client.convert_fast("doc.pdf", return_type="content")
print(text)  # Direct text content
```

### 2. URL
```python
url = client.convert_fast("doc.pdf", return_type="url") 
print(url)   # Download URL for the result
```

### 3. File
```python
path = client.convert_fast(
    "doc.pdf", 
    return_type="file", 
    save_to="output.txt"
)
print(path)  # Path to saved file
```

## 🛠️ Job Management

### `get_status()`
Check job status.

```python
status = client.get_status(job_id)
print(f"Status: {status['status']}")
print(f"Progress: {status.get('progress', 0)}%")
```

### `wait()`
Wait for job completion with polling.

```python
final_status = client.wait(job_id, interval=2.0, timeout=300)
```

### `get_result()`
Get completed job result.

```python
result = client.get_result(job_id)
print(result['content'])  # The extracted/converted text
```

### `cancel()`
Cancel a running job.

```python
client.cancel(job_id)
```

## 🔧 Configuration

### Environment Variables

```bash
export HASHUB_API_KEY="your_api_key_here"
```

### Client Configuration

```python
client = DocAppClient(
    api_key="your_api_key",
    base_url="https://doc.hashub.dev/api/v1",  # Default
    timeout=(30, 120),                         # (connect, read) timeout
    max_retries=3,                            # Max retry attempts
    rate_limit_delay=2.0                      # Min delay between requests
)
```

## 🎯 Usage Examples

### Basic OCR

```python
from hashub_docapp import DocAppClient

client = DocAppClient("your_api_key")

# Extract text from PDF
text = client.convert_fast("invoice.pdf", language="en")
print(text)

# High-quality OCR with layout
markdown = client.convert_smart("complex_document.pdf")
print(markdown)
```

### Multi-language Processing

```python
# Process documents in different languages
documents = [
    ("english_doc.pdf", "en"),
    ("turkish_doc.pdf", "tr"), 
    ("german_doc.pdf", "de"),
    ("chinese_doc.pdf", "zh")
]

for doc_path, lang in documents:
    text = client.convert_fast(doc_path, language=lang)
    print(f"{lang}: {text[:100]}...")
```

### Enhanced Image Processing

```python
# Process different types of scanned documents
scan_types = {
    "old_book.pdf": "scan_low_dpi",
    "phone_photo.jpg": "camera_shadow", 
    "faded_copy.pdf": "photocopy_faded",
    "receipt.jpg": "receipt_thermal",
    "technical_drawing.pdf": "blueprint"
}

for file_path, enhancement in scan_types.items():
    text = client.convert_fast(
        file_path, 
        enhancement=enhancement,
        language="en"
    )
    print(f"Processed {file_path} with {enhancement}")
```

### Batch Processing Example

```python
# Process entire directory
results = client.batch_convert_auto(
    directory="./input_docs",
    save_to="./output",
    output_format="markdown",
    show_progress=True
)

print(f"✅ Processed {results['success_count']} files successfully")
for file_result in results['results']:
    if file_result['status'] == 'success':
        print(f"  📄 {file_result['source_file']} -> {file_result['output_file']}")
```

## 🛡️ Error Handling

```python
from hashub_docapp import DocAppClient
from hashub_docapp.exceptions import (
    AuthenticationError, 
    RateLimitError, 
    ProcessingError,
    ValidationError
)

client = DocAppClient("your_api_key")

try:
    result = client.convert_fast("document.pdf")
    print(result)
    
except AuthenticationError:
    print("❌ Invalid API key")
    
except RateLimitError:
    print("⏳ Rate limit exceeded, wait and retry")
    
except ProcessingError as e:
    print(f"💥 Processing failed: {e}")
    
except ValidationError as e:
    print(f"📝 Validation error: {e}")
    
except FileNotFoundError:
    print("📁 File not found")
```

## 🔄 Rate Limiting

The SDK includes built-in rate limiting to prevent API throttling:

- **Default delay**: 2 seconds between requests
- **Automatic retry**: Failed requests are retried with exponential backoff
- **Progress tracking**: Polls job status with appropriate intervals

```python
# Configure rate limiting
client = DocAppClient(
    api_key="your_key",
    rate_limit_delay=3.0,  # 3 second delay between requests
    max_retries=5          # Retry failed requests up to 5 times
)
```

## 📈 Performance Tips

1. **Use appropriate modes**:
   - `convert_fast()` for simple text extraction with language support
   - `convert_smart()` for complex layouts and formatting

2. **Batch processing**:
   - Use batch methods for multiple files
   - Adjust `max_workers` based on your API limits

3. **Language specification**:
   - Always specify the correct language for better accuracy
   - Use ISO codes for convenience (`"en"`, `"tr"`, `"de"`)

4. **Enhancement presets**:
   - Choose the right preset for your document type
   - Experiment with different presets for optimal results

## 🐛 Troubleshooting

### Common Issues

**1. 404 Errors**
```python
# Ensure correct base URL
client = DocAppClient(
    api_key="your_key",
    base_url="https://doc.hashub.dev/api/v1"
)
```

**2. Rate Limiting**
```python
# Increase delay between requests
client = DocAppClient(
    api_key="your_key", 
    rate_limit_delay=3.0
)
```

**3. Timeout Issues**
```python
# Increase timeout for large files
result = client.convert_smart("large_file.pdf", timeout=600)
```

**4. Language Errors**
```python
# Check supported languages
from hashub_docapp.languages import LanguageHelper
languages = LanguageHelper.list_languages()
print([lang['iso'] for lang in languages])
```

## 📊 API Method Summary

| Method | Purpose | Key Parameters | Returns |
|--------|---------|----------------|---------|
| `convert_fast()` | Fast OCR | file_path, language, enhancement | str/Path |
| `convert_smart()` | Smart OCR | file_path, output | str/Path |
| `convert_doc()` | Office docs | file_path, output | str/Path |
| `convert_html_string()` | HTML conversion | html_content, output | str/Path |
| `batch_convert_smart()` | Smart batch | directory, save_to | Dict |
| `batch_convert_fast()` | Fast batch | directory, save_to, language | Dict |
| `batch_convert_auto()` | Auto batch | directory, save_to | Dict |

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🤝 Support

- **Documentation**: [HashubDocApp Docs](https://doc.hashub.dev)
- **API Reference**: [API Documentation](https://doc.hashub.dev/api)
- **Support**: [Contact Support](mailto:support@hashub.dev)

---

**Made with ❤️ by the Hashub Team**

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hasanbahadir/hashub-doc-sdk",
    "name": "hashub-docapp",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Hashub Team <support@hashub.dev>",
    "keywords": "ocr, pdf, document, conversion, text-extraction, image-processing, api-client, hashub, batch-processing",
    "author": "Hashub Team",
    "author_email": "Hashub Team <support@hashub.dev>",
    "download_url": "https://files.pythonhosted.org/packages/39/4a/b5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f/hashub_docapp-1.0.0.tar.gz",
    "platform": null,
    "description": "# HashubDocApp Python SDK\r\n\r\n[![Python](https://img.shields.io/badge/python-3.8+-blue.svg)](https://python.org)\r\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\r\n[![Status](https://img.shields.io/badge/status-production-brightgreen.svg)](https://github.com)\r\n\r\nProfessional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service.\r\n\r\n## \u2728 Features\r\n\r\n- \ud83d\ude80 **Fast OCR**: Quick text extraction with 76+ language support\r\n- \ud83e\udde0 **Smart OCR**: High-quality OCR with layout preservation\r\n- \ud83d\udcc4 **Document Conversion**: Office documents (Word, Excel) and HTML to Markdown/Text\r\n- \ud83d\udd04 **Batch Processing**: Process multiple files with intelligent categorization\r\n- \ud83c\udf0d **Multi-language**: Support for 76+ languages with ISO 639-1 codes\r\n- \ud83c\udfa8 **Image Enhancement**: 11 pre-configured enhancement presets\r\n- \ud83d\udcca **Progress Tracking**: Real-time progress bars and status monitoring\r\n- \u26a1 **Rate Limiting**: Built-in API throttling protection\r\n\r\n## \ud83d\ude80 Quick Start\r\n\r\n### Installation\r\n\r\n```bash\r\npip install hashub-docapp\r\n```\r\n\r\n### Basic Usage\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\n\r\n# Initialize client\r\nclient = DocAppClient(\"your_api_key_here\")\r\n\r\n# Fast OCR - Quick text extraction\r\ntext = client.convert_fast(\"document.pdf\", language=\"en\")\r\nprint(text)\r\n\r\n# Smart OCR - High-quality with layout preservation  \r\nmarkdown = client.convert_smart(\"document.pdf\")\r\nprint(markdown)\r\n```\r\n\r\n## \ud83d\udcd6 Core Methods\r\n\r\n### `convert_fast()`\r\nFast OCR for quick text extraction with language support.\r\n\r\n```python\r\ndef convert_fast(\r\n    file_or_image: Union[str, Path], \r\n    output: str = \"markdown\",\r\n    language: str = \"en\",\r\n    enhancement: Optional[str] = None,\r\n    return_type: ReturnType = \"content\",\r\n    save_to: Optional[Union[str, Path]] = None,\r\n    show_progress: bool = True,\r\n    timeout: int = 300\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Parameters:**\r\n- `file_or_image`: Path to PDF or image file\r\n- `output`: Output format (\"markdown\", \"txt\", \"json\")\r\n- `language`: Language code (ISO 639-1 like \"en\", \"tr\", \"de\")\r\n- `enhancement`: Image enhancement preset (optional)\r\n- `return_type`: \"content\" (default), \"url\", or \"file\"\r\n- `save_to`: File path when return_type=\"file\"\r\n- `show_progress`: Show progress bar (default: True)\r\n- `timeout`: Maximum wait time in seconds (default: 300)\r\n\r\n**Examples:**\r\n```python\r\n# Basic fast OCR\r\ntext = client.convert_fast(\"scan.pdf\")\r\n\r\n# With Turkish language\r\ntext = client.convert_fast(\"document.pdf\", language=\"tr\")\r\n\r\n# With enhancement for low-quality scans\r\ntext = client.convert_fast(\"scan.pdf\", enhancement=\"scan_low_dpi\")\r\n\r\n# Save to file\r\nclient.convert_fast(\"document.pdf\", return_type=\"file\", save_to=\"output.txt\")\r\n```\r\n\r\n### `convert_smart()`\r\nHigh-quality OCR with layout preservation and structure detection.\r\n\r\n```python\r\ndef convert_smart(\r\n    file_or_image: Union[str, Path], \r\n    output: str = \"markdown\",\r\n    return_type: ReturnType = \"content\",\r\n    save_to: Optional[Union[str, Path]] = None,\r\n    show_progress: bool = True,\r\n    timeout: int = 300\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Parameters:**\r\n- `file_or_image`: Path to PDF or image file\r\n- `output`: Output format (\"markdown\", \"txt\", \"json\")\r\n- `return_type`: \"content\" (default), \"url\", or \"file\"\r\n- `save_to`: File path when return_type=\"file\"\r\n- `show_progress`: Show progress bar (default: True)\r\n- `timeout`: Maximum wait time in seconds (default: 300)\r\n\r\n**Examples:**\r\n```python\r\n# Smart OCR with layout preservation\r\nmarkdown = client.convert_smart(\"complex_document.pdf\")\r\n\r\n# Save as file\r\nclient.convert_smart(\"document.pdf\", return_type=\"file\", save_to=\"output.md\")\r\n\r\n# Different output format\r\njson_data = client.convert_smart(\"document.pdf\", output=\"json\")\r\n```\r\n\r\n## \ud83c\udf0d Language Support\r\n\r\nThe SDK supports 76+ languages with ISO 639-1 codes:\r\n\r\n```python\r\nfrom hashub_docapp.languages import LanguageHelper\r\n\r\n# List all supported languages\r\nlanguages = LanguageHelper.list_languages()\r\nprint(f\"Supported languages: {len(languages)}\")\r\n\r\n# Get language info\r\nturkish_info = LanguageHelper.get_language_info(\"tr\")\r\nprint(turkish_info)  # {'english': 'Turkish', 'native': 'T\u00fcrk\u00e7e', 'iso': 'tr', 'api_code': 'lang_tur_tr'}\r\n\r\n# Use with convert_fast\r\ntext = client.convert_fast(\"document.pdf\", language=\"tr\")  # Turkish\r\ntext = client.convert_fast(\"document.pdf\", language=\"de\")  # German\r\ntext = client.convert_fast(\"document.pdf\", language=\"zh\")  # Chinese\r\n```\r\n\r\n**Popular Language Codes:**\r\n- `en` - English\r\n- `tr` - Turkish  \r\n- `de` - German\r\n- `fr` - French\r\n- `es` - Spanish\r\n- `zh` - Chinese (Simplified)\r\n- `ar` - Arabic\r\n- `ru` - Russian\r\n- `ja` - Japanese\r\n- `ko` - Korean\r\n\r\n## \ud83c\udfa8 Image Enhancement Presets\r\n\r\nThe SDK includes 11 pre-configured enhancement presets for different document types:\r\n\r\n```python\r\n# Enhancement presets (use with convert_fast)\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"document_crisp\")     # Clean documents\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"scan_low_dpi\")       # Low quality scans\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"camera_shadow\")      # Phone photos\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"photocopy_faded\")    # Faded copies\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"inverted_scan\")      # Inverted colors\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"noisy_dots\")         # Noisy artifacts\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"tables_fine\")        # Tables and grids\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"receipt_thermal\")    # Receipts\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"newspaper_moire\")    # Newspapers\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"fax_low_quality\")    # Fax documents\r\nclient.convert_fast(\"scan.pdf\", enhancement=\"blueprint\")          # Technical drawings\r\n```\r\n\r\n## \ud83d\udcc4 Document Conversion\r\n\r\n### `convert_doc()`\r\nConvert Word, Excel, and other office documents.\r\n\r\n```python\r\ndef convert_doc(\r\n    path: Union[str, Path], \r\n    output: str = \"markdown\",\r\n    return_type: ReturnType = \"content\",\r\n    save_to: Optional[Union[str, Path]] = None,\r\n    options: Optional[Dict[str, Any]] = None\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Examples:**\r\n```python\r\n# Convert Word document to Markdown\r\nmarkdown = client.convert_doc(\"document.docx\")\r\n\r\n# Convert Excel to text\r\ntext = client.convert_doc(\"spreadsheet.xlsx\", output=\"txt\")\r\n\r\n# Save to file\r\nclient.convert_doc(\"presentation.pptx\", return_type=\"file\", save_to=\"output.md\")\r\n```\r\n\r\n### `convert_html_string()`\r\nConvert HTML string content to other formats.\r\n\r\n```python\r\ndef convert_html_string(\r\n    html_content: str, \r\n    output: str = \"markdown\",\r\n    return_type: ReturnType = \"content\",\r\n    save_to: Optional[Union[str, Path]] = None,\r\n    options: Optional[Dict[str, Any]] = None\r\n) -> Union[str, Path]\r\n```\r\n\r\n**Examples:**\r\n```python\r\nhtml = \"<h1>Title</h1><p>Content</p>\"\r\nmarkdown = client.convert_html_string(html)\r\n```\r\n\r\n## \ud83d\udd04 Batch Processing\r\n\r\n### `batch_convert_smart()`\r\nSmart batch processing with automatic file categorization.\r\n\r\n```python\r\ndef batch_convert_smart(\r\n    directory: Union[str, Path],\r\n    save_to: Union[str, Path],\r\n    output_format: str = \"txt\",\r\n    recursive: bool = True,\r\n    show_progress: bool = True,\r\n    max_workers: int = 3,\r\n    timeout: int = 600\r\n) -> Dict[str, Any]\r\n```\r\n\r\n**Example:**\r\n```python\r\n# Process all files in directory intelligently\r\nresults = client.batch_convert_smart(\r\n    directory=\"./documents\",\r\n    save_to=\"./output\",\r\n    output_format=\"markdown\"\r\n)\r\n\r\nprint(f\"Processed {results['processed_count']} files\")\r\nprint(f\"Success: {results['success_count']}, Failed: {results['failed_count']}\")\r\n```\r\n\r\n### `batch_convert_fast()`\r\nFast batch OCR for images and PDFs.\r\n\r\n```python\r\ndef batch_convert_fast(\r\n    directory: Union[str, Path],\r\n    save_to: Union[str, Path],\r\n    language: str = \"en\",\r\n    enhancement: Optional[str] = None,\r\n    output_format: str = \"txt\",\r\n    recursive: bool = True,\r\n    show_progress: bool = True,\r\n    max_workers: int = 5,\r\n    timeout: int = 300\r\n) -> Dict[str, Any]\r\n```\r\n\r\n### `batch_convert_auto()`\r\nAutomatic processing mode selection based on file types.\r\n\r\n```python\r\ndef batch_convert_auto(\r\n    directory: Union[str, Path],\r\n    save_to: Union[str, Path],\r\n    language: str = \"en\",\r\n    enhancement: Optional[str] = None,\r\n    output_format: str = \"txt\",\r\n    recursive: bool = True,\r\n    show_progress: bool = True,\r\n    max_workers: int = 4,\r\n    timeout: int = 900\r\n) -> Dict[str, Any]\r\n```\r\n\r\n## \ud83d\udcca Return Types\r\n\r\nThe SDK supports three return types for conversion methods:\r\n\r\n### 1. Content (Default)\r\n```python\r\ntext = client.convert_fast(\"doc.pdf\", return_type=\"content\")\r\nprint(text)  # Direct text content\r\n```\r\n\r\n### 2. URL\r\n```python\r\nurl = client.convert_fast(\"doc.pdf\", return_type=\"url\") \r\nprint(url)   # Download URL for the result\r\n```\r\n\r\n### 3. File\r\n```python\r\npath = client.convert_fast(\r\n    \"doc.pdf\", \r\n    return_type=\"file\", \r\n    save_to=\"output.txt\"\r\n)\r\nprint(path)  # Path to saved file\r\n```\r\n\r\n## \ud83d\udee0\ufe0f Job Management\r\n\r\n### `get_status()`\r\nCheck job status.\r\n\r\n```python\r\nstatus = client.get_status(job_id)\r\nprint(f\"Status: {status['status']}\")\r\nprint(f\"Progress: {status.get('progress', 0)}%\")\r\n```\r\n\r\n### `wait()`\r\nWait for job completion with polling.\r\n\r\n```python\r\nfinal_status = client.wait(job_id, interval=2.0, timeout=300)\r\n```\r\n\r\n### `get_result()`\r\nGet completed job result.\r\n\r\n```python\r\nresult = client.get_result(job_id)\r\nprint(result['content'])  # The extracted/converted text\r\n```\r\n\r\n### `cancel()`\r\nCancel a running job.\r\n\r\n```python\r\nclient.cancel(job_id)\r\n```\r\n\r\n## \ud83d\udd27 Configuration\r\n\r\n### Environment Variables\r\n\r\n```bash\r\nexport HASHUB_API_KEY=\"your_api_key_here\"\r\n```\r\n\r\n### Client Configuration\r\n\r\n```python\r\nclient = DocAppClient(\r\n    api_key=\"your_api_key\",\r\n    base_url=\"https://doc.hashub.dev/api/v1\",  # Default\r\n    timeout=(30, 120),                         # (connect, read) timeout\r\n    max_retries=3,                            # Max retry attempts\r\n    rate_limit_delay=2.0                      # Min delay between requests\r\n)\r\n```\r\n\r\n## \ud83c\udfaf Usage Examples\r\n\r\n### Basic OCR\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\n\r\nclient = DocAppClient(\"your_api_key\")\r\n\r\n# Extract text from PDF\r\ntext = client.convert_fast(\"invoice.pdf\", language=\"en\")\r\nprint(text)\r\n\r\n# High-quality OCR with layout\r\nmarkdown = client.convert_smart(\"complex_document.pdf\")\r\nprint(markdown)\r\n```\r\n\r\n### Multi-language Processing\r\n\r\n```python\r\n# Process documents in different languages\r\ndocuments = [\r\n    (\"english_doc.pdf\", \"en\"),\r\n    (\"turkish_doc.pdf\", \"tr\"), \r\n    (\"german_doc.pdf\", \"de\"),\r\n    (\"chinese_doc.pdf\", \"zh\")\r\n]\r\n\r\nfor doc_path, lang in documents:\r\n    text = client.convert_fast(doc_path, language=lang)\r\n    print(f\"{lang}: {text[:100]}...\")\r\n```\r\n\r\n### Enhanced Image Processing\r\n\r\n```python\r\n# Process different types of scanned documents\r\nscan_types = {\r\n    \"old_book.pdf\": \"scan_low_dpi\",\r\n    \"phone_photo.jpg\": \"camera_shadow\", \r\n    \"faded_copy.pdf\": \"photocopy_faded\",\r\n    \"receipt.jpg\": \"receipt_thermal\",\r\n    \"technical_drawing.pdf\": \"blueprint\"\r\n}\r\n\r\nfor file_path, enhancement in scan_types.items():\r\n    text = client.convert_fast(\r\n        file_path, \r\n        enhancement=enhancement,\r\n        language=\"en\"\r\n    )\r\n    print(f\"Processed {file_path} with {enhancement}\")\r\n```\r\n\r\n### Batch Processing Example\r\n\r\n```python\r\n# Process entire directory\r\nresults = client.batch_convert_auto(\r\n    directory=\"./input_docs\",\r\n    save_to=\"./output\",\r\n    output_format=\"markdown\",\r\n    show_progress=True\r\n)\r\n\r\nprint(f\"\u2705 Processed {results['success_count']} files successfully\")\r\nfor file_result in results['results']:\r\n    if file_result['status'] == 'success':\r\n        print(f\"  \ud83d\udcc4 {file_result['source_file']} -> {file_result['output_file']}\")\r\n```\r\n\r\n## \ud83d\udee1\ufe0f Error Handling\r\n\r\n```python\r\nfrom hashub_docapp import DocAppClient\r\nfrom hashub_docapp.exceptions import (\r\n    AuthenticationError, \r\n    RateLimitError, \r\n    ProcessingError,\r\n    ValidationError\r\n)\r\n\r\nclient = DocAppClient(\"your_api_key\")\r\n\r\ntry:\r\n    result = client.convert_fast(\"document.pdf\")\r\n    print(result)\r\n    \r\nexcept AuthenticationError:\r\n    print(\"\u274c Invalid API key\")\r\n    \r\nexcept RateLimitError:\r\n    print(\"\u23f3 Rate limit exceeded, wait and retry\")\r\n    \r\nexcept ProcessingError as e:\r\n    print(f\"\ud83d\udca5 Processing failed: {e}\")\r\n    \r\nexcept ValidationError as e:\r\n    print(f\"\ud83d\udcdd Validation error: {e}\")\r\n    \r\nexcept FileNotFoundError:\r\n    print(\"\ud83d\udcc1 File not found\")\r\n```\r\n\r\n## \ud83d\udd04 Rate Limiting\r\n\r\nThe SDK includes built-in rate limiting to prevent API throttling:\r\n\r\n- **Default delay**: 2 seconds between requests\r\n- **Automatic retry**: Failed requests are retried with exponential backoff\r\n- **Progress tracking**: Polls job status with appropriate intervals\r\n\r\n```python\r\n# Configure rate limiting\r\nclient = DocAppClient(\r\n    api_key=\"your_key\",\r\n    rate_limit_delay=3.0,  # 3 second delay between requests\r\n    max_retries=5          # Retry failed requests up to 5 times\r\n)\r\n```\r\n\r\n## \ud83d\udcc8 Performance Tips\r\n\r\n1. **Use appropriate modes**:\r\n   - `convert_fast()` for simple text extraction with language support\r\n   - `convert_smart()` for complex layouts and formatting\r\n\r\n2. **Batch processing**:\r\n   - Use batch methods for multiple files\r\n   - Adjust `max_workers` based on your API limits\r\n\r\n3. **Language specification**:\r\n   - Always specify the correct language for better accuracy\r\n   - Use ISO codes for convenience (`\"en\"`, `\"tr\"`, `\"de\"`)\r\n\r\n4. **Enhancement presets**:\r\n   - Choose the right preset for your document type\r\n   - Experiment with different presets for optimal results\r\n\r\n## \ud83d\udc1b Troubleshooting\r\n\r\n### Common Issues\r\n\r\n**1. 404 Errors**\r\n```python\r\n# Ensure correct base URL\r\nclient = DocAppClient(\r\n    api_key=\"your_key\",\r\n    base_url=\"https://doc.hashub.dev/api/v1\"\r\n)\r\n```\r\n\r\n**2. Rate Limiting**\r\n```python\r\n# Increase delay between requests\r\nclient = DocAppClient(\r\n    api_key=\"your_key\", \r\n    rate_limit_delay=3.0\r\n)\r\n```\r\n\r\n**3. Timeout Issues**\r\n```python\r\n# Increase timeout for large files\r\nresult = client.convert_smart(\"large_file.pdf\", timeout=600)\r\n```\r\n\r\n**4. Language Errors**\r\n```python\r\n# Check supported languages\r\nfrom hashub_docapp.languages import LanguageHelper\r\nlanguages = LanguageHelper.list_languages()\r\nprint([lang['iso'] for lang in languages])\r\n```\r\n\r\n## \ud83d\udcca API Method Summary\r\n\r\n| Method | Purpose | Key Parameters | Returns |\r\n|--------|---------|----------------|---------|\r\n| `convert_fast()` | Fast OCR | file_path, language, enhancement | str/Path |\r\n| `convert_smart()` | Smart OCR | file_path, output | str/Path |\r\n| `convert_doc()` | Office docs | file_path, output | str/Path |\r\n| `convert_html_string()` | HTML conversion | html_content, output | str/Path |\r\n| `batch_convert_smart()` | Smart batch | directory, save_to | Dict |\r\n| `batch_convert_fast()` | Fast batch | directory, save_to, language | Dict |\r\n| `batch_convert_auto()` | Auto batch | directory, save_to | Dict |\r\n\r\n## \ud83d\udcc4 License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83e\udd1d Support\r\n\r\n- **Documentation**: [HashubDocApp Docs](https://doc.hashub.dev)\r\n- **API Reference**: [API Documentation](https://doc.hashub.dev/api)\r\n- **Support**: [Contact Support](mailto:support@hashub.dev)\r\n\r\n---\r\n\r\n**Made with \u2764\ufe0f by the Hashub Team**\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Professional Python SDK for the HashubDocApp API - Advanced OCR, document conversion, and text extraction service",
    "version": "1.0.0",
    "project_urls": {
        "API Reference": "https://doc.hashub.dev/api",
        "Bug Reports": "https://github.com/hasanbahadir/hashub-doc-sdk/issues",
        "Documentation": "https://doc.hashub.dev",
        "Homepage": "https://github.com/hasanbahadir/hashub-doc-sdk",
        "Repository": "https://github.com/hasanbahadir/hashub-doc-sdk"
    },
    "split_keywords": [
        "ocr",
        " pdf",
        " document",
        " conversion",
        " text-extraction",
        " image-processing",
        " api-client",
        " hashub",
        " batch-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "16ca9b289f51608d360bc9d31711d487e64e0e991d0032a458981e30c6580947",
                "md5": "3f0d68a6b4c868657c2d194fe0a34658",
                "sha256": "e1ede2bfdc5fabc0630da0edcda50b2ea7d2c031ef9c145d7867e86f46fe11e5"
            },
            "downloads": -1,
            "filename": "hashub_docapp-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3f0d68a6b4c868657c2d194fe0a34658",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 39325,
            "upload_time": "2025-08-15T12:09:57",
            "upload_time_iso_8601": "2025-08-15T12:09:57.376482Z",
            "url": "https://files.pythonhosted.org/packages/16/ca/9b289f51608d360bc9d31711d487e64e0e991d0032a458981e30c6580947/hashub_docapp-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "394ab5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f",
                "md5": "10d33f3179096ae0740bed5c0c2db50d",
                "sha256": "f5d0b5d50b32e0272a0a8dc4e7d1d57d379a956d8ab7ddf4a287453123eb8d5a"
            },
            "downloads": -1,
            "filename": "hashub_docapp-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "10d33f3179096ae0740bed5c0c2db50d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 52664,
            "upload_time": "2025-08-15T12:09:58",
            "upload_time_iso_8601": "2025-08-15T12:09:58.921670Z",
            "url": "https://files.pythonhosted.org/packages/39/4a/b5224c82db9e21791f331206fbebb83918bc2fda46227b9fdb4e2352069f/hashub_docapp-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-15 12:09:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hasanbahadir",
    "github_project": "hashub-doc-sdk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        },
        {
            "name": "urllib3",
            "specs": [
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "typing-extensions",
            "specs": [
                [
                    ">=",
                    "3.7.4"
                ]
            ]
        }
    ],
    "lcname": "hashub-docapp"
}

Hashub Team