mdpo-llm

Name	mdpo-llm JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Process Markdown documents with LLMs using PO files for efficient translation and refinement
upload_time	2025-10-21 00:55:04
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	document-processing gettext i18n llm localization markdown po translation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # mdpo-llm

[![Python Version](https://img.shields.io/pypi/pyversions/mdpo-llm.svg)](https://pypi.org/project/mdpo-llm/)
[![PyPI Version](https://img.shields.io/pypi/v/mdpo-llm.svg)](https://pypi.org/project/mdpo-llm/)
[![License](https://img.shields.io/pypi/l/mdpo-llm.svg)](https://github.com/yourusername/mdpo-llm/blob/main/LICENSE)

Process Markdown documents with LLMs using PO files for efficient translation and refinement workflows.

## Features

- 📝 **Incremental Processing**: Only process changed content using GNU gettext PO files
- 🌍 **Multi-language Support**: Built-in support for English, Chinese, Japanese, and Korean
- 🔄 **Translation & Refinement**: Use LLMs for both translation and document refinement
- 🏗️ **Structure Preservation**: Maintains perfect Markdown structure and formatting
- ⚡ **Concurrent Processing**: Process multiple blocks in parallel for speed
- 🔌 **LLM Agnostic**: Implement your own LLM interface for any provider

## Why mdpo-llm?

Traditional approaches to translating or refining Markdown documents with LLMs require processing the entire file every time there's a change. **mdpo-llm** solves this by:

1. **Parsing** Markdown into semantic blocks (headings, paragraphs, code blocks, etc.)
2. **Tracking** each block's content and translation state using PO files
3. **Processing** only new or changed content through your LLM
4. **Preserving** the exact document structure when reconstructing the output

This means you can make small edits to a large document and only pay for processing the changed sections!

## Installation

```bash
pip install mdpo-llm
```

## Quick Start

### 1. Implement the LLM Interface

```python
from mdpo_llm import LLMInterface, MdpoLLM, LanguageCode

class MyLLM(LLMInterface):
    def process(self, source_text: str) -> str:
        # Your LLM logic here (OpenAI, Anthropic, Google, local models, etc.)
        # Example with OpenAI:
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Translate to Korean"},
                {"role": "user", "content": source_text}
            ]
        )
        return response.choices[0].message.content
```

### 2. Process Your Documents

```python
from pathlib import Path

# Initialize with your LLM
llm = MyLLM()
processor = MdpoLLM(llm)

# Process a document
result = processor.process_document(
    source_path=Path("docs/README.md"),
    target_path=Path("docs/README_ko.md"),
    po_path=Path("translations/README.po"),
    inplace=False  # Set True for refinement instead of translation
)

print(f"Processed {result['blocks_count']} blocks")
print(f"Translation coverage: {result['coverage']['coverage_percentage']}%")
```

## Advanced Usage

### Language Detection

```python
from mdpo_llm import LanguageCode

# Detect languages in text
text = "Hello 世界"
if LanguageCode.EN.in_text(text):
    print("Contains English")
if LanguageCode.CN.in_text(text):
    print("Contains Chinese")
```

### Custom Processing Configuration

```python
class CustomProcessor(MdpoLLM):
    # Skip processing certain block types
    SKIP_TYPES = ["hr", "code"]  # Don't process horizontal rules or code blocks
    
processor = CustomProcessor(llm)
```

### Working with PO Files

The PO files track translation state for each block:
- **Untranslated**: New content that needs processing
- **Translated**: Completed translations
- **Fuzzy**: Content that changed and needs re-processing
- **Obsolete**: Removed content (automatically cleaned up)

You can inspect PO files with standard gettext tools or any PO editor.

## API Reference

### Core Classes

#### `MdpoLLM` (alias for `MarkdownProcessor`)

Main processor class that orchestrates the workflow.

**Methods:**
- `process_document(source_path, target_path, po_path, inplace=False)` - Process a document
- `get_translation_stats(source_path, po_path)` - Get processing statistics
- `export_report(source_path, po_path)` - Generate a detailed report

#### `LLMInterface`

Abstract base class for LLM implementations.

**Methods to implement:**
- `process(source_text: str) -> str` - Process text and return result

#### `LanguageCode`

Enum for supported languages with detection capabilities.

**Values:**
- `LanguageCode.EN` - English
- `LanguageCode.CN` - Chinese
- `LanguageCode.JP` - Japanese  
- `LanguageCode.KO` - Korean

**Methods:**
- `in_text(text: str) -> bool` - Check if language appears in text

## Examples

### Validation and Quality Control

```python
class ValidatingLLM(LLMInterface):
    def __init__(self, max_retries: int = 3):
        self.max_retries = max_retries
        
    def process(self, source_text: str) -> str:
        # Implement retry logic with validation
        for attempt in range(self.max_retries):
            result = self._call_llm(source_text)
            if self._validate(source_text, result):
                return result
            print(f"Validation failed, retry {attempt + 1}/{self.max_retries}")
        return result  # Return last attempt
    
    def _validate(self, source: str, result: str) -> bool:
        # Check structure preservation, length ratios, etc.
        return len(result) > len(source) * 0.5  # Example validation
```

See `examples/validation_example.py` for a complete implementation with detailed validation rules.

### Translation Workflow

```python
from mdpo_llm import MdpoLLM, LLMInterface
from pathlib import Path

class TranslationLLM(LLMInterface):
    def __init__(self, target_language):
        self.target_language = target_language
        
    def process(self, source_text: str) -> str:
        # Your translation logic
        return translate(source_text, self.target_language)

# Setup
llm = TranslationLLM("Korean")
processor = MdpoLLM(llm)

# First run - translates everything
processor.process_document(
    Path("README.md"),
    Path("README_ko.md"),
    Path("translations/README.po")
)

# Edit README.md (e.g., fix a typo)
# ...

# Second run - only translates changed paragraphs!
processor.process_document(
    Path("README.md"),
    Path("README_ko.md"),
    Path("translations/README.po")
)
```

### Document Refinement Workflow

```python
class RefinementLLM(LLMInterface):
    def process(self, source_text: str) -> str:
        # Improve clarity, fix grammar, etc.
        return refine_text(source_text)

llm = RefinementLLM()
processor = MdpoLLM(llm)

# Refine document in-place (same language)
processor.process_document(
    Path("draft.md"),
    Path("refined.md"),
    Path("refinements/draft.po"),
    inplace=True  # Indicates same-language refinement
)
```

## Development

### Setup Development Environment

```bash
# Clone the repository
git clone https://github.com/yourusername/mdpo-llm.git
cd mdpo-llm

# Install with development dependencies
pip install -e ".[dev]"
```

### Run Tests

```bash
pytest tests/
```

### Build Package

```bash
python -m build
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Built on top of [polib](https://github.com/izimobil/polib) for PO file handling
- Inspired by traditional gettext localization workflows

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "mdpo-llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "document-processing, gettext, i18n, llm, localization, markdown, po, translation",
    "author": null,
    "author_email": "Will Kang <willysk73@outlook.com>",
    "download_url": "https://files.pythonhosted.org/packages/66/1d/082e0771d91e25739b2ca08eccd2c19939f5f9d64dd38908da86d069d87b/mdpo_llm-0.1.0.tar.gz",
    "platform": null,
    "description": "# mdpo-llm\n\n[![Python Version](https://img.shields.io/pypi/pyversions/mdpo-llm.svg)](https://pypi.org/project/mdpo-llm/)\n[![PyPI Version](https://img.shields.io/pypi/v/mdpo-llm.svg)](https://pypi.org/project/mdpo-llm/)\n[![License](https://img.shields.io/pypi/l/mdpo-llm.svg)](https://github.com/yourusername/mdpo-llm/blob/main/LICENSE)\n\nProcess Markdown documents with LLMs using PO files for efficient translation and refinement workflows.\n\n## Features\n\n- \ud83d\udcdd **Incremental Processing**: Only process changed content using GNU gettext PO files\n- \ud83c\udf0d **Multi-language Support**: Built-in support for English, Chinese, Japanese, and Korean\n- \ud83d\udd04 **Translation & Refinement**: Use LLMs for both translation and document refinement\n- \ud83c\udfd7\ufe0f **Structure Preservation**: Maintains perfect Markdown structure and formatting\n- \u26a1 **Concurrent Processing**: Process multiple blocks in parallel for speed\n- \ud83d\udd0c **LLM Agnostic**: Implement your own LLM interface for any provider\n\n## Why mdpo-llm?\n\nTraditional approaches to translating or refining Markdown documents with LLMs require processing the entire file every time there's a change. **mdpo-llm** solves this by:\n\n1. **Parsing** Markdown into semantic blocks (headings, paragraphs, code blocks, etc.)\n2. **Tracking** each block's content and translation state using PO files\n3. **Processing** only new or changed content through your LLM\n4. **Preserving** the exact document structure when reconstructing the output\n\nThis means you can make small edits to a large document and only pay for processing the changed sections!\n\n## Installation\n\n```bash\npip install mdpo-llm\n```\n\n## Quick Start\n\n### 1. Implement the LLM Interface\n\n```python\nfrom mdpo_llm import LLMInterface, MdpoLLM, LanguageCode\n\nclass MyLLM(LLMInterface):\n    def process(self, source_text: str) -> str:\n        # Your LLM logic here (OpenAI, Anthropic, Google, local models, etc.)\n        # Example with OpenAI:\n        response = openai.chat.completions.create(\n            model=\"gpt-4\",\n            messages=[\n                {\"role\": \"system\", \"content\": \"Translate to Korean\"},\n                {\"role\": \"user\", \"content\": source_text}\n            ]\n        )\n        return response.choices[0].message.content\n```\n\n### 2. Process Your Documents\n\n```python\nfrom pathlib import Path\n\n# Initialize with your LLM\nllm = MyLLM()\nprocessor = MdpoLLM(llm)\n\n# Process a document\nresult = processor.process_document(\n    source_path=Path(\"docs/README.md\"),\n    target_path=Path(\"docs/README_ko.md\"),\n    po_path=Path(\"translations/README.po\"),\n    inplace=False  # Set True for refinement instead of translation\n)\n\nprint(f\"Processed {result['blocks_count']} blocks\")\nprint(f\"Translation coverage: {result['coverage']['coverage_percentage']}%\")\n```\n\n## Advanced Usage\n\n### Language Detection\n\n```python\nfrom mdpo_llm import LanguageCode\n\n# Detect languages in text\ntext = \"Hello \u4e16\u754c\"\nif LanguageCode.EN.in_text(text):\n    print(\"Contains English\")\nif LanguageCode.CN.in_text(text):\n    print(\"Contains Chinese\")\n```\n\n### Custom Processing Configuration\n\n```python\nclass CustomProcessor(MdpoLLM):\n    # Skip processing certain block types\n    SKIP_TYPES = [\"hr\", \"code\"]  # Don't process horizontal rules or code blocks\n    \nprocessor = CustomProcessor(llm)\n```\n\n### Working with PO Files\n\nThe PO files track translation state for each block:\n- **Untranslated**: New content that needs processing\n- **Translated**: Completed translations\n- **Fuzzy**: Content that changed and needs re-processing\n- **Obsolete**: Removed content (automatically cleaned up)\n\nYou can inspect PO files with standard gettext tools or any PO editor.\n\n## API Reference\n\n### Core Classes\n\n#### `MdpoLLM` (alias for `MarkdownProcessor`)\n\nMain processor class that orchestrates the workflow.\n\n**Methods:**\n- `process_document(source_path, target_path, po_path, inplace=False)` - Process a document\n- `get_translation_stats(source_path, po_path)` - Get processing statistics\n- `export_report(source_path, po_path)` - Generate a detailed report\n\n#### `LLMInterface`\n\nAbstract base class for LLM implementations.\n\n**Methods to implement:**\n- `process(source_text: str) -> str` - Process text and return result\n\n#### `LanguageCode`\n\nEnum for supported languages with detection capabilities.\n\n**Values:**\n- `LanguageCode.EN` - English\n- `LanguageCode.CN` - Chinese\n- `LanguageCode.JP` - Japanese  \n- `LanguageCode.KO` - Korean\n\n**Methods:**\n- `in_text(text: str) -> bool` - Check if language appears in text\n\n## Examples\n\n### Validation and Quality Control\n\n```python\nclass ValidatingLLM(LLMInterface):\n    def __init__(self, max_retries: int = 3):\n        self.max_retries = max_retries\n        \n    def process(self, source_text: str) -> str:\n        # Implement retry logic with validation\n        for attempt in range(self.max_retries):\n            result = self._call_llm(source_text)\n            if self._validate(source_text, result):\n                return result\n            print(f\"Validation failed, retry {attempt + 1}/{self.max_retries}\")\n        return result  # Return last attempt\n    \n    def _validate(self, source: str, result: str) -> bool:\n        # Check structure preservation, length ratios, etc.\n        return len(result) > len(source) * 0.5  # Example validation\n```\n\nSee `examples/validation_example.py` for a complete implementation with detailed validation rules.\n\n### Translation Workflow\n\n```python\nfrom mdpo_llm import MdpoLLM, LLMInterface\nfrom pathlib import Path\n\nclass TranslationLLM(LLMInterface):\n    def __init__(self, target_language):\n        self.target_language = target_language\n        \n    def process(self, source_text: str) -> str:\n        # Your translation logic\n        return translate(source_text, self.target_language)\n\n# Setup\nllm = TranslationLLM(\"Korean\")\nprocessor = MdpoLLM(llm)\n\n# First run - translates everything\nprocessor.process_document(\n    Path(\"README.md\"),\n    Path(\"README_ko.md\"),\n    Path(\"translations/README.po\")\n)\n\n# Edit README.md (e.g., fix a typo)\n# ...\n\n# Second run - only translates changed paragraphs!\nprocessor.process_document(\n    Path(\"README.md\"),\n    Path(\"README_ko.md\"),\n    Path(\"translations/README.po\")\n)\n```\n\n### Document Refinement Workflow\n\n```python\nclass RefinementLLM(LLMInterface):\n    def process(self, source_text: str) -> str:\n        # Improve clarity, fix grammar, etc.\n        return refine_text(source_text)\n\nllm = RefinementLLM()\nprocessor = MdpoLLM(llm)\n\n# Refine document in-place (same language)\nprocessor.process_document(\n    Path(\"draft.md\"),\n    Path(\"refined.md\"),\n    Path(\"refinements/draft.po\"),\n    inplace=True  # Indicates same-language refinement\n)\n```\n\n## Development\n\n### Setup Development Environment\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/mdpo-llm.git\ncd mdpo-llm\n\n# Install with development dependencies\npip install -e \".[dev]\"\n```\n\n### Run Tests\n\n```bash\npytest tests/\n```\n\n### Build Package\n\n```bash\npython -m build\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- Built on top of [polib](https://github.com/izimobil/polib) for PO file handling\n- Inspired by traditional gettext localization workflows\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Process Markdown documents with LLMs using PO files for efficient translation and refinement",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/willy/mdpo-llm#readme",
        "Homepage": "https://github.com/willy/mdpo-llm",
        "Issues": "https://github.com/willy/mdpo-llm/issues",
        "Repository": "https://github.com/willy/mdpo-llm.git"
    },
    "split_keywords": [
        "document-processing",
        " gettext",
        " i18n",
        " llm",
        " localization",
        " markdown",
        " po",
        " translation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c0cc6acdf39b4f59ec2077ff795a67a1c0df1e5c1793c6f9e08798c7bb959e55",
                "md5": "d04d387a801b32a2d5e3b522b1030176",
                "sha256": "4bd66a1e211d568ab955106b7df3b96d3ecf505f32ebf01f996db8106cac4b66"
            },
            "downloads": -1,
            "filename": "mdpo_llm-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d04d387a801b32a2d5e3b522b1030176",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19412,
            "upload_time": "2025-10-21T00:55:03",
            "upload_time_iso_8601": "2025-10-21T00:55:03.711744Z",
            "url": "https://files.pythonhosted.org/packages/c0/cc/6acdf39b4f59ec2077ff795a67a1c0df1e5c1793c6f9e08798c7bb959e55/mdpo_llm-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "661d082e0771d91e25739b2ca08eccd2c19939f5f9d64dd38908da86d069d87b",
                "md5": "dd3611ab338753051301ca6e66577954",
                "sha256": "7b63e72b0a9ab6248ccdec079511469e12a4619dabe64dd5ddddd4795f6844e1"
            },
            "downloads": -1,
            "filename": "mdpo_llm-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dd3611ab338753051301ca6e66577954",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 24066,
            "upload_time": "2025-10-21T00:55:04",
            "upload_time_iso_8601": "2025-10-21T00:55:04.938038Z",
            "url": "https://files.pythonhosted.org/packages/66/1d/082e0771d91e25739b2ca08eccd2c19939f5f9d64dd38908da86d069d87b/mdpo_llm-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 00:55:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "willy",
    "github_project": "mdpo-llm#readme",
    "github_not_found": true,
    "lcname": "mdpo-llm"
}

None