doc2mark


Namedoc2mark JSON
Version 0.4.1 PyPI version JSON
download
home_pagehttps://github.com/luisleo526/doc2mark
SummaryUnified document processing with AI-powered OCR
upload_time2025-10-07 05:09:42
maintainerNone
docs_urlNone
authorHaoLiangWen
requires_python>=3.8
licenseMIT
keywords document-processing ocr pdf docx xlsx pptx ai gpt-4 openai langchain document-extraction text-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # doc2mark

[![PyPI version](https://img.shields.io/pypi/v/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![Python](https://img.shields.io/pypi/pyversions/doc2mark.svg)](https://pypi.org/project/doc2mark/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Turn any document into clean Markdown – in one line.

## Why doc2mark?

- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
- AI OCR for scans and screenshots (OpenAI)
- Preserves complex tables (merged cells, headers) and basic layout
- One simple API + CLI for single files or whole folders

## Install

```bash
pip install doc2mark[all]
```

## Try it in 30 seconds

```python
from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider='openai')  # or None
result = loader.load('sample_documents/sample_pdf.pdf', extract_images=True, ocr_images=True)
print(result.content)
```

CLI:

```bash
# single file → stdout
doc2mark sample_documents/sample_document.docx

# directory → save files (recursively)
doc2mark sample_documents -o output -r

# enable OCR with OpenAI
export OPENAI_API_KEY=sk-...        # Windows: set OPENAI_API_KEY=...
doc2mark sample_documents/sample_pdf.pdf --ocr openai --ocr-images
```

## Supported formats

- PDF • DOCX • XLSX • PPTX • Images (PNG/JPG/WEBP) • TXT/CSV/TSV/JSON/JSONL • HTML/XML/MD
- Legacy Office (DOC/XLS/PPT/RTF/PPS) via LibreOffice (optional)

## Common recipes

```python
from doc2mark import UnifiedDocumentLoader

loader = UnifiedDocumentLoader(ocr_provider='openai')

# 1) Single file → Markdown string
print(loader.load('document.pdf').content)

# 2) Image with OCR
print(loader.load('screenshot.png', extract_images=True, ocr_images=True).content)

# 3) Batch a folder and save outputs
loader.batch_process(
    input_dir='documents/',
    output_dir='converted/',
    extract_images=True,
    ocr_images=True,
    show_progress=True,
    save_files=True
)
```

## OpenAI OCR (optional)

```bash
export OPENAI_API_KEY=your_key   # Windows: set OPENAI_API_KEY=your_key
```

```python
loader = UnifiedDocumentLoader(ocr_provider='openai')
# Need a cheaper model? Use model='gpt-4o-mini'
```

Use OpenAI‑compatible endpoints (self‑hosted/offline VLM):

```python
# Example: point to an OpenAI‑compatible server (must support vision)
loader = UnifiedDocumentLoader(
    ocr_provider='openai',
    base_url='http://localhost:11434/v1',  # your OpenAI‑compatible endpoint
    api_key='your-key-or-any-string',      # some servers require a token
    model='gpt-4o-mini'
)
```

## Tips

- Use `extract_images=True, ocr_images=True` to convert images to text
- `batch_process(..., save_files=True)` writes `.md` (and `.json` when requested)
- Sample files live in `sample_documents/` — perfect for a quick test

## License

MIT — see `LICENSE`.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/luisleo526/doc2mark",
    "name": "doc2mark",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "doc2mark Team <luisleo52655@gmail.com>",
    "keywords": "document-processing, ocr, pdf, docx, xlsx, pptx, ai, gpt-4, openai, langchain, document-extraction, text-extraction",
    "author": "HaoLiangWen",
    "author_email": "doc2mark Team <luisleo52655@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d8/00/45473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c/doc2mark-0.4.1.tar.gz",
    "platform": null,
    "description": "# doc2mark\n\n[![PyPI version](https://img.shields.io/pypi/v/doc2mark.svg)](https://pypi.org/project/doc2mark/)\n[![Python](https://img.shields.io/pypi/pyversions/doc2mark.svg)](https://pypi.org/project/doc2mark/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nTurn any document into clean Markdown \u2013 in one line.\n\n## Why doc2mark?\n\n- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more\n- AI OCR for scans and screenshots (OpenAI)\n- Preserves complex tables (merged cells, headers) and basic layout\n- One simple API + CLI for single files or whole folders\n\n## Install\n\n```bash\npip install doc2mark[all]\n```\n\n## Try it in 30 seconds\n\n```python\nfrom doc2mark import UnifiedDocumentLoader\n\nloader = UnifiedDocumentLoader(ocr_provider='openai')  # or None\nresult = loader.load('sample_documents/sample_pdf.pdf', extract_images=True, ocr_images=True)\nprint(result.content)\n```\n\nCLI:\n\n```bash\n# single file \u2192 stdout\ndoc2mark sample_documents/sample_document.docx\n\n# directory \u2192 save files (recursively)\ndoc2mark sample_documents -o output -r\n\n# enable OCR with OpenAI\nexport OPENAI_API_KEY=sk-...        # Windows: set OPENAI_API_KEY=...\ndoc2mark sample_documents/sample_pdf.pdf --ocr openai --ocr-images\n```\n\n## Supported formats\n\n- PDF \u2022 DOCX \u2022 XLSX \u2022 PPTX \u2022 Images (PNG/JPG/WEBP) \u2022 TXT/CSV/TSV/JSON/JSONL \u2022 HTML/XML/MD\n- Legacy Office (DOC/XLS/PPT/RTF/PPS) via LibreOffice (optional)\n\n## Common recipes\n\n```python\nfrom doc2mark import UnifiedDocumentLoader\n\nloader = UnifiedDocumentLoader(ocr_provider='openai')\n\n# 1) Single file \u2192 Markdown string\nprint(loader.load('document.pdf').content)\n\n# 2) Image with OCR\nprint(loader.load('screenshot.png', extract_images=True, ocr_images=True).content)\n\n# 3) Batch a folder and save outputs\nloader.batch_process(\n    input_dir='documents/',\n    output_dir='converted/',\n    extract_images=True,\n    ocr_images=True,\n    show_progress=True,\n    save_files=True\n)\n```\n\n## OpenAI OCR (optional)\n\n```bash\nexport OPENAI_API_KEY=your_key   # Windows: set OPENAI_API_KEY=your_key\n```\n\n```python\nloader = UnifiedDocumentLoader(ocr_provider='openai')\n# Need a cheaper model? Use model='gpt-4o-mini'\n```\n\nUse OpenAI\u2011compatible endpoints (self\u2011hosted/offline VLM):\n\n```python\n# Example: point to an OpenAI\u2011compatible server (must support vision)\nloader = UnifiedDocumentLoader(\n    ocr_provider='openai',\n    base_url='http://localhost:11434/v1',  # your OpenAI\u2011compatible endpoint\n    api_key='your-key-or-any-string',      # some servers require a token\n    model='gpt-4o-mini'\n)\n```\n\n## Tips\n\n- Use `extract_images=True, ocr_images=True` to convert images to text\n- `batch_process(..., save_files=True)` writes `.md` (and `.json` when requested)\n- Sample files live in `sample_documents/` \u2014 perfect for a quick test\n\n## License\n\nMIT \u2014 see `LICENSE`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Unified document processing with AI-powered OCR",
    "version": "0.4.1",
    "project_urls": {
        "Changelog": "https://github.com/luisleo526/doc2mark/blob/main/CHANGELOG.md",
        "Documentation": "https://doc2mark.readthedocs.io",
        "Homepage": "https://github.com/luisleo526/doc2mark",
        "Issues": "https://github.com/luisleo526/doc2mark/issues",
        "Repository": "https://github.com/luisleo526/doc2mark"
    },
    "split_keywords": [
        "document-processing",
        " ocr",
        " pdf",
        " docx",
        " xlsx",
        " pptx",
        " ai",
        " gpt-4",
        " openai",
        " langchain",
        " document-extraction",
        " text-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b1ef82697ae00aa66bcf3409f3a25597f4623722e6f5d28e442c32a0e2a35654",
                "md5": "6265c8bbde1e5b2dc2cbba90fbc0b34d",
                "sha256": "ad7b5806abc7ddf0d8b3b585e0a582578f8395096ef04ba355d58c4b4f0330e8"
            },
            "downloads": -1,
            "filename": "doc2mark-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6265c8bbde1e5b2dc2cbba90fbc0b34d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 115232,
            "upload_time": "2025-10-07T05:09:40",
            "upload_time_iso_8601": "2025-10-07T05:09:40.488041Z",
            "url": "https://files.pythonhosted.org/packages/b1/ef/82697ae00aa66bcf3409f3a25597f4623722e6f5d28e442c32a0e2a35654/doc2mark-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d80045473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c",
                "md5": "7cbf171190203bc023d65d0dfcb0e7dc",
                "sha256": "bd9098593cd730dfe6c1e74f664578637d5911f9e252c6518c25bb018a3aaf74"
            },
            "downloads": -1,
            "filename": "doc2mark-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "7cbf171190203bc023d65d0dfcb0e7dc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 114964,
            "upload_time": "2025-10-07T05:09:42",
            "upload_time_iso_8601": "2025-10-07T05:09:42.272089Z",
            "url": "https://files.pythonhosted.org/packages/d8/00/45473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c/doc2mark-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-07 05:09:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "luisleo526",
    "github_project": "doc2mark",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "doc2mark"
}
        
Elapsed time: 1.01895s