# doc2mark
[](https://pypi.org/project/doc2mark/)
[](https://pypi.org/project/doc2mark/)
[](https://opensource.org/licenses/MIT)
Turn any document into clean Markdown – in one line.
## Why doc2mark?
- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more
- AI OCR for scans and screenshots (OpenAI)
- Preserves complex tables (merged cells, headers) and basic layout
- One simple API + CLI for single files or whole folders
## Install
```bash
pip install doc2mark[all]
```
## Try it in 30 seconds
```python
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider='openai') # or None
result = loader.load('sample_documents/sample_pdf.pdf', extract_images=True, ocr_images=True)
print(result.content)
```
CLI:
```bash
# single file → stdout
doc2mark sample_documents/sample_document.docx
# directory → save files (recursively)
doc2mark sample_documents -o output -r
# enable OCR with OpenAI
export OPENAI_API_KEY=sk-... # Windows: set OPENAI_API_KEY=...
doc2mark sample_documents/sample_pdf.pdf --ocr openai --ocr-images
```
## Supported formats
- PDF • DOCX • XLSX • PPTX • Images (PNG/JPG/WEBP) • TXT/CSV/TSV/JSON/JSONL • HTML/XML/MD
- Legacy Office (DOC/XLS/PPT/RTF/PPS) via LibreOffice (optional)
## Common recipes
```python
from doc2mark import UnifiedDocumentLoader
loader = UnifiedDocumentLoader(ocr_provider='openai')
# 1) Single file → Markdown string
print(loader.load('document.pdf').content)
# 2) Image with OCR
print(loader.load('screenshot.png', extract_images=True, ocr_images=True).content)
# 3) Batch a folder and save outputs
loader.batch_process(
input_dir='documents/',
output_dir='converted/',
extract_images=True,
ocr_images=True,
show_progress=True,
save_files=True
)
```
## OpenAI OCR (optional)
```bash
export OPENAI_API_KEY=your_key # Windows: set OPENAI_API_KEY=your_key
```
```python
loader = UnifiedDocumentLoader(ocr_provider='openai')
# Need a cheaper model? Use model='gpt-4o-mini'
```
Use OpenAI‑compatible endpoints (self‑hosted/offline VLM):
```python
# Example: point to an OpenAI‑compatible server (must support vision)
loader = UnifiedDocumentLoader(
ocr_provider='openai',
base_url='http://localhost:11434/v1', # your OpenAI‑compatible endpoint
api_key='your-key-or-any-string', # some servers require a token
model='gpt-4o-mini'
)
```
## Tips
- Use `extract_images=True, ocr_images=True` to convert images to text
- `batch_process(..., save_files=True)` writes `.md` (and `.json` when requested)
- Sample files live in `sample_documents/` — perfect for a quick test
## License
MIT — see `LICENSE`.
Raw data
{
"_id": null,
"home_page": "https://github.com/luisleo526/doc2mark",
"name": "doc2mark",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "doc2mark Team <luisleo52655@gmail.com>",
"keywords": "document-processing, ocr, pdf, docx, xlsx, pptx, ai, gpt-4, openai, langchain, document-extraction, text-extraction",
"author": "HaoLiangWen",
"author_email": "doc2mark Team <luisleo52655@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d8/00/45473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c/doc2mark-0.4.1.tar.gz",
"platform": null,
"description": "# doc2mark\n\n[](https://pypi.org/project/doc2mark/)\n[](https://pypi.org/project/doc2mark/)\n[](https://opensource.org/licenses/MIT)\n\nTurn any document into clean Markdown \u2013 in one line.\n\n## Why doc2mark?\n\n- Converts PDFs, DOCX/XLSX/PPTX, images, HTML, CSV/JSON, and more\n- AI OCR for scans and screenshots (OpenAI)\n- Preserves complex tables (merged cells, headers) and basic layout\n- One simple API + CLI for single files or whole folders\n\n## Install\n\n```bash\npip install doc2mark[all]\n```\n\n## Try it in 30 seconds\n\n```python\nfrom doc2mark import UnifiedDocumentLoader\n\nloader = UnifiedDocumentLoader(ocr_provider='openai') # or None\nresult = loader.load('sample_documents/sample_pdf.pdf', extract_images=True, ocr_images=True)\nprint(result.content)\n```\n\nCLI:\n\n```bash\n# single file \u2192 stdout\ndoc2mark sample_documents/sample_document.docx\n\n# directory \u2192 save files (recursively)\ndoc2mark sample_documents -o output -r\n\n# enable OCR with OpenAI\nexport OPENAI_API_KEY=sk-... # Windows: set OPENAI_API_KEY=...\ndoc2mark sample_documents/sample_pdf.pdf --ocr openai --ocr-images\n```\n\n## Supported formats\n\n- PDF \u2022 DOCX \u2022 XLSX \u2022 PPTX \u2022 Images (PNG/JPG/WEBP) \u2022 TXT/CSV/TSV/JSON/JSONL \u2022 HTML/XML/MD\n- Legacy Office (DOC/XLS/PPT/RTF/PPS) via LibreOffice (optional)\n\n## Common recipes\n\n```python\nfrom doc2mark import UnifiedDocumentLoader\n\nloader = UnifiedDocumentLoader(ocr_provider='openai')\n\n# 1) Single file \u2192 Markdown string\nprint(loader.load('document.pdf').content)\n\n# 2) Image with OCR\nprint(loader.load('screenshot.png', extract_images=True, ocr_images=True).content)\n\n# 3) Batch a folder and save outputs\nloader.batch_process(\n input_dir='documents/',\n output_dir='converted/',\n extract_images=True,\n ocr_images=True,\n show_progress=True,\n save_files=True\n)\n```\n\n## OpenAI OCR (optional)\n\n```bash\nexport OPENAI_API_KEY=your_key # Windows: set OPENAI_API_KEY=your_key\n```\n\n```python\nloader = UnifiedDocumentLoader(ocr_provider='openai')\n# Need a cheaper model? Use model='gpt-4o-mini'\n```\n\nUse OpenAI\u2011compatible endpoints (self\u2011hosted/offline VLM):\n\n```python\n# Example: point to an OpenAI\u2011compatible server (must support vision)\nloader = UnifiedDocumentLoader(\n ocr_provider='openai',\n base_url='http://localhost:11434/v1', # your OpenAI\u2011compatible endpoint\n api_key='your-key-or-any-string', # some servers require a token\n model='gpt-4o-mini'\n)\n```\n\n## Tips\n\n- Use `extract_images=True, ocr_images=True` to convert images to text\n- `batch_process(..., save_files=True)` writes `.md` (and `.json` when requested)\n- Sample files live in `sample_documents/` \u2014 perfect for a quick test\n\n## License\n\nMIT \u2014 see `LICENSE`.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Unified document processing with AI-powered OCR",
"version": "0.4.1",
"project_urls": {
"Changelog": "https://github.com/luisleo526/doc2mark/blob/main/CHANGELOG.md",
"Documentation": "https://doc2mark.readthedocs.io",
"Homepage": "https://github.com/luisleo526/doc2mark",
"Issues": "https://github.com/luisleo526/doc2mark/issues",
"Repository": "https://github.com/luisleo526/doc2mark"
},
"split_keywords": [
"document-processing",
" ocr",
" pdf",
" docx",
" xlsx",
" pptx",
" ai",
" gpt-4",
" openai",
" langchain",
" document-extraction",
" text-extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b1ef82697ae00aa66bcf3409f3a25597f4623722e6f5d28e442c32a0e2a35654",
"md5": "6265c8bbde1e5b2dc2cbba90fbc0b34d",
"sha256": "ad7b5806abc7ddf0d8b3b585e0a582578f8395096ef04ba355d58c4b4f0330e8"
},
"downloads": -1,
"filename": "doc2mark-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6265c8bbde1e5b2dc2cbba90fbc0b34d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 115232,
"upload_time": "2025-10-07T05:09:40",
"upload_time_iso_8601": "2025-10-07T05:09:40.488041Z",
"url": "https://files.pythonhosted.org/packages/b1/ef/82697ae00aa66bcf3409f3a25597f4623722e6f5d28e442c32a0e2a35654/doc2mark-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d80045473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c",
"md5": "7cbf171190203bc023d65d0dfcb0e7dc",
"sha256": "bd9098593cd730dfe6c1e74f664578637d5911f9e252c6518c25bb018a3aaf74"
},
"downloads": -1,
"filename": "doc2mark-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "7cbf171190203bc023d65d0dfcb0e7dc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 114964,
"upload_time": "2025-10-07T05:09:42",
"upload_time_iso_8601": "2025-10-07T05:09:42.272089Z",
"url": "https://files.pythonhosted.org/packages/d8/00/45473b9d8ed9c1e6269c0138a082ad6f57f1b0c109d4f3379294f0c3925c/doc2mark-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 05:09:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "luisleo526",
"github_project": "doc2mark",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "doc2mark"
}