Name | markitdown-pro JSON |
Version |
1.1.0
JSON |
| download |
home_page | None |
Summary | A package that converts almost any file format to Markdown. |
upload_time | 2025-08-23 15:28:17 |
maintainer | None |
docs_url | None |
author | Developer |
requires_python | >=3.12.2 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# MarkItDown-Pro
**MarkItDown-Pro** is an **improvement** of the **[Microsoft MarkItDown repository](https://github.com/markitdown)**, enhancing gaps and extending functionality by leveraging **Azure Document Intelligence SDK**, **Unstructured.io**, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to **convert diverse document formats into Markdown** with graceful fallbacks, including OCR support via GPT-4o-mini.
---
## Table of Contents
- [Folder Structure](#folder-structure)
- [Features & Highlights](#features--highlights)
- [How It Works](#how-it-works)
- [File-by-File Explanation](#file-by-file-explanation)
- [Main Files](#main-files)
- [Common Utils](#common-utils)
- [Converters](#converters)
- [Handlers](#handlers)
- [Testing](#testing)
- [Usage & Examples](#usage--examples)
- [CLI Usage](#cli-usage)
- [Programmatic Usage](#programmatic-usage)
- [Extra: Vector Database Chunking](#extra-vector-database-chunking)
- [Environment Variables](#environment-variables)
- [FAQ](#faq)
---
## Folder Structure
A typical layout for **MarkItDown-Pro** might look like this:
```bash
markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│ └── utils.py
├── converters
│ ├── markitdown_wrapper.py
│ ├── azure_docint.py
│ ├── unstructured_wrapper.py
│ └── gpt4o_mini_vision.py
├── handlers
│ ├── pst_handler.py
│ ├── email_handler.py
│ ├── zip_handler.py
│ ├── audio_handler.py
│ └── pdf_handler.py
└── tests
├── data
└── test.py
```
| Folder/File | Description |
|----------------------------|---------------------------------------------------------------------------------------|
| **main.py** | Entry point for CLI usage; uses `argparse` to accept file paths. |
| **conversion_pipeline.py** | Orchestrates the fallback chain for converting documents to Markdown. |
| **common/** | Shared utility functions, e.g. for file detection, text cleanup, etc. |
| **converters/** | Contains modules for using various 3rd-party libraries or services to extract text. |
| **handlers/** | Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning). |
| **.env** | Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence). |
| **requirements.txt** | Python dependencies needed to install and run this project. |
| **tests/test_markitdownpro.py**| Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md|
| **README.md** | This documentation file, explaining usage and details of the project. |
---
## Features & Highlights
1. **MarkItDown with LLM**
- Uses **MarkItDown** to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an **OPENAI_API_KEY**.
- Auto-checks for `exiftool` if you want EXIF metadata in your images.
2. **Whisper-Based Audio Transcription**
- Converts audio files (`.mp3`, `.wav`, `.ogg`, etc.) into text using [OpenAI Whisper](https://github.com/openai/whisper).
- Gracefully falls back if Whisper is not installed.
3. **PST Extraction**
- Parses Outlook PST files with [`libratom`](https://github.com/rafproject/libratom), extracting emails and attachments recursively.
4. **Scanned PDF Detection & Concurrency**
- Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
- Offers concurrent page-by-page OCR for faster performance.
5. **Fallback to Azure Document Intelligence & Unstructured**
- If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
- Unstructured.io library for broad coverage of file types.
6. **GPT-4 Vision (or GPT-4o-mini) for Images & OCR**
- If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
- Supports local images (base64 encoding) or remote image URLs directly.
7. **Handles ZIP & EML**
- **ZIP**: Unzips and processes each file inside, concatenating the results.
- **EML**: Extracts email text, attachments, and processes attachments recursively.
8. **Graceful LLM Handling**
- If no **OPENAI_API_KEY** or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.
9. **Helper Methods for URL & Stream Conversion**
- `convert_document_from_url(url, output_md)`
- `convert_document_from_stream(stream, extension, output_md)`
- `convert_document_to_md(local_path, output_md)`
10. **Easy-to-Extend Architecture**
Each file type has its own **handler**. Each text-extraction library has its own **converter**. The main pipeline provides a centralized fallback sequence.
11. **Environment-Driven Configuration**
- Pulls API keys, endpoints, and paths from `.env` to keep secrets out of source code.
12. **Rich File Type Handling**
| Category | File Type(s) |
|-----------------------|-------------|
| PDF | .pdf |
| PowerPoint | .pot, .potm, .ppt, .pptm, .pptx |
| Word Processing | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |
| Excel/Spreadsheet | .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw |
| Images | .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp |
| Audio | .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus |
| HTML | .htm, .html |
| Text-Based Formats | .csv, .json, .xml, .txt |
| ZIP Files | (Iterates over contents) |
| Email | .eml, .p7s |
| PST | .pst |
| EPUB | .epub |
| Markdown | .md |
| Org Mode | .org |
| Open Office | .odt, .sgl |
| Other | .eth, .mw, .pbd, .sdp, .uof, .web |
| Plain Text | .txt |
| reStructured Text | .rst |
| Rich Text | .rtf |
| StarOffice | .sxg |
| TSV | .tsv |
| Apple | .cwk, .mcw, .pages |
| Data Interchange | .dif |
| dBase | .dbf |
| Microsoft Office | .docx, .xlsx, .pptx |
| HEIF Image Format | .heif |
---
## How It Works
1. **Detect File Type**: The pipeline checks the file extension or general signature (`.pdf`, `.zip`, `.eml`, `.docx`, `.mp3`, etc.).
2. **Specialized Handlers**: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
3. **MarkItDown**: For most generic document conversions, we first try [MarkItDown](https://github.com/markitdown).
4. **Unstructured**: If MarkItDown fails or yields minimal text, we turn to [Unstructured.io](https://unstructured.io/) next.
- **Why?** It's typically **cheaper** than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure `OCR_AGENT`).
5. **Azure Document Intelligence**: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
6. **GPT-4o-mini**: As a final fallback or specifically for OCR on images/scanned pages.
7. **Saves** the extracted text to a `.md` file once any method returns sufficient content.
---
## File-by-File Explanation
### Main Files
- **`conversion_pipeline.py`**
The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to `.md` and stops.
### Common Utils
- **`common/utils.py`**
- **File Detection**: Contains helper functions like `is_pdf`, `is_audio`, `detect_extension`.
- **Markdown Cleaning**: Functions like `clean_markdown()` and `ensure_minimum_content()` to tidy up text and ensure it’s not empty.
### Converters
- **`converters/markitdown_wrapper.py`**
- Wraps the [MarkItDown](https://github.com/markitdown) library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
- If MarkItDown is not installed, or fails, returns `None`.
- **`converters/azure_docint.py`**
- Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.
- **`converters/unstructured_wrapper.py`**
- Uses the [Unstructured.io](https://www.unstructured.io/) library to parse documents. Useful for handling broad, less-common file types.
- **`converters/gpt4o_mini_vision.py`**
- Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on **images** or **scanned PDFs**.
- **Concurrent** or **simple** page-by-page approaches for PDFs.
- Can pass **URL-based images** or **local images** via Base64 encoding.
### Handlers
- **`handlers/pst_handler.py`**
- Parses PST archives with [`libratom`](https://github.com/rafproject/libratom) and extracts emails + attachments. Calls back into the pipeline for each attachment.
- **`handlers/email_handler.py`**
- Processes `.eml` files, extracting plain text, attachments, etc. Recursively processes attachments.
- **`handlers/zip_handler.py`**
- Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.
- **`handlers/audio_handler.py`**
- Uses [OpenAI Whisper](https://github.com/openai/whisper) to transcribe `.mp3`, `.wav`, `.ogg`, etc.
- Caches the model in memory to speed up repeated use.
- **`handlers/pdf_handler.py`**
- Utility to detect if a PDF is text-only, text+images, or fully scanned.
- Coordinates with GPT-4o-mini for OCR if needed.
---
## Installation
1. **Clone the Repo**
```bash
git clone https://github.com/YourName/markitdown-pro.git
cd markitdown-pro
```
2. **Create a Virtual Environment (recommended)**
```bash
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
```
3. **Create a Virtual Environment (recommended)**
```bash
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
```
4. **Install Dependencies**
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.
5. **Set Up .env**
- Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:
```bash
AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com"
AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY"
AZURE_OPENAI_API_KEY="your azure open ai key"
AZURE_OPENAI_API_VERSION="your azure open ai api version"
AZURE_OPENAI_ENDPOINT="your azure open ai endpoint"
AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion"
AZURE_SPEECH_KEY="azure speech service key - for audio conversion"
AZURE_SPEECH_REGION="azure speech service region - for audio conversion"
```
Make sure to source it or ensure python-dotenv can read it.
---
## Testing
We use **pytest** for running our test suite. The test files and scripts are located in the `/tests` directory:
```bash
pytest tests/test_markitdownpro.py
```
---
## Usage
### CLI Usage
1. **Basic:**
```bash
python main.py /path/to/document.pdf
```
This will produce /path/to/document.md if successful.
2. **Specify Output Path:**
```bash
python main.py /path/to/document.pst --output my_pst_output.md
```
### Programmatic Usage
You can import and call the pipeline directly from your Python code:
```python
from conversion_pipeline import convert_document_to_md, convert_document_from_url
# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)
# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")
```
---
## FAQ
1. **What if MarkItDown or Whisper is not installed?**
The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.
2. **Do I need Azure/OpenAI credentials?**
Azure: If you want to use Document Intelligence or GPT-4o-mini, yes.
OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models.
How do I handle large PST files?
Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.
3. **Does GPT-4o-mini require a publicly accessible image URL?**
If you provide a local file path, the code base64-encodes it. This is ideal for truly local images.
If you have a publicly hosted image, you can pass its URL directly.
4. **Why is Unstructured tried before Azure Doc Intelligence now?**
We observed that **Unstructured** is typically **lower cost** to run (especially with Tesseract or local OCR) compared to Azure’s \$10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.
Raw data
{
"_id": null,
"home_page": null,
"name": "markitdown-pro",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12.2",
"maintainer_email": null,
"keywords": null,
"author": "Developer",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/99/85/26bf58754f419f0fb532c104d2af2e2b7a1b2642f583d0d1a9872f298f2e/markitdown_pro-1.1.0.tar.gz",
"platform": null,
"description": "# MarkItDown-Pro\n\n**MarkItDown-Pro** is an **improvement** of the **[Microsoft MarkItDown repository](https://github.com/markitdown)**, enhancing gaps and extending functionality by leveraging **Azure Document Intelligence SDK**, **Unstructured.io**, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to **convert diverse document formats into Markdown** with graceful fallbacks, including OCR support via GPT-4o-mini.\n\n---\n\n## Table of Contents\n\n- [Folder Structure](#folder-structure)\n- [Features & Highlights](#features--highlights)\n- [How It Works](#how-it-works)\n- [File-by-File Explanation](#file-by-file-explanation)\n - [Main Files](#main-files)\n - [Common Utils](#common-utils)\n - [Converters](#converters)\n - [Handlers](#handlers)\n- [Testing](#testing)\n- [Usage & Examples](#usage--examples)\n - [CLI Usage](#cli-usage)\n - [Programmatic Usage](#programmatic-usage)\n - [Extra: Vector Database Chunking](#extra-vector-database-chunking)\n- [Environment Variables](#environment-variables)\n- [FAQ](#faq)\n\n---\n\n## Folder Structure\n\nA typical layout for **MarkItDown-Pro** might look like this:\n```bash\nmarkitdown-pro/\n\u251c\u2500\u2500 .env\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 requirements.txt\n\u251c\u2500\u2500 main.py\n\u251c\u2500\u2500 conversion_pipeline.py\n\u251c\u2500\u2500 common\n\u2502 \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 converters\n\u2502 \u251c\u2500\u2500 markitdown_wrapper.py\n\u2502 \u251c\u2500\u2500 azure_docint.py\n\u2502 \u251c\u2500\u2500 unstructured_wrapper.py\n\u2502 \u2514\u2500\u2500 gpt4o_mini_vision.py\n\u251c\u2500\u2500 handlers\n\u2502 \u251c\u2500\u2500 pst_handler.py\n\u2502 \u251c\u2500\u2500 email_handler.py\n\u2502 \u251c\u2500\u2500 zip_handler.py\n\u2502 \u251c\u2500\u2500 audio_handler.py\n\u2502 \u2514\u2500\u2500 pdf_handler.py\n\u2514\u2500\u2500 tests\n \u251c\u2500\u2500 data\n \u2514\u2500\u2500 test.py\n```\n\n| Folder/File | Description |\n|----------------------------|---------------------------------------------------------------------------------------|\n| **main.py** | Entry point for CLI usage; uses `argparse` to accept file paths. |\n| **conversion_pipeline.py** | Orchestrates the fallback chain for converting documents to Markdown. |\n| **common/** | Shared utility functions, e.g. for file detection, text cleanup, etc. |\n| **converters/** | Contains modules for using various 3rd-party libraries or services to extract text. |\n| **handlers/** | Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning). |\n| **.env** | Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence). |\n| **requirements.txt** | Python dependencies needed to install and run this project. |\n| **tests/test_markitdownpro.py**| Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md|\n| **README.md** | This documentation file, explaining usage and details of the project. |\n\n---\n\n## Features & Highlights\n\n1. **MarkItDown with LLM**\n - Uses **MarkItDown** to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an **OPENAI_API_KEY**.\n - Auto-checks for `exiftool` if you want EXIF metadata in your images.\n\n2. **Whisper-Based Audio Transcription**\n - Converts audio files (`.mp3`, `.wav`, `.ogg`, etc.) into text using [OpenAI Whisper](https://github.com/openai/whisper).\n - Gracefully falls back if Whisper is not installed.\n\n3. **PST Extraction**\n - Parses Outlook PST files with [`libratom`](https://github.com/rafproject/libratom), extracting emails and attachments recursively.\n\n4. **Scanned PDF Detection & Concurrency**\n - Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.\n - Offers concurrent page-by-page OCR for faster performance.\n\n5. **Fallback to Azure Document Intelligence & Unstructured**\n - If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure\u2019s Document Intelligence to extract textual layout.\n - Unstructured.io library for broad coverage of file types.\n\n6. **GPT-4 Vision (or GPT-4o-mini) for Images & OCR**\n - If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.\n - Supports local images (base64 encoding) or remote image URLs directly.\n\n7. **Handles ZIP & EML**\n - **ZIP**: Unzips and processes each file inside, concatenating the results.\n - **EML**: Extracts email text, attachments, and processes attachments recursively.\n\n8. **Graceful LLM Handling**\n - If no **OPENAI_API_KEY** or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.\n\n9. **Helper Methods for URL & Stream Conversion**\n - `convert_document_from_url(url, output_md)`\n - `convert_document_from_stream(stream, extension, output_md)`\n - `convert_document_to_md(local_path, output_md)`\n\n10. **Easy-to-Extend Architecture**\n Each file type has its own **handler**. Each text-extraction library has its own **converter**. The main pipeline provides a centralized fallback sequence.\n\n11. **Environment-Driven Configuration**\n - Pulls API keys, endpoints, and paths from `.env` to keep secrets out of source code.\n\n12. **Rich File Type Handling**\n\n| Category | File Type(s) |\n|-----------------------|-------------|\n| PDF | .pdf |\n| PowerPoint | .pot, .potm, .ppt, .pptm, .pptx |\n| Word Processing | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |\n| Excel/Spreadsheet | .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw |\n| Images | .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp |\n| Audio | .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus |\n| HTML | .htm, .html |\n| Text-Based Formats | .csv, .json, .xml, .txt |\n| ZIP Files | (Iterates over contents) |\n| Email | .eml, .p7s |\n| PST | .pst |\n| EPUB | .epub |\n| Markdown | .md |\n| Org Mode | .org |\n| Open Office | .odt, .sgl |\n| Other | .eth, .mw, .pbd, .sdp, .uof, .web |\n| Plain Text | .txt |\n| reStructured Text | .rst |\n| Rich Text | .rtf |\n| StarOffice | .sxg |\n| TSV | .tsv |\n| Apple | .cwk, .mcw, .pages |\n| Data Interchange | .dif |\n| dBase | .dbf |\n| Microsoft Office | .docx, .xlsx, .pptx |\n| HEIF Image Format | .heif |\n\n\n---\n\n## How It Works\n\n1. **Detect File Type**: The pipeline checks the file extension or general signature (`.pdf`, `.zip`, `.eml`, `.docx`, `.mp3`, etc.).\n2. **Specialized Handlers**: If the file is PST, EML, ZIP, or audio, it\u2019s handed off to a dedicated module that handles that format.\n3. **MarkItDown**: For most generic document conversions, we first try [MarkItDown](https://github.com/markitdown).\n4. **Unstructured**: If MarkItDown fails or yields minimal text, we turn to [Unstructured.io](https://unstructured.io/) next.\n - **Why?** It's typically **cheaper** than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure `OCR_AGENT`).\n5. **Azure Document Intelligence**: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).\n6. **GPT-4o-mini**: As a final fallback or specifically for OCR on images/scanned pages.\n7. **Saves** the extracted text to a `.md` file once any method returns sufficient content.\n\n---\n\n## File-by-File Explanation\n\n### Main Files\n\n- **`conversion_pipeline.py`**\n The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to `.md` and stops.\n\n### Common Utils\n\n- **`common/utils.py`**\n - **File Detection**: Contains helper functions like `is_pdf`, `is_audio`, `detect_extension`.\n - **Markdown Cleaning**: Functions like `clean_markdown()` and `ensure_minimum_content()` to tidy up text and ensure it\u2019s not empty.\n\n### Converters\n\n- **`converters/markitdown_wrapper.py`**\n - Wraps the [MarkItDown](https://github.com/markitdown) library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.\n - If MarkItDown is not installed, or fails, returns `None`.\n\n- **`converters/azure_docint.py`**\n - Leverages Azure\u2019s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.\n\n- **`converters/unstructured_wrapper.py`**\n - Uses the [Unstructured.io](https://www.unstructured.io/) library to parse documents. Useful for handling broad, less-common file types.\n\n- **`converters/gpt4o_mini_vision.py`**\n - Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on **images** or **scanned PDFs**.\n - **Concurrent** or **simple** page-by-page approaches for PDFs.\n - Can pass **URL-based images** or **local images** via Base64 encoding.\n\n### Handlers\n\n- **`handlers/pst_handler.py`**\n - Parses PST archives with [`libratom`](https://github.com/rafproject/libratom) and extracts emails + attachments. Calls back into the pipeline for each attachment.\n\n- **`handlers/email_handler.py`**\n - Processes `.eml` files, extracting plain text, attachments, etc. Recursively processes attachments.\n\n- **`handlers/zip_handler.py`**\n - Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.\n\n- **`handlers/audio_handler.py`**\n - Uses [OpenAI Whisper](https://github.com/openai/whisper) to transcribe `.mp3`, `.wav`, `.ogg`, etc.\n - Caches the model in memory to speed up repeated use.\n\n- **`handlers/pdf_handler.py`**\n - Utility to detect if a PDF is text-only, text+images, or fully scanned.\n - Coordinates with GPT-4o-mini for OCR if needed.\n\n---\n\n## Installation\n\n1. **Clone the Repo**\n ```bash\n git clone https://github.com/YourName/markitdown-pro.git\n cd markitdown-pro\n ```\n2. **Create a Virtual Environment (recommended)**\n ```bash\n python -m venv venv\n source venv/bin/activate # or venv\\Scripts\\activate on Windows\n ```\n3. **Create a Virtual Environment (recommended)**\n ```bash\n python -m venv venv\n source venv/bin/activate # or venv\\Scripts\\activate on Windows\n ```\n4. **Install Dependencies**\n ```bash\n pip install --upgrade pip\n pip install -r requirements.txt\n ```\n Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.\n\n5. **Set Up .env**\n\n- Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:\n ```bash\n AZURE_DOCINTEL_ENDPOINT=\"https://<your-region>.api.cognitive.microsoft.com\"\n AZURE_DOCINTEL_KEY=\"YOUR_AZURE_KEY\"\n AZURE_OPENAI_API_KEY=\"your azure open ai key\"\n AZURE_OPENAI_API_VERSION=\"your azure open ai api version\"\n AZURE_OPENAI_ENDPOINT=\"your azure open ai endpoint\"\n AZURE_SPEECH_ENDPOINT=\"azure speech service endpoint - for audio conversion\"\n AZURE_SPEECH_KEY=\"azure speech service key - for audio conversion\"\n AZURE_SPEECH_REGION=\"azure speech service region - for audio conversion\"\n ```\n Make sure to source it or ensure python-dotenv can read it.\n---\n\n## Testing\n\nWe use **pytest** for running our test suite. The test files and scripts are located in the `/tests` directory:\n ```bash\n pytest tests/test_markitdownpro.py\n ```\n\n---\n\n## Usage\n### CLI Usage\n1. **Basic:**\n ```bash\n python main.py /path/to/document.pdf\n ```\n This will produce /path/to/document.md if successful.\n\n2. **Specify Output Path:**\n ```bash\n python main.py /path/to/document.pst --output my_pst_output.md\n ```\n### Programmatic Usage\nYou can import and call the pipeline directly from your Python code:\n ```python\n from conversion_pipeline import convert_document_to_md, convert_document_from_url\n\n# 1) Local file example\nmd_text = convert_document_to_md(\"/path/to/my_file.pdf\")\nprint(\"Extracted Markdown:\", md_text)\n\n# 2) URL example\nmd_from_url = convert_document_from_url(\"https://example.com/my_doc.docx\", output_md=\"output_doc.md\")\nprint(\"Output saved to output_doc.md\")\n```\n---\n\n## FAQ\n1. **What if MarkItDown or Whisper is not installed?**\n The pipeline checks for each library\u2019s availability. If a library is missing or fails, it gracefully moves on to the next fallback.\n\n2. **Do I need Azure/OpenAI credentials?**\n\n Azure: If you want to use Document Intelligence or GPT-4o-mini, yes.\n OpenAI: If you want MarkItDown\u2019s LLM-based image captioning or are using Whisper from openai\u2019s library, you need appropriate credentials or local models.\n How do I handle large PST files?\n Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don\u2019t need.\n\n3. **Does GPT-4o-mini require a publicly accessible image URL?**\n\n If you provide a local file path, the code base64-encodes it. This is ideal for truly local images.\n If you have a publicly hosted image, you can pass its URL directly.\n\n4. **Why is Unstructured tried before Azure Doc Intelligence now?**\n We observed that **Unstructured** is typically **lower cost** to run (especially with Tesseract or local OCR) compared to Azure\u2019s \\$10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A package that converts almost any file format to Markdown.",
"version": "1.1.0",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "d1db5d1fe00045b6636ed4e5f647e2794e4703467c1ea18800e119109e6460e0",
"md5": "a8966eee6396e133e82c781ed7e36593",
"sha256": "0b952bb4bbfc2c768c093d1b79680a21d261fcdf0e0e00ccd0efa4445a456a5b"
},
"downloads": -1,
"filename": "markitdown_pro-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a8966eee6396e133e82c781ed7e36593",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12.2",
"size": 46040,
"upload_time": "2025-08-23T15:28:16",
"upload_time_iso_8601": "2025-08-23T15:28:16.651418Z",
"url": "https://files.pythonhosted.org/packages/d1/db/5d1fe00045b6636ed4e5f647e2794e4703467c1ea18800e119109e6460e0/markitdown_pro-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "998526bf58754f419f0fb532c104d2af2e2b7a1b2642f583d0d1a9872f298f2e",
"md5": "bcd64a348a1dc2e956ca2a0ee55537be",
"sha256": "6c7f54e58242475553a5945d33147642d591bce17040b547b51cba5cc58d27e8"
},
"downloads": -1,
"filename": "markitdown_pro-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "bcd64a348a1dc2e956ca2a0ee55537be",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12.2",
"size": 39295,
"upload_time": "2025-08-23T15:28:17",
"upload_time_iso_8601": "2025-08-23T15:28:17.685445Z",
"url": "https://files.pythonhosted.org/packages/99/85/26bf58754f419f0fb532c104d2af2e2b7a1b2642f583d0d1a9872f298f2e/markitdown_pro-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-23 15:28:17",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "markitdown-pro"
}