markitdown-pro

Name	markitdown-pro JSON
Version	1.3.7 JSON
	download
home_page	None
Summary	A package that converts almost any file format to Markdown.
upload_time	2025-10-24 17:12:43
maintainer	None
docs_url	None
author	Developer
requires_python	>=3.12.2
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MarkItDown-Pro

**MarkItDown-Pro** is an **improvement** of the **[Microsoft MarkItDown repository](https://github.com/markitdown)**, enhancing gaps and extending functionality by leveraging **Azure Document Intelligence SDK**, **Unstructured.io**, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to **convert diverse document formats into Markdown** with graceful fallbacks, including OCR support via GPT-4o-mini.

---

## Table of Contents

- [Folder Structure](#folder-structure)
- [Features & Highlights](#features--highlights)
- [How It Works](#how-it-works)
- [File-by-File Explanation](#file-by-file-explanation)
  - [Main Files](#main-files)
  - [Common Utils](#common-utils)
  - [Converters](#converters)
  - [Handlers](#handlers)
- [Testing](#testing)
- [Usage & Examples](#usage--examples)
  - [CLI Usage](#cli-usage)
  - [Programmatic Usage](#programmatic-usage)
  - [Extra: Vector Database Chunking](#extra-vector-database-chunking)
- [Environment Variables](#environment-variables)
- [FAQ](#faq)

---

## Folder Structure

A typical layout for **MarkItDown-Pro** might look like this:
```bash
markitdown-pro/
├── .env
├── README.md
├── requirements.txt
├── main.py
├── conversion_pipeline.py
├── common
│   └── utils.py
├── converters
│   ├── markitdown_wrapper.py
│   ├── azure_docint.py
│   ├── unstructured_wrapper.py
│   └── gpt4o_mini_vision.py
├── handlers
│   ├── pst_handler.py
│   ├── email_handler.py
│   ├── zip_handler.py
│   ├── audio_handler.py
│   └── pdf_handler.py
└──  tests
    ├── data
    └── test.py
```

| Folder/File                 | Description                                                                           |
|----------------------------|---------------------------------------------------------------------------------------|
| **main.py**                | Entry point for CLI usage; uses `argparse` to accept file paths.                      |
| **conversion_pipeline.py** | Orchestrates the fallback chain for converting documents to Markdown.                |
| **common/**                | Shared utility functions, e.g. for file detection, text cleanup, etc.                 |
| **converters/**            | Contains modules for using various 3rd-party libraries or services to extract text.   |
| **handlers/**              | Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning).    |
| **.env**                   | Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence). |
| **requirements.txt**       | Python dependencies needed to install and run this project.                           |
| **tests/test_markitdownpro.py**| Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md|
| **README.md**              | This documentation file, explaining usage and details of the project.                 |

---

## Features & Highlights

1. **MarkItDown with LLM**
   - Uses **MarkItDown** to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an **OPENAI_API_KEY**.
   - Auto-checks for `exiftool` if you want EXIF metadata in your images.

2. **Whisper-Based Audio Transcription**
   - Converts audio files (`.mp3`, `.wav`, `.ogg`, etc.) into text using [OpenAI Whisper](https://github.com/openai/whisper).
   - Gracefully falls back if Whisper is not installed.

3. **PST Extraction**
   - Parses Outlook PST files with [`libratom`](https://github.com/rafproject/libratom), extracting emails and attachments recursively.

4. **Scanned PDF Detection & Concurrency**
   - Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.
   - Offers concurrent page-by-page OCR for faster performance.

5. **Fallback to Azure Document Intelligence & Unstructured**
   - If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure’s Document Intelligence to extract textual layout.
   - Unstructured.io library for broad coverage of file types.

6. **GPT-4 Vision (or GPT-4o-mini) for Images & OCR**
   - If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.
   - Supports local images (base64 encoding) or remote image URLs directly.

7. **Handles ZIP & EML**
   - **ZIP**: Unzips and processes each file inside, concatenating the results.
   - **EML**: Extracts email text, attachments, and processes attachments recursively.

8. **Graceful LLM Handling**
   - If no **OPENAI_API_KEY** or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.

9. **Helper Methods for URL & Stream Conversion**
   - `convert_document_from_url(url, output_md)`
   - `convert_document_from_stream(stream, extension, output_md)`
   - `convert_document_to_md(local_path, output_md)`

10. **Easy-to-Extend Architecture**
   Each file type has its own **handler**. Each text-extraction library has its own **converter**. The main pipeline provides a centralized fallback sequence.

11. **Environment-Driven Configuration**
   - Pulls API keys, endpoints, and paths from `.env` to keep secrets out of source code.

12. **Rich File Type Handling**

| Category              | File Type(s) |
|-----------------------|-------------|
| PDF                  | .pdf |
| PowerPoint           | .pot, .potm, .ppt, .pptm, .pptx |
| Word Processing      | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |
| Excel/Spreadsheet    | .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw |
| Images              | .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp |
| Audio               | .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus |
| HTML                | .htm, .html |
| Text-Based Formats  | .csv, .json, .xml, .txt |
| ZIP Files           | (Iterates over contents) |
| Email               | .eml, .p7s |
| PST                 | .pst |
| EPUB                | .epub |
| Markdown            | .md |
| Org Mode            | .org |
| Open Office         | .odt, .sgl |
| Other              | .eth, .mw, .pbd, .sdp, .uof, .web |
| Plain Text          | .txt |
| reStructured Text   | .rst |
| Rich Text           | .rtf |
| StarOffice          | .sxg |
| TSV                 | .tsv |
| Apple               | .cwk, .mcw, .pages |
| Data Interchange    | .dif |
| dBase               | .dbf |
| Microsoft Office    | .docx, .xlsx, .pptx |
| HEIF Image Format   | .heif |


---

## How It Works

1. **Detect File Type**: The pipeline checks the file extension or general signature (`.pdf`, `.zip`, `.eml`, `.docx`, `.mp3`, etc.).
2. **Specialized Handlers**: If the file is PST, EML, ZIP, or audio, it’s handed off to a dedicated module that handles that format.
3. **MarkItDown**: For most generic document conversions, we first try [MarkItDown](https://github.com/markitdown).
4. **Unstructured**: If MarkItDown fails or yields minimal text, we turn to [Unstructured.io](https://unstructured.io/) next.
   - **Why?** It's typically **cheaper** than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure `OCR_AGENT`).
5. **Azure Document Intelligence**: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).
6. **GPT-4o-mini**: As a final fallback or specifically for OCR on images/scanned pages.
7. **Saves** the extracted text to a `.md` file once any method returns sufficient content.

---

## File-by-File Explanation

### Main Files

- **`conversion_pipeline.py`**
  The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to `.md` and stops.

### Common Utils

- **`common/utils.py`**
  - **File Detection**: Contains helper functions like `is_pdf`, `is_audio`, `detect_extension`.
  - **Markdown Cleaning**: Functions like `clean_markdown()` and `ensure_minimum_content()` to tidy up text and ensure it’s not empty.

### Converters

- **`converters/markitdown_wrapper.py`**
  - Wraps the [MarkItDown](https://github.com/markitdown) library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.
  - If MarkItDown is not installed, or fails, returns `None`.

- **`converters/azure_docint.py`**
  - Leverages Azure’s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.

- **`converters/unstructured_wrapper.py`**
  - Uses the [Unstructured.io](https://www.unstructured.io/) library to parse documents. Useful for handling broad, less-common file types.

- **`converters/gpt4o_mini_vision.py`**
  - Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on **images** or **scanned PDFs**.
  - **Concurrent** or **simple** page-by-page approaches for PDFs.
  - Can pass **URL-based images** or **local images** via Base64 encoding.

### Handlers

- **`handlers/pst_handler.py`**
  - Parses PST archives with [`libratom`](https://github.com/rafproject/libratom) and extracts emails + attachments. Calls back into the pipeline for each attachment.

- **`handlers/email_handler.py`**
  - Processes `.eml` files, extracting plain text, attachments, etc. Recursively processes attachments.

- **`handlers/zip_handler.py`**
  - Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.

- **`handlers/audio_handler.py`**
  - Uses [OpenAI Whisper](https://github.com/openai/whisper) to transcribe `.mp3`, `.wav`, `.ogg`, etc.
  - Caches the model in memory to speed up repeated use.

- **`handlers/pdf_handler.py`**
  - Utility to detect if a PDF is text-only, text+images, or fully scanned.
  - Coordinates with GPT-4o-mini for OCR if needed.

---

## Installation

1. **Clone the Repo**
   ```bash
   git clone https://github.com/YourName/markitdown-pro.git
   cd markitdown-pro
   ```
2. **Create a Virtual Environment (recommended)**
   ```bash
   python -m venv venv
   source venv/bin/activate  # or venv\Scripts\activate on Windows
   ```
3. **Create a Virtual Environment (recommended)**
   ```bash
   python -m venv venv
   source venv/bin/activate  # or venv\Scripts\activate on Windows
   ```
4. **Install Dependencies**
   ```bash
   pip install --upgrade pip
   pip install -r requirements.txt
   ```
  Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.

5. **Set Up .env**

- Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:
   ```bash
   AZURE_DOCINTEL_ENDPOINT="https://<your-region>.api.cognitive.microsoft.com"
   AZURE_DOCINTEL_KEY="YOUR_AZURE_KEY"
   AZURE_OPENAI_API_KEY="your azure open ai key"
   AZURE_OPENAI_API_VERSION="your azure open ai api version"
   AZURE_OPENAI_ENDPOINT="your azure open ai endpoint"
   AZURE_SPEECH_ENDPOINT="azure speech service endpoint - for audio conversion"
   AZURE_SPEECH_KEY="azure speech service key - for audio conversion"
   AZURE_SPEECH_REGION="azure speech service region - for audio conversion"
   ```
  Make sure to source it or ensure python-dotenv can read it.
---

## Testing

We use **pytest** for running our test suite. The test files and scripts are located in the `/tests` directory:
   ```bash
   pytest tests/test_markitdownpro.py
   ```

---

## Usage
### CLI Usage
1. **Basic:**
   ```bash
   python main.py /path/to/document.pdf
   ```
   This will produce /path/to/document.md if successful.

2. **Specify Output Path:**
   ```bash
   python main.py /path/to/document.pst --output my_pst_output.md
   ```
### Programmatic Usage
You can import and call the pipeline directly from your Python code:
   ```python
   from conversion_pipeline import convert_document_to_md, convert_document_from_url

# 1) Local file example
md_text = convert_document_to_md("/path/to/my_file.pdf")
print("Extracted Markdown:", md_text)

# 2) URL example
md_from_url = convert_document_from_url("https://example.com/my_doc.docx", output_md="output_doc.md")
print("Output saved to output_doc.md")
```
---

## FAQ
1. **What if MarkItDown or Whisper is not installed?**
  The pipeline checks for each library’s availability. If a library is missing or fails, it gracefully moves on to the next fallback.

2. **Do I need Azure/OpenAI credentials?**

  Azure: If you want to use Document Intelligence or GPT-4o-mini, yes.
  OpenAI: If you want MarkItDown’s LLM-based image captioning or are using Whisper from openai’s library, you need appropriate credentials or local models.
  How do I handle large PST files?
  Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don’t need.

3. **Does GPT-4o-mini require a publicly accessible image URL?**

  If you provide a local file path, the code base64-encodes it. This is ideal for truly local images.
  If you have a publicly hosted image, you can pass its URL directly.

4. **Why is Unstructured tried before Azure Doc Intelligence now?**
  We observed that **Unstructured** is typically **lower cost** to run (especially with Tesseract or local OCR) compared to Azure’s \$10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "markitdown-pro",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12.2",
    "maintainer_email": null,
    "keywords": null,
    "author": "Developer",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/0e/72/31025ecb8aff07446e4574950b0ad366c6624e5b0fcf5bafe90d395430c6/markitdown_pro-1.3.7.tar.gz",
    "platform": null,
    "description": "# MarkItDown-Pro\n\n**MarkItDown-Pro** is an **improvement** of the **[Microsoft MarkItDown repository](https://github.com/markitdown)**, enhancing gaps and extending functionality by leveraging **Azure Document Intelligence SDK**, **Unstructured.io**, and other Azure services and libraries. The result is a comprehensive Python library and command-line tool designed to **convert diverse document formats into Markdown** with graceful fallbacks, including OCR support via GPT-4o-mini.\n\n---\n\n## Table of Contents\n\n- [Folder Structure](#folder-structure)\n- [Features & Highlights](#features--highlights)\n- [How It Works](#how-it-works)\n- [File-by-File Explanation](#file-by-file-explanation)\n  - [Main Files](#main-files)\n  - [Common Utils](#common-utils)\n  - [Converters](#converters)\n  - [Handlers](#handlers)\n- [Testing](#testing)\n- [Usage & Examples](#usage--examples)\n  - [CLI Usage](#cli-usage)\n  - [Programmatic Usage](#programmatic-usage)\n  - [Extra: Vector Database Chunking](#extra-vector-database-chunking)\n- [Environment Variables](#environment-variables)\n- [FAQ](#faq)\n\n---\n\n## Folder Structure\n\nA typical layout for **MarkItDown-Pro** might look like this:\n```bash\nmarkitdown-pro/\n\u251c\u2500\u2500 .env\n\u251c\u2500\u2500 README.md\n\u251c\u2500\u2500 requirements.txt\n\u251c\u2500\u2500 main.py\n\u251c\u2500\u2500 conversion_pipeline.py\n\u251c\u2500\u2500 common\n\u2502   \u2514\u2500\u2500 utils.py\n\u251c\u2500\u2500 converters\n\u2502   \u251c\u2500\u2500 markitdown_wrapper.py\n\u2502   \u251c\u2500\u2500 azure_docint.py\n\u2502   \u251c\u2500\u2500 unstructured_wrapper.py\n\u2502   \u2514\u2500\u2500 gpt4o_mini_vision.py\n\u251c\u2500\u2500 handlers\n\u2502   \u251c\u2500\u2500 pst_handler.py\n\u2502   \u251c\u2500\u2500 email_handler.py\n\u2502   \u251c\u2500\u2500 zip_handler.py\n\u2502   \u251c\u2500\u2500 audio_handler.py\n\u2502   \u2514\u2500\u2500 pdf_handler.py\n\u2514\u2500\u2500  tests\n    \u251c\u2500\u2500 data\n    \u2514\u2500\u2500 test.py\n```\n\n| Folder/File                 | Description                                                                           |\n|----------------------------|---------------------------------------------------------------------------------------|\n| **main.py**                | Entry point for CLI usage; uses `argparse` to accept file paths.                      |\n| **conversion_pipeline.py** | Orchestrates the fallback chain for converting documents to Markdown.                |\n| **common/**                | Shared utility functions, e.g. for file detection, text cleanup, etc.                 |\n| **converters/**            | Contains modules for using various 3rd-party libraries or services to extract text.   |\n| **handlers/**              | Specialized handlers for specific file types (PST, EML, ZIP, audio, PDF scanning).    |\n| **.env**                   | Environment variables (e.g., credentials for Azure GPT-4o-mini, Azure Doc Intelligence). |\n| **requirements.txt**       | Python dependencies needed to install and run this project.                           |\n| **tests/test_markitdownpro.py**| Recursively scans /tests/data/ and attempts to convert each file using convert_document_to_md|\n| **README.md**              | This documentation file, explaining usage and details of the project.                 |\n\n---\n\n## Features & Highlights\n\n1. **MarkItDown with LLM**\n   - Uses **MarkItDown** to convert documents to Markdown, optionally leveraging an OpenAI LLM to create image captions if you have an **OPENAI_API_KEY**.\n   - Auto-checks for `exiftool` if you want EXIF metadata in your images.\n\n2. **Whisper-Based Audio Transcription**\n   - Converts audio files (`.mp3`, `.wav`, `.ogg`, etc.) into text using [OpenAI Whisper](https://github.com/openai/whisper).\n   - Gracefully falls back if Whisper is not installed.\n\n3. **PST Extraction**\n   - Parses Outlook PST files with [`libratom`](https://github.com/rafproject/libratom), extracting emails and attachments recursively.\n\n4. **Scanned PDF Detection & Concurrency**\n   - Identifies PDFs with no text or embedded images, and automatically performs OCR on each page with GPT-4o-mini.\n   - Offers concurrent page-by-page OCR for faster performance.\n\n5. **Fallback to Azure Document Intelligence & Unstructured**\n   - If standard MarkItDown or specialized handlers fail or yield insufficient text, it tries Azure\u2019s Document Intelligence to extract textual layout.\n   - Unstructured.io library for broad coverage of file types.\n\n6. **GPT-4 Vision (or GPT-4o-mini) for Images & OCR**\n   - If an image or partially scanned PDF is detected, we can pass it to GPT-4o-mini for OCR.\n   - Supports local images (base64 encoding) or remote image URLs directly.\n\n7. **Handles ZIP & EML**\n   - **ZIP**: Unzips and processes each file inside, concatenating the results.\n   - **EML**: Extracts email text, attachments, and processes attachments recursively.\n\n8. **Graceful LLM Handling**\n   - If no **OPENAI_API_KEY** or GPT-4o-mini credentials are provided, it simply skips LLM-based features, logging a warning.\n\n9. **Helper Methods for URL & Stream Conversion**\n   - `convert_document_from_url(url, output_md)`\n   - `convert_document_from_stream(stream, extension, output_md)`\n   - `convert_document_to_md(local_path, output_md)`\n\n10. **Easy-to-Extend Architecture**\n   Each file type has its own **handler**. Each text-extraction library has its own **converter**. The main pipeline provides a centralized fallback sequence.\n\n11. **Environment-Driven Configuration**\n   - Pulls API keys, endpoints, and paths from `.env` to keep secrets out of source code.\n\n12. **Rich File Type Handling**\n\n| Category              | File Type(s) |\n|-----------------------|-------------|\n| PDF                  | .pdf |\n| PowerPoint           | .pot, .potm, .ppt, .pptm, .pptx |\n| Word Processing      | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |\n| Excel/Spreadsheet    | .et, .fods, .uos1, .uos2, .wk2, .xls, .xlsb, .xlsm, .xlsx, .xlw |\n| Images              | .bmp, .gif, .heic, .jpeg, .jpg, .png, .prn, .svg, .tiff, .webp |\n| Audio               | .mp3, .wav, .ogg, .flac, .m4a, .aac, .wma, .webm, .opus |\n| HTML                | .htm, .html |\n| Text-Based Formats  | .csv, .json, .xml, .txt |\n| ZIP Files           | (Iterates over contents) |\n| Email               | .eml, .p7s |\n| PST                 | .pst |\n| EPUB                | .epub |\n| Markdown            | .md |\n| Org Mode            | .org |\n| Open Office         | .odt, .sgl |\n| Other              | .eth, .mw, .pbd, .sdp, .uof, .web |\n| Plain Text          | .txt |\n| reStructured Text   | .rst |\n| Rich Text           | .rtf |\n| StarOffice          | .sxg |\n| TSV                 | .tsv |\n| Apple               | .cwk, .mcw, .pages |\n| Data Interchange    | .dif |\n| dBase               | .dbf |\n| Microsoft Office    | .docx, .xlsx, .pptx |\n| HEIF Image Format   | .heif |\n\n\n---\n\n## How It Works\n\n1. **Detect File Type**: The pipeline checks the file extension or general signature (`.pdf`, `.zip`, `.eml`, `.docx`, `.mp3`, etc.).\n2. **Specialized Handlers**: If the file is PST, EML, ZIP, or audio, it\u2019s handed off to a dedicated module that handles that format.\n3. **MarkItDown**: For most generic document conversions, we first try [MarkItDown](https://github.com/markitdown).\n4. **Unstructured**: If MarkItDown fails or yields minimal text, we turn to [Unstructured.io](https://unstructured.io/) next.\n   - **Why?** It's typically **cheaper** than Azure Document Intelligence, and can handle partial OCR scenarios (via Tesseract, PaddleOCR, etc., if you configure `OCR_AGENT`).\n5. **Azure Document Intelligence**: If Unstructured also fails or yields minimal text, we try Azure Document Intelligence (prebuilt-layout).\n6. **GPT-4o-mini**: As a final fallback or specifically for OCR on images/scanned pages.\n7. **Saves** the extracted text to a `.md` file once any method returns sufficient content.\n\n---\n\n## File-by-File Explanation\n\n### Main Files\n\n- **`conversion_pipeline.py`**\n  The core logic that orchestrates the fallback chain. Checks each handler or converter in a specific order. Once a successful conversion with enough text is found, it writes to `.md` and stops.\n\n### Common Utils\n\n- **`common/utils.py`**\n  - **File Detection**: Contains helper functions like `is_pdf`, `is_audio`, `detect_extension`.\n  - **Markdown Cleaning**: Functions like `clean_markdown()` and `ensure_minimum_content()` to tidy up text and ensure it\u2019s not empty.\n\n### Converters\n\n- **`converters/markitdown_wrapper.py`**\n  - Wraps the [MarkItDown](https://github.com/markitdown) library for docx/image extraction, EXIF reading, and optional LLM-based image captioning.\n  - If MarkItDown is not installed, or fails, returns `None`.\n\n- **`converters/azure_docint.py`**\n  - Leverages Azure\u2019s Document Intelligence (prebuilt-layout) to extract text from PDFs and other document types in Markdown format.\n\n- **`converters/unstructured_wrapper.py`**\n  - Uses the [Unstructured.io](https://www.unstructured.io/) library to parse documents. Useful for handling broad, less-common file types.\n\n- **`converters/gpt4o_mini_vision.py`**\n  - Uses GPT-4o-mini (Azure ChatOpenAI) for OCR tasks on **images** or **scanned PDFs**.\n  - **Concurrent** or **simple** page-by-page approaches for PDFs.\n  - Can pass **URL-based images** or **local images** via Base64 encoding.\n\n### Handlers\n\n- **`handlers/pst_handler.py`**\n  - Parses PST archives with [`libratom`](https://github.com/rafproject/libratom) and extracts emails + attachments. Calls back into the pipeline for each attachment.\n\n- **`handlers/email_handler.py`**\n  - Processes `.eml` files, extracting plain text, attachments, etc. Recursively processes attachments.\n\n- **`handlers/zip_handler.py`**\n  - Unzips files, recurses into the pipeline for each contained file, and concatenates all Markdown output.\n\n- **`handlers/audio_handler.py`**\n  - Uses [OpenAI Whisper](https://github.com/openai/whisper) to transcribe `.mp3`, `.wav`, `.ogg`, etc.\n  - Caches the model in memory to speed up repeated use.\n\n- **`handlers/pdf_handler.py`**\n  - Utility to detect if a PDF is text-only, text+images, or fully scanned.\n  - Coordinates with GPT-4o-mini for OCR if needed.\n\n---\n\n## Installation\n\n1. **Clone the Repo**\n   ```bash\n   git clone https://github.com/YourName/markitdown-pro.git\n   cd markitdown-pro\n   ```\n2. **Create a Virtual Environment (recommended)**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # or venv\\Scripts\\activate on Windows\n   ```\n3. **Create a Virtual Environment (recommended)**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # or venv\\Scripts\\activate on Windows\n   ```\n4. **Install Dependencies**\n   ```bash\n   pip install --upgrade pip\n   pip install -r requirements.txt\n   ```\n  Note: You may also need system dependencies for libraries like PyMuPDF, libratom, etc.\n\n5. **Set Up .env**\n\n- Copy the sample .env to your root folder, and fill in your Azure or OpenAI API keys, etc. For example:\n   ```bash\n   AZURE_DOCINTEL_ENDPOINT=\"https://<your-region>.api.cognitive.microsoft.com\"\n   AZURE_DOCINTEL_KEY=\"YOUR_AZURE_KEY\"\n   AZURE_OPENAI_API_KEY=\"your azure open ai key\"\n   AZURE_OPENAI_API_VERSION=\"your azure open ai api version\"\n   AZURE_OPENAI_ENDPOINT=\"your azure open ai endpoint\"\n   AZURE_SPEECH_ENDPOINT=\"azure speech service endpoint - for audio conversion\"\n   AZURE_SPEECH_KEY=\"azure speech service key - for audio conversion\"\n   AZURE_SPEECH_REGION=\"azure speech service region - for audio conversion\"\n   ```\n  Make sure to source it or ensure python-dotenv can read it.\n---\n\n## Testing\n\nWe use **pytest** for running our test suite. The test files and scripts are located in the `/tests` directory:\n   ```bash\n   pytest tests/test_markitdownpro.py\n   ```\n\n---\n\n## Usage\n### CLI Usage\n1. **Basic:**\n   ```bash\n   python main.py /path/to/document.pdf\n   ```\n   This will produce /path/to/document.md if successful.\n\n2. **Specify Output Path:**\n   ```bash\n   python main.py /path/to/document.pst --output my_pst_output.md\n   ```\n### Programmatic Usage\nYou can import and call the pipeline directly from your Python code:\n   ```python\n   from conversion_pipeline import convert_document_to_md, convert_document_from_url\n\n# 1) Local file example\nmd_text = convert_document_to_md(\"/path/to/my_file.pdf\")\nprint(\"Extracted Markdown:\", md_text)\n\n# 2) URL example\nmd_from_url = convert_document_from_url(\"https://example.com/my_doc.docx\", output_md=\"output_doc.md\")\nprint(\"Output saved to output_doc.md\")\n```\n---\n\n## FAQ\n1. **What if MarkItDown or Whisper is not installed?**\n  The pipeline checks for each library\u2019s availability. If a library is missing or fails, it gracefully moves on to the next fallback.\n\n2. **Do I need Azure/OpenAI credentials?**\n\n  Azure: If you want to use Document Intelligence or GPT-4o-mini, yes.\n  OpenAI: If you want MarkItDown\u2019s LLM-based image captioning or are using Whisper from openai\u2019s library, you need appropriate credentials or local models.\n  How do I handle large PST files?\n  Large PSTs can be slow to process, especially if they contain many attachments. We parse them message-by-message, recursively handling attachments. For extremely large archives, you might want to increase concurrency or filter out attachments you don\u2019t need.\n\n3. **Does GPT-4o-mini require a publicly accessible image URL?**\n\n  If you provide a local file path, the code base64-encodes it. This is ideal for truly local images.\n  If you have a publicly hosted image, you can pass its URL directly.\n\n4. **Why is Unstructured tried before Azure Doc Intelligence now?**\n  We observed that **Unstructured** is typically **lower cost** to run (especially with Tesseract or local OCR) compared to Azure\u2019s \\$10 per 1,000 pages. So if MarkItDown fails, we want to try Unstructured next to potentially save cost. If that also fails, we move to Azure.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package that converts almost any file format to Markdown.",
    "version": "1.3.7",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eb240862925210d8aca5960320729532dde1a5a9db490e2cd98bcff503906128",
                "md5": "30c68688f7ee81b78dbb7996abd67426",
                "sha256": "631cf858fcb8e78d521bb77b653c357269e89eb6e1de3bea9c1ba96033f069bd"
            },
            "downloads": -1,
            "filename": "markitdown_pro-1.3.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "30c68688f7ee81b78dbb7996abd67426",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12.2",
            "size": 56232,
            "upload_time": "2025-10-24T17:12:41",
            "upload_time_iso_8601": "2025-10-24T17:12:41.826868Z",
            "url": "https://files.pythonhosted.org/packages/eb/24/0862925210d8aca5960320729532dde1a5a9db490e2cd98bcff503906128/markitdown_pro-1.3.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0e7231025ecb8aff07446e4574950b0ad366c6624e5b0fcf5bafe90d395430c6",
                "md5": "979f1f59b5b6e0bba14f4b8e312e9a4b",
                "sha256": "fc2a9c7de853f957d494c43bd87a7f9f418626aa6e5395ab73440ba37a90a53e"
            },
            "downloads": -1,
            "filename": "markitdown_pro-1.3.7.tar.gz",
            "has_sig": false,
            "md5_digest": "979f1f59b5b6e0bba14f4b8e312e9a4b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12.2",
            "size": 49731,
            "upload_time": "2025-10-24T17:12:43",
            "upload_time_iso_8601": "2025-10-24T17:12:43.087947Z",
            "url": "https://files.pythonhosted.org/packages/0e/72/31025ecb8aff07446e4574950b0ad366c6624e5b0fcf5bafe90d395430c6/markitdown_pro-1.3.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 17:12:43",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "markitdown-pro"
}

Developer