md-server

Name	md-server JSON
Version	0.1.2 JSON
	download
home_page	None
Summary	HTTP API server for converting documents, web pages, and media to markdown
upload_time	2025-08-10 17:41:22
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	markdown document-conversion api server fastapi pdf docx
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # md-server

**Convert any document, webpage, or media file to markdown via HTTP API.**

[![CI](https://github.com/peteretelej/md-server/actions/workflows/ci.yml/badge.svg)](https://github.com/peteretelej/md-server/actions/workflows/ci.yml)
[![Coverage Status](https://coveralls.io/repos/github/peteretelej/md-server/badge.svg?branch=main)](https://coveralls.io/github/peteretelej/md-server?branch=main)
[![PyPI version](https://img.shields.io/pypi/v/md-server.svg)](https://pypi.org/project/md-server/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Docker](https://img.shields.io/badge/docker-ghcr.io-blue)](https://github.com/peteretelej/md-server/pkgs/container/md-server)

md-server provides a HTTP API that accepts files, URLs, or raw content converts it into markdown. It automatically detects input types, handles everything from PDFs and Office documents, Youtube videos, images, to web pages with JavaScript rendering, and requires zero configuration to get started. Under the hood, it uses Microsoft's MarkItDown for document conversion and Crawl4AI for intelligent web scraping.

## Quick Start

```bash
# Starts server at localhost:8080
uvx md-server

# Convert a file
curl -X POST localhost:8080/convert --data-binary @document.pdf

# Convert a URL
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Convert HTML text
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"text": "<h1>Title</h1><p>Content</p>", "mime_type": "text/html"}'
```

## Installation

### Using uvx (Recommended)

```bash
uvx md-server
```

### Using Docker

You can run on Docker using the [md-server docker image](https://github.com/peteretelej/md-server/pkgs/container/md-server). The Docker image includes full browser support for JavaScript rendering.

```bash
docker run -p 127.0.0.1:8080:8080 ghcr.io/peteretelej/md-server
```

**Resource Requirements:**
- Memory: 1GB recommended (minimum 512MB)
- Storage: ~1.2GB image size
- Initial startup: 10-15 seconds (browser initialization)

## API

### `POST /convert`

Single endpoint that accepts multiple input types and automatically detects what you're sending.

#### Input Methods

```bash
# Binary file upload
curl -X POST localhost:8080/convert --data-binary @document.pdf

# Multipart form upload
curl -X POST localhost:8080/convert -F "file=@presentation.pptx"

# URL conversion
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# Base64 content
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"content": "base64_encoded_file_here", "filename": "report.docx"}'

# Raw text
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"text": "# Already Markdown\n\nBut might need cleaning"}'

# Text with specific format (HTML, XML, etc.)
curl -X POST localhost:8080/convert \
  -H "Content-Type: application/json" \
  -d '{"text": "<h1>HTML Title</h1><p>Convert HTML to markdown</p>", "mime_type": "text/html"}'
```

#### Response Format

```json
{
  "success": true,
  "markdown": "# Converted Content\n\nYour markdown here...",
  "metadata": {
    "source_type": "pdf",
    "source_size": 102400,
    "markdown_size": 8192,
    "conversion_time_ms": 245,
    "detected_format": "application/pdf"
  },
  "request_id": "req_550e8400-e29b-41d4-a716-446655440000"
}
```

#### Options

```json
{
  "url": "https://example.com",
  "options": {
    "js_rendering": true, // Use headless browser for JavaScript sites
    "extract_images": true, // Extract and link images
    "ocr_enabled": true, // OCR for scanned PDFs/images
    "preserve_formatting": true // Keep complex formatting
  }
}
```

### `GET /formats`

Returns supported formats and capabilities.

```bash
curl localhost:8080/formats
```

### `GET /health`

Health check endpoint.

```bash
curl localhost:8080/health
```

## Supported Formats

**Documents**: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP  
**Web**: HTML, URLs (with JavaScript rendering)  
**Images**: PNG, JPG, JPEG (with OCR)  
**Audio**: MP3, WAV (transcription)  
**Video**: YouTube URLs  
**Text**: TXT, MD, CSV, XML, JSON

## Advanced Usage

### Enhanced URL Conversion

**Docker deployments** include full browser support automatically - JavaScript rendering is enabled out of the box.

**Local installations** use MarkItDown for URL conversion by default. To enable **Crawl4AI** with JavaScript rendering:

```bash
uvx playwright install-deps
uvx playwright install chromium
```

When browsers are available, md-server automatically uses Crawl4AI for better handling of JavaScript-heavy sites, smart content extraction, and enhanced web crawling capabilities.

### Pipe from Other Commands

```bash
# Convert HTML from stdin
echo "<h1>Hello</h1>" | curl -X POST localhost:8080/convert \
  --data-binary @- \
  -H "Content-Type: text/html"

# Chain with other tools
pdftotext document.pdf - | curl -X POST localhost:8080/convert \
  --data-binary @-
```

### Python Client Example

```python
import requests

# Convert file
with open('document.pdf', 'rb') as f:
    response = requests.post(
        'http://localhost:8080/convert',
        data=f.read(),
        headers={'Content-Type': 'application/pdf'}
    )
    markdown = response.json()['markdown']

# Convert URL
response = requests.post(
    'http://localhost:8080/convert',
    json={'url': 'https://example.com'}
)
markdown = response.json()['markdown']
```

## Error Handling

Errors include actionable information:

```json
{
  "success": false,
  "error": {
    "code": "UNSUPPORTED_FORMAT",
    "message": "File format not supported",
    "details": {
      "detected_format": "application/x-rar",
      "supported_formats": ["pdf", "docx", "html", "..."]
    }
  },
  "request_id": "req_550e8400-e29b-41d4-a716-446655440000"
}
```

## Development

See [CONTRIBUTING.md](docs/CONTRIBUTING.md) for development setup, testing, and contribution guidelines.

## Powered By

This project makes use of these excellent tools:

[![Powered by Crawl4AI](https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg)](https://github.com/unclecode/crawl4ai) [![microsoft/markitdown](https://img.shields.io/badge/microsoft-MarkItDown-0078D4?style=for-the-badge&logo=microsoft)](https://github.com/microsoft/markitdown) [![Litestar Project](https://img.shields.io/badge/Litestar%20Org-%E2%AD%90%20Litestar-202235.svg?logo=python&labelColor=202235&color=edb641&logoColor=edb641)](https://github.com/litestar-org/litestar)

## License

[MIT](./LICENSE)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "md-server",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Peter Etelej <peter@etelej.com>",
    "keywords": "markdown, document-conversion, api, server, fastapi, pdf, docx",
    "author": null,
    "author_email": "Peter Etelej <peter@etelej.com>",
    "download_url": "https://files.pythonhosted.org/packages/49/c0/989275cea90ccd66dc7129acfd79764eeb101d5eb474996d521506cd1857/md_server-0.1.2.tar.gz",
    "platform": null,
    "description": "# md-server\n\n**Convert any document, webpage, or media file to markdown via HTTP API.**\n\n[![CI](https://github.com/peteretelej/md-server/actions/workflows/ci.yml/badge.svg)](https://github.com/peteretelej/md-server/actions/workflows/ci.yml)\n[![Coverage Status](https://coveralls.io/repos/github/peteretelej/md-server/badge.svg?branch=main)](https://coveralls.io/github/peteretelej/md-server?branch=main)\n[![PyPI version](https://img.shields.io/pypi/v/md-server.svg)](https://pypi.org/project/md-server/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Docker](https://img.shields.io/badge/docker-ghcr.io-blue)](https://github.com/peteretelej/md-server/pkgs/container/md-server)\n\nmd-server provides a HTTP API that accepts files, URLs, or raw content converts it into markdown. It automatically detects input types, handles everything from PDFs and Office documents, Youtube videos, images, to web pages with JavaScript rendering, and requires zero configuration to get started. Under the hood, it uses Microsoft's MarkItDown for document conversion and Crawl4AI for intelligent web scraping.\n\n## Quick Start\n\n```bash\n# Starts server at localhost:8080\nuvx md-server\n\n# Convert a file\ncurl -X POST localhost:8080/convert --data-binary @document.pdf\n\n# Convert a URL\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"url\": \"https://example.com\"}'\n\n# Convert HTML text\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"text\": \"<h1>Title</h1><p>Content</p>\", \"mime_type\": \"text/html\"}'\n```\n\n## Installation\n\n### Using uvx (Recommended)\n\n```bash\nuvx md-server\n```\n\n### Using Docker\n\nYou can run on Docker using the [md-server docker image](https://github.com/peteretelej/md-server/pkgs/container/md-server). The Docker image includes full browser support for JavaScript rendering.\n\n```bash\ndocker run -p 127.0.0.1:8080:8080 ghcr.io/peteretelej/md-server\n```\n\n**Resource Requirements:**\n- Memory: 1GB recommended (minimum 512MB)\n- Storage: ~1.2GB image size\n- Initial startup: 10-15 seconds (browser initialization)\n\n## API\n\n### `POST /convert`\n\nSingle endpoint that accepts multiple input types and automatically detects what you're sending.\n\n#### Input Methods\n\n```bash\n# Binary file upload\ncurl -X POST localhost:8080/convert --data-binary @document.pdf\n\n# Multipart form upload\ncurl -X POST localhost:8080/convert -F \"file=@presentation.pptx\"\n\n# URL conversion\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"url\": \"https://example.com\"}'\n\n# Base64 content\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"content\": \"base64_encoded_file_here\", \"filename\": \"report.docx\"}'\n\n# Raw text\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"text\": \"# Already Markdown\\n\\nBut might need cleaning\"}'\n\n# Text with specific format (HTML, XML, etc.)\ncurl -X POST localhost:8080/convert \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"text\": \"<h1>HTML Title</h1><p>Convert HTML to markdown</p>\", \"mime_type\": \"text/html\"}'\n```\n\n#### Response Format\n\n```json\n{\n  \"success\": true,\n  \"markdown\": \"# Converted Content\\n\\nYour markdown here...\",\n  \"metadata\": {\n    \"source_type\": \"pdf\",\n    \"source_size\": 102400,\n    \"markdown_size\": 8192,\n    \"conversion_time_ms\": 245,\n    \"detected_format\": \"application/pdf\"\n  },\n  \"request_id\": \"req_550e8400-e29b-41d4-a716-446655440000\"\n}\n```\n\n#### Options\n\n```json\n{\n  \"url\": \"https://example.com\",\n  \"options\": {\n    \"js_rendering\": true, // Use headless browser for JavaScript sites\n    \"extract_images\": true, // Extract and link images\n    \"ocr_enabled\": true, // OCR for scanned PDFs/images\n    \"preserve_formatting\": true // Keep complex formatting\n  }\n}\n```\n\n### `GET /formats`\n\nReturns supported formats and capabilities.\n\n```bash\ncurl localhost:8080/formats\n```\n\n### `GET /health`\n\nHealth check endpoint.\n\n```bash\ncurl localhost:8080/health\n```\n\n## Supported Formats\n\n**Documents**: PDF, DOCX, XLSX, PPTX, ODT, ODS, ODP  \n**Web**: HTML, URLs (with JavaScript rendering)  \n**Images**: PNG, JPG, JPEG (with OCR)  \n**Audio**: MP3, WAV (transcription)  \n**Video**: YouTube URLs  \n**Text**: TXT, MD, CSV, XML, JSON\n\n## Advanced Usage\n\n### Enhanced URL Conversion\n\n**Docker deployments** include full browser support automatically - JavaScript rendering is enabled out of the box.\n\n**Local installations** use MarkItDown for URL conversion by default. To enable **Crawl4AI** with JavaScript rendering:\n\n```bash\nuvx playwright install-deps\nuvx playwright install chromium\n```\n\nWhen browsers are available, md-server automatically uses Crawl4AI for better handling of JavaScript-heavy sites, smart content extraction, and enhanced web crawling capabilities.\n\n### Pipe from Other Commands\n\n```bash\n# Convert HTML from stdin\necho \"<h1>Hello</h1>\" | curl -X POST localhost:8080/convert \\\n  --data-binary @- \\\n  -H \"Content-Type: text/html\"\n\n# Chain with other tools\npdftotext document.pdf - | curl -X POST localhost:8080/convert \\\n  --data-binary @-\n```\n\n### Python Client Example\n\n```python\nimport requests\n\n# Convert file\nwith open('document.pdf', 'rb') as f:\n    response = requests.post(\n        'http://localhost:8080/convert',\n        data=f.read(),\n        headers={'Content-Type': 'application/pdf'}\n    )\n    markdown = response.json()['markdown']\n\n# Convert URL\nresponse = requests.post(\n    'http://localhost:8080/convert',\n    json={'url': 'https://example.com'}\n)\nmarkdown = response.json()['markdown']\n```\n\n## Error Handling\n\nErrors include actionable information:\n\n```json\n{\n  \"success\": false,\n  \"error\": {\n    \"code\": \"UNSUPPORTED_FORMAT\",\n    \"message\": \"File format not supported\",\n    \"details\": {\n      \"detected_format\": \"application/x-rar\",\n      \"supported_formats\": [\"pdf\", \"docx\", \"html\", \"...\"]\n    }\n  },\n  \"request_id\": \"req_550e8400-e29b-41d4-a716-446655440000\"\n}\n```\n\n## Development\n\nSee [CONTRIBUTING.md](docs/CONTRIBUTING.md) for development setup, testing, and contribution guidelines.\n\n## Powered By\n\nThis project makes use of these excellent tools:\n\n[![Powered by Crawl4AI](https://raw.githubusercontent.com/unclecode/crawl4ai/main/docs/assets/powered-by-light.svg)](https://github.com/unclecode/crawl4ai) [![microsoft/markitdown](https://img.shields.io/badge/microsoft-MarkItDown-0078D4?style=for-the-badge&logo=microsoft)](https://github.com/microsoft/markitdown) [![Litestar Project](https://img.shields.io/badge/Litestar%20Org-%E2%AD%90%20Litestar-202235.svg?logo=python&labelColor=202235&color=edb641&logoColor=edb641)](https://github.com/litestar-org/litestar)\n\n## License\n\n[MIT](./LICENSE)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "HTTP API server for converting documents, web pages, and media to markdown",
    "version": "0.1.2",
    "project_urls": {
        "Documentation": "https://github.com/peteretelej/md-server#readme",
        "Homepage": "https://github.com/peteretelej/md-server",
        "Issues": "https://github.com/peteretelej/md-server/issues",
        "Repository": "https://github.com/peteretelej/md-server.git"
    },
    "split_keywords": [
        "markdown",
        " document-conversion",
        " api",
        " server",
        " fastapi",
        " pdf",
        " docx"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3f917bdbd78e21a0ed21a1fc739028744fa63b2e744eb4a855c9876723452991",
                "md5": "d1854f6dea3b7999cbdae4c075aa09e7",
                "sha256": "4e945555408ecbda058de01992789e64572e05ffd94ff3f0264ed4e5eb9be895"
            },
            "downloads": -1,
            "filename": "md_server-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1854f6dea3b7999cbdae4c075aa09e7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 22638,
            "upload_time": "2025-08-10T17:41:21",
            "upload_time_iso_8601": "2025-08-10T17:41:21.630745Z",
            "url": "https://files.pythonhosted.org/packages/3f/91/7bdbd78e21a0ed21a1fc739028744fa63b2e744eb4a855c9876723452991/md_server-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "49c0989275cea90ccd66dc7129acfd79764eeb101d5eb474996d521506cd1857",
                "md5": "92ec6081ff4b1bd0fcb499a752bbf51b",
                "sha256": "f490133e1987e8d5ec78b5cb2e587233d7d1cf7c080441b647b2ce061c92ba6f"
            },
            "downloads": -1,
            "filename": "md_server-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "92ec6081ff4b1bd0fcb499a752bbf51b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 21718,
            "upload_time": "2025-08-10T17:41:22",
            "upload_time_iso_8601": "2025-08-10T17:41:22.830720Z",
            "url": "https://files.pythonhosted.org/packages/49/c0/989275cea90ccd66dc7129acfd79764eeb101d5eb474996d521506cd1857/md_server-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-10 17:41:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "peteretelej",
    "github_project": "md-server#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "md-server"
}

None