markitdown-chunker

Name	markitdown-chunker JSON
Version	0.1.0 JSON
	download
home_page	https://github.com/Naveenkumarar/markitdown-chunker
Summary	Convert documents to markdown, chunk them intelligently, and export structured data
upload_time	2025-10-10 01:05:22
maintainer	None
docs_url	None
author	Naveen Kumar Rajarajan
requires_python	>=3.8
license	MIT
keywords	markdown converter chunker document-processing langchain markitdown
VCS
bugtrack_url
requirements	markitdown langchain langchain-text-splitters
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MarkitDown Chunker

A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the [markitdown](https://github.com/microsoft/markitdown) package with advanced chunking capabilities using LangChain.

## ✨ Features

- 📄 **Multi-format Support**: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
- 🖼️ **Image Extraction**: Automatically extract images from PDF, DOCX, PPTX files ([requires optional dependencies](docs/IMAGE_EXTRACTION.md))
- 🎨 **Image Summarization**: Optional AI-powered image descriptions for better context
- ✂️ **Smart Chunking**: Markdown-aware text splitting that respects document structure
- 📊 **Structured Export**: Export chunks with metadata to JSON format
- 🔧 **Flexible Pipeline**: Run individual steps or complete pipeline as needed
- 🎯 **CLI & Python API**: Use from command line or integrate into your Python applications

## 📦 Installation

### Basic Installation

```bash
pip install markitdown-chunker
```

### With Image Extraction Support

To extract images from PDF, DOCX, and PPTX files:

```bash
pip install "markitdown-chunker[images]"
```

See [Image Extraction Guide](docs/IMAGE_EXTRACTION.md) for details.

### From Source

```bash
git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"
```

## 🚀 Quick Start

### Command Line Interface

```bash
# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/

# Convert only
markitdown-chunker document.docx output/ --convert-only

# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only

# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400

# List supported formats
markitdown-chunker --list-formats
```

### Python API

#### Complete Pipeline

```python
from markitdown_chunker import MarkitDownProcessor

# Initialize processor with custom settings
processor = MarkitDownProcessor(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True
)

# Process a document (all steps)
result = processor.process(
    file_path="document.pdf",
    output_dir="output/"
)

print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")
```

#### Step-by-Step Processing

```python
from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter

# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
    file_path="document.pdf",
    output_dir="output/",
    save_images=True
)

# Step 2: Chunk the markdown
chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = chunker.chunk_file(
    markdown_path=conversion_result['markdown_path']
)

# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
    chunks=chunks,
    output_path="output/chunks.json"
)
```

## 📚 Supported File Formats

- **Documents**: PDF, DOCX, DOC, RTF, ODT, TXT, MD
- **Presentations**: PPTX, PPT, ODP
- **Spreadsheets**: XLSX, XLS, ODS
- **Web**: HTML, HTM

> **Note**: Audio/video files (MP3, MP4, etc.) require ffmpeg. See [docs/FFMPEG_AUDIO.md](docs/FFMPEG_AUDIO.md) for details.

## 📂 Output Directory Structure

After processing a document, the output directory will contain:

```
output/
├── document.md                    # Converted markdown file
├── document_chunks.json           # Chunks with metadata and statistics
└── images/                        # Extracted images (if any)
    ├── page1_img1.png
    ├── page2_img1.jpg
    ├── page3_img1.png
    └── page3_img2.jpg
```

### Example Output Files

**`document.md`** - Markdown conversion with image references:
```markdown
# Document Title

Document content converted to markdown format...

## Extracted Images

![Page 1 Image 1](images/page1_img1.png)

![Page 2 Image 1](images/page2_img1.jpg)
```

**`document_chunks.json`** - Structured chunk data:
```json
{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Document content chunk...",
      "metadata": {
        "Header 1": "Introduction",
        "chunk_index": 0,
        "source_file": "output/document.md"
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-10T10:30:45.123456"
}
```

**`images/`** - Extracted images with organized naming:
- PDF images: `page{N}_img{M}.{ext}` (e.g., `page1_img1.png`)
- DOCX images: `docx_img{N}.{ext}` (e.g., `docx_img1.jpg`)
- PPTX images: `slide{N}_img{M}.{ext}` (e.g., `slide1_img1.png`)

> 💡 **Tip**: The images directory is only created if the document contains images and `save_images=True` (default).

## 🎛️ Configuration Options

### Chunking Parameters

```python
processor = MarkitDownProcessor(
    chunk_size=1000,           # Maximum characters per chunk
    chunk_overlap=200,          # Overlap between consecutive chunks
    use_markdown_splitter=True, # Use markdown-aware splitting
    json_indent=2              # JSON formatting
)
```

### Processing Options

```python
result = processor.process(
    file_path="input.pdf",
    output_dir="output/",
    save_images=True,                    # Save extracted images
    include_image_summaries=False,       # Add image summaries to chunks
    image_summarizer=my_summarizer_func, # Custom image summarizer
    skip_conversion=False,               # Skip if already markdown
    skip_chunking=False,                 # Only convert
    skip_export=False                    # Don't export JSON
)
```

## 🔬 Advanced Usage

### Custom Image Summarization

```python
def summarize_image(image_path: str) -> str:
    """Your custom image summarization logic."""
    # Example: Use vision AI model
    from my_vision_model import analyze_image
    return analyze_image(image_path)

processor = MarkitDownProcessor()
result = processor.process(
    file_path="document.pdf",
    output_dir="output/",
    include_image_summaries=True,
    image_summarizer=summarize_image
)
```

### Batch Processing

```python
processor = MarkitDownProcessor()

files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
    file_paths=files,
    output_dir="output/"
)

for result in results:
    if "error" in result:
        print(f"Failed: {result['input_file']} - {result['error']}")
    else:
        print(f"Success: {result['input_file']}")
```

### Individual Step Processing

```python
processor = MarkitDownProcessor()

# Only convert to markdown
conversion = processor.convert_only(
    file_path="document.pdf",
    output_dir="output/"
)

# Only chunk existing markdown
chunks = processor.chunk_only(
    markdown_path="document.md"
)

# Only export chunks
processor.export_only(
    chunks=chunks,
    output_path="output/chunks.json",
    source_info={"source": "document.md"}
)
```

### Custom Markdown Header Splitting

```python
from markitdown_chunker import DocumentChunker

chunker = DocumentChunker(
    chunk_size=1000,
    chunk_overlap=200,
    use_markdown_splitter=True,
    headers_to_split_on=[
        ("#", "Title"),
        ("##", "Section"),
        ("###", "Subsection"),
        ("####", "Paragraph")
    ]
)

chunks = chunker.chunk_file("document.md")
```

## 📤 Output Format

### JSON Structure

```json
{
  "source_info": {
    "source_file": "document.pdf",
    "markdown_file": "output/document.md",
    "output_dir": "output/",
    "images_dir": "output/images"
  },
  "chunks": [
    {
      "text": "Chunk content here...",
      "metadata": {
        "Header 1": "Introduction",
        "Header 2": "Overview",
        "sub_chunk_index": 0,
        "total_sub_chunks": 1,
        "source_file": "output/document.md",
        "chunk_size_config": 1000,
        "chunk_overlap_config": 200
      }
    }
  ],
  "total_chunks": 42,
  "statistics": {
    "total_characters": 48392,
    "avg_chunk_size": 1152.19,
    "min_chunk_size": 234,
    "max_chunk_size": 1000
  },
  "exported_at": "2025-10-09T10:30:45.123456"
}
```

## 🛠️ CLI Reference

```bash
markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
                    [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
                    [--no-markdown-splitter] [--no-images]
                    [--include-image-summaries] [--json-indent JSON_INDENT]
                    [--list-formats] [--version] [-v]
                    input output

Positional Arguments:
  input                 Input file path
  output                Output directory

Optional Arguments:
  -h, --help            Show help message
  --convert-only        Only convert to markdown
  --chunk-only          Only chunk existing markdown
  --no-export           Skip JSON export
  --chunk-size SIZE     Maximum chunk size (default: 1000)
  --overlap SIZE        Chunk overlap (default: 200)
  --no-markdown-splitter Disable markdown-aware splitting
  --no-images           Don't save extracted images
  --json-indent N       JSON indentation (default: 2)
  --list-formats        List supported formats
  --version             Show version
  -v, --verbose         Enable verbose output
```

## 🧪 Development

### Setup Development Environment

```bash
git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"
```

### Run Tests

```bash
pytest tests/
pytest --cov=markitdown_chunker tests/
```

### Code Formatting

```bash
black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Built on top of [markitdown](https://github.com/microsoft/markitdown) by Microsoft
- Uses [LangChain](https://github.com/langchain-ai/langchain) text splitters for intelligent chunking

## 📞 Support

- 🐛 [Report a bug](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- 💡 [Request a feature](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- 📖 [Documentation](https://github.com/Naveenkumarar/markitdown-chunker)
- 🖼️ [Image Extraction Guide](docs/IMAGE_EXTRACTION.md)
- 🎵 [Audio/Video Processing Guide](docs/FFMPEG_AUDIO.md)

## 🗺️ Roadmap

- [ ] Support for more document formats
- [ ] Advanced chunking strategies (semantic, sentence-based)
- [ ] Integration with vector databases
- [ ] Web UI for document processing
- [ ] Cloud storage integration (S3, GCS, Azure)
- [ ] Parallel batch processing
- [ ] Custom output formats (CSV, Parquet, etc.)

---

Made with ❤️ by the MarkitDown Chunker community

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Naveenkumarar/markitdown-chunker",
    "name": "markitdown-chunker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "markdown, converter, chunker, document-processing, langchain, markitdown",
    "author": "Naveen Kumar Rajarajan",
    "author_email": "Naveen Kumar Rajarajan <rajarajannaveenkumar@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/6b/38/2597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4/markitdown_chunker-0.1.0.tar.gz",
    "platform": null,
    "description": "# MarkitDown Chunker\n\nA powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the [markitdown](https://github.com/microsoft/markitdown) package with advanced chunking capabilities using LangChain.\n\n## \u2728 Features\n\n- \ud83d\udcc4 **Multi-format Support**: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown\n- \ud83d\uddbc\ufe0f **Image Extraction**: Automatically extract images from PDF, DOCX, PPTX files ([requires optional dependencies](docs/IMAGE_EXTRACTION.md))\n- \ud83c\udfa8 **Image Summarization**: Optional AI-powered image descriptions for better context\n- \u2702\ufe0f **Smart Chunking**: Markdown-aware text splitting that respects document structure\n- \ud83d\udcca **Structured Export**: Export chunks with metadata to JSON format\n- \ud83d\udd27 **Flexible Pipeline**: Run individual steps or complete pipeline as needed\n- \ud83c\udfaf **CLI & Python API**: Use from command line or integrate into your Python applications\n\n## \ud83d\udce6 Installation\n\n### Basic Installation\n\n```bash\npip install markitdown-chunker\n```\n\n### With Image Extraction Support\n\nTo extract images from PDF, DOCX, and PPTX files:\n\n```bash\npip install \"markitdown-chunker[images]\"\n```\n\nSee [Image Extraction Guide](docs/IMAGE_EXTRACTION.md) for details.\n\n### From Source\n\n```bash\ngit clone https://github.com/Naveenkumarar/markitdown-chunker.git\ncd markitdown-chunker\npip install -e .\n# Or with image support:\npip install -e \".[images]\"\n```\n\n## \ud83d\ude80 Quick Start\n\n### Command Line Interface\n\n```bash\n# Convert, chunk, and export (full pipeline)\nmarkitdown-chunker input.pdf output/\n\n# Convert only\nmarkitdown-chunker document.docx output/ --convert-only\n\n# Chunk existing markdown\nmarkitdown-chunker document.md output/ --chunk-only\n\n# Custom chunk size and overlap\nmarkitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400\n\n# List supported formats\nmarkitdown-chunker --list-formats\n```\n\n### Python API\n\n#### Complete Pipeline\n\n```python\nfrom markitdown_chunker import MarkitDownProcessor\n\n# Initialize processor with custom settings\nprocessor = MarkitDownProcessor(\n    chunk_size=1000,\n    chunk_overlap=200,\n    use_markdown_splitter=True\n)\n\n# Process a document (all steps)\nresult = processor.process(\n    file_path=\"document.pdf\",\n    output_dir=\"output/\"\n)\n\nprint(f\"Markdown saved to: {result['conversion']['markdown_path']}\")\nprint(f\"Created {len(result['chunking']['chunks'])} chunks\")\nprint(f\"JSON exported to: {result['export']['json_path']}\")\n```\n\n#### Step-by-Step Processing\n\n```python\nfrom markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter\n\n# Step 1: Convert to Markdown\nconverter = MarkdownConverter()\nconversion_result = converter.convert(\n    file_path=\"document.pdf\",\n    output_dir=\"output/\",\n    save_images=True\n)\n\n# Step 2: Chunk the markdown\nchunker = DocumentChunker(\n    chunk_size=1000,\n    chunk_overlap=200\n)\nchunks = chunker.chunk_file(\n    markdown_path=conversion_result['markdown_path']\n)\n\n# Step 3: Export to JSON\nexporter = JSONExporter()\njson_path = exporter.export(\n    chunks=chunks,\n    output_path=\"output/chunks.json\"\n)\n```\n\n## \ud83d\udcda Supported File Formats\n\n- **Documents**: PDF, DOCX, DOC, RTF, ODT, TXT, MD\n- **Presentations**: PPTX, PPT, ODP\n- **Spreadsheets**: XLSX, XLS, ODS\n- **Web**: HTML, HTM\n\n> **Note**: Audio/video files (MP3, MP4, etc.) require ffmpeg. See [docs/FFMPEG_AUDIO.md](docs/FFMPEG_AUDIO.md) for details.\n\n## \ud83d\udcc2 Output Directory Structure\n\nAfter processing a document, the output directory will contain:\n\n```\noutput/\n\u251c\u2500\u2500 document.md                    # Converted markdown file\n\u251c\u2500\u2500 document_chunks.json           # Chunks with metadata and statistics\n\u2514\u2500\u2500 images/                        # Extracted images (if any)\n    \u251c\u2500\u2500 page1_img1.png\n    \u251c\u2500\u2500 page2_img1.jpg\n    \u251c\u2500\u2500 page3_img1.png\n    \u2514\u2500\u2500 page3_img2.jpg\n```\n\n### Example Output Files\n\n**`document.md`** - Markdown conversion with image references:\n```markdown\n# Document Title\n\nDocument content converted to markdown format...\n\n## Extracted Images\n\n![Page 1 Image 1](images/page1_img1.png)\n\n![Page 2 Image 1](images/page2_img1.jpg)\n```\n\n**`document_chunks.json`** - Structured chunk data:\n```json\n{\n  \"source_info\": {\n    \"source_file\": \"document.pdf\",\n    \"markdown_file\": \"output/document.md\",\n    \"images_dir\": \"output/images\"\n  },\n  \"chunks\": [\n    {\n      \"text\": \"Document content chunk...\",\n      \"metadata\": {\n        \"Header 1\": \"Introduction\",\n        \"chunk_index\": 0,\n        \"source_file\": \"output/document.md\"\n      }\n    }\n  ],\n  \"total_chunks\": 42,\n  \"statistics\": {\n    \"total_characters\": 48392,\n    \"avg_chunk_size\": 1152.19,\n    \"min_chunk_size\": 234,\n    \"max_chunk_size\": 1000\n  },\n  \"exported_at\": \"2025-10-10T10:30:45.123456\"\n}\n```\n\n**`images/`** - Extracted images with organized naming:\n- PDF images: `page{N}_img{M}.{ext}` (e.g., `page1_img1.png`)\n- DOCX images: `docx_img{N}.{ext}` (e.g., `docx_img1.jpg`)\n- PPTX images: `slide{N}_img{M}.{ext}` (e.g., `slide1_img1.png`)\n\n> \ud83d\udca1 **Tip**: The images directory is only created if the document contains images and `save_images=True` (default).\n\n## \ud83c\udf9b\ufe0f Configuration Options\n\n### Chunking Parameters\n\n```python\nprocessor = MarkitDownProcessor(\n    chunk_size=1000,           # Maximum characters per chunk\n    chunk_overlap=200,          # Overlap between consecutive chunks\n    use_markdown_splitter=True, # Use markdown-aware splitting\n    json_indent=2              # JSON formatting\n)\n```\n\n### Processing Options\n\n```python\nresult = processor.process(\n    file_path=\"input.pdf\",\n    output_dir=\"output/\",\n    save_images=True,                    # Save extracted images\n    include_image_summaries=False,       # Add image summaries to chunks\n    image_summarizer=my_summarizer_func, # Custom image summarizer\n    skip_conversion=False,               # Skip if already markdown\n    skip_chunking=False,                 # Only convert\n    skip_export=False                    # Don't export JSON\n)\n```\n\n## \ud83d\udd2c Advanced Usage\n\n### Custom Image Summarization\n\n```python\ndef summarize_image(image_path: str) -> str:\n    \"\"\"Your custom image summarization logic.\"\"\"\n    # Example: Use vision AI model\n    from my_vision_model import analyze_image\n    return analyze_image(image_path)\n\nprocessor = MarkitDownProcessor()\nresult = processor.process(\n    file_path=\"document.pdf\",\n    output_dir=\"output/\",\n    include_image_summaries=True,\n    image_summarizer=summarize_image\n)\n```\n\n### Batch Processing\n\n```python\nprocessor = MarkitDownProcessor()\n\nfiles = [\"doc1.pdf\", \"doc2.docx\", \"doc3.pptx\"]\nresults = processor.process_batch(\n    file_paths=files,\n    output_dir=\"output/\"\n)\n\nfor result in results:\n    if \"error\" in result:\n        print(f\"Failed: {result['input_file']} - {result['error']}\")\n    else:\n        print(f\"Success: {result['input_file']}\")\n```\n\n### Individual Step Processing\n\n```python\nprocessor = MarkitDownProcessor()\n\n# Only convert to markdown\nconversion = processor.convert_only(\n    file_path=\"document.pdf\",\n    output_dir=\"output/\"\n)\n\n# Only chunk existing markdown\nchunks = processor.chunk_only(\n    markdown_path=\"document.md\"\n)\n\n# Only export chunks\nprocessor.export_only(\n    chunks=chunks,\n    output_path=\"output/chunks.json\",\n    source_info={\"source\": \"document.md\"}\n)\n```\n\n### Custom Markdown Header Splitting\n\n```python\nfrom markitdown_chunker import DocumentChunker\n\nchunker = DocumentChunker(\n    chunk_size=1000,\n    chunk_overlap=200,\n    use_markdown_splitter=True,\n    headers_to_split_on=[\n        (\"#\", \"Title\"),\n        (\"##\", \"Section\"),\n        (\"###\", \"Subsection\"),\n        (\"####\", \"Paragraph\")\n    ]\n)\n\nchunks = chunker.chunk_file(\"document.md\")\n```\n\n## \ud83d\udce4 Output Format\n\n### JSON Structure\n\n```json\n{\n  \"source_info\": {\n    \"source_file\": \"document.pdf\",\n    \"markdown_file\": \"output/document.md\",\n    \"output_dir\": \"output/\",\n    \"images_dir\": \"output/images\"\n  },\n  \"chunks\": [\n    {\n      \"text\": \"Chunk content here...\",\n      \"metadata\": {\n        \"Header 1\": \"Introduction\",\n        \"Header 2\": \"Overview\",\n        \"sub_chunk_index\": 0,\n        \"total_sub_chunks\": 1,\n        \"source_file\": \"output/document.md\",\n        \"chunk_size_config\": 1000,\n        \"chunk_overlap_config\": 200\n      }\n    }\n  ],\n  \"total_chunks\": 42,\n  \"statistics\": {\n    \"total_characters\": 48392,\n    \"avg_chunk_size\": 1152.19,\n    \"min_chunk_size\": 234,\n    \"max_chunk_size\": 1000\n  },\n  \"exported_at\": \"2025-10-09T10:30:45.123456\"\n}\n```\n\n## \ud83d\udee0\ufe0f CLI Reference\n\n```bash\nmarkitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]\n                    [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]\n                    [--no-markdown-splitter] [--no-images]\n                    [--include-image-summaries] [--json-indent JSON_INDENT]\n                    [--list-formats] [--version] [-v]\n                    input output\n\nPositional Arguments:\n  input                 Input file path\n  output                Output directory\n\nOptional Arguments:\n  -h, --help            Show help message\n  --convert-only        Only convert to markdown\n  --chunk-only          Only chunk existing markdown\n  --no-export           Skip JSON export\n  --chunk-size SIZE     Maximum chunk size (default: 1000)\n  --overlap SIZE        Chunk overlap (default: 200)\n  --no-markdown-splitter Disable markdown-aware splitting\n  --no-images           Don't save extracted images\n  --json-indent N       JSON indentation (default: 2)\n  --list-formats        List supported formats\n  --version             Show version\n  -v, --verbose         Enable verbose output\n```\n\n## \ud83e\uddea Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/yourusername/markitdown-chunker.git\ncd markitdown-chunker\npip install -e \".[dev]\"\n```\n\n### Run Tests\n\n```bash\npytest tests/\npytest --cov=markitdown_chunker tests/\n```\n\n### Code Formatting\n\n```bash\nblack markitdown_chunker/\nflake8 markitdown_chunker/\nmypy markitdown_chunker/\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Built on top of [markitdown](https://github.com/microsoft/markitdown) by Microsoft\n- Uses [LangChain](https://github.com/langchain-ai/langchain) text splitters for intelligent chunking\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b [Report a bug](https://github.com/Naveenkumarar/markitdown-chunker/issues)\n- \ud83d\udca1 [Request a feature](https://github.com/Naveenkumarar/markitdown-chunker/issues)\n- \ud83d\udcd6 [Documentation](https://github.com/Naveenkumarar/markitdown-chunker)\n- \ud83d\uddbc\ufe0f [Image Extraction Guide](docs/IMAGE_EXTRACTION.md)\n- \ud83c\udfb5 [Audio/Video Processing Guide](docs/FFMPEG_AUDIO.md)\n\n## \ud83d\uddfa\ufe0f Roadmap\n\n- [ ] Support for more document formats\n- [ ] Advanced chunking strategies (semantic, sentence-based)\n- [ ] Integration with vector databases\n- [ ] Web UI for document processing\n- [ ] Cloud storage integration (S3, GCS, Azure)\n- [ ] Parallel batch processing\n- [ ] Custom output formats (CSV, Parquet, etc.)\n\n---\n\nMade with \u2764\ufe0f by the MarkitDown Chunker community\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert documents to markdown, chunk them intelligently, and export structured data",
    "version": "0.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/Naveenkumarar/markitdown-chunker/issues",
        "Homepage": "https://github.com/Naveenkumarar/markitdown-chunker",
        "Source Code": "https://github.com/Naveenkumarar/markitdown-chunker"
    },
    "split_keywords": [
        "markdown",
        " converter",
        " chunker",
        " document-processing",
        " langchain",
        " markitdown"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e20528702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06",
                "md5": "4d1ad673dc3d9cb73a00208140b53c41",
                "sha256": "0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8"
            },
            "downloads": -1,
            "filename": "markitdown_chunker-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4d1ad673dc3d9cb73a00208140b53c41",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 36064,
            "upload_time": "2025-10-10T01:05:20",
            "upload_time_iso_8601": "2025-10-10T01:05:20.874592Z",
            "url": "https://files.pythonhosted.org/packages/e2/05/28702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06/markitdown_chunker-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6b382597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4",
                "md5": "2f62fc2b3403ca4bf1a52425951f9c9a",
                "sha256": "ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521"
            },
            "downloads": -1,
            "filename": "markitdown_chunker-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "2f62fc2b3403ca4bf1a52425951f9c9a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 42752,
            "upload_time": "2025-10-10T01:05:22",
            "upload_time_iso_8601": "2025-10-10T01:05:22.303693Z",
            "url": "https://files.pythonhosted.org/packages/6b/38/2597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4/markitdown_chunker-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-10 01:05:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Naveenkumarar",
    "github_project": "markitdown-chunker",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "markitdown",
            "specs": [
                [
                    ">=",
                    "0.0.1"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": [
                [
                    ">=",
                    "0.1.0"
                ]
            ]
        },
        {
            "name": "langchain-text-splitters",
            "specs": [
                [
                    ">=",
                    "0.0.1"
                ]
            ]
        }
    ],
    "lcname": "markitdown-chunker"
}

Naveen Kumar Rajarajan