# MarkitDown Chunker
A powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the [markitdown](https://github.com/microsoft/markitdown) package with advanced chunking capabilities using LangChain.
## β¨ Features
- π **Multi-format Support**: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown
- πΌοΈ **Image Extraction**: Automatically extract images from PDF, DOCX, PPTX files ([requires optional dependencies](docs/IMAGE_EXTRACTION.md))
- π¨ **Image Summarization**: Optional AI-powered image descriptions for better context
- βοΈ **Smart Chunking**: Markdown-aware text splitting that respects document structure
- π **Structured Export**: Export chunks with metadata to JSON format
- π§ **Flexible Pipeline**: Run individual steps or complete pipeline as needed
- π― **CLI & Python API**: Use from command line or integrate into your Python applications
## π¦ Installation
### Basic Installation
```bash
pip install markitdown-chunker
```
### With Image Extraction Support
To extract images from PDF, DOCX, and PPTX files:
```bash
pip install "markitdown-chunker[images]"
```
See [Image Extraction Guide](docs/IMAGE_EXTRACTION.md) for details.
### From Source
```bash
git clone https://github.com/Naveenkumarar/markitdown-chunker.git
cd markitdown-chunker
pip install -e .
# Or with image support:
pip install -e ".[images]"
```
## π Quick Start
### Command Line Interface
```bash
# Convert, chunk, and export (full pipeline)
markitdown-chunker input.pdf output/
# Convert only
markitdown-chunker document.docx output/ --convert-only
# Chunk existing markdown
markitdown-chunker document.md output/ --chunk-only
# Custom chunk size and overlap
markitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400
# List supported formats
markitdown-chunker --list-formats
```
### Python API
#### Complete Pipeline
```python
from markitdown_chunker import MarkitDownProcessor
# Initialize processor with custom settings
processor = MarkitDownProcessor(
chunk_size=1000,
chunk_overlap=200,
use_markdown_splitter=True
)
# Process a document (all steps)
result = processor.process(
file_path="document.pdf",
output_dir="output/"
)
print(f"Markdown saved to: {result['conversion']['markdown_path']}")
print(f"Created {len(result['chunking']['chunks'])} chunks")
print(f"JSON exported to: {result['export']['json_path']}")
```
#### Step-by-Step Processing
```python
from markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter
# Step 1: Convert to Markdown
converter = MarkdownConverter()
conversion_result = converter.convert(
file_path="document.pdf",
output_dir="output/",
save_images=True
)
# Step 2: Chunk the markdown
chunker = DocumentChunker(
chunk_size=1000,
chunk_overlap=200
)
chunks = chunker.chunk_file(
markdown_path=conversion_result['markdown_path']
)
# Step 3: Export to JSON
exporter = JSONExporter()
json_path = exporter.export(
chunks=chunks,
output_path="output/chunks.json"
)
```
## π Supported File Formats
- **Documents**: PDF, DOCX, DOC, RTF, ODT, TXT, MD
- **Presentations**: PPTX, PPT, ODP
- **Spreadsheets**: XLSX, XLS, ODS
- **Web**: HTML, HTM
> **Note**: Audio/video files (MP3, MP4, etc.) require ffmpeg. See [docs/FFMPEG_AUDIO.md](docs/FFMPEG_AUDIO.md) for details.
## π Output Directory Structure
After processing a document, the output directory will contain:
```
output/
βββ document.md # Converted markdown file
βββ document_chunks.json # Chunks with metadata and statistics
βββ images/ # Extracted images (if any)
βββ page1_img1.png
βββ page2_img1.jpg
βββ page3_img1.png
βββ page3_img2.jpg
```
### Example Output Files
**`document.md`** - Markdown conversion with image references:
```markdown
# Document Title
Document content converted to markdown format...
## Extracted Images


```
**`document_chunks.json`** - Structured chunk data:
```json
{
"source_info": {
"source_file": "document.pdf",
"markdown_file": "output/document.md",
"images_dir": "output/images"
},
"chunks": [
{
"text": "Document content chunk...",
"metadata": {
"Header 1": "Introduction",
"chunk_index": 0,
"source_file": "output/document.md"
}
}
],
"total_chunks": 42,
"statistics": {
"total_characters": 48392,
"avg_chunk_size": 1152.19,
"min_chunk_size": 234,
"max_chunk_size": 1000
},
"exported_at": "2025-10-10T10:30:45.123456"
}
```
**`images/`** - Extracted images with organized naming:
- PDF images: `page{N}_img{M}.{ext}` (e.g., `page1_img1.png`)
- DOCX images: `docx_img{N}.{ext}` (e.g., `docx_img1.jpg`)
- PPTX images: `slide{N}_img{M}.{ext}` (e.g., `slide1_img1.png`)
> π‘ **Tip**: The images directory is only created if the document contains images and `save_images=True` (default).
## ποΈ Configuration Options
### Chunking Parameters
```python
processor = MarkitDownProcessor(
chunk_size=1000, # Maximum characters per chunk
chunk_overlap=200, # Overlap between consecutive chunks
use_markdown_splitter=True, # Use markdown-aware splitting
json_indent=2 # JSON formatting
)
```
### Processing Options
```python
result = processor.process(
file_path="input.pdf",
output_dir="output/",
save_images=True, # Save extracted images
include_image_summaries=False, # Add image summaries to chunks
image_summarizer=my_summarizer_func, # Custom image summarizer
skip_conversion=False, # Skip if already markdown
skip_chunking=False, # Only convert
skip_export=False # Don't export JSON
)
```
## π¬ Advanced Usage
### Custom Image Summarization
```python
def summarize_image(image_path: str) -> str:
"""Your custom image summarization logic."""
# Example: Use vision AI model
from my_vision_model import analyze_image
return analyze_image(image_path)
processor = MarkitDownProcessor()
result = processor.process(
file_path="document.pdf",
output_dir="output/",
include_image_summaries=True,
image_summarizer=summarize_image
)
```
### Batch Processing
```python
processor = MarkitDownProcessor()
files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
results = processor.process_batch(
file_paths=files,
output_dir="output/"
)
for result in results:
if "error" in result:
print(f"Failed: {result['input_file']} - {result['error']}")
else:
print(f"Success: {result['input_file']}")
```
### Individual Step Processing
```python
processor = MarkitDownProcessor()
# Only convert to markdown
conversion = processor.convert_only(
file_path="document.pdf",
output_dir="output/"
)
# Only chunk existing markdown
chunks = processor.chunk_only(
markdown_path="document.md"
)
# Only export chunks
processor.export_only(
chunks=chunks,
output_path="output/chunks.json",
source_info={"source": "document.md"}
)
```
### Custom Markdown Header Splitting
```python
from markitdown_chunker import DocumentChunker
chunker = DocumentChunker(
chunk_size=1000,
chunk_overlap=200,
use_markdown_splitter=True,
headers_to_split_on=[
("#", "Title"),
("##", "Section"),
("###", "Subsection"),
("####", "Paragraph")
]
)
chunks = chunker.chunk_file("document.md")
```
## π€ Output Format
### JSON Structure
```json
{
"source_info": {
"source_file": "document.pdf",
"markdown_file": "output/document.md",
"output_dir": "output/",
"images_dir": "output/images"
},
"chunks": [
{
"text": "Chunk content here...",
"metadata": {
"Header 1": "Introduction",
"Header 2": "Overview",
"sub_chunk_index": 0,
"total_sub_chunks": 1,
"source_file": "output/document.md",
"chunk_size_config": 1000,
"chunk_overlap_config": 200
}
}
],
"total_chunks": 42,
"statistics": {
"total_characters": 48392,
"avg_chunk_size": 1152.19,
"min_chunk_size": 234,
"max_chunk_size": 1000
},
"exported_at": "2025-10-09T10:30:45.123456"
}
```
## π οΈ CLI Reference
```bash
markitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]
[--chunk-size CHUNK_SIZE] [--overlap OVERLAP]
[--no-markdown-splitter] [--no-images]
[--include-image-summaries] [--json-indent JSON_INDENT]
[--list-formats] [--version] [-v]
input output
Positional Arguments:
input Input file path
output Output directory
Optional Arguments:
-h, --help Show help message
--convert-only Only convert to markdown
--chunk-only Only chunk existing markdown
--no-export Skip JSON export
--chunk-size SIZE Maximum chunk size (default: 1000)
--overlap SIZE Chunk overlap (default: 200)
--no-markdown-splitter Disable markdown-aware splitting
--no-images Don't save extracted images
--json-indent N JSON indentation (default: 2)
--list-formats List supported formats
--version Show version
-v, --verbose Enable verbose output
```
## π§ͺ Development
### Setup Development Environment
```bash
git clone https://github.com/yourusername/markitdown-chunker.git
cd markitdown-chunker
pip install -e ".[dev]"
```
### Run Tests
```bash
pytest tests/
pytest --cov=markitdown_chunker tests/
```
### Code Formatting
```bash
black markitdown_chunker/
flake8 markitdown_chunker/
mypy markitdown_chunker/
```
## π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- Built on top of [markitdown](https://github.com/microsoft/markitdown) by Microsoft
- Uses [LangChain](https://github.com/langchain-ai/langchain) text splitters for intelligent chunking
## π Support
- π [Report a bug](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- π‘ [Request a feature](https://github.com/Naveenkumarar/markitdown-chunker/issues)
- π [Documentation](https://github.com/Naveenkumarar/markitdown-chunker)
- πΌοΈ [Image Extraction Guide](docs/IMAGE_EXTRACTION.md)
- π΅ [Audio/Video Processing Guide](docs/FFMPEG_AUDIO.md)
## πΊοΈ Roadmap
- [ ] Support for more document formats
- [ ] Advanced chunking strategies (semantic, sentence-based)
- [ ] Integration with vector databases
- [ ] Web UI for document processing
- [ ] Cloud storage integration (S3, GCS, Azure)
- [ ] Parallel batch processing
- [ ] Custom output formats (CSV, Parquet, etc.)
---
Made with β€οΈ by the MarkitDown Chunker community
Raw data
{
"_id": null,
"home_page": "https://github.com/Naveenkumarar/markitdown-chunker",
"name": "markitdown-chunker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "markdown, converter, chunker, document-processing, langchain, markitdown",
"author": "Naveen Kumar Rajarajan",
"author_email": "Naveen Kumar Rajarajan <rajarajannaveenkumar@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/6b/38/2597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4/markitdown_chunker-0.1.0.tar.gz",
"platform": null,
"description": "# MarkitDown Chunker\n\nA powerful Python package that converts documents to markdown, intelligently chunks them, and exports structured data. Built as an add-on to the [markitdown](https://github.com/microsoft/markitdown) package with advanced chunking capabilities using LangChain.\n\n## \u2728 Features\n\n- \ud83d\udcc4 **Multi-format Support**: Convert PDF, DOCX, PPTX, XLSX, HTML, RTF, ODT, and more to markdown\n- \ud83d\uddbc\ufe0f **Image Extraction**: Automatically extract images from PDF, DOCX, PPTX files ([requires optional dependencies](docs/IMAGE_EXTRACTION.md))\n- \ud83c\udfa8 **Image Summarization**: Optional AI-powered image descriptions for better context\n- \u2702\ufe0f **Smart Chunking**: Markdown-aware text splitting that respects document structure\n- \ud83d\udcca **Structured Export**: Export chunks with metadata to JSON format\n- \ud83d\udd27 **Flexible Pipeline**: Run individual steps or complete pipeline as needed\n- \ud83c\udfaf **CLI & Python API**: Use from command line or integrate into your Python applications\n\n## \ud83d\udce6 Installation\n\n### Basic Installation\n\n```bash\npip install markitdown-chunker\n```\n\n### With Image Extraction Support\n\nTo extract images from PDF, DOCX, and PPTX files:\n\n```bash\npip install \"markitdown-chunker[images]\"\n```\n\nSee [Image Extraction Guide](docs/IMAGE_EXTRACTION.md) for details.\n\n### From Source\n\n```bash\ngit clone https://github.com/Naveenkumarar/markitdown-chunker.git\ncd markitdown-chunker\npip install -e .\n# Or with image support:\npip install -e \".[images]\"\n```\n\n## \ud83d\ude80 Quick Start\n\n### Command Line Interface\n\n```bash\n# Convert, chunk, and export (full pipeline)\nmarkitdown-chunker input.pdf output/\n\n# Convert only\nmarkitdown-chunker document.docx output/ --convert-only\n\n# Chunk existing markdown\nmarkitdown-chunker document.md output/ --chunk-only\n\n# Custom chunk size and overlap\nmarkitdown-chunker input.pdf output/ --chunk-size 2000 --overlap 400\n\n# List supported formats\nmarkitdown-chunker --list-formats\n```\n\n### Python API\n\n#### Complete Pipeline\n\n```python\nfrom markitdown_chunker import MarkitDownProcessor\n\n# Initialize processor with custom settings\nprocessor = MarkitDownProcessor(\n chunk_size=1000,\n chunk_overlap=200,\n use_markdown_splitter=True\n)\n\n# Process a document (all steps)\nresult = processor.process(\n file_path=\"document.pdf\",\n output_dir=\"output/\"\n)\n\nprint(f\"Markdown saved to: {result['conversion']['markdown_path']}\")\nprint(f\"Created {len(result['chunking']['chunks'])} chunks\")\nprint(f\"JSON exported to: {result['export']['json_path']}\")\n```\n\n#### Step-by-Step Processing\n\n```python\nfrom markitdown_chunker import MarkdownConverter, DocumentChunker, JSONExporter\n\n# Step 1: Convert to Markdown\nconverter = MarkdownConverter()\nconversion_result = converter.convert(\n file_path=\"document.pdf\",\n output_dir=\"output/\",\n save_images=True\n)\n\n# Step 2: Chunk the markdown\nchunker = DocumentChunker(\n chunk_size=1000,\n chunk_overlap=200\n)\nchunks = chunker.chunk_file(\n markdown_path=conversion_result['markdown_path']\n)\n\n# Step 3: Export to JSON\nexporter = JSONExporter()\njson_path = exporter.export(\n chunks=chunks,\n output_path=\"output/chunks.json\"\n)\n```\n\n## \ud83d\udcda Supported File Formats\n\n- **Documents**: PDF, DOCX, DOC, RTF, ODT, TXT, MD\n- **Presentations**: PPTX, PPT, ODP\n- **Spreadsheets**: XLSX, XLS, ODS\n- **Web**: HTML, HTM\n\n> **Note**: Audio/video files (MP3, MP4, etc.) require ffmpeg. See [docs/FFMPEG_AUDIO.md](docs/FFMPEG_AUDIO.md) for details.\n\n## \ud83d\udcc2 Output Directory Structure\n\nAfter processing a document, the output directory will contain:\n\n```\noutput/\n\u251c\u2500\u2500 document.md # Converted markdown file\n\u251c\u2500\u2500 document_chunks.json # Chunks with metadata and statistics\n\u2514\u2500\u2500 images/ # Extracted images (if any)\n \u251c\u2500\u2500 page1_img1.png\n \u251c\u2500\u2500 page2_img1.jpg\n \u251c\u2500\u2500 page3_img1.png\n \u2514\u2500\u2500 page3_img2.jpg\n```\n\n### Example Output Files\n\n**`document.md`** - Markdown conversion with image references:\n```markdown\n# Document Title\n\nDocument content converted to markdown format...\n\n## Extracted Images\n\n\n\n\n```\n\n**`document_chunks.json`** - Structured chunk data:\n```json\n{\n \"source_info\": {\n \"source_file\": \"document.pdf\",\n \"markdown_file\": \"output/document.md\",\n \"images_dir\": \"output/images\"\n },\n \"chunks\": [\n {\n \"text\": \"Document content chunk...\",\n \"metadata\": {\n \"Header 1\": \"Introduction\",\n \"chunk_index\": 0,\n \"source_file\": \"output/document.md\"\n }\n }\n ],\n \"total_chunks\": 42,\n \"statistics\": {\n \"total_characters\": 48392,\n \"avg_chunk_size\": 1152.19,\n \"min_chunk_size\": 234,\n \"max_chunk_size\": 1000\n },\n \"exported_at\": \"2025-10-10T10:30:45.123456\"\n}\n```\n\n**`images/`** - Extracted images with organized naming:\n- PDF images: `page{N}_img{M}.{ext}` (e.g., `page1_img1.png`)\n- DOCX images: `docx_img{N}.{ext}` (e.g., `docx_img1.jpg`)\n- PPTX images: `slide{N}_img{M}.{ext}` (e.g., `slide1_img1.png`)\n\n> \ud83d\udca1 **Tip**: The images directory is only created if the document contains images and `save_images=True` (default).\n\n## \ud83c\udf9b\ufe0f Configuration Options\n\n### Chunking Parameters\n\n```python\nprocessor = MarkitDownProcessor(\n chunk_size=1000, # Maximum characters per chunk\n chunk_overlap=200, # Overlap between consecutive chunks\n use_markdown_splitter=True, # Use markdown-aware splitting\n json_indent=2 # JSON formatting\n)\n```\n\n### Processing Options\n\n```python\nresult = processor.process(\n file_path=\"input.pdf\",\n output_dir=\"output/\",\n save_images=True, # Save extracted images\n include_image_summaries=False, # Add image summaries to chunks\n image_summarizer=my_summarizer_func, # Custom image summarizer\n skip_conversion=False, # Skip if already markdown\n skip_chunking=False, # Only convert\n skip_export=False # Don't export JSON\n)\n```\n\n## \ud83d\udd2c Advanced Usage\n\n### Custom Image Summarization\n\n```python\ndef summarize_image(image_path: str) -> str:\n \"\"\"Your custom image summarization logic.\"\"\"\n # Example: Use vision AI model\n from my_vision_model import analyze_image\n return analyze_image(image_path)\n\nprocessor = MarkitDownProcessor()\nresult = processor.process(\n file_path=\"document.pdf\",\n output_dir=\"output/\",\n include_image_summaries=True,\n image_summarizer=summarize_image\n)\n```\n\n### Batch Processing\n\n```python\nprocessor = MarkitDownProcessor()\n\nfiles = [\"doc1.pdf\", \"doc2.docx\", \"doc3.pptx\"]\nresults = processor.process_batch(\n file_paths=files,\n output_dir=\"output/\"\n)\n\nfor result in results:\n if \"error\" in result:\n print(f\"Failed: {result['input_file']} - {result['error']}\")\n else:\n print(f\"Success: {result['input_file']}\")\n```\n\n### Individual Step Processing\n\n```python\nprocessor = MarkitDownProcessor()\n\n# Only convert to markdown\nconversion = processor.convert_only(\n file_path=\"document.pdf\",\n output_dir=\"output/\"\n)\n\n# Only chunk existing markdown\nchunks = processor.chunk_only(\n markdown_path=\"document.md\"\n)\n\n# Only export chunks\nprocessor.export_only(\n chunks=chunks,\n output_path=\"output/chunks.json\",\n source_info={\"source\": \"document.md\"}\n)\n```\n\n### Custom Markdown Header Splitting\n\n```python\nfrom markitdown_chunker import DocumentChunker\n\nchunker = DocumentChunker(\n chunk_size=1000,\n chunk_overlap=200,\n use_markdown_splitter=True,\n headers_to_split_on=[\n (\"#\", \"Title\"),\n (\"##\", \"Section\"),\n (\"###\", \"Subsection\"),\n (\"####\", \"Paragraph\")\n ]\n)\n\nchunks = chunker.chunk_file(\"document.md\")\n```\n\n## \ud83d\udce4 Output Format\n\n### JSON Structure\n\n```json\n{\n \"source_info\": {\n \"source_file\": \"document.pdf\",\n \"markdown_file\": \"output/document.md\",\n \"output_dir\": \"output/\",\n \"images_dir\": \"output/images\"\n },\n \"chunks\": [\n {\n \"text\": \"Chunk content here...\",\n \"metadata\": {\n \"Header 1\": \"Introduction\",\n \"Header 2\": \"Overview\",\n \"sub_chunk_index\": 0,\n \"total_sub_chunks\": 1,\n \"source_file\": \"output/document.md\",\n \"chunk_size_config\": 1000,\n \"chunk_overlap_config\": 200\n }\n }\n ],\n \"total_chunks\": 42,\n \"statistics\": {\n \"total_characters\": 48392,\n \"avg_chunk_size\": 1152.19,\n \"min_chunk_size\": 234,\n \"max_chunk_size\": 1000\n },\n \"exported_at\": \"2025-10-09T10:30:45.123456\"\n}\n```\n\n## \ud83d\udee0\ufe0f CLI Reference\n\n```bash\nmarkitdown-chunker [-h] [--convert-only | --chunk-only | --no-export]\n [--chunk-size CHUNK_SIZE] [--overlap OVERLAP]\n [--no-markdown-splitter] [--no-images]\n [--include-image-summaries] [--json-indent JSON_INDENT]\n [--list-formats] [--version] [-v]\n input output\n\nPositional Arguments:\n input Input file path\n output Output directory\n\nOptional Arguments:\n -h, --help Show help message\n --convert-only Only convert to markdown\n --chunk-only Only chunk existing markdown\n --no-export Skip JSON export\n --chunk-size SIZE Maximum chunk size (default: 1000)\n --overlap SIZE Chunk overlap (default: 200)\n --no-markdown-splitter Disable markdown-aware splitting\n --no-images Don't save extracted images\n --json-indent N JSON indentation (default: 2)\n --list-formats List supported formats\n --version Show version\n -v, --verbose Enable verbose output\n```\n\n## \ud83e\uddea Development\n\n### Setup Development Environment\n\n```bash\ngit clone https://github.com/yourusername/markitdown-chunker.git\ncd markitdown-chunker\npip install -e \".[dev]\"\n```\n\n### Run Tests\n\n```bash\npytest tests/\npytest --cov=markitdown_chunker tests/\n```\n\n### Code Formatting\n\n```bash\nblack markitdown_chunker/\nflake8 markitdown_chunker/\nmypy markitdown_chunker/\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Built on top of [markitdown](https://github.com/microsoft/markitdown) by Microsoft\n- Uses [LangChain](https://github.com/langchain-ai/langchain) text splitters for intelligent chunking\n\n## \ud83d\udcde Support\n\n- \ud83d\udc1b [Report a bug](https://github.com/Naveenkumarar/markitdown-chunker/issues)\n- \ud83d\udca1 [Request a feature](https://github.com/Naveenkumarar/markitdown-chunker/issues)\n- \ud83d\udcd6 [Documentation](https://github.com/Naveenkumarar/markitdown-chunker)\n- \ud83d\uddbc\ufe0f [Image Extraction Guide](docs/IMAGE_EXTRACTION.md)\n- \ud83c\udfb5 [Audio/Video Processing Guide](docs/FFMPEG_AUDIO.md)\n\n## \ud83d\uddfa\ufe0f Roadmap\n\n- [ ] Support for more document formats\n- [ ] Advanced chunking strategies (semantic, sentence-based)\n- [ ] Integration with vector databases\n- [ ] Web UI for document processing\n- [ ] Cloud storage integration (S3, GCS, Azure)\n- [ ] Parallel batch processing\n- [ ] Custom output formats (CSV, Parquet, etc.)\n\n---\n\nMade with \u2764\ufe0f by the MarkitDown Chunker community\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Convert documents to markdown, chunk them intelligently, and export structured data",
"version": "0.1.0",
"project_urls": {
"Bug Reports": "https://github.com/Naveenkumarar/markitdown-chunker/issues",
"Homepage": "https://github.com/Naveenkumarar/markitdown-chunker",
"Source Code": "https://github.com/Naveenkumarar/markitdown-chunker"
},
"split_keywords": [
"markdown",
" converter",
" chunker",
" document-processing",
" langchain",
" markitdown"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e20528702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06",
"md5": "4d1ad673dc3d9cb73a00208140b53c41",
"sha256": "0b39bc2dfb2c398af9262f1e0382950812b399060b1c8b6d30df2a5e80760ee8"
},
"downloads": -1,
"filename": "markitdown_chunker-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4d1ad673dc3d9cb73a00208140b53c41",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 36064,
"upload_time": "2025-10-10T01:05:20",
"upload_time_iso_8601": "2025-10-10T01:05:20.874592Z",
"url": "https://files.pythonhosted.org/packages/e2/05/28702b3ff63d596a6e1f5b86f5157cdb871b0ade1fd94d5c60eb6cd99a06/markitdown_chunker-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6b382597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4",
"md5": "2f62fc2b3403ca4bf1a52425951f9c9a",
"sha256": "ab81de683514dd282c8264cf2a6b394ed8b3e102f7900da326f7378e892c2521"
},
"downloads": -1,
"filename": "markitdown_chunker-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "2f62fc2b3403ca4bf1a52425951f9c9a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 42752,
"upload_time": "2025-10-10T01:05:22",
"upload_time_iso_8601": "2025-10-10T01:05:22.303693Z",
"url": "https://files.pythonhosted.org/packages/6b/38/2597def5d2f3728fdff57443196bc46c4d3796df9fa2b80c5019feafd2e4/markitdown_chunker-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-10 01:05:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Naveenkumarar",
"github_project": "markitdown-chunker",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "markitdown",
"specs": [
[
">=",
"0.0.1"
]
]
},
{
"name": "langchain",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "langchain-text-splitters",
"specs": [
[
">=",
"0.0.1"
]
]
}
],
"lcname": "markitdown-chunker"
}