Name | Version | Summary | date |
pdf2markdown |
0.2.0 |
Python library and CLI tool that leverages LLMs to convert technical PDF documents to well-structured Markdown |
2025-08-17 20:03:08 |
kreuzberg |
3.11.2 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-08-15 13:51:46 |
contextgem |
0.15.0 |
Effortless LLM extraction from documents |
2025-08-13 22:25:52 |
inkognito |
0.1.0 |
Privacy-first document processing FastMCP server with PII anonymization |
2025-08-13 17:45:52 |
ocr-detection |
0.1.2 |
A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR |
2025-08-13 04:29:13 |
qdrant-loader |
0.6.0 |
A tool for collecting and vectorizing technical content from multiple sources and storing it in a QDrant vector database. |
2025-08-12 09:20:21 |
xml-analysis-framework |
1.4.4 |
XML document analysis and preprocessing framework designed for AI/ML data pipelines |
2025-08-12 04:21:41 |
raggy |
0.3.5 |
scraping stuff |
2025-08-11 14:49:05 |
docstrange |
1.1.3 |
Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR. |
2025-08-11 07:10:23 |
docling-analysis-framework |
1.1.0 |
AI-ready analysis framework for PDF and Office documents using Docling for content extraction |
2025-07-29 14:34:10 |
document-data-extractor |
1.0.4 |
Best open-source document to markdown extractor for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-29 08:25:56 |
aikitx |
1.0.0 |
A comprehensive GUI toolkit for Large Language Models (LLMs) with GGUF support, document processing, email automation, and multi-backend inference |
2025-07-25 19:44:31 |
llm-data-converter |
2.2.0 |
Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
2025-07-25 13:32:07 |
llm-text-splitter |
0.2.0 |
A lightweight, rule-based text splitter for LLM context window management, handles multiple file formats and enriches chunks with metadata. |
2025-07-24 12:21:01 |
mseep-kreuzberg |
3.8.2 |
Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats |
2025-07-17 03:32:28 |
pdf-splitter-cli |
0.1.1 |
A modern command-line tool to split PDF files into smaller chunks with progress bars and automatic filename generation |
2025-07-17 01:37:12 |
pdf-ocr-processor |
2.0.3 |
Advanced PDF OCR processing with AI-powered text extraction and selectable text overlays |
2025-07-11 21:11:24 |
ai-chunking |
0.1.4 |
A powerful Python library for semantic document chunking and enrichment using AI |
2025-03-16 20:44:19 |
atai-pdf-tool |
0.1.0 |
A tool for parsing and extracting text from PDF files with OCR capabilities |
2025-02-27 11:15:46 |
smart-llm-loader |
0.1.0 |
A powerful PDF processing toolkit that seamlessly integrates with LLMs for intelligent document chunking and RAG applications. Features smart context-aware segmentation, multi-LLM support, and optimized content extraction for enhanced RAG performance. |
2025-02-14 12:42:55 |