ragparser

Name	ragparser JSON
Version	1.0.3 JSON
	download
home_page	https://github.com/shubham7995/ragparser
Summary	A comprehensive document parser for RAG applications with support for PDF, DOCX, PPTX, XLSX, and more
upload_time	2025-07-24 19:46:15
maintainer	None
docs_url	None
author	Shubham Shinde
requires_python	>=3.8
license	MIT License Copyright (c) 2025 Shubham Shinde Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements	aiofiles PyMuPDF pdfplumber python-docx python-pptx openpyxl beautifulsoup4 lxml pytesseract Pillow langchain llama-index langdetect pytest
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # RAG Parser

A comprehensive Python library for parsing documents into RAG-ready format. Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, and more with intelligent chunking strategies.

## 🚀 Features

- **Universal Document Parsing**: Support for PDF, DOCX, PPTX, XLSX, HTML, MD, CSV, JSON, and images
- **Intelligent Chunking**: Multiple strategies (Fixed, Semantic, Adaptive) optimized for RAG
- **Metadata Extraction**: Rich metadata including author, creation date, structure info
- **Content Structure Preservation**: Maintains headers, tables, images, and formatting context
- **Async Support**: Full async/await support for high-performance processing
- **RAG-Optimized Output**: Ready-to-embed chunks with proper citations and context
- **Framework Integration**: Built-in adapters for LangChain and LlamaIndex
- **Extensible Architecture**: Easy to add custom parsers and chunking strategies

## 📦 Installation

### Basic Installation
```bash
pip install ragparser
```

### With Specific Format Support
```bash
# PDF support
pip install ragparser[pdf]

# Office documents (DOCX, PPTX, XLSX)
pip install ragparser[office]

# HTML parsing
pip install ragparser[html]

# OCR for images
pip install ragparser[ocr]

# All formats
pip install ragparser[all]
```

### Development Installation
```bash
git clone https://github.com/shubham7995/ragparser.git
cd ragparser
pip install -e ".[dev]"
```

## 🎯 Quick Start

### Basic Usage

```python
from ragparser import RagParser
from ragparser.core.models import ParserConfig

# Initialize parser
parser = RagParser()

# Parse a document
result = parser.parse("document.pdf")

if result.success:
    document = result.document
    print(f"Extracted {len(document.content)} characters")
    print(f"Created {len(document.chunks)} chunks")
    print(f"Found {len(document.tables)} tables")
else:
    print(f"Error: {result.error}")
```

### Advanced Configuration

```python
from ragparser import RagParser, ParserConfig, ChunkingStrategy

# Custom configuration
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    extract_tables=True,
    extract_images=True,
    clean_text=True
)

parser = RagParser(config)
result = parser.parse("complex_document.pdf")
```

### Async Processing

```python
import asyncio
from ragparser import RagParser

async def process_documents():
    parser = RagParser()
    
    # Process single document
    result = await parser.parse_async("document.pdf")
    
    # Process multiple documents concurrently
    files = ["doc1.pdf", "doc2.docx", "doc3.pptx"]
    results = await parser.parse_multiple_async(files)
    
    for result in results:
        if result.success:
            print(f"Processed: {result.document.metadata.file_name}")

asyncio.run(process_documents())
```

### Processing from Bytes

```python
# Parse document from bytes (e.g., from web upload)
with open("document.pdf", "rb") as f:
    data = f.read()

result = parser.parse_from_bytes(data, "document.pdf")
```

## 📚 Supported Formats

| Format | Extensions | Features |
|--------|------------|----------|
| **PDF** | `.pdf` | Text, images, tables, metadata, OCR |
| **Word** | `.docx` | Text, formatting, tables, images, comments |
| **PowerPoint** | `.pptx` | Slides, speaker notes, images, tables |
| **Excel** | `.xlsx` | Sheets, formulas, charts, named ranges |
| **HTML** | `.html`, `.htm` | Structure, links, images, tables |
| **Markdown** | `.md`, `.markdown` | Headers, code blocks, tables, links |
| **Text** | `.txt` | Plain text with encoding detection |
| **CSV** | `.csv` | Structured data with header detection |
| **JSON** | `.json` | Structured data parsing |
| **Images** | `.png`, `.jpg`, `.gif`, etc. | OCR text extraction |

## 🔧 Chunking Strategies

### Fixed Chunking
```python
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.FIXED,
    chunk_size=1000,
    chunk_overlap=200
)
```
- Splits text into fixed-size chunks
- Preserves sentence boundaries
- Configurable overlap for context

### Semantic Chunking
```python
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000
)
```
- Groups content by semantic meaning
- Respects document structure (headers, paragraphs)
- Maintains topic coherence

### Adaptive Chunking
```python
config = ParserConfig(
    chunking_strategy=ChunkingStrategy.ADAPTIVE,
    chunk_size=1000
)
```
- Dynamically adjusts chunk size based on content
- Optimizes for embedding model context windows
- Balances size and semantic coherence

## 🔍 Content Extraction

### Text and Structure
```python
# Access extracted content
document = result.document

# Full text content
print(document.content)

# Structured content blocks
for block in document.content_blocks:
    print(f"{block.block_type}: {block.content}")

# Chunked content ready for RAG
for chunk in document.chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")
```

### Tables and Data
```python
# Extract tables
for table in document.tables:
    print(f"Table with {len(table['data'])} rows")
    headers = table.get('headers', [])
    print(f"Headers: {headers}")
```

### Metadata
```python
meta = document.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Pages: {meta.page_count}")
print(f"Words: {meta.word_count}")
```

## 🔗 Framework Integration

### LangChain Integration
```python
from ragparser.integrations.langchain_adapter import RagParserLoader

# Use as a LangChain document loader
loader = RagParserLoader("documents/")
documents = loader.load()

# With custom config
config = ParserConfig(chunking_strategy=ChunkingStrategy.SEMANTIC)
loader = RagParserLoader("documents/", config=config)
documents = loader.load()
```

### LlamaIndex Integration
```python
from ragparser.integrations.llamaindex_adapter import RagParserReader

# Use as a LlamaIndex reader
reader = RagParserReader()
documents = reader.load_data("document.pdf")
```

## ⚙️ Configuration Options

### Parser Configuration
```python
config = ParserConfig(
    # Chunking settings
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    chunk_size=1000,
    chunk_overlap=200,
    
    # Content extraction
    extract_tables=True,
    extract_images=True,
    extract_metadata=True,
    extract_links=True,
    
    # Text processing
    clean_text=True,
    preserve_formatting=False,
    merge_paragraphs=True,
    
    # OCR settings
    enable_ocr=True,
    ocr_language="eng",
    ocr_confidence_threshold=0.7,
    
    # Performance
    max_file_size=100 * 1024 * 1024,  # 100MB
    timeout_seconds=300,
)
```

### Runtime Configuration Updates
```python
parser = RagParser()

# Update specific settings
parser.update_config(
    chunk_size=1500,
    extract_tables=False
)

# Add custom settings
parser.update_config(
    custom_ocr_model="my_model",
    special_processing=True
)
```

## 🚀 Performance Features

- **Async Processing**: Non-blocking document processing
- **Concurrent Parsing**: Process multiple documents simultaneously
- **Memory Efficient**: Streaming for large files
- **Caching**: Avoid reprocessing identical content
- **Lazy Loading**: Only load parsers for formats you use

## 📊 Monitoring and Quality

### Processing Statistics
```python
result = parser.parse("document.pdf")

stats = result.processing_stats
print(f"Processing time: {stats['processing_time']:.2f}s")
print(f"File size: {stats['file_size']} bytes")
print(f"Chunks created: {stats['chunk_count']}")
```

### Quality Metrics
```python
document = result.document

# Content quality indicators
print(f"Quality score: {document.quality_score}")
print(f"Extraction notes: {document.extraction_notes}")

# Chunk quality
for chunk in document.chunks:
    print(f"Chunk tokens: {chunk.token_count}")
    print(f"Embedding ready: {chunk.embedding_ready}")
```

## 🧪 Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=ragparser

# Run only fast tests
pytest -m "not slow"

# Run integration tests
pytest -m integration
```

## 🤝 Contributing

Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup
```bash
# Clone repository
git clone https://github.com/shubham7995/ragparser.git
cd ragparser

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest
```

### Adding New Parsers
```python
from ragparser.parsers.base import BaseParser
from ragparser.core.models import ParsedDocument, FileType

class MyCustomParser(BaseParser):
    def __init__(self):
        super().__init__()
        self.supported_formats = [FileType.CUSTOM]
    
    async def parse_async(self, file_path, config):
        # Implement parsing logic
        return ParsedDocument(...)
```

## 📝 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🔗 Links

- **GitHub**: https://github.com/shubham7995/ragparser
- **PyPI**: https://pypi.org/project/ragparser/
- **Documentation**: https://ragparser.readthedocs.io/
- **Issues**: https://github.com/shubham7995/ragparser/issues

## 🏷️ Keywords

`RAG`, `document parsing`, `PDF`, `DOCX`, `PPTX`, `XLSX`, `chunking`, `embedding`, `LangChain`, `LlamaIndex`, `async`, `OCR`, `metadata extraction`

---

Built with ❤️ for the RAG and LLM community

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/shubham7995/ragparser",
    "name": "ragparser",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Shubham Shinde",
    "author_email": "Shubham Shinde <shubhamshinde7995@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/cb/50/2d8d5cbefc71f46ebdb277b22beb17795ef8d3c21f5e9061cde1518cff2c/ragparser-1.0.3.tar.gz",
    "platform": null,
    "description": "# RAG Parser\n\nA comprehensive Python library for parsing documents into RAG-ready format. Supports PDF, DOCX, PPTX, XLSX, HTML, Markdown, and more with intelligent chunking strategies.\n\n## \ud83d\ude80 Features\n\n- **Universal Document Parsing**: Support for PDF, DOCX, PPTX, XLSX, HTML, MD, CSV, JSON, and images\n- **Intelligent Chunking**: Multiple strategies (Fixed, Semantic, Adaptive) optimized for RAG\n- **Metadata Extraction**: Rich metadata including author, creation date, structure info\n- **Content Structure Preservation**: Maintains headers, tables, images, and formatting context\n- **Async Support**: Full async/await support for high-performance processing\n- **RAG-Optimized Output**: Ready-to-embed chunks with proper citations and context\n- **Framework Integration**: Built-in adapters for LangChain and LlamaIndex\n- **Extensible Architecture**: Easy to add custom parsers and chunking strategies\n\n## \ud83d\udce6 Installation\n\n### Basic Installation\n```bash\npip install ragparser\n```\n\n### With Specific Format Support\n```bash\n# PDF support\npip install ragparser[pdf]\n\n# Office documents (DOCX, PPTX, XLSX)\npip install ragparser[office]\n\n# HTML parsing\npip install ragparser[html]\n\n# OCR for images\npip install ragparser[ocr]\n\n# All formats\npip install ragparser[all]\n```\n\n### Development Installation\n```bash\ngit clone https://github.com/shubham7995/ragparser.git\ncd ragparser\npip install -e \".[dev]\"\n```\n\n## \ud83c\udfaf Quick Start\n\n### Basic Usage\n\n```python\nfrom ragparser import RagParser\nfrom ragparser.core.models import ParserConfig\n\n# Initialize parser\nparser = RagParser()\n\n# Parse a document\nresult = parser.parse(\"document.pdf\")\n\nif result.success:\n    document = result.document\n    print(f\"Extracted {len(document.content)} characters\")\n    print(f\"Created {len(document.chunks)} chunks\")\n    print(f\"Found {len(document.tables)} tables\")\nelse:\n    print(f\"Error: {result.error}\")\n```\n\n### Advanced Configuration\n\n```python\nfrom ragparser import RagParser, ParserConfig, ChunkingStrategy\n\n# Custom configuration\nconfig = ParserConfig(\n    chunking_strategy=ChunkingStrategy.SEMANTIC,\n    chunk_size=1000,\n    chunk_overlap=200,\n    extract_tables=True,\n    extract_images=True,\n    clean_text=True\n)\n\nparser = RagParser(config)\nresult = parser.parse(\"complex_document.pdf\")\n```\n\n### Async Processing\n\n```python\nimport asyncio\nfrom ragparser import RagParser\n\nasync def process_documents():\n    parser = RagParser()\n    \n    # Process single document\n    result = await parser.parse_async(\"document.pdf\")\n    \n    # Process multiple documents concurrently\n    files = [\"doc1.pdf\", \"doc2.docx\", \"doc3.pptx\"]\n    results = await parser.parse_multiple_async(files)\n    \n    for result in results:\n        if result.success:\n            print(f\"Processed: {result.document.metadata.file_name}\")\n\nasyncio.run(process_documents())\n```\n\n### Processing from Bytes\n\n```python\n# Parse document from bytes (e.g., from web upload)\nwith open(\"document.pdf\", \"rb\") as f:\n    data = f.read()\n\nresult = parser.parse_from_bytes(data, \"document.pdf\")\n```\n\n## \ud83d\udcda Supported Formats\n\n| Format | Extensions | Features |\n|--------|------------|----------|\n| **PDF** | `.pdf` | Text, images, tables, metadata, OCR |\n| **Word** | `.docx` | Text, formatting, tables, images, comments |\n| **PowerPoint** | `.pptx` | Slides, speaker notes, images, tables |\n| **Excel** | `.xlsx` | Sheets, formulas, charts, named ranges |\n| **HTML** | `.html`, `.htm` | Structure, links, images, tables |\n| **Markdown** | `.md`, `.markdown` | Headers, code blocks, tables, links |\n| **Text** | `.txt` | Plain text with encoding detection |\n| **CSV** | `.csv` | Structured data with header detection |\n| **JSON** | `.json` | Structured data parsing |\n| **Images** | `.png`, `.jpg`, `.gif`, etc. | OCR text extraction |\n\n## \ud83d\udd27 Chunking Strategies\n\n### Fixed Chunking\n```python\nconfig = ParserConfig(\n    chunking_strategy=ChunkingStrategy.FIXED,\n    chunk_size=1000,\n    chunk_overlap=200\n)\n```\n- Splits text into fixed-size chunks\n- Preserves sentence boundaries\n- Configurable overlap for context\n\n### Semantic Chunking\n```python\nconfig = ParserConfig(\n    chunking_strategy=ChunkingStrategy.SEMANTIC,\n    chunk_size=1000\n)\n```\n- Groups content by semantic meaning\n- Respects document structure (headers, paragraphs)\n- Maintains topic coherence\n\n### Adaptive Chunking\n```python\nconfig = ParserConfig(\n    chunking_strategy=ChunkingStrategy.ADAPTIVE,\n    chunk_size=1000\n)\n```\n- Dynamically adjusts chunk size based on content\n- Optimizes for embedding model context windows\n- Balances size and semantic coherence\n\n## \ud83d\udd0d Content Extraction\n\n### Text and Structure\n```python\n# Access extracted content\ndocument = result.document\n\n# Full text content\nprint(document.content)\n\n# Structured content blocks\nfor block in document.content_blocks:\n    print(f\"{block.block_type}: {block.content}\")\n\n# Chunked content ready for RAG\nfor chunk in document.chunks:\n    print(f\"Chunk {chunk.chunk_id}: {len(chunk.content)} chars\")\n```\n\n### Tables and Data\n```python\n# Extract tables\nfor table in document.tables:\n    print(f\"Table with {len(table['data'])} rows\")\n    headers = table.get('headers', [])\n    print(f\"Headers: {headers}\")\n```\n\n### Metadata\n```python\nmeta = document.metadata\nprint(f\"Title: {meta.title}\")\nprint(f\"Author: {meta.author}\")\nprint(f\"Pages: {meta.page_count}\")\nprint(f\"Words: {meta.word_count}\")\n```\n\n## \ud83d\udd17 Framework Integration\n\n### LangChain Integration\n```python\nfrom ragparser.integrations.langchain_adapter import RagParserLoader\n\n# Use as a LangChain document loader\nloader = RagParserLoader(\"documents/\")\ndocuments = loader.load()\n\n# With custom config\nconfig = ParserConfig(chunking_strategy=ChunkingStrategy.SEMANTIC)\nloader = RagParserLoader(\"documents/\", config=config)\ndocuments = loader.load()\n```\n\n### LlamaIndex Integration\n```python\nfrom ragparser.integrations.llamaindex_adapter import RagParserReader\n\n# Use as a LlamaIndex reader\nreader = RagParserReader()\ndocuments = reader.load_data(\"document.pdf\")\n```\n\n## \u2699\ufe0f Configuration Options\n\n### Parser Configuration\n```python\nconfig = ParserConfig(\n    # Chunking settings\n    chunking_strategy=ChunkingStrategy.SEMANTIC,\n    chunk_size=1000,\n    chunk_overlap=200,\n    \n    # Content extraction\n    extract_tables=True,\n    extract_images=True,\n    extract_metadata=True,\n    extract_links=True,\n    \n    # Text processing\n    clean_text=True,\n    preserve_formatting=False,\n    merge_paragraphs=True,\n    \n    # OCR settings\n    enable_ocr=True,\n    ocr_language=\"eng\",\n    ocr_confidence_threshold=0.7,\n    \n    # Performance\n    max_file_size=100 * 1024 * 1024,  # 100MB\n    timeout_seconds=300,\n)\n```\n\n### Runtime Configuration Updates\n```python\nparser = RagParser()\n\n# Update specific settings\nparser.update_config(\n    chunk_size=1500,\n    extract_tables=False\n)\n\n# Add custom settings\nparser.update_config(\n    custom_ocr_model=\"my_model\",\n    special_processing=True\n)\n```\n\n## \ud83d\ude80 Performance Features\n\n- **Async Processing**: Non-blocking document processing\n- **Concurrent Parsing**: Process multiple documents simultaneously\n- **Memory Efficient**: Streaming for large files\n- **Caching**: Avoid reprocessing identical content\n- **Lazy Loading**: Only load parsers for formats you use\n\n## \ud83d\udcca Monitoring and Quality\n\n### Processing Statistics\n```python\nresult = parser.parse(\"document.pdf\")\n\nstats = result.processing_stats\nprint(f\"Processing time: {stats['processing_time']:.2f}s\")\nprint(f\"File size: {stats['file_size']} bytes\")\nprint(f\"Chunks created: {stats['chunk_count']}\")\n```\n\n### Quality Metrics\n```python\ndocument = result.document\n\n# Content quality indicators\nprint(f\"Quality score: {document.quality_score}\")\nprint(f\"Extraction notes: {document.extraction_notes}\")\n\n# Chunk quality\nfor chunk in document.chunks:\n    print(f\"Chunk tokens: {chunk.token_count}\")\n    print(f\"Embedding ready: {chunk.embedding_ready}\")\n```\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=ragparser\n\n# Run only fast tests\npytest -m \"not slow\"\n\n# Run integration tests\npytest -m integration\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\n# Clone repository\ngit clone https://github.com/shubham7995/ragparser.git\ncd ragparser\n\n# Install in development mode\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n```\n\n### Adding New Parsers\n```python\nfrom ragparser.parsers.base import BaseParser\nfrom ragparser.core.models import ParsedDocument, FileType\n\nclass MyCustomParser(BaseParser):\n    def __init__(self):\n        super().__init__()\n        self.supported_formats = [FileType.CUSTOM]\n    \n    async def parse_async(self, file_path, config):\n        # Implement parsing logic\n        return ParsedDocument(...)\n```\n\n## \ud83d\udcdd License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Links\n\n- **GitHub**: https://github.com/shubham7995/ragparser\n- **PyPI**: https://pypi.org/project/ragparser/\n- **Documentation**: https://ragparser.readthedocs.io/\n- **Issues**: https://github.com/shubham7995/ragparser/issues\n\n## \ud83c\udff7\ufe0f Keywords\n\n`RAG`, `document parsing`, `PDF`, `DOCX`, `PPTX`, `XLSX`, `chunking`, `embedding`, `LangChain`, `LlamaIndex`, `async`, `OCR`, `metadata extraction`\n\n---\n\nBuilt with \u2764\ufe0f for the RAG and LLM community\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Shubham Shinde\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "A comprehensive document parser for RAG applications with support for PDF, DOCX, PPTX, XLSX, and more",
    "version": "1.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/shubham7995/ragparser/issues",
        "Documentation": "https://github.com/shubham7995/ragparser#readme",
        "Homepage": "https://github.com/shubham7995/ragparser",
        "Repository": "https://github.com/shubham7995/ragparser"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "32d6150864b283c951893ca5388d4536b257ed27eab268498fc6787e2054f4ec",
                "md5": "97e35300600ee298b1ccaa89f6b9de61",
                "sha256": "17bade3ddf4a4917547b08385189c6ab3479e363c4023e82d4c266135dcb9701"
            },
            "downloads": -1,
            "filename": "ragparser-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "97e35300600ee298b1ccaa89f6b9de61",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 60675,
            "upload_time": "2025-07-24T19:46:14",
            "upload_time_iso_8601": "2025-07-24T19:46:14.355910Z",
            "url": "https://files.pythonhosted.org/packages/32/d6/150864b283c951893ca5388d4536b257ed27eab268498fc6787e2054f4ec/ragparser-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cb502d8d5cbefc71f46ebdb277b22beb17795ef8d3c21f5e9061cde1518cff2c",
                "md5": "eaa92bbc76636704cdd66d0f83aec8d1",
                "sha256": "a566e2528de7248bcd9d4e8572efd42b1cc365e9001b2b721637510c4aee5c4d"
            },
            "downloads": -1,
            "filename": "ragparser-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "eaa92bbc76636704cdd66d0f83aec8d1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 57301,
            "upload_time": "2025-07-24T19:46:15",
            "upload_time_iso_8601": "2025-07-24T19:46:15.927941Z",
            "url": "https://files.pythonhosted.org/packages/cb/50/2d8d5cbefc71f46ebdb277b22beb17795ef8d3c21f5e9061cde1518cff2c/ragparser-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 19:46:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "shubham7995",
    "github_project": "ragparser",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "aiofiles",
            "specs": [
                [
                    ">=",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "PyMuPDF",
            "specs": [
                [
                    ">=",
                    "1.23.0"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    ">=",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "0.8.11"
                ]
            ]
        },
        {
            "name": "python-pptx",
            "specs": [
                [
                    ">=",
                    "0.6.21"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    ">=",
                    "4.11.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "pytesseract",
            "specs": [
                [
                    ">=",
                    "0.3.10"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    ">=",
                    "9.0.0"
                ]
            ]
        },
        {
            "name": "langchain",
            "specs": []
        },
        {
            "name": "llama-index",
            "specs": []
        },
        {
            "name": "langdetect",
            "specs": []
        },
        {
            "name": "pytest",
            "specs": []
        }
    ],
    "lcname": "ragparser"
}

Shubham Shinde