ceylon-rag

Name	ceylon-rag JSON
Version	0.3.1 JSON
	download
home_page	None
Summary	None
upload_time	2025-01-14 16:11:41
maintainer	None
docs_url	None
author	dewma
requires_python	<4.0,>=3.12
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Ceylon AI RAG Framework

A powerful, modular, and extensible Retrieval-Augmented Generation (RAG) framework built with Python, supporting multiple LLM providers, embedders, and document types.

## 🌟 Features

- **Multiple Document Types**: Support for various document formats including:
  - Text files (with extensive format support)
  - PDF documents
  - Images (with OCR capabilities)
  - Source code files

- **Flexible Architecture**:
  - Modular component design
  - Pluggable LLM providers (OpenAI, Ollama)
  - Extensible embedding providers
  - Vector store integration (LanceDB)

- **Advanced RAG Capabilities**:
  - Intelligent document chunking
  - Context-aware searching
  - Query expansion and reranking
  - Metadata enrichment
  - Source attribution

- **Specialized RAG Implementations**:
  - `FolderRAG`: Process and analyze entire directory structures
  - `CodeAnalysisRAG`: Specialized for source code understanding
  - `SimpleRAG`: Basic RAG implementation for text data
  - Support for custom RAG implementations

## 🚀 Getting Started

### Installation

```bash
# Install via pip
pip install ceylon-rag

# Or install from source
git clone https://github.com/ceylonai/ceylon-rag.git
cd ceylon-rag
pip install -e .
```

### Basic Usage

Here's a simple example using the framework:

```python
import asyncio
from dotenv import load_dotenv
from ceylon_rag import SimpleRAG

async def main():
    # Load environment variables
    load_dotenv()

    # Configure the RAG system
    config = {
        "llm": {
            "type": "openai",
            "model_name": "gpt-4",
            "api_key": os.getenv("OPENAI_API_KEY")
        },
        "embedder": {
            "type": "openai",
            "model_name": "text-embedding-3-small",
            "api_key": os.getenv("OPENAI_API_KEY")
        },
        "vector_store": {
            "type": "lancedb",
            "db_path": "./data/lancedb",
            "table_name": "documents"
        }
    }

    # Initialize RAG
    rag = SimpleRAG(config)
    await rag.initialize()

    try:
        # Process your documents
        documents = await rag.process_documents("path/to/documents")
        
        # Query the system
        result = await rag.query("What are the main topics in these documents?")
        print(result.response)
        
    finally:
        await rag.close()

if __name__ == "__main__":
    asyncio.run(main())
```

## 🏗️ Architecture

### Core Components

1. **Document Loaders**
   - `TextLoader`: Handles text-based files
   - `PDFLoader`: Processes PDF documents
   - `ImageLoader`: Handles images with OCR
   - Extensible base class for custom loaders

2. **Embedders**
   - OpenAI embeddings support
   - Ollama embeddings support
   - Modular design for adding new providers

3. **LLM Providers**
   - OpenAI integration
   - Ollama integration
   - Async interface for all providers

4. **Vector Store**
   - LanceDB integration
   - Efficient vector similarity search
   - Metadata storage and retrieval

### Document Processing

The framework provides sophisticated document processing capabilities:

```python
# Example: Processing a code repository
async def analyze_codebase():
    config = {
        "llm": {
            "type": "openai",
            "model_name": "gpt-4"
        },
        "embedder": {
            "type": "openai",
            "model_name": "text-embedding-3-small"
        },
        "vector_store": {
            "type": "lancedb",
            "db_path": "./data/lancedb",
            "table_name": "code_documents"
        },
        "chunk_size": 1000,
        "chunk_overlap": 200
    }

    rag = CodeAnalysisRAG(config)
    await rag.initialize()
    
    documents = await rag.process_codebase("./src")
    await rag.index_code(documents)
    
    result = await rag.analyze_code(
        "Explain the main architecture of this codebase"
    )
    print(result.response)
```

## 🔧 Advanced Configuration

### File Exclusions

Configure file exclusions using patterns:

```python
config = {
    # ... other config options ...
    "excluded_dirs": [
        "venv",
        "node_modules",
        ".git",
        "__pycache__"
    ],
    "excluded_files": [
        ".env",
        "package-lock.json"
    ],
    "excluded_extensions": [
        ".pyc",
        ".pyo",
        ".pyd"
    ],
    "ignore_file": ".ragignore"  # Similar to .gitignore
}
```

### Chunking Configuration

Customize document chunking:

```python
config = {
    # ... other config options ...
    "chunk_size": 1000,  # Characters per chunk
    "chunk_overlap": 200,  # Overlap between chunks
}
```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

## 📄 License

[MIT License](LICENSE)

## 🙏 Acknowledgments

- OpenAI for GPT and embedding models
- Ollama for local LLM support
- LanceDB team for vector storage
- All contributors and users of the framework

---

## 📚 API Documentation

For detailed API documentation, please visit our [API Documentation](docs/api.md) page.

## 🔗 Links

- [GitHub Repository](https://github.com/yourusername/ceylon-rag)
- [Issue Tracker](https://github.com/yourusername/ceylon-rag/issues)
- [Documentation](https://ceylon-rag.readthedocs.io/)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ceylon-rag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": null,
    "author": "dewma",
    "author_email": "dewmal@syigen.com",
    "download_url": "https://files.pythonhosted.org/packages/8b/67/98474bb3706533745aa052a87579ba867560034a862c0d33898f049d2d30/ceylon_rag-0.3.1.tar.gz",
    "platform": null,
    "description": "# Ceylon AI RAG Framework\n\nA powerful, modular, and extensible Retrieval-Augmented Generation (RAG) framework built with Python, supporting multiple LLM providers, embedders, and document types.\n\n## \ud83c\udf1f Features\n\n- **Multiple Document Types**: Support for various document formats including:\n  - Text files (with extensive format support)\n  - PDF documents\n  - Images (with OCR capabilities)\n  - Source code files\n\n- **Flexible Architecture**:\n  - Modular component design\n  - Pluggable LLM providers (OpenAI, Ollama)\n  - Extensible embedding providers\n  - Vector store integration (LanceDB)\n\n- **Advanced RAG Capabilities**:\n  - Intelligent document chunking\n  - Context-aware searching\n  - Query expansion and reranking\n  - Metadata enrichment\n  - Source attribution\n\n- **Specialized RAG Implementations**:\n  - `FolderRAG`: Process and analyze entire directory structures\n  - `CodeAnalysisRAG`: Specialized for source code understanding\n  - `SimpleRAG`: Basic RAG implementation for text data\n  - Support for custom RAG implementations\n\n## \ud83d\ude80 Getting Started\n\n### Installation\n\n```bash\n# Install via pip\npip install ceylon-rag\n\n# Or install from source\ngit clone https://github.com/ceylonai/ceylon-rag.git\ncd ceylon-rag\npip install -e .\n```\n\n### Basic Usage\n\nHere's a simple example using the framework:\n\n```python\nimport asyncio\nfrom dotenv import load_dotenv\nfrom ceylon_rag import SimpleRAG\n\nasync def main():\n    # Load environment variables\n    load_dotenv()\n\n    # Configure the RAG system\n    config = {\n        \"llm\": {\n            \"type\": \"openai\",\n            \"model_name\": \"gpt-4\",\n            \"api_key\": os.getenv(\"OPENAI_API_KEY\")\n        },\n        \"embedder\": {\n            \"type\": \"openai\",\n            \"model_name\": \"text-embedding-3-small\",\n            \"api_key\": os.getenv(\"OPENAI_API_KEY\")\n        },\n        \"vector_store\": {\n            \"type\": \"lancedb\",\n            \"db_path\": \"./data/lancedb\",\n            \"table_name\": \"documents\"\n        }\n    }\n\n    # Initialize RAG\n    rag = SimpleRAG(config)\n    await rag.initialize()\n\n    try:\n        # Process your documents\n        documents = await rag.process_documents(\"path/to/documents\")\n        \n        # Query the system\n        result = await rag.query(\"What are the main topics in these documents?\")\n        print(result.response)\n        \n    finally:\n        await rag.close()\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\n### Core Components\n\n1. **Document Loaders**\n   - `TextLoader`: Handles text-based files\n   - `PDFLoader`: Processes PDF documents\n   - `ImageLoader`: Handles images with OCR\n   - Extensible base class for custom loaders\n\n2. **Embedders**\n   - OpenAI embeddings support\n   - Ollama embeddings support\n   - Modular design for adding new providers\n\n3. **LLM Providers**\n   - OpenAI integration\n   - Ollama integration\n   - Async interface for all providers\n\n4. **Vector Store**\n   - LanceDB integration\n   - Efficient vector similarity search\n   - Metadata storage and retrieval\n\n### Document Processing\n\nThe framework provides sophisticated document processing capabilities:\n\n```python\n# Example: Processing a code repository\nasync def analyze_codebase():\n    config = {\n        \"llm\": {\n            \"type\": \"openai\",\n            \"model_name\": \"gpt-4\"\n        },\n        \"embedder\": {\n            \"type\": \"openai\",\n            \"model_name\": \"text-embedding-3-small\"\n        },\n        \"vector_store\": {\n            \"type\": \"lancedb\",\n            \"db_path\": \"./data/lancedb\",\n            \"table_name\": \"code_documents\"\n        },\n        \"chunk_size\": 1000,\n        \"chunk_overlap\": 200\n    }\n\n    rag = CodeAnalysisRAG(config)\n    await rag.initialize()\n    \n    documents = await rag.process_codebase(\"./src\")\n    await rag.index_code(documents)\n    \n    result = await rag.analyze_code(\n        \"Explain the main architecture of this codebase\"\n    )\n    print(result.response)\n```\n\n## \ud83d\udd27 Advanced Configuration\n\n### File Exclusions\n\nConfigure file exclusions using patterns:\n\n```python\nconfig = {\n    # ... other config options ...\n    \"excluded_dirs\": [\n        \"venv\",\n        \"node_modules\",\n        \".git\",\n        \"__pycache__\"\n    ],\n    \"excluded_files\": [\n        \".env\",\n        \"package-lock.json\"\n    ],\n    \"excluded_extensions\": [\n        \".pyc\",\n        \".pyo\",\n        \".pyd\"\n    ],\n    \"ignore_file\": \".ragignore\"  # Similar to .gitignore\n}\n```\n\n### Chunking Configuration\n\nCustomize document chunking:\n\n```python\nconfig = {\n    # ... other config options ...\n    \"chunk_size\": 1000,  # Characters per chunk\n    \"chunk_overlap\": 200,  # Overlap between chunks\n}\n```\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit pull requests. For major changes, please open an issue first to discuss what you would like to change.\n\n## \ud83d\udcc4 License\n\n[MIT License](LICENSE)\n\n## \ud83d\ude4f Acknowledgments\n\n- OpenAI for GPT and embedding models\n- Ollama for local LLM support\n- LanceDB team for vector storage\n- All contributors and users of the framework\n\n---\n\n## \ud83d\udcda API Documentation\n\nFor detailed API documentation, please visit our [API Documentation](docs/api.md) page.\n\n## \ud83d\udd17 Links\n\n- [GitHub Repository](https://github.com/yourusername/ceylon-rag)\n- [Issue Tracker](https://github.com/yourusername/ceylon-rag/issues)\n- [Documentation](https://ceylon-rag.readthedocs.io/)",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.3.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "66212eb499487ebaa8b46346af8728ebab13fe92c6c7638e97b28aef51198449",
                "md5": "b2ac4aa847a2f58f617defcb657abe02",
                "sha256": "f51c81568b9c57be1b9a9d62d908c1e4a44aadf26ea116d02165923e25cfd7cc"
            },
            "downloads": -1,
            "filename": "ceylon_rag-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b2ac4aa847a2f58f617defcb657abe02",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 20503,
            "upload_time": "2025-01-14T16:11:39",
            "upload_time_iso_8601": "2025-01-14T16:11:39.222094Z",
            "url": "https://files.pythonhosted.org/packages/66/21/2eb499487ebaa8b46346af8728ebab13fe92c6c7638e97b28aef51198449/ceylon_rag-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b6798474bb3706533745aa052a87579ba867560034a862c0d33898f049d2d30",
                "md5": "5321a3ba1c60c19ca5f69cdc9e32e1f5",
                "sha256": "f2512ed18080f31a278e6ab831996f2438c69f0ae002819fa4cd0927c05487ab"
            },
            "downloads": -1,
            "filename": "ceylon_rag-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5321a3ba1c60c19ca5f69cdc9e32e1f5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 40603,
            "upload_time": "2025-01-14T16:11:41",
            "upload_time_iso_8601": "2025-01-14T16:11:41.635205Z",
            "url": "https://files.pythonhosted.org/packages/8b/67/98474bb3706533745aa052a87579ba867560034a862c0d33898f049d2d30/ceylon_rag-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-14 16:11:41",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ceylon-rag"
}

dewma