context-bridge

Name	context-bridge JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Unified Python package for RAG documentation workflows - Crawl, embed, store, and retrieve technical documentation for AI agents
upload_time	2025-10-23 09:38:56
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	MIT
keywords	ai-agents crawling documentation embedding mcp pgvector postgresql rag vector-search
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Context Bridge 🌉

> **Unified Python package for RAG-powered documentation management** - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

---

## 📋 Table of Contents

- [Overview](#-overview)
- [Features](#-features)
- [Architecture](#-architecture)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Configuration](#-configuration)
- [Usage](#-usage)
- [Database Schema](#-database-schema)
- [Development](#-development)
- [Technical Documentation](#-technical-documentation)
- [License](#-license)

---

## 🎯 Overview

**Context Bridge** is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.

### What It Does

1. **Crawls** technical documentation from URLs using Crawl4AI
2. **Organizes** crawled pages into logical groups with size management
3. **Chunks** Markdown content intelligently while preserving structure
4. **Embeds** chunks using vector embeddings (Ollama/Gemini)
5. **Stores** everything in PostgreSQL with vector and vchord_bm25
6. **Searches** with hybrid vector + BM25 search for best results
7. **Serves** via MCP (Model Context Protocol) for AI agent integration
8. **Manages** through a Streamlit UI for human oversight

---

## ✨ Features

### Core Capabilities

- **🕷️ Smart Crawling**: Automatically detect and crawl documentation sites, sitemaps, and text files
- **📦 Intelligent Chunking**: Smart Markdown chunking that respects code blocks, paragraphs, and sentences
- **🔍 Hybrid Search**: Dual vector + BM25 search for superior retrieval accuracy
- **📚 Version Management**: Track multiple versions of the same documentation
- **🎯 Document Organization**: Manual page grouping with size constraints before chunking
- **⚡ High Performance**: PSQLPy for fast async PostgreSQL operations
- **🤖 AI-Ready**: MCP server for seamless AI agent integration
- **🎨 User-Friendly**: Streamlit UI for documentation management

### Technical Features

- **Vector Search**: Powered by vector extension
- **BM25 Full-Text Search**: Using vchord_bm25 extension
- **Async/Await**: Fully asynchronous operations for scalability
- **Configurable Embeddings**: Support for Ollama (local) and Google Gemini (cloud)
- **Type-Safe**: Pydantic models for configuration and data validation
- **Modular Design**: Clean separation of concerns (repositories, services, managers)

---

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    Context Bridge Architecture               │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Streamlit   │    │ MCP Server   │    │  Python API  │
│      UI      │    │  (AI Agent)  │    │   (Direct)   │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                           │
                           ▼
       ┌───────────────────────────────────────┐
       │         Service Layer                 │
       │  - CrawlingService                    │
       │  - ChunkingService                    │
       │  - EmbeddingService                   │
       │  - SearchService                      │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │       Repository Layer                │
       │  - DocumentRepository                 │
       │  - PageRepository                     │
       │  - GroupRepository                    │
       │  - ChunkRepository                    │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │      PostgreSQL Manager               │
       │  - Connection Pooling                 │
       │  - Transaction Management             │
       └───────────────┬───────────────────────┘
                       │
                       ▼
       ┌───────────────────────────────────────┐
       │     PostgreSQL Database               │
       │  Extensions:                          │
       │  - vector (vector search)           │
       │  - vchord_bm25 (BM25 search)        │
       │  - pg_tokenizer (text tokenization) │
       └───────────────────────────────────────┘

External Dependencies:
┌──────────────┐    ┌──────────────┐
│   Crawl4AI   │    │    Ollama    │
│  (Crawling)  │    │  or Gemini   │
│              │    │ (Embeddings) │
└──────────────┘    └──────────────┘
```

### Workflow

```
1. Crawl Documentation
   ↓
2. Store Raw Pages
   ↓
3. Manual Organization (Group Pages)
   ↓
4. Smart Chunking
   ↓
5. Generate Embeddings
   ↓
6. Store with Vector + BM25 Indexes
   ↓
7. Hybrid Search (Vector + BM25)
```

---

## 📦 Installation

### Prerequisites

- **Python 3.11+**
- **PostgreSQL 14+** with extensions:
  - `vector`
  - `vchord`
  - `pg_tokenizer`
  - `vchord_bm25`
- **Ollama** (for local embeddings) or **Google API Key** (for Gemini)

### Install from PyPI

```bash
pip install context-bridge
```

### Install with Optional Dependencies

```bash
# With Gemini support
pip install context-bridge[gemini]

# With MCP server
pip install context-bridge[mcp]

# With Streamlit UI
pip install context-bridge[ui]

# All features
pip install context-bridge[all]
```

### Running the Applications

**MCP Server:**
```bash
# Using the installed script
context-bridge-mcp

# Or run directly
python -m context_bridge_mcp
```

**Streamlit UI:**
```bash
# Using streamlit directly
streamlit run streamlit_app/app.py

# Or with uv
uv run streamlit run streamlit_app/app.py
```

### Install from Source

```bash
git clone https://github.com/yourusername/context-bridge.git
cd context-bridge
pip install -e .
```

---

## 🚀 Quick Start

### 1. Initialize Database

```bash
python -m context_bridge.database.init_databases
```

This will:
- Create required PostgreSQL extensions
- Create all necessary tables
- Set up vector and BM25 indexes

### 2. Basic Usage (Three Ways)

#### **Option A: Direct Python (Recommended for PyPI users)**

```python
import asyncio
from context_bridge import ContextBridge, Config

async def main():
    # Create config with your settings
    config = Config(
        postgres_host="localhost",
        postgres_password="your_secure_password",
        embedding_model="nomic-embed-text:latest"
    )
    
    # Use with context manager
    async with ContextBridge(config=config) as bridge:
        # Crawl documentation
        result = await bridge.crawl_documentation(
            name="Python Docs",
            version="3.11",
            source_url="https://docs.python.org/3/library/"
        )
        
        # Search documentation
        search_results = await bridge.search(
            query="async await tutorial",
            document_id=result.document_id
        )
        
        for hit in search_results[:3]:
            print(f"Score: {hit.score}, Content: {hit.content[:100]}...")

if __name__ == "__main__":
    asyncio.run(main())
```

#### **Option B: Environment Variables (Recommended for Docker/K8s)**

```bash
# Set environment variables
export POSTGRES_HOST=postgres
export POSTGRES_PASSWORD=secure_password
export OLLAMA_BASE_URL=http://ollama:11434
export EMBEDDING_MODEL=nomic-embed-text:latest

# Or in docker-compose.yml
environment:
  - POSTGRES_HOST=postgres
  - POSTGRES_PASSWORD=secure_password
  - EMBEDDING_MODEL=nomic-embed-text:latest
```

```python
import asyncio
from context_bridge import ContextBridge

async def main():
    # Config automatically loaded from environment variables
    async with ContextBridge() as bridge:
        result = await bridge.crawl_documentation(
            name="Python Docs",
            version="3.11",
            source_url="https://docs.python.org/3/library/"
        )
```

#### **Option C: .env File (Convenient for local development)**

Create `.env` file (git-ignored):

```bash
# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=context_bridge

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text:latest
VECTOR_DIMENSION=768

# Search Configuration
SIMILARITY_THRESHOLD=0.7
BM25_WEIGHT=0.3
VECTOR_WEIGHT=0.7

# Chunking Configuration
CHUNK_SIZE=2000
MIN_COMBINED_CONTENT_SIZE=100
MAX_COMBINED_CONTENT_SIZE=3500000

# Crawling Configuration
CRAWL_MAX_DEPTH=3
CRAWL_MAX_CONCURRENT=5
```

Then in your code:

```python
import asyncio
from context_bridge import ContextBridge

async def main():
    # Config automatically loaded from .env file (if python-dotenv is available)
    async with ContextBridge() as bridge:
        result = await bridge.crawl_documentation(...)
```

To use .env files in development, install with dev dependencies:
```bash
pip install context-bridge[dev]
```

---

## ⚙️ Configuration

The package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:

### Configuration Methods (Priority Order)

1. **Direct Python instantiation** (recommended for packaged installs)
2. **Environment variables** (recommended for containers/CI)
3. **.env file** (convenient for local development only)

### Core Settings

| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `POSTGRES_HOST` | `localhost` | PostgreSQL host | `postgres_host` |
| `POSTGRES_PORT` | `5432` | PostgreSQL port | `postgres_port` |
| `POSTGRES_USER` | `postgres` | PostgreSQL user | `postgres_user` |
| `POSTGRES_PASSWORD` | `` (empty) | PostgreSQL password (min 8 chars for prod) | `postgres_password` |
| `POSTGRES_DB` | `context_bridge` | Database name | `postgres_db` |
| `DB_POOL_MAX` | `10` | Connection pool size | `postgres_max_pool_size` |

### Embedding Settings

| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL | `ollama_base_url` |
| `EMBEDDING_MODEL` | `nomic-embed-text:latest` | Ollama model name | `embedding_model` |
| `VECTOR_DIMENSION` | `768` | Embedding vector dimension | `vector_dimension` |

### Search Settings

| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `SIMILARITY_THRESHOLD` | `0.7` | Minimum similarity score | `similarity_threshold` |
| `BM25_WEIGHT` | `0.3` | BM25 weight in hybrid search | `bm25_weight` |
| `VECTOR_WEIGHT` | `0.7` | Vector weight in hybrid search | `vector_weight` |

### Chunking Settings

| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `CHUNK_SIZE` | `2000` | Default chunk size (bytes) | `chunk_size` |
| `MIN_COMBINED_CONTENT_SIZE` | `100` | Minimum combined page size (bytes) | `min_combined_content_size` |
| `MAX_COMBINED_CONTENT_SIZE` | `3500000` | Maximum combined page size (bytes) | `max_combined_content_size` |

### Crawling Settings

| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `CRAWL_MAX_DEPTH` | `3` | Maximum crawl depth | `crawl_max_depth` |
| `CRAWL_MAX_CONCURRENT` | `10` | Maximum concurrent crawl operations | `crawl_max_concurrent` |

---

## 📚 Usage

### Crawling Documentation

```python
from context_bridge.service.crawling_service import CrawlingService, CrawlConfig
from crawl4ai import AsyncWebCrawler

# Configure crawler
config = CrawlConfig(
    max_depth=3,          # How deep to follow links
    max_concurrent=10,    # Concurrent requests
    memory_threshold=70.0 # Memory usage threshold
)

service = CrawlingService(config)

# Crawl a documentation site
async with AsyncWebCrawler(verbose=True) as crawler:
    result = await service.crawl_webpage(
        crawler,
        "https://docs.example.com"
    )

# Access results
for crawl_result in result.results:
    print(f"URL: {crawl_result.url}")
    print(f"Content length: {len(crawl_result.markdown)}")
```

### Storing Documents

```python
from context_bridge.repositories.document_repository import DocumentRepository

async with db_manager.connection() as conn:
    doc_repo = DocumentRepository(conn)
    
    # Create a new document
    doc_id = await doc_repo.create(
        name="Python Documentation",
        version="3.11",
        source_url="https://docs.python.org/3/",
        description="Official Python 3.11 documentation"
    )
    
    # Store crawled pages
    for page in crawled_pages:
        await page_repo.create(
            document_id=doc_id,
            url=page.url,
            content=page.markdown,
            content_hash=hash(page.markdown)
        )
```

### Organizing Pages into Groups

```python
from context_bridge.repositories.group_repository import GroupRepository

# User manually selects pages to group
page_ids = [1, 2, 3, 4, 5]

# Create a group
async with db_manager.connection() as conn:
    group_repo = GroupRepository(conn)
    
    group_id = await group_repo.create_group(
        document_id=doc_id,
        page_ids=page_ids,
        min_size=1000,   # Minimum total content size
        max_size=50000   # Maximum total content size
    )
```

### Chunking and Embedding

```python
from context_bridge.service.chunking_service import ChunkingService
from context_bridge.service.embedding import EmbeddingService

chunking_service = ChunkingService()
embedding_service = EmbeddingService(config)

# Get groups ready for chunking
eligible_groups = await group_repo.get_eligible_groups(doc_id)

for group in eligible_groups:
    # Get combined content
    content = await group_repo.get_group_content(group.id)
    
    # Smart chunking
    chunks = chunking_service.smart_chunk_markdown(
        content,
        chunk_size=2000
    )
    
    # Generate embeddings and store
    for i, chunk_text in enumerate(chunks):
        embedding = await embedding_service.get_embedding(chunk_text)
        
        await chunk_repo.create(
            document_id=doc_id,
            group_id=group.id,
            chunk_index=i,
            content=chunk_text,
            embedding=embedding
        )
```

### Searching Documents

#### Find Documents by Query

```python
from context_bridge.repositories.document_repository import DocumentRepository

# Find relevant documents
documents = await doc_repo.find_by_query(
    query="python asyncio tutorial",
    limit=5
)

for doc in documents:
    print(f"{doc.name} (v{doc.version})")
```

#### Search Document Content (Hybrid Search)

```python
from context_bridge.repositories.chunk_repository import ChunkRepository

# Search within a specific document
chunks = await chunk_repo.hybrid_search(
    document_id=doc_id,
    version="3.11",
    query="async await examples",
    query_embedding=await embedding_service.get_embedding("async await examples"),
    limit=10,
    vector_weight=0.7,
    bm25_weight=0.3
)

for chunk in chunks:
    print(f"Score: {chunk.score}")
    print(f"Content: {chunk.content[:200]}...")
```

### Using the Streamlit UI

The Context Bridge includes a full-featured web interface for managing documentation:

```bash
# Install with UI support
pip install context-bridge[ui]

# Run the Streamlit application
uv run streamlit run streamlit_app/app.py

# Or use the installed script
context-bridge-ui
```

**Features:**
- **Document Management**: View, search, and delete documents
- **Page Organization**: Select and group crawled pages for processing
- **Chunk Processing**: Convert page groups into searchable chunks
- **Hybrid Search**: Search across all documentation with advanced filtering

### Using the MCP Server

The Model Context Protocol server allows AI agents to interact with Context Bridge:

```bash
# Install with MCP support
pip install context-bridge[mcp]

# Run the MCP server
uv run python -m context_bridge_mcp

# Or use the installed script
context-bridge-mcp
```

**Available Tools:**
- `find_documents`: Search for documents by query
- `search_content`: Perform hybrid vector + BM25 search within specific documents

**Integration with AI Clients:**
The MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.

For detailed usage instructions, see the [MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md).

---

## 🗄️ Database Schema

### Core Tables

```sql
-- Documents (versioned documentation)
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    version TEXT NOT NULL,
    source_url TEXT,
    description TEXT,
    metadata JSONB DEFAULT '{}'::jsonb,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(name, version)
);

-- Pages (raw crawled content)
CREATE TABLE pages (
    id SERIAL PRIMARY KEY,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    url TEXT NOT NULL UNIQUE,
    content TEXT NOT NULL,
    content_hash TEXT NOT NULL,
    content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,
    crawled_at TIMESTAMPTZ DEFAULT NOW(),
    status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),
    group_id UUID, -- For future grouping feature
    metadata JSONB DEFAULT '{}'::jsonb
);

-- Chunks (embedded content)
CREATE TABLE chunks (
    id SERIAL PRIMARY KEY,
    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    group_id UUID, -- For future grouping feature
    embedding VECTOR(768), -- Dimension must match config
    bm25_vector bm25vector, -- Auto-generated by trigger
    created_at TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(document_id, group_id, chunk_index)
);

-- Indexes
CREATE INDEX idx_pages_document ON pages(document_id);
CREATE INDEX idx_pages_status ON pages(status);
CREATE INDEX idx_pages_hash ON pages(content_hash);
CREATE INDEX idx_pages_group ON pages(group_id);
CREATE INDEX idx_chunks_document ON chunks(document_id);
CREATE INDEX idx_chunks_group ON chunks(group_id);
CREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);
```

---

## 🛠️ Development

### Project Structure

```
context_bridge/               # Core package
├── __init__.py
├── config.py                 # Configuration management
├── core.py                   # Main ContextBridge API
├── database/
│   ├── init_databases.py     # Database initialization
│   └── postgres_manager.py   # Connection pool manager
├── schema/
│   └── extensions.sql        # PostgreSQL extensions & schema
├── repositories/             # Data access layer
│   ├── document_repository.py
│   ├── page_repository.py
│   ├── group_repository.py
│   └── chunk_repository.py
├── service/                  # Business logic layer
│   ├── crawling_service.py
│   ├── chunking_service.py
│   ├── embedding.py
│   ├── search_service.py
│   └── url_service.py

context_bridge_mcp/          # MCP Server (Model Context Protocol)
├── __init__.py
├── server.py                 # MCP server implementation
├── schemas.py                # Tool input/output schemas
└── __main__.py               # CLI entry point

streamlit_app/               # Streamlit Web UI
├── __init__.py
├── app.py                    # Main application
├── pages/                    # Multi-page navigation
│   ├── documents.py          # Document management
│   ├── crawled_pages.py      # Page management
│   └── search.py             # Search interface
├── components/               # Reusable UI components
├── utils/                    # UI utilities and helpers
└── README.md                 # UI-specific documentation

docs/                        # Documentation
├── guide/
│   └── MCP_SERVER_USAGE.md   # MCP server usage guide
├── plan/                    # Development plans
│   └── ui_and_mcp_implementation_plan.md
├── technical/               # Technical guides
│   ├── crawl4ai_complete_guide.md
│   ├── embedding_service.md
│   ├── psqlpy-complete-guide.md
│   ├── python_mcp_server_guide.md
│   ├── python-testing-guide.md
│   └── smart_chunk_markdown_algorithm.md
└── memory_templates.yaml    # Memory usage templates

tests/                       # Test suite
├── conftest.py
├── integration/
├── unit/
└── e2e/                     # End-to-end tests
    ├── conftest.py
    └── test_streamlit_ui.py
```
```

### Running Tests

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run all tests
pytest

# Run with coverage
pytest --cov=context_bridge --cov-report=html

# Run specific test file
pytest tests/test_chunking_service.py -v
```

### Code Quality

```bash
# Format code
black context_bridge tests

# Type checking
mypy context_bridge

# Linting
ruff check context_bridge
```

---

## 📖 Technical Documentation

Comprehensive technical guides are available in `docs/`:

### Testing & Quality Assurance
- **[UI Testing Report](docs/ui_testing_report.md)** - Comprehensive Playwright testing results and bug fixes
- **[MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md)** - How to use the MCP server with AI clients

### Technical Guides (`docs/technical/`)
- **[Crawl4AI Guide](docs/technical/crawl4ai_complete_guide.md)** - Complete crawling documentation
- **[Embedding Service](docs/technical/embedding_service.md)** - Ollama and Gemini embedding setup
- **[PSQLPy Guide](docs/technical/psqlpy-complete-guide.md)** - PostgreSQL driver usage
- **[MCP Server Guide](docs/technical/python_mcp_server_guide.md)** - MCP server implementation
- **[Testing Guide](docs/technical/python-testing-guide.md)** - Testing best practices
- **[Smart Chunking Algorithm](docs/technical/smart_chunk_markdown_algorithm.md)** - Chunking implementation

### Implementation Plans (`docs/plan/`)
- **[UI & MCP Implementation Plan](docs/plan/ui_and_mcp_implementation_plan.md)** - Development roadmap and progress

---

## 🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests.

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🙏 Acknowledgments

- **[Crawl4AI](https://github.com/unclecode/crawl4ai)** - High-performance web crawler
- **[PSQLPy](https://github.com/qaspen-python/psqlpy)** - Async PostgreSQL driver
- **[pgvector](https://github.com/pgvector/pgvector)** - Vector similarity search
- **[MCP](https://modelcontextprotocol.io/)** - Model Context Protocol

---

## 📧 Support

For questions, issues, or feature requests:
- **Issues**: [GitHub Issues](https://github.com/yourusername/context-bridge/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yourusername/context-bridge/discussions)
- **Email**: your.email@example.com

---

**Built with ❤️ for AI agents and developers**

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "context-bridge",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "ai-agents, crawling, documentation, embedding, mcp, pgvector, postgresql, rag, vector-search",
    "author": null,
    "author_email": "Your Name <your.email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/6c/ea/e11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7/context_bridge-0.1.0.tar.gz",
    "platform": null,
    "description": "# Context Bridge \ud83c\udf09\n\n> **Unified Python package for RAG-powered documentation management** - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.\n\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n---\n\n## \ud83d\udccb Table of Contents\n\n- [Overview](#-overview)\n- [Features](#-features)\n- [Architecture](#-architecture)\n- [Installation](#-installation)\n- [Quick Start](#-quick-start)\n- [Configuration](#-configuration)\n- [Usage](#-usage)\n- [Database Schema](#-database-schema)\n- [Development](#-development)\n- [Technical Documentation](#-technical-documentation)\n- [License](#-license)\n\n---\n\n## \ud83c\udfaf Overview\n\n**Context Bridge** is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.\n\n### What It Does\n\n1. **Crawls** technical documentation from URLs using Crawl4AI\n2. **Organizes** crawled pages into logical groups with size management\n3. **Chunks** Markdown content intelligently while preserving structure\n4. **Embeds** chunks using vector embeddings (Ollama/Gemini)\n5. **Stores** everything in PostgreSQL with vector and vchord_bm25\n6. **Searches** with hybrid vector + BM25 search for best results\n7. **Serves** via MCP (Model Context Protocol) for AI agent integration\n8. **Manages** through a Streamlit UI for human oversight\n\n---\n\n## \u2728 Features\n\n### Core Capabilities\n\n- **\ud83d\udd77\ufe0f Smart Crawling**: Automatically detect and crawl documentation sites, sitemaps, and text files\n- **\ud83d\udce6 Intelligent Chunking**: Smart Markdown chunking that respects code blocks, paragraphs, and sentences\n- **\ud83d\udd0d Hybrid Search**: Dual vector + BM25 search for superior retrieval accuracy\n- **\ud83d\udcda Version Management**: Track multiple versions of the same documentation\n- **\ud83c\udfaf Document Organization**: Manual page grouping with size constraints before chunking\n- **\u26a1 High Performance**: PSQLPy for fast async PostgreSQL operations\n- **\ud83e\udd16 AI-Ready**: MCP server for seamless AI agent integration\n- **\ud83c\udfa8 User-Friendly**: Streamlit UI for documentation management\n\n### Technical Features\n\n- **Vector Search**: Powered by vector extension\n- **BM25 Full-Text Search**: Using vchord_bm25 extension\n- **Async/Await**: Fully asynchronous operations for scalability\n- **Configurable Embeddings**: Support for Ollama (local) and Google Gemini (cloud)\n- **Type-Safe**: Pydantic models for configuration and data validation\n- **Modular Design**: Clean separation of concerns (repositories, services, managers)\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                    Context Bridge Architecture               \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Streamlit   \u2502    \u2502 MCP Server   \u2502    \u2502  Python API  \u2502\n\u2502      UI      \u2502    \u2502  (AI Agent)  \u2502    \u2502   (Direct)   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n       \u2502                   \u2502                   \u2502\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                           \u2502\n                           \u25bc\n       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n       \u2502         Service Layer                 \u2502\n       \u2502  - CrawlingService                    \u2502\n       \u2502  - ChunkingService                    \u2502\n       \u2502  - EmbeddingService                   \u2502\n       \u2502  - SearchService                      \u2502\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                       \u2502\n                       \u25bc\n       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n       \u2502       Repository Layer                \u2502\n       \u2502  - DocumentRepository                 \u2502\n       \u2502  - PageRepository                     \u2502\n       \u2502  - GroupRepository                    \u2502\n       \u2502  - ChunkRepository                    \u2502\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                       \u2502\n                       \u25bc\n       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n       \u2502      PostgreSQL Manager               \u2502\n       \u2502  - Connection Pooling                 \u2502\n       \u2502  - Transaction Management             \u2502\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                       \u2502\n                       \u25bc\n       \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n       \u2502     PostgreSQL Database               \u2502\n       \u2502  Extensions:                          \u2502\n       \u2502  - vector (vector search)           \u2502\n       \u2502  - vchord_bm25 (BM25 search)        \u2502\n       \u2502  - pg_tokenizer (text tokenization) \u2502\n       \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\nExternal Dependencies:\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502   Crawl4AI   \u2502    \u2502    Ollama    \u2502\n\u2502  (Crawling)  \u2502    \u2502  or Gemini   \u2502\n\u2502              \u2502    \u2502 (Embeddings) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Workflow\n\n```\n1. Crawl Documentation\n   \u2193\n2. Store Raw Pages\n   \u2193\n3. Manual Organization (Group Pages)\n   \u2193\n4. Smart Chunking\n   \u2193\n5. Generate Embeddings\n   \u2193\n6. Store with Vector + BM25 Indexes\n   \u2193\n7. Hybrid Search (Vector + BM25)\n```\n\n---\n\n## \ud83d\udce6 Installation\n\n### Prerequisites\n\n- **Python 3.11+**\n- **PostgreSQL 14+** with extensions:\n  - `vector`\n  - `vchord`\n  - `pg_tokenizer`\n  - `vchord_bm25`\n- **Ollama** (for local embeddings) or **Google API Key** (for Gemini)\n\n### Install from PyPI\n\n```bash\npip install context-bridge\n```\n\n### Install with Optional Dependencies\n\n```bash\n# With Gemini support\npip install context-bridge[gemini]\n\n# With MCP server\npip install context-bridge[mcp]\n\n# With Streamlit UI\npip install context-bridge[ui]\n\n# All features\npip install context-bridge[all]\n```\n\n### Running the Applications\n\n**MCP Server:**\n```bash\n# Using the installed script\ncontext-bridge-mcp\n\n# Or run directly\npython -m context_bridge_mcp\n```\n\n**Streamlit UI:**\n```bash\n# Using streamlit directly\nstreamlit run streamlit_app/app.py\n\n# Or with uv\nuv run streamlit run streamlit_app/app.py\n```\n\n### Install from Source\n\n```bash\ngit clone https://github.com/yourusername/context-bridge.git\ncd context-bridge\npip install -e .\n```\n\n---\n\n## \ud83d\ude80 Quick Start\n\n### 1. Initialize Database\n\n```bash\npython -m context_bridge.database.init_databases\n```\n\nThis will:\n- Create required PostgreSQL extensions\n- Create all necessary tables\n- Set up vector and BM25 indexes\n\n### 2. Basic Usage (Three Ways)\n\n#### **Option A: Direct Python (Recommended for PyPI users)**\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge, Config\n\nasync def main():\n    # Create config with your settings\n    config = Config(\n        postgres_host=\"localhost\",\n        postgres_password=\"your_secure_password\",\n        embedding_model=\"nomic-embed-text:latest\"\n    )\n    \n    # Use with context manager\n    async with ContextBridge(config=config) as bridge:\n        # Crawl documentation\n        result = await bridge.crawl_documentation(\n            name=\"Python Docs\",\n            version=\"3.11\",\n            source_url=\"https://docs.python.org/3/library/\"\n        )\n        \n        # Search documentation\n        search_results = await bridge.search(\n            query=\"async await tutorial\",\n            document_id=result.document_id\n        )\n        \n        for hit in search_results[:3]:\n            print(f\"Score: {hit.score}, Content: {hit.content[:100]}...\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n#### **Option B: Environment Variables (Recommended for Docker/K8s)**\n\n```bash\n# Set environment variables\nexport POSTGRES_HOST=postgres\nexport POSTGRES_PASSWORD=secure_password\nexport OLLAMA_BASE_URL=http://ollama:11434\nexport EMBEDDING_MODEL=nomic-embed-text:latest\n\n# Or in docker-compose.yml\nenvironment:\n  - POSTGRES_HOST=postgres\n  - POSTGRES_PASSWORD=secure_password\n  - EMBEDDING_MODEL=nomic-embed-text:latest\n```\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge\n\nasync def main():\n    # Config automatically loaded from environment variables\n    async with ContextBridge() as bridge:\n        result = await bridge.crawl_documentation(\n            name=\"Python Docs\",\n            version=\"3.11\",\n            source_url=\"https://docs.python.org/3/library/\"\n        )\n```\n\n#### **Option C: .env File (Convenient for local development)**\n\nCreate `.env` file (git-ignored):\n\n```bash\n# PostgreSQL Configuration\nPOSTGRES_HOST=localhost\nPOSTGRES_PORT=5432\nPOSTGRES_USER=postgres\nPOSTGRES_PASSWORD=your_secure_password\nPOSTGRES_DB=context_bridge\n\n# Ollama Configuration\nOLLAMA_BASE_URL=http://localhost:11434\nEMBEDDING_MODEL=nomic-embed-text:latest\nVECTOR_DIMENSION=768\n\n# Search Configuration\nSIMILARITY_THRESHOLD=0.7\nBM25_WEIGHT=0.3\nVECTOR_WEIGHT=0.7\n\n# Chunking Configuration\nCHUNK_SIZE=2000\nMIN_COMBINED_CONTENT_SIZE=100\nMAX_COMBINED_CONTENT_SIZE=3500000\n\n# Crawling Configuration\nCRAWL_MAX_DEPTH=3\nCRAWL_MAX_CONCURRENT=5\n```\n\nThen in your code:\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge\n\nasync def main():\n    # Config automatically loaded from .env file (if python-dotenv is available)\n    async with ContextBridge() as bridge:\n        result = await bridge.crawl_documentation(...)\n```\n\nTo use .env files in development, install with dev dependencies:\n```bash\npip install context-bridge[dev]\n```\n\n---\n\n## \u2699\ufe0f Configuration\n\nThe package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:\n\n### Configuration Methods (Priority Order)\n\n1. **Direct Python instantiation** (recommended for packaged installs)\n2. **Environment variables** (recommended for containers/CI)\n3. **.env file** (convenient for local development only)\n\n### Core Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `POSTGRES_HOST` | `localhost` | PostgreSQL host | `postgres_host` |\n| `POSTGRES_PORT` | `5432` | PostgreSQL port | `postgres_port` |\n| `POSTGRES_USER` | `postgres` | PostgreSQL user | `postgres_user` |\n| `POSTGRES_PASSWORD` | `` (empty) | PostgreSQL password (min 8 chars for prod) | `postgres_password` |\n| `POSTGRES_DB` | `context_bridge` | Database name | `postgres_db` |\n| `DB_POOL_MAX` | `10` | Connection pool size | `postgres_max_pool_size` |\n\n### Embedding Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL | `ollama_base_url` |\n| `EMBEDDING_MODEL` | `nomic-embed-text:latest` | Ollama model name | `embedding_model` |\n| `VECTOR_DIMENSION` | `768` | Embedding vector dimension | `vector_dimension` |\n\n### Search Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `SIMILARITY_THRESHOLD` | `0.7` | Minimum similarity score | `similarity_threshold` |\n| `BM25_WEIGHT` | `0.3` | BM25 weight in hybrid search | `bm25_weight` |\n| `VECTOR_WEIGHT` | `0.7` | Vector weight in hybrid search | `vector_weight` |\n\n### Chunking Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `CHUNK_SIZE` | `2000` | Default chunk size (bytes) | `chunk_size` |\n| `MIN_COMBINED_CONTENT_SIZE` | `100` | Minimum combined page size (bytes) | `min_combined_content_size` |\n| `MAX_COMBINED_CONTENT_SIZE` | `3500000` | Maximum combined page size (bytes) | `max_combined_content_size` |\n\n### Crawling Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `CRAWL_MAX_DEPTH` | `3` | Maximum crawl depth | `crawl_max_depth` |\n| `CRAWL_MAX_CONCURRENT` | `10` | Maximum concurrent crawl operations | `crawl_max_concurrent` |\n\n---\n\n## \ud83d\udcda Usage\n\n### Crawling Documentation\n\n```python\nfrom context_bridge.service.crawling_service import CrawlingService, CrawlConfig\nfrom crawl4ai import AsyncWebCrawler\n\n# Configure crawler\nconfig = CrawlConfig(\n    max_depth=3,          # How deep to follow links\n    max_concurrent=10,    # Concurrent requests\n    memory_threshold=70.0 # Memory usage threshold\n)\n\nservice = CrawlingService(config)\n\n# Crawl a documentation site\nasync with AsyncWebCrawler(verbose=True) as crawler:\n    result = await service.crawl_webpage(\n        crawler,\n        \"https://docs.example.com\"\n    )\n\n# Access results\nfor crawl_result in result.results:\n    print(f\"URL: {crawl_result.url}\")\n    print(f\"Content length: {len(crawl_result.markdown)}\")\n```\n\n### Storing Documents\n\n```python\nfrom context_bridge.repositories.document_repository import DocumentRepository\n\nasync with db_manager.connection() as conn:\n    doc_repo = DocumentRepository(conn)\n    \n    # Create a new document\n    doc_id = await doc_repo.create(\n        name=\"Python Documentation\",\n        version=\"3.11\",\n        source_url=\"https://docs.python.org/3/\",\n        description=\"Official Python 3.11 documentation\"\n    )\n    \n    # Store crawled pages\n    for page in crawled_pages:\n        await page_repo.create(\n            document_id=doc_id,\n            url=page.url,\n            content=page.markdown,\n            content_hash=hash(page.markdown)\n        )\n```\n\n### Organizing Pages into Groups\n\n```python\nfrom context_bridge.repositories.group_repository import GroupRepository\n\n# User manually selects pages to group\npage_ids = [1, 2, 3, 4, 5]\n\n# Create a group\nasync with db_manager.connection() as conn:\n    group_repo = GroupRepository(conn)\n    \n    group_id = await group_repo.create_group(\n        document_id=doc_id,\n        page_ids=page_ids,\n        min_size=1000,   # Minimum total content size\n        max_size=50000   # Maximum total content size\n    )\n```\n\n### Chunking and Embedding\n\n```python\nfrom context_bridge.service.chunking_service import ChunkingService\nfrom context_bridge.service.embedding import EmbeddingService\n\nchunking_service = ChunkingService()\nembedding_service = EmbeddingService(config)\n\n# Get groups ready for chunking\neligible_groups = await group_repo.get_eligible_groups(doc_id)\n\nfor group in eligible_groups:\n    # Get combined content\n    content = await group_repo.get_group_content(group.id)\n    \n    # Smart chunking\n    chunks = chunking_service.smart_chunk_markdown(\n        content,\n        chunk_size=2000\n    )\n    \n    # Generate embeddings and store\n    for i, chunk_text in enumerate(chunks):\n        embedding = await embedding_service.get_embedding(chunk_text)\n        \n        await chunk_repo.create(\n            document_id=doc_id,\n            group_id=group.id,\n            chunk_index=i,\n            content=chunk_text,\n            embedding=embedding\n        )\n```\n\n### Searching Documents\n\n#### Find Documents by Query\n\n```python\nfrom context_bridge.repositories.document_repository import DocumentRepository\n\n# Find relevant documents\ndocuments = await doc_repo.find_by_query(\n    query=\"python asyncio tutorial\",\n    limit=5\n)\n\nfor doc in documents:\n    print(f\"{doc.name} (v{doc.version})\")\n```\n\n#### Search Document Content (Hybrid Search)\n\n```python\nfrom context_bridge.repositories.chunk_repository import ChunkRepository\n\n# Search within a specific document\nchunks = await chunk_repo.hybrid_search(\n    document_id=doc_id,\n    version=\"3.11\",\n    query=\"async await examples\",\n    query_embedding=await embedding_service.get_embedding(\"async await examples\"),\n    limit=10,\n    vector_weight=0.7,\n    bm25_weight=0.3\n)\n\nfor chunk in chunks:\n    print(f\"Score: {chunk.score}\")\n    print(f\"Content: {chunk.content[:200]}...\")\n```\n\n### Using the Streamlit UI\n\nThe Context Bridge includes a full-featured web interface for managing documentation:\n\n```bash\n# Install with UI support\npip install context-bridge[ui]\n\n# Run the Streamlit application\nuv run streamlit run streamlit_app/app.py\n\n# Or use the installed script\ncontext-bridge-ui\n```\n\n**Features:**\n- **Document Management**: View, search, and delete documents\n- **Page Organization**: Select and group crawled pages for processing\n- **Chunk Processing**: Convert page groups into searchable chunks\n- **Hybrid Search**: Search across all documentation with advanced filtering\n\n### Using the MCP Server\n\nThe Model Context Protocol server allows AI agents to interact with Context Bridge:\n\n```bash\n# Install with MCP support\npip install context-bridge[mcp]\n\n# Run the MCP server\nuv run python -m context_bridge_mcp\n\n# Or use the installed script\ncontext-bridge-mcp\n```\n\n**Available Tools:**\n- `find_documents`: Search for documents by query\n- `search_content`: Perform hybrid vector + BM25 search within specific documents\n\n**Integration with AI Clients:**\nThe MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.\n\nFor detailed usage instructions, see the [MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md).\n\n---\n\n## \ud83d\uddc4\ufe0f Database Schema\n\n### Core Tables\n\n```sql\n-- Documents (versioned documentation)\nCREATE TABLE documents (\n    id SERIAL PRIMARY KEY,\n    name TEXT NOT NULL,\n    version TEXT NOT NULL,\n    source_url TEXT,\n    description TEXT,\n    metadata JSONB DEFAULT '{}'::jsonb,\n    created_at TIMESTAMPTZ DEFAULT NOW(),\n    updated_at TIMESTAMPTZ DEFAULT NOW(),\n    UNIQUE(name, version)\n);\n\n-- Pages (raw crawled content)\nCREATE TABLE pages (\n    id SERIAL PRIMARY KEY,\n    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,\n    url TEXT NOT NULL UNIQUE,\n    content TEXT NOT NULL,\n    content_hash TEXT NOT NULL,\n    content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,\n    crawled_at TIMESTAMPTZ DEFAULT NOW(),\n    status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),\n    group_id UUID, -- For future grouping feature\n    metadata JSONB DEFAULT '{}'::jsonb\n);\n\n-- Chunks (embedded content)\nCREATE TABLE chunks (\n    id SERIAL PRIMARY KEY,\n    document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,\n    chunk_index INTEGER NOT NULL,\n    content TEXT NOT NULL,\n    group_id UUID, -- For future grouping feature\n    embedding VECTOR(768), -- Dimension must match config\n    bm25_vector bm25vector, -- Auto-generated by trigger\n    created_at TIMESTAMPTZ DEFAULT NOW(),\n    UNIQUE(document_id, group_id, chunk_index)\n);\n\n-- Indexes\nCREATE INDEX idx_pages_document ON pages(document_id);\nCREATE INDEX idx_pages_status ON pages(status);\nCREATE INDEX idx_pages_hash ON pages(content_hash);\nCREATE INDEX idx_pages_group ON pages(group_id);\nCREATE INDEX idx_chunks_document ON chunks(document_id);\nCREATE INDEX idx_chunks_group ON chunks(group_id);\nCREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);\nCREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);\n```\n\n---\n\n## \ud83d\udee0\ufe0f Development\n\n### Project Structure\n\n```\ncontext_bridge/               # Core package\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 config.py                 # Configuration management\n\u251c\u2500\u2500 core.py                   # Main ContextBridge API\n\u251c\u2500\u2500 database/\n\u2502   \u251c\u2500\u2500 init_databases.py     # Database initialization\n\u2502   \u2514\u2500\u2500 postgres_manager.py   # Connection pool manager\n\u251c\u2500\u2500 schema/\n\u2502   \u2514\u2500\u2500 extensions.sql        # PostgreSQL extensions & schema\n\u251c\u2500\u2500 repositories/             # Data access layer\n\u2502   \u251c\u2500\u2500 document_repository.py\n\u2502   \u251c\u2500\u2500 page_repository.py\n\u2502   \u251c\u2500\u2500 group_repository.py\n\u2502   \u2514\u2500\u2500 chunk_repository.py\n\u251c\u2500\u2500 service/                  # Business logic layer\n\u2502   \u251c\u2500\u2500 crawling_service.py\n\u2502   \u251c\u2500\u2500 chunking_service.py\n\u2502   \u251c\u2500\u2500 embedding.py\n\u2502   \u251c\u2500\u2500 search_service.py\n\u2502   \u2514\u2500\u2500 url_service.py\n\ncontext_bridge_mcp/          # MCP Server (Model Context Protocol)\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 server.py                 # MCP server implementation\n\u251c\u2500\u2500 schemas.py                # Tool input/output schemas\n\u2514\u2500\u2500 __main__.py               # CLI entry point\n\nstreamlit_app/               # Streamlit Web UI\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 app.py                    # Main application\n\u251c\u2500\u2500 pages/                    # Multi-page navigation\n\u2502   \u251c\u2500\u2500 documents.py          # Document management\n\u2502   \u251c\u2500\u2500 crawled_pages.py      # Page management\n\u2502   \u2514\u2500\u2500 search.py             # Search interface\n\u251c\u2500\u2500 components/               # Reusable UI components\n\u251c\u2500\u2500 utils/                    # UI utilities and helpers\n\u2514\u2500\u2500 README.md                 # UI-specific documentation\n\ndocs/                        # Documentation\n\u251c\u2500\u2500 guide/\n\u2502   \u2514\u2500\u2500 MCP_SERVER_USAGE.md   # MCP server usage guide\n\u251c\u2500\u2500 plan/                    # Development plans\n\u2502   \u2514\u2500\u2500 ui_and_mcp_implementation_plan.md\n\u251c\u2500\u2500 technical/               # Technical guides\n\u2502   \u251c\u2500\u2500 crawl4ai_complete_guide.md\n\u2502   \u251c\u2500\u2500 embedding_service.md\n\u2502   \u251c\u2500\u2500 psqlpy-complete-guide.md\n\u2502   \u251c\u2500\u2500 python_mcp_server_guide.md\n\u2502   \u251c\u2500\u2500 python-testing-guide.md\n\u2502   \u2514\u2500\u2500 smart_chunk_markdown_algorithm.md\n\u2514\u2500\u2500 memory_templates.yaml    # Memory usage templates\n\ntests/                       # Test suite\n\u251c\u2500\u2500 conftest.py\n\u251c\u2500\u2500 integration/\n\u251c\u2500\u2500 unit/\n\u2514\u2500\u2500 e2e/                     # End-to-end tests\n    \u251c\u2500\u2500 conftest.py\n    \u2514\u2500\u2500 test_streamlit_ui.py\n```\n```\n\n### Running Tests\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=context_bridge --cov-report=html\n\n# Run specific test file\npytest tests/test_chunking_service.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack context_bridge tests\n\n# Type checking\nmypy context_bridge\n\n# Linting\nruff check context_bridge\n```\n\n---\n\n## \ud83d\udcd6 Technical Documentation\n\nComprehensive technical guides are available in `docs/`:\n\n### Testing & Quality Assurance\n- **[UI Testing Report](docs/ui_testing_report.md)** - Comprehensive Playwright testing results and bug fixes\n- **[MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md)** - How to use the MCP server with AI clients\n\n### Technical Guides (`docs/technical/`)\n- **[Crawl4AI Guide](docs/technical/crawl4ai_complete_guide.md)** - Complete crawling documentation\n- **[Embedding Service](docs/technical/embedding_service.md)** - Ollama and Gemini embedding setup\n- **[PSQLPy Guide](docs/technical/psqlpy-complete-guide.md)** - PostgreSQL driver usage\n- **[MCP Server Guide](docs/technical/python_mcp_server_guide.md)** - MCP server implementation\n- **[Testing Guide](docs/technical/python-testing-guide.md)** - Testing best practices\n- **[Smart Chunking Algorithm](docs/technical/smart_chunk_markdown_algorithm.md)** - Chunking implementation\n\n### Implementation Plans (`docs/plan/`)\n- **[UI & MCP Implementation Plan](docs/plan/ui_and_mcp_implementation_plan.md)** - Development roadmap and progress\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please read our contributing guidelines and submit pull requests.\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Acknowledgments\n\n- **[Crawl4AI](https://github.com/unclecode/crawl4ai)** - High-performance web crawler\n- **[PSQLPy](https://github.com/qaspen-python/psqlpy)** - Async PostgreSQL driver\n- **[pgvector](https://github.com/pgvector/pgvector)** - Vector similarity search\n- **[MCP](https://modelcontextprotocol.io/)** - Model Context Protocol\n\n---\n\n## \ud83d\udce7 Support\n\nFor questions, issues, or feature requests:\n- **Issues**: [GitHub Issues](https://github.com/yourusername/context-bridge/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/yourusername/context-bridge/discussions)\n- **Email**: your.email@example.com\n\n---\n\n**Built with \u2764\ufe0f for AI agents and developers**\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Unified Python package for RAG documentation workflows - Crawl, embed, store, and retrieve technical documentation for AI agents",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://yourusername.github.io/context_bridge",
        "Homepage": "https://github.com/yourusername/context_bridge",
        "Issues": "https://github.com/yourusername/context_bridge/issues",
        "Repository": "https://github.com/yourusername/context_bridge"
    },
    "split_keywords": [
        "ai-agents",
        " crawling",
        " documentation",
        " embedding",
        " mcp",
        " pgvector",
        " postgresql",
        " rag",
        " vector-search"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0ab3673efa1d6d1c519f82dc5fb82403ce2dcb59915b5941b6297ac7531a36e7",
                "md5": "7d9eb562408e17f40708ca1b9f068d78",
                "sha256": "582e50329ee34b21dce7e34105ae313db55e773cde553ac976ab134740936832"
            },
            "downloads": -1,
            "filename": "context_bridge-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "7d9eb562408e17f40708ca1b9f068d78",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 58308,
            "upload_time": "2025-10-23T09:38:54",
            "upload_time_iso_8601": "2025-10-23T09:38:54.593241Z",
            "url": "https://files.pythonhosted.org/packages/0a/b3/673efa1d6d1c519f82dc5fb82403ce2dcb59915b5941b6297ac7531a36e7/context_bridge-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6ceae11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7",
                "md5": "1bfb811c2572971b1a6125f8bd27ce4b",
                "sha256": "256a6fb02a8c56bfc473c5c9838c4c0d9fc289d9fd31ac64b4cac38a969338e0"
            },
            "downloads": -1,
            "filename": "context_bridge-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "1bfb811c2572971b1a6125f8bd27ce4b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 596370,
            "upload_time": "2025-10-23T09:38:56",
            "upload_time_iso_8601": "2025-10-23T09:38:56.455446Z",
            "url": "https://files.pythonhosted.org/packages/6c/ea/e11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7/context_bridge-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-23 09:38:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "context_bridge",
    "github_not_found": true,
    "lcname": "context-bridge"
}

None