# Context Bridge 🌉
> **Unified Python package for RAG-powered documentation management** - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
---
## 📋 Table of Contents
- [Overview](#-overview)
- [Features](#-features)
- [Architecture](#-architecture)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Configuration](#-configuration)
- [Usage](#-usage)
- [Database Schema](#-database-schema)
- [Development](#-development)
- [Technical Documentation](#-technical-documentation)
- [License](#-license)
---
## 🎯 Overview
**Context Bridge** is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.
### What It Does
1. **Crawls** technical documentation from URLs using Crawl4AI
2. **Organizes** crawled pages into logical groups with size management
3. **Chunks** Markdown content intelligently while preserving structure
4. **Embeds** chunks using vector embeddings (Ollama/Gemini)
5. **Stores** everything in PostgreSQL with vector and vchord_bm25
6. **Searches** with hybrid vector + BM25 search for best results
7. **Serves** via MCP (Model Context Protocol) for AI agent integration
8. **Manages** through a Streamlit UI for human oversight
---
## ✨ Features
### Core Capabilities
- **🕷️ Smart Crawling**: Automatically detect and crawl documentation sites, sitemaps, and text files
- **📦 Intelligent Chunking**: Smart Markdown chunking that respects code blocks, paragraphs, and sentences
- **🔍 Hybrid Search**: Dual vector + BM25 search for superior retrieval accuracy
- **📚 Version Management**: Track multiple versions of the same documentation
- **🎯 Document Organization**: Manual page grouping with size constraints before chunking
- **⚡ High Performance**: PSQLPy for fast async PostgreSQL operations
- **🤖 AI-Ready**: MCP server for seamless AI agent integration
- **🎨 User-Friendly**: Streamlit UI for documentation management
### Technical Features
- **Vector Search**: Powered by vector extension
- **BM25 Full-Text Search**: Using vchord_bm25 extension
- **Async/Await**: Fully asynchronous operations for scalability
- **Configurable Embeddings**: Support for Ollama (local) and Google Gemini (cloud)
- **Type-Safe**: Pydantic models for configuration and data validation
- **Modular Design**: Clean separation of concerns (repositories, services, managers)
---
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Context Bridge Architecture │
└─────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Streamlit │ │ MCP Server │ │ Python API │
│ UI │ │ (AI Agent) │ │ (Direct) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└───────────────────┴───────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Service Layer │
│ - CrawlingService │
│ - ChunkingService │
│ - EmbeddingService │
│ - SearchService │
└───────────────┬───────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Repository Layer │
│ - DocumentRepository │
│ - PageRepository │
│ - GroupRepository │
│ - ChunkRepository │
└───────────────┬───────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ PostgreSQL Manager │
│ - Connection Pooling │
│ - Transaction Management │
└───────────────┬───────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ PostgreSQL Database │
│ Extensions: │
│ - vector (vector search) │
│ - vchord_bm25 (BM25 search) │
│ - pg_tokenizer (text tokenization) │
└───────────────────────────────────────┘
External Dependencies:
┌──────────────┐ ┌──────────────┐
│ Crawl4AI │ │ Ollama │
│ (Crawling) │ │ or Gemini │
│ │ │ (Embeddings) │
└──────────────┘ └──────────────┘
```
### Workflow
```
1. Crawl Documentation
↓
2. Store Raw Pages
↓
3. Manual Organization (Group Pages)
↓
4. Smart Chunking
↓
5. Generate Embeddings
↓
6. Store with Vector + BM25 Indexes
↓
7. Hybrid Search (Vector + BM25)
```
---
## 📦 Installation
### Prerequisites
- **Python 3.11+**
- **PostgreSQL 14+** with extensions:
- `vector`
- `vchord`
- `pg_tokenizer`
- `vchord_bm25`
- **Ollama** (for local embeddings) or **Google API Key** (for Gemini)
### Install from PyPI
```bash
pip install context-bridge
```
### Install with Optional Dependencies
```bash
# With Gemini support
pip install context-bridge[gemini]
# With MCP server
pip install context-bridge[mcp]
# With Streamlit UI
pip install context-bridge[ui]
# All features
pip install context-bridge[all]
```
### Running the Applications
**MCP Server:**
```bash
# Using the installed script
context-bridge-mcp
# Or run directly
python -m context_bridge_mcp
```
**Streamlit UI:**
```bash
# Using streamlit directly
streamlit run streamlit_app/app.py
# Or with uv
uv run streamlit run streamlit_app/app.py
```
### Install from Source
```bash
git clone https://github.com/yourusername/context-bridge.git
cd context-bridge
pip install -e .
```
---
## 🚀 Quick Start
### 1. Initialize Database
```bash
python -m context_bridge.database.init_databases
```
This will:
- Create required PostgreSQL extensions
- Create all necessary tables
- Set up vector and BM25 indexes
### 2. Basic Usage (Three Ways)
#### **Option A: Direct Python (Recommended for PyPI users)**
```python
import asyncio
from context_bridge import ContextBridge, Config
async def main():
# Create config with your settings
config = Config(
postgres_host="localhost",
postgres_password="your_secure_password",
embedding_model="nomic-embed-text:latest"
)
# Use with context manager
async with ContextBridge(config=config) as bridge:
# Crawl documentation
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)
# Search documentation
search_results = await bridge.search(
query="async await tutorial",
document_id=result.document_id
)
for hit in search_results[:3]:
print(f"Score: {hit.score}, Content: {hit.content[:100]}...")
if __name__ == "__main__":
asyncio.run(main())
```
#### **Option B: Environment Variables (Recommended for Docker/K8s)**
```bash
# Set environment variables
export POSTGRES_HOST=postgres
export POSTGRES_PASSWORD=secure_password
export OLLAMA_BASE_URL=http://ollama:11434
export EMBEDDING_MODEL=nomic-embed-text:latest
# Or in docker-compose.yml
environment:
- POSTGRES_HOST=postgres
- POSTGRES_PASSWORD=secure_password
- EMBEDDING_MODEL=nomic-embed-text:latest
```
```python
import asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from environment variables
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(
name="Python Docs",
version="3.11",
source_url="https://docs.python.org/3/library/"
)
```
#### **Option C: .env File (Convenient for local development)**
Create `.env` file (git-ignored):
```bash
# PostgreSQL Configuration
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgres
POSTGRES_PASSWORD=your_secure_password
POSTGRES_DB=context_bridge
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
EMBEDDING_MODEL=nomic-embed-text:latest
VECTOR_DIMENSION=768
# Search Configuration
SIMILARITY_THRESHOLD=0.7
BM25_WEIGHT=0.3
VECTOR_WEIGHT=0.7
# Chunking Configuration
CHUNK_SIZE=2000
MIN_COMBINED_CONTENT_SIZE=100
MAX_COMBINED_CONTENT_SIZE=3500000
# Crawling Configuration
CRAWL_MAX_DEPTH=3
CRAWL_MAX_CONCURRENT=5
```
Then in your code:
```python
import asyncio
from context_bridge import ContextBridge
async def main():
# Config automatically loaded from .env file (if python-dotenv is available)
async with ContextBridge() as bridge:
result = await bridge.crawl_documentation(...)
```
To use .env files in development, install with dev dependencies:
```bash
pip install context-bridge[dev]
```
---
## ⚙️ Configuration
The package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:
### Configuration Methods (Priority Order)
1. **Direct Python instantiation** (recommended for packaged installs)
2. **Environment variables** (recommended for containers/CI)
3. **.env file** (convenient for local development only)
### Core Settings
| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `POSTGRES_HOST` | `localhost` | PostgreSQL host | `postgres_host` |
| `POSTGRES_PORT` | `5432` | PostgreSQL port | `postgres_port` |
| `POSTGRES_USER` | `postgres` | PostgreSQL user | `postgres_user` |
| `POSTGRES_PASSWORD` | `` (empty) | PostgreSQL password (min 8 chars for prod) | `postgres_password` |
| `POSTGRES_DB` | `context_bridge` | Database name | `postgres_db` |
| `DB_POOL_MAX` | `10` | Connection pool size | `postgres_max_pool_size` |
### Embedding Settings
| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL | `ollama_base_url` |
| `EMBEDDING_MODEL` | `nomic-embed-text:latest` | Ollama model name | `embedding_model` |
| `VECTOR_DIMENSION` | `768` | Embedding vector dimension | `vector_dimension` |
### Search Settings
| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `SIMILARITY_THRESHOLD` | `0.7` | Minimum similarity score | `similarity_threshold` |
| `BM25_WEIGHT` | `0.3` | BM25 weight in hybrid search | `bm25_weight` |
| `VECTOR_WEIGHT` | `0.7` | Vector weight in hybrid search | `vector_weight` |
### Chunking Settings
| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `CHUNK_SIZE` | `2000` | Default chunk size (bytes) | `chunk_size` |
| `MIN_COMBINED_CONTENT_SIZE` | `100` | Minimum combined page size (bytes) | `min_combined_content_size` |
| `MAX_COMBINED_CONTENT_SIZE` | `3500000` | Maximum combined page size (bytes) | `max_combined_content_size` |
### Crawling Settings
| Setting | Default | Description | Python API |
|---------|---------|-------------|-----------|
| `CRAWL_MAX_DEPTH` | `3` | Maximum crawl depth | `crawl_max_depth` |
| `CRAWL_MAX_CONCURRENT` | `10` | Maximum concurrent crawl operations | `crawl_max_concurrent` |
---
## 📚 Usage
### Crawling Documentation
```python
from context_bridge.service.crawling_service import CrawlingService, CrawlConfig
from crawl4ai import AsyncWebCrawler
# Configure crawler
config = CrawlConfig(
max_depth=3, # How deep to follow links
max_concurrent=10, # Concurrent requests
memory_threshold=70.0 # Memory usage threshold
)
service = CrawlingService(config)
# Crawl a documentation site
async with AsyncWebCrawler(verbose=True) as crawler:
result = await service.crawl_webpage(
crawler,
"https://docs.example.com"
)
# Access results
for crawl_result in result.results:
print(f"URL: {crawl_result.url}")
print(f"Content length: {len(crawl_result.markdown)}")
```
### Storing Documents
```python
from context_bridge.repositories.document_repository import DocumentRepository
async with db_manager.connection() as conn:
doc_repo = DocumentRepository(conn)
# Create a new document
doc_id = await doc_repo.create(
name="Python Documentation",
version="3.11",
source_url="https://docs.python.org/3/",
description="Official Python 3.11 documentation"
)
# Store crawled pages
for page in crawled_pages:
await page_repo.create(
document_id=doc_id,
url=page.url,
content=page.markdown,
content_hash=hash(page.markdown)
)
```
### Organizing Pages into Groups
```python
from context_bridge.repositories.group_repository import GroupRepository
# User manually selects pages to group
page_ids = [1, 2, 3, 4, 5]
# Create a group
async with db_manager.connection() as conn:
group_repo = GroupRepository(conn)
group_id = await group_repo.create_group(
document_id=doc_id,
page_ids=page_ids,
min_size=1000, # Minimum total content size
max_size=50000 # Maximum total content size
)
```
### Chunking and Embedding
```python
from context_bridge.service.chunking_service import ChunkingService
from context_bridge.service.embedding import EmbeddingService
chunking_service = ChunkingService()
embedding_service = EmbeddingService(config)
# Get groups ready for chunking
eligible_groups = await group_repo.get_eligible_groups(doc_id)
for group in eligible_groups:
# Get combined content
content = await group_repo.get_group_content(group.id)
# Smart chunking
chunks = chunking_service.smart_chunk_markdown(
content,
chunk_size=2000
)
# Generate embeddings and store
for i, chunk_text in enumerate(chunks):
embedding = await embedding_service.get_embedding(chunk_text)
await chunk_repo.create(
document_id=doc_id,
group_id=group.id,
chunk_index=i,
content=chunk_text,
embedding=embedding
)
```
### Searching Documents
#### Find Documents by Query
```python
from context_bridge.repositories.document_repository import DocumentRepository
# Find relevant documents
documents = await doc_repo.find_by_query(
query="python asyncio tutorial",
limit=5
)
for doc in documents:
print(f"{doc.name} (v{doc.version})")
```
#### Search Document Content (Hybrid Search)
```python
from context_bridge.repositories.chunk_repository import ChunkRepository
# Search within a specific document
chunks = await chunk_repo.hybrid_search(
document_id=doc_id,
version="3.11",
query="async await examples",
query_embedding=await embedding_service.get_embedding("async await examples"),
limit=10,
vector_weight=0.7,
bm25_weight=0.3
)
for chunk in chunks:
print(f"Score: {chunk.score}")
print(f"Content: {chunk.content[:200]}...")
```
### Using the Streamlit UI
The Context Bridge includes a full-featured web interface for managing documentation:
```bash
# Install with UI support
pip install context-bridge[ui]
# Run the Streamlit application
uv run streamlit run streamlit_app/app.py
# Or use the installed script
context-bridge-ui
```
**Features:**
- **Document Management**: View, search, and delete documents
- **Page Organization**: Select and group crawled pages for processing
- **Chunk Processing**: Convert page groups into searchable chunks
- **Hybrid Search**: Search across all documentation with advanced filtering
### Using the MCP Server
The Model Context Protocol server allows AI agents to interact with Context Bridge:
```bash
# Install with MCP support
pip install context-bridge[mcp]
# Run the MCP server
uv run python -m context_bridge_mcp
# Or use the installed script
context-bridge-mcp
```
**Available Tools:**
- `find_documents`: Search for documents by query
- `search_content`: Perform hybrid vector + BM25 search within specific documents
**Integration with AI Clients:**
The MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.
For detailed usage instructions, see the [MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md).
---
## 🗄️ Database Schema
### Core Tables
```sql
-- Documents (versioned documentation)
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
version TEXT NOT NULL,
source_url TEXT,
description TEXT,
metadata JSONB DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(name, version)
);
-- Pages (raw crawled content)
CREATE TABLE pages (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
url TEXT NOT NULL UNIQUE,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,
crawled_at TIMESTAMPTZ DEFAULT NOW(),
status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),
group_id UUID, -- For future grouping feature
metadata JSONB DEFAULT '{}'::jsonb
);
-- Chunks (embedded content)
CREATE TABLE chunks (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
chunk_index INTEGER NOT NULL,
content TEXT NOT NULL,
group_id UUID, -- For future grouping feature
embedding VECTOR(768), -- Dimension must match config
bm25_vector bm25vector, -- Auto-generated by trigger
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(document_id, group_id, chunk_index)
);
-- Indexes
CREATE INDEX idx_pages_document ON pages(document_id);
CREATE INDEX idx_pages_status ON pages(status);
CREATE INDEX idx_pages_hash ON pages(content_hash);
CREATE INDEX idx_pages_group ON pages(group_id);
CREATE INDEX idx_chunks_document ON chunks(document_id);
CREATE INDEX idx_chunks_group ON chunks(group_id);
CREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);
```
---
## 🛠️ Development
### Project Structure
```
context_bridge/ # Core package
├── __init__.py
├── config.py # Configuration management
├── core.py # Main ContextBridge API
├── database/
│ ├── init_databases.py # Database initialization
│ └── postgres_manager.py # Connection pool manager
├── schema/
│ └── extensions.sql # PostgreSQL extensions & schema
├── repositories/ # Data access layer
│ ├── document_repository.py
│ ├── page_repository.py
│ ├── group_repository.py
│ └── chunk_repository.py
├── service/ # Business logic layer
│ ├── crawling_service.py
│ ├── chunking_service.py
│ ├── embedding.py
│ ├── search_service.py
│ └── url_service.py
context_bridge_mcp/ # MCP Server (Model Context Protocol)
├── __init__.py
├── server.py # MCP server implementation
├── schemas.py # Tool input/output schemas
└── __main__.py # CLI entry point
streamlit_app/ # Streamlit Web UI
├── __init__.py
├── app.py # Main application
├── pages/ # Multi-page navigation
│ ├── documents.py # Document management
│ ├── crawled_pages.py # Page management
│ └── search.py # Search interface
├── components/ # Reusable UI components
├── utils/ # UI utilities and helpers
└── README.md # UI-specific documentation
docs/ # Documentation
├── guide/
│ └── MCP_SERVER_USAGE.md # MCP server usage guide
├── plan/ # Development plans
│ └── ui_and_mcp_implementation_plan.md
├── technical/ # Technical guides
│ ├── crawl4ai_complete_guide.md
│ ├── embedding_service.md
│ ├── psqlpy-complete-guide.md
│ ├── python_mcp_server_guide.md
│ ├── python-testing-guide.md
│ └── smart_chunk_markdown_algorithm.md
└── memory_templates.yaml # Memory usage templates
tests/ # Test suite
├── conftest.py
├── integration/
├── unit/
└── e2e/ # End-to-end tests
├── conftest.py
└── test_streamlit_ui.py
```
```
### Running Tests
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run all tests
pytest
# Run with coverage
pytest --cov=context_bridge --cov-report=html
# Run specific test file
pytest tests/test_chunking_service.py -v
```
### Code Quality
```bash
# Format code
black context_bridge tests
# Type checking
mypy context_bridge
# Linting
ruff check context_bridge
```
---
## 📖 Technical Documentation
Comprehensive technical guides are available in `docs/`:
### Testing & Quality Assurance
- **[UI Testing Report](docs/ui_testing_report.md)** - Comprehensive Playwright testing results and bug fixes
- **[MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md)** - How to use the MCP server with AI clients
### Technical Guides (`docs/technical/`)
- **[Crawl4AI Guide](docs/technical/crawl4ai_complete_guide.md)** - Complete crawling documentation
- **[Embedding Service](docs/technical/embedding_service.md)** - Ollama and Gemini embedding setup
- **[PSQLPy Guide](docs/technical/psqlpy-complete-guide.md)** - PostgreSQL driver usage
- **[MCP Server Guide](docs/technical/python_mcp_server_guide.md)** - MCP server implementation
- **[Testing Guide](docs/technical/python-testing-guide.md)** - Testing best practices
- **[Smart Chunking Algorithm](docs/technical/smart_chunk_markdown_algorithm.md)** - Chunking implementation
### Implementation Plans (`docs/plan/`)
- **[UI & MCP Implementation Plan](docs/plan/ui_and_mcp_implementation_plan.md)** - Development roadmap and progress
---
## 🤝 Contributing
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
---
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## 🙏 Acknowledgments
- **[Crawl4AI](https://github.com/unclecode/crawl4ai)** - High-performance web crawler
- **[PSQLPy](https://github.com/qaspen-python/psqlpy)** - Async PostgreSQL driver
- **[pgvector](https://github.com/pgvector/pgvector)** - Vector similarity search
- **[MCP](https://modelcontextprotocol.io/)** - Model Context Protocol
---
## 📧 Support
For questions, issues, or feature requests:
- **Issues**: [GitHub Issues](https://github.com/yourusername/context-bridge/issues)
- **Discussions**: [GitHub Discussions](https://github.com/yourusername/context-bridge/discussions)
- **Email**: your.email@example.com
---
**Built with ❤️ for AI agents and developers**
Raw data
{
"_id": null,
"home_page": null,
"name": "context-bridge",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "ai-agents, crawling, documentation, embedding, mcp, pgvector, postgresql, rag, vector-search",
"author": null,
"author_email": "Your Name <your.email@example.com>",
"download_url": "https://files.pythonhosted.org/packages/6c/ea/e11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7/context_bridge-0.1.0.tar.gz",
"platform": null,
"description": "# Context Bridge \ud83c\udf09\n\n> **Unified Python package for RAG-powered documentation management** - Crawl, store, chunk, and retrieve technical documentation with vector + BM25 hybrid search.\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/psf/black)\n\n---\n\n## \ud83d\udccb Table of Contents\n\n- [Overview](#-overview)\n- [Features](#-features)\n- [Architecture](#-architecture)\n- [Installation](#-installation)\n- [Quick Start](#-quick-start)\n- [Configuration](#-configuration)\n- [Usage](#-usage)\n- [Database Schema](#-database-schema)\n- [Development](#-development)\n- [Technical Documentation](#-technical-documentation)\n- [License](#-license)\n\n---\n\n## \ud83c\udfaf Overview\n\n**Context Bridge** is a standalone Python package designed to help AI agents, LLMs, and developers manage technical documentation with RAG (Retrieval-Augmented Generation) capabilities. It bridges the gap between scattered online documentation and AI-ready, searchable knowledge bases.\n\n### What It Does\n\n1. **Crawls** technical documentation from URLs using Crawl4AI\n2. **Organizes** crawled pages into logical groups with size management\n3. **Chunks** Markdown content intelligently while preserving structure\n4. **Embeds** chunks using vector embeddings (Ollama/Gemini)\n5. **Stores** everything in PostgreSQL with vector and vchord_bm25\n6. **Searches** with hybrid vector + BM25 search for best results\n7. **Serves** via MCP (Model Context Protocol) for AI agent integration\n8. **Manages** through a Streamlit UI for human oversight\n\n---\n\n## \u2728 Features\n\n### Core Capabilities\n\n- **\ud83d\udd77\ufe0f Smart Crawling**: Automatically detect and crawl documentation sites, sitemaps, and text files\n- **\ud83d\udce6 Intelligent Chunking**: Smart Markdown chunking that respects code blocks, paragraphs, and sentences\n- **\ud83d\udd0d Hybrid Search**: Dual vector + BM25 search for superior retrieval accuracy\n- **\ud83d\udcda Version Management**: Track multiple versions of the same documentation\n- **\ud83c\udfaf Document Organization**: Manual page grouping with size constraints before chunking\n- **\u26a1 High Performance**: PSQLPy for fast async PostgreSQL operations\n- **\ud83e\udd16 AI-Ready**: MCP server for seamless AI agent integration\n- **\ud83c\udfa8 User-Friendly**: Streamlit UI for documentation management\n\n### Technical Features\n\n- **Vector Search**: Powered by vector extension\n- **BM25 Full-Text Search**: Using vchord_bm25 extension\n- **Async/Await**: Fully asynchronous operations for scalability\n- **Configurable Embeddings**: Support for Ollama (local) and Google Gemini (cloud)\n- **Type-Safe**: Pydantic models for configuration and data validation\n- **Modular Design**: Clean separation of concerns (repositories, services, managers)\n\n---\n\n## \ud83c\udfd7\ufe0f Architecture\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Context Bridge Architecture \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Streamlit \u2502 \u2502 MCP Server \u2502 \u2502 Python API \u2502\n\u2502 UI \u2502 \u2502 (AI Agent) \u2502 \u2502 (Direct) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502 \u2502 \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Service Layer \u2502\n \u2502 - CrawlingService \u2502\n \u2502 - ChunkingService \u2502\n \u2502 - EmbeddingService \u2502\n \u2502 - SearchService \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Repository Layer \u2502\n \u2502 - DocumentRepository \u2502\n \u2502 - PageRepository \u2502\n \u2502 - GroupRepository \u2502\n \u2502 - ChunkRepository \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 PostgreSQL Manager \u2502\n \u2502 - Connection Pooling \u2502\n \u2502 - Transaction Management \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 PostgreSQL Database \u2502\n \u2502 Extensions: \u2502\n \u2502 - vector (vector search) \u2502\n \u2502 - vchord_bm25 (BM25 search) \u2502\n \u2502 - pg_tokenizer (text tokenization) \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\nExternal Dependencies:\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Crawl4AI \u2502 \u2502 Ollama \u2502\n\u2502 (Crawling) \u2502 \u2502 or Gemini \u2502\n\u2502 \u2502 \u2502 (Embeddings) \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Workflow\n\n```\n1. Crawl Documentation\n \u2193\n2. Store Raw Pages\n \u2193\n3. Manual Organization (Group Pages)\n \u2193\n4. Smart Chunking\n \u2193\n5. Generate Embeddings\n \u2193\n6. Store with Vector + BM25 Indexes\n \u2193\n7. Hybrid Search (Vector + BM25)\n```\n\n---\n\n## \ud83d\udce6 Installation\n\n### Prerequisites\n\n- **Python 3.11+**\n- **PostgreSQL 14+** with extensions:\n - `vector`\n - `vchord`\n - `pg_tokenizer`\n - `vchord_bm25`\n- **Ollama** (for local embeddings) or **Google API Key** (for Gemini)\n\n### Install from PyPI\n\n```bash\npip install context-bridge\n```\n\n### Install with Optional Dependencies\n\n```bash\n# With Gemini support\npip install context-bridge[gemini]\n\n# With MCP server\npip install context-bridge[mcp]\n\n# With Streamlit UI\npip install context-bridge[ui]\n\n# All features\npip install context-bridge[all]\n```\n\n### Running the Applications\n\n**MCP Server:**\n```bash\n# Using the installed script\ncontext-bridge-mcp\n\n# Or run directly\npython -m context_bridge_mcp\n```\n\n**Streamlit UI:**\n```bash\n# Using streamlit directly\nstreamlit run streamlit_app/app.py\n\n# Or with uv\nuv run streamlit run streamlit_app/app.py\n```\n\n### Install from Source\n\n```bash\ngit clone https://github.com/yourusername/context-bridge.git\ncd context-bridge\npip install -e .\n```\n\n---\n\n## \ud83d\ude80 Quick Start\n\n### 1. Initialize Database\n\n```bash\npython -m context_bridge.database.init_databases\n```\n\nThis will:\n- Create required PostgreSQL extensions\n- Create all necessary tables\n- Set up vector and BM25 indexes\n\n### 2. Basic Usage (Three Ways)\n\n#### **Option A: Direct Python (Recommended for PyPI users)**\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge, Config\n\nasync def main():\n # Create config with your settings\n config = Config(\n postgres_host=\"localhost\",\n postgres_password=\"your_secure_password\",\n embedding_model=\"nomic-embed-text:latest\"\n )\n \n # Use with context manager\n async with ContextBridge(config=config) as bridge:\n # Crawl documentation\n result = await bridge.crawl_documentation(\n name=\"Python Docs\",\n version=\"3.11\",\n source_url=\"https://docs.python.org/3/library/\"\n )\n \n # Search documentation\n search_results = await bridge.search(\n query=\"async await tutorial\",\n document_id=result.document_id\n )\n \n for hit in search_results[:3]:\n print(f\"Score: {hit.score}, Content: {hit.content[:100]}...\")\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```\n\n#### **Option B: Environment Variables (Recommended for Docker/K8s)**\n\n```bash\n# Set environment variables\nexport POSTGRES_HOST=postgres\nexport POSTGRES_PASSWORD=secure_password\nexport OLLAMA_BASE_URL=http://ollama:11434\nexport EMBEDDING_MODEL=nomic-embed-text:latest\n\n# Or in docker-compose.yml\nenvironment:\n - POSTGRES_HOST=postgres\n - POSTGRES_PASSWORD=secure_password\n - EMBEDDING_MODEL=nomic-embed-text:latest\n```\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge\n\nasync def main():\n # Config automatically loaded from environment variables\n async with ContextBridge() as bridge:\n result = await bridge.crawl_documentation(\n name=\"Python Docs\",\n version=\"3.11\",\n source_url=\"https://docs.python.org/3/library/\"\n )\n```\n\n#### **Option C: .env File (Convenient for local development)**\n\nCreate `.env` file (git-ignored):\n\n```bash\n# PostgreSQL Configuration\nPOSTGRES_HOST=localhost\nPOSTGRES_PORT=5432\nPOSTGRES_USER=postgres\nPOSTGRES_PASSWORD=your_secure_password\nPOSTGRES_DB=context_bridge\n\n# Ollama Configuration\nOLLAMA_BASE_URL=http://localhost:11434\nEMBEDDING_MODEL=nomic-embed-text:latest\nVECTOR_DIMENSION=768\n\n# Search Configuration\nSIMILARITY_THRESHOLD=0.7\nBM25_WEIGHT=0.3\nVECTOR_WEIGHT=0.7\n\n# Chunking Configuration\nCHUNK_SIZE=2000\nMIN_COMBINED_CONTENT_SIZE=100\nMAX_COMBINED_CONTENT_SIZE=3500000\n\n# Crawling Configuration\nCRAWL_MAX_DEPTH=3\nCRAWL_MAX_CONCURRENT=5\n```\n\nThen in your code:\n\n```python\nimport asyncio\nfrom context_bridge import ContextBridge\n\nasync def main():\n # Config automatically loaded from .env file (if python-dotenv is available)\n async with ContextBridge() as bridge:\n result = await bridge.crawl_documentation(...)\n```\n\nTo use .env files in development, install with dev dependencies:\n```bash\npip install context-bridge[dev]\n```\n\n---\n\n## \u2699\ufe0f Configuration\n\nThe package uses Pydantic for type-safe, type-hinted configuration. Context Bridge supports three configuration methods:\n\n### Configuration Methods (Priority Order)\n\n1. **Direct Python instantiation** (recommended for packaged installs)\n2. **Environment variables** (recommended for containers/CI)\n3. **.env file** (convenient for local development only)\n\n### Core Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `POSTGRES_HOST` | `localhost` | PostgreSQL host | `postgres_host` |\n| `POSTGRES_PORT` | `5432` | PostgreSQL port | `postgres_port` |\n| `POSTGRES_USER` | `postgres` | PostgreSQL user | `postgres_user` |\n| `POSTGRES_PASSWORD` | `` (empty) | PostgreSQL password (min 8 chars for prod) | `postgres_password` |\n| `POSTGRES_DB` | `context_bridge` | Database name | `postgres_db` |\n| `DB_POOL_MAX` | `10` | Connection pool size | `postgres_max_pool_size` |\n\n### Embedding Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama API URL | `ollama_base_url` |\n| `EMBEDDING_MODEL` | `nomic-embed-text:latest` | Ollama model name | `embedding_model` |\n| `VECTOR_DIMENSION` | `768` | Embedding vector dimension | `vector_dimension` |\n\n### Search Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `SIMILARITY_THRESHOLD` | `0.7` | Minimum similarity score | `similarity_threshold` |\n| `BM25_WEIGHT` | `0.3` | BM25 weight in hybrid search | `bm25_weight` |\n| `VECTOR_WEIGHT` | `0.7` | Vector weight in hybrid search | `vector_weight` |\n\n### Chunking Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `CHUNK_SIZE` | `2000` | Default chunk size (bytes) | `chunk_size` |\n| `MIN_COMBINED_CONTENT_SIZE` | `100` | Minimum combined page size (bytes) | `min_combined_content_size` |\n| `MAX_COMBINED_CONTENT_SIZE` | `3500000` | Maximum combined page size (bytes) | `max_combined_content_size` |\n\n### Crawling Settings\n\n| Setting | Default | Description | Python API |\n|---------|---------|-------------|-----------|\n| `CRAWL_MAX_DEPTH` | `3` | Maximum crawl depth | `crawl_max_depth` |\n| `CRAWL_MAX_CONCURRENT` | `10` | Maximum concurrent crawl operations | `crawl_max_concurrent` |\n\n---\n\n## \ud83d\udcda Usage\n\n### Crawling Documentation\n\n```python\nfrom context_bridge.service.crawling_service import CrawlingService, CrawlConfig\nfrom crawl4ai import AsyncWebCrawler\n\n# Configure crawler\nconfig = CrawlConfig(\n max_depth=3, # How deep to follow links\n max_concurrent=10, # Concurrent requests\n memory_threshold=70.0 # Memory usage threshold\n)\n\nservice = CrawlingService(config)\n\n# Crawl a documentation site\nasync with AsyncWebCrawler(verbose=True) as crawler:\n result = await service.crawl_webpage(\n crawler,\n \"https://docs.example.com\"\n )\n\n# Access results\nfor crawl_result in result.results:\n print(f\"URL: {crawl_result.url}\")\n print(f\"Content length: {len(crawl_result.markdown)}\")\n```\n\n### Storing Documents\n\n```python\nfrom context_bridge.repositories.document_repository import DocumentRepository\n\nasync with db_manager.connection() as conn:\n doc_repo = DocumentRepository(conn)\n \n # Create a new document\n doc_id = await doc_repo.create(\n name=\"Python Documentation\",\n version=\"3.11\",\n source_url=\"https://docs.python.org/3/\",\n description=\"Official Python 3.11 documentation\"\n )\n \n # Store crawled pages\n for page in crawled_pages:\n await page_repo.create(\n document_id=doc_id,\n url=page.url,\n content=page.markdown,\n content_hash=hash(page.markdown)\n )\n```\n\n### Organizing Pages into Groups\n\n```python\nfrom context_bridge.repositories.group_repository import GroupRepository\n\n# User manually selects pages to group\npage_ids = [1, 2, 3, 4, 5]\n\n# Create a group\nasync with db_manager.connection() as conn:\n group_repo = GroupRepository(conn)\n \n group_id = await group_repo.create_group(\n document_id=doc_id,\n page_ids=page_ids,\n min_size=1000, # Minimum total content size\n max_size=50000 # Maximum total content size\n )\n```\n\n### Chunking and Embedding\n\n```python\nfrom context_bridge.service.chunking_service import ChunkingService\nfrom context_bridge.service.embedding import EmbeddingService\n\nchunking_service = ChunkingService()\nembedding_service = EmbeddingService(config)\n\n# Get groups ready for chunking\neligible_groups = await group_repo.get_eligible_groups(doc_id)\n\nfor group in eligible_groups:\n # Get combined content\n content = await group_repo.get_group_content(group.id)\n \n # Smart chunking\n chunks = chunking_service.smart_chunk_markdown(\n content,\n chunk_size=2000\n )\n \n # Generate embeddings and store\n for i, chunk_text in enumerate(chunks):\n embedding = await embedding_service.get_embedding(chunk_text)\n \n await chunk_repo.create(\n document_id=doc_id,\n group_id=group.id,\n chunk_index=i,\n content=chunk_text,\n embedding=embedding\n )\n```\n\n### Searching Documents\n\n#### Find Documents by Query\n\n```python\nfrom context_bridge.repositories.document_repository import DocumentRepository\n\n# Find relevant documents\ndocuments = await doc_repo.find_by_query(\n query=\"python asyncio tutorial\",\n limit=5\n)\n\nfor doc in documents:\n print(f\"{doc.name} (v{doc.version})\")\n```\n\n#### Search Document Content (Hybrid Search)\n\n```python\nfrom context_bridge.repositories.chunk_repository import ChunkRepository\n\n# Search within a specific document\nchunks = await chunk_repo.hybrid_search(\n document_id=doc_id,\n version=\"3.11\",\n query=\"async await examples\",\n query_embedding=await embedding_service.get_embedding(\"async await examples\"),\n limit=10,\n vector_weight=0.7,\n bm25_weight=0.3\n)\n\nfor chunk in chunks:\n print(f\"Score: {chunk.score}\")\n print(f\"Content: {chunk.content[:200]}...\")\n```\n\n### Using the Streamlit UI\n\nThe Context Bridge includes a full-featured web interface for managing documentation:\n\n```bash\n# Install with UI support\npip install context-bridge[ui]\n\n# Run the Streamlit application\nuv run streamlit run streamlit_app/app.py\n\n# Or use the installed script\ncontext-bridge-ui\n```\n\n**Features:**\n- **Document Management**: View, search, and delete documents\n- **Page Organization**: Select and group crawled pages for processing\n- **Chunk Processing**: Convert page groups into searchable chunks\n- **Hybrid Search**: Search across all documentation with advanced filtering\n\n### Using the MCP Server\n\nThe Model Context Protocol server allows AI agents to interact with Context Bridge:\n\n```bash\n# Install with MCP support\npip install context-bridge[mcp]\n\n# Run the MCP server\nuv run python -m context_bridge_mcp\n\n# Or use the installed script\ncontext-bridge-mcp\n```\n\n**Available Tools:**\n- `find_documents`: Search for documents by query\n- `search_content`: Perform hybrid vector + BM25 search within specific documents\n\n**Integration with AI Clients:**\nThe MCP server can be integrated with AI assistants like Claude Desktop for seamless documentation access.\n\nFor detailed usage instructions, see the [MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md).\n\n---\n\n## \ud83d\uddc4\ufe0f Database Schema\n\n### Core Tables\n\n```sql\n-- Documents (versioned documentation)\nCREATE TABLE documents (\n id SERIAL PRIMARY KEY,\n name TEXT NOT NULL,\n version TEXT NOT NULL,\n source_url TEXT,\n description TEXT,\n metadata JSONB DEFAULT '{}'::jsonb,\n created_at TIMESTAMPTZ DEFAULT NOW(),\n updated_at TIMESTAMPTZ DEFAULT NOW(),\n UNIQUE(name, version)\n);\n\n-- Pages (raw crawled content)\nCREATE TABLE pages (\n id SERIAL PRIMARY KEY,\n document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,\n url TEXT NOT NULL UNIQUE,\n content TEXT NOT NULL,\n content_hash TEXT NOT NULL,\n content_length INTEGER GENERATED ALWAYS AS (length(content)) STORED,\n crawled_at TIMESTAMPTZ DEFAULT NOW(),\n status TEXT DEFAULT 'pending' CHECK (status IN ('pending', 'processing', 'chunked', 'deleted')),\n group_id UUID, -- For future grouping feature\n metadata JSONB DEFAULT '{}'::jsonb\n);\n\n-- Chunks (embedded content)\nCREATE TABLE chunks (\n id SERIAL PRIMARY KEY,\n document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,\n chunk_index INTEGER NOT NULL,\n content TEXT NOT NULL,\n group_id UUID, -- For future grouping feature\n embedding VECTOR(768), -- Dimension must match config\n bm25_vector bm25vector, -- Auto-generated by trigger\n created_at TIMESTAMPTZ DEFAULT NOW(),\n UNIQUE(document_id, group_id, chunk_index)\n);\n\n-- Indexes\nCREATE INDEX idx_pages_document ON pages(document_id);\nCREATE INDEX idx_pages_status ON pages(status);\nCREATE INDEX idx_pages_hash ON pages(content_hash);\nCREATE INDEX idx_pages_group ON pages(group_id);\nCREATE INDEX idx_chunks_document ON chunks(document_id);\nCREATE INDEX idx_chunks_group ON chunks(group_id);\nCREATE INDEX idx_chunks_vector ON chunks USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);\nCREATE INDEX idx_chunks_bm25 ON chunks USING bm25(bm25_vector bm25_ops);\n```\n\n---\n\n## \ud83d\udee0\ufe0f Development\n\n### Project Structure\n\n```\ncontext_bridge/ # Core package\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 config.py # Configuration management\n\u251c\u2500\u2500 core.py # Main ContextBridge API\n\u251c\u2500\u2500 database/\n\u2502 \u251c\u2500\u2500 init_databases.py # Database initialization\n\u2502 \u2514\u2500\u2500 postgres_manager.py # Connection pool manager\n\u251c\u2500\u2500 schema/\n\u2502 \u2514\u2500\u2500 extensions.sql # PostgreSQL extensions & schema\n\u251c\u2500\u2500 repositories/ # Data access layer\n\u2502 \u251c\u2500\u2500 document_repository.py\n\u2502 \u251c\u2500\u2500 page_repository.py\n\u2502 \u251c\u2500\u2500 group_repository.py\n\u2502 \u2514\u2500\u2500 chunk_repository.py\n\u251c\u2500\u2500 service/ # Business logic layer\n\u2502 \u251c\u2500\u2500 crawling_service.py\n\u2502 \u251c\u2500\u2500 chunking_service.py\n\u2502 \u251c\u2500\u2500 embedding.py\n\u2502 \u251c\u2500\u2500 search_service.py\n\u2502 \u2514\u2500\u2500 url_service.py\n\ncontext_bridge_mcp/ # MCP Server (Model Context Protocol)\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 server.py # MCP server implementation\n\u251c\u2500\u2500 schemas.py # Tool input/output schemas\n\u2514\u2500\u2500 __main__.py # CLI entry point\n\nstreamlit_app/ # Streamlit Web UI\n\u251c\u2500\u2500 __init__.py\n\u251c\u2500\u2500 app.py # Main application\n\u251c\u2500\u2500 pages/ # Multi-page navigation\n\u2502 \u251c\u2500\u2500 documents.py # Document management\n\u2502 \u251c\u2500\u2500 crawled_pages.py # Page management\n\u2502 \u2514\u2500\u2500 search.py # Search interface\n\u251c\u2500\u2500 components/ # Reusable UI components\n\u251c\u2500\u2500 utils/ # UI utilities and helpers\n\u2514\u2500\u2500 README.md # UI-specific documentation\n\ndocs/ # Documentation\n\u251c\u2500\u2500 guide/\n\u2502 \u2514\u2500\u2500 MCP_SERVER_USAGE.md # MCP server usage guide\n\u251c\u2500\u2500 plan/ # Development plans\n\u2502 \u2514\u2500\u2500 ui_and_mcp_implementation_plan.md\n\u251c\u2500\u2500 technical/ # Technical guides\n\u2502 \u251c\u2500\u2500 crawl4ai_complete_guide.md\n\u2502 \u251c\u2500\u2500 embedding_service.md\n\u2502 \u251c\u2500\u2500 psqlpy-complete-guide.md\n\u2502 \u251c\u2500\u2500 python_mcp_server_guide.md\n\u2502 \u251c\u2500\u2500 python-testing-guide.md\n\u2502 \u2514\u2500\u2500 smart_chunk_markdown_algorithm.md\n\u2514\u2500\u2500 memory_templates.yaml # Memory usage templates\n\ntests/ # Test suite\n\u251c\u2500\u2500 conftest.py\n\u251c\u2500\u2500 integration/\n\u251c\u2500\u2500 unit/\n\u2514\u2500\u2500 e2e/ # End-to-end tests\n \u251c\u2500\u2500 conftest.py\n \u2514\u2500\u2500 test_streamlit_ui.py\n```\n```\n\n### Running Tests\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run all tests\npytest\n\n# Run with coverage\npytest --cov=context_bridge --cov-report=html\n\n# Run specific test file\npytest tests/test_chunking_service.py -v\n```\n\n### Code Quality\n\n```bash\n# Format code\nblack context_bridge tests\n\n# Type checking\nmypy context_bridge\n\n# Linting\nruff check context_bridge\n```\n\n---\n\n## \ud83d\udcd6 Technical Documentation\n\nComprehensive technical guides are available in `docs/`:\n\n### Testing & Quality Assurance\n- **[UI Testing Report](docs/ui_testing_report.md)** - Comprehensive Playwright testing results and bug fixes\n- **[MCP Server Usage Guide](docs/guide/MCP_SERVER_USAGE.md)** - How to use the MCP server with AI clients\n\n### Technical Guides (`docs/technical/`)\n- **[Crawl4AI Guide](docs/technical/crawl4ai_complete_guide.md)** - Complete crawling documentation\n- **[Embedding Service](docs/technical/embedding_service.md)** - Ollama and Gemini embedding setup\n- **[PSQLPy Guide](docs/technical/psqlpy-complete-guide.md)** - PostgreSQL driver usage\n- **[MCP Server Guide](docs/technical/python_mcp_server_guide.md)** - MCP server implementation\n- **[Testing Guide](docs/technical/python-testing-guide.md)** - Testing best practices\n- **[Smart Chunking Algorithm](docs/technical/smart_chunk_markdown_algorithm.md)** - Chunking implementation\n\n### Implementation Plans (`docs/plan/`)\n- **[UI & MCP Implementation Plan](docs/plan/ui_and_mcp_implementation_plan.md)** - Development roadmap and progress\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please read our contributing guidelines and submit pull requests.\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\ude4f Acknowledgments\n\n- **[Crawl4AI](https://github.com/unclecode/crawl4ai)** - High-performance web crawler\n- **[PSQLPy](https://github.com/qaspen-python/psqlpy)** - Async PostgreSQL driver\n- **[pgvector](https://github.com/pgvector/pgvector)** - Vector similarity search\n- **[MCP](https://modelcontextprotocol.io/)** - Model Context Protocol\n\n---\n\n## \ud83d\udce7 Support\n\nFor questions, issues, or feature requests:\n- **Issues**: [GitHub Issues](https://github.com/yourusername/context-bridge/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/yourusername/context-bridge/discussions)\n- **Email**: your.email@example.com\n\n---\n\n**Built with \u2764\ufe0f for AI agents and developers**\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Unified Python package for RAG documentation workflows - Crawl, embed, store, and retrieve technical documentation for AI agents",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://yourusername.github.io/context_bridge",
"Homepage": "https://github.com/yourusername/context_bridge",
"Issues": "https://github.com/yourusername/context_bridge/issues",
"Repository": "https://github.com/yourusername/context_bridge"
},
"split_keywords": [
"ai-agents",
" crawling",
" documentation",
" embedding",
" mcp",
" pgvector",
" postgresql",
" rag",
" vector-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0ab3673efa1d6d1c519f82dc5fb82403ce2dcb59915b5941b6297ac7531a36e7",
"md5": "7d9eb562408e17f40708ca1b9f068d78",
"sha256": "582e50329ee34b21dce7e34105ae313db55e773cde553ac976ab134740936832"
},
"downloads": -1,
"filename": "context_bridge-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7d9eb562408e17f40708ca1b9f068d78",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 58308,
"upload_time": "2025-10-23T09:38:54",
"upload_time_iso_8601": "2025-10-23T09:38:54.593241Z",
"url": "https://files.pythonhosted.org/packages/0a/b3/673efa1d6d1c519f82dc5fb82403ce2dcb59915b5941b6297ac7531a36e7/context_bridge-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6ceae11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7",
"md5": "1bfb811c2572971b1a6125f8bd27ce4b",
"sha256": "256a6fb02a8c56bfc473c5c9838c4c0d9fc289d9fd31ac64b4cac38a969338e0"
},
"downloads": -1,
"filename": "context_bridge-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "1bfb811c2572971b1a6125f8bd27ce4b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 596370,
"upload_time": "2025-10-23T09:38:56",
"upload_time_iso_8601": "2025-10-23T09:38:56.455446Z",
"url": "https://files.pythonhosted.org/packages/6c/ea/e11f97f08bd944759100499dc852a8b5dfaa9a01a054e4ccb7c8a11b52b7/context_bridge-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-23 09:38:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "context_bridge",
"github_not_found": true,
"lcname": "context-bridge"
}