# RAG Retriever
A semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.
## 🚀 Quick Start with AI Assistant
Let your AI coding assistant help you set up and use RAG Retriever:
**Setup**: Direct your AI assistant to [`SETUP_ASSISTANT_PROMPT.md`](SETUP_ASSISTANT_PROMPT.md)
**Usage**: Direct your AI assistant to [`USAGE_ASSISTANT_PROMPT.md`](USAGE_ASSISTANT_PROMPT.md)
**CLI Operations**: Direct your AI assistant to [`CLI_ASSISTANT_PROMPT.md`](CLI_ASSISTANT_PROMPT.md)
**Administration**: Direct your AI assistant to [`ADMIN_ASSISTANT_PROMPT.md`](ADMIN_ASSISTANT_PROMPT.md)
**Advanced Content**: Direct your AI assistant to [`ADVANCED_CONTENT_INGESTION_PROMPT.md`](ADVANCED_CONTENT_INGESTION_PROMPT.md)
**Troubleshooting**: Direct your AI assistant to [`TROUBLESHOOTING_ASSISTANT_PROMPT.md`](TROUBLESHOOTING_ASSISTANT_PROMPT.md)
**Quick Commands**: See [`QUICKSTART.md`](QUICKSTART.md) for copy-paste installation commands.
These prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.
## What RAG Retriever Does
RAG Retriever enhances your AI coding workflows by providing:
- **Website Crawling**: Index documentation sites, blogs, and knowledge bases
- **Semantic Search**: Find relevant information using natural language queries
- **Collection Management**: Organize content into themed collections
- **MCP Integration**: Direct access from Claude Code and other AI assistants
- **Fast Processing**: 20x faster crawling with Crawl4AI option
- **Rich Metadata**: Extract titles, descriptions, and source attribution
## Key Features
### 🌐 Advanced Web Crawling
- **Playwright**: Reliable JavaScript-enabled crawling
- **Crawl4AI**: High-performance crawling with content filtering
- **Configurable depth**: Control how deep to crawl linked pages
- **Same-domain focus**: Automatically stays within target sites
### 🔍 Semantic Search
- **OpenAI Embeddings**: Uses text-embedding-3-large for high-quality vectors
- **Relevance Scoring**: Configurable similarity thresholds
- **Cross-Collection Search**: Search across all collections simultaneously
- **Source Attribution**: Track where information comes from
### 📚 Collection Management
- **Named Collections**: Organize content by topic, project, or source
- **Metadata Tracking**: Creation dates, document counts, descriptions
- **Health Monitoring**: Audit collections for quality and freshness
- **Easy Cleanup**: Remove or rebuild collections as needed
### 🎯 Quality Management
- **Content Quality Assessment**: Systematic evaluation of indexed content
- **AI-Powered Quality Review**: Use AI to assess accuracy and completeness
- **Contradiction Detection**: Find conflicting information across collections
- **Relevance Monitoring**: Track search quality metrics over time
- **Best Practice Guidance**: Comprehensive collection organization strategies
### 🤖 AI Integration
- **MCP Server**: Direct integration with Claude Code
- **Custom Commands**: Pre-built workflows for common tasks
- **Tool Descriptions**: Clear interfaces for AI assistants
- **Permission Management**: Secure access controls
## MCP vs CLI Capabilities
### MCP Server (Claude Code Integration)
The MCP server provides **secure, AI-friendly access** to core functionality:
- **Web Crawling**: Index websites and documentation
- **Semantic Search**: Search across collections with relevance scoring
- **Collection Discovery**: List and explore available collections
- **Quality Assessment**: Audit content quality and system health
- **Intentionally Limited**: No administrative operations for security
### CLI (Full Administrative Control)
The command-line interface provides **complete system control**:
- **All MCP Capabilities**: Everything available through MCP server
- **Collection Management**: Delete collections, clean entire vector store
- **Advanced Content Ingestion**: Images, PDFs, GitHub repos, Confluence
- **Local File Processing**: Directory scanning, bulk operations
- **System Administration**: Configuration, maintenance, troubleshooting
- **Rich Output Options**: JSON, verbose logging, custom formatting
### Web UI (Visual Management)
The Streamlit-based web interface provides **intuitive visual control**:
- **Interactive Search**: Visual search interface with adjustable parameters
- **Collection Management**: View, delete, edit descriptions, compare collections
- **Content Discovery**: Web search and direct content indexing workflow
- **Visual Analytics**: Statistics, charts, and collection comparisons
- **User-Friendly**: No command-line knowledge required
- **Real-time Feedback**: Immediate visual confirmation of operations
### When to Use Each Interface
| Task | MCP Server | CLI | Web UI | Recommendation |
|------|------------|-----|---------|----------------|
| Search content | ✅ | ✅ | ✅ | **MCP** for AI workflows, **UI** for interactive exploration |
| Index websites | ✅ | ✅ | ✅ | **UI** for discovery workflow, **MCP** for AI integration |
| Delete collections | ❌ | ✅ | ✅ | **UI** for visual confirmation, **CLI** for scripting |
| Edit collection metadata | ❌ | ❌ | ✅ | **UI** only option |
| Visual analytics | ❌ | ❌ | ✅ | **UI** only option |
| Content discovery | ❌ | ❌ | ✅ | **UI** provides search → select → index workflow |
| Process local files | ❌ | ✅ | ❌ | **CLI** only option |
| Analyze images | ❌ | ✅ | ❌ | **CLI** only option |
| GitHub integration | ❌ | ✅ | ❌ | **CLI** only option |
| System maintenance | ❌ | ✅ | ❌ | **CLI** only option |
| AI assistant integration | ✅ | ❌ | ❌ | **MCP** designed for AI workflows |
| Visual collection comparison | ❌ | ❌ | ✅ | **UI** provides interactive charts |
## Available Claude Code Commands
Once configured as an MCP server, you can use:
### `/rag-list-collections`
Discover all available vector store collections with document counts and metadata.
### `/rag-search-knowledge "query [collection] [limit] [threshold]"`
Search indexed content using semantic similarity:
- `"python documentation"` - searches default collection
- `"python documentation python_docs"` - searches specific collection
- `"python documentation all"` - searches ALL collections
- `"error handling all 10 0.4"` - custom parameters
### `/rag-index-website "url [max_depth] [collection]"`
Crawl and index website content:
- `"https://docs.python.org"` - index with defaults
- `"https://docs.python.org 3"` - custom crawl depth
- `"https://docs.python.org python_docs 2"` - custom collection
### `/rag-audit-collections`
Review collection health, identify issues, and get maintenance recommendations.
### `/rag-assess-quality`
Systematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.
### `/rag-manage-collections`
Administrative collection operations including deletion and cleanup (provides CLI commands).
### `/rag-ingest-content`
Guide through advanced content ingestion for local files, images, and enterprise systems.
### `/rag-cli-help`
Interactive CLI command builder and comprehensive help system.
## Web UI Interface
Launch the visual interface with: `rag-retriever --ui`
### Collection Management

*Comprehensive collection overview with statistics, metadata, and management actions*
### Collection Actions and Deletion

*Collection management interface showing edit description and delete collection options with visual confirmation*
### Interactive Knowledge Search

*Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content*
### Collection Analytics and Comparison

*Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics*
### Content Discovery and Indexing Workflow

*Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow*
The Web UI excels at:
- **Content Discovery Workflow**: Search → Select → Adjust → Index new content in one seamless interface
- **Visual Collection Management**: View statistics, edit descriptions, delete collections with confirmation
- **Interactive Search**: Real-time parameter adjustment and visual exploration of indexed content
- **Collection Analytics**: Compare collections with interactive charts and performance metrics
- **Administrative Tasks**: User-friendly collection deletion and management operations
## How It Works
1. **Content Ingestion**: Web pages are crawled and processed into clean text
2. **Embedding Generation**: Text is converted to vectors using OpenAI's embedding models
3. **Vector Storage**: Embeddings are stored in ChromaDB with metadata
4. **Semantic Search**: Queries are embedded and matched against stored vectors
5. **Result Ranking**: Results are ranked by similarity and returned with sources
## Architecture
### Layered Content Ingestion Architecture
```mermaid
flowchart TD
subgraph CS ["CONTENT SOURCES"]
subgraph WC ["Web Content"]
WC1["Playwright"]
WC2["Crawl4AI"]
WC3["Web Search"]
WC4["Discovery UI"]
end
subgraph LF ["Local Files"]
LF1["PDF Files"]
LF2["Markdown"]
LF3["Text Files"]
LF4["Directories"]
end
subgraph RM ["Rich Media"]
RM1["Images"]
RM2["Screenshots"]
RM3["Diagrams"]
RM4["OpenAI Vision"]
end
subgraph ES ["Enterprise Systems"]
ES1["GitHub Repos"]
ES2["Confluence Spaces"]
ES3["Private Repos"]
ES4["Branch Selection"]
end
end
subgraph PP ["PROCESSING PIPELINE"]
subgraph CC ["Content Cleaning"]
CC1["HTML Parsing"]
CC2["Text Extract"]
CC3["Format Normal"]
end
subgraph TC ["Text Chunking"]
TC1["Smart Splits"]
TC2["Overlap Mgmt"]
TC3["Size Control"]
end
subgraph EB ["Embedding"]
EB1["OpenAI API"]
EB2["Vector Gen"]
EB3["Batch Process"]
end
subgraph QA ["Quality Assessment"]
QA1["Relevance Scoring"]
QA2["Search Quality"]
QA3["Collection Auditing"]
end
end
subgraph SSE ["STORAGE & SEARCH ENGINE"]
subgraph CD ["ChromaDB"]
CD1["Vector Store"]
CD2["Persistence"]
CD3["Performance"]
end
subgraph COL ["Collections"]
COL1["Topic-based"]
COL2["Named Groups"]
COL3["Metadata"]
end
subgraph SS ["Semantic Search"]
SS1["Similarity"]
SS2["Thresholds"]
SS3["Cross-search"]
end
subgraph MS ["Metadata Store"]
MS1["Source Attribution"]
MS2["Timestamps"]
MS3["Descriptions"]
end
end
subgraph UI ["USER INTERFACES"]
subgraph WUI ["Web UI"]
WUI1["Discovery"]
WUI2["Visual Mgmt"]
WUI3["Interactive"]
end
subgraph CLI ["CLI"]
CLI1["Full Admin"]
CLI2["All Features"]
CLI3["Maintenance"]
end
subgraph MCP ["MCP Server"]
MCP1["Tool Provider"]
MCP2["Secure Ops"]
MCP3["FastMCP"]
end
subgraph AI ["AI Assistant Integ"]
AI1["Claude Code Cmds"]
AI2["AI Workflows"]
AI3["Assistant Commands"]
end
end
CS --> PP
PP --> SSE
SSE --> UI
```
### Technical Component Architecture
```mermaid
graph TB
subgraph RAG ["RAG RETRIEVER SYSTEM"]
subgraph INTERFACES ["USER INTERFACES"]
WEB["Streamlit Web UI<br/>(ui/app.py)<br/>• Discovery<br/>• Collections<br/>• Search"]
CLI_MOD["CLI Module<br/>(cli.py)<br/>• Full Control<br/>• Admin Ops<br/>• All Features<br/>• Maintenance"]
MCP_SRV["MCP Server<br/>(mcp/server.py)<br/>• FastMCP Framework<br/>• Tool Definitions<br/>• AI Integration<br/>• Claude Code Support"]
end
subgraph CORE ["CORE ENGINE"]
PROC["Content Processing<br/>(main.py)<br/>• URL Processing<br/>• Search Coordination<br/>• Orchestration"]
LOADERS["Document Loaders<br/>• LocalLoader<br/>• ImageLoader<br/>• GitHubLoader<br/>• ConfluenceLoader"]
SEARCH["Search Engine<br/>(searcher.py)<br/>• Semantic Search<br/>• Cross-collection<br/>• Score Ranking"]
end
subgraph DATA ["DATA LAYER"]
VECTOR["Vector Store<br/>(store.py)<br/>• ChromaDB<br/>• Collections<br/>• Metadata<br/>• Persistence"]
CRAWLERS["Web Crawlers<br/>(crawling/)<br/>• Playwright<br/>• Crawl4AI<br/>• ContentClean<br/>• URL Handling"]
CONFIG["Config System<br/>(config.py)<br/>• YAML Config<br/>• User Settings<br/>• API Keys<br/>• Validation"]
end
subgraph EXTERNAL ["EXTERNAL APIS"]
OPENAI["OpenAI API<br/>• Embeddings<br/>• Vision Model<br/>• Batch Process"]
SEARCH_API["Search APIs<br/>• Google Search<br/>• DuckDuckGo<br/>• Web Discovery"]
EXT_SYS["External Systems<br/>• GitHub API<br/>• Confluence<br/>• Git Repos"]
end
end
WEB --> PROC
CLI_MOD --> PROC
MCP_SRV --> PROC
PROC <--> LOADERS
PROC <--> SEARCH
LOADERS <--> SEARCH
CORE --> VECTOR
CORE --> CRAWLERS
CORE --> CONFIG
DATA --> OPENAI
DATA --> SEARCH_API
DATA --> EXT_SYS
```
## Use Cases
### Documentation Management
- Index official documentation sites
- Search for APIs, functions, and usage examples
- Maintain up-to-date development references
### Knowledge Bases
- Index company wikis and internal documentation
- Search for policies, procedures, and best practices
- Centralize organizational knowledge
### Research and Learning
- Index technical blogs and tutorials
- Search for specific topics and technologies
- Build personal knowledge repositories
### Project Documentation
- Index project-specific documentation
- Search for implementation patterns
- Maintain project knowledge bases
## Configuration
RAG Retriever is highly configurable through `config.yaml`:
```yaml
# Crawler selection
crawler:
type: "crawl4ai" # or "playwright"
# Search settings
search:
default_limit: 8
default_score_threshold: 0.3
# Content processing
content:
chunk_size: 2000
chunk_overlap: 400
# API configuration
api:
openai_api_key: sk-your-key-here
```
## Requirements
- Python 3.10+
- OpenAI API key
- Git (for system functionality)
- ~500MB disk space for dependencies
## Installation
See [`QUICKSTART.md`](QUICKSTART.md) for exact installation commands, or use the AI assistant prompts for guided setup.
## Data Storage
Your content is stored locally in:
- **macOS/Linux**: `~/.local/share/rag-retriever/`
- **Windows**: `%LOCALAPPDATA%\rag-retriever\`
Collections persist between sessions and are automatically managed.
## Performance
- **Crawl4AI**: Up to 20x faster than traditional crawling
- **Embedding Caching**: Efficient vector storage and retrieval
- **Parallel Processing**: Concurrent indexing and search
- **Optimized Chunking**: Configurable content processing
## Security
- **Local Storage**: All data stored locally, no cloud dependencies
- **API Key Protection**: Secure configuration management
- **Permission Controls**: MCP server permission management
- **Source Tracking**: Complete audit trail of indexed content
## Contributing
RAG Retriever is open source and welcomes contributions. See the repository for guidelines.
## License
MIT License - see LICENSE file for details.
## Support
- **Documentation**: Use the AI assistant prompts for guidance
- **Issues**: Report bugs and request features via GitHub issues
- **Community**: Join discussions and share usage patterns
---
**Remember**: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!
Raw data
{
"_id": null,
"home_page": null,
"name": "rag-retriever",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "ai, rag, embeddings, semantic-search, web-crawler, vector-store",
"author": null,
"author_email": "Tim Kitchens <codingthefuturewithai@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/3d/2d/42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f/rag_retriever-0.4.1.tar.gz",
"platform": null,
"description": "# RAG Retriever\n\nA semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.\n\n## \ud83d\ude80 Quick Start with AI Assistant\n\nLet your AI coding assistant help you set up and use RAG Retriever:\n\n**Setup**: Direct your AI assistant to [`SETUP_ASSISTANT_PROMPT.md`](SETUP_ASSISTANT_PROMPT.md) \n**Usage**: Direct your AI assistant to [`USAGE_ASSISTANT_PROMPT.md`](USAGE_ASSISTANT_PROMPT.md) \n**CLI Operations**: Direct your AI assistant to [`CLI_ASSISTANT_PROMPT.md`](CLI_ASSISTANT_PROMPT.md) \n**Administration**: Direct your AI assistant to [`ADMIN_ASSISTANT_PROMPT.md`](ADMIN_ASSISTANT_PROMPT.md) \n**Advanced Content**: Direct your AI assistant to [`ADVANCED_CONTENT_INGESTION_PROMPT.md`](ADVANCED_CONTENT_INGESTION_PROMPT.md) \n**Troubleshooting**: Direct your AI assistant to [`TROUBLESHOOTING_ASSISTANT_PROMPT.md`](TROUBLESHOOTING_ASSISTANT_PROMPT.md)\n\n**Quick Commands**: See [`QUICKSTART.md`](QUICKSTART.md) for copy-paste installation commands.\n\nThese prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.\n\n## What RAG Retriever Does\n\nRAG Retriever enhances your AI coding workflows by providing:\n\n- **Website Crawling**: Index documentation sites, blogs, and knowledge bases\n- **Semantic Search**: Find relevant information using natural language queries\n- **Collection Management**: Organize content into themed collections\n- **MCP Integration**: Direct access from Claude Code and other AI assistants\n- **Fast Processing**: 20x faster crawling with Crawl4AI option\n- **Rich Metadata**: Extract titles, descriptions, and source attribution\n\n## Key Features\n\n### \ud83c\udf10 Advanced Web Crawling\n- **Playwright**: Reliable JavaScript-enabled crawling\n- **Crawl4AI**: High-performance crawling with content filtering\n- **Configurable depth**: Control how deep to crawl linked pages\n- **Same-domain focus**: Automatically stays within target sites\n\n### \ud83d\udd0d Semantic Search\n- **OpenAI Embeddings**: Uses text-embedding-3-large for high-quality vectors\n- **Relevance Scoring**: Configurable similarity thresholds\n- **Cross-Collection Search**: Search across all collections simultaneously\n- **Source Attribution**: Track where information comes from\n\n### \ud83d\udcda Collection Management\n- **Named Collections**: Organize content by topic, project, or source\n- **Metadata Tracking**: Creation dates, document counts, descriptions\n- **Health Monitoring**: Audit collections for quality and freshness\n- **Easy Cleanup**: Remove or rebuild collections as needed\n\n### \ud83c\udfaf Quality Management\n- **Content Quality Assessment**: Systematic evaluation of indexed content\n- **AI-Powered Quality Review**: Use AI to assess accuracy and completeness\n- **Contradiction Detection**: Find conflicting information across collections\n- **Relevance Monitoring**: Track search quality metrics over time\n- **Best Practice Guidance**: Comprehensive collection organization strategies\n\n### \ud83e\udd16 AI Integration\n- **MCP Server**: Direct integration with Claude Code\n- **Custom Commands**: Pre-built workflows for common tasks\n- **Tool Descriptions**: Clear interfaces for AI assistants\n- **Permission Management**: Secure access controls\n\n## MCP vs CLI Capabilities\n\n### MCP Server (Claude Code Integration)\nThe MCP server provides **secure, AI-friendly access** to core functionality:\n\n- **Web Crawling**: Index websites and documentation\n- **Semantic Search**: Search across collections with relevance scoring\n- **Collection Discovery**: List and explore available collections\n- **Quality Assessment**: Audit content quality and system health\n- **Intentionally Limited**: No administrative operations for security\n\n### CLI (Full Administrative Control)\nThe command-line interface provides **complete system control**:\n\n- **All MCP Capabilities**: Everything available through MCP server\n- **Collection Management**: Delete collections, clean entire vector store\n- **Advanced Content Ingestion**: Images, PDFs, GitHub repos, Confluence\n- **Local File Processing**: Directory scanning, bulk operations\n- **System Administration**: Configuration, maintenance, troubleshooting\n- **Rich Output Options**: JSON, verbose logging, custom formatting\n\n### Web UI (Visual Management)\nThe Streamlit-based web interface provides **intuitive visual control**:\n\n- **Interactive Search**: Visual search interface with adjustable parameters\n- **Collection Management**: View, delete, edit descriptions, compare collections\n- **Content Discovery**: Web search and direct content indexing workflow\n- **Visual Analytics**: Statistics, charts, and collection comparisons\n- **User-Friendly**: No command-line knowledge required\n- **Real-time Feedback**: Immediate visual confirmation of operations\n\n### When to Use Each Interface\n\n| Task | MCP Server | CLI | Web UI | Recommendation |\n|------|------------|-----|---------|----------------|\n| Search content | \u2705 | \u2705 | \u2705 | **MCP** for AI workflows, **UI** for interactive exploration |\n| Index websites | \u2705 | \u2705 | \u2705 | **UI** for discovery workflow, **MCP** for AI integration |\n| Delete collections | \u274c | \u2705 | \u2705 | **UI** for visual confirmation, **CLI** for scripting |\n| Edit collection metadata | \u274c | \u274c | \u2705 | **UI** only option |\n| Visual analytics | \u274c | \u274c | \u2705 | **UI** only option |\n| Content discovery | \u274c | \u274c | \u2705 | **UI** provides search \u2192 select \u2192 index workflow |\n| Process local files | \u274c | \u2705 | \u274c | **CLI** only option |\n| Analyze images | \u274c | \u2705 | \u274c | **CLI** only option |\n| GitHub integration | \u274c | \u2705 | \u274c | **CLI** only option |\n| System maintenance | \u274c | \u2705 | \u274c | **CLI** only option |\n| AI assistant integration | \u2705 | \u274c | \u274c | **MCP** designed for AI workflows |\n| Visual collection comparison | \u274c | \u274c | \u2705 | **UI** provides interactive charts |\n\n## Available Claude Code Commands\n\nOnce configured as an MCP server, you can use:\n\n### `/rag-list-collections`\nDiscover all available vector store collections with document counts and metadata.\n\n### `/rag-search-knowledge \"query [collection] [limit] [threshold]\"`\nSearch indexed content using semantic similarity:\n- `\"python documentation\"` - searches default collection\n- `\"python documentation python_docs\"` - searches specific collection \n- `\"python documentation all\"` - searches ALL collections\n- `\"error handling all 10 0.4\"` - custom parameters\n\n### `/rag-index-website \"url [max_depth] [collection]\"`\nCrawl and index website content:\n- `\"https://docs.python.org\"` - index with defaults\n- `\"https://docs.python.org 3\"` - custom crawl depth\n- `\"https://docs.python.org python_docs 2\"` - custom collection\n\n### `/rag-audit-collections`\nReview collection health, identify issues, and get maintenance recommendations.\n\n### `/rag-assess-quality`\nSystematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.\n\n### `/rag-manage-collections`\nAdministrative collection operations including deletion and cleanup (provides CLI commands).\n\n### `/rag-ingest-content`\nGuide through advanced content ingestion for local files, images, and enterprise systems.\n\n### `/rag-cli-help`\nInteractive CLI command builder and comprehensive help system.\n\n## Web UI Interface\n\nLaunch the visual interface with: `rag-retriever --ui`\n\n### Collection Management\n\n*Comprehensive collection overview with statistics, metadata, and management actions*\n\n### Collection Actions and Deletion\n\n*Collection management interface showing edit description and delete collection options with visual confirmation*\n\n### Interactive Knowledge Search\n\n*Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content*\n\n### Collection Analytics and Comparison\n\n*Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics*\n\n### Content Discovery and Indexing Workflow\n\n*Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow*\n\nThe Web UI excels at:\n- **Content Discovery Workflow**: Search \u2192 Select \u2192 Adjust \u2192 Index new content in one seamless interface\n- **Visual Collection Management**: View statistics, edit descriptions, delete collections with confirmation\n- **Interactive Search**: Real-time parameter adjustment and visual exploration of indexed content\n- **Collection Analytics**: Compare collections with interactive charts and performance metrics\n- **Administrative Tasks**: User-friendly collection deletion and management operations\n\n## How It Works\n\n1. **Content Ingestion**: Web pages are crawled and processed into clean text\n2. **Embedding Generation**: Text is converted to vectors using OpenAI's embedding models\n3. **Vector Storage**: Embeddings are stored in ChromaDB with metadata\n4. **Semantic Search**: Queries are embedded and matched against stored vectors\n5. **Result Ranking**: Results are ranked by similarity and returned with sources\n\n## Architecture\n\n### Layered Content Ingestion Architecture\n\n```mermaid\nflowchart TD\n subgraph CS [\"CONTENT SOURCES\"]\n subgraph WC [\"Web Content\"]\n WC1[\"Playwright\"]\n WC2[\"Crawl4AI\"] \n WC3[\"Web Search\"]\n WC4[\"Discovery UI\"]\n end\n subgraph LF [\"Local Files\"]\n LF1[\"PDF Files\"]\n LF2[\"Markdown\"]\n LF3[\"Text Files\"]\n LF4[\"Directories\"]\n end\n subgraph RM [\"Rich Media\"]\n RM1[\"Images\"]\n RM2[\"Screenshots\"]\n RM3[\"Diagrams\"]\n RM4[\"OpenAI Vision\"]\n end\n subgraph ES [\"Enterprise Systems\"]\n ES1[\"GitHub Repos\"]\n ES2[\"Confluence Spaces\"]\n ES3[\"Private Repos\"]\n ES4[\"Branch Selection\"]\n end\n end\n \n subgraph PP [\"PROCESSING PIPELINE\"]\n subgraph CC [\"Content Cleaning\"]\n CC1[\"HTML Parsing\"]\n CC2[\"Text Extract\"]\n CC3[\"Format Normal\"]\n end\n subgraph TC [\"Text Chunking\"]\n TC1[\"Smart Splits\"]\n TC2[\"Overlap Mgmt\"]\n TC3[\"Size Control\"]\n end\n subgraph EB [\"Embedding\"]\n EB1[\"OpenAI API\"]\n EB2[\"Vector Gen\"]\n EB3[\"Batch Process\"]\n end\n subgraph QA [\"Quality Assessment\"]\n QA1[\"Relevance Scoring\"]\n QA2[\"Search Quality\"]\n QA3[\"Collection Auditing\"]\n end\n end\n \n subgraph SSE [\"STORAGE & SEARCH ENGINE\"]\n subgraph CD [\"ChromaDB\"]\n CD1[\"Vector Store\"]\n CD2[\"Persistence\"]\n CD3[\"Performance\"]\n end\n subgraph COL [\"Collections\"]\n COL1[\"Topic-based\"]\n COL2[\"Named Groups\"]\n COL3[\"Metadata\"]\n end\n subgraph SS [\"Semantic Search\"]\n SS1[\"Similarity\"]\n SS2[\"Thresholds\"]\n SS3[\"Cross-search\"]\n end\n subgraph MS [\"Metadata Store\"]\n MS1[\"Source Attribution\"]\n MS2[\"Timestamps\"]\n MS3[\"Descriptions\"]\n end\n end\n \n subgraph UI [\"USER INTERFACES\"]\n subgraph WUI [\"Web UI\"]\n WUI1[\"Discovery\"]\n WUI2[\"Visual Mgmt\"]\n WUI3[\"Interactive\"]\n end\n subgraph CLI [\"CLI\"]\n CLI1[\"Full Admin\"]\n CLI2[\"All Features\"]\n CLI3[\"Maintenance\"]\n end\n subgraph MCP [\"MCP Server\"]\n MCP1[\"Tool Provider\"]\n MCP2[\"Secure Ops\"]\n MCP3[\"FastMCP\"]\n end\n subgraph AI [\"AI Assistant Integ\"]\n AI1[\"Claude Code Cmds\"]\n AI2[\"AI Workflows\"]\n AI3[\"Assistant Commands\"]\n end\n end\n \n CS --> PP\n PP --> SSE\n SSE --> UI\n```\n\n### Technical Component Architecture\n\n```mermaid\ngraph TB\n subgraph RAG [\"RAG RETRIEVER SYSTEM\"]\n subgraph INTERFACES [\"USER INTERFACES\"]\n WEB[\"Streamlit Web UI<br/>(ui/app.py)<br/>\u2022 Discovery<br/>\u2022 Collections<br/>\u2022 Search\"]\n CLI_MOD[\"CLI Module<br/>(cli.py)<br/>\u2022 Full Control<br/>\u2022 Admin Ops<br/>\u2022 All Features<br/>\u2022 Maintenance\"]\n MCP_SRV[\"MCP Server<br/>(mcp/server.py)<br/>\u2022 FastMCP Framework<br/>\u2022 Tool Definitions<br/>\u2022 AI Integration<br/>\u2022 Claude Code Support\"]\n end\n \n subgraph CORE [\"CORE ENGINE\"]\n PROC[\"Content Processing<br/>(main.py)<br/>\u2022 URL Processing<br/>\u2022 Search Coordination<br/>\u2022 Orchestration\"]\n LOADERS[\"Document Loaders<br/>\u2022 LocalLoader<br/>\u2022 ImageLoader<br/>\u2022 GitHubLoader<br/>\u2022 ConfluenceLoader\"]\n SEARCH[\"Search Engine<br/>(searcher.py)<br/>\u2022 Semantic Search<br/>\u2022 Cross-collection<br/>\u2022 Score Ranking\"]\n end\n \n subgraph DATA [\"DATA LAYER\"]\n VECTOR[\"Vector Store<br/>(store.py)<br/>\u2022 ChromaDB<br/>\u2022 Collections<br/>\u2022 Metadata<br/>\u2022 Persistence\"]\n CRAWLERS[\"Web Crawlers<br/>(crawling/)<br/>\u2022 Playwright<br/>\u2022 Crawl4AI<br/>\u2022 ContentClean<br/>\u2022 URL Handling\"]\n CONFIG[\"Config System<br/>(config.py)<br/>\u2022 YAML Config<br/>\u2022 User Settings<br/>\u2022 API Keys<br/>\u2022 Validation\"]\n end\n \n subgraph EXTERNAL [\"EXTERNAL APIS\"]\n OPENAI[\"OpenAI API<br/>\u2022 Embeddings<br/>\u2022 Vision Model<br/>\u2022 Batch Process\"]\n SEARCH_API[\"Search APIs<br/>\u2022 Google Search<br/>\u2022 DuckDuckGo<br/>\u2022 Web Discovery\"]\n EXT_SYS[\"External Systems<br/>\u2022 GitHub API<br/>\u2022 Confluence<br/>\u2022 Git Repos\"]\n end\n end\n \n WEB --> PROC\n CLI_MOD --> PROC\n MCP_SRV --> PROC\n \n PROC <--> LOADERS\n PROC <--> SEARCH\n LOADERS <--> SEARCH\n \n CORE --> VECTOR\n CORE --> CRAWLERS\n CORE --> CONFIG\n \n DATA --> OPENAI\n DATA --> SEARCH_API\n DATA --> EXT_SYS\n```\n\n## Use Cases\n\n### Documentation Management\n- Index official documentation sites\n- Search for APIs, functions, and usage examples\n- Maintain up-to-date development references\n\n### Knowledge Bases\n- Index company wikis and internal documentation\n- Search for policies, procedures, and best practices\n- Centralize organizational knowledge\n\n### Research and Learning\n- Index technical blogs and tutorials\n- Search for specific topics and technologies\n- Build personal knowledge repositories\n\n### Project Documentation\n- Index project-specific documentation\n- Search for implementation patterns\n- Maintain project knowledge bases\n\n## Configuration\n\nRAG Retriever is highly configurable through `config.yaml`:\n\n```yaml\n# Crawler selection\ncrawler:\n type: \"crawl4ai\" # or \"playwright\"\n\n# Search settings\nsearch:\n default_limit: 8\n default_score_threshold: 0.3\n\n# Content processing\ncontent:\n chunk_size: 2000\n chunk_overlap: 400\n\n# API configuration\napi:\n openai_api_key: sk-your-key-here\n```\n\n## Requirements\n\n- Python 3.10+\n- OpenAI API key\n- Git (for system functionality)\n- ~500MB disk space for dependencies\n\n## Installation\n\nSee [`QUICKSTART.md`](QUICKSTART.md) for exact installation commands, or use the AI assistant prompts for guided setup.\n\n## Data Storage\n\nYour content is stored locally in:\n- **macOS/Linux**: `~/.local/share/rag-retriever/`\n- **Windows**: `%LOCALAPPDATA%\\rag-retriever\\`\n\nCollections persist between sessions and are automatically managed.\n\n## Performance\n\n- **Crawl4AI**: Up to 20x faster than traditional crawling\n- **Embedding Caching**: Efficient vector storage and retrieval\n- **Parallel Processing**: Concurrent indexing and search\n- **Optimized Chunking**: Configurable content processing\n\n## Security\n\n- **Local Storage**: All data stored locally, no cloud dependencies\n- **API Key Protection**: Secure configuration management\n- **Permission Controls**: MCP server permission management\n- **Source Tracking**: Complete audit trail of indexed content\n\n## Contributing\n\nRAG Retriever is open source and welcomes contributions. See the repository for guidelines.\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Support\n\n- **Documentation**: Use the AI assistant prompts for guidance\n- **Issues**: Report bugs and request features via GitHub issues\n- **Community**: Join discussions and share usage patterns\n\n---\n\n**Remember**: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for crawling, indexing, and semantically searching web content with RAG capabilities",
"version": "0.4.1",
"project_urls": null,
"split_keywords": [
"ai",
" rag",
" embeddings",
" semantic-search",
" web-crawler",
" vector-store"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "69a103753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69",
"md5": "10f60781424f3a35032ce929a1b88fcc",
"sha256": "a7a346504a95e49155b855f2898e164731412d49f27074ed0a96527f03cdd9ad"
},
"downloads": -1,
"filename": "rag_retriever-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "10f60781424f3a35032ce929a1b88fcc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 86129,
"upload_time": "2025-07-09T18:29:19",
"upload_time_iso_8601": "2025-07-09T18:29:19.292488Z",
"url": "https://files.pythonhosted.org/packages/69/a1/03753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69/rag_retriever-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3d2d42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f",
"md5": "976dcad7fdf3b0127a8a023d85f57eef",
"sha256": "32c72d9af7f26eb82654264045559d7e702f98ad7e581e97c7acb29bee7464b2"
},
"downloads": -1,
"filename": "rag_retriever-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "976dcad7fdf3b0127a8a023d85f57eef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 81973,
"upload_time": "2025-07-09T18:29:20",
"upload_time_iso_8601": "2025-07-09T18:29:20.686876Z",
"url": "https://files.pythonhosted.org/packages/3d/2d/42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f/rag_retriever-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-09 18:29:20",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rag-retriever"
}