rag-retriever

Name	rag-retriever JSON
Version	0.4.1 JSON
	download
home_page	None
Summary	A tool for crawling, indexing, and semantically searching web content with RAG capabilities
upload_time	2025-07-09 18:29:20
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.10
license	MIT
keywords	ai rag embeddings semantic-search web-crawler vector-store
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # RAG Retriever

A semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.

## 🚀 Quick Start with AI Assistant

Let your AI coding assistant help you set up and use RAG Retriever:

**Setup**: Direct your AI assistant to [`SETUP_ASSISTANT_PROMPT.md`](SETUP_ASSISTANT_PROMPT.md)  
**Usage**: Direct your AI assistant to [`USAGE_ASSISTANT_PROMPT.md`](USAGE_ASSISTANT_PROMPT.md)  
**CLI Operations**: Direct your AI assistant to [`CLI_ASSISTANT_PROMPT.md`](CLI_ASSISTANT_PROMPT.md)  
**Administration**: Direct your AI assistant to [`ADMIN_ASSISTANT_PROMPT.md`](ADMIN_ASSISTANT_PROMPT.md)  
**Advanced Content**: Direct your AI assistant to [`ADVANCED_CONTENT_INGESTION_PROMPT.md`](ADVANCED_CONTENT_INGESTION_PROMPT.md)  
**Troubleshooting**: Direct your AI assistant to [`TROUBLESHOOTING_ASSISTANT_PROMPT.md`](TROUBLESHOOTING_ASSISTANT_PROMPT.md)

**Quick Commands**: See [`QUICKSTART.md`](QUICKSTART.md) for copy-paste installation commands.

These prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.

## What RAG Retriever Does

RAG Retriever enhances your AI coding workflows by providing:

- **Website Crawling**: Index documentation sites, blogs, and knowledge bases
- **Semantic Search**: Find relevant information using natural language queries
- **Collection Management**: Organize content into themed collections
- **MCP Integration**: Direct access from Claude Code and other AI assistants
- **Fast Processing**: 20x faster crawling with Crawl4AI option
- **Rich Metadata**: Extract titles, descriptions, and source attribution

## Key Features

### 🌐 Advanced Web Crawling
- **Playwright**: Reliable JavaScript-enabled crawling
- **Crawl4AI**: High-performance crawling with content filtering
- **Configurable depth**: Control how deep to crawl linked pages
- **Same-domain focus**: Automatically stays within target sites

### 🔍 Semantic Search
- **OpenAI Embeddings**: Uses text-embedding-3-large for high-quality vectors
- **Relevance Scoring**: Configurable similarity thresholds
- **Cross-Collection Search**: Search across all collections simultaneously
- **Source Attribution**: Track where information comes from

### 📚 Collection Management
- **Named Collections**: Organize content by topic, project, or source
- **Metadata Tracking**: Creation dates, document counts, descriptions
- **Health Monitoring**: Audit collections for quality and freshness
- **Easy Cleanup**: Remove or rebuild collections as needed

### 🎯 Quality Management
- **Content Quality Assessment**: Systematic evaluation of indexed content
- **AI-Powered Quality Review**: Use AI to assess accuracy and completeness
- **Contradiction Detection**: Find conflicting information across collections
- **Relevance Monitoring**: Track search quality metrics over time
- **Best Practice Guidance**: Comprehensive collection organization strategies

### 🤖 AI Integration
- **MCP Server**: Direct integration with Claude Code
- **Custom Commands**: Pre-built workflows for common tasks
- **Tool Descriptions**: Clear interfaces for AI assistants
- **Permission Management**: Secure access controls

## MCP vs CLI Capabilities

### MCP Server (Claude Code Integration)
The MCP server provides **secure, AI-friendly access** to core functionality:

- **Web Crawling**: Index websites and documentation
- **Semantic Search**: Search across collections with relevance scoring
- **Collection Discovery**: List and explore available collections
- **Quality Assessment**: Audit content quality and system health
- **Intentionally Limited**: No administrative operations for security

### CLI (Full Administrative Control)
The command-line interface provides **complete system control**:

- **All MCP Capabilities**: Everything available through MCP server
- **Collection Management**: Delete collections, clean entire vector store
- **Advanced Content Ingestion**: Images, PDFs, GitHub repos, Confluence
- **Local File Processing**: Directory scanning, bulk operations
- **System Administration**: Configuration, maintenance, troubleshooting
- **Rich Output Options**: JSON, verbose logging, custom formatting

### Web UI (Visual Management)
The Streamlit-based web interface provides **intuitive visual control**:

- **Interactive Search**: Visual search interface with adjustable parameters
- **Collection Management**: View, delete, edit descriptions, compare collections
- **Content Discovery**: Web search and direct content indexing workflow
- **Visual Analytics**: Statistics, charts, and collection comparisons
- **User-Friendly**: No command-line knowledge required
- **Real-time Feedback**: Immediate visual confirmation of operations

### When to Use Each Interface

| Task | MCP Server | CLI | Web UI | Recommendation |
|------|------------|-----|---------|----------------|
| Search content | ✅ | ✅ | ✅ | **MCP** for AI workflows, **UI** for interactive exploration |
| Index websites | ✅ | ✅ | ✅ | **UI** for discovery workflow, **MCP** for AI integration |
| Delete collections | ❌ | ✅ | ✅ | **UI** for visual confirmation, **CLI** for scripting |
| Edit collection metadata | ❌ | ❌ | ✅ | **UI** only option |
| Visual analytics | ❌ | ❌ | ✅ | **UI** only option |
| Content discovery | ❌ | ❌ | ✅ | **UI** provides search → select → index workflow |
| Process local files | ❌ | ✅ | ❌ | **CLI** only option |
| Analyze images | ❌ | ✅ | ❌ | **CLI** only option |
| GitHub integration | ❌ | ✅ | ❌ | **CLI** only option |
| System maintenance | ❌ | ✅ | ❌ | **CLI** only option |
| AI assistant integration | ✅ | ❌ | ❌ | **MCP** designed for AI workflows |
| Visual collection comparison | ❌ | ❌ | ✅ | **UI** provides interactive charts |

## Available Claude Code Commands

Once configured as an MCP server, you can use:

### `/rag-list-collections`
Discover all available vector store collections with document counts and metadata.

### `/rag-search-knowledge "query [collection] [limit] [threshold]"`
Search indexed content using semantic similarity:
- `"python documentation"` - searches default collection
- `"python documentation python_docs"` - searches specific collection  
- `"python documentation all"` - searches ALL collections
- `"error handling all 10 0.4"` - custom parameters

### `/rag-index-website "url [max_depth] [collection]"`
Crawl and index website content:
- `"https://docs.python.org"` - index with defaults
- `"https://docs.python.org 3"` - custom crawl depth
- `"https://docs.python.org python_docs 2"` - custom collection

### `/rag-audit-collections`
Review collection health, identify issues, and get maintenance recommendations.

### `/rag-assess-quality`
Systematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.

### `/rag-manage-collections`
Administrative collection operations including deletion and cleanup (provides CLI commands).

### `/rag-ingest-content`
Guide through advanced content ingestion for local files, images, and enterprise systems.

### `/rag-cli-help`
Interactive CLI command builder and comprehensive help system.

## Web UI Interface

Launch the visual interface with: `rag-retriever --ui`

### Collection Management
![Collection Management](docs/images/rag-retriever-UI-collections.png)
*Comprehensive collection overview with statistics, metadata, and management actions*

### Collection Actions and Deletion
![Collection Actions](docs/images/rag-retreiver-UI-delete-collection.png)
*Collection management interface showing edit description and delete collection options with visual confirmation*

### Interactive Knowledge Search
![Search Interface](docs/images/rag-retriever-UI-search.png)
*Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content*

### Collection Analytics and Comparison
![Collection Comparison](docs/images/rag-retriever-UI-compare-collections.png)
*Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics*

### Content Discovery and Indexing Workflow
![Content Discovery](docs/images/rag-retreiver-UI-discover-and-index-new-web-content.png)
*Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow*

The Web UI excels at:
- **Content Discovery Workflow**: Search → Select → Adjust → Index new content in one seamless interface
- **Visual Collection Management**: View statistics, edit descriptions, delete collections with confirmation
- **Interactive Search**: Real-time parameter adjustment and visual exploration of indexed content
- **Collection Analytics**: Compare collections with interactive charts and performance metrics
- **Administrative Tasks**: User-friendly collection deletion and management operations

## How It Works

1. **Content Ingestion**: Web pages are crawled and processed into clean text
2. **Embedding Generation**: Text is converted to vectors using OpenAI's embedding models
3. **Vector Storage**: Embeddings are stored in ChromaDB with metadata
4. **Semantic Search**: Queries are embedded and matched against stored vectors
5. **Result Ranking**: Results are ranked by similarity and returned with sources

## Architecture

### Layered Content Ingestion Architecture

```mermaid
flowchart TD
    subgraph CS ["CONTENT SOURCES"]
        subgraph WC ["Web Content"]
            WC1["Playwright"]
            WC2["Crawl4AI"] 
            WC3["Web Search"]
            WC4["Discovery UI"]
        end
        subgraph LF ["Local Files"]
            LF1["PDF Files"]
            LF2["Markdown"]
            LF3["Text Files"]
            LF4["Directories"]
        end
        subgraph RM ["Rich Media"]
            RM1["Images"]
            RM2["Screenshots"]
            RM3["Diagrams"]
            RM4["OpenAI Vision"]
        end
        subgraph ES ["Enterprise Systems"]
            ES1["GitHub Repos"]
            ES2["Confluence Spaces"]
            ES3["Private Repos"]
            ES4["Branch Selection"]
        end
    end
    
    subgraph PP ["PROCESSING PIPELINE"]
        subgraph CC ["Content Cleaning"]
            CC1["HTML Parsing"]
            CC2["Text Extract"]
            CC3["Format Normal"]
        end
        subgraph TC ["Text Chunking"]
            TC1["Smart Splits"]
            TC2["Overlap Mgmt"]
            TC3["Size Control"]
        end
        subgraph EB ["Embedding"]
            EB1["OpenAI API"]
            EB2["Vector Gen"]
            EB3["Batch Process"]
        end
        subgraph QA ["Quality Assessment"]
            QA1["Relevance Scoring"]
            QA2["Search Quality"]
            QA3["Collection Auditing"]
        end
    end
    
    subgraph SSE ["STORAGE & SEARCH ENGINE"]
        subgraph CD ["ChromaDB"]
            CD1["Vector Store"]
            CD2["Persistence"]
            CD3["Performance"]
        end
        subgraph COL ["Collections"]
            COL1["Topic-based"]
            COL2["Named Groups"]
            COL3["Metadata"]
        end
        subgraph SS ["Semantic Search"]
            SS1["Similarity"]
            SS2["Thresholds"]
            SS3["Cross-search"]
        end
        subgraph MS ["Metadata Store"]
            MS1["Source Attribution"]
            MS2["Timestamps"]
            MS3["Descriptions"]
        end
    end
    
    subgraph UI ["USER INTERFACES"]
        subgraph WUI ["Web UI"]
            WUI1["Discovery"]
            WUI2["Visual Mgmt"]
            WUI3["Interactive"]
        end
        subgraph CLI ["CLI"]
            CLI1["Full Admin"]
            CLI2["All Features"]
            CLI3["Maintenance"]
        end
        subgraph MCP ["MCP Server"]
            MCP1["Tool Provider"]
            MCP2["Secure Ops"]
            MCP3["FastMCP"]
        end
        subgraph AI ["AI Assistant Integ"]
            AI1["Claude Code Cmds"]
            AI2["AI Workflows"]
            AI3["Assistant Commands"]
        end
    end
    
    CS --> PP
    PP --> SSE
    SSE --> UI
```

### Technical Component Architecture

```mermaid
graph TB
    subgraph RAG ["RAG RETRIEVER SYSTEM"]
        subgraph INTERFACES ["USER INTERFACES"]
            WEB["Streamlit Web UI<br/>(ui/app.py)<br/>• Discovery<br/>• Collections<br/>• Search"]
            CLI_MOD["CLI Module<br/>(cli.py)<br/>• Full Control<br/>• Admin Ops<br/>• All Features<br/>• Maintenance"]
            MCP_SRV["MCP Server<br/>(mcp/server.py)<br/>• FastMCP Framework<br/>• Tool Definitions<br/>• AI Integration<br/>• Claude Code Support"]
        end
        
        subgraph CORE ["CORE ENGINE"]
            PROC["Content Processing<br/>(main.py)<br/>• URL Processing<br/>• Search Coordination<br/>• Orchestration"]
            LOADERS["Document Loaders<br/>• LocalLoader<br/>• ImageLoader<br/>• GitHubLoader<br/>• ConfluenceLoader"]
            SEARCH["Search Engine<br/>(searcher.py)<br/>• Semantic Search<br/>• Cross-collection<br/>• Score Ranking"]
        end
        
        subgraph DATA ["DATA LAYER"]
            VECTOR["Vector Store<br/>(store.py)<br/>• ChromaDB<br/>• Collections<br/>• Metadata<br/>• Persistence"]
            CRAWLERS["Web Crawlers<br/>(crawling/)<br/>• Playwright<br/>• Crawl4AI<br/>• ContentClean<br/>• URL Handling"]
            CONFIG["Config System<br/>(config.py)<br/>• YAML Config<br/>• User Settings<br/>• API Keys<br/>• Validation"]
        end
        
        subgraph EXTERNAL ["EXTERNAL APIS"]
            OPENAI["OpenAI API<br/>• Embeddings<br/>• Vision Model<br/>• Batch Process"]
            SEARCH_API["Search APIs<br/>• Google Search<br/>• DuckDuckGo<br/>• Web Discovery"]
            EXT_SYS["External Systems<br/>• GitHub API<br/>• Confluence<br/>• Git Repos"]
        end
    end
    
    WEB --> PROC
    CLI_MOD --> PROC
    MCP_SRV --> PROC
    
    PROC <--> LOADERS
    PROC <--> SEARCH
    LOADERS <--> SEARCH
    
    CORE --> VECTOR
    CORE --> CRAWLERS
    CORE --> CONFIG
    
    DATA --> OPENAI
    DATA --> SEARCH_API
    DATA --> EXT_SYS
```

## Use Cases

### Documentation Management
- Index official documentation sites
- Search for APIs, functions, and usage examples
- Maintain up-to-date development references

### Knowledge Bases
- Index company wikis and internal documentation
- Search for policies, procedures, and best practices
- Centralize organizational knowledge

### Research and Learning
- Index technical blogs and tutorials
- Search for specific topics and technologies
- Build personal knowledge repositories

### Project Documentation
- Index project-specific documentation
- Search for implementation patterns
- Maintain project knowledge bases

## Configuration

RAG Retriever is highly configurable through `config.yaml`:

```yaml
# Crawler selection
crawler:
  type: "crawl4ai"  # or "playwright"

# Search settings
search:
  default_limit: 8
  default_score_threshold: 0.3

# Content processing
content:
  chunk_size: 2000
  chunk_overlap: 400

# API configuration
api:
  openai_api_key: sk-your-key-here
```

## Requirements

- Python 3.10+
- OpenAI API key
- Git (for system functionality)
- ~500MB disk space for dependencies

## Installation

See [`QUICKSTART.md`](QUICKSTART.md) for exact installation commands, or use the AI assistant prompts for guided setup.

## Data Storage

Your content is stored locally in:
- **macOS/Linux**: `~/.local/share/rag-retriever/`
- **Windows**: `%LOCALAPPDATA%\rag-retriever\`

Collections persist between sessions and are automatically managed.

## Performance

- **Crawl4AI**: Up to 20x faster than traditional crawling
- **Embedding Caching**: Efficient vector storage and retrieval
- **Parallel Processing**: Concurrent indexing and search
- **Optimized Chunking**: Configurable content processing

## Security

- **Local Storage**: All data stored locally, no cloud dependencies
- **API Key Protection**: Secure configuration management
- **Permission Controls**: MCP server permission management
- **Source Tracking**: Complete audit trail of indexed content

## Contributing

RAG Retriever is open source and welcomes contributions. See the repository for guidelines.

## License

MIT License - see LICENSE file for details.

## Support

- **Documentation**: Use the AI assistant prompts for guidance
- **Issues**: Report bugs and request features via GitHub issues
- **Community**: Join discussions and share usage patterns

---

**Remember**: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rag-retriever",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "ai, rag, embeddings, semantic-search, web-crawler, vector-store",
    "author": null,
    "author_email": "Tim Kitchens <codingthefuturewithai@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/3d/2d/42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f/rag_retriever-0.4.1.tar.gz",
    "platform": null,
    "description": "# RAG Retriever\n\nA semantic search system that crawls websites, indexes content, and provides AI-powered search capabilities through an MCP server. Built with modular architecture using OpenAI embeddings and ChromaDB vector store.\n\n## \ud83d\ude80 Quick Start with AI Assistant\n\nLet your AI coding assistant help you set up and use RAG Retriever:\n\n**Setup**: Direct your AI assistant to [`SETUP_ASSISTANT_PROMPT.md`](SETUP_ASSISTANT_PROMPT.md)  \n**Usage**: Direct your AI assistant to [`USAGE_ASSISTANT_PROMPT.md`](USAGE_ASSISTANT_PROMPT.md)  \n**CLI Operations**: Direct your AI assistant to [`CLI_ASSISTANT_PROMPT.md`](CLI_ASSISTANT_PROMPT.md)  \n**Administration**: Direct your AI assistant to [`ADMIN_ASSISTANT_PROMPT.md`](ADMIN_ASSISTANT_PROMPT.md)  \n**Advanced Content**: Direct your AI assistant to [`ADVANCED_CONTENT_INGESTION_PROMPT.md`](ADVANCED_CONTENT_INGESTION_PROMPT.md)  \n**Troubleshooting**: Direct your AI assistant to [`TROUBLESHOOTING_ASSISTANT_PROMPT.md`](TROUBLESHOOTING_ASSISTANT_PROMPT.md)\n\n**Quick Commands**: See [`QUICKSTART.md`](QUICKSTART.md) for copy-paste installation commands.\n\nThese prompts provide comprehensive instructions for your AI assistant to walk you through setup, usage, and troubleshooting without needing to read through documentation.\n\n## What RAG Retriever Does\n\nRAG Retriever enhances your AI coding workflows by providing:\n\n- **Website Crawling**: Index documentation sites, blogs, and knowledge bases\n- **Semantic Search**: Find relevant information using natural language queries\n- **Collection Management**: Organize content into themed collections\n- **MCP Integration**: Direct access from Claude Code and other AI assistants\n- **Fast Processing**: 20x faster crawling with Crawl4AI option\n- **Rich Metadata**: Extract titles, descriptions, and source attribution\n\n## Key Features\n\n### \ud83c\udf10 Advanced Web Crawling\n- **Playwright**: Reliable JavaScript-enabled crawling\n- **Crawl4AI**: High-performance crawling with content filtering\n- **Configurable depth**: Control how deep to crawl linked pages\n- **Same-domain focus**: Automatically stays within target sites\n\n### \ud83d\udd0d Semantic Search\n- **OpenAI Embeddings**: Uses text-embedding-3-large for high-quality vectors\n- **Relevance Scoring**: Configurable similarity thresholds\n- **Cross-Collection Search**: Search across all collections simultaneously\n- **Source Attribution**: Track where information comes from\n\n### \ud83d\udcda Collection Management\n- **Named Collections**: Organize content by topic, project, or source\n- **Metadata Tracking**: Creation dates, document counts, descriptions\n- **Health Monitoring**: Audit collections for quality and freshness\n- **Easy Cleanup**: Remove or rebuild collections as needed\n\n### \ud83c\udfaf Quality Management\n- **Content Quality Assessment**: Systematic evaluation of indexed content\n- **AI-Powered Quality Review**: Use AI to assess accuracy and completeness\n- **Contradiction Detection**: Find conflicting information across collections\n- **Relevance Monitoring**: Track search quality metrics over time\n- **Best Practice Guidance**: Comprehensive collection organization strategies\n\n### \ud83e\udd16 AI Integration\n- **MCP Server**: Direct integration with Claude Code\n- **Custom Commands**: Pre-built workflows for common tasks\n- **Tool Descriptions**: Clear interfaces for AI assistants\n- **Permission Management**: Secure access controls\n\n## MCP vs CLI Capabilities\n\n### MCP Server (Claude Code Integration)\nThe MCP server provides **secure, AI-friendly access** to core functionality:\n\n- **Web Crawling**: Index websites and documentation\n- **Semantic Search**: Search across collections with relevance scoring\n- **Collection Discovery**: List and explore available collections\n- **Quality Assessment**: Audit content quality and system health\n- **Intentionally Limited**: No administrative operations for security\n\n### CLI (Full Administrative Control)\nThe command-line interface provides **complete system control**:\n\n- **All MCP Capabilities**: Everything available through MCP server\n- **Collection Management**: Delete collections, clean entire vector store\n- **Advanced Content Ingestion**: Images, PDFs, GitHub repos, Confluence\n- **Local File Processing**: Directory scanning, bulk operations\n- **System Administration**: Configuration, maintenance, troubleshooting\n- **Rich Output Options**: JSON, verbose logging, custom formatting\n\n### Web UI (Visual Management)\nThe Streamlit-based web interface provides **intuitive visual control**:\n\n- **Interactive Search**: Visual search interface with adjustable parameters\n- **Collection Management**: View, delete, edit descriptions, compare collections\n- **Content Discovery**: Web search and direct content indexing workflow\n- **Visual Analytics**: Statistics, charts, and collection comparisons\n- **User-Friendly**: No command-line knowledge required\n- **Real-time Feedback**: Immediate visual confirmation of operations\n\n### When to Use Each Interface\n\n| Task | MCP Server | CLI | Web UI | Recommendation |\n|------|------------|-----|---------|----------------|\n| Search content | \u2705 | \u2705 | \u2705 | **MCP** for AI workflows, **UI** for interactive exploration |\n| Index websites | \u2705 | \u2705 | \u2705 | **UI** for discovery workflow, **MCP** for AI integration |\n| Delete collections | \u274c | \u2705 | \u2705 | **UI** for visual confirmation, **CLI** for scripting |\n| Edit collection metadata | \u274c | \u274c | \u2705 | **UI** only option |\n| Visual analytics | \u274c | \u274c | \u2705 | **UI** only option |\n| Content discovery | \u274c | \u274c | \u2705 | **UI** provides search \u2192 select \u2192 index workflow |\n| Process local files | \u274c | \u2705 | \u274c | **CLI** only option |\n| Analyze images | \u274c | \u2705 | \u274c | **CLI** only option |\n| GitHub integration | \u274c | \u2705 | \u274c | **CLI** only option |\n| System maintenance | \u274c | \u2705 | \u274c | **CLI** only option |\n| AI assistant integration | \u2705 | \u274c | \u274c | **MCP** designed for AI workflows |\n| Visual collection comparison | \u274c | \u274c | \u2705 | **UI** provides interactive charts |\n\n## Available Claude Code Commands\n\nOnce configured as an MCP server, you can use:\n\n### `/rag-list-collections`\nDiscover all available vector store collections with document counts and metadata.\n\n### `/rag-search-knowledge \"query [collection] [limit] [threshold]\"`\nSearch indexed content using semantic similarity:\n- `\"python documentation\"` - searches default collection\n- `\"python documentation python_docs\"` - searches specific collection  \n- `\"python documentation all\"` - searches ALL collections\n- `\"error handling all 10 0.4\"` - custom parameters\n\n### `/rag-index-website \"url [max_depth] [collection]\"`\nCrawl and index website content:\n- `\"https://docs.python.org\"` - index with defaults\n- `\"https://docs.python.org 3\"` - custom crawl depth\n- `\"https://docs.python.org python_docs 2\"` - custom collection\n\n### `/rag-audit-collections`\nReview collection health, identify issues, and get maintenance recommendations.\n\n### `/rag-assess-quality`\nSystematically evaluate content quality, accuracy, and reliability to ensure high-quality search results.\n\n### `/rag-manage-collections`\nAdministrative collection operations including deletion and cleanup (provides CLI commands).\n\n### `/rag-ingest-content`\nGuide through advanced content ingestion for local files, images, and enterprise systems.\n\n### `/rag-cli-help`\nInteractive CLI command builder and comprehensive help system.\n\n## Web UI Interface\n\nLaunch the visual interface with: `rag-retriever --ui`\n\n### Collection Management\n![Collection Management](docs/images/rag-retriever-UI-collections.png)\n*Comprehensive collection overview with statistics, metadata, and management actions*\n\n### Collection Actions and Deletion\n![Collection Actions](docs/images/rag-retreiver-UI-delete-collection.png)\n*Collection management interface showing edit description and delete collection options with visual confirmation*\n\n### Interactive Knowledge Search\n![Search Interface](docs/images/rag-retriever-UI-search.png)\n*Search indexed content with adjustable parameters (max results, score threshold) and explore results with metadata and expandable content*\n\n### Collection Analytics and Comparison\n![Collection Comparison](docs/images/rag-retriever-UI-compare-collections.png)\n*Side-by-side collection comparison with interactive charts showing document counts, chunks, and performance metrics*\n\n### Content Discovery and Indexing Workflow\n![Content Discovery](docs/images/rag-retreiver-UI-discover-and-index-new-web-content.png)\n*Search the web, select relevant content, adjust crawl depth, and index directly into collections - complete discovery-to-indexing workflow*\n\nThe Web UI excels at:\n- **Content Discovery Workflow**: Search \u2192 Select \u2192 Adjust \u2192 Index new content in one seamless interface\n- **Visual Collection Management**: View statistics, edit descriptions, delete collections with confirmation\n- **Interactive Search**: Real-time parameter adjustment and visual exploration of indexed content\n- **Collection Analytics**: Compare collections with interactive charts and performance metrics\n- **Administrative Tasks**: User-friendly collection deletion and management operations\n\n## How It Works\n\n1. **Content Ingestion**: Web pages are crawled and processed into clean text\n2. **Embedding Generation**: Text is converted to vectors using OpenAI's embedding models\n3. **Vector Storage**: Embeddings are stored in ChromaDB with metadata\n4. **Semantic Search**: Queries are embedded and matched against stored vectors\n5. **Result Ranking**: Results are ranked by similarity and returned with sources\n\n## Architecture\n\n### Layered Content Ingestion Architecture\n\n```mermaid\nflowchart TD\n    subgraph CS [\"CONTENT SOURCES\"]\n        subgraph WC [\"Web Content\"]\n            WC1[\"Playwright\"]\n            WC2[\"Crawl4AI\"] \n            WC3[\"Web Search\"]\n            WC4[\"Discovery UI\"]\n        end\n        subgraph LF [\"Local Files\"]\n            LF1[\"PDF Files\"]\n            LF2[\"Markdown\"]\n            LF3[\"Text Files\"]\n            LF4[\"Directories\"]\n        end\n        subgraph RM [\"Rich Media\"]\n            RM1[\"Images\"]\n            RM2[\"Screenshots\"]\n            RM3[\"Diagrams\"]\n            RM4[\"OpenAI Vision\"]\n        end\n        subgraph ES [\"Enterprise Systems\"]\n            ES1[\"GitHub Repos\"]\n            ES2[\"Confluence Spaces\"]\n            ES3[\"Private Repos\"]\n            ES4[\"Branch Selection\"]\n        end\n    end\n    \n    subgraph PP [\"PROCESSING PIPELINE\"]\n        subgraph CC [\"Content Cleaning\"]\n            CC1[\"HTML Parsing\"]\n            CC2[\"Text Extract\"]\n            CC3[\"Format Normal\"]\n        end\n        subgraph TC [\"Text Chunking\"]\n            TC1[\"Smart Splits\"]\n            TC2[\"Overlap Mgmt\"]\n            TC3[\"Size Control\"]\n        end\n        subgraph EB [\"Embedding\"]\n            EB1[\"OpenAI API\"]\n            EB2[\"Vector Gen\"]\n            EB3[\"Batch Process\"]\n        end\n        subgraph QA [\"Quality Assessment\"]\n            QA1[\"Relevance Scoring\"]\n            QA2[\"Search Quality\"]\n            QA3[\"Collection Auditing\"]\n        end\n    end\n    \n    subgraph SSE [\"STORAGE & SEARCH ENGINE\"]\n        subgraph CD [\"ChromaDB\"]\n            CD1[\"Vector Store\"]\n            CD2[\"Persistence\"]\n            CD3[\"Performance\"]\n        end\n        subgraph COL [\"Collections\"]\n            COL1[\"Topic-based\"]\n            COL2[\"Named Groups\"]\n            COL3[\"Metadata\"]\n        end\n        subgraph SS [\"Semantic Search\"]\n            SS1[\"Similarity\"]\n            SS2[\"Thresholds\"]\n            SS3[\"Cross-search\"]\n        end\n        subgraph MS [\"Metadata Store\"]\n            MS1[\"Source Attribution\"]\n            MS2[\"Timestamps\"]\n            MS3[\"Descriptions\"]\n        end\n    end\n    \n    subgraph UI [\"USER INTERFACES\"]\n        subgraph WUI [\"Web UI\"]\n            WUI1[\"Discovery\"]\n            WUI2[\"Visual Mgmt\"]\n            WUI3[\"Interactive\"]\n        end\n        subgraph CLI [\"CLI\"]\n            CLI1[\"Full Admin\"]\n            CLI2[\"All Features\"]\n            CLI3[\"Maintenance\"]\n        end\n        subgraph MCP [\"MCP Server\"]\n            MCP1[\"Tool Provider\"]\n            MCP2[\"Secure Ops\"]\n            MCP3[\"FastMCP\"]\n        end\n        subgraph AI [\"AI Assistant Integ\"]\n            AI1[\"Claude Code Cmds\"]\n            AI2[\"AI Workflows\"]\n            AI3[\"Assistant Commands\"]\n        end\n    end\n    \n    CS --> PP\n    PP --> SSE\n    SSE --> UI\n```\n\n### Technical Component Architecture\n\n```mermaid\ngraph TB\n    subgraph RAG [\"RAG RETRIEVER SYSTEM\"]\n        subgraph INTERFACES [\"USER INTERFACES\"]\n            WEB[\"Streamlit Web UI<br/>(ui/app.py)<br/>\u2022 Discovery<br/>\u2022 Collections<br/>\u2022 Search\"]\n            CLI_MOD[\"CLI Module<br/>(cli.py)<br/>\u2022 Full Control<br/>\u2022 Admin Ops<br/>\u2022 All Features<br/>\u2022 Maintenance\"]\n            MCP_SRV[\"MCP Server<br/>(mcp/server.py)<br/>\u2022 FastMCP Framework<br/>\u2022 Tool Definitions<br/>\u2022 AI Integration<br/>\u2022 Claude Code Support\"]\n        end\n        \n        subgraph CORE [\"CORE ENGINE\"]\n            PROC[\"Content Processing<br/>(main.py)<br/>\u2022 URL Processing<br/>\u2022 Search Coordination<br/>\u2022 Orchestration\"]\n            LOADERS[\"Document Loaders<br/>\u2022 LocalLoader<br/>\u2022 ImageLoader<br/>\u2022 GitHubLoader<br/>\u2022 ConfluenceLoader\"]\n            SEARCH[\"Search Engine<br/>(searcher.py)<br/>\u2022 Semantic Search<br/>\u2022 Cross-collection<br/>\u2022 Score Ranking\"]\n        end\n        \n        subgraph DATA [\"DATA LAYER\"]\n            VECTOR[\"Vector Store<br/>(store.py)<br/>\u2022 ChromaDB<br/>\u2022 Collections<br/>\u2022 Metadata<br/>\u2022 Persistence\"]\n            CRAWLERS[\"Web Crawlers<br/>(crawling/)<br/>\u2022 Playwright<br/>\u2022 Crawl4AI<br/>\u2022 ContentClean<br/>\u2022 URL Handling\"]\n            CONFIG[\"Config System<br/>(config.py)<br/>\u2022 YAML Config<br/>\u2022 User Settings<br/>\u2022 API Keys<br/>\u2022 Validation\"]\n        end\n        \n        subgraph EXTERNAL [\"EXTERNAL APIS\"]\n            OPENAI[\"OpenAI API<br/>\u2022 Embeddings<br/>\u2022 Vision Model<br/>\u2022 Batch Process\"]\n            SEARCH_API[\"Search APIs<br/>\u2022 Google Search<br/>\u2022 DuckDuckGo<br/>\u2022 Web Discovery\"]\n            EXT_SYS[\"External Systems<br/>\u2022 GitHub API<br/>\u2022 Confluence<br/>\u2022 Git Repos\"]\n        end\n    end\n    \n    WEB --> PROC\n    CLI_MOD --> PROC\n    MCP_SRV --> PROC\n    \n    PROC <--> LOADERS\n    PROC <--> SEARCH\n    LOADERS <--> SEARCH\n    \n    CORE --> VECTOR\n    CORE --> CRAWLERS\n    CORE --> CONFIG\n    \n    DATA --> OPENAI\n    DATA --> SEARCH_API\n    DATA --> EXT_SYS\n```\n\n## Use Cases\n\n### Documentation Management\n- Index official documentation sites\n- Search for APIs, functions, and usage examples\n- Maintain up-to-date development references\n\n### Knowledge Bases\n- Index company wikis and internal documentation\n- Search for policies, procedures, and best practices\n- Centralize organizational knowledge\n\n### Research and Learning\n- Index technical blogs and tutorials\n- Search for specific topics and technologies\n- Build personal knowledge repositories\n\n### Project Documentation\n- Index project-specific documentation\n- Search for implementation patterns\n- Maintain project knowledge bases\n\n## Configuration\n\nRAG Retriever is highly configurable through `config.yaml`:\n\n```yaml\n# Crawler selection\ncrawler:\n  type: \"crawl4ai\"  # or \"playwright\"\n\n# Search settings\nsearch:\n  default_limit: 8\n  default_score_threshold: 0.3\n\n# Content processing\ncontent:\n  chunk_size: 2000\n  chunk_overlap: 400\n\n# API configuration\napi:\n  openai_api_key: sk-your-key-here\n```\n\n## Requirements\n\n- Python 3.10+\n- OpenAI API key\n- Git (for system functionality)\n- ~500MB disk space for dependencies\n\n## Installation\n\nSee [`QUICKSTART.md`](QUICKSTART.md) for exact installation commands, or use the AI assistant prompts for guided setup.\n\n## Data Storage\n\nYour content is stored locally in:\n- **macOS/Linux**: `~/.local/share/rag-retriever/`\n- **Windows**: `%LOCALAPPDATA%\\rag-retriever\\`\n\nCollections persist between sessions and are automatically managed.\n\n## Performance\n\n- **Crawl4AI**: Up to 20x faster than traditional crawling\n- **Embedding Caching**: Efficient vector storage and retrieval\n- **Parallel Processing**: Concurrent indexing and search\n- **Optimized Chunking**: Configurable content processing\n\n## Security\n\n- **Local Storage**: All data stored locally, no cloud dependencies\n- **API Key Protection**: Secure configuration management\n- **Permission Controls**: MCP server permission management\n- **Source Tracking**: Complete audit trail of indexed content\n\n## Contributing\n\nRAG Retriever is open source and welcomes contributions. See the repository for guidelines.\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Support\n\n- **Documentation**: Use the AI assistant prompts for guidance\n- **Issues**: Report bugs and request features via GitHub issues\n- **Community**: Join discussions and share usage patterns\n\n---\n\n**Remember**: Use the AI assistant prompts above rather than reading through documentation. Your AI assistant can guide you through setup, usage, and troubleshooting much more effectively than traditional documentation!\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for crawling, indexing, and semantically searching web content with RAG capabilities",
    "version": "0.4.1",
    "project_urls": null,
    "split_keywords": [
        "ai",
        " rag",
        " embeddings",
        " semantic-search",
        " web-crawler",
        " vector-store"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "69a103753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69",
                "md5": "10f60781424f3a35032ce929a1b88fcc",
                "sha256": "a7a346504a95e49155b855f2898e164731412d49f27074ed0a96527f03cdd9ad"
            },
            "downloads": -1,
            "filename": "rag_retriever-0.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "10f60781424f3a35032ce929a1b88fcc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 86129,
            "upload_time": "2025-07-09T18:29:19",
            "upload_time_iso_8601": "2025-07-09T18:29:19.292488Z",
            "url": "https://files.pythonhosted.org/packages/69/a1/03753f6d8df84f4ab37369d48e7f1c10bb1db3b153af243aaaa752a33e69/rag_retriever-0.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3d2d42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f",
                "md5": "976dcad7fdf3b0127a8a023d85f57eef",
                "sha256": "32c72d9af7f26eb82654264045559d7e702f98ad7e581e97c7acb29bee7464b2"
            },
            "downloads": -1,
            "filename": "rag_retriever-0.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "976dcad7fdf3b0127a8a023d85f57eef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 81973,
            "upload_time": "2025-07-09T18:29:20",
            "upload_time_iso_8601": "2025-07-09T18:29:20.686876Z",
            "url": "https://files.pythonhosted.org/packages/3d/2d/42821402c6dc7b7540b38ffd3bcc0786f0c2748e758b704fd0cee537279f/rag_retriever-0.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-09 18:29:20",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "rag-retriever"
}

None