rag-retriever


Namerag-retriever JSON
Version 0.3.1 PyPI version JSON
download
home_pageNone
SummaryA tool for crawling, indexing, and semantically searching web content with RAG capabilities
upload_time2025-02-23 19:41:52
maintainerNone
docs_urlNone
authorNone
requires_python<3.13,>=3.10
licenseMIT
keywords ai rag embeddings semantic-search web-crawler vector-store
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # RAG Retriever

A Python application that loads and processes web pages, local documents, images, GitHub repositories, and Confluence spaces, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store. Now featuring support for Anthropic's Model Context Protocol (MCP), enabling direct integration with AI assistants like Cursor and Claude Desktop.

## What It Does

RAG Retriever enhances your AI coding assistant (like aider, Cursor, or Windsurf) by giving it access to:

- Documentation about new technologies and features
- Your organization's architecture decisions and coding standards
- Internal APIs and tools documentation
- GitHub repositories and their documentation
- Confluence spaces and documentation
- Visual content like architecture diagrams, UI mockups, and technical illustrations
- Any other knowledge that isn't part of the LLM's training data

This can help provide new knowledge to your AI tools, prevent hallucinations and ensure your AI assistant follows your team's practices.

> **💡 Note**: For detailed instructions on setting up and configuring your AI coding assistant with RAG Retriever, see our [AI Assistant Setup Guide](https://github.com/codingthefuturewithai/ai-assistant-instructions/blob/main/instructions/setup/ai-assistant-setup-guide.md).

## How It Works

RAG Retriever processes various types of content:

- Text documents and PDFs are chunked and embedded for semantic search
- Images are analyzed using AI vision models to generate detailed textual descriptions
- Web pages are crawled and their content is extracted
- GitHub repositories are indexed with their code and documentation
- Confluence spaces are indexed with their full content hierarchy

All content can be organized into collections for better organization and targeted searching. By default, searches are performed within the current collection, but you can also explicitly search across all collections using the `--search-all-collections` flag. For images, instead of returning the images themselves, it returns their AI-generated descriptions, making visual content searchable alongside your documentation.

[![Watch the video](https://img.youtube.com/vi/oQ6fSWUZYh0/0.jpg)](https://youtu.be/oQ6fSWUZYh0)

_RAG Retriever seamlessly integrating with aider, Cursor, and Windsurf to provide accurate, up-to-date information during development._

> **💡 Note**: While our examples focus on AI coding assistants, RAG Retriever can enhance any AI-powered development environment or tool that can execute command-line applications. Use it to augment IDEs, CLI tools, or any development workflow that needs reliable, up-to-date information.

## Why Do We Need Such Tools?

Modern AI coding assistants each implement their own way of loading external context from files and web sources. However, this creates several challenges:

- Knowledge remains siloed within each tool's ecosystem
- Support for different document types and sources varies widely
- Integration with enterprise knowledge bases (Confluence, Notion, etc.) is limited
- Each tool requires learning its unique context-loading mechanisms

RAG Retriever solves these challenges by:

1. Providing a unified knowledge repository that can ingest content from diverse sources
2. Offering both a command-line interface and a Model Context Protocol (MCP) server that work with any AI tool supporting shell commands or MCP integration
3. Supporting collections to organize and manage content effectively

> **💡 For a detailed discussion** of why centralized knowledge retrieval tools are crucial for AI-driven development, see our [Why RAG Retriever](docs/why-rag-retriever.md) guide.

## Prerequisites

### Core Requirements

- Python 3.10-3.12 (Download from [python.org](https://python.org))
- One of these package managers:

  - pipx (Recommended, install with one of these commands):

    ```bash
    # On MacOS
    brew install pipx

    # On Windows/Linux
    python -m pip install --user pipx
    ```

  - uv (Alternative, faster installation):

    ```bash
    # On MacOS
    brew install uv

    # On Windows/Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh
    ```

---

> ### 🚀 Ready to Try It? Let's Go!
>
> **Get up and running in 10 minutes!**
>
> 1. Install RAG Retriever using one of these methods:
>
>    ```bash
>    # Using pipx (recommended for isolation)
>    pipx install rag-retriever
>
>    # Using uv (faster installation)
>    uv pip install --system rag-retriever
>    ```
>
> 2. Configure your AI coding assistant by following our [AI Assistant Setup Guide](https://github.com/codingthefuturewithai/ai-assistant-instructions/blob/main/instructions/setup/ai-assistant-setup-guide.md)
> 3. Start using RAG Retriever with your configured AI assistant!
>
> For detailed installation and configuration steps, see our [Getting Started Guide](docs/getting-started.md).

---

### Optional Dependencies

The following dependencies are only required for specific advanced PDF processing features:

**MacOS**: `brew install tesseract`
**Windows**: Install [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki)

### System Requirements

The application uses Playwright with Chromium for web crawling:

- Chromium browser is automatically installed during package installation
- Sufficient disk space for Chromium (~200MB)
- Internet connection for initial setup and crawling

Note: The application will automatically download and manage Chromium installation.

## Installation

Install RAG Retriever as a standalone application:

```bash
pipx install rag-retriever
```

This will:

- Create an isolated environment for the application
- Install all required dependencies
- Install Chromium browser automatically
- Make the `rag-retriever` command available in your PATH

## How to Upgrade

To upgrade RAG Retriever to the latest version, use the same package manager you used for installation:

```bash
# If installed with pipx
pipx upgrade rag-retriever

# If installed with uv
uv pip install --system --upgrade rag-retriever
```

Both methods will:

- Upgrade the package to the latest available version
- Preserve your existing configuration and data
- Update any new dependencies automatically

After upgrading, it's recommended to reinitialize the configuration to ensure you have any new settings:

```bash
# Initialize configuration files
rag-retriever --init
```

This creates a configuration file at `~/.config/rag-retriever/config.yaml` (Unix/Mac) or `%APPDATA%\rag-retriever\config.yaml` (Windows)

### Setting up your API Key

Add your OpenAI API key to your configuration file:

```yaml
api:
  openai_api_key: "sk-your-api-key-here"
```

> **Security Note**: During installation, RAG Retriever automatically sets strict file permissions (600) on `config.yaml` to ensure it's only readable by you. This helps protect your API key.

### Customizing Configuration

All settings are in `config.yaml`. For detailed information about all configuration options, best practices, and example configurations, see our [Configuration Guide](docs/configuration-guide.md).

### Data Storage

The vector store database and collections are stored at:

- Unix/Mac: `~/.local/share/rag-retriever/chromadb/`
- Windows: `%LOCALAPPDATA%\rag-retriever\chromadb/`

Each collection is stored in its own subdirectory, with collection metadata maintained in a central metadata file. This location is automatically managed by the application and should not be modified directly.

## Knowledge Management Web Interface

![RAG Retriever UI](docs/images/rag-retriever-UI-collections.png)

RAG Retriever includes a modern web interface intended to help you manage your knowledge store. The UI provides:

- Collection management with statistics and comparisons
- Semantic search with relevance scoring
- Interactive visualizations of collection metrics

To launch the UI:

```bash
# If installed via pipx or uv
rag-retriever --ui

# If running from local repository
python scripts/run_ui.py
```

For detailed instructions on using the interface, see our [RAG Retriever UI User Guide](docs/rag-retriever-ui-guide.md).

### Uninstallation

To completely remove RAG Retriever:

```bash
# Remove the application and its isolated environment
pipx uninstall rag-retriever

# Remove Playwright browsers
python -m playwright uninstall chromium

# Optional: Remove configuration and data files
# Unix/Mac:
rm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever
# Windows (run in PowerShell):
Remove-Item -Recurse -Force "$env:APPDATA\rag-retriever"
Remove-Item -Recurse -Force "$env:LOCALAPPDATA\rag-retriever"
```

### Development Setup

If you want to contribute to RAG Retriever or modify the code:

```bash
# Clone the repository
git clone https://github.com/codingthefuturewithai/rag-retriever.git
cd rag-retriever

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # Unix/Mac
venv\Scripts\activate     # Windows

# Install in editable mode
pip install -e .

# Initialize user configuration
./scripts/run-rag.sh --init  # Unix/Mac
scripts\run-rag.bat --init   # Windows
```

## Collections

RAG Retriever organizes content into collections, allowing you to:

- Group related content together (e.g., by project, team, or topic)
- Search within specific collections or across all collections
- Manage and clean up collections independently
- Track metadata like creation date, last modified, and document counts

### Collection Features

- **Default Collection**: All content goes to the 'default' collection unless specified otherwise
- **Collection Management**:
  - Create collections automatically by specifying a name
  - List all collections and their metadata
  - Delete specific collections while preserving others
  - Search within a specific collection or across all collections
- **Collection Metadata**:
  - Creation timestamp
  - Last modified timestamp
  - Document count
  - Total chunks processed
  - Optional description

### Using Collections

All commands support specifying a collection:

```bash
# List all collections
rag-retriever --list-collections

# Add content to a specific collection
rag-retriever --fetch-url https://example.com --collection docs
rag-retriever --ingest-file document.md --collection project-x
rag-retriever --github-repo https://github.com/org/repo.git --collection code-docs

# Search within a specific collection
rag-retriever --query "search term" --collection docs

# Search across all collections
rag-retriever --query "search term" --search-all-collections

# Delete a specific collection
rag-retriever --clean --collection old-docs
```

## Usage Examples

### Local Document Processing

```bash
# Process a single file
rag-retriever --ingest-file path/to/document.pdf

# Process all supported files in a directory
rag-retriever --ingest-directory path/to/docs/

# Enable OCR for scanned documents (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.ocr_enabled: true
rag-retriever --ingest-file scanned-document.pdf

# Enable image extraction from PDFs (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.extract_images: true
rag-retriever --ingest-file document-with-images.pdf
```

### Web Content Fetching

```bash
# Basic fetch
rag-retriever --fetch https://example.com

# With depth control (default: 2)
rag-retriever --fetch https://example.com --max-depth 2

# Enable verbose output
rag-retriever --fetch https://example.com --verbose
```

### Image Analysis

```bash
# Analyze and index a single image
rag-retriever --ingest-image diagrams/RAG-Retriever-architecture.png

# Process all images in a directory
rag-retriever --ingest-image-directory diagrams/system-design/

# Search for specific architectural details
rag-retriever --query "How does RAG Retriever handle different types of document processing in its architecture?"
rag-retriever --query "What components are responsible for vector storage in the RAG Retriever system?"

# Combine with other content in searches
rag-retriever --query "Compare the error handling approach shown in the RAG Retriever architecture with the approach used by the latest LangChain framework"
```

The image analysis feature uses AI vision models to create detailed descriptions of your visual content, making it searchable alongside your documentation. When you search, you'll receive relevant text descriptions of the images rather than the images themselves.

### Web Search

```bash
# Search the web using DuckDuckGo
rag-retriever --web-search "your search query"

# Control number of results
rag-retriever --web-search "your search query" --results 10
```

### Confluence Integration

RAG Retriever can load and index content directly from your Confluence spaces. To use this feature:

1. Configure your Confluence credentials in `~/.config/rag-retriever/config.yaml`:

```yaml
api:
  confluence:
    url: "https://your-domain.atlassian.net" # Your Confluence instance URL
    username: "your-email@example.com" # Your Confluence username/email
    api_token: "your-api-token" # API token from https://id.atlassian.com/manage-profile/security/api-tokens
    space_key: null # Optional: Default space to load from
    parent_id: null # Optional: Default parent page ID
    include_attachments: false # Whether to include attachments
    limit: 50 # Max pages per request
    max_pages: 1000 # Maximum total pages to load
```

2. Load content from Confluence:

```bash
# Load from configured default space
rag-retriever --confluence

# Load from specific space
rag-retriever --confluence --space-key TEAM

# Load from specific parent page
rag-retriever --confluence --parent-id 123456

# Load from specific space and parent
rag-retriever --confluence --space-key TEAM --parent-id 123456
```

The loaded content will be:

- Converted to markdown format
- Split into appropriate chunks
- Embedded and stored in your vector store
- Available for semantic search just like any other content

### Searching Content

```bash
# Basic search
rag-retriever --query "How do I configure logging?"

# Limit results
rag-retriever --query "deployment steps" --limit 5

# Set minimum relevance score
rag-retriever --query "error handling" --score-threshold 0.7

# Get full content (default) or truncated
rag-retriever --query "database setup" --truncate

# Output in JSON format
rag-retriever --query "API endpoints" --json
```

### GitHub Repository Integration

```bash
# Load a GitHub repository
rag-retriever --github-repo https://github.com/username/repo.git

# Load a specific branch
rag-retriever --github-repo https://github.com/username/repo.git --branch main

# Load only specific file types
rag-retriever --github-repo https://github.com/username/repo.git --file-extensions .py .md .js
```

The GitHub loader:

- Clones repositories to a temporary directory
- Automatically cleans up after processing
- Supports branch selection
- Filters files by extension
- Excludes common non-documentation paths (node_modules, **pycache**, etc.)
- Enforces file size limits for better processing

## Understanding Search Results

Search results include relevance scores based on cosine similarity:

- Scores range from 0 to 1, where 1 indicates perfect similarity
- Default threshold is 0.3 (configurable via `search.default_score_threshold`)
- Typical interpretation:
  - 0.7+: Very high relevance (nearly exact matches)
  - 0.6 - 0.7: High relevance
  - 0.5 - 0.6: Good relevance
  - 0.3 - 0.5: Moderate relevance
  - Below 0.3: Lower relevance

## Notes

- Uses OpenAI's text-embedding-3-large model for generating embeddings by default
- Content is automatically cleaned and structured during indexing
- Implements URL depth-based crawling control
- Vector store persists between runs unless explicitly deleted
- Uses cosine similarity for more intuitive relevance scoring
- Minimal output by default with `--verbose` flag for troubleshooting
- Full content display by default with `--truncate` option for brevity
- ⚠️ Changing chunk size/overlap settings after ingesting content may lead to inconsistent search results. Consider reprocessing existing content if these settings must be changed.

## Contributing

Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

### Command Line Options

Core options:

- `--version`: Show version information
- `--init`: Initialize user configuration files
- `--clean`: Clean (delete) the vector store
- `--verbose`: Enable verbose output for troubleshooting
- `--json`: Output results in JSON format

Content Search:

- `--query STRING`: Search query to find relevant content
- `--limit N`: Maximum number of results to return
- `--score-threshold N`: Minimum relevance score threshold
- `--truncate`: Truncate content in search results (default: show full content)

Web Content:

- `--fetch URL`: Fetch and index web content
- `--max-depth N`: Maximum depth for recursive URL loading (default: 2)
- `--web-search STRING`: Perform DuckDuckGo web search
- `--results N`: Number of web search results (default: 5)

File Processing:

- `--ingest-file PATH`: Ingest a local file (supports .pdf, .md, .txt)
- `--ingest-directory PATH`: Ingest a directory of files

Image Processing:

- `--ingest-image PATH`: Path to an image file or URL to analyze and ingest
- `--ingest-image-directory PATH`: Path to a directory containing images to analyze and ingest

GitHub Integration:

- `--github-repo URL`: URL of the GitHub repository to load
- `--branch STRING`: Specific branch to load from the repository (default: main)
- `--file-extensions EXT [EXT ...]`: Specific file extensions to load (e.g., .py .md .js)

Confluence Integration:

- `--confluence`: Load from Confluence
- `--space-key STRING`: Confluence space key
- `--parent-id STRING`: Confluence parent page ID

## Web Search Integration

RAG Retriever supports multiple search providers for retrieving web content:

### Default Search Provider

By default, RAG Retriever will try to use Google Search if configured, falling back to DuckDuckGo if Google credentials are not available. You can change the default provider in your `config.yaml`:

```yaml
search:
  default_provider: "google" # or "duckduckgo"
```

### Using Google Search

To use Google's Programmable Search Engine:

1. Set up Google Search credentials (one of these methods):

   ```bash
   # Environment variables (recommended for development)
   export GOOGLE_API_KEY=your_api_key
   export GOOGLE_CSE_ID=your_cse_id

   # Or in config.yaml (recommended for permanent setup)
   search:
     google_search:
       api_key: "your_api_key"
       cse_id: "your_cse_id"
   ```

2. Use Google Search:

   ```bash
   # Use default provider (will use Google if configured)
   rag-retriever --web-search "your query"

   # Explicitly request Google
   rag-retriever --web-search "your query" --search-provider google
   ```

### Using DuckDuckGo Search

DuckDuckGo search is always available and requires no configuration:

```bash
# Use DuckDuckGo explicitly
rag-retriever --web-search "your query" --search-provider duckduckgo
```

### Provider Selection Behavior

- When no provider is specified:
  - Uses the default_provider from config
  - If Google is default but not configured, silently falls back to DuckDuckGo
- When Google is explicitly requested but not configured:
  - Shows error message suggesting to use DuckDuckGo
- When DuckDuckGo is requested:
  - Uses DuckDuckGo directly

For detailed configuration options and setup instructions, see our [Configuration Guide](docs/configuration-guide.md#search-provider-configuration).

## MCP Integration (Experimental)

RAG Retriever provides support for Anthropic's Model Context Protocol (MCP), enabling AI assistants to directly leverage its retrieval capabilities. The MCP integration currently offers a focused set of core features, with ongoing development to expand the available functionality.

### Currently Supported MCP Features

**Search Operations**

- Web search using DuckDuckGo

  - Customizable number of results
  - Results include titles, URLs, and snippets
  - Markdown-formatted output

- Vector store search
  - Semantic search across indexed content
  - Search within specific collections
  - Search across all collections simultaneously
  - Configurable result limits
  - Score threshold filtering
  - Full or partial content retrieval
  - Source attribution with collection information
  - Markdown-formatted output with relevance scores

**Content Processing**

- URL content processing
  - Fetch and ingest web content
  - Store content in specific collections
  - Automatically extract and clean text content
  - Store processed content in vector store for semantic search
  - Support for recursive crawling with depth control

### Server Modes

RAG Retriever's MCP server supports multiple operation modes:

1. **stdio Mode** (Default)

   - Used by Cursor and Claude Desktop integrations
   - Communicates via standard input/output
   - Configure as shown in the integration guides below

2. **SSE Mode**

   - Runs as a web server with Server-Sent Events
   - Useful for web-based integrations or development
   - Start with:

   ```bash
   python rag_retriever/mcp/server.py --port 3001
   ```

3. **Debug Mode**
   - Uses the MCP Inspector for testing and development
   - Helpful for debugging tool implementations
   - Start with:
   ```bash
   mcp dev rag_retriever/mcp/server.py
   ```

### Cursor Integration

1. First install RAG Retriever following the installation instructions above.

2. Get the full path to the MCP server:

```bash
which mcp-rag-retriever
```

This will output something like `/Users/<username>/.local/bin/mcp-rag-retriever`

3. Configure Cursor:
   - Open Cursor Settings > Features > MCP Servers
   - Click "+ Add New MCP Server"
   - Configure the server:
     - Name: rag-retriever
     - Type: stdio
     - Command: [paste the full path from step 2]

For more details about MCP configuration in Cursor, see the [Cursor MCP documentation](https://docs.cursor.com/context/model-context-protocol).

### Claude Desktop Integration

1. First install RAG Retriever following the installation instructions above.

2. Get the full path to the MCP server:

```bash
which mcp-rag-retriever
```

This will output something like `/Users/<username>/.local/bin/mcp-rag-retriever`

3. Configure Claude Desktop:
   - Open Claude menu > Settings > Developer > Edit Config
   - Add RAG Retriever to the MCP servers configuration:

```json
{
  "mcpServers": {
    "rag-retriever": {
      "command": "/Users/<username>/.local/bin/mcp-rag-retriever"
    }
  }
}
```

For more details, see the [Claude Desktop MCP documentation](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "rag-retriever",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "ai, rag, embeddings, semantic-search, web-crawler, vector-store",
    "author": null,
    "author_email": "Tim Kitchens <codingthefuturewithai@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/c2/35/6488cabd2c74c95a73eb2ee37232283fcc52fcc6db2e9a2efddad104d9dc/rag_retriever-0.3.1.tar.gz",
    "platform": null,
    "description": "# RAG Retriever\n\nA Python application that loads and processes web pages, local documents, images, GitHub repositories, and Confluence spaces, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store. Now featuring support for Anthropic's Model Context Protocol (MCP), enabling direct integration with AI assistants like Cursor and Claude Desktop.\n\n## What It Does\n\nRAG Retriever enhances your AI coding assistant (like aider, Cursor, or Windsurf) by giving it access to:\n\n- Documentation about new technologies and features\n- Your organization's architecture decisions and coding standards\n- Internal APIs and tools documentation\n- GitHub repositories and their documentation\n- Confluence spaces and documentation\n- Visual content like architecture diagrams, UI mockups, and technical illustrations\n- Any other knowledge that isn't part of the LLM's training data\n\nThis can help provide new knowledge to your AI tools, prevent hallucinations and ensure your AI assistant follows your team's practices.\n\n> **\ud83d\udca1 Note**: For detailed instructions on setting up and configuring your AI coding assistant with RAG Retriever, see our [AI Assistant Setup Guide](https://github.com/codingthefuturewithai/ai-assistant-instructions/blob/main/instructions/setup/ai-assistant-setup-guide.md).\n\n## How It Works\n\nRAG Retriever processes various types of content:\n\n- Text documents and PDFs are chunked and embedded for semantic search\n- Images are analyzed using AI vision models to generate detailed textual descriptions\n- Web pages are crawled and their content is extracted\n- GitHub repositories are indexed with their code and documentation\n- Confluence spaces are indexed with their full content hierarchy\n\nAll content can be organized into collections for better organization and targeted searching. By default, searches are performed within the current collection, but you can also explicitly search across all collections using the `--search-all-collections` flag. For images, instead of returning the images themselves, it returns their AI-generated descriptions, making visual content searchable alongside your documentation.\n\n[![Watch the video](https://img.youtube.com/vi/oQ6fSWUZYh0/0.jpg)](https://youtu.be/oQ6fSWUZYh0)\n\n_RAG Retriever seamlessly integrating with aider, Cursor, and Windsurf to provide accurate, up-to-date information during development._\n\n> **\ud83d\udca1 Note**: While our examples focus on AI coding assistants, RAG Retriever can enhance any AI-powered development environment or tool that can execute command-line applications. Use it to augment IDEs, CLI tools, or any development workflow that needs reliable, up-to-date information.\n\n## Why Do We Need Such Tools?\n\nModern AI coding assistants each implement their own way of loading external context from files and web sources. However, this creates several challenges:\n\n- Knowledge remains siloed within each tool's ecosystem\n- Support for different document types and sources varies widely\n- Integration with enterprise knowledge bases (Confluence, Notion, etc.) is limited\n- Each tool requires learning its unique context-loading mechanisms\n\nRAG Retriever solves these challenges by:\n\n1. Providing a unified knowledge repository that can ingest content from diverse sources\n2. Offering both a command-line interface and a Model Context Protocol (MCP) server that work with any AI tool supporting shell commands or MCP integration\n3. Supporting collections to organize and manage content effectively\n\n> **\ud83d\udca1 For a detailed discussion** of why centralized knowledge retrieval tools are crucial for AI-driven development, see our [Why RAG Retriever](docs/why-rag-retriever.md) guide.\n\n## Prerequisites\n\n### Core Requirements\n\n- Python 3.10-3.12 (Download from [python.org](https://python.org))\n- One of these package managers:\n\n  - pipx (Recommended, install with one of these commands):\n\n    ```bash\n    # On MacOS\n    brew install pipx\n\n    # On Windows/Linux\n    python -m pip install --user pipx\n    ```\n\n  - uv (Alternative, faster installation):\n\n    ```bash\n    # On MacOS\n    brew install uv\n\n    # On Windows/Linux\n    curl -LsSf https://astral.sh/uv/install.sh | sh\n    ```\n\n---\n\n> ### \ud83d\ude80 Ready to Try It? Let's Go!\n>\n> **Get up and running in 10 minutes!**\n>\n> 1. Install RAG Retriever using one of these methods:\n>\n>    ```bash\n>    # Using pipx (recommended for isolation)\n>    pipx install rag-retriever\n>\n>    # Using uv (faster installation)\n>    uv pip install --system rag-retriever\n>    ```\n>\n> 2. Configure your AI coding assistant by following our [AI Assistant Setup Guide](https://github.com/codingthefuturewithai/ai-assistant-instructions/blob/main/instructions/setup/ai-assistant-setup-guide.md)\n> 3. Start using RAG Retriever with your configured AI assistant!\n>\n> For detailed installation and configuration steps, see our [Getting Started Guide](docs/getting-started.md).\n\n---\n\n### Optional Dependencies\n\nThe following dependencies are only required for specific advanced PDF processing features:\n\n**MacOS**: `brew install tesseract`\n**Windows**: Install [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki)\n\n### System Requirements\n\nThe application uses Playwright with Chromium for web crawling:\n\n- Chromium browser is automatically installed during package installation\n- Sufficient disk space for Chromium (~200MB)\n- Internet connection for initial setup and crawling\n\nNote: The application will automatically download and manage Chromium installation.\n\n## Installation\n\nInstall RAG Retriever as a standalone application:\n\n```bash\npipx install rag-retriever\n```\n\nThis will:\n\n- Create an isolated environment for the application\n- Install all required dependencies\n- Install Chromium browser automatically\n- Make the `rag-retriever` command available in your PATH\n\n## How to Upgrade\n\nTo upgrade RAG Retriever to the latest version, use the same package manager you used for installation:\n\n```bash\n# If installed with pipx\npipx upgrade rag-retriever\n\n# If installed with uv\nuv pip install --system --upgrade rag-retriever\n```\n\nBoth methods will:\n\n- Upgrade the package to the latest available version\n- Preserve your existing configuration and data\n- Update any new dependencies automatically\n\nAfter upgrading, it's recommended to reinitialize the configuration to ensure you have any new settings:\n\n```bash\n# Initialize configuration files\nrag-retriever --init\n```\n\nThis creates a configuration file at `~/.config/rag-retriever/config.yaml` (Unix/Mac) or `%APPDATA%\\rag-retriever\\config.yaml` (Windows)\n\n### Setting up your API Key\n\nAdd your OpenAI API key to your configuration file:\n\n```yaml\napi:\n  openai_api_key: \"sk-your-api-key-here\"\n```\n\n> **Security Note**: During installation, RAG Retriever automatically sets strict file permissions (600) on `config.yaml` to ensure it's only readable by you. This helps protect your API key.\n\n### Customizing Configuration\n\nAll settings are in `config.yaml`. For detailed information about all configuration options, best practices, and example configurations, see our [Configuration Guide](docs/configuration-guide.md).\n\n### Data Storage\n\nThe vector store database and collections are stored at:\n\n- Unix/Mac: `~/.local/share/rag-retriever/chromadb/`\n- Windows: `%LOCALAPPDATA%\\rag-retriever\\chromadb/`\n\nEach collection is stored in its own subdirectory, with collection metadata maintained in a central metadata file. This location is automatically managed by the application and should not be modified directly.\n\n## Knowledge Management Web Interface\n\n![RAG Retriever UI](docs/images/rag-retriever-UI-collections.png)\n\nRAG Retriever includes a modern web interface intended to help you manage your knowledge store. The UI provides:\n\n- Collection management with statistics and comparisons\n- Semantic search with relevance scoring\n- Interactive visualizations of collection metrics\n\nTo launch the UI:\n\n```bash\n# If installed via pipx or uv\nrag-retriever --ui\n\n# If running from local repository\npython scripts/run_ui.py\n```\n\nFor detailed instructions on using the interface, see our [RAG Retriever UI User Guide](docs/rag-retriever-ui-guide.md).\n\n### Uninstallation\n\nTo completely remove RAG Retriever:\n\n```bash\n# Remove the application and its isolated environment\npipx uninstall rag-retriever\n\n# Remove Playwright browsers\npython -m playwright uninstall chromium\n\n# Optional: Remove configuration and data files\n# Unix/Mac:\nrm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever\n# Windows (run in PowerShell):\nRemove-Item -Recurse -Force \"$env:APPDATA\\rag-retriever\"\nRemove-Item -Recurse -Force \"$env:LOCALAPPDATA\\rag-retriever\"\n```\n\n### Development Setup\n\nIf you want to contribute to RAG Retriever or modify the code:\n\n```bash\n# Clone the repository\ngit clone https://github.com/codingthefuturewithai/rag-retriever.git\ncd rag-retriever\n\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate  # Unix/Mac\nvenv\\Scripts\\activate     # Windows\n\n# Install in editable mode\npip install -e .\n\n# Initialize user configuration\n./scripts/run-rag.sh --init  # Unix/Mac\nscripts\\run-rag.bat --init   # Windows\n```\n\n## Collections\n\nRAG Retriever organizes content into collections, allowing you to:\n\n- Group related content together (e.g., by project, team, or topic)\n- Search within specific collections or across all collections\n- Manage and clean up collections independently\n- Track metadata like creation date, last modified, and document counts\n\n### Collection Features\n\n- **Default Collection**: All content goes to the 'default' collection unless specified otherwise\n- **Collection Management**:\n  - Create collections automatically by specifying a name\n  - List all collections and their metadata\n  - Delete specific collections while preserving others\n  - Search within a specific collection or across all collections\n- **Collection Metadata**:\n  - Creation timestamp\n  - Last modified timestamp\n  - Document count\n  - Total chunks processed\n  - Optional description\n\n### Using Collections\n\nAll commands support specifying a collection:\n\n```bash\n# List all collections\nrag-retriever --list-collections\n\n# Add content to a specific collection\nrag-retriever --fetch-url https://example.com --collection docs\nrag-retriever --ingest-file document.md --collection project-x\nrag-retriever --github-repo https://github.com/org/repo.git --collection code-docs\n\n# Search within a specific collection\nrag-retriever --query \"search term\" --collection docs\n\n# Search across all collections\nrag-retriever --query \"search term\" --search-all-collections\n\n# Delete a specific collection\nrag-retriever --clean --collection old-docs\n```\n\n## Usage Examples\n\n### Local Document Processing\n\n```bash\n# Process a single file\nrag-retriever --ingest-file path/to/document.pdf\n\n# Process all supported files in a directory\nrag-retriever --ingest-directory path/to/docs/\n\n# Enable OCR for scanned documents (update config.yaml first)\n# Set in config.yaml:\n# document_processing.pdf_settings.ocr_enabled: true\nrag-retriever --ingest-file scanned-document.pdf\n\n# Enable image extraction from PDFs (update config.yaml first)\n# Set in config.yaml:\n# document_processing.pdf_settings.extract_images: true\nrag-retriever --ingest-file document-with-images.pdf\n```\n\n### Web Content Fetching\n\n```bash\n# Basic fetch\nrag-retriever --fetch https://example.com\n\n# With depth control (default: 2)\nrag-retriever --fetch https://example.com --max-depth 2\n\n# Enable verbose output\nrag-retriever --fetch https://example.com --verbose\n```\n\n### Image Analysis\n\n```bash\n# Analyze and index a single image\nrag-retriever --ingest-image diagrams/RAG-Retriever-architecture.png\n\n# Process all images in a directory\nrag-retriever --ingest-image-directory diagrams/system-design/\n\n# Search for specific architectural details\nrag-retriever --query \"How does RAG Retriever handle different types of document processing in its architecture?\"\nrag-retriever --query \"What components are responsible for vector storage in the RAG Retriever system?\"\n\n# Combine with other content in searches\nrag-retriever --query \"Compare the error handling approach shown in the RAG Retriever architecture with the approach used by the latest LangChain framework\"\n```\n\nThe image analysis feature uses AI vision models to create detailed descriptions of your visual content, making it searchable alongside your documentation. When you search, you'll receive relevant text descriptions of the images rather than the images themselves.\n\n### Web Search\n\n```bash\n# Search the web using DuckDuckGo\nrag-retriever --web-search \"your search query\"\n\n# Control number of results\nrag-retriever --web-search \"your search query\" --results 10\n```\n\n### Confluence Integration\n\nRAG Retriever can load and index content directly from your Confluence spaces. To use this feature:\n\n1. Configure your Confluence credentials in `~/.config/rag-retriever/config.yaml`:\n\n```yaml\napi:\n  confluence:\n    url: \"https://your-domain.atlassian.net\" # Your Confluence instance URL\n    username: \"your-email@example.com\" # Your Confluence username/email\n    api_token: \"your-api-token\" # API token from https://id.atlassian.com/manage-profile/security/api-tokens\n    space_key: null # Optional: Default space to load from\n    parent_id: null # Optional: Default parent page ID\n    include_attachments: false # Whether to include attachments\n    limit: 50 # Max pages per request\n    max_pages: 1000 # Maximum total pages to load\n```\n\n2. Load content from Confluence:\n\n```bash\n# Load from configured default space\nrag-retriever --confluence\n\n# Load from specific space\nrag-retriever --confluence --space-key TEAM\n\n# Load from specific parent page\nrag-retriever --confluence --parent-id 123456\n\n# Load from specific space and parent\nrag-retriever --confluence --space-key TEAM --parent-id 123456\n```\n\nThe loaded content will be:\n\n- Converted to markdown format\n- Split into appropriate chunks\n- Embedded and stored in your vector store\n- Available for semantic search just like any other content\n\n### Searching Content\n\n```bash\n# Basic search\nrag-retriever --query \"How do I configure logging?\"\n\n# Limit results\nrag-retriever --query \"deployment steps\" --limit 5\n\n# Set minimum relevance score\nrag-retriever --query \"error handling\" --score-threshold 0.7\n\n# Get full content (default) or truncated\nrag-retriever --query \"database setup\" --truncate\n\n# Output in JSON format\nrag-retriever --query \"API endpoints\" --json\n```\n\n### GitHub Repository Integration\n\n```bash\n# Load a GitHub repository\nrag-retriever --github-repo https://github.com/username/repo.git\n\n# Load a specific branch\nrag-retriever --github-repo https://github.com/username/repo.git --branch main\n\n# Load only specific file types\nrag-retriever --github-repo https://github.com/username/repo.git --file-extensions .py .md .js\n```\n\nThe GitHub loader:\n\n- Clones repositories to a temporary directory\n- Automatically cleans up after processing\n- Supports branch selection\n- Filters files by extension\n- Excludes common non-documentation paths (node_modules, **pycache**, etc.)\n- Enforces file size limits for better processing\n\n## Understanding Search Results\n\nSearch results include relevance scores based on cosine similarity:\n\n- Scores range from 0 to 1, where 1 indicates perfect similarity\n- Default threshold is 0.3 (configurable via `search.default_score_threshold`)\n- Typical interpretation:\n  - 0.7+: Very high relevance (nearly exact matches)\n  - 0.6 - 0.7: High relevance\n  - 0.5 - 0.6: Good relevance\n  - 0.3 - 0.5: Moderate relevance\n  - Below 0.3: Lower relevance\n\n## Notes\n\n- Uses OpenAI's text-embedding-3-large model for generating embeddings by default\n- Content is automatically cleaned and structured during indexing\n- Implements URL depth-based crawling control\n- Vector store persists between runs unless explicitly deleted\n- Uses cosine similarity for more intuitive relevance scoring\n- Minimal output by default with `--verbose` flag for troubleshooting\n- Full content display by default with `--truncate` option for brevity\n- \u26a0\ufe0f Changing chunk size/overlap settings after ingesting content may lead to inconsistent search results. Consider reprocessing existing content if these settings must be changed.\n\n## Contributing\n\nPlease read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n### Command Line Options\n\nCore options:\n\n- `--version`: Show version information\n- `--init`: Initialize user configuration files\n- `--clean`: Clean (delete) the vector store\n- `--verbose`: Enable verbose output for troubleshooting\n- `--json`: Output results in JSON format\n\nContent Search:\n\n- `--query STRING`: Search query to find relevant content\n- `--limit N`: Maximum number of results to return\n- `--score-threshold N`: Minimum relevance score threshold\n- `--truncate`: Truncate content in search results (default: show full content)\n\nWeb Content:\n\n- `--fetch URL`: Fetch and index web content\n- `--max-depth N`: Maximum depth for recursive URL loading (default: 2)\n- `--web-search STRING`: Perform DuckDuckGo web search\n- `--results N`: Number of web search results (default: 5)\n\nFile Processing:\n\n- `--ingest-file PATH`: Ingest a local file (supports .pdf, .md, .txt)\n- `--ingest-directory PATH`: Ingest a directory of files\n\nImage Processing:\n\n- `--ingest-image PATH`: Path to an image file or URL to analyze and ingest\n- `--ingest-image-directory PATH`: Path to a directory containing images to analyze and ingest\n\nGitHub Integration:\n\n- `--github-repo URL`: URL of the GitHub repository to load\n- `--branch STRING`: Specific branch to load from the repository (default: main)\n- `--file-extensions EXT [EXT ...]`: Specific file extensions to load (e.g., .py .md .js)\n\nConfluence Integration:\n\n- `--confluence`: Load from Confluence\n- `--space-key STRING`: Confluence space key\n- `--parent-id STRING`: Confluence parent page ID\n\n## Web Search Integration\n\nRAG Retriever supports multiple search providers for retrieving web content:\n\n### Default Search Provider\n\nBy default, RAG Retriever will try to use Google Search if configured, falling back to DuckDuckGo if Google credentials are not available. You can change the default provider in your `config.yaml`:\n\n```yaml\nsearch:\n  default_provider: \"google\" # or \"duckduckgo\"\n```\n\n### Using Google Search\n\nTo use Google's Programmable Search Engine:\n\n1. Set up Google Search credentials (one of these methods):\n\n   ```bash\n   # Environment variables (recommended for development)\n   export GOOGLE_API_KEY=your_api_key\n   export GOOGLE_CSE_ID=your_cse_id\n\n   # Or in config.yaml (recommended for permanent setup)\n   search:\n     google_search:\n       api_key: \"your_api_key\"\n       cse_id: \"your_cse_id\"\n   ```\n\n2. Use Google Search:\n\n   ```bash\n   # Use default provider (will use Google if configured)\n   rag-retriever --web-search \"your query\"\n\n   # Explicitly request Google\n   rag-retriever --web-search \"your query\" --search-provider google\n   ```\n\n### Using DuckDuckGo Search\n\nDuckDuckGo search is always available and requires no configuration:\n\n```bash\n# Use DuckDuckGo explicitly\nrag-retriever --web-search \"your query\" --search-provider duckduckgo\n```\n\n### Provider Selection Behavior\n\n- When no provider is specified:\n  - Uses the default_provider from config\n  - If Google is default but not configured, silently falls back to DuckDuckGo\n- When Google is explicitly requested but not configured:\n  - Shows error message suggesting to use DuckDuckGo\n- When DuckDuckGo is requested:\n  - Uses DuckDuckGo directly\n\nFor detailed configuration options and setup instructions, see our [Configuration Guide](docs/configuration-guide.md#search-provider-configuration).\n\n## MCP Integration (Experimental)\n\nRAG Retriever provides support for Anthropic's Model Context Protocol (MCP), enabling AI assistants to directly leverage its retrieval capabilities. The MCP integration currently offers a focused set of core features, with ongoing development to expand the available functionality.\n\n### Currently Supported MCP Features\n\n**Search Operations**\n\n- Web search using DuckDuckGo\n\n  - Customizable number of results\n  - Results include titles, URLs, and snippets\n  - Markdown-formatted output\n\n- Vector store search\n  - Semantic search across indexed content\n  - Search within specific collections\n  - Search across all collections simultaneously\n  - Configurable result limits\n  - Score threshold filtering\n  - Full or partial content retrieval\n  - Source attribution with collection information\n  - Markdown-formatted output with relevance scores\n\n**Content Processing**\n\n- URL content processing\n  - Fetch and ingest web content\n  - Store content in specific collections\n  - Automatically extract and clean text content\n  - Store processed content in vector store for semantic search\n  - Support for recursive crawling with depth control\n\n### Server Modes\n\nRAG Retriever's MCP server supports multiple operation modes:\n\n1. **stdio Mode** (Default)\n\n   - Used by Cursor and Claude Desktop integrations\n   - Communicates via standard input/output\n   - Configure as shown in the integration guides below\n\n2. **SSE Mode**\n\n   - Runs as a web server with Server-Sent Events\n   - Useful for web-based integrations or development\n   - Start with:\n\n   ```bash\n   python rag_retriever/mcp/server.py --port 3001\n   ```\n\n3. **Debug Mode**\n   - Uses the MCP Inspector for testing and development\n   - Helpful for debugging tool implementations\n   - Start with:\n   ```bash\n   mcp dev rag_retriever/mcp/server.py\n   ```\n\n### Cursor Integration\n\n1. First install RAG Retriever following the installation instructions above.\n\n2. Get the full path to the MCP server:\n\n```bash\nwhich mcp-rag-retriever\n```\n\nThis will output something like `/Users/<username>/.local/bin/mcp-rag-retriever`\n\n3. Configure Cursor:\n   - Open Cursor Settings > Features > MCP Servers\n   - Click \"+ Add New MCP Server\"\n   - Configure the server:\n     - Name: rag-retriever\n     - Type: stdio\n     - Command: [paste the full path from step 2]\n\nFor more details about MCP configuration in Cursor, see the [Cursor MCP documentation](https://docs.cursor.com/context/model-context-protocol).\n\n### Claude Desktop Integration\n\n1. First install RAG Retriever following the installation instructions above.\n\n2. Get the full path to the MCP server:\n\n```bash\nwhich mcp-rag-retriever\n```\n\nThis will output something like `/Users/<username>/.local/bin/mcp-rag-retriever`\n\n3. Configure Claude Desktop:\n   - Open Claude menu > Settings > Developer > Edit Config\n   - Add RAG Retriever to the MCP servers configuration:\n\n```json\n{\n  \"mcpServers\": {\n    \"rag-retriever\": {\n      \"command\": \"/Users/<username>/.local/bin/mcp-rag-retriever\"\n    }\n  }\n}\n```\n\nFor more details, see the [Claude Desktop MCP documentation](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for crawling, indexing, and semantically searching web content with RAG capabilities",
    "version": "0.3.1",
    "project_urls": null,
    "split_keywords": [
        "ai",
        " rag",
        " embeddings",
        " semantic-search",
        " web-crawler",
        " vector-store"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5f105e0aa44d260e5e808cbfcae98047dfc32f41e3e37f10b1b5f2d9c4bcefa3",
                "md5": "113884031f809403ef9ac3a91aa4328b",
                "sha256": "84687f4a9f977ae96c72d3d5a2c99fef1ba1e983a0da5399422ec1558efe4bfc"
            },
            "downloads": -1,
            "filename": "rag_retriever-0.3.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "113884031f809403ef9ac3a91aa4328b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 82859,
            "upload_time": "2025-02-23T19:41:50",
            "upload_time_iso_8601": "2025-02-23T19:41:50.754711Z",
            "url": "https://files.pythonhosted.org/packages/5f/10/5e0aa44d260e5e808cbfcae98047dfc32f41e3e37f10b1b5f2d9c4bcefa3/rag_retriever-0.3.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c2356488cabd2c74c95a73eb2ee37232283fcc52fcc6db2e9a2efddad104d9dc",
                "md5": "2c954bcd78f967741c83a146803b2169",
                "sha256": "2d83afb6b675f5004b45ca3a164641d91ba09bc42e42f9f569a4c8487d098412"
            },
            "downloads": -1,
            "filename": "rag_retriever-0.3.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2c954bcd78f967741c83a146803b2169",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 81982,
            "upload_time": "2025-02-23T19:41:52",
            "upload_time_iso_8601": "2025-02-23T19:41:52.582710Z",
            "url": "https://files.pythonhosted.org/packages/c2/35/6488cabd2c74c95a73eb2ee37232283fcc52fcc6db2e9a2efddad104d9dc/rag_retriever-0.3.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-23 19:41:52",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "rag-retriever"
}
        
Elapsed time: 3.64372s