# RAG Retriever
A Python application that loads and processes web pages, local documents, and Confluence spaces, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store.
## What It Does
RAG Retriever enhances your AI coding assistant (like aider or Cursor) by giving it access to:
- Documentation about new technologies and features
- Your organization's architecture decisions and coding standards
- Internal APIs and tools documentation
- Confluence spaces and documentation
- Any other knowledge that isn't part of the LLM's training data
This helps prevent hallucinations and ensures your AI assistant follows your team's practices.
> **š” Note**: While our examples focus on AI coding assistants, RAG Retriever can enhance any AI-powered development environment or tool that can execute command-line applications. Use it to augment IDEs, CLI tools, or any development workflow that needs reliable, up-to-date information.
## Prerequisites
### Core Requirements
- Python 3.10-3.12 (Download from [python.org](https://python.org))
- pipx (Install with one of these commands):
```bash
# On MacOS
brew install pipx
# On Windows/Linux
python -m pip install --user pipx
```
### Optional Dependencies
The following dependencies are only required for specific advanced features:
#### OCR Support (Optional)
Required only for:
- Processing scanned documents
- Extracting text from images in PDFs
- Converting images to searchable text
**MacOS**: `brew install tesseract`
**Windows**: Install [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki)
#### Advanced PDF Processing (Optional)
Required only for:
- Complex PDF layouts
- Better table extraction
- Technical document processing
**MacOS**: `brew install poppler`
**Windows**: Install [Poppler](https://github.com/oschwartz10612/poppler-windows/releases/)
The core functionality works without these dependencies, including:
- Basic PDF text extraction
- Markdown and text file processing
- Web content crawling
- Vector storage and search
### System Requirements
The application uses Playwright with Chromium for web crawling:
- Chromium browser is automatically installed during package installation
- Sufficient disk space for Chromium (~200MB)
- Internet connection for initial setup and crawling
Note: The application will automatically download and manage Chromium installation.
---
### š Ready to Try It?
Head over to our [Getting Started Guide](docs/getting-started.md) for a quick setup that will get your AI assistant using the RAG Retriever in 5 minutes!
---
## Installation
Install RAG Retriever as a standalone application:
```bash
pipx install rag-retriever
```
### Windows-Specific Setup
On Windows, use PowerShell for installation and setup:
1. Open PowerShell as Administrator (right-click PowerShell and select "Run as Administrator")
2. Install pipx if not already installed:
```powershell
python -m pip install --user pipx
```
3. Install RAG Retriever:
```powershell
pipx install rag-retriever
```
4. Add pipx binaries to your PATH (requires Administrator privileges):
```powershell
pipx ensurepath
```
5. Close PowerShell and open a new regular (non-Administrator) PowerShell window
6. Initialize the configuration:
```powershell
rag-retriever --init
```
Note: If you get a "command not found" error after following these steps, try closing all PowerShell windows and opening a new one.
This will:
- Create an isolated environment for the application
- Install all required dependencies
- Install Chromium browser automatically
- Make the `rag-retriever` command available in your PATH
## How to Upgrade
To upgrade RAG Retriever to the latest version:
```bash
pipx upgrade rag-retriever
```
This will:
- Upgrade the package to the latest available version
- Preserve your existing configuration and data
- Update any new dependencies automatically
After installation, initialize the configuration:
```bash
# Initialize configuration files
rag-retriever --init
```
This creates:
- A configuration file at `~/.config/rag-retriever/config.yaml` (Unix/Mac) or `%APPDATA%\rag-retriever\config.yaml` (Windows)
- A `.env` file in the same directory for your OpenAI API key
### Setting up your API Key
Add your OpenAI API key to the `.env` file:
```bash
OPENAI_API_KEY=your-api-key-here
```
### Customizing Configuration
All settings are in `config.yaml`. For detailed information about all configuration options, best practices, and example configurations, see our [Configuration Guide](docs/configuration-guide.md).
Key configuration sections include:
```yaml
# Vector store settings
vector_store:
embedding_model: "text-embedding-3-large"
embedding_dimensions: 3072
chunk_size: 1000
chunk_overlap: 200
# Local document processing
document_processing:
supported_extensions:
- ".md"
- ".txt"
- ".pdf"
pdf_settings:
max_file_size_mb: 50
extract_images: false
ocr_enabled: false
languages: ["eng"]
strategy: "fast"
mode: "elements"
# Search settings
search:
default_limit: 8
default_score_threshold: 0.3
```
### Data Storage
The vector store database is stored at:
- Unix/Mac: `~/.local/share/rag-retriever/chromadb/`
- Windows: `%LOCALAPPDATA%\rag-retriever\chromadb/`
This location is automatically managed by the application and should not be modified directly.
### Uninstallation
To completely remove RAG Retriever:
```bash
# Remove the application and its isolated environment
pipx uninstall rag-retriever
# Remove Playwright browsers
python -m playwright uninstall chromium
# Optional: Remove configuration and data files
# On Unix/Mac:
rm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever
# On Windows (run in PowerShell):
# Open PowerShell and run:
Remove-Item -Recurse -Force "$env:APPDATA\rag-retriever"
Remove-Item -Recurse -Force "$env:LOCALAPPDATA\rag-retriever"
```
### Development Setup
If you want to contribute to RAG Retriever or modify the code:
```bash
# Clone the repository
git clone https://github.com/codingthefuturewithai/rag-retriever.git
cd rag-retriever
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Unix/Mac
venv\Scripts\activate # Windows
# Install in editable mode
pip install -e .
# Initialize user configuration
./scripts/run-rag.sh --init # Unix/Mac
scripts\run-rag.bat --init # Windows
```
## Usage Examples
### Local Document Processing
```bash
# Process a single file
rag-retriever --ingest-file path/to/document.pdf
# Process all supported files in a directory
rag-retriever --ingest-directory path/to/docs/
# Enable OCR for scanned documents (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.ocr_enabled: true
rag-retriever --ingest-file scanned-document.pdf
# Enable image extraction from PDFs (update config.yaml first)
# Set in config.yaml:
# document_processing.pdf_settings.extract_images: true
rag-retriever --ingest-file document-with-images.pdf
```
### Web Content Fetching
```bash
# Basic fetch
rag-retriever --fetch https://example.com
# With depth control
rag-retriever --fetch https://example.com --max-depth 2
# Minimal output mode
rag-retriever --fetch https://example.com --verbose false
```
### Confluence Integration
RAG Retriever can load and index content directly from your Confluence spaces. To use this feature:
1. Configure your Confluence credentials in `~/.config/rag-retriever/config.yaml`:
```yaml
api:
confluence:
url: "https://your-domain.atlassian.net" # Your Confluence instance URL
username: "your-email@example.com" # Your Confluence username/email
api_token: "your-api-token" # API token from https://id.atlassian.com/manage-profile/security/api-tokens
space_key: null # Optional: Default space to load from
parent_id: null # Optional: Default parent page ID
include_attachments: false # Whether to include attachments
limit: 50 # Max pages per request
max_pages: 1000 # Maximum total pages to load
```
2. Load content from Confluence:
```bash
# Load from configured default space
rag-retriever --confluence
# Load from specific space
rag-retriever --confluence --space-key TEAM
# Load from specific parent page
rag-retriever --confluence --parent-id 123456
# Load from specific space and parent
rag-retriever --confluence --space-key TEAM --parent-id 123456
```
The loaded content will be:
- Converted to markdown format
- Split into appropriate chunks
- Embedded and stored in your vector store
- Available for semantic search just like any other content
### Searching Content
```bash
# Basic search
rag-retriever --query "How do I configure logging?"
# Limit results
rag-retriever --query "deployment steps" --limit 5
# Set minimum relevance score
rag-retriever --query "error handling" --score-threshold 0.7
# Get full content (default) or truncated
rag-retriever --query "database setup" --truncate
# Output in JSON format
rag-retriever --query "API endpoints" --json
```
## Configuration Options
The configuration file (`config.yaml`) is organized into several sections:
### Vector Store Settings
```yaml
vector_store:
persist_directory: null # Set automatically to OS-specific path
embedding_model: "text-embedding-3-large"
embedding_dimensions: 3072
chunk_size: 1000 # Size of text chunks for indexing
chunk_overlap: 200 # Overlap between chunks
```
### Document Processing Settings
```yaml
document_processing:
# Supported file extensions
supported_extensions:
- ".md"
- ".txt"
- ".pdf"
# Patterns to exclude from processing
excluded_patterns:
- ".*"
- "node_modules/**"
- "__pycache__/**"
- "*.pyc"
- ".git/**"
# Fallback encodings for text files
encoding_fallbacks:
- "utf-8"
- "latin-1"
- "cp1252"
# PDF processing settings
pdf_settings:
max_file_size_mb: 50
extract_images: false
ocr_enabled: false
languages: ["eng"]
password: null
strategy: "fast" # Options: fast, accurate
mode: "elements" # Options: single_page, paged, elements
```
### Content Processing Settings
```yaml
content:
chunk_size: 2000
chunk_overlap: 400
# Text splitting separators (in order of preference)
separators:
- "\n## " # h2 headers (strongest break)
- "\n### " # h3 headers
- "\n#### " # h4 headers
- "\n- " # bullet points
- "\nā¢ " # alternative bullet points
- "\n\n" # paragraphs
- ". " # sentences (weakest break)
```
### Search Settings
```yaml
search:
default_limit: 8 # Default number of results
default_score_threshold: 0.3 # Minimum relevance score
```
### Browser Settings (Web Crawling)
```yaml
browser:
wait_time: 2 # Base wait time in seconds
viewport:
width: 1920
height: 1080
delays:
before_request: [1, 3] # Min and max seconds
after_load: [2, 4]
after_dynamic: [1, 2]
launch_options:
headless: true
channel: "chrome"
context_options:
bypass_csp: true
java_script_enabled: true
```
## Understanding Search Results
Search results include relevance scores based on cosine similarity:
- Scores range from 0 to 1, where 1 indicates perfect similarity
- Default threshold is 0.3 (configurable via `search.default_score_threshold`)
- Typical interpretation:
- 0.7+: Very high relevance (nearly exact matches)
- 0.6 - 0.7: High relevance
- 0.5 - 0.6: Good relevance
- 0.3 - 0.5: Moderate relevance
- Below 0.3: Lower relevance
## Features
### Core Features (No Additional Dependencies)
- Web crawling and content extraction
- Basic PDF text extraction
- Markdown and text file processing
- Vector storage and semantic search
- Configuration management
- Basic document chunking and processing
### Advanced Features (Optional Dependencies Required)
- **OCR Processing** (Requires Tesseract):
- Scanned document processing
- Image text extraction
- PDF image text extraction
- **Enhanced PDF Processing** (Requires Poppler):
- Complex layout handling
- Table extraction
- Technical document processing
- Better handling of multi-column layouts
All core features work without installing optional dependencies. Install optional dependencies only if you need their specific features.
For more detailed usage instructions and examples, please refer to the [local-document-loading.md](docs/local-document-loading.md) documentation.
## Project Structure
```
rag-retriever/
āāā rag_retriever/ # Main package directory
ā āāā config/ # Configuration settings
ā āāā crawling/ # Web crawling functionality
ā āāā vectorstore/ # Vector storage operations
ā āāā search/ # Search functionality
ā āāā utils/ # Utility functions
```
## Dependencies
Key dependencies include:
- openai: For embeddings generation (text-embedding-3-large model)
- chromadb: Vector store implementation with cosine similarity
- selenium: JavaScript content rendering
- beautifulsoup4: HTML parsing
- python-dotenv: Environment management
## Notes
- Uses OpenAI's text-embedding-3-large model for generating embeddings by default
- Content is automatically cleaned and structured during indexing
- Implements URL depth-based crawling control
- Vector store persists between runs unless explicitly deleted
- Uses cosine similarity for more intuitive relevance scoring
- Minimal output by default with `--verbose` flag for troubleshooting
- Full content display by default with `--truncate` option for brevity
- ā ļø Changing chunk size/overlap settings after ingesting content may lead to inconsistent search results. Consider reprocessing existing content if these settings must be changed.
## Future Development
RAG Retriever is under active development with many planned improvements. We maintain a detailed roadmap of future enhancements in our [Future Features](docs/future-features.md) document, which outlines:
- Document lifecycle management improvements
- Integration with popular documentation platforms
- Vector store analysis and visualization
- Search quality enhancements
- Performance optimizations
While the current version is fully functional for core use cases, there are currently some limitations that will be addressed in future releases. Check the future features document for details on potential upcoming improvements.
## Contributing
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "rag-retriever",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "ai, rag, embeddings, semantic-search, web-crawler, vector-store",
"author": null,
"author_email": "Tim Kitchens <codingthefuturewithai@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/39/e9/600bf8fede929049b66f20876eedcb35e5b3ff7387b6d3b5386d2c188dd8/rag_retriever-0.1.7.tar.gz",
"platform": null,
"description": "# RAG Retriever\n\nA Python application that loads and processes web pages, local documents, and Confluence spaces, indexing their content using embeddings, and enabling semantic search queries. Built with a modular architecture using OpenAI embeddings and Chroma vector store.\n\n## What It Does\n\nRAG Retriever enhances your AI coding assistant (like aider or Cursor) by giving it access to:\n\n- Documentation about new technologies and features\n- Your organization's architecture decisions and coding standards\n- Internal APIs and tools documentation\n- Confluence spaces and documentation\n- Any other knowledge that isn't part of the LLM's training data\n\nThis helps prevent hallucinations and ensures your AI assistant follows your team's practices.\n\n> **\ud83d\udca1 Note**: While our examples focus on AI coding assistants, RAG Retriever can enhance any AI-powered development environment or tool that can execute command-line applications. Use it to augment IDEs, CLI tools, or any development workflow that needs reliable, up-to-date information.\n\n## Prerequisites\n\n### Core Requirements\n\n- Python 3.10-3.12 (Download from [python.org](https://python.org))\n- pipx (Install with one of these commands):\n\n ```bash\n # On MacOS\n brew install pipx\n\n # On Windows/Linux\n python -m pip install --user pipx\n ```\n\n### Optional Dependencies\n\nThe following dependencies are only required for specific advanced features:\n\n#### OCR Support (Optional)\n\nRequired only for:\n\n- Processing scanned documents\n- Extracting text from images in PDFs\n- Converting images to searchable text\n\n**MacOS**: `brew install tesseract`\n**Windows**: Install [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki)\n\n#### Advanced PDF Processing (Optional)\n\nRequired only for:\n\n- Complex PDF layouts\n- Better table extraction\n- Technical document processing\n\n**MacOS**: `brew install poppler`\n**Windows**: Install [Poppler](https://github.com/oschwartz10612/poppler-windows/releases/)\n\nThe core functionality works without these dependencies, including:\n\n- Basic PDF text extraction\n- Markdown and text file processing\n- Web content crawling\n- Vector storage and search\n\n### System Requirements\n\nThe application uses Playwright with Chromium for web crawling:\n\n- Chromium browser is automatically installed during package installation\n- Sufficient disk space for Chromium (~200MB)\n- Internet connection for initial setup and crawling\n\nNote: The application will automatically download and manage Chromium installation.\n\n---\n\n### \ud83d\ude80 Ready to Try It?\n\nHead over to our [Getting Started Guide](docs/getting-started.md) for a quick setup that will get your AI assistant using the RAG Retriever in 5 minutes!\n\n---\n\n## Installation\n\nInstall RAG Retriever as a standalone application:\n\n```bash\npipx install rag-retriever\n```\n\n### Windows-Specific Setup\n\nOn Windows, use PowerShell for installation and setup:\n\n1. Open PowerShell as Administrator (right-click PowerShell and select \"Run as Administrator\")\n\n2. Install pipx if not already installed:\n\n```powershell\npython -m pip install --user pipx\n```\n\n3. Install RAG Retriever:\n\n```powershell\npipx install rag-retriever\n```\n\n4. Add pipx binaries to your PATH (requires Administrator privileges):\n\n```powershell\npipx ensurepath\n```\n\n5. Close PowerShell and open a new regular (non-Administrator) PowerShell window\n\n6. Initialize the configuration:\n\n```powershell\nrag-retriever --init\n```\n\nNote: If you get a \"command not found\" error after following these steps, try closing all PowerShell windows and opening a new one.\n\nThis will:\n\n- Create an isolated environment for the application\n- Install all required dependencies\n- Install Chromium browser automatically\n- Make the `rag-retriever` command available in your PATH\n\n## How to Upgrade\n\nTo upgrade RAG Retriever to the latest version:\n\n```bash\npipx upgrade rag-retriever\n```\n\nThis will:\n\n- Upgrade the package to the latest available version\n- Preserve your existing configuration and data\n- Update any new dependencies automatically\n\nAfter installation, initialize the configuration:\n\n```bash\n# Initialize configuration files\nrag-retriever --init\n```\n\nThis creates:\n\n- A configuration file at `~/.config/rag-retriever/config.yaml` (Unix/Mac) or `%APPDATA%\\rag-retriever\\config.yaml` (Windows)\n- A `.env` file in the same directory for your OpenAI API key\n\n### Setting up your API Key\n\nAdd your OpenAI API key to the `.env` file:\n\n```bash\nOPENAI_API_KEY=your-api-key-here\n```\n\n### Customizing Configuration\n\nAll settings are in `config.yaml`. For detailed information about all configuration options, best practices, and example configurations, see our [Configuration Guide](docs/configuration-guide.md).\n\nKey configuration sections include:\n\n```yaml\n# Vector store settings\nvector_store:\n embedding_model: \"text-embedding-3-large\"\n embedding_dimensions: 3072\n chunk_size: 1000\n chunk_overlap: 200\n\n# Local document processing\ndocument_processing:\n supported_extensions:\n - \".md\"\n - \".txt\"\n - \".pdf\"\n pdf_settings:\n max_file_size_mb: 50\n extract_images: false\n ocr_enabled: false\n languages: [\"eng\"]\n strategy: \"fast\"\n mode: \"elements\"\n\n# Search settings\nsearch:\n default_limit: 8\n default_score_threshold: 0.3\n```\n\n### Data Storage\n\nThe vector store database is stored at:\n\n- Unix/Mac: `~/.local/share/rag-retriever/chromadb/`\n- Windows: `%LOCALAPPDATA%\\rag-retriever\\chromadb/`\n\nThis location is automatically managed by the application and should not be modified directly.\n\n### Uninstallation\n\nTo completely remove RAG Retriever:\n\n```bash\n# Remove the application and its isolated environment\npipx uninstall rag-retriever\n\n# Remove Playwright browsers\npython -m playwright uninstall chromium\n\n# Optional: Remove configuration and data files\n# On Unix/Mac:\nrm -rf ~/.config/rag-retriever ~/.local/share/rag-retriever\n# On Windows (run in PowerShell):\n# Open PowerShell and run:\nRemove-Item -Recurse -Force \"$env:APPDATA\\rag-retriever\"\nRemove-Item -Recurse -Force \"$env:LOCALAPPDATA\\rag-retriever\"\n```\n\n### Development Setup\n\nIf you want to contribute to RAG Retriever or modify the code:\n\n```bash\n# Clone the repository\ngit clone https://github.com/codingthefuturewithai/rag-retriever.git\ncd rag-retriever\n\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate # Unix/Mac\nvenv\\Scripts\\activate # Windows\n\n# Install in editable mode\npip install -e .\n\n# Initialize user configuration\n./scripts/run-rag.sh --init # Unix/Mac\nscripts\\run-rag.bat --init # Windows\n```\n\n## Usage Examples\n\n### Local Document Processing\n\n```bash\n# Process a single file\nrag-retriever --ingest-file path/to/document.pdf\n\n# Process all supported files in a directory\nrag-retriever --ingest-directory path/to/docs/\n\n# Enable OCR for scanned documents (update config.yaml first)\n# Set in config.yaml:\n# document_processing.pdf_settings.ocr_enabled: true\nrag-retriever --ingest-file scanned-document.pdf\n\n# Enable image extraction from PDFs (update config.yaml first)\n# Set in config.yaml:\n# document_processing.pdf_settings.extract_images: true\nrag-retriever --ingest-file document-with-images.pdf\n```\n\n### Web Content Fetching\n\n```bash\n# Basic fetch\nrag-retriever --fetch https://example.com\n\n# With depth control\nrag-retriever --fetch https://example.com --max-depth 2\n\n# Minimal output mode\nrag-retriever --fetch https://example.com --verbose false\n```\n\n### Confluence Integration\n\nRAG Retriever can load and index content directly from your Confluence spaces. To use this feature:\n\n1. Configure your Confluence credentials in `~/.config/rag-retriever/config.yaml`:\n\n```yaml\napi:\n confluence:\n url: \"https://your-domain.atlassian.net\" # Your Confluence instance URL\n username: \"your-email@example.com\" # Your Confluence username/email\n api_token: \"your-api-token\" # API token from https://id.atlassian.com/manage-profile/security/api-tokens\n space_key: null # Optional: Default space to load from\n parent_id: null # Optional: Default parent page ID\n include_attachments: false # Whether to include attachments\n limit: 50 # Max pages per request\n max_pages: 1000 # Maximum total pages to load\n```\n\n2. Load content from Confluence:\n\n```bash\n# Load from configured default space\nrag-retriever --confluence\n\n# Load from specific space\nrag-retriever --confluence --space-key TEAM\n\n# Load from specific parent page\nrag-retriever --confluence --parent-id 123456\n\n# Load from specific space and parent\nrag-retriever --confluence --space-key TEAM --parent-id 123456\n```\n\nThe loaded content will be:\n\n- Converted to markdown format\n- Split into appropriate chunks\n- Embedded and stored in your vector store\n- Available for semantic search just like any other content\n\n### Searching Content\n\n```bash\n# Basic search\nrag-retriever --query \"How do I configure logging?\"\n\n# Limit results\nrag-retriever --query \"deployment steps\" --limit 5\n\n# Set minimum relevance score\nrag-retriever --query \"error handling\" --score-threshold 0.7\n\n# Get full content (default) or truncated\nrag-retriever --query \"database setup\" --truncate\n\n# Output in JSON format\nrag-retriever --query \"API endpoints\" --json\n```\n\n## Configuration Options\n\nThe configuration file (`config.yaml`) is organized into several sections:\n\n### Vector Store Settings\n\n```yaml\nvector_store:\n persist_directory: null # Set automatically to OS-specific path\n embedding_model: \"text-embedding-3-large\"\n embedding_dimensions: 3072\n chunk_size: 1000 # Size of text chunks for indexing\n chunk_overlap: 200 # Overlap between chunks\n```\n\n### Document Processing Settings\n\n```yaml\ndocument_processing:\n # Supported file extensions\n supported_extensions:\n - \".md\"\n - \".txt\"\n - \".pdf\"\n\n # Patterns to exclude from processing\n excluded_patterns:\n - \".*\"\n - \"node_modules/**\"\n - \"__pycache__/**\"\n - \"*.pyc\"\n - \".git/**\"\n\n # Fallback encodings for text files\n encoding_fallbacks:\n - \"utf-8\"\n - \"latin-1\"\n - \"cp1252\"\n\n # PDF processing settings\n pdf_settings:\n max_file_size_mb: 50\n extract_images: false\n ocr_enabled: false\n languages: [\"eng\"]\n password: null\n strategy: \"fast\" # Options: fast, accurate\n mode: \"elements\" # Options: single_page, paged, elements\n```\n\n### Content Processing Settings\n\n```yaml\ncontent:\n chunk_size: 2000\n chunk_overlap: 400\n # Text splitting separators (in order of preference)\n separators:\n - \"\\n## \" # h2 headers (strongest break)\n - \"\\n### \" # h3 headers\n - \"\\n#### \" # h4 headers\n - \"\\n- \" # bullet points\n - \"\\n\u2022 \" # alternative bullet points\n - \"\\n\\n\" # paragraphs\n - \". \" # sentences (weakest break)\n```\n\n### Search Settings\n\n```yaml\nsearch:\n default_limit: 8 # Default number of results\n default_score_threshold: 0.3 # Minimum relevance score\n```\n\n### Browser Settings (Web Crawling)\n\n```yaml\nbrowser:\n wait_time: 2 # Base wait time in seconds\n viewport:\n width: 1920\n height: 1080\n delays:\n before_request: [1, 3] # Min and max seconds\n after_load: [2, 4]\n after_dynamic: [1, 2]\n launch_options:\n headless: true\n channel: \"chrome\"\n context_options:\n bypass_csp: true\n java_script_enabled: true\n```\n\n## Understanding Search Results\n\nSearch results include relevance scores based on cosine similarity:\n\n- Scores range from 0 to 1, where 1 indicates perfect similarity\n- Default threshold is 0.3 (configurable via `search.default_score_threshold`)\n- Typical interpretation:\n - 0.7+: Very high relevance (nearly exact matches)\n - 0.6 - 0.7: High relevance\n - 0.5 - 0.6: Good relevance\n - 0.3 - 0.5: Moderate relevance\n - Below 0.3: Lower relevance\n\n## Features\n\n### Core Features (No Additional Dependencies)\n\n- Web crawling and content extraction\n- Basic PDF text extraction\n- Markdown and text file processing\n- Vector storage and semantic search\n- Configuration management\n- Basic document chunking and processing\n\n### Advanced Features (Optional Dependencies Required)\n\n- **OCR Processing** (Requires Tesseract):\n\n - Scanned document processing\n - Image text extraction\n - PDF image text extraction\n\n- **Enhanced PDF Processing** (Requires Poppler):\n - Complex layout handling\n - Table extraction\n - Technical document processing\n - Better handling of multi-column layouts\n\nAll core features work without installing optional dependencies. Install optional dependencies only if you need their specific features.\n\nFor more detailed usage instructions and examples, please refer to the [local-document-loading.md](docs/local-document-loading.md) documentation.\n\n## Project Structure\n\n```\nrag-retriever/\n\u251c\u2500\u2500 rag_retriever/ # Main package directory\n\u2502 \u251c\u2500\u2500 config/ # Configuration settings\n\u2502 \u251c\u2500\u2500 crawling/ # Web crawling functionality\n\u2502 \u251c\u2500\u2500 vectorstore/ # Vector storage operations\n\u2502 \u251c\u2500\u2500 search/ # Search functionality\n\u2502 \u2514\u2500\u2500 utils/ # Utility functions\n```\n\n## Dependencies\n\nKey dependencies include:\n\n- openai: For embeddings generation (text-embedding-3-large model)\n- chromadb: Vector store implementation with cosine similarity\n- selenium: JavaScript content rendering\n- beautifulsoup4: HTML parsing\n- python-dotenv: Environment management\n\n## Notes\n\n- Uses OpenAI's text-embedding-3-large model for generating embeddings by default\n- Content is automatically cleaned and structured during indexing\n- Implements URL depth-based crawling control\n- Vector store persists between runs unless explicitly deleted\n- Uses cosine similarity for more intuitive relevance scoring\n- Minimal output by default with `--verbose` flag for troubleshooting\n- Full content display by default with `--truncate` option for brevity\n- \u26a0\ufe0f Changing chunk size/overlap settings after ingesting content may lead to inconsistent search results. Consider reprocessing existing content if these settings must be changed.\n\n## Future Development\n\nRAG Retriever is under active development with many planned improvements. We maintain a detailed roadmap of future enhancements in our [Future Features](docs/future-features.md) document, which outlines:\n\n- Document lifecycle management improvements\n- Integration with popular documentation platforms\n- Vector store analysis and visualization\n- Search quality enhancements\n- Performance optimizations\n\nWhile the current version is fully functional for core use cases, there are currently some limitations that will be addressed in future releases. Check the future features document for details on potential upcoming improvements.\n\n## Contributing\n\nPlease read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for crawling, indexing, and semantically searching web content",
"version": "0.1.7",
"project_urls": null,
"split_keywords": [
"ai",
" rag",
" embeddings",
" semantic-search",
" web-crawler",
" vector-store"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a9b8898b56b81c832e079f376a23265a2539a7caabebb10fc881b44050bae90b",
"md5": "3660d4d74942c42d1f575d9769e0abf5",
"sha256": "3800e4fa71a191680d9deac7bb28132dfce29035469c9fb74059989af91b5b8f"
},
"downloads": -1,
"filename": "rag_retriever-0.1.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3660d4d74942c42d1f575d9769e0abf5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 39574,
"upload_time": "2025-01-18T17:41:30",
"upload_time_iso_8601": "2025-01-18T17:41:30.946287Z",
"url": "https://files.pythonhosted.org/packages/a9/b8/898b56b81c832e079f376a23265a2539a7caabebb10fc881b44050bae90b/rag_retriever-0.1.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "39e9600bf8fede929049b66f20876eedcb35e5b3ff7387b6d3b5386d2c188dd8",
"md5": "956841a0602242db79cb31d91dd14cf6",
"sha256": "116a893f4ca0a50cabe5d7106a5931708bfa19d0c285c7a70e45b24d88bbc96e"
},
"downloads": -1,
"filename": "rag_retriever-0.1.7.tar.gz",
"has_sig": false,
"md5_digest": "956841a0602242db79cb31d91dd14cf6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 38434,
"upload_time": "2025-01-18T17:41:33",
"upload_time_iso_8601": "2025-01-18T17:41:33.992914Z",
"url": "https://files.pythonhosted.org/packages/39/e9/600bf8fede929049b66f20876eedcb35e5b3ff7387b6d3b5386d2c188dd8/rag_retriever-0.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-01-18 17:41:33",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "rag-retriever"
}