# Vector DB Query
<div align="center">
<h3>๐ Semantic Search for Your Documents with AI Integration</h3>
<p>A powerful CLI tool that indexes your documents and enables natural language search with LLM integration via MCP</p>
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://vector-db-query.readthedocs.io/en/latest/?badge=latest)
</div>
## ๐ Key Features
Vector DB Query is a comprehensive solution for building searchable knowledge bases from your documents:
### ๐ Enhanced Document Processing
- **40+ File Formats**: PDF, Word, Excel, PowerPoint, HTML, Markdown, JSON, XML, Images (with OCR), and more
- **OCR Support**: Extract text from images (PNG, JPG, TIFF, BMP) with configurable languages and confidence thresholds
- **Archive Support**: Process ZIP, TAR, and compressed archives automatically
- **Smart Chunking**: Multiple strategies including sliding window, semantic, and paragraph-based
- **Metadata Extraction**: Preserve document structure, authorship, dates, and custom tags
- **Format-Specific Processing**: Tailored extraction for each file type (formulas from Excel, speaker notes from PowerPoint, etc.)
### ๐ Advanced Semantic Search
- Natural language queries with vector similarity
- Hybrid search combining keyword and semantic matching
- Advanced filtering by file type, date, score, and metadata
- Result reranking and highlighting
- Export results in multiple formats (JSON, CSV, Markdown)
### ๐จ Rich Interactive CLI
- Beautiful terminal UI powered by Rich and Textual
- Visual file browser with real-time preview
- Interactive query builder with autocomplete
- Live progress tracking with detailed statistics
- Customizable themes and output formats
### ๐ค AI Integration
- MCP server for Claude and other AI assistants
- Secure API with JWT authentication
- Rate limiting and request monitoring
- Standardized tool interface for document operations
- Real-time processing feedback
### โ๏ธ Flexible Configuration
- YAML-based configuration with environment overrides
- CLI commands for configuration management
- Support for multiple configuration profiles
- Validation and health checks
- Hot-reloading of settings
### ๐ Monitoring & Management
- Real-time monitoring dashboard (Streamlit)
- System metrics and resource usage tracking
- Processing queue management
- Log aggregation and analysis
- PM2 integration for process management
### โก Performance & Scalability
- Parallel processing with configurable workers
- Memory-efficient chunking and streaming
- Smart caching system
- Connection pooling for database operations
- Batch processing optimization
### ๐ Data Source Integration (New!)
- **Gmail Integration**: Sync emails via IMAP/OAuth2 with folder selection and filtering
- **Fireflies.ai Integration**: Automatic meeting transcript sync via API and webhooks
- **Google Drive Integration**: Search and sync Gemini transcripts and documents
- **Smart Deduplication**: Cross-source duplicate detection using content hashing
- **NLP Processing**: Entity extraction, sentiment analysis, and key phrase detection
- **Selective Processing**: Configurable filters for targeted content processing
- **Real-time Monitoring**: Dashboard integration for tracking sync status
- **Setup Wizard**: Interactive configuration for easy onboarding
## ๐ Requirements
- Python 3.9 or higher
- 4GB RAM minimum (8GB recommended)
- Qdrant vector database (local or cloud)
- API key for embeddings (Google, OpenAI, etc.)
- Optional: Tesseract for OCR support
- Optional: Docker for containerized deployment
## ๐ Quick Start
### Installation
```bash
# Install from PyPI
pip install vector-db-query
# Or install from source
git clone https://github.com/your-org/vector-db-query.git
cd vector-db-query
pip install -e .
# Install with OCR support
pip install vector-db-query[ocr]
# Also install Tesseract:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
# Install with all features
pip install vector-db-query[all]
# Install additional language packs for OCR
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-spa # Spanish
# macOS:
brew install tesseract-lang
```
### Setup
```bash
# 1. Start Qdrant (using Docker)
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant
# 2. Configure the application
vector-db-query config setup
# 3. Process your first documents
vector-db-query process ~/Documents/my-files --recursive
# 4. Search your documents
vector-db-query query "machine learning algorithms"
# 5. Or use interactive mode for the full experience
vector-db-query interactive start
```
## ๐ Usage
### Interactive Mode (Recommended)
The interactive mode provides a rich terminal interface:
```bash
vector-db-query interactive start
```
Features:
- ๐ Visual file browser with multi-format preview
- ๐ Interactive query builder with AI suggestions
- ๐ Beautiful result viewer with syntax highlighting
- โ๏ธ Settings editor with live validation
- ๐ Built-in tutorials and examples
- ๐ฏ Format-specific processing options
### Command Line Mode
#### Processing Documents
```bash
# Process all supported formats
vector-db-query process /path/to/documents --recursive
# Process specific formats only
vector-db-query process /path/to/docs --formats pdf,docx,xlsx
# Process with OCR for images
vector-db-query process /path/to/images --ocr --ocr-lang eng
# Show all supported formats
vector-db-query formats
# Check format support for specific files
vector-db-query formats /path/to/file.xyz
# Process only Excel and PowerPoint files
vector-db-query process /path/to/docs --formats xlsx,pptx --recursive
# Dry run to see what would be processed
vector-db-query process /path/to/docs --dry-run --verbose
```
#### Querying Documents
```bash
# Simple natural language query
vector-db-query query "explain the authentication flow"
# Advanced search with filters
vector-db-query query "Python async" --filter file_type=py --limit 20
# Hybrid search with keyword weight
vector-db-query query "API endpoints" --hybrid --keyword-weight 0.4
# Export results in different formats
vector-db-query query "documentation" --export results.json --format json
vector-db-query query "configuration" --export results.md --format markdown
# Show query statistics
vector-db-query query "machine learning" --stats
```
#### Configuration Management
```bash
# Show current configuration
vector-db-query config show
vector-db-query config show --format table
vector-db-query config show --section document_processing
# Get/Set configuration values
vector-db-query config get document_processing.chunk_size
vector-db-query config set document_processing.chunk_size 2000 --type int
# Validate configuration
vector-db-query config validate
# Show supported file formats
vector-db-query config formats
# Add custom format
vector-db-query config add-format .custom
# Export/Import configuration
vector-db-query config export --output my-config.yaml
vector-db-query config load custom-config.yaml --merge
# Show environment variable mappings
vector-db-query config env
```
#### Monitoring and Management
```bash
# Start monitoring dashboard (requires monitoring dependencies)
vector-db-query monitor
# Or install monitoring dependencies first:
# pip install vector-db-query[monitoring]
# View system status
vector-db-query status
# View processing logs
vector-db-query logging show --tail 100
vector-db-query logging search "ERROR" --context 5
# Manage processes with PM2
./scripts/pm2-manage.sh start all
./scripts/pm2-manage.sh status
./scripts/pm2-manage.sh logs mcp-server
```
### MCP Server for AI Assistants
Enable AI assistants like Claude to search your documents:
```bash
# Initialize MCP configuration
vector-db-query mcp init
# Start MCP server
vector-db-query mcp start
# Create API client
vector-db-query mcp auth create-client "claude-assistant"
# Check server status
vector-db-query mcp status
# Test with sample query
vector-db-query mcp test --query "find Python examples"
```
The MCP server provides tools for:
- Searching documents with natural language
- Processing new files in real-time
- Getting collection statistics
- Managing the vector database
- Monitoring system health
## โ๏ธ Configuration
The application uses a flexible YAML-based configuration system:
```yaml
# config.yaml example
app:
name: "Vector DB Query System"
log_level: "INFO"
document_processing:
chunk_size: 1000
chunk_overlap: 200
max_file_size_mb: 100
file_formats:
documents: [".pdf", ".doc", ".docx", ".txt", ".md"]
spreadsheets: [".xlsx", ".xls", ".csv"]
images: [".png", ".jpg", ".jpeg", ".gif", ".bmp"]
# ... more formats
ocr:
enabled: true
language: "eng"
confidence_threshold: 60.0
vector_db:
host: "localhost"
port: 6333
collection_name: "documents"
embedding:
model: "embedding-001"
dimensions: 768
# ... more settings
```
### Environment Variables
Override configuration with environment variables:
```bash
export VECTOR_DB_LOG_LEVEL=DEBUG
export QDRANT_HOST=remote-server.com
export QDRANT_PORT=6334
export EMBEDDING_MODEL=text-embedding-ada-002
export OCR_LANGUAGE=eng+fra+deu
export CHUNK_SIZE=1500
```
## ๐งฉ Supported File Formats
### Documents
- **PDF** (`.pdf`) - Full text extraction with layout preservation
- **Microsoft Word** (`.doc`, `.docx`) - Text, tables, headers/footers, and comments
- **OpenDocument Text** (`.odt`) - ODT format support
- **Rich Text Format** (`.rtf`) - RTF document processing
- **Plain Text** (`.txt`, `.text`) - With encoding detection
- **Markdown** (`.md`, `.markdown`) - Preserves structure and formatting
### Spreadsheets
- **Microsoft Excel** (`.xlsx`, `.xls`) - Extracts:
- Cell values and formulas
- Comments and notes
- Multiple sheets
- Table structures
- **CSV** (`.csv`, `.tsv`) - Tabular data processing
- **OpenDocument Spreadsheet** (`.ods`) - ODS format support
### Presentations
- **Microsoft PowerPoint** (`.pptx`, `.ppt`) - Extracts:
- Slide content and titles
- Speaker notes
- Table data
- Slide numbers and structure
- **OpenDocument Presentation** (`.odp`) - ODP format support
### Email
- **Email Messages** (`.eml`) - Extracts:
- Headers (From, To, Subject, Date)
- Body content (text/HTML)
- Attachments (processed recursively)
- Thread detection
- **Mailbox** (`.mbox`) - Multi-message archive support
- **Outlook Message** (`.msg`) - MSG format support
### Web & Markup
- **HTML** (`.html`, `.htm`, `.xhtml`) - Features:
- Script/style removal
- Text extraction with structure
- Link preservation
- Optional markdown conversion
- **XML** (`.xml`) - Structured data extraction
### Configuration & Data
- **JSON** (`.json`) - Pretty-printed extraction
- **YAML** (`.yaml`, `.yml`) - Multi-document support
- **INI/Config** (`.ini`, `.cfg`, `.conf`) - Section-based extraction
- **TOML** (`.toml`) - TOML format support
- **Log Files** (`.log`) - Features:
- Pattern extraction
- Summary generation
- Configurable line limits
### Images (with OCR)
Requires Tesseract installation:
- **PNG** (`.png`) - Lossless image format
- **JPEG** (`.jpg`, `.jpeg`) - Common photo format
- **TIFF** (`.tiff`, `.tif`) - Multi-page support
- **BMP** (`.bmp`) - Bitmap images
- **GIF** (`.gif`) - Graphics format
### Archives
- **ZIP** (`.zip`)
- TAR (`.tar`, `.tar.gz`, `.tar.bz2`, `.tar.xz`)
- 7-Zip (`.7z`)
### Logs
- Log Files (`.log`)
## ๐ง Advanced Features
### OCR Configuration
```bash
# Install Tesseract
# macOS
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Install additional languages
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu
# Configure OCR in vector-db-query
vector-db-query config set document_processing.ocr.enabled true
vector-db-query config set document_processing.ocr.language "eng"
vector-db-query config set document_processing.ocr.confidence_threshold 60.0
```
### Format-Specific Configuration
Configure processing behavior for each file format:
```yaml
# config/default.yaml
document_processing:
format_settings:
excel:
extract_formulas: true
extract_comments: true
process_all_sheets: true
max_rows_per_sheet: 10000
powerpoint:
extract_speaker_notes: true
extract_slide_numbers: true
include_master_slides: false
email:
extract_attachments: true
thread_detection: true
sanitize_content: true
include_headers: true
html:
remove_scripts: true
remove_styles: true
convert_to_markdown: false
preserve_links: true
logs:
summarize: true
extract_patterns: true
max_lines: 10000
```
Or use environment variables:
```bash
export VECTOR_DB_EXCEL_EXTRACT_FORMULAS=true
export VECTOR_DB_EXCEL_MAX_ROWS=5000
export VECTOR_DB_EMAIL_EXTRACT_ATTACHMENTS=true
export VECTOR_DB_HTML_CONVERT_MARKDOWN=true
export VECTOR_DB_LOG_SUMMARIZE=true
```
### Batch Processing
```python
# Python script for batch processing
from vector_db_query import DocumentProcessor
processor = DocumentProcessor(
chunk_size=1500,
chunk_overlap=300,
parallel_workers=8
)
# Process with progress callback
def on_progress(current, total, file_name):
print(f"Processing {file_name}: {current}/{total}")
documents = processor.process_directory(
"/path/to/documents",
recursive=True,
progress_callback=on_progress
)
```
### Custom Embeddings
```python
# Use custom embedding models
from vector_db_query import EmbeddingService
# Configure custom model
embedding_service = EmbeddingService(
model="custom-model",
api_key="your-api-key",
dimensions=1536
)
# Process with custom embeddings
processor = DocumentProcessor(
embedding_service=embedding_service
)
```
## ๐ Monitoring Dashboard
The built-in monitoring dashboard provides real-time insights:
```bash
# Start the dashboard
vector-db-query monitor start
# Access at http://localhost:8501
```
Features:
- System resource usage (CPU, Memory, Disk)
- Processing queue status
- Document processing statistics
- Error logs and alerts
- Performance metrics
## ๐ณ Docker Support
Run everything in containers:
```bash
# Build the image
docker build -t vector-db-query .
# Run with docker-compose
docker-compose up -d
# Access services
# - API: http://localhost:5000
# - Qdrant: http://localhost:6333
# - Dashboard: http://localhost:8501
```
## ๐งช Testing
```bash
# Run all tests
pytest
# Run specific test categories
pytest tests/test_readers/
pytest tests/test_cli/
# Run with coverage
pytest --cov=vector_db_query
# Run integration tests
pytest tests/integration/ --integration
```
## ๐ Documentation
### Guides
- [Getting Started Guide](docs/getting-started.md)
- [File Formats Guide](docs/file-formats-guide.md) - Detailed information about all supported formats
- [Usage Examples](docs/usage-examples.md) - Practical examples for common use cases
- [Configuration Guide](docs/configuration-guide.md)
- [CLI Features Guide](docs/enhanced-cli-features.md)
### API Documentation
- [Document Readers API](docs/api/readers.md) - API reference for all document readers
- [Full API Reference](https://vector-db-query.readthedocs.io)
### Integration & Deployment
- [MCP Integration Guide](docs/mcp_integration_guide.md)
- [Monitoring Setup Guide](docs/monitoring-setup.md)
- [Monitoring System Guide](docs/monitoring-system.md)
## ๐ค Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Data Sources
The Data Sources feature enables automatic synchronization of content from multiple external sources into your vector database:
### Quick Start
```bash
# Run interactive setup wizard
vdq setup
# Or use quick start guide
vdq quickstart
# Start syncing data
vdq datasources sync
# Monitor sync status
vdq monitor
```
### Key Capabilities
#### Gmail Integration
- OAuth2 authentication for secure access
- Folder selection (INBOX, Sent, Drafts, etc.)
- Advanced filtering (sender whitelist/blacklist, patterns)
- Attachment processing
- Thread detection and grouping
#### Fireflies.ai Integration
- API-based transcript sync
- Real-time webhook support
- Meeting duration and platform filters
- Speaker identification
- Automatic summary extraction
#### Google Drive Integration
- OAuth2 authentication
- Pattern-based file search (e.g., "Notes by Gemini")
- Folder-specific sync
- Shared drive support
- File type filtering
#### Advanced Processing
- **Deduplication**: Content-based hashing to prevent duplicates
- **NLP Analysis**: Extract entities, sentiment, and key phrases
- **Selective Processing**: Rule-based filtering system
- **Performance**: Parallel processing with rate limiting
- **Monitoring**: Real-time dashboard with metrics
### Configuration
The system can be configured via:
- Interactive setup wizard: `vdq setup`
- Configuration file: `config/default.yaml`
- Environment variables for sensitive data
- Web UI through monitoring dashboard
### Documentation
- [Setup Wizard Guide](docs/setup-wizard.md)
- [Data Sources Operations Guide](docs/data-sources-operations.md)
- [Troubleshooting Guide](docs/troubleshooting-guide.md)
- [Deployment Guide](docs/deployment-guide.md)
- [Maintenance Procedures](docs/maintenance-procedures.md)
## ๐ Acknowledgments
- [Qdrant](https://qdrant.tech/) for the excellent vector database
- [Rich](https://github.com/Textualize/rich) for beautiful terminal formatting
- [Textual](https://github.com/Textualize/textual) for the interactive TUI
- [MCP](https://modelcontextprotocol.io/) for AI integration standards
- All our contributors and users!
---
<div align="center">
<p>Built with โค๏ธ by the Vector DB Query Team</p>
<p>
<a href="https://github.com/your-org/vector-db-query">GitHub</a> โข
<a href="https://vector-db-query.readthedocs.io">Documentation</a> โข
<a href="https://github.com/your-org/vector-db-query/issues">Issues</a>
</p>
</div>
Raw data
{
"_id": null,
"home_page": "https://github.com/your-org/vector-db-query",
"name": "vector-db-query",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "vector-database, llm, mcp, embeddings, qdrant",
"author": "Vector DB Query Team",
"author_email": "Vector DB Query Team <team@example.com>",
"download_url": "https://files.pythonhosted.org/packages/e0/ff/cbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a/vector_db_query-1.0.0.tar.gz",
"platform": null,
"description": "# Vector DB Query\n\n<div align=\"center\">\n <h3>\ud83d\ude80 Semantic Search for Your Documents with AI Integration</h3>\n <p>A powerful CLI tool that indexes your documents and enables natural language search with LLM integration via MCP</p>\n \n [](https://www.python.org/downloads/)\n [](https://opensource.org/licenses/MIT)\n [](https://github.com/psf/black)\n [](https://vector-db-query.readthedocs.io/en/latest/?badge=latest)\n</div>\n\n## \ud83c\udf1f Key Features\n\nVector DB Query is a comprehensive solution for building searchable knowledge bases from your documents:\n\n### \ud83d\udcc4 Enhanced Document Processing\n- **40+ File Formats**: PDF, Word, Excel, PowerPoint, HTML, Markdown, JSON, XML, Images (with OCR), and more\n- **OCR Support**: Extract text from images (PNG, JPG, TIFF, BMP) with configurable languages and confidence thresholds\n- **Archive Support**: Process ZIP, TAR, and compressed archives automatically\n- **Smart Chunking**: Multiple strategies including sliding window, semantic, and paragraph-based\n- **Metadata Extraction**: Preserve document structure, authorship, dates, and custom tags\n- **Format-Specific Processing**: Tailored extraction for each file type (formulas from Excel, speaker notes from PowerPoint, etc.)\n\n### \ud83d\udd0d Advanced Semantic Search\n- Natural language queries with vector similarity\n- Hybrid search combining keyword and semantic matching\n- Advanced filtering by file type, date, score, and metadata\n- Result reranking and highlighting\n- Export results in multiple formats (JSON, CSV, Markdown)\n\n### \ud83c\udfa8 Rich Interactive CLI\n- Beautiful terminal UI powered by Rich and Textual\n- Visual file browser with real-time preview\n- Interactive query builder with autocomplete\n- Live progress tracking with detailed statistics\n- Customizable themes and output formats\n\n### \ud83e\udd16 AI Integration\n- MCP server for Claude and other AI assistants\n- Secure API with JWT authentication\n- Rate limiting and request monitoring\n- Standardized tool interface for document operations\n- Real-time processing feedback\n\n### \u2699\ufe0f Flexible Configuration\n- YAML-based configuration with environment overrides\n- CLI commands for configuration management\n- Support for multiple configuration profiles\n- Validation and health checks\n- Hot-reloading of settings\n\n### \ud83d\udcca Monitoring & Management\n- Real-time monitoring dashboard (Streamlit)\n- System metrics and resource usage tracking\n- Processing queue management\n- Log aggregation and analysis\n- PM2 integration for process management\n\n### \u26a1 Performance & Scalability\n- Parallel processing with configurable workers\n- Memory-efficient chunking and streaming\n- Smart caching system\n- Connection pooling for database operations\n- Batch processing optimization\n\n### \ud83d\udd17 Data Source Integration (New!)\n- **Gmail Integration**: Sync emails via IMAP/OAuth2 with folder selection and filtering\n- **Fireflies.ai Integration**: Automatic meeting transcript sync via API and webhooks\n- **Google Drive Integration**: Search and sync Gemini transcripts and documents\n- **Smart Deduplication**: Cross-source duplicate detection using content hashing\n- **NLP Processing**: Entity extraction, sentiment analysis, and key phrase detection\n- **Selective Processing**: Configurable filters for targeted content processing\n- **Real-time Monitoring**: Dashboard integration for tracking sync status\n- **Setup Wizard**: Interactive configuration for easy onboarding\n\n## \ud83d\udccb Requirements\n\n- Python 3.9 or higher\n- 4GB RAM minimum (8GB recommended)\n- Qdrant vector database (local or cloud)\n- API key for embeddings (Google, OpenAI, etc.)\n- Optional: Tesseract for OCR support\n- Optional: Docker for containerized deployment\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install from PyPI\npip install vector-db-query\n\n# Or install from source\ngit clone https://github.com/your-org/vector-db-query.git\ncd vector-db-query\npip install -e .\n\n# Install with OCR support\npip install vector-db-query[ocr]\n# Also install Tesseract:\n# macOS: brew install tesseract\n# Ubuntu: sudo apt-get install tesseract-ocr\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\n\n# Install with all features\npip install vector-db-query[all]\n\n# Install additional language packs for OCR\n# Ubuntu/Debian:\nsudo apt-get install tesseract-ocr-fra # French\nsudo apt-get install tesseract-ocr-deu # German\nsudo apt-get install tesseract-ocr-spa # Spanish\n\n# macOS:\nbrew install tesseract-lang\n```\n\n### Setup\n\n```bash\n# 1. Start Qdrant (using Docker)\ndocker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant\n\n# 2. Configure the application\nvector-db-query config setup\n\n# 3. Process your first documents\nvector-db-query process ~/Documents/my-files --recursive\n\n# 4. Search your documents\nvector-db-query query \"machine learning algorithms\"\n\n# 5. Or use interactive mode for the full experience\nvector-db-query interactive start\n```\n\n## \ud83d\udcd6 Usage\n\n### Interactive Mode (Recommended)\n\nThe interactive mode provides a rich terminal interface:\n\n```bash\nvector-db-query interactive start\n```\n\nFeatures:\n- \ud83d\udcc1 Visual file browser with multi-format preview\n- \ud83d\udd0d Interactive query builder with AI suggestions\n- \ud83d\udcca Beautiful result viewer with syntax highlighting\n- \u2699\ufe0f Settings editor with live validation\n- \ud83d\udcda Built-in tutorials and examples\n- \ud83c\udfaf Format-specific processing options\n\n### Command Line Mode\n\n#### Processing Documents\n\n```bash\n# Process all supported formats\nvector-db-query process /path/to/documents --recursive\n\n# Process specific formats only\nvector-db-query process /path/to/docs --formats pdf,docx,xlsx\n\n# Process with OCR for images\nvector-db-query process /path/to/images --ocr --ocr-lang eng\n\n# Show all supported formats\nvector-db-query formats\n\n# Check format support for specific files\nvector-db-query formats /path/to/file.xyz\n\n# Process only Excel and PowerPoint files\nvector-db-query process /path/to/docs --formats xlsx,pptx --recursive\n\n# Dry run to see what would be processed\nvector-db-query process /path/to/docs --dry-run --verbose\n```\n\n#### Querying Documents\n\n```bash\n# Simple natural language query\nvector-db-query query \"explain the authentication flow\"\n\n# Advanced search with filters\nvector-db-query query \"Python async\" --filter file_type=py --limit 20\n\n# Hybrid search with keyword weight\nvector-db-query query \"API endpoints\" --hybrid --keyword-weight 0.4\n\n# Export results in different formats\nvector-db-query query \"documentation\" --export results.json --format json\nvector-db-query query \"configuration\" --export results.md --format markdown\n\n# Show query statistics\nvector-db-query query \"machine learning\" --stats\n```\n\n#### Configuration Management\n\n```bash\n# Show current configuration\nvector-db-query config show\nvector-db-query config show --format table\nvector-db-query config show --section document_processing\n\n# Get/Set configuration values\nvector-db-query config get document_processing.chunk_size\nvector-db-query config set document_processing.chunk_size 2000 --type int\n\n# Validate configuration\nvector-db-query config validate\n\n# Show supported file formats\nvector-db-query config formats\n\n# Add custom format\nvector-db-query config add-format .custom\n\n# Export/Import configuration\nvector-db-query config export --output my-config.yaml\nvector-db-query config load custom-config.yaml --merge\n\n# Show environment variable mappings\nvector-db-query config env\n```\n\n#### Monitoring and Management\n\n```bash\n# Start monitoring dashboard (requires monitoring dependencies)\nvector-db-query monitor\n# Or install monitoring dependencies first:\n# pip install vector-db-query[monitoring]\n\n# View system status\nvector-db-query status\n\n# View processing logs\nvector-db-query logging show --tail 100\nvector-db-query logging search \"ERROR\" --context 5\n\n# Manage processes with PM2\n./scripts/pm2-manage.sh start all\n./scripts/pm2-manage.sh status\n./scripts/pm2-manage.sh logs mcp-server\n```\n\n### MCP Server for AI Assistants\n\nEnable AI assistants like Claude to search your documents:\n\n```bash\n# Initialize MCP configuration\nvector-db-query mcp init\n\n# Start MCP server\nvector-db-query mcp start\n\n# Create API client\nvector-db-query mcp auth create-client \"claude-assistant\"\n\n# Check server status\nvector-db-query mcp status\n\n# Test with sample query\nvector-db-query mcp test --query \"find Python examples\"\n```\n\nThe MCP server provides tools for:\n- Searching documents with natural language\n- Processing new files in real-time\n- Getting collection statistics\n- Managing the vector database\n- Monitoring system health\n\n## \u2699\ufe0f Configuration\n\nThe application uses a flexible YAML-based configuration system:\n\n```yaml\n# config.yaml example\napp:\n name: \"Vector DB Query System\"\n log_level: \"INFO\"\n\ndocument_processing:\n chunk_size: 1000\n chunk_overlap: 200\n max_file_size_mb: 100\n \n file_formats:\n documents: [\".pdf\", \".doc\", \".docx\", \".txt\", \".md\"]\n spreadsheets: [\".xlsx\", \".xls\", \".csv\"]\n images: [\".png\", \".jpg\", \".jpeg\", \".gif\", \".bmp\"]\n # ... more formats\n \n ocr:\n enabled: true\n language: \"eng\"\n confidence_threshold: 60.0\n\nvector_db:\n host: \"localhost\"\n port: 6333\n collection_name: \"documents\"\n\nembedding:\n model: \"embedding-001\"\n dimensions: 768\n\n# ... more settings\n```\n\n### Environment Variables\n\nOverride configuration with environment variables:\n\n```bash\nexport VECTOR_DB_LOG_LEVEL=DEBUG\nexport QDRANT_HOST=remote-server.com\nexport QDRANT_PORT=6334\nexport EMBEDDING_MODEL=text-embedding-ada-002\nexport OCR_LANGUAGE=eng+fra+deu\nexport CHUNK_SIZE=1500\n```\n\n## \ud83e\udde9 Supported File Formats\n\n### Documents\n- **PDF** (`.pdf`) - Full text extraction with layout preservation\n- **Microsoft Word** (`.doc`, `.docx`) - Text, tables, headers/footers, and comments\n- **OpenDocument Text** (`.odt`) - ODT format support\n- **Rich Text Format** (`.rtf`) - RTF document processing\n- **Plain Text** (`.txt`, `.text`) - With encoding detection\n- **Markdown** (`.md`, `.markdown`) - Preserves structure and formatting\n\n### Spreadsheets\n- **Microsoft Excel** (`.xlsx`, `.xls`) - Extracts:\n - Cell values and formulas\n - Comments and notes\n - Multiple sheets\n - Table structures\n- **CSV** (`.csv`, `.tsv`) - Tabular data processing\n- **OpenDocument Spreadsheet** (`.ods`) - ODS format support\n\n### Presentations\n- **Microsoft PowerPoint** (`.pptx`, `.ppt`) - Extracts:\n - Slide content and titles\n - Speaker notes\n - Table data\n - Slide numbers and structure\n- **OpenDocument Presentation** (`.odp`) - ODP format support\n\n### Email\n- **Email Messages** (`.eml`) - Extracts:\n - Headers (From, To, Subject, Date)\n - Body content (text/HTML)\n - Attachments (processed recursively)\n - Thread detection\n- **Mailbox** (`.mbox`) - Multi-message archive support\n- **Outlook Message** (`.msg`) - MSG format support\n\n### Web & Markup\n- **HTML** (`.html`, `.htm`, `.xhtml`) - Features:\n - Script/style removal\n - Text extraction with structure\n - Link preservation\n - Optional markdown conversion\n- **XML** (`.xml`) - Structured data extraction\n\n### Configuration & Data\n- **JSON** (`.json`) - Pretty-printed extraction\n- **YAML** (`.yaml`, `.yml`) - Multi-document support\n- **INI/Config** (`.ini`, `.cfg`, `.conf`) - Section-based extraction\n- **TOML** (`.toml`) - TOML format support\n- **Log Files** (`.log`) - Features:\n - Pattern extraction\n - Summary generation\n - Configurable line limits\n\n### Images (with OCR)\nRequires Tesseract installation:\n- **PNG** (`.png`) - Lossless image format\n- **JPEG** (`.jpg`, `.jpeg`) - Common photo format\n- **TIFF** (`.tiff`, `.tif`) - Multi-page support\n- **BMP** (`.bmp`) - Bitmap images\n- **GIF** (`.gif`) - Graphics format\n\n### Archives\n- **ZIP** (`.zip`)\n- TAR (`.tar`, `.tar.gz`, `.tar.bz2`, `.tar.xz`)\n- 7-Zip (`.7z`)\n\n### Logs\n- Log Files (`.log`)\n\n## \ud83d\udd27 Advanced Features\n\n### OCR Configuration\n\n```bash\n# Install Tesseract\n# macOS\nbrew install tesseract\n\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr\n\n# Install additional languages\nsudo apt-get install tesseract-ocr-fra tesseract-ocr-deu\n\n# Configure OCR in vector-db-query\nvector-db-query config set document_processing.ocr.enabled true\nvector-db-query config set document_processing.ocr.language \"eng\"\nvector-db-query config set document_processing.ocr.confidence_threshold 60.0\n```\n\n### Format-Specific Configuration\n\nConfigure processing behavior for each file format:\n\n```yaml\n# config/default.yaml\ndocument_processing:\n format_settings:\n excel:\n extract_formulas: true\n extract_comments: true\n process_all_sheets: true\n max_rows_per_sheet: 10000\n \n powerpoint:\n extract_speaker_notes: true\n extract_slide_numbers: true\n include_master_slides: false\n \n email:\n extract_attachments: true\n thread_detection: true\n sanitize_content: true\n include_headers: true\n \n html:\n remove_scripts: true\n remove_styles: true\n convert_to_markdown: false\n preserve_links: true\n \n logs:\n summarize: true\n extract_patterns: true\n max_lines: 10000\n```\n\nOr use environment variables:\n\n```bash\nexport VECTOR_DB_EXCEL_EXTRACT_FORMULAS=true\nexport VECTOR_DB_EXCEL_MAX_ROWS=5000\nexport VECTOR_DB_EMAIL_EXTRACT_ATTACHMENTS=true\nexport VECTOR_DB_HTML_CONVERT_MARKDOWN=true\nexport VECTOR_DB_LOG_SUMMARIZE=true\n```\n\n### Batch Processing\n\n```python\n# Python script for batch processing\nfrom vector_db_query import DocumentProcessor\n\nprocessor = DocumentProcessor(\n chunk_size=1500,\n chunk_overlap=300,\n parallel_workers=8\n)\n\n# Process with progress callback\ndef on_progress(current, total, file_name):\n print(f\"Processing {file_name}: {current}/{total}\")\n\ndocuments = processor.process_directory(\n \"/path/to/documents\",\n recursive=True,\n progress_callback=on_progress\n)\n```\n\n### Custom Embeddings\n\n```python\n# Use custom embedding models\nfrom vector_db_query import EmbeddingService\n\n# Configure custom model\nembedding_service = EmbeddingService(\n model=\"custom-model\",\n api_key=\"your-api-key\",\n dimensions=1536\n)\n\n# Process with custom embeddings\nprocessor = DocumentProcessor(\n embedding_service=embedding_service\n)\n```\n\n## \ud83d\udcca Monitoring Dashboard\n\nThe built-in monitoring dashboard provides real-time insights:\n\n```bash\n# Start the dashboard\nvector-db-query monitor start\n\n# Access at http://localhost:8501\n```\n\nFeatures:\n- System resource usage (CPU, Memory, Disk)\n- Processing queue status\n- Document processing statistics\n- Error logs and alerts\n- Performance metrics\n\n## \ud83d\udc33 Docker Support\n\nRun everything in containers:\n\n```bash\n# Build the image\ndocker build -t vector-db-query .\n\n# Run with docker-compose\ndocker-compose up -d\n\n# Access services\n# - API: http://localhost:5000\n# - Qdrant: http://localhost:6333\n# - Dashboard: http://localhost:8501\n```\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run specific test categories\npytest tests/test_readers/\npytest tests/test_cli/\n\n# Run with coverage\npytest --cov=vector_db_query\n\n# Run integration tests\npytest tests/integration/ --integration\n```\n\n## \ud83d\udcda Documentation\n\n### Guides\n- [Getting Started Guide](docs/getting-started.md)\n- [File Formats Guide](docs/file-formats-guide.md) - Detailed information about all supported formats\n- [Usage Examples](docs/usage-examples.md) - Practical examples for common use cases\n- [Configuration Guide](docs/configuration-guide.md)\n- [CLI Features Guide](docs/enhanced-cli-features.md)\n\n### API Documentation\n- [Document Readers API](docs/api/readers.md) - API reference for all document readers\n- [Full API Reference](https://vector-db-query.readthedocs.io)\n\n### Integration & Deployment\n- [MCP Integration Guide](docs/mcp_integration_guide.md)\n- [Monitoring Setup Guide](docs/monitoring-setup.md)\n- [Monitoring System Guide](docs/monitoring-system.md)\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcca Data Sources\n\nThe Data Sources feature enables automatic synchronization of content from multiple external sources into your vector database:\n\n### Quick Start\n```bash\n# Run interactive setup wizard\nvdq setup\n\n# Or use quick start guide\nvdq quickstart\n\n# Start syncing data\nvdq datasources sync\n\n# Monitor sync status\nvdq monitor\n```\n\n### Key Capabilities\n\n#### Gmail Integration\n- OAuth2 authentication for secure access\n- Folder selection (INBOX, Sent, Drafts, etc.)\n- Advanced filtering (sender whitelist/blacklist, patterns)\n- Attachment processing\n- Thread detection and grouping\n\n#### Fireflies.ai Integration\n- API-based transcript sync\n- Real-time webhook support\n- Meeting duration and platform filters\n- Speaker identification\n- Automatic summary extraction\n\n#### Google Drive Integration\n- OAuth2 authentication\n- Pattern-based file search (e.g., \"Notes by Gemini\")\n- Folder-specific sync\n- Shared drive support\n- File type filtering\n\n#### Advanced Processing\n- **Deduplication**: Content-based hashing to prevent duplicates\n- **NLP Analysis**: Extract entities, sentiment, and key phrases\n- **Selective Processing**: Rule-based filtering system\n- **Performance**: Parallel processing with rate limiting\n- **Monitoring**: Real-time dashboard with metrics\n\n### Configuration\n\nThe system can be configured via:\n- Interactive setup wizard: `vdq setup`\n- Configuration file: `config/default.yaml`\n- Environment variables for sensitive data\n- Web UI through monitoring dashboard\n\n### Documentation\n- [Setup Wizard Guide](docs/setup-wizard.md)\n- [Data Sources Operations Guide](docs/data-sources-operations.md)\n- [Troubleshooting Guide](docs/troubleshooting-guide.md)\n- [Deployment Guide](docs/deployment-guide.md)\n- [Maintenance Procedures](docs/maintenance-procedures.md)\n\n## \ud83d\ude4f Acknowledgments\n\n- [Qdrant](https://qdrant.tech/) for the excellent vector database\n- [Rich](https://github.com/Textualize/rich) for beautiful terminal formatting\n- [Textual](https://github.com/Textualize/textual) for the interactive TUI\n- [MCP](https://modelcontextprotocol.io/) for AI integration standards\n- All our contributors and users!\n\n---\n\n<div align=\"center\">\n <p>Built with \u2764\ufe0f by the Vector DB Query Team</p>\n <p>\n <a href=\"https://github.com/your-org/vector-db-query\">GitHub</a> \u2022\n <a href=\"https://vector-db-query.readthedocs.io\">Documentation</a> \u2022\n <a href=\"https://github.com/your-org/vector-db-query/issues\">Issues</a>\n </p>\n</div>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "CLI application for vector database queries using LLMs via MCP",
"version": "1.0.0",
"project_urls": {
"Changelog": "https://github.com/yourusername/vector-db-query/blob/main/CHANGELOG.md",
"Documentation": "https://vector-db-query.readthedocs.io",
"Homepage": "https://github.com/yourusername/vector-db-query",
"Issues": "https://github.com/yourusername/vector-db-query/issues",
"Repository": "https://github.com/yourusername/vector-db-query"
},
"split_keywords": [
"vector-database",
" llm",
" mcp",
" embeddings",
" qdrant"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cc2b79ac973d6c9493e3044e34aad23d782419a3ae5a7cdfa7be79b2df0cd2e2",
"md5": "35ffba95127bef6f0754ae84e9ad6b41",
"sha256": "d56616c296c08898e2fba0258fbfa55e32a0beb72d7b0637a0ceb7d0c9b7a98e"
},
"downloads": -1,
"filename": "vector_db_query-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "35ffba95127bef6f0754ae84e9ad6b41",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 1047730,
"upload_time": "2025-08-04T02:33:41",
"upload_time_iso_8601": "2025-08-04T02:33:41.453702Z",
"url": "https://files.pythonhosted.org/packages/cc/2b/79ac973d6c9493e3044e34aad23d782419a3ae5a7cdfa7be79b2df0cd2e2/vector_db_query-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e0ffcbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a",
"md5": "efc5a4b8e9afa4e15afe172db0dcfc60",
"sha256": "bc94c496518e538bd0693fa345565906c9335eb49121b45331e1f1a394f8d2ac"
},
"downloads": -1,
"filename": "vector_db_query-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "efc5a4b8e9afa4e15afe172db0dcfc60",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 1118513,
"upload_time": "2025-08-04T02:33:46",
"upload_time_iso_8601": "2025-08-04T02:33:46.312387Z",
"url": "https://files.pythonhosted.org/packages/e0/ff/cbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a/vector_db_query-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-04 02:33:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-org",
"github_project": "vector-db-query",
"github_not_found": true,
"lcname": "vector-db-query"
}