vector-db-query


Namevector-db-query JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/your-org/vector-db-query
SummaryCLI application for vector database queries using LLMs via MCP
upload_time2025-08-04 02:33:46
maintainerNone
docs_urlNone
authorVector DB Query Team
requires_python>=3.9
licenseMIT
keywords vector-database llm mcp embeddings qdrant
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Vector DB Query

<div align="center">
  <h3>๐Ÿš€ Semantic Search for Your Documents with AI Integration</h3>
  <p>A powerful CLI tool that indexes your documents and enables natural language search with LLM integration via MCP</p>
  
  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
  [![Documentation Status](https://readthedocs.org/projects/vector-db-query/badge/?version=latest)](https://vector-db-query.readthedocs.io/en/latest/?badge=latest)
</div>

## ๐ŸŒŸ Key Features

Vector DB Query is a comprehensive solution for building searchable knowledge bases from your documents:

### ๐Ÿ“„ Enhanced Document Processing
- **40+ File Formats**: PDF, Word, Excel, PowerPoint, HTML, Markdown, JSON, XML, Images (with OCR), and more
- **OCR Support**: Extract text from images (PNG, JPG, TIFF, BMP) with configurable languages and confidence thresholds
- **Archive Support**: Process ZIP, TAR, and compressed archives automatically
- **Smart Chunking**: Multiple strategies including sliding window, semantic, and paragraph-based
- **Metadata Extraction**: Preserve document structure, authorship, dates, and custom tags
- **Format-Specific Processing**: Tailored extraction for each file type (formulas from Excel, speaker notes from PowerPoint, etc.)

### ๐Ÿ” Advanced Semantic Search
- Natural language queries with vector similarity
- Hybrid search combining keyword and semantic matching
- Advanced filtering by file type, date, score, and metadata
- Result reranking and highlighting
- Export results in multiple formats (JSON, CSV, Markdown)

### ๐ŸŽจ Rich Interactive CLI
- Beautiful terminal UI powered by Rich and Textual
- Visual file browser with real-time preview
- Interactive query builder with autocomplete
- Live progress tracking with detailed statistics
- Customizable themes and output formats

### ๐Ÿค– AI Integration
- MCP server for Claude and other AI assistants
- Secure API with JWT authentication
- Rate limiting and request monitoring
- Standardized tool interface for document operations
- Real-time processing feedback

### โš™๏ธ Flexible Configuration
- YAML-based configuration with environment overrides
- CLI commands for configuration management
- Support for multiple configuration profiles
- Validation and health checks
- Hot-reloading of settings

### ๐Ÿ“Š Monitoring & Management
- Real-time monitoring dashboard (Streamlit)
- System metrics and resource usage tracking
- Processing queue management
- Log aggregation and analysis
- PM2 integration for process management

### โšก Performance & Scalability
- Parallel processing with configurable workers
- Memory-efficient chunking and streaming
- Smart caching system
- Connection pooling for database operations
- Batch processing optimization

### ๐Ÿ”— Data Source Integration (New!)
- **Gmail Integration**: Sync emails via IMAP/OAuth2 with folder selection and filtering
- **Fireflies.ai Integration**: Automatic meeting transcript sync via API and webhooks
- **Google Drive Integration**: Search and sync Gemini transcripts and documents
- **Smart Deduplication**: Cross-source duplicate detection using content hashing
- **NLP Processing**: Entity extraction, sentiment analysis, and key phrase detection
- **Selective Processing**: Configurable filters for targeted content processing
- **Real-time Monitoring**: Dashboard integration for tracking sync status
- **Setup Wizard**: Interactive configuration for easy onboarding

## ๐Ÿ“‹ Requirements

- Python 3.9 or higher
- 4GB RAM minimum (8GB recommended)
- Qdrant vector database (local or cloud)
- API key for embeddings (Google, OpenAI, etc.)
- Optional: Tesseract for OCR support
- Optional: Docker for containerized deployment

## ๐Ÿš€ Quick Start

### Installation

```bash
# Install from PyPI
pip install vector-db-query

# Or install from source
git clone https://github.com/your-org/vector-db-query.git
cd vector-db-query
pip install -e .

# Install with OCR support
pip install vector-db-query[ocr]
# Also install Tesseract:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki

# Install with all features
pip install vector-db-query[all]

# Install additional language packs for OCR
# Ubuntu/Debian:
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-spa  # Spanish

# macOS:
brew install tesseract-lang
```

### Setup

```bash
# 1. Start Qdrant (using Docker)
docker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant

# 2. Configure the application
vector-db-query config setup

# 3. Process your first documents
vector-db-query process ~/Documents/my-files --recursive

# 4. Search your documents
vector-db-query query "machine learning algorithms"

# 5. Or use interactive mode for the full experience
vector-db-query interactive start
```

## ๐Ÿ“– Usage

### Interactive Mode (Recommended)

The interactive mode provides a rich terminal interface:

```bash
vector-db-query interactive start
```

Features:
- ๐Ÿ“ Visual file browser with multi-format preview
- ๐Ÿ” Interactive query builder with AI suggestions
- ๐Ÿ“Š Beautiful result viewer with syntax highlighting
- โš™๏ธ Settings editor with live validation
- ๐Ÿ“š Built-in tutorials and examples
- ๐ŸŽฏ Format-specific processing options

### Command Line Mode

#### Processing Documents

```bash
# Process all supported formats
vector-db-query process /path/to/documents --recursive

# Process specific formats only
vector-db-query process /path/to/docs --formats pdf,docx,xlsx

# Process with OCR for images
vector-db-query process /path/to/images --ocr --ocr-lang eng

# Show all supported formats
vector-db-query formats

# Check format support for specific files
vector-db-query formats /path/to/file.xyz

# Process only Excel and PowerPoint files
vector-db-query process /path/to/docs --formats xlsx,pptx --recursive

# Dry run to see what would be processed
vector-db-query process /path/to/docs --dry-run --verbose
```

#### Querying Documents

```bash
# Simple natural language query
vector-db-query query "explain the authentication flow"

# Advanced search with filters
vector-db-query query "Python async" --filter file_type=py --limit 20

# Hybrid search with keyword weight
vector-db-query query "API endpoints" --hybrid --keyword-weight 0.4

# Export results in different formats
vector-db-query query "documentation" --export results.json --format json
vector-db-query query "configuration" --export results.md --format markdown

# Show query statistics
vector-db-query query "machine learning" --stats
```

#### Configuration Management

```bash
# Show current configuration
vector-db-query config show
vector-db-query config show --format table
vector-db-query config show --section document_processing

# Get/Set configuration values
vector-db-query config get document_processing.chunk_size
vector-db-query config set document_processing.chunk_size 2000 --type int

# Validate configuration
vector-db-query config validate

# Show supported file formats
vector-db-query config formats

# Add custom format
vector-db-query config add-format .custom

# Export/Import configuration
vector-db-query config export --output my-config.yaml
vector-db-query config load custom-config.yaml --merge

# Show environment variable mappings
vector-db-query config env
```

#### Monitoring and Management

```bash
# Start monitoring dashboard (requires monitoring dependencies)
vector-db-query monitor
# Or install monitoring dependencies first:
# pip install vector-db-query[monitoring]

# View system status
vector-db-query status

# View processing logs
vector-db-query logging show --tail 100
vector-db-query logging search "ERROR" --context 5

# Manage processes with PM2
./scripts/pm2-manage.sh start all
./scripts/pm2-manage.sh status
./scripts/pm2-manage.sh logs mcp-server
```

### MCP Server for AI Assistants

Enable AI assistants like Claude to search your documents:

```bash
# Initialize MCP configuration
vector-db-query mcp init

# Start MCP server
vector-db-query mcp start

# Create API client
vector-db-query mcp auth create-client "claude-assistant"

# Check server status
vector-db-query mcp status

# Test with sample query
vector-db-query mcp test --query "find Python examples"
```

The MCP server provides tools for:
- Searching documents with natural language
- Processing new files in real-time
- Getting collection statistics
- Managing the vector database
- Monitoring system health

## โš™๏ธ Configuration

The application uses a flexible YAML-based configuration system:

```yaml
# config.yaml example
app:
  name: "Vector DB Query System"
  log_level: "INFO"

document_processing:
  chunk_size: 1000
  chunk_overlap: 200
  max_file_size_mb: 100
  
  file_formats:
    documents: [".pdf", ".doc", ".docx", ".txt", ".md"]
    spreadsheets: [".xlsx", ".xls", ".csv"]
    images: [".png", ".jpg", ".jpeg", ".gif", ".bmp"]
    # ... more formats
  
  ocr:
    enabled: true
    language: "eng"
    confidence_threshold: 60.0

vector_db:
  host: "localhost"
  port: 6333
  collection_name: "documents"

embedding:
  model: "embedding-001"
  dimensions: 768

# ... more settings
```

### Environment Variables

Override configuration with environment variables:

```bash
export VECTOR_DB_LOG_LEVEL=DEBUG
export QDRANT_HOST=remote-server.com
export QDRANT_PORT=6334
export EMBEDDING_MODEL=text-embedding-ada-002
export OCR_LANGUAGE=eng+fra+deu
export CHUNK_SIZE=1500
```

## ๐Ÿงฉ Supported File Formats

### Documents
- **PDF** (`.pdf`) - Full text extraction with layout preservation
- **Microsoft Word** (`.doc`, `.docx`) - Text, tables, headers/footers, and comments
- **OpenDocument Text** (`.odt`) - ODT format support
- **Rich Text Format** (`.rtf`) - RTF document processing
- **Plain Text** (`.txt`, `.text`) - With encoding detection
- **Markdown** (`.md`, `.markdown`) - Preserves structure and formatting

### Spreadsheets
- **Microsoft Excel** (`.xlsx`, `.xls`) - Extracts:
  - Cell values and formulas
  - Comments and notes
  - Multiple sheets
  - Table structures
- **CSV** (`.csv`, `.tsv`) - Tabular data processing
- **OpenDocument Spreadsheet** (`.ods`) - ODS format support

### Presentations
- **Microsoft PowerPoint** (`.pptx`, `.ppt`) - Extracts:
  - Slide content and titles
  - Speaker notes
  - Table data
  - Slide numbers and structure
- **OpenDocument Presentation** (`.odp`) - ODP format support

### Email
- **Email Messages** (`.eml`) - Extracts:
  - Headers (From, To, Subject, Date)
  - Body content (text/HTML)
  - Attachments (processed recursively)
  - Thread detection
- **Mailbox** (`.mbox`) - Multi-message archive support
- **Outlook Message** (`.msg`) - MSG format support

### Web & Markup
- **HTML** (`.html`, `.htm`, `.xhtml`) - Features:
  - Script/style removal
  - Text extraction with structure
  - Link preservation
  - Optional markdown conversion
- **XML** (`.xml`) - Structured data extraction

### Configuration & Data
- **JSON** (`.json`) - Pretty-printed extraction
- **YAML** (`.yaml`, `.yml`) - Multi-document support
- **INI/Config** (`.ini`, `.cfg`, `.conf`) - Section-based extraction
- **TOML** (`.toml`) - TOML format support
- **Log Files** (`.log`) - Features:
  - Pattern extraction
  - Summary generation
  - Configurable line limits

### Images (with OCR)
Requires Tesseract installation:
- **PNG** (`.png`) - Lossless image format
- **JPEG** (`.jpg`, `.jpeg`) - Common photo format
- **TIFF** (`.tiff`, `.tif`) - Multi-page support
- **BMP** (`.bmp`) - Bitmap images
- **GIF** (`.gif`) - Graphics format

### Archives
- **ZIP** (`.zip`)
- TAR (`.tar`, `.tar.gz`, `.tar.bz2`, `.tar.xz`)
- 7-Zip (`.7z`)

### Logs
- Log Files (`.log`)

## ๐Ÿ”ง Advanced Features

### OCR Configuration

```bash
# Install Tesseract
# macOS
brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Install additional languages
sudo apt-get install tesseract-ocr-fra tesseract-ocr-deu

# Configure OCR in vector-db-query
vector-db-query config set document_processing.ocr.enabled true
vector-db-query config set document_processing.ocr.language "eng"
vector-db-query config set document_processing.ocr.confidence_threshold 60.0
```

### Format-Specific Configuration

Configure processing behavior for each file format:

```yaml
# config/default.yaml
document_processing:
  format_settings:
    excel:
      extract_formulas: true
      extract_comments: true
      process_all_sheets: true
      max_rows_per_sheet: 10000
    
    powerpoint:
      extract_speaker_notes: true
      extract_slide_numbers: true
      include_master_slides: false
    
    email:
      extract_attachments: true
      thread_detection: true
      sanitize_content: true
      include_headers: true
    
    html:
      remove_scripts: true
      remove_styles: true
      convert_to_markdown: false
      preserve_links: true
    
    logs:
      summarize: true
      extract_patterns: true
      max_lines: 10000
```

Or use environment variables:

```bash
export VECTOR_DB_EXCEL_EXTRACT_FORMULAS=true
export VECTOR_DB_EXCEL_MAX_ROWS=5000
export VECTOR_DB_EMAIL_EXTRACT_ATTACHMENTS=true
export VECTOR_DB_HTML_CONVERT_MARKDOWN=true
export VECTOR_DB_LOG_SUMMARIZE=true
```

### Batch Processing

```python
# Python script for batch processing
from vector_db_query import DocumentProcessor

processor = DocumentProcessor(
    chunk_size=1500,
    chunk_overlap=300,
    parallel_workers=8
)

# Process with progress callback
def on_progress(current, total, file_name):
    print(f"Processing {file_name}: {current}/{total}")

documents = processor.process_directory(
    "/path/to/documents",
    recursive=True,
    progress_callback=on_progress
)
```

### Custom Embeddings

```python
# Use custom embedding models
from vector_db_query import EmbeddingService

# Configure custom model
embedding_service = EmbeddingService(
    model="custom-model",
    api_key="your-api-key",
    dimensions=1536
)

# Process with custom embeddings
processor = DocumentProcessor(
    embedding_service=embedding_service
)
```

## ๐Ÿ“Š Monitoring Dashboard

The built-in monitoring dashboard provides real-time insights:

```bash
# Start the dashboard
vector-db-query monitor start

# Access at http://localhost:8501
```

Features:
- System resource usage (CPU, Memory, Disk)
- Processing queue status
- Document processing statistics
- Error logs and alerts
- Performance metrics

## ๐Ÿณ Docker Support

Run everything in containers:

```bash
# Build the image
docker build -t vector-db-query .

# Run with docker-compose
docker-compose up -d

# Access services
# - API: http://localhost:5000
# - Qdrant: http://localhost:6333
# - Dashboard: http://localhost:8501
```

## ๐Ÿงช Testing

```bash
# Run all tests
pytest

# Run specific test categories
pytest tests/test_readers/
pytest tests/test_cli/

# Run with coverage
pytest --cov=vector_db_query

# Run integration tests
pytest tests/integration/ --integration
```

## ๐Ÿ“š Documentation

### Guides
- [Getting Started Guide](docs/getting-started.md)
- [File Formats Guide](docs/file-formats-guide.md) - Detailed information about all supported formats
- [Usage Examples](docs/usage-examples.md) - Practical examples for common use cases
- [Configuration Guide](docs/configuration-guide.md)
- [CLI Features Guide](docs/enhanced-cli-features.md)

### API Documentation
- [Document Readers API](docs/api/readers.md) - API reference for all document readers
- [Full API Reference](https://vector-db-query.readthedocs.io)

### Integration & Deployment
- [MCP Integration Guide](docs/mcp_integration_guide.md)
- [Monitoring Setup Guide](docs/monitoring-setup.md)
- [Monitoring System Guide](docs/monitoring-system.md)

## ๐Ÿค Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

## ๐Ÿ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ“Š Data Sources

The Data Sources feature enables automatic synchronization of content from multiple external sources into your vector database:

### Quick Start
```bash
# Run interactive setup wizard
vdq setup

# Or use quick start guide
vdq quickstart

# Start syncing data
vdq datasources sync

# Monitor sync status
vdq monitor
```

### Key Capabilities

#### Gmail Integration
- OAuth2 authentication for secure access
- Folder selection (INBOX, Sent, Drafts, etc.)
- Advanced filtering (sender whitelist/blacklist, patterns)
- Attachment processing
- Thread detection and grouping

#### Fireflies.ai Integration
- API-based transcript sync
- Real-time webhook support
- Meeting duration and platform filters
- Speaker identification
- Automatic summary extraction

#### Google Drive Integration
- OAuth2 authentication
- Pattern-based file search (e.g., "Notes by Gemini")
- Folder-specific sync
- Shared drive support
- File type filtering

#### Advanced Processing
- **Deduplication**: Content-based hashing to prevent duplicates
- **NLP Analysis**: Extract entities, sentiment, and key phrases
- **Selective Processing**: Rule-based filtering system
- **Performance**: Parallel processing with rate limiting
- **Monitoring**: Real-time dashboard with metrics

### Configuration

The system can be configured via:
- Interactive setup wizard: `vdq setup`
- Configuration file: `config/default.yaml`
- Environment variables for sensitive data
- Web UI through monitoring dashboard

### Documentation
- [Setup Wizard Guide](docs/setup-wizard.md)
- [Data Sources Operations Guide](docs/data-sources-operations.md)
- [Troubleshooting Guide](docs/troubleshooting-guide.md)
- [Deployment Guide](docs/deployment-guide.md)
- [Maintenance Procedures](docs/maintenance-procedures.md)

## ๐Ÿ™ Acknowledgments

- [Qdrant](https://qdrant.tech/) for the excellent vector database
- [Rich](https://github.com/Textualize/rich) for beautiful terminal formatting
- [Textual](https://github.com/Textualize/textual) for the interactive TUI
- [MCP](https://modelcontextprotocol.io/) for AI integration standards
- All our contributors and users!

---

<div align="center">
  <p>Built with โค๏ธ by the Vector DB Query Team</p>
  <p>
    <a href="https://github.com/your-org/vector-db-query">GitHub</a> โ€ข
    <a href="https://vector-db-query.readthedocs.io">Documentation</a> โ€ข
    <a href="https://github.com/your-org/vector-db-query/issues">Issues</a>
  </p>
</div>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/your-org/vector-db-query",
    "name": "vector-db-query",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "vector-database, llm, mcp, embeddings, qdrant",
    "author": "Vector DB Query Team",
    "author_email": "Vector DB Query Team <team@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/e0/ff/cbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a/vector_db_query-1.0.0.tar.gz",
    "platform": null,
    "description": "# Vector DB Query\n\n<div align=\"center\">\n  <h3>\ud83d\ude80 Semantic Search for Your Documents with AI Integration</h3>\n  <p>A powerful CLI tool that indexes your documents and enables natural language search with LLM integration via MCP</p>\n  \n  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n  [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n  [![Documentation Status](https://readthedocs.org/projects/vector-db-query/badge/?version=latest)](https://vector-db-query.readthedocs.io/en/latest/?badge=latest)\n</div>\n\n## \ud83c\udf1f Key Features\n\nVector DB Query is a comprehensive solution for building searchable knowledge bases from your documents:\n\n### \ud83d\udcc4 Enhanced Document Processing\n- **40+ File Formats**: PDF, Word, Excel, PowerPoint, HTML, Markdown, JSON, XML, Images (with OCR), and more\n- **OCR Support**: Extract text from images (PNG, JPG, TIFF, BMP) with configurable languages and confidence thresholds\n- **Archive Support**: Process ZIP, TAR, and compressed archives automatically\n- **Smart Chunking**: Multiple strategies including sliding window, semantic, and paragraph-based\n- **Metadata Extraction**: Preserve document structure, authorship, dates, and custom tags\n- **Format-Specific Processing**: Tailored extraction for each file type (formulas from Excel, speaker notes from PowerPoint, etc.)\n\n### \ud83d\udd0d Advanced Semantic Search\n- Natural language queries with vector similarity\n- Hybrid search combining keyword and semantic matching\n- Advanced filtering by file type, date, score, and metadata\n- Result reranking and highlighting\n- Export results in multiple formats (JSON, CSV, Markdown)\n\n### \ud83c\udfa8 Rich Interactive CLI\n- Beautiful terminal UI powered by Rich and Textual\n- Visual file browser with real-time preview\n- Interactive query builder with autocomplete\n- Live progress tracking with detailed statistics\n- Customizable themes and output formats\n\n### \ud83e\udd16 AI Integration\n- MCP server for Claude and other AI assistants\n- Secure API with JWT authentication\n- Rate limiting and request monitoring\n- Standardized tool interface for document operations\n- Real-time processing feedback\n\n### \u2699\ufe0f Flexible Configuration\n- YAML-based configuration with environment overrides\n- CLI commands for configuration management\n- Support for multiple configuration profiles\n- Validation and health checks\n- Hot-reloading of settings\n\n### \ud83d\udcca Monitoring & Management\n- Real-time monitoring dashboard (Streamlit)\n- System metrics and resource usage tracking\n- Processing queue management\n- Log aggregation and analysis\n- PM2 integration for process management\n\n### \u26a1 Performance & Scalability\n- Parallel processing with configurable workers\n- Memory-efficient chunking and streaming\n- Smart caching system\n- Connection pooling for database operations\n- Batch processing optimization\n\n### \ud83d\udd17 Data Source Integration (New!)\n- **Gmail Integration**: Sync emails via IMAP/OAuth2 with folder selection and filtering\n- **Fireflies.ai Integration**: Automatic meeting transcript sync via API and webhooks\n- **Google Drive Integration**: Search and sync Gemini transcripts and documents\n- **Smart Deduplication**: Cross-source duplicate detection using content hashing\n- **NLP Processing**: Entity extraction, sentiment analysis, and key phrase detection\n- **Selective Processing**: Configurable filters for targeted content processing\n- **Real-time Monitoring**: Dashboard integration for tracking sync status\n- **Setup Wizard**: Interactive configuration for easy onboarding\n\n## \ud83d\udccb Requirements\n\n- Python 3.9 or higher\n- 4GB RAM minimum (8GB recommended)\n- Qdrant vector database (local or cloud)\n- API key for embeddings (Google, OpenAI, etc.)\n- Optional: Tesseract for OCR support\n- Optional: Docker for containerized deployment\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\n```bash\n# Install from PyPI\npip install vector-db-query\n\n# Or install from source\ngit clone https://github.com/your-org/vector-db-query.git\ncd vector-db-query\npip install -e .\n\n# Install with OCR support\npip install vector-db-query[ocr]\n# Also install Tesseract:\n# macOS: brew install tesseract\n# Ubuntu: sudo apt-get install tesseract-ocr\n# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki\n\n# Install with all features\npip install vector-db-query[all]\n\n# Install additional language packs for OCR\n# Ubuntu/Debian:\nsudo apt-get install tesseract-ocr-fra  # French\nsudo apt-get install tesseract-ocr-deu  # German\nsudo apt-get install tesseract-ocr-spa  # Spanish\n\n# macOS:\nbrew install tesseract-lang\n```\n\n### Setup\n\n```bash\n# 1. Start Qdrant (using Docker)\ndocker run -p 6333:6333 -v $(pwd)/qdrant_storage:/qdrant/storage qdrant/qdrant\n\n# 2. Configure the application\nvector-db-query config setup\n\n# 3. Process your first documents\nvector-db-query process ~/Documents/my-files --recursive\n\n# 4. Search your documents\nvector-db-query query \"machine learning algorithms\"\n\n# 5. Or use interactive mode for the full experience\nvector-db-query interactive start\n```\n\n## \ud83d\udcd6 Usage\n\n### Interactive Mode (Recommended)\n\nThe interactive mode provides a rich terminal interface:\n\n```bash\nvector-db-query interactive start\n```\n\nFeatures:\n- \ud83d\udcc1 Visual file browser with multi-format preview\n- \ud83d\udd0d Interactive query builder with AI suggestions\n- \ud83d\udcca Beautiful result viewer with syntax highlighting\n- \u2699\ufe0f Settings editor with live validation\n- \ud83d\udcda Built-in tutorials and examples\n- \ud83c\udfaf Format-specific processing options\n\n### Command Line Mode\n\n#### Processing Documents\n\n```bash\n# Process all supported formats\nvector-db-query process /path/to/documents --recursive\n\n# Process specific formats only\nvector-db-query process /path/to/docs --formats pdf,docx,xlsx\n\n# Process with OCR for images\nvector-db-query process /path/to/images --ocr --ocr-lang eng\n\n# Show all supported formats\nvector-db-query formats\n\n# Check format support for specific files\nvector-db-query formats /path/to/file.xyz\n\n# Process only Excel and PowerPoint files\nvector-db-query process /path/to/docs --formats xlsx,pptx --recursive\n\n# Dry run to see what would be processed\nvector-db-query process /path/to/docs --dry-run --verbose\n```\n\n#### Querying Documents\n\n```bash\n# Simple natural language query\nvector-db-query query \"explain the authentication flow\"\n\n# Advanced search with filters\nvector-db-query query \"Python async\" --filter file_type=py --limit 20\n\n# Hybrid search with keyword weight\nvector-db-query query \"API endpoints\" --hybrid --keyword-weight 0.4\n\n# Export results in different formats\nvector-db-query query \"documentation\" --export results.json --format json\nvector-db-query query \"configuration\" --export results.md --format markdown\n\n# Show query statistics\nvector-db-query query \"machine learning\" --stats\n```\n\n#### Configuration Management\n\n```bash\n# Show current configuration\nvector-db-query config show\nvector-db-query config show --format table\nvector-db-query config show --section document_processing\n\n# Get/Set configuration values\nvector-db-query config get document_processing.chunk_size\nvector-db-query config set document_processing.chunk_size 2000 --type int\n\n# Validate configuration\nvector-db-query config validate\n\n# Show supported file formats\nvector-db-query config formats\n\n# Add custom format\nvector-db-query config add-format .custom\n\n# Export/Import configuration\nvector-db-query config export --output my-config.yaml\nvector-db-query config load custom-config.yaml --merge\n\n# Show environment variable mappings\nvector-db-query config env\n```\n\n#### Monitoring and Management\n\n```bash\n# Start monitoring dashboard (requires monitoring dependencies)\nvector-db-query monitor\n# Or install monitoring dependencies first:\n# pip install vector-db-query[monitoring]\n\n# View system status\nvector-db-query status\n\n# View processing logs\nvector-db-query logging show --tail 100\nvector-db-query logging search \"ERROR\" --context 5\n\n# Manage processes with PM2\n./scripts/pm2-manage.sh start all\n./scripts/pm2-manage.sh status\n./scripts/pm2-manage.sh logs mcp-server\n```\n\n### MCP Server for AI Assistants\n\nEnable AI assistants like Claude to search your documents:\n\n```bash\n# Initialize MCP configuration\nvector-db-query mcp init\n\n# Start MCP server\nvector-db-query mcp start\n\n# Create API client\nvector-db-query mcp auth create-client \"claude-assistant\"\n\n# Check server status\nvector-db-query mcp status\n\n# Test with sample query\nvector-db-query mcp test --query \"find Python examples\"\n```\n\nThe MCP server provides tools for:\n- Searching documents with natural language\n- Processing new files in real-time\n- Getting collection statistics\n- Managing the vector database\n- Monitoring system health\n\n## \u2699\ufe0f Configuration\n\nThe application uses a flexible YAML-based configuration system:\n\n```yaml\n# config.yaml example\napp:\n  name: \"Vector DB Query System\"\n  log_level: \"INFO\"\n\ndocument_processing:\n  chunk_size: 1000\n  chunk_overlap: 200\n  max_file_size_mb: 100\n  \n  file_formats:\n    documents: [\".pdf\", \".doc\", \".docx\", \".txt\", \".md\"]\n    spreadsheets: [\".xlsx\", \".xls\", \".csv\"]\n    images: [\".png\", \".jpg\", \".jpeg\", \".gif\", \".bmp\"]\n    # ... more formats\n  \n  ocr:\n    enabled: true\n    language: \"eng\"\n    confidence_threshold: 60.0\n\nvector_db:\n  host: \"localhost\"\n  port: 6333\n  collection_name: \"documents\"\n\nembedding:\n  model: \"embedding-001\"\n  dimensions: 768\n\n# ... more settings\n```\n\n### Environment Variables\n\nOverride configuration with environment variables:\n\n```bash\nexport VECTOR_DB_LOG_LEVEL=DEBUG\nexport QDRANT_HOST=remote-server.com\nexport QDRANT_PORT=6334\nexport EMBEDDING_MODEL=text-embedding-ada-002\nexport OCR_LANGUAGE=eng+fra+deu\nexport CHUNK_SIZE=1500\n```\n\n## \ud83e\udde9 Supported File Formats\n\n### Documents\n- **PDF** (`.pdf`) - Full text extraction with layout preservation\n- **Microsoft Word** (`.doc`, `.docx`) - Text, tables, headers/footers, and comments\n- **OpenDocument Text** (`.odt`) - ODT format support\n- **Rich Text Format** (`.rtf`) - RTF document processing\n- **Plain Text** (`.txt`, `.text`) - With encoding detection\n- **Markdown** (`.md`, `.markdown`) - Preserves structure and formatting\n\n### Spreadsheets\n- **Microsoft Excel** (`.xlsx`, `.xls`) - Extracts:\n  - Cell values and formulas\n  - Comments and notes\n  - Multiple sheets\n  - Table structures\n- **CSV** (`.csv`, `.tsv`) - Tabular data processing\n- **OpenDocument Spreadsheet** (`.ods`) - ODS format support\n\n### Presentations\n- **Microsoft PowerPoint** (`.pptx`, `.ppt`) - Extracts:\n  - Slide content and titles\n  - Speaker notes\n  - Table data\n  - Slide numbers and structure\n- **OpenDocument Presentation** (`.odp`) - ODP format support\n\n### Email\n- **Email Messages** (`.eml`) - Extracts:\n  - Headers (From, To, Subject, Date)\n  - Body content (text/HTML)\n  - Attachments (processed recursively)\n  - Thread detection\n- **Mailbox** (`.mbox`) - Multi-message archive support\n- **Outlook Message** (`.msg`) - MSG format support\n\n### Web & Markup\n- **HTML** (`.html`, `.htm`, `.xhtml`) - Features:\n  - Script/style removal\n  - Text extraction with structure\n  - Link preservation\n  - Optional markdown conversion\n- **XML** (`.xml`) - Structured data extraction\n\n### Configuration & Data\n- **JSON** (`.json`) - Pretty-printed extraction\n- **YAML** (`.yaml`, `.yml`) - Multi-document support\n- **INI/Config** (`.ini`, `.cfg`, `.conf`) - Section-based extraction\n- **TOML** (`.toml`) - TOML format support\n- **Log Files** (`.log`) - Features:\n  - Pattern extraction\n  - Summary generation\n  - Configurable line limits\n\n### Images (with OCR)\nRequires Tesseract installation:\n- **PNG** (`.png`) - Lossless image format\n- **JPEG** (`.jpg`, `.jpeg`) - Common photo format\n- **TIFF** (`.tiff`, `.tif`) - Multi-page support\n- **BMP** (`.bmp`) - Bitmap images\n- **GIF** (`.gif`) - Graphics format\n\n### Archives\n- **ZIP** (`.zip`)\n- TAR (`.tar`, `.tar.gz`, `.tar.bz2`, `.tar.xz`)\n- 7-Zip (`.7z`)\n\n### Logs\n- Log Files (`.log`)\n\n## \ud83d\udd27 Advanced Features\n\n### OCR Configuration\n\n```bash\n# Install Tesseract\n# macOS\nbrew install tesseract\n\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr\n\n# Install additional languages\nsudo apt-get install tesseract-ocr-fra tesseract-ocr-deu\n\n# Configure OCR in vector-db-query\nvector-db-query config set document_processing.ocr.enabled true\nvector-db-query config set document_processing.ocr.language \"eng\"\nvector-db-query config set document_processing.ocr.confidence_threshold 60.0\n```\n\n### Format-Specific Configuration\n\nConfigure processing behavior for each file format:\n\n```yaml\n# config/default.yaml\ndocument_processing:\n  format_settings:\n    excel:\n      extract_formulas: true\n      extract_comments: true\n      process_all_sheets: true\n      max_rows_per_sheet: 10000\n    \n    powerpoint:\n      extract_speaker_notes: true\n      extract_slide_numbers: true\n      include_master_slides: false\n    \n    email:\n      extract_attachments: true\n      thread_detection: true\n      sanitize_content: true\n      include_headers: true\n    \n    html:\n      remove_scripts: true\n      remove_styles: true\n      convert_to_markdown: false\n      preserve_links: true\n    \n    logs:\n      summarize: true\n      extract_patterns: true\n      max_lines: 10000\n```\n\nOr use environment variables:\n\n```bash\nexport VECTOR_DB_EXCEL_EXTRACT_FORMULAS=true\nexport VECTOR_DB_EXCEL_MAX_ROWS=5000\nexport VECTOR_DB_EMAIL_EXTRACT_ATTACHMENTS=true\nexport VECTOR_DB_HTML_CONVERT_MARKDOWN=true\nexport VECTOR_DB_LOG_SUMMARIZE=true\n```\n\n### Batch Processing\n\n```python\n# Python script for batch processing\nfrom vector_db_query import DocumentProcessor\n\nprocessor = DocumentProcessor(\n    chunk_size=1500,\n    chunk_overlap=300,\n    parallel_workers=8\n)\n\n# Process with progress callback\ndef on_progress(current, total, file_name):\n    print(f\"Processing {file_name}: {current}/{total}\")\n\ndocuments = processor.process_directory(\n    \"/path/to/documents\",\n    recursive=True,\n    progress_callback=on_progress\n)\n```\n\n### Custom Embeddings\n\n```python\n# Use custom embedding models\nfrom vector_db_query import EmbeddingService\n\n# Configure custom model\nembedding_service = EmbeddingService(\n    model=\"custom-model\",\n    api_key=\"your-api-key\",\n    dimensions=1536\n)\n\n# Process with custom embeddings\nprocessor = DocumentProcessor(\n    embedding_service=embedding_service\n)\n```\n\n## \ud83d\udcca Monitoring Dashboard\n\nThe built-in monitoring dashboard provides real-time insights:\n\n```bash\n# Start the dashboard\nvector-db-query monitor start\n\n# Access at http://localhost:8501\n```\n\nFeatures:\n- System resource usage (CPU, Memory, Disk)\n- Processing queue status\n- Document processing statistics\n- Error logs and alerts\n- Performance metrics\n\n## \ud83d\udc33 Docker Support\n\nRun everything in containers:\n\n```bash\n# Build the image\ndocker build -t vector-db-query .\n\n# Run with docker-compose\ndocker-compose up -d\n\n# Access services\n# - API: http://localhost:5000\n# - Qdrant: http://localhost:6333\n# - Dashboard: http://localhost:8501\n```\n\n## \ud83e\uddea Testing\n\n```bash\n# Run all tests\npytest\n\n# Run specific test categories\npytest tests/test_readers/\npytest tests/test_cli/\n\n# Run with coverage\npytest --cov=vector_db_query\n\n# Run integration tests\npytest tests/integration/ --integration\n```\n\n## \ud83d\udcda Documentation\n\n### Guides\n- [Getting Started Guide](docs/getting-started.md)\n- [File Formats Guide](docs/file-formats-guide.md) - Detailed information about all supported formats\n- [Usage Examples](docs/usage-examples.md) - Practical examples for common use cases\n- [Configuration Guide](docs/configuration-guide.md)\n- [CLI Features Guide](docs/enhanced-cli-features.md)\n\n### API Documentation\n- [Document Readers API](docs/api/readers.md) - API reference for all document readers\n- [Full API Reference](https://vector-db-query.readthedocs.io)\n\n### Integration & Deployment\n- [MCP Integration Guide](docs/mcp_integration_guide.md)\n- [Monitoring Setup Guide](docs/monitoring-setup.md)\n- [Monitoring System Guide](docs/monitoring-system.md)\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udcca Data Sources\n\nThe Data Sources feature enables automatic synchronization of content from multiple external sources into your vector database:\n\n### Quick Start\n```bash\n# Run interactive setup wizard\nvdq setup\n\n# Or use quick start guide\nvdq quickstart\n\n# Start syncing data\nvdq datasources sync\n\n# Monitor sync status\nvdq monitor\n```\n\n### Key Capabilities\n\n#### Gmail Integration\n- OAuth2 authentication for secure access\n- Folder selection (INBOX, Sent, Drafts, etc.)\n- Advanced filtering (sender whitelist/blacklist, patterns)\n- Attachment processing\n- Thread detection and grouping\n\n#### Fireflies.ai Integration\n- API-based transcript sync\n- Real-time webhook support\n- Meeting duration and platform filters\n- Speaker identification\n- Automatic summary extraction\n\n#### Google Drive Integration\n- OAuth2 authentication\n- Pattern-based file search (e.g., \"Notes by Gemini\")\n- Folder-specific sync\n- Shared drive support\n- File type filtering\n\n#### Advanced Processing\n- **Deduplication**: Content-based hashing to prevent duplicates\n- **NLP Analysis**: Extract entities, sentiment, and key phrases\n- **Selective Processing**: Rule-based filtering system\n- **Performance**: Parallel processing with rate limiting\n- **Monitoring**: Real-time dashboard with metrics\n\n### Configuration\n\nThe system can be configured via:\n- Interactive setup wizard: `vdq setup`\n- Configuration file: `config/default.yaml`\n- Environment variables for sensitive data\n- Web UI through monitoring dashboard\n\n### Documentation\n- [Setup Wizard Guide](docs/setup-wizard.md)\n- [Data Sources Operations Guide](docs/data-sources-operations.md)\n- [Troubleshooting Guide](docs/troubleshooting-guide.md)\n- [Deployment Guide](docs/deployment-guide.md)\n- [Maintenance Procedures](docs/maintenance-procedures.md)\n\n## \ud83d\ude4f Acknowledgments\n\n- [Qdrant](https://qdrant.tech/) for the excellent vector database\n- [Rich](https://github.com/Textualize/rich) for beautiful terminal formatting\n- [Textual](https://github.com/Textualize/textual) for the interactive TUI\n- [MCP](https://modelcontextprotocol.io/) for AI integration standards\n- All our contributors and users!\n\n---\n\n<div align=\"center\">\n  <p>Built with \u2764\ufe0f by the Vector DB Query Team</p>\n  <p>\n    <a href=\"https://github.com/your-org/vector-db-query\">GitHub</a> \u2022\n    <a href=\"https://vector-db-query.readthedocs.io\">Documentation</a> \u2022\n    <a href=\"https://github.com/your-org/vector-db-query/issues\">Issues</a>\n  </p>\n</div>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "CLI application for vector database queries using LLMs via MCP",
    "version": "1.0.0",
    "project_urls": {
        "Changelog": "https://github.com/yourusername/vector-db-query/blob/main/CHANGELOG.md",
        "Documentation": "https://vector-db-query.readthedocs.io",
        "Homepage": "https://github.com/yourusername/vector-db-query",
        "Issues": "https://github.com/yourusername/vector-db-query/issues",
        "Repository": "https://github.com/yourusername/vector-db-query"
    },
    "split_keywords": [
        "vector-database",
        " llm",
        " mcp",
        " embeddings",
        " qdrant"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cc2b79ac973d6c9493e3044e34aad23d782419a3ae5a7cdfa7be79b2df0cd2e2",
                "md5": "35ffba95127bef6f0754ae84e9ad6b41",
                "sha256": "d56616c296c08898e2fba0258fbfa55e32a0beb72d7b0637a0ceb7d0c9b7a98e"
            },
            "downloads": -1,
            "filename": "vector_db_query-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "35ffba95127bef6f0754ae84e9ad6b41",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 1047730,
            "upload_time": "2025-08-04T02:33:41",
            "upload_time_iso_8601": "2025-08-04T02:33:41.453702Z",
            "url": "https://files.pythonhosted.org/packages/cc/2b/79ac973d6c9493e3044e34aad23d782419a3ae5a7cdfa7be79b2df0cd2e2/vector_db_query-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e0ffcbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a",
                "md5": "efc5a4b8e9afa4e15afe172db0dcfc60",
                "sha256": "bc94c496518e538bd0693fa345565906c9335eb49121b45331e1f1a394f8d2ac"
            },
            "downloads": -1,
            "filename": "vector_db_query-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "efc5a4b8e9afa4e15afe172db0dcfc60",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 1118513,
            "upload_time": "2025-08-04T02:33:46",
            "upload_time_iso_8601": "2025-08-04T02:33:46.312387Z",
            "url": "https://files.pythonhosted.org/packages/e0/ff/cbb2b361a918b4c9126546a623cebd152e753e1160c537b4c85f29e9094a/vector_db_query-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-04 02:33:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "your-org",
    "github_project": "vector-db-query",
    "github_not_found": true,
    "lcname": "vector-db-query"
}
        
Elapsed time: 1.76656s