# grepctl - BigQuery Semantic Search Orchestrator
[](https://pypi.org/project/grepctl/)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
🚀 **One-command multimodal semantic search across your entire data lake using BigQuery ML and Google Cloud AI.**
## 📦 Installation
```bash
# Install from PyPI
pip install grepctl
```
## 🎯 Quick Start - From Zero to Search in One Command
```bash
# Complete setup with automatic data ingestion
grepctl init all --bucket your-bucket --auto-ingest
# Start searching immediately
grepctl search "find all mentions of machine learning"
```
That's it! The system automatically:
- ✅ Enables all required Google Cloud APIs
- ✅ Creates BigQuery dataset and tables
- ✅ Deploys Vertex AI embedding models
- ✅ Ingests all 8 data modalities from your GCS bucket
- ✅ Generates 768-dimensional embeddings
- ✅ Configures semantic search with VECTOR_SEARCH
## 📊 What is grepctl?
`grepctl` is a powerful command-line orchestration tool that transforms your Google Cloud Storage data lake into a searchable knowledge base. It provides a unified interface for searching across **8 different data types**:
- 📄 **Text & Markdown** - Direct content extraction
- 📑 **PDF Documents** - OCR with Document AI
- 🖼️ **Images** - Vision API analysis (labels, text, objects, faces)
- 🎵 **Audio Files** - Speech-to-Text transcription
- 🎬 **Video Files** - Video Intelligence analysis
- 📊 **JSON & CSV** - Structured data parsing
All searchable through semantic understanding, not just keywords!
## 🏗️ Architecture Overview
```
┌─────────────────────────────────────────────────────────────┐
│ GCS DATA LAKE │
│ (Your Documents) │
│ 📄 Text 📑 PDF 🖼️ Images 🎵 Audio 🎬 Video 📊 Data │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────▼─────┐
│ grepctl │ ← One command orchestration
└─────┬─────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Ingestion │ │ Google APIs │ │ Processing │
│ • 6 scripts │ │ • Vision │ │ • Extract │
│ • All types │ │ • Speech │ │ • Transform │
│ │ │ • Video │ │ • Enrich │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
└─────────────────┼─────────────────┘
▼
┌─────────────────────┐
│ BigQuery Dataset │
│ search_corpus │
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Vertex AI │
│ text-embedding-004 │
│ 768 dimensions │
└─────────┬───────────┘
▼
┌─────────────────────┐
│ Semantic Search │
│ VECTOR_SEARCH │
│ <1 second query │
└─────────────────────┘
```
## 🛠️ Installation & Setup
### Prerequisites
1. **Google Cloud Project** with billing enabled
2. **Python 3.11+**
3. **gcloud CLI** authenticated with appropriate permissions
### Install from PyPI
```bash
# Install the package
pip install grepctl
# Verify installation
grepctl --help
```
### Install from Source
```bash
# Clone repository
git clone https://github.com/yourusername/grepctl.git
cd grepctl
# Install with uv (recommended)
uv sync
uv run grepctl --help
# Or install with pip
pip install -e .
grepctl --help
```
### Complete System Setup
#### Option 1: Fully Automated (Recommended)
```bash
# One command does everything!
grepctl init all --bucket your-bucket --auto-ingest
# This single command:
# 1. Enables 7 Google Cloud APIs
# 2. Creates BigQuery dataset and 3 tables
# 3. Deploys 3 Vertex AI models
# 4. Ingests all files from GCS
# 5. Generates embeddings
# 6. Sets up semantic search
```
#### Option 2: Step-by-Step Control
```bash
# Enable APIs
grepctl apis enable --all
# Initialize BigQuery
grepctl init dataset
grepctl init models
# Ingest data
grepctl ingest all
# Generate embeddings
grepctl index update
# Start searching
grepctl search "your query"
```
## 🔍 Using the System
### Command Line Interface
```bash
# Search across all data
grepctl search "machine learning algorithms"
# Search specific modalities
grepctl search "error handling" -k 20 -m pdf -m markdown
# Check system status
grepctl status
# View all available commands
grepctl --help
```
### SQL Interface
```sql
-- Direct semantic search
WITH query_embedding AS (
SELECT ml_generate_embedding_result AS embedding
FROM ML.GENERATE_EMBEDDING(
MODEL `your-project.mmgrep.text_embedding_model`,
(SELECT 'machine learning' AS content),
STRUCT(TRUE AS flatten_json_output)
)
)
SELECT doc_id, source, text_content, distance AS score
FROM VECTOR_SEARCH(
TABLE `your-project.mmgrep.search_corpus`,
'embedding',
(SELECT embedding FROM query_embedding),
top_k => 10
)
ORDER BY distance;
```
### Python API (When installed from source)
```python
from bq_semgrep.search.vector_search import SemanticSearch
from bq_semgrep.bigquery.connection import BigQueryClient
from bq_semgrep.config import load_config
# Load configuration
config = load_config()
client = BigQueryClient(config)
# Initialize searcher
searcher = SemanticSearch(client, config)
# Search across all modalities
results = searcher.search(
query="neural networks",
top_k=20,
source_filter=['pdf', 'images'],
use_rerank=True
)
```
## 📈 System Capabilities
### Current Status (Production Ready)
- ✅ **425+ documents** indexed across 8 modalities
- ✅ **768-dimensional embeddings** for semantic understanding
- ✅ **Sub-second query response** times
- ✅ **100% embedding coverage** for all documents
- ✅ **5 Google Cloud APIs** integrated
- ✅ **Auto-recovery** from embedding issues
### Supported Operations
| Operation | Command | Description |
|-----------|---------|-------------|
| **Setup** | `grepctl init all --auto-ingest` | Complete one-command setup |
| **Ingest** | `grepctl ingest all` | Process all file types |
| **Index** | `grepctl index update` | Generate embeddings |
| **Fix** | `grepctl fix embeddings` | Auto-fix dimension issues |
| **Search** | `grepctl search "query"` | Semantic search |
| **Status** | `grepctl status` | System health check |
## 🧰 grepctl Commands
### Complete CLI Management
```bash
# System initialization
grepctl init all --bucket your-bucket --auto-ingest
# API management
grepctl apis enable --all
grepctl apis check
# Data ingestion
grepctl ingest pdf # Process PDFs
grepctl ingest images # Analyze images with Vision API
grepctl ingest audio # Transcribe audio
grepctl ingest video # Analyze videos
# Index management
grepctl index rebuild # Rebuild from scratch
grepctl index update # Update missing embeddings
grepctl index verify # Check embedding health
# Troubleshooting
grepctl fix embeddings # Fix dimension issues
grepctl fix stuck # Handle stuck processing
grepctl fix validate # Check data integrity
# Search
grepctl search "query" -k 20 -o json
```
### Configuration
grepctl uses `~/.grepctl.yaml` for configuration:
```yaml
project_id: your-project
dataset: mmgrep
bucket: your-bucket
location: US
batch_size: 100
chunk_size: 1000
```
## 📊 Supported Data Types
| Modality | Extensions | Processing Method | Google API Used |
|----------|------------|-------------------|-----------------|
| Text | .txt, .log | Direct extraction | — |
| Markdown | .md | Markdown parsing | — |
| PDF | .pdf | OCR extraction | Document AI |
| Images | .jpg, .png, .gif | Visual analysis | Vision API |
| Audio | .mp3, .wav, .m4a | Transcription | Speech-to-Text |
| Video | .mp4, .avi, .mov | Frame + audio analysis | Video Intelligence |
| JSON | .json, .jsonl | Structured parsing | — |
| CSV | .csv, .tsv | Tabular analysis | — |
## 🚀 Advanced Features
### Multimodal Search
Search across all data types simultaneously:
```bash
# Find mentions across PDFs, images, and videos
grepctl search "quarterly revenue" -m pdf -m images -m video
```
### Automatic Processing
- **Vision API** extracts text, labels, objects from images
- **Document AI** performs OCR on scanned PDFs
- **Speech-to-Text** transcribes audio with punctuation
- **Video Intelligence** analyzes frames and transcribes speech
### Error Recovery
```bash
# Automatic fix for common issues
grepctl fix embeddings # Fixes dimension mismatches
grepctl fix stuck # Clears stuck processing
```
## 📚 Documentation
- **[grepctl Documentation](grepctl_README.md)** - Complete grepctl usage guide
- **[Architecture Diagrams](visualize_architecture.py)** - System visualization
- **[Lessons Learned](lessons_learned.md)** - Implementation insights
- **[API Integration](api_integration_detail.png)** - Google Cloud API details
## 🔧 Troubleshooting
### Common Issues & Solutions
| Issue | Solution |
|-------|----------|
| "Permission denied" | Run `gcloud auth login` and ensure BigQuery Admin role |
| "Dataset not found" | Run `grepctl init dataset` |
| "Embedding dimension mismatch" | Run `grepctl fix embeddings` |
| "No search results" | Check `grepctl status` and run `grepctl index update` |
| "API not enabled" | Run `grepctl apis enable --all` |
### Quick Diagnostics
```bash
# Check everything
grepctl status
# Verify APIs
grepctl apis check
# Check embeddings
grepctl index verify
# Fix any issues
grepctl fix embeddings
```
## 🎯 Example Use Cases
1. **Code Search**: Find code patterns across repositories
2. **Document Discovery**: Search PDFs for specific topics
3. **Media Analysis**: Find content in images and videos
4. **Log Analysis**: Semantic search through log files
5. **Data Mining**: Query structured data semantically
## 📈 Performance
- **Ingestion**: ~50 docs/second for text
- **Embedding Generation**: ~20 docs/second
- **Search Latency**: <1 second for most queries
- **Storage**: ~500MB for 425+ documents
- **Accuracy**: 768-dimensional embeddings for semantic precision
## 📦 Package Information
- **PyPI**: https://pypi.org/project/grepctl/
- **Version**: 0.1.0
- **Requirements**: Python 3.11+, Google Cloud Project
- **License**: MIT
## 🤝 Contributing
Contributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
### Development Setup
```bash
# Clone the repository
git clone https://github.com/yourusername/grepctl.git
cd grepctl
# Install in development mode with uv
uv sync
uv add --dev pytest black flake8
# Run tests
uv run pytest
# Format code
uv run black .
```
## 📄 License
MIT License - see [LICENSE](LICENSE) for details.
## 🙏 Acknowledgments
Built with:
- Google BigQuery ML
- Vertex AI (text-embedding-004)
- Google Cloud Vision, Document AI, Speech-to-Text, Video Intelligence APIs
- Python, Click, and Rich CLI libraries
## 📊 Citation
If you use grepctl in your research or project, please cite:
```bibtex
@software{grepctl2024,
title = {grepctl: One-Command Orchestration for Multimodal Semantic Search in BigQuery},
author = {Mulla, Gregory},
year = {2024},
url = {https://github.com/yourusername/grepctl},
version = {0.1.0}
}
```
---
**Ready to transform your data lake into a searchable knowledge base?**
```bash
pip install grepctl
grepctl init all --bucket your-bucket --auto-ingest
```
🎉 That's all it takes!
Raw data
{
"_id": null,
"home_page": null,
"name": "grepctl",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": "Gregory Mulla <gregory.cr.mulla@gmail.com>",
"keywords": "bigquery, semantic-search, vector-search, multimodal, google-cloud, machine-learning, embeddings, gcs, vertex-ai, document-search",
"author": null,
"author_email": "Gregory Mulla <gregory.cr.mulla@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/78/f2/9dccc0f22a78581b077a1a9a261b32bbc5e98bf7a3f08ef1f9ca79e87c18/grepctl-0.1.1.tar.gz",
"platform": null,
"description": "# grepctl - BigQuery Semantic Search Orchestrator\n\n[](https://pypi.org/project/grepctl/)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\n\ud83d\ude80 **One-command multimodal semantic search across your entire data lake using BigQuery ML and Google Cloud AI.**\n\n## \ud83d\udce6 Installation\n\n```bash\n# Install from PyPI\npip install grepctl\n```\n\n## \ud83c\udfaf Quick Start - From Zero to Search in One Command\n\n```bash\n# Complete setup with automatic data ingestion\ngrepctl init all --bucket your-bucket --auto-ingest\n\n# Start searching immediately\ngrepctl search \"find all mentions of machine learning\"\n```\n\nThat's it! The system automatically:\n- \u2705 Enables all required Google Cloud APIs\n- \u2705 Creates BigQuery dataset and tables\n- \u2705 Deploys Vertex AI embedding models\n- \u2705 Ingests all 8 data modalities from your GCS bucket\n- \u2705 Generates 768-dimensional embeddings\n- \u2705 Configures semantic search with VECTOR_SEARCH\n\n## \ud83d\udcca What is grepctl?\n\n`grepctl` is a powerful command-line orchestration tool that transforms your Google Cloud Storage data lake into a searchable knowledge base. It provides a unified interface for searching across **8 different data types**:\n- \ud83d\udcc4 **Text & Markdown** - Direct content extraction\n- \ud83d\udcd1 **PDF Documents** - OCR with Document AI\n- \ud83d\uddbc\ufe0f **Images** - Vision API analysis (labels, text, objects, faces)\n- \ud83c\udfb5 **Audio Files** - Speech-to-Text transcription\n- \ud83c\udfac **Video Files** - Video Intelligence analysis\n- \ud83d\udcca **JSON & CSV** - Structured data parsing\n\nAll searchable through semantic understanding, not just keywords!\n\n## \ud83c\udfd7\ufe0f Architecture Overview\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 GCS DATA LAKE \u2502\n\u2502 (Your Documents) \u2502\n\u2502 \ud83d\udcc4 Text \ud83d\udcd1 PDF \ud83d\uddbc\ufe0f Images \ud83c\udfb5 Audio \ud83c\udfac Video \ud83d\udcca Data \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 grepctl \u2502 \u2190 One command orchestration\n \u2514\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u25bc \u25bc \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Ingestion \u2502 \u2502 Google APIs \u2502 \u2502 Processing \u2502\n\u2502 \u2022 6 scripts \u2502 \u2502 \u2022 Vision \u2502 \u2502 \u2022 Extract \u2502\n\u2502 \u2022 All types \u2502 \u2502 \u2022 Speech \u2502 \u2502 \u2022 Transform \u2502\n\u2502 \u2502 \u2502 \u2022 Video \u2502 \u2502 \u2022 Enrich \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 BigQuery Dataset \u2502\n \u2502 search_corpus \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Vertex AI \u2502\n \u2502 text-embedding-004 \u2502\n \u2502 768 dimensions \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u25bc\n \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n \u2502 Semantic Search \u2502\n \u2502 VECTOR_SEARCH \u2502\n \u2502 <1 second query \u2502\n \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## \ud83d\udee0\ufe0f Installation & Setup\n\n### Prerequisites\n\n1. **Google Cloud Project** with billing enabled\n2. **Python 3.11+**\n3. **gcloud CLI** authenticated with appropriate permissions\n\n### Install from PyPI\n\n```bash\n# Install the package\npip install grepctl\n\n# Verify installation\ngrepctl --help\n```\n\n### Install from Source\n\n```bash\n# Clone repository\ngit clone https://github.com/yourusername/grepctl.git\ncd grepctl\n\n# Install with uv (recommended)\nuv sync\nuv run grepctl --help\n\n# Or install with pip\npip install -e .\ngrepctl --help\n```\n\n### Complete System Setup\n\n#### Option 1: Fully Automated (Recommended)\n\n```bash\n# One command does everything!\ngrepctl init all --bucket your-bucket --auto-ingest\n\n# This single command:\n# 1. Enables 7 Google Cloud APIs\n# 2. Creates BigQuery dataset and 3 tables\n# 3. Deploys 3 Vertex AI models\n# 4. Ingests all files from GCS\n# 5. Generates embeddings\n# 6. Sets up semantic search\n```\n\n#### Option 2: Step-by-Step Control\n\n```bash\n# Enable APIs\ngrepctl apis enable --all\n\n# Initialize BigQuery\ngrepctl init dataset\ngrepctl init models\n\n# Ingest data\ngrepctl ingest all\n\n# Generate embeddings\ngrepctl index update\n\n# Start searching\ngrepctl search \"your query\"\n```\n\n## \ud83d\udd0d Using the System\n\n### Command Line Interface\n\n```bash\n# Search across all data\ngrepctl search \"machine learning algorithms\"\n\n# Search specific modalities\ngrepctl search \"error handling\" -k 20 -m pdf -m markdown\n\n# Check system status\ngrepctl status\n\n# View all available commands\ngrepctl --help\n```\n\n### SQL Interface\n\n```sql\n-- Direct semantic search\nWITH query_embedding AS (\n SELECT ml_generate_embedding_result AS embedding\n FROM ML.GENERATE_EMBEDDING(\n MODEL `your-project.mmgrep.text_embedding_model`,\n (SELECT 'machine learning' AS content),\n STRUCT(TRUE AS flatten_json_output)\n )\n)\nSELECT doc_id, source, text_content, distance AS score\nFROM VECTOR_SEARCH(\n TABLE `your-project.mmgrep.search_corpus`,\n 'embedding',\n (SELECT embedding FROM query_embedding),\n top_k => 10\n)\nORDER BY distance;\n```\n\n### Python API (When installed from source)\n\n```python\nfrom bq_semgrep.search.vector_search import SemanticSearch\nfrom bq_semgrep.bigquery.connection import BigQueryClient\nfrom bq_semgrep.config import load_config\n\n# Load configuration\nconfig = load_config()\nclient = BigQueryClient(config)\n\n# Initialize searcher\nsearcher = SemanticSearch(client, config)\n\n# Search across all modalities\nresults = searcher.search(\n query=\"neural networks\",\n top_k=20,\n source_filter=['pdf', 'images'],\n use_rerank=True\n)\n```\n\n## \ud83d\udcc8 System Capabilities\n\n### Current Status (Production Ready)\n- \u2705 **425+ documents** indexed across 8 modalities\n- \u2705 **768-dimensional embeddings** for semantic understanding\n- \u2705 **Sub-second query response** times\n- \u2705 **100% embedding coverage** for all documents\n- \u2705 **5 Google Cloud APIs** integrated\n- \u2705 **Auto-recovery** from embedding issues\n\n### Supported Operations\n| Operation | Command | Description |\n|-----------|---------|-------------|\n| **Setup** | `grepctl init all --auto-ingest` | Complete one-command setup |\n| **Ingest** | `grepctl ingest all` | Process all file types |\n| **Index** | `grepctl index update` | Generate embeddings |\n| **Fix** | `grepctl fix embeddings` | Auto-fix dimension issues |\n| **Search** | `grepctl search \"query\"` | Semantic search |\n| **Status** | `grepctl status` | System health check |\n\n## \ud83e\uddf0 grepctl Commands\n\n### Complete CLI Management\n\n```bash\n# System initialization\ngrepctl init all --bucket your-bucket --auto-ingest\n\n# API management\ngrepctl apis enable --all\ngrepctl apis check\n\n# Data ingestion\ngrepctl ingest pdf # Process PDFs\ngrepctl ingest images # Analyze images with Vision API\ngrepctl ingest audio # Transcribe audio\ngrepctl ingest video # Analyze videos\n\n# Index management\ngrepctl index rebuild # Rebuild from scratch\ngrepctl index update # Update missing embeddings\ngrepctl index verify # Check embedding health\n\n# Troubleshooting\ngrepctl fix embeddings # Fix dimension issues\ngrepctl fix stuck # Handle stuck processing\ngrepctl fix validate # Check data integrity\n\n# Search\ngrepctl search \"query\" -k 20 -o json\n```\n\n### Configuration\n\ngrepctl uses `~/.grepctl.yaml` for configuration:\n\n```yaml\nproject_id: your-project\ndataset: mmgrep\nbucket: your-bucket\nlocation: US\nbatch_size: 100\nchunk_size: 1000\n```\n\n## \ud83d\udcca Supported Data Types\n\n| Modality | Extensions | Processing Method | Google API Used |\n|----------|------------|-------------------|-----------------|\n| Text | .txt, .log | Direct extraction | \u2014 |\n| Markdown | .md | Markdown parsing | \u2014 |\n| PDF | .pdf | OCR extraction | Document AI |\n| Images | .jpg, .png, .gif | Visual analysis | Vision API |\n| Audio | .mp3, .wav, .m4a | Transcription | Speech-to-Text |\n| Video | .mp4, .avi, .mov | Frame + audio analysis | Video Intelligence |\n| JSON | .json, .jsonl | Structured parsing | \u2014 |\n| CSV | .csv, .tsv | Tabular analysis | \u2014 |\n\n## \ud83d\ude80 Advanced Features\n\n### Multimodal Search\nSearch across all data types simultaneously:\n```bash\n# Find mentions across PDFs, images, and videos\ngrepctl search \"quarterly revenue\" -m pdf -m images -m video\n```\n\n### Automatic Processing\n- **Vision API** extracts text, labels, objects from images\n- **Document AI** performs OCR on scanned PDFs\n- **Speech-to-Text** transcribes audio with punctuation\n- **Video Intelligence** analyzes frames and transcribes speech\n\n### Error Recovery\n```bash\n# Automatic fix for common issues\ngrepctl fix embeddings # Fixes dimension mismatches\ngrepctl fix stuck # Clears stuck processing\n```\n\n## \ud83d\udcda Documentation\n\n- **[grepctl Documentation](grepctl_README.md)** - Complete grepctl usage guide\n- **[Architecture Diagrams](visualize_architecture.py)** - System visualization\n- **[Lessons Learned](lessons_learned.md)** - Implementation insights\n- **[API Integration](api_integration_detail.png)** - Google Cloud API details\n\n## \ud83d\udd27 Troubleshooting\n\n### Common Issues & Solutions\n\n| Issue | Solution |\n|-------|----------|\n| \"Permission denied\" | Run `gcloud auth login` and ensure BigQuery Admin role |\n| \"Dataset not found\" | Run `grepctl init dataset` |\n| \"Embedding dimension mismatch\" | Run `grepctl fix embeddings` |\n| \"No search results\" | Check `grepctl status` and run `grepctl index update` |\n| \"API not enabled\" | Run `grepctl apis enable --all` |\n\n### Quick Diagnostics\n\n```bash\n# Check everything\ngrepctl status\n\n# Verify APIs\ngrepctl apis check\n\n# Check embeddings\ngrepctl index verify\n\n# Fix any issues\ngrepctl fix embeddings\n```\n\n## \ud83c\udfaf Example Use Cases\n\n1. **Code Search**: Find code patterns across repositories\n2. **Document Discovery**: Search PDFs for specific topics\n3. **Media Analysis**: Find content in images and videos\n4. **Log Analysis**: Semantic search through log files\n5. **Data Mining**: Query structured data semantically\n\n## \ud83d\udcc8 Performance\n\n- **Ingestion**: ~50 docs/second for text\n- **Embedding Generation**: ~20 docs/second\n- **Search Latency**: <1 second for most queries\n- **Storage**: ~500MB for 425+ documents\n- **Accuracy**: 768-dimensional embeddings for semantic precision\n\n## \ud83d\udce6 Package Information\n\n- **PyPI**: https://pypi.org/project/grepctl/\n- **Version**: 0.1.0\n- **Requirements**: Python 3.11+, Google Cloud Project\n- **License**: MIT\n\n## \ud83e\udd1d Contributing\n\nContributions welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n### Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/grepctl.git\ncd grepctl\n\n# Install in development mode with uv\nuv sync\nuv add --dev pytest black flake8\n\n# Run tests\nuv run pytest\n\n# Format code\nuv run black .\n```\n\n## \ud83d\udcc4 License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## \ud83d\ude4f Acknowledgments\n\nBuilt with:\n- Google BigQuery ML\n- Vertex AI (text-embedding-004)\n- Google Cloud Vision, Document AI, Speech-to-Text, Video Intelligence APIs\n- Python, Click, and Rich CLI libraries\n\n## \ud83d\udcca Citation\n\nIf you use grepctl in your research or project, please cite:\n\n```bibtex\n@software{grepctl2024,\n title = {grepctl: One-Command Orchestration for Multimodal Semantic Search in BigQuery},\n author = {Mulla, Gregory},\n year = {2024},\n url = {https://github.com/yourusername/grepctl},\n version = {0.1.0}\n}\n```\n\n---\n\n**Ready to transform your data lake into a searchable knowledge base?**\n\n```bash\npip install grepctl\ngrepctl init all --bucket your-bucket --auto-ingest\n```\n\n\ud83c\udf89 That's all it takes!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "One-command orchestration for multimodal semantic search in BigQuery",
"version": "0.1.1",
"project_urls": {
"Changelog": "https://github.com/yourusername/grepctl/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/yourusername/grepctl#readme",
"Homepage": "https://github.com/yourusername/grepctl",
"Issues": "https://github.com/yourusername/grepctl/issues",
"Repository": "https://github.com/yourusername/grepctl.git"
},
"split_keywords": [
"bigquery",
" semantic-search",
" vector-search",
" multimodal",
" google-cloud",
" machine-learning",
" embeddings",
" gcs",
" vertex-ai",
" document-search"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "23f89dd763793d37a490f3f3a64333417013a68abe7513bed17297a8dff0038b",
"md5": "fe87ea7bcc8a92eae820337a6359f1a6",
"sha256": "e4b5f07824acb3995b082deb1ec3e66344fd0abf50059e210a4d3d4c016a853c"
},
"downloads": -1,
"filename": "grepctl-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "fe87ea7bcc8a92eae820337a6359f1a6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 35181,
"upload_time": "2025-09-14T14:34:59",
"upload_time_iso_8601": "2025-09-14T14:34:59.935896Z",
"url": "https://files.pythonhosted.org/packages/23/f8/9dd763793d37a490f3f3a64333417013a68abe7513bed17297a8dff0038b/grepctl-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "78f29dccc0f22a78581b077a1a9a261b32bbc5e98bf7a3f08ef1f9ca79e87c18",
"md5": "a965287e2698521bfa153eba959ea3e2",
"sha256": "d45a25ba6884e961e06027042c97a231ff789f10395e92b369b7be51ce7df413"
},
"downloads": -1,
"filename": "grepctl-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "a965287e2698521bfa153eba959ea3e2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 40867,
"upload_time": "2025-09-14T14:35:00",
"upload_time_iso_8601": "2025-09-14T14:35:00.951280Z",
"url": "https://files.pythonhosted.org/packages/78/f2/9dccc0f22a78581b077a1a9a261b32bbc5e98bf7a3f08ef1f9ca79e87c18/grepctl-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-14 14:35:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yourusername",
"github_project": "grepctl",
"github_not_found": true,
"lcname": "grepctl"
}