# Project Vectorizer
A powerful CLI tool that vectorizes codebases, stores them in a vector database, tracks changes, and serves them via MCP (Model Context Protocol) for AI agents like Claude, Codex, and others.
**Latest Version**: 0.1.2 | [Changelog](#changelog) | [GitHub](https://github.com/starkbaknet/project-vectorizer)
---
## 📋 Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Performance Optimization](#performance-optimization)
- [CLI Commands](#cli-commands)
- [Configuration](#configuration)
- [Search Features](#search-features)
- [MCP Server](#mcp-server)
- [Advanced Usage](#advanced-usage)
- [Troubleshooting](#troubleshooting)
- [Changelog](#changelog)
- [Contributing](#contributing)
---
## Features
### 🚀 Performance & Optimization
- **Auto-Optimized Config** - Auto-detect CPU cores and RAM for optimal settings (`--optimize`)
- **Max Resources Mode** - Use maximum system resources for fastest indexing (`--max-resources`)
- **Smart Incremental** - 60-70% faster indexing with intelligent change categorization
- **Git-Aware Indexing** - 80-90% faster by indexing only git-changed files
- **Parallel Processing** - Multi-threaded with auto-detected optimal worker count (up to 16 workers)
- **Memory Monitoring** - Real-time memory tracking with automatic garbage collection
- **Batch Optimization** - Memory-based batch size calculation for safe processing
### 🔍 Search & Indexing
- **Code Vectorization** - Parse and vectorize with sentence-transformers or OpenAI embeddings
- **Multi-Level Chunking** - Functions, classes, micro-chunks, and word-level chunks for precision
- **Enhanced Single-Word Search** - High-precision search for single keywords (0.8+ thresholds)
- **Semantic + Exact Search** - Combines semantic similarity with exact word matching
- **Adaptive Thresholds** - Automatically adjusts for optimal results
- **Multiple Languages** - 30+ languages (Python, JS, TS, Go, Rust, Java, C++, C, PHP, Ruby, Swift, Kotlin, and more)
### 🔄 Change Management
- **Git Integration** - Track changes via git commits with `index-git` command
- **Smart File Categorization** - Detects New, Modified, and Deleted files
- **Watch Mode** - Real-time monitoring with configurable debouncing (0.5-10s)
- **Incremental Updates** - Only re-index changed content
- **Hash-Based Detection** - SHA256 file hashing for accurate change detection
### 🌐 AI Integration
- **MCP Server** - Model Context Protocol for AI agents (Claude, Codex, etc.)
- **HTTP Fallback API** - RESTful endpoints when MCP unavailable
- **Semantic Search** - Natural language queries for code discovery
- **File Operations** - Get content, list files, project statistics
### 🎨 User Experience
- **Clean Progress Output** - Single unified progress bar with timing information
- **Suppressed Library Logs** - No cluttered batch progress bars from dependencies
- **Timing Information** - Elapsed time for all operations (seconds or minutes+seconds)
- **Verbose Mode** - Optional detailed logging for debugging
- **Professional UI** - Rich terminal output with colors, panels, and formatting
- **Real-time Updates** - Live file names and status tags during indexing
### 💾 Database & Storage
- **ChromaDB Backend** - High-performance vector database
- **Fast HNSW Indexing** - Optimized similarity search algorithm
- **Scalable** - Handles 500K+ chunks efficiently
- **Single Database** - No external dependencies required
- **Custom Paths** - Configurable database location
---
## Installation
### From PyPI (Recommended)
```bash
# Install from PyPI
pip install project-vectorizer
# Verify installation
pv --version
```
### From Source
```bash
# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer
# Install
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"
```
---
## Quick Start
### 1. Initialize Your Project
```bash
# 🚀 Recommended: Auto-optimize based on your system (16 workers, 400 batch on 8-core/16GB RAM)
pv init /path/to/project --optimize
# Or with custom settings
pv init /path/to/project \
--name "My Project" \
--embedding-model "all-MiniLM-L6-v2" \
--chunk-size 256 \
--optimize
```
**Output:**
```
✓ Project initialized successfully!
Name: My Project
Path: /path/to/project
Model: all-MiniLM-L6-v2
Provider: sentence-transformers
Chunk Size: 256 tokens
Optimized Settings:
• Workers: 16
• Batch Size: 400
• Embedding Batch: 200
• Memory Monitoring: Enabled
• GC Interval: 100 files
```
### 2. Index Your Codebase
```bash
# 🚀 Recommended: First-time indexing with max resources (2-4x faster)
pv index /path/to/project --max-resources
# 🚀 Recommended: Smart incremental for updates (60-70% faster)
pv index /path/to/project --smart
# 🚀 Recommended: Git-aware for recent changes (80-90% faster)
pv index-git /path/to/project --since HEAD~5
# Standard full indexing
pv index /path/to/project
# Force re-index everything
pv index /path/to/project --force
# Combine for maximum performance
pv index /path/to/project --smart --max-resources
```
**Output:**
```
Using maximum system resources (optimized settings)...
• Workers: 16
• Batch Size: 400
• Embedding Batch: 200
Indexing examples/demo.py ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%
╭────────────────── Indexing Complete ──────────────────╮
│ ✓ Indexing complete! │
│ │
│ Files indexed: 48/49 │
│ Total chunks: 9222 │
│ Model: all-MiniLM-L6-v2 │
│ Time taken: 2m 16s │
│ │
│ You can now search with: pv search . "your query" │
╰───────────────────────────────────────────────────────╯
```
### 3. Search Your Code
```bash
# Natural language search
pv search /path/to/project "authentication logic"
# Single-word searches work great (high precision)
pv search /path/to/project "async" --threshold 0.8
pv search /path/to/project "test" --threshold 0.9
# Multi-word queries (semantic search)
pv search /path/to/project "user login validation" --threshold 0.5
# Find specific constructs
pv search /path/to/project "class" --limit 10
```
**Output:**
```
Search Results for: authentication logic
Found 5 result(s) with threshold >= 0.5
╭─────────────────────── Result 1 ───────────────────────╮
│ src/auth/login.py │
│ Lines 45-67 | Similarity: 0.892 │
│ │
│ def authenticate_user(username: str, password: str): │
│ """ │
│ Authenticate user credentials against database │
│ Returns user object if valid, None otherwise │
│ """ │
│ ... │
╰────────────────────────────────────────────────────────╯
```
### 4. Start MCP Server
```bash
# Start server (default: localhost:8000)
pv serve /path/to/project
# Custom host/port
pv serve /path/to/project --host 0.0.0.0 --port 8080
```
### 5. Monitor Changes in Real-Time
```bash
# Watch for file changes (default 2s debounce)
pv sync /path/to/project --watch
# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5
# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0
```
---
## Performance Optimization
### Understanding the Optimization Flags
#### `--optimize` (Permanent)
Use when **initializing** a new project. Detects your system and saves optimal settings.
```bash
pv init /path/to/project --optimize
```
**What it does:**
- Detects CPU cores → sets `max_workers` (e.g., 8 cores = 16 workers)
- Calculates RAM → sets safe `batch_size` (e.g., 16GB = 400 batch)
- Sets memory thresholds based on total RAM
- **Saves to config** - All future operations use these settings
**When to use:**
- ✅ New projects
- ✅ Want permanent optimization
- ✅ Same machine for all operations
- ✅ "Set and forget" approach
#### `--max-resources` (Temporary)
Use when **indexing** to temporarily boost performance without changing config.
```bash
pv index /path/to/project --max-resources
pv index-git /path/to/project --since HEAD~1 --max-resources
```
**What it does:**
- Detects system resources (same as --optimize)
- **Temporarily overrides** config for this operation only
- Original config unchanged
**When to use:**
- ✅ Existing project without optimization
- ✅ One-time heavy indexing
- ✅ CI/CD with dedicated resources
- ✅ Don't want to modify config
### Performance Benchmarks
**System**: 8-core CPU, 16GB RAM, SSD
| Mode | Files | Chunks | Time | Settings |
| ------------------ | --------- | ------ | ------ | --------------------- |
| Standard | 48 | 9222 | 4m 32s | 4 workers, 100 batch |
| --max-resources | 48 | 9222 | 2m 16s | 16 workers, 400 batch |
| Smart incremental | 5 changed | 412 | 24s | 16 workers, 400 batch |
| Git-aware (HEAD~1) | 3 changed | 287 | 15s | 16 workers, 400 batch |
**Key Findings:**
- `--max-resources`: **2x faster** for full indexing
- Smart incremental: **60-70% faster** than full reindex
- Git-aware: **80-90% faster** for recent changes
- Chunk size (128 vs 512): **No performance difference** (same ~2m 16s)
### System Resource Detection
**CPU Detection:**
```
Detected: 8 cores
Optimal workers: min(8 * 2, 16) = 16 workers
```
**Memory Detection:**
```
Total RAM: 16GB
Available RAM: 8GB
Safe batch size: 8GB * 0.5 * 100 = 400
Embedding batch: 400 * 0.5 = 200
GC interval: 100 files
```
**Memory Thresholds:**
```
32GB+ RAM → threshold: 50000
16-32GB → threshold: 20000
8-16GB → threshold: 10000
<8GB → threshold: 5000
```
### Best Practices
1. **Initialize with optimization**
```bash
pv init ~/my-project --optimize
```
2. **Use max resources for heavy operations**
```bash
pv index ~/my-project --force --max-resources
```
3. **Use smart mode for daily updates**
```bash
pv index ~/my-project --smart
```
4. **Use git-aware after pulling changes**
```bash
pv index-git ~/my-project --since HEAD~1
```
5. **Monitor memory with verbose mode**
```bash
pv index ~/my-project --max-resources --verbose
```
---
## CLI Commands
### Global Options
```bash
pv [OPTIONS] COMMAND [ARGS]
Options:
-v, --verbose Enable verbose output
--version Show version
--help Show help
```
### `pv init` - Initialize Project
Initialize a new project for vectorization.
```bash
pv init [OPTIONS] PROJECT_PATH
Options:
-n, --name TEXT Project name (default: directory name)
-m, --embedding-model TEXT Model name (default: all-MiniLM-L6-v2)
-p, --embedding-provider Provider: sentence-transformers | openai
-c, --chunk-size INT Chunk size in tokens (default: 256)
-o, --chunk-overlap INT Overlap in tokens (default: 32)
--optimize Auto-optimize based on system resources ⭐
```
**Examples:**
```bash
# Basic initialization
pv init /path/to/project
# With optimization (recommended)
pv init /path/to/project --optimize
# With OpenAI embeddings
export OPENAI_API_KEY="sk-..."
pv init /path/to/project \
--embedding-provider openai \
--embedding-model text-embedding-ada-002 \
--optimize
```
### `pv index` - Index Codebase
Index the codebase for searching.
```bash
pv index [OPTIONS] PROJECT_PATH
Options:
-i, --incremental Only index changed files
-s, --smart Smart incremental (categorized: new/modified/deleted) ⭐
-f, --force Force re-index all files
--max-resources Use maximum system resources ⭐
```
**Examples:**
```bash
# Full indexing with max resources
pv index /path/to/project --max-resources
# Smart incremental (fastest for updates)
pv index /path/to/project --smart
# Combine for maximum performance
pv index /path/to/project --smart --max-resources
# Force complete reindex
pv index /path/to/project --force
```
### `pv index-git` - Git-Aware Indexing
Index only files changed in git commits.
```bash
pv index-git [OPTIONS] PROJECT_PATH
Options:
-s, --since TEXT Git reference (default: HEAD~1)
--max-resources Use maximum system resources ⭐
```
**Examples:**
```bash
# Last commit
pv index-git /path/to/project --since HEAD~1
# Last 5 commits
pv index-git /path/to/project --since HEAD~5
# Since main branch
pv index-git /path/to/project --since main
# Since specific commit
pv index-git /path/to/project --since abc123def
# With max resources
pv index-git /path/to/project --since HEAD~10 --max-resources
```
**Use Cases:**
- After `git pull` - index only new changes
- Before code review - index PR changes
- CI/CD pipelines - index commit range
- After branch switch - index differences
### `pv search` - Search Code
Search through vectorized codebase.
```bash
pv search [OPTIONS] PROJECT_PATH QUERY
Options:
-l, --limit INT Number of results (default: 10)
-t, --threshold FLOAT Similarity threshold 0.0-1.0 (default: 0.3)
```
**Examples:**
```bash
# Natural language search
pv search /path/to/project "error handling in database connections"
# Single-word search (high threshold)
pv search /path/to/project "async" --threshold 0.9
# Find all tests
pv search /path/to/project "test" --limit 20 --threshold 0.8
# Broad semantic search (low threshold)
pv search /path/to/project "api authentication" --threshold 0.3
```
**Threshold Guide:**
- **0.8-0.95**: Single words, exact matches
- **0.5-0.7**: Multi-word phrases, semantic
- **0.3-0.5**: Complex queries, broad search
- **0.1-0.3**: Very broad, exploratory
### `pv sync` - Sync Changes / Watch Mode
Sync changes or watch for file modifications.
```bash
pv sync [OPTIONS] PROJECT_PATH
Options:
-w, --watch Watch for file changes
-d, --debounce FLOAT Debounce delay in seconds (default: 2.0)
```
**Examples:**
```bash
# One-time sync (smart incremental)
pv sync /path/to/project
# Watch mode with default debounce (2s)
pv sync /path/to/project --watch
# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5
# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0
```
**Debounce Explained:**
- Waits X seconds after last file change before indexing
- Batches multiple rapid changes together
- Prevents redundant indexing when saving files repeatedly
- Reduces CPU usage during active development
**Recommended Values:**
- **0.5-1.0s**: Fast machines, need instant feedback
- **2.0s**: Balanced (default)
- **5.0-10.0s**: Slower machines, large codebases
### `pv serve` - Start MCP Server
Start MCP server for AI agent integration.
```bash
pv serve [OPTIONS] PROJECT_PATH
Options:
-p, --port INT Port number (default: 8000)
-h, --host TEXT Host address (default: localhost)
```
**Examples:**
```bash
# Start server
pv serve /path/to/project
# Custom port
pv serve /path/to/project --port 8080
# Expose to network
pv serve /path/to/project --host 0.0.0.0 --port 8000
```
### `pv status` - Show Project Status
Show project status and statistics.
```bash
pv status PROJECT_PATH
```
**Output:**
```
╭────────────── Project Status ──────────────╮
│ Name my-project │
│ Path /path/to/project │
│ Embedding Model all-MiniLM-L6-v2 │
│ │
│ Total Files 49 │
│ Indexed Files 48 │
│ Total Chunks 9222 │
│ │
│ Git Branch main │
│ Last Updated 2025-10-13 12:15:42 │
│ Created 2025-10-10 09:30:15 │
╰────────────────────────────────────────────╯
```
---
## Configuration
### Config File Location
Configuration is stored at `<project>/.vectorizer/config.json`
### Full Configuration Reference
```json
{
"chromadb_path": null,
"embedding_model": "all-MiniLM-L6-v2",
"embedding_provider": "sentence-transformers",
"openai_api_key": null,
"chunk_size": 128,
"chunk_overlap": 32,
"max_file_size_mb": 10,
"included_extensions": [
".py",
".js",
".ts",
".jsx",
".tsx",
".go",
".rs",
".java",
".cpp",
".c",
".h",
".hpp",
".cs",
".php",
".rb",
".swift",
".kt",
".scala",
".clj",
".sh",
".bash",
".zsh",
".fish",
".ps1",
".bat",
".cmd",
".md",
".txt",
".rst",
".json",
".yaml",
".yml",
".toml",
".xml",
".html",
".css",
".scss",
".sql",
".graphql",
".proto"
],
"excluded_patterns": [
"node_modules/**",
".git/**",
"__pycache__/**",
"*.pyc",
".pytest_cache/**",
"venv/**",
"env/**",
".env/**",
"build/**",
"dist/**",
"*.egg-info/**",
".DS_Store",
"*.min.js",
"*.min.css"
],
"mcp_host": "localhost",
"mcp_port": 8000,
"log_level": "INFO",
"log_file": null,
"max_workers": 4,
"batch_size": 100,
"embedding_batch_size": 100,
"parallel_file_processing": true,
"memory_monitoring_enabled": true,
"memory_efficient_search_threshold": 10000,
"gc_interval": 100
}
```
### Key Settings Explained
**Embedding Settings:**
- `embedding_model`: Model for embeddings (all-MiniLM-L6-v2, text-embedding-ada-002, etc.)
- `embedding_provider`: "sentence-transformers" (local) or "openai" (API)
- `chunk_size`: Tokens per chunk (128 for precision, 512 for context)
- `chunk_overlap`: Overlap between chunks (16-32 recommended)
**Performance Settings:**
- `max_workers`: Parallel workers (auto-detected with --optimize)
- `batch_size`: Files per batch (auto-calculated with --optimize)
- `embedding_batch_size`: Embeddings per batch
- `parallel_file_processing`: Enable parallel processing (recommended: true)
**Memory Settings:**
- `memory_monitoring_enabled`: Monitor RAM usage (recommended: true)
- `memory_efficient_search_threshold`: Switch to streaming for large results
- `gc_interval`: Garbage collection frequency (files between GC)
**File Filtering:**
- `included_extensions`: File types to index
- `excluded_patterns`: Glob patterns to ignore
- `max_file_size_mb`: Skip files larger than this
**Server Settings:**
- `mcp_host`: MCP server host
- `mcp_port`: MCP server port
- `log_level`: INFO, DEBUG, WARNING, ERROR
- `chromadb_path`: Custom ChromaDB location (optional)
### Environment Variables
Create `.env` file or export:
```bash
# OpenAI API Key (required for OpenAI embeddings)
export OPENAI_API_KEY="sk-..."
# Override config values
export EMBEDDING_PROVIDER="sentence-transformers"
export EMBEDDING_MODEL="all-MiniLM-L6-v2"
export CHUNK_SIZE="256"
export DEFAULT_SEARCH_THRESHOLD="0.3"
# Database
export CHROMADB_PATH="/custom/path/to/chromadb"
# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/vectorizer.log"
```
For complete list, see [docs/ENVIRONMENT.md](docs/ENVIRONMENT.md)
### Editing Configuration
```bash
# View current config
cat /path/to/project/.vectorizer/config.json
# Edit manually
nano /path/to/project/.vectorizer/config.json
# Or regenerate with optimization
pv init /path/to/project --optimize
```
---
## Search Features
### Single-Word Search
Optimized for high-precision single-keyword searches.
```bash
# Programming keywords
pv search /path/to/project "async" --threshold 0.9
pv search /path/to/project "test" --threshold 0.8
pv search /path/to/project "class" --threshold 0.9
pv search /path/to/project "import" --threshold 0.85
# Works great for finding specific constructs
pv search /path/to/project "def" --threshold 0.9 # Python functions
pv search /path/to/project "function" --threshold 0.9 # JS functions
pv search /path/to/project "catch" --threshold 0.8 # Error handling
```
**Features:**
- **Exact Word Matching**: Prioritizes exact word boundaries
- **Keyword Detection**: Special handling for programming keywords
- **Relevance Boosting**: Huge boost for exact matches
- **High Thresholds**: Reliable results even at 0.8-0.9+
### Multi-Word Search
Semantic search for phrases and concepts.
```bash
# Natural language
pv search /path/to/project "user authentication logic" --threshold 0.5
# Code patterns
pv search /path/to/project "error handling in database" --threshold 0.4
# Features
pv search /path/to/project "rate limiting middleware" --threshold 0.6
```
### Search Result Ranking
Results ranked by:
1. **Exact word matches** (highest priority)
2. **Content type** (micro/word chunks get boost)
3. **Partial matches** within larger words
4. **Semantic similarity** from embeddings
### Recommended Thresholds by Query Type
| Query Type | Threshold | Example |
| -------------- | --------- | --------------------------------- |
| Single keyword | 0.7-0.95 | "async", "test", "class" |
| Two words | 0.5-0.8 | "error handling", "api routes" |
| Short phrase | 0.4-0.7 | "user login validation" |
| Complex query | 0.3-0.5 | "authentication with jwt tokens" |
| Exploratory | 0.1-0.3 | "machine learning model training" |
---
## MCP Server
### Starting the Server
```bash
# Default (localhost:8000)
pv serve /path/to/project
# Custom settings
pv serve /path/to/project --host 0.0.0.0 --port 8080
```
### Available MCP Tools
When running, AI agents can use these tools:
1. **search_code** - Search vectorized codebase
```json
{
"query": "authentication logic",
"limit": 10,
"threshold": 0.5
}
```
2. **get_file_content** - Retrieve full file
```json
{
"file_path": "src/auth/login.py"
}
```
3. **list_files** - List all files
```json
{
"file_type": "py" // optional filter
}
```
4. **get_project_stats** - Get statistics
```json
{}
```
### HTTP Fallback API
If MCP unavailable, HTTP endpoints provided:
```bash
# Search
curl "http://localhost:8000/search?q=authentication&limit=5&threshold=0.5"
# Get file
curl "http://localhost:8000/file/src/auth/login.py"
# List files
curl "http://localhost:8000/files?type=py"
# Statistics
curl "http://localhost:8000/stats"
# Health check
curl "http://localhost:8000/health"
```
### Use Cases
1. **AI Code Review**: Let Claude analyze your codebase semantically
2. **Intelligent Navigation**: Ask AI to find relevant code
3. **Documentation**: Generate docs from actual code
4. **Onboarding**: Help new devs understand codebase
5. **Refactoring**: Find similar patterns across project
---
## Advanced Usage
### Python API
#### Basic Usage
```python
import asyncio
from pathlib import Path
from project_vectorizer.core.config import Config
from project_vectorizer.core.project import ProjectManager
async def main():
# Initialize project
config = Config.create_optimized(
embedding_model="all-MiniLM-L6-v2",
chunk_size=256
)
project_path = Path("/path/to/project")
manager = ProjectManager(project_path, config)
# Initialize
await manager.initialize("My Project")
# Index
await manager.load()
await manager.index_all()
# Search
results = await manager.search("authentication", limit=10, threshold=0.5)
for result in results:
print(f"{result['file_path']}: {result['similarity']:.3f}")
asyncio.run(main())
```
#### Progress Tracking
```python
from rich.progress import Progress, BarColumn, TaskProgressColumn
async def index_with_progress(project_path):
config = Config.load_from_project(project_path)
manager = ProjectManager(project_path, config)
await manager.load()
with Progress() as progress:
task = progress.add_task("Indexing...", total=100)
def update_progress(current, total, description):
progress.update(task, completed=current, total=total, description=description)
manager.set_progress_callback(update_progress)
await manager.index_all()
```
#### Custom Resource Limits
```python
import psutil
async def adaptive_index(project_path):
"""Index with resources based on current load."""
cpu_percent = psutil.cpu_percent(interval=1)
if cpu_percent < 50: # System idle
config = Config.create_optimized()
else: # System busy
config = Config(max_workers=4, batch_size=100)
manager = ProjectManager(project_path, config)
await manager.load()
await manager.index_all()
```
### Chunk Size Optimization
The engine enforces a maximum of 128 tokens per chunk (see engine.py:35) for precision, but you can configure larger sizes for more context:
```bash
# Precision (default, forced max 128)
pv init /path/to/project --chunk-size 128
# More context (still capped at 128 by engine)
pv init /path/to/project --chunk-size 512
```
**Performance Note**: Chunk size has virtually NO impact on indexing speed (~2m 16s for both 128 and 512 tokens). Choose based on search quality needs:
- **128**: Better precision, exact matches
- **512**: More context, better understanding
### CI/CD Integration
```yaml
# .github/workflows/vectorize.yml
name: Vectorize Codebase
on:
push:
branches: [main]
jobs:
vectorize:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.9"
- name: Install vectorizer
run: pip install project-vectorizer
- name: Initialize and index
run: |
pv init . --optimize --name "${{ github.repository }}"
pv index . --max-resources
- name: Test search
run: pv search . "test" --limit 5
```
### Custom File Filters
```json
{
"included_extensions": [".py", ".js", ".custom"],
"excluded_patterns": ["tests/**", "*.generated.js", "vendor/**", "*.min.*"]
}
```
### Watch Mode During Development
```bash
# Terminal 1: Watch mode
pv sync /path/to/project --watch --debounce 1.0
# Terminal 2: Make code changes
# Auto-indexes when you save
# Terminal 3: Search as you code
pv search /path/to/project "your new function" --threshold 0.5
```
---
## Troubleshooting
### Common Issues
#### 1. Slow Indexing
**Problem**: Indexing taking too long
**Solutions:**
```bash
# Use max resources
pv index /path/to/project --max-resources
# Use smart incremental for updates
pv index /path/to/project --smart
# Use git-aware for recent changes
pv index-git /path/to/project --since HEAD~1
# Check if optimization is working
pv index /path/to/project --max-resources --verbose
# Look for: "Workers: 16, Batch Size: 400"
```
#### 2. High Memory Usage
**Problem**: Process using too much RAM or getting killed
**Solutions:**
```bash
# Reduce batch size in config
{
"batch_size": 50,
"max_workers": 4
}
# Enable memory monitoring
{
"memory_monitoring_enabled": true,
"gc_interval": 50
}
# Use smaller chunks
pv init /path/to/project --chunk-size 128
```
#### 3. Poor Search Results
**Problem**: Search not finding relevant code
**Solutions:**
```bash
# Lower threshold for phrases
pv search /path/to/project "your query" --threshold 0.3
# Higher threshold for keywords
pv search /path/to/project "async" --threshold 0.9
# Use smaller chunk size for precision
# Edit config: "chunk_size": 128
# Ensure index is up to date
pv index /path/to/project --smart
```
#### 4. No Results for Single Words
**Problem**: Single-word searches return nothing
**Solutions:**
```bash
# Try lower threshold
pv search /path/to/project "yourword" --threshold 0.5
# Check if word exists
pv search /path/to/project "yourword" --threshold 0.1 --limit 1
# Reindex with smaller chunks
# Edit config: "chunk_size": 128
pv index /path/to/project --force
```
#### 5. Missing Recent Changes
**Problem**: Just-edited code not showing in search
**Solutions:**
```bash
# Run smart incremental
pv index /path/to/project --smart
# Or git-aware
pv index-git /path/to/project --since HEAD~1
# Check status
pv status /path/to/project
```
#### 6. psutil Not Found
**Problem**: Optimization not working
**Solution:**
```bash
# Install psutil
pip install psutil
# Verify
python -c "import psutil; print(f'CPUs: {psutil.cpu_count()}, RAM: {psutil.virtual_memory().available / 1024**3:.1f}GB')"
# Try again
pv init /path/to/project --optimize
```
### Debug Mode
```bash
# Enable verbose logging
pv --verbose index /path/to/project
# Check project status
pv status /path/to/project
# View config
cat /path/to/project/.vectorizer/config.json
# Check ChromaDB
ls -lh /path/to/project/.vectorizer/chromadb/
```
### Performance Debugging
```bash
# Time operations
time pv index /path/to/project
time pv index /path/to/project --max-resources
# Monitor resources during indexing
# Terminal 1:
pv index /path/to/project --max-resources
# Terminal 2:
htop # or top
# Should see high CPU across all cores
# Check memory warnings
pv index /path/to/project --max-resources --verbose
# Look for memory warnings
```
---
## Changelog
### [0.1.2] - 2025-10-13
#### Added
- **Optimized Config Generation** - `Config.create_optimized()` auto-detects CPU/RAM
- **Max Resources Flag** - `--max-resources` for temporary performance boost
- **psutil Integration** - Automatic system resource detection
- **Unified Progress Tracking** - Clean single-line progress bar
- **Library Progress Suppression** - No more cluttered batch progress bars
- **Timing Information** - All operations show elapsed time
- **Clean Terminal Output** - Professional UI with timing
#### Performance
- **2x faster** full indexing with --max-resources
- **60-70% faster** smart incremental updates
- **80-90% faster** git-aware indexing
#### Documentation
- Comprehensive documentation overhaul
- Consolidated all guides into main README
- Added CHANGELOG.md with version history
### [0.1.1] - 2025-10-12
- Enhanced single-word search with high precision
- Multi-level chunking (micro + word-level)
- Adaptive search thresholds
- Programming keyword detection
- Improved word matching and relevance boosting
### [0.1.0] - 2025-10-10
- Initial release
- Code vectorization
- Smart incremental indexing
- Git-aware indexing
- MCP server
- Watch mode
- ChromaDB backend
- 30+ language support
---
## Contributing
### Development Setup
```bash
# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
isort .
```
### Running Tests
```bash
# All tests
pytest
# With coverage
pytest --cov=project_vectorizer
# Specific test
pytest tests/test_config.py
# Verbose
pytest -v
```
See [docs/TESTING.md](docs/TESTING.md) for details.
### Publishing
See [docs/PUBLISHING.md](docs/PUBLISHING.md) for PyPI publishing guide.
### Contributing Guidelines
1. Fork repository
2. Create feature branch: `git checkout -b feature/amazing-feature`
3. Make changes and add tests
4. Ensure tests pass: `pytest`
5. Format code: `black . && isort .`
6. Commit: `git commit -m 'Add amazing feature'`
7. Push: `git push origin feature/amazing-feature`
8. Open Pull Request
---
## License
MIT License - see [LICENSE](LICENSE) file
---
## Additional Resources
- **GitHub**: https://github.com/starkbaknet/project-vectorizer
- **PyPI**: https://pypi.org/project/project-vectorizer/
- **Issues**: https://github.com/starkbaknet/project-vectorizer/issues
---
**Made with ❤️ by StarkBakNet**
_Vectorize your codebase. Empower your AI agents. Build better software._
Raw data
{
"_id": null,
"home_page": null,
"name": "project-vectorizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "vectorization, code-search, embeddings, semantic-search, mcp, chromadb, codebase-analysis, ai-tools",
"author": null,
"author_email": "StarkBak <bak.stark06@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c0/d8/f78f6e363e213cc6149e35a8870374007baa9eaf3daded91b7699bb2343b/project_vectorizer-0.1.2.tar.gz",
"platform": null,
"description": "# Project Vectorizer\n\nA powerful CLI tool that vectorizes codebases, stores them in a vector database, tracks changes, and serves them via MCP (Model Context Protocol) for AI agents like Claude, Codex, and others.\n\n**Latest Version**: 0.1.2 | [Changelog](#changelog) | [GitHub](https://github.com/starkbaknet/project-vectorizer)\n\n---\n\n## \ud83d\udccb Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Performance Optimization](#performance-optimization)\n- [CLI Commands](#cli-commands)\n- [Configuration](#configuration)\n- [Search Features](#search-features)\n- [MCP Server](#mcp-server)\n- [Advanced Usage](#advanced-usage)\n- [Troubleshooting](#troubleshooting)\n- [Changelog](#changelog)\n- [Contributing](#contributing)\n\n---\n\n## Features\n\n### \ud83d\ude80 Performance & Optimization\n\n- **Auto-Optimized Config** - Auto-detect CPU cores and RAM for optimal settings (`--optimize`)\n- **Max Resources Mode** - Use maximum system resources for fastest indexing (`--max-resources`)\n- **Smart Incremental** - 60-70% faster indexing with intelligent change categorization\n- **Git-Aware Indexing** - 80-90% faster by indexing only git-changed files\n- **Parallel Processing** - Multi-threaded with auto-detected optimal worker count (up to 16 workers)\n- **Memory Monitoring** - Real-time memory tracking with automatic garbage collection\n- **Batch Optimization** - Memory-based batch size calculation for safe processing\n\n### \ud83d\udd0d Search & Indexing\n\n- **Code Vectorization** - Parse and vectorize with sentence-transformers or OpenAI embeddings\n- **Multi-Level Chunking** - Functions, classes, micro-chunks, and word-level chunks for precision\n- **Enhanced Single-Word Search** - High-precision search for single keywords (0.8+ thresholds)\n- **Semantic + Exact Search** - Combines semantic similarity with exact word matching\n- **Adaptive Thresholds** - Automatically adjusts for optimal results\n- **Multiple Languages** - 30+ languages (Python, JS, TS, Go, Rust, Java, C++, C, PHP, Ruby, Swift, Kotlin, and more)\n\n### \ud83d\udd04 Change Management\n\n- **Git Integration** - Track changes via git commits with `index-git` command\n- **Smart File Categorization** - Detects New, Modified, and Deleted files\n- **Watch Mode** - Real-time monitoring with configurable debouncing (0.5-10s)\n- **Incremental Updates** - Only re-index changed content\n- **Hash-Based Detection** - SHA256 file hashing for accurate change detection\n\n### \ud83c\udf10 AI Integration\n\n- **MCP Server** - Model Context Protocol for AI agents (Claude, Codex, etc.)\n- **HTTP Fallback API** - RESTful endpoints when MCP unavailable\n- **Semantic Search** - Natural language queries for code discovery\n- **File Operations** - Get content, list files, project statistics\n\n### \ud83c\udfa8 User Experience\n\n- **Clean Progress Output** - Single unified progress bar with timing information\n- **Suppressed Library Logs** - No cluttered batch progress bars from dependencies\n- **Timing Information** - Elapsed time for all operations (seconds or minutes+seconds)\n- **Verbose Mode** - Optional detailed logging for debugging\n- **Professional UI** - Rich terminal output with colors, panels, and formatting\n- **Real-time Updates** - Live file names and status tags during indexing\n\n### \ud83d\udcbe Database & Storage\n\n- **ChromaDB Backend** - High-performance vector database\n- **Fast HNSW Indexing** - Optimized similarity search algorithm\n- **Scalable** - Handles 500K+ chunks efficiently\n- **Single Database** - No external dependencies required\n- **Custom Paths** - Configurable database location\n\n---\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\n# Install from PyPI\npip install project-vectorizer\n\n# Verify installation\npv --version\n```\n\n### From Source\n\n```bash\n# Clone repository\ngit clone https://github.com/starkbaknet/project-vectorizer.git\ncd project-vectorizer\n\n# Install\npip install -e .\n\n# Or with development dependencies\npip install -e \".[dev]\"\n```\n\n---\n\n## Quick Start\n\n### 1. Initialize Your Project\n\n```bash\n# \ud83d\ude80 Recommended: Auto-optimize based on your system (16 workers, 400 batch on 8-core/16GB RAM)\npv init /path/to/project --optimize\n\n# Or with custom settings\npv init /path/to/project \\\n --name \"My Project\" \\\n --embedding-model \"all-MiniLM-L6-v2\" \\\n --chunk-size 256 \\\n --optimize\n```\n\n**Output:**\n\n```\n\u2713 Project initialized successfully!\n\nName: My Project\nPath: /path/to/project\nModel: all-MiniLM-L6-v2\nProvider: sentence-transformers\nChunk Size: 256 tokens\n\nOptimized Settings:\n \u2022 Workers: 16\n \u2022 Batch Size: 400\n \u2022 Embedding Batch: 200\n \u2022 Memory Monitoring: Enabled\n \u2022 GC Interval: 100 files\n```\n\n### 2. Index Your Codebase\n\n```bash\n# \ud83d\ude80 Recommended: First-time indexing with max resources (2-4x faster)\npv index /path/to/project --max-resources\n\n# \ud83d\ude80 Recommended: Smart incremental for updates (60-70% faster)\npv index /path/to/project --smart\n\n# \ud83d\ude80 Recommended: Git-aware for recent changes (80-90% faster)\npv index-git /path/to/project --since HEAD~5\n\n# Standard full indexing\npv index /path/to/project\n\n# Force re-index everything\npv index /path/to/project --force\n\n# Combine for maximum performance\npv index /path/to/project --smart --max-resources\n```\n\n**Output:**\n\n```\nUsing maximum system resources (optimized settings)...\n \u2022 Workers: 16\n \u2022 Batch Size: 400\n \u2022 Embedding Batch: 200\n\n Indexing examples/demo.py \u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501 100%\n\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Indexing Complete \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 \u2713 Indexing complete! \u2502\n\u2502 \u2502\n\u2502 Files indexed: 48/49 \u2502\n\u2502 Total chunks: 9222 \u2502\n\u2502 Model: all-MiniLM-L6-v2 \u2502\n\u2502 Time taken: 2m 16s \u2502\n\u2502 \u2502\n\u2502 You can now search with: pv search . \"your query\" \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n### 3. Search Your Code\n\n```bash\n# Natural language search\npv search /path/to/project \"authentication logic\"\n\n# Single-word searches work great (high precision)\npv search /path/to/project \"async\" --threshold 0.8\npv search /path/to/project \"test\" --threshold 0.9\n\n# Multi-word queries (semantic search)\npv search /path/to/project \"user login validation\" --threshold 0.5\n\n# Find specific constructs\npv search /path/to/project \"class\" --limit 10\n```\n\n**Output:**\n\n```\nSearch Results for: authentication logic\n\nFound 5 result(s) with threshold >= 0.5\n\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Result 1 \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 src/auth/login.py \u2502\n\u2502 Lines 45-67 | Similarity: 0.892 \u2502\n\u2502 \u2502\n\u2502 def authenticate_user(username: str, password: str): \u2502\n\u2502 \"\"\" \u2502\n\u2502 Authenticate user credentials against database \u2502\n\u2502 Returns user object if valid, None otherwise \u2502\n\u2502 \"\"\" \u2502\n\u2502 ... \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n### 4. Start MCP Server\n\n```bash\n# Start server (default: localhost:8000)\npv serve /path/to/project\n\n# Custom host/port\npv serve /path/to/project --host 0.0.0.0 --port 8080\n```\n\n### 5. Monitor Changes in Real-Time\n\n```bash\n# Watch for file changes (default 2s debounce)\npv sync /path/to/project --watch\n\n# Fast feedback (0.5s)\npv sync /path/to/project --watch --debounce 0.5\n\n# Slower systems (5s)\npv sync /path/to/project --watch --debounce 5.0\n```\n\n---\n\n## Performance Optimization\n\n### Understanding the Optimization Flags\n\n#### `--optimize` (Permanent)\n\nUse when **initializing** a new project. Detects your system and saves optimal settings.\n\n```bash\npv init /path/to/project --optimize\n```\n\n**What it does:**\n\n- Detects CPU cores \u2192 sets `max_workers` (e.g., 8 cores = 16 workers)\n- Calculates RAM \u2192 sets safe `batch_size` (e.g., 16GB = 400 batch)\n- Sets memory thresholds based on total RAM\n- **Saves to config** - All future operations use these settings\n\n**When to use:**\n\n- \u2705 New projects\n- \u2705 Want permanent optimization\n- \u2705 Same machine for all operations\n- \u2705 \"Set and forget\" approach\n\n#### `--max-resources` (Temporary)\n\nUse when **indexing** to temporarily boost performance without changing config.\n\n```bash\npv index /path/to/project --max-resources\npv index-git /path/to/project --since HEAD~1 --max-resources\n```\n\n**What it does:**\n\n- Detects system resources (same as --optimize)\n- **Temporarily overrides** config for this operation only\n- Original config unchanged\n\n**When to use:**\n\n- \u2705 Existing project without optimization\n- \u2705 One-time heavy indexing\n- \u2705 CI/CD with dedicated resources\n- \u2705 Don't want to modify config\n\n### Performance Benchmarks\n\n**System**: 8-core CPU, 16GB RAM, SSD\n\n| Mode | Files | Chunks | Time | Settings |\n| ------------------ | --------- | ------ | ------ | --------------------- |\n| Standard | 48 | 9222 | 4m 32s | 4 workers, 100 batch |\n| --max-resources | 48 | 9222 | 2m 16s | 16 workers, 400 batch |\n| Smart incremental | 5 changed | 412 | 24s | 16 workers, 400 batch |\n| Git-aware (HEAD~1) | 3 changed | 287 | 15s | 16 workers, 400 batch |\n\n**Key Findings:**\n\n- `--max-resources`: **2x faster** for full indexing\n- Smart incremental: **60-70% faster** than full reindex\n- Git-aware: **80-90% faster** for recent changes\n- Chunk size (128 vs 512): **No performance difference** (same ~2m 16s)\n\n### System Resource Detection\n\n**CPU Detection:**\n\n```\nDetected: 8 cores\nOptimal workers: min(8 * 2, 16) = 16 workers\n```\n\n**Memory Detection:**\n\n```\nTotal RAM: 16GB\nAvailable RAM: 8GB\nSafe batch size: 8GB * 0.5 * 100 = 400\nEmbedding batch: 400 * 0.5 = 200\nGC interval: 100 files\n```\n\n**Memory Thresholds:**\n\n```\n32GB+ RAM \u2192 threshold: 50000\n16-32GB \u2192 threshold: 20000\n8-16GB \u2192 threshold: 10000\n<8GB \u2192 threshold: 5000\n```\n\n### Best Practices\n\n1. **Initialize with optimization**\n\n ```bash\n pv init ~/my-project --optimize\n ```\n\n2. **Use max resources for heavy operations**\n\n ```bash\n pv index ~/my-project --force --max-resources\n ```\n\n3. **Use smart mode for daily updates**\n\n ```bash\n pv index ~/my-project --smart\n ```\n\n4. **Use git-aware after pulling changes**\n\n ```bash\n pv index-git ~/my-project --since HEAD~1\n ```\n\n5. **Monitor memory with verbose mode**\n ```bash\n pv index ~/my-project --max-resources --verbose\n ```\n\n---\n\n## CLI Commands\n\n### Global Options\n\n```bash\npv [OPTIONS] COMMAND [ARGS]\n\nOptions:\n -v, --verbose Enable verbose output\n --version Show version\n --help Show help\n```\n\n### `pv init` - Initialize Project\n\nInitialize a new project for vectorization.\n\n```bash\npv init [OPTIONS] PROJECT_PATH\n\nOptions:\n -n, --name TEXT Project name (default: directory name)\n -m, --embedding-model TEXT Model name (default: all-MiniLM-L6-v2)\n -p, --embedding-provider Provider: sentence-transformers | openai\n -c, --chunk-size INT Chunk size in tokens (default: 256)\n -o, --chunk-overlap INT Overlap in tokens (default: 32)\n --optimize Auto-optimize based on system resources \u2b50\n```\n\n**Examples:**\n\n```bash\n# Basic initialization\npv init /path/to/project\n\n# With optimization (recommended)\npv init /path/to/project --optimize\n\n# With OpenAI embeddings\nexport OPENAI_API_KEY=\"sk-...\"\npv init /path/to/project \\\n --embedding-provider openai \\\n --embedding-model text-embedding-ada-002 \\\n --optimize\n```\n\n### `pv index` - Index Codebase\n\nIndex the codebase for searching.\n\n```bash\npv index [OPTIONS] PROJECT_PATH\n\nOptions:\n -i, --incremental Only index changed files\n -s, --smart Smart incremental (categorized: new/modified/deleted) \u2b50\n -f, --force Force re-index all files\n --max-resources Use maximum system resources \u2b50\n```\n\n**Examples:**\n\n```bash\n# Full indexing with max resources\npv index /path/to/project --max-resources\n\n# Smart incremental (fastest for updates)\npv index /path/to/project --smart\n\n# Combine for maximum performance\npv index /path/to/project --smart --max-resources\n\n# Force complete reindex\npv index /path/to/project --force\n```\n\n### `pv index-git` - Git-Aware Indexing\n\nIndex only files changed in git commits.\n\n```bash\npv index-git [OPTIONS] PROJECT_PATH\n\nOptions:\n -s, --since TEXT Git reference (default: HEAD~1)\n --max-resources Use maximum system resources \u2b50\n```\n\n**Examples:**\n\n```bash\n# Last commit\npv index-git /path/to/project --since HEAD~1\n\n# Last 5 commits\npv index-git /path/to/project --since HEAD~5\n\n# Since main branch\npv index-git /path/to/project --since main\n\n# Since specific commit\npv index-git /path/to/project --since abc123def\n\n# With max resources\npv index-git /path/to/project --since HEAD~10 --max-resources\n```\n\n**Use Cases:**\n\n- After `git pull` - index only new changes\n- Before code review - index PR changes\n- CI/CD pipelines - index commit range\n- After branch switch - index differences\n\n### `pv search` - Search Code\n\nSearch through vectorized codebase.\n\n```bash\npv search [OPTIONS] PROJECT_PATH QUERY\n\nOptions:\n -l, --limit INT Number of results (default: 10)\n -t, --threshold FLOAT Similarity threshold 0.0-1.0 (default: 0.3)\n```\n\n**Examples:**\n\n```bash\n# Natural language search\npv search /path/to/project \"error handling in database connections\"\n\n# Single-word search (high threshold)\npv search /path/to/project \"async\" --threshold 0.9\n\n# Find all tests\npv search /path/to/project \"test\" --limit 20 --threshold 0.8\n\n# Broad semantic search (low threshold)\npv search /path/to/project \"api authentication\" --threshold 0.3\n```\n\n**Threshold Guide:**\n\n- **0.8-0.95**: Single words, exact matches\n- **0.5-0.7**: Multi-word phrases, semantic\n- **0.3-0.5**: Complex queries, broad search\n- **0.1-0.3**: Very broad, exploratory\n\n### `pv sync` - Sync Changes / Watch Mode\n\nSync changes or watch for file modifications.\n\n```bash\npv sync [OPTIONS] PROJECT_PATH\n\nOptions:\n -w, --watch Watch for file changes\n -d, --debounce FLOAT Debounce delay in seconds (default: 2.0)\n```\n\n**Examples:**\n\n```bash\n# One-time sync (smart incremental)\npv sync /path/to/project\n\n# Watch mode with default debounce (2s)\npv sync /path/to/project --watch\n\n# Fast feedback (0.5s)\npv sync /path/to/project --watch --debounce 0.5\n\n# Slower systems (5s)\npv sync /path/to/project --watch --debounce 5.0\n```\n\n**Debounce Explained:**\n\n- Waits X seconds after last file change before indexing\n- Batches multiple rapid changes together\n- Prevents redundant indexing when saving files repeatedly\n- Reduces CPU usage during active development\n\n**Recommended Values:**\n\n- **0.5-1.0s**: Fast machines, need instant feedback\n- **2.0s**: Balanced (default)\n- **5.0-10.0s**: Slower machines, large codebases\n\n### `pv serve` - Start MCP Server\n\nStart MCP server for AI agent integration.\n\n```bash\npv serve [OPTIONS] PROJECT_PATH\n\nOptions:\n -p, --port INT Port number (default: 8000)\n -h, --host TEXT Host address (default: localhost)\n```\n\n**Examples:**\n\n```bash\n# Start server\npv serve /path/to/project\n\n# Custom port\npv serve /path/to/project --port 8080\n\n# Expose to network\npv serve /path/to/project --host 0.0.0.0 --port 8000\n```\n\n### `pv status` - Show Project Status\n\nShow project status and statistics.\n\n```bash\npv status PROJECT_PATH\n```\n\n**Output:**\n\n```\n\u256d\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500 Project Status \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 Name my-project \u2502\n\u2502 Path /path/to/project \u2502\n\u2502 Embedding Model all-MiniLM-L6-v2 \u2502\n\u2502 \u2502\n\u2502 Total Files 49 \u2502\n\u2502 Indexed Files 48 \u2502\n\u2502 Total Chunks 9222 \u2502\n\u2502 \u2502\n\u2502 Git Branch main \u2502\n\u2502 Last Updated 2025-10-13 12:15:42 \u2502\n\u2502 Created 2025-10-10 09:30:15 \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n```\n\n---\n\n## Configuration\n\n### Config File Location\n\nConfiguration is stored at `<project>/.vectorizer/config.json`\n\n### Full Configuration Reference\n\n```json\n{\n \"chromadb_path\": null,\n \"embedding_model\": \"all-MiniLM-L6-v2\",\n \"embedding_provider\": \"sentence-transformers\",\n \"openai_api_key\": null,\n \"chunk_size\": 128,\n \"chunk_overlap\": 32,\n \"max_file_size_mb\": 10,\n \"included_extensions\": [\n \".py\",\n \".js\",\n \".ts\",\n \".jsx\",\n \".tsx\",\n \".go\",\n \".rs\",\n \".java\",\n \".cpp\",\n \".c\",\n \".h\",\n \".hpp\",\n \".cs\",\n \".php\",\n \".rb\",\n \".swift\",\n \".kt\",\n \".scala\",\n \".clj\",\n \".sh\",\n \".bash\",\n \".zsh\",\n \".fish\",\n \".ps1\",\n \".bat\",\n \".cmd\",\n \".md\",\n \".txt\",\n \".rst\",\n \".json\",\n \".yaml\",\n \".yml\",\n \".toml\",\n \".xml\",\n \".html\",\n \".css\",\n \".scss\",\n \".sql\",\n \".graphql\",\n \".proto\"\n ],\n \"excluded_patterns\": [\n \"node_modules/**\",\n \".git/**\",\n \"__pycache__/**\",\n \"*.pyc\",\n \".pytest_cache/**\",\n \"venv/**\",\n \"env/**\",\n \".env/**\",\n \"build/**\",\n \"dist/**\",\n \"*.egg-info/**\",\n \".DS_Store\",\n \"*.min.js\",\n \"*.min.css\"\n ],\n \"mcp_host\": \"localhost\",\n \"mcp_port\": 8000,\n \"log_level\": \"INFO\",\n \"log_file\": null,\n \"max_workers\": 4,\n \"batch_size\": 100,\n \"embedding_batch_size\": 100,\n \"parallel_file_processing\": true,\n \"memory_monitoring_enabled\": true,\n \"memory_efficient_search_threshold\": 10000,\n \"gc_interval\": 100\n}\n```\n\n### Key Settings Explained\n\n**Embedding Settings:**\n\n- `embedding_model`: Model for embeddings (all-MiniLM-L6-v2, text-embedding-ada-002, etc.)\n- `embedding_provider`: \"sentence-transformers\" (local) or \"openai\" (API)\n- `chunk_size`: Tokens per chunk (128 for precision, 512 for context)\n- `chunk_overlap`: Overlap between chunks (16-32 recommended)\n\n**Performance Settings:**\n\n- `max_workers`: Parallel workers (auto-detected with --optimize)\n- `batch_size`: Files per batch (auto-calculated with --optimize)\n- `embedding_batch_size`: Embeddings per batch\n- `parallel_file_processing`: Enable parallel processing (recommended: true)\n\n**Memory Settings:**\n\n- `memory_monitoring_enabled`: Monitor RAM usage (recommended: true)\n- `memory_efficient_search_threshold`: Switch to streaming for large results\n- `gc_interval`: Garbage collection frequency (files between GC)\n\n**File Filtering:**\n\n- `included_extensions`: File types to index\n- `excluded_patterns`: Glob patterns to ignore\n- `max_file_size_mb`: Skip files larger than this\n\n**Server Settings:**\n\n- `mcp_host`: MCP server host\n- `mcp_port`: MCP server port\n- `log_level`: INFO, DEBUG, WARNING, ERROR\n- `chromadb_path`: Custom ChromaDB location (optional)\n\n### Environment Variables\n\nCreate `.env` file or export:\n\n```bash\n# OpenAI API Key (required for OpenAI embeddings)\nexport OPENAI_API_KEY=\"sk-...\"\n\n# Override config values\nexport EMBEDDING_PROVIDER=\"sentence-transformers\"\nexport EMBEDDING_MODEL=\"all-MiniLM-L6-v2\"\nexport CHUNK_SIZE=\"256\"\nexport DEFAULT_SEARCH_THRESHOLD=\"0.3\"\n\n# Database\nexport CHROMADB_PATH=\"/custom/path/to/chromadb\"\n\n# Logging\nexport LOG_LEVEL=\"INFO\"\nexport LOG_FILE=\"/var/log/vectorizer.log\"\n```\n\nFor complete list, see [docs/ENVIRONMENT.md](docs/ENVIRONMENT.md)\n\n### Editing Configuration\n\n```bash\n# View current config\ncat /path/to/project/.vectorizer/config.json\n\n# Edit manually\nnano /path/to/project/.vectorizer/config.json\n\n# Or regenerate with optimization\npv init /path/to/project --optimize\n```\n\n---\n\n## Search Features\n\n### Single-Word Search\n\nOptimized for high-precision single-keyword searches.\n\n```bash\n# Programming keywords\npv search /path/to/project \"async\" --threshold 0.9\npv search /path/to/project \"test\" --threshold 0.8\npv search /path/to/project \"class\" --threshold 0.9\npv search /path/to/project \"import\" --threshold 0.85\n\n# Works great for finding specific constructs\npv search /path/to/project \"def\" --threshold 0.9 # Python functions\npv search /path/to/project \"function\" --threshold 0.9 # JS functions\npv search /path/to/project \"catch\" --threshold 0.8 # Error handling\n```\n\n**Features:**\n\n- **Exact Word Matching**: Prioritizes exact word boundaries\n- **Keyword Detection**: Special handling for programming keywords\n- **Relevance Boosting**: Huge boost for exact matches\n- **High Thresholds**: Reliable results even at 0.8-0.9+\n\n### Multi-Word Search\n\nSemantic search for phrases and concepts.\n\n```bash\n# Natural language\npv search /path/to/project \"user authentication logic\" --threshold 0.5\n\n# Code patterns\npv search /path/to/project \"error handling in database\" --threshold 0.4\n\n# Features\npv search /path/to/project \"rate limiting middleware\" --threshold 0.6\n```\n\n### Search Result Ranking\n\nResults ranked by:\n\n1. **Exact word matches** (highest priority)\n2. **Content type** (micro/word chunks get boost)\n3. **Partial matches** within larger words\n4. **Semantic similarity** from embeddings\n\n### Recommended Thresholds by Query Type\n\n| Query Type | Threshold | Example |\n| -------------- | --------- | --------------------------------- |\n| Single keyword | 0.7-0.95 | \"async\", \"test\", \"class\" |\n| Two words | 0.5-0.8 | \"error handling\", \"api routes\" |\n| Short phrase | 0.4-0.7 | \"user login validation\" |\n| Complex query | 0.3-0.5 | \"authentication with jwt tokens\" |\n| Exploratory | 0.1-0.3 | \"machine learning model training\" |\n\n---\n\n## MCP Server\n\n### Starting the Server\n\n```bash\n# Default (localhost:8000)\npv serve /path/to/project\n\n# Custom settings\npv serve /path/to/project --host 0.0.0.0 --port 8080\n```\n\n### Available MCP Tools\n\nWhen running, AI agents can use these tools:\n\n1. **search_code** - Search vectorized codebase\n\n ```json\n {\n \"query\": \"authentication logic\",\n \"limit\": 10,\n \"threshold\": 0.5\n }\n ```\n\n2. **get_file_content** - Retrieve full file\n\n ```json\n {\n \"file_path\": \"src/auth/login.py\"\n }\n ```\n\n3. **list_files** - List all files\n\n ```json\n {\n \"file_type\": \"py\" // optional filter\n }\n ```\n\n4. **get_project_stats** - Get statistics\n ```json\n {}\n ```\n\n### HTTP Fallback API\n\nIf MCP unavailable, HTTP endpoints provided:\n\n```bash\n# Search\ncurl \"http://localhost:8000/search?q=authentication&limit=5&threshold=0.5\"\n\n# Get file\ncurl \"http://localhost:8000/file/src/auth/login.py\"\n\n# List files\ncurl \"http://localhost:8000/files?type=py\"\n\n# Statistics\ncurl \"http://localhost:8000/stats\"\n\n# Health check\ncurl \"http://localhost:8000/health\"\n```\n\n### Use Cases\n\n1. **AI Code Review**: Let Claude analyze your codebase semantically\n2. **Intelligent Navigation**: Ask AI to find relevant code\n3. **Documentation**: Generate docs from actual code\n4. **Onboarding**: Help new devs understand codebase\n5. **Refactoring**: Find similar patterns across project\n\n---\n\n## Advanced Usage\n\n### Python API\n\n#### Basic Usage\n\n```python\nimport asyncio\nfrom pathlib import Path\nfrom project_vectorizer.core.config import Config\nfrom project_vectorizer.core.project import ProjectManager\n\nasync def main():\n # Initialize project\n config = Config.create_optimized(\n embedding_model=\"all-MiniLM-L6-v2\",\n chunk_size=256\n )\n\n project_path = Path(\"/path/to/project\")\n manager = ProjectManager(project_path, config)\n\n # Initialize\n await manager.initialize(\"My Project\")\n\n # Index\n await manager.load()\n await manager.index_all()\n\n # Search\n results = await manager.search(\"authentication\", limit=10, threshold=0.5)\n for result in results:\n print(f\"{result['file_path']}: {result['similarity']:.3f}\")\n\nasyncio.run(main())\n```\n\n#### Progress Tracking\n\n```python\nfrom rich.progress import Progress, BarColumn, TaskProgressColumn\n\nasync def index_with_progress(project_path):\n config = Config.load_from_project(project_path)\n manager = ProjectManager(project_path, config)\n await manager.load()\n\n with Progress() as progress:\n task = progress.add_task(\"Indexing...\", total=100)\n\n def update_progress(current, total, description):\n progress.update(task, completed=current, total=total, description=description)\n\n manager.set_progress_callback(update_progress)\n await manager.index_all()\n```\n\n#### Custom Resource Limits\n\n```python\nimport psutil\n\nasync def adaptive_index(project_path):\n \"\"\"Index with resources based on current load.\"\"\"\n cpu_percent = psutil.cpu_percent(interval=1)\n\n if cpu_percent < 50: # System idle\n config = Config.create_optimized()\n else: # System busy\n config = Config(max_workers=4, batch_size=100)\n\n manager = ProjectManager(project_path, config)\n await manager.load()\n await manager.index_all()\n```\n\n### Chunk Size Optimization\n\nThe engine enforces a maximum of 128 tokens per chunk (see engine.py:35) for precision, but you can configure larger sizes for more context:\n\n```bash\n# Precision (default, forced max 128)\npv init /path/to/project --chunk-size 128\n\n# More context (still capped at 128 by engine)\npv init /path/to/project --chunk-size 512\n```\n\n**Performance Note**: Chunk size has virtually NO impact on indexing speed (~2m 16s for both 128 and 512 tokens). Choose based on search quality needs:\n\n- **128**: Better precision, exact matches\n- **512**: More context, better understanding\n\n### CI/CD Integration\n\n```yaml\n# .github/workflows/vectorize.yml\nname: Vectorize Codebase\n\non:\n push:\n branches: [main]\n\njobs:\n vectorize:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n\n - name: Setup Python\n uses: actions/setup-python@v4\n with:\n python-version: \"3.9\"\n\n - name: Install vectorizer\n run: pip install project-vectorizer\n\n - name: Initialize and index\n run: |\n pv init . --optimize --name \"${{ github.repository }}\"\n pv index . --max-resources\n\n - name: Test search\n run: pv search . \"test\" --limit 5\n```\n\n### Custom File Filters\n\n```json\n{\n \"included_extensions\": [\".py\", \".js\", \".custom\"],\n \"excluded_patterns\": [\"tests/**\", \"*.generated.js\", \"vendor/**\", \"*.min.*\"]\n}\n```\n\n### Watch Mode During Development\n\n```bash\n# Terminal 1: Watch mode\npv sync /path/to/project --watch --debounce 1.0\n\n# Terminal 2: Make code changes\n# Auto-indexes when you save\n\n# Terminal 3: Search as you code\npv search /path/to/project \"your new function\" --threshold 0.5\n```\n\n---\n\n## Troubleshooting\n\n### Common Issues\n\n#### 1. Slow Indexing\n\n**Problem**: Indexing taking too long\n\n**Solutions:**\n\n```bash\n# Use max resources\npv index /path/to/project --max-resources\n\n# Use smart incremental for updates\npv index /path/to/project --smart\n\n# Use git-aware for recent changes\npv index-git /path/to/project --since HEAD~1\n\n# Check if optimization is working\npv index /path/to/project --max-resources --verbose\n# Look for: \"Workers: 16, Batch Size: 400\"\n```\n\n#### 2. High Memory Usage\n\n**Problem**: Process using too much RAM or getting killed\n\n**Solutions:**\n\n```bash\n# Reduce batch size in config\n{\n \"batch_size\": 50,\n \"max_workers\": 4\n}\n\n# Enable memory monitoring\n{\n \"memory_monitoring_enabled\": true,\n \"gc_interval\": 50\n}\n\n# Use smaller chunks\npv init /path/to/project --chunk-size 128\n```\n\n#### 3. Poor Search Results\n\n**Problem**: Search not finding relevant code\n\n**Solutions:**\n\n```bash\n# Lower threshold for phrases\npv search /path/to/project \"your query\" --threshold 0.3\n\n# Higher threshold for keywords\npv search /path/to/project \"async\" --threshold 0.9\n\n# Use smaller chunk size for precision\n# Edit config: \"chunk_size\": 128\n\n# Ensure index is up to date\npv index /path/to/project --smart\n```\n\n#### 4. No Results for Single Words\n\n**Problem**: Single-word searches return nothing\n\n**Solutions:**\n\n```bash\n# Try lower threshold\npv search /path/to/project \"yourword\" --threshold 0.5\n\n# Check if word exists\npv search /path/to/project \"yourword\" --threshold 0.1 --limit 1\n\n# Reindex with smaller chunks\n# Edit config: \"chunk_size\": 128\npv index /path/to/project --force\n```\n\n#### 5. Missing Recent Changes\n\n**Problem**: Just-edited code not showing in search\n\n**Solutions:**\n\n```bash\n# Run smart incremental\npv index /path/to/project --smart\n\n# Or git-aware\npv index-git /path/to/project --since HEAD~1\n\n# Check status\npv status /path/to/project\n```\n\n#### 6. psutil Not Found\n\n**Problem**: Optimization not working\n\n**Solution:**\n\n```bash\n# Install psutil\npip install psutil\n\n# Verify\npython -c \"import psutil; print(f'CPUs: {psutil.cpu_count()}, RAM: {psutil.virtual_memory().available / 1024**3:.1f}GB')\"\n\n# Try again\npv init /path/to/project --optimize\n```\n\n### Debug Mode\n\n```bash\n# Enable verbose logging\npv --verbose index /path/to/project\n\n# Check project status\npv status /path/to/project\n\n# View config\ncat /path/to/project/.vectorizer/config.json\n\n# Check ChromaDB\nls -lh /path/to/project/.vectorizer/chromadb/\n```\n\n### Performance Debugging\n\n```bash\n# Time operations\ntime pv index /path/to/project\ntime pv index /path/to/project --max-resources\n\n# Monitor resources during indexing\n# Terminal 1:\npv index /path/to/project --max-resources\n\n# Terminal 2:\nhtop # or top\n# Should see high CPU across all cores\n\n# Check memory warnings\npv index /path/to/project --max-resources --verbose\n# Look for memory warnings\n```\n\n---\n\n## Changelog\n\n### [0.1.2] - 2025-10-13\n\n#### Added\n\n- **Optimized Config Generation** - `Config.create_optimized()` auto-detects CPU/RAM\n- **Max Resources Flag** - `--max-resources` for temporary performance boost\n- **psutil Integration** - Automatic system resource detection\n- **Unified Progress Tracking** - Clean single-line progress bar\n- **Library Progress Suppression** - No more cluttered batch progress bars\n- **Timing Information** - All operations show elapsed time\n- **Clean Terminal Output** - Professional UI with timing\n\n#### Performance\n\n- **2x faster** full indexing with --max-resources\n- **60-70% faster** smart incremental updates\n- **80-90% faster** git-aware indexing\n\n#### Documentation\n\n- Comprehensive documentation overhaul\n- Consolidated all guides into main README\n- Added CHANGELOG.md with version history\n\n### [0.1.1] - 2025-10-12\n\n- Enhanced single-word search with high precision\n- Multi-level chunking (micro + word-level)\n- Adaptive search thresholds\n- Programming keyword detection\n- Improved word matching and relevance boosting\n\n### [0.1.0] - 2025-10-10\n\n- Initial release\n- Code vectorization\n- Smart incremental indexing\n- Git-aware indexing\n- MCP server\n- Watch mode\n- ChromaDB backend\n- 30+ language support\n\n---\n\n## Contributing\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/starkbaknet/project-vectorizer.git\ncd project-vectorizer\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # Windows: venv\\Scripts\\activate\n\n# Install with dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Format code\nblack .\nisort .\n```\n\n### Running Tests\n\n```bash\n# All tests\npytest\n\n# With coverage\npytest --cov=project_vectorizer\n\n# Specific test\npytest tests/test_config.py\n\n# Verbose\npytest -v\n```\n\nSee [docs/TESTING.md](docs/TESTING.md) for details.\n\n### Publishing\n\nSee [docs/PUBLISHING.md](docs/PUBLISHING.md) for PyPI publishing guide.\n\n### Contributing Guidelines\n\n1. Fork repository\n2. Create feature branch: `git checkout -b feature/amazing-feature`\n3. Make changes and add tests\n4. Ensure tests pass: `pytest`\n5. Format code: `black . && isort .`\n6. Commit: `git commit -m 'Add amazing feature'`\n7. Push: `git push origin feature/amazing-feature`\n8. Open Pull Request\n\n---\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file\n\n---\n\n## Additional Resources\n\n- **GitHub**: https://github.com/starkbaknet/project-vectorizer\n- **PyPI**: https://pypi.org/project/project-vectorizer/\n- **Issues**: https://github.com/starkbaknet/project-vectorizer/issues\n\n---\n\n**Made with \u2764\ufe0f by StarkBakNet**\n\n_Vectorize your codebase. Empower your AI agents. Build better software._\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "CLI tool for vectorizing codebases and serving them via MCP",
"version": "0.1.2",
"project_urls": {
"Changelog": "https://github.com/starkbaknet/project-vectorizer/releases",
"Documentation": "https://github.com/starkbaknet/project-vectorizer#readme",
"Homepage": "https://github.com/starkbaknet/project-vectorizer",
"Issues": "https://github.com/starkbaknet/project-vectorizer/issues",
"Repository": "https://github.com/starkbaknet/project-vectorizer"
},
"split_keywords": [
"vectorization",
" code-search",
" embeddings",
" semantic-search",
" mcp",
" chromadb",
" codebase-analysis",
" ai-tools"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0c674fa94c87a1a133644c9110f59f61f60121e699867884da51b32f005a927a",
"md5": "39f9cd4cd8941d3ee09057e3d0a9fca8",
"sha256": "ad80c18cb4f81039586dc3aa365934ae497562229dfb1ddff3b2ac1347602271"
},
"downloads": -1,
"filename": "project_vectorizer-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "39f9cd4cd8941d3ee09057e3d0a9fca8",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 51544,
"upload_time": "2025-10-13T08:49:38",
"upload_time_iso_8601": "2025-10-13T08:49:38.377974Z",
"url": "https://files.pythonhosted.org/packages/0c/67/4fa94c87a1a133644c9110f59f61f60121e699867884da51b32f005a927a/project_vectorizer-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c0d8f78f6e363e213cc6149e35a8870374007baa9eaf3daded91b7699bb2343b",
"md5": "fb59b06058a76c0a4ed4363b49d910b1",
"sha256": "eb1a631814c33bdc75b24d48be136aa0a372482a03a34e6b0cea62bbd1a1efa5"
},
"downloads": -1,
"filename": "project_vectorizer-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "fb59b06058a76c0a4ed4363b49d910b1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 86398,
"upload_time": "2025-10-13T08:49:39",
"upload_time_iso_8601": "2025-10-13T08:49:39.398221Z",
"url": "https://files.pythonhosted.org/packages/c0/d8/f78f6e363e213cc6149e35a8870374007baa9eaf3daded91b7699bb2343b/project_vectorizer-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-13 08:49:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "starkbaknet",
"github_project": "project-vectorizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "project-vectorizer"
}