# REM Database
**Resources-Entities-Moments (REM):** High-performance embedded database for semantic search, graph queries, and structured data.
**Project Location:** `/Users/sirsh/code/percolation/percolate-rocks`
## Project Goals
Build a production-ready database combining:
- **Rust performance** - HNSW vector search (200x faster), native SQL execution (5-10x faster)
- **Python ergonomics** - Pydantic models drive schemas, natural language queries
- **Zero impedance** - Pydantic `json_schema_extra` → automatic embeddings, indexing, validation
## Quick Start
### Installation
**From PyPI** (published v0.2.0):
```bash
pip install percolate-rocks
```
**From source**:
```bash
cd /Users/sirsh/code/percolation/percolate-rocks
maturin develop --release
```
See [`.release-notes/`](./.release-notes/) for release history.
### Building
This project supports two build modes:
**Python extension (default):**
```bash
# Build and install Python package
maturin develop
# Syntax check only (faster)
maturin develop --skip-install
# Note: cargo check/build will fail - use maturin for Python extensions
```
**Standalone Rust library (no Python):**
```bash
# Build as pure Rust library (no Python bindings)
cargo check --lib --no-default-features
cargo build --lib --no-default-features --release
# Run tests without Python
cargo test --lib --no-default-features
# Use in other Rust projects
# Add to Cargo.toml:
# percolate-rocks = { version = "0.1", default-features = false }
```
### Testing cross-compilation locally
To ensure your local builds will work in CI/GitHub Actions, use Docker to replicate the CI environment:
```bash
# Test ARM64 Linux cross-compilation (what CI does)
docker run --rm -v "$(pwd):/workspace" -w /workspace rust:latest bash -c "
apt-get update && apt-get install -y gcc-aarch64-linux-gnu pkg-config libclang-dev
rustup target add aarch64-unknown-linux-gnu
cargo build --target aarch64-unknown-linux-gnu --release
"
```
**Why local builds might work but CI fails:**
| Environment | Rust Version | Target | OpenSSL | Why It Works |
|-------------|--------------|--------|---------|--------------|
| **Your Mac** | Latest (1.87+) | Native (aarch64-apple-darwin) | System OpenSSL (Homebrew) | Local system libs |
| **GitHub Actions** | Workflow-specified | Cross-compile (aarch64-unknown-linux-gnu) | Vendored (native-tls-vendored feature) | Must compile from source |
**Key differences:**
1. **Rust version**: Mac typically has latest via rustup, CI uses workflow-pinned version
2. **Cross-compilation**: Mac → Linux ARM64 requires vendored dependencies (no system libs available)
3. **Native TLS**: `reqwest` needs `native-tls-vendored` feature for cross-compilation
**If Docker build passes → CI will pass.** This is your local validation gate.
### Basic Workflow
Define your schema using Pydantic (in `models.py`):
```python
from pydantic import BaseModel, Field, ConfigDict
class Article(BaseModel):
"""Article resource for semantic search."""
title: str = Field(description="Article title")
content: str = Field(description="Full article content")
category: str = Field(description="Content category")
model_config = ConfigDict(
json_schema_extra={
"embedding_fields": ["content"], # Auto-embed on insert
"indexed_fields": ["category"], # Fast WHERE queries
"key_field": "title" # Deterministic UUID
}
)
```
Use the CLI to work with your data:
```bash
# 1. Generate encryption key (for encryption at rest)
rem key-gen --password "strong_master_password"
# Generates Ed25519 key pair and stores encrypted at ~/.p8/keys/
# 2. Initialize database (defaults to ~/.p8/db/)
rem init
# Or specify custom path
rem init --path ./data
# With encryption at rest (optional)
rem init --path ./data --password "strong_master_password"
# 3. Register schema (JSON/YAML preferred, Python also supported)
rem schema add schema.json # Preferred: pure JSON Schema
rem schema add schema.yaml # Preferred: YAML format
rem schema add models.py::Article # Also supported: Pydantic model
# Or create from template
rem schema add --name my_docs --template resources # Clone resources schema
# 4. Batch upsert articles (single embedding API call)
cat articles.jsonl | rem insert articles --batch
# 5. Semantic search (HNSW index)
rem search "fast programming languages" --schema=articles --top-k=5
# 6. SQL queries (indexed)
rem query "SELECT * FROM articles WHERE category = 'programming'"
```
## CLI Commands
### Setup and Schema Management
| Command | Description | Example |
|---------|-------------|---------|
| `rem key-gen` | Generate encryption key pair (Ed25519) | `rem key-gen --password "strong_password"` (saves to `~/.p8/keys/`) |
| `rem init` | Initialize database (default: `~/.p8/db/`) | `rem init` or `rem init --path ./data` or `rem init --password "..."` (encryption at rest) |
| `rem schema add <file>` | Register schema (JSON/YAML preferred) | `rem schema add schema.json` or `rem schema add models.py::Article` |
| `rem schema add --name <name> --template <template>` | Create schema from built-in template | `rem schema add --name my_docs --template resources` |
| `rem schema list` | List registered schemas | `rem schema list` |
| `rem schema show <name>` | Show schema definition | `rem schema show articles` |
| `rem schema templates` | List available templates | `rem schema templates` |
**Schema template workflow:**
```bash
# List available templates
rem schema templates
# Output:
# Available schema templates:
# - resources: Chunked documents with embeddings (URI-based)
# - entities: Generic structured data (name-based)
# - agentlets: AI agent definitions (with tools/resources)
# - moments: Temporal classifications (time-range queries)
# Create new schema from template
rem schema add --name my_documents --template resources
# This creates and registers:
# - Schema name: my_documents
# - Clones all fields from resources template
# - Updates fully_qualified_name: "user.my_documents"
# - Updates short_name: "my_documents"
# - Preserves embedding/indexing configuration
# Customize the generated schema (optional)
rem schema show my_documents > my_documents.json
# Edit my_documents.json
rem schema add my_documents.json # Re-register with changes
# Or save to file without registering
rem schema add --name my_docs --template resources --output my_docs.yaml
# Edit my_docs.yaml
rem schema add my_docs.yaml # Register when ready
```
**Built-in templates:**
| Template | Use Case | Key Fields | Configuration |
|----------|----------|------------|---------------|
| `resources` | Documents, articles, PDFs | `name`, `content`, `uri`, `chunk_ordinal` | Embeds `content`, indexes `content_type`, key: `uri` |
| `entities` | Generic structured data | `name`, `key`, `properties` | Indexes `name`, key: `name` |
| `agentlets` | AI agent definitions | `description`, `tools`, `resources` | Embeds `description`, includes MCP config |
| `moments` | Temporal events | `name`, `start_time`, `end_time`, `classifications` | Indexes `start_time`, `end_time` |
**Example: Creating custom document schema**
```bash
# Start with resources template
rem schema add --name technical_docs --template resources --output technical_docs.yaml
# Edit technical_docs.yaml to add custom fields:
# - difficulty_level: enum["beginner", "intermediate", "advanced"]
# - language: string
# - code_examples: array[object]
# Register customized schema
rem schema add technical_docs.yaml
# Insert documents
cat docs.jsonl | rem insert technical_docs --batch
```
### Data Operations
| Command | Description | Example |
|---------|-------------|---------|
| `rem insert <table> <json>` | Insert entity | `rem insert articles '{"title": "..."}` |
| `rem insert <table> --batch` | Batch insert from stdin | `cat data.jsonl \| rem insert articles --batch` |
| `rem ingest <file>` | Upload and chunk file | `rem ingest tutorial.pdf --schema=articles` |
| `rem get <uuid>` | Get entity by ID | `rem get 550e8400-...` |
| `rem lookup <key>` | Global key lookup | `rem lookup "Python Guide"` |
### Search and Queries
| Command | Description | Example |
|---------|-------------|---------|
| `rem search <query>` | Semantic search | `rem search "async programming" --schema=articles` |
| `rem query "<SQL>"` | SQL query | `rem query "SELECT * FROM articles WHERE category = 'tutorial'"` |
| `rem ask "<question>"` | Natural language query (executes) | `rem ask "show recent programming articles"` |
| `rem ask "<question>" --plan` | Show query plan without executing | `rem ask "show recent articles" --plan` |
| `rem traverse <uuid>` | Graph traversal | `rem traverse <id> --depth=2 --direction=out` |
**Natural language query examples:**
```bash
# Execute query immediately
rem ask "show recent programming articles"
# Output: Query results as JSON
# Show query plan without executing (LLM response only)
rem ask "show recent programming articles" --plan
# Output:
# {
# "confidence": 0.95,
# "query": "SELECT * FROM articles WHERE category = 'programming' ORDER BY created_at DESC LIMIT 10",
# "reasoning": "User wants recent articles filtered by programming category",
# "requires_search": false
# }
# Complex query with semantic search
rem ask "find articles about Rust performance optimization" --plan
# Output:
# {
# "confidence": 0.85,
# "query": "SEARCH articles 'Rust performance optimization' LIMIT 10",
# "reasoning": "Semantic search needed for conceptual similarity",
# "requires_search": true
# }
```
### Export and Analytics
| Command | Description | Example |
|---------|-------------|---------|
| `rem export <table>` | Export to Parquet | `rem export articles --output ./data.parquet` |
| `rem export --all` | Export all schemas | `rem export --all --output ./exports/` |
### REM Dreaming (Background Intelligence)
| Command | Description | Example |
|---------|-------------|---------|
| `rem dream` | Run dreaming with default lookback (24h) | `rem dream` |
| `rem dream --lookback-hours <N>` | Custom lookback window | `rem dream --lookback-hours 168` (weekly) |
| `rem dream --dry-run` | Show what would be generated | `rem dream --dry-run --verbose` |
| `rem dream --llm <model>` | Specify LLM provider | `rem dream --llm gpt-4-turbo` |
| `rem dream --start <date> --end <date>` | Specific date range | `rem dream --start "2025-10-20" --end "2025-10-25"` |
**REM Dreaming** uses LLMs to analyze your activity in the background and generate:
- **Moments**: Temporal classifications of what you were working on (with emotions, topics, outcomes)
- **Summaries**: Period recaps and key insights
- **Graph edges**: Automatic connections between related resources and sessions
- **Ontological maps**: Topic relationships and themes
See [`docs/rem-dreaming.md`](docs/rem-dreaming.md) for detailed documentation.
## Core System Schemas
REM Database includes three core schemas for tracking user activity:
### Sessions
**Purpose:** Track conversation sessions with AI agents.
**Key fields:**
- `id` (UUID) - Session identifier
- `case_id` (UUID) - Optional link to project/case
- `user_id` (string) - User identifier
- `metadata` (object) - Session context
**Schema:** [`schema/core/sessions.json`](schema/core/sessions.json)
### Messages
**Purpose:** Individual messages within sessions (user queries, AI responses, tool calls).
**Key fields:**
- `session_id` (UUID) - Parent session
- `role` (enum) - user | assistant | system | tool
- `content` (string) - Message content (embedded for search)
- `tool_calls` (array) - Tool invocations
- `trace_id`, `span_id` (string) - Observability
**Schema:** [`schema/core/messages.json`](schema/core/messages.json)
### Moments
**Purpose:** Temporal classifications generated by REM Dreaming.
**Key fields:**
- `name` (string) - Moment title
- `summary` (string) - Activity description
- `start_time`, `end_time` (datetime) - Time bounds
- `moment_type` (enum) - work_session | learning | planning | communication | reflection | creation
- `tags` (array) - Topic tags (e.g., ["rust", "database", "performance"])
- `emotion_tags` (array) - Emotion/tone tags (e.g., ["focused", "productive"])
- `people` (array) - People mentioned
- `resource_ids`, `session_ids` (arrays) - Related entities
**Schema:** [`schema/core/moments.json`](schema/core/moments.json)
**These schemas are registered automatically on `rem init`.**
## Peer Replication Testing
REM supports primary/replica replication via WAL and gRPC streaming.
### Terminal 1: Primary Node
```bash
# Start primary with WAL enabled
export P8_REPLICATION_MODE=primary
export P8_REPLICATION_PORT=50051
export P8_WAL_ENABLED=true
export P8_DB_PATH=./data/primary # Override default ~/.p8/db/
rem init
# Register schema (JSON/YAML preferred)
rem schema add schema.json
# Start replication server
rem serve --host 0.0.0.0 --port 50051
# Insert data (will be replicated)
rem insert articles '{"title": "Doc 1", "content": "Test replication", "category": "test"}'
# Check WAL status
rem replication wal-status
# Output:
# WAL sequence: 1
# Entries: 1
# Size: 512 bytes
```
### Terminal 2: Replica 1
```bash
# Start replica pointing to primary
export P8_REPLICATION_MODE=replica
export P8_PRIMARY_HOST=localhost:50051
export P8_DB_PATH=./data/replica1 # Override default ~/.p8/db/
rem init
# Connect and sync from primary
rem replicate --primary=localhost:50051 --follow
# Check replication status
rem replication status
# Output:
# Mode: replica
# Primary: localhost:50051
# WAL position: 1
# Lag: 2ms
# Status: synced
# Query replica (read-only)
rem query "SELECT * FROM articles"
# Output: Same data as primary
```
### Terminal 3: Replica 2
```bash
export P8_REPLICATION_MODE=replica
export P8_PRIMARY_HOST=localhost:50051
export P8_DB_PATH=./data/replica2 # Override default ~/.p8/db/
rem init
rem replicate --primary=localhost:50051 --follow
# Verify sync
rem query "SELECT COUNT(*) FROM articles"
# Output: 1
```
### Testing Failover
**Terminal 1: Simulate Primary Failure**
```bash
^C # Stop primary server
```
**Terminal 2: Replica Behavior During Outage**
```bash
# Replica continues serving reads
rem query "SELECT * FROM articles"
# Output: Cached data still available
# Check status
rem replication status
# Output:
# Status: disconnected
# Last sync: 45s ago
# Buffered writes: 0 (read-only)
```
**Terminal 1: Primary Restart**
```bash
# Restart primary and insert new data
rem serve --host 0.0.0.0 --port 50051
rem insert articles '{"title": "Doc 2", "content": "After restart", "category": "test"}'
```
**Terminal 2: Automatic Catchup**
```bash
# Replica auto-reconnects and syncs
rem replication status
# Output:
# Status: synced
# Catchup: completed (1 entry, 50ms)
# Lag: 3ms
# Verify new data
rem query "SELECT title FROM articles ORDER BY created_at DESC LIMIT 1"
# Output: Doc 2
```
## Key Implementation Conventions
### REM Principle
**Resources-Entities-Moments** is a unified data model, not separate storage:
- **Resources**: Chunked documents with embeddings → semantic search (HNSW)
- **Entities**: Structured data → SQL queries (indexed fields)
- **Moments**: Temporal classifications → time-range queries
All stored as **entities** in RocksDB. Conceptual distinction only.
### Pydantic-Driven Everything
Configuration flows from `json_schema_extra`:
NB!: While we support adding metadata in config. Fields can also take properties like key-field and embedding_provider as json schema extra and is preferred.
```python
model_config = ConfigDict(
json_schema_extra={
"embedding_fields": ["content"], # → Auto-embed on insert
"indexed_fields": ["category"], # → RocksDB index CF
"key_field": "title" # → Deterministic UUID
}
)
```
NB: Rust can also define schema in equivalent mode classes or schema but we drive things with pydantic aware semantics of the json schema format.
### Deterministic UUIDs (Idempotent Inserts)
NB: Precedence; uri -> key -> name unless specified in config.
| Priority | Field | UUID Generation |
|----------|-------|-----------------|
| 1 | `uri` | `blake3(entity_type + uri + chunk_ordinal)` |
| 2 | `json_schema_extra.key_field` | `blake3(entity_type + value)` |
| 3 | `key` | `blake3(entity_type + key)` |
| 4 | `name` | `blake3(entity_type + name)` |
| 5 | (fallback) | `UUID::v4()` (random) |
Same key → same UUID → upsert semantics.
### System Fields (Always Auto-Added)
**Never** define these in Pydantic models - always added by database:
- `id` (UUID) - Deterministic or random
- `entity_type` (string) - Schema/table name
- `created_at`, `modified_at`, `deleted_at` (ISO 8601) - Timestamps
- `edges` (array[string]) - Graph relationships
### Embedding Fields (Conditionally Added)
**Not system fields** - only added when configured:
- `embedding` (array[float32]) - Added if `embedding_fields` in `json_schema_extra`
- `embedding_alt` (array[float32]) - Added if `P8_ALT_EMBEDDING` environment variable set
```python
# Configuration that triggers embedding generation:
model_config = ConfigDict(
json_schema_extra={
"embedding_fields": ["content"], # → Adds "embedding" field
"embedding_provider": "default" # → Uses P8_DEFAULT_EMBEDDING
}
)
```
### Encryption at Rest
Optional encryption at rest using **Ed25519 key pairs** and **ChaCha20-Poly1305 AEAD**:
1. **Generate key pair** (one-time setup):
```bash
rem key-gen --password "strong_master_password"
# Stores encrypted key at ~/.p8/keys/private_key_encrypted
# Stores public key at ~/.p8/keys/public_key
```
2. **Initialize database with encryption**:
```bash
rem init --password "strong_master_password"
# All entity data encrypted before storage
# Transparent encryption/decryption on get/put
```
3. **Sharing across tenants** (future):
- Encrypt data with recipient's **public key** (X25519 ECDH)
- End-to-end encryption - even database admin cannot read shared data
4. **Device-to-device sync** (future):
- WAL entries encrypted before gRPC transmission
- Defense in depth: mTLS (transport) + encrypted WAL (application layer)
**Key security properties:**
- Private key **never leaves device unencrypted**
- Password-derived key using **Argon2** KDF
- **ChaCha20-Poly1305** AEAD for data encryption
- Public key stored unencrypted for sharing capabilities
See `docs/encryption-architecture.md` for complete design.
### Column Families (Performance)
| CF | Purpose | Speedup vs Scan |
|----|---------|-----------------|
| `key_index` | Reverse key lookup | O(log n) vs O(n) |
| `edges` + `edges_reverse` | Bidirectional graph | 20x faster |
| `embeddings` (binary) | Vector storage | 3x compression |
| `indexes` | Indexed fields | 10-50x faster |
| `keys` | Encrypted tenant keys | - |
### HNSW Vector Index
Rust HNSW index provides **200x speedup** over naive Python scan:
- Python naive: ~1000ms for 1M documents
- Rust HNSW: ~5ms for 1M documents
This is the **primary reason** for Rust implementation.
## Performance Targets
| Operation | Target | Why Rust? |
|-----------|--------|-----------|
| Insert (no embedding) | < 1ms | RocksDB + zero-copy |
| Insert (with embedding) | < 50ms | Network-bound (OpenAI) |
| Get by ID | < 0.1ms | Single RocksDB get |
| Vector search (1M docs) | < 5ms | **HNSW (vs 1000ms naive)** |
| SQL query (indexed) | < 10ms | **Native execution (vs 50ms Python)** |
| Graph traversal (3 hops) | < 5ms | **Bidirectional CF (vs 100ms scan)** |
| Batch insert (1000 docs) | < 500ms | Batched embeddings |
| Parquet export (100k rows) | < 2s | **Parallel encoding** |
NB: WE generally work in batches; batch upserts and batch embeddings. NEVER make individual requests when batches are possible.
## Environment Configuration
```bash
# Core
export P8_HOME=~/.p8
export P8_DB_PATH=$P8_HOME/db
# Embeddings
export P8_DEFAULT_EMBEDDING=local:all-MiniLM-L6-v2
export P8_OPENAI_API_KEY=sk-... # For OpenAI embeddings
# LLM (natural language queries)
export P8_DEFAULT_LLM=gpt-4.1
export P8_OPENAI_API_KEY=sk-...
# RocksDB tuning
export P8_ROCKSDB_WRITE_BUFFER_SIZE=67108864 # 64MB
export P8_ROCKSDB_MAX_BACKGROUND_JOBS=4
export P8_ROCKSDB_COMPRESSION=lz4
# Replication
export P8_REPLICATION_MODE=primary # or replica
export P8_PRIMARY_HOST=localhost:50051 # For replicas
export P8_WAL_ENABLED=true
```
See [CLAUDE.md](./CLAUDE.md) for full list.
## Project Structure
```
percolate-rocks/ # Clean implementation
├── Cargo.toml # Rust dependencies
├── pyproject.toml # Python package (maturin)
├── README.md # This file
├── CLAUDE.md # Implementation guide
│
├── src/ # Rust implementation (~3000 lines target)
│ ├── lib.rs # PyO3 module definition (30 lines)
│ │
│ ├── types/ # Core data types (120 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── entity.rs # Entity, Edge structs
│ │ ├── error.rs # Error types (thiserror)
│ │ └── result.rs # Type aliases
│ │
│ ├── storage/ # RocksDB wrapper (400 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── db.rs # Storage struct + open
│ │ ├── keys.rs # Key encoding functions
│ │ ├── batch.rs # Batch writer
│ │ ├── iterator.rs # Prefix iterator
│ │ └── column_families.rs # CF constants + setup
│ │
│ ├── index/ # Indexing layer (310 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── hnsw.rs # HNSW vector index
│ │ ├── fields.rs # Indexed fields
│ │ └── keys.rs # Key index (reverse lookup)
│ │
│ ├── query/ # Query execution (260 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── parser.rs # SQL parser
│ │ ├── executor.rs # Query executor
│ │ ├── predicates.rs # Predicate evaluation
│ │ └── planner.rs # Query planner
│ │
│ ├── embeddings/ # Embedding providers (200 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── provider.rs # Provider trait + factory
│ │ ├── local.rs # Local models (fastembed)
│ │ ├── openai.rs # OpenAI API client
│ │ └── batch.rs # Batch embedding operations
│ │
│ ├── schema/ # Schema validation (160 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── registry.rs # Schema registry
│ │ ├── validator.rs # JSON Schema validation
│ │ └── pydantic.rs # Pydantic json_schema_extra parser
│ │
│ ├── graph/ # Graph operations (130 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── edges.rs # Edge CRUD
│ │ └── traversal.rs # BFS/DFS traversal
│ │
│ ├── replication/ # Replication engine (400 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── wal.rs # Write-ahead log
│ │ ├── primary.rs # Primary node (gRPC server)
│ │ ├── replica.rs # Replica node (gRPC client)
│ │ ├── protocol.rs # gRPC protocol definitions
│ │ └── sync.rs # Sync state machine
│ │
│ ├── export/ # Export formats (200 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── parquet.rs # Parquet writer
│ │ ├── csv.rs # CSV writer
│ │ └── jsonl.rs # JSONL writer
│ │
│ ├── ingest/ # Document ingestion (180 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── chunker.rs # Document chunking
│ │ ├── pdf.rs # PDF parser
│ │ └── text.rs # Text chunking
│ │
│ ├── llm/ # LLM query builder (150 lines)
│ │ ├── mod.rs # Re-exports
│ │ ├── query_builder.rs # Natural language → SQL
│ │ └── planner.rs # Query plan generation
│ │
│ └── bindings/ # PyO3 Python bindings (300 lines)
│ ├── mod.rs # Re-exports
│ ├── database.rs # Database wrapper (main API)
│ ├── types.rs # Type conversions (Python ↔ Rust)
│ ├── errors.rs # Error conversions
│ └── async_ops.rs # Async operation wrappers
│
├── python/ # Python package (~800 lines target)
│ └── rem_db/
│ ├── __init__.py # Public API (thin wrapper over Rust)
│ ├── cli.py # Typer CLI (delegates to Rust)
│ ├── models.py # Built-in Pydantic schemas
│ └── async_api.py # Async wrapper utilities
│
└── tests/
├── rust/ # Rust integration tests
│ ├── test_crud.rs
│ ├── test_search.rs
│ ├── test_graph.rs
│ ├── test_replication.rs
│ └── test_export.rs
│
└── python/ # Python integration tests
├── test_api.py
├── test_cli.py
├── test_async.py
└── test_end_to_end.py
```
**Key Design Notes:**
1. **Rust Core (~3000 lines in ~40 files)**: All performance-critical operations in Rust
- Average 75 lines per file
- Max 150 lines per file
- Single responsibility per module
2. **Python Bindings (bindings/)**: Thin PyO3 layer
- Database wrapper exposes high-level API
- Type conversions between Python dict/list ↔ Rust structs
- Error conversions for Python exceptions
- Async operation wrappers (tokio → asyncio)
- **No business logic** - pure translation layer
3. **Python Package (python/)**: Minimal orchestration
- CLI delegates to Rust immediately
- Public API is thin wrapper (`db._rust_insert()`)
- Pydantic models define schemas, Rust validates/stores
- Async utilities for Python async/await ergonomics
4. **Replication Module**: Primary/replica peer replication
- WAL (write-ahead log) for durability
- gRPC streaming for real-time sync
- Automatic catchup after disconnection
- Read-only replica mode
5. **Export Module**: Analytics-friendly formats
- Parquet with ZSTD compression
- CSV for spreadsheets
- JSONL for streaming/batch processing
6. **LLM Module**: Natural language query interface
- Convert questions → SQL/SEARCH queries
- Query plan generation (`--plan` flag)
- Confidence scoring
7. **Test Organization**: Separation of unit and integration tests
**Rust Tests:**
- **Unit tests**: Inline with implementation using `#[cfg(test)]` modules
```rust
// src/storage/keys.rs
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_encode_entity_key() {
let key = encode_entity_key(uuid);
assert!(key.starts_with(b"entity:"));
}
}
```
- **Integration tests**: In `tests/rust/` directory
- Test full workflows across modules
- Require actual RocksDB instance
- May be slower (acceptable up to 10s per test)
**Python Tests:**
- **Unit tests**: NOT APPLICABLE (Python layer is thin wrapper)
- **Integration tests**: In `tests/python/` directory
- Test PyO3 bindings (Python ↔ Rust type conversions)
- Test CLI commands end-to-end
- Test async/await ergonomics
- Require Rust library to be built
**Running Tests:**
```bash
# Rust unit tests (fast, inline with code)
cargo test --lib
# Rust integration tests (slower, requires RocksDB)
cargo test --test '*'
# Python integration tests (requires maturin build)
maturin develop
pytest tests/python/
# All tests
cargo test && pytest tests/python/
```
**Coverage Targets:**
- Rust: 80%+ coverage (critical path)
- Python: 90%+ coverage (thin wrapper, easy to test)
## Development
### Pre-Build Checks
```bash
# Check compilation (fast, no binary output)
cargo check
# Format check (without modifying files)
cargo fmt --check
# Linting with clippy
cargo clippy --all-targets --all-features
# Security audit (requires: cargo install cargo-audit)
cargo audit
# Check for outdated dependencies (requires: cargo install cargo-outdated)
cargo outdated
```
### Building
```bash
# Development build (unoptimized, fast compile)
cargo build
# Release build (optimized, slower compile)
cargo build --release
# Python extension development install (editable)
maturin develop
# Python extension release wheel
maturin build --release
```
### Testing
```bash
# Rust unit tests
cargo test
# Rust unit tests with output
cargo test -- --nocapture
# Python integration tests (requires maturin develop first)
pytest
# Python tests with verbose output
pytest -v
# Run specific test
cargo test test_name
```
### Code Quality
```bash
# Auto-format code
cargo fmt
# Run clippy linter
cargo clippy --all-targets
# Fix clippy warnings automatically (where possible)
cargo clippy --fix
# Check for unused dependencies
cargo machete # requires: cargo install cargo-machete
```
### Benchmarks
```bash
# Run all benchmarks
cargo bench
# Run specific benchmark
cargo bench vector_search
```
### Development Workflow
```bash
# 1. Make changes to Rust code
# 2. Check compilation
cargo check
# 3. Run tests
cargo test
# 4. Build Python extension
maturin develop
# 5. Test Python integration
pytest
```
## References
- **Specification**: See `db-specification-v0.md` in `-ref` folder
- **Python spike**: `../rem-db` (100% features, production-ready)
- **Old Rust spike**: `../percolate-rocks-ref` (~40% features)
- **Implementation guide**: [CLAUDE.md](./CLAUDE.md)
Raw data
{
"_id": null,
"home_page": null,
"name": "percolate-rocks",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "database, vector-search, graph, embedding, semantic",
"author": "Percolate Team",
"author_email": null,
"download_url": null,
"platform": null,
"description": "# REM Database\n\n**Resources-Entities-Moments (REM):** High-performance embedded database for semantic search, graph queries, and structured data.\n\n**Project Location:** `/Users/sirsh/code/percolation/percolate-rocks`\n\n## Project Goals\n\nBuild a production-ready database combining:\n- **Rust performance** - HNSW vector search (200x faster), native SQL execution (5-10x faster)\n- **Python ergonomics** - Pydantic models drive schemas, natural language queries\n- **Zero impedance** - Pydantic `json_schema_extra` \u2192 automatic embeddings, indexing, validation\n\n## Quick Start\n\n### Installation\n\n**From PyPI** (published v0.2.0):\n```bash\npip install percolate-rocks\n```\n\n**From source**:\n```bash\ncd /Users/sirsh/code/percolation/percolate-rocks\nmaturin develop --release\n```\n\nSee [`.release-notes/`](./.release-notes/) for release history.\n\n### Building\n\nThis project supports two build modes:\n\n**Python extension (default):**\n```bash\n# Build and install Python package\nmaturin develop\n\n# Syntax check only (faster)\nmaturin develop --skip-install\n\n# Note: cargo check/build will fail - use maturin for Python extensions\n```\n\n**Standalone Rust library (no Python):**\n```bash\n# Build as pure Rust library (no Python bindings)\ncargo check --lib --no-default-features\ncargo build --lib --no-default-features --release\n\n# Run tests without Python\ncargo test --lib --no-default-features\n\n# Use in other Rust projects\n# Add to Cargo.toml:\n# percolate-rocks = { version = \"0.1\", default-features = false }\n```\n\n### Testing cross-compilation locally\n\nTo ensure your local builds will work in CI/GitHub Actions, use Docker to replicate the CI environment:\n\n```bash\n# Test ARM64 Linux cross-compilation (what CI does)\ndocker run --rm -v \"$(pwd):/workspace\" -w /workspace rust:latest bash -c \"\n apt-get update && apt-get install -y gcc-aarch64-linux-gnu pkg-config libclang-dev\n rustup target add aarch64-unknown-linux-gnu\n cargo build --target aarch64-unknown-linux-gnu --release\n\"\n```\n\n**Why local builds might work but CI fails:**\n\n| Environment | Rust Version | Target | OpenSSL | Why It Works |\n|-------------|--------------|--------|---------|--------------|\n| **Your Mac** | Latest (1.87+) | Native (aarch64-apple-darwin) | System OpenSSL (Homebrew) | Local system libs |\n| **GitHub Actions** | Workflow-specified | Cross-compile (aarch64-unknown-linux-gnu) | Vendored (native-tls-vendored feature) | Must compile from source |\n\n**Key differences:**\n1. **Rust version**: Mac typically has latest via rustup, CI uses workflow-pinned version\n2. **Cross-compilation**: Mac \u2192 Linux ARM64 requires vendored dependencies (no system libs available)\n3. **Native TLS**: `reqwest` needs `native-tls-vendored` feature for cross-compilation\n\n**If Docker build passes \u2192 CI will pass.** This is your local validation gate.\n\n### Basic Workflow\n\nDefine your schema using Pydantic (in `models.py`):\n\n```python\nfrom pydantic import BaseModel, Field, ConfigDict\n\nclass Article(BaseModel):\n \"\"\"Article resource for semantic search.\"\"\"\n title: str = Field(description=\"Article title\")\n content: str = Field(description=\"Full article content\")\n category: str = Field(description=\"Content category\")\n\n model_config = ConfigDict(\n json_schema_extra={\n \"embedding_fields\": [\"content\"], # Auto-embed on insert\n \"indexed_fields\": [\"category\"], # Fast WHERE queries\n \"key_field\": \"title\" # Deterministic UUID\n }\n )\n```\n\nUse the CLI to work with your data:\n\n```bash\n# 1. Generate encryption key (for encryption at rest)\nrem key-gen --password \"strong_master_password\"\n# Generates Ed25519 key pair and stores encrypted at ~/.p8/keys/\n\n# 2. Initialize database (defaults to ~/.p8/db/)\nrem init\n\n# Or specify custom path\nrem init --path ./data\n\n# With encryption at rest (optional)\nrem init --path ./data --password \"strong_master_password\"\n\n# 3. Register schema (JSON/YAML preferred, Python also supported)\nrem schema add schema.json # Preferred: pure JSON Schema\nrem schema add schema.yaml # Preferred: YAML format\nrem schema add models.py::Article # Also supported: Pydantic model\n\n# Or create from template\nrem schema add --name my_docs --template resources # Clone resources schema\n\n# 4. Batch upsert articles (single embedding API call)\ncat articles.jsonl | rem insert articles --batch\n\n# 5. Semantic search (HNSW index)\nrem search \"fast programming languages\" --schema=articles --top-k=5\n\n# 6. SQL queries (indexed)\nrem query \"SELECT * FROM articles WHERE category = 'programming'\"\n```\n\n## CLI Commands\n\n### Setup and Schema Management\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem key-gen` | Generate encryption key pair (Ed25519) | `rem key-gen --password \"strong_password\"` (saves to `~/.p8/keys/`) |\n| `rem init` | Initialize database (default: `~/.p8/db/`) | `rem init` or `rem init --path ./data` or `rem init --password \"...\"` (encryption at rest) |\n| `rem schema add <file>` | Register schema (JSON/YAML preferred) | `rem schema add schema.json` or `rem schema add models.py::Article` |\n| `rem schema add --name <name> --template <template>` | Create schema from built-in template | `rem schema add --name my_docs --template resources` |\n| `rem schema list` | List registered schemas | `rem schema list` |\n| `rem schema show <name>` | Show schema definition | `rem schema show articles` |\n| `rem schema templates` | List available templates | `rem schema templates` |\n\n**Schema template workflow:**\n\n```bash\n# List available templates\nrem schema templates\n# Output:\n# Available schema templates:\n# - resources: Chunked documents with embeddings (URI-based)\n# - entities: Generic structured data (name-based)\n# - agentlets: AI agent definitions (with tools/resources)\n# - moments: Temporal classifications (time-range queries)\n\n# Create new schema from template\nrem schema add --name my_documents --template resources\n\n# This creates and registers:\n# - Schema name: my_documents\n# - Clones all fields from resources template\n# - Updates fully_qualified_name: \"user.my_documents\"\n# - Updates short_name: \"my_documents\"\n# - Preserves embedding/indexing configuration\n\n# Customize the generated schema (optional)\nrem schema show my_documents > my_documents.json\n# Edit my_documents.json\nrem schema add my_documents.json # Re-register with changes\n\n# Or save to file without registering\nrem schema add --name my_docs --template resources --output my_docs.yaml\n# Edit my_docs.yaml\nrem schema add my_docs.yaml # Register when ready\n```\n\n**Built-in templates:**\n\n| Template | Use Case | Key Fields | Configuration |\n|----------|----------|------------|---------------|\n| `resources` | Documents, articles, PDFs | `name`, `content`, `uri`, `chunk_ordinal` | Embeds `content`, indexes `content_type`, key: `uri` |\n| `entities` | Generic structured data | `name`, `key`, `properties` | Indexes `name`, key: `name` |\n| `agentlets` | AI agent definitions | `description`, `tools`, `resources` | Embeds `description`, includes MCP config |\n| `moments` | Temporal events | `name`, `start_time`, `end_time`, `classifications` | Indexes `start_time`, `end_time` |\n\n**Example: Creating custom document schema**\n\n```bash\n# Start with resources template\nrem schema add --name technical_docs --template resources --output technical_docs.yaml\n\n# Edit technical_docs.yaml to add custom fields:\n# - difficulty_level: enum[\"beginner\", \"intermediate\", \"advanced\"]\n# - language: string\n# - code_examples: array[object]\n\n# Register customized schema\nrem schema add technical_docs.yaml\n\n# Insert documents\ncat docs.jsonl | rem insert technical_docs --batch\n```\n\n### Data Operations\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem insert <table> <json>` | Insert entity | `rem insert articles '{\"title\": \"...\"}` |\n| `rem insert <table> --batch` | Batch insert from stdin | `cat data.jsonl \\| rem insert articles --batch` |\n| `rem ingest <file>` | Upload and chunk file | `rem ingest tutorial.pdf --schema=articles` |\n| `rem get <uuid>` | Get entity by ID | `rem get 550e8400-...` |\n| `rem lookup <key>` | Global key lookup | `rem lookup \"Python Guide\"` |\n\n### Search and Queries\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem search <query>` | Semantic search | `rem search \"async programming\" --schema=articles` |\n| `rem query \"<SQL>\"` | SQL query | `rem query \"SELECT * FROM articles WHERE category = 'tutorial'\"` |\n| `rem ask \"<question>\"` | Natural language query (executes) | `rem ask \"show recent programming articles\"` |\n| `rem ask \"<question>\" --plan` | Show query plan without executing | `rem ask \"show recent articles\" --plan` |\n| `rem traverse <uuid>` | Graph traversal | `rem traverse <id> --depth=2 --direction=out` |\n\n**Natural language query examples:**\n\n```bash\n# Execute query immediately\nrem ask \"show recent programming articles\"\n# Output: Query results as JSON\n\n# Show query plan without executing (LLM response only)\nrem ask \"show recent programming articles\" --plan\n# Output:\n# {\n# \"confidence\": 0.95,\n# \"query\": \"SELECT * FROM articles WHERE category = 'programming' ORDER BY created_at DESC LIMIT 10\",\n# \"reasoning\": \"User wants recent articles filtered by programming category\",\n# \"requires_search\": false\n# }\n\n# Complex query with semantic search\nrem ask \"find articles about Rust performance optimization\" --plan\n# Output:\n# {\n# \"confidence\": 0.85,\n# \"query\": \"SEARCH articles 'Rust performance optimization' LIMIT 10\",\n# \"reasoning\": \"Semantic search needed for conceptual similarity\",\n# \"requires_search\": true\n# }\n```\n\n### Export and Analytics\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem export <table>` | Export to Parquet | `rem export articles --output ./data.parquet` |\n| `rem export --all` | Export all schemas | `rem export --all --output ./exports/` |\n\n### REM Dreaming (Background Intelligence)\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem dream` | Run dreaming with default lookback (24h) | `rem dream` |\n| `rem dream --lookback-hours <N>` | Custom lookback window | `rem dream --lookback-hours 168` (weekly) |\n| `rem dream --dry-run` | Show what would be generated | `rem dream --dry-run --verbose` |\n| `rem dream --llm <model>` | Specify LLM provider | `rem dream --llm gpt-4-turbo` |\n| `rem dream --start <date> --end <date>` | Specific date range | `rem dream --start \"2025-10-20\" --end \"2025-10-25\"` |\n\n**REM Dreaming** uses LLMs to analyze your activity in the background and generate:\n- **Moments**: Temporal classifications of what you were working on (with emotions, topics, outcomes)\n- **Summaries**: Period recaps and key insights\n- **Graph edges**: Automatic connections between related resources and sessions\n- **Ontological maps**: Topic relationships and themes\n\nSee [`docs/rem-dreaming.md`](docs/rem-dreaming.md) for detailed documentation.\n\n## Core System Schemas\n\nREM Database includes three core schemas for tracking user activity:\n\n### Sessions\n\n**Purpose:** Track conversation sessions with AI agents.\n\n**Key fields:**\n- `id` (UUID) - Session identifier\n- `case_id` (UUID) - Optional link to project/case\n- `user_id` (string) - User identifier\n- `metadata` (object) - Session context\n\n**Schema:** [`schema/core/sessions.json`](schema/core/sessions.json)\n\n### Messages\n\n**Purpose:** Individual messages within sessions (user queries, AI responses, tool calls).\n\n**Key fields:**\n- `session_id` (UUID) - Parent session\n- `role` (enum) - user | assistant | system | tool\n- `content` (string) - Message content (embedded for search)\n- `tool_calls` (array) - Tool invocations\n- `trace_id`, `span_id` (string) - Observability\n\n**Schema:** [`schema/core/messages.json`](schema/core/messages.json)\n\n### Moments\n\n**Purpose:** Temporal classifications generated by REM Dreaming.\n\n**Key fields:**\n- `name` (string) - Moment title\n- `summary` (string) - Activity description\n- `start_time`, `end_time` (datetime) - Time bounds\n- `moment_type` (enum) - work_session | learning | planning | communication | reflection | creation\n- `tags` (array) - Topic tags (e.g., [\"rust\", \"database\", \"performance\"])\n- `emotion_tags` (array) - Emotion/tone tags (e.g., [\"focused\", \"productive\"])\n- `people` (array) - People mentioned\n- `resource_ids`, `session_ids` (arrays) - Related entities\n\n**Schema:** [`schema/core/moments.json`](schema/core/moments.json)\n\n**These schemas are registered automatically on `rem init`.**\n\n## Peer Replication Testing\n\nREM supports primary/replica replication via WAL and gRPC streaming.\n\n### Terminal 1: Primary Node\n\n```bash\n# Start primary with WAL enabled\nexport P8_REPLICATION_MODE=primary\nexport P8_REPLICATION_PORT=50051\nexport P8_WAL_ENABLED=true\nexport P8_DB_PATH=./data/primary # Override default ~/.p8/db/\n\nrem init\n# Register schema (JSON/YAML preferred)\nrem schema add schema.json\n\n# Start replication server\nrem serve --host 0.0.0.0 --port 50051\n\n# Insert data (will be replicated)\nrem insert articles '{\"title\": \"Doc 1\", \"content\": \"Test replication\", \"category\": \"test\"}'\n\n# Check WAL status\nrem replication wal-status\n# Output:\n# WAL sequence: 1\n# Entries: 1\n# Size: 512 bytes\n```\n\n### Terminal 2: Replica 1\n\n```bash\n# Start replica pointing to primary\nexport P8_REPLICATION_MODE=replica\nexport P8_PRIMARY_HOST=localhost:50051\nexport P8_DB_PATH=./data/replica1 # Override default ~/.p8/db/\n\nrem init\n\n# Connect and sync from primary\nrem replicate --primary=localhost:50051 --follow\n\n# Check replication status\nrem replication status\n# Output:\n# Mode: replica\n# Primary: localhost:50051\n# WAL position: 1\n# Lag: 2ms\n# Status: synced\n\n# Query replica (read-only)\nrem query \"SELECT * FROM articles\"\n# Output: Same data as primary\n```\n\n### Terminal 3: Replica 2\n\n```bash\nexport P8_REPLICATION_MODE=replica\nexport P8_PRIMARY_HOST=localhost:50051\nexport P8_DB_PATH=./data/replica2 # Override default ~/.p8/db/\n\nrem init\nrem replicate --primary=localhost:50051 --follow\n\n# Verify sync\nrem query \"SELECT COUNT(*) FROM articles\"\n# Output: 1\n```\n\n### Testing Failover\n\n**Terminal 1: Simulate Primary Failure**\n```bash\n^C # Stop primary server\n```\n\n**Terminal 2: Replica Behavior During Outage**\n```bash\n# Replica continues serving reads\nrem query \"SELECT * FROM articles\"\n# Output: Cached data still available\n\n# Check status\nrem replication status\n# Output:\n# Status: disconnected\n# Last sync: 45s ago\n# Buffered writes: 0 (read-only)\n```\n\n**Terminal 1: Primary Restart**\n```bash\n# Restart primary and insert new data\nrem serve --host 0.0.0.0 --port 50051\nrem insert articles '{\"title\": \"Doc 2\", \"content\": \"After restart\", \"category\": \"test\"}'\n```\n\n**Terminal 2: Automatic Catchup**\n```bash\n# Replica auto-reconnects and syncs\nrem replication status\n# Output:\n# Status: synced\n# Catchup: completed (1 entry, 50ms)\n# Lag: 3ms\n\n# Verify new data\nrem query \"SELECT title FROM articles ORDER BY created_at DESC LIMIT 1\"\n# Output: Doc 2\n```\n\n## Key Implementation Conventions\n\n### REM Principle\n\n**Resources-Entities-Moments** is a unified data model, not separate storage:\n\n- **Resources**: Chunked documents with embeddings \u2192 semantic search (HNSW)\n- **Entities**: Structured data \u2192 SQL queries (indexed fields)\n- **Moments**: Temporal classifications \u2192 time-range queries\n\nAll stored as **entities** in RocksDB. Conceptual distinction only.\n\n### Pydantic-Driven Everything\n\nConfiguration flows from `json_schema_extra`:\n\nNB!: While we support adding metadata in config. Fields can also take properties like key-field and embedding_provider as json schema extra and is preferred.\n\n```python\nmodel_config = ConfigDict(\n json_schema_extra={\n \"embedding_fields\": [\"content\"], # \u2192 Auto-embed on insert\n \"indexed_fields\": [\"category\"], # \u2192 RocksDB index CF\n \"key_field\": \"title\" # \u2192 Deterministic UUID\n }\n)\n```\n\nNB: Rust can also define schema in equivalent mode classes or schema but we drive things with pydantic aware semantics of the json schema format.\n\n### Deterministic UUIDs (Idempotent Inserts)\n\nNB: Precedence; uri -> key -> name unless specified in config.\n\n| Priority | Field | UUID Generation |\n|----------|-------|-----------------|\n| 1 | `uri` | `blake3(entity_type + uri + chunk_ordinal)` |\n| 2 | `json_schema_extra.key_field` | `blake3(entity_type + value)` |\n| 3 | `key` | `blake3(entity_type + key)` |\n| 4 | `name` | `blake3(entity_type + name)` |\n| 5 | (fallback) | `UUID::v4()` (random) |\n\nSame key \u2192 same UUID \u2192 upsert semantics.\n\n### System Fields (Always Auto-Added)\n\n**Never** define these in Pydantic models - always added by database:\n\n- `id` (UUID) - Deterministic or random\n- `entity_type` (string) - Schema/table name\n- `created_at`, `modified_at`, `deleted_at` (ISO 8601) - Timestamps\n- `edges` (array[string]) - Graph relationships\n\n### Embedding Fields (Conditionally Added)\n\n**Not system fields** - only added when configured:\n\n- `embedding` (array[float32]) - Added if `embedding_fields` in `json_schema_extra`\n- `embedding_alt` (array[float32]) - Added if `P8_ALT_EMBEDDING` environment variable set\n\n```python\n# Configuration that triggers embedding generation:\nmodel_config = ConfigDict(\n json_schema_extra={\n \"embedding_fields\": [\"content\"], # \u2192 Adds \"embedding\" field\n \"embedding_provider\": \"default\" # \u2192 Uses P8_DEFAULT_EMBEDDING\n }\n)\n```\n\n### Encryption at Rest\n\nOptional encryption at rest using **Ed25519 key pairs** and **ChaCha20-Poly1305 AEAD**:\n\n1. **Generate key pair** (one-time setup):\n ```bash\n rem key-gen --password \"strong_master_password\"\n # Stores encrypted key at ~/.p8/keys/private_key_encrypted\n # Stores public key at ~/.p8/keys/public_key\n ```\n\n2. **Initialize database with encryption**:\n ```bash\n rem init --password \"strong_master_password\"\n # All entity data encrypted before storage\n # Transparent encryption/decryption on get/put\n ```\n\n3. **Sharing across tenants** (future):\n - Encrypt data with recipient's **public key** (X25519 ECDH)\n - End-to-end encryption - even database admin cannot read shared data\n\n4. **Device-to-device sync** (future):\n - WAL entries encrypted before gRPC transmission\n - Defense in depth: mTLS (transport) + encrypted WAL (application layer)\n\n**Key security properties:**\n- Private key **never leaves device unencrypted**\n- Password-derived key using **Argon2** KDF\n- **ChaCha20-Poly1305** AEAD for data encryption\n- Public key stored unencrypted for sharing capabilities\n\nSee `docs/encryption-architecture.md` for complete design.\n\n### Column Families (Performance)\n\n| CF | Purpose | Speedup vs Scan |\n|----|---------|-----------------|\n| `key_index` | Reverse key lookup | O(log n) vs O(n) |\n| `edges` + `edges_reverse` | Bidirectional graph | 20x faster |\n| `embeddings` (binary) | Vector storage | 3x compression |\n| `indexes` | Indexed fields | 10-50x faster |\n| `keys` | Encrypted tenant keys | - |\n\n### HNSW Vector Index\n\nRust HNSW index provides **200x speedup** over naive Python scan:\n- Python naive: ~1000ms for 1M documents\n- Rust HNSW: ~5ms for 1M documents\n\nThis is the **primary reason** for Rust implementation.\n\n## Performance Targets\n\n| Operation | Target | Why Rust? |\n|-----------|--------|-----------|\n| Insert (no embedding) | < 1ms | RocksDB + zero-copy |\n| Insert (with embedding) | < 50ms | Network-bound (OpenAI) |\n| Get by ID | < 0.1ms | Single RocksDB get |\n| Vector search (1M docs) | < 5ms | **HNSW (vs 1000ms naive)** |\n| SQL query (indexed) | < 10ms | **Native execution (vs 50ms Python)** |\n| Graph traversal (3 hops) | < 5ms | **Bidirectional CF (vs 100ms scan)** |\n| Batch insert (1000 docs) | < 500ms | Batched embeddings |\n| Parquet export (100k rows) | < 2s | **Parallel encoding** |\n\nNB: WE generally work in batches; batch upserts and batch embeddings. NEVER make individual requests when batches are possible.\n\n## Environment Configuration\n\n```bash\n# Core\nexport P8_HOME=~/.p8\nexport P8_DB_PATH=$P8_HOME/db\n\n# Embeddings\nexport P8_DEFAULT_EMBEDDING=local:all-MiniLM-L6-v2\nexport P8_OPENAI_API_KEY=sk-... # For OpenAI embeddings\n\n# LLM (natural language queries)\nexport P8_DEFAULT_LLM=gpt-4.1\nexport P8_OPENAI_API_KEY=sk-...\n\n# RocksDB tuning\nexport P8_ROCKSDB_WRITE_BUFFER_SIZE=67108864 # 64MB\nexport P8_ROCKSDB_MAX_BACKGROUND_JOBS=4\nexport P8_ROCKSDB_COMPRESSION=lz4\n\n# Replication\nexport P8_REPLICATION_MODE=primary # or replica\nexport P8_PRIMARY_HOST=localhost:50051 # For replicas\nexport P8_WAL_ENABLED=true\n```\n\nSee [CLAUDE.md](./CLAUDE.md) for full list.\n\n## Project Structure\n\n```\npercolate-rocks/ # Clean implementation\n\u251c\u2500\u2500 Cargo.toml # Rust dependencies\n\u251c\u2500\u2500 pyproject.toml # Python package (maturin)\n\u251c\u2500\u2500 README.md # This file\n\u251c\u2500\u2500 CLAUDE.md # Implementation guide\n\u2502\n\u251c\u2500\u2500 src/ # Rust implementation (~3000 lines target)\n\u2502 \u251c\u2500\u2500 lib.rs # PyO3 module definition (30 lines)\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 types/ # Core data types (120 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 entity.rs # Entity, Edge structs\n\u2502 \u2502 \u251c\u2500\u2500 error.rs # Error types (thiserror)\n\u2502 \u2502 \u2514\u2500\u2500 result.rs # Type aliases\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 storage/ # RocksDB wrapper (400 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 db.rs # Storage struct + open\n\u2502 \u2502 \u251c\u2500\u2500 keys.rs # Key encoding functions\n\u2502 \u2502 \u251c\u2500\u2500 batch.rs # Batch writer\n\u2502 \u2502 \u251c\u2500\u2500 iterator.rs # Prefix iterator\n\u2502 \u2502 \u2514\u2500\u2500 column_families.rs # CF constants + setup\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 index/ # Indexing layer (310 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 hnsw.rs # HNSW vector index\n\u2502 \u2502 \u251c\u2500\u2500 fields.rs # Indexed fields\n\u2502 \u2502 \u2514\u2500\u2500 keys.rs # Key index (reverse lookup)\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 query/ # Query execution (260 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 parser.rs # SQL parser\n\u2502 \u2502 \u251c\u2500\u2500 executor.rs # Query executor\n\u2502 \u2502 \u251c\u2500\u2500 predicates.rs # Predicate evaluation\n\u2502 \u2502 \u2514\u2500\u2500 planner.rs # Query planner\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 embeddings/ # Embedding providers (200 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 provider.rs # Provider trait + factory\n\u2502 \u2502 \u251c\u2500\u2500 local.rs # Local models (fastembed)\n\u2502 \u2502 \u251c\u2500\u2500 openai.rs # OpenAI API client\n\u2502 \u2502 \u2514\u2500\u2500 batch.rs # Batch embedding operations\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 schema/ # Schema validation (160 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 registry.rs # Schema registry\n\u2502 \u2502 \u251c\u2500\u2500 validator.rs # JSON Schema validation\n\u2502 \u2502 \u2514\u2500\u2500 pydantic.rs # Pydantic json_schema_extra parser\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 graph/ # Graph operations (130 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 edges.rs # Edge CRUD\n\u2502 \u2502 \u2514\u2500\u2500 traversal.rs # BFS/DFS traversal\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 replication/ # Replication engine (400 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 wal.rs # Write-ahead log\n\u2502 \u2502 \u251c\u2500\u2500 primary.rs # Primary node (gRPC server)\n\u2502 \u2502 \u251c\u2500\u2500 replica.rs # Replica node (gRPC client)\n\u2502 \u2502 \u251c\u2500\u2500 protocol.rs # gRPC protocol definitions\n\u2502 \u2502 \u2514\u2500\u2500 sync.rs # Sync state machine\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 export/ # Export formats (200 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 parquet.rs # Parquet writer\n\u2502 \u2502 \u251c\u2500\u2500 csv.rs # CSV writer\n\u2502 \u2502 \u2514\u2500\u2500 jsonl.rs # JSONL writer\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 ingest/ # Document ingestion (180 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 chunker.rs # Document chunking\n\u2502 \u2502 \u251c\u2500\u2500 pdf.rs # PDF parser\n\u2502 \u2502 \u2514\u2500\u2500 text.rs # Text chunking\n\u2502 \u2502\n\u2502 \u251c\u2500\u2500 llm/ # LLM query builder (150 lines)\n\u2502 \u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u2502 \u251c\u2500\u2500 query_builder.rs # Natural language \u2192 SQL\n\u2502 \u2502 \u2514\u2500\u2500 planner.rs # Query plan generation\n\u2502 \u2502\n\u2502 \u2514\u2500\u2500 bindings/ # PyO3 Python bindings (300 lines)\n\u2502 \u251c\u2500\u2500 mod.rs # Re-exports\n\u2502 \u251c\u2500\u2500 database.rs # Database wrapper (main API)\n\u2502 \u251c\u2500\u2500 types.rs # Type conversions (Python \u2194 Rust)\n\u2502 \u251c\u2500\u2500 errors.rs # Error conversions\n\u2502 \u2514\u2500\u2500 async_ops.rs # Async operation wrappers\n\u2502\n\u251c\u2500\u2500 python/ # Python package (~800 lines target)\n\u2502 \u2514\u2500\u2500 rem_db/\n\u2502 \u251c\u2500\u2500 __init__.py # Public API (thin wrapper over Rust)\n\u2502 \u251c\u2500\u2500 cli.py # Typer CLI (delegates to Rust)\n\u2502 \u251c\u2500\u2500 models.py # Built-in Pydantic schemas\n\u2502 \u2514\u2500\u2500 async_api.py # Async wrapper utilities\n\u2502\n\u2514\u2500\u2500 tests/\n \u251c\u2500\u2500 rust/ # Rust integration tests\n \u2502 \u251c\u2500\u2500 test_crud.rs\n \u2502 \u251c\u2500\u2500 test_search.rs\n \u2502 \u251c\u2500\u2500 test_graph.rs\n \u2502 \u251c\u2500\u2500 test_replication.rs\n \u2502 \u2514\u2500\u2500 test_export.rs\n \u2502\n \u2514\u2500\u2500 python/ # Python integration tests\n \u251c\u2500\u2500 test_api.py\n \u251c\u2500\u2500 test_cli.py\n \u251c\u2500\u2500 test_async.py\n \u2514\u2500\u2500 test_end_to_end.py\n```\n\n**Key Design Notes:**\n\n1. **Rust Core (~3000 lines in ~40 files)**: All performance-critical operations in Rust\n - Average 75 lines per file\n - Max 150 lines per file\n - Single responsibility per module\n\n2. **Python Bindings (bindings/)**: Thin PyO3 layer\n - Database wrapper exposes high-level API\n - Type conversions between Python dict/list \u2194 Rust structs\n - Error conversions for Python exceptions\n - Async operation wrappers (tokio \u2192 asyncio)\n - **No business logic** - pure translation layer\n\n3. **Python Package (python/)**: Minimal orchestration\n - CLI delegates to Rust immediately\n - Public API is thin wrapper (`db._rust_insert()`)\n - Pydantic models define schemas, Rust validates/stores\n - Async utilities for Python async/await ergonomics\n\n4. **Replication Module**: Primary/replica peer replication\n - WAL (write-ahead log) for durability\n - gRPC streaming for real-time sync\n - Automatic catchup after disconnection\n - Read-only replica mode\n\n5. **Export Module**: Analytics-friendly formats\n - Parquet with ZSTD compression\n - CSV for spreadsheets\n - JSONL for streaming/batch processing\n\n6. **LLM Module**: Natural language query interface\n - Convert questions \u2192 SQL/SEARCH queries\n - Query plan generation (`--plan` flag)\n - Confidence scoring\n\n7. **Test Organization**: Separation of unit and integration tests\n\n **Rust Tests:**\n - **Unit tests**: Inline with implementation using `#[cfg(test)]` modules\n ```rust\n // src/storage/keys.rs\n #[cfg(test)]\n mod tests {\n use super::*;\n\n #[test]\n fn test_encode_entity_key() {\n let key = encode_entity_key(uuid);\n assert!(key.starts_with(b\"entity:\"));\n }\n }\n ```\n - **Integration tests**: In `tests/rust/` directory\n - Test full workflows across modules\n - Require actual RocksDB instance\n - May be slower (acceptable up to 10s per test)\n\n **Python Tests:**\n - **Unit tests**: NOT APPLICABLE (Python layer is thin wrapper)\n - **Integration tests**: In `tests/python/` directory\n - Test PyO3 bindings (Python \u2194 Rust type conversions)\n - Test CLI commands end-to-end\n - Test async/await ergonomics\n - Require Rust library to be built\n\n **Running Tests:**\n ```bash\n # Rust unit tests (fast, inline with code)\n cargo test --lib\n\n # Rust integration tests (slower, requires RocksDB)\n cargo test --test '*'\n\n # Python integration tests (requires maturin build)\n maturin develop\n pytest tests/python/\n\n # All tests\n cargo test && pytest tests/python/\n ```\n\n **Coverage Targets:**\n - Rust: 80%+ coverage (critical path)\n - Python: 90%+ coverage (thin wrapper, easy to test)\n\n## Development\n\n### Pre-Build Checks\n\n```bash\n# Check compilation (fast, no binary output)\ncargo check\n\n# Format check (without modifying files)\ncargo fmt --check\n\n# Linting with clippy\ncargo clippy --all-targets --all-features\n\n# Security audit (requires: cargo install cargo-audit)\ncargo audit\n\n# Check for outdated dependencies (requires: cargo install cargo-outdated)\ncargo outdated\n```\n\n### Building\n\n```bash\n# Development build (unoptimized, fast compile)\ncargo build\n\n# Release build (optimized, slower compile)\ncargo build --release\n\n# Python extension development install (editable)\nmaturin develop\n\n# Python extension release wheel\nmaturin build --release\n```\n\n### Testing\n\n```bash\n# Rust unit tests\ncargo test\n\n# Rust unit tests with output\ncargo test -- --nocapture\n\n# Python integration tests (requires maturin develop first)\npytest\n\n# Python tests with verbose output\npytest -v\n\n# Run specific test\ncargo test test_name\n```\n\n### Code Quality\n\n```bash\n# Auto-format code\ncargo fmt\n\n# Run clippy linter\ncargo clippy --all-targets\n\n# Fix clippy warnings automatically (where possible)\ncargo clippy --fix\n\n# Check for unused dependencies\ncargo machete # requires: cargo install cargo-machete\n```\n\n### Benchmarks\n\n```bash\n# Run all benchmarks\ncargo bench\n\n# Run specific benchmark\ncargo bench vector_search\n```\n\n### Development Workflow\n\n```bash\n# 1. Make changes to Rust code\n# 2. Check compilation\ncargo check\n\n# 3. Run tests\ncargo test\n\n# 4. Build Python extension\nmaturin develop\n\n# 5. Test Python integration\npytest\n```\n\n## References\n\n- **Specification**: See `db-specification-v0.md` in `-ref` folder\n- **Python spike**: `../rem-db` (100% features, production-ready)\n- **Old Rust spike**: `../percolate-rocks-ref` (~40% features)\n- **Implementation guide**: [CLAUDE.md](./CLAUDE.md)\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "High-performance embedded database for semantic search, graph queries, and structured data",
"version": "0.3.2",
"project_urls": null,
"split_keywords": [
"database",
" vector-search",
" graph",
" embedding",
" semantic"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "3db4193dac10c6d90665c97cc14445223c674cafa79aff65416e1ab584342f44",
"md5": "b2a202117d1e6f68e8a7ded435fb5fba",
"sha256": "88da89da34ea2e2157171729f83a8a6a6edcc2871fc353d15ec2915bd7f3090e"
},
"downloads": -1,
"filename": "percolate_rocks-0.3.2-cp38-abi3-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "b2a202117d1e6f68e8a7ded435fb5fba",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 10196843,
"upload_time": "2025-10-26T21:04:15",
"upload_time_iso_8601": "2025-10-26T21:04:15.381209Z",
"url": "https://files.pythonhosted.org/packages/3d/b4/193dac10c6d90665c97cc14445223c674cafa79aff65416e1ab584342f44/percolate_rocks-0.3.2-cp38-abi3-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6cf14a1dc34470e2afafa37520774d49f2a6f159611b283d917d75f2382c3001",
"md5": "e5e023e3c0230ba95a14751220de5d2d",
"sha256": "1b8b9da2a1b684cf276e47d2e246c506adfaba2467d4c3f91ac13da20d59557e"
},
"downloads": -1,
"filename": "percolate_rocks-0.3.2-cp38-abi3-manylinux_2_39_x86_64.whl",
"has_sig": false,
"md5_digest": "e5e023e3c0230ba95a14751220de5d2d",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 14274189,
"upload_time": "2025-10-26T21:09:35",
"upload_time_iso_8601": "2025-10-26T21:09:35.692675Z",
"url": "https://files.pythonhosted.org/packages/6c/f1/4a1dc34470e2afafa37520774d49f2a6f159611b283d917d75f2382c3001/percolate_rocks-0.3.2-cp38-abi3-manylinux_2_39_x86_64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-26 21:04:15",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "percolate-rocks"
}