percolate-rocks

Name	percolate-rocks JSON
Version	0.3.2 JSON
	download
home_page	None
Summary	High-performance embedded database for semantic search, graph queries, and structured data
upload_time	2025-10-26 21:04:15
maintainer	None
docs_url	None
author	Percolate Team
requires_python	>=3.8
license	MIT
keywords	database vector-search graph embedding semantic
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # REM Database

**Resources-Entities-Moments (REM):** High-performance embedded database for semantic search, graph queries, and structured data.

**Project Location:** `/Users/sirsh/code/percolation/percolate-rocks`

## Project Goals

Build a production-ready database combining:
- **Rust performance** - HNSW vector search (200x faster), native SQL execution (5-10x faster)
- **Python ergonomics** - Pydantic models drive schemas, natural language queries
- **Zero impedance** - Pydantic `json_schema_extra` → automatic embeddings, indexing, validation

## Quick Start

### Installation

**From PyPI** (published v0.2.0):
```bash
pip install percolate-rocks
```

**From source**:
```bash
cd /Users/sirsh/code/percolation/percolate-rocks
maturin develop --release
```

See [`.release-notes/`](./.release-notes/) for release history.

### Building

This project supports two build modes:

**Python extension (default):**
```bash
# Build and install Python package
maturin develop

# Syntax check only (faster)
maturin develop --skip-install

# Note: cargo check/build will fail - use maturin for Python extensions
```

**Standalone Rust library (no Python):**
```bash
# Build as pure Rust library (no Python bindings)
cargo check --lib --no-default-features
cargo build --lib --no-default-features --release

# Run tests without Python
cargo test --lib --no-default-features

# Use in other Rust projects
# Add to Cargo.toml:
# percolate-rocks = { version = "0.1", default-features = false }
```

### Testing cross-compilation locally

To ensure your local builds will work in CI/GitHub Actions, use Docker to replicate the CI environment:

```bash
# Test ARM64 Linux cross-compilation (what CI does)
docker run --rm -v "$(pwd):/workspace" -w /workspace rust:latest bash -c "
  apt-get update && apt-get install -y gcc-aarch64-linux-gnu pkg-config libclang-dev
  rustup target add aarch64-unknown-linux-gnu
  cargo build --target aarch64-unknown-linux-gnu --release
"
```

**Why local builds might work but CI fails:**

| Environment | Rust Version | Target | OpenSSL | Why It Works |
|-------------|--------------|--------|---------|--------------|
| **Your Mac** | Latest (1.87+) | Native (aarch64-apple-darwin) | System OpenSSL (Homebrew) | Local system libs |
| **GitHub Actions** | Workflow-specified | Cross-compile (aarch64-unknown-linux-gnu) | Vendored (native-tls-vendored feature) | Must compile from source |

**Key differences:**
1. **Rust version**: Mac typically has latest via rustup, CI uses workflow-pinned version
2. **Cross-compilation**: Mac → Linux ARM64 requires vendored dependencies (no system libs available)
3. **Native TLS**: `reqwest` needs `native-tls-vendored` feature for cross-compilation

**If Docker build passes → CI will pass.** This is your local validation gate.

### Basic Workflow

Define your schema using Pydantic (in `models.py`):

```python
from pydantic import BaseModel, Field, ConfigDict

class Article(BaseModel):
    """Article resource for semantic search."""
    title: str = Field(description="Article title")
    content: str = Field(description="Full article content")
    category: str = Field(description="Content category")

    model_config = ConfigDict(
        json_schema_extra={
            "embedding_fields": ["content"],      # Auto-embed on insert
            "indexed_fields": ["category"],       # Fast WHERE queries
            "key_field": "title"                  # Deterministic UUID
        }
    )
```

Use the CLI to work with your data:

```bash
# 1. Generate encryption key (for encryption at rest)
rem key-gen --password "strong_master_password"
# Generates Ed25519 key pair and stores encrypted at ~/.p8/keys/

# 2. Initialize database (defaults to ~/.p8/db/)
rem init

# Or specify custom path
rem init --path ./data

# With encryption at rest (optional)
rem init --path ./data --password "strong_master_password"

# 3. Register schema (JSON/YAML preferred, Python also supported)
rem schema add schema.json  # Preferred: pure JSON Schema
rem schema add schema.yaml  # Preferred: YAML format
rem schema add models.py::Article  # Also supported: Pydantic model

# Or create from template
rem schema add --name my_docs --template resources  # Clone resources schema

# 4. Batch upsert articles (single embedding API call)
cat articles.jsonl | rem insert articles --batch

# 5. Semantic search (HNSW index)
rem search "fast programming languages" --schema=articles --top-k=5

# 6. SQL queries (indexed)
rem query "SELECT * FROM articles WHERE category = 'programming'"
```

## CLI Commands

### Setup and Schema Management

| Command | Description | Example |
|---------|-------------|---------|
| `rem key-gen` | Generate encryption key pair (Ed25519) | `rem key-gen --password "strong_password"` (saves to `~/.p8/keys/`) |
| `rem init` | Initialize database (default: `~/.p8/db/`) | `rem init` or `rem init --path ./data` or `rem init --password "..."` (encryption at rest) |
| `rem schema add <file>` | Register schema (JSON/YAML preferred) | `rem schema add schema.json` or `rem schema add models.py::Article` |
| `rem schema add --name <name> --template <template>` | Create schema from built-in template | `rem schema add --name my_docs --template resources` |
| `rem schema list` | List registered schemas | `rem schema list` |
| `rem schema show <name>` | Show schema definition | `rem schema show articles` |
| `rem schema templates` | List available templates | `rem schema templates` |

**Schema template workflow:**

```bash
# List available templates
rem schema templates
# Output:
# Available schema templates:
# - resources: Chunked documents with embeddings (URI-based)
# - entities: Generic structured data (name-based)
# - agentlets: AI agent definitions (with tools/resources)
# - moments: Temporal classifications (time-range queries)

# Create new schema from template
rem schema add --name my_documents --template resources

# This creates and registers:
# - Schema name: my_documents
# - Clones all fields from resources template
# - Updates fully_qualified_name: "user.my_documents"
# - Updates short_name: "my_documents"
# - Preserves embedding/indexing configuration

# Customize the generated schema (optional)
rem schema show my_documents > my_documents.json
# Edit my_documents.json
rem schema add my_documents.json  # Re-register with changes

# Or save to file without registering
rem schema add --name my_docs --template resources --output my_docs.yaml
# Edit my_docs.yaml
rem schema add my_docs.yaml  # Register when ready
```

**Built-in templates:**

| Template | Use Case | Key Fields | Configuration |
|----------|----------|------------|---------------|
| `resources` | Documents, articles, PDFs | `name`, `content`, `uri`, `chunk_ordinal` | Embeds `content`, indexes `content_type`, key: `uri` |
| `entities` | Generic structured data | `name`, `key`, `properties` | Indexes `name`, key: `name` |
| `agentlets` | AI agent definitions | `description`, `tools`, `resources` | Embeds `description`, includes MCP config |
| `moments` | Temporal events | `name`, `start_time`, `end_time`, `classifications` | Indexes `start_time`, `end_time` |

**Example: Creating custom document schema**

```bash
# Start with resources template
rem schema add --name technical_docs --template resources --output technical_docs.yaml

# Edit technical_docs.yaml to add custom fields:
# - difficulty_level: enum["beginner", "intermediate", "advanced"]
# - language: string
# - code_examples: array[object]

# Register customized schema
rem schema add technical_docs.yaml

# Insert documents
cat docs.jsonl | rem insert technical_docs --batch
```

### Data Operations

| Command | Description | Example |
|---------|-------------|---------|
| `rem insert <table> <json>` | Insert entity | `rem insert articles '{"title": "..."}` |
| `rem insert <table> --batch` | Batch insert from stdin | `cat data.jsonl \| rem insert articles --batch` |
| `rem ingest <file>` | Upload and chunk file | `rem ingest tutorial.pdf --schema=articles` |
| `rem get <uuid>` | Get entity by ID | `rem get 550e8400-...` |
| `rem lookup <key>` | Global key lookup | `rem lookup "Python Guide"` |

### Search and Queries

| Command | Description | Example |
|---------|-------------|---------|
| `rem search <query>` | Semantic search | `rem search "async programming" --schema=articles` |
| `rem query "<SQL>"` | SQL query | `rem query "SELECT * FROM articles WHERE category = 'tutorial'"` |
| `rem ask "<question>"` | Natural language query (executes) | `rem ask "show recent programming articles"` |
| `rem ask "<question>" --plan` | Show query plan without executing | `rem ask "show recent articles" --plan` |
| `rem traverse <uuid>` | Graph traversal | `rem traverse <id> --depth=2 --direction=out` |

**Natural language query examples:**

```bash
# Execute query immediately
rem ask "show recent programming articles"
# Output: Query results as JSON

# Show query plan without executing (LLM response only)
rem ask "show recent programming articles" --plan
# Output:
# {
#   "confidence": 0.95,
#   "query": "SELECT * FROM articles WHERE category = 'programming' ORDER BY created_at DESC LIMIT 10",
#   "reasoning": "User wants recent articles filtered by programming category",
#   "requires_search": false
# }

# Complex query with semantic search
rem ask "find articles about Rust performance optimization" --plan
# Output:
# {
#   "confidence": 0.85,
#   "query": "SEARCH articles 'Rust performance optimization' LIMIT 10",
#   "reasoning": "Semantic search needed for conceptual similarity",
#   "requires_search": true
# }
```

### Export and Analytics

| Command | Description | Example |
|---------|-------------|---------|
| `rem export <table>` | Export to Parquet | `rem export articles --output ./data.parquet` |
| `rem export --all` | Export all schemas | `rem export --all --output ./exports/` |

### REM Dreaming (Background Intelligence)

| Command | Description | Example |
|---------|-------------|---------|
| `rem dream` | Run dreaming with default lookback (24h) | `rem dream` |
| `rem dream --lookback-hours <N>` | Custom lookback window | `rem dream --lookback-hours 168` (weekly) |
| `rem dream --dry-run` | Show what would be generated | `rem dream --dry-run --verbose` |
| `rem dream --llm <model>` | Specify LLM provider | `rem dream --llm gpt-4-turbo` |
| `rem dream --start <date> --end <date>` | Specific date range | `rem dream --start "2025-10-20" --end "2025-10-25"` |

**REM Dreaming** uses LLMs to analyze your activity in the background and generate:
- **Moments**: Temporal classifications of what you were working on (with emotions, topics, outcomes)
- **Summaries**: Period recaps and key insights
- **Graph edges**: Automatic connections between related resources and sessions
- **Ontological maps**: Topic relationships and themes

See [`docs/rem-dreaming.md`](docs/rem-dreaming.md) for detailed documentation.

## Core System Schemas

REM Database includes three core schemas for tracking user activity:

### Sessions

**Purpose:** Track conversation sessions with AI agents.

**Key fields:**
- `id` (UUID) - Session identifier
- `case_id` (UUID) - Optional link to project/case
- `user_id` (string) - User identifier
- `metadata` (object) - Session context

**Schema:** [`schema/core/sessions.json`](schema/core/sessions.json)

### Messages

**Purpose:** Individual messages within sessions (user queries, AI responses, tool calls).

**Key fields:**
- `session_id` (UUID) - Parent session
- `role` (enum) - user | assistant | system | tool
- `content` (string) - Message content (embedded for search)
- `tool_calls` (array) - Tool invocations
- `trace_id`, `span_id` (string) - Observability

**Schema:** [`schema/core/messages.json`](schema/core/messages.json)

### Moments

**Purpose:** Temporal classifications generated by REM Dreaming.

**Key fields:**
- `name` (string) - Moment title
- `summary` (string) - Activity description
- `start_time`, `end_time` (datetime) - Time bounds
- `moment_type` (enum) - work_session | learning | planning | communication | reflection | creation
- `tags` (array) - Topic tags (e.g., ["rust", "database", "performance"])
- `emotion_tags` (array) - Emotion/tone tags (e.g., ["focused", "productive"])
- `people` (array) - People mentioned
- `resource_ids`, `session_ids` (arrays) - Related entities

**Schema:** [`schema/core/moments.json`](schema/core/moments.json)

**These schemas are registered automatically on `rem init`.**

## Peer Replication Testing

REM supports primary/replica replication via WAL and gRPC streaming.

### Terminal 1: Primary Node

```bash
# Start primary with WAL enabled
export P8_REPLICATION_MODE=primary
export P8_REPLICATION_PORT=50051
export P8_WAL_ENABLED=true
export P8_DB_PATH=./data/primary  # Override default ~/.p8/db/

rem init
# Register schema (JSON/YAML preferred)
rem schema add schema.json

# Start replication server
rem serve --host 0.0.0.0 --port 50051

# Insert data (will be replicated)
rem insert articles '{"title": "Doc 1", "content": "Test replication", "category": "test"}'

# Check WAL status
rem replication wal-status
# Output:
# WAL sequence: 1
# Entries: 1
# Size: 512 bytes
```

### Terminal 2: Replica 1

```bash
# Start replica pointing to primary
export P8_REPLICATION_MODE=replica
export P8_PRIMARY_HOST=localhost:50051
export P8_DB_PATH=./data/replica1  # Override default ~/.p8/db/

rem init

# Connect and sync from primary
rem replicate --primary=localhost:50051 --follow

# Check replication status
rem replication status
# Output:
# Mode: replica
# Primary: localhost:50051
# WAL position: 1
# Lag: 2ms
# Status: synced

# Query replica (read-only)
rem query "SELECT * FROM articles"
# Output: Same data as primary
```

### Terminal 3: Replica 2

```bash
export P8_REPLICATION_MODE=replica
export P8_PRIMARY_HOST=localhost:50051
export P8_DB_PATH=./data/replica2  # Override default ~/.p8/db/

rem init
rem replicate --primary=localhost:50051 --follow

# Verify sync
rem query "SELECT COUNT(*) FROM articles"
# Output: 1
```

### Testing Failover

**Terminal 1: Simulate Primary Failure**
```bash
^C  # Stop primary server
```

**Terminal 2: Replica Behavior During Outage**
```bash
# Replica continues serving reads
rem query "SELECT * FROM articles"
# Output: Cached data still available

# Check status
rem replication status
# Output:
# Status: disconnected
# Last sync: 45s ago
# Buffered writes: 0 (read-only)
```

**Terminal 1: Primary Restart**
```bash
# Restart primary and insert new data
rem serve --host 0.0.0.0 --port 50051
rem insert articles '{"title": "Doc 2", "content": "After restart", "category": "test"}'
```

**Terminal 2: Automatic Catchup**
```bash
# Replica auto-reconnects and syncs
rem replication status
# Output:
# Status: synced
# Catchup: completed (1 entry, 50ms)
# Lag: 3ms

# Verify new data
rem query "SELECT title FROM articles ORDER BY created_at DESC LIMIT 1"
# Output: Doc 2
```

## Key Implementation Conventions

### REM Principle

**Resources-Entities-Moments** is a unified data model, not separate storage:

- **Resources**: Chunked documents with embeddings → semantic search (HNSW)
- **Entities**: Structured data → SQL queries (indexed fields)
- **Moments**: Temporal classifications → time-range queries

All stored as **entities** in RocksDB. Conceptual distinction only.

### Pydantic-Driven Everything

Configuration flows from `json_schema_extra`:

NB!: While we support adding metadata in config. Fields can also take properties like key-field and embedding_provider as json schema extra and is preferred.

```python
model_config = ConfigDict(
    json_schema_extra={
        "embedding_fields": ["content"],      # → Auto-embed on insert
        "indexed_fields": ["category"],       # → RocksDB index CF
        "key_field": "title"                  # → Deterministic UUID
    }
)
```

NB: Rust can also define schema in equivalent mode classes or schema but we drive things with pydantic aware semantics of the json schema format.

### Deterministic UUIDs (Idempotent Inserts)

NB: Precedence; uri -> key -> name unless specified in config.

| Priority | Field | UUID Generation |
|----------|-------|-----------------|
| 1 | `uri` | `blake3(entity_type + uri + chunk_ordinal)` |
| 2 | `json_schema_extra.key_field` | `blake3(entity_type + value)` |
| 3 | `key` | `blake3(entity_type + key)` |
| 4 | `name` | `blake3(entity_type + name)` |
| 5 | (fallback) | `UUID::v4()` (random) |

Same key → same UUID → upsert semantics.

### System Fields (Always Auto-Added)

**Never** define these in Pydantic models - always added by database:

- `id` (UUID) - Deterministic or random
- `entity_type` (string) - Schema/table name
- `created_at`, `modified_at`, `deleted_at` (ISO 8601) - Timestamps
- `edges` (array[string]) - Graph relationships

### Embedding Fields (Conditionally Added)

**Not system fields** - only added when configured:

- `embedding` (array[float32]) - Added if `embedding_fields` in `json_schema_extra`
- `embedding_alt` (array[float32]) - Added if `P8_ALT_EMBEDDING` environment variable set

```python
# Configuration that triggers embedding generation:
model_config = ConfigDict(
    json_schema_extra={
        "embedding_fields": ["content"],      # → Adds "embedding" field
        "embedding_provider": "default"       # → Uses P8_DEFAULT_EMBEDDING
    }
)
```

### Encryption at Rest

Optional encryption at rest using **Ed25519 key pairs** and **ChaCha20-Poly1305 AEAD**:

1. **Generate key pair** (one-time setup):
   ```bash
   rem key-gen --password "strong_master_password"
   # Stores encrypted key at ~/.p8/keys/private_key_encrypted
   # Stores public key at ~/.p8/keys/public_key
   ```

2. **Initialize database with encryption**:
   ```bash
   rem init --password "strong_master_password"
   # All entity data encrypted before storage
   # Transparent encryption/decryption on get/put
   ```

3. **Sharing across tenants** (future):
   - Encrypt data with recipient's **public key** (X25519 ECDH)
   - End-to-end encryption - even database admin cannot read shared data

4. **Device-to-device sync** (future):
   - WAL entries encrypted before gRPC transmission
   - Defense in depth: mTLS (transport) + encrypted WAL (application layer)

**Key security properties:**
- Private key **never leaves device unencrypted**
- Password-derived key using **Argon2** KDF
- **ChaCha20-Poly1305** AEAD for data encryption
- Public key stored unencrypted for sharing capabilities

See `docs/encryption-architecture.md` for complete design.

### Column Families (Performance)

| CF | Purpose | Speedup vs Scan |
|----|---------|-----------------|
| `key_index` | Reverse key lookup | O(log n) vs O(n) |
| `edges` + `edges_reverse` | Bidirectional graph | 20x faster |
| `embeddings` (binary) | Vector storage | 3x compression |
| `indexes` | Indexed fields | 10-50x faster |
| `keys` | Encrypted tenant keys | - |

### HNSW Vector Index

Rust HNSW index provides **200x speedup** over naive Python scan:
- Python naive: ~1000ms for 1M documents
- Rust HNSW: ~5ms for 1M documents

This is the **primary reason** for Rust implementation.

## Performance Targets

| Operation | Target | Why Rust? |
|-----------|--------|-----------|
| Insert (no embedding) | < 1ms | RocksDB + zero-copy |
| Insert (with embedding) | < 50ms | Network-bound (OpenAI) |
| Get by ID | < 0.1ms | Single RocksDB get |
| Vector search (1M docs) | < 5ms | **HNSW (vs 1000ms naive)** |
| SQL query (indexed) | < 10ms | **Native execution (vs 50ms Python)** |
| Graph traversal (3 hops) | < 5ms | **Bidirectional CF (vs 100ms scan)** |
| Batch insert (1000 docs) | < 500ms | Batched embeddings |
| Parquet export (100k rows) | < 2s | **Parallel encoding** |

NB: WE generally work in batches; batch upserts and batch embeddings. NEVER make individual requests when batches are possible.

## Environment Configuration

```bash
# Core
export P8_HOME=~/.p8
export P8_DB_PATH=$P8_HOME/db

# Embeddings
export P8_DEFAULT_EMBEDDING=local:all-MiniLM-L6-v2
export P8_OPENAI_API_KEY=sk-...  # For OpenAI embeddings

# LLM (natural language queries)
export P8_DEFAULT_LLM=gpt-4.1
export P8_OPENAI_API_KEY=sk-...

# RocksDB tuning
export P8_ROCKSDB_WRITE_BUFFER_SIZE=67108864  # 64MB
export P8_ROCKSDB_MAX_BACKGROUND_JOBS=4
export P8_ROCKSDB_COMPRESSION=lz4

# Replication
export P8_REPLICATION_MODE=primary  # or replica
export P8_PRIMARY_HOST=localhost:50051  # For replicas
export P8_WAL_ENABLED=true
```

See [CLAUDE.md](./CLAUDE.md) for full list.

## Project Structure

```
percolate-rocks/          # Clean implementation
├── Cargo.toml            # Rust dependencies
├── pyproject.toml        # Python package (maturin)
├── README.md             # This file
├── CLAUDE.md             # Implementation guide
│
├── src/                  # Rust implementation (~3000 lines target)
│   ├── lib.rs            # PyO3 module definition (30 lines)
│   │
│   ├── types/            # Core data types (120 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── entity.rs     # Entity, Edge structs
│   │   ├── error.rs      # Error types (thiserror)
│   │   └── result.rs     # Type aliases
│   │
│   ├── storage/          # RocksDB wrapper (400 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── db.rs         # Storage struct + open
│   │   ├── keys.rs       # Key encoding functions
│   │   ├── batch.rs      # Batch writer
│   │   ├── iterator.rs   # Prefix iterator
│   │   └── column_families.rs  # CF constants + setup
│   │
│   ├── index/            # Indexing layer (310 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── hnsw.rs       # HNSW vector index
│   │   ├── fields.rs     # Indexed fields
│   │   └── keys.rs       # Key index (reverse lookup)
│   │
│   ├── query/            # Query execution (260 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── parser.rs     # SQL parser
│   │   ├── executor.rs   # Query executor
│   │   ├── predicates.rs # Predicate evaluation
│   │   └── planner.rs    # Query planner
│   │
│   ├── embeddings/       # Embedding providers (200 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── provider.rs   # Provider trait + factory
│   │   ├── local.rs      # Local models (fastembed)
│   │   ├── openai.rs     # OpenAI API client
│   │   └── batch.rs      # Batch embedding operations
│   │
│   ├── schema/           # Schema validation (160 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── registry.rs   # Schema registry
│   │   ├── validator.rs  # JSON Schema validation
│   │   └── pydantic.rs   # Pydantic json_schema_extra parser
│   │
│   ├── graph/            # Graph operations (130 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── edges.rs      # Edge CRUD
│   │   └── traversal.rs  # BFS/DFS traversal
│   │
│   ├── replication/      # Replication engine (400 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── wal.rs        # Write-ahead log
│   │   ├── primary.rs    # Primary node (gRPC server)
│   │   ├── replica.rs    # Replica node (gRPC client)
│   │   ├── protocol.rs   # gRPC protocol definitions
│   │   └── sync.rs       # Sync state machine
│   │
│   ├── export/           # Export formats (200 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── parquet.rs    # Parquet writer
│   │   ├── csv.rs        # CSV writer
│   │   └── jsonl.rs      # JSONL writer
│   │
│   ├── ingest/           # Document ingestion (180 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── chunker.rs    # Document chunking
│   │   ├── pdf.rs        # PDF parser
│   │   └── text.rs       # Text chunking
│   │
│   ├── llm/              # LLM query builder (150 lines)
│   │   ├── mod.rs        # Re-exports
│   │   ├── query_builder.rs  # Natural language → SQL
│   │   └── planner.rs    # Query plan generation
│   │
│   └── bindings/         # PyO3 Python bindings (300 lines)
│       ├── mod.rs        # Re-exports
│       ├── database.rs   # Database wrapper (main API)
│       ├── types.rs      # Type conversions (Python ↔ Rust)
│       ├── errors.rs     # Error conversions
│       └── async_ops.rs  # Async operation wrappers
│
├── python/               # Python package (~800 lines target)
│   └── rem_db/
│       ├── __init__.py   # Public API (thin wrapper over Rust)
│       ├── cli.py        # Typer CLI (delegates to Rust)
│       ├── models.py     # Built-in Pydantic schemas
│       └── async_api.py  # Async wrapper utilities
│
└── tests/
    ├── rust/             # Rust integration tests
    │   ├── test_crud.rs
    │   ├── test_search.rs
    │   ├── test_graph.rs
    │   ├── test_replication.rs
    │   └── test_export.rs
    │
    └── python/           # Python integration tests
        ├── test_api.py
        ├── test_cli.py
        ├── test_async.py
        └── test_end_to_end.py
```

**Key Design Notes:**

1. **Rust Core (~3000 lines in ~40 files)**: All performance-critical operations in Rust
   - Average 75 lines per file
   - Max 150 lines per file
   - Single responsibility per module

2. **Python Bindings (bindings/)**: Thin PyO3 layer
   - Database wrapper exposes high-level API
   - Type conversions between Python dict/list ↔ Rust structs
   - Error conversions for Python exceptions
   - Async operation wrappers (tokio → asyncio)
   - **No business logic** - pure translation layer

3. **Python Package (python/)**: Minimal orchestration
   - CLI delegates to Rust immediately
   - Public API is thin wrapper (`db._rust_insert()`)
   - Pydantic models define schemas, Rust validates/stores
   - Async utilities for Python async/await ergonomics

4. **Replication Module**: Primary/replica peer replication
   - WAL (write-ahead log) for durability
   - gRPC streaming for real-time sync
   - Automatic catchup after disconnection
   - Read-only replica mode

5. **Export Module**: Analytics-friendly formats
   - Parquet with ZSTD compression
   - CSV for spreadsheets
   - JSONL for streaming/batch processing

6. **LLM Module**: Natural language query interface
   - Convert questions → SQL/SEARCH queries
   - Query plan generation (`--plan` flag)
   - Confidence scoring

7. **Test Organization**: Separation of unit and integration tests

   **Rust Tests:**
   - **Unit tests**: Inline with implementation using `#[cfg(test)]` modules
     ```rust
     // src/storage/keys.rs
     #[cfg(test)]
     mod tests {
         use super::*;

         #[test]
         fn test_encode_entity_key() {
             let key = encode_entity_key(uuid);
             assert!(key.starts_with(b"entity:"));
         }
     }
     ```
   - **Integration tests**: In `tests/rust/` directory
     - Test full workflows across modules
     - Require actual RocksDB instance
     - May be slower (acceptable up to 10s per test)

   **Python Tests:**
   - **Unit tests**: NOT APPLICABLE (Python layer is thin wrapper)
   - **Integration tests**: In `tests/python/` directory
     - Test PyO3 bindings (Python ↔ Rust type conversions)
     - Test CLI commands end-to-end
     - Test async/await ergonomics
     - Require Rust library to be built

   **Running Tests:**
   ```bash
   # Rust unit tests (fast, inline with code)
   cargo test --lib

   # Rust integration tests (slower, requires RocksDB)
   cargo test --test '*'

   # Python integration tests (requires maturin build)
   maturin develop
   pytest tests/python/

   # All tests
   cargo test && pytest tests/python/
   ```

   **Coverage Targets:**
   - Rust: 80%+ coverage (critical path)
   - Python: 90%+ coverage (thin wrapper, easy to test)

## Development

### Pre-Build Checks

```bash
# Check compilation (fast, no binary output)
cargo check

# Format check (without modifying files)
cargo fmt --check

# Linting with clippy
cargo clippy --all-targets --all-features

# Security audit (requires: cargo install cargo-audit)
cargo audit

# Check for outdated dependencies (requires: cargo install cargo-outdated)
cargo outdated
```

### Building

```bash
# Development build (unoptimized, fast compile)
cargo build

# Release build (optimized, slower compile)
cargo build --release

# Python extension development install (editable)
maturin develop

# Python extension release wheel
maturin build --release
```

### Testing

```bash
# Rust unit tests
cargo test

# Rust unit tests with output
cargo test -- --nocapture

# Python integration tests (requires maturin develop first)
pytest

# Python tests with verbose output
pytest -v

# Run specific test
cargo test test_name
```

### Code Quality

```bash
# Auto-format code
cargo fmt

# Run clippy linter
cargo clippy --all-targets

# Fix clippy warnings automatically (where possible)
cargo clippy --fix

# Check for unused dependencies
cargo machete  # requires: cargo install cargo-machete
```

### Benchmarks

```bash
# Run all benchmarks
cargo bench

# Run specific benchmark
cargo bench vector_search
```

### Development Workflow

```bash
# 1. Make changes to Rust code
# 2. Check compilation
cargo check

# 3. Run tests
cargo test

# 4. Build Python extension
maturin develop

# 5. Test Python integration
pytest
```

## References

- **Specification**: See `db-specification-v0.md` in `-ref` folder
- **Python spike**: `../rem-db` (100% features, production-ready)
- **Old Rust spike**: `../percolate-rocks-ref` (~40% features)
- **Implementation guide**: [CLAUDE.md](./CLAUDE.md)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "percolate-rocks",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "database, vector-search, graph, embedding, semantic",
    "author": "Percolate Team",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# REM Database\n\n**Resources-Entities-Moments (REM):** High-performance embedded database for semantic search, graph queries, and structured data.\n\n**Project Location:** `/Users/sirsh/code/percolation/percolate-rocks`\n\n## Project Goals\n\nBuild a production-ready database combining:\n- **Rust performance** - HNSW vector search (200x faster), native SQL execution (5-10x faster)\n- **Python ergonomics** - Pydantic models drive schemas, natural language queries\n- **Zero impedance** - Pydantic `json_schema_extra` \u2192 automatic embeddings, indexing, validation\n\n## Quick Start\n\n### Installation\n\n**From PyPI** (published v0.2.0):\n```bash\npip install percolate-rocks\n```\n\n**From source**:\n```bash\ncd /Users/sirsh/code/percolation/percolate-rocks\nmaturin develop --release\n```\n\nSee [`.release-notes/`](./.release-notes/) for release history.\n\n### Building\n\nThis project supports two build modes:\n\n**Python extension (default):**\n```bash\n# Build and install Python package\nmaturin develop\n\n# Syntax check only (faster)\nmaturin develop --skip-install\n\n# Note: cargo check/build will fail - use maturin for Python extensions\n```\n\n**Standalone Rust library (no Python):**\n```bash\n# Build as pure Rust library (no Python bindings)\ncargo check --lib --no-default-features\ncargo build --lib --no-default-features --release\n\n# Run tests without Python\ncargo test --lib --no-default-features\n\n# Use in other Rust projects\n# Add to Cargo.toml:\n# percolate-rocks = { version = \"0.1\", default-features = false }\n```\n\n### Testing cross-compilation locally\n\nTo ensure your local builds will work in CI/GitHub Actions, use Docker to replicate the CI environment:\n\n```bash\n# Test ARM64 Linux cross-compilation (what CI does)\ndocker run --rm -v \"$(pwd):/workspace\" -w /workspace rust:latest bash -c \"\n  apt-get update && apt-get install -y gcc-aarch64-linux-gnu pkg-config libclang-dev\n  rustup target add aarch64-unknown-linux-gnu\n  cargo build --target aarch64-unknown-linux-gnu --release\n\"\n```\n\n**Why local builds might work but CI fails:**\n\n| Environment | Rust Version | Target | OpenSSL | Why It Works |\n|-------------|--------------|--------|---------|--------------|\n| **Your Mac** | Latest (1.87+) | Native (aarch64-apple-darwin) | System OpenSSL (Homebrew) | Local system libs |\n| **GitHub Actions** | Workflow-specified | Cross-compile (aarch64-unknown-linux-gnu) | Vendored (native-tls-vendored feature) | Must compile from source |\n\n**Key differences:**\n1. **Rust version**: Mac typically has latest via rustup, CI uses workflow-pinned version\n2. **Cross-compilation**: Mac \u2192 Linux ARM64 requires vendored dependencies (no system libs available)\n3. **Native TLS**: `reqwest` needs `native-tls-vendored` feature for cross-compilation\n\n**If Docker build passes \u2192 CI will pass.** This is your local validation gate.\n\n### Basic Workflow\n\nDefine your schema using Pydantic (in `models.py`):\n\n```python\nfrom pydantic import BaseModel, Field, ConfigDict\n\nclass Article(BaseModel):\n    \"\"\"Article resource for semantic search.\"\"\"\n    title: str = Field(description=\"Article title\")\n    content: str = Field(description=\"Full article content\")\n    category: str = Field(description=\"Content category\")\n\n    model_config = ConfigDict(\n        json_schema_extra={\n            \"embedding_fields\": [\"content\"],      # Auto-embed on insert\n            \"indexed_fields\": [\"category\"],       # Fast WHERE queries\n            \"key_field\": \"title\"                  # Deterministic UUID\n        }\n    )\n```\n\nUse the CLI to work with your data:\n\n```bash\n# 1. Generate encryption key (for encryption at rest)\nrem key-gen --password \"strong_master_password\"\n# Generates Ed25519 key pair and stores encrypted at ~/.p8/keys/\n\n# 2. Initialize database (defaults to ~/.p8/db/)\nrem init\n\n# Or specify custom path\nrem init --path ./data\n\n# With encryption at rest (optional)\nrem init --path ./data --password \"strong_master_password\"\n\n# 3. Register schema (JSON/YAML preferred, Python also supported)\nrem schema add schema.json  # Preferred: pure JSON Schema\nrem schema add schema.yaml  # Preferred: YAML format\nrem schema add models.py::Article  # Also supported: Pydantic model\n\n# Or create from template\nrem schema add --name my_docs --template resources  # Clone resources schema\n\n# 4. Batch upsert articles (single embedding API call)\ncat articles.jsonl | rem insert articles --batch\n\n# 5. Semantic search (HNSW index)\nrem search \"fast programming languages\" --schema=articles --top-k=5\n\n# 6. SQL queries (indexed)\nrem query \"SELECT * FROM articles WHERE category = 'programming'\"\n```\n\n## CLI Commands\n\n### Setup and Schema Management\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem key-gen` | Generate encryption key pair (Ed25519) | `rem key-gen --password \"strong_password\"` (saves to `~/.p8/keys/`) |\n| `rem init` | Initialize database (default: `~/.p8/db/`) | `rem init` or `rem init --path ./data` or `rem init --password \"...\"` (encryption at rest) |\n| `rem schema add <file>` | Register schema (JSON/YAML preferred) | `rem schema add schema.json` or `rem schema add models.py::Article` |\n| `rem schema add --name <name> --template <template>` | Create schema from built-in template | `rem schema add --name my_docs --template resources` |\n| `rem schema list` | List registered schemas | `rem schema list` |\n| `rem schema show <name>` | Show schema definition | `rem schema show articles` |\n| `rem schema templates` | List available templates | `rem schema templates` |\n\n**Schema template workflow:**\n\n```bash\n# List available templates\nrem schema templates\n# Output:\n# Available schema templates:\n# - resources: Chunked documents with embeddings (URI-based)\n# - entities: Generic structured data (name-based)\n# - agentlets: AI agent definitions (with tools/resources)\n# - moments: Temporal classifications (time-range queries)\n\n# Create new schema from template\nrem schema add --name my_documents --template resources\n\n# This creates and registers:\n# - Schema name: my_documents\n# - Clones all fields from resources template\n# - Updates fully_qualified_name: \"user.my_documents\"\n# - Updates short_name: \"my_documents\"\n# - Preserves embedding/indexing configuration\n\n# Customize the generated schema (optional)\nrem schema show my_documents > my_documents.json\n# Edit my_documents.json\nrem schema add my_documents.json  # Re-register with changes\n\n# Or save to file without registering\nrem schema add --name my_docs --template resources --output my_docs.yaml\n# Edit my_docs.yaml\nrem schema add my_docs.yaml  # Register when ready\n```\n\n**Built-in templates:**\n\n| Template | Use Case | Key Fields | Configuration |\n|----------|----------|------------|---------------|\n| `resources` | Documents, articles, PDFs | `name`, `content`, `uri`, `chunk_ordinal` | Embeds `content`, indexes `content_type`, key: `uri` |\n| `entities` | Generic structured data | `name`, `key`, `properties` | Indexes `name`, key: `name` |\n| `agentlets` | AI agent definitions | `description`, `tools`, `resources` | Embeds `description`, includes MCP config |\n| `moments` | Temporal events | `name`, `start_time`, `end_time`, `classifications` | Indexes `start_time`, `end_time` |\n\n**Example: Creating custom document schema**\n\n```bash\n# Start with resources template\nrem schema add --name technical_docs --template resources --output technical_docs.yaml\n\n# Edit technical_docs.yaml to add custom fields:\n# - difficulty_level: enum[\"beginner\", \"intermediate\", \"advanced\"]\n# - language: string\n# - code_examples: array[object]\n\n# Register customized schema\nrem schema add technical_docs.yaml\n\n# Insert documents\ncat docs.jsonl | rem insert technical_docs --batch\n```\n\n### Data Operations\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem insert <table> <json>` | Insert entity | `rem insert articles '{\"title\": \"...\"}` |\n| `rem insert <table> --batch` | Batch insert from stdin | `cat data.jsonl \\| rem insert articles --batch` |\n| `rem ingest <file>` | Upload and chunk file | `rem ingest tutorial.pdf --schema=articles` |\n| `rem get <uuid>` | Get entity by ID | `rem get 550e8400-...` |\n| `rem lookup <key>` | Global key lookup | `rem lookup \"Python Guide\"` |\n\n### Search and Queries\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem search <query>` | Semantic search | `rem search \"async programming\" --schema=articles` |\n| `rem query \"<SQL>\"` | SQL query | `rem query \"SELECT * FROM articles WHERE category = 'tutorial'\"` |\n| `rem ask \"<question>\"` | Natural language query (executes) | `rem ask \"show recent programming articles\"` |\n| `rem ask \"<question>\" --plan` | Show query plan without executing | `rem ask \"show recent articles\" --plan` |\n| `rem traverse <uuid>` | Graph traversal | `rem traverse <id> --depth=2 --direction=out` |\n\n**Natural language query examples:**\n\n```bash\n# Execute query immediately\nrem ask \"show recent programming articles\"\n# Output: Query results as JSON\n\n# Show query plan without executing (LLM response only)\nrem ask \"show recent programming articles\" --plan\n# Output:\n# {\n#   \"confidence\": 0.95,\n#   \"query\": \"SELECT * FROM articles WHERE category = 'programming' ORDER BY created_at DESC LIMIT 10\",\n#   \"reasoning\": \"User wants recent articles filtered by programming category\",\n#   \"requires_search\": false\n# }\n\n# Complex query with semantic search\nrem ask \"find articles about Rust performance optimization\" --plan\n# Output:\n# {\n#   \"confidence\": 0.85,\n#   \"query\": \"SEARCH articles 'Rust performance optimization' LIMIT 10\",\n#   \"reasoning\": \"Semantic search needed for conceptual similarity\",\n#   \"requires_search\": true\n# }\n```\n\n### Export and Analytics\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem export <table>` | Export to Parquet | `rem export articles --output ./data.parquet` |\n| `rem export --all` | Export all schemas | `rem export --all --output ./exports/` |\n\n### REM Dreaming (Background Intelligence)\n\n| Command | Description | Example |\n|---------|-------------|---------|\n| `rem dream` | Run dreaming with default lookback (24h) | `rem dream` |\n| `rem dream --lookback-hours <N>` | Custom lookback window | `rem dream --lookback-hours 168` (weekly) |\n| `rem dream --dry-run` | Show what would be generated | `rem dream --dry-run --verbose` |\n| `rem dream --llm <model>` | Specify LLM provider | `rem dream --llm gpt-4-turbo` |\n| `rem dream --start <date> --end <date>` | Specific date range | `rem dream --start \"2025-10-20\" --end \"2025-10-25\"` |\n\n**REM Dreaming** uses LLMs to analyze your activity in the background and generate:\n- **Moments**: Temporal classifications of what you were working on (with emotions, topics, outcomes)\n- **Summaries**: Period recaps and key insights\n- **Graph edges**: Automatic connections between related resources and sessions\n- **Ontological maps**: Topic relationships and themes\n\nSee [`docs/rem-dreaming.md`](docs/rem-dreaming.md) for detailed documentation.\n\n## Core System Schemas\n\nREM Database includes three core schemas for tracking user activity:\n\n### Sessions\n\n**Purpose:** Track conversation sessions with AI agents.\n\n**Key fields:**\n- `id` (UUID) - Session identifier\n- `case_id` (UUID) - Optional link to project/case\n- `user_id` (string) - User identifier\n- `metadata` (object) - Session context\n\n**Schema:** [`schema/core/sessions.json`](schema/core/sessions.json)\n\n### Messages\n\n**Purpose:** Individual messages within sessions (user queries, AI responses, tool calls).\n\n**Key fields:**\n- `session_id` (UUID) - Parent session\n- `role` (enum) - user | assistant | system | tool\n- `content` (string) - Message content (embedded for search)\n- `tool_calls` (array) - Tool invocations\n- `trace_id`, `span_id` (string) - Observability\n\n**Schema:** [`schema/core/messages.json`](schema/core/messages.json)\n\n### Moments\n\n**Purpose:** Temporal classifications generated by REM Dreaming.\n\n**Key fields:**\n- `name` (string) - Moment title\n- `summary` (string) - Activity description\n- `start_time`, `end_time` (datetime) - Time bounds\n- `moment_type` (enum) - work_session | learning | planning | communication | reflection | creation\n- `tags` (array) - Topic tags (e.g., [\"rust\", \"database\", \"performance\"])\n- `emotion_tags` (array) - Emotion/tone tags (e.g., [\"focused\", \"productive\"])\n- `people` (array) - People mentioned\n- `resource_ids`, `session_ids` (arrays) - Related entities\n\n**Schema:** [`schema/core/moments.json`](schema/core/moments.json)\n\n**These schemas are registered automatically on `rem init`.**\n\n## Peer Replication Testing\n\nREM supports primary/replica replication via WAL and gRPC streaming.\n\n### Terminal 1: Primary Node\n\n```bash\n# Start primary with WAL enabled\nexport P8_REPLICATION_MODE=primary\nexport P8_REPLICATION_PORT=50051\nexport P8_WAL_ENABLED=true\nexport P8_DB_PATH=./data/primary  # Override default ~/.p8/db/\n\nrem init\n# Register schema (JSON/YAML preferred)\nrem schema add schema.json\n\n# Start replication server\nrem serve --host 0.0.0.0 --port 50051\n\n# Insert data (will be replicated)\nrem insert articles '{\"title\": \"Doc 1\", \"content\": \"Test replication\", \"category\": \"test\"}'\n\n# Check WAL status\nrem replication wal-status\n# Output:\n# WAL sequence: 1\n# Entries: 1\n# Size: 512 bytes\n```\n\n### Terminal 2: Replica 1\n\n```bash\n# Start replica pointing to primary\nexport P8_REPLICATION_MODE=replica\nexport P8_PRIMARY_HOST=localhost:50051\nexport P8_DB_PATH=./data/replica1  # Override default ~/.p8/db/\n\nrem init\n\n# Connect and sync from primary\nrem replicate --primary=localhost:50051 --follow\n\n# Check replication status\nrem replication status\n# Output:\n# Mode: replica\n# Primary: localhost:50051\n# WAL position: 1\n# Lag: 2ms\n# Status: synced\n\n# Query replica (read-only)\nrem query \"SELECT * FROM articles\"\n# Output: Same data as primary\n```\n\n### Terminal 3: Replica 2\n\n```bash\nexport P8_REPLICATION_MODE=replica\nexport P8_PRIMARY_HOST=localhost:50051\nexport P8_DB_PATH=./data/replica2  # Override default ~/.p8/db/\n\nrem init\nrem replicate --primary=localhost:50051 --follow\n\n# Verify sync\nrem query \"SELECT COUNT(*) FROM articles\"\n# Output: 1\n```\n\n### Testing Failover\n\n**Terminal 1: Simulate Primary Failure**\n```bash\n^C  # Stop primary server\n```\n\n**Terminal 2: Replica Behavior During Outage**\n```bash\n# Replica continues serving reads\nrem query \"SELECT * FROM articles\"\n# Output: Cached data still available\n\n# Check status\nrem replication status\n# Output:\n# Status: disconnected\n# Last sync: 45s ago\n# Buffered writes: 0 (read-only)\n```\n\n**Terminal 1: Primary Restart**\n```bash\n# Restart primary and insert new data\nrem serve --host 0.0.0.0 --port 50051\nrem insert articles '{\"title\": \"Doc 2\", \"content\": \"After restart\", \"category\": \"test\"}'\n```\n\n**Terminal 2: Automatic Catchup**\n```bash\n# Replica auto-reconnects and syncs\nrem replication status\n# Output:\n# Status: synced\n# Catchup: completed (1 entry, 50ms)\n# Lag: 3ms\n\n# Verify new data\nrem query \"SELECT title FROM articles ORDER BY created_at DESC LIMIT 1\"\n# Output: Doc 2\n```\n\n## Key Implementation Conventions\n\n### REM Principle\n\n**Resources-Entities-Moments** is a unified data model, not separate storage:\n\n- **Resources**: Chunked documents with embeddings \u2192 semantic search (HNSW)\n- **Entities**: Structured data \u2192 SQL queries (indexed fields)\n- **Moments**: Temporal classifications \u2192 time-range queries\n\nAll stored as **entities** in RocksDB. Conceptual distinction only.\n\n### Pydantic-Driven Everything\n\nConfiguration flows from `json_schema_extra`:\n\nNB!: While we support adding metadata in config. Fields can also take properties like key-field and embedding_provider as json schema extra and is preferred.\n\n```python\nmodel_config = ConfigDict(\n    json_schema_extra={\n        \"embedding_fields\": [\"content\"],      # \u2192 Auto-embed on insert\n        \"indexed_fields\": [\"category\"],       # \u2192 RocksDB index CF\n        \"key_field\": \"title\"                  # \u2192 Deterministic UUID\n    }\n)\n```\n\nNB: Rust can also define schema in equivalent mode classes or schema but we drive things with pydantic aware semantics of the json schema format.\n\n### Deterministic UUIDs (Idempotent Inserts)\n\nNB: Precedence; uri -> key -> name unless specified in config.\n\n| Priority | Field | UUID Generation |\n|----------|-------|-----------------|\n| 1 | `uri` | `blake3(entity_type + uri + chunk_ordinal)` |\n| 2 | `json_schema_extra.key_field` | `blake3(entity_type + value)` |\n| 3 | `key` | `blake3(entity_type + key)` |\n| 4 | `name` | `blake3(entity_type + name)` |\n| 5 | (fallback) | `UUID::v4()` (random) |\n\nSame key \u2192 same UUID \u2192 upsert semantics.\n\n### System Fields (Always Auto-Added)\n\n**Never** define these in Pydantic models - always added by database:\n\n- `id` (UUID) - Deterministic or random\n- `entity_type` (string) - Schema/table name\n- `created_at`, `modified_at`, `deleted_at` (ISO 8601) - Timestamps\n- `edges` (array[string]) - Graph relationships\n\n### Embedding Fields (Conditionally Added)\n\n**Not system fields** - only added when configured:\n\n- `embedding` (array[float32]) - Added if `embedding_fields` in `json_schema_extra`\n- `embedding_alt` (array[float32]) - Added if `P8_ALT_EMBEDDING` environment variable set\n\n```python\n# Configuration that triggers embedding generation:\nmodel_config = ConfigDict(\n    json_schema_extra={\n        \"embedding_fields\": [\"content\"],      # \u2192 Adds \"embedding\" field\n        \"embedding_provider\": \"default\"       # \u2192 Uses P8_DEFAULT_EMBEDDING\n    }\n)\n```\n\n### Encryption at Rest\n\nOptional encryption at rest using **Ed25519 key pairs** and **ChaCha20-Poly1305 AEAD**:\n\n1. **Generate key pair** (one-time setup):\n   ```bash\n   rem key-gen --password \"strong_master_password\"\n   # Stores encrypted key at ~/.p8/keys/private_key_encrypted\n   # Stores public key at ~/.p8/keys/public_key\n   ```\n\n2. **Initialize database with encryption**:\n   ```bash\n   rem init --password \"strong_master_password\"\n   # All entity data encrypted before storage\n   # Transparent encryption/decryption on get/put\n   ```\n\n3. **Sharing across tenants** (future):\n   - Encrypt data with recipient's **public key** (X25519 ECDH)\n   - End-to-end encryption - even database admin cannot read shared data\n\n4. **Device-to-device sync** (future):\n   - WAL entries encrypted before gRPC transmission\n   - Defense in depth: mTLS (transport) + encrypted WAL (application layer)\n\n**Key security properties:**\n- Private key **never leaves device unencrypted**\n- Password-derived key using **Argon2** KDF\n- **ChaCha20-Poly1305** AEAD for data encryption\n- Public key stored unencrypted for sharing capabilities\n\nSee `docs/encryption-architecture.md` for complete design.\n\n### Column Families (Performance)\n\n| CF | Purpose | Speedup vs Scan |\n|----|---------|-----------------|\n| `key_index` | Reverse key lookup | O(log n) vs O(n) |\n| `edges` + `edges_reverse` | Bidirectional graph | 20x faster |\n| `embeddings` (binary) | Vector storage | 3x compression |\n| `indexes` | Indexed fields | 10-50x faster |\n| `keys` | Encrypted tenant keys | - |\n\n### HNSW Vector Index\n\nRust HNSW index provides **200x speedup** over naive Python scan:\n- Python naive: ~1000ms for 1M documents\n- Rust HNSW: ~5ms for 1M documents\n\nThis is the **primary reason** for Rust implementation.\n\n## Performance Targets\n\n| Operation | Target | Why Rust? |\n|-----------|--------|-----------|\n| Insert (no embedding) | < 1ms | RocksDB + zero-copy |\n| Insert (with embedding) | < 50ms | Network-bound (OpenAI) |\n| Get by ID | < 0.1ms | Single RocksDB get |\n| Vector search (1M docs) | < 5ms | **HNSW (vs 1000ms naive)** |\n| SQL query (indexed) | < 10ms | **Native execution (vs 50ms Python)** |\n| Graph traversal (3 hops) | < 5ms | **Bidirectional CF (vs 100ms scan)** |\n| Batch insert (1000 docs) | < 500ms | Batched embeddings |\n| Parquet export (100k rows) | < 2s | **Parallel encoding** |\n\nNB: WE generally work in batches; batch upserts and batch embeddings. NEVER make individual requests when batches are possible.\n\n## Environment Configuration\n\n```bash\n# Core\nexport P8_HOME=~/.p8\nexport P8_DB_PATH=$P8_HOME/db\n\n# Embeddings\nexport P8_DEFAULT_EMBEDDING=local:all-MiniLM-L6-v2\nexport P8_OPENAI_API_KEY=sk-...  # For OpenAI embeddings\n\n# LLM (natural language queries)\nexport P8_DEFAULT_LLM=gpt-4.1\nexport P8_OPENAI_API_KEY=sk-...\n\n# RocksDB tuning\nexport P8_ROCKSDB_WRITE_BUFFER_SIZE=67108864  # 64MB\nexport P8_ROCKSDB_MAX_BACKGROUND_JOBS=4\nexport P8_ROCKSDB_COMPRESSION=lz4\n\n# Replication\nexport P8_REPLICATION_MODE=primary  # or replica\nexport P8_PRIMARY_HOST=localhost:50051  # For replicas\nexport P8_WAL_ENABLED=true\n```\n\nSee [CLAUDE.md](./CLAUDE.md) for full list.\n\n## Project Structure\n\n```\npercolate-rocks/          # Clean implementation\n\u251c\u2500\u2500 Cargo.toml            # Rust dependencies\n\u251c\u2500\u2500 pyproject.toml        # Python package (maturin)\n\u251c\u2500\u2500 README.md             # This file\n\u251c\u2500\u2500 CLAUDE.md             # Implementation guide\n\u2502\n\u251c\u2500\u2500 src/                  # Rust implementation (~3000 lines target)\n\u2502   \u251c\u2500\u2500 lib.rs            # PyO3 module definition (30 lines)\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 types/            # Core data types (120 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 entity.rs     # Entity, Edge structs\n\u2502   \u2502   \u251c\u2500\u2500 error.rs      # Error types (thiserror)\n\u2502   \u2502   \u2514\u2500\u2500 result.rs     # Type aliases\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 storage/          # RocksDB wrapper (400 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 db.rs         # Storage struct + open\n\u2502   \u2502   \u251c\u2500\u2500 keys.rs       # Key encoding functions\n\u2502   \u2502   \u251c\u2500\u2500 batch.rs      # Batch writer\n\u2502   \u2502   \u251c\u2500\u2500 iterator.rs   # Prefix iterator\n\u2502   \u2502   \u2514\u2500\u2500 column_families.rs  # CF constants + setup\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 index/            # Indexing layer (310 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 hnsw.rs       # HNSW vector index\n\u2502   \u2502   \u251c\u2500\u2500 fields.rs     # Indexed fields\n\u2502   \u2502   \u2514\u2500\u2500 keys.rs       # Key index (reverse lookup)\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 query/            # Query execution (260 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 parser.rs     # SQL parser\n\u2502   \u2502   \u251c\u2500\u2500 executor.rs   # Query executor\n\u2502   \u2502   \u251c\u2500\u2500 predicates.rs # Predicate evaluation\n\u2502   \u2502   \u2514\u2500\u2500 planner.rs    # Query planner\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 embeddings/       # Embedding providers (200 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 provider.rs   # Provider trait + factory\n\u2502   \u2502   \u251c\u2500\u2500 local.rs      # Local models (fastembed)\n\u2502   \u2502   \u251c\u2500\u2500 openai.rs     # OpenAI API client\n\u2502   \u2502   \u2514\u2500\u2500 batch.rs      # Batch embedding operations\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 schema/           # Schema validation (160 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 registry.rs   # Schema registry\n\u2502   \u2502   \u251c\u2500\u2500 validator.rs  # JSON Schema validation\n\u2502   \u2502   \u2514\u2500\u2500 pydantic.rs   # Pydantic json_schema_extra parser\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 graph/            # Graph operations (130 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 edges.rs      # Edge CRUD\n\u2502   \u2502   \u2514\u2500\u2500 traversal.rs  # BFS/DFS traversal\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 replication/      # Replication engine (400 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 wal.rs        # Write-ahead log\n\u2502   \u2502   \u251c\u2500\u2500 primary.rs    # Primary node (gRPC server)\n\u2502   \u2502   \u251c\u2500\u2500 replica.rs    # Replica node (gRPC client)\n\u2502   \u2502   \u251c\u2500\u2500 protocol.rs   # gRPC protocol definitions\n\u2502   \u2502   \u2514\u2500\u2500 sync.rs       # Sync state machine\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 export/           # Export formats (200 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 parquet.rs    # Parquet writer\n\u2502   \u2502   \u251c\u2500\u2500 csv.rs        # CSV writer\n\u2502   \u2502   \u2514\u2500\u2500 jsonl.rs      # JSONL writer\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 ingest/           # Document ingestion (180 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 chunker.rs    # Document chunking\n\u2502   \u2502   \u251c\u2500\u2500 pdf.rs        # PDF parser\n\u2502   \u2502   \u2514\u2500\u2500 text.rs       # Text chunking\n\u2502   \u2502\n\u2502   \u251c\u2500\u2500 llm/              # LLM query builder (150 lines)\n\u2502   \u2502   \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502   \u2502   \u251c\u2500\u2500 query_builder.rs  # Natural language \u2192 SQL\n\u2502   \u2502   \u2514\u2500\u2500 planner.rs    # Query plan generation\n\u2502   \u2502\n\u2502   \u2514\u2500\u2500 bindings/         # PyO3 Python bindings (300 lines)\n\u2502       \u251c\u2500\u2500 mod.rs        # Re-exports\n\u2502       \u251c\u2500\u2500 database.rs   # Database wrapper (main API)\n\u2502       \u251c\u2500\u2500 types.rs      # Type conversions (Python \u2194 Rust)\n\u2502       \u251c\u2500\u2500 errors.rs     # Error conversions\n\u2502       \u2514\u2500\u2500 async_ops.rs  # Async operation wrappers\n\u2502\n\u251c\u2500\u2500 python/               # Python package (~800 lines target)\n\u2502   \u2514\u2500\u2500 rem_db/\n\u2502       \u251c\u2500\u2500 __init__.py   # Public API (thin wrapper over Rust)\n\u2502       \u251c\u2500\u2500 cli.py        # Typer CLI (delegates to Rust)\n\u2502       \u251c\u2500\u2500 models.py     # Built-in Pydantic schemas\n\u2502       \u2514\u2500\u2500 async_api.py  # Async wrapper utilities\n\u2502\n\u2514\u2500\u2500 tests/\n    \u251c\u2500\u2500 rust/             # Rust integration tests\n    \u2502   \u251c\u2500\u2500 test_crud.rs\n    \u2502   \u251c\u2500\u2500 test_search.rs\n    \u2502   \u251c\u2500\u2500 test_graph.rs\n    \u2502   \u251c\u2500\u2500 test_replication.rs\n    \u2502   \u2514\u2500\u2500 test_export.rs\n    \u2502\n    \u2514\u2500\u2500 python/           # Python integration tests\n        \u251c\u2500\u2500 test_api.py\n        \u251c\u2500\u2500 test_cli.py\n        \u251c\u2500\u2500 test_async.py\n        \u2514\u2500\u2500 test_end_to_end.py\n```\n\n**Key Design Notes:**\n\n1. **Rust Core (~3000 lines in ~40 files)**: All performance-critical operations in Rust\n   - Average 75 lines per file\n   - Max 150 lines per file\n   - Single responsibility per module\n\n2. **Python Bindings (bindings/)**: Thin PyO3 layer\n   - Database wrapper exposes high-level API\n   - Type conversions between Python dict/list \u2194 Rust structs\n   - Error conversions for Python exceptions\n   - Async operation wrappers (tokio \u2192 asyncio)\n   - **No business logic** - pure translation layer\n\n3. **Python Package (python/)**: Minimal orchestration\n   - CLI delegates to Rust immediately\n   - Public API is thin wrapper (`db._rust_insert()`)\n   - Pydantic models define schemas, Rust validates/stores\n   - Async utilities for Python async/await ergonomics\n\n4. **Replication Module**: Primary/replica peer replication\n   - WAL (write-ahead log) for durability\n   - gRPC streaming for real-time sync\n   - Automatic catchup after disconnection\n   - Read-only replica mode\n\n5. **Export Module**: Analytics-friendly formats\n   - Parquet with ZSTD compression\n   - CSV for spreadsheets\n   - JSONL for streaming/batch processing\n\n6. **LLM Module**: Natural language query interface\n   - Convert questions \u2192 SQL/SEARCH queries\n   - Query plan generation (`--plan` flag)\n   - Confidence scoring\n\n7. **Test Organization**: Separation of unit and integration tests\n\n   **Rust Tests:**\n   - **Unit tests**: Inline with implementation using `#[cfg(test)]` modules\n     ```rust\n     // src/storage/keys.rs\n     #[cfg(test)]\n     mod tests {\n         use super::*;\n\n         #[test]\n         fn test_encode_entity_key() {\n             let key = encode_entity_key(uuid);\n             assert!(key.starts_with(b\"entity:\"));\n         }\n     }\n     ```\n   - **Integration tests**: In `tests/rust/` directory\n     - Test full workflows across modules\n     - Require actual RocksDB instance\n     - May be slower (acceptable up to 10s per test)\n\n   **Python Tests:**\n   - **Unit tests**: NOT APPLICABLE (Python layer is thin wrapper)\n   - **Integration tests**: In `tests/python/` directory\n     - Test PyO3 bindings (Python \u2194 Rust type conversions)\n     - Test CLI commands end-to-end\n     - Test async/await ergonomics\n     - Require Rust library to be built\n\n   **Running Tests:**\n   ```bash\n   # Rust unit tests (fast, inline with code)\n   cargo test --lib\n\n   # Rust integration tests (slower, requires RocksDB)\n   cargo test --test '*'\n\n   # Python integration tests (requires maturin build)\n   maturin develop\n   pytest tests/python/\n\n   # All tests\n   cargo test && pytest tests/python/\n   ```\n\n   **Coverage Targets:**\n   - Rust: 80%+ coverage (critical path)\n   - Python: 90%+ coverage (thin wrapper, easy to test)\n\n## Development\n\n### Pre-Build Checks\n\n```bash\n# Check compilation (fast, no binary output)\ncargo check\n\n# Format check (without modifying files)\ncargo fmt --check\n\n# Linting with clippy\ncargo clippy --all-targets --all-features\n\n# Security audit (requires: cargo install cargo-audit)\ncargo audit\n\n# Check for outdated dependencies (requires: cargo install cargo-outdated)\ncargo outdated\n```\n\n### Building\n\n```bash\n# Development build (unoptimized, fast compile)\ncargo build\n\n# Release build (optimized, slower compile)\ncargo build --release\n\n# Python extension development install (editable)\nmaturin develop\n\n# Python extension release wheel\nmaturin build --release\n```\n\n### Testing\n\n```bash\n# Rust unit tests\ncargo test\n\n# Rust unit tests with output\ncargo test -- --nocapture\n\n# Python integration tests (requires maturin develop first)\npytest\n\n# Python tests with verbose output\npytest -v\n\n# Run specific test\ncargo test test_name\n```\n\n### Code Quality\n\n```bash\n# Auto-format code\ncargo fmt\n\n# Run clippy linter\ncargo clippy --all-targets\n\n# Fix clippy warnings automatically (where possible)\ncargo clippy --fix\n\n# Check for unused dependencies\ncargo machete  # requires: cargo install cargo-machete\n```\n\n### Benchmarks\n\n```bash\n# Run all benchmarks\ncargo bench\n\n# Run specific benchmark\ncargo bench vector_search\n```\n\n### Development Workflow\n\n```bash\n# 1. Make changes to Rust code\n# 2. Check compilation\ncargo check\n\n# 3. Run tests\ncargo test\n\n# 4. Build Python extension\nmaturin develop\n\n# 5. Test Python integration\npytest\n```\n\n## References\n\n- **Specification**: See `db-specification-v0.md` in `-ref` folder\n- **Python spike**: `../rem-db` (100% features, production-ready)\n- **Old Rust spike**: `../percolate-rocks-ref` (~40% features)\n- **Implementation guide**: [CLAUDE.md](./CLAUDE.md)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "High-performance embedded database for semantic search, graph queries, and structured data",
    "version": "0.3.2",
    "project_urls": null,
    "split_keywords": [
        "database",
        " vector-search",
        " graph",
        " embedding",
        " semantic"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3db4193dac10c6d90665c97cc14445223c674cafa79aff65416e1ab584342f44",
                "md5": "b2a202117d1e6f68e8a7ded435fb5fba",
                "sha256": "88da89da34ea2e2157171729f83a8a6a6edcc2871fc353d15ec2915bd7f3090e"
            },
            "downloads": -1,
            "filename": "percolate_rocks-0.3.2-cp38-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "b2a202117d1e6f68e8a7ded435fb5fba",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 10196843,
            "upload_time": "2025-10-26T21:04:15",
            "upload_time_iso_8601": "2025-10-26T21:04:15.381209Z",
            "url": "https://files.pythonhosted.org/packages/3d/b4/193dac10c6d90665c97cc14445223c674cafa79aff65416e1ab584342f44/percolate_rocks-0.3.2-cp38-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6cf14a1dc34470e2afafa37520774d49f2a6f159611b283d917d75f2382c3001",
                "md5": "e5e023e3c0230ba95a14751220de5d2d",
                "sha256": "1b8b9da2a1b684cf276e47d2e246c506adfaba2467d4c3f91ac13da20d59557e"
            },
            "downloads": -1,
            "filename": "percolate_rocks-0.3.2-cp38-abi3-manylinux_2_39_x86_64.whl",
            "has_sig": false,
            "md5_digest": "e5e023e3c0230ba95a14751220de5d2d",
            "packagetype": "bdist_wheel",
            "python_version": "cp38",
            "requires_python": ">=3.8",
            "size": 14274189,
            "upload_time": "2025-10-26T21:09:35",
            "upload_time_iso_8601": "2025-10-26T21:09:35.692675Z",
            "url": "https://files.pythonhosted.org/packages/6c/f1/4a1dc34470e2afafa37520774d49f2a6f159611b283d917d75f2382c3001/percolate_rocks-0.3.2-cp38-abi3-manylinux_2_39_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-26 21:04:15",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "percolate-rocks"
}

Percolate Team