microvector

Name	microvector JSON
Version	0.3.0 JSON
	download
home_page	None
Summary	Lightweight local vector database with persistence to disk, supporting multiple similarity metrics and easy-to-use API.
upload_time	2025-10-29 02:17:42
maintainer	None
docs_url	None
author	John Dagdelen, Logan Powell
requires_python	>=3.10
license	MIT
keywords	automation database embeddings llm machine-learning search similarity vector
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # microvector

Lightweight local vector database with persistence to disk, supporting multiple similarity metrics and an easy-to-use API.

> A refactor and repackaging of [HyperDB](https://github.com/jdagdelen/hyperDB/tree/main) optimized for CPU-only environments with improved type safety and developer experience.

## Features

- 🚀 **Simple API**: Clean, intuitive interface with `PartitionStore` pattern
- 💾 **Persistent Storage**: Automatically caches vector stores to `.pickle.gz` files
- 🔍 **Multiple Similarity Metrics**: Choose from cosine, dot product, Euclidean, or Derrida distance
- 🎯 **Type Safe**: Full type annotations with strict pyright compliance
- ⚡ **CPU Optimized**: Designed for CPU-only environments (no CUDA required)
- 🔄 **Flexible Caching**: Use persistent stores or create temporary in-memory collections
- 📦 **Easy Installation**: One-command setup with automatic PyTorch CPU configuration
- ✨ **Partition-level Operations**: Add, remove, and search documents through dedicated store objects

## Installation

```bash
pip install microvector
```

Or for development:

```bash
git clone https://github.com/loganpowell/microvector.git
cd microvector
uv sync
```

## Quick Start

```python
from microvector import Client

# Initialize the client
client = Client()

# Save a collection (by default, in-memory only)
store = client.save(
    partition="my_documents",
    collection=[
        {"text": "Python is a popular programming language", "category": "tech"},
        {"text": "Machine learning models learn from data", "category": "ai"},
        {"text": "The quick brown fox jumps over the lazy dog", "category": "example"},
    ],
    key="text"
)

# Search using the store
results = store.search(
    term="artificial intelligence and ML",
)

for result in results:
    print(f"Score: {result['similarity_score']:.4f} - {result['text']}")

# Add more documents (also in-memory by default)
store.add([
    {"text": "Deep learning uses neural networks", "category": "ai"}
])

# Search again with the updated store
results = store.search("neural networks", top_k=3)

# Persist to disk when ready
client.save(
    partition="my_documents",
    collection=store.to_dict(),
    cache=True  # Now save to disk
)
```

## API Reference

### Client

The main interface for creating and managing vector stores.

```python
Client(
    model_cache: str = "./.cached_models",
    vector_cache: str = "./.vector_cache",
    embedding_model: str = "avsolatorio/GIST-small-Embedding-v0",
    search_algo: str = "cosine"
)
```

**Parameters:**

| Parameter         | Description                                        | Default                                 |
| ----------------- | -------------------------------------------------- | --------------------------------------- |
| `model_cache`     | Directory for caching downloaded embedding models  | `"./.cached_models"`                    |
| `vector_cache`    | Directory for persisting vector stores             | `"./.vector_cache"`                     |
| `embedding_model` | HuggingFace model name for generating embeddings   | `"avsolatorio/GIST-small-Embedding-v0"` |
| `search_algo`     | `"cosine"`, `"dot"`, `"euclidean"`, or `"derrida"` | `"cosine"`                              |

**Note:** The `search_algo` is set at the client level and applies to all partitions created by that client. This ensures consistency and prevents issues with switching algorithms on already-normalized vectors.

### save()

Create or update a vector store and return a `PartitionStore` for operations.

```python
store = client.save(
    partition: str,
    collection: list[dict[str, Any]],
    key: str = "text",
    cache: bool = False,
    append: bool = False
) -> PartitionStore
```

**Parameters:**

| Parameter    | Description                                             | Default  |
| ------------ | ------------------------------------------------------- | -------- |
| `partition`  | Unique identifier for this vector store                 | -        |
| `collection` | List of documents (dictionaries) to vectorize           | -        |
| `key`        | Field name to use for embedding                         | `"text"` |
| `cache`      | If True, persist to disk; if False, keep in-memory only | `False`  |
| `append`     | If True, add to existing store; if False, replace       | `False`  |

**Returns:** `PartitionStore` object with methods for searching and managing documents

**Example:**

```python
store = client.save(
    partition="products",
    collection=[
        {"description": "Wireless headphones", "price": 99.99},
        {"description": "Smart watch", "price": 299.99},
    ],
    key="description",
    cache=True  # Persist to disk
)

print(f"Partition: {store.partition}")
print(f"Size: {store.size}")
print(f"Algorithm: {store.algo}")
```

### PartitionStore

A partition-specific interface for vector operations returned by `client.save()`.

**Attributes:**

| Attribute   | Type | Description                                                                      |
| ----------- | ---- | -------------------------------------------------------------------------------- |
| `partition` | str  | Name of the partition                                                            |
| `key`       | str  | Field being vectorized                                                           |
| `algo`      | str  | Similarity algorithm in use (`"cosine"`, `"dot"`, `"euclidean"`, or `"derrida"`) |
| `size`      | int  | Number of documents in the store (read-only property)                            |

**Methods:**

#### search()

Search this partition for similar documents.

```python
results = store.search(
    term: str,
    top_k: int = 5
) -> list[dict[str, Any]]
```

**Parameters:**

| Parameter | Description                         | Default |
| --------- | ----------------------------------- | ------- |
| `term`    | Search query string                 | -       |
| `top_k`   | Maximum number of results to return | `5`     |

**Returns:** List of documents with similarity scores

```python
[
    {
        "text": "Machine learning is awesome",
        "category": "ai",
        "similarity_score": 0.923
    },
    ...
]
```

**Example:**

```python
results = store.search("laptop computers", top_k=3)
for result in results:
    print(f"{result['similarity_score']:.3f} - {result['description']}")
```

#### add()

Add new documents to this partition.

```python
success = store.add(
    collection: list[dict[str, Any]] | dict[str, Any],
    cache: bool = False
) -> bool
```

**Parameters:**

| Parameter    | Description                                        | Default |
| ------------ | -------------------------------------------------- | ------- |
| `collection` | Single document or list of documents to add        | -       |
| `cache`      | If True, persist changes; if False, keep in-memory | `False` |

**Returns:** `True` if successful, `False` otherwise

**Example:**

```python
# Add a single document
store.add({"description": "USB-C cable", "price": 12.99})

# Add multiple documents
store.add([
    {"description": "Keyboard", "price": 79.99},
    {"description": "Mouse pad", "price": 19.99}
], cache=True)

print(f"Updated size: {store.size}")
```

#### remove()

Remove a document from this partition by index or content.

```python
success = store.remove(
    item: int | dict[str, Any],
    cache: bool = False
) -> bool
```

**Parameters:**

| Parameter | Description                                               | Default |
| --------- | --------------------------------------------------------- | ------- |
| `item`    | Document index (int) or document content (dict) to remove | -       |
| `cache`   | If True, persist changes; if False, keep in-memory        | `False` |

**Returns:** `True` if successful, `False` otherwise

**Example:**

```python
# Remove by index
store.remove(0)

# Remove by document content
store.remove({"description": "Wireless headphones", "price": 99.99})
```

#### delete()

Delete this partition's cache file from disk.

```python
success = store.delete() -> bool
```

**Returns:** `True` if successful, `False` otherwise

**Example:**

```python
if store.delete():
    print(f"Partition '{store.partition}' deleted from disk")
```

## Similarity Algorithms

| Algorithm   | Best For                          | Range                        |
| ----------- | --------------------------------- | ---------------------------- |
| `cosine`    | General text similarity (default) | 0-1 (higher is more similar) |
| `dot`       | When magnitude matters            | Unbounded                    |
| `euclidean` | Spatial distance                  | 0-∞ (lower is more similar)  |
| `derrida`   | Experimental alternative distance | 0-∞ (lower is more similar)  |

## Advanced Usage

### Using Different Algorithms

The similarity algorithm is set at the client level and applies to all partitions:

```python
# Create clients with different algorithms
cosine_client = Client(search_algo="cosine")
dot_client = Client(search_algo="dot")
euclidean_client = Client(search_algo="euclidean")

# Each client's partitions use its algorithm
cosine_store = cosine_client.save("docs_cosine", documents)
dot_store = dot_client.save("docs_dot", documents)

# Different algorithms, different results
cosine_results = cosine_store.search("query")
dot_results = dot_store.search("query")
```

**Why client-level?** Vectors are normalized based on the algorithm. Switching algorithms on existing vectors would produce incorrect results, so we lock the algorithm at creation time for consistency.

### Custom Embedding Models

Use any HuggingFace sentence-transformer model:

```python
client = Client(
    embedding_model="intfloat/e5-small-v2",
    search_algo="cosine"
)
```

### Nested Key Paths

Access nested fields using dot notation:

```python
collection = [
    {
        "product": {
            "name": "Laptop",
            "specs": {"cpu": "Intel i7"}
        }
    },
    {
        "product": {
            "name": "Mouse",
            "specs": {"cpu": None}
        }
    }
]

store = client.save(
    partition="products",
    collection=collection,
    key="product.name"  # Extract "Laptop", "Mouse" from nested structure
)

# Search works on the nested field
results = store.search("computer", top_k=1)
print(results[0]["product"]["name"])  # "Laptop"
```

### Working with Multiple Partitions

Organize different datasets in separate partitions:

```python
# Create stores for different content types
news_store = client.save("news_articles", news_data, key="content")
review_store = client.save("product_reviews", review_data, key="review_text")
ticket_store = client.save("support_tickets", tickets, key="description")

# Search each independently
news_results = news_store.search("economy")
review_results = review_store.search("quality")
ticket_results = ticket_store.search("login issue")
```

### Incremental Updates

Add new documents to existing stores without replacing them:

```python
# Create initial store (in-memory by default)
store = client.save(
    partition="knowledge_base",
    collection=[
        {"text": "Python is a programming language"},
        {"text": "JavaScript runs in browsers"},
    ]
)

print(f"Initial size: {store.size}")  # 2

# Add more documents using the store's add() method
store.add([
    {"text": "TypeScript adds types to JavaScript"},
    {"text": "Rust is memory-safe"},
])

print(f"Updated size: {store.size}")  # 4

# Or use save() with append=True
client.save(
    partition="knowledge_base",
    collection=[{"text": "Go is designed for concurrency"}],
    append=True
)

# Persist all changes to disk when ready
store.add([], cache=True)  # Flush to disk
```

### Persistent Storage

Enable caching to persist vector stores to disk:

```python
# Save directly to disk
store = client.save(
    partition="permanent_docs",
    collection=documents,
    cache=True  # Persist immediately
)

# Add documents and persist
store.add(new_documents, cache=True)

# Remove documents and persist
store.remove(0, cache=True)

# Later, load from cache
loaded_store = client.save(
    partition="permanent_docs",
    collection=[],  # Empty collection loads from cache
    cache=True
)
```

### In-Memory Operations

Work with documents without persisting to disk (default behavior):

```python
# Create a temporary store (cache=False is the default)
temp_store = client.save(
    partition="temp_analysis",
    collection=documents
)

# Add documents without caching (default behavior)
temp_store.add(more_documents)

# Search as normal
results = temp_store.search("query")

# Since cache=False (default), changes aren't persisted
```

### Document Management

```python
# Create store (in-memory by default)
store = client.save("products", initial_products)

# Add new products (in-memory)
store.add([
    {"name": "New Product", "price": 49.99}
])

# Remove by index (in-memory)
store.remove(0)

# Remove by content (in-memory)
store.remove({"name": "Old Product", "price": 19.99})

# Check current size
print(f"Current inventory: {store.size} items")

# Persist changes to disk
store.add([], cache=True)

# Or delete entire partition from disk
if store.delete():
    print("Partition deleted")
```

## Development Setup

This project uses `uv` for dependency management and automatically configures CPU-only PyTorch.

### Quick Start

1. **Install dependencies:**

   ```bash
   uv sync
   ```

2. **Verify setup:**

   ```bash
   uv run python setup_dev.py
   ```

3. **Run tests:**

   ```bash
   uv run pytest
   ```

4. **Type checking:**
   ```bash
   uv run pyright
   ```

### What Gets Installed

- **PyTorch (CPU-only)**: Automatically from PyTorch CPU index
- **Transformers**: HuggingFace transformers library
- **Sentence Transformers**: For embedding generation
- **NumPy**: Numerical computing

No special flags or manual PyTorch installation needed - just `uv sync` and go!

## Performance Tips

1. **Reuse Client instances** - Model loading is expensive
2. **Use persistent caching** - Vector computation is cached automatically
3. **Batch your saves** - Save collections together when possible
4. **Choose the right algorithm** - Cosine is fastest for most use cases
5. **Adjust top_k** - Lower values are faster

## Architecture

```
microvector/
├── main.py              # Client API
├── partition_store.py   # PartitionStore class for partition operations
├── store.py             # Vector storage and similarity search
├── cache.py             # Persistence layer
├── embed.py             # Embedding generation
├── search.py            # Search utilities
├── algos.py             # Similarity algorithms
└── utils.py             # Helper functions
```

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Credits

Based on [HyperDB](https://github.com/jdagdelen/hyperDB) by John Dagdelen.
Refactored and maintained by Logan Powell.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "microvector",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "automation, database, embeddings, llm, machine-learning, search, similarity, vector",
    "author": "John Dagdelen, Logan Powell",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/6d/55/987cf23d304e6d64f6abacb9294169cb8682cc1fd0a26491c0e2a057d8d0/microvector-0.3.0.tar.gz",
    "platform": null,
    "description": "# microvector\n\nLightweight local vector database with persistence to disk, supporting multiple similarity metrics and an easy-to-use API.\n\n> A refactor and repackaging of [HyperDB](https://github.com/jdagdelen/hyperDB/tree/main) optimized for CPU-only environments with improved type safety and developer experience.\n\n## Features\n\n- \ud83d\ude80 **Simple API**: Clean, intuitive interface with `PartitionStore` pattern\n- \ud83d\udcbe **Persistent Storage**: Automatically caches vector stores to `.pickle.gz` files\n- \ud83d\udd0d **Multiple Similarity Metrics**: Choose from cosine, dot product, Euclidean, or Derrida distance\n- \ud83c\udfaf **Type Safe**: Full type annotations with strict pyright compliance\n- \u26a1 **CPU Optimized**: Designed for CPU-only environments (no CUDA required)\n- \ud83d\udd04 **Flexible Caching**: Use persistent stores or create temporary in-memory collections\n- \ud83d\udce6 **Easy Installation**: One-command setup with automatic PyTorch CPU configuration\n- \u2728 **Partition-level Operations**: Add, remove, and search documents through dedicated store objects\n\n## Installation\n\n```bash\npip install microvector\n```\n\nOr for development:\n\n```bash\ngit clone https://github.com/loganpowell/microvector.git\ncd microvector\nuv sync\n```\n\n## Quick Start\n\n```python\nfrom microvector import Client\n\n# Initialize the client\nclient = Client()\n\n# Save a collection (by default, in-memory only)\nstore = client.save(\n    partition=\"my_documents\",\n    collection=[\n        {\"text\": \"Python is a popular programming language\", \"category\": \"tech\"},\n        {\"text\": \"Machine learning models learn from data\", \"category\": \"ai\"},\n        {\"text\": \"The quick brown fox jumps over the lazy dog\", \"category\": \"example\"},\n    ],\n    key=\"text\"\n)\n\n# Search using the store\nresults = store.search(\n    term=\"artificial intelligence and ML\",\n)\n\nfor result in results:\n    print(f\"Score: {result['similarity_score']:.4f} - {result['text']}\")\n\n# Add more documents (also in-memory by default)\nstore.add([\n    {\"text\": \"Deep learning uses neural networks\", \"category\": \"ai\"}\n])\n\n# Search again with the updated store\nresults = store.search(\"neural networks\", top_k=3)\n\n# Persist to disk when ready\nclient.save(\n    partition=\"my_documents\",\n    collection=store.to_dict(),\n    cache=True  # Now save to disk\n)\n```\n\n## API Reference\n\n### Client\n\nThe main interface for creating and managing vector stores.\n\n```python\nClient(\n    model_cache: str = \"./.cached_models\",\n    vector_cache: str = \"./.vector_cache\",\n    embedding_model: str = \"avsolatorio/GIST-small-Embedding-v0\",\n    search_algo: str = \"cosine\"\n)\n```\n\n**Parameters:**\n\n| Parameter         | Description                                        | Default                                 |\n| ----------------- | -------------------------------------------------- | --------------------------------------- |\n| `model_cache`     | Directory for caching downloaded embedding models  | `\"./.cached_models\"`                    |\n| `vector_cache`    | Directory for persisting vector stores             | `\"./.vector_cache\"`                     |\n| `embedding_model` | HuggingFace model name for generating embeddings   | `\"avsolatorio/GIST-small-Embedding-v0\"` |\n| `search_algo`     | `\"cosine\"`, `\"dot\"`, `\"euclidean\"`, or `\"derrida\"` | `\"cosine\"`                              |\n\n**Note:** The `search_algo` is set at the client level and applies to all partitions created by that client. This ensures consistency and prevents issues with switching algorithms on already-normalized vectors.\n\n### save()\n\nCreate or update a vector store and return a `PartitionStore` for operations.\n\n```python\nstore = client.save(\n    partition: str,\n    collection: list[dict[str, Any]],\n    key: str = \"text\",\n    cache: bool = False,\n    append: bool = False\n) -> PartitionStore\n```\n\n**Parameters:**\n\n| Parameter    | Description                                             | Default  |\n| ------------ | ------------------------------------------------------- | -------- |\n| `partition`  | Unique identifier for this vector store                 | -        |\n| `collection` | List of documents (dictionaries) to vectorize           | -        |\n| `key`        | Field name to use for embedding                         | `\"text\"` |\n| `cache`      | If True, persist to disk; if False, keep in-memory only | `False`  |\n| `append`     | If True, add to existing store; if False, replace       | `False`  |\n\n**Returns:** `PartitionStore` object with methods for searching and managing documents\n\n**Example:**\n\n```python\nstore = client.save(\n    partition=\"products\",\n    collection=[\n        {\"description\": \"Wireless headphones\", \"price\": 99.99},\n        {\"description\": \"Smart watch\", \"price\": 299.99},\n    ],\n    key=\"description\",\n    cache=True  # Persist to disk\n)\n\nprint(f\"Partition: {store.partition}\")\nprint(f\"Size: {store.size}\")\nprint(f\"Algorithm: {store.algo}\")\n```\n\n### PartitionStore\n\nA partition-specific interface for vector operations returned by `client.save()`.\n\n**Attributes:**\n\n| Attribute   | Type | Description                                                                      |\n| ----------- | ---- | -------------------------------------------------------------------------------- |\n| `partition` | str  | Name of the partition                                                            |\n| `key`       | str  | Field being vectorized                                                           |\n| `algo`      | str  | Similarity algorithm in use (`\"cosine\"`, `\"dot\"`, `\"euclidean\"`, or `\"derrida\"`) |\n| `size`      | int  | Number of documents in the store (read-only property)                            |\n\n**Methods:**\n\n#### search()\n\nSearch this partition for similar documents.\n\n```python\nresults = store.search(\n    term: str,\n    top_k: int = 5\n) -> list[dict[str, Any]]\n```\n\n**Parameters:**\n\n| Parameter | Description                         | Default |\n| --------- | ----------------------------------- | ------- |\n| `term`    | Search query string                 | -       |\n| `top_k`   | Maximum number of results to return | `5`     |\n\n**Returns:** List of documents with similarity scores\n\n```python\n[\n    {\n        \"text\": \"Machine learning is awesome\",\n        \"category\": \"ai\",\n        \"similarity_score\": 0.923\n    },\n    ...\n]\n```\n\n**Example:**\n\n```python\nresults = store.search(\"laptop computers\", top_k=3)\nfor result in results:\n    print(f\"{result['similarity_score']:.3f} - {result['description']}\")\n```\n\n#### add()\n\nAdd new documents to this partition.\n\n```python\nsuccess = store.add(\n    collection: list[dict[str, Any]] | dict[str, Any],\n    cache: bool = False\n) -> bool\n```\n\n**Parameters:**\n\n| Parameter    | Description                                        | Default |\n| ------------ | -------------------------------------------------- | ------- |\n| `collection` | Single document or list of documents to add        | -       |\n| `cache`      | If True, persist changes; if False, keep in-memory | `False` |\n\n**Returns:** `True` if successful, `False` otherwise\n\n**Example:**\n\n```python\n# Add a single document\nstore.add({\"description\": \"USB-C cable\", \"price\": 12.99})\n\n# Add multiple documents\nstore.add([\n    {\"description\": \"Keyboard\", \"price\": 79.99},\n    {\"description\": \"Mouse pad\", \"price\": 19.99}\n], cache=True)\n\nprint(f\"Updated size: {store.size}\")\n```\n\n#### remove()\n\nRemove a document from this partition by index or content.\n\n```python\nsuccess = store.remove(\n    item: int | dict[str, Any],\n    cache: bool = False\n) -> bool\n```\n\n**Parameters:**\n\n| Parameter | Description                                               | Default |\n| --------- | --------------------------------------------------------- | ------- |\n| `item`    | Document index (int) or document content (dict) to remove | -       |\n| `cache`   | If True, persist changes; if False, keep in-memory        | `False` |\n\n**Returns:** `True` if successful, `False` otherwise\n\n**Example:**\n\n```python\n# Remove by index\nstore.remove(0)\n\n# Remove by document content\nstore.remove({\"description\": \"Wireless headphones\", \"price\": 99.99})\n```\n\n#### delete()\n\nDelete this partition's cache file from disk.\n\n```python\nsuccess = store.delete() -> bool\n```\n\n**Returns:** `True` if successful, `False` otherwise\n\n**Example:**\n\n```python\nif store.delete():\n    print(f\"Partition '{store.partition}' deleted from disk\")\n```\n\n## Similarity Algorithms\n\n| Algorithm   | Best For                          | Range                        |\n| ----------- | --------------------------------- | ---------------------------- |\n| `cosine`    | General text similarity (default) | 0-1 (higher is more similar) |\n| `dot`       | When magnitude matters            | Unbounded                    |\n| `euclidean` | Spatial distance                  | 0-\u221e (lower is more similar)  |\n| `derrida`   | Experimental alternative distance | 0-\u221e (lower is more similar)  |\n\n## Advanced Usage\n\n### Using Different Algorithms\n\nThe similarity algorithm is set at the client level and applies to all partitions:\n\n```python\n# Create clients with different algorithms\ncosine_client = Client(search_algo=\"cosine\")\ndot_client = Client(search_algo=\"dot\")\neuclidean_client = Client(search_algo=\"euclidean\")\n\n# Each client's partitions use its algorithm\ncosine_store = cosine_client.save(\"docs_cosine\", documents)\ndot_store = dot_client.save(\"docs_dot\", documents)\n\n# Different algorithms, different results\ncosine_results = cosine_store.search(\"query\")\ndot_results = dot_store.search(\"query\")\n```\n\n**Why client-level?** Vectors are normalized based on the algorithm. Switching algorithms on existing vectors would produce incorrect results, so we lock the algorithm at creation time for consistency.\n\n### Custom Embedding Models\n\nUse any HuggingFace sentence-transformer model:\n\n```python\nclient = Client(\n    embedding_model=\"intfloat/e5-small-v2\",\n    search_algo=\"cosine\"\n)\n```\n\n### Nested Key Paths\n\nAccess nested fields using dot notation:\n\n```python\ncollection = [\n    {\n        \"product\": {\n            \"name\": \"Laptop\",\n            \"specs\": {\"cpu\": \"Intel i7\"}\n        }\n    },\n    {\n        \"product\": {\n            \"name\": \"Mouse\",\n            \"specs\": {\"cpu\": None}\n        }\n    }\n]\n\nstore = client.save(\n    partition=\"products\",\n    collection=collection,\n    key=\"product.name\"  # Extract \"Laptop\", \"Mouse\" from nested structure\n)\n\n# Search works on the nested field\nresults = store.search(\"computer\", top_k=1)\nprint(results[0][\"product\"][\"name\"])  # \"Laptop\"\n```\n\n### Working with Multiple Partitions\n\nOrganize different datasets in separate partitions:\n\n```python\n# Create stores for different content types\nnews_store = client.save(\"news_articles\", news_data, key=\"content\")\nreview_store = client.save(\"product_reviews\", review_data, key=\"review_text\")\nticket_store = client.save(\"support_tickets\", tickets, key=\"description\")\n\n# Search each independently\nnews_results = news_store.search(\"economy\")\nreview_results = review_store.search(\"quality\")\nticket_results = ticket_store.search(\"login issue\")\n```\n\n### Incremental Updates\n\nAdd new documents to existing stores without replacing them:\n\n```python\n# Create initial store (in-memory by default)\nstore = client.save(\n    partition=\"knowledge_base\",\n    collection=[\n        {\"text\": \"Python is a programming language\"},\n        {\"text\": \"JavaScript runs in browsers\"},\n    ]\n)\n\nprint(f\"Initial size: {store.size}\")  # 2\n\n# Add more documents using the store's add() method\nstore.add([\n    {\"text\": \"TypeScript adds types to JavaScript\"},\n    {\"text\": \"Rust is memory-safe\"},\n])\n\nprint(f\"Updated size: {store.size}\")  # 4\n\n# Or use save() with append=True\nclient.save(\n    partition=\"knowledge_base\",\n    collection=[{\"text\": \"Go is designed for concurrency\"}],\n    append=True\n)\n\n# Persist all changes to disk when ready\nstore.add([], cache=True)  # Flush to disk\n```\n\n### Persistent Storage\n\nEnable caching to persist vector stores to disk:\n\n```python\n# Save directly to disk\nstore = client.save(\n    partition=\"permanent_docs\",\n    collection=documents,\n    cache=True  # Persist immediately\n)\n\n# Add documents and persist\nstore.add(new_documents, cache=True)\n\n# Remove documents and persist\nstore.remove(0, cache=True)\n\n# Later, load from cache\nloaded_store = client.save(\n    partition=\"permanent_docs\",\n    collection=[],  # Empty collection loads from cache\n    cache=True\n)\n```\n\n### In-Memory Operations\n\nWork with documents without persisting to disk (default behavior):\n\n```python\n# Create a temporary store (cache=False is the default)\ntemp_store = client.save(\n    partition=\"temp_analysis\",\n    collection=documents\n)\n\n# Add documents without caching (default behavior)\ntemp_store.add(more_documents)\n\n# Search as normal\nresults = temp_store.search(\"query\")\n\n# Since cache=False (default), changes aren't persisted\n```\n\n### Document Management\n\n```python\n# Create store (in-memory by default)\nstore = client.save(\"products\", initial_products)\n\n# Add new products (in-memory)\nstore.add([\n    {\"name\": \"New Product\", \"price\": 49.99}\n])\n\n# Remove by index (in-memory)\nstore.remove(0)\n\n# Remove by content (in-memory)\nstore.remove({\"name\": \"Old Product\", \"price\": 19.99})\n\n# Check current size\nprint(f\"Current inventory: {store.size} items\")\n\n# Persist changes to disk\nstore.add([], cache=True)\n\n# Or delete entire partition from disk\nif store.delete():\n    print(\"Partition deleted\")\n```\n\n## Development Setup\n\nThis project uses `uv` for dependency management and automatically configures CPU-only PyTorch.\n\n### Quick Start\n\n1. **Install dependencies:**\n\n   ```bash\n   uv sync\n   ```\n\n2. **Verify setup:**\n\n   ```bash\n   uv run python setup_dev.py\n   ```\n\n3. **Run tests:**\n\n   ```bash\n   uv run pytest\n   ```\n\n4. **Type checking:**\n   ```bash\n   uv run pyright\n   ```\n\n### What Gets Installed\n\n- **PyTorch (CPU-only)**: Automatically from PyTorch CPU index\n- **Transformers**: HuggingFace transformers library\n- **Sentence Transformers**: For embedding generation\n- **NumPy**: Numerical computing\n\nNo special flags or manual PyTorch installation needed - just `uv sync` and go!\n\n## Performance Tips\n\n1. **Reuse Client instances** - Model loading is expensive\n2. **Use persistent caching** - Vector computation is cached automatically\n3. **Batch your saves** - Save collections together when possible\n4. **Choose the right algorithm** - Cosine is fastest for most use cases\n5. **Adjust top_k** - Lower values are faster\n\n## Architecture\n\n```\nmicrovector/\n\u251c\u2500\u2500 main.py              # Client API\n\u251c\u2500\u2500 partition_store.py   # PartitionStore class for partition operations\n\u251c\u2500\u2500 store.py             # Vector storage and similarity search\n\u251c\u2500\u2500 cache.py             # Persistence layer\n\u251c\u2500\u2500 embed.py             # Embedding generation\n\u251c\u2500\u2500 search.py            # Search utilities\n\u251c\u2500\u2500 algos.py             # Similarity algorithms\n\u2514\u2500\u2500 utils.py             # Helper functions\n```\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Credits\n\nBased on [HyperDB](https://github.com/jdagdelen/hyperDB) by John Dagdelen.\nRefactored and maintained by Logan Powell.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Lightweight local vector database with persistence to disk, supporting multiple similarity metrics and easy-to-use API.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/loganpowell/microvector",
        "Issues": "https://github.com/loganpowell/microvector/issues",
        "Repository": "https://github.com/loganpowell/microvector"
    },
    "split_keywords": [
        "automation",
        " database",
        " embeddings",
        " llm",
        " machine-learning",
        " search",
        " similarity",
        " vector"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "94d9480fe08a23ee8064251fb5b2c77c99c9d8dcf6c7c40ccc6e0e3ad2dae149",
                "md5": "f7aa39701e5e1151a72e5cb2868d22f8",
                "sha256": "3a611101a488ac19bc4386c23115a2103eba84a86b74dd131a1b686fe7c795ca"
            },
            "downloads": -1,
            "filename": "microvector-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f7aa39701e5e1151a72e5cb2868d22f8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 36672,
            "upload_time": "2025-10-29T02:17:41",
            "upload_time_iso_8601": "2025-10-29T02:17:41.405799Z",
            "url": "https://files.pythonhosted.org/packages/94/d9/480fe08a23ee8064251fb5b2c77c99c9d8dcf6c7c40ccc6e0e3ad2dae149/microvector-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6d55987cf23d304e6d64f6abacb9294169cb8682cc1fd0a26491c0e2a057d8d0",
                "md5": "c45dc556dd07ee6e63a6dd9e0a710ed3",
                "sha256": "afcd10f8fe59dae73296b39a6da88c6c7b40cff316ade64c1db4c9f7e4607601"
            },
            "downloads": -1,
            "filename": "microvector-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c45dc556dd07ee6e63a6dd9e0a710ed3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 175776,
            "upload_time": "2025-10-29T02:17:42",
            "upload_time_iso_8601": "2025-10-29T02:17:42.444090Z",
            "url": "https://files.pythonhosted.org/packages/6d/55/987cf23d304e6d64f6abacb9294169cb8682cc1fd0a26491c0e2a057d8d0/microvector-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 02:17:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "loganpowell",
    "github_project": "microvector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "microvector"
}

John Dagdelen, Logan Powell