safe-store


Namesafe-store JSON
Version 3.1.1 PyPI version JSON
download
home_pageNone
SummarySimple, concurrent SQLite-based vector store optimized for local RAG pipelines, with optional encryption.
upload_time2025-10-13 23:07:40
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords concurrent database embedding encryption llm local rag semantic search sqlite vector webui
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # safe_store: Transform Your Digital Chaos into a Queryable Knowledge Base

[![PyPI version](https://img.shields.io/pypi/v/safe_store.svg)](https://pypi.org/project/safe_store/)
[![PyPI license](https://img.shields.io/pypi/l/safe_store.svg)](https://github.com/ParisNeo/safe_store/blob/main/LICENSE)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/safe_store.svg)](https://pypi.org/project/safe_store/)

**`safe_store` is a Python library that turns your local folders of documents into a powerful, private, and intelligent knowledge base.** It achieves this by combining two powerful AI concepts into a single, seamless tool:

1.  **Deep Semantic Search:** It reads and *understands* the content of your files, allowing you to search by meaning and context, not just keywords.
2.  **AI-Powered Knowledge Graph:** It uses a Large Language Model (LLM) to automatically identify key entities (people, companies, concepts) and the relationships between them, building an interconnected web of your knowledge.

All of this happens entirely on your local machine, using a single, portable SQLite file. Your data never leaves your control.

---

## The Journey from Search to Understanding

`safe_store` is designed to grow with your needs. You can start with a simple, powerful RAG system in minutes, and then evolve it into a sophisticated knowledge engine.

### Level 1: Build a Powerful RAG System with Semantic Search
**The Foundation: Retrieval-Augmented Generation (RAG)**

RAG is the state-of-the-art technique for making Large Language Models (LLMs) answer questions about your private documents. The process is simple:
1.  **Retrieve:** Find the most relevant text chunks from your documents related to a user's query.
2.  **Augment:** Add those chunks as context to your prompt.
3.  **Generate:** Ask the LLM to generate an answer based *only* on the provided context.

`SafeStore` is the perfect tool for the "Retrieve" step. It uses vector embeddings to understand the *meaning* of your text, allowing you to find relevant passages even if they don't contain the exact keywords.

**Example: A Simple RAG Pipeline**

```python
import safe_store

# 1. Create a store. This will create a 'my_notes.db' file.
store = safe_store.SafeStore(db_path="my_notes.db", vectorizer_name="st")

# 2. Add your documents. It will scan the folder and process all supported files.
with store:
    store.add_document("path/to/my_notes_and_articles/")

# 3. Query the store to find context for your RAG prompt.
user_query = "What were the main arguments about AI consciousness in my research?"
context_chunks = store.query(user_query, top_k=3)

# 4. Build the prompt and send to your LLM.
context_text = "\n\n".join([chunk['chunk_text'] for chunk in context_chunks])
prompt = f"""
Based on the following context, please answer the user's question.
Do not use any external knowledge.

Context:
---
{context_text}
---

Question: {user_query}
"""

# result = my_llm_function(prompt) # Send to your LLM of choice
```
With just this, you have a powerful, private RAG system running on your local files.

### Level 2: Uncover Hidden Connections with a Knowledge Graph
**The Next Dimension: From Passages to a Web of Knowledge**

Semantic search is great for finding *relevant passages*, but it struggles with questions about *specific facts* and *relationships* scattered across multiple documents.

`GraphStore` complements this by building a structured knowledge graph of the key **instances** (like the person "Geoffrey Hinton") and their **relationships** (like `PIONEERED` the concept "Backpropagation"). This allows you to ask precise, factual questions.

---

## Dynamic Vectorizer Discovery & Configuration

One of `safe_store`'s most powerful features is its ability to self-document. You don't need to guess which vectorizers are available or what parameters they need. You can discover everything at runtime.

This makes it easy to experiment with different embedding models and build interactive tools that guide users through the setup process.

### Step 1: Discovering Available Vectorizers

The `SafeStore.list_available_vectorizers()` class method scans the library for all built-in and custom vectorizers and returns their complete configuration metadata.

```python
import safe_store
import pprint

# Get a list of all available vectorizer configurations
available_vectorizers = safe_store.SafeStore.list_available_vectorizers()

# Pretty-print the result to see what's available
pprint.pprint(available_vectorizers)
```
This will produce a detailed output like this:
```
[{'author': 'ParisNeo',
  'class_name': 'CohereVectorizer',
  'creation_date': '2025-10-10',
  'description': "A vectorizer that uses Cohere's API...",
  'input_parameters': [{'default': 'embed-english-v3.0',
                        'description': 'The name of the Cohere embedding model...',
                        'mandatory': True,
                        'name': 'model'},
                       {'default': '',
                        'description': 'Your Cohere API key...',
                        'mandatory': False,
                        'name': 'api_key'},
                        ...],
  'last_update_date': '2025-10-10',
  'name': 'cohere',
  'title': 'Cohere Vectorizer'},
 {'author': 'ParisNeo',
  'class_name': 'OllamaVectorizer',
  'name': 'ollama',
  'title': 'Ollama Vectorizer',
  ...},
  ...
]
```

### Step 2: Listing Available Models for a Vectorizer

Once you know which vectorizer you want to use, you can ask `safe_store` what specific models it supports. This is especially useful for API-based or local server-based vectorizers like `ollama`, which can have many different models available.

```python
import safe_store

# Example: List all embedding models available from a running Ollama server
try:
    # This requires a running Ollama instance to succeed
    ollama_models = safe_store.SafeStore.list_models("ollama")
    print("Available Ollama embedding models:")
    for model in ollama_models:
        print(f"- {model}")
except Exception as e:
    print(f"Could not list Ollama models. Is the server running? Error: {e}")

```

### Step 3: Building an Interactive Configurator

You can use this metadata to create an interactive setup script, guiding the user to choose and configure their desired vectorizer on the fly.

**Full Interactive Example:**
Copy and run this script. It will guide you through selecting and configuring a vectorizer, then initialize `SafeStore` with your choices.

```python
# interactive_setup.py
import safe_store
import pprint

def interactive_vectorizer_setup():
    """
    An interactive CLI to guide the user through selecting and configuring a vectorizer.
    """
    print("--- Welcome to the safe_store Interactive Vectorizer Setup ---")
    
    # 1. List all available vectorizers
    vectorizers = safe_store.SafeStore.list_available_vectorizers()
    
    print("\nAvailable Vectorizers:")
    for i, vec in enumerate(vectorizers):
        print(f"  [{i+1}] {vec['name']} - {vec.get('title', 'No Title')}")

    # 2. Get user's choice
    choice = -1
    while choice < 0 or choice >= len(vectorizers):
        try:
            raw_choice = input(f"\nPlease select a vectorizer (1-{len(vectorizers)}): ")
            choice = int(raw_choice) - 1
            if not (0 <= choice < len(vectorizers)):
                print("Invalid selection. Please try again.")
        except ValueError:
            print("Please enter a number.")

    selected_vectorizer = vectorizers[choice]
    selected_name = selected_vectorizer['name']
    
    print(f"\nYou have selected: {selected_name}")
    print(f"Description: {selected_vectorizer.get('description', 'N/A').strip()}")

    # 3. Dynamically build the configuration dictionary
    vectorizer_config = {}
    print("\nPlease provide the following configuration values (press Enter to use default):")
    
    params = selected_vectorizer.get('input_parameters', [])
    if not params:
        print("This vectorizer requires no special configuration.")
    else:
        for param in params:
            param_name = param['name']
            description = param.get('description', 'No description.')
            default_value = param.get('default', None)
            
            prompt = f"- {param_name} ({description})"
            if default_value is not None:
                prompt += f" [default: {default_value}]: "
            else:
                prompt += ": "
                
            user_input = input(prompt)
            
            # Use user input if provided, otherwise use default
            final_value = user_input if user_input else default_value
            
            # Simple type conversion for demonstration (can be expanded)
            if final_value is not None:
                if param.get('type') == 'int':
                    vectorizer_config[param_name] = int(final_value)
                elif param.get('type') == 'dict':
                    # For simplicity, we don't parse dicts here, but a real app might use json.loads
                    vectorizer_config[param_name] = final_value
                else:
                    vectorizer_config[param_name] = str(final_value)

    # 4. Initialize SafeStore with the dynamically created configuration
    print("\n--- Configuration Complete ---")
    print(f"Vectorizer Name: '{selected_name}'")
    print("Vectorizer Config:")
    pprint.pprint(vectorizer_config)
    
    try:
        print("\nInitializing SafeStore with your configuration...")
        store = safe_store.SafeStore(
            db_path=f"{selected_name}_store.db",
            vectorizer_name=selected_name,
            vectorizer_config=vectorizer_config
        )
        print("\nāœ… SafeStore initialized successfully!")
        print(f"Database file is at: {selected_name}_store.db")
        store.close()
    except Exception as e:
        print(f"\nāŒ Failed to initialize SafeStore: {e}")


if __name__ == "__main__":
    interactive_vectorizer_setup()
```
This script demonstrates how the self-documenting nature of `safe_store` enables you to build powerful, user-friendly applications on top of it.

---

## Core Concepts for Advanced RAG

### Understanding Tokenization for Chunking
`safe_store` can chunk your documents based on character count (`character` strategy) or token count (`token` strategy). Using the `token` strategy is often more effective as it aligns better with how Large Language Models (LLMs) process text.

When you select `chunking_strategy='token'`, `safe_store` intelligently handles tokenization:

1.  **Vectorizer's Native Tokenizer:** If the chosen vectorizer (like a local `sentence-transformers` model) has its own tokenizer, `safe_store` will use it. This is the most accurate method, as the chunking tokens will perfectly match the vectorizer's tokens.

2.  **Fallback to `tiktoken`:** Some vectorizers, especially those accessed via an API (like OpenAI or Cohere), do not expose their tokenizer for local use. In these cases, `safe_store` uses `tiktoken` (specifically the `cl100k_base` model) as a reliable fallback. `tiktoken` is the tokenizer used by modern OpenAI models and provides a very close approximation for many other models, ensuring your chunks are sized correctly for optimal performance.

You can also specify a custom tokenizer during `SafeStore` initialization if you have specific needs.

### Enriching Your Data with Metadata
Metadata is extra information about your documents that provides crucial context. You can attach a dictionary of key-value pairs to any document you add to `safe_store`. This metadata is then automatically used to enrich the search results, leading to more accurate and context-aware answers in your RAG pipeline.

**How to Add Metadata:**
Simply pass a dictionary to the `metadata` parameter when adding content.

```python
# Example of adding a document with metadata
doc_info = {
    "title": "Quantum Entanglement in Nanostructures",
    "author": "Dr. Alice Smith",
    "year": 2024,
    "topic": "Quantum Physics"
}

with store:
    store.add_document(
        "path/to/research_paper.txt",
        metadata=doc_info
    )
```

**How Metadata is Used:**
When you perform a `query`, `safe_store` finds the most relevant text chunks. Before returning a chunk, it automatically prepends the metadata of the source document to the chunk's text.

This means the context you feed into your LLM is not just the raw text, but a richer snippet that looks like this:

```text
--- Document Context ---
Title: Quantum Entanglement in Nanostructures
Author: Dr. Alice Smith
Year: 2024
Topic: Quantum Physics
------------------------

...the actual text from the document chunk begins here, discussing the specifics of entanglement in nanostructures...
```
This "just-in-time" context injection dramatically improves the LLM's ability to understand the source and relevance of the information, leading to better-quality responses without any extra work on your part.

---

## šŸ Quick Start Guide

This example shows the end-to-end workflow: indexing a document, then building and querying a knowledge graph of its **instances** using a simple string-based ontology.

```python
import safe_store
from safe_store import GraphStore, LogLevel
from lollms_client import LollmsClient
from pathlib import Path
import shutil

# --- 0. Configuration & Cleanup ---
DB_FILE = "quickstart.db"
DOC_DIR = Path("temp_docs_qs")
if DOC_DIR.exists(): shutil.rmtree(DOC_DIR)
DOC_DIR.mkdir()
Path(DB_FILE).unlink(missing_ok=True)

# --- 1. LLM Executor & Sample Document ---
def llm_executor(prompt: str) -> str:
    try:
        client = LollmsClient()
        return client.generate_code(prompt, language="json", temperature=0.1) or ""
    except Exception as e:
        raise ConnectionError(f"LLM call failed: {e}")

doc_path = DOC_DIR / "doc.txt"
doc_path.write_text("Dr. Aris Thorne is the CEO of QuantumLeap AI, a firm in Geneva.")

# --- 2. Level 1: Semantic Search with SafeStore ---
print("--- LEVEL 1: SEMANTIC SEARCH ---")
store = safe_store.SafeStore(db_path=DB_FILE, vectorizer_name="st", log_level=LogLevel.INFO)
with store:
    store.add_document(doc_path)
    results = store.query("who leads the AI firm in Geneva?", top_k=1)
    print(f"Semantic search result: '{results['chunk_text']}'")

# --- 3. Level 2: Knowledge Graph with GraphStore ---
print("\n--- LEVEL 2: KNOWLEDGE GRAPH ---")
ontology = "Extract People and Companies. A Person can be a CEO_OF a Company."
try:
    graph_store = GraphStore(store=store, llm_executor_callback=llm_executor, ontology=ontology)
    with graph_store:
        graph_store.build_graph_for_all_documents()
        graph_result = graph_store.query_graph("Who is the CEO of QuantumLeap AI?", output_mode="graph_only")
        
        print("Graph query result:")
        for rel in graph_result.get('relationships', []):
            source = rel['source_node']['properties'].get('identifying_value')
            target = rel['target_node']['properties'].get('identifying_value')
            print(f"- Relationship: '{source}' --[{rel['type']}]--> '{target}'")
except ConnectionError as e:
    print(f"[SKIP] GraphStore part failed: {e}")

store.close()
```

---

## āš™ļø Installation

```bash
pip install safe-store
```
Install optional dependencies for the features you need:```bash
# For Sentence Transformers (recommended for local use)
pip install safe-store[sentence-transformers]

# For API-based vectorizers
pip install safe_store[openai,ollama,cohere]

# For parsing PDF, DOCX, etc.
pip install safe-store[parsing]

# For encryption
pip install safe-store[encryption]

# To install everything:
pip install safe-store[all] 
```
---

## šŸ’” API Highlights

#### `SafeStore` (The Foundation)
*   `__init__(db_path, vectorizer_name, ...)`: Creates or loads a database. The vectorizer is locked in at creation.
*   `add_document(path, ...)`: Parses, chunks, vectorizes, and stores a document or an entire folder.
*   `query(query_text, top_k, ...)`: Performs a semantic search and returns the most relevant text chunks for your RAG pipeline.

#### `GraphStore` (The Intelligence Layer)
*   `__init__(store, llm_executor_callback, ontology)`: Creates the graph manager on an existing `SafeStore` instance.
*   `build_graph_for_all_documents()`: Scans documents and uses an LLM to build the knowledge graph based on your ontology.
*   `query_graph(natural_language_query, ...)`: Translates a question into a graph traversal, returning nodes, relationships, and/or the original source text.
*   `add_node(...)`, `add_relationship(...)`: Manually edit the graph to add your own expert knowledge.

---

## šŸ¤ Contributing & License

Contributions are highly welcome! Please open an issue to discuss a new feature or submit a pull request on [GitHub](https://github.com/ParisNeo/safe_store).

Licensed under Apache 2.0. See [LICENSE](LICENSE).
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "safe-store",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "concurrent, database, embedding, encryption, llm, local, rag, semantic search, sqlite, vector, webui",
    "author": null,
    "author_email": "ParisNeo <parisneo_ai@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7d/ae/d2d33e72b6f78023a1d1ab66efd7af325edbca7e6aed984f2172b0fb40a4/safe_store-3.1.1.tar.gz",
    "platform": null,
    "description": "# safe_store: Transform Your Digital Chaos into a Queryable Knowledge Base\n\n[![PyPI version](https://img.shields.io/pypi/v/safe_store.svg)](https://pypi.org/project/safe_store/)\n[![PyPI license](https://img.shields.io/pypi/l/safe_store.svg)](https://github.com/ParisNeo/safe_store/blob/main/LICENSE)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/safe_store.svg)](https://pypi.org/project/safe_store/)\n\n**`safe_store` is a Python library that turns your local folders of documents into a powerful, private, and intelligent knowledge base.** It achieves this by combining two powerful AI concepts into a single, seamless tool:\n\n1.  **Deep Semantic Search:** It reads and *understands* the content of your files, allowing you to search by meaning and context, not just keywords.\n2.  **AI-Powered Knowledge Graph:** It uses a Large Language Model (LLM) to automatically identify key entities (people, companies, concepts) and the relationships between them, building an interconnected web of your knowledge.\n\nAll of this happens entirely on your local machine, using a single, portable SQLite file. Your data never leaves your control.\n\n---\n\n## The Journey from Search to Understanding\n\n`safe_store` is designed to grow with your needs. You can start with a simple, powerful RAG system in minutes, and then evolve it into a sophisticated knowledge engine.\n\n### Level 1: Build a Powerful RAG System with Semantic Search\n**The Foundation: Retrieval-Augmented Generation (RAG)**\n\nRAG is the state-of-the-art technique for making Large Language Models (LLMs) answer questions about your private documents. The process is simple:\n1.  **Retrieve:** Find the most relevant text chunks from your documents related to a user's query.\n2.  **Augment:** Add those chunks as context to your prompt.\n3.  **Generate:** Ask the LLM to generate an answer based *only* on the provided context.\n\n`SafeStore` is the perfect tool for the \"Retrieve\" step. It uses vector embeddings to understand the *meaning* of your text, allowing you to find relevant passages even if they don't contain the exact keywords.\n\n**Example: A Simple RAG Pipeline**\n\n```python\nimport safe_store\n\n# 1. Create a store. This will create a 'my_notes.db' file.\nstore = safe_store.SafeStore(db_path=\"my_notes.db\", vectorizer_name=\"st\")\n\n# 2. Add your documents. It will scan the folder and process all supported files.\nwith store:\n    store.add_document(\"path/to/my_notes_and_articles/\")\n\n# 3. Query the store to find context for your RAG prompt.\nuser_query = \"What were the main arguments about AI consciousness in my research?\"\ncontext_chunks = store.query(user_query, top_k=3)\n\n# 4. Build the prompt and send to your LLM.\ncontext_text = \"\\n\\n\".join([chunk['chunk_text'] for chunk in context_chunks])\nprompt = f\"\"\"\nBased on the following context, please answer the user's question.\nDo not use any external knowledge.\n\nContext:\n---\n{context_text}\n---\n\nQuestion: {user_query}\n\"\"\"\n\n# result = my_llm_function(prompt) # Send to your LLM of choice\n```\nWith just this, you have a powerful, private RAG system running on your local files.\n\n### Level 2: Uncover Hidden Connections with a Knowledge Graph\n**The Next Dimension: From Passages to a Web of Knowledge**\n\nSemantic search is great for finding *relevant passages*, but it struggles with questions about *specific facts* and *relationships* scattered across multiple documents.\n\n`GraphStore` complements this by building a structured knowledge graph of the key **instances** (like the person \"Geoffrey Hinton\") and their **relationships** (like `PIONEERED` the concept \"Backpropagation\"). This allows you to ask precise, factual questions.\n\n---\n\n## Dynamic Vectorizer Discovery & Configuration\n\nOne of `safe_store`'s most powerful features is its ability to self-document. You don't need to guess which vectorizers are available or what parameters they need. You can discover everything at runtime.\n\nThis makes it easy to experiment with different embedding models and build interactive tools that guide users through the setup process.\n\n### Step 1: Discovering Available Vectorizers\n\nThe `SafeStore.list_available_vectorizers()` class method scans the library for all built-in and custom vectorizers and returns their complete configuration metadata.\n\n```python\nimport safe_store\nimport pprint\n\n# Get a list of all available vectorizer configurations\navailable_vectorizers = safe_store.SafeStore.list_available_vectorizers()\n\n# Pretty-print the result to see what's available\npprint.pprint(available_vectorizers)\n```\nThis will produce a detailed output like this:\n```\n[{'author': 'ParisNeo',\n  'class_name': 'CohereVectorizer',\n  'creation_date': '2025-10-10',\n  'description': \"A vectorizer that uses Cohere's API...\",\n  'input_parameters': [{'default': 'embed-english-v3.0',\n                        'description': 'The name of the Cohere embedding model...',\n                        'mandatory': True,\n                        'name': 'model'},\n                       {'default': '',\n                        'description': 'Your Cohere API key...',\n                        'mandatory': False,\n                        'name': 'api_key'},\n                        ...],\n  'last_update_date': '2025-10-10',\n  'name': 'cohere',\n  'title': 'Cohere Vectorizer'},\n {'author': 'ParisNeo',\n  'class_name': 'OllamaVectorizer',\n  'name': 'ollama',\n  'title': 'Ollama Vectorizer',\n  ...},\n  ...\n]\n```\n\n### Step 2: Listing Available Models for a Vectorizer\n\nOnce you know which vectorizer you want to use, you can ask `safe_store` what specific models it supports. This is especially useful for API-based or local server-based vectorizers like `ollama`, which can have many different models available.\n\n```python\nimport safe_store\n\n# Example: List all embedding models available from a running Ollama server\ntry:\n    # This requires a running Ollama instance to succeed\n    ollama_models = safe_store.SafeStore.list_models(\"ollama\")\n    print(\"Available Ollama embedding models:\")\n    for model in ollama_models:\n        print(f\"- {model}\")\nexcept Exception as e:\n    print(f\"Could not list Ollama models. Is the server running? Error: {e}\")\n\n```\n\n### Step 3: Building an Interactive Configurator\n\nYou can use this metadata to create an interactive setup script, guiding the user to choose and configure their desired vectorizer on the fly.\n\n**Full Interactive Example:**\nCopy and run this script. It will guide you through selecting and configuring a vectorizer, then initialize `SafeStore` with your choices.\n\n```python\n# interactive_setup.py\nimport safe_store\nimport pprint\n\ndef interactive_vectorizer_setup():\n    \"\"\"\n    An interactive CLI to guide the user through selecting and configuring a vectorizer.\n    \"\"\"\n    print(\"--- Welcome to the safe_store Interactive Vectorizer Setup ---\")\n    \n    # 1. List all available vectorizers\n    vectorizers = safe_store.SafeStore.list_available_vectorizers()\n    \n    print(\"\\nAvailable Vectorizers:\")\n    for i, vec in enumerate(vectorizers):\n        print(f\"  [{i+1}] {vec['name']} - {vec.get('title', 'No Title')}\")\n\n    # 2. Get user's choice\n    choice = -1\n    while choice < 0 or choice >= len(vectorizers):\n        try:\n            raw_choice = input(f\"\\nPlease select a vectorizer (1-{len(vectorizers)}): \")\n            choice = int(raw_choice) - 1\n            if not (0 <= choice < len(vectorizers)):\n                print(\"Invalid selection. Please try again.\")\n        except ValueError:\n            print(\"Please enter a number.\")\n\n    selected_vectorizer = vectorizers[choice]\n    selected_name = selected_vectorizer['name']\n    \n    print(f\"\\nYou have selected: {selected_name}\")\n    print(f\"Description: {selected_vectorizer.get('description', 'N/A').strip()}\")\n\n    # 3. Dynamically build the configuration dictionary\n    vectorizer_config = {}\n    print(\"\\nPlease provide the following configuration values (press Enter to use default):\")\n    \n    params = selected_vectorizer.get('input_parameters', [])\n    if not params:\n        print(\"This vectorizer requires no special configuration.\")\n    else:\n        for param in params:\n            param_name = param['name']\n            description = param.get('description', 'No description.')\n            default_value = param.get('default', None)\n            \n            prompt = f\"- {param_name} ({description})\"\n            if default_value is not None:\n                prompt += f\" [default: {default_value}]: \"\n            else:\n                prompt += \": \"\n                \n            user_input = input(prompt)\n            \n            # Use user input if provided, otherwise use default\n            final_value = user_input if user_input else default_value\n            \n            # Simple type conversion for demonstration (can be expanded)\n            if final_value is not None:\n                if param.get('type') == 'int':\n                    vectorizer_config[param_name] = int(final_value)\n                elif param.get('type') == 'dict':\n                    # For simplicity, we don't parse dicts here, but a real app might use json.loads\n                    vectorizer_config[param_name] = final_value\n                else:\n                    vectorizer_config[param_name] = str(final_value)\n\n    # 4. Initialize SafeStore with the dynamically created configuration\n    print(\"\\n--- Configuration Complete ---\")\n    print(f\"Vectorizer Name: '{selected_name}'\")\n    print(\"Vectorizer Config:\")\n    pprint.pprint(vectorizer_config)\n    \n    try:\n        print(\"\\nInitializing SafeStore with your configuration...\")\n        store = safe_store.SafeStore(\n            db_path=f\"{selected_name}_store.db\",\n            vectorizer_name=selected_name,\n            vectorizer_config=vectorizer_config\n        )\n        print(\"\\n\u2705 SafeStore initialized successfully!\")\n        print(f\"Database file is at: {selected_name}_store.db\")\n        store.close()\n    except Exception as e:\n        print(f\"\\n\u274c Failed to initialize SafeStore: {e}\")\n\n\nif __name__ == \"__main__\":\n    interactive_vectorizer_setup()\n```\nThis script demonstrates how the self-documenting nature of `safe_store` enables you to build powerful, user-friendly applications on top of it.\n\n---\n\n## Core Concepts for Advanced RAG\n\n### Understanding Tokenization for Chunking\n`safe_store` can chunk your documents based on character count (`character` strategy) or token count (`token` strategy). Using the `token` strategy is often more effective as it aligns better with how Large Language Models (LLMs) process text.\n\nWhen you select `chunking_strategy='token'`, `safe_store` intelligently handles tokenization:\n\n1.  **Vectorizer's Native Tokenizer:** If the chosen vectorizer (like a local `sentence-transformers` model) has its own tokenizer, `safe_store` will use it. This is the most accurate method, as the chunking tokens will perfectly match the vectorizer's tokens.\n\n2.  **Fallback to `tiktoken`:** Some vectorizers, especially those accessed via an API (like OpenAI or Cohere), do not expose their tokenizer for local use. In these cases, `safe_store` uses `tiktoken` (specifically the `cl100k_base` model) as a reliable fallback. `tiktoken` is the tokenizer used by modern OpenAI models and provides a very close approximation for many other models, ensuring your chunks are sized correctly for optimal performance.\n\nYou can also specify a custom tokenizer during `SafeStore` initialization if you have specific needs.\n\n### Enriching Your Data with Metadata\nMetadata is extra information about your documents that provides crucial context. You can attach a dictionary of key-value pairs to any document you add to `safe_store`. This metadata is then automatically used to enrich the search results, leading to more accurate and context-aware answers in your RAG pipeline.\n\n**How to Add Metadata:**\nSimply pass a dictionary to the `metadata` parameter when adding content.\n\n```python\n# Example of adding a document with metadata\ndoc_info = {\n    \"title\": \"Quantum Entanglement in Nanostructures\",\n    \"author\": \"Dr. Alice Smith\",\n    \"year\": 2024,\n    \"topic\": \"Quantum Physics\"\n}\n\nwith store:\n    store.add_document(\n        \"path/to/research_paper.txt\",\n        metadata=doc_info\n    )\n```\n\n**How Metadata is Used:**\nWhen you perform a `query`, `safe_store` finds the most relevant text chunks. Before returning a chunk, it automatically prepends the metadata of the source document to the chunk's text.\n\nThis means the context you feed into your LLM is not just the raw text, but a richer snippet that looks like this:\n\n```text\n--- Document Context ---\nTitle: Quantum Entanglement in Nanostructures\nAuthor: Dr. Alice Smith\nYear: 2024\nTopic: Quantum Physics\n------------------------\n\n...the actual text from the document chunk begins here, discussing the specifics of entanglement in nanostructures...\n```\nThis \"just-in-time\" context injection dramatically improves the LLM's ability to understand the source and relevance of the information, leading to better-quality responses without any extra work on your part.\n\n---\n\n## \ud83c\udfc1 Quick Start Guide\n\nThis example shows the end-to-end workflow: indexing a document, then building and querying a knowledge graph of its **instances** using a simple string-based ontology.\n\n```python\nimport safe_store\nfrom safe_store import GraphStore, LogLevel\nfrom lollms_client import LollmsClient\nfrom pathlib import Path\nimport shutil\n\n# --- 0. Configuration & Cleanup ---\nDB_FILE = \"quickstart.db\"\nDOC_DIR = Path(\"temp_docs_qs\")\nif DOC_DIR.exists(): shutil.rmtree(DOC_DIR)\nDOC_DIR.mkdir()\nPath(DB_FILE).unlink(missing_ok=True)\n\n# --- 1. LLM Executor & Sample Document ---\ndef llm_executor(prompt: str) -> str:\n    try:\n        client = LollmsClient()\n        return client.generate_code(prompt, language=\"json\", temperature=0.1) or \"\"\n    except Exception as e:\n        raise ConnectionError(f\"LLM call failed: {e}\")\n\ndoc_path = DOC_DIR / \"doc.txt\"\ndoc_path.write_text(\"Dr. Aris Thorne is the CEO of QuantumLeap AI, a firm in Geneva.\")\n\n# --- 2. Level 1: Semantic Search with SafeStore ---\nprint(\"--- LEVEL 1: SEMANTIC SEARCH ---\")\nstore = safe_store.SafeStore(db_path=DB_FILE, vectorizer_name=\"st\", log_level=LogLevel.INFO)\nwith store:\n    store.add_document(doc_path)\n    results = store.query(\"who leads the AI firm in Geneva?\", top_k=1)\n    print(f\"Semantic search result: '{results['chunk_text']}'\")\n\n# --- 3. Level 2: Knowledge Graph with GraphStore ---\nprint(\"\\n--- LEVEL 2: KNOWLEDGE GRAPH ---\")\nontology = \"Extract People and Companies. A Person can be a CEO_OF a Company.\"\ntry:\n    graph_store = GraphStore(store=store, llm_executor_callback=llm_executor, ontology=ontology)\n    with graph_store:\n        graph_store.build_graph_for_all_documents()\n        graph_result = graph_store.query_graph(\"Who is the CEO of QuantumLeap AI?\", output_mode=\"graph_only\")\n        \n        print(\"Graph query result:\")\n        for rel in graph_result.get('relationships', []):\n            source = rel['source_node']['properties'].get('identifying_value')\n            target = rel['target_node']['properties'].get('identifying_value')\n            print(f\"- Relationship: '{source}' --[{rel['type']}]--> '{target}'\")\nexcept ConnectionError as e:\n    print(f\"[SKIP] GraphStore part failed: {e}\")\n\nstore.close()\n```\n\n---\n\n## \u2699\ufe0f Installation\n\n```bash\npip install safe-store\n```\nInstall optional dependencies for the features you need:```bash\n# For Sentence Transformers (recommended for local use)\npip install safe-store[sentence-transformers]\n\n# For API-based vectorizers\npip install safe_store[openai,ollama,cohere]\n\n# For parsing PDF, DOCX, etc.\npip install safe-store[parsing]\n\n# For encryption\npip install safe-store[encryption]\n\n# To install everything:\npip install safe-store[all] \n```\n---\n\n## \ud83d\udca1 API Highlights\n\n#### `SafeStore` (The Foundation)\n*   `__init__(db_path, vectorizer_name, ...)`: Creates or loads a database. The vectorizer is locked in at creation.\n*   `add_document(path, ...)`: Parses, chunks, vectorizes, and stores a document or an entire folder.\n*   `query(query_text, top_k, ...)`: Performs a semantic search and returns the most relevant text chunks for your RAG pipeline.\n\n#### `GraphStore` (The Intelligence Layer)\n*   `__init__(store, llm_executor_callback, ontology)`: Creates the graph manager on an existing `SafeStore` instance.\n*   `build_graph_for_all_documents()`: Scans documents and uses an LLM to build the knowledge graph based on your ontology.\n*   `query_graph(natural_language_query, ...)`: Translates a question into a graph traversal, returning nodes, relationships, and/or the original source text.\n*   `add_node(...)`, `add_relationship(...)`: Manually edit the graph to add your own expert knowledge.\n\n---\n\n## \ud83e\udd1d Contributing & License\n\nContributions are highly welcome! Please open an issue to discuss a new feature or submit a pull request on [GitHub](https://github.com/ParisNeo/safe_store).\n\nLicensed under Apache 2.0. See [LICENSE](LICENSE).",
    "bugtrack_url": null,
    "license": null,
    "summary": "Simple, concurrent SQLite-based vector store optimized for local RAG pipelines, with optional encryption.",
    "version": "3.1.1",
    "project_urls": {
        "Documentation": "https://github.com/ParisNeo/safe_store#readme",
        "Homepage": "https://github.com/ParisNeo/safe_store",
        "Issues": "https://github.com/ParisNeo/safe_store/issues",
        "Repository": "https://github.com/ParisNeo/safe_store"
    },
    "split_keywords": [
        "concurrent",
        " database",
        " embedding",
        " encryption",
        " llm",
        " local",
        " rag",
        " semantic search",
        " sqlite",
        " vector",
        " webui"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3803776d5ea06915a4b5cf870c2ae6eea55defc361fb9ac7463a371c44344bcf",
                "md5": "a5b16ea5a8c49a79dcff0f15e5a7693e",
                "sha256": "9804484667dc16eccd1669feafffebd82ecd4e483ea0d18f0417025ae0367358"
            },
            "downloads": -1,
            "filename": "safe_store-3.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a5b16ea5a8c49a79dcff0f15e5a7693e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 73937,
            "upload_time": "2025-10-13T23:07:39",
            "upload_time_iso_8601": "2025-10-13T23:07:39.073475Z",
            "url": "https://files.pythonhosted.org/packages/38/03/776d5ea06915a4b5cf870c2ae6eea55defc361fb9ac7463a371c44344bcf/safe_store-3.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7daed2d33e72b6f78023a1d1ab66efd7af325edbca7e6aed984f2172b0fb40a4",
                "md5": "6c435f5d1e62728b2cecc2fde9bbfbc1",
                "sha256": "175c719c99cfff011b3c2c8fe07c879ca0ae81ecf54a645a07d480ef36268893"
            },
            "downloads": -1,
            "filename": "safe_store-3.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c435f5d1e62728b2cecc2fde9bbfbc1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 56025,
            "upload_time": "2025-10-13T23:07:40",
            "upload_time_iso_8601": "2025-10-13T23:07:40.851251Z",
            "url": "https://files.pythonhosted.org/packages/7d/ae/d2d33e72b6f78023a1d1ab66efd7af325edbca7e6aed984f2172b0fb40a4/safe_store-3.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-13 23:07:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ParisNeo",
    "github_project": "safe_store#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "safe-store"
}
        
Elapsed time: 1.80237s