# Smart Semantic Cache
A high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.
## Why Semantic Cache?
Traditional caching requires **exact** query matches. Semantic caching understands that "What's the capital of France?" and "Tell me France's capital city" should return the same cached result. This can reduce your LLM API costs by 60–90% in real applications.
## Key Features
* **Multi-Layer Intelligence**: Memory cache → SQLite → FAISS vector similarity
* **Lightning Fast**: Sub-millisecond memory lookups, <10ms semantic search
* **Configurable Similarity**: Fine-tune cache hit sensitivity (0.0–1.0)
* **Memory Efficient**: Optional FAISS quantization for large-scale deployments
* **Async Ready**: Full async support for high-throughput applications
* **Rich Metrics**: Comprehensive performance monitoring and analytics
* **Smart Eviction**: LRU-based cache management with intelligent cleanup
* **Production Ready**: Thread-safe, error-resilient, battle-tested
## Installation
```bash
pip install thinkcache
```
### Optional Dependencies
```bash
# For GPU acceleration (recommended for production)
pip install thinkcache[quantization]
# For development and testing
pip install thinkcache[dev]
```
## Quick Start
### Method 1: Global Cache Setup (Recommended)
```python
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI
# Initialize semantic cache globally - one line setup!
ensure_semantic_cache(
similarity_threshold=0.15,
max_cache_size=1000
)
llm = OpenAI(temperature=0)
response1 = llm.invoke("What is the capital of France?")
response2 = llm.invoke("Tell me the capital city of France")
response3 = llm.invoke("France's capital is?")
```
### Method 2: Direct Cache Usage
```python
from thinkcache import SemanticCache
from langchain.globals import set_llm_cache
cache = SemanticCache(
database_path="./production_cache.db",
faiss_index_path="./vector_cache",
similarity_threshold=0.15,
max_cache_size=5000,
memory_cache_size=1000,
enable_quantization=True
)
set_llm_cache(cache)
```
## Configuration Methods
### Global Configuration (Before First Use)
```python
from thinkcache import configure_semantic_cache
configure_semantic_cache(
database_path="./my_cache.db",
similarity_threshold=0.15,
max_cache_size=2000
)
from thinkcache import ensure_semantic_cache
ensure_semantic_cache()
```
### Runtime Configuration
```python
from thinkcache import ensure_semantic_cache
cache = ensure_semantic_cache(
similarity_threshold=0.2,
database_path="./cache.db",
faiss_index_path="./vectors",
max_cache_size=1000,
memory_cache_size=500,
batch_size=20,
enable_quantization=False
)
```
### Production Configuration
```python
from thinkcache import configure_semantic_cache
configure_semantic_cache(
database_path="/var/cache/semantic/cache.db",
faiss_index_path="/var/cache/semantic/vectors",
similarity_threshold=0.15,
max_cache_size=10000,
memory_cache_size=2000,
enable_quantization=True,
batch_size=50
)
```
## Cache Management
### Getting Cache Instance
```python
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
print("Cache is active and ready!")
else:
print("No cache initialized yet")
```
### Resetting Cache
```python
from thinkcache import reset_semantic_cache
reset_semantic_cache()
from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)
```
### Handling Already Initialized Cache
```python
from thinkcache import configure_semantic_cache
try:
configure_semantic_cache(similarity_threshold=0.1)
except ValueError as e:
print("Cache already initialized!")
from thinkcache import reset_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.1)
```
## Performance Monitoring
### Real-time Metrics
```python
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
metrics = cache.get_metrics()
print(f"Cache Hit Rate: {metrics['hit_rate']:.1%}")
print(f"Total Requests: {metrics['total_requests']:,}")
print(f"Memory Hits: {metrics['memory_hits']:,}")
print(f"Semantic Hits: {metrics['semantic_hits']:,}")
print(f"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s")
print(f"Avg Search Time: {metrics['avg_search_time']:.3f}s")
```
### Cache Cleanup
```python
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
if cache:
cache.clear_cache()
print("All caches cleared!")
```
## Architecture Overview
```
Query → Memory Cache → SQLite Cache → Semantic Search → LLM API
↓ ↓ ↓ ↓ ↓
<1ms ~1-2ms ~2-5ms ~5-15ms 100-2000ms
```
### How It Works
1. **Memory Cache**: Lightning-fast LRU cache for recently accessed queries
2. **SQLite Cache**: Persistent exact-match cache with indexing
3. **Semantic Search**: FAISS-powered vector similarity search
4. **Embedding Cache**: Cached embeddings to avoid recomputation
5. **Smart Eviction**: Automatic cleanup based on usage patterns
## Advanced Usage
### Async Operations
```python
import asyncio
from thinkcache import ensure_semantic_cache
from langchain_openai import OpenAI
ensure_semantic_cache()
async def cached_queries():
llm = OpenAI()
tasks = [
llm.ainvoke("Explain quantum computing"),
llm.ainvoke("What is quantum computing?"),
llm.ainvoke("Define quantum computing")
]
results = await asyncio.gather(*tasks)
return results
results = asyncio.run(cached_queries())
```
### Custom Similarity Thresholds
```python
from thinkcache import configure_semantic_cache
configure_semantic_cache(similarity_threshold=0.1)
configure_semantic_cache(similarity_threshold=0.4)
configure_semantic_cache(similarity_threshold=0.2)
```
### Multiple Cache Instances
```python
from thinkcache import SemanticCache
qa_cache = SemanticCache(
database_path="./qa_cache.db",
similarity_threshold=0.15
)
summarization_cache = SemanticCache(
database_path="./summary_cache.db",
similarity_threshold=0.25
)
```
## Complete Workflow Example
```python
from thinkcache import (
configure_semantic_cache,
ensure_semantic_cache,
get_semantic_cache,
reset_semantic_cache
)
from langchain_openai import OpenAI
configure_semantic_cache(
similarity_threshold=0.2,
max_cache_size=5000,
enable_quantization=True
)
cache = ensure_semantic_cache()
llm = OpenAI(temperature=0)
response = llm.invoke("What is machine learning?")
metrics = cache.get_metrics()
print(f"Hit rate: {metrics['hit_rate']:.1%}")
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.1)
```
## Configuration Reference
| Parameter | Default | Description |
| ---------------------- | ------------------------ | ---------------------------------------- |
| `database_path` | `.langchain.db` | SQLite database file path |
| `faiss_index_path` | `./semantic_cache_index` | FAISS vector index directory |
| `similarity_threshold` | `0.5` | Semantic similarity threshold (0.0–1.0) |
| `max_cache_size` | `1000` | Maximum entries in vector store |
| `memory_cache_size` | `100` | Maximum entries in memory cache |
| `batch_size` | `10` | Batch size for vector operations |
| `enable_quantization` | `False` | Enable FAISS quantization for efficiency |
## Troubleshooting
**Cache not working?**
```python
from thinkcache import get_semantic_cache
cache = get_semantic_cache()
print(f"Cache active: {cache is not None}")
```
**Configuration errors?**
```python
from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)
```
**Low hit rates?**
```python
from thinkcache import reset_semantic_cache, configure_semantic_cache
reset_semantic_cache()
configure_semantic_cache(similarity_threshold=0.3)
```
**Memory issues?**
```python
from thinkcache import configure_semantic_cache
configure_semantic_cache(enable_quantization=True)
```
## Performance Tips
1. Start with 0.2 similarity threshold
2. Use `configure_semantic_cache()` for production
3. Enable quantization for large caches
4. Use a larger memory cache in production
5. Monitor hit rates and adjust threshold
6. Reset cache regularly during development
## Requirements
* Python 3.8+
* Core: FAISS, HuggingFace Transformers, SQLite
* Optional: faiss-gpu (for GPU acceleration)
## License
MIT License – see [LICENSE](LICENSE) file for details.
## Changelog
### v0.1.1
* Multi-layer caching system
* FAISS integration with quantization
* Comprehensive metrics and monitoring
* Full async support
Raw data
{
"_id": null,
"home_page": "https://github.com/Chrisolande/langchain-semantic-cache",
"name": "thinkcache",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "semantic, cache, faiss, vector, similarity, llm, ai, ml",
"author": "Chris Olande",
"author_email": "Chris Olande <olandechris@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7a/e2/2db0ecbab1b4a991368522e7f4c9c65f5ed7a67b7725010afd44cdd9bba6/thinkcache-0.1.2.tar.gz",
"platform": null,
"description": "# Smart Semantic Cache\n\nA high-performance, multi-layered semantic caching system that dramatically reduces LLM costs and latency through intelligent similarity-based response caching.\n\n## Why Semantic Cache?\n\nTraditional caching requires **exact** query matches. Semantic caching understands that \"What's the capital of France?\" and \"Tell me France's capital city\" should return the same cached result. This can reduce your LLM API costs by 60\u201390% in real applications.\n\n## Key Features\n\n* **Multi-Layer Intelligence**: Memory cache \u2192 SQLite \u2192 FAISS vector similarity\n* **Lightning Fast**: Sub-millisecond memory lookups, <10ms semantic search\n* **Configurable Similarity**: Fine-tune cache hit sensitivity (0.0\u20131.0)\n* **Memory Efficient**: Optional FAISS quantization for large-scale deployments\n* **Async Ready**: Full async support for high-throughput applications\n* **Rich Metrics**: Comprehensive performance monitoring and analytics\n* **Smart Eviction**: LRU-based cache management with intelligent cleanup\n* **Production Ready**: Thread-safe, error-resilient, battle-tested\n\n## Installation\n\n```bash\npip install thinkcache\n```\n\n### Optional Dependencies\n\n```bash\n# For GPU acceleration (recommended for production)\npip install thinkcache[quantization]\n\n# For development and testing\npip install thinkcache[dev]\n```\n\n## Quick Start\n\n### Method 1: Global Cache Setup (Recommended)\n\n```python\nfrom thinkcache import ensure_semantic_cache\nfrom langchain_openai import OpenAI\n\n# Initialize semantic cache globally - one line setup!\nensure_semantic_cache(\n similarity_threshold=0.15,\n max_cache_size=1000\n)\n\nllm = OpenAI(temperature=0)\n\nresponse1 = llm.invoke(\"What is the capital of France?\")\nresponse2 = llm.invoke(\"Tell me the capital city of France\")\nresponse3 = llm.invoke(\"France's capital is?\")\n```\n\n### Method 2: Direct Cache Usage\n\n```python\nfrom thinkcache import SemanticCache\nfrom langchain.globals import set_llm_cache\n\ncache = SemanticCache(\n database_path=\"./production_cache.db\",\n faiss_index_path=\"./vector_cache\",\n similarity_threshold=0.15,\n max_cache_size=5000,\n memory_cache_size=1000,\n enable_quantization=True\n)\n\nset_llm_cache(cache)\n```\n\n## Configuration Methods\n\n### Global Configuration (Before First Use)\n\n```python\nfrom thinkcache import configure_semantic_cache\n\nconfigure_semantic_cache(\n database_path=\"./my_cache.db\",\n similarity_threshold=0.15,\n max_cache_size=2000\n)\n\nfrom thinkcache import ensure_semantic_cache\nensure_semantic_cache()\n```\n\n### Runtime Configuration\n\n```python\nfrom thinkcache import ensure_semantic_cache\n\ncache = ensure_semantic_cache(\n similarity_threshold=0.2,\n database_path=\"./cache.db\",\n faiss_index_path=\"./vectors\",\n max_cache_size=1000,\n memory_cache_size=500,\n batch_size=20,\n enable_quantization=False\n)\n```\n\n### Production Configuration\n\n```python\nfrom thinkcache import configure_semantic_cache\n\nconfigure_semantic_cache(\n database_path=\"/var/cache/semantic/cache.db\",\n faiss_index_path=\"/var/cache/semantic/vectors\",\n similarity_threshold=0.15,\n max_cache_size=10000,\n memory_cache_size=2000,\n enable_quantization=True,\n batch_size=50\n)\n```\n\n## Cache Management\n\n### Getting Cache Instance\n\n```python\nfrom thinkcache import get_semantic_cache\n\ncache = get_semantic_cache()\n\nif cache:\n print(\"Cache is active and ready!\")\nelse:\n print(\"No cache initialized yet\")\n```\n\n### Resetting Cache\n\n```python\nfrom thinkcache import reset_semantic_cache\n\nreset_semantic_cache()\n\nfrom thinkcache import configure_semantic_cache\nconfigure_semantic_cache(similarity_threshold=0.1)\n```\n\n### Handling Already Initialized Cache\n\n```python\nfrom thinkcache import configure_semantic_cache\n\ntry:\n configure_semantic_cache(similarity_threshold=0.1)\nexcept ValueError as e:\n print(\"Cache already initialized!\")\n from thinkcache import reset_semantic_cache\n reset_semantic_cache()\n configure_semantic_cache(similarity_threshold=0.1)\n```\n\n## Performance Monitoring\n\n### Real-time Metrics\n\n```python\nfrom thinkcache import get_semantic_cache\n\ncache = get_semantic_cache()\n\nif cache:\n metrics = cache.get_metrics()\n\n print(f\"Cache Hit Rate: {metrics['hit_rate']:.1%}\")\n print(f\"Total Requests: {metrics['total_requests']:,}\")\n print(f\"Memory Hits: {metrics['memory_hits']:,}\")\n print(f\"Semantic Hits: {metrics['semantic_hits']:,}\")\n print(f\"Avg Embedding Time: {metrics['avg_embedding_time']:.3f}s\")\n print(f\"Avg Search Time: {metrics['avg_search_time']:.3f}s\")\n```\n\n### Cache Cleanup\n\n```python\nfrom thinkcache import get_semantic_cache\n\ncache = get_semantic_cache()\nif cache:\n cache.clear_cache()\n print(\"All caches cleared!\")\n```\n\n## Architecture Overview\n\n```\nQuery \u2192 Memory Cache \u2192 SQLite Cache \u2192 Semantic Search \u2192 LLM API\n \u2193 \u2193 \u2193 \u2193 \u2193\n <1ms ~1-2ms ~2-5ms ~5-15ms 100-2000ms\n```\n\n### How It Works\n\n1. **Memory Cache**: Lightning-fast LRU cache for recently accessed queries\n2. **SQLite Cache**: Persistent exact-match cache with indexing\n3. **Semantic Search**: FAISS-powered vector similarity search\n4. **Embedding Cache**: Cached embeddings to avoid recomputation\n5. **Smart Eviction**: Automatic cleanup based on usage patterns\n\n## Advanced Usage\n\n### Async Operations\n\n```python\nimport asyncio\nfrom thinkcache import ensure_semantic_cache\nfrom langchain_openai import OpenAI\n\nensure_semantic_cache()\n\nasync def cached_queries():\n llm = OpenAI()\n tasks = [\n llm.ainvoke(\"Explain quantum computing\"),\n llm.ainvoke(\"What is quantum computing?\"),\n llm.ainvoke(\"Define quantum computing\")\n ]\n results = await asyncio.gather(*tasks)\n return results\n\nresults = asyncio.run(cached_queries())\n```\n\n### Custom Similarity Thresholds\n\n```python\nfrom thinkcache import configure_semantic_cache\n\nconfigure_semantic_cache(similarity_threshold=0.1)\nconfigure_semantic_cache(similarity_threshold=0.4)\nconfigure_semantic_cache(similarity_threshold=0.2)\n```\n\n### Multiple Cache Instances\n\n```python\nfrom thinkcache import SemanticCache\n\nqa_cache = SemanticCache(\n database_path=\"./qa_cache.db\",\n similarity_threshold=0.15\n)\n\nsummarization_cache = SemanticCache(\n database_path=\"./summary_cache.db\",\n similarity_threshold=0.25\n)\n```\n\n## Complete Workflow Example\n\n```python\nfrom thinkcache import (\n configure_semantic_cache,\n ensure_semantic_cache,\n get_semantic_cache,\n reset_semantic_cache\n)\nfrom langchain_openai import OpenAI\n\nconfigure_semantic_cache(\n similarity_threshold=0.2,\n max_cache_size=5000,\n enable_quantization=True\n)\n\ncache = ensure_semantic_cache()\n\nllm = OpenAI(temperature=0)\nresponse = llm.invoke(\"What is machine learning?\")\n\nmetrics = cache.get_metrics()\nprint(f\"Hit rate: {metrics['hit_rate']:.1%}\")\n\nreset_semantic_cache()\n\nconfigure_semantic_cache(similarity_threshold=0.1)\n```\n\n## Configuration Reference\n\n| Parameter | Default | Description |\n| ---------------------- | ------------------------ | ---------------------------------------- |\n| `database_path` | `.langchain.db` | SQLite database file path |\n| `faiss_index_path` | `./semantic_cache_index` | FAISS vector index directory |\n| `similarity_threshold` | `0.5` | Semantic similarity threshold (0.0\u20131.0) |\n| `max_cache_size` | `1000` | Maximum entries in vector store |\n| `memory_cache_size` | `100` | Maximum entries in memory cache |\n| `batch_size` | `10` | Batch size for vector operations |\n| `enable_quantization` | `False` | Enable FAISS quantization for efficiency |\n\n## Troubleshooting\n\n**Cache not working?**\n\n```python\nfrom thinkcache import get_semantic_cache\ncache = get_semantic_cache()\nprint(f\"Cache active: {cache is not None}\")\n```\n\n**Configuration errors?**\n\n```python\nfrom thinkcache import reset_semantic_cache, configure_semantic_cache\nreset_semantic_cache()\nconfigure_semantic_cache(similarity_threshold=0.3)\n```\n\n**Low hit rates?**\n\n```python\nfrom thinkcache import reset_semantic_cache, configure_semantic_cache\nreset_semantic_cache()\nconfigure_semantic_cache(similarity_threshold=0.3)\n```\n\n**Memory issues?**\n\n```python\nfrom thinkcache import configure_semantic_cache\nconfigure_semantic_cache(enable_quantization=True)\n```\n\n## Performance Tips\n\n1. Start with 0.2 similarity threshold\n2. Use `configure_semantic_cache()` for production\n3. Enable quantization for large caches\n4. Use a larger memory cache in production\n5. Monitor hit rates and adjust threshold\n6. Reset cache regularly during development\n\n## Requirements\n\n* Python 3.8+\n* Core: FAISS, HuggingFace Transformers, SQLite\n* Optional: faiss-gpu (for GPU acceleration)\n\n## License\n\nMIT License \u2013 see [LICENSE](LICENSE) file for details.\n\n## Changelog\n\n### v0.1.1\n\n* Multi-layer caching system\n* FAISS integration with quantization\n* Comprehensive metrics and monitoring\n* Full async support\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A high-performance semantic caching system with FAISS vector similarity search for LLMs",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/Chrisolande/langchain-semantic-cache",
"bug_reports": "https://github.com/Chrisolande/thinkcache/issues",
"documentation": "https://github.com/Chrisolande/thinkcache#readme",
"homepage": "https://github.com/Chrisolande/thinkcache",
"source": "https://github.com/Chrisolande/thinkcache"
},
"split_keywords": [
"semantic",
" cache",
" faiss",
" vector",
" similarity",
" llm",
" ai",
" ml"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "fc219651144921c6c3c81d4fd8d2bc56c2f3d659ab8f963c8dd616c554e4ee8f",
"md5": "37a8846946cfe0722556a7d34dca6c23",
"sha256": "2cbf8e900f688741edf824b3788f4da84a9e4ecc9e4262e7dff2d52c3763f8d7"
},
"downloads": -1,
"filename": "thinkcache-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "37a8846946cfe0722556a7d34dca6c23",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10569,
"upload_time": "2025-07-24T18:17:24",
"upload_time_iso_8601": "2025-07-24T18:17:24.405854Z",
"url": "https://files.pythonhosted.org/packages/fc/21/9651144921c6c3c81d4fd8d2bc56c2f3d659ab8f963c8dd616c554e4ee8f/thinkcache-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7ae22db0ecbab1b4a991368522e7f4c9c65f5ed7a67b7725010afd44cdd9bba6",
"md5": "2e9fdb5bbd46573b8bd949f090f36411",
"sha256": "f5c2b1077c5ac2813e6368a284a0468a0772615e607a42d9374c54ae59937a65"
},
"downloads": -1,
"filename": "thinkcache-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "2e9fdb5bbd46573b8bd949f090f36411",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 13746,
"upload_time": "2025-07-24T18:17:26",
"upload_time_iso_8601": "2025-07-24T18:17:26.682108Z",
"url": "https://files.pythonhosted.org/packages/7a/e2/2db0ecbab1b4a991368522e7f4c9c65f5ed7a67b7725010afd44cdd9bba6/thinkcache-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-24 18:17:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Chrisolande",
"github_project": "langchain-semantic-cache",
"github_not_found": true,
"lcname": "thinkcache"
}