# KARA - Efficient RAG Knowledge Base Updates
[](https://github.com/mzakizadeh/kara/actions)
[](https://badge.fury.io/py/kara-toolkit)
[](https://github.com/astral-sh/ruff)
[](https://creativecommons.org/licenses/by/4.0/)
<!-- [](https://pepy.tech/project/kara-toolkit) -->
> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word "Kara" (کارآ) also means "efficient" in Persian.
KARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.
## How It Works
KARA formulates chunking as a graph optimization problem:
1. Creates a DAG where nodes are split positions and edges are potential chunks
2. Uses Dijkstra's algorithm to find optimal chunking paths
3. Automatically reuses existing chunks to minimize embedding costs
<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->
## Installation
```bash
pip install kara-toolkit
# With LangChain integration
pip install kara-toolkit[langchain]
```
## Key Parameters
| Parameter | Type | Default | Description |
|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `imperfect_chunk_tolerance` | `int` | `9` | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |
| `chunk_size` | `int` | `500` | Target size (in characters) for each text chunk. |
| `separators` | `List[str]` | `["\n\n", "\n", " "]` | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`. |
## Quick Start
```python
from kara import KARAUpdater, RecursiveCharacterChunker
# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)
# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])
# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
result.new_chunked_doc,
["Updated document content..."]
)
print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")
```
## LangChain Integration
```python
from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document
# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)
docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)
```
## Examples
See [`examples/`](examples/) for complete usage examples.
## Roadmap to 1.0.0
- [ ] **100% Test Coverage** - Complete test suite with full coverage
- [ ] **Performance Benchmarks** - Real-world efficiency testing
- [ ] **Framework Support** - LlamaIndex, Haystack, and others
- [ ] **Complete Documentation** - API reference, guides, and examples
## License
CC BY 4.0 License - see [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "kara-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "rag, embeddings, knowledge-base, nlp, langchain, llamaindex",
"author": null,
"author_email": "Mahdi Zakizadeh <mzakizadeh.me@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/d3/da/85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13/kara_toolkit-0.2.0.tar.gz",
"platform": null,
"description": "# KARA - Efficient RAG Knowledge Base Updates\n\n[](https://github.com/mzakizadeh/kara/actions)\n[](https://badge.fury.io/py/kara-toolkit)\n[](https://github.com/astral-sh/ruff)\n[](https://creativecommons.org/licenses/by/4.0/)\n<!-- [](https://pepy.tech/project/kara-toolkit) -->\n\n> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word \"Kara\" (\u06a9\u0627\u0631\u0622) also means \"efficient\" in Persian.\n\nKARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.\n\n## How It Works\n\nKARA formulates chunking as a graph optimization problem:\n1. Creates a DAG where nodes are split positions and edges are potential chunks\n2. Uses Dijkstra's algorithm to find optimal chunking paths\n3. Automatically reuses existing chunks to minimize embedding costs\n\n<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->\n\n## Installation\n\n```bash\npip install kara-toolkit\n\n# With LangChain integration\npip install kara-toolkit[langchain]\n```\n\n## Key Parameters\n\n| Parameter | Type | Default | Description |\n|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `imperfect_chunk_tolerance` | `int` | `9` | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |\n| `chunk_size` | `int` | `500` | Target size (in characters) for each text chunk. |\n| `separators` | `List[str]` | `[\"\\n\\n\", \"\\n\", \" \"]` | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`. |\n\n## Quick Start\n\n```python\nfrom kara import KARAUpdater, RecursiveCharacterChunker\n\n# Initialize\nchunker = RecursiveCharacterChunker(chunk_size=500)\nupdater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)\n\n# Process initial documents\nresult = updater.create_knowledge_base([\"Your document content...\"])\n\n# Update with new content - reuses existing chunks automatically\nupdate_result = updater.update_knowledge_base(\n result.new_chunked_doc,\n [\"Updated document content...\"]\n)\n\nprint(f\"Efficiency: {update_result.efficiency_ratio:.1%}\")\nprint(f\"Chunks reused: {update_result.num_reused}\")\n```\n\n## LangChain Integration\n\n```python\nfrom kara.integrations.langchain import KARATextSplitter\nfrom langchain_core.documents import Document\n\n# Use as a drop-in replacement for LangChain text splitters\nsplitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)\n\ndocs = [Document(page_content=\"Your content...\", metadata={\"source\": \"file.pdf\"})]\nchunks = splitter.split_documents(docs)\n```\n\n\n## Examples\n\nSee [`examples/`](examples/) for complete usage examples.\n\n\n## Roadmap to 1.0.0\n\n- [ ] **100% Test Coverage** - Complete test suite with full coverage\n- [ ] **Performance Benchmarks** - Real-world efficiency testing\n- [ ] **Framework Support** - LlamaIndex, Haystack, and others\n- [ ] **Complete Documentation** - API reference, guides, and examples\n\n## License\n\nCC BY 4.0 License - see [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": "CC-BY License",
"summary": "Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/mzakizadeh/kara-toolkit/issues",
"Documentation": "https://kara-toolkit.readthedocs.io",
"Homepage": "https://github.com/mzakizadeh/kara",
"Repository": "https://github.com/mzakizadeh/kara"
},
"split_keywords": [
"rag",
" embeddings",
" knowledge-base",
" nlp",
" langchain",
" llamaindex"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "42199a2b9a3c3221079bdb8dfab7869fca9223734b49d44d1ccbf0896bfd4ee4",
"md5": "f009f5313a04df667eccb81f882e9f09",
"sha256": "6a785d98325b1aa5179e2860ff036ed03d94acf19ade91457349fa86a5b208fe"
},
"downloads": -1,
"filename": "kara_toolkit-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f009f5313a04df667eccb81f882e9f09",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 12417,
"upload_time": "2025-07-12T21:26:15",
"upload_time_iso_8601": "2025-07-12T21:26:15.931428Z",
"url": "https://files.pythonhosted.org/packages/42/19/9a2b9a3c3221079bdb8dfab7869fca9223734b49d44d1ccbf0896bfd4ee4/kara_toolkit-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d3da85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13",
"md5": "90e41d3447815efaaa47ac5080f98c5c",
"sha256": "bb1ea4001dadac9bbf37d433c3200082f2e9da49c8c1da0d0b97feb80703722f"
},
"downloads": -1,
"filename": "kara_toolkit-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "90e41d3447815efaaa47ac5080f98c5c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 33352,
"upload_time": "2025-07-12T21:26:17",
"upload_time_iso_8601": "2025-07-12T21:26:17.379242Z",
"url": "https://files.pythonhosted.org/packages/d3/da/85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13/kara_toolkit-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 21:26:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mzakizadeh",
"github_project": "kara-toolkit",
"github_not_found": true,
"lcname": "kara-toolkit"
}