kara-toolkit


Namekara-toolkit JSON
Version 0.2.1 PyPI version JSON
download
home_pageNone
SummaryKnowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates
upload_time2025-07-13 10:05:15
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseCC-BY License
keywords rag embeddings knowledge-base nlp langchain llamaindex
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # KARA - Efficient RAG Knowledge Base Updates

[![CI](https://github.com/mzakizadeh/kara/workflows/CI/badge.svg)](https://github.com/mzakizadeh/kara/actions)
[![PyPI version](https://badge.fury.io/py/kara-toolkit.svg)](https://badge.fury.io/py/kara-toolkit)
[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
<!-- [![Downloads](https://static.pepy.tech/badge/kara-toolkit)](https://pepy.tech/project/kara-toolkit) -->

> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word "Kara" (کارآ) also means "efficient" in Persian.

KARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.

## Installation

```bash
pip install kara-toolkit

# With LangChain integration
pip install kara-toolkit[langchain]
```

## Key Parameters

| Parameter                   | Type         | Default | Description                                                                                                                                                                                                                 |
|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `imperfect_chunk_tolerance` | `int`        | `9`     | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |
| `chunk_size`                | `int`        | `500`   | Target size (in characters) for each text chunk.                                                                                                                                                                            |
| `separators`                | `List[str]`  | `["\n\n", "\n", " "]`       | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`.                                                                     |

## Quick Start

```python
from kara import KARAUpdater, RecursiveCharacterChunker

# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)

# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])

# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
    result.new_chunked_doc,
    ["Updated document content..."]
)

print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")
```

## LangChain Integration

```python
from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document

# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)

docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)
```


## Examples

See [`examples/`](examples/) for complete usage examples.


## How It Works

KARA formulates chunking as a graph optimization problem:
1. Creates a DAG where nodes are split positions and edges are potential chunks
2. Uses Dijkstra's algorithm to find optimal chunking paths
3. Automatically reuses existing chunks to minimize embedding costs

<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->


## Limitations

While KARA provides significant efficiency improvements for knowledge base updates, there are some current limitations to be aware of:

- **Document Version Dependency**: The biggest limitation is that you need to keep the last version of documents to identify reusable chunks. However, you may be able to reconstruct document content using saved chunks in your vector store to reduce storage overhead. When compared to LangChain's indexing solution ([documented here](https://python.langchain.com/docs/how_to/indexing/)), which maintains a separate SQL database for chunk hashes while being extremely inefficient, our approach is still superior.

- **Chunking Configuration Changes**: You likely cannot change splitting configurations (chunk size, separator characters) between updates, as this may disrupt the algorithm's optimal solution. We have not yet tested the extent to which configuration changes impact performance.

- **No Chunk Overlap Support**: We currently do not support overlapping chunks, but we are investigating whether this feature can be added in future versions.

## Roadmap to 1.0.0

- [ ] **100% Test Coverage** - Complete test suite with full coverage
- [ ] **Performance Benchmarks** - Real-world efficiency testing
- [ ] **Framework Support** - LlamaIndex, Haystack, and others
- [ ] **Complete Documentation** - API reference, guides, and examples
- [ ] **Token-Based Optimal Chunking** - Extend algorithm to support token-based chunking strategies

## License

CC BY 4.0 License - see [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kara-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "rag, embeddings, knowledge-base, nlp, langchain, llamaindex",
    "author": null,
    "author_email": "Mahdi Zakizadeh <mzakizadeh.me@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/84/29/b33b91f61aebef3d3a3ba19416e4c3b9a0555b75af91a7233456d59d57fa/kara_toolkit-0.2.1.tar.gz",
    "platform": null,
    "description": "# KARA - Efficient RAG Knowledge Base Updates\n\n[![CI](https://github.com/mzakizadeh/kara/workflows/CI/badge.svg)](https://github.com/mzakizadeh/kara/actions)\n[![PyPI version](https://badge.fury.io/py/kara-toolkit.svg)](https://badge.fury.io/py/kara-toolkit)\n[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)\n<!-- [![Downloads](https://static.pepy.tech/badge/kara-toolkit)](https://pepy.tech/project/kara-toolkit) -->\n\n> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word \"Kara\" (\u06a9\u0627\u0631\u0622) also means \"efficient\" in Persian.\n\nKARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.\n\n## Installation\n\n```bash\npip install kara-toolkit\n\n# With LangChain integration\npip install kara-toolkit[langchain]\n```\n\n## Key Parameters\n\n| Parameter                   | Type         | Default | Description                                                                                                                                                                                                                 |\n|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `imperfect_chunk_tolerance` | `int`        | `9`     | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |\n| `chunk_size`                | `int`        | `500`   | Target size (in characters) for each text chunk.                                                                                                                                                                            |\n| `separators`                | `List[str]`  | `[\"\\n\\n\", \"\\n\", \" \"]`       | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`.                                                                     |\n\n## Quick Start\n\n```python\nfrom kara import KARAUpdater, RecursiveCharacterChunker\n\n# Initialize\nchunker = RecursiveCharacterChunker(chunk_size=500)\nupdater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)\n\n# Process initial documents\nresult = updater.create_knowledge_base([\"Your document content...\"])\n\n# Update with new content - reuses existing chunks automatically\nupdate_result = updater.update_knowledge_base(\n    result.new_chunked_doc,\n    [\"Updated document content...\"]\n)\n\nprint(f\"Efficiency: {update_result.efficiency_ratio:.1%}\")\nprint(f\"Chunks reused: {update_result.num_reused}\")\n```\n\n## LangChain Integration\n\n```python\nfrom kara.integrations.langchain import KARATextSplitter\nfrom langchain_core.documents import Document\n\n# Use as a drop-in replacement for LangChain text splitters\nsplitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)\n\ndocs = [Document(page_content=\"Your content...\", metadata={\"source\": \"file.pdf\"})]\nchunks = splitter.split_documents(docs)\n```\n\n\n## Examples\n\nSee [`examples/`](examples/) for complete usage examples.\n\n\n## How It Works\n\nKARA formulates chunking as a graph optimization problem:\n1. Creates a DAG where nodes are split positions and edges are potential chunks\n2. Uses Dijkstra's algorithm to find optimal chunking paths\n3. Automatically reuses existing chunks to minimize embedding costs\n\n<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->\n\n\n## Limitations\n\nWhile KARA provides significant efficiency improvements for knowledge base updates, there are some current limitations to be aware of:\n\n- **Document Version Dependency**: The biggest limitation is that you need to keep the last version of documents to identify reusable chunks. However, you may be able to reconstruct document content using saved chunks in your vector store to reduce storage overhead. When compared to LangChain's indexing solution ([documented here](https://python.langchain.com/docs/how_to/indexing/)), which maintains a separate SQL database for chunk hashes while being extremely inefficient, our approach is still superior.\n\n- **Chunking Configuration Changes**: You likely cannot change splitting configurations (chunk size, separator characters) between updates, as this may disrupt the algorithm's optimal solution. We have not yet tested the extent to which configuration changes impact performance.\n\n- **No Chunk Overlap Support**: We currently do not support overlapping chunks, but we are investigating whether this feature can be added in future versions.\n\n## Roadmap to 1.0.0\n\n- [ ] **100% Test Coverage** - Complete test suite with full coverage\n- [ ] **Performance Benchmarks** - Real-world efficiency testing\n- [ ] **Framework Support** - LlamaIndex, Haystack, and others\n- [ ] **Complete Documentation** - API reference, guides, and examples\n- [ ] **Token-Based Optimal Chunking** - Extend algorithm to support token-based chunking strategies\n\n## License\n\nCC BY 4.0 License - see [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "CC-BY License",
    "summary": "Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates",
    "version": "0.2.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/mzakizadeh/kara-toolkit/issues",
        "Documentation": "https://kara-toolkit.readthedocs.io",
        "Homepage": "https://github.com/mzakizadeh/kara",
        "Repository": "https://github.com/mzakizadeh/kara"
    },
    "split_keywords": [
        "rag",
        " embeddings",
        " knowledge-base",
        " nlp",
        " langchain",
        " llamaindex"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c280b7ec634cc816d55fa8c61285f23a8dbc6aaead5e69cbb517bc273ff624ac",
                "md5": "905a81c42dbb7595bcf127d7d2673c59",
                "sha256": "faa681936588c8526c7ed73bb57d02a4aba1f6d3faffb5dc06224414d4569080"
            },
            "downloads": -1,
            "filename": "kara_toolkit-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "905a81c42dbb7595bcf127d7d2673c59",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 13109,
            "upload_time": "2025-07-13T10:05:14",
            "upload_time_iso_8601": "2025-07-13T10:05:14.010325Z",
            "url": "https://files.pythonhosted.org/packages/c2/80/b7ec634cc816d55fa8c61285f23a8dbc6aaead5e69cbb517bc273ff624ac/kara_toolkit-0.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8429b33b91f61aebef3d3a3ba19416e4c3b9a0555b75af91a7233456d59d57fa",
                "md5": "107598c1658f33368450a28290f5695f",
                "sha256": "0bc083b26d35e545b0eee4f8308c0fbd9b64e625adb052dad856bdfbedfc6794"
            },
            "downloads": -1,
            "filename": "kara_toolkit-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "107598c1658f33368450a28290f5695f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 34554,
            "upload_time": "2025-07-13T10:05:15",
            "upload_time_iso_8601": "2025-07-13T10:05:15.438755Z",
            "url": "https://files.pythonhosted.org/packages/84/29/b33b91f61aebef3d3a3ba19416e4c3b9a0555b75af91a7233456d59d57fa/kara_toolkit-0.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 10:05:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mzakizadeh",
    "github_project": "kara-toolkit",
    "github_not_found": true,
    "lcname": "kara-toolkit"
}
        
Elapsed time: 0.48825s