kara-toolkit


Namekara-toolkit JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryKnowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates
upload_time2025-07-12 21:26:17
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseCC-BY License
keywords rag embeddings knowledge-base nlp langchain llamaindex
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # KARA - Efficient RAG Knowledge Base Updates

[![CI](https://github.com/mzakizadeh/kara/workflows/CI/badge.svg)](https://github.com/mzakizadeh/kara/actions)
[![PyPI version](https://badge.fury.io/py/kara-toolkit.svg)](https://badge.fury.io/py/kara-toolkit)
[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)
<!-- [![Downloads](https://static.pepy.tech/badge/kara-toolkit)](https://pepy.tech/project/kara-toolkit) -->

> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word "Kara" (کارآ) also means "efficient" in Persian.

KARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.

## How It Works

KARA formulates chunking as a graph optimization problem:
1. Creates a DAG where nodes are split positions and edges are potential chunks
2. Uses Dijkstra's algorithm to find optimal chunking paths
3. Automatically reuses existing chunks to minimize embedding costs

<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->

## Installation

```bash
pip install kara-toolkit

# With LangChain integration
pip install kara-toolkit[langchain]
```

## Key Parameters

| Parameter                   | Type         | Default | Description                                                                                                                                                                                                                 |
|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `imperfect_chunk_tolerance` | `int`        | `9`     | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |
| `chunk_size`                | `int`        | `500`   | Target size (in characters) for each text chunk.                                                                                                                                                                            |
| `separators`                | `List[str]`  | `["\n\n", "\n", " "]`       | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`.                                                                     |

## Quick Start

```python
from kara import KARAUpdater, RecursiveCharacterChunker

# Initialize
chunker = RecursiveCharacterChunker(chunk_size=500)
updater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)

# Process initial documents
result = updater.create_knowledge_base(["Your document content..."])

# Update with new content - reuses existing chunks automatically
update_result = updater.update_knowledge_base(
    result.new_chunked_doc,
    ["Updated document content..."]
)

print(f"Efficiency: {update_result.efficiency_ratio:.1%}")
print(f"Chunks reused: {update_result.num_reused}")
```

## LangChain Integration

```python
from kara.integrations.langchain import KARATextSplitter
from langchain_core.documents import Document

# Use as a drop-in replacement for LangChain text splitters
splitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)

docs = [Document(page_content="Your content...", metadata={"source": "file.pdf"})]
chunks = splitter.split_documents(docs)
```


## Examples

See [`examples/`](examples/) for complete usage examples.


## Roadmap to 1.0.0

- [ ] **100% Test Coverage** - Complete test suite with full coverage
- [ ] **Performance Benchmarks** - Real-world efficiency testing
- [ ] **Framework Support** - LlamaIndex, Haystack, and others
- [ ] **Complete Documentation** - API reference, guides, and examples

## License

CC BY 4.0 License - see [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kara-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "rag, embeddings, knowledge-base, nlp, langchain, llamaindex",
    "author": null,
    "author_email": "Mahdi Zakizadeh <mzakizadeh.me@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/d3/da/85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13/kara_toolkit-0.2.0.tar.gz",
    "platform": null,
    "description": "# KARA - Efficient RAG Knowledge Base Updates\n\n[![CI](https://github.com/mzakizadeh/kara/workflows/CI/badge.svg)](https://github.com/mzakizadeh/kara/actions)\n[![PyPI version](https://badge.fury.io/py/kara-toolkit.svg)](https://badge.fury.io/py/kara-toolkit)\n[![Code style: ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg)](https://creativecommons.org/licenses/by/4.0/)\n<!-- [![Downloads](https://static.pepy.tech/badge/kara-toolkit)](https://pepy.tech/project/kara-toolkit) -->\n\n> **KARA** stands for **Knowledge-Aware Reembedding Algorithm**. The word \"Kara\" (\u06a9\u0627\u0631\u0622) also means \"efficient\" in Persian.\n\nKARA is a Python library that efficiently updates knowledge bases by reducing unnecessary embedding operations. When documents change, KARA automatically identifies and reuses existing chunks, minimizing the need for new embeddings.\n\n## How It Works\n\nKARA formulates chunking as a graph optimization problem:\n1. Creates a DAG where nodes are split positions and edges are potential chunks\n2. Uses Dijkstra's algorithm to find optimal chunking paths\n3. Automatically reuses existing chunks to minimize embedding costs\n\n<!-- Typical efficiency gains: 70-90% fewer embeddings for document updates. -->\n\n## Installation\n\n```bash\npip install kara-toolkit\n\n# With LangChain integration\npip install kara-toolkit[langchain]\n```\n\n## Key Parameters\n\n| Parameter                   | Type         | Default | Description                                                                                                                                                                                                                 |\n|-----------------------------|--------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `imperfect_chunk_tolerance` | `int`        | `9`     | Controls the trade-off between reusing existing chunks and creating new, perfectly-sized ones.<br><br>- `0`: No tolerance; disables chunk reuse.<br>- `1`: Prefers new chunk over two imperfect ones.<br>- `9`: Balanced default.<br>- `99+`: Maximizes reuse, less uniform sizes. |\n| `chunk_size`                | `int`        | `500`   | Target size (in characters) for each text chunk.                                                                                                                                                                            |\n| `separators`                | `List[str]`  | `[\"\\n\\n\", \"\\n\", \" \"]`       | List of strings used to split the text. If not provided, uses default separators from `RecursiveCharacterChunker`.                                                                     |\n\n## Quick Start\n\n```python\nfrom kara import KARAUpdater, RecursiveCharacterChunker\n\n# Initialize\nchunker = RecursiveCharacterChunker(chunk_size=500)\nupdater = KARAUpdater(chunker=chunker, imperfect_chunk_tolerance=9)\n\n# Process initial documents\nresult = updater.create_knowledge_base([\"Your document content...\"])\n\n# Update with new content - reuses existing chunks automatically\nupdate_result = updater.update_knowledge_base(\n    result.new_chunked_doc,\n    [\"Updated document content...\"]\n)\n\nprint(f\"Efficiency: {update_result.efficiency_ratio:.1%}\")\nprint(f\"Chunks reused: {update_result.num_reused}\")\n```\n\n## LangChain Integration\n\n```python\nfrom kara.integrations.langchain import KARATextSplitter\nfrom langchain_core.documents import Document\n\n# Use as a drop-in replacement for LangChain text splitters\nsplitter = KARATextSplitter(chunk_size=300, imperfect_chunk_tolerance=2)\n\ndocs = [Document(page_content=\"Your content...\", metadata={\"source\": \"file.pdf\"})]\nchunks = splitter.split_documents(docs)\n```\n\n\n## Examples\n\nSee [`examples/`](examples/) for complete usage examples.\n\n\n## Roadmap to 1.0.0\n\n- [ ] **100% Test Coverage** - Complete test suite with full coverage\n- [ ] **Performance Benchmarks** - Real-world efficiency testing\n- [ ] **Framework Support** - LlamaIndex, Haystack, and others\n- [ ] **Complete Documentation** - API reference, guides, and examples\n\n## License\n\nCC BY 4.0 License - see [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "CC-BY License",
    "summary": "Knowledge-Aware Re-embedding Algorithm - Efficient RAG knowledge base updates",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/mzakizadeh/kara-toolkit/issues",
        "Documentation": "https://kara-toolkit.readthedocs.io",
        "Homepage": "https://github.com/mzakizadeh/kara",
        "Repository": "https://github.com/mzakizadeh/kara"
    },
    "split_keywords": [
        "rag",
        " embeddings",
        " knowledge-base",
        " nlp",
        " langchain",
        " llamaindex"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "42199a2b9a3c3221079bdb8dfab7869fca9223734b49d44d1ccbf0896bfd4ee4",
                "md5": "f009f5313a04df667eccb81f882e9f09",
                "sha256": "6a785d98325b1aa5179e2860ff036ed03d94acf19ade91457349fa86a5b208fe"
            },
            "downloads": -1,
            "filename": "kara_toolkit-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f009f5313a04df667eccb81f882e9f09",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 12417,
            "upload_time": "2025-07-12T21:26:15",
            "upload_time_iso_8601": "2025-07-12T21:26:15.931428Z",
            "url": "https://files.pythonhosted.org/packages/42/19/9a2b9a3c3221079bdb8dfab7869fca9223734b49d44d1ccbf0896bfd4ee4/kara_toolkit-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d3da85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13",
                "md5": "90e41d3447815efaaa47ac5080f98c5c",
                "sha256": "bb1ea4001dadac9bbf37d433c3200082f2e9da49c8c1da0d0b97feb80703722f"
            },
            "downloads": -1,
            "filename": "kara_toolkit-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "90e41d3447815efaaa47ac5080f98c5c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 33352,
            "upload_time": "2025-07-12T21:26:17",
            "upload_time_iso_8601": "2025-07-12T21:26:17.379242Z",
            "url": "https://files.pythonhosted.org/packages/d3/da/85fbe297b50c38bf700e7c309fbe3ec3a79ba02e54a1e4c1e3aadd5d4b13/kara_toolkit-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 21:26:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mzakizadeh",
    "github_project": "kara-toolkit",
    "github_not_found": true,
    "lcname": "kara-toolkit"
}
        
Elapsed time: 0.54763s