chunklet


Namechunklet JSON
Version 1.0.5 PyPI version JSON
download
home_pageNone
SummaryA smart multilingual text chunker for LLMs, RAG, and beyond.
upload_time2025-07-25 07:15:38
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords nlp chunking text-splitting llm rag ai multilingual
VCS
bugtrack_url
requirements mpire loguru sentence-splitter sentsplit slangid
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # πŸ“¦ Chunklet: Smart Multilingual Text Chunker

![Version](https://img.shields.io/badge/version-1.0.5-blue)
![Stability](https://img.shields.io/badge/stability-stable-brightgreen)
![License: MIT](https://img.shields.io/badge/license-MIT-yellow)

> Chunk smarter, not harder β€” built for LLMs, RAG pipelines, and beyond.  
**Author:** Speedyk_005  
**Version:** 1.0.5 (πŸŽ‰ first stable release)  
**License:** MIT

---

## πŸš€ What’s New in v1.0.5 (Stable)

- βœ… **Stable Release:** v1.0.5 marks the first fully stable version after extensive refactoring.
- πŸ”„ **Multiple Refactor Steps:** Core code reorganized for clarity, maintainability, and performance.
- ➿ **True Clause-Level Overlap:** Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.
- πŸ› οΈ **Improved Chunking Logic:** Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.
- ⚑ **Optimized Batch Processing:** Parallel chunking now consistently respects token counters and offsets.
- πŸ§ͺ **Expanded Test Suite:** Comprehensive tests added for multilingual support, caching, and chunk correctness.
- 🧹 **Cleaner Output:** Logging filters and redundant docstrings removed to reduce noise during runs.

---

## πŸ”₯ Why Chunklet?

Feature | Why it’s elite  
--------|----------------
⛓️ **Hybrid Mode** | Combines token + sentence limits with guaranteed overlap β€” rare even in commercial stacks.  
🌐 **Multilingual Fallbacks** | CRF > Moses > Regex, with dynamic confidence detection.  
➿ **Clause-Level Overlap** | `overlap_percent` now operates at the **clause level**, preserving semantic flow across chunks using `, ; …` logic.  
⚑ **Parallel Batch Processing** | Multi-core acceleration with `mpire`.  
♻️ **LRU Caching** | Smart memoization via `functools.lru_cache`.  
πŸͺ„ **Pluggable Token Counters** | Swap in GPT-2, BPE, or your own tokenizer.

---

## 🧩 Chunking Modes

Pick your flavor:

- `"sentence"` β€” chunk by sentence count only  
- `"token"` β€” chunk by token count only  
- `"hybrid"` β€” sentence + token thresholds respected with guaranteed overlap  

---

## πŸ“¦ Installation

```bash
pip install chunklet
```

```bash
git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install requirements
# or manually
pip install mpire loguru sentence-splitter sentsplit langid
```

---

πŸ’‘ Example: Hybrid Mode
```
from chunklet import Chunklet

def word_token_counter(text: str) -> int:
    return len(text.split())

chunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)

sample = """
This is a long document about AI. It discusses neural networks and deep learning.
The future is exciting. Ethics must be considered. Let’s build wisely.
"""

chunks = chunker.chunk(
    text=sample,
    mode="hybrid",
    max_tokens=20,
    max_sentences=5,
    overlap_percent=30
)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---")
    print(chunk)
```

---

πŸŒ€ Batch Chunking (Parallel)
```
texts = [
    "First document sentence. Second sentence.",
    "Another one. Slightly longer. A third one here.",
    "Final doc with multiple lines. Great for testing chunk overlap."
]

results = chunker.batch_chunk(
    texts=texts,
    mode="hybrid",
    max_tokens=15,
    max_sentences=4,
    overlap_percent=20,
    n_jobs=2
)

for i, doc_chunks in enumerate(results):
    print(f"\n## Document {i+1}")
    for j, chunk in enumerate(doc_chunks):
        print(f"Chunk {j+1}:\n{chunk}")
```

---

βš™οΈ GPT-2 Token Count Support
```
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def gpt2_token_count(text: str) -> int:
    return len(tokenizer.encode(text))

chunker = Chunklet(token_counter=gpt2_token_count)
```

---

πŸ§ͺ Planned Features

[ ] PDF splitter with metadata
[ ] code splitting based on interest point
[ ] CLI interface with --file, --mode, --overlap, etc.
[ ] Named chunking presets: "all", "random_gap" for downstream control


---

🌍 Language Support (30+)

- CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.
- Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.
- Fallback: All other languages via smart regex


---

## πŸ’‘Projects that inspire me

| Tool                      | Description                                                                                      |
|---------------------------|--------------------------------------------------------------------------------------------------|
| [**Semchunk**](https://github.com/cocktailpeanut/semchunk)  | Semantic-aware chunking using transformer embeddings.                  |
| [**CintraAI Code Chunker**](https://github.com/CintraAI/code-chunker) | AST-based code chunker for intelligent code splitting.                 |


---

🀝 Contributing

1. Fork this repo
2. Create a new feature branch
3. Code like a star
4. Submit a pull request


---

πŸ“œ License

> MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chunklet",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "nlp, chunking, text-splitting, llm, rag, ai, multilingual",
    "author": null,
    "author_email": "Speedyk_005 <speedy40115719@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/00/66/b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853/chunklet-1.0.5.tar.gz",
    "platform": null,
    "description": "# \ud83d\udce6 Chunklet: Smart Multilingual Text Chunker\n\n![Version](https://img.shields.io/badge/version-1.0.5-blue)\n![Stability](https://img.shields.io/badge/stability-stable-brightgreen)\n![License: MIT](https://img.shields.io/badge/license-MIT-yellow)\n\n> Chunk smarter, not harder \u2014 built for LLMs, RAG pipelines, and beyond.  \n**Author:** Speedyk_005  \n**Version:** 1.0.5 (\ud83c\udf89 first stable release)  \n**License:** MIT\n\n---\n\n## \ud83d\ude80 What\u2019s New in v1.0.5 (Stable)\n\n- \u2705 **Stable Release:** v1.0.5 marks the first fully stable version after extensive refactoring.\n- \ud83d\udd04 **Multiple Refactor Steps:** Core code reorganized for clarity, maintainability, and performance.\n- \u27bf **True Clause-Level Overlap:** Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.\n- \ud83d\udee0\ufe0f **Improved Chunking Logic:** Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.\n- \u26a1 **Optimized Batch Processing:** Parallel chunking now consistently respects token counters and offsets.\n- \ud83e\uddea **Expanded Test Suite:** Comprehensive tests added for multilingual support, caching, and chunk correctness.\n- \ud83e\uddf9 **Cleaner Output:** Logging filters and redundant docstrings removed to reduce noise during runs.\n\n---\n\n## \ud83d\udd25 Why Chunklet?\n\nFeature | Why it\u2019s elite  \n--------|----------------\n\u26d3\ufe0f **Hybrid Mode** | Combines token + sentence limits with guaranteed overlap \u2014 rare even in commercial stacks.  \n\ud83c\udf10 **Multilingual Fallbacks** | CRF > Moses > Regex, with dynamic confidence detection.  \n\u27bf **Clause-Level Overlap** | `overlap_percent` now operates at the **clause level**, preserving semantic flow across chunks using `, ; \u2026` logic.  \n\u26a1 **Parallel Batch Processing** | Multi-core acceleration with `mpire`.  \n\u267b\ufe0f **LRU Caching** | Smart memoization via `functools.lru_cache`.  \n\ud83e\ude84 **Pluggable Token Counters** | Swap in GPT-2, BPE, or your own tokenizer.\n\n---\n\n## \ud83e\udde9 Chunking Modes\n\nPick your flavor:\n\n- `\"sentence\"` \u2014 chunk by sentence count only  \n- `\"token\"` \u2014 chunk by token count only  \n- `\"hybrid\"` \u2014 sentence + token thresholds respected with guaranteed overlap  \n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install chunklet\n```\n\n```bash\ngit clone https://github.com/Speedyk-005/chunklet.git\ncd chunklet\npip install requirements\n# or manually\npip install mpire loguru sentence-splitter sentsplit langid\n```\n\n---\n\n\ud83d\udca1 Example: Hybrid Mode\n```\nfrom chunklet import Chunklet\n\ndef word_token_counter(text: str) -> int:\n    return len(text.split())\n\nchunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)\n\nsample = \"\"\"\nThis is a long document about AI. It discusses neural networks and deep learning.\nThe future is exciting. Ethics must be considered. Let\u2019s build wisely.\n\"\"\"\n\nchunks = chunker.chunk(\n    text=sample,\n    mode=\"hybrid\",\n    max_tokens=20,\n    max_sentences=5,\n    overlap_percent=30\n)\n\nfor i, chunk in enumerate(chunks):\n    print(f\"--- Chunk {i+1} ---\")\n    print(chunk)\n```\n\n---\n\n\ud83c\udf00 Batch Chunking (Parallel)\n```\ntexts = [\n    \"First document sentence. Second sentence.\",\n    \"Another one. Slightly longer. A third one here.\",\n    \"Final doc with multiple lines. Great for testing chunk overlap.\"\n]\n\nresults = chunker.batch_chunk(\n    texts=texts,\n    mode=\"hybrid\",\n    max_tokens=15,\n    max_sentences=4,\n    overlap_percent=20,\n    n_jobs=2\n)\n\nfor i, doc_chunks in enumerate(results):\n    print(f\"\\n## Document {i+1}\")\n    for j, chunk in enumerate(doc_chunks):\n        print(f\"Chunk {j+1}:\\n{chunk}\")\n```\n\n---\n\n\u2699\ufe0f GPT-2 Token Count Support\n```\nfrom transformers import GPT2TokenizerFast\ntokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n\ndef gpt2_token_count(text: str) -> int:\n    return len(tokenizer.encode(text))\n\nchunker = Chunklet(token_counter=gpt2_token_count)\n```\n\n---\n\n\ud83e\uddea Planned Features\n\n[ ] PDF splitter with metadata\n[ ] code splitting based on interest point\n[ ] CLI interface with --file, --mode, --overlap, etc.\n[ ] Named chunking presets: \"all\", \"random_gap\" for downstream control\n\n\n---\n\n\ud83c\udf0d Language Support (30+)\n\n- CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.\n- Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.\n- Fallback: All other languages via smart regex\n\n\n---\n\n## \ud83d\udca1Projects that inspire me\n\n| Tool                      | Description                                                                                      |\n|---------------------------|--------------------------------------------------------------------------------------------------|\n| [**Semchunk**](https://github.com/cocktailpeanut/semchunk)  | Semantic-aware chunking using transformer embeddings.                  |\n| [**CintraAI Code Chunker**](https://github.com/CintraAI/code-chunker) | AST-based code chunker for intelligent code splitting.                 |\n\n\n---\n\n\ud83e\udd1d Contributing\n\n1. Fork this repo\n2. Create a new feature branch\n3. Code like a star\n4. Submit a pull request\n\n\n---\n\n\ud83d\udcdc License\n\n> MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A smart multilingual text chunker for LLMs, RAG, and beyond.",
    "version": "1.0.5",
    "project_urls": {
        "Homepage": "https://github.com/Speedyk-005/chunklet",
        "Issues": "https://github.com/Speedyk-005/chunklet/issues",
        "Repository": "https://github.com/Speedyk-005/chunklet"
    },
    "split_keywords": [
        "nlp",
        " chunking",
        " text-splitting",
        " llm",
        " rag",
        " ai",
        " multilingual"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bf9f82577957df57351e3ef0346a60968f56d696021f6a0ba6315f8aaa5eba04",
                "md5": "2c6e2a38841cf05cd3dc7d1820b19ddd",
                "sha256": "0cf8d7135e3f8e8e08eb01220295951a401ccbd152b1b1542f37038510244fbb"
            },
            "downloads": -1,
            "filename": "chunklet-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2c6e2a38841cf05cd3dc7d1820b19ddd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 9424,
            "upload_time": "2025-07-25T07:15:37",
            "upload_time_iso_8601": "2025-07-25T07:15:37.289086Z",
            "url": "https://files.pythonhosted.org/packages/bf/9f/82577957df57351e3ef0346a60968f56d696021f6a0ba6315f8aaa5eba04/chunklet-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0066b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853",
                "md5": "506ac959ab3aee724041affce73ac39d",
                "sha256": "766b3817163f06707c49edb06881f2fdf1ce0ce41d5a17dfe5cb4afc1d3a3aae"
            },
            "downloads": -1,
            "filename": "chunklet-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "506ac959ab3aee724041affce73ac39d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 13287,
            "upload_time": "2025-07-25T07:15:38",
            "upload_time_iso_8601": "2025-07-25T07:15:38.194219Z",
            "url": "https://files.pythonhosted.org/packages/00/66/b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853/chunklet-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-25 07:15:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Speedyk-005",
    "github_project": "chunklet",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "mpire",
            "specs": [
                [
                    ">=",
                    "2.10.0"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.7.2"
                ]
            ]
        },
        {
            "name": "sentence-splitter",
            "specs": [
                [
                    ">=",
                    "1.4"
                ]
            ]
        },
        {
            "name": "sentsplit",
            "specs": [
                [
                    ">=",
                    "0.0.7"
                ]
            ]
        },
        {
            "name": "slangid",
            "specs": [
                [
                    ">=",
                    "1.1.6"
                ]
            ]
        }
    ],
    "lcname": "chunklet"
}
        
Elapsed time: 1.98917s