# π¦ Chunklet: Smart Multilingual Text Chunker



> Chunk smarter, not harder β built for LLMs, RAG pipelines, and beyond.
**Author:** Speedyk_005
**Version:** 1.0.5 (π first stable release)
**License:** MIT
---
## π Whatβs New in v1.0.5 (Stable)
- β
**Stable Release:** v1.0.5 marks the first fully stable version after extensive refactoring.
- π **Multiple Refactor Steps:** Core code reorganized for clarity, maintainability, and performance.
- βΏ **True Clause-Level Overlap:** Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.
- π οΈ **Improved Chunking Logic:** Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.
- β‘ **Optimized Batch Processing:** Parallel chunking now consistently respects token counters and offsets.
- π§ͺ **Expanded Test Suite:** Comprehensive tests added for multilingual support, caching, and chunk correctness.
- π§Ή **Cleaner Output:** Logging filters and redundant docstrings removed to reduce noise during runs.
---
## π₯ Why Chunklet?
Feature | Why itβs elite
--------|----------------
βοΈ **Hybrid Mode** | Combines token + sentence limits with guaranteed overlap β rare even in commercial stacks.
π **Multilingual Fallbacks** | CRF > Moses > Regex, with dynamic confidence detection.
βΏ **Clause-Level Overlap** | `overlap_percent` now operates at the **clause level**, preserving semantic flow across chunks using `, ; β¦` logic.
β‘ **Parallel Batch Processing** | Multi-core acceleration with `mpire`.
β»οΈ **LRU Caching** | Smart memoization via `functools.lru_cache`.
πͺ **Pluggable Token Counters** | Swap in GPT-2, BPE, or your own tokenizer.
---
## π§© Chunking Modes
Pick your flavor:
- `"sentence"` β chunk by sentence count only
- `"token"` β chunk by token count only
- `"hybrid"` β sentence + token thresholds respected with guaranteed overlap
---
## π¦ Installation
```bash
pip install chunklet
```
```bash
git clone https://github.com/Speedyk-005/chunklet.git
cd chunklet
pip install requirements
# or manually
pip install mpire loguru sentence-splitter sentsplit langid
```
---
π‘ Example: Hybrid Mode
```
from chunklet import Chunklet
def word_token_counter(text: str) -> int:
return len(text.split())
chunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)
sample = """
This is a long document about AI. It discusses neural networks and deep learning.
The future is exciting. Ethics must be considered. Letβs build wisely.
"""
chunks = chunker.chunk(
text=sample,
mode="hybrid",
max_tokens=20,
max_sentences=5,
overlap_percent=30
)
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i+1} ---")
print(chunk)
```
---
π Batch Chunking (Parallel)
```
texts = [
"First document sentence. Second sentence.",
"Another one. Slightly longer. A third one here.",
"Final doc with multiple lines. Great for testing chunk overlap."
]
results = chunker.batch_chunk(
texts=texts,
mode="hybrid",
max_tokens=15,
max_sentences=4,
overlap_percent=20,
n_jobs=2
)
for i, doc_chunks in enumerate(results):
print(f"\n## Document {i+1}")
for j, chunk in enumerate(doc_chunks):
print(f"Chunk {j+1}:\n{chunk}")
```
---
βοΈ GPT-2 Token Count Support
```
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def gpt2_token_count(text: str) -> int:
return len(tokenizer.encode(text))
chunker = Chunklet(token_counter=gpt2_token_count)
```
---
π§ͺ Planned Features
[ ] PDF splitter with metadata
[ ] code splitting based on interest point
[ ] CLI interface with --file, --mode, --overlap, etc.
[ ] Named chunking presets: "all", "random_gap" for downstream control
---
π Language Support (30+)
- CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.
- Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.
- Fallback: All other languages via smart regex
---
## π‘Projects that inspire me
| Tool | Description |
|---------------------------|--------------------------------------------------------------------------------------------------|
| [**Semchunk**](https://github.com/cocktailpeanut/semchunk) | Semantic-aware chunking using transformer embeddings. |
| [**CintraAI Code Chunker**](https://github.com/CintraAI/code-chunker) | AST-based code chunker for intelligent code splitting. |
---
π€ Contributing
1. Fork this repo
2. Create a new feature branch
3. Code like a star
4. Submit a pull request
---
π License
> MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)
Raw data
{
"_id": null,
"home_page": null,
"name": "chunklet",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "nlp, chunking, text-splitting, llm, rag, ai, multilingual",
"author": null,
"author_email": "Speedyk_005 <speedy40115719@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/00/66/b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853/chunklet-1.0.5.tar.gz",
"platform": null,
"description": "# \ud83d\udce6 Chunklet: Smart Multilingual Text Chunker\n\n\n\n\n\n> Chunk smarter, not harder \u2014 built for LLMs, RAG pipelines, and beyond. \n**Author:** Speedyk_005 \n**Version:** 1.0.5 (\ud83c\udf89 first stable release) \n**License:** MIT\n\n---\n\n## \ud83d\ude80 What\u2019s New in v1.0.5 (Stable)\n\n- \u2705 **Stable Release:** v1.0.5 marks the first fully stable version after extensive refactoring.\n- \ud83d\udd04 **Multiple Refactor Steps:** Core code reorganized for clarity, maintainability, and performance.\n- \u27bf **True Clause-Level Overlap:** Overlap now occurs on natural clause boundaries (commas, semicolons, etc.) instead of just sentences, preserving semantic flow better.\n- \ud83d\udee0\ufe0f **Improved Chunking Logic:** Enhanced fallback splitters and overlap calculations to handle edge cases gracefully.\n- \u26a1 **Optimized Batch Processing:** Parallel chunking now consistently respects token counters and offsets.\n- \ud83e\uddea **Expanded Test Suite:** Comprehensive tests added for multilingual support, caching, and chunk correctness.\n- \ud83e\uddf9 **Cleaner Output:** Logging filters and redundant docstrings removed to reduce noise during runs.\n\n---\n\n## \ud83d\udd25 Why Chunklet?\n\nFeature | Why it\u2019s elite \n--------|----------------\n\u26d3\ufe0f **Hybrid Mode** | Combines token + sentence limits with guaranteed overlap \u2014 rare even in commercial stacks. \n\ud83c\udf10 **Multilingual Fallbacks** | CRF > Moses > Regex, with dynamic confidence detection. \n\u27bf **Clause-Level Overlap** | `overlap_percent` now operates at the **clause level**, preserving semantic flow across chunks using `, ; \u2026` logic. \n\u26a1 **Parallel Batch Processing** | Multi-core acceleration with `mpire`. \n\u267b\ufe0f **LRU Caching** | Smart memoization via `functools.lru_cache`. \n\ud83e\ude84 **Pluggable Token Counters** | Swap in GPT-2, BPE, or your own tokenizer.\n\n---\n\n## \ud83e\udde9 Chunking Modes\n\nPick your flavor:\n\n- `\"sentence\"` \u2014 chunk by sentence count only \n- `\"token\"` \u2014 chunk by token count only \n- `\"hybrid\"` \u2014 sentence + token thresholds respected with guaranteed overlap \n\n---\n\n## \ud83d\udce6 Installation\n\n```bash\npip install chunklet\n```\n\n```bash\ngit clone https://github.com/Speedyk-005/chunklet.git\ncd chunklet\npip install requirements\n# or manually\npip install mpire loguru sentence-splitter sentsplit langid\n```\n\n---\n\n\ud83d\udca1 Example: Hybrid Mode\n```\nfrom chunklet import Chunklet\n\ndef word_token_counter(text: str) -> int:\n return len(text.split())\n\nchunker = Chunklet(verbose=True, use_cache=True, token_counter=word_token_counter)\n\nsample = \"\"\"\nThis is a long document about AI. It discusses neural networks and deep learning.\nThe future is exciting. Ethics must be considered. Let\u2019s build wisely.\n\"\"\"\n\nchunks = chunker.chunk(\n text=sample,\n mode=\"hybrid\",\n max_tokens=20,\n max_sentences=5,\n overlap_percent=30\n)\n\nfor i, chunk in enumerate(chunks):\n print(f\"--- Chunk {i+1} ---\")\n print(chunk)\n```\n\n---\n\n\ud83c\udf00 Batch Chunking (Parallel)\n```\ntexts = [\n \"First document sentence. Second sentence.\",\n \"Another one. Slightly longer. A third one here.\",\n \"Final doc with multiple lines. Great for testing chunk overlap.\"\n]\n\nresults = chunker.batch_chunk(\n texts=texts,\n mode=\"hybrid\",\n max_tokens=15,\n max_sentences=4,\n overlap_percent=20,\n n_jobs=2\n)\n\nfor i, doc_chunks in enumerate(results):\n print(f\"\\n## Document {i+1}\")\n for j, chunk in enumerate(doc_chunks):\n print(f\"Chunk {j+1}:\\n{chunk}\")\n```\n\n---\n\n\u2699\ufe0f GPT-2 Token Count Support\n```\nfrom transformers import GPT2TokenizerFast\ntokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n\ndef gpt2_token_count(text: str) -> int:\n return len(tokenizer.encode(text))\n\nchunker = Chunklet(token_counter=gpt2_token_count)\n```\n\n---\n\n\ud83e\uddea Planned Features\n\n[ ] PDF splitter with metadata\n[ ] code splitting based on interest point\n[ ] CLI interface with --file, --mode, --overlap, etc.\n[ ] Named chunking presets: \"all\", \"random_gap\" for downstream control\n\n\n---\n\n\ud83c\udf0d Language Support (30+)\n\n- CRF-based: en, fr, de, it, ru, zh, ja, ko, pt, tr, etc.\n- Heuristic-based: es, nl, da, fi, no, sv, cs, hu, el, ro, etc.\n- Fallback: All other languages via smart regex\n\n\n---\n\n## \ud83d\udca1Projects that inspire me\n\n| Tool | Description |\n|---------------------------|--------------------------------------------------------------------------------------------------|\n| [**Semchunk**](https://github.com/cocktailpeanut/semchunk) | Semantic-aware chunking using transformer embeddings. |\n| [**CintraAI Code Chunker**](https://github.com/CintraAI/code-chunker) | AST-based code chunker for intelligent code splitting. |\n\n\n---\n\n\ud83e\udd1d Contributing\n\n1. Fork this repo\n2. Create a new feature branch\n3. Code like a star\n4. Submit a pull request\n\n\n---\n\n\ud83d\udcdc License\n\n> MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)\n",
"bugtrack_url": null,
"license": null,
"summary": "A smart multilingual text chunker for LLMs, RAG, and beyond.",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://github.com/Speedyk-005/chunklet",
"Issues": "https://github.com/Speedyk-005/chunklet/issues",
"Repository": "https://github.com/Speedyk-005/chunklet"
},
"split_keywords": [
"nlp",
" chunking",
" text-splitting",
" llm",
" rag",
" ai",
" multilingual"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "bf9f82577957df57351e3ef0346a60968f56d696021f6a0ba6315f8aaa5eba04",
"md5": "2c6e2a38841cf05cd3dc7d1820b19ddd",
"sha256": "0cf8d7135e3f8e8e08eb01220295951a401ccbd152b1b1542f37038510244fbb"
},
"downloads": -1,
"filename": "chunklet-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2c6e2a38841cf05cd3dc7d1820b19ddd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 9424,
"upload_time": "2025-07-25T07:15:37",
"upload_time_iso_8601": "2025-07-25T07:15:37.289086Z",
"url": "https://files.pythonhosted.org/packages/bf/9f/82577957df57351e3ef0346a60968f56d696021f6a0ba6315f8aaa5eba04/chunklet-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0066b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853",
"md5": "506ac959ab3aee724041affce73ac39d",
"sha256": "766b3817163f06707c49edb06881f2fdf1ce0ce41d5a17dfe5cb4afc1d3a3aae"
},
"downloads": -1,
"filename": "chunklet-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "506ac959ab3aee724041affce73ac39d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 13287,
"upload_time": "2025-07-25T07:15:38",
"upload_time_iso_8601": "2025-07-25T07:15:38.194219Z",
"url": "https://files.pythonhosted.org/packages/00/66/b34eb23b903a4938bcd886687c0d57bf0e4e50011bc804a1f4e5bdb0c853/chunklet-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 07:15:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Speedyk-005",
"github_project": "chunklet",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "mpire",
"specs": [
[
">=",
"2.10.0"
]
]
},
{
"name": "loguru",
"specs": [
[
">=",
"0.7.2"
]
]
},
{
"name": "sentence-splitter",
"specs": [
[
">=",
"1.4"
]
]
},
{
"name": "sentsplit",
"specs": [
[
">=",
"0.0.7"
]
]
},
{
"name": "slangid",
"specs": [
[
">=",
"1.1.6"
]
]
}
],
"lcname": "chunklet"
}