# đź“„ Semantic Document Chunking
This repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.
---
## 🚀 Overview
Traditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments.
Our approach **reverses this by organizing and splitting content based on semantic similarity**, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.
---
## 🔍 How It Works
### 📝 Sentence-wise Splitting
The document is first split into individual sentences or paragraphs, depending on the selected mode (`'sentence'` or `'para'`).
### đź”— Semantic Segregation
1. Calculate cosine similarity between sentences using a Sentence Transformer.
2. Group sentences where similarity scores > `0.4` into clusters.
3. Recursively repeat for ungrouped sentences until all are grouped semantically.
---
## ⚙️ Dynamic Chunking with Retrieval Optimization
After semantic grouping:
- A **recursive character splitter** is applied with dynamic chunk sizing.
- The chunk size is computed as:
**By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks.
Typically, one chunk—approximately one-fourth of the document—is enough to provide a meaningful response, depending on the question.**
chunk_size = length_of_document / N
Where N is a configurable parameter that determines granularity.
🔢 Chunk Size Calculation Example
For a document of length 1200 and N = 16:
chunk_size = 1200 / 16 = 75
This would yield chunks of ~75 characters with some overlap.
### đź’ˇ Chunk Usage Guide
Depending on the desired response length, vary how many chunks are used:
| Response Type | Approx. Chunks Used | Chunking Config (`N`, `overlap_ratio`) |
| -------------------- | ----------------------------------------- | ------------------------------------------ |
| Short Answer | \~1/4 of total chunks (e.g., top 4 of 16) | `light` → `N=16`, `overlap_ratio=0.15` |
| Moderate (Detailed) | \~1/2 of total chunks (e.g., top 6 of 12) | `standard` → `N=12`, `overlap_ratio=0.25` |
| Detailed Answer | \~3/4 of total chunks (e.g., top 6 of 8) | `deep` → `N=8`, `overlap_ratio=0.35` |
| Very Detailed Answer | All chunks (\~8 of 8) | `max_detail` → `N=8`, `overlap_ratio=0.45` |
This balances context coverage and retrieval efficiency.
---
### 🎯 Benefits
âś… Produces contextually relevant and semantically consistent chunks
âś… Saves computational and cost resources by minimizing redundant input
âś… Automatically adjusts chunk size and overlap based on document length and depth
---
### 📦 Installation
Make sure the required dependencies are installed:
```
pip install nltk sentence-transformers langchain
```
If needed, download NLTK tokenizers:
```
import nltk
nltk.download("punkt")
```
### đź§Ş Usage Example
```
from splitter import SemanticSplitter
splitter = SemanticSplitter(
threshold=0.4, # Semantic similarity threshold for splitting
depth='standard', # Options: 'light', 'standard', 'deep', 'max_detail'
tokenization_mode='para', # Options: 'para' (paragraph), 'sent' (sentence)
model="BAAI/bge-base-en" # Sentence embedding model (default: "BAAI/bge-base-en")
)
with open("path/to/your/document.txt", "r", encoding="utf-8") as f:
document = f.read()
chunks = splitter.auto_split(document)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk.page_content}\n")
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Mahemaran/semantic_splitter",
"name": "semantic-splitter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Maran M",
"author_email": "Maran M <mahemaran99@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a9/71/32452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f/semantic_splitter-0.1.1.tar.gz",
"platform": null,
"description": "# \ud83d\udcc4 Semantic Document Chunking\r\n\r\nThis repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.\r\n\r\n---\r\n\r\n## \ud83d\ude80 Overview\r\n\r\nTraditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments. \r\nOur approach **reverses this by organizing and splitting content based on semantic similarity**, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.\r\n\r\n---\r\n\r\n## \ud83d\udd0d How It Works\r\n\r\n### \ud83d\udcdd Sentence-wise Splitting\r\nThe document is first split into individual sentences or paragraphs, depending on the selected mode (`'sentence'` or `'para'`).\r\n\r\n### \ud83d\udd17 Semantic Segregation\r\n\r\n1. Calculate cosine similarity between sentences using a Sentence Transformer.\r\n2. Group sentences where similarity scores > `0.4` into clusters.\r\n3. Recursively repeat for ungrouped sentences until all are grouped semantically.\r\n\r\n---\r\n\r\n## \u2699\ufe0f Dynamic Chunking with Retrieval Optimization\r\n\r\nAfter semantic grouping:\r\n- A **recursive character splitter** is applied with dynamic chunk sizing.\r\n- The chunk size is computed as:\r\n\r\n**By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks.\r\nTypically, one chunk\u2014approximately one-fourth of the document\u2014is enough to provide a meaningful response, depending on the question.**\r\n\r\nchunk_size = length_of_document / N\r\nWhere N is a configurable parameter that determines granularity.\r\n\r\n\ud83d\udd22 Chunk Size Calculation Example\r\nFor a document of length 1200 and N = 16:\r\n\r\n\r\nchunk_size = 1200 / 16 = 75\r\n\r\nThis would yield chunks of ~75 characters with some overlap.\r\n\r\n### \ud83d\udca1 Chunk Usage Guide\r\nDepending on the desired response length, vary how many chunks are used:\r\n\r\n| Response Type | Approx. Chunks Used | Chunking Config (`N`, `overlap_ratio`) |\r\n| -------------------- | ----------------------------------------- | ------------------------------------------ |\r\n| Short Answer | \\~1/4 of total chunks (e.g., top 4 of 16) | `light` \u2192 `N=16`, `overlap_ratio=0.15` |\r\n| Moderate (Detailed) | \\~1/2 of total chunks (e.g., top 6 of 12) | `standard` \u2192 `N=12`, `overlap_ratio=0.25` |\r\n| Detailed Answer | \\~3/4 of total chunks (e.g., top 6 of 8) | `deep` \u2192 `N=8`, `overlap_ratio=0.35` |\r\n| Very Detailed Answer | All chunks (\\~8 of 8) | `max_detail` \u2192 `N=8`, `overlap_ratio=0.45` |\r\n\r\n\r\n\r\nThis balances context coverage and retrieval efficiency.\r\n\r\n---\r\n\r\n### \ud83c\udfaf Benefits\r\n\u2705 Produces contextually relevant and semantically consistent chunks\r\n\r\n\u2705 Saves computational and cost resources by minimizing redundant input\r\n\r\n\u2705 Automatically adjusts chunk size and overlap based on document length and depth\r\n\r\n---\r\n\r\n### \ud83d\udce6 Installation\r\nMake sure the required dependencies are installed:\r\n```\r\npip install nltk sentence-transformers langchain\r\n```\r\n\r\nIf needed, download NLTK tokenizers:\r\n```\r\nimport nltk\r\n\r\nnltk.download(\"punkt\")\r\n```\r\n### \ud83e\uddea Usage Example\r\n\r\n\r\n```\r\nfrom splitter import SemanticSplitter\r\n\r\nsplitter = SemanticSplitter(\r\n threshold=0.4, # Semantic similarity threshold for splitting\r\n depth='standard', # Options: 'light', 'standard', 'deep', 'max_detail'\r\n tokenization_mode='para', # Options: 'para' (paragraph), 'sent' (sentence)\r\n model=\"BAAI/bge-base-en\" # Sentence embedding model (default: \"BAAI/bge-base-en\")\r\n)\r\nwith open(\"path/to/your/document.txt\", \"r\", encoding=\"utf-8\") as f:\r\n document = f.read()\r\n\r\nchunks = splitter.auto_split(document)\r\n\r\nfor i, chunk in enumerate(chunks):\r\n print(f\"Chunk {i+1}:\\n{chunk.page_content}\\n\")\r\n```\r\n\r\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Semantic chunking of documents using Sentence Transformers and LangChain,",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/Mahemaran/semantic_splitter",
"Repository": "https://github.com/Mahemaran/semantic_splitter"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "38c687be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61",
"md5": "65b98332b689605b9d7841487ea6fa8d",
"sha256": "5cde8d10b43eca26b75451fda15d27e73dcfd1ac87d0fffc0f21d0cbf7d9da89"
},
"downloads": -1,
"filename": "semantic_splitter-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "65b98332b689605b9d7841487ea6fa8d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 7677,
"upload_time": "2025-07-18T14:44:00",
"upload_time_iso_8601": "2025-07-18T14:44:00.230023Z",
"url": "https://files.pythonhosted.org/packages/38/c6/87be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61/semantic_splitter-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a97132452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f",
"md5": "3187c03c64381f8471833277dc684527",
"sha256": "5e39b3f9809bff7caadd426c58ae71bab63c2a269f851b493bd51a6924a3a36f"
},
"downloads": -1,
"filename": "semantic_splitter-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "3187c03c64381f8471833277dc684527",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 7646,
"upload_time": "2025-07-18T14:44:01",
"upload_time_iso_8601": "2025-07-18T14:44:01.551981Z",
"url": "https://files.pythonhosted.org/packages/a9/71/32452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f/semantic_splitter-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-18 14:44:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Mahemaran",
"github_project": "semantic_splitter",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "nltk",
"specs": []
},
{
"name": "langchain",
"specs": []
},
{
"name": "sentence_transformers",
"specs": []
}
],
"lcname": "semantic-splitter"
}