semantic-splitter

Name	semantic-splitter JSON
Version	0.1.1 JSON
	download
home_page	https://github.com/Mahemaran/semantic_splitter
Summary	Semantic chunking of documents using Sentence Transformers and LangChain,
upload_time	2025-07-18 14:44:01
maintainer	None
docs_url	None
author	Maran M
requires_python	>=3.8
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	nltk langchain sentence_transformers
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 📄 Semantic Document Chunking

This repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.

---

## 🚀 Overview

Traditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments.  
Our approach **reverses this by organizing and splitting content based on semantic similarity**, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.

---

## 🔍 How It Works

### 📝 Sentence-wise Splitting
The document is first split into individual sentences or paragraphs, depending on the selected mode (`'sentence'` or `'para'`).

### 🔗 Semantic Segregation

1. Calculate cosine similarity between sentences using a Sentence Transformer.
2. Group sentences where similarity scores > `0.4` into clusters.
3. Recursively repeat for ungrouped sentences until all are grouped semantically.

---

## ⚙️ Dynamic Chunking with Retrieval Optimization

After semantic grouping:
- A **recursive character splitter** is applied with dynamic chunk sizing.
- The chunk size is computed as:

**By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks.
Typically, one chunk—approximately one-fourth of the document—is enough to provide a meaningful response, depending on the question.**

chunk_size = length_of_document / N
Where N is a configurable parameter that determines granularity.

🔢 Chunk Size Calculation Example
For a document of length 1200 and N = 16:


chunk_size = 1200 / 16 = 75

This would yield chunks of ~75 characters with some overlap.

### 💡 Chunk Usage Guide
Depending on the desired response length, vary how many chunks are used:

| Response Type        | Approx. Chunks Used                       | Chunking Config (`N`, `overlap_ratio`)     |
| -------------------- | ----------------------------------------- | ------------------------------------------ |
| Short Answer         | \~1/4 of total chunks (e.g., top 4 of 16) | `light` → `N=16`, `overlap_ratio=0.15`     |
| Moderate (Detailed)  | \~1/2 of total chunks (e.g., top 6 of 12) | `standard` → `N=12`, `overlap_ratio=0.25`  |
| Detailed Answer      | \~3/4 of total chunks (e.g., top 6 of 8)  | `deep` → `N=8`, `overlap_ratio=0.35`       |
| Very Detailed Answer | All chunks (\~8 of 8)                     | `max_detail` → `N=8`, `overlap_ratio=0.45` |



This balances context coverage and retrieval efficiency.

---

### 🎯 Benefits
✅ Produces contextually relevant and semantically consistent chunks

✅ Saves computational and cost resources by minimizing redundant input

✅ Automatically adjusts chunk size and overlap based on document length and depth

---

### 📦 Installation
Make sure the required dependencies are installed:
```
pip install nltk sentence-transformers langchain
```

If needed, download NLTK tokenizers:
```
import nltk

nltk.download("punkt")
```
### 🧪 Usage Example


```
from splitter import SemanticSplitter

splitter = SemanticSplitter(
    threshold=0.4,                # Semantic similarity threshold for splitting
    depth='standard',            # Options: 'light', 'standard', 'deep', 'max_detail'
    tokenization_mode='para',    # Options: 'para' (paragraph), 'sent' (sentence)
    model="BAAI/bge-base-en"     # Sentence embedding model (default: "BAAI/bge-base-en")
)
with open("path/to/your/document.txt", "r", encoding="utf-8") as f:
    document = f.read()

chunks = splitter.auto_split(document)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Mahemaran/semantic_splitter",
    "name": "semantic-splitter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Maran M",
    "author_email": "Maran M <mahemaran99@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a9/71/32452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f/semantic_splitter-0.1.1.tar.gz",
    "platform": null,
    "description": "# \ud83d\udcc4 Semantic Document Chunking\r\n\r\nThis repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.\r\n\r\n---\r\n\r\n## \ud83d\ude80 Overview\r\n\r\nTraditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments.  \r\nOur approach **reverses this by organizing and splitting content based on semantic similarity**, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.\r\n\r\n---\r\n\r\n## \ud83d\udd0d How It Works\r\n\r\n### \ud83d\udcdd Sentence-wise Splitting\r\nThe document is first split into individual sentences or paragraphs, depending on the selected mode (`'sentence'` or `'para'`).\r\n\r\n### \ud83d\udd17 Semantic Segregation\r\n\r\n1. Calculate cosine similarity between sentences using a Sentence Transformer.\r\n2. Group sentences where similarity scores > `0.4` into clusters.\r\n3. Recursively repeat for ungrouped sentences until all are grouped semantically.\r\n\r\n---\r\n\r\n## \u2699\ufe0f Dynamic Chunking with Retrieval Optimization\r\n\r\nAfter semantic grouping:\r\n- A **recursive character splitter** is applied with dynamic chunk sizing.\r\n- The chunk size is computed as:\r\n\r\n**By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks.\r\nTypically, one chunk\u2014approximately one-fourth of the document\u2014is enough to provide a meaningful response, depending on the question.**\r\n\r\nchunk_size = length_of_document / N\r\nWhere N is a configurable parameter that determines granularity.\r\n\r\n\ud83d\udd22 Chunk Size Calculation Example\r\nFor a document of length 1200 and N = 16:\r\n\r\n\r\nchunk_size = 1200 / 16 = 75\r\n\r\nThis would yield chunks of ~75 characters with some overlap.\r\n\r\n### \ud83d\udca1 Chunk Usage Guide\r\nDepending on the desired response length, vary how many chunks are used:\r\n\r\n| Response Type        | Approx. Chunks Used                       | Chunking Config (`N`, `overlap_ratio`)     |\r\n| -------------------- | ----------------------------------------- | ------------------------------------------ |\r\n| Short Answer         | \\~1/4 of total chunks (e.g., top 4 of 16) | `light` \u2192 `N=16`, `overlap_ratio=0.15`     |\r\n| Moderate (Detailed)  | \\~1/2 of total chunks (e.g., top 6 of 12) | `standard` \u2192 `N=12`, `overlap_ratio=0.25`  |\r\n| Detailed Answer      | \\~3/4 of total chunks (e.g., top 6 of 8)  | `deep` \u2192 `N=8`, `overlap_ratio=0.35`       |\r\n| Very Detailed Answer | All chunks (\\~8 of 8)                     | `max_detail` \u2192 `N=8`, `overlap_ratio=0.45` |\r\n\r\n\r\n\r\nThis balances context coverage and retrieval efficiency.\r\n\r\n---\r\n\r\n### \ud83c\udfaf Benefits\r\n\u2705 Produces contextually relevant and semantically consistent chunks\r\n\r\n\u2705 Saves computational and cost resources by minimizing redundant input\r\n\r\n\u2705 Automatically adjusts chunk size and overlap based on document length and depth\r\n\r\n---\r\n\r\n### \ud83d\udce6 Installation\r\nMake sure the required dependencies are installed:\r\n```\r\npip install nltk sentence-transformers langchain\r\n```\r\n\r\nIf needed, download NLTK tokenizers:\r\n```\r\nimport nltk\r\n\r\nnltk.download(\"punkt\")\r\n```\r\n### \ud83e\uddea Usage Example\r\n\r\n\r\n```\r\nfrom splitter import SemanticSplitter\r\n\r\nsplitter = SemanticSplitter(\r\n    threshold=0.4,                # Semantic similarity threshold for splitting\r\n    depth='standard',            # Options: 'light', 'standard', 'deep', 'max_detail'\r\n    tokenization_mode='para',    # Options: 'para' (paragraph), 'sent' (sentence)\r\n    model=\"BAAI/bge-base-en\"     # Sentence embedding model (default: \"BAAI/bge-base-en\")\r\n)\r\nwith open(\"path/to/your/document.txt\", \"r\", encoding=\"utf-8\") as f:\r\n    document = f.read()\r\n\r\nchunks = splitter.auto_split(document)\r\n\r\nfor i, chunk in enumerate(chunks):\r\n    print(f\"Chunk {i+1}:\\n{chunk.page_content}\\n\")\r\n```\r\n\r\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Semantic chunking of documents using Sentence Transformers and LangChain,",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/Mahemaran/semantic_splitter",
        "Repository": "https://github.com/Mahemaran/semantic_splitter"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "38c687be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61",
                "md5": "65b98332b689605b9d7841487ea6fa8d",
                "sha256": "5cde8d10b43eca26b75451fda15d27e73dcfd1ac87d0fffc0f21d0cbf7d9da89"
            },
            "downloads": -1,
            "filename": "semantic_splitter-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "65b98332b689605b9d7841487ea6fa8d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7677,
            "upload_time": "2025-07-18T14:44:00",
            "upload_time_iso_8601": "2025-07-18T14:44:00.230023Z",
            "url": "https://files.pythonhosted.org/packages/38/c6/87be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61/semantic_splitter-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a97132452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f",
                "md5": "3187c03c64381f8471833277dc684527",
                "sha256": "5e39b3f9809bff7caadd426c58ae71bab63c2a269f851b493bd51a6924a3a36f"
            },
            "downloads": -1,
            "filename": "semantic_splitter-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "3187c03c64381f8471833277dc684527",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 7646,
            "upload_time": "2025-07-18T14:44:01",
            "upload_time_iso_8601": "2025-07-18T14:44:01.551981Z",
            "url": "https://files.pythonhosted.org/packages/a9/71/32452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f/semantic_splitter-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-18 14:44:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Mahemaran",
    "github_project": "semantic_splitter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "nltk",
            "specs": []
        },
        {
            "name": "langchain",
            "specs": []
        },
        {
            "name": "sentence_transformers",
            "specs": []
        }
    ],
    "lcname": "semantic-splitter"
}

Maran M