# semantic-chunker
This library is built on top of the [semantic-text-splitter](https://github.com/benbrandt/text-splitter)
library, written in Rust, combining it with
the [tree-sitter-language-pack](https://github.com/Goldziher/tree-sitter-language-pack)
to enable code-splitting.
Its main utility is in providing a strongly typed interface to the underlying library and removing the need for
managing
tree-sitter dependencies.
## Installation
```shell
pip install semantic-chunker
```
Or to include the optional `tokenizers` dependency:
```shell
pip install semantic-chunker[tokenizers]
```
### Usage
Import the `get_chunker` function from the `semantic_chunker` module, and use it to get a chunker instance and chunk
content. You can chunk plain text:
```python
from semantic_chunker import get_chunker
plain_text = """
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin
literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney
College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,
and going through the cites of the word in classical literature, discovered the undoubtable source: Lorem Ipsum
comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by
Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance.
The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section
"""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="text", # required
max_tokens=10, # required
trim=False, # default True
overlap=5, # default 0
)
# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:
chunks = chunker.chunks(plain_text) # list[str]
# Or a list of tuples containing the character offset indices and the chunk:
chunks_with_incides = chunker.chunk_with_indices(plain_text) # list[tuple[str, int]]
```
Markdown:
```python
from semantic_chunker import get_chunker
markdown_text = """
# Lorem Ipsum Intro
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature
from 45 BC, making it over 2000 years old.
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin
words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature,
discovered the undoubtable source: Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum"
(The Extremes of Good and Evil) by Cicero, written in 45 BC.
This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum,
"Lorem ipsum dolor sit amet..", comes from a line in section.
"""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="markdown", # required
max_tokens=10, # required
trim=False, # default True
overlap=5, # default 0
)
# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:
chunks = chunker.chunks(markdown_text) # list[str]
# Or a list of tuples containing the character offset indices and the chunk:
chunks_with_incides = chunker.chunk_with_indices(markdown_text) # list[tuple[str, int]]
```
Or code:
```python
from semantic_chunker import get_chunker
kotlin_snippet = """
import kotlin.random.Random
fun main() {
val randomNumbers = IntArray(10) { Random.nextInt(1, 100) } // Generate an array of 10 random integers between 1 and 99
println("Random numbers:")
for (number in randomNumbers) {
println(number) // Print each random number
}
}
"""
chunker = get_chunker(
"gpt-3.5-turbo",
chunking_type="code", # required
max_tokens=10, # required
language="kotlin", # required, only for code chunking, ignored otherwise
trim=False, # default True
overlap=5, # default 0
)
# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:
chunks = chunker.chunks(kotlin_snippet) # list[str]
# Or a list of tuples containing the character offset indices and the chunk:
chunks_with_incides = chunker.chunk_with_indices(kotlin_snippet) # list[tuple[str, int]]
```
The first argument to `get_chunker` is a required argument (not kwarg), which can be one of the following:
1. a tiktoken model string identifier (e.g. `gpt-3.5-turbo` etc.)
2. a callback function that receives a text (string) and returns the number of tokens it contains (integer.)
3. a `tokenizers.Tokenizer` instance (or an instance of a subclass thereof).
4. a file path to a tokenizer JSON file as a string (`"/path/to/tokenizer.json"`) or `Path`
instance (`Path("/path/to/tokenizer.json")`)
The (**required**) kwarg `chunking_type` can be either `text`, `markdown` or `code`.
The (**required**) kwarg `max_tokens` is the maximum number of tokens in each chunk. This kwarg accepts either an _
_integer__ or a __tuple__ of two integers (`tuple[int,int]`), which represents a min/max range within which the number
of tokens in each chunk should fall.
If the `chunking_type` is `code`, the `language` kwarg is **required**. This kwarg should be a string representing the
language of the code to be split. The language should be one of the languages included in the
the `tree-sitter-language-pack` library,
([see here for a list](https://github.com/Goldziher/tree-sitter-language-pack)).
### Note on Types
The [semantic-text-splitter](https://github.com/benbrandt/text-splitter) library is used to split the text into chunks (
very fast). It has 3 types of splitters: `TextSplitter`, `MarkdownSplitter`, and `CodeSplitter`. This is abstracted by
this library into a protocol type named `SemanticChunker`:
```python
from typing import Protocol
class SemanticChunker(Protocol):
def chunks(self, content: str) -> list[str]:
"""Generate a list of chunks from a given text. Each chunk will be up to the `capacity`."""
def chunk_with_indices(self, content: str) -> list[tuple[int, str]]:
"""Generate a list of chunks from a given text, along with their character offsets in the original text. Each chunk will be up to the `capacity`."""
```
## Contribution
This library welcomes contributions. To contribute, please follow the steps below:
1. Fork and clone the repository.
2. Make changes and commit them (follow [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/)).
3. Submit a PR.
Read below on how to develop locally:
### Prerequisites
- A compatible Python version.
- [pdm](https://github.com/pdm-project/pdm) installed.
- [pre-commit](https://pre-commit.com) installed.
### Setup
1. Inside the repository, install the dependencies with:
```shell
pdm install
```
This will create a virtual env under the git ignored `.venv` folder and install all the dependencies.
2. Install the pre-commit hooks:
```shell
pre-commit install && pre-commit install --hook-type commit-msg
```
This will install the pre-commit hooks that will run before every commit. This includes linters and formatters.
### Linting
To lint the codebase, run:
```shell
pdm run lint
```
### Testing
To run the tests, run:
```shell
pdm run test
```
### Updating Dependencies
To update the dependencies, run:
```shell
pdm update
```
Raw data
{
"_id": null,
"home_page": null,
"name": "semantic-chunker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "ai, chunking, code-splitter, semantic, semantic-chunking, text-splitter, tree-sitter",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfeld@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/84/f3/343f8c914b958a2fd2178b9048fbcc1230c9f4378221263fa2b5fcdf2d61/semantic_chunker-0.1.0.tar.gz",
"platform": null,
"description": "# semantic-chunker\n\nThis library is built on top of the [semantic-text-splitter](https://github.com/benbrandt/text-splitter)\nlibrary, written in Rust, combining it with\nthe [tree-sitter-language-pack](https://github.com/Goldziher/tree-sitter-language-pack)\nto enable code-splitting.\n\nIts main utility is in providing a strongly typed interface to the underlying library and removing the need for\nmanaging\ntree-sitter dependencies.\n\n## Installation\n\n```shell\npip install semantic-chunker\n```\n\nOr to include the optional `tokenizers` dependency:\n\n```shell\npip install semantic-chunker[tokenizers]\n```\n\n### Usage\n\nImport the `get_chunker` function from the `semantic_chunker` module, and use it to get a chunker instance and chunk\ncontent. You can chunk plain text:\n\n```python\nfrom semantic_chunker import get_chunker\n\nplain_text = \"\"\"\nContrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin\nliterature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney\nCollege in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage,\nand going through the cites of the word in classical literature, discovered the undoubtable source: Lorem Ipsum\ncomes from sections 1.10.32 and 1.10.33 of \"de Finibus Bonorum et Malorum\" (The Extremes of Good and Evil) by\nCicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance.\nThe first line of Lorem Ipsum, \"Lorem ipsum dolor sit amet..\", comes from a line in section\n\"\"\"\n\nchunker = get_chunker(\n \"gpt-3.5-turbo\",\n chunking_type=\"text\", # required\n max_tokens=10, # required\n trim=False, # default True\n overlap=5, # default 0\n)\n\n# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:\nchunks = chunker.chunks(plain_text) # list[str]\n\n# Or a list of tuples containing the character offset indices and the chunk:\nchunks_with_incides = chunker.chunk_with_indices(plain_text) # list[tuple[str, int]]\n```\n\nMarkdown:\n\n```python\nfrom semantic_chunker import get_chunker\n\nmarkdown_text = \"\"\"\n# Lorem Ipsum Intro\n\n\nContrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature\nfrom 45 BC, making it over 2000 years old.\n\n\nRichard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin\nwords, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature,\ndiscovered the undoubtable source: Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of \"de Finibus Bonorum et Malorum\"\n(The Extremes of Good and Evil) by Cicero, written in 45 BC.\nThis book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum,\n\"Lorem ipsum dolor sit amet..\", comes from a line in section.\n\"\"\"\n\nchunker = get_chunker(\n \"gpt-3.5-turbo\",\n chunking_type=\"markdown\", # required\n max_tokens=10, # required\n trim=False, # default True\n overlap=5, # default 0\n)\n\n# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:\nchunks = chunker.chunks(markdown_text) # list[str]\n\n# Or a list of tuples containing the character offset indices and the chunk:\nchunks_with_incides = chunker.chunk_with_indices(markdown_text) # list[tuple[str, int]]\n```\n\nOr code:\n\n```python\nfrom semantic_chunker import get_chunker\n\nkotlin_snippet = \"\"\"\nimport kotlin.random.Random\n\n\nfun main() {\n val randomNumbers = IntArray(10) { Random.nextInt(1, 100) } // Generate an array of 10 random integers between 1 and 99\n println(\"Random numbers:\")\n for (number in randomNumbers) {\n println(number) // Print each random number\n }\n}\n\"\"\"\n\nchunker = get_chunker(\n \"gpt-3.5-turbo\",\n chunking_type=\"code\", # required\n max_tokens=10, # required\n language=\"kotlin\", # required, only for code chunking, ignored otherwise\n trim=False, # default True\n overlap=5, # default 0\n)\n\n# Then use it to chunk a value into either a list of chunks that are up to the `max_tokens` length:\nchunks = chunker.chunks(kotlin_snippet) # list[str]\n\n# Or a list of tuples containing the character offset indices and the chunk:\nchunks_with_incides = chunker.chunk_with_indices(kotlin_snippet) # list[tuple[str, int]]\n```\n\nThe first argument to `get_chunker` is a required argument (not kwarg), which can be one of the following:\n\n1. a tiktoken model string identifier (e.g. `gpt-3.5-turbo` etc.)\n2. a callback function that receives a text (string) and returns the number of tokens it contains (integer.)\n3. a `tokenizers.Tokenizer` instance (or an instance of a subclass thereof).\n4. a file path to a tokenizer JSON file as a string (`\"/path/to/tokenizer.json\"`) or `Path`\n instance (`Path(\"/path/to/tokenizer.json\")`)\n\nThe (**required**) kwarg `chunking_type` can be either `text`, `markdown` or `code`.\nThe (**required**) kwarg `max_tokens` is the maximum number of tokens in each chunk. This kwarg accepts either an _\n_integer__ or a __tuple__ of two integers (`tuple[int,int]`), which represents a min/max range within which the number\nof tokens in each chunk should fall.\n\nIf the `chunking_type` is `code`, the `language` kwarg is **required**. This kwarg should be a string representing the\nlanguage of the code to be split. The language should be one of the languages included in the\nthe `tree-sitter-language-pack` library,\n([see here for a list](https://github.com/Goldziher/tree-sitter-language-pack)).\n\n### Note on Types\n\nThe [semantic-text-splitter](https://github.com/benbrandt/text-splitter) library is used to split the text into chunks (\nvery fast). It has 3 types of splitters: `TextSplitter`, `MarkdownSplitter`, and `CodeSplitter`. This is abstracted by\nthis library into a protocol type named `SemanticChunker`:\n\n```python\nfrom typing import Protocol\n\n\nclass SemanticChunker(Protocol):\n def chunks(self, content: str) -> list[str]:\n \"\"\"Generate a list of chunks from a given text. Each chunk will be up to the `capacity`.\"\"\"\n\n def chunk_with_indices(self, content: str) -> list[tuple[int, str]]:\n \"\"\"Generate a list of chunks from a given text, along with their character offsets in the original text. Each chunk will be up to the `capacity`.\"\"\"\n```\n\n## Contribution\n\nThis library welcomes contributions. To contribute, please follow the steps below:\n\n1. Fork and clone the repository.\n2. Make changes and commit them (follow [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/)).\n3. Submit a PR.\n\nRead below on how to develop locally:\n\n### Prerequisites\n\n- A compatible Python version.\n- [pdm](https://github.com/pdm-project/pdm) installed.\n- [pre-commit](https://pre-commit.com) installed.\n\n### Setup\n\n1. Inside the repository, install the dependencies with:\n\n ```shell\n pdm install\n ```\n\nThis will create a virtual env under the git ignored `.venv` folder and install all the dependencies.\n\n2. Install the pre-commit hooks:\n\n ```shell\n pre-commit install && pre-commit install --hook-type commit-msg\n ```\n\nThis will install the pre-commit hooks that will run before every commit. This includes linters and formatters.\n\n### Linting\n\nTo lint the codebase, run:\n\n```shell\n pdm run lint\n```\n\n### Testing\n\nTo run the tests, run:\n\n```shell\n pdm run test\n```\n\n### Updating Dependencies\n\nTo update the dependencies, run:\n\n```shell\n pdm update\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Semantic Chunker",
"version": "0.1.0",
"project_urls": {
"Repository": "https://github.com/Goldziher/semantic-chunker"
},
"split_keywords": [
"ai",
" chunking",
" code-splitter",
" semantic",
" semantic-chunking",
" text-splitter",
" tree-sitter"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d5e5dcb51b5d3c9adad2ca3b5abd7aaafacd1e8eeb4f8e433544037ff6cc4d6b",
"md5": "60f49d2d5dfb12e1f2b048c2ede1f9eb",
"sha256": "04e2d518d4f7949022ed81919aecd64e993e49d1861caf24ca30fb0914b32c5f"
},
"downloads": -1,
"filename": "semantic_chunker-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "60f49d2d5dfb12e1f2b048c2ede1f9eb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 6054,
"upload_time": "2024-07-17T22:05:33",
"upload_time_iso_8601": "2024-07-17T22:05:33.306487Z",
"url": "https://files.pythonhosted.org/packages/d5/e5/dcb51b5d3c9adad2ca3b5abd7aaafacd1e8eeb4f8e433544037ff6cc4d6b/semantic_chunker-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "84f3343f8c914b958a2fd2178b9048fbcc1230c9f4378221263fa2b5fcdf2d61",
"md5": "22b9c52cad2fce64b7e18ace6910df6f",
"sha256": "c8cfb64a689745313e9f13d52127fedb02313409a242d77a8067a50ab208c42f"
},
"downloads": -1,
"filename": "semantic_chunker-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "22b9c52cad2fce64b7e18ace6910df6f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 6233,
"upload_time": "2024-07-17T22:05:34",
"upload_time_iso_8601": "2024-07-17T22:05:34.952224Z",
"url": "https://files.pythonhosted.org/packages/84/f3/343f8c914b958a2fd2178b9048fbcc1230c9f4378221263fa2b5fcdf2d61/semantic_chunker-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-17 22:05:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "semantic-chunker",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "semantic-chunker"
}