# semchunk
<a href="https://pypi.org/project/semchunk/" alt="PyPI Version"><img src="https://img.shields.io/pypi/v/semchunk"></a> <a href="https://github.com/umarbutler/semchunk/actions/workflows/ci.yml" alt="Build Status"><img src="https://img.shields.io/github/actions/workflow/status/umarbutler/semchunk/ci.yml?branch=main"></a> <a href="https://app.codecov.io/gh/umarbutler/semchunk" alt="Code Coverage"><img src="https://img.shields.io/codecov/c/github/umarbutler/semchunk"></a> <!-- <a href="https://pypistats.org/packages/semchunk" alt="Downloads"><img src="https://img.shields.io/pypi/dm/semchunk"></a> -->
`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 70% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).
## Installation 📦
`semchunk` may be installed with `pip`:
```bash
pip install semchunk
```
## Usage 👩💻
The code snippet below demonstrates how text can be chunked with `semchunk`:
```python
>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
```
### Chunk
```python
def chunk(
text: str,
chunk_size: int,
token_counter: callable,
memoize: bool=True
) -> list[str]
```
`chunk()` splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.
`text` is the text to be chunked.
`chunk_size` is the maximum number of tokens a chunk may contain.
`token_counter` is a callable that takes a string and returns the number of tokens in it.
`memoize` flags whether to memoise the token counter. It defaults to `True`.
This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.
## How It Works 🔍
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
1. Splits text using the most semantically meaningful splitter possible;
1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
1. The largest sequence of tabs;
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class);
1. Sentence terminators (`.`, `?`, `!` and `*`);
1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `“`, `”`, `‘`, `’`, `'`, `"` and `` ` ``);
1. Sentence interrupters (`:`, `—` and `…`);
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
1. All other characters.
`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.
## Benchmarks 📊
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 24.41s seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.
The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).
## Licence 📄
This library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENCE).
Raw data
{
"_id": null,
"home_page": "",
"name": "semchunk",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "chunk,chunker,chunking,chunks,nlp,split,splits,splitter,splitting,text",
"author": "",
"author_email": "Umar Butler <umar@umar.au>",
"download_url": "https://files.pythonhosted.org/packages/8b/9b/11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a/semchunk-0.2.2.tar.gz",
"platform": null,
"description": "# semchunk\n<a href=\"https://pypi.org/project/semchunk/\" alt=\"PyPI Version\"><img src=\"https://img.shields.io/pypi/v/semchunk\"></a> <a href=\"https://github.com/umarbutler/semchunk/actions/workflows/ci.yml\" alt=\"Build Status\"><img src=\"https://img.shields.io/github/actions/workflow/status/umarbutler/semchunk/ci.yml?branch=main\"></a> <a href=\"https://app.codecov.io/gh/umarbutler/semchunk\" alt=\"Code Coverage\"><img src=\"https://img.shields.io/codecov/c/github/umarbutler/semchunk\"></a> <!-- <a href=\"https://pypistats.org/packages/semchunk\" alt=\"Downloads\"><img src=\"https://img.shields.io/pypi/dm/semchunk\"></a> -->\n\n`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.\n\nOwing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works \ud83d\udd0d](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 70% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks \ud83d\udcca](https://github.com/umarbutler/semchunk#benchmarks-)).\n\n## Installation \ud83d\udce6\n`semchunk` may be installed with `pip`:\n```bash\npip install semchunk\n```\n\n## Usage \ud83d\udc69\u200d\ud83d\udcbb\nThe code snippet below demonstrates how text can be chunked with `semchunk`:\n```python\n>>> import semchunk\n>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.\n>>> text = 'The quick brown fox jumps over the lazy dog.'\n>>> chunk_size = 2 # A low chunk size is used here for demo purposes.\n>>> encoder = tiktoken.encoding_for_model('gpt-4')\n>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.\n>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)\n['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']\n```\n\n### Chunk\n```python\ndef chunk(\n text: str,\n chunk_size: int,\n token_counter: callable,\n memoize: bool=True\n) -> list[str]\n```\n\n`chunk()` splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.\n\n`text` is the text to be chunked.\n\n`chunk_size` is the maximum number of tokens a chunk may contain.\n\n`token_counter` is a callable that takes a string and returns the number of tokens in it.\n\n`memoize` flags whether to memoise the token counter. It defaults to `True`.\n\nThis function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.\n\n## How It Works \ud83d\udd0d\n`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:\n1. Splits text using the most semantically meaningful splitter possible;\n1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;\n1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and\n1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.\n\nTo ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:\n1. The largest sequence of newlines (`\\n`) and/or carriage returns (`\\r`);\n1. The largest sequence of tabs;\n1. The largest sequence of whitespace characters (as defined by regex's `\\s` character class);\n1. Sentence terminators (`.`, `?`, `!` and `*`);\n1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `\u201c`, `\u201d`, `\u2018`, `\u2019`, `'`, `\"` and `` ` ``);\n1. Sentence interrupters (`:`, `\u2014` and `\u2026`);\n1. Word joiners (`/`, `\\`, `\u2013`, `&` and `-`); and\n1. All other characters.\n\n`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.\n\n## Benchmarks \ud83d\udcca\nOn a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 24.41s seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks \u2014 a difference of 77.35%.\n\nThe code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).\n\n## Licence \ud83d\udcc4\nThis library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENCE).",
"bugtrack_url": null,
"license": "MIT",
"summary": "A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.",
"version": "0.2.2",
"project_urls": {
"Documentation": "https://github.com/umarbutler/semchunk/blob/main/README.md",
"Homepage": "https://github.com/umarbutler/semchunk",
"Issues": "https://github.com/umarbutler/semchunk/issues",
"Source": "https://github.com/umarbutler/semchunk"
},
"split_keywords": [
"chunk",
"chunker",
"chunking",
"chunks",
"nlp",
"split",
"splits",
"splitter",
"splitting",
"text"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a35b8548e8cf0df8a57b60f56b5a0af3c29e0b8c05672e1af667a803c263c412",
"md5": "870d7f5916a71e055f08647fffbe5299",
"sha256": "68d463891f79de2430db242e7ae4c05323c3b9ebc8f3731824bb00fc66e6ed1f"
},
"downloads": -1,
"filename": "semchunk-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "870d7f5916a71e055f08647fffbe5299",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 6577,
"upload_time": "2024-02-06T02:08:05",
"upload_time_iso_8601": "2024-02-06T02:08:05.195142Z",
"url": "https://files.pythonhosted.org/packages/a3/5b/8548e8cf0df8a57b60f56b5a0af3c29e0b8c05672e1af667a803c263c412/semchunk-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8b9b11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a",
"md5": "8f1d106c3e6f65bca5fa08f28d8bc127",
"sha256": "ef9c7915aa4a907efcdb237bb9668ab8eb93abeb09d754c129eb6b45ac9bd34b"
},
"downloads": -1,
"filename": "semchunk-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "8f1d106c3e6f65bca5fa08f28d8bc127",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 7520,
"upload_time": "2024-02-06T02:08:09",
"upload_time_iso_8601": "2024-02-06T02:08:09.397649Z",
"url": "https://files.pythonhosted.org/packages/8b/9b/11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a/semchunk-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-06 02:08:09",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "umarbutler",
"github_project": "semchunk",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "semchunk"
}