semchunk


Namesemchunk JSON
Version 0.2.2 PyPI version JSON
download
home_page
SummaryA fast and lightweight pure Python library for splitting text into semantically meaningful chunks.
upload_time2024-02-06 02:08:09
maintainer
docs_urlNone
author
requires_python>=3.9
licenseMIT
keywords chunk chunker chunking chunks nlp split splits splitter splitting text
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # semchunk
<a href="https://pypi.org/project/semchunk/" alt="PyPI Version"><img src="https://img.shields.io/pypi/v/semchunk"></a> <a href="https://github.com/umarbutler/semchunk/actions/workflows/ci.yml" alt="Build Status"><img src="https://img.shields.io/github/actions/workflow/status/umarbutler/semchunk/ci.yml?branch=main"></a> <a href="https://app.codecov.io/gh/umarbutler/semchunk" alt="Code Coverage"><img src="https://img.shields.io/codecov/c/github/umarbutler/semchunk"></a> <!-- <a href="https://pypistats.org/packages/semchunk" alt="Downloads"><img src="https://img.shields.io/pypi/dm/semchunk"></a> -->

`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.

Owing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works 🔍](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 70% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks 📊](https://github.com/umarbutler/semchunk#benchmarks-)).

## Installation 📦
`semchunk` may be installed with `pip`:
```bash
pip install semchunk
```

## Usage 👩‍💻
The code snippet below demonstrates how text can be chunked with `semchunk`:
```python
>>> import semchunk
>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> chunk_size = 2 # A low chunk size is used here for demo purposes.
>>> encoder = tiktoken.encoding_for_model('gpt-4')
>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.
>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)
['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']
```

### Chunk
```python
def chunk(
    text: str,
    chunk_size: int,
    token_counter: callable,
    memoize: bool=True
) -> list[str]
```

`chunk()` splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.

`text` is the text to be chunked.

`chunk_size` is the maximum number of tokens a chunk may contain.

`token_counter` is a callable that takes a string and returns the number of tokens in it.

`memoize` flags whether to memoise the token counter. It defaults to `True`.

This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.

## How It Works 🔍
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
1. Splits text using the most semantically meaningful splitter possible;
1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.

To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
1. The largest sequence of tabs;
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class);
1. Sentence terminators (`.`, `?`, `!` and `*`);
1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `“`, `”`, `‘`, `’`, `'`, `"` and `` ` ``);
1. Sentence interrupters (`:`, `—` and `…`);
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
1. All other characters.

`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.

## Benchmarks 📊
On a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 24.41s seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks — a difference of 77.35%.

The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).

## Licence 📄
This library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENCE).
            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "semchunk",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "chunk,chunker,chunking,chunks,nlp,split,splits,splitter,splitting,text",
    "author": "",
    "author_email": "Umar Butler <umar@umar.au>",
    "download_url": "https://files.pythonhosted.org/packages/8b/9b/11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a/semchunk-0.2.2.tar.gz",
    "platform": null,
    "description": "# semchunk\n<a href=\"https://pypi.org/project/semchunk/\" alt=\"PyPI Version\"><img src=\"https://img.shields.io/pypi/v/semchunk\"></a> <a href=\"https://github.com/umarbutler/semchunk/actions/workflows/ci.yml\" alt=\"Build Status\"><img src=\"https://img.shields.io/github/actions/workflow/status/umarbutler/semchunk/ci.yml?branch=main\"></a> <a href=\"https://app.codecov.io/gh/umarbutler/semchunk\" alt=\"Code Coverage\"><img src=\"https://img.shields.io/codecov/c/github/umarbutler/semchunk\"></a> <!-- <a href=\"https://pypistats.org/packages/semchunk\" alt=\"Downloads\"><img src=\"https://img.shields.io/pypi/dm/semchunk\"></a> -->\n\n`semchunk` is a fast and lightweight pure Python library for splitting text into semantically meaningful chunks.\n\nOwing to its complex yet highly efficient chunking algorithm, `semchunk` is both more semantically accurate than [`langchain.text_splitter.RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter) (see [How It Works \ud83d\udd0d](https://github.com/umarbutler/semchunk#how-it-works-)) and is also over 70% faster than [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (see the [Benchmarks \ud83d\udcca](https://github.com/umarbutler/semchunk#benchmarks-)).\n\n## Installation \ud83d\udce6\n`semchunk` may be installed with `pip`:\n```bash\npip install semchunk\n```\n\n## Usage \ud83d\udc69\u200d\ud83d\udcbb\nThe code snippet below demonstrates how text can be chunked with `semchunk`:\n```python\n>>> import semchunk\n>>> import tiktoken # `tiktoken` is not required but is used here to quickly count tokens.\n>>> text = 'The quick brown fox jumps over the lazy dog.'\n>>> chunk_size = 2 # A low chunk size is used here for demo purposes.\n>>> encoder = tiktoken.encoding_for_model('gpt-4')\n>>> token_counter = lambda text: len(encoder.encode(text)) # `token_counter` may be swapped out for any function capable of counting tokens.\n>>> semchunk.chunk(text, chunk_size=chunk_size, token_counter=token_counter)\n['The quick', 'brown fox', 'jumps over', 'the lazy', 'dog.']\n```\n\n### Chunk\n```python\ndef chunk(\n    text: str,\n    chunk_size: int,\n    token_counter: callable,\n    memoize: bool=True\n) -> list[str]\n```\n\n`chunk()` splits text into semantically meaningful chunks of a specified size as determined by the provided token counter.\n\n`text` is the text to be chunked.\n\n`chunk_size` is the maximum number of tokens a chunk may contain.\n\n`token_counter` is a callable that takes a string and returns the number of tokens in it.\n\n`memoize` flags whether to memoise the token counter. It defaults to `True`.\n\nThis function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed.\n\n## How It Works \ud83d\udd0d\n`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:\n1. Splits text using the most semantically meaningful splitter possible;\n1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;\n1. Merges any chunks that are under the chunk size back together until the chunk size is reached; and\n1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks.\n\nTo ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:\n1. The largest sequence of newlines (`\\n`) and/or carriage returns (`\\r`);\n1. The largest sequence of tabs;\n1. The largest sequence of whitespace characters (as defined by regex's `\\s` character class);\n1. Sentence terminators (`.`, `?`, `!` and `*`);\n1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `\u201c`, `\u201d`, `\u2018`, `\u2019`, `'`, `\"` and `` ` ``);\n1. Sentence interrupters (`:`, `\u2014` and `\u2026`);\n1. Word joiners (`/`, `\\`, `\u2013`, `&` and `-`); and\n1. All other characters.\n\n`semchunk` also relies on memoization to cache the results of token counters and the `chunk()` function, thereby improving performance.\n\n## Benchmarks \ud83d\udcca\nOn a desktop with a Ryzen 3600, 64 GB of RAM, Windows 11 and Python 3.12.0, it takes `semchunk` 24.41s seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) 1 minute and 48.01 seconds to chunk the same texts into 512-token-long chunks \u2014 a difference of 77.35%.\n\nThe code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/umarbutler/semchunk/blob/main/tests/bench.py).\n\n## Licence \ud83d\udcc4\nThis library is licensed under the [MIT License](https://github.com/umarbutler/semchunk/blob/main/LICENCE).",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fast and lightweight pure Python library for splitting text into semantically meaningful chunks.",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://github.com/umarbutler/semchunk/blob/main/README.md",
        "Homepage": "https://github.com/umarbutler/semchunk",
        "Issues": "https://github.com/umarbutler/semchunk/issues",
        "Source": "https://github.com/umarbutler/semchunk"
    },
    "split_keywords": [
        "chunk",
        "chunker",
        "chunking",
        "chunks",
        "nlp",
        "split",
        "splits",
        "splitter",
        "splitting",
        "text"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a35b8548e8cf0df8a57b60f56b5a0af3c29e0b8c05672e1af667a803c263c412",
                "md5": "870d7f5916a71e055f08647fffbe5299",
                "sha256": "68d463891f79de2430db242e7ae4c05323c3b9ebc8f3731824bb00fc66e6ed1f"
            },
            "downloads": -1,
            "filename": "semchunk-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "870d7f5916a71e055f08647fffbe5299",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 6577,
            "upload_time": "2024-02-06T02:08:05",
            "upload_time_iso_8601": "2024-02-06T02:08:05.195142Z",
            "url": "https://files.pythonhosted.org/packages/a3/5b/8548e8cf0df8a57b60f56b5a0af3c29e0b8c05672e1af667a803c263c412/semchunk-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b9b11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a",
                "md5": "8f1d106c3e6f65bca5fa08f28d8bc127",
                "sha256": "ef9c7915aa4a907efcdb237bb9668ab8eb93abeb09d754c129eb6b45ac9bd34b"
            },
            "downloads": -1,
            "filename": "semchunk-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "8f1d106c3e6f65bca5fa08f28d8bc127",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 7520,
            "upload_time": "2024-02-06T02:08:09",
            "upload_time_iso_8601": "2024-02-06T02:08:09.397649Z",
            "url": "https://files.pythonhosted.org/packages/8b/9b/11668a592d2312f97da972cec8fc02b1714935ef849a51a6189bbf8d298a/semchunk-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-06 02:08:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "umarbutler",
    "github_project": "semchunk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "semchunk"
}
        
Elapsed time: 0.17876s