semchunk

Name	semchunk JSON
Version	3.1.1 JSON
	download
home_page	None
Summary	A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.
upload_time	2025-02-18 00:08:47
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT
keywords	chunk chunker chunking chunks nlp split splits splitter splitting text
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align='center'>

# semchunk 🧩
<a href="https://pypi.org/project/semchunk/" alt="PyPI Version"><img src="https://img.shields.io/pypi/v/semchunk"></a> <a href="https://github.com/isaacus-dev/semchunk/actions/workflows/ci.yml" alt="Build Status"><img src="https://img.shields.io/github/actions/workflow/status/isaacus-dev/semchunk/ci.yml?branch=main"></a> <a href="https://app.codecov.io/gh/isaacus-dev/semchunk" alt="Code Coverage"><img src="https://img.shields.io/codecov/c/github/isaacus-dev/semchunk"></a> <a href="https://pypistats.org/packages/semchunk" alt="Downloads"><img src="https://img.shields.io/pypi/dm/semchunk"></a>

</div>

`semchunk` is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

It has built-in support for tokenizers from OpenAI's `tiktoken` and Hugging Face's `transformers` and `tokenizers` libraries, in addition to supporting custom tokenizers and token counters. It can also overlap chunks as well as return their offsets.

Powered by an efficient yet highly accurate chunking algorithm ([How It Works 🔍](https://github.com/isaacus-dev/semchunk#how-it-works-)), `semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers like `langchain`'s `RecursiveCharacterTextSplitter`, while also being 85% faster than its closest alternative, `semantic-text-splitter` ([Benchmarks 📊](https://github.com/isaacus-dev/semchunk#benchmarks-)).

## Installation 📦
`semchunk` can be installed with `pip`:
```bash
pip install semchunk
```

`semchunk` is also available on `conda-forge`:
```bash
conda install conda-forge::semchunk
# or
conda install -c conda-forge semchunk
```

In addition, [@dominictarro](https://github.com/dominictarro) maintains a Rust port of `semchunk` named [`semchunk-rs`](https://crates.io/crates/semchunk-rs).

## Quickstart 👩‍💻
The code snippet below demonstrates how to chunk text with `semchunk`:
```python

import semchunk
import tiktoken                        # `transformers` and `tiktoken` are not required.
from transformers import AutoTokenizer # They're just here for demonstration purposes.

chunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, `semchunk`
               # does not know how many special tokens, if any, your tokenizer adds to every input,
               # so you may want to deduct the number of special tokens added from your chunk size.
text = 'The quick brown fox jumps over the lazy dog.'

# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,
# OpenAI `tiktoken` encoding or Hugging Face model, or a custom tokenizer that has an `encode()`
# method (like a `tiktoken`, `transformers` or `tokenizers` tokenizer) or a custom token counting
# function that takes a text and returns the number of tokens in it.
chunker = semchunk.chunkerify('isaacus/emubert', chunk_size) or \
          semchunk.chunkerify('gpt-4', chunk_size) or \
          semchunk.chunkerify('cl100k_base', chunk_size) or \
          semchunk.chunkerify(AutoTokenizer.from_pretrained('isaacus/emubert'), chunk_size) or \
          semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \
          semchunk.chunkerify(lambda text: len(text.split()), chunk_size)

# If you give the resulting chunker a single text, it'll return a list of chunks. If you give it a
# list of texts, it'll return a list of lists of chunks.
assert chunker(text) == ['The quick brown fox', 'jumps over the', 'lazy dog.']
assert chunker([text], progress = True) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]

# If you have a lot of texts and you want to speed things up, you can enable multiprocessing by
# setting `processes` to a number greater than 1.
assert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]

# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`
# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).
chunks, offsets = chunker(text, offsets = True, overlap = 0.5)
```

## Usage 🕹️
### `chunkerify()`
```python
def chunkerify(
    tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \
                                tokenizers.Tokenizer | Callable[[str], int],
    chunk_size: int = None,
    max_token_chars: int = None,
    memoize: bool = True,
    cache_maxsize: int | None = None,
) -> Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:
```

`chunkerify()` constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.

`tokenizer_or_token_counter` is either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (e.g., a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in an input.

`chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a `ValueError` will be raised.

`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.

`memoize` flags whether to memoize the token counter. It defaults to `True`.

`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.

This function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.

The resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.

It is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.

As described above, the `offsets` argument, if set to `True`, will cause the chunker to return the start and end offsets of each chunk.

The chunker accepts an `overlap` argument that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.

### `chunk()`
```python
def chunk(
    text: str,
    chunk_size: int,
    token_counter: Callable,
    memoize: bool = True,
    offsets: bool = False,
    overlap: float | int | None = None,
    cache_maxsize: int | None = None,
) -> list[str]
```

`chunk()` splits a text into semantically meaningful chunks of a specified size as determined by the provided token counter.

`text` is the text to be chunked.

`chunk_size` is the maximum number of tokens a chunk may contain.

`token_counter` is a callable that takes a string and returns the number of tokens in it.

`memoize` flags whether to memoize the token counter. It defaults to `True`.

`offsets` flags whether to return the start and end offsets of each chunk. It defaults to `False`.

`overlap` specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.

`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.

This function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.

## How It Works 🔍
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
1. Splits text using the most semantically meaningful splitter possible;
1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
1. Merges any chunks that are under the chunk size back together until the chunk size is reached;
1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
1. Since version 3.0.0, excludes chunks consisting entirely of whitespace characters.

To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
1. The largest sequence of tabs;
1. The largest sequence of whitespace characters (as defined by regex's `\s` character class);
1. Sentence terminators (`.`, `?`, `!` and `*`);
1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `“`, `”`, `‘`, `’`, `'`, `"` and `` ` ``);
1. Sentence interrupters (`:`, `—` and `…`);
1. Word joiners (`/`, `\`, `–`, `&` and `-`); and
1. All other characters.

If overlapping chunks have been requested, `semchunk` also:
1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
1. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.

## Benchmarks 📊
On a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 3.04 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 24.84 seconds to chunk the same texts into 512-token-long chunks — a difference of 87.76%.

The code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/isaacus-dev/semchunk/blob/main/tests/bench.py).

## Licence 📄
This library is licensed under the [MIT License](https://github.com/isaacus-dev/semchunk/blob/main/LICENCE).

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "semchunk",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "chunk, chunker, chunking, chunks, nlp, split, splits, splitter, splitting, text",
    "author": null,
    "author_email": "Isaacus <support@isaacus.com>, Umar Butler <umar@umar.au>",
    "download_url": "https://files.pythonhosted.org/packages/99/f9/623c5b383c857c1094a411fa927502c6c8cccba7e13482de48a0bf3bd8ca/semchunk-3.1.1.tar.gz",
    "platform": null,
    "description": "<div align='center'>\n\n# semchunk \ud83e\udde9\n<a href=\"https://pypi.org/project/semchunk/\" alt=\"PyPI Version\"><img src=\"https://img.shields.io/pypi/v/semchunk\"></a> <a href=\"https://github.com/isaacus-dev/semchunk/actions/workflows/ci.yml\" alt=\"Build Status\"><img src=\"https://img.shields.io/github/actions/workflow/status/isaacus-dev/semchunk/ci.yml?branch=main\"></a> <a href=\"https://app.codecov.io/gh/isaacus-dev/semchunk\" alt=\"Code Coverage\"><img src=\"https://img.shields.io/codecov/c/github/isaacus-dev/semchunk\"></a> <a href=\"https://pypistats.org/packages/semchunk\" alt=\"Downloads\"><img src=\"https://img.shields.io/pypi/dm/semchunk\"></a>\n\n</div>\n\n`semchunk` is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.\n\nIt has built-in support for tokenizers from OpenAI's `tiktoken` and Hugging Face's `transformers` and `tokenizers` libraries, in addition to supporting custom tokenizers and token counters. It can also overlap chunks as well as return their offsets.\n\nPowered by an efficient yet highly accurate chunking algorithm ([How It Works \ud83d\udd0d](https://github.com/isaacus-dev/semchunk#how-it-works-)), `semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers like `langchain`'s `RecursiveCharacterTextSplitter`, while also being 85% faster than its closest alternative, `semantic-text-splitter` ([Benchmarks \ud83d\udcca](https://github.com/isaacus-dev/semchunk#benchmarks-)).\n\n## Installation \ud83d\udce6\n`semchunk` can be installed with `pip`:\n```bash\npip install semchunk\n```\n\n`semchunk` is also available on `conda-forge`:\n```bash\nconda install conda-forge::semchunk\n# or\nconda install -c conda-forge semchunk\n```\n\nIn addition, [@dominictarro](https://github.com/dominictarro) maintains a Rust port of `semchunk` named [`semchunk-rs`](https://crates.io/crates/semchunk-rs).\n\n## Quickstart \ud83d\udc69\u200d\ud83d\udcbb\nThe code snippet below demonstrates how to chunk text with `semchunk`:\n```python\n\nimport semchunk\nimport tiktoken                        # `transformers` and `tiktoken` are not required.\nfrom transformers import AutoTokenizer # They're just here for demonstration purposes.\n\nchunk_size = 4 # A low chunk size is used here for demonstration purposes. Keep in mind, `semchunk`\n               # does not know how many special tokens, if any, your tokenizer adds to every input,\n               # so you may want to deduct the number of special tokens added from your chunk size.\ntext = 'The quick brown fox jumps over the lazy dog.'\n\n# You can construct a chunker with `semchunk.chunkerify()` by passing the name of an OpenAI model,\n# OpenAI `tiktoken` encoding or Hugging Face model, or a custom tokenizer that has an `encode()`\n# method (like a `tiktoken`, `transformers` or `tokenizers` tokenizer) or a custom token counting\n# function that takes a text and returns the number of tokens in it.\nchunker = semchunk.chunkerify('isaacus/emubert', chunk_size) or \\\n          semchunk.chunkerify('gpt-4', chunk_size) or \\\n          semchunk.chunkerify('cl100k_base', chunk_size) or \\\n          semchunk.chunkerify(AutoTokenizer.from_pretrained('isaacus/emubert'), chunk_size) or \\\n          semchunk.chunkerify(tiktoken.encoding_for_model('gpt-4'), chunk_size) or \\\n          semchunk.chunkerify(lambda text: len(text.split()), chunk_size)\n\n# If you give the resulting chunker a single text, it'll return a list of chunks. If you give it a\n# list of texts, it'll return a list of lists of chunks.\nassert chunker(text) == ['The quick brown fox', 'jumps over the', 'lazy dog.']\nassert chunker([text], progress = True) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]\n\n# If you have a lot of texts and you want to speed things up, you can enable multiprocessing by\n# setting `processes` to a number greater than 1.\nassert chunker([text], processes = 2) == [['The quick brown fox', 'jumps over the', 'lazy dog.']]\n\n# You can also pass a `offsets` argument to return the offsets of chunks, as well as an `overlap`\n# argument to overlap chunks by a ratio (if < 1) or an absolute number of tokens (if >= 1).\nchunks, offsets = chunker(text, offsets = True, overlap = 0.5)\n```\n\n## Usage \ud83d\udd79\ufe0f\n### `chunkerify()`\n```python\ndef chunkerify(\n    tokenizer_or_token_counter: str | tiktoken.Encoding | transformers.PreTrainedTokenizer | \\\n                                tokenizers.Tokenizer | Callable[[str], int],\n    chunk_size: int = None,\n    max_token_chars: int = None,\n    memoize: bool = True,\n    cache_maxsize: int | None = None,\n) -> Callable[[str | Sequence[str], bool, bool, bool, int | float | None], list[str] | tuple[list[str], list[tuple[int, int]]] | list[list[str]] | tuple[list[list[str]], list[list[tuple[int, int]]]]]:\n```\n\n`chunkerify()` constructs a chunker that splits one or more texts into semantically meaningful chunks of a specified size as determined by the provided tokenizer or token counter.\n\n`tokenizer_or_token_counter` is either: the name of a `tiktoken` or `transformers` tokenizer (with priority given to the former); a tokenizer that possesses an `encode` attribute (e.g., a `tiktoken`, `transformers` or `tokenizers` tokenizer); or a token counter that returns the number of tokens in an input.\n\n`chunk_size` is the maximum number of tokens a chunk may contain. It defaults to `None` in which case it will be set to the same value as the tokenizer's `model_max_length` attribute (deducted by the number of tokens returned by attempting to tokenize an empty string) if possible, otherwise a `ValueError` will be raised.\n\n`max_token_chars` is the maximum numbers of characters a token may contain. It is used to significantly speed up the token counting of long inputs. It defaults to `None` in which case it will either not be used or will, if possible, be set to the numbers of characters in the longest token in the tokenizer's vocabulary as determined by the `token_byte_values` or `get_vocab` methods.\n\n`memoize` flags whether to memoize the token counter. It defaults to `True`.\n\n`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.\n\nThis function returns a chunker that takes either a single text or a sequence of texts and returns, depending on whether multiple texts have been provided, a list or list of lists of chunks up to `chunk_size`-tokens-long with any whitespace used to split the text removed, and, if the optional `offsets` argument to the chunker is `True`, a list or lists of tuples of the form `(start, end)` where `start` is the index of the first character of a chunk in a text and `end` is the index of the character succeeding the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.\n\nThe resulting chunker can be passed a `processes` argument that specifies the number of processes to be used when chunking multiple texts.\n\nIt is also possible to pass a `progress` argument which, if set to `True` and multiple texts are passed, will display a progress bar.\n\nAs described above, the `offsets` argument, if set to `True`, will cause the chunker to return the start and end offsets of each chunk.\n\nThe chunker accepts an `overlap` argument that specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.\n\n### `chunk()`\n```python\ndef chunk(\n    text: str,\n    chunk_size: int,\n    token_counter: Callable,\n    memoize: bool = True,\n    offsets: bool = False,\n    overlap: float | int | None = None,\n    cache_maxsize: int | None = None,\n) -> list[str]\n```\n\n`chunk()` splits a text into semantically meaningful chunks of a specified size as determined by the provided token counter.\n\n`text` is the text to be chunked.\n\n`chunk_size` is the maximum number of tokens a chunk may contain.\n\n`token_counter` is a callable that takes a string and returns the number of tokens in it.\n\n`memoize` flags whether to memoize the token counter. It defaults to `True`.\n\n`offsets` flags whether to return the start and end offsets of each chunk. It defaults to `False`.\n\n`overlap` specifies the proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. It defaults to `None`, in which case no overlapping occurs.\n\n`cache_maxsize` is the maximum number of text-token count pairs that can be stored in the token counter's cache. It defaults to `None`, which makes the cache unbounded. This argument is only used if `memoize` is `True`.\n\nThis function returns a list of chunks up to `chunk_size`-tokens-long, with any whitespace used to split the text removed, and, if `offsets` is `True`, a list of tuples of the form `(start, end)` where `start` is the index of the first character of the chunk in the original text and `end` is the index of the character after the last character of the chunk such that `chunks[i] == text[offsets[i][0]:offsets[i][1]]`.\n\n## How It Works \ud83d\udd0d\n`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:\n1. Splits text using the most semantically meaningful splitter possible;\n1. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;\n1. Merges any chunks that are under the chunk size back together until the chunk size is reached;\n1. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and\n1. Since version 3.0.0, excludes chunks consisting entirely of whitespace characters.\n\nTo ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:\n1. The largest sequence of newlines (`\\n`) and/or carriage returns (`\\r`);\n1. The largest sequence of tabs;\n1. The largest sequence of whitespace characters (as defined by regex's `\\s` character class);\n1. Sentence terminators (`.`, `?`, `!` and `*`);\n1. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `\u201c`, `\u201d`, `\u2018`, `\u2019`, `'`, `\"` and `` ` ``);\n1. Sentence interrupters (`:`, `\u2014` and `\u2026`);\n1. Word joiners (`/`, `\\`, `\u2013`, `&` and `-`); and\n1. All other characters.\n\nIf overlapping chunks have been requested, `semchunk` also:\n1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and\n1. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.\n\n## Benchmarks \ud83d\udcca\nOn a desktop with a Ryzen 9 7900X, 96 GB of DDR5 5600MHz CL40 RAM, Windows 11 and Python 3.12.4, it takes `semchunk` 3.04 seconds to split every sample in [NLTK's Gutenberg Corpus](https://www.nltk.org/howto/corpus.html#plaintext-corpora) into 512-token-long chunks with GPT-4's tokenizer (for context, the Corpus contains 18 texts and 3,001,260 tokens). By comparison, it takes [`semantic-text-splitter`](https://pypi.org/project/semantic-text-splitter/) (with multiprocessing) 24.84 seconds to chunk the same texts into 512-token-long chunks \u2014 a difference of 87.76%.\n\nThe code used to benchmark `semchunk` and `semantic-text-splitter` is available [here](https://github.com/isaacus-dev/semchunk/blob/main/tests/bench.py).\n\n## Licence \ud83d\udcc4\nThis library is licensed under the [MIT License](https://github.com/isaacus-dev/semchunk/blob/main/LICENCE).",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.",
    "version": "3.1.1",
    "project_urls": {
        "Documentation": "https://github.com/isaacus-dev/semchunk/blob/main/README.md",
        "Homepage": "https://github.com/isaacus-dev/semchunk",
        "Issues": "https://github.com/isaacus-dev/semchunk/issues",
        "Source": "https://github.com/isaacus-dev/semchunk"
    },
    "split_keywords": [
        "chunk",
        " chunker",
        " chunking",
        " chunks",
        " nlp",
        " split",
        " splits",
        " splitter",
        " splitting",
        " text"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a35330ac05b074be484ec60cca8dba43212ab2a291d4fb6c6e3a303cb5d42eb9",
                "md5": "cf95c6eb77a0ed1959a3f3d55ecbf52e",
                "sha256": "148f891e0c46ccfb4026478e3e57399c8497cd89a4c602751fb85cc2caeafa5c"
            },
            "downloads": -1,
            "filename": "semchunk-3.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf95c6eb77a0ed1959a3f3d55ecbf52e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 12632,
            "upload_time": "2025-02-18T00:08:45",
            "upload_time_iso_8601": "2025-02-18T00:08:45.629365Z",
            "url": "https://files.pythonhosted.org/packages/a3/53/30ac05b074be484ec60cca8dba43212ab2a291d4fb6c6e3a303cb5d42eb9/semchunk-3.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "99f9623c5b383c857c1094a411fa927502c6c8cccba7e13482de48a0bf3bd8ca",
                "md5": "8f43a1859c233d746abc149970c8dbb4",
                "sha256": "7381708035617801dfa246c8a09602042a5c2b9c3d35944316da22d5238cbf6b"
            },
            "downloads": -1,
            "filename": "semchunk-3.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8f43a1859c233d746abc149970c8dbb4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 16984,
            "upload_time": "2025-02-18T00:08:47",
            "upload_time_iso_8601": "2025-02-18T00:08:47.484790Z",
            "url": "https://files.pythonhosted.org/packages/99/f9/623c5b383c857c1094a411fa927502c6c8cccba7e13482de48a0bf3bd8ca/semchunk-3.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-18 00:08:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "isaacus-dev",
    "github_project": "semchunk",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "semchunk"
}

None