# Chunkipy

[](https://badge.fury.io/py/chunkipy)
[](https://codecov.io/gh/gioelecrispo/chunkipy)
`chunkipy` is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, `chunkipy` provides flexibility and control
for various text processing tasks.
## Motivation and Features
`chunkipy` was created to address the need within the field of Natural Language Processing (NLP) to chunk text so that it does not exceed the input size of **neural networks** such as BERT, but it could be used for several other use cases.
The library offers some useful features:
- **Size estimation**: unlike other text chunking libraries, `chunkipy` offers the possibility of providing a size estimator function, in order to build the chunks taking into account the counting function (e.g. tokenizer) that will use those chunks.
- **Split text into meaningful sentences**: as an optional configuration, `chunkipy`,
in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence.
This is achieved through the use of the sentence segmenter libraries, that utilize semantic models to cut text
into meaningful sentences.
- **Smart Overlapping**: `chunkipy` offers the possibility to define an `overlap_percentage` and create overlapping chunks to
preserve the context along chunks.
- **Flexibility for text splitters**: Additionally, `chunkipy` offers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined text spliters.
## Documentation
For **Installation**, **Usage**, and **API documentation**, please refer to the [documentation](https://gioelecrispo.github.io/chunkipy).
You can also check the [examples](https://github.com/gioelecrispo/chunkipy/tree/main/examples) directory for more usage scenarios.
## Contributing
If you find a bug or have a feature request, please open an issue on [GitHub](https://github.com/gioelecrispo/chunkipy/issues).
Contributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the [code style](https://www.python.org/dev/peps/pep-0008/).
### Development
To start developing chunkipy, it is recommended to:
1. Create a virtual environment (e.g. `python -m venv .venv`) and activate it
2. Install poetry via `pip install poetry`
3. Install the development dependencies via one of these commands:
```bash
poetry install # no extra dependencies
poetry install --extra spacy-splitter
poetry install --extra openai-splitter,openai-estimator # multiple extras dependencies
poetry install --all-extras # all the extras dependencies
```
### Documentation
`chunkipy` relies on python docstrings and `sphinx` for its documentation.
`sphinx-autosummary` is used to automatically generate documentation from code.
`sphinx-multiversion` is used to provide multiversion support, i.e. you can navigation documention for past version too.
This is handled via Github Action, but you can reproduce it by installing the needed dependencies:
```bash
poetry install --only docs
```
and then by running the following command:
```bash
sphinx-multiversion docs/source docs/build/html
```
### Testing
We use `pytest` as main testing framework.
You can install al the testing dependencies by running:
```bash
poetry install --with test
```
Once done, you can run all the unit test (and check the coverage) with this command from the project folder:
```bash
pytest --cov=chunkipy --cov-report=term
```
## License
`chunkipy` is licensed under the [MIT License](https://opensource.org/licenses/MIT).
Raw data
{
"_id": null,
"home_page": "https://github.com/gioelecrispo/chunkipy",
"name": "chunkipy",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "text, chunking, NLP, tokenization",
"author": "Gioele Crispo",
"author_email": "crispogioele@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/91/a8/41a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9/chunkipy-1.0.0.post1.tar.gz",
"platform": null,
"description": "# Chunkipy\n\n\n[](https://badge.fury.io/py/chunkipy)\n[](https://codecov.io/gh/gioelecrispo/chunkipy)\n\n\n`chunkipy` is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, `chunkipy` provides flexibility and control\nfor various text processing tasks.\n\n## Motivation and Features\n`chunkipy` was created to address the need within the field of Natural Language Processing (NLP) to chunk text so that it does not exceed the input size of **neural networks** such as BERT, but it could be used for several other use cases.\n\nThe library offers some useful features:\n\n- **Size estimation**: unlike other text chunking libraries, `chunkipy` offers the possibility of providing a size estimator function, in order to build the chunks taking into account the counting function (e.g. tokenizer) that will use those chunks.\n- **Split text into meaningful sentences**: as an optional configuration, `chunkipy`,\n in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence.\n This is achieved through the use of the sentence segmenter libraries, that utilize semantic models to cut text\n into meaningful sentences.\n- **Smart Overlapping**: `chunkipy` offers the possibility to define an `overlap_percentage` and create overlapping chunks to\n preserve the context along chunks. \n- **Flexibility for text splitters**: Additionally, `chunkipy` offers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined text spliters.\n\n## Documentation\nFor **Installation**, **Usage**, and **API documentation**, please refer to the [documentation](https://gioelecrispo.github.io/chunkipy).\n\nYou can also check the [examples](https://github.com/gioelecrispo/chunkipy/tree/main/examples) directory for more usage scenarios.\n\n\n## Contributing\nIf you find a bug or have a feature request, please open an issue on [GitHub](https://github.com/gioelecrispo/chunkipy/issues).\nContributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the [code style](https://www.python.org/dev/peps/pep-0008/).\n\n\n### Development \nTo start developing chunkipy, it is recommended to: \n\n1. Create a virtual environment (e.g. `python -m venv .venv`) and activate it\n2. Install poetry via `pip install poetry`\n3. Install the development dependencies via one of these commands:\n\n```bash\npoetry install # no extra dependencies\npoetry install --extra spacy-splitter\npoetry install --extra openai-splitter,openai-estimator # multiple extras dependencies\npoetry install --all-extras # all the extras dependencies\n```\n\n\n### Documentation\n`chunkipy` relies on python docstrings and `sphinx` for its documentation.\n`sphinx-autosummary` is used to automatically generate documentation from code.\n\n`sphinx-multiversion` is used to provide multiversion support, i.e. you can navigation documention for past version too.\n\nThis is handled via Github Action, but you can reproduce it by installing the needed dependencies:\n\n```bash\npoetry install --only docs\n```\n\nand then by running the following command:\n```bash\nsphinx-multiversion docs/source docs/build/html\n```\n\n\n### Testing\nWe use `pytest` as main testing framework. \nYou can install al the testing dependencies by running: \n\n```bash\npoetry install --with test\n```\n\nOnce done, you can run all the unit test (and check the coverage) with this command from the project folder:\n\n```bash\npytest --cov=chunkipy --cov-report=term\n```\n\n\n## License\n`chunkipy` is licensed under the [MIT License](https://opensource.org/licenses/MIT).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Chunkipy is an easy-to-use library for chunking text based on the size estimator function you provide.",
"version": "1.0.0.post1",
"project_urls": {
"Homepage": "https://github.com/gioelecrispo/chunkipy",
"Repository": "https://github.com/gioelecrispo/chunkipy"
},
"split_keywords": [
"text",
" chunking",
" nlp",
" tokenization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5b88fceba9fde2b36de3ea30deedb8979fb3a270d80ac8c8e38727e7e2bf3cc2",
"md5": "d991a87fa2dab3b8e3cefdaf7818e5a2",
"sha256": "f6dc9071945839f92a2a60b674c04e772111d140a5642d5bb9358f412f4149fb"
},
"downloads": -1,
"filename": "chunkipy-1.0.0.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d991a87fa2dab3b8e3cefdaf7818e5a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 16318,
"upload_time": "2025-08-08T12:37:03",
"upload_time_iso_8601": "2025-08-08T12:37:03.203063Z",
"url": "https://files.pythonhosted.org/packages/5b/88/fceba9fde2b36de3ea30deedb8979fb3a270d80ac8c8e38727e7e2bf3cc2/chunkipy-1.0.0.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "91a841a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9",
"md5": "69a260b4dc0e6959217c1d6bded92b73",
"sha256": "63590f4ef0acdc3a8b69e8876c5e0a625c6de11a4064088053ed3948a3f66f16"
},
"downloads": -1,
"filename": "chunkipy-1.0.0.post1.tar.gz",
"has_sig": false,
"md5_digest": "69a260b4dc0e6959217c1d6bded92b73",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 11583,
"upload_time": "2025-08-08T12:37:03",
"upload_time_iso_8601": "2025-08-08T12:37:03.982327Z",
"url": "https://files.pythonhosted.org/packages/91/a8/41a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9/chunkipy-1.0.0.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-08 12:37:03",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gioelecrispo",
"github_project": "chunkipy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "chunkipy"
}