chunkipy


Namechunkipy JSON
Version 1.0.0.post1 PyPI version JSON
download
home_pagehttps://github.com/gioelecrispo/chunkipy
SummaryChunkipy is an easy-to-use library for chunking text based on the size estimator function you provide.
upload_time2025-08-08 12:37:03
maintainerNone
docs_urlNone
authorGioele Crispo
requires_python<3.13,>=3.10
licenseMIT
keywords text chunking nlp tokenization
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Chunkipy

![Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13](https://img.shields.io/badge/python-3.8%2C%203.9%2C%203.10%2C%203.11%2C%203.12%2C%203.13-blue.svg)
[![PyPI version](https://badge.fury.io/py/chunkipy.svg)](https://badge.fury.io/py/chunkipy)
[![codecov](https://codecov.io/gh/gioelecrispo/chunkipy/graph/badge.svg?token=2A7KQ87Q62)](https://codecov.io/gh/gioelecrispo/chunkipy)


`chunkipy` is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, `chunkipy` provides flexibility and control
for various text processing tasks.

## Motivation and Features
`chunkipy` was created to address the need within the field of Natural Language Processing (NLP) to chunk text so that it does not exceed the input size of **neural networks** such as BERT, but it could be used for several other use cases.

The library offers some useful features:

- **Size estimation**: unlike other text chunking libraries, `chunkipy` offers the possibility of providing a size estimator function, in order to build the chunks taking into account the  counting function (e.g. tokenizer) that will use those chunks.
- **Split text into meaningful sentences**: as an optional configuration, `chunkipy`,
  in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence.
  This is achieved through the use of the sentence segmenter libraries, that utilize semantic models to cut text
  into meaningful sentences.
- **Smart Overlapping**: `chunkipy` offers the possibility to define an `overlap_percentage` and create overlapping chunks to
  preserve the context along chunks. 
- **Flexibility for text splitters**: Additionally, `chunkipy` offers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined text spliters.

## Documentation
For **Installation**, **Usage**, and **API documentation**, please refer to the [documentation](https://gioelecrispo.github.io/chunkipy).

You can also check the [examples](https://github.com/gioelecrispo/chunkipy/tree/main/examples) directory for more usage scenarios.


## Contributing
If you find a bug or have a feature request, please open an issue on [GitHub](https://github.com/gioelecrispo/chunkipy/issues).
Contributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the [code style](https://www.python.org/dev/peps/pep-0008/).


### Development 
To start developing chunkipy, it is recommended to: 

1. Create a virtual environment (e.g. `python -m venv .venv`) and activate it
2. Install poetry via `pip install poetry`
3. Install the development dependencies via one of these commands:

```bash
poetry install  # no extra dependencies
poetry install --extra  spacy-splitter
poetry install --extra  openai-splitter,openai-estimator  # multiple extras dependencies
poetry install --all-extras  # all the extras dependencies
```


### Documentation
`chunkipy` relies on python docstrings and `sphinx` for its documentation.
`sphinx-autosummary` is used to automatically generate documentation from code.

`sphinx-multiversion` is used to provide multiversion support, i.e. you can navigation documention for past version too.

This is handled via Github Action, but you can reproduce it by installing the needed dependencies:

```bash
poetry install --only docs
```

and then by running the following command:
```bash
sphinx-multiversion docs/source docs/build/html
```


### Testing
We use `pytest` as main testing framework. 
You can install al the testing dependencies by running: 

```bash
poetry install --with test
```

Once done, you can run all the unit test (and check the coverage) with this command from the project folder:

```bash
pytest --cov=chunkipy --cov-report=term
```


## License
`chunkipy` is licensed under the [MIT License](https://opensource.org/licenses/MIT).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/gioelecrispo/chunkipy",
    "name": "chunkipy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "text, chunking, NLP, tokenization",
    "author": "Gioele Crispo",
    "author_email": "crispogioele@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/91/a8/41a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9/chunkipy-1.0.0.post1.tar.gz",
    "platform": null,
    "description": "# Chunkipy\n\n![Python 3.8, 3.9, 3.10, 3.11, 3.12, 3.13](https://img.shields.io/badge/python-3.8%2C%203.9%2C%203.10%2C%203.11%2C%203.12%2C%203.13-blue.svg)\n[![PyPI version](https://badge.fury.io/py/chunkipy.svg)](https://badge.fury.io/py/chunkipy)\n[![codecov](https://codecov.io/gh/gioelecrispo/chunkipy/graph/badge.svg?token=2A7KQ87Q62)](https://codecov.io/gh/gioelecrispo/chunkipy)\n\n\n`chunkipy` is an extremely useful tool for segmenting long texts into smaller chunks, based on either a character or token count. With customizable chunk sizes and splitting strategies, `chunkipy` provides flexibility and control\nfor various text processing tasks.\n\n## Motivation and Features\n`chunkipy` was created to address the need within the field of Natural Language Processing (NLP) to chunk text so that it does not exceed the input size of **neural networks** such as BERT, but it could be used for several other use cases.\n\nThe library offers some useful features:\n\n- **Size estimation**: unlike other text chunking libraries, `chunkipy` offers the possibility of providing a size estimator function, in order to build the chunks taking into account the  counting function (e.g. tokenizer) that will use those chunks.\n- **Split text into meaningful sentences**: as an optional configuration, `chunkipy`,\n  in creating the chunks, avoids cutting sentences, and always tries to have a complete and syntactically correct sentence.\n  This is achieved through the use of the sentence segmenter libraries, that utilize semantic models to cut text\n  into meaningful sentences.\n- **Smart Overlapping**: `chunkipy` offers the possibility to define an `overlap_percentage` and create overlapping chunks to\n  preserve the context along chunks. \n- **Flexibility for text splitters**: Additionally, `chunkipy` offers complete flexibility in choosing how to split, allowing users to define their own text splitting function or choose from a list of pre-defined text spliters.\n\n## Documentation\nFor **Installation**, **Usage**, and **API documentation**, please refer to the [documentation](https://gioelecrispo.github.io/chunkipy).\n\nYou can also check the [examples](https://github.com/gioelecrispo/chunkipy/tree/main/examples) directory for more usage scenarios.\n\n\n## Contributing\nIf you find a bug or have a feature request, please open an issue on [GitHub](https://github.com/gioelecrispo/chunkipy/issues).\nContributions are welcome! Just fork the repository, create a new branch with your changes, and submit a pull request. Please make sure to write tests for your changes and to follow the [code style](https://www.python.org/dev/peps/pep-0008/).\n\n\n### Development \nTo start developing chunkipy, it is recommended to: \n\n1. Create a virtual environment (e.g. `python -m venv .venv`) and activate it\n2. Install poetry via `pip install poetry`\n3. Install the development dependencies via one of these commands:\n\n```bash\npoetry install  # no extra dependencies\npoetry install --extra  spacy-splitter\npoetry install --extra  openai-splitter,openai-estimator  # multiple extras dependencies\npoetry install --all-extras  # all the extras dependencies\n```\n\n\n### Documentation\n`chunkipy` relies on python docstrings and `sphinx` for its documentation.\n`sphinx-autosummary` is used to automatically generate documentation from code.\n\n`sphinx-multiversion` is used to provide multiversion support, i.e. you can navigation documention for past version too.\n\nThis is handled via Github Action, but you can reproduce it by installing the needed dependencies:\n\n```bash\npoetry install --only docs\n```\n\nand then by running the following command:\n```bash\nsphinx-multiversion docs/source docs/build/html\n```\n\n\n### Testing\nWe use `pytest` as main testing framework. \nYou can install al the testing dependencies by running: \n\n```bash\npoetry install --with test\n```\n\nOnce done, you can run all the unit test (and check the coverage) with this command from the project folder:\n\n```bash\npytest --cov=chunkipy --cov-report=term\n```\n\n\n## License\n`chunkipy` is licensed under the [MIT License](https://opensource.org/licenses/MIT).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Chunkipy is an easy-to-use library for chunking text based on the size estimator function you provide.",
    "version": "1.0.0.post1",
    "project_urls": {
        "Homepage": "https://github.com/gioelecrispo/chunkipy",
        "Repository": "https://github.com/gioelecrispo/chunkipy"
    },
    "split_keywords": [
        "text",
        " chunking",
        " nlp",
        " tokenization"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5b88fceba9fde2b36de3ea30deedb8979fb3a270d80ac8c8e38727e7e2bf3cc2",
                "md5": "d991a87fa2dab3b8e3cefdaf7818e5a2",
                "sha256": "f6dc9071945839f92a2a60b674c04e772111d140a5642d5bb9358f412f4149fb"
            },
            "downloads": -1,
            "filename": "chunkipy-1.0.0.post1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d991a87fa2dab3b8e3cefdaf7818e5a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 16318,
            "upload_time": "2025-08-08T12:37:03",
            "upload_time_iso_8601": "2025-08-08T12:37:03.203063Z",
            "url": "https://files.pythonhosted.org/packages/5b/88/fceba9fde2b36de3ea30deedb8979fb3a270d80ac8c8e38727e7e2bf3cc2/chunkipy-1.0.0.post1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "91a841a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9",
                "md5": "69a260b4dc0e6959217c1d6bded92b73",
                "sha256": "63590f4ef0acdc3a8b69e8876c5e0a625c6de11a4064088053ed3948a3f66f16"
            },
            "downloads": -1,
            "filename": "chunkipy-1.0.0.post1.tar.gz",
            "has_sig": false,
            "md5_digest": "69a260b4dc0e6959217c1d6bded92b73",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 11583,
            "upload_time": "2025-08-08T12:37:03",
            "upload_time_iso_8601": "2025-08-08T12:37:03.982327Z",
            "url": "https://files.pythonhosted.org/packages/91/a8/41a38b397078a2e2f912f9d0441539a794f271d44bf4b4fb13695d5d10a9/chunkipy-1.0.0.post1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-08 12:37:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "gioelecrispo",
    "github_project": "chunkipy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chunkipy"
}
        
Elapsed time: 1.08078s