context-aware-chunker

Name	context-aware-chunker JSON
Version	0.0.2 JSON
	download
home_page	https://github.com/wordlabs-io/context_aware_chunker
Summary	Context aware chunking using perplexity
upload_time	2024-02-26 05:17:05
maintainer
docs_url	None
author	Tanishk Kithannae
requires_python
license
keywords	python rag
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Context Aware Chunker
When performing semantic search using vector similarity, one of the key issues that arises is the size of the chunk you are using.

The size of the chunk affects a lot of things, including the accuracy of your result, the amount of contextual information retained at inference time, and accuracy of retrieval.

One of the easiest ways to boost accuracy is to retain highly correlated information in a single atomic chunk as opposed to creating multiple, since this might be missed when performing semantic search. 

## How does this package work?
The idea is quite simple. Language models are extremely good at knowing when two pieces of text belong together.

When they do, the perplexity remains low, but when they aren't, the perplexity is much higher. 

Based on this, we can merge two groups of text together, creating the perfect chunk of highly correlated information

## Usage

> WARNING: Please note that this is an alpha release and is only suitable for testing, not for production

### Installation
```
pip install context_aware_chunking
```
### Python Code
```python
text = "<INSERT TEXT HERE>"

from context_aware_chunker.chunking_models import T5ChunkerModel
from context_aware_chunker.text_splitter import SentenceSplitter

#This module will help you in finding relevant sentences from unstructured text
splitter = SentenceSplitter()

'''
Responsible for determining which sentence segments to merge or separate
If you have more GPU power you can try using larger models
'''
chunking_agent = T5ChunkerModel('t5-small')

'''
Here, merge_sentences decides how many sentences will be in one split of the sentences
Default is 1, you can increase and see
'''
split_content = splitter.split_text(text, merge_sentences = 1)

chunks = chunking_agent.chunk(split_content)

for chunk in chunks:
  print(chunk)
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/wordlabs-io/context_aware_chunker",
    "name": "context-aware-chunker",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "python,rag",
    "author": "Tanishk Kithannae",
    "author_email": "tanishk.kithannae@wordlabs.io",
    "download_url": "",
    "platform": null,
    "description": "# Context Aware Chunker\r\nWhen performing semantic search using vector similarity, one of the key issues that arises is the size of the chunk you are using.\r\n\r\nThe size of the chunk affects a lot of things, including the accuracy of your result, the amount of contextual information retained at inference time, and accuracy of retrieval.\r\n\r\nOne of the easiest ways to boost accuracy is to retain highly correlated information in a single atomic chunk as opposed to creating multiple, since this might be missed when performing semantic search. \r\n\r\n## How does this package work?\r\nThe idea is quite simple. Language models are extremely good at knowing when two pieces of text belong together.\r\n\r\nWhen they do, the perplexity remains low, but when they aren't, the perplexity is much higher. \r\n\r\nBased on this, we can merge two groups of text together, creating the perfect chunk of highly correlated information\r\n\r\n## Usage\r\n\r\n> WARNING: Please note that this is an alpha release and is only suitable for testing, not for production\r\n\r\n### Installation\r\n```\r\npip install context_aware_chunking\r\n```\r\n### Python Code\r\n```python\r\ntext = \"<INSERT TEXT HERE>\"\r\n\r\nfrom context_aware_chunker.chunking_models import T5ChunkerModel\r\nfrom context_aware_chunker.text_splitter import SentenceSplitter\r\n\r\n#This module will help you in finding relevant sentences from unstructured text\r\nsplitter = SentenceSplitter()\r\n\r\n'''\r\nResponsible for determining which sentence segments to merge or separate\r\nIf you have more GPU power you can try using larger models\r\n'''\r\nchunking_agent = T5ChunkerModel('t5-small')\r\n\r\n'''\r\nHere, merge_sentences decides how many sentences will be in one split of the sentences\r\nDefault is 1, you can increase and see\r\n'''\r\nsplit_content = splitter.split_text(text, merge_sentences = 1)\r\n\r\nchunks = chunking_agent.chunk(split_content)\r\n\r\nfor chunk in chunks:\r\n  print(chunk)\r\n```\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Context aware chunking using perplexity",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/wordlabs-io/context_aware_chunker"
    },
    "split_keywords": [
        "python",
        "rag"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5b63d21afcf1e8518491356f6b5149d023566257fec3536a866db60d485c524b",
                "md5": "13b3f899e7155f2724c9d6fd7f30d8c7",
                "sha256": "63707c08968ba27d510ce5fb8c052b1f11b6005567b5aef03d5e1033a9980890"
            },
            "downloads": -1,
            "filename": "context_aware_chunker-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "13b3f899e7155f2724c9d6fd7f30d8c7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 5186,
            "upload_time": "2024-02-26T05:17:05",
            "upload_time_iso_8601": "2024-02-26T05:17:05.016467Z",
            "url": "https://files.pythonhosted.org/packages/5b/63/d21afcf1e8518491356f6b5149d023566257fec3536a866db60d485c524b/context_aware_chunker-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-26 05:17:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wordlabs-io",
    "github_project": "context_aware_chunker",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "context-aware-chunker"
}

Tanishk Kithannae