# Chunk Norris
## Goal
This project aims at improving the method of chunking documents from various sources (HTML, PDFs, ...)
An optimized chunking method might lead to smaller chunks, meaning :
- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)
- **Less errors** because of chunks exceeding the API limit in terms of number of tokens
- **Less hallucinations** of generation models because of superfluous information in the prompt
- **Reduced cost** as the prompt would have reduced size
## Installation
Using Pypi, just run the following command :
```pip install chunknorris```
## Chunkers
The package features multiple ***chunkers*** that can be used independently depending on the type of document needed.
All chunkers follow a similar logic :
- Extract table of contents (= headers)
- Build chunks using the text content of a part, and put the titles of the parts it belongs to on top
![](images/chunk_method.png)
### MarkdownChunkNorris
This chunker is meant to be used **on markdown-formatted text**.
Note: When calling the chunker, **you need to specify the header style** of your markdown text ([ATX or Setext](https://golem.ph.utexas.edu/~distler/maruku/markdown_syntax.html#header)). By default it will consider "Setext" heading style.
#### Usage
```py
from chunkers import MarkdownChunkNorris
text = """
# This is a header
This is a text
## This is another header
And another text
## With this final header
And this last text
"""
chunker = MarkdownChunkNorris()
header_style = "atx" # or "setext" depending on headers in your text
chunks = chunker(text, header_style=header_style)
```
### HTMLChunkNorris
This chunker is meant to be used **on html-formatted text**. Behind the scene, it uses markdownify to transform the text to markdown with "setex"-style headers and uses MarkdownChunkNorris to process it.
#### Usage
```py
from chunkers import HTMLChunkNorris
text = """
<h1>This is 1st level heading</h1>
<p>This is a test paragraph.</p>
<h2>This is 2nd level heading</h2>
<p>This is a test paragraph.</p>
<h2>This is another level heading</h2>
<p>This is another test paragraph.</p>
"""
hcn = HTMLChunkNorris()
chunks = hcn(text)
```
### Advanced usage of chunkers
Additionally, the chunkers can take a number of argument allowing to modifiy its behavior:
```py
from chunkers import MarkdownChunkNorris
mystring = "# header\nThis is a markdown string"
chunker = MarkdownChunkNorris() # or any other chunker
chunks = chunker(
mystring,
max_title_level_to_use="h3",
max_chunk_word_length=200,
link_placement="in_sentence",
max_chunk_tokens=8191,
chunk_tokens_exceeded_handling="split",
min_chunk_wordcount=15,
)
```
***max_title_level_to_use***
(str): The maximum (included) level of headers take into account for chunking. For example, if "h3" is set, then "h4" and "h5" titles won't be used. Must be a string of type "hx" with x being the title level. Defaults to "h4".
***max_chunk_word_length***
(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.
***link_placement***
(str): How the links should be handled. Defaults to in_sentence.
Options :
- "remove" : text is kept but links are removed
- "end_of_chunk" : adds a paragraph at the end of the chunk containing all the links
- "in_sentence" : the links is added between parenthesis inside the sentence
***max_chunk_tokens***
(int): The hard maximum of number of token a chunk can be. Chunks bigger by this limit will be handler according to chunk_tokens_exceeded_handling. Defaults to 8191.
***chunk_tokens_exceeded_handling***
(str): how the chunks bigger that than specified by max_chunk_tokens should be handled. Default to "raise_error".
Options:
- "raise_error": raises an error, indicated the chunk could not be split according to headers
- "split": split the chunks arbitrarily sothat each chunk has a size lower than max_chunk_tokens
***min_chunk_wordcount***
(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.
### PDFChunkNorris
#TODO:
Raw data
{
"_id": null,
"home_page": null,
"name": "chunknorris",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "chunk, document, split, html, markdown, pdf, header",
"author": null,
"author_email": "Wikit <dev@wikit.ai>",
"download_url": "https://files.pythonhosted.org/packages/38/c4/32b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10/chunknorris-0.0.1.tar.gz",
"platform": null,
"description": "# Chunk Norris\n\n## Goal\n\nThis project aims at improving the method of chunking documents from various sources (HTML, PDFs, ...)\nAn optimized chunking method might lead to smaller chunks, meaning :\n- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)\n- **Less errors** because of chunks exceeding the API limit in terms of number of tokens\n- **Less hallucinations** of generation models because of superfluous information in the prompt\n- **Reduced cost** as the prompt would have reduced size\n\n## Installation\n\nUsing Pypi, just run the following command :\n```pip install chunknorris```\n\n## Chunkers\n\nThe package features multiple ***chunkers*** that can be used independently depending on the type of document needed.\n\nAll chunkers follow a similar logic :\n- Extract table of contents (= headers)\n- Build chunks using the text content of a part, and put the titles of the parts it belongs to on top\n\n![](images/chunk_method.png)\n\n### MarkdownChunkNorris\n\nThis chunker is meant to be used **on markdown-formatted text**. \n\nNote: When calling the chunker, **you need to specify the header style** of your markdown text ([ATX or Setext](https://golem.ph.utexas.edu/~distler/maruku/markdown_syntax.html#header)). By default it will consider \"Setext\" heading style.\n\n#### Usage\n\n```py\nfrom chunkers import MarkdownChunkNorris\n\ntext = \"\"\"\n# This is a header\nThis is a text\n## This is another header\nAnd another text\n## With this final header\nAnd this last text\n\"\"\"\nchunker = MarkdownChunkNorris()\nheader_style = \"atx\" # or \"setext\" depending on headers in your text\nchunks = chunker(text, header_style=header_style)\n```\n\n### HTMLChunkNorris\n\nThis chunker is meant to be used **on html-formatted text**. Behind the scene, it uses markdownify to transform the text to markdown with \"setex\"-style headers and uses MarkdownChunkNorris to process it.\n\n#### Usage\n\n```py\nfrom chunkers import HTMLChunkNorris\n\ntext = \"\"\"\n<h1>This is 1st level heading</h1>\n<p>This is a test paragraph.</p>\n<h2>This is 2nd level heading</h2>\n<p>This is a test paragraph.</p>\n<h2>This is another level heading</h2>\n<p>This is another test paragraph.</p>\n\"\"\"\nhcn = HTMLChunkNorris()\nchunks = hcn(text)\n```\n\n### Advanced usage of chunkers\n\nAdditionally, the chunkers can take a number of argument allowing to modifiy its behavior:\n\n```py\nfrom chunkers import MarkdownChunkNorris\n\nmystring = \"# header\\nThis is a markdown string\"\n\nchunker = MarkdownChunkNorris() # or any other chunker\nchunks = chunker(\n mystring,\n max_title_level_to_use=\"h3\",\n max_chunk_word_length=200,\n link_placement=\"in_sentence\",\n max_chunk_tokens=8191,\n chunk_tokens_exceeded_handling=\"split\",\n min_chunk_wordcount=15,\n )\n```\n\n***max_title_level_to_use*** \n(str): The maximum (included) level of headers take into account for chunking. For example, if \"h3\" is set, then \"h4\" and \"h5\" titles won't be used. Must be a string of type \"hx\" with x being the title level. Defaults to \"h4\".\n\n***max_chunk_word_length***\n(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.\n\n***link_placement***\n(str): How the links should be handled. Defaults to in_sentence.\nOptions :\n- \"remove\" : text is kept but links are removed\n- \"end_of_chunk\" : adds a paragraph at the end of the chunk containing all the links\n- \"in_sentence\" : the links is added between parenthesis inside the sentence\n\n***max_chunk_tokens***\n(int): The hard maximum of number of token a chunk can be. Chunks bigger by this limit will be handler according to chunk_tokens_exceeded_handling. Defaults to 8191. \n\n***chunk_tokens_exceeded_handling***\n(str): how the chunks bigger that than specified by max_chunk_tokens should be handled. Default to \"raise_error\".\nOptions: \n- \"raise_error\": raises an error, indicated the chunk could not be split according to headers\n- \"split\": split the chunks arbitrarily sothat each chunk has a size lower than max_chunk_tokens\n\n***min_chunk_wordcount***\n(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.\n\n### PDFChunkNorris\n\n#TODO:\n",
"bugtrack_url": null,
"license": null,
"summary": "A package for chunking documents from various formats",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://gitlab.com/wikit/research-and-development/chunk-norris",
"Issues": "https://gitlab.com/wikit/research-and-development/chunk-norris/-/issues"
},
"split_keywords": [
"chunk",
" document",
" split",
" html",
" markdown",
" pdf",
" header"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6bf998dedfaad739f697a559165e766e32f2a625a302b94eaaadc75bcebab733",
"md5": "2fef8192b734af99836538a884a05fe3",
"sha256": "ebdd2e6059032a4c535d1d3e36cbeab22f27f4580db5a96fc800dfda54727940"
},
"downloads": -1,
"filename": "chunknorris-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2fef8192b734af99836538a884a05fe3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 17582,
"upload_time": "2024-05-23T13:14:00",
"upload_time_iso_8601": "2024-05-23T13:14:00.447902Z",
"url": "https://files.pythonhosted.org/packages/6b/f9/98dedfaad739f697a559165e766e32f2a625a302b94eaaadc75bcebab733/chunknorris-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "38c432b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10",
"md5": "dc4a36d84e89b2224a05182d0555f7c4",
"sha256": "233c2f25366b137726ef352d676518e7eb12a15aeac9f8aaf2caf33859968202"
},
"downloads": -1,
"filename": "chunknorris-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "dc4a36d84e89b2224a05182d0555f7c4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 12767,
"upload_time": "2024-05-23T13:14:07",
"upload_time_iso_8601": "2024-05-23T13:14:07.756306Z",
"url": "https://files.pythonhosted.org/packages/38/c4/32b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10/chunknorris-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-23 13:14:07",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "wikit",
"gitlab_project": "research-and-development",
"lcname": "chunknorris"
}