chunknorris


Namechunknorris JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryA package for chunking documents from various formats
upload_time2024-05-23 13:14:07
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords chunk document split html markdown pdf header
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Chunk Norris

## Goal

This project aims at improving the method of chunking documents from various sources (HTML, PDFs, ...)
An optimized chunking method might lead to smaller chunks, meaning :
- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)
- **Less errors** because of chunks exceeding the API limit in terms of number of tokens
- **Less hallucinations** of generation models because of superfluous information in the prompt
- **Reduced cost** as the prompt would have reduced size

## Installation

Using Pypi, just run the following command :
```pip install chunknorris```

## Chunkers

The package features multiple ***chunkers*** that can be used independently depending on the type of document needed.

All chunkers follow a similar logic :
- Extract table of contents (= headers)
- Build chunks using the text content of a part, and put the titles of the parts it belongs to on top

![](images/chunk_method.png)

### MarkdownChunkNorris

This chunker is meant to be used **on markdown-formatted text**. 

Note: When calling the chunker, **you need to specify the header style** of your markdown text ([ATX or Setext](https://golem.ph.utexas.edu/~distler/maruku/markdown_syntax.html#header)). By default it will consider "Setext" heading style.

#### Usage

```py
from chunkers import MarkdownChunkNorris

text = """
# This is a header
This is a text
## This is another header
And another text
## With this final header
And this last text
"""
chunker = MarkdownChunkNorris()
header_style = "atx" # or "setext" depending on headers in your text
chunks = chunker(text, header_style=header_style)
```

### HTMLChunkNorris

This chunker is meant to be used **on html-formatted text**. Behind the scene, it uses markdownify to transform the text to markdown with "setex"-style headers and uses MarkdownChunkNorris to process it.

#### Usage

```py
from chunkers import HTMLChunkNorris

text = """
<h1>This is 1st level heading</h1>
<p>This is a test paragraph.</p>
<h2>This is 2nd level heading</h2>
<p>This is a test paragraph.</p>
<h2>This is another level heading</h2>
<p>This is another test paragraph.</p>
"""
hcn = HTMLChunkNorris()
chunks = hcn(text)
```

### Advanced usage of chunkers

Additionally, the chunkers can take a number of argument allowing to modifiy its behavior:

```py
from chunkers import MarkdownChunkNorris

mystring = "# header\nThis is a markdown string"

chunker = MarkdownChunkNorris() # or any other chunker
chunks = chunker(
    mystring,
    max_title_level_to_use="h3",
    max_chunk_word_length=200,
    link_placement="in_sentence",
    max_chunk_tokens=8191,
    chunk_tokens_exceeded_handling="split",
    min_chunk_wordcount=15,
    )
```

***max_title_level_to_use*** 
(str): The maximum (included) level of headers take into account for chunking. For example, if "h3" is set, then "h4" and "h5" titles won't be used. Must be a string of type "hx" with x being the title level. Defaults to "h4".

***max_chunk_word_length***
(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.

***link_placement***
(str): How the links should be handled. Defaults to in_sentence.
Options :
- "remove" : text is kept but links are removed
- "end_of_chunk" : adds a paragraph at the end of the chunk containing all the links
- "in_sentence" : the links is added between parenthesis inside the sentence

***max_chunk_tokens***
(int): The hard maximum of number of token a chunk can be. Chunks bigger by this limit will be handler according to chunk_tokens_exceeded_handling. Defaults to 8191. 

***chunk_tokens_exceeded_handling***
(str): how the chunks bigger that than specified by max_chunk_tokens should be handled. Default to "raise_error".
Options: 
- "raise_error": raises an error, indicated the chunk could not be split according to headers
- "split": split the chunks arbitrarily sothat each chunk has a size lower than max_chunk_tokens

***min_chunk_wordcount***
(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.

### PDFChunkNorris

#TODO:

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chunknorris",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "chunk, document, split, html, markdown, pdf, header",
    "author": null,
    "author_email": "Wikit <dev@wikit.ai>",
    "download_url": "https://files.pythonhosted.org/packages/38/c4/32b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10/chunknorris-0.0.1.tar.gz",
    "platform": null,
    "description": "# Chunk Norris\n\n## Goal\n\nThis project aims at improving the method of chunking documents from various sources (HTML, PDFs, ...)\nAn optimized chunking method might lead to smaller chunks, meaning :\n- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)\n- **Less errors** because of chunks exceeding the API limit in terms of number of tokens\n- **Less hallucinations** of generation models because of superfluous information in the prompt\n- **Reduced cost** as the prompt would have reduced size\n\n## Installation\n\nUsing Pypi, just run the following command :\n```pip install chunknorris```\n\n## Chunkers\n\nThe package features multiple ***chunkers*** that can be used independently depending on the type of document needed.\n\nAll chunkers follow a similar logic :\n- Extract table of contents (= headers)\n- Build chunks using the text content of a part, and put the titles of the parts it belongs to on top\n\n![](images/chunk_method.png)\n\n### MarkdownChunkNorris\n\nThis chunker is meant to be used **on markdown-formatted text**. \n\nNote: When calling the chunker, **you need to specify the header style** of your markdown text ([ATX or Setext](https://golem.ph.utexas.edu/~distler/maruku/markdown_syntax.html#header)). By default it will consider \"Setext\" heading style.\n\n#### Usage\n\n```py\nfrom chunkers import MarkdownChunkNorris\n\ntext = \"\"\"\n# This is a header\nThis is a text\n## This is another header\nAnd another text\n## With this final header\nAnd this last text\n\"\"\"\nchunker = MarkdownChunkNorris()\nheader_style = \"atx\" # or \"setext\" depending on headers in your text\nchunks = chunker(text, header_style=header_style)\n```\n\n### HTMLChunkNorris\n\nThis chunker is meant to be used **on html-formatted text**. Behind the scene, it uses markdownify to transform the text to markdown with \"setex\"-style headers and uses MarkdownChunkNorris to process it.\n\n#### Usage\n\n```py\nfrom chunkers import HTMLChunkNorris\n\ntext = \"\"\"\n<h1>This is 1st level heading</h1>\n<p>This is a test paragraph.</p>\n<h2>This is 2nd level heading</h2>\n<p>This is a test paragraph.</p>\n<h2>This is another level heading</h2>\n<p>This is another test paragraph.</p>\n\"\"\"\nhcn = HTMLChunkNorris()\nchunks = hcn(text)\n```\n\n### Advanced usage of chunkers\n\nAdditionally, the chunkers can take a number of argument allowing to modifiy its behavior:\n\n```py\nfrom chunkers import MarkdownChunkNorris\n\nmystring = \"# header\\nThis is a markdown string\"\n\nchunker = MarkdownChunkNorris() # or any other chunker\nchunks = chunker(\n    mystring,\n    max_title_level_to_use=\"h3\",\n    max_chunk_word_length=200,\n    link_placement=\"in_sentence\",\n    max_chunk_tokens=8191,\n    chunk_tokens_exceeded_handling=\"split\",\n    min_chunk_wordcount=15,\n    )\n```\n\n***max_title_level_to_use*** \n(str): The maximum (included) level of headers take into account for chunking. For example, if \"h3\" is set, then \"h4\" and \"h5\" titles won't be used. Must be a string of type \"hx\" with x being the title level. Defaults to \"h4\".\n\n***max_chunk_word_length***\n(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.\n\n***link_placement***\n(str): How the links should be handled. Defaults to in_sentence.\nOptions :\n- \"remove\" : text is kept but links are removed\n- \"end_of_chunk\" : adds a paragraph at the end of the chunk containing all the links\n- \"in_sentence\" : the links is added between parenthesis inside the sentence\n\n***max_chunk_tokens***\n(int): The hard maximum of number of token a chunk can be. Chunks bigger by this limit will be handler according to chunk_tokens_exceeded_handling. Defaults to 8191. \n\n***chunk_tokens_exceeded_handling***\n(str): how the chunks bigger that than specified by max_chunk_tokens should be handled. Default to \"raise_error\".\nOptions: \n- \"raise_error\": raises an error, indicated the chunk could not be split according to headers\n- \"split\": split the chunks arbitrarily sothat each chunk has a size lower than max_chunk_tokens\n\n***min_chunk_wordcount***\n(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.\n\n### PDFChunkNorris\n\n#TODO:\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package for chunking documents from various formats",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://gitlab.com/wikit/research-and-development/chunk-norris",
        "Issues": "https://gitlab.com/wikit/research-and-development/chunk-norris/-/issues"
    },
    "split_keywords": [
        "chunk",
        " document",
        " split",
        " html",
        " markdown",
        " pdf",
        " header"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bf998dedfaad739f697a559165e766e32f2a625a302b94eaaadc75bcebab733",
                "md5": "2fef8192b734af99836538a884a05fe3",
                "sha256": "ebdd2e6059032a4c535d1d3e36cbeab22f27f4580db5a96fc800dfda54727940"
            },
            "downloads": -1,
            "filename": "chunknorris-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2fef8192b734af99836538a884a05fe3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 17582,
            "upload_time": "2024-05-23T13:14:00",
            "upload_time_iso_8601": "2024-05-23T13:14:00.447902Z",
            "url": "https://files.pythonhosted.org/packages/6b/f9/98dedfaad739f697a559165e766e32f2a625a302b94eaaadc75bcebab733/chunknorris-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38c432b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10",
                "md5": "dc4a36d84e89b2224a05182d0555f7c4",
                "sha256": "233c2f25366b137726ef352d676518e7eb12a15aeac9f8aaf2caf33859968202"
            },
            "downloads": -1,
            "filename": "chunknorris-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "dc4a36d84e89b2224a05182d0555f7c4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 12767,
            "upload_time": "2024-05-23T13:14:07",
            "upload_time_iso_8601": "2024-05-23T13:14:07.756306Z",
            "url": "https://files.pythonhosted.org/packages/38/c4/32b53e6442c7cfee95e1bac6c738e0b2c91b348b5d5eb602a80ce691fd10/chunknorris-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-23 13:14:07",
    "github": false,
    "gitlab": true,
    "bitbucket": false,
    "codeberg": false,
    "gitlab_user": "wikit",
    "gitlab_project": "research-and-development",
    "lcname": "chunknorris"
}
        
Elapsed time: 0.38094s