chunknorris

Name	chunknorris JSON
Version	1.1.4 JSON
	download
home_page	None
Summary	A package for chunking documents from various formats
upload_time	2025-07-10 12:29:33
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	chunk document split html markdown pdf header parsing rag
VCS
bugtrack_url
requirements	markdownify lxml html5lib termcolor PyMuPDF matplotlib pydantic pandas pyyaml openpyxl thefuzz tabulate nbformat mammoth
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # ChunkNorris

📒 [Documentation](https://wikit-ai.github.io/chunknorris/) | 🧪 [Testing app](https://huggingface.co/spaces/Wikit/chunknorris)


## Goal

This package aims at improving the method of chunking documents from various sources (HTML, PDFs, ...).
In the context of Retrieval Augmented Generation (RAG), an optimized chunking method might lead to smaller chunks, meaning :
- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)
- **Less errors** because of chunks exceeding the API limit in terms of number of tokens
- **Less hallucinations** of generation models because of superfluous information in the prompt
- **Reduced cost** as the prompt would have reduced size

## ⬇️ Installation

Using Pypi, just run the following command :
```pip install chunknorris```

## 🚀 Quick usage

You can directly invoke chunknorris on any **.md**, **.html** or **.pdf** file by running the following command in your terminal :

```chunknorris --filepath "path/to/myfile.pdf"```

See ``chunknorris -h`` for available options. Feel free to experiment 🧪 !

## ⚙️ How it works

ChunkNorris relies on 3 components :
- **Parsers** : they handle the cleaning and formating of your input document. You may use any parser suited for your need (e.g PdfParser for parsing PDF documents, MarkdownParser for parser)
- **Chunkers** : they use the output of the parser and handle its chunking.
- **pipelines**: they combine a parser and a chunker, allowing to output chunks directly from you input documents.

### Parsers

The role of parsers is to take a file or a string as input, and output a clean formated string suited for a chunker. As of today, **each file type relies on a dedicated parser**. For example, you may use ``MarkdownParser`` : for parsing markdown files/strings or ``PdfParser`` : for parsing PDF files.

All parsers will output a markdown-formatted string. Indeed, markdown is a great format to be use in RAG application as it is very well understood by LLMs.

### Chunkers

![](./docs/assets/chunking_method.png)

The role of chunkers is to process the output of parsers in order to obtain relevant chunks of the document. As of today, only ``MarkdownChunker`` is available. Used in conjunction with parsers, it allows to process a various inputs.

The chunking strategy of chunkers is based on several principles:
- **Each chunk must carry homogenous information.** To this end, they use the document's headers to chunk the documents. It helps ensuring that a specific piece of information is not splitted across multiple chunks.
- **Each chunk must keep contextual information.** A document's section might loose its meaning if the reader as no knowledge of its context. Consequently, all the headers of the parents sections are added ad the top of the chunk.
- **All chunks must be of similar sizes.** Indeed, when attempting to retrieve relevant chunks regarding a query, embedding models tend to be sensitive to the length of chunks. Actually, it is likely that a chunk with a text content of similar length to the query will have a high similarity score, while a chunk with a longer text content will see its similarity score descrease despite its relevancy. To prevent this, chunkers try to keep chunks of similar sizes whenever possible.


### Pipelines

Pipelines are the glue that **sticks together a parser and a chunker**. They use both to process documents and ensure constant output quality.

## Usage

You may find more detailed examples in the [examples section](./docs/examples) of the repo. Nevertheless, here is a basic example to get you started, assuming you need to chunk Mardown files.

```py
from chunknorris.parsers import MarkdownParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline

# Instanciate components
parser = MarkdownParser()
chunker = MarkdownChunker()
pipeline = BasePipeline(parser, chunker)

# Get some chunks !
chunks = pipeline.chunk_file(filepath="myfile.md")

# Print or save :
for chunk in chunks:
    print(chunk.get_text())
pipeline.save_chunks(chunks)
```

The ``BasePipeline`` is rather simple : it simply puts the parsers output into the chunker. While this is enough most in most cases, you may sometime need to use more advanced strategies.

Feel free to experiment with various combinations, or even to implement your the pipeline that suits your needs !.


### Advanced usage

Additionally, the chunkers and parsers can take a number of argument allowing to modifiy their behavior. For example:

```py
import tiktoken
from chunknorris.chunkers import MarkdownChunker

chunker = MarkdownChunker(
    max_headers_to_use="h4",
    max_chunk_word_count=250,
    hard_max_chunk_word_count=400,
    min_chunk_word_count=15,
    hard_max_chunk_token_count=8000,
    tokenizer=tiktoken.encoding_for_model("text-embedding-3-large"),
)
```

***max_headers_to_use*** 
(str): The maximum (included) level of headers take into account for chunking. For example, if "h3" is set, then "h4" and "h5" titles won't be used. Must be a string of type "hx" with x being the title level. Defaults to "h4".

***max_chunk_word_count***
(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.

***hard_max_chunk_word_count***
(int): The hard maximum of number of words a chunk can be. Chunks bigger by this limit will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks. It uses newlines to split. It should be greater than max_chunk_word_count. Defaults to 400. 

***min_chunk_word_count***
(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.

***hard_max_chunk_token_count***
(int | None): The hard maximum of tokens a chunk can be. This is used after the word-based chunking. Every chunk bigger (in token) than this number will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks, still considering newlines to avoid random cuts. If None, this parameter won't be used. If an int value is provided, then a tokenizer must be provided as well. Defaults to None.

***tokenizer***
(Any | None): The tokenizer to use to count tokens. Can be any instance of a class that has an ```.encode()``` method, that takes a string as input and returns a list of tokens. Must be provided if ```hard_max_chunk_token_count``` is set. Defaults to None.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "chunknorris",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "chunk, document, split, html, markdown, pdf, header, parsing, RAG",
    "author": null,
    "author_email": "Wikit <dev@wikit.ai>",
    "download_url": "https://files.pythonhosted.org/packages/08/47/7b48f72545e9c20c5920c7e46fc5b02f84caab8141e7d6504c2914444819/chunknorris-1.1.4.tar.gz",
    "platform": null,
    "description": "# ChunkNorris\n\n\ud83d\udcd2 [Documentation](https://wikit-ai.github.io/chunknorris/) | \ud83e\uddea [Testing app](https://huggingface.co/spaces/Wikit/chunknorris)\n\n\n## Goal\n\nThis package aims at improving the method of chunking documents from various sources (HTML, PDFs, ...).\nIn the context of Retrieval Augmented Generation (RAG), an optimized chunking method might lead to smaller chunks, meaning :\n- **Better relevancy of chunks** (and thus easier identification of useful chunks through embedding cosine similarity)\n- **Less errors** because of chunks exceeding the API limit in terms of number of tokens\n- **Less hallucinations** of generation models because of superfluous information in the prompt\n- **Reduced cost** as the prompt would have reduced size\n\n## \u2b07\ufe0f Installation\n\nUsing Pypi, just run the following command :\n```pip install chunknorris```\n\n## \ud83d\ude80 Quick usage\n\nYou can directly invoke chunknorris on any **.md**, **.html** or **.pdf** file by running the following command in your terminal :\n\n```chunknorris --filepath \"path/to/myfile.pdf\"```\n\nSee ``chunknorris -h`` for available options. Feel free to experiment \ud83e\uddea !\n\n## \u2699\ufe0f How it works\n\nChunkNorris relies on 3 components :\n- **Parsers** : they handle the cleaning and formating of your input document. You may use any parser suited for your need (e.g PdfParser for parsing PDF documents, MarkdownParser for parser)\n- **Chunkers** : they use the output of the parser and handle its chunking.\n- **pipelines**: they combine a parser and a chunker, allowing to output chunks directly from you input documents.\n\n### Parsers\n\nThe role of parsers is to take a file or a string as input, and output a clean formated string suited for a chunker. As of today, **each file type relies on a dedicated parser**. For example, you may use ``MarkdownParser`` : for parsing markdown files/strings or ``PdfParser`` : for parsing PDF files.\n\nAll parsers will output a markdown-formatted string. Indeed, markdown is a great format to be use in RAG application as it is very well understood by LLMs.\n\n### Chunkers\n\n![](./docs/assets/chunking_method.png)\n\nThe role of chunkers is to process the output of parsers in order to obtain relevant chunks of the document. As of today, only ``MarkdownChunker`` is available. Used in conjunction with parsers, it allows to process a various inputs.\n\nThe chunking strategy of chunkers is based on several principles:\n- **Each chunk must carry homogenous information.** To this end, they use the document's headers to chunk the documents. It helps ensuring that a specific piece of information is not splitted across multiple chunks.\n- **Each chunk must keep contextual information.** A document's section might loose its meaning if the reader as no knowledge of its context. Consequently, all the headers of the parents sections are added ad the top of the chunk.\n- **All chunks must be of similar sizes.** Indeed, when attempting to retrieve relevant chunks regarding a query, embedding models tend to be sensitive to the length of chunks. Actually, it is likely that a chunk with a text content of similar length to the query will have a high similarity score, while a chunk with a longer text content will see its similarity score descrease despite its relevancy. To prevent this, chunkers try to keep chunks of similar sizes whenever possible.\n\n\n### Pipelines\n\nPipelines are the glue that **sticks together a parser and a chunker**. They use both to process documents and ensure constant output quality.\n\n## Usage\n\nYou may find more detailed examples in the [examples section](./docs/examples) of the repo. Nevertheless, here is a basic example to get you started, assuming you need to chunk Mardown files.\n\n```py\nfrom chunknorris.parsers import MarkdownParser\nfrom chunknorris.chunkers import MarkdownChunker\nfrom chunknorris.pipelines import BasePipeline\n\n# Instanciate components\nparser = MarkdownParser()\nchunker = MarkdownChunker()\npipeline = BasePipeline(parser, chunker)\n\n# Get some chunks !\nchunks = pipeline.chunk_file(filepath=\"myfile.md\")\n\n# Print or save :\nfor chunk in chunks:\n    print(chunk.get_text())\npipeline.save_chunks(chunks)\n```\n\nThe ``BasePipeline`` is rather simple : it simply puts the parsers output into the chunker. While this is enough most in most cases, you may sometime need to use more advanced strategies.\n\nFeel free to experiment with various combinations, or even to implement your the pipeline that suits your needs !.\n\n\n### Advanced usage\n\nAdditionally, the chunkers and parsers can take a number of argument allowing to modifiy their behavior. For example:\n\n```py\nimport tiktoken\nfrom chunknorris.chunkers import MarkdownChunker\n\nchunker = MarkdownChunker(\n    max_headers_to_use=\"h4\",\n    max_chunk_word_count=250,\n    hard_max_chunk_word_count=400,\n    min_chunk_word_count=15,\n    hard_max_chunk_token_count=8000,\n    tokenizer=tiktoken.encoding_for_model(\"text-embedding-3-large\"),\n)\n```\n\n***max_headers_to_use*** \n(str): The maximum (included) level of headers take into account for chunking. For example, if \"h3\" is set, then \"h4\" and \"h5\" titles won't be used. Must be a string of type \"hx\" with x being the title level. Defaults to \"h4\".\n\n***max_chunk_word_count***\n(int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.\n\n***hard_max_chunk_word_count***\n(int): The hard maximum of number of words a chunk can be. Chunks bigger by this limit will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks. It uses newlines to split. It should be greater than max_chunk_word_count. Defaults to 400. \n\n***min_chunk_word_count***\n(int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.\n\n***hard_max_chunk_token_count***\n(int | None): The hard maximum of tokens a chunk can be. This is used after the word-based chunking. Every chunk bigger (in token) than this number will be split into subchunks. ChunkNorris will try to equilibrate the size of resulting subchunks, still considering newlines to avoid random cuts. If None, this parameter won't be used. If an int value is provided, then a tokenizer must be provided as well. Defaults to None.\n\n***tokenizer***\n(Any | None): The tokenizer to use to count tokens. Can be any instance of a class that has an ```.encode()``` method, that takes a string as input and returns a list of tokens. Must be provided if ```hard_max_chunk_token_count``` is set. Defaults to None.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package for chunking documents from various formats",
    "version": "1.1.4",
    "project_urls": {
        "Documentation": "https://wikit-ai.github.io/chunknorris/",
        "Homepage": "https://github.com/wikit-ai/chunknorris",
        "Issues": "https://github.com/wikit-ai/chunknorris/issues"
    },
    "split_keywords": [
        "chunk",
        " document",
        " split",
        " html",
        " markdown",
        " pdf",
        " header",
        " parsing",
        " rag"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f555a25b6e32c7494d021dbdca5b571e5cb718bf715211b7833ed2ff1fcd6454",
                "md5": "2ee1865231abee5629183baf9d5c0255",
                "sha256": "58831c845f86d33ff4644ec57064eede1a7870855ab976249496e780de94b8df"
            },
            "downloads": -1,
            "filename": "chunknorris-1.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2ee1865231abee5629183baf9d5c0255",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 66579,
            "upload_time": "2025-07-10T12:29:32",
            "upload_time_iso_8601": "2025-07-10T12:29:32.235039Z",
            "url": "https://files.pythonhosted.org/packages/f5/55/a25b6e32c7494d021dbdca5b571e5cb718bf715211b7833ed2ff1fcd6454/chunknorris-1.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "08477b48f72545e9c20c5920c7e46fc5b02f84caab8141e7d6504c2914444819",
                "md5": "a948adc013206ed097bc96b479519c0a",
                "sha256": "2b7722e94d35997cd2b74042e5a0ba5a37b9e7c9cfe8146de691e2beda23cb02"
            },
            "downloads": -1,
            "filename": "chunknorris-1.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "a948adc013206ed097bc96b479519c0a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 51647,
            "upload_time": "2025-07-10T12:29:33",
            "upload_time_iso_8601": "2025-07-10T12:29:33.610695Z",
            "url": "https://files.pythonhosted.org/packages/08/47/7b48f72545e9c20c5920c7e46fc5b02f84caab8141e7d6504c2914444819/chunknorris-1.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-10 12:29:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "wikit-ai",
    "github_project": "chunknorris",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "markdownify",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "html5lib",
            "specs": [
                [
                    "==",
                    "1.1"
                ]
            ]
        },
        {
            "name": "termcolor",
            "specs": [
                [
                    "==",
                    "2.4"
                ]
            ]
        },
        {
            "name": "PyMuPDF",
            "specs": [
                [
                    "==",
                    "1.25.5"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.9.2"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.3"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "6.0.2"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    "==",
                    "3.1.5"
                ]
            ]
        },
        {
            "name": "thefuzz",
            "specs": [
                [
                    "==",
                    "0.22.1"
                ]
            ]
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    "==",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "nbformat",
            "specs": [
                [
                    "==",
                    "5.10.4"
                ]
            ]
        },
        {
            "name": "mammoth",
            "specs": [
                [
                    "==",
                    "1.9.0"
                ]
            ]
        }
    ],
    "lcname": "chunknorris"
}

None