chunkit


Namechunkit JSON
Version 0.2.9 PyPI version JSON
download
home_pagehttps://github.com/hypergrok/chunkit
SummaryConvert URLs and files into LLM-friendly markdown chunks
upload_time2024-09-01 18:35:04
maintainerNone
docs_urlNone
authorhypergrok
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="https://raw.githubusercontent.com/hypergrok/chunkit/main/chn.png" alt="chunkit" width="200"/>
</p>

<div align="center">
  <a href="https://badge.fury.io/py/chunkit"><img src="https://badge.fury.io/py/chunkit.svg" alt="PyPI version" /></a>
  <a href="https://pepy.tech/project/chunkit"><img src="https://pepy.tech/badge/chunkit" alt="Downloads" /></a>
  <a href="https://www.gnu.org/licenses/gpl-3.0.html"><img src="https://img.shields.io/badge/License-GPL%20v3-blue.svg" alt="License: GPL v3" /></a>
</div>

<h3 align="center">Turn URLs into LLM-friendly markdown chunks</h3>

Chunkit allows you to scrape and convert webpages into markdown chunks, for RAG applications.

### Quickstart

1) Install

```bash
pip install chunkit
```

2) Start chunking

```python
from chunkit import Chunker

# Initialize the Chunker
chunker = Chunker()

# Define URLs to process
urls = ["https://en.wikipedia.org/wiki/Chunking_(psychology)"]

# Process the URLs into markdown chunks
chunkified_urls = chunker.process(urls)

# Output the resulting chunks
for url in chunkified_urls:
    if url['success']:
        for chunk in url['chunks']:
            print(chunk)
```

<details>
  <summary>Example results for above Wikipedia page</summary>

#### Chunk 1
```markdown
### Chunking (psychology)

In cognitive psychology, **chunking** is a process by which small individual pieces of a set of information are bound together to create a meaningful whole later on in memory. The chunks, by which the information is grouped, are meant to improve short-term retention of the material, thus bypassing the limited capacity of working memory...
```
#### Chunk 2
```markdown
### Modality effect

A modality effect is present in chunking. That is, the mechanism used to convey the list of items to the individual affects how much "chunking" occurs. Experimentally, it has been found that auditory presentation results in a larger amount of grouping in the responses of individuals than visual presentation does...
```
#### Chunk 3
```markdown
### Memory training systems, mnemonic

Various kinds of memory training systems and mnemonics include training and drills in specially-designed recoding or chunking schemes. Such systems existed before Miller's paper, but there was no convenient term to describe the general strategy and no substantive and reliable research...
```
Etc.

</details>


### How most chunkers work

Most chunkers:

* Perform a naive chunking based on the number of words in the content.
* For example, they may split content every 200 words, and have a 30 word overlap between each.
* This leads to messy chunks that are noisy and have unnecessary extra data.
* Additionally, the chunked sentences are usually split in the middle, with lost meaning.
* This leads to poor LLM performance, with incorrect answers and hallucinations.

### Why Chunkit works better

Chunkit however, converts HTML to Markdown, and then determines split points based on the most common header levels.

This gives you better results because:

* Online content tends to be logically split in paragraphs delimited by headers.
* By chunking based on headers, this method preserves semantic meaning better.
* You get a much cleaner, semantically cohesive paragraph of data.

### Supported filetypes

This free open source package primarily chunks webpages and html.

### License

This project is licensed under GPL v3 - see the [LICENSE](LICENSE) file for details.

### Contact

For questions or support, please open an issue. 

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hypergrok/chunkit",
    "name": "chunkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "hypergrok",
    "author_email": "173556723+hypergrok@users.noreply.github.com",
    "download_url": null,
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/hypergrok/chunkit/main/chn.png\" alt=\"chunkit\" width=\"200\"/>\n</p>\n\n<div align=\"center\">\n  <a href=\"https://badge.fury.io/py/chunkit\"><img src=\"https://badge.fury.io/py/chunkit.svg\" alt=\"PyPI version\" /></a>\n  <a href=\"https://pepy.tech/project/chunkit\"><img src=\"https://pepy.tech/badge/chunkit\" alt=\"Downloads\" /></a>\n  <a href=\"https://www.gnu.org/licenses/gpl-3.0.html\"><img src=\"https://img.shields.io/badge/License-GPL%20v3-blue.svg\" alt=\"License: GPL v3\" /></a>\n</div>\n\n<h3 align=\"center\">Turn URLs into LLM-friendly markdown chunks</h3>\n\nChunkit allows you to scrape and convert webpages into markdown chunks, for RAG applications.\n\n### Quickstart\n\n1) Install\n\n```bash\npip install chunkit\n```\n\n2) Start chunking\n\n```python\nfrom chunkit import Chunker\n\n# Initialize the Chunker\nchunker = Chunker()\n\n# Define URLs to process\nurls = [\"https://en.wikipedia.org/wiki/Chunking_(psychology)\"]\n\n# Process the URLs into markdown chunks\nchunkified_urls = chunker.process(urls)\n\n# Output the resulting chunks\nfor url in chunkified_urls:\n    if url['success']:\n        for chunk in url['chunks']:\n            print(chunk)\n```\n\n<details>\n  <summary>Example results for above Wikipedia page</summary>\n\n#### Chunk 1\n```markdown\n### Chunking (psychology)\n\nIn cognitive psychology, **chunking** is a process by which small individual pieces of a set of information are bound together to create a meaningful whole later on in memory. The chunks, by which the information is grouped, are meant to improve short-term retention of the material, thus bypassing the limited capacity of working memory...\n```\n#### Chunk 2\n```markdown\n### Modality effect\n\nA modality effect is present in chunking. That is, the mechanism used to convey the list of items to the individual affects how much \"chunking\" occurs. Experimentally, it has been found that auditory presentation results in a larger amount of grouping in the responses of individuals than visual presentation does...\n```\n#### Chunk 3\n```markdown\n### Memory training systems, mnemonic\n\nVarious kinds of memory training systems and mnemonics include training and drills in specially-designed recoding or chunking schemes. Such systems existed before Miller's paper, but there was no convenient term to describe the general strategy and no substantive and reliable research...\n```\nEtc.\n\n</details>\n\n\n### How most chunkers work\n\nMost chunkers:\n\n* Perform a naive chunking based on the number of words in the content.\n* For example, they may split content every 200 words, and have a 30 word overlap between each.\n* This leads to messy chunks that are noisy and have unnecessary extra data.\n* Additionally, the chunked sentences are usually split in the middle, with lost meaning.\n* This leads to poor LLM performance, with incorrect answers and hallucinations.\n\n### Why Chunkit works better\n\nChunkit however, converts HTML to Markdown, and then determines split points based on the most common header levels.\n\nThis gives you better results because:\n\n* Online content tends to be logically split in paragraphs delimited by headers.\n* By chunking based on headers, this method preserves semantic meaning better.\n* You get a much cleaner, semantically cohesive paragraph of data.\n\n### Supported filetypes\n\nThis free open source package primarily chunks webpages and html.\n\n### License\n\nThis project is licensed under GPL v3 - see the [LICENSE](LICENSE) file for details.\n\n### Contact\n\nFor questions or support, please open an issue. \n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Convert URLs and files into LLM-friendly markdown chunks",
    "version": "0.2.9",
    "project_urls": {
        "Homepage": "https://github.com/hypergrok/chunkit",
        "Source Code": "https://github.com/hypergrok/chunkit"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f32c8dc73c8ca21eac9f1747973cb2609393d0fe90b2a23c4f0935921cd29ec",
                "md5": "cf17e3d9501c1621f2a4e473dae9459d",
                "sha256": "b9ee11932870137ba4d88a17e519a14a44a62a57a8d8bf4a7c980f6f635b79a4"
            },
            "downloads": -1,
            "filename": "chunkit-0.2.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cf17e3d9501c1621f2a4e473dae9459d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17513,
            "upload_time": "2024-09-01T18:35:04",
            "upload_time_iso_8601": "2024-09-01T18:35:04.807946Z",
            "url": "https://files.pythonhosted.org/packages/6f/32/c8dc73c8ca21eac9f1747973cb2609393d0fe90b2a23c4f0935921cd29ec/chunkit-0.2.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-01 18:35:04",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hypergrok",
    "github_project": "chunkit",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "chunkit"
}
        
Elapsed time: 0.67201s