llm-text-splitter

Name	llm-text-splitter JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	A lightweight, rule-based text splitter for LLM context window management, handles multiple file formats and enriches chunks with metadata.
upload_time	2025-07-24 12:21:01
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	None
keywords	llm text-splitter chunking rag document-processing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # **LLM Text Splitter v0.2.0**
![PyPI](https://img.shields.io/pypi/v/llm-text-splitter)

A lightweight, rule-based text splitter designed for preparing long documents for Large Language Model (LLM) context windows. It intelligently breaks down text into manageable chunks, prioritizing meaningful structural breaks (like paragraphs or lines) before resorting to arbitrary character limits.

## Key Features

* **All-in-One Installation:** Handles `.pdf`, `.docx`, `.html`, and plain text files out-of-the-box with a single installation.
* **Rich Metadata:** Each chunk is returned as a dictionary containing the text **content** and its **metadata** (e.g., source filename, path, chunk index), which is crucial for RAG (Retrieval-Augmented Generation) and source tracking.
* **Robust Recursive Splitting:** Employs a powerful recursive splitting strategy that prioritizes semantic boundaries (paragraphs, then lines, then sentences) before falling back to character splits.
* **Configurable Overlap:** Maintains context across hard splits with configurable character overlap.
* **Modular & Extensible:** Built with a clean `readers` architecture, making it easy to add support for new file types in the future.

## Installation

You can install `llm-text-splitter` using pip:

```bash
pip install llm-text-splitter
```

## Usage
Here's how to use the LLMTextSplitter in your Python projects:

1. Splitting a File

```python
from llm_text_splitter import LLMTextSplitter

# Assume you have 'my_report.pdf' and 'my_notes.txt'

# Initialize the splitter with a target chunk size and overlap
splitter = LLMTextSplitter(max_chunk_chars=1000, overlap_chars=100)

try:
    # Process a PDF file
    pdf_chunks = splitter.split_file("my_report.pdf")
    print(f"Split 'my_report.pdf' into {len(pdf_chunks)} chunks.")

    # Each chunk is a dictionary with 'content' and 'metadata'
    print("\n--- First PDF Chunk ---")
    print("Content:", pdf_chunks[0]['content'][:200] + "...") # Print first 200 chars
    print("Metadata:", pdf_chunks[0]['metadata'])
    
    print("\n" + "="*50 + "\n")

    # Process a plain text file
    txt_chunks = splitter.split_file("my_notes.txt")
    print(f"Split 'my_notes.txt' into {len(txt_chunks)} chunks.")

    print("\n--- First TXT Chunk ---")
    print("Content:", txt_chunks[0]['content'])
    print("Metadata:", txt_chunks[0]['metadata'])

except FileNotFoundError as e:
    print(e)
except Exception as e:
    print(f"An error occurred: {e}")
```
2. Splitting a Raw Text String
Use split_text if you already have your text content in a string variable.

```python
from llm_text_splitter import LLMTextSplitter

long_text = "This is the first paragraph. It contains multiple sentences.\n\nThis is the second paragraph. It is also quite long and will be chunked according to the recursive splitting rules to maintain semantic meaning where possible."

# Initialize splitter with a small chunk size for demonstration
splitter = LLMTextSplitter(max_chunk_chars=100, overlap_chars=15)

# Split the text string
chunks = splitter.split_text(long_text, base_metadata={"source": "manual_input"})

print(f"Split text into {len(chunks)} chunks:\n")

for chunk in chunks:
    print(f"Content: {chunk['content']}")
    print(f"Metadata: {chunk['metadata']}")
    print("-" * 20)
```
## Example Output:
```bash
Split text into 2 chunks:

Content: This is the first paragraph. It contains multiple sentences.
Metadata: {'source': 'manual_input', 'chunk_index': 0}
--------------------
Content: This is the second paragraph. It is also quite long and will be chunked according to the recursive splitting rules to maintain semantic meaning where possible.
Metadata: {'source': 'manual_input', 'chunk_index': 1}
--------------------
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-text-splitter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "LLM, text-splitter, chunking, RAG, document-processing",
    "author": null,
    "author_email": "Mohamed Elghobary <m.abdeltawab.elghobary@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f1/af/778f6d18a2a5cef2379542ee05f30b5d49e9c21d51282b07b1f271b61e7b/llm_text_splitter-0.2.0.tar.gz",
    "platform": null,
    "description": "# **LLM Text Splitter v0.2.0**\n![PyPI](https://img.shields.io/pypi/v/llm-text-splitter)\n\nA lightweight, rule-based text splitter designed for preparing long documents for Large Language Model (LLM) context windows. It intelligently breaks down text into manageable chunks, prioritizing meaningful structural breaks (like paragraphs or lines) before resorting to arbitrary character limits.\n\n## Key Features\n\n* **All-in-One Installation:** Handles `.pdf`, `.docx`, `.html`, and plain text files out-of-the-box with a single installation.\n* **Rich Metadata:** Each chunk is returned as a dictionary containing the text **content** and its **metadata** (e.g., source filename, path, chunk index), which is crucial for RAG (Retrieval-Augmented Generation) and source tracking.\n* **Robust Recursive Splitting:** Employs a powerful recursive splitting strategy that prioritizes semantic boundaries (paragraphs, then lines, then sentences) before falling back to character splits.\n* **Configurable Overlap:** Maintains context across hard splits with configurable character overlap.\n* **Modular & Extensible:** Built with a clean `readers` architecture, making it easy to add support for new file types in the future.\n\n## Installation\n\nYou can install `llm-text-splitter` using pip:\n\n```bash\npip install llm-text-splitter\n```\n\n## Usage\nHere's how to use the LLMTextSplitter in your Python projects:\n\n1. Splitting a File\n\n```python\nfrom llm_text_splitter import LLMTextSplitter\n\n# Assume you have 'my_report.pdf' and 'my_notes.txt'\n\n# Initialize the splitter with a target chunk size and overlap\nsplitter = LLMTextSplitter(max_chunk_chars=1000, overlap_chars=100)\n\ntry:\n    # Process a PDF file\n    pdf_chunks = splitter.split_file(\"my_report.pdf\")\n    print(f\"Split 'my_report.pdf' into {len(pdf_chunks)} chunks.\")\n\n    # Each chunk is a dictionary with 'content' and 'metadata'\n    print(\"\\n--- First PDF Chunk ---\")\n    print(\"Content:\", pdf_chunks[0]['content'][:200] + \"...\") # Print first 200 chars\n    print(\"Metadata:\", pdf_chunks[0]['metadata'])\n    \n    print(\"\\n\" + \"=\"*50 + \"\\n\")\n\n    # Process a plain text file\n    txt_chunks = splitter.split_file(\"my_notes.txt\")\n    print(f\"Split 'my_notes.txt' into {len(txt_chunks)} chunks.\")\n\n    print(\"\\n--- First TXT Chunk ---\")\n    print(\"Content:\", txt_chunks[0]['content'])\n    print(\"Metadata:\", txt_chunks[0]['metadata'])\n\nexcept FileNotFoundError as e:\n    print(e)\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n```\n2. Splitting a Raw Text String\nUse split_text if you already have your text content in a string variable.\n\n```python\nfrom llm_text_splitter import LLMTextSplitter\n\nlong_text = \"This is the first paragraph. It contains multiple sentences.\\n\\nThis is the second paragraph. It is also quite long and will be chunked according to the recursive splitting rules to maintain semantic meaning where possible.\"\n\n# Initialize splitter with a small chunk size for demonstration\nsplitter = LLMTextSplitter(max_chunk_chars=100, overlap_chars=15)\n\n# Split the text string\nchunks = splitter.split_text(long_text, base_metadata={\"source\": \"manual_input\"})\n\nprint(f\"Split text into {len(chunks)} chunks:\\n\")\n\nfor chunk in chunks:\n    print(f\"Content: {chunk['content']}\")\n    print(f\"Metadata: {chunk['metadata']}\")\n    print(\"-\" * 20)\n```\n## Example Output:\n```bash\nSplit text into 2 chunks:\n\nContent: This is the first paragraph. It contains multiple sentences.\nMetadata: {'source': 'manual_input', 'chunk_index': 0}\n--------------------\nContent: This is the second paragraph. It is also quite long and will be chunked according to the recursive splitting rules to maintain semantic meaning where possible.\nMetadata: {'source': 'manual_input', 'chunk_index': 1}\n--------------------\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A lightweight, rule-based text splitter for LLM context window management, handles multiple file formats and enriches chunks with metadata.",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/MohamedElghobary/llm_text_splitter/issues",
        "Homepage": "https://github.com/MohamedElghobary/llm_text_splitter",
        "Source Code": "https://github.com/MohamedElghobary/llm_text_splitter"
    },
    "split_keywords": [
        "llm",
        " text-splitter",
        " chunking",
        " rag",
        " document-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e1a1d23c7bf21de67da3e2a099a76ff07ba52559934977988b46477663a0a639",
                "md5": "0d56f1da1b20b3e3265ed3122e618920",
                "sha256": "04cba0257ca0370fc0334f9062a4cf69d8b3ef0b9d3c103b7260a8fc49d27a9d"
            },
            "downloads": -1,
            "filename": "llm_text_splitter-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0d56f1da1b20b3e3265ed3122e618920",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 7116,
            "upload_time": "2025-07-24T12:21:00",
            "upload_time_iso_8601": "2025-07-24T12:21:00.184774Z",
            "url": "https://files.pythonhosted.org/packages/e1/a1/d23c7bf21de67da3e2a099a76ff07ba52559934977988b46477663a0a639/llm_text_splitter-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f1af778f6d18a2a5cef2379542ee05f30b5d49e9c21d51282b07b1f271b61e7b",
                "md5": "137546720ab7a38e8d69bd58851f117f",
                "sha256": "5f5e9d9f648478fc18bc0fb94ddcc7014f0683c27d5a2288c25af1de4c7b62b7"
            },
            "downloads": -1,
            "filename": "llm_text_splitter-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "137546720ab7a38e8d69bd58851f117f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 8180,
            "upload_time": "2025-07-24T12:21:01",
            "upload_time_iso_8601": "2025-07-24T12:21:01.668396Z",
            "url": "https://files.pythonhosted.org/packages/f1/af/778f6d18a2a5cef2379542ee05f30b5d49e9c21d51282b07b1f271b61e7b/llm_text_splitter-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-24 12:21:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MohamedElghobary",
    "github_project": "llm_text_splitter",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "llm-text-splitter"
}

None