splitter-mr


Namesplitter-mr JSON
Version 1.0.1 PyPI version JSON
download
home_pageNone
SummaryA modular text splitting library.
upload_time2025-09-13 11:55:16
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT License Copyright (c) 2025 Andrés Herencia Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords
VCS
bugtrack_url
requirements accelerate aiohappyeyeballs aiohttp aiolimiter aiosignal annotated-types anyio appnope argon2-cffi argon2-cffi-bindings arrow asttokens async-lru attrs audioop-lts autoflake azure-ai-documentintelligence azure-core azure-identity babel backrefs beautifulsoup4 bert-score black bleach blis bracex build cachetools catalogue certifi cffi cfgv charset-normalizer click cloudpathlib cobble colorama coloredlogs comm confection contourpy coverage cramjam cryptography cssselect2 cycler cymem debugpy decorator defusedxml dill distlib distro docling docling-core docling-ibm-models docling-parse docutils easyocr et-xmlfile executing faker fastjsonschema fastparquet ffmpeg ffmpeg-downloader filelock filetype flake8 flatbuffers fonttools fqdn frozenlist fsspec ghp-import gitdb gitpython google-auth google-genai griffe grpcio h11 h2 hf-xet hpack httpcore httpx huggingface-hub humanfriendly hyperframe id identify idna imageio importlib-metadata iniconfig iprogress ipykernel ipython ipython-pygments-lexers ipywidgets isodate isoduration isort jaraco-classes jaraco-context jaraco-functools jedi jinja2 jiter joblib json5 jsonlines jsonpatch jsonpointer jsonref jsonschema jsonschema-specifications jupyter jupyter-client jupyter-console jupyter-core jupyter-events jupyter-lsp jupyter-server jupyter-server-terminals jupyterlab jupyterlab-pygments jupyterlab-server jupyterlab-widgets keyring kiwisolver langchain-core langchain-text-splitters langcodes langsmith language-data lark latex2mathml lazy-loader lxml magika mammoth marisa-trie markdown markdown-it-py markdownify markitdown marko markupsafe matplotlib matplotlib-inline mccabe mdurl mergedeep mistune mkdocs mkdocs-autorefs mkdocs-awesome-pages-plugin mkdocs-get-deps mkdocs-glightbox mkdocs-material mkdocs-material-extensions mkdocstrings mkdocstrings-python more-itertools mpire mpmath msal msal-extensions multidict multiprocess murmurhash mypy-extensions natsort nbclient nbconvert nbformat nest-asyncio networkx nh3 ninja nltk nodeenv notebook notebook-shim numpy olefile onnxruntime openai opencv-python-headless openpyxl opentelemetry-api opentelemetry-sdk opentelemetry-semantic-conventions orjson packaging paginate pandas pandocfilters parso pastel pathspec pdfminer-six pdfplumber pexpect pillow pip platformdirs pluggy poethepoet polyfactory portalocker pre-commit preshed prometheus-client prompt-toolkit propcache protobuf psutil ptyprocess pure-eval pyasn1 pyasn1-modules pyclipper pycodestyle pycparser pydantic pydantic-core pydantic-settings pydub pyflakes pygments pyjwt pylatexenc pymdown-extensions pymupdf pypandoc pyparsing pypdf pypdfium2 pyproject-autoflake pyproject-hooks pytest pytest-cov python-bidi python-dateutil python-docx python-dotenv python-json-logger python-pptx pytz pyupgrade pyyaml pyyaml-env-tag pyzmq qdrant-client readme-renderer referencing regex reportlab requests requests-mock requests-toolbelt rfc3339-validator rfc3986 rfc3986-validator rfc3987-syntax rich rpds-py rsa rtree ruff safetensors scikit-image scikit-learn scipy selectolax semchunk send2trash sentence-transformers setuptools setuptools-scm shapely shellingham six smart-open smmap sniffio soupsieve spacy spacy-legacy spacy-loggers speechrecognition srsly stack-data standard-aifc standard-chunk svglib sympy tabulate tenacity terminado thinc threadpoolctl tifffile tiktoken tinycss2 tokenize-rt tokenizers toml torch torchvision tornado tqdm traitlets transformers twine typer types-python-dateutil typing-extensions typing-inspection tzdata uri-template urllib3 uv virtualenv voyageai wasabi watchdog wcmatch wcwidth weasel webcolors webencodings websocket-client websockets wheel widgetsnbextension wrapt xai-sdk xlrd xlsxwriter yarl youtube-transcript-api zipp zstandard
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # **SplitterMR**

**SplitterMR** is a library for chunking data into convenient text blocks compatible with your LLM applications.

<img src="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_logo.svg#gh-light-mode-only" alt="SplitterMR logo" width=100%/>
<img src="https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_logo_white.svg#gh-dark-mode-only" alt="SplitterMR logo" width=100%/>

> [!IMPORTANT] 
> 
> "Version 1.0.0 released – First Stable Release!"
> 
> **We are excited to announce the first stable release of SplitterMR (v1.0.0)!** Install it with the following command:
> 
> ```python
> pip install splitter-mr
> ```
> 
> **Highlights:**
> 
> - 🚀 [**Stable API**](#core-install) consolidating all v0.x features.
> - 📖 **[Readers](https://andreshere00.github.io/Splitter_MR/api_reference/reader/):** Plug-and-play support for Vanilla, MarkItDown, and Docling, covering formats like text, Office, JSON/YAML, images, HTML, and more.
> - 🪓 **[Splitters](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/):** Extensive library of split strategies, including character, word, sentence, paragraph, token, paged, row/column, JSON, semantic, HTML tag, header, and code splitters.
> - 🧠 **[Models](https://andreshere00.github.io/Splitter_MR/api_reference/model/):** Multimodal Vision-Language support for OpenAI, Azure, Grok, HuggingFace, Gemini, Claude, and more.
> - 🗺️ **[Embeddings](https://andreshere00.github.io/Splitter_MR/api_reference/embedding/):** Fully integrated embeddings from OpenAI, Azure, HuggingFace, Gemini, and Claude (via Voyage).
> - 🎛️ [**Extras system:**](#multiple-extras) Install the minimal core, or extend with `markitdown`, `docling`, `multimodal`, or `all` for a batteries-included setup.
> - 📚 **[Docs](https://andreshere00.github.io/Splitter_MR/):** New API reference, real executed notebook examples, and updated architecture diagrams.
> - 🔧 **Developer Experience:** CI/CD pipeline, PyPI publishing, pre-commit checks, and improved cleaning instructions.
> - 🐛 **Bugfixes:** Improved NLTK tokenizers, more robust splitters, and new utilities for HTML => Markdown conversion.
> 
> **Check out the updated documentation, new examples, and join us in making text splitting and document parsing easier than ever!**
>
> **Version 1.0.1 released - `KeywordSplitter`
>
> This Splitter allows to divide text based on specific regex patterns or keywords. See documentation [**here**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#keywordsplitter).

## Features

### Different input formats

SplitterMR can read data from multiples sources and files. To read the files, it uses the Reader components, which inherits from a Base abstract class, `BaseReader`. This object allows you to read the files as a properly formatted string, or convert the files into another format (such as `markdown` or `json`). 

Currently, there are supported three readers: `VanillaReader`, and `MarkItDownReader` and `DoclingReader`. These are the differences between each Reader component:

| **Reader**             | **Unstructured files & PDFs** | **MS Office suite files** | **Tabular data** | **Files with hierarchical schema** | **Image files** | **Markdown conversion** |
|------------------------|-------------------------------|---------------------------|------------------|------------------------------------|-----------------|-------------------------|
| [**`VanillaReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#vanillareader)    | `txt`, `md`, `pdf` | `xlsx`, `docx`, `pptx` | `csv`, `tsv`, `parquet` | `json`, `yaml`, `html`, `xml` | `jpg`, `png`, `webp`, `gif` | Yes                     |
| [**`MarkItDownReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#markitdownreader) | `txt`, `md`, `pdf` | `docx`, `xlsx`, `pptx` | `csv`, `tsv` | `json`, `html`, `xml`                    | `jpg`, `png`, `pneg`        | Yes                     |
| [**`DoclingReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#doclingreader)    | `txt`, `md`, `pdf` | `docx`, `xlsx`, `pptx` | –            | `html`, `xhtml`                 | `png`, `jpeg`, `tiff`, `bmp`, `webp` | Yes                     |

### Several splitting methods

SplitterMR allows you to split files in many different ways depending on your needs. The available splitting methods are described in the following table:

| Splitting Technique | Description |
| ------------------------- | -----------------------------|
| [**Character Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#charactersplitter)    | Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars: int or %). <br> **Compatible with:** Text. |
| [**Word Splitter** ](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#wordsplitter)        | Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage. <br> **Parameters:** `chunk_size` (max words per chunk), `chunk_overlap` (overlapping words: int or %). <br> **Compatible with:** Text. |
| [**Sentence Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#sentencesplitter)     | Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., `.`, `!`, `?`). <br> **Parameters:** `chunk_size` (max sentences per chunk), `chunk_overlap` (overlapping words: int or %), `sentence_separators` (list of characters). <br> **Compatible with:** Text. |
| [**Paragraph Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#paragraphsplitter)    | Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks. <br> **Parameters:** `chunk_size` (max paragraphs per chunk), `chunk_overlap` (overlapping words: int or %), `line_break` (delimiter(s) for paragraphs). <br> **Compatible with:** Text. |
| [**Recursive Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#recursivesplitter)    | Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars), `separators` (list of characters to split on, e.g., `["\n\n", "\n", " ", ""]`). <br> **Compatible with:** Text.                                                                                   |
| [**Keyword Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#keywordsplitter) | Splits text into chunks around matches of specified keywords, using one or more regex patterns. Supports precise boundary control—matched keywords can be included `before`, `after`, `both` sides, or omitted from the split. Each keyword can have a custom name (via `dict`) for metadata counting. Secondary soft-wrapping by `chunk_size` is supported. <br> **Parameters:** `patterns` (list of regex patterns, or `dict` mapping names to patterns), `include_delimiters` (`"before"`, `"after"`, `"both"`, or `"none"`), `flags` (regex flags, e.g. `re.MULTILINE`), `chunk_size` (max chars per chunk, soft-wrapped). <br> **Compatible with:** Text. |
| [**Token Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#tokensplitter)        | Splits text into chunks based on the number of tokens, using various tokenization models (e.g., tiktoken, spaCy, NLTK). Useful for ensuring chunks are compatible with LLM context limits. <br> **Parameters:** `chunk_size` (max tokens per chunk), `model_name` (tokenizer/model, e.g., `"tiktoken/cl100k_base"`, `"spacy/en_core_web_sm"`, `"nltk/punkt"`), `language` (for NLTK). <br> **Compatible with:** Text. |
| [**Paged Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#pagedsplitter)        | Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap. <br> **Parameters:** `num_pages` (pages per chunk), `chunk_overlap` (overlapping words). <br> **Compatible with:** Word, PDF, Excel, PowerPoint. |
| [**Row/Column Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#rowcolumnsplitter)   | For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive. <br> **Parameters:** `num_rows`, `num_cols` (rows/columns per chunk), `overlap` (overlapping rows or columns). <br> **Compatible with:** Tabular formats (csv, tsv, parquet, flat json). |
| [**JSON Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#recursivejsonsplitter)         | Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema. <br> **Parameters:** `max_chunk_size` (max chars per chunk), `min_chunk_size` (min chars per chunk). <br> **Compatible with:** JSON. |
| [**Semantic Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#semanticsplitter)     | Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings. <br> **Parameters:** `embedding_model` (model for embeddings), `max_tokens` (max tokens per chunk). <br> **Compatible with:** Text. |
| [**HTML Tag Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#htmltagsplitter)       | Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag. <br> **Parameters:** `chunk_size` (max chars per chunk), `tag` (HTML tag to split on, optional). <br> **Compatible with:** HTML. |
| [**Header Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#headersplitter)       | Splits Markdown or HTML documents into chunks using header levels (e.g., `#`, `##`, or `<h1>`, `<h2>`). Uses configurable headers for chunking. <br> **Parameters:** `headers_to_split_on` (list of headers and semantic names), `chunk_size` (unused, for compatibility). <br> **Compatible with:** Markdown, HTML. |
| [**Code Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#codesplitter)         | Splits source code files into programmatically meaningful chunks (functions, classes, methods, etc.), aware of the syntax of the specified programming language (e.g., Python, Java, Kotlin). Uses language-aware logic to avoid splitting inside code blocks. <br> **Parameters:** `chunk_size` (max chars per chunk), `language` (programming language as string, e.g., `"python"`, `"java"`). <br> **Compatible with:** Source code files (Python, Java, Kotlin, C++, JavaScript, Go, etc.). |

## Architecture

![SplitterMR architecture diagram](https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_architecture_diagram.svg#gh-light-mode-only)
![SplitterMR architecture diagram](https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_architecture_diagram_white.svg#gh-dark-mode-only)

**SplitterMR** is designed around a modular pipeline that processes files from raw data all the way to chunked, LLM-ready text. There are three main components: Readers, Models and Splitters.

- **Readers**
    - The **`BaseReader`** components read a file and optionally converts to other formats to subsequently conduct a splitting strategy.
    - Supported readers (e.g., **`VanillaReader`**, **`MarkItDownReader`**, **`DoclingReader`**) produce a `ReaderOutput` dictionary containing:
        - **Text** content (in `markdown`, `text`, `json` or another format).
        - Document **metadata**.
        - **Conversion** method.
- **Models:**
    - The **`BaseModel`** component is used to read non-text content using a Visual Language Model (VLM).
    - Supported models are `AzureOpenAI`, `OpenAI` and `Grok`, but more models will be available soon.
    - All the models have a `analyze_content` method which returns the LLM response based on a prompt, the client and the model parameters.
- **Splitters**
    - The **`BaseSplitter`** components take the **`ReaderOutput`** text content and divide that text into meaningful chunks for LLM or other downstream use.
    - Splitter classes (e.g., **`CharacterSplitter`**, **`SentenceSplitter`**, **`RecursiveCharacterSplitter`**, etc.) allow flexible chunking strategies with optional overlap and rich configuration.
- **Embedders**
    - The **`BaseEmbedder`** components are used to encode the text into embeddings. These embeddings are used to split text by semantic similarity.
    - Supported models are `AzureOpenAI` and `OpenAI`, but more models will be available soon.
    - All the models have a `encode_text` method which returns the embeddings based on a text, the client and the model parameters.

## How to install

Package is published on [PyPi](https://pypi.org/project/splitter-mr/).  

By default, only the **core dependencies** are installed. If you need additional features (e.g., MarkItDown, Docling, multimodal processing), you can install the corresponding **extras**.

### Core install

Installs the basic text splitting and file parsing features (lightweight, fast install):

```bash
pip install splitter-mr
```

### Optional extras

| Extra            | Description                                                                                                           | Example install command                 |
| ---------------- | --------------------------------------------------------------------------------------------------------------------- | --------------------------------------- |
| **`markitdown`** | Adds [MarkItDown](https://github.com/microsoft/markitdown) support for rich-text document parsing (HTML, DOCX, etc.). | `pip install "splitter-mr[markitdown]"` |
| **`docling`**    | Adds [Docling](https://github.com/ibm/docling) support for high-quality PDF/document to Markdown conversion.          | `pip install "splitter-mr[docling]"`    |
| **`multimodal`** | Enables computer vision, OCR, and audio features — includes **PyTorch**, EasyOCR, OpenCV, Transformers, etc.          | `pip install "splitter-mr[multimodal]"` |
| **`azure`**      | Installs Azure AI SDKs for integrating with Azure Document Intelligence and other Azure AI services.                  | `pip install "splitter-mr[azure]"`      |
| **`all`**        | Installs **everything** above (MarkItDown + Docling + Multimodal + Azure). **Heavy install** (\~GBs).                 | `pip install "splitter-mr[all]"`        |

### Multiple extras

You can combine extras by separating them with commas:

```bash
pip install "splitter-mr[markitdown,docling]"
```

### Using other package managers

You can also install it with [`uv`](https://docs.astral.sh/uv/), [`conda`](https://anaconda.org/anaconda/conda) or [`poetry`](https://python-poetry.org/):

```bash
uv add splitter-mr
```

> [!NOTE]
>
> **Python 3.11 or greater** is required to use this library.

## How to use

### Read files

Firstly, you need to instantiate an object from a BaseReader class, for example, `VanillaReader`.

```python
from splitter_mr.reader import VanillaReader

reader = VanillaReader()
```

To read any file, provide the file path within the `read()` method. If you use `DoclingReader` or `MarkItDownReader`, your files will be automatically parsed to markdown text format. The result of this reader will be a `ReaderOutput` object, a dictionary with the following shape:

```python 
reader_output = reader.read('https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt')
print(reader_output)
```
```python
text='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet ultricies orci. Nullam et tellus dui.', 
document_name='lorem_ipsum.txt',
document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt', 
document_id='732b9530-3e41-4a1a-a4ea-1d9d6fe815d3', 
conversion_method='txt', 
reader_method='vanilla', 
ocr_method=None, 
page_placeholder=None,
metadata={}
```

> [!NOTE]
> Note that you can read from an URL, a variable and from a `file_path`. See [Developer guide](https://andreshere00.github.io/Splitter_MR/api_reference/reader/).

### Split text

To split the text, first import the class that implements your desired splitting strategy (e.g., by characters, recursively, by headers, etc.). Then, create an instance of this class and call its `split` method, which is defined in the `BaseSplitter` class.

For example, we will split by characters with a maximum chunk size of 50, with an overlap between chunks:

```python
from splitter_mr.splitter import CharacterSplitter

char_splitter = CharacterSplitter(chunk_size=50, chunk_overlap = 10)
splitter_output = char_splitter.split(reader_output)
print(splitter_output)
```
```python
chunks=['Lorem ipsum dolor sit amet, consectetur adipiscing', 'adipiscing elit. Vestibulum sit amet ultricies orc', 'ricies orci. Nullam et tellus dui.'], 
chunk_id=['db454a9b-32aa-4fdc-9aab-8770cae99882', 'e67b427c-4bb0-4f28-96c2-7785f070d1c1', '6206a89d-efd1-4586-8889-95590a14645b'], 
document_name='lorem_ipsum.txt', 
document_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt', 
document_id='732b9530-3e41-4a1a-a4ea-1d9d6fe815d3', 
conversion_method='txt', 
reader_method='vanilla', 
ocr_method=None, 
split_method='character_splitter', 
split_params={'chunk_size': 50, 'chunk_overlap': 10}, 
metadata={}
```

The returned object is a `SplitterOutput` dataclass, which provides all the information you need to further process your data. You can easily add custom metadata, and you have access to details such as the document name, path, and type. Each chunk is uniquely identified by an UUID, allowing for easy traceability throughout your LLM workflow.

### Compatibility with vision tools for image processing and annotations

Pass a VLM model to any Reader via the `model` parameter:

```python
from splitter_mr.reader import VanillaReader
from splitter_mr.model.models import AzureOpenAIVisionModel

model = AzureOpenAIVisionModel()
reader = VanillaReader(model=model)
output = reader.read(file_path="data/lorem_ipsum.pdf")
print(output.text)
```

These VLMs can be used for captioning, annotation or text extraction. In fact, you can use these models to process the files as you want using the `prompt` parameter in the `read` method for every class which inherits from `BaseReader`. 

> [!NOTE]
> To see more details, consult documentation [here](https://andreshere00.github.io/Splitter_MR/api_reference/model/).

## Updates

### Next features

- [ ] **NEW** Provide a MCP server to make queries about the chunked documents.
- [ ] Add examples on how to implement SplitterMR in RAGs, MCPs and Agentic RAGs.
- [ ] Add a method to read PDFs using Textract.
- [ ] Add a new `BaseVisionModel` class to support generic API-provided models.
- [ ] Add asynchronous methods for Splitters and Readers.
- [ ] Add batch methods to process several documents at once.
- [ ] Add support to read formulas.
- [ ] Add classic **OCR** models: `easyocr` and `pytesseract`.
- [ ] Add support to generate output in `markdown` for all data types in VanillaReader.
- [ ] Add methods to support Markdown, JSON and XML data types when returning output.

### Previously implemented (up to `v1.0.0`)

- [X] Add embedding model support.
    - [X] Add OpenAI embeddings model support.
    - [X] Add Azure OpenAI embeddings model support.
    - [X] Add HuggingFace embeddings model support.
    - [X] Add Gemini embeddings model support.
    - [X] Add Claude Anthropic embeddings model support.
- [X] Add Vision models:
    - [X] Add OpenAI vision model support.
    - [X] Add Azure OpenAI embeddings model support.
    - [X] Add Grok VLMs model support.
    - [X] Add HuggingFace VLMs model support.
    - [X] Add Gemini VLMs model support.
    - [X] Add Claude Anthropic VLMs model support.
- [X] Modularize library into several sub-libraries.
- [X] Implement a method to split by embedding similarity: `SemanticSplitter`.
- [X] Add new supported formats to be analyzed with OpenAI and AzureOpenAI models.
- [X] Add support to read images using `VanillaReader`. 
- [X] Add support to read `xlsx`, `docx` and `pptx` files using `VanillaReader`. 
- [X] Add support to read images using `VanillaReader`.
- [X] Implement a method to split a document by pages (`PagedSplitter`).
- [X] Add support to read PDF as scanned pages.
- [X] Add support to change image placeholders.
- [X] Add support to change page placeholders.
- [X] Add Pydantic models to define Reader and Splitter outputs.

## Contact

If you want to collaborate, please contact me through the following media: 

- [My mail](mailto:andresherencia2000@gmail.com).
- [My LinkedIn](https://linkedin.com/in/andres-herencia)
- [PyPI package](https://pypi.org/project/splitter-mr/)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "splitter-mr",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Andr\u00e9s Herencia <andresherencia2000@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f4/e6/571e0c9febd6bef2961e37185ae3931058c88e40222feb3af689c9a09299/splitter_mr-1.0.1.tar.gz",
    "platform": null,
    "description": "# **SplitterMR**\n\n**SplitterMR** is a library for chunking data into convenient text blocks compatible with your LLM applications.\n\n<img src=\"https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_logo.svg#gh-light-mode-only\" alt=\"SplitterMR logo\" width=100%/>\n<img src=\"https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_logo_white.svg#gh-dark-mode-only\" alt=\"SplitterMR logo\" width=100%/>\n\n> [!IMPORTANT] \n> \n> \"Version 1.0.0 released \u2013 First Stable Release!\"\n> \n> **We are excited to announce the first stable release of SplitterMR (v1.0.0)!** Install it with the following command:\n> \n> ```python\n> pip install splitter-mr\n> ```\n> \n> **Highlights:**\n> \n> - \ud83d\ude80 [**Stable API**](#core-install) consolidating all v0.x features.\n> - \ud83d\udcd6 **[Readers](https://andreshere00.github.io/Splitter_MR/api_reference/reader/):** Plug-and-play support for Vanilla, MarkItDown, and Docling, covering formats like text, Office, JSON/YAML, images, HTML, and more.\n> - \ud83e\ude93 **[Splitters](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/):** Extensive library of split strategies, including character, word, sentence, paragraph, token, paged, row/column, JSON, semantic, HTML tag, header, and code splitters.\n> - \ud83e\udde0 **[Models](https://andreshere00.github.io/Splitter_MR/api_reference/model/):** Multimodal Vision-Language support for OpenAI, Azure, Grok, HuggingFace, Gemini, Claude, and more.\n> - \ud83d\uddfa\ufe0f **[Embeddings](https://andreshere00.github.io/Splitter_MR/api_reference/embedding/):** Fully integrated embeddings from OpenAI, Azure, HuggingFace, Gemini, and Claude (via Voyage).\n> - \ud83c\udf9b\ufe0f [**Extras system:**](#multiple-extras) Install the minimal core, or extend with `markitdown`, `docling`, `multimodal`, or `all` for a batteries-included setup.\n> - \ud83d\udcda **[Docs](https://andreshere00.github.io/Splitter_MR/):** New API reference, real executed notebook examples, and updated architecture diagrams.\n> - \ud83d\udd27 **Developer Experience:** CI/CD pipeline, PyPI publishing, pre-commit checks, and improved cleaning instructions.\n> - \ud83d\udc1b **Bugfixes:** Improved NLTK tokenizers, more robust splitters, and new utilities for HTML => Markdown conversion.\n> \n> **Check out the updated documentation, new examples, and join us in making text splitting and document parsing easier than ever!**\n>\n> **Version 1.0.1 released - `KeywordSplitter`\n>\n> This Splitter allows to divide text based on specific regex patterns or keywords. See documentation [**here**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#keywordsplitter).\n\n## Features\n\n### Different input formats\n\nSplitterMR can read data from multiples sources and files. To read the files, it uses the Reader components, which inherits from a Base abstract class, `BaseReader`. This object allows you to read the files as a properly formatted string, or convert the files into another format (such as `markdown` or `json`). \n\nCurrently, there are supported three readers: `VanillaReader`, and `MarkItDownReader` and `DoclingReader`. These are the differences between each Reader component:\n\n| **Reader**             | **Unstructured files & PDFs** | **MS Office suite files** | **Tabular data** | **Files with hierarchical schema** | **Image files** | **Markdown conversion** |\n|------------------------|-------------------------------|---------------------------|------------------|------------------------------------|-----------------|-------------------------|\n| [**`VanillaReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#vanillareader)    | `txt`, `md`, `pdf` | `xlsx`, `docx`, `pptx` | `csv`, `tsv`, `parquet` | `json`, `yaml`, `html`, `xml` | `jpg`, `png`, `webp`, `gif` | Yes                     |\n| [**`MarkItDownReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#markitdownreader) | `txt`, `md`, `pdf` | `docx`, `xlsx`, `pptx` | `csv`, `tsv` | `json`, `html`, `xml`                    | `jpg`, `png`, `pneg`        | Yes                     |\n| [**`DoclingReader`**](https://andreshere00.github.io/Splitter_MR/api_reference/reader/#doclingreader)    | `txt`, `md`, `pdf` | `docx`, `xlsx`, `pptx` | \u2013            | `html`, `xhtml`                 | `png`, `jpeg`, `tiff`, `bmp`, `webp` | Yes                     |\n\n### Several splitting methods\n\nSplitterMR allows you to split files in many different ways depending on your needs. The available splitting methods are described in the following table:\n\n| Splitting Technique | Description |\n| ------------------------- | -----------------------------|\n| [**Character Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#charactersplitter)    | Splits text into chunks based on a specified number of characters. Supports overlapping by character count or percentage. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars: int or %). <br> **Compatible with:** Text. |\n| [**Word Splitter** ](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#wordsplitter)        | Splits text into chunks based on a specified number of words. Supports overlapping by word count or percentage. <br> **Parameters:** `chunk_size` (max words per chunk), `chunk_overlap` (overlapping words: int or %). <br> **Compatible with:** Text. |\n| [**Sentence Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#sentencesplitter)     | Splits text into chunks by a specified number of sentences. Allows overlap defined by a number or percentage of words from the end of the previous chunk. Customizable sentence separators (e.g., `.`, `!`, `?`). <br> **Parameters:** `chunk_size` (max sentences per chunk), `chunk_overlap` (overlapping words: int or %), `sentence_separators` (list of characters). <br> **Compatible with:** Text. |\n| [**Paragraph Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#paragraphsplitter)    | Splits text into chunks based on a specified number of paragraphs. Allows overlapping by word count or percentage, and customizable line breaks. <br> **Parameters:** `chunk_size` (max paragraphs per chunk), `chunk_overlap` (overlapping words: int or %), `line_break` (delimiter(s) for paragraphs). <br> **Compatible with:** Text. |\n| [**Recursive Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#recursivesplitter)    | Recursively splits text based on a hierarchy of separators (e.g., paragraph, sentence, word, character) until chunks reach a target size. Tries to preserve semantic units as long as possible. <br> **Parameters:** `chunk_size` (max chars per chunk), `chunk_overlap` (overlapping chars), `separators` (list of characters to split on, e.g., `[\"\\n\\n\", \"\\n\", \" \", \"\"]`). <br> **Compatible with:** Text.                                                                                   |\n| [**Keyword Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#keywordsplitter) | Splits text into chunks around matches of specified keywords, using one or more regex patterns. Supports precise boundary control\u2014matched keywords can be included `before`, `after`, `both` sides, or omitted from the split. Each keyword can have a custom name (via `dict`) for metadata counting. Secondary soft-wrapping by `chunk_size` is supported. <br> **Parameters:** `patterns` (list of regex patterns, or `dict` mapping names to patterns), `include_delimiters` (`\"before\"`, `\"after\"`, `\"both\"`, or `\"none\"`), `flags` (regex flags, e.g. `re.MULTILINE`), `chunk_size` (max chars per chunk, soft-wrapped). <br> **Compatible with:** Text. |\n| [**Token Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#tokensplitter)        | Splits text into chunks based on the number of tokens, using various tokenization models (e.g., tiktoken, spaCy, NLTK). Useful for ensuring chunks are compatible with LLM context limits. <br> **Parameters:** `chunk_size` (max tokens per chunk), `model_name` (tokenizer/model, e.g., `\"tiktoken/cl100k_base\"`, `\"spacy/en_core_web_sm\"`, `\"nltk/punkt\"`), `language` (for NLTK). <br> **Compatible with:** Text. |\n| [**Paged Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#pagedsplitter)        | Splits text by pages for documents that have page structure. Each chunk contains a specified number of pages, with optional word overlap. <br> **Parameters:** `num_pages` (pages per chunk), `chunk_overlap` (overlapping words). <br> **Compatible with:** Word, PDF, Excel, PowerPoint. |\n| [**Row/Column Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#rowcolumnsplitter)   | For tabular formats, splits data by a set number of rows or columns per chunk, with possible overlap. Row-based and column-based splitting are mutually exclusive. <br> **Parameters:** `num_rows`, `num_cols` (rows/columns per chunk), `overlap` (overlapping rows or columns). <br> **Compatible with:** Tabular formats (csv, tsv, parquet, flat json). |\n| [**JSON Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#recursivejsonsplitter)         | Recursively splits JSON documents into smaller sub-structures that preserve the original JSON schema. <br> **Parameters:** `max_chunk_size` (max chars per chunk), `min_chunk_size` (min chars per chunk). <br> **Compatible with:** JSON. |\n| [**Semantic Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#semanticsplitter)     | Splits text into chunks based on semantic similarity, using an embedding model and a max tokens parameter. Useful for meaningful semantic groupings. <br> **Parameters:** `embedding_model` (model for embeddings), `max_tokens` (max tokens per chunk). <br> **Compatible with:** Text. |\n| [**HTML Tag Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#htmltagsplitter)       | Splits HTML content based on a specified tag, or automatically detects the most frequent and shallowest tag if not specified. Each chunk is a complete HTML fragment for that tag. <br> **Parameters:** `chunk_size` (max chars per chunk), `tag` (HTML tag to split on, optional). <br> **Compatible with:** HTML. |\n| [**Header Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#headersplitter)       | Splits Markdown or HTML documents into chunks using header levels (e.g., `#`, `##`, or `<h1>`, `<h2>`). Uses configurable headers for chunking. <br> **Parameters:** `headers_to_split_on` (list of headers and semantic names), `chunk_size` (unused, for compatibility). <br> **Compatible with:** Markdown, HTML. |\n| [**Code Splitter**](https://andreshere00.github.io/Splitter_MR/api_reference/splitter/#codesplitter)         | Splits source code files into programmatically meaningful chunks (functions, classes, methods, etc.), aware of the syntax of the specified programming language (e.g., Python, Java, Kotlin). Uses language-aware logic to avoid splitting inside code blocks. <br> **Parameters:** `chunk_size` (max chars per chunk), `language` (programming language as string, e.g., `\"python\"`, `\"java\"`). <br> **Compatible with:** Source code files (Python, Java, Kotlin, C++, JavaScript, Go, etc.). |\n\n## Architecture\n\n![SplitterMR architecture diagram](https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_architecture_diagram.svg#gh-light-mode-only)\n![SplitterMR architecture diagram](https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/docs/assets/splitter_mr_architecture_diagram_white.svg#gh-dark-mode-only)\n\n**SplitterMR** is designed around a modular pipeline that processes files from raw data all the way to chunked, LLM-ready text. There are three main components: Readers, Models and Splitters.\n\n- **Readers**\n    - The **`BaseReader`** components read a file and optionally converts to other formats to subsequently conduct a splitting strategy.\n    - Supported readers (e.g., **`VanillaReader`**, **`MarkItDownReader`**, **`DoclingReader`**) produce a `ReaderOutput` dictionary containing:\n        - **Text** content (in `markdown`, `text`, `json` or another format).\n        - Document **metadata**.\n        - **Conversion** method.\n- **Models:**\n    - The **`BaseModel`** component is used to read non-text content using a Visual Language Model (VLM).\n    - Supported models are `AzureOpenAI`, `OpenAI` and `Grok`, but more models will be available soon.\n    - All the models have a `analyze_content` method which returns the LLM response based on a prompt, the client and the model parameters.\n- **Splitters**\n    - The **`BaseSplitter`** components take the **`ReaderOutput`** text content and divide that text into meaningful chunks for LLM or other downstream use.\n    - Splitter classes (e.g., **`CharacterSplitter`**, **`SentenceSplitter`**, **`RecursiveCharacterSplitter`**, etc.) allow flexible chunking strategies with optional overlap and rich configuration.\n- **Embedders**\n    - The **`BaseEmbedder`** components are used to encode the text into embeddings. These embeddings are used to split text by semantic similarity.\n    - Supported models are `AzureOpenAI` and `OpenAI`, but more models will be available soon.\n    - All the models have a `encode_text` method which returns the embeddings based on a text, the client and the model parameters.\n\n## How to install\n\nPackage is published on [PyPi](https://pypi.org/project/splitter-mr/).  \n\nBy default, only the **core dependencies** are installed. If you need additional features (e.g., MarkItDown, Docling, multimodal processing), you can install the corresponding **extras**.\n\n### Core install\n\nInstalls the basic text splitting and file parsing features (lightweight, fast install):\n\n```bash\npip install splitter-mr\n```\n\n### Optional extras\n\n| Extra            | Description                                                                                                           | Example install command                 |\n| ---------------- | --------------------------------------------------------------------------------------------------------------------- | --------------------------------------- |\n| **`markitdown`** | Adds [MarkItDown](https://github.com/microsoft/markitdown) support for rich-text document parsing (HTML, DOCX, etc.). | `pip install \"splitter-mr[markitdown]\"` |\n| **`docling`**    | Adds [Docling](https://github.com/ibm/docling) support for high-quality PDF/document to Markdown conversion.          | `pip install \"splitter-mr[docling]\"`    |\n| **`multimodal`** | Enables computer vision, OCR, and audio features \u2014 includes **PyTorch**, EasyOCR, OpenCV, Transformers, etc.          | `pip install \"splitter-mr[multimodal]\"` |\n| **`azure`**      | Installs Azure AI SDKs for integrating with Azure Document Intelligence and other Azure AI services.                  | `pip install \"splitter-mr[azure]\"`      |\n| **`all`**        | Installs **everything** above (MarkItDown + Docling + Multimodal + Azure). **Heavy install** (\\~GBs).                 | `pip install \"splitter-mr[all]\"`        |\n\n### Multiple extras\n\nYou can combine extras by separating them with commas:\n\n```bash\npip install \"splitter-mr[markitdown,docling]\"\n```\n\n### Using other package managers\n\nYou can also install it with [`uv`](https://docs.astral.sh/uv/), [`conda`](https://anaconda.org/anaconda/conda) or [`poetry`](https://python-poetry.org/):\n\n```bash\nuv add splitter-mr\n```\n\n> [!NOTE]\n>\n> **Python 3.11 or greater** is required to use this library.\n\n## How to use\n\n### Read files\n\nFirstly, you need to instantiate an object from a BaseReader class, for example, `VanillaReader`.\n\n```python\nfrom splitter_mr.reader import VanillaReader\n\nreader = VanillaReader()\n```\n\nTo read any file, provide the file path within the `read()` method. If you use `DoclingReader` or `MarkItDownReader`, your files will be automatically parsed to markdown text format. The result of this reader will be a `ReaderOutput` object, a dictionary with the following shape:\n\n```python \nreader_output = reader.read('https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt')\nprint(reader_output)\n```\n```python\ntext='Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet ultricies orci. Nullam et tellus dui.', \ndocument_name='lorem_ipsum.txt',\ndocument_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt', \ndocument_id='732b9530-3e41-4a1a-a4ea-1d9d6fe815d3', \nconversion_method='txt', \nreader_method='vanilla', \nocr_method=None, \npage_placeholder=None,\nmetadata={}\n```\n\n> [!NOTE]\n> Note that you can read from an URL, a variable and from a `file_path`. See [Developer guide](https://andreshere00.github.io/Splitter_MR/api_reference/reader/).\n\n### Split text\n\nTo split the text, first import the class that implements your desired splitting strategy (e.g., by characters, recursively, by headers, etc.). Then, create an instance of this class and call its `split` method, which is defined in the `BaseSplitter` class.\n\nFor example, we will split by characters with a maximum chunk size of 50, with an overlap between chunks:\n\n```python\nfrom splitter_mr.splitter import CharacterSplitter\n\nchar_splitter = CharacterSplitter(chunk_size=50, chunk_overlap = 10)\nsplitter_output = char_splitter.split(reader_output)\nprint(splitter_output)\n```\n```python\nchunks=['Lorem ipsum dolor sit amet, consectetur adipiscing', 'adipiscing elit. Vestibulum sit amet ultricies orc', 'ricies orci. Nullam et tellus dui.'], \nchunk_id=['db454a9b-32aa-4fdc-9aab-8770cae99882', 'e67b427c-4bb0-4f28-96c2-7785f070d1c1', '6206a89d-efd1-4586-8889-95590a14645b'], \ndocument_name='lorem_ipsum.txt', \ndocument_path='https://raw.githubusercontent.com/andreshere00/Splitter_MR/refs/heads/main/data/lorem_ipsum.txt', \ndocument_id='732b9530-3e41-4a1a-a4ea-1d9d6fe815d3', \nconversion_method='txt', \nreader_method='vanilla', \nocr_method=None, \nsplit_method='character_splitter', \nsplit_params={'chunk_size': 50, 'chunk_overlap': 10}, \nmetadata={}\n```\n\nThe returned object is a `SplitterOutput` dataclass, which provides all the information you need to further process your data. You can easily add custom metadata, and you have access to details such as the document name, path, and type. Each chunk is uniquely identified by an UUID, allowing for easy traceability throughout your LLM workflow.\n\n### Compatibility with vision tools for image processing and annotations\n\nPass a VLM model to any Reader via the `model` parameter:\n\n```python\nfrom splitter_mr.reader import VanillaReader\nfrom splitter_mr.model.models import AzureOpenAIVisionModel\n\nmodel = AzureOpenAIVisionModel()\nreader = VanillaReader(model=model)\noutput = reader.read(file_path=\"data/lorem_ipsum.pdf\")\nprint(output.text)\n```\n\nThese VLMs can be used for captioning, annotation or text extraction. In fact, you can use these models to process the files as you want using the `prompt` parameter in the `read` method for every class which inherits from `BaseReader`. \n\n> [!NOTE]\n> To see more details, consult documentation [here](https://andreshere00.github.io/Splitter_MR/api_reference/model/).\n\n## Updates\n\n### Next features\n\n- [ ] **NEW** Provide a MCP server to make queries about the chunked documents.\n- [ ] Add examples on how to implement SplitterMR in RAGs, MCPs and Agentic RAGs.\n- [ ] Add a method to read PDFs using Textract.\n- [ ] Add a new `BaseVisionModel` class to support generic API-provided models.\n- [ ] Add asynchronous methods for Splitters and Readers.\n- [ ] Add batch methods to process several documents at once.\n- [ ] Add support to read formulas.\n- [ ] Add classic **OCR** models: `easyocr` and `pytesseract`.\n- [ ] Add support to generate output in `markdown` for all data types in VanillaReader.\n- [ ] Add methods to support Markdown, JSON and XML data types when returning output.\n\n### Previously implemented (up to `v1.0.0`)\n\n- [X] Add embedding model support.\n    - [X] Add OpenAI embeddings model support.\n    - [X] Add Azure OpenAI embeddings model support.\n    - [X] Add HuggingFace embeddings model support.\n    - [X] Add Gemini embeddings model support.\n    - [X] Add Claude Anthropic embeddings model support.\n- [X] Add Vision models:\n    - [X] Add OpenAI vision model support.\n    - [X] Add Azure OpenAI embeddings model support.\n    - [X] Add Grok VLMs model support.\n    - [X] Add HuggingFace VLMs model support.\n    - [X] Add Gemini VLMs model support.\n    - [X] Add Claude Anthropic VLMs model support.\n- [X] Modularize library into several sub-libraries.\n- [X] Implement a method to split by embedding similarity: `SemanticSplitter`.\n- [X] Add new supported formats to be analyzed with OpenAI and AzureOpenAI models.\n- [X] Add support to read images using `VanillaReader`. \n- [X] Add support to read `xlsx`, `docx` and `pptx` files using `VanillaReader`. \n- [X] Add support to read images using `VanillaReader`.\n- [X] Implement a method to split a document by pages (`PagedSplitter`).\n- [X] Add support to read PDF as scanned pages.\n- [X] Add support to change image placeholders.\n- [X] Add support to change page placeholders.\n- [X] Add Pydantic models to define Reader and Splitter outputs.\n\n## Contact\n\nIf you want to collaborate, please contact me through the following media: \n\n- [My mail](mailto:andresherencia2000@gmail.com).\n- [My LinkedIn](https://linkedin.com/in/andres-herencia)\n- [PyPI package](https://pypi.org/project/splitter-mr/)\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 Andr\u00e9s Herencia\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "A modular text splitting library.",
    "version": "1.0.1",
    "project_urls": {
        "homepage": "https://github.com/andreshere00/splitter_mr",
        "repository": "https://github.com/andreshere00/splitter_mr"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3520c23e405fb4a2f01bf91a07f6c6a44bee64007cf9b900c05af491a2462b86",
                "md5": "851f65b632c3b4933f019ad39043fd9d",
                "sha256": "ab6489fb19770d3b6bc81796e5a7b5f2796eb5da7c8f6769b15cacb32287910e"
            },
            "downloads": -1,
            "filename": "splitter_mr-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "851f65b632c3b4933f019ad39043fd9d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 108649,
            "upload_time": "2025-09-13T11:55:14",
            "upload_time_iso_8601": "2025-09-13T11:55:14.748694Z",
            "url": "https://files.pythonhosted.org/packages/35/20/c23e405fb4a2f01bf91a07f6c6a44bee64007cf9b900c05af491a2462b86/splitter_mr-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f4e6571e0c9febd6bef2961e37185ae3931058c88e40222feb3af689c9a09299",
                "md5": "fa677b0f71420082ca3269f05daeca2f",
                "sha256": "f91407d2b20645e4a08a6d3dfcd55a027902ed121149f0eecf78cb79dfa25786"
            },
            "downloads": -1,
            "filename": "splitter_mr-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "fa677b0f71420082ca3269f05daeca2f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 7714126,
            "upload_time": "2025-09-13T11:55:16",
            "upload_time_iso_8601": "2025-09-13T11:55:16.584623Z",
            "url": "https://files.pythonhosted.org/packages/f4/e6/571e0c9febd6bef2961e37185ae3931058c88e40222feb3af689c9a09299/splitter_mr-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-13 11:55:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "andreshere00",
    "github_project": "splitter_mr",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "accelerate",
            "specs": [
                [
                    "==",
                    "1.10.1"
                ]
            ]
        },
        {
            "name": "aiohappyeyeballs",
            "specs": [
                [
                    "==",
                    "2.6.1"
                ]
            ]
        },
        {
            "name": "aiohttp",
            "specs": [
                [
                    "==",
                    "3.12.15"
                ]
            ]
        },
        {
            "name": "aiolimiter",
            "specs": [
                [
                    "==",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "aiosignal",
            "specs": [
                [
                    "==",
                    "1.4.0"
                ]
            ]
        },
        {
            "name": "annotated-types",
            "specs": [
                [
                    "==",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "anyio",
            "specs": [
                [
                    "==",
                    "4.10.0"
                ]
            ]
        },
        {
            "name": "appnope",
            "specs": [
                [
                    "==",
                    "0.1.4"
                ]
            ]
        },
        {
            "name": "argon2-cffi",
            "specs": [
                [
                    "==",
                    "25.1.0"
                ]
            ]
        },
        {
            "name": "argon2-cffi-bindings",
            "specs": [
                [
                    "==",
                    "25.1.0"
                ]
            ]
        },
        {
            "name": "arrow",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "asttokens",
            "specs": [
                [
                    "==",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "async-lru",
            "specs": [
                [
                    "==",
                    "2.0.5"
                ]
            ]
        },
        {
            "name": "attrs",
            "specs": [
                [
                    "==",
                    "25.3.0"
                ]
            ]
        },
        {
            "name": "audioop-lts",
            "specs": [
                [
                    "==",
                    "0.2.2"
                ]
            ]
        },
        {
            "name": "autoflake",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "azure-ai-documentintelligence",
            "specs": [
                [
                    "==",
                    "1.0.2"
                ]
            ]
        },
        {
            "name": "azure-core",
            "specs": [
                [
                    "==",
                    "1.35.0"
                ]
            ]
        },
        {
            "name": "azure-identity",
            "specs": [
                [
                    "==",
                    "1.24.0"
                ]
            ]
        },
        {
            "name": "babel",
            "specs": [
                [
                    "==",
                    "2.17.0"
                ]
            ]
        },
        {
            "name": "backrefs",
            "specs": [
                [
                    "==",
                    "5.9"
                ]
            ]
        },
        {
            "name": "beautifulsoup4",
            "specs": [
                [
                    "==",
                    "4.13.5"
                ]
            ]
        },
        {
            "name": "bert-score",
            "specs": [
                [
                    "==",
                    "0.3.13"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    "==",
                    "25.1.0"
                ]
            ]
        },
        {
            "name": "bleach",
            "specs": [
                [
                    "==",
                    "6.2.0"
                ]
            ]
        },
        {
            "name": "blis",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "bracex",
            "specs": [
                [
                    "==",
                    "2.6"
                ]
            ]
        },
        {
            "name": "build",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "cachetools",
            "specs": [
                [
                    "==",
                    "5.5.2"
                ]
            ]
        },
        {
            "name": "catalogue",
            "specs": [
                [
                    "==",
                    "2.0.10"
                ]
            ]
        },
        {
            "name": "certifi",
            "specs": [
                [
                    "==",
                    "2025.8.3"
                ]
            ]
        },
        {
            "name": "cffi",
            "specs": [
                [
                    "==",
                    "1.17.1"
                ]
            ]
        },
        {
            "name": "cfgv",
            "specs": [
                [
                    "==",
                    "3.4.0"
                ]
            ]
        },
        {
            "name": "charset-normalizer",
            "specs": [
                [
                    "==",
                    "3.4.3"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    "==",
                    "8.2.1"
                ]
            ]
        },
        {
            "name": "cloudpathlib",
            "specs": [
                [
                    "==",
                    "0.22.0"
                ]
            ]
        },
        {
            "name": "cobble",
            "specs": [
                [
                    "==",
                    "0.1.4"
                ]
            ]
        },
        {
            "name": "colorama",
            "specs": [
                [
                    "==",
                    "0.4.6"
                ]
            ]
        },
        {
            "name": "coloredlogs",
            "specs": [
                [
                    "==",
                    "15.0.1"
                ]
            ]
        },
        {
            "name": "comm",
            "specs": [
                [
                    "==",
                    "0.2.3"
                ]
            ]
        },
        {
            "name": "confection",
            "specs": [
                [
                    "==",
                    "0.1.5"
                ]
            ]
        },
        {
            "name": "contourpy",
            "specs": [
                [
                    "==",
                    "1.3.3"
                ]
            ]
        },
        {
            "name": "coverage",
            "specs": [
                [
                    "==",
                    "7.10.6"
                ]
            ]
        },
        {
            "name": "cramjam",
            "specs": [
                [
                    "==",
                    "2.11.0"
                ]
            ]
        },
        {
            "name": "cryptography",
            "specs": [
                [
                    "==",
                    "45.0.7"
                ]
            ]
        },
        {
            "name": "cssselect2",
            "specs": [
                [
                    "==",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "cycler",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "cymem",
            "specs": [
                [
                    "==",
                    "2.0.11"
                ]
            ]
        },
        {
            "name": "debugpy",
            "specs": [
                [
                    "==",
                    "1.8.16"
                ]
            ]
        },
        {
            "name": "decorator",
            "specs": [
                [
                    "==",
                    "5.2.1"
                ]
            ]
        },
        {
            "name": "defusedxml",
            "specs": [
                [
                    "==",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "dill",
            "specs": [
                [
                    "==",
                    "0.4.0"
                ]
            ]
        },
        {
            "name": "distlib",
            "specs": [
                [
                    "==",
                    "0.4.0"
                ]
            ]
        },
        {
            "name": "distro",
            "specs": [
                [
                    "==",
                    "1.9.0"
                ]
            ]
        },
        {
            "name": "docling",
            "specs": [
                [
                    "==",
                    "2.51.0"
                ]
            ]
        },
        {
            "name": "docling-core",
            "specs": [
                [
                    "==",
                    "2.47.0"
                ]
            ]
        },
        {
            "name": "docling-ibm-models",
            "specs": [
                [
                    "==",
                    "3.9.1"
                ]
            ]
        },
        {
            "name": "docling-parse",
            "specs": [
                [
                    "==",
                    "4.4.0"
                ]
            ]
        },
        {
            "name": "docutils",
            "specs": [
                [
                    "==",
                    "0.22"
                ]
            ]
        },
        {
            "name": "easyocr",
            "specs": [
                [
                    "==",
                    "1.7.2"
                ]
            ]
        },
        {
            "name": "et-xmlfile",
            "specs": [
                [
                    "==",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "executing",
            "specs": [
                [
                    "==",
                    "2.2.1"
                ]
            ]
        },
        {
            "name": "faker",
            "specs": [
                [
                    "==",
                    "37.6.0"
                ]
            ]
        },
        {
            "name": "fastjsonschema",
            "specs": [
                [
                    "==",
                    "2.21.2"
                ]
            ]
        },
        {
            "name": "fastparquet",
            "specs": [
                [
                    "==",
                    "2024.11.0"
                ]
            ]
        },
        {
            "name": "ffmpeg",
            "specs": [
                [
                    "==",
                    "1.4"
                ]
            ]
        },
        {
            "name": "ffmpeg-downloader",
            "specs": [
                [
                    "==",
                    "0.4.0"
                ]
            ]
        },
        {
            "name": "filelock",
            "specs": [
                [
                    "==",
                    "3.19.1"
                ]
            ]
        },
        {
            "name": "filetype",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "flake8",
            "specs": [
                [
                    "==",
                    "7.3.0"
                ]
            ]
        },
        {
            "name": "flatbuffers",
            "specs": [
                [
                    "==",
                    "25.2.10"
                ]
            ]
        },
        {
            "name": "fonttools",
            "specs": [
                [
                    "==",
                    "4.59.2"
                ]
            ]
        },
        {
            "name": "fqdn",
            "specs": [
                [
                    "==",
                    "1.5.1"
                ]
            ]
        },
        {
            "name": "frozenlist",
            "specs": [
                [
                    "==",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "fsspec",
            "specs": [
                [
                    "==",
                    "2025.9.0"
                ]
            ]
        },
        {
            "name": "ghp-import",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "gitdb",
            "specs": [
                [
                    "==",
                    "4.0.12"
                ]
            ]
        },
        {
            "name": "gitpython",
            "specs": [
                [
                    "==",
                    "3.1.45"
                ]
            ]
        },
        {
            "name": "google-auth",
            "specs": [
                [
                    "==",
                    "2.40.3"
                ]
            ]
        },
        {
            "name": "google-genai",
            "specs": [
                [
                    "==",
                    "1.33.0"
                ]
            ]
        },
        {
            "name": "griffe",
            "specs": [
                [
                    "==",
                    "1.14.0"
                ]
            ]
        },
        {
            "name": "grpcio",
            "specs": [
                [
                    "==",
                    "1.74.0"
                ]
            ]
        },
        {
            "name": "h11",
            "specs": [
                [
                    "==",
                    "0.16.0"
                ]
            ]
        },
        {
            "name": "h2",
            "specs": [
                [
                    "==",
                    "4.3.0"
                ]
            ]
        },
        {
            "name": "hf-xet",
            "specs": [
                [
                    "==",
                    "1.1.9"
                ]
            ]
        },
        {
            "name": "hpack",
            "specs": [
                [
                    "==",
                    "4.1.0"
                ]
            ]
        },
        {
            "name": "httpcore",
            "specs": [
                [
                    "==",
                    "1.0.9"
                ]
            ]
        },
        {
            "name": "httpx",
            "specs": [
                [
                    "==",
                    "0.28.1"
                ]
            ]
        },
        {
            "name": "huggingface-hub",
            "specs": [
                [
                    "==",
                    "0.34.4"
                ]
            ]
        },
        {
            "name": "humanfriendly",
            "specs": [
                [
                    "==",
                    "10.0"
                ]
            ]
        },
        {
            "name": "hyperframe",
            "specs": [
                [
                    "==",
                    "6.1.0"
                ]
            ]
        },
        {
            "name": "id",
            "specs": [
                [
                    "==",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "identify",
            "specs": [
                [
                    "==",
                    "2.6.14"
                ]
            ]
        },
        {
            "name": "idna",
            "specs": [
                [
                    "==",
                    "3.10"
                ]
            ]
        },
        {
            "name": "imageio",
            "specs": [
                [
                    "==",
                    "2.37.0"
                ]
            ]
        },
        {
            "name": "importlib-metadata",
            "specs": [
                [
                    "==",
                    "8.7.0"
                ]
            ]
        },
        {
            "name": "iniconfig",
            "specs": [
                [
                    "==",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "iprogress",
            "specs": [
                [
                    "==",
                    "0.4"
                ]
            ]
        },
        {
            "name": "ipykernel",
            "specs": [
                [
                    "==",
                    "6.30.1"
                ]
            ]
        },
        {
            "name": "ipython",
            "specs": [
                [
                    "==",
                    "9.5.0"
                ]
            ]
        },
        {
            "name": "ipython-pygments-lexers",
            "specs": [
                [
                    "==",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "ipywidgets",
            "specs": [
                [
                    "==",
                    "8.1.7"
                ]
            ]
        },
        {
            "name": "isodate",
            "specs": [
                [
                    "==",
                    "0.7.2"
                ]
            ]
        },
        {
            "name": "isoduration",
            "specs": [
                [
                    "==",
                    "20.11.0"
                ]
            ]
        },
        {
            "name": "isort",
            "specs": [
                [
                    "==",
                    "6.0.1"
                ]
            ]
        },
        {
            "name": "jaraco-classes",
            "specs": [
                [
                    "==",
                    "3.4.0"
                ]
            ]
        },
        {
            "name": "jaraco-context",
            "specs": [
                [
                    "==",
                    "6.0.1"
                ]
            ]
        },
        {
            "name": "jaraco-functools",
            "specs": [
                [
                    "==",
                    "4.3.0"
                ]
            ]
        },
        {
            "name": "jedi",
            "specs": [
                [
                    "==",
                    "0.19.2"
                ]
            ]
        },
        {
            "name": "jinja2",
            "specs": [
                [
                    "==",
                    "3.1.6"
                ]
            ]
        },
        {
            "name": "jiter",
            "specs": [
                [
                    "==",
                    "0.10.0"
                ]
            ]
        },
        {
            "name": "joblib",
            "specs": [
                [
                    "==",
                    "1.5.2"
                ]
            ]
        },
        {
            "name": "json5",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "jsonlines",
            "specs": [
                [
                    "==",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "jsonpatch",
            "specs": [
                [
                    "==",
                    "1.33"
                ]
            ]
        },
        {
            "name": "jsonpointer",
            "specs": [
                [
                    "==",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "jsonref",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "jsonschema",
            "specs": [
                [
                    "==",
                    "4.25.1"
                ]
            ]
        },
        {
            "name": "jsonschema-specifications",
            "specs": [
                [
                    "==",
                    "2025.4.1"
                ]
            ]
        },
        {
            "name": "jupyter",
            "specs": [
                [
                    "==",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "jupyter-client",
            "specs": [
                [
                    "==",
                    "8.6.3"
                ]
            ]
        },
        {
            "name": "jupyter-console",
            "specs": [
                [
                    "==",
                    "6.6.3"
                ]
            ]
        },
        {
            "name": "jupyter-core",
            "specs": [
                [
                    "==",
                    "5.8.1"
                ]
            ]
        },
        {
            "name": "jupyter-events",
            "specs": [
                [
                    "==",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "jupyter-lsp",
            "specs": [
                [
                    "==",
                    "2.3.0"
                ]
            ]
        },
        {
            "name": "jupyter-server",
            "specs": [
                [
                    "==",
                    "2.17.0"
                ]
            ]
        },
        {
            "name": "jupyter-server-terminals",
            "specs": [
                [
                    "==",
                    "0.5.3"
                ]
            ]
        },
        {
            "name": "jupyterlab",
            "specs": [
                [
                    "==",
                    "4.4.7"
                ]
            ]
        },
        {
            "name": "jupyterlab-pygments",
            "specs": [
                [
                    "==",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "jupyterlab-server",
            "specs": [
                [
                    "==",
                    "2.27.3"
                ]
            ]
        },
        {
            "name": "jupyterlab-widgets",
            "specs": [
                [
                    "==",
                    "3.0.15"
                ]
            ]
        },
        {
            "name": "keyring",
            "specs": [
                [
                    "==",
                    "25.6.0"
                ]
            ]
        },
        {
            "name": "kiwisolver",
            "specs": [
                [
                    "==",
                    "1.4.9"
                ]
            ]
        },
        {
            "name": "langchain-core",
            "specs": [
                [
                    "==",
                    "0.3.75"
                ]
            ]
        },
        {
            "name": "langchain-text-splitters",
            "specs": [
                [
                    "==",
                    "0.3.11"
                ]
            ]
        },
        {
            "name": "langcodes",
            "specs": [
                [
                    "==",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "langsmith",
            "specs": [
                [
                    "==",
                    "0.4.25"
                ]
            ]
        },
        {
            "name": "language-data",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "lark",
            "specs": [
                [
                    "==",
                    "1.2.2"
                ]
            ]
        },
        {
            "name": "latex2mathml",
            "specs": [
                [
                    "==",
                    "3.78.1"
                ]
            ]
        },
        {
            "name": "lazy-loader",
            "specs": [
                [
                    "==",
                    "0.4"
                ]
            ]
        },
        {
            "name": "lxml",
            "specs": [
                [
                    "==",
                    "5.4.0"
                ]
            ]
        },
        {
            "name": "magika",
            "specs": [
                [
                    "==",
                    "0.6.2"
                ]
            ]
        },
        {
            "name": "mammoth",
            "specs": [
                [
                    "==",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "marisa-trie",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "markdown",
            "specs": [
                [
                    "==",
                    "3.9"
                ]
            ]
        },
        {
            "name": "markdown-it-py",
            "specs": [
                [
                    "==",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "markdownify",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "markitdown",
            "specs": [
                [
                    "==",
                    "0.1.2"
                ]
            ]
        },
        {
            "name": "marko",
            "specs": [
                [
                    "==",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "markupsafe",
            "specs": [
                [
                    "==",
                    "3.0.2"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.10.6"
                ]
            ]
        },
        {
            "name": "matplotlib-inline",
            "specs": [
                [
                    "==",
                    "0.1.7"
                ]
            ]
        },
        {
            "name": "mccabe",
            "specs": [
                [
                    "==",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "mdurl",
            "specs": [
                [
                    "==",
                    "0.1.2"
                ]
            ]
        },
        {
            "name": "mergedeep",
            "specs": [
                [
                    "==",
                    "1.3.4"
                ]
            ]
        },
        {
            "name": "mistune",
            "specs": [
                [
                    "==",
                    "3.1.4"
                ]
            ]
        },
        {
            "name": "mkdocs",
            "specs": [
                [
                    "==",
                    "1.6.1"
                ]
            ]
        },
        {
            "name": "mkdocs-autorefs",
            "specs": [
                [
                    "==",
                    "1.4.3"
                ]
            ]
        },
        {
            "name": "mkdocs-awesome-pages-plugin",
            "specs": [
                [
                    "==",
                    "2.10.1"
                ]
            ]
        },
        {
            "name": "mkdocs-get-deps",
            "specs": [
                [
                    "==",
                    "0.2.0"
                ]
            ]
        },
        {
            "name": "mkdocs-glightbox",
            "specs": [
                [
                    "==",
                    "0.5.1"
                ]
            ]
        },
        {
            "name": "mkdocs-material",
            "specs": [
                [
                    "==",
                    "9.6.19"
                ]
            ]
        },
        {
            "name": "mkdocs-material-extensions",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "mkdocstrings",
            "specs": [
                [
                    "==",
                    "0.30.0"
                ]
            ]
        },
        {
            "name": "mkdocstrings-python",
            "specs": [
                [
                    "==",
                    "1.18.2"
                ]
            ]
        },
        {
            "name": "more-itertools",
            "specs": [
                [
                    "==",
                    "10.8.0"
                ]
            ]
        },
        {
            "name": "mpire",
            "specs": [
                [
                    "==",
                    "2.10.2"
                ]
            ]
        },
        {
            "name": "mpmath",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "msal",
            "specs": [
                [
                    "==",
                    "1.33.0"
                ]
            ]
        },
        {
            "name": "msal-extensions",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "multidict",
            "specs": [
                [
                    "==",
                    "6.6.4"
                ]
            ]
        },
        {
            "name": "multiprocess",
            "specs": [
                [
                    "==",
                    "0.70.18"
                ]
            ]
        },
        {
            "name": "murmurhash",
            "specs": [
                [
                    "==",
                    "1.0.13"
                ]
            ]
        },
        {
            "name": "mypy-extensions",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "natsort",
            "specs": [
                [
                    "==",
                    "8.4.0"
                ]
            ]
        },
        {
            "name": "nbclient",
            "specs": [
                [
                    "==",
                    "0.10.2"
                ]
            ]
        },
        {
            "name": "nbconvert",
            "specs": [
                [
                    "==",
                    "7.16.6"
                ]
            ]
        },
        {
            "name": "nbformat",
            "specs": [
                [
                    "==",
                    "5.10.4"
                ]
            ]
        },
        {
            "name": "nest-asyncio",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    "==",
                    "3.5"
                ]
            ]
        },
        {
            "name": "nh3",
            "specs": [
                [
                    "==",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "ninja",
            "specs": [
                [
                    "==",
                    "1.13.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "==",
                    "3.9.1"
                ]
            ]
        },
        {
            "name": "nodeenv",
            "specs": [
                [
                    "==",
                    "1.9.1"
                ]
            ]
        },
        {
            "name": "notebook",
            "specs": [
                [
                    "==",
                    "7.4.5"
                ]
            ]
        },
        {
            "name": "notebook-shim",
            "specs": [
                [
                    "==",
                    "0.2.4"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.3.2"
                ]
            ]
        },
        {
            "name": "olefile",
            "specs": [
                [
                    "==",
                    "0.47"
                ]
            ]
        },
        {
            "name": "onnxruntime",
            "specs": [
                [
                    "==",
                    "1.22.1"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    "==",
                    "1.106.1"
                ]
            ]
        },
        {
            "name": "opencv-python-headless",
            "specs": [
                [
                    "==",
                    "4.11.0.86"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    "==",
                    "3.1.5"
                ]
            ]
        },
        {
            "name": "opentelemetry-api",
            "specs": [
                [
                    "==",
                    "1.36.0"
                ]
            ]
        },
        {
            "name": "opentelemetry-sdk",
            "specs": [
                [
                    "==",
                    "1.36.0"
                ]
            ]
        },
        {
            "name": "opentelemetry-semantic-conventions",
            "specs": [
                [
                    "==",
                    "0.57b0"
                ]
            ]
        },
        {
            "name": "orjson",
            "specs": [
                [
                    "==",
                    "3.11.3"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "25.0"
                ]
            ]
        },
        {
            "name": "paginate",
            "specs": [
                [
                    "==",
                    "0.5.7"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.3.2"
                ]
            ]
        },
        {
            "name": "pandocfilters",
            "specs": [
                [
                    "==",
                    "1.5.1"
                ]
            ]
        },
        {
            "name": "parso",
            "specs": [
                [
                    "==",
                    "0.8.5"
                ]
            ]
        },
        {
            "name": "pastel",
            "specs": [
                [
                    "==",
                    "0.2.1"
                ]
            ]
        },
        {
            "name": "pathspec",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "pdfminer-six",
            "specs": [
                [
                    "==",
                    "20250506"
                ]
            ]
        },
        {
            "name": "pdfplumber",
            "specs": [
                [
                    "==",
                    "0.11.7"
                ]
            ]
        },
        {
            "name": "pexpect",
            "specs": [
                [
                    "==",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    "==",
                    "11.3.0"
                ]
            ]
        },
        {
            "name": "pip",
            "specs": [
                [
                    "==",
                    "25.2"
                ]
            ]
        },
        {
            "name": "platformdirs",
            "specs": [
                [
                    "==",
                    "4.4.0"
                ]
            ]
        },
        {
            "name": "pluggy",
            "specs": [
                [
                    "==",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "poethepoet",
            "specs": [
                [
                    "==",
                    "0.37.0"
                ]
            ]
        },
        {
            "name": "polyfactory",
            "specs": [
                [
                    "==",
                    "2.22.2"
                ]
            ]
        },
        {
            "name": "portalocker",
            "specs": [
                [
                    "==",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    "==",
                    "4.3.0"
                ]
            ]
        },
        {
            "name": "preshed",
            "specs": [
                [
                    "==",
                    "3.0.10"
                ]
            ]
        },
        {
            "name": "prometheus-client",
            "specs": [
                [
                    "==",
                    "0.22.1"
                ]
            ]
        },
        {
            "name": "prompt-toolkit",
            "specs": [
                [
                    "==",
                    "3.0.52"
                ]
            ]
        },
        {
            "name": "propcache",
            "specs": [
                [
                    "==",
                    "0.3.2"
                ]
            ]
        },
        {
            "name": "protobuf",
            "specs": [
                [
                    "==",
                    "6.32.0"
                ]
            ]
        },
        {
            "name": "psutil",
            "specs": [
                [
                    "==",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "ptyprocess",
            "specs": [
                [
                    "==",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "pure-eval",
            "specs": [
                [
                    "==",
                    "0.2.3"
                ]
            ]
        },
        {
            "name": "pyasn1",
            "specs": [
                [
                    "==",
                    "0.6.1"
                ]
            ]
        },
        {
            "name": "pyasn1-modules",
            "specs": [
                [
                    "==",
                    "0.4.2"
                ]
            ]
        },
        {
            "name": "pyclipper",
            "specs": [
                [
                    "==",
                    "1.3.0.post6"
                ]
            ]
        },
        {
            "name": "pycodestyle",
            "specs": [
                [
                    "==",
                    "2.14.0"
                ]
            ]
        },
        {
            "name": "pycparser",
            "specs": [
                [
                    "==",
                    "2.22"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.11.7"
                ]
            ]
        },
        {
            "name": "pydantic-core",
            "specs": [
                [
                    "==",
                    "2.33.2"
                ]
            ]
        },
        {
            "name": "pydantic-settings",
            "specs": [
                [
                    "==",
                    "2.10.1"
                ]
            ]
        },
        {
            "name": "pydub",
            "specs": [
                [
                    "==",
                    "0.25.1"
                ]
            ]
        },
        {
            "name": "pyflakes",
            "specs": [
                [
                    "==",
                    "3.4.0"
                ]
            ]
        },
        {
            "name": "pygments",
            "specs": [
                [
                    "==",
                    "2.19.2"
                ]
            ]
        },
        {
            "name": "pyjwt",
            "specs": [
                [
                    "==",
                    "2.10.1"
                ]
            ]
        },
        {
            "name": "pylatexenc",
            "specs": [
                [
                    "==",
                    "2.10"
                ]
            ]
        },
        {
            "name": "pymdown-extensions",
            "specs": [
                [
                    "==",
                    "10.16.1"
                ]
            ]
        },
        {
            "name": "pymupdf",
            "specs": [
                [
                    "==",
                    "1.26.4"
                ]
            ]
        },
        {
            "name": "pypandoc",
            "specs": [
                [
                    "==",
                    "1.15"
                ]
            ]
        },
        {
            "name": "pyparsing",
            "specs": [
                [
                    "==",
                    "3.2.3"
                ]
            ]
        },
        {
            "name": "pypdf",
            "specs": [
                [
                    "==",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "pypdfium2",
            "specs": [
                [
                    "==",
                    "4.30.0"
                ]
            ]
        },
        {
            "name": "pyproject-autoflake",
            "specs": [
                [
                    "==",
                    "1.0.2"
                ]
            ]
        },
        {
            "name": "pyproject-hooks",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "8.4.2"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    "==",
                    "6.3.0"
                ]
            ]
        },
        {
            "name": "python-bidi",
            "specs": [
                [
                    "==",
                    "0.6.6"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.post0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    "==",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "python-json-logger",
            "specs": [
                [
                    "==",
                    "3.3.0"
                ]
            ]
        },
        {
            "name": "python-pptx",
            "specs": [
                [
                    "==",
                    "1.0.2"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "pyupgrade",
            "specs": [
                [
                    "==",
                    "3.20.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    "==",
                    "6.0.2"
                ]
            ]
        },
        {
            "name": "pyyaml-env-tag",
            "specs": [
                [
                    "==",
                    "1.1"
                ]
            ]
        },
        {
            "name": "pyzmq",
            "specs": [
                [
                    "==",
                    "27.0.2"
                ]
            ]
        },
        {
            "name": "qdrant-client",
            "specs": [
                [
                    "==",
                    "1.15.1"
                ]
            ]
        },
        {
            "name": "readme-renderer",
            "specs": [
                [
                    "==",
                    "44.0"
                ]
            ]
        },
        {
            "name": "referencing",
            "specs": [
                [
                    "==",
                    "0.36.2"
                ]
            ]
        },
        {
            "name": "regex",
            "specs": [
                [
                    "==",
                    "2025.9.1"
                ]
            ]
        },
        {
            "name": "reportlab",
            "specs": [
                [
                    "==",
                    "4.4.3"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.32.5"
                ]
            ]
        },
        {
            "name": "requests-mock",
            "specs": [
                [
                    "==",
                    "1.12.1"
                ]
            ]
        },
        {
            "name": "requests-toolbelt",
            "specs": [
                [
                    "==",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "rfc3339-validator",
            "specs": [
                [
                    "==",
                    "0.1.4"
                ]
            ]
        },
        {
            "name": "rfc3986",
            "specs": [
                [
                    "==",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "rfc3986-validator",
            "specs": [
                [
                    "==",
                    "0.1.1"
                ]
            ]
        },
        {
            "name": "rfc3987-syntax",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    "==",
                    "14.1.0"
                ]
            ]
        },
        {
            "name": "rpds-py",
            "specs": [
                [
                    "==",
                    "0.27.1"
                ]
            ]
        },
        {
            "name": "rsa",
            "specs": [
                [
                    "==",
                    "4.9.1"
                ]
            ]
        },
        {
            "name": "rtree",
            "specs": [
                [
                    "==",
                    "1.4.1"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    "==",
                    "0.12.12"
                ]
            ]
        },
        {
            "name": "safetensors",
            "specs": [
                [
                    "==",
                    "0.6.2"
                ]
            ]
        },
        {
            "name": "scikit-image",
            "specs": [
                [
                    "==",
                    "0.25.2"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.7.1"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.16.1"
                ]
            ]
        },
        {
            "name": "selectolax",
            "specs": [
                [
                    "==",
                    "0.3.29"
                ]
            ]
        },
        {
            "name": "semchunk",
            "specs": [
                [
                    "==",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "send2trash",
            "specs": [
                [
                    "==",
                    "1.8.3"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    "==",
                    "5.1.0"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "80.9.0"
                ]
            ]
        },
        {
            "name": "setuptools-scm",
            "specs": [
                [
                    "==",
                    "9.2.0"
                ]
            ]
        },
        {
            "name": "shapely",
            "specs": [
                [
                    "==",
                    "2.1.1"
                ]
            ]
        },
        {
            "name": "shellingham",
            "specs": [
                [
                    "==",
                    "1.5.4"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.17.0"
                ]
            ]
        },
        {
            "name": "smart-open",
            "specs": [
                [
                    "==",
                    "7.3.0.post1"
                ]
            ]
        },
        {
            "name": "smmap",
            "specs": [
                [
                    "==",
                    "5.0.2"
                ]
            ]
        },
        {
            "name": "sniffio",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "soupsieve",
            "specs": [
                [
                    "==",
                    "2.8"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "==",
                    "3.8.7"
                ]
            ]
        },
        {
            "name": "spacy-legacy",
            "specs": [
                [
                    "==",
                    "3.0.12"
                ]
            ]
        },
        {
            "name": "spacy-loggers",
            "specs": [
                [
                    "==",
                    "1.0.5"
                ]
            ]
        },
        {
            "name": "speechrecognition",
            "specs": [
                [
                    "==",
                    "3.14.3"
                ]
            ]
        },
        {
            "name": "srsly",
            "specs": [
                [
                    "==",
                    "2.5.1"
                ]
            ]
        },
        {
            "name": "stack-data",
            "specs": [
                [
                    "==",
                    "0.6.3"
                ]
            ]
        },
        {
            "name": "standard-aifc",
            "specs": [
                [
                    "==",
                    "3.13.0"
                ]
            ]
        },
        {
            "name": "standard-chunk",
            "specs": [
                [
                    "==",
                    "3.13.0"
                ]
            ]
        },
        {
            "name": "svglib",
            "specs": [
                [
                    "==",
                    "1.5.1"
                ]
            ]
        },
        {
            "name": "sympy",
            "specs": [
                [
                    "==",
                    "1.14.0"
                ]
            ]
        },
        {
            "name": "tabulate",
            "specs": [
                [
                    "==",
                    "0.9.0"
                ]
            ]
        },
        {
            "name": "tenacity",
            "specs": [
                [
                    "==",
                    "9.1.2"
                ]
            ]
        },
        {
            "name": "terminado",
            "specs": [
                [
                    "==",
                    "0.18.1"
                ]
            ]
        },
        {
            "name": "thinc",
            "specs": [
                [
                    "==",
                    "8.3.6"
                ]
            ]
        },
        {
            "name": "threadpoolctl",
            "specs": [
                [
                    "==",
                    "3.6.0"
                ]
            ]
        },
        {
            "name": "tifffile",
            "specs": [
                [
                    "==",
                    "2025.8.28"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    "==",
                    "0.11.0"
                ]
            ]
        },
        {
            "name": "tinycss2",
            "specs": [
                [
                    "==",
                    "1.4.0"
                ]
            ]
        },
        {
            "name": "tokenize-rt",
            "specs": [
                [
                    "==",
                    "6.2.0"
                ]
            ]
        },
        {
            "name": "tokenizers",
            "specs": [
                [
                    "==",
                    "0.22.0"
                ]
            ]
        },
        {
            "name": "toml",
            "specs": [
                [
                    "==",
                    "0.10.2"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.7.1"
                ]
            ]
        },
        {
            "name": "torchvision",
            "specs": [
                [
                    "==",
                    "0.22.1"
                ]
            ]
        },
        {
            "name": "tornado",
            "specs": [
                [
                    "==",
                    "6.5.2"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.67.1"
                ]
            ]
        },
        {
            "name": "traitlets",
            "specs": [
                [
                    "==",
                    "5.14.3"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.56.1"
                ]
            ]
        },
        {
            "name": "twine",
            "specs": [
                [
                    "==",
                    "6.2.0"
                ]
            ]
        },
        {
            "name": "typer",
            "specs": [
                [
                    "==",
                    "0.16.1"
                ]
            ]
        },
        {
            "name": "types-python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.20250822"
                ]
            ]
        },
        {
            "name": "typing-extensions",
            "specs": [
                [
                    "==",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "typing-inspection",
            "specs": [
                [
                    "==",
                    "0.4.1"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "uri-template",
            "specs": [
                [
                    "==",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "urllib3",
            "specs": [
                [
                    "==",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "uv",
            "specs": [
                [
                    "==",
                    "0.8.15"
                ]
            ]
        },
        {
            "name": "virtualenv",
            "specs": [
                [
                    "==",
                    "20.34.0"
                ]
            ]
        },
        {
            "name": "voyageai",
            "specs": [
                [
                    "==",
                    "0.3.4"
                ]
            ]
        },
        {
            "name": "wasabi",
            "specs": [
                [
                    "==",
                    "1.1.3"
                ]
            ]
        },
        {
            "name": "watchdog",
            "specs": [
                [
                    "==",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "wcmatch",
            "specs": [
                [
                    "==",
                    "10.1"
                ]
            ]
        },
        {
            "name": "wcwidth",
            "specs": [
                [
                    "==",
                    "0.2.13"
                ]
            ]
        },
        {
            "name": "weasel",
            "specs": [
                [
                    "==",
                    "0.4.1"
                ]
            ]
        },
        {
            "name": "webcolors",
            "specs": [
                [
                    "==",
                    "24.11.1"
                ]
            ]
        },
        {
            "name": "webencodings",
            "specs": [
                [
                    "==",
                    "0.5.1"
                ]
            ]
        },
        {
            "name": "websocket-client",
            "specs": [
                [
                    "==",
                    "1.8.0"
                ]
            ]
        },
        {
            "name": "websockets",
            "specs": [
                [
                    "==",
                    "15.0.1"
                ]
            ]
        },
        {
            "name": "wheel",
            "specs": [
                [
                    "==",
                    "0.45.1"
                ]
            ]
        },
        {
            "name": "widgetsnbextension",
            "specs": [
                [
                    "==",
                    "4.0.14"
                ]
            ]
        },
        {
            "name": "wrapt",
            "specs": [
                [
                    "==",
                    "1.17.3"
                ]
            ]
        },
        {
            "name": "xai-sdk",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "xlrd",
            "specs": [
                [
                    "==",
                    "2.0.2"
                ]
            ]
        },
        {
            "name": "xlsxwriter",
            "specs": [
                [
                    "==",
                    "3.2.5"
                ]
            ]
        },
        {
            "name": "yarl",
            "specs": [
                [
                    "==",
                    "1.20.1"
                ]
            ]
        },
        {
            "name": "youtube-transcript-api",
            "specs": [
                [
                    "==",
                    "1.0.3"
                ]
            ]
        },
        {
            "name": "zipp",
            "specs": [
                [
                    "==",
                    "3.23.0"
                ]
            ]
        },
        {
            "name": "zstandard",
            "specs": [
                [
                    "==",
                    "0.24.0"
                ]
            ]
        }
    ],
    "lcname": "splitter-mr"
}
        
Elapsed time: 2.35410s