# Document Chunkers
A collection of helpers to process raw documents
## Overview
This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's
particularly useful for preprocessing text data for natural language processing (NLP) tasks.
- [Text Segmenters](#text-segmenters)
- [Supported Models](#supported-models)
- [Usage](#usage)
- [Example: Using SaT for Sentence Segmentation](#example-using-sat-for-sentence-segmentation)
- [Example: Using WtP for Paragraph Segmentation](#example-using-wtp-for-paragraph-segmentation)
- [Example: Using PySBD for Sentence Segmentation](#example-using-pysbd-for-sentence-segmentation)
- [File Formats](#file-formats)
- [Supported File Formats](#supported-file-formats)
- [Usage](#usage-1)
- [Example using MarkdownSentenceSplitter](#example-using-markdownsentencesplitter)
- [Example using MarkdownParagraphSplitter](#example-using-markdownparagraphsplitter)
- [Example using HTMLSentenceSplitter](#example-using-htmlsentencesplitter)
- [Example using HTMLParagraphSplitter](#example-using-htmlparagraphsplitter)
- [Example using PDFParagraphSplitter](#example-using-pdfparagraphsplitter)
## Text Segmenters

- **SaT**
— [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](https://arxiv.org/abs/2406.16678)
by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (**state-of-the-art, encouraged
**). - 85 languages
- **WtP**
— [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398/)
by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić. - 85 languages
- **PySBD** — [{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation](https://arxiv.org/abs/2010.09657) by Nipun
Sadvilkar and Mark Neumann (rule-based, **lightweight**) - 22 languages
### Usage
#### Example: Using SaT for Sentence Segmentation
```python
from ovos_document_chunkers import SaTSentenceSplitter
config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)
text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)
for sentence in sentences:
print(sentence)
```
#### Example: Using WtP for Paragraph Segmentation
```python
from ovos_document_chunkers import WtPParagraphSplitter
config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)
text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)
for paragraph in paragraphs:
print(paragraph)
```
#### Example: Using PySBD for Sentence Segmentation
```python
from ovos_document_chunkers import PySBDSentenceSplitter
config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)
text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)
for sentence in sentences:
print(sentence)
```
## File Formats
### Supported File Formats
| Type | Description | Class Name | Expected Input | File Extension |
|----------|--------------------------------------------------------------|---------------------------|-------------------------------------|----------------|
| Markdown | Splits Markdown text into sentences or paragraphs | MarkdownSentenceSplitter | String (url, path or Markdown text) | .md |
| | | MarkdownParagraphSplitter | String (url, path or Markdown text) | .md |
| HTML | Splits HTML text into sentences or paragraphs | HTMLSentenceSplitter | String (url, path or HTML text) | .html |
| | | HTMLParagraphSplitter | String (url, path or HTML text) | .html |
| PDF | Splits PDF documents into sentences or paragraphs | PDFSentenceSplitter | String (url or path to PDF file) | .pdf |
| | | PDFParagraphSplitter | String (url or path to PDF file) | .pdf |
| doc | Splits Microsoft doc documents into sentences or paragraphs | DOCSentenceSplitter | String (url or path to doc file) | .doc |
| | | DOCParagraphSplitter | String (url or path to doc file) | .doc |
| docx | Splits Microsoft docx documents into sentences or paragraphs | DOCxSentenceSplitter | String (url or path to docx file) | .docx |
| | | DOCxParagraphSplitter | String (url or path to docx file) | .docx |
### Usage
#### Example using MarkdownSentenceSplitter
```python
from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests
markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text
sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)
print("Sentences:")
for sentence in sentences:
print(sentence)
```
#### Example using MarkdownParagraphSplitter
```python
from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests
markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text
paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
```
#### Example using HTMLSentenceSplitter
```python
from ovos_document_chunkers import HTMLSentenceSplitter
import requests
html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text
sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)
print("Sentences:")
for sentence in sentences:
print(sentence)
```
#### Example using HTMLParagraphSplitter
```python
from ovos_document_chunkers import HTMLParagraphSplitter
import requests
html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text
paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
```
#### Example using PDFParagraphSplitter
```python
from ovos_document_chunkers import PDFParagraphSplitter
pdf_path = "/path/to/your/pdf/document.pdf"
paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)
print("\nParagraphs:")
for paragraph in paragraphs:
print(paragraph)
```
## Credits

> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.
Raw data
{
"_id": null,
"home_page": null,
"name": "ovos-document-chunkers",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "OVOS openvoiceos plugin",
"author": "jarbasai",
"author_email": "jarbasai@mailfence.com",
"download_url": "https://files.pythonhosted.org/packages/bd/e2/6737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05/ovos-document-chunkers-0.1.1.tar.gz",
"platform": null,
"description": "# Document Chunkers\n\nA collection of helpers to process raw documents\n\n## Overview\n\nThis library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's\nparticularly useful for preprocessing text data for natural language processing (NLP) tasks.\n\n- [Text Segmenters](#text-segmenters)\n - [Supported Models](#supported-models)\n - [Usage](#usage)\n - [Example: Using SaT for Sentence Segmentation](#example-using-sat-for-sentence-segmentation)\n - [Example: Using WtP for Paragraph Segmentation](#example-using-wtp-for-paragraph-segmentation)\n - [Example: Using PySBD for Sentence Segmentation](#example-using-pysbd-for-sentence-segmentation)\n- [File Formats](#file-formats)\n - [Supported File Formats](#supported-file-formats)\n - [Usage](#usage-1)\n - [Example using MarkdownSentenceSplitter](#example-using-markdownsentencesplitter)\n - [Example using MarkdownParagraphSplitter](#example-using-markdownparagraphsplitter)\n - [Example using HTMLSentenceSplitter](#example-using-htmlsentencesplitter)\n - [Example using HTMLParagraphSplitter](#example-using-htmlparagraphsplitter)\n - [Example using PDFParagraphSplitter](#example-using-pdfparagraphsplitter)\n\n## Text Segmenters\n\n\n\n- **SaT**\n — [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](https://arxiv.org/abs/2406.16678)\n by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuli\u0107 and Markus Schedl (**state-of-the-art, encouraged\n **). - 85 languages\n- **WtP**\n — [Where\u2019s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398/)\n by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vuli\u0107. - 85 languages\n- **PySBD** — [{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation](https://arxiv.org/abs/2010.09657) by Nipun\n Sadvilkar and Mark Neumann (rule-based, **lightweight**) - 22 languages\n\n### Usage\n\n#### Example: Using SaT for Sentence Segmentation\n\n```python\nfrom ovos_document_chunkers import SaTSentenceSplitter\n\nconfig = {\"model\": \"sat-3l-sm\", \"use_cuda\": False}\nsplitter = SaTSentenceSplitter(config)\n\ntext = \"This is a sentence. And this is another one.\"\nsentences = splitter.chunk(text)\n\nfor sentence in sentences:\n print(sentence)\n```\n\n#### Example: Using WtP for Paragraph Segmentation\n\n```python\nfrom ovos_document_chunkers import WtPParagraphSplitter\n\nconfig = {\"model\": \"wtp-bert-mini\", \"use_cuda\": False}\nsplitter = WtPParagraphSplitter(config)\n\ntext = \"This is a paragraph. It contains multiple sentences.\\n\\nThis is another paragraph.\"\nparagraphs = splitter.chunk(text)\n\nfor paragraph in paragraphs:\n print(paragraph)\n```\n\n#### Example: Using PySBD for Sentence Segmentation\n\n```python\nfrom ovos_document_chunkers import PySBDSentenceSplitter\n\nconfig = {\"lang\": \"en\"}\nsplitter = PySBDSentenceSplitter(config)\n\ntext = \"This is a sentence. This is another one!\"\nsentences = splitter.chunk(text)\n\nfor sentence in sentences:\n print(sentence)\n```\n\n## File Formats\n\n### Supported File Formats\n\n| Type | Description | Class Name | Expected Input | File Extension |\n|----------|--------------------------------------------------------------|---------------------------|-------------------------------------|----------------|\n| Markdown | Splits Markdown text into sentences or paragraphs | MarkdownSentenceSplitter | String (url, path or Markdown text) | .md |\n| | | MarkdownParagraphSplitter | String (url, path or Markdown text) | .md |\n| HTML | Splits HTML text into sentences or paragraphs | HTMLSentenceSplitter | String (url, path or HTML text) | .html |\n| | | HTMLParagraphSplitter | String (url, path or HTML text) | .html |\n| PDF | Splits PDF documents into sentences or paragraphs | PDFSentenceSplitter | String (url or path to PDF file) | .pdf |\n| | | PDFParagraphSplitter | String (url or path to PDF file) | .pdf |\n| doc | Splits Microsoft doc documents into sentences or paragraphs | DOCSentenceSplitter | String (url or path to doc file) | .doc |\n| | | DOCParagraphSplitter | String (url or path to doc file) | .doc |\n| docx | Splits Microsoft docx documents into sentences or paragraphs | DOCxSentenceSplitter | String (url or path to docx file) | .docx |\n| | | DOCxParagraphSplitter | String (url or path to docx file) | .docx |\n\n### Usage\n\n#### Example using MarkdownSentenceSplitter\n\n```python\nfrom ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter\nimport requests\n\nmarkdown_text = requests.get(\"https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md\").text\n\nsentence_splitter = MarkdownSentenceSplitter()\nsentences = sentence_splitter.chunk(markdown_text)\n\nprint(\"Sentences:\")\nfor sentence in sentences:\n print(sentence)\n```\n\n#### Example using MarkdownParagraphSplitter\n\n```python\nfrom ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter\nimport requests\n\nmarkdown_text = requests.get(\"https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md\").text\n\nparagraph_splitter = MarkdownParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(markdown_text)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n print(paragraph)\n```\n\n#### Example using HTMLSentenceSplitter\n\n```python\nfrom ovos_document_chunkers import HTMLSentenceSplitter\nimport requests\n\nhtml_text = requests.get(\"https://www.gofundme.com/f/openvoiceos\").text\n\nsentence_splitter = HTMLSentenceSplitter()\nsentences = sentence_splitter.chunk(html_text)\n\nprint(\"Sentences:\")\nfor sentence in sentences:\n print(sentence)\n```\n\n#### Example using HTMLParagraphSplitter\n\n```python\nfrom ovos_document_chunkers import HTMLParagraphSplitter\nimport requests\n\nhtml_text = requests.get(\"https://www.gofundme.com/f/openvoiceos\").text\n\nparagraph_splitter = HTMLParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(html_text)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n print(paragraph)\n```\n\n#### Example using PDFParagraphSplitter\n\n```python\nfrom ovos_document_chunkers import PDFParagraphSplitter\n\npdf_path = \"/path/to/your/pdf/document.pdf\"\n\nparagraph_splitter = PDFParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(pdf_path)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n print(paragraph)\n```\n\n## Credits\n\n\n\n> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A plugin for OVOS",
"version": "0.1.1",
"project_urls": null,
"split_keywords": [
"ovos",
"openvoiceos",
"plugin"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5ac904db6e3a8dde71536b357fdde1624b4d90dff38d2cbb53bb40ba51ede0f6",
"md5": "5fae2904099d0838f42161fbef995b54",
"sha256": "2eca48614d853908e7fc522bb38222bac4fe300c2f6f2fcf26789a25f4dd7494"
},
"downloads": -1,
"filename": "ovos_document_chunkers-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5fae2904099d0838f42161fbef995b54",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 18842,
"upload_time": "2025-07-13T17:55:37",
"upload_time_iso_8601": "2025-07-13T17:55:37.337077Z",
"url": "https://files.pythonhosted.org/packages/5a/c9/04db6e3a8dde71536b357fdde1624b4d90dff38d2cbb53bb40ba51ede0f6/ovos_document_chunkers-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "bde26737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05",
"md5": "bf6ff9daff5328d7ebe61e3ca2cc4ffb",
"sha256": "419b3df26f5274255b947243be188dad904e93b1a1ce4d97616188589d6315b0"
},
"downloads": -1,
"filename": "ovos-document-chunkers-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "bf6ff9daff5328d7ebe61e3ca2cc4ffb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14200,
"upload_time": "2025-07-13T17:55:38",
"upload_time_iso_8601": "2025-07-13T17:55:38.406972Z",
"url": "https://files.pythonhosted.org/packages/bd/e2/6737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05/ovos-document-chunkers-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 17:55:38",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "ovos-document-chunkers"
}