ovos-document-chunkers

Name	ovos-document-chunkers JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	A plugin for OVOS
upload_time	2025-07-13 17:55:38
maintainer	None
docs_url	None
author	jarbasai
requires_python	None
license	MIT
keywords	ovos openvoiceos plugin
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Document Chunkers

A collection of helpers to process raw documents

## Overview

This library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's
particularly useful for preprocessing text data for natural language processing (NLP) tasks.

- [Text Segmenters](#text-segmenters)
    - [Supported Models](#supported-models)
    - [Usage](#usage)
        - [Example: Using SaT for Sentence Segmentation](#example-using-sat-for-sentence-segmentation)
        - [Example: Using WtP for Paragraph Segmentation](#example-using-wtp-for-paragraph-segmentation)
        - [Example: Using PySBD for Sentence Segmentation](#example-using-pysbd-for-sentence-segmentation)
- [File Formats](#file-formats)
    - [Supported File Formats](#supported-file-formats)
    - [Usage](#usage-1)
        - [Example using MarkdownSentenceSplitter](#example-using-markdownsentencesplitter)
        - [Example using MarkdownParagraphSplitter](#example-using-markdownparagraphsplitter)
        - [Example using HTMLSentenceSplitter](#example-using-htmlsentencesplitter)
        - [Example using HTMLParagraphSplitter](#example-using-htmlparagraphsplitter)
        - [Example using PDFParagraphSplitter](#example-using-pdfparagraphsplitter)

## Text Segmenters

![img.png](img.png)

- **SaT**
  &mdash; [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](https://arxiv.org/abs/2406.16678)
  by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vulić and Markus Schedl (**state-of-the-art, encouraged
  **). - 85 languages
- **WtP**
  &mdash; [Where’s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398/)
  by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vulić. - 85 languages
- **PySBD** &mdash; [{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation](https://arxiv.org/abs/2010.09657) by Nipun
  Sadvilkar and Mark Neumann  (rule-based, **lightweight**) - 22 languages

### Usage

#### Example: Using SaT for Sentence Segmentation

```python
from ovos_document_chunkers import SaTSentenceSplitter

config = {"model": "sat-3l-sm", "use_cuda": False}
splitter = SaTSentenceSplitter(config)

text = "This is a sentence. And this is another one."
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)
```

#### Example: Using WtP for Paragraph Segmentation

```python
from ovos_document_chunkers import WtPParagraphSplitter

config = {"model": "wtp-bert-mini", "use_cuda": False}
splitter = WtPParagraphSplitter(config)

text = "This is a paragraph. It contains multiple sentences.\n\nThis is another paragraph."
paragraphs = splitter.chunk(text)

for paragraph in paragraphs:
    print(paragraph)
```

#### Example: Using PySBD for Sentence Segmentation

```python
from ovos_document_chunkers import PySBDSentenceSplitter

config = {"lang": "en"}
splitter = PySBDSentenceSplitter(config)

text = "This is a sentence. This is another one!"
sentences = splitter.chunk(text)

for sentence in sentences:
    print(sentence)
```

## File Formats

### Supported File Formats

| Type     | Description                                                  | Class Name                | Expected Input                      | File Extension |
|----------|--------------------------------------------------------------|---------------------------|-------------------------------------|----------------|
| Markdown | Splits Markdown text into sentences or paragraphs            | MarkdownSentenceSplitter  | String (url, path or Markdown text) | .md            |
|          |                                                              | MarkdownParagraphSplitter | String (url, path or Markdown text) | .md            |
| HTML     | Splits HTML text into sentences or paragraphs                | HTMLSentenceSplitter      | String (url, path or HTML text)     | .html          |
|          |                                                              | HTMLParagraphSplitter     | String (url, path or HTML text)     | .html          |
| PDF      | Splits PDF documents into sentences or paragraphs            | PDFSentenceSplitter       | String (url or path to PDF file)    | .pdf           |
|          |                                                              | PDFParagraphSplitter      | String (url or path to PDF file)    | .pdf           |
| doc      | Splits Microsoft doc documents into sentences or paragraphs  | DOCSentenceSplitter       | String (url or path to doc file)    | .doc           |
|          |                                                              | DOCParagraphSplitter      | String (url or path to doc file)    | .doc           |
| docx     | Splits Microsoft docx documents into sentences or paragraphs | DOCxSentenceSplitter      | String (url or path to docx file)   | .docx          |
|          |                                                              | DOCxParagraphSplitter     | String (url or path to docx file)   | .docx          |

### Usage

#### Example using MarkdownSentenceSplitter

```python
from ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

sentence_splitter = MarkdownSentenceSplitter()
sentences = sentence_splitter.chunk(markdown_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)
```

#### Example using MarkdownParagraphSplitter

```python
from ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter
import requests

markdown_text = requests.get("https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md").text

paragraph_splitter = MarkdownParagraphSplitter()
paragraphs = paragraph_splitter.chunk(markdown_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)
```

#### Example using HTMLSentenceSplitter

```python
from ovos_document_chunkers import HTMLSentenceSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

sentence_splitter = HTMLSentenceSplitter()
sentences = sentence_splitter.chunk(html_text)

print("Sentences:")
for sentence in sentences:
    print(sentence)
```

#### Example using HTMLParagraphSplitter

```python
from ovos_document_chunkers import HTMLParagraphSplitter
import requests

html_text = requests.get("https://www.gofundme.com/f/openvoiceos").text

paragraph_splitter = HTMLParagraphSplitter()
paragraphs = paragraph_splitter.chunk(html_text)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)
```

#### Example using PDFParagraphSplitter

```python
from ovos_document_chunkers import PDFParagraphSplitter

pdf_path = "/path/to/your/pdf/document.pdf"

paragraph_splitter = PDFParagraphSplitter()
paragraphs = paragraph_splitter.chunk(pdf_path)

print("\nParagraphs:")
for paragraph in paragraphs:
    print(paragraph)
```

## Credits

![image](https://github.com/user-attachments/assets/809588a2-32a2-406c-98c0-f88bf7753cb4)

> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ovos-document-chunkers",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "OVOS openvoiceos plugin",
    "author": "jarbasai",
    "author_email": "jarbasai@mailfence.com",
    "download_url": "https://files.pythonhosted.org/packages/bd/e2/6737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05/ovos-document-chunkers-0.1.1.tar.gz",
    "platform": null,
    "description": "# Document Chunkers\n\nA collection of helpers to process raw documents\n\n## Overview\n\nThis library provides tools for chunking documents into manageable pieces such as paragraphs and sentences. It's\nparticularly useful for preprocessing text data for natural language processing (NLP) tasks.\n\n- [Text Segmenters](#text-segmenters)\n    - [Supported Models](#supported-models)\n    - [Usage](#usage)\n        - [Example: Using SaT for Sentence Segmentation](#example-using-sat-for-sentence-segmentation)\n        - [Example: Using WtP for Paragraph Segmentation](#example-using-wtp-for-paragraph-segmentation)\n        - [Example: Using PySBD for Sentence Segmentation](#example-using-pysbd-for-sentence-segmentation)\n- [File Formats](#file-formats)\n    - [Supported File Formats](#supported-file-formats)\n    - [Usage](#usage-1)\n        - [Example using MarkdownSentenceSplitter](#example-using-markdownsentencesplitter)\n        - [Example using MarkdownParagraphSplitter](#example-using-markdownparagraphsplitter)\n        - [Example using HTMLSentenceSplitter](#example-using-htmlsentencesplitter)\n        - [Example using HTMLParagraphSplitter](#example-using-htmlparagraphsplitter)\n        - [Example using PDFParagraphSplitter](#example-using-pdfparagraphsplitter)\n\n## Text Segmenters\n\n![img.png](img.png)\n\n- **SaT**\n  &mdash; [Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation](https://arxiv.org/abs/2406.16678)\n  by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuli\u0107 and Markus Schedl (**state-of-the-art, encouraged\n  **). - 85 languages\n- **WtP**\n  &mdash; [Where\u2019s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation](https://aclanthology.org/2023.acl-long.398/)\n  by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vuli\u0107. - 85 languages\n- **PySBD** &mdash; [{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation](https://arxiv.org/abs/2010.09657) by Nipun\n  Sadvilkar and Mark Neumann  (rule-based, **lightweight**) - 22 languages\n\n### Usage\n\n#### Example: Using SaT for Sentence Segmentation\n\n```python\nfrom ovos_document_chunkers import SaTSentenceSplitter\n\nconfig = {\"model\": \"sat-3l-sm\", \"use_cuda\": False}\nsplitter = SaTSentenceSplitter(config)\n\ntext = \"This is a sentence. And this is another one.\"\nsentences = splitter.chunk(text)\n\nfor sentence in sentences:\n    print(sentence)\n```\n\n#### Example: Using WtP for Paragraph Segmentation\n\n```python\nfrom ovos_document_chunkers import WtPParagraphSplitter\n\nconfig = {\"model\": \"wtp-bert-mini\", \"use_cuda\": False}\nsplitter = WtPParagraphSplitter(config)\n\ntext = \"This is a paragraph. It contains multiple sentences.\\n\\nThis is another paragraph.\"\nparagraphs = splitter.chunk(text)\n\nfor paragraph in paragraphs:\n    print(paragraph)\n```\n\n#### Example: Using PySBD for Sentence Segmentation\n\n```python\nfrom ovos_document_chunkers import PySBDSentenceSplitter\n\nconfig = {\"lang\": \"en\"}\nsplitter = PySBDSentenceSplitter(config)\n\ntext = \"This is a sentence. This is another one!\"\nsentences = splitter.chunk(text)\n\nfor sentence in sentences:\n    print(sentence)\n```\n\n## File Formats\n\n### Supported File Formats\n\n| Type     | Description                                                  | Class Name                | Expected Input                      | File Extension |\n|----------|--------------------------------------------------------------|---------------------------|-------------------------------------|----------------|\n| Markdown | Splits Markdown text into sentences or paragraphs            | MarkdownSentenceSplitter  | String (url, path or Markdown text) | .md            |\n|          |                                                              | MarkdownParagraphSplitter | String (url, path or Markdown text) | .md            |\n| HTML     | Splits HTML text into sentences or paragraphs                | HTMLSentenceSplitter      | String (url, path or HTML text)     | .html          |\n|          |                                                              | HTMLParagraphSplitter     | String (url, path or HTML text)     | .html          |\n| PDF      | Splits PDF documents into sentences or paragraphs            | PDFSentenceSplitter       | String (url or path to PDF file)    | .pdf           |\n|          |                                                              | PDFParagraphSplitter      | String (url or path to PDF file)    | .pdf           |\n| doc      | Splits Microsoft doc documents into sentences or paragraphs  | DOCSentenceSplitter       | String (url or path to doc file)    | .doc           |\n|          |                                                              | DOCParagraphSplitter      | String (url or path to doc file)    | .doc           |\n| docx     | Splits Microsoft docx documents into sentences or paragraphs | DOCxSentenceSplitter      | String (url or path to docx file)   | .docx          |\n|          |                                                              | DOCxParagraphSplitter     | String (url or path to docx file)   | .docx          |\n\n### Usage\n\n#### Example using MarkdownSentenceSplitter\n\n```python\nfrom ovos_document_chunkers.text.markdown import MarkdownSentenceSplitter\nimport requests\n\nmarkdown_text = requests.get(\"https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md\").text\n\nsentence_splitter = MarkdownSentenceSplitter()\nsentences = sentence_splitter.chunk(markdown_text)\n\nprint(\"Sentences:\")\nfor sentence in sentences:\n    print(sentence)\n```\n\n#### Example using MarkdownParagraphSplitter\n\n```python\nfrom ovos_document_chunkers.text.markdown import MarkdownParagraphSplitter\nimport requests\n\nmarkdown_text = requests.get(\"https://github.com/OpenVoiceOS/ovos-core/raw/dev/README.md\").text\n\nparagraph_splitter = MarkdownParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(markdown_text)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n    print(paragraph)\n```\n\n#### Example using HTMLSentenceSplitter\n\n```python\nfrom ovos_document_chunkers import HTMLSentenceSplitter\nimport requests\n\nhtml_text = requests.get(\"https://www.gofundme.com/f/openvoiceos\").text\n\nsentence_splitter = HTMLSentenceSplitter()\nsentences = sentence_splitter.chunk(html_text)\n\nprint(\"Sentences:\")\nfor sentence in sentences:\n    print(sentence)\n```\n\n#### Example using HTMLParagraphSplitter\n\n```python\nfrom ovos_document_chunkers import HTMLParagraphSplitter\nimport requests\n\nhtml_text = requests.get(\"https://www.gofundme.com/f/openvoiceos\").text\n\nparagraph_splitter = HTMLParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(html_text)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n    print(paragraph)\n```\n\n#### Example using PDFParagraphSplitter\n\n```python\nfrom ovos_document_chunkers import PDFParagraphSplitter\n\npdf_path = \"/path/to/your/pdf/document.pdf\"\n\nparagraph_splitter = PDFParagraphSplitter()\nparagraphs = paragraph_splitter.chunk(pdf_path)\n\nprint(\"\\nParagraphs:\")\nfor paragraph in paragraphs:\n    print(paragraph)\n```\n\n## Credits\n\n![image](https://github.com/user-attachments/assets/809588a2-32a2-406c-98c0-f88bf7753cb4)\n\n> This work was sponsored by VisioLab, part of [Royal Dutch Visio](https://visio.org/), is the test, education, and research center in the field of (innovative) assistive technology for blind and visually impaired people and professionals. We explore (new) technological developments such as Voice, VR and AI and make the knowledge and expertise we gain available to everyone.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A plugin for OVOS",
    "version": "0.1.1",
    "project_urls": null,
    "split_keywords": [
        "ovos",
        "openvoiceos",
        "plugin"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5ac904db6e3a8dde71536b357fdde1624b4d90dff38d2cbb53bb40ba51ede0f6",
                "md5": "5fae2904099d0838f42161fbef995b54",
                "sha256": "2eca48614d853908e7fc522bb38222bac4fe300c2f6f2fcf26789a25f4dd7494"
            },
            "downloads": -1,
            "filename": "ovos_document_chunkers-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5fae2904099d0838f42161fbef995b54",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 18842,
            "upload_time": "2025-07-13T17:55:37",
            "upload_time_iso_8601": "2025-07-13T17:55:37.337077Z",
            "url": "https://files.pythonhosted.org/packages/5a/c9/04db6e3a8dde71536b357fdde1624b4d90dff38d2cbb53bb40ba51ede0f6/ovos_document_chunkers-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "bde26737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05",
                "md5": "bf6ff9daff5328d7ebe61e3ca2cc4ffb",
                "sha256": "419b3df26f5274255b947243be188dad904e93b1a1ce4d97616188589d6315b0"
            },
            "downloads": -1,
            "filename": "ovos-document-chunkers-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "bf6ff9daff5328d7ebe61e3ca2cc4ffb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14200,
            "upload_time": "2025-07-13T17:55:38",
            "upload_time_iso_8601": "2025-07-13T17:55:38.406972Z",
            "url": "https://files.pythonhosted.org/packages/bd/e2/6737cb477e32a3706b012ffebe299f1182308dc711da476c86a7a8beea05/ovos-document-chunkers-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 17:55:38",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ovos-document-chunkers"
}

jarbasai