stream2sentence

Name	stream2sentence JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/KoljaB/stream2sentence
Summary	Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
upload_time	2024-12-14 17:29:22
maintainer	None
docs_url	None
author	Kolja Beigel
requires_python	>=3.6
license	None
keywords	realtime text streaming stream sentence sentence detection sentence generation tts speech synthesis nltk text analysis audio processing boundary detection sentence boundary detection
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# Real-Time Sentence Detection

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

> **Hint:** *If you're interested in state-of-the-art voice solutions you might also want to <strong>have a look at [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Contributing](#contributing)
- [License](#license)

## Features

- Generates sentences from a stream of text in real-time.
- Customizable to finetune/balance speed vs reliability.
- Option to clean the output by removing links and emojis from the detected sentences.
- Easy to configure and integrate.

## Installation

```bash
pip install stream2sentence
```

## Usage

Pass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.

Here's a basic example:

```python
from stream2sentence import generate_sentences

# Dummy generator for demonstration
def dummy_generator():
yield "This is a sentence. And here's another! Yet, "
yield "there's more. This ends now."

for sentence in generate_sentences(dummy_generator()):
print(sentence)
```

This will output:
```
This is a sentence.
And here's another!
Yet, there's more.
This ends now.
```

One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.

## Configuration

The `generate_sentences()` function offers various parameters to fine-tune its behavior:

### Core Parameters

- `generator: Iterator[str]`
- The primary input source, yielding chunks of text to be processed.
- Can be any iterator that emits text chunks of any size.

- `context_size: int = 12`
- Number of characters considered for sentence boundary detection.
- Larger values improve accuracy but may increase latency.
- Default: 12 characters

- `context_size_look_overhead: int = 12`
- Additional characters to examine beyond `context_size` for sentence splitting.
- Enhances sentence detection accuracy.
- Default: 12 characters

- `minimum_sentence_length: int = 10`
- Minimum character count for a text chunk to be considered a sentence.
- Shorter fragments are buffered until this threshold is met.
- Default: 10 characters

- `minimum_first_fragment_length: int = 10`
- Minimum character count required for the first sentence fragment.
- Ensures the initial output meets a specified length threshold.
- Default: 10 characters

### Yield Control

These parameters control how quickly and frequently the generator yields sentence fragments:

- `quick_yield_single_sentence_fragment: bool = False`
- When True, yields the first fragment of the first sentence as quickly as possible.
- Useful for getting immediate output in real-time applications like speech synthesis.
- Default: False

- `quick_yield_for_all_sentences: bool = False`
- When True, yields the first fragment of every sentence as quickly as possible.
- Extends the quick yield behavior to all sentences, not just the first one.
- Automatically sets `quick_yield_single_sentence_fragment` to True.
- Default: False

- `quick_yield_every_fragment: bool = False`
- When True, yields every fragment of every sentence as quickly as possible.
- Provides the most granular output, yielding fragments as soon as they're detected.
- Automatically sets both `quick_yield_for_all_sentences` and `quick_yield_single_sentence_fragment` to True.
- Default: False

### Text Cleanup

- `cleanup_text_links: bool = False`
- When True, removes hyperlinks from the output sentences.
- Default: False

- `cleanup_text_emojis: bool = False`
- When True, removes emoji characters from the output sentences.
- Default: False

### Tokenization

- `tokenize_sentences: Callable = None`
- Custom function for sentence tokenization.
- If None, uses the default tokenizer specified by `tokenizer`.
- Default: None

- `tokenizer: str = "nltk"`
- Specifies the tokenizer to use. Options: "nltk" or "stanza"
- Default: "nltk"

- `language: str = "en"`
- Language setting for the tokenizer.
- Use "en" for English or "multilingual" for Stanza tokenizer.
- Default: "en"

### Debugging and Fine-tuning

- `log_characters: bool = False`
- When True, logs each processed character to the console.
- Useful for debugging or monitoring real-time processing.
- Default: False

- `sentence_fragment_delimiters: str = ".?!;:,\nâ€¦)]}ã€‚-"`
- Characters considered as potential sentence fragment delimiters.
- Used for quick yielding of sentence fragments.
- Default: ".?!;:,\nâ€¦)]}ã€‚-"

- `full_sentence_delimiters: str = ".?!\nâ€¦ã€‚"`
- Characters considered as full sentence delimiters.
- Used for more definitive sentence boundary detection.
- Default: ".?!\nâ€¦ã€‚"

- `force_first_fragment_after_words: int = 15`
- Forces the yield of the first sentence fragment after this many words.
- Ensures timely output even with long opening sentences.
- Default: 15 words

## Contributing

Any Contributions you make are welcome and **greatly appreciated**.

1. **Fork** the Project.
2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).
3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).
4. **Push** to the Branch (`git push origin feature/AmazingFeature`).
5. **Open** a Pull Request.

## License

This project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.

---

Project created and maintained by [Kolja Beigel](https://github.com/KoljaB).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoljaB/stream2sentence",
    "name": "stream2sentence",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "realtime, text streaming, stream, sentence, sentence detection, sentence generation, tts, speech synthesis, nltk, text analysis, audio processing, boundary detection, sentence boundary detection",
    "author": "Kolja Beigel",
    "author_email": "kolja.beigel@web.de",
    "download_url": "https://files.pythonhosted.org/packages/37/f2/e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90/stream2sentence-0.3.0.tar.gz",
    "platform": null,
    "description": "# Real-Time Sentence Detection\r\n\r\nReal-time processing and delivery of sentences from a continuous stream of characters or text chunks.\r\n\r\n> **Hint:** *If you're interested in state-of-the-art voice solutions you might also want to <strong>have a look at [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*\r\n\r\n## Table of Contents\r\n\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n- [Configuration](#configuration)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n## Features\r\n\r\n- Generates sentences from a stream of text in real-time.\r\n- Customizable to finetune/balance speed vs reliability.\r\n- Option to clean the output by removing links and emojis from the detected sentences.\r\n- Easy to configure and integrate.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install stream2sentence\r\n```\r\n\r\n## Usage\r\n\r\nPass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.\r\n\r\nHere's a basic example:\r\n\r\n```python\r\nfrom stream2sentence import generate_sentences\r\n\r\n# Dummy generator for demonstration\r\ndef dummy_generator():\r\n    yield \"This is a sentence. And here's another! Yet, \"\r\n    yield \"there's more. This ends now.\"\r\n\r\nfor sentence in generate_sentences(dummy_generator()):\r\n    print(sentence)\r\n```\r\n\r\nThis will output:\r\n```\r\nThis is a sentence.\r\nAnd here's another!\r\nYet, there's more.\r\nThis ends now.\r\n```\r\n\r\nOne main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.\r\n\r\n## Configuration\r\n\r\nThe `generate_sentences()` function offers various parameters to fine-tune its behavior:\r\n\r\n### Core Parameters\r\n\r\n- `generator: Iterator[str]`\r\n  - The primary input source, yielding chunks of text to be processed.\r\n  - Can be any iterator that emits text chunks of any size.\r\n\r\n- `context_size: int = 12`\r\n  - Number of characters considered for sentence boundary detection.\r\n  - Larger values improve accuracy but may increase latency.\r\n  - Default: 12 characters\r\n\r\n- `context_size_look_overhead: int = 12`\r\n  - Additional characters to examine beyond `context_size` for sentence splitting.\r\n  - Enhances sentence detection accuracy.\r\n  - Default: 12 characters\r\n\r\n- `minimum_sentence_length: int = 10`\r\n  - Minimum character count for a text chunk to be considered a sentence.\r\n  - Shorter fragments are buffered until this threshold is met.\r\n  - Default: 10 characters\r\n\r\n- `minimum_first_fragment_length: int = 10`\r\n  - Minimum character count required for the first sentence fragment.\r\n  - Ensures the initial output meets a specified length threshold.\r\n  - Default: 10 characters\r\n\r\n### Yield Control\r\n\r\nThese parameters control how quickly and frequently the generator yields sentence fragments:\r\n\r\n- `quick_yield_single_sentence_fragment: bool = False`\r\n  - When True, yields the first fragment of the first sentence as quickly as possible.\r\n  - Useful for getting immediate output in real-time applications like speech synthesis.\r\n  - Default: False\r\n\r\n- `quick_yield_for_all_sentences: bool = False`\r\n  - When True, yields the first fragment of every sentence as quickly as possible.\r\n  - Extends the quick yield behavior to all sentences, not just the first one.\r\n  - Automatically sets `quick_yield_single_sentence_fragment` to True.\r\n  - Default: False\r\n\r\n- `quick_yield_every_fragment: bool = False`\r\n  - When True, yields every fragment of every sentence as quickly as possible.\r\n  - Provides the most granular output, yielding fragments as soon as they're detected.\r\n  - Automatically sets both `quick_yield_for_all_sentences` and `quick_yield_single_sentence_fragment` to True.\r\n  - Default: False\r\n\r\n### Text Cleanup\r\n\r\n- `cleanup_text_links: bool = False`\r\n  - When True, removes hyperlinks from the output sentences.\r\n  - Default: False\r\n\r\n- `cleanup_text_emojis: bool = False`\r\n  - When True, removes emoji characters from the output sentences.\r\n  - Default: False\r\n\r\n### Tokenization\r\n\r\n- `tokenize_sentences: Callable = None`\r\n  - Custom function for sentence tokenization.\r\n  - If None, uses the default tokenizer specified by `tokenizer`.\r\n  - Default: None\r\n\r\n- `tokenizer: str = \"nltk\"`\r\n  - Specifies the tokenizer to use. Options: \"nltk\" or \"stanza\"\r\n  - Default: \"nltk\"\r\n\r\n- `language: str = \"en\"`\r\n  - Language setting for the tokenizer.\r\n  - Use \"en\" for English or \"multilingual\" for Stanza tokenizer.\r\n  - Default: \"en\"\r\n\r\n### Debugging and Fine-tuning\r\n\r\n- `log_characters: bool = False`\r\n  - When True, logs each processed character to the console.\r\n  - Useful for debugging or monitoring real-time processing.\r\n  - Default: False\r\n\r\n- `sentence_fragment_delimiters: str = \".?!;:,\\n\u00e2\u20ac\u00a6)]}\u00e3\u20ac\u201a-\"`\r\n  - Characters considered as potential sentence fragment delimiters.\r\n  - Used for quick yielding of sentence fragments.\r\n  - Default: \".?!;:,\\n\u00e2\u20ac\u00a6)]}\u00e3\u20ac\u201a-\"\r\n\r\n- `full_sentence_delimiters: str = \".?!\\n\u00e2\u20ac\u00a6\u00e3\u20ac\u201a\"`\r\n  - Characters considered as full sentence delimiters.\r\n  - Used for more definitive sentence boundary detection.\r\n  - Default: \".?!\\n\u00e2\u20ac\u00a6\u00e3\u20ac\u201a\"\r\n\r\n- `force_first_fragment_after_words: int = 15`\r\n  - Forces the yield of the first sentence fragment after this many words.\r\n  - Ensures timely output even with long opening sentences.\r\n  - Default: 15 words\r\n\r\n## Contributing\r\n\r\nAny Contributions you make are welcome and **greatly appreciated**.\r\n\r\n1. **Fork** the Project.\r\n2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).\r\n3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).\r\n4. **Push** to the Branch (`git push origin feature/AmazingFeature`).\r\n5. **Open** a Pull Request.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.\r\n\r\n---\r\n\r\nProject created and maintained by [Kolja Beigel](https://github.com/KoljaB).\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.",
    "version": "0.3.0",
    "project_urls": {
        "Homepage": "https://github.com/KoljaB/stream2sentence"
    },
    "split_keywords": [
        "realtime",
        " text streaming",
        " stream",
        " sentence",
        " sentence detection",
        " sentence generation",
        " tts",
        " speech synthesis",
        " nltk",
        " text analysis",
        " audio processing",
        " boundary detection",
        " sentence boundary detection"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "278ae10bf9fe42b1760c83fbbfe696a85bdc4130cecfd379af1c5b4cdfd5d80d",
                "md5": "c5f56580672722e471285cce86e2bc5e",
                "sha256": "0a2df92b3dd9c7aa3e4c49de5ab329ddaa2f20b359f9903d3f52d9fa4f719045"
            },
            "downloads": -1,
            "filename": "stream2sentence-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c5f56580672722e471285cce86e2bc5e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8517,
            "upload_time": "2024-12-14T17:29:19",
            "upload_time_iso_8601": "2024-12-14T17:29:19.839616Z",
            "url": "https://files.pythonhosted.org/packages/27/8a/e10bf9fe42b1760c83fbbfe696a85bdc4130cecfd379af1c5b4cdfd5d80d/stream2sentence-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "37f2e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90",
                "md5": "948346ac540b72189ac7ae540c6b3e24",
                "sha256": "b55bab6c975fa88108a52abae59b1dd44f91b99815585304aadd817c0999e934"
            },
            "downloads": -1,
            "filename": "stream2sentence-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "948346ac540b72189ac7ae540c6b3e24",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 12566,
            "upload_time": "2024-12-14T17:29:22",
            "upload_time_iso_8601": "2024-12-14T17:29:22.222178Z",
            "url": "https://files.pythonhosted.org/packages/37/f2/e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90/stream2sentence-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-14 17:29:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoljaB",
    "github_project": "stream2sentence",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "stream2sentence"
}

Kolja Beigel