# Real-Time Sentence Detection
Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.
> **Hint:** *If you're interested in state-of-the-art voice solutions you might also want to <strong>have a look at [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Contributing](#contributing)
- [License](#license)
## Features
- Generates sentences from a stream of text in real-time.
- Customizable to finetune/balance speed vs reliability.
- Option to clean the output by removing links and emojis from the detected sentences.
- Easy to configure and integrate.
## Installation
```bash
pip install stream2sentence
```
## Usage
Pass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.
Here's a basic example:
```python
from stream2sentence import generate_sentences
# Dummy generator for demonstration
def dummy_generator():
yield "This is a sentence. And here's another! Yet, "
yield "there's more. This ends now."
for sentence in generate_sentences(dummy_generator()):
print(sentence)
```
This will output:
```
This is a sentence.
And here's another!
Yet, there's more.
This ends now.
```
One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.
## Configuration
The `generate_sentences()` function offers various parameters to fine-tune its behavior:
### Core Parameters
- `generator: Iterator[str]`
- The primary input source, yielding chunks of text to be processed.
- Can be any iterator that emits text chunks of any size.
- `context_size: int = 12`
- Number of characters considered for sentence boundary detection.
- Larger values improve accuracy but may increase latency.
- Default: 12 characters
- `context_size_look_overhead: int = 12`
- Additional characters to examine beyond `context_size` for sentence splitting.
- Enhances sentence detection accuracy.
- Default: 12 characters
- `minimum_sentence_length: int = 10`
- Minimum character count for a text chunk to be considered a sentence.
- Shorter fragments are buffered until this threshold is met.
- Default: 10 characters
- `minimum_first_fragment_length: int = 10`
- Minimum character count required for the first sentence fragment.
- Ensures the initial output meets a specified length threshold.
- Default: 10 characters
### Yield Control
These parameters control how quickly and frequently the generator yields sentence fragments:
- `quick_yield_single_sentence_fragment: bool = False`
- When True, yields the first fragment of the first sentence as quickly as possible.
- Useful for getting immediate output in real-time applications like speech synthesis.
- Default: False
- `quick_yield_for_all_sentences: bool = False`
- When True, yields the first fragment of every sentence as quickly as possible.
- Extends the quick yield behavior to all sentences, not just the first one.
- Automatically sets `quick_yield_single_sentence_fragment` to True.
- Default: False
- `quick_yield_every_fragment: bool = False`
- When True, yields every fragment of every sentence as quickly as possible.
- Provides the most granular output, yielding fragments as soon as they're detected.
- Automatically sets both `quick_yield_for_all_sentences` and `quick_yield_single_sentence_fragment` to True.
- Default: False
### Text Cleanup
- `cleanup_text_links: bool = False`
- When True, removes hyperlinks from the output sentences.
- Default: False
- `cleanup_text_emojis: bool = False`
- When True, removes emoji characters from the output sentences.
- Default: False
### Tokenization
- `tokenize_sentences: Callable = None`
- Custom function for sentence tokenization.
- If None, uses the default tokenizer specified by `tokenizer`.
- Default: None
- `tokenizer: str = "nltk"`
- Specifies the tokenizer to use. Options: "nltk" or "stanza"
- Default: "nltk"
- `language: str = "en"`
- Language setting for the tokenizer.
- Use "en" for English or "multilingual" for Stanza tokenizer.
- Default: "en"
### Debugging and Fine-tuning
- `log_characters: bool = False`
- When True, logs each processed character to the console.
- Useful for debugging or monitoring real-time processing.
- Default: False
- `sentence_fragment_delimiters: str = ".?!;:,\n…)]}。-"`
- Characters considered as potential sentence fragment delimiters.
- Used for quick yielding of sentence fragments.
- Default: ".?!;:,\n…)]}。-"
- `full_sentence_delimiters: str = ".?!\n…。"`
- Characters considered as full sentence delimiters.
- Used for more definitive sentence boundary detection.
- Default: ".?!\n…。"
- `force_first_fragment_after_words: int = 15`
- Forces the yield of the first sentence fragment after this many words.
- Ensures timely output even with long opening sentences.
- Default: 15 words
## Contributing
Any Contributions you make are welcome and **greatly appreciated**.
1. **Fork** the Project.
2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).
3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).
4. **Push** to the Branch (`git push origin feature/AmazingFeature`).
5. **Open** a Pull Request.
## License
This project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.
---
Project created and maintained by [Kolja Beigel](https://github.com/KoljaB).
Raw data
{
"_id": null,
"home_page": "https://github.com/KoljaB/stream2sentence",
"name": "stream2sentence",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "realtime, text streaming, stream, sentence, sentence detection, sentence generation, tts, speech synthesis, nltk, text analysis, audio processing, boundary detection, sentence boundary detection",
"author": "Kolja Beigel",
"author_email": "kolja.beigel@web.de",
"download_url": "https://files.pythonhosted.org/packages/37/f2/e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90/stream2sentence-0.3.0.tar.gz",
"platform": null,
"description": "# Real-Time Sentence Detection\r\n\r\nReal-time processing and delivery of sentences from a continuous stream of characters or text chunks.\r\n\r\n> **Hint:** *If you're interested in state-of-the-art voice solutions you might also want to <strong>have a look at [Linguflex](https://github.com/KoljaB/Linguflex)</strong>, the original project from which stream2sentence is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.*\r\n\r\n## Table of Contents\r\n\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n- [Configuration](#configuration)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n## Features\r\n\r\n- Generates sentences from a stream of text in real-time.\r\n- Customizable to finetune/balance speed vs reliability.\r\n- Option to clean the output by removing links and emojis from the detected sentences.\r\n- Easy to configure and integrate.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install stream2sentence\r\n```\r\n\r\n## Usage\r\n\r\nPass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.\r\n\r\nHere's a basic example:\r\n\r\n```python\r\nfrom stream2sentence import generate_sentences\r\n\r\n# Dummy generator for demonstration\r\ndef dummy_generator():\r\n yield \"This is a sentence. And here's another! Yet, \"\r\n yield \"there's more. This ends now.\"\r\n\r\nfor sentence in generate_sentences(dummy_generator()):\r\n print(sentence)\r\n```\r\n\r\nThis will output:\r\n```\r\nThis is a sentence.\r\nAnd here's another!\r\nYet, there's more.\r\nThis ends now.\r\n```\r\n\r\nOne main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.\r\n\r\n## Configuration\r\n\r\nThe `generate_sentences()` function offers various parameters to fine-tune its behavior:\r\n\r\n### Core Parameters\r\n\r\n- `generator: Iterator[str]`\r\n - The primary input source, yielding chunks of text to be processed.\r\n - Can be any iterator that emits text chunks of any size.\r\n\r\n- `context_size: int = 12`\r\n - Number of characters considered for sentence boundary detection.\r\n - Larger values improve accuracy but may increase latency.\r\n - Default: 12 characters\r\n\r\n- `context_size_look_overhead: int = 12`\r\n - Additional characters to examine beyond `context_size` for sentence splitting.\r\n - Enhances sentence detection accuracy.\r\n - Default: 12 characters\r\n\r\n- `minimum_sentence_length: int = 10`\r\n - Minimum character count for a text chunk to be considered a sentence.\r\n - Shorter fragments are buffered until this threshold is met.\r\n - Default: 10 characters\r\n\r\n- `minimum_first_fragment_length: int = 10`\r\n - Minimum character count required for the first sentence fragment.\r\n - Ensures the initial output meets a specified length threshold.\r\n - Default: 10 characters\r\n\r\n### Yield Control\r\n\r\nThese parameters control how quickly and frequently the generator yields sentence fragments:\r\n\r\n- `quick_yield_single_sentence_fragment: bool = False`\r\n - When True, yields the first fragment of the first sentence as quickly as possible.\r\n - Useful for getting immediate output in real-time applications like speech synthesis.\r\n - Default: False\r\n\r\n- `quick_yield_for_all_sentences: bool = False`\r\n - When True, yields the first fragment of every sentence as quickly as possible.\r\n - Extends the quick yield behavior to all sentences, not just the first one.\r\n - Automatically sets `quick_yield_single_sentence_fragment` to True.\r\n - Default: False\r\n\r\n- `quick_yield_every_fragment: bool = False`\r\n - When True, yields every fragment of every sentence as quickly as possible.\r\n - Provides the most granular output, yielding fragments as soon as they're detected.\r\n - Automatically sets both `quick_yield_for_all_sentences` and `quick_yield_single_sentence_fragment` to True.\r\n - Default: False\r\n\r\n### Text Cleanup\r\n\r\n- `cleanup_text_links: bool = False`\r\n - When True, removes hyperlinks from the output sentences.\r\n - Default: False\r\n\r\n- `cleanup_text_emojis: bool = False`\r\n - When True, removes emoji characters from the output sentences.\r\n - Default: False\r\n\r\n### Tokenization\r\n\r\n- `tokenize_sentences: Callable = None`\r\n - Custom function for sentence tokenization.\r\n - If None, uses the default tokenizer specified by `tokenizer`.\r\n - Default: None\r\n\r\n- `tokenizer: str = \"nltk\"`\r\n - Specifies the tokenizer to use. Options: \"nltk\" or \"stanza\"\r\n - Default: \"nltk\"\r\n\r\n- `language: str = \"en\"`\r\n - Language setting for the tokenizer.\r\n - Use \"en\" for English or \"multilingual\" for Stanza tokenizer.\r\n - Default: \"en\"\r\n\r\n### Debugging and Fine-tuning\r\n\r\n- `log_characters: bool = False`\r\n - When True, logs each processed character to the console.\r\n - Useful for debugging or monitoring real-time processing.\r\n - Default: False\r\n\r\n- `sentence_fragment_delimiters: str = \".?!;:,\\n\u00e2\u20ac\u00a6)]}\u00e3\u20ac\u201a-\"`\r\n - Characters considered as potential sentence fragment delimiters.\r\n - Used for quick yielding of sentence fragments.\r\n - Default: \".?!;:,\\n\u00e2\u20ac\u00a6)]}\u00e3\u20ac\u201a-\"\r\n\r\n- `full_sentence_delimiters: str = \".?!\\n\u00e2\u20ac\u00a6\u00e3\u20ac\u201a\"`\r\n - Characters considered as full sentence delimiters.\r\n - Used for more definitive sentence boundary detection.\r\n - Default: \".?!\\n\u00e2\u20ac\u00a6\u00e3\u20ac\u201a\"\r\n\r\n- `force_first_fragment_after_words: int = 15`\r\n - Forces the yield of the first sentence fragment after this many words.\r\n - Ensures timely output even with long opening sentences.\r\n - Default: 15 words\r\n\r\n## Contributing\r\n\r\nAny Contributions you make are welcome and **greatly appreciated**.\r\n\r\n1. **Fork** the Project.\r\n2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).\r\n3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).\r\n4. **Push** to the Branch (`git push origin feature/AmazingFeature`).\r\n5. **Open** a Pull Request.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.\r\n\r\n---\r\n\r\nProject created and maintained by [Kolja Beigel](https://github.com/KoljaB).\r\n",
"bugtrack_url": null,
"license": null,
"summary": "Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/KoljaB/stream2sentence"
},
"split_keywords": [
"realtime",
" text streaming",
" stream",
" sentence",
" sentence detection",
" sentence generation",
" tts",
" speech synthesis",
" nltk",
" text analysis",
" audio processing",
" boundary detection",
" sentence boundary detection"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "278ae10bf9fe42b1760c83fbbfe696a85bdc4130cecfd379af1c5b4cdfd5d80d",
"md5": "c5f56580672722e471285cce86e2bc5e",
"sha256": "0a2df92b3dd9c7aa3e4c49de5ab329ddaa2f20b359f9903d3f52d9fa4f719045"
},
"downloads": -1,
"filename": "stream2sentence-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c5f56580672722e471285cce86e2bc5e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8517,
"upload_time": "2024-12-14T17:29:19",
"upload_time_iso_8601": "2024-12-14T17:29:19.839616Z",
"url": "https://files.pythonhosted.org/packages/27/8a/e10bf9fe42b1760c83fbbfe696a85bdc4130cecfd379af1c5b4cdfd5d80d/stream2sentence-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "37f2e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90",
"md5": "948346ac540b72189ac7ae540c6b3e24",
"sha256": "b55bab6c975fa88108a52abae59b1dd44f91b99815585304aadd817c0999e934"
},
"downloads": -1,
"filename": "stream2sentence-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "948346ac540b72189ac7ae540c6b3e24",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 12566,
"upload_time": "2024-12-14T17:29:22",
"upload_time_iso_8601": "2024-12-14T17:29:22.222178Z",
"url": "https://files.pythonhosted.org/packages/37/f2/e5ec8027f5317d931e729ccb1d523a7cf58b2b8fdb6a6afc9ce50ac4ee90/stream2sentence-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-14 17:29:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "KoljaB",
"github_project": "stream2sentence",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "stream2sentence"
}