stream2sentence


Namestream2sentence JSON
Version 0.2.3 PyPI version JSON
download
home_pagehttps://github.com/KoljaB/stream2sentence
SummaryReal-time processing and delivery of sentences from a continuous stream of characters or text chunks.
upload_time2024-03-21 19:21:02
maintainerNone
docs_urlNone
authorKolja Beigel
requires_python>=3.6
licenseNone
keywords realtime text streaming stream sentence sentence detection sentence generation tts speech synthesis nltk text analysis audio processing boundary detection sentence boundary detection
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Real-Time Sentence Detection

Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Contributing](#contributing)
- [License](#license)

## Features

- Generates sentences from a stream of text in real-time.
- Customizable to finetune/balance speed vs reliability.
- Option to clean the output by removing links and emojis from the detected sentences.
- Easy to configure and integrate.

## Installation

```bash
pip install stream2sentence
```

## Usage

Pass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.

Here's a basic example:

```python
from stream2sentence import generate_sentences

# Dummy generator for demonstration
def dummy_generator():
    yield "This is a sentence. And here's another! Yet, "
    yield "there's more. This ends now."

for sentence in generate_sentences(dummy_generator()):
    print(sentence)
```

This will output:
```
This is a sentence.
And here's another!
Yet, there's more.
This ends now.
```

One main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.

## Configuration

The `generate_sentences()` function has the following parameters:

- `generator`: Input character generator.
  Iterator that emits chunks of text. These chunks can be of any size, and they'll be processed one by one to extract sentences from them. It forms the primary source from which the function reads and generates sentences.

- `context_size`: Context size for sentence detection.  
  This controls how much context is looked at to detect sentence boundaries. It determines the number of characters around a potential delimiter (like a period) that are considered when detecting sentence boundaries. A larger context size allows more reliable sentence boundary detection, but requires buffering more characters before emitting a sentence.  
  Default is 12 characters. Increasing this can help detect sentences more accurately, at the cost of added latency.

- `minimum_sentence_length`: Minimum length of a sentence to be detected.  
  Specifies the minimum number of characters a chunk of text should have before it's considered a potential sentence. This ensures that very short sequences of characters are not mistakenly identified as sentences.Shorter fragments are ignored and kept in the buffer.  
  Default is 10 characters. Increasing this avoids emitting very short sentence fragments, at the cost of potentially missing some sentences.

- `minimum_first_fragment_length`: The minimum number of characters required for the first sentence fragment before yielding.
  This parameter sets a threshold for the length of the initial fragment of text that the function will yield as a sentence. If the first chunk of text does not meet this length requirement, it will be buffered until additional text is received to meet or exceed this threshold. This is important for ensuring the first output is long enough, e.g. to ensure a quality synthesis for text-to-speech applications.
  Default is 10 characters. Set this according to the needs of the application, balancing between the immediacy of output and the completeness of the text fragment.
  
- `quick_yield_single_sentence_fragment`: Yield a sentence fragment quickly for real-time applications.
  When set to True, this option allows the function to quickly yield a sentence fragment as soon as it identifies a potential sentence delimiter, without waiting for further context. This is useful for applications like real-time speech synthesis where there's a need for immediate feedback even if the entire sentence isn't complete. 
  Default is False. Set to True for faster but potentially less accurate sentence yields.

- `cleanup_text_links`: Option to remove links from the output sentences.  
  When set to True, this option enables the function to identify and remove HTTP/HTTPS hyperlinks from the emitted output sentences. This helps clean up the output by avoiding unnecessary links.  
  Default is False. Set to True if links are not required in the output.

- `cleanup_text_emojis`: Option to remove emojis from the output sentences.  
  If True, any Unicode emoji characters are identified and removed from the emitted output sentences. This can help to clean up the output.  
  Default is False. Set to True if emojis are not required in the output.

- `tokenize_sentences`: (Optional) Function for sentence tokenization. Default is None.

- `tokenizer`: Specifies the tokenizer to use ('nltk' or 'stanza'). Default is 'nltk'.

- `language`: Language setting for the tokenizer ('en' or 'multilingual' for stanza). Default is 'en'.

- `log_characters`: Logs each processed character to the console for debugging. Default is False.

- `sentence_fragment_delimiters`: Characters considered as sentence delimiters for yielding quick fragment.

- `force_first_fragment_after_words`: Forces the first sentence fragment to yield after a specified number of words. Default is 10 words.

- `log_characters`: Option to log characters to the console.
  When enabled, each character processed by the function is printed to the console. This is mainly for debugging purposes to observe the flow of characters through the function.
  Default is False. Set to True for a visual representation of characters being processed. Example: allows printing llm output to console when using stream2sentence to prepare input generation for text to speech synthesis.


## Contributing

Any Contributions you make are welcome and **greatly appreciated**.

1. **Fork** the Project.
2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).
3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).
4. **Push** to the Branch (`git push origin feature/AmazingFeature`).
5. **Open** a Pull Request.

## License

This project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.

---

Project created and maintained by [Kolja Beigel](https://github.com/KoljaB).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KoljaB/stream2sentence",
    "name": "stream2sentence",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "realtime, text streaming, stream, sentence, sentence detection, sentence generation, tts, speech synthesis, nltk, text analysis, audio processing, boundary detection, sentence boundary detection",
    "author": "Kolja Beigel",
    "author_email": "kolja.beigel@web.de",
    "download_url": "https://files.pythonhosted.org/packages/d3/c9/600ea6242095079b8b268fc2c34b0e7f0a817b22683352246286f81531b5/stream2sentence-0.2.3.tar.gz",
    "platform": null,
    "description": "# Real-Time Sentence Detection\r\n\r\nReal-time processing and delivery of sentences from a continuous stream of characters or text chunks.\r\n\r\n## Table of Contents\r\n\r\n- [Features](#features)\r\n- [Installation](#installation)\r\n- [Usage](#usage)\r\n- [Configuration](#configuration)\r\n- [Contributing](#contributing)\r\n- [License](#license)\r\n\r\n## Features\r\n\r\n- Generates sentences from a stream of text in real-time.\r\n- Customizable to finetune/balance speed vs reliability.\r\n- Option to clean the output by removing links and emojis from the detected sentences.\r\n- Easy to configure and integrate.\r\n\r\n## Installation\r\n\r\n```bash\r\npip install stream2sentence\r\n```\r\n\r\n## Usage\r\n\r\nPass a generator of characters or text chunks to `generate_sentences()` to get a generator of sentences in return.\r\n\r\nHere's a basic example:\r\n\r\n```python\r\nfrom stream2sentence import generate_sentences\r\n\r\n# Dummy generator for demonstration\r\ndef dummy_generator():\r\n    yield \"This is a sentence. And here's another! Yet, \"\r\n    yield \"there's more. This ends now.\"\r\n\r\nfor sentence in generate_sentences(dummy_generator()):\r\n    print(sentence)\r\n```\r\n\r\nThis will output:\r\n```\r\nThis is a sentence.\r\nAnd here's another!\r\nYet, there's more.\r\nThis ends now.\r\n```\r\n\r\nOne main use case of this library is enable fast text to speech synthesis in the context of character feeds generated from large language models: this library enables fastest possible access to a complete sentence or sentence fragment (using the quick_yield_single_sentence_fragment flag) that then can be synthesized in realtime. The usage of this is demonstrated in the test_stream_from_llm.py file in the tests directory.\r\n\r\n## Configuration\r\n\r\nThe `generate_sentences()` function has the following parameters:\r\n\r\n- `generator`: Input character generator.\r\n  Iterator that emits chunks of text. These chunks can be of any size, and they'll be processed one by one to extract sentences from them. It forms the primary source from which the function reads and generates sentences.\r\n\r\n- `context_size`: Context size for sentence detection.  \r\n  This controls how much context is looked at to detect sentence boundaries. It determines the number of characters around a potential delimiter (like a period) that are considered when detecting sentence boundaries. A larger context size allows more reliable sentence boundary detection, but requires buffering more characters before emitting a sentence.  \r\n  Default is 12 characters. Increasing this can help detect sentences more accurately, at the cost of added latency.\r\n\r\n- `minimum_sentence_length`: Minimum length of a sentence to be detected.  \r\n  Specifies the minimum number of characters a chunk of text should have before it's considered a potential sentence. This ensures that very short sequences of characters are not mistakenly identified as sentences.Shorter fragments are ignored and kept in the buffer.  \r\n  Default is 10 characters. Increasing this avoids emitting very short sentence fragments, at the cost of potentially missing some sentences.\r\n\r\n- `minimum_first_fragment_length`: The minimum number of characters required for the first sentence fragment before yielding.\r\n  This parameter sets a threshold for the length of the initial fragment of text that the function will yield as a sentence. If the first chunk of text does not meet this length requirement, it will be buffered until additional text is received to meet or exceed this threshold. This is important for ensuring the first output is long enough, e.g. to ensure a quality synthesis for text-to-speech applications.\r\n  Default is 10 characters. Set this according to the needs of the application, balancing between the immediacy of output and the completeness of the text fragment.\r\n  \r\n- `quick_yield_single_sentence_fragment`: Yield a sentence fragment quickly for real-time applications.\r\n  When set to True, this option allows the function to quickly yield a sentence fragment as soon as it identifies a potential sentence delimiter, without waiting for further context. This is useful for applications like real-time speech synthesis where there's a need for immediate feedback even if the entire sentence isn't complete. \r\n  Default is False. Set to True for faster but potentially less accurate sentence yields.\r\n\r\n- `cleanup_text_links`: Option to remove links from the output sentences.  \r\n  When set to True, this option enables the function to identify and remove HTTP/HTTPS hyperlinks from the emitted output sentences. This helps clean up the output by avoiding unnecessary links.  \r\n  Default is False. Set to True if links are not required in the output.\r\n\r\n- `cleanup_text_emojis`: Option to remove emojis from the output sentences.  \r\n  If True, any Unicode emoji characters are identified and removed from the emitted output sentences. This can help to clean up the output.  \r\n  Default is False. Set to True if emojis are not required in the output.\r\n\r\n- `tokenize_sentences`: (Optional) Function for sentence tokenization. Default is None.\r\n\r\n- `tokenizer`: Specifies the tokenizer to use ('nltk' or 'stanza'). Default is 'nltk'.\r\n\r\n- `language`: Language setting for the tokenizer ('en' or 'multilingual' for stanza). Default is 'en'.\r\n\r\n- `log_characters`: Logs each processed character to the console for debugging. Default is False.\r\n\r\n- `sentence_fragment_delimiters`: Characters considered as sentence delimiters for yielding quick fragment.\r\n\r\n- `force_first_fragment_after_words`: Forces the first sentence fragment to yield after a specified number of words. Default is 10 words.\r\n\r\n- `log_characters`: Option to log characters to the console.\r\n  When enabled, each character processed by the function is printed to the console. This is mainly for debugging purposes to observe the flow of characters through the function.\r\n  Default is False. Set to True for a visual representation of characters being processed. Example: allows printing llm output to console when using stream2sentence to prepare input generation for text to speech synthesis.\r\n\r\n\r\n## Contributing\r\n\r\nAny Contributions you make are welcome and **greatly appreciated**.\r\n\r\n1. **Fork** the Project.\r\n2. **Create** your Feature Branch (`git checkout -b feature/AmazingFeature`).\r\n3. **Commit** your Changes (`git commit -m 'Add some AmazingFeature'`).\r\n4. **Push** to the Branch (`git push origin feature/AmazingFeature`).\r\n5. **Open** a Pull Request.\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License. For more details, see the [`LICENSE`](LICENSE) file.\r\n\r\n---\r\n\r\nProject created and maintained by [Kolja Beigel](https://github.com/KoljaB).\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Real-time processing and delivery of sentences from a continuous stream of characters or text chunks.",
    "version": "0.2.3",
    "project_urls": {
        "Homepage": "https://github.com/KoljaB/stream2sentence"
    },
    "split_keywords": [
        "realtime",
        " text streaming",
        " stream",
        " sentence",
        " sentence detection",
        " sentence generation",
        " tts",
        " speech synthesis",
        " nltk",
        " text analysis",
        " audio processing",
        " boundary detection",
        " sentence boundary detection"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "eefa56e23465d33d34b98e8ee73a72f46ea8f26e48835c4f6cd05b7349434ce3",
                "md5": "d1fcb487f38940ddb52153a6b20c20cf",
                "sha256": "0705c0e14087a15658abbe79ce6b2c0baacbb3426eb494c0535e7920b67c01d3"
            },
            "downloads": -1,
            "filename": "stream2sentence-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1fcb487f38940ddb52153a6b20c20cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 7086,
            "upload_time": "2024-03-21T19:21:01",
            "upload_time_iso_8601": "2024-03-21T19:21:01.446052Z",
            "url": "https://files.pythonhosted.org/packages/ee/fa/56e23465d33d34b98e8ee73a72f46ea8f26e48835c4f6cd05b7349434ce3/stream2sentence-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d3c9600ea6242095079b8b268fc2c34b0e7f0a817b22683352246286f81531b5",
                "md5": "19404f6967f3c1cde0b50fd2c8b51e55",
                "sha256": "9ba93a75e9e0a856b106a8ae0d46bcad598c3b7d63b0a647a82e6d6d2787424b"
            },
            "downloads": -1,
            "filename": "stream2sentence-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "19404f6967f3c1cde0b50fd2c8b51e55",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 10697,
            "upload_time": "2024-03-21T19:21:02",
            "upload_time_iso_8601": "2024-03-21T19:21:02.807464Z",
            "url": "https://files.pythonhosted.org/packages/d3/c9/600ea6242095079b8b268fc2c34b0e7f0a817b22683352246286f81531b5/stream2sentence-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-21 19:21:02",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KoljaB",
    "github_project": "stream2sentence",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "stream2sentence"
}
        
Elapsed time: 0.24915s