tsnorm


Nametsnorm JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/NetherQuartz/TextForSpeechNormalizer
SummaryA library to put stress marks in Russian text
upload_time2023-07-31 00:33:42
maintainer
docs_urlNone
authorVladimir Larkin
requires_python>=3.10
license
keywords nlp accentuation wiktionary russian-language setuptools spacy
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Automatic accentuation for texts in Russian

Accentuation is a common task in such speech-related fields as text-to-speech, speech recognition and language learning. This library is used to mark stressed vowels in given texts using the data from Wiktionary and syntactic analysis of [Spacy](https://github.com/explosion/spaCy).

### Installation
Python 3.10 and above supported.
```bash
pip install tsnorm
```

### General usage
```Python
from tsnorm import Normalizer


normalizer = Normalizer(stress_mark=chr(0x301), stress_mark_pos="after")
normalizer("Словно куклой в час ночной теперь он может управлять тобой")

# Output: Сло́вно ку́клой в час ночно́й тепе́рь он мо́жет управля́ть тобо́й
```

### Change stress mark and its position
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before")
normalizer("Трупы оживали, землю разрывали")

# Output: Тр+упы ожив+али, з+емлю разрыв+али
```

### Stress yo (Ё)
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_yo=True)
normalizer("Погаснет день, луна проснётся, и снова зверь во мне очнётся")

# Output: Пог+аснет день, лун+а просн+ётся, и сн+ова зверь во мне очн+ётся
```

### Stress monosyllabic words
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True)
normalizer("Панки грязи не боятся, кто устал — ушёл сдаваться!")

# Output: П+анки гр+язи н+е бо+ятся, кт+о уст+ал — ушёл сдав+аться!
```

### Change minimum length of words to be stressed
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")

# Output: Разбеж+авшись, пр+ыгну с+о скал+ы, в+от +я б+ыл +и в+от мен+я н+е ст+ало


normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True, min_word_len=2)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")

# Output: Разбеж+авшись, пр+ыгну с+о скал+ы, в+от я б+ыл и в+от мен+я н+е ст+ало


normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True, min_word_len=3)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")

# Output: Разбеж+авшись, пр+ыгну со скал+ы, в+от я б+ыл и в+от мен+я не ст+ало
```

### Expand normalizer dictionary

```Python
from tsnorm import Normalizer, CustomDictionary, WordForm, Lemma, WordFormTags, LemmaPOS


normalizer = Normalizer("+", "before")

normalizer("Охотник Себастьян, что спал на чердаке")
# Output: Ох+отник Себастьян, что спал на чердак+е

dictionary = CustomDictionary(
    word_forms=[
        WordForm("Себастьян", 7, WordFormTags(singular=True, nominative=True), "Себастьян")
    ],
    lemmas=[
        Lemma("Себастьян", LemmaPOS(PNOUN=True))
    ]
)

normalizer.update_dictionary(dictionary)

normalizer("Охотник Себастьян, что спал на чердаке")
# Output: Ох+отник Себасть+ян, что спал на чердак+е
```

It's also possible to pass `CustomDictionary` at normalizer initialization:
```Python
normalizer = Normalizer("+", "before", custom_dictionary=dictionary)
```

To add your custom words to normalizer dictionary you must pass two lists to `CustomDictionary`:
1. a list of `WordForm` objects, which are forms of each word with case, tense and lemma information, as well as the positions of stressed letters,
2. a list of `Lemma` objects, which are records of lemmas with their parts of speech.

Parts of speech for lemmas are configured using the `LemmaPOS` class which stores [universal POS tags](https://universaldependencies.org/u/pos/).

## Acknowledgement

This library is based on code by @einhornus from his [article](https://habr.com/ru/articles/575100/) on Habr.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NetherQuartz/TextForSpeechNormalizer",
    "name": "tsnorm",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "",
    "keywords": "nlp,accentuation,wiktionary,russian-language,setuptools,spacy",
    "author": "Vladimir Larkin",
    "author_email": "vladimir@larkin.one",
    "download_url": "",
    "platform": null,
    "description": "# Automatic accentuation for texts in Russian\n\nAccentuation is a common task in such speech-related fields as text-to-speech, speech recognition and language learning. This library is used to mark stressed vowels in given texts using the data from Wiktionary and syntactic analysis of [Spacy](https://github.com/explosion/spaCy).\n\n### Installation\nPython 3.10 and above supported.\n```bash\npip install tsnorm\n```\n\n### General usage\n```Python\nfrom tsnorm import Normalizer\n\n\nnormalizer = Normalizer(stress_mark=chr(0x301), stress_mark_pos=\"after\")\nnormalizer(\"\u0421\u043b\u043e\u0432\u043d\u043e \u043a\u0443\u043a\u043b\u043e\u0439 \u0432 \u0447\u0430\u0441 \u043d\u043e\u0447\u043d\u043e\u0439 \u0442\u0435\u043f\u0435\u0440\u044c \u043e\u043d \u043c\u043e\u0436\u0435\u0442 \u0443\u043f\u0440\u0430\u0432\u043b\u044f\u0442\u044c \u0442\u043e\u0431\u043e\u0439\")\n\n# Output: \u0421\u043b\u043e\u0301\u0432\u043d\u043e \u043a\u0443\u0301\u043a\u043b\u043e\u0439 \u0432 \u0447\u0430\u0441 \u043d\u043e\u0447\u043d\u043e\u0301\u0439 \u0442\u0435\u043f\u0435\u0301\u0440\u044c \u043e\u043d \u043c\u043e\u0301\u0436\u0435\u0442 \u0443\u043f\u0440\u0430\u0432\u043b\u044f\u0301\u0442\u044c \u0442\u043e\u0431\u043e\u0301\u0439\n```\n\n### Change stress mark and its position\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\")\nnormalizer(\"\u0422\u0440\u0443\u043f\u044b \u043e\u0436\u0438\u0432\u0430\u043b\u0438, \u0437\u0435\u043c\u043b\u044e \u0440\u0430\u0437\u0440\u044b\u0432\u0430\u043b\u0438\")\n\n# Output: \u0422\u0440+\u0443\u043f\u044b \u043e\u0436\u0438\u0432+\u0430\u043b\u0438, \u0437+\u0435\u043c\u043b\u044e \u0440\u0430\u0437\u0440\u044b\u0432+\u0430\u043b\u0438\n```\n\n### Stress yo (\u0401)\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_yo=True)\nnormalizer(\"\u041f\u043e\u0433\u0430\u0441\u043d\u0435\u0442 \u0434\u0435\u043d\u044c, \u043b\u0443\u043d\u0430 \u043f\u0440\u043e\u0441\u043d\u0451\u0442\u0441\u044f, \u0438 \u0441\u043d\u043e\u0432\u0430 \u0437\u0432\u0435\u0440\u044c \u0432\u043e \u043c\u043d\u0435 \u043e\u0447\u043d\u0451\u0442\u0441\u044f\")\n\n# Output: \u041f\u043e\u0433+\u0430\u0441\u043d\u0435\u0442 \u0434\u0435\u043d\u044c, \u043b\u0443\u043d+\u0430 \u043f\u0440\u043e\u0441\u043d+\u0451\u0442\u0441\u044f, \u0438 \u0441\u043d+\u043e\u0432\u0430 \u0437\u0432\u0435\u0440\u044c \u0432\u043e \u043c\u043d\u0435 \u043e\u0447\u043d+\u0451\u0442\u0441\u044f\n```\n\n### Stress monosyllabic words\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True)\nnormalizer(\"\u041f\u0430\u043d\u043a\u0438 \u0433\u0440\u044f\u0437\u0438 \u043d\u0435 \u0431\u043e\u044f\u0442\u0441\u044f, \u043a\u0442\u043e \u0443\u0441\u0442\u0430\u043b \u2014 \u0443\u0448\u0451\u043b \u0441\u0434\u0430\u0432\u0430\u0442\u044c\u0441\u044f!\")\n\n# Output: \u041f+\u0430\u043d\u043a\u0438 \u0433\u0440+\u044f\u0437\u0438 \u043d+\u0435 \u0431\u043e+\u044f\u0442\u0441\u044f, \u043a\u0442+\u043e \u0443\u0441\u0442+\u0430\u043b \u2014 \u0443\u0448\u0451\u043b \u0441\u0434\u0430\u0432+\u0430\u0442\u044c\u0441\u044f!\n```\n\n### Change minimum length of words to be stressed\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441+\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 +\u044f \u0431+\u044b\u043b +\u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d+\u0435 \u0441\u0442+\u0430\u043b\u043e\n\n\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True, min_word_len=2)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441+\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 \u044f \u0431+\u044b\u043b \u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d+\u0435 \u0441\u0442+\u0430\u043b\u043e\n\n\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True, min_word_len=3)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 \u044f \u0431+\u044b\u043b \u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d\u0435 \u0441\u0442+\u0430\u043b\u043e\n```\n\n### Expand normalizer dictionary\n\n```Python\nfrom tsnorm import Normalizer, CustomDictionary, WordForm, Lemma, WordFormTags, LemmaPOS\n\n\nnormalizer = Normalizer(\"+\", \"before\")\n\nnormalizer(\"\u041e\u0445\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a\u0435\")\n# Output: \u041e\u0445+\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a+\u0435\n\ndictionary = CustomDictionary(\n    word_forms=[\n        WordForm(\"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\", 7, WordFormTags(singular=True, nominative=True), \"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\")\n    ],\n    lemmas=[\n        Lemma(\"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\", LemmaPOS(PNOUN=True))\n    ]\n)\n\nnormalizer.update_dictionary(dictionary)\n\nnormalizer(\"\u041e\u0445\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a\u0435\")\n# Output: \u041e\u0445+\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c+\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a+\u0435\n```\n\nIt's also possible to pass `CustomDictionary` at normalizer initialization:\n```Python\nnormalizer = Normalizer(\"+\", \"before\", custom_dictionary=dictionary)\n```\n\nTo add your custom words to normalizer dictionary you must pass two lists to `CustomDictionary`:\n1. a list of `WordForm` objects, which are forms of each word with case, tense and lemma information, as well as the positions of stressed letters,\n2. a list of `Lemma` objects, which are records of lemmas with their parts of speech.\n\nParts of speech for lemmas are configured using the `LemmaPOS` class which stores [universal POS tags](https://universaldependencies.org/u/pos/).\n\n## Acknowledgement\n\nThis library is based on code by @einhornus from his [article](https://habr.com/ru/articles/575100/) on Habr.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A library to put stress marks in Russian text",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/NetherQuartz/TextForSpeechNormalizer"
    },
    "split_keywords": [
        "nlp",
        "accentuation",
        "wiktionary",
        "russian-language",
        "setuptools",
        "spacy"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1f4f9b0054a80040fedd855f4a161cc849fa16bb7da111ea038adc19d2393fa5",
                "md5": "315e408a2c240c5d2ddc1561259ff2d8",
                "sha256": "21cc6a9e7f65bddacba7f69f26cd30030ba829c93748d7e656251c0f162cd43c"
            },
            "downloads": -1,
            "filename": "tsnorm-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "315e408a2c240c5d2ddc1561259ff2d8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 17243625,
            "upload_time": "2023-07-31T00:33:42",
            "upload_time_iso_8601": "2023-07-31T00:33:42.597482Z",
            "url": "https://files.pythonhosted.org/packages/1f/4f/9b0054a80040fedd855f4a161cc849fa16bb7da111ea038adc19d2393fa5/tsnorm-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-31 00:33:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NetherQuartz",
    "github_project": "TextForSpeechNormalizer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tsnorm"
}
        
Elapsed time: 0.11260s