# Automatic accentuation for texts in Russian
Accentuation is a common task in such speech-related fields as text-to-speech, speech recognition and language learning. This library is used to mark stressed vowels in given texts using the data from Wiktionary and syntactic analysis of [Spacy](https://github.com/explosion/spaCy).
### Installation
Python 3.10 and above supported.
```bash
pip install tsnorm
```
### General usage
```Python
from tsnorm import Normalizer
normalizer = Normalizer(stress_mark=chr(0x301), stress_mark_pos="after")
normalizer("Словно куклой в час ночной теперь он может управлять тобой")
# Output: Сло́вно ку́клой в час ночно́й тепе́рь он мо́жет управля́ть тобо́й
```
### Change stress mark and its position
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before")
normalizer("Трупы оживали, землю разрывали")
# Output: Тр+упы ожив+али, з+емлю разрыв+али
```
### Stress yo (Ё)
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_yo=True)
normalizer("Погаснет день, луна проснётся, и снова зверь во мне очнётся")
# Output: Пог+аснет день, лун+а просн+ётся, и сн+ова зверь во мне очн+ётся
```
### Stress monosyllabic words
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True)
normalizer("Панки грязи не боятся, кто устал — ушёл сдаваться!")
# Output: П+анки гр+язи н+е бо+ятся, кт+о уст+ал — ушёл сдав+аться!
```
### Change minimum length of words to be stressed
```Python
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")
# Output: Разбеж+авшись, пр+ыгну с+о скал+ы, в+от +я б+ыл +и в+от мен+я н+е ст+ало
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True, min_word_len=2)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")
# Output: Разбеж+авшись, пр+ыгну с+о скал+ы, в+от я б+ыл и в+от мен+я н+е ст+ало
normalizer = Normalizer(stress_mark="+", stress_mark_pos="before", stress_monosyllabic=True, min_word_len=3)
normalizer("Разбежавшись, прыгну со скалы, вот я был и вот меня не стало")
# Output: Разбеж+авшись, пр+ыгну со скал+ы, в+от я б+ыл и в+от мен+я не ст+ало
```
### Expand normalizer dictionary
```Python
from tsnorm import Normalizer, CustomDictionary, WordForm, Lemma, WordFormTags, LemmaPOS
normalizer = Normalizer("+", "before")
normalizer("Охотник Себастьян, что спал на чердаке")
# Output: Ох+отник Себастьян, что спал на чердак+е
dictionary = CustomDictionary(
word_forms=[
WordForm("Себастьян", 7, WordFormTags(singular=True, nominative=True), "Себастьян")
],
lemmas=[
Lemma("Себастьян", LemmaPOS(PNOUN=True))
]
)
normalizer.update_dictionary(dictionary)
normalizer("Охотник Себастьян, что спал на чердаке")
# Output: Ох+отник Себасть+ян, что спал на чердак+е
```
It's also possible to pass `CustomDictionary` at normalizer initialization:
```Python
normalizer = Normalizer("+", "before", custom_dictionary=dictionary)
```
To add your custom words to normalizer dictionary you must pass two lists to `CustomDictionary`:
1. a list of `WordForm` objects, which are forms of each word with case, tense and lemma information, as well as the positions of stressed letters,
2. a list of `Lemma` objects, which are records of lemmas with their parts of speech.
Parts of speech for lemmas are configured using the `LemmaPOS` class which stores [universal POS tags](https://universaldependencies.org/u/pos/).
## Acknowledgement
This library is based on code by @einhornus from his [article](https://habr.com/ru/articles/575100/) on Habr.
Raw data
{
"_id": null,
"home_page": "https://github.com/NetherQuartz/TextForSpeechNormalizer",
"name": "tsnorm",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "nlp, accentuation, wiktionary, russian-language, setuptools, spacy",
"author": "Vladimir Larkin",
"author_email": "vladimir@larkin.one",
"download_url": null,
"platform": null,
"description": "# Automatic accentuation for texts in Russian\n\nAccentuation is a common task in such speech-related fields as text-to-speech, speech recognition and language learning. This library is used to mark stressed vowels in given texts using the data from Wiktionary and syntactic analysis of [Spacy](https://github.com/explosion/spaCy).\n\n### Installation\nPython 3.10 and above supported.\n```bash\npip install tsnorm\n```\n\n### General usage\n```Python\nfrom tsnorm import Normalizer\n\n\nnormalizer = Normalizer(stress_mark=chr(0x301), stress_mark_pos=\"after\")\nnormalizer(\"\u0421\u043b\u043e\u0432\u043d\u043e \u043a\u0443\u043a\u043b\u043e\u0439 \u0432 \u0447\u0430\u0441 \u043d\u043e\u0447\u043d\u043e\u0439 \u0442\u0435\u043f\u0435\u0440\u044c \u043e\u043d \u043c\u043e\u0436\u0435\u0442 \u0443\u043f\u0440\u0430\u0432\u043b\u044f\u0442\u044c \u0442\u043e\u0431\u043e\u0439\")\n\n# Output: \u0421\u043b\u043e\u0301\u0432\u043d\u043e \u043a\u0443\u0301\u043a\u043b\u043e\u0439 \u0432 \u0447\u0430\u0441 \u043d\u043e\u0447\u043d\u043e\u0301\u0439 \u0442\u0435\u043f\u0435\u0301\u0440\u044c \u043e\u043d \u043c\u043e\u0301\u0436\u0435\u0442 \u0443\u043f\u0440\u0430\u0432\u043b\u044f\u0301\u0442\u044c \u0442\u043e\u0431\u043e\u0301\u0439\n```\n\n### Change stress mark and its position\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\")\nnormalizer(\"\u0422\u0440\u0443\u043f\u044b \u043e\u0436\u0438\u0432\u0430\u043b\u0438, \u0437\u0435\u043c\u043b\u044e \u0440\u0430\u0437\u0440\u044b\u0432\u0430\u043b\u0438\")\n\n# Output: \u0422\u0440+\u0443\u043f\u044b \u043e\u0436\u0438\u0432+\u0430\u043b\u0438, \u0437+\u0435\u043c\u043b\u044e \u0440\u0430\u0437\u0440\u044b\u0432+\u0430\u043b\u0438\n```\n\n### Stress yo (\u0401)\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_yo=True)\nnormalizer(\"\u041f\u043e\u0433\u0430\u0441\u043d\u0435\u0442 \u0434\u0435\u043d\u044c, \u043b\u0443\u043d\u0430 \u043f\u0440\u043e\u0441\u043d\u0451\u0442\u0441\u044f, \u0438 \u0441\u043d\u043e\u0432\u0430 \u0437\u0432\u0435\u0440\u044c \u0432\u043e \u043c\u043d\u0435 \u043e\u0447\u043d\u0451\u0442\u0441\u044f\")\n\n# Output: \u041f\u043e\u0433+\u0430\u0441\u043d\u0435\u0442 \u0434\u0435\u043d\u044c, \u043b\u0443\u043d+\u0430 \u043f\u0440\u043e\u0441\u043d+\u0451\u0442\u0441\u044f, \u0438 \u0441\u043d+\u043e\u0432\u0430 \u0437\u0432\u0435\u0440\u044c \u0432\u043e \u043c\u043d\u0435 \u043e\u0447\u043d+\u0451\u0442\u0441\u044f\n```\n\n### Stress monosyllabic words\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True)\nnormalizer(\"\u041f\u0430\u043d\u043a\u0438 \u0433\u0440\u044f\u0437\u0438 \u043d\u0435 \u0431\u043e\u044f\u0442\u0441\u044f, \u043a\u0442\u043e \u0443\u0441\u0442\u0430\u043b \u2014 \u0443\u0448\u0451\u043b \u0441\u0434\u0430\u0432\u0430\u0442\u044c\u0441\u044f!\")\n\n# Output: \u041f+\u0430\u043d\u043a\u0438 \u0433\u0440+\u044f\u0437\u0438 \u043d+\u0435 \u0431\u043e+\u044f\u0442\u0441\u044f, \u043a\u0442+\u043e \u0443\u0441\u0442+\u0430\u043b \u2014 \u0443\u0448\u0451\u043b \u0441\u0434\u0430\u0432+\u0430\u0442\u044c\u0441\u044f!\n```\n\n### Change minimum length of words to be stressed\n```Python\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441+\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 +\u044f \u0431+\u044b\u043b +\u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d+\u0435 \u0441\u0442+\u0430\u043b\u043e\n\n\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True, min_word_len=2)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441+\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 \u044f \u0431+\u044b\u043b \u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d+\u0435 \u0441\u0442+\u0430\u043b\u043e\n\n\nnormalizer = Normalizer(stress_mark=\"+\", stress_mark_pos=\"before\", stress_monosyllabic=True, min_word_len=3)\nnormalizer(\"\u0420\u0430\u0437\u0431\u0435\u0436\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b\u044b, \u0432\u043e\u0442 \u044f \u0431\u044b\u043b \u0438 \u0432\u043e\u0442 \u043c\u0435\u043d\u044f \u043d\u0435 \u0441\u0442\u0430\u043b\u043e\")\n\n# Output: \u0420\u0430\u0437\u0431\u0435\u0436+\u0430\u0432\u0448\u0438\u0441\u044c, \u043f\u0440+\u044b\u0433\u043d\u0443 \u0441\u043e \u0441\u043a\u0430\u043b+\u044b, \u0432+\u043e\u0442 \u044f \u0431+\u044b\u043b \u0438 \u0432+\u043e\u0442 \u043c\u0435\u043d+\u044f \u043d\u0435 \u0441\u0442+\u0430\u043b\u043e\n```\n\n### Expand normalizer dictionary\n\n```Python\nfrom tsnorm import Normalizer, CustomDictionary, WordForm, Lemma, WordFormTags, LemmaPOS\n\n\nnormalizer = Normalizer(\"+\", \"before\")\n\nnormalizer(\"\u041e\u0445\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a\u0435\")\n# Output: \u041e\u0445+\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a+\u0435\n\ndictionary = CustomDictionary(\n word_forms=[\n WordForm(\"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\", 7, WordFormTags(singular=True, nominative=True), \"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\")\n ],\n lemmas=[\n Lemma(\"\u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d\", LemmaPOS(PNOUN=True))\n ]\n)\n\nnormalizer.update_dictionary(dictionary)\n\nnormalizer(\"\u041e\u0445\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a\u0435\")\n# Output: \u041e\u0445+\u043e\u0442\u043d\u0438\u043a \u0421\u0435\u0431\u0430\u0441\u0442\u044c+\u044f\u043d, \u0447\u0442\u043e \u0441\u043f\u0430\u043b \u043d\u0430 \u0447\u0435\u0440\u0434\u0430\u043a+\u0435\n```\n\nIt's also possible to pass `CustomDictionary` at normalizer initialization:\n```Python\nnormalizer = Normalizer(\"+\", \"before\", custom_dictionary=dictionary)\n```\n\nTo add your custom words to normalizer dictionary you must pass two lists to `CustomDictionary`:\n1. a list of `WordForm` objects, which are forms of each word with case, tense and lemma information, as well as the positions of stressed letters,\n2. a list of `Lemma` objects, which are records of lemmas with their parts of speech.\n\nParts of speech for lemmas are configured using the `LemmaPOS` class which stores [universal POS tags](https://universaldependencies.org/u/pos/).\n\n## Acknowledgement\n\nThis library is based on code by @einhornus from his [article](https://habr.com/ru/articles/575100/) on Habr.\n",
"bugtrack_url": null,
"license": null,
"summary": "A library to put stress marks in Russian text",
"version": "1.1.2",
"project_urls": {
"Homepage": "https://github.com/NetherQuartz/TextForSpeechNormalizer"
},
"split_keywords": [
"nlp",
" accentuation",
" wiktionary",
" russian-language",
" setuptools",
" spacy"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d699dec2a2b0666e93631b2365974ed5b9bdfb6feab43929462a073b5d742fd1",
"md5": "333553b34de80987d52dcbada8c8805b",
"sha256": "9cffa03a38f3382a362c4d14f2b0810b0feed547a8deceb3fea0f292218928d9"
},
"downloads": -1,
"filename": "tsnorm-1.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "333553b34de80987d52dcbada8c8805b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 17243627,
"upload_time": "2024-12-19T00:43:06",
"upload_time_iso_8601": "2024-12-19T00:43:06.413808Z",
"url": "https://files.pythonhosted.org/packages/d6/99/dec2a2b0666e93631b2365974ed5b9bdfb6feab43929462a073b5d742fd1/tsnorm-1.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-19 00:43:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NetherQuartz",
"github_project": "TextForSpeechNormalizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tsnorm"
}