ukrainian-word-stress

Name	ukrainian-word-stress JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/lang-uk/ukrainian-word-stress
Summary	Find word stress for texts in Ukrainian
upload_time	2024-02-05 18:45:01
maintainer
docs_url	None
author	Oleksiy Syvokon
requires_python
license	MIT
keywords	ukrainian nlp word stress accents dictionary linguistics
VCS
bugtrack_url
requirements	stanza marisa-trie
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            Ukrainian word stress
=====================

Word stress is an emphasis we place on a particular syllable of a word as
we pronounce it: ма́ма

This package takes text in Ukrainian and adds the stress mark after an accented
vowel. This is useful in speech synthesis applications and for preparing text
for language learners.


## Example


### From Python

```python
>>> from ukrainian_word_stress import Stressifier
>>> text = """Потяг зупинився, ми зійшли на платформу. Було тихо, широкі навскісні промені золотили повітря, заважаючи бачити речі такими, якими вони були. Третя по обіді. Жодноі живоі душі. Найкращий час для урочистих відвідин померлих. Взяли в привокзальному торбу вина, рушили вздовж колій, піщаною стежкою."""
>>> stressify = Stressifier()
>>> stressify(text)

'Потяг зупини´вся, ми зійшли´ на платфо´рму. Було´ ти´хо, широ´кі навскі´сні
про´мені золоти´ли пові´тря, заважа´ючи ба´чити ре´чі таки´ми, яки´ми вони´
були´. Тре´тя по обі´ді. Жодноі живоі душі´. Найкра´щий час для урочи´стих
відві´дин поме´рлих. Взя´ли в привокза´льному то´рбу вина, ру´шили вздовж
ко´лій, піща´ною сте´жкою.'

```

The `ukrainian_word_stress.Stressifier` class has optional arguments for
fine-graded configuration (see sections below). For example:

```python
>>> from ukrainian_word_stress import Stressifier, StressSymbol
>>> stressify = Stressifier(stress_symbol=StressSymbol.CombiningAcuteAccent)
>>> stressify(text)

'Потяг зупини́вся, ми зійшли́ на платфо́рму. Було́ ти́хо, широ́кі навскі́сні про́мені
золоти́ли пові́тря, заважа́ючи ба́чити ре́чі таки́ми, яки́ми вони́ були́. Тре́тя по
обі́ді. Жодноі живоі душі́. Найкра́щий час для урочи́стих відві́дин поме́рлих. Взя́ли
в привокза́льному то́рбу вина, ру́шили вздовж ко́лій, піща́ною сте́жкою.'
```


### From command-line

```bash
$ echo 'Золоті яйця, але нема ні яйця' | ukrainian-word-stress
Золоті´ я´йця, але´ нема´ ні яйця´
```


## Setup

```bash
$ pip install ukrainian-word-stress
```

Note, that on the first call this will download around 500M of Stanza resources.
The default location for this is `~/stanza_resources`


## Handling ambiguity

Some words have different pronunciation and meaning but share the same spelling.
These are so called [heteronyms][1].

In most cases, this happens when a word used in its form (singular/plural, case).
For example:

* блохи́ - родовий відмінок в однині ("немає ані блохи́")
* бло́хи - множина називного відмінку ("повсюди були бло́хи")

We handle this more or less correctly by doing morphological and POS text parse
with Stanza.

A much smaller category of heteronyms is where words have completely different meanings:

* а́тлас - збірник карт
* атла́с - тканина

Resolving this is much harder and sometimes impossible.

There's no ideal solution to heteronyms ambiguity. We let you decide what to
do for such cases. Possible strategies are:

* `skip`: do not place stress at all (this is the default).

* `all`: return all possible options at once.  This will look as multiple
  stress symbols in one word (за´мо´к).

* `first`: place a stress of the first match with a high chance of being
  incorrect. Essentially, means a random guess on the heteronyms meaning.

The strategy can be configured via `--on-ambiguity` parameter of the
command-line utility. In Python, use `on_ambiguity` parameter of the 
`ukrainian_word_stress.Stressifier` class.


## Stress mark symbols

By default, the Unicode Acute Acent symbol is used: “´” (U+00B4).

On print, Combining Acute Acent is more common and visually less intrusive.
This can be turned on by passing "--symbol=combining" to the CLI utility,
or `stress_symbol=StressSymbol.CombiningAcuteAccent` in the `Stressifier` class.

Note, that some platforms (Windows, for example) render it incorrectly.

You can also pass custom characters in place of these two:

```bash
$ echo 'олені небриті і не голені.' | ukrainian-word-stress --symbol +
о+лені небри+ті і не го+лені.

$ echo 'олені небриті і не голені.' | ukrainian-word-stress --symbol combining
о́лені небри́ті і не го́лені.
```


## Variative stress

Some words allow for multiple stress positions. For example,
по́милка and поми́лка are both acceptable. For such words we return
double stress:

```
$ echo помилка | ukrainian-word-stress
по´ми´лка
```




## Debugging and reporting issues

Use the `--verbose` switch to get info useful for debugging.

If you believe that you found a bug, please open a [Github issue](https://github.com/lang-uk/ukrainian-word-stress/issues)

But first, make sure that the bug is not related to heteronyms disambiguation.
For example, if you see that some word lacks accent, add the `--on-ambiguity=all`
switch to see if this was a heteronym. If the word of question has
multiple accents, that's a heteronym, not a bug:

```bash
$ echo замок | ukrainian-word-stress --on-ambiguity=all
за´мо´к
```


## More docs

* [Dictionary format](./docs/dictionary_format.md)


[1]: https://en.wikipedia.org/wiki/Heteronym_(linguistics)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lang-uk/ukrainian-word-stress",
    "name": "ukrainian-word-stress",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "ukrainian nlp word stress accents dictionary linguistics",
    "author": "Oleksiy Syvokon",
    "author_email": "oleksiy.syvokon@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "Ukrainian word stress\n=====================\n\nWord stress is an emphasis we place on a particular syllable of a word as\nwe pronounce it: \u043c\u0430\u0301\u043c\u0430\n\nThis package takes text in Ukrainian and adds the stress mark after an accented\nvowel. This is useful in speech synthesis applications and for preparing text\nfor language learners.\n\n\n## Example\n\n\n### From Python\n\n```python\n>>> from ukrainian_word_stress import Stressifier\n>>> text = \"\"\"\u041f\u043e\u0442\u044f\u0433 \u0437\u0443\u043f\u0438\u043d\u0438\u0432\u0441\u044f, \u043c\u0438 \u0437\u0456\u0439\u0448\u043b\u0438 \u043d\u0430 \u043f\u043b\u0430\u0442\u0444\u043e\u0440\u043c\u0443. \u0411\u0443\u043b\u043e \u0442\u0438\u0445\u043e, \u0448\u0438\u0440\u043e\u043a\u0456 \u043d\u0430\u0432\u0441\u043a\u0456\u0441\u043d\u0456 \u043f\u0440\u043e\u043c\u0435\u043d\u0456 \u0437\u043e\u043b\u043e\u0442\u0438\u043b\u0438 \u043f\u043e\u0432\u0456\u0442\u0440\u044f, \u0437\u0430\u0432\u0430\u0436\u0430\u044e\u0447\u0438 \u0431\u0430\u0447\u0438\u0442\u0438 \u0440\u0435\u0447\u0456 \u0442\u0430\u043a\u0438\u043c\u0438, \u044f\u043a\u0438\u043c\u0438 \u0432\u043e\u043d\u0438 \u0431\u0443\u043b\u0438. \u0422\u0440\u0435\u0442\u044f \u043f\u043e \u043e\u0431\u0456\u0434\u0456. \u0416\u043e\u0434\u043d\u043e\u0456 \u0436\u0438\u0432\u043e\u0456 \u0434\u0443\u0448\u0456. \u041d\u0430\u0439\u043a\u0440\u0430\u0449\u0438\u0439 \u0447\u0430\u0441 \u0434\u043b\u044f \u0443\u0440\u043e\u0447\u0438\u0441\u0442\u0438\u0445 \u0432\u0456\u0434\u0432\u0456\u0434\u0438\u043d \u043f\u043e\u043c\u0435\u0440\u043b\u0438\u0445. \u0412\u0437\u044f\u043b\u0438 \u0432 \u043f\u0440\u0438\u0432\u043e\u043a\u0437\u0430\u043b\u044c\u043d\u043e\u043c\u0443 \u0442\u043e\u0440\u0431\u0443 \u0432\u0438\u043d\u0430, \u0440\u0443\u0448\u0438\u043b\u0438 \u0432\u0437\u0434\u043e\u0432\u0436 \u043a\u043e\u043b\u0456\u0439, \u043f\u0456\u0449\u0430\u043d\u043e\u044e \u0441\u0442\u0435\u0436\u043a\u043e\u044e.\"\"\"\n>>> stressify = Stressifier()\n>>> stressify(text)\n\n'\u041f\u043e\u0442\u044f\u0433 \u0437\u0443\u043f\u0438\u043d\u0438\u00b4\u0432\u0441\u044f, \u043c\u0438 \u0437\u0456\u0439\u0448\u043b\u0438\u00b4 \u043d\u0430 \u043f\u043b\u0430\u0442\u0444\u043e\u00b4\u0440\u043c\u0443. \u0411\u0443\u043b\u043e\u00b4 \u0442\u0438\u00b4\u0445\u043e, \u0448\u0438\u0440\u043e\u00b4\u043a\u0456 \u043d\u0430\u0432\u0441\u043a\u0456\u00b4\u0441\u043d\u0456\n\u043f\u0440\u043e\u00b4\u043c\u0435\u043d\u0456 \u0437\u043e\u043b\u043e\u0442\u0438\u00b4\u043b\u0438 \u043f\u043e\u0432\u0456\u00b4\u0442\u0440\u044f, \u0437\u0430\u0432\u0430\u0436\u0430\u00b4\u044e\u0447\u0438 \u0431\u0430\u00b4\u0447\u0438\u0442\u0438 \u0440\u0435\u00b4\u0447\u0456 \u0442\u0430\u043a\u0438\u00b4\u043c\u0438, \u044f\u043a\u0438\u00b4\u043c\u0438 \u0432\u043e\u043d\u0438\u00b4\n\u0431\u0443\u043b\u0438\u00b4. \u0422\u0440\u0435\u00b4\u0442\u044f \u043f\u043e \u043e\u0431\u0456\u00b4\u0434\u0456. \u0416\u043e\u0434\u043d\u043e\u0456 \u0436\u0438\u0432\u043e\u0456 \u0434\u0443\u0448\u0456\u00b4. \u041d\u0430\u0439\u043a\u0440\u0430\u00b4\u0449\u0438\u0439 \u0447\u0430\u0441 \u0434\u043b\u044f \u0443\u0440\u043e\u0447\u0438\u00b4\u0441\u0442\u0438\u0445\n\u0432\u0456\u0434\u0432\u0456\u00b4\u0434\u0438\u043d \u043f\u043e\u043c\u0435\u00b4\u0440\u043b\u0438\u0445. \u0412\u0437\u044f\u00b4\u043b\u0438 \u0432 \u043f\u0440\u0438\u0432\u043e\u043a\u0437\u0430\u00b4\u043b\u044c\u043d\u043e\u043c\u0443 \u0442\u043e\u00b4\u0440\u0431\u0443 \u0432\u0438\u043d\u0430, \u0440\u0443\u00b4\u0448\u0438\u043b\u0438 \u0432\u0437\u0434\u043e\u0432\u0436\n\u043a\u043e\u00b4\u043b\u0456\u0439, \u043f\u0456\u0449\u0430\u00b4\u043d\u043e\u044e \u0441\u0442\u0435\u00b4\u0436\u043a\u043e\u044e.'\n\n```\n\nThe `ukrainian_word_stress.Stressifier` class has optional arguments for\nfine-graded configuration (see sections below). For example:\n\n```python\n>>> from ukrainian_word_stress import Stressifier, StressSymbol\n>>> stressify = Stressifier(stress_symbol=StressSymbol.CombiningAcuteAccent)\n>>> stressify(text)\n\n'\u041f\u043e\u0442\u044f\u0433 \u0437\u0443\u043f\u0438\u043d\u0438\u0301\u0432\u0441\u044f, \u043c\u0438 \u0437\u0456\u0439\u0448\u043b\u0438\u0301 \u043d\u0430 \u043f\u043b\u0430\u0442\u0444\u043e\u0301\u0440\u043c\u0443. \u0411\u0443\u043b\u043e\u0301 \u0442\u0438\u0301\u0445\u043e, \u0448\u0438\u0440\u043e\u0301\u043a\u0456 \u043d\u0430\u0432\u0441\u043a\u0456\u0301\u0441\u043d\u0456 \u043f\u0440\u043e\u0301\u043c\u0435\u043d\u0456\n\u0437\u043e\u043b\u043e\u0442\u0438\u0301\u043b\u0438 \u043f\u043e\u0432\u0456\u0301\u0442\u0440\u044f, \u0437\u0430\u0432\u0430\u0436\u0430\u0301\u044e\u0447\u0438 \u0431\u0430\u0301\u0447\u0438\u0442\u0438 \u0440\u0435\u0301\u0447\u0456 \u0442\u0430\u043a\u0438\u0301\u043c\u0438, \u044f\u043a\u0438\u0301\u043c\u0438 \u0432\u043e\u043d\u0438\u0301 \u0431\u0443\u043b\u0438\u0301. \u0422\u0440\u0435\u0301\u0442\u044f \u043f\u043e\n\u043e\u0431\u0456\u0301\u0434\u0456. \u0416\u043e\u0434\u043d\u043e\u0456 \u0436\u0438\u0432\u043e\u0456 \u0434\u0443\u0448\u0456\u0301. \u041d\u0430\u0439\u043a\u0440\u0430\u0301\u0449\u0438\u0439 \u0447\u0430\u0441 \u0434\u043b\u044f \u0443\u0440\u043e\u0447\u0438\u0301\u0441\u0442\u0438\u0445 \u0432\u0456\u0434\u0432\u0456\u0301\u0434\u0438\u043d \u043f\u043e\u043c\u0435\u0301\u0440\u043b\u0438\u0445. \u0412\u0437\u044f\u0301\u043b\u0438\n\u0432 \u043f\u0440\u0438\u0432\u043e\u043a\u0437\u0430\u0301\u043b\u044c\u043d\u043e\u043c\u0443 \u0442\u043e\u0301\u0440\u0431\u0443 \u0432\u0438\u043d\u0430, \u0440\u0443\u0301\u0448\u0438\u043b\u0438 \u0432\u0437\u0434\u043e\u0432\u0436 \u043a\u043e\u0301\u043b\u0456\u0439, \u043f\u0456\u0449\u0430\u0301\u043d\u043e\u044e \u0441\u0442\u0435\u0301\u0436\u043a\u043e\u044e.'\n```\n\n\n### From command-line\n\n```bash\n$ echo '\u0417\u043e\u043b\u043e\u0442\u0456 \u044f\u0439\u0446\u044f, \u0430\u043b\u0435 \u043d\u0435\u043c\u0430 \u043d\u0456 \u044f\u0439\u0446\u044f' | ukrainian-word-stress\n\u0417\u043e\u043b\u043e\u0442\u0456\u00b4 \u044f\u00b4\u0439\u0446\u044f, \u0430\u043b\u0435\u00b4 \u043d\u0435\u043c\u0430\u00b4 \u043d\u0456 \u044f\u0439\u0446\u044f\u00b4\n```\n\n\n## Setup\n\n```bash\n$ pip install ukrainian-word-stress\n```\n\nNote, that on the first call this will download around 500M of Stanza resources.\nThe default location for this is `~/stanza_resources`\n\n\n## Handling ambiguity\n\nSome words have different pronunciation and meaning but share the same spelling.\nThese are so called [heteronyms][1].\n\nIn most cases, this happens when a word used in its form (singular/plural, case).\nFor example:\n\n* \u0431\u043b\u043e\u0445\u0438\u0301 - \u0440\u043e\u0434\u043e\u0432\u0438\u0439 \u0432\u0456\u0434\u043c\u0456\u043d\u043e\u043a \u0432 \u043e\u0434\u043d\u0438\u043d\u0456 (\"\u043d\u0435\u043c\u0430\u0454 \u0430\u043d\u0456 \u0431\u043b\u043e\u0445\u0438\u0301\")\n* \u0431\u043b\u043e\u0301\u0445\u0438 - \u043c\u043d\u043e\u0436\u0438\u043d\u0430 \u043d\u0430\u0437\u0438\u0432\u043d\u043e\u0433\u043e \u0432\u0456\u0434\u043c\u0456\u043d\u043a\u0443 (\"\u043f\u043e\u0432\u0441\u044e\u0434\u0438 \u0431\u0443\u043b\u0438 \u0431\u043b\u043e\u0301\u0445\u0438\")\n\nWe handle this more or less correctly by doing morphological and POS text parse\nwith Stanza.\n\nA much smaller category of heteronyms is where words have completely different meanings:\n\n* \u0430\u0301\u0442\u043b\u0430\u0441 - \u0437\u0431\u0456\u0440\u043d\u0438\u043a \u043a\u0430\u0440\u0442\n* \u0430\u0442\u043b\u0430\u0301\u0441 - \u0442\u043a\u0430\u043d\u0438\u043d\u0430\n\nResolving this is much harder and sometimes impossible.\n\nThere's no ideal solution to heteronyms ambiguity. We let you decide what to\ndo for such cases. Possible strategies are:\n\n* `skip`: do not place stress at all (this is the default).\n\n* `all`: return all possible options at once.  This will look as multiple\n  stress symbols in one word (\u0437\u0430\u00b4\u043c\u043e\u00b4\u043a).\n\n* `first`: place a stress of the first match with a high chance of being\n  incorrect. Essentially, means a random guess on the heteronyms meaning.\n\nThe strategy can be configured via `--on-ambiguity` parameter of the\ncommand-line utility. In Python, use `on_ambiguity` parameter of the \n`ukrainian_word_stress.Stressifier` class.\n\n\n## Stress mark symbols\n\nBy default, the Unicode Acute Acent symbol is used: \u201c\u00b4\u201d (U+00B4).\n\nOn print, Combining Acute Acent is more common and visually less intrusive.\nThis can be turned on by passing \"--symbol=combining\" to the CLI utility,\nor `stress_symbol=StressSymbol.CombiningAcuteAccent` in the `Stressifier` class.\n\nNote, that some platforms (Windows, for example) render it incorrectly.\n\nYou can also pass custom characters in place of these two:\n\n```bash\n$ echo '\u043e\u043b\u0435\u043d\u0456 \u043d\u0435\u0431\u0440\u0438\u0442\u0456 \u0456 \u043d\u0435 \u0433\u043e\u043b\u0435\u043d\u0456.' | ukrainian-word-stress --symbol +\n\u043e+\u043b\u0435\u043d\u0456 \u043d\u0435\u0431\u0440\u0438+\u0442\u0456 \u0456 \u043d\u0435 \u0433\u043e+\u043b\u0435\u043d\u0456.\n\n$ echo '\u043e\u043b\u0435\u043d\u0456 \u043d\u0435\u0431\u0440\u0438\u0442\u0456 \u0456 \u043d\u0435 \u0433\u043e\u043b\u0435\u043d\u0456.' | ukrainian-word-stress --symbol combining\n\u043e\u0301\u043b\u0435\u043d\u0456 \u043d\u0435\u0431\u0440\u0438\u0301\u0442\u0456 \u0456 \u043d\u0435 \u0433\u043e\u0301\u043b\u0435\u043d\u0456.\n```\n\n\n## Variative stress\n\nSome words allow for multiple stress positions. For example,\n\u043f\u043e\u0301\u043c\u0438\u043b\u043a\u0430 and \u043f\u043e\u043c\u0438\u0301\u043b\u043a\u0430 are both acceptable. For such words we return\ndouble stress:\n\n```\n$ echo \u043f\u043e\u043c\u0438\u043b\u043a\u0430 | ukrainian-word-stress\n\u043f\u043e\u00b4\u043c\u0438\u00b4\u043b\u043a\u0430\n```\n\n\n\n\n## Debugging and reporting issues\n\nUse the `--verbose` switch to get info useful for debugging.\n\nIf you believe that you found a bug, please open a [Github issue](https://github.com/lang-uk/ukrainian-word-stress/issues)\n\nBut first, make sure that the bug is not related to heteronyms disambiguation.\nFor example, if you see that some word lacks accent, add the `--on-ambiguity=all`\nswitch to see if this was a heteronym. If the word of question has\nmultiple accents, that's a heteronym, not a bug:\n\n```bash\n$ echo \u0437\u0430\u043c\u043e\u043a | ukrainian-word-stress --on-ambiguity=all\n\u0437\u0430\u00b4\u043c\u043e\u00b4\u043a\n```\n\n\n## More docs\n\n* [Dictionary format](./docs/dictionary_format.md)\n\n\n[1]: https://en.wikipedia.org/wiki/Heteronym_(linguistics)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Find word stress for texts in Ukrainian",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/lang-uk/ukrainian-word-stress"
    },
    "split_keywords": [
        "ukrainian",
        "nlp",
        "word",
        "stress",
        "accents",
        "dictionary",
        "linguistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d2adbda6e30745c09d6a6036e906afb2027779da6fb0f4bb89c0a1b03a98821b",
                "md5": "e2a201233e25b0f2d92418d24b33e4ef",
                "sha256": "be6549ad9956530a8b13d0bd7a64d9ef9c41f688e63bb3d842e0769f828a06b0"
            },
            "downloads": -1,
            "filename": "ukrainian_word_stress-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e2a201233e25b0f2d92418d24b33e4ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 7070056,
            "upload_time": "2024-02-05T18:45:01",
            "upload_time_iso_8601": "2024-02-05T18:45:01.913855Z",
            "url": "https://files.pythonhosted.org/packages/d2/ad/bda6e30745c09d6a6036e906afb2027779da6fb0f4bb89c0a1b03a98821b/ukrainian_word_stress-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-05 18:45:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lang-uk",
    "github_project": "ukrainian-word-stress",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "stanza",
            "specs": [
                [
                    "==",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "marisa-trie",
            "specs": [
                [
                    "==",
                    "1.1.0"
                ]
            ]
        }
    ],
    "lcname": "ukrainian-word-stress"
}

Oleksiy Syvokon