gruut

Name	gruut JSON
Version	2.4.0 JSON
	download
home_page	https://github.com/rhasspy/gruut
Summary	A tokenizer, text cleaner, and phonemizer for many human languages.
upload_time	2024-07-03 15:40:55
maintainer	None
docs_url	None
author	Michael Hansen
requires_python	>=3.6
license	None
keywords
VCS
bugtrack_url
requirements	Babel dateparser gruut-ipa gruut_lang_en jsonlines networkx num2words numpy python-crfsuite
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Gruut

A tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).

```python
from gruut import sentences

text = 'He wound it around the wound, saying "I read it was $10 to read."'

for sent in sentences(text, lang="en-us"):
    for word in sent:
        if word.phonemes:
            print(word.text, *word.phonemes)
```

which outputs:

```
He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
```

Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.

A [subset of SSML](#ssml) is also supported:

```python
from gruut import sentences

ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""

for sent in sentences(ssml_text, ssml=True):
    for word in sent:
        if word.phonemes:
            print(sent.idx, word.lang, word.text, *word.phonemes)
```

with the output:

```
0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖
```

See [the documentation](https://rhasspy.github.io/gruut/) for more details.

## Installation

```sh
pip install gruut
```

Languages besides English can be added during installation. For example, with French and Italian support:

```sh
pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
```

The extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.

You may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).

gruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. 

## Supported Languages

gruut currently supports:

* Arabic (`ar`)
* Czech (`cs` or `cs-cz`)
* German (`de` or `de-de`)
* English (`en` or `en-us`)
* Spanish (`es` or `es-es`)
* Farsi/Persian (`fa`)
* French (`fr` or `fr-fr`)
* Italian (`it` or `it-it`)
* Luxembourgish (`lb`)
* Dutch (`nl`)
* Russian (`ru` or `ru-ru`)
* Swedish (`sv` or `sv-se`)
* Swahili (`sw`)

The goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)

## Dependencies

* Python 3.7 or higher
* Linux
    * Tested on Debian Bullseye
* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)
    * Currency/number handling
    * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
* gruut-ipa
    * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation
* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)
    * Part of speech tagging and grapheme to phoneme models
* [pydateparser](https://github.com/GLibAi/pydateparser)
    * Date parsing for multiple languages

## Numbers, Dates, and More

`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., `<s lang="...">`).

The following types of expressions can be automatically expanded into words by `gruut`:

* Numbers - "123" to "one hundred and twenty three" (disable with `verbalize_numbers=False` or `--no-numbers`)
    * Relies on `Babel` for parsing and `num2words` for verbalization
* Dates - "1/1/2020" to "January first, twenty twenty" (disable with `verbalize_dates=False` or `--no-dates`)
    * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization
* Currency - "$10" to "ten dollars" (disable with `verbalize_currency=False` or `--no-currency`)
    * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization
* Times - "12:01am" to "twelve oh one A M" (disable with `verbalize_times=False` or `--no-times`)
    * English only
    * Relies on `num2words` for verbalization

## Command-Line Usage

The `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).

The `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).
You will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.

### Plain Text

Takes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.

```sh
echo 'This, right here, is some "RAW" text!' \
   | gruut --language en-us \
   | jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!
```

More information is available in the full JSON output:

```sh
gruut --language en-us 'More  text.' | jq .
```

Output:

```json
{
  "idx": 0,
  "text": "More text.",
  "text_with_ws": "More text.",
  "text_spoken": "More text",
  "par_idx": 0,
  "lang": "en-us",
  "voice": "",
  "words": [
    {
      "idx": 0,
      "text": "More",
      "text_with_ws": "More ",
      "leading_ws": "",
      "training_ws": " ",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "JJR",
      "phonemes": [
        "m",
        "ˈɔ",
        "ɹ"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 1,
      "text": "text",
      "text_with_ws": "text",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "NN",
      "phonemes": [
        "t",
        "ˈɛ",
        "k",
        "s",
        "t"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 2,
      "text": ".",
      "text_with_ws": ".",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": null,
      "phonemes": [
        "‖"
      ],
      "is_major_break": true,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": true,
      "is_spoken": false,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    }
  ],
  "pause_before_ms": 0,
  "pause_after_ms": 0
}
```

For the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.

Within each word, there is:

* `idx` - zero-based index of the word in the sentence
* `sent_idx` - zero-based index of the sentence in the input text
* `pos` - part of speech tag (if available)
* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)
* `is_minor_break` - `true` if "word" separates phrases (comma, semicolon, etc.)
* `is_major_break` - `true` if "word" separates sentences (period, question mark, etc.)
* `is_break` - `true` if "word" is a major or minor break
* `is_punctuation` - `true` if "word" is a surrounding punctuation mark (quote, bracket, etc.)
* `is_spoken` - `true` if not a break or punctuation

See `python3 -m gruut <LANGUAGE> --help` for more options.

### SSML

A subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:

* `<speak>` - wrap around SSML text
    * `lang` - set language for document
* `<p>` - paragraph
    * `lang` - set language for paragraph
* `<s>` - sentence (disables automatic sentence breaking)
    * `lang` - set language for sentence
* `<w>` / `<token>` - word (disables automatic tokenization)
    * `lang` - set language for word
    * `role` - set word role (see [word roles](#word-roles))
* `<lang lang="...">` - set language inner text
* `<voice name="...">` - set voice of inner text
* `<say-as interpret-as="">` - force interpretation of inner text
    * `interpret-as` one of "spell-out", "date", "number", "time", or "currency"
    * `format` - way to format text depending on `interpret-as`
        * number - one of "cardinal", "ordinal", "digits", "year"
        * date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
* `<break time="">` - Pause for given amount of time
    * time - seconds ("123s") or milliseconds ("123ms")
* `<mark name="">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)
    * name - name of mark
* `<sub alias="">` - substitute `alias` for inner text
* `<phoneme ph="...">` - supply phonemes for inner text
    * `ph` - phonemes for each word of inner text, separated by whitespace
* `<lexicon id="...">` - inline or external pronunciation lexicon
    * `id` - unique id of lexicon (used in `<lookup ref="...">`)
    * `uri` - if empty or missing, lexicon is inline
    * One or more `<lexeme>` child elements with:
        *  Optional `role="..."` ([word roles][#word-roles] separated by whitespace)
        * `<grapheme>WORD</grapheme>` - word text
        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)
* `<lookup ref="...">` - use pronunciation lexicon for child elements
    * `ref` - id from a `<lexicon id="...">`

#### Word Roles

During phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., "a" should be spoken as `/eɪ/` instead of `/ə/`.

For `en-us`, the following additional roles are available from the part-of-speech tagger:

* `gruut:CD` - number
* `gruut:DT` - determiner
* `gruut:IN` - preposition or subordinating conjunction 
* `gruut:JJ` - adjective
* `gruut:NN` - noun
* `gruut:PRP` - personal pronoun
* `gruut:RB` - adverb
* `gruut:VB` - verb
* `gruut:VB` - verb (past tense)

#### Inline Lexicons

Inline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a "default" inline lexicon that does not require a corresponding `<lookup>` tag.

For example, the following document will yield three different pronunciations for the word "tomato":

``` xml
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- Individual phonemes are separated by whitespace -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- Made up pronunciation for fake word role -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>
```

The first "tomato" will be looked up in the U.S. English lexicon (`/t ə m ˈeɪ t oʊ/`). Within the `<lookup>` tag's scope, the second and third "tomato" words will be looked up in the inline lexicon. The third "tomato" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).

Even further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: 

``` xml
<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <!-- No id means change all words without a lookup -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
</speak>
```

This will yield a pronunciation of `/t ə m ˈɑ t oʊ/` for all instances of "tomato" in the document (unless they have a `<lookup>`).

## Intended Audience

gruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).

For each supported language, gruut includes a:

* A word pronunciation lexicon built from open source data
    * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)
* A pre-trained grapheme-to-phoneme model for guessing word pronunciations

Some languages also include:

* A pre-trained part of speech tagger built from open source data:
    * See [universal dependencies](https://universaldependencies.org/)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/rhasspy/gruut",
    "name": "gruut",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Michael Hansen",
    "author_email": "mike@rhasspy.org",
    "download_url": "https://files.pythonhosted.org/packages/fc/e1/6b5a01ef36b5341d5d0899401e4413594dfaa21f86cfc05be8efb25baf81/gruut-2.4.0.tar.gz",
    "platform": null,
    "description": "# Gruut\n\nA tokenizer, text cleaner, and [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemizer for several human languages that supports [SSML](#ssml).\n\n```python\nfrom gruut import sentences\n\ntext = 'He wound it around the wound, saying \"I read it was $10 to read.\"'\n\nfor sent in sentences(text, lang=\"en-us\"):\n    for word in sent:\n        if word.phonemes:\n            print(word.text, *word.phonemes)\n```\n\nwhich outputs:\n\n```\nHe h \u02c8i\nwound w \u02c8a\u028a n d\nit \u02c8\u026a t\naround \u025a \u02c8a\u028a n d\nthe \u00f0 \u0259\nwound w \u02c8u n d\n, |\nsaying s \u02c8e\u026a \u026a \u014b\nI \u02c8a\u026a\nread \u0279 \u02c8\u025b d\nit \u02c8\u026a t\nwas w \u0259 z\nten t \u02c8\u025b n\ndollars d \u02c8\u0251 l \u025a z\nto t \u0259\nread \u0279 \u02c8i d\n. \u2016\n```\n\nNote that \"wound\" and \"read\" have different pronunciations when used in different (grammatical) contexts.\n\nA [subset of SSML](#ssml) is also supported:\n\n```python\nfrom gruut import sentences\n\nssml_text = \"\"\"<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n<speak version=\"1.1\" xmlns=\"http://www.w3.org/2001/10/synthesis\"\n    xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n    xsi:schemaLocation=\"http://www.w3.org/2001/10/synthesis\n                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd\"\n    xml:lang=\"en-US\">\n<s>Today at 4pm, 2/1/2000.</s>\n<s xml:lang=\"it\">Un mese f\u00e0, 2/1/2000.</s>\n</speak>\"\"\"\n\nfor sent in sentences(ssml_text, ssml=True):\n    for word in sent:\n        if word.phonemes:\n            print(sent.idx, word.lang, word.text, *word.phonemes)\n```\n\nwith the output:\n\n```\n0 en-US Today t \u0259 d \u02c8e\u026a\n0 en-US at \u02c8\u00e6 t\n0 en-US four f \u02c8\u0254 \u0279\n0 en-US P p \u02c8i\n0 en-US M \u02c8\u025b m\n0 en-US , |\n0 en-US February f \u02c8\u025b b j u \u02cc\u025b \u0279 i\n0 en-US first f \u02c8\u025a s t\n0 en-US , |\n0 en-US two t \u02c8u\n0 en-US thousand \u03b8 \u02c8a\u028a z \u0259 n d\n0 en-US . \u2016\n1 it Un u n\n1 it mese \u02c8m e s e\n1 it f\u00e0 f a\n1 it , |\n1 it due d j u\n1 it gennaio d\u0361\u0292 e n n \u02c8a j o\n1 it duemila d u e \u02c8m i l a\n1 it . \u2016\n```\n\nSee [the documentation](https://rhasspy.github.io/gruut/) for more details.\n\n## Installation\n\n```sh\npip install gruut\n```\n\nLanguages besides English can be added during installation. For example, with French and Italian support:\n\n```sh\npip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]\n```\n\nThe extra pip repo is needed for an updated [num2words fork](https://github.com/rhasspy/num2words) that includes support for more languages.\n\nYou may also [manually download language files](https://github.com/rhasspy/gruut/releases/latest) and use put them in `$XDG_CONFIG_HOME/gruut/` (`$HOME/.config/gruut` by default).\n\ngruut will look for language files in the directory `$XDG_CONFIG_HOME/gruut/<lang>/` if the corresponding Python package is not installed. Note that `<lang>` here is the **full** language name, e.g. `de-de` instead of just `de`. \n\n## Supported Languages\n\ngruut currently supports:\n\n* Arabic (`ar`)\n* Czech (`cs` or `cs-cz`)\n* German (`de` or `de-de`)\n* English (`en` or `en-us`)\n* Spanish (`es` or `es-es`)\n* Farsi/Persian (`fa`)\n* French (`fr` or `fr-fr`)\n* Italian (`it` or `it-it`)\n* Luxembourgish (`lb`)\n* Dutch (`nl`)\n* Russian (`ru` or `ru-ru`)\n* Swedish (`sv` or `sv-se`)\n* Swahili (`sw`)\n\nThe goal is to support all of [voice2json's languages](https://github.com/synesthesiam/voice2json-profiles#supported-languages)\n\n## Dependencies\n\n* Python 3.7 or higher\n* Linux\n    * Tested on Debian Bullseye\n* [num2words fork](https://github.com/rhasspy/num2words) and [Babel](https://pypi.org/project/Babel/)\n    * Currency/number handling\n    * num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)\n* gruut-ipa\n    * [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation manipulation\n* [pycrfsuite](https://github.com/scrapinghub/python-crfsuite)\n    * Part of speech tagging and grapheme to phoneme models\n* [pydateparser](https://github.com/GLibAi/pydateparser)\n    * Date parsing for multiple languages\n\n## Numbers, Dates, and More\n\n`gruut` can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so \"1/1/2020\" may be interpreted as \"M/D/Y\" or \"D/M/Y\" depending on the word or sentence's language (e.g., `<s lang=\"...\">`).\n\nThe following types of expressions can be automatically expanded into words by `gruut`:\n\n* Numbers - \"123\" to \"one hundred and twenty three\" (disable with `verbalize_numbers=False` or `--no-numbers`)\n    * Relies on `Babel` for parsing and `num2words` for verbalization\n* Dates - \"1/1/2020\" to \"January first, twenty twenty\" (disable with `verbalize_dates=False` or `--no-dates`)\n    * Relies on `pydateparser` for parsing and both `Babel` and `num2words` for verbalization\n* Currency - \"$10\" to \"ten dollars\" (disable with `verbalize_currency=False` or `--no-currency`)\n    * Relies on `Babel` for parsing and both `Babel` and `num2words` for verbalization\n* Times - \"12:01am\" to \"twelve oh one A M\" (disable with `verbalize_times=False` or `--no-times`)\n    * English only\n    * Relies on `num2words` for verbalization\n\n## Command-Line Usage\n\nThe `gruut` module can be executed with `python3 -m gruut --language <LANGUAGE> <TEXT>` or with the `gruut` command (from `setup.py`).\n\nThe `gruut` command is line-oriented, consuming text and producing [JSONL](https://jsonlines.org/).\nYou will probably want to install [jq](https://stedolan.github.io/jq/) to manipulate the [JSONL](https://jsonlines.org/) output from `gruut`.\n\n### Plain Text\n\nTakes raw text and outputs [JSONL](https://jsonlines.org/) with cleaned words/tokens.\n\n```sh\necho 'This, right here, is some \"RAW\" text!' \\\n   | gruut --language en-us \\\n   | jq --raw-output '.words[].text'\nThis\n,\nright\nhere\n,\nis\nsome\n\"\nRAW\n\"\ntext\n!\n```\n\nMore information is available in the full JSON output:\n\n```sh\ngruut --language en-us 'More  text.' | jq .\n```\n\nOutput:\n\n```json\n{\n  \"idx\": 0,\n  \"text\": \"More text.\",\n  \"text_with_ws\": \"More text.\",\n  \"text_spoken\": \"More text\",\n  \"par_idx\": 0,\n  \"lang\": \"en-us\",\n  \"voice\": \"\",\n  \"words\": [\n    {\n      \"idx\": 0,\n      \"text\": \"More\",\n      \"text_with_ws\": \"More \",\n      \"leading_ws\": \"\",\n      \"training_ws\": \" \",\n      \"sent_idx\": 0,\n      \"par_idx\": 0,\n      \"lang\": \"en-us\",\n      \"voice\": \"\",\n      \"pos\": \"JJR\",\n      \"phonemes\": [\n        \"m\",\n        \"\u02c8\u0254\",\n        \"\u0279\"\n      ],\n      \"is_major_break\": false,\n      \"is_minor_break\": false,\n      \"is_punctuation\": false,\n      \"is_break\": false,\n      \"is_spoken\": true,\n      \"pause_before_ms\": 0,\n      \"pause_after_ms\": 0\n    },\n    {\n      \"idx\": 1,\n      \"text\": \"text\",\n      \"text_with_ws\": \"text\",\n      \"leading_ws\": \"\",\n      \"training_ws\": \"\",\n      \"sent_idx\": 0,\n      \"par_idx\": 0,\n      \"lang\": \"en-us\",\n      \"voice\": \"\",\n      \"pos\": \"NN\",\n      \"phonemes\": [\n        \"t\",\n        \"\u02c8\u025b\",\n        \"k\",\n        \"s\",\n        \"t\"\n      ],\n      \"is_major_break\": false,\n      \"is_minor_break\": false,\n      \"is_punctuation\": false,\n      \"is_break\": false,\n      \"is_spoken\": true,\n      \"pause_before_ms\": 0,\n      \"pause_after_ms\": 0\n    },\n    {\n      \"idx\": 2,\n      \"text\": \".\",\n      \"text_with_ws\": \".\",\n      \"leading_ws\": \"\",\n      \"training_ws\": \"\",\n      \"sent_idx\": 0,\n      \"par_idx\": 0,\n      \"lang\": \"en-us\",\n      \"voice\": \"\",\n      \"pos\": null,\n      \"phonemes\": [\n        \"\u2016\"\n      ],\n      \"is_major_break\": true,\n      \"is_minor_break\": false,\n      \"is_punctuation\": false,\n      \"is_break\": true,\n      \"is_spoken\": false,\n      \"pause_before_ms\": 0,\n      \"pause_after_ms\": 0\n    }\n  ],\n  \"pause_before_ms\": 0,\n  \"pause_after_ms\": 0\n}\n```\n\nFor the whole input line and each word, the `text` property contains the processed input text with normalized whitespace while `text_with_ws` retains the original whitespace. The `text_spoken` property only contains words that are spoken, so punctuation and breaks are excluded.\n\nWithin each word, there is:\n\n* `idx` - zero-based index of the word in the sentence\n* `sent_idx` - zero-based index of the sentence in the input text\n* `pos` - part of speech tag (if available)\n* `phonemes` - list of [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) phonemes for the word (if available)\n* `is_minor_break` - `true` if \"word\" separates phrases (comma, semicolon, etc.)\n* `is_major_break` - `true` if \"word\" separates sentences (period, question mark, etc.)\n* `is_break` - `true` if \"word\" is a major or minor break\n* `is_punctuation` - `true` if \"word\" is a surrounding punctuation mark (quote, bracket, etc.)\n* `is_spoken` - `true` if not a break or punctuation\n\nSee `python3 -m gruut <LANGUAGE> --help` for more options.\n\n### SSML\n\nA subset of [SSML](https://www.w3.org/TR/speech-synthesis11/) is supported:\n\n* `<speak>` - wrap around SSML text\n    * `lang` - set language for document\n* `<p>` - paragraph\n    * `lang` - set language for paragraph\n* `<s>` - sentence (disables automatic sentence breaking)\n    * `lang` - set language for sentence\n* `<w>` / `<token>` - word (disables automatic tokenization)\n    * `lang` - set language for word\n    * `role` - set word role (see [word roles](#word-roles))\n* `<lang lang=\"...\">` - set language inner text\n* `<voice name=\"...\">` - set voice of inner text\n* `<say-as interpret-as=\"\">` - force interpretation of inner text\n    * `interpret-as` one of \"spell-out\", \"date\", \"number\", \"time\", or \"currency\"\n    * `format` - way to format text depending on `interpret-as`\n        * number - one of \"cardinal\", \"ordinal\", \"digits\", \"year\"\n        * date - string with \"d\" (cardinal day), \"o\" (ordinal day), \"m\" (month), or \"y\" (year)\n* `<break time=\"\">` - Pause for given amount of time\n    * time - seconds (\"123s\") or milliseconds (\"123ms\")\n* `<mark name=\"\">` - User-defined mark (`marks_before` and `marks_after` attributes of words/sentences)\n    * name - name of mark\n* `<sub alias=\"\">` - substitute `alias` for inner text\n* `<phoneme ph=\"...\">` - supply phonemes for inner text\n    * `ph` - phonemes for each word of inner text, separated by whitespace\n* `<lexicon id=\"...\">` - inline or external pronunciation lexicon\n    * `id` - unique id of lexicon (used in `<lookup ref=\"...\">`)\n    * `uri` - if empty or missing, lexicon is inline\n    * One or more `<lexeme>` child elements with:\n        *  Optional `role=\"...\"` ([word roles][#word-roles] separated by whitespace)\n        * `<grapheme>WORD</grapheme>` - word text\n        * `<phoneme>P H O N E M E S</phoneme>` - word pronunciation (phonemes separated by whitespace)\n* `<lookup ref=\"...\">` - use pronunciation lexicon for child elements\n    * `ref` - id from a `<lexicon id=\"...\">`\n\n#### Word Roles\n\nDuring phonemization, word roles are used to disambiguate pronunciations. Unless manually specified, a word's role is derived from its part of speech tag as `gruut:<TAG>`. For initialisms and `spell-out`, the role `gruut:letter` is used to indicate that e.g., \"a\" should be spoken as `/e\u026a/` instead of `/\u0259/`.\n\nFor `en-us`, the following additional roles are available from the part-of-speech tagger:\n\n* `gruut:CD` - number\n* `gruut:DT` - determiner\n* `gruut:IN` - preposition or subordinating conjunction \n* `gruut:JJ` - adjective\n* `gruut:NN` - noun\n* `gruut:PRP` - personal pronoun\n* `gruut:RB` - adverb\n* `gruut:VB` - verb\n* `gruut:VB` - verb (past tense)\n\n#### Inline Lexicons\n\nInline [pronunciation lexicons](https://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/) are supported via the `<lexicon>` and `<lookup>` tags. gruut diverges slightly from the [SSML standard](https://www.w3.org/TR/speech-synthesis11/) here by allowing lexicons to be defined within the SSML document itself (`url` is blank or missing). Additionally, the `id` attribute of the `<lexicon>` element can be left off to indicate a \"default\" inline lexicon that does not require a corresponding `<lookup>` tag.\n\nFor example, the following document will yield three different pronunciations for the word \"tomato\":\n\n``` xml\n<?xml version=\"1.0\"?>\n<speak version=\"1.1\"\n       xmlns=\"http://www.w3.org/2001/10/synthesis\"\n       xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n       xsi:schemaLocation=\"http://www.w3.org/2001/10/synthesis\n                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd\"\n       xml:lang=\"en-US\">\n\n  <lexicon xml:id=\"test\" alphabet=\"ipa\">\n    <lexeme>\n      <grapheme>\n        tomato\n      </grapheme>\n      <phoneme>\n        <!-- Individual phonemes are separated by whitespace -->\n        t \u0259 m \u02c8\u0251 t o\u028a\n      </phoneme>\n    </lexeme>\n    <lexeme>\n      <grapheme role=\"fake-role\">\n        tomato\n      </grapheme>\n      <phoneme>\n        <!-- Made up pronunciation for fake word role -->\n        t \u0259 m \u02c8i t o\u028a\n      </phoneme>\n    </lexeme>\n  </lexicon>\n\n  <w>tomato</w>\n  <lookup ref=\"test\">\n    <w>tomato</w>\n    <w role=\"fake-role\">tomato</w>\n  </lookup>\n</speak>\n```\n\nThe first \"tomato\" will be looked up in the U.S. English lexicon (`/t \u0259 m \u02c8e\u026a t o\u028a/`). Within the `<lookup>` tag's scope, the second and third \"tomato\" words will be looked up in the inline lexicon. The third \"tomato\" word has a [role](#word-roles) attached  (selecting a made up pronunciation in this case).\n\nEven further from the SSML standard, gruut allows you to leave off the `<lexicon>` id entirely. With no `id`, a `<lookup>` tag is no longer needed, allowing you to override the pronunciation of any word in the document: \n\n``` xml\n<?xml version=\"1.0\"?>\n<speak version=\"1.1\"\n       xmlns=\"http://www.w3.org/2001/10/synthesis\"\n       xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n       xsi:schemaLocation=\"http://www.w3.org/2001/10/synthesis\n                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd\"\n       xml:lang=\"en-US\">\n\n  <!-- No id means change all words without a lookup -->\n  <lexicon>\n    <lexeme>\n      <grapheme>\n        tomato\n      </grapheme>\n      <phoneme>\n        t \u0259 m \u02c8\u0251 t o\u028a\n      </phoneme>\n    </lexeme>\n  </lexicon>\n\n  <w>tomato</w>\n</speak>\n```\n\nThis will yield a pronunciation of `/t \u0259 m \u02c8\u0251 t o\u028a/` for all instances of \"tomato\" in the document (unless they have a `<lookup>`).\n\n## Intended Audience\n\ngruut is useful for transforming raw text into phonetic pronunciations, similar to [phonemizer](https://github.com/bootphon/phonemizer). Unlike phonemizer, gruut looks up words in a pre-built lexicon (pronunciation dictionary) or guesses word pronunciations with a pre-trained grapheme-to-phoneme model. Phonemes for each language come from a [carefully chosen inventory](https://en.wikipedia.org/wiki/Template:Language_phonologies).\n\nFor each supported language, gruut includes a:\n\n* A word pronunciation lexicon built from open source data\n    * See [pron_dict](https://github.com/Kyubyong/pron_dictionaries)\n* A pre-trained grapheme-to-phoneme model for guessing word pronunciations\n\nSome languages also include:\n\n* A pre-trained part of speech tagger built from open source data:\n    * See [universal dependencies](https://universaldependencies.org/)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tokenizer, text cleaner, and phonemizer for many human languages.",
    "version": "2.4.0",
    "project_urls": {
        "Homepage": "https://github.com/rhasspy/gruut"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fce16b5a01ef36b5341d5d0899401e4413594dfaa21f86cfc05be8efb25baf81",
                "md5": "bd39118707abc1b256f296e4f7bf779a",
                "sha256": "a49f693266a3a1ab5a6bde77a8f560ef27712b4169b5a6b02e6a1a873342e19e"
            },
            "downloads": -1,
            "filename": "gruut-2.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bd39118707abc1b256f296e4f7bf779a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 85341,
            "upload_time": "2024-07-03T15:40:55",
            "upload_time_iso_8601": "2024-07-03T15:40:55.589073Z",
            "url": "https://files.pythonhosted.org/packages/fc/e1/6b5a01ef36b5341d5d0899401e4413594dfaa21f86cfc05be8efb25baf81/gruut-2.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-03 15:40:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "rhasspy",
    "github_project": "gruut",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "Babel",
            "specs": [
                [
                    ">=",
                    "2.8.0"
                ],
                [
                    "<",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "dateparser",
            "specs": [
                [
                    "~=",
                    "1.1.1"
                ]
            ]
        },
        {
            "name": "gruut-ipa",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ],
                [
                    "<",
                    "1.0"
                ]
            ]
        },
        {
            "name": "gruut_lang_en",
            "specs": [
                [
                    "~=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "jsonlines",
            "specs": [
                [
                    "~=",
                    "1.2.0"
                ]
            ]
        },
        {
            "name": "networkx",
            "specs": [
                [
                    ">=",
                    "2.5.0"
                ]
            ]
        },
        {
            "name": "num2words",
            "specs": [
                [
                    ">=",
                    "0.5.10"
                ],
                [
                    "<",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.0"
                ]
            ]
        },
        {
            "name": "python-crfsuite",
            "specs": [
                [
                    "~=",
                    "0.9.7"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "gruut"
}

Michael Hansen