# phonemes2ids
Flexible tool for assigning integer ids to phonemes.
Useful for text to speech or speech to text applications where text is phonemized, converted to an integer vector, and then used to train a network.
## Installation
phonemes2ids is available on PyPI:
```sh
pip install phonemes2ids
```
If installation was successful, you should be able to run:
```sh
phonemes2ids --version
```
## Learning Phoneme IDs
```sh
cat << EOF |
a b c
b a a b
EOF
phonemes2ids --write-phonemes phonemes.txt
```
which prints out:
```
0 1 2
1 0 0 1
```
Looking at phonemes.txt, we get:
```
0 a
1 b
2 c
```
Importantly, the assignment of ids to phonemes is deterministic. It is based on the sorted order of the final observed phoneme set. So changing the order of our input lines:
```sh
cat << EOF |
b a a b
a b c
EOF
phonemes2ids --write-phonemes phonemes.txt
```
does not change the ids associated with each phoneme:
```
1 0 0 1
0 1 2
```
### Pre-Assigned Phonemes
We can pre-assign ids to some (or all) phonemes beforehand, allowing the rest to be learned:
```sh
echo '0 c' > assigned_phonemes.txt
cat << EOF |
b a a b
a b c
EOF
phonemes2ids --read-phonemes assigned_phonemes.txt \
--write-phonemes phonemes.txt
```
The output is now:
```
2 1 1 2
1 2 0
```
and phonemes.txt shows the new assignments:
```
0 c
1 a
2 b
```
### Phoneme/Word Separators and Blanks
By default, phonemes are assumed to not be separated at all and words are separated by whitespace. Let's specify instead that phonemes are separated by a '_' and words a '|':
```sh
echo 'a|b|a_b|b_a' | phonemes2ids -p '_' -w '|'
0 1 0 1 1 0
```
where `a` is 0 and `b` is 1.
Word separators become especially useful when you want to insert blank tokens between words:
```sh
echo '0 #' > assigned_phonemes.txt
echo 'a|b|a_b|b_a' | \
phonemes2ids -p '_' -w '|' --blank '#' \
--read-phonemes assigned_phonemes.txt
0 1 0 2 0 1 2 0 2 1 0
```
where `#` is 0, and `a` and `b` are 0 and 1 respectively. Note that `a_b` and `b_a` are surrounded by `#` (0) in the output because they are words.
Blank tokens are useful for some machine learning models, such as [GlowTTS](https://github.com/jaywalnut310/glow-tts/). They are inserted between words by default, including before the first word (change with `--no-blank-start`) and after the last word (change with `--no-blank-end`).
It's possible to include blank tokens between *every* phoneme (instead of just between words):
```sh
echo '0 #' > assigned_phonemes.txt
echo 'a|b|a_b|b_a' | \
phonemes2ids -p '_' -w '|' --blank '#' --blank-between tokens \
--read-phonemes assigned_phonemes.txt
0 1 0 2 0 1 0 2 0 2 0 1 0
```
Now every other phoneme/token in the output is blank (`#` = 0).
### Pad/BOS/EOS Symbols
It's common to include pad, bos (beginning of sentence), and eos (end of sentence) symbols. These typically occupy the first few phoneme ids, especially the pad symbol which is almost always 0.
You can have bos/eos added automatically with `--auto-bos-eos`:
```sh
echo 'a b c' | \
phonemes2ids --auto-bos-eos \
--pad '_' --bos '^' --eos '$' \
--write-phonemes phonemes.txt
1 3 4 5 2
```
Looking at phonemes.txt, we can see that pad, bos, and eos are automatically assigned ids (in that order):
```
0 _
1 ^
2 $
3 a
4 b
5 c
```
The output from the first command (`1 3 4 5 2`) can now be interpreted as `^ a b c $`.
### Stress/Tone Separation
Depending on your use case, it may be important that stress markers and tones are separated into distinct phonemes.
By default, stress is assumed to be part of a phoneme:
```sh
echo "ˈa a cˌ c" | phonemes2ids -p ' '
3 0 2 1
```
Note that each phoneme as received a distinct id. To separate the IPA primary/secondary stress characters (U+02C8 and U+02CC):
```sh
echo "ˈa a cˌ c" | \
phonemes2ids -p ' ' --separate-stress \
--write-phonemes phonemes.txt
0 2 2 3 1 3
```
Looking in phonemes.txt, we can see that stress markers are assigned ids first:
```
0 ˈ
1 ˌ
2 a
3 c
```
Tones can also be separated, if desired. These are represented as digits (`[0-9]+`) that follow a phoneme:
```sh
echo 'a123 b45 c6' | \
phonemes2ids -p ' ' --separate-tones \
--write-phonemes phonemes.txt
3 0 4 1 5 2
```
The tones are given separate ids and placed after their corresponding phoneme (change with `--tone-before`):
```
0 123
1 45
2 6
3 a
4 b
5 c
```
### Advanced Separation
Separating out parts of phonemes can be controlled further with the `--separate-graphemes` flag and `--separate <string>` option.
The `--separate-graphemes` flag will case all Unicode characters to be decomposed into codepoints before being assigned ids:
```sh
echo 'ɑ̃' | \
phonemes2ids --separate-graphemes
0 1
```
where U+0251 (`ɑ`) is 0 and U+0303 (nasal) is 1.
Specifying the exact graphemes to separate out is done with `--separate <string>` (one or more times):
```sh
echo 'aː' | \
phonemes2ids --separate 'ː'
0 1
```
where `a` is 0 and `ː` is 1. If the separator occurs in the middle of a phoneme, the phoneme is split into three parts (before, separator, after):
```sh
echo 'aːb' | \
phonemes2ids --separate 'ː'
0 2 1
```
where `a` is 0, `b` is 1, and `ː` is 2.
### Punctuation Simplification
If you only care about short and long pauses in a sentence, the `--simple-punctuation` flag is for you! It replaces common punctuation symbols with either `,` (short pause) or `.` (long pause):
```sh
echo ', . : ; ! ?' | \
phonemes2ids --simple-punctuation
0 1 0 0 1 1
```
where `,` is 0 and `.` is 1. Use `--phoneme-map` for more control.
---
## Phoneme Counts and Maps
The learning feature of phonemes2ids can be used to help you reduce your phoneme set. A typical workflow is:
1. Run `phonemes2ids` with `--write-phoneme-counts` on your input data
2. Look inside the phoneme counts file, and decide which phonemes should be re-mapped (usually ones with very few examples)
3. Create a phoneme map text file, where each line is `<FROM> <TO>` like `ʌ ə` (every occurrence of `ʌ` is replaced by `ə`)
* The `<TO>` can be multiple phonemes like `aɪ a ɪ`, which breaks apart the dipthong
4. Re-run `phonemes2ids` with `--phoneme-map` and `--write-phonemes`
Make sure to keep your phoneme map with your phonemes.txt file!
---
## Converting Phonemes to IDs
Once you've figured out all of your settings, it's time to convert some input data! This will usually look something like:
```sh
phonemes2ids --read-phonemes phonemes.txt \
--phoneme-map map.txt \
[other settings] \
< input_phonemes.txt \
> output_ids.txt
```
where `phonemes.txt` contains your complete phoneme/id pairs from the learning phase, and `map.txt` has phoneme/phoneme pairs that you'd like to be automatically replaced.
Each line in the output file (`output_ids.txt`) will contain the ids of the corresponding line from the input file (`input_phonemes.txt`).
If your input file is delimited, you can keep extra information with each output line:
```sh
echo 's1|a b c' | phonemes2ids --csv
s1|a b c|0 1 2
```
The `--csv` flag indicates that the input data is delimited by '|' (change with `--csv-delimiter`). The final column of each row is assumed to be the input phonemes, and the ids are simply appended as a new column. This allows you to pass arbitrary metadata through to the output file.
---
## Python Library
You can use phonemes2ids directly from Python:
```python
from phonemes2ids import phonemes2ids
word_phonemes = [["a"], ["b"], ["c"], ["b", "c", "a"]]
phoneme_to_id = {"a": 1, "b": 2, "c": 3}
ids = phonemes2ids(word_phonemes=word_phonemes, phoneme_to_id=phoneme_to_id)
assert ids == [1, 2, 3, 2, 3, 1]
```
See the docstrings for `phonemes2ids` and `learn_phoneme_ids` for more details.
Raw data
{
"_id": null,
"home_page": "https://github.com/rhasspy/phonemes2ids",
"name": "phonemes2ids",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Michael Hansen",
"author_email": "mike@rhasspy.org",
"download_url": "https://files.pythonhosted.org/packages/27/6f/0cf0746d02a356103b7c3a065dbc188d8b5092e58bb363d74aa668ee1ad1/phonemes2ids-1.2.2.tar.gz",
"platform": "",
"description": "# phonemes2ids\n\nFlexible tool for assigning integer ids to phonemes.\n\nUseful for text to speech or speech to text applications where text is phonemized, converted to an integer vector, and then used to train a network.\n\n## Installation\n\nphonemes2ids is available on PyPI:\n\n```sh\npip install phonemes2ids\n```\n\nIf installation was successful, you should be able to run:\n\n```sh\nphonemes2ids --version\n```\n\n## Learning Phoneme IDs\n\n```sh\ncat << EOF |\na b c\nb a a b\nEOF\n phonemes2ids --write-phonemes phonemes.txt\n```\n\nwhich prints out:\n\n```\n0 1 2\n1 0 0 1\n```\n\nLooking at phonemes.txt, we get:\n\n```\n0 a\n1 b\n2 c\n```\n\nImportantly, the assignment of ids to phonemes is deterministic. It is based on the sorted order of the final observed phoneme set. So changing the order of our input lines:\n\n```sh\ncat << EOF |\nb a a b\na b c\nEOF\n phonemes2ids --write-phonemes phonemes.txt\n```\n\ndoes not change the ids associated with each phoneme:\n\n```\n1 0 0 1\n0 1 2\n```\n\n### Pre-Assigned Phonemes\n\nWe can pre-assign ids to some (or all) phonemes beforehand, allowing the rest to be learned:\n\n```sh\necho '0 c' > assigned_phonemes.txt\ncat << EOF |\nb a a b\na b c\nEOF\n phonemes2ids --read-phonemes assigned_phonemes.txt \\\n --write-phonemes phonemes.txt\n```\n\nThe output is now:\n\n```\n2 1 1 2\n1 2 0\n```\n\nand phonemes.txt shows the new assignments:\n\n```\n0 c\n1 a\n2 b\n```\n\n### Phoneme/Word Separators and Blanks\n\nBy default, phonemes are assumed to not be separated at all and words are separated by whitespace. Let's specify instead that phonemes are separated by a '_' and words a '|':\n\n```sh\necho 'a|b|a_b|b_a' | phonemes2ids -p '_' -w '|'\n0 1 0 1 1 0\n```\n\nwhere `a` is 0 and `b` is 1.\n\nWord separators become especially useful when you want to insert blank tokens between words:\n\n```sh\necho '0 #' > assigned_phonemes.txt\necho 'a|b|a_b|b_a' | \\\n phonemes2ids -p '_' -w '|' --blank '#' \\\n --read-phonemes assigned_phonemes.txt\n0 1 0 2 0 1 2 0 2 1 0\n```\n\nwhere `#` is 0, and `a` and `b` are 0 and 1 respectively. Note that `a_b` and `b_a` are surrounded by `#` (0) in the output because they are words.\n\nBlank tokens are useful for some machine learning models, such as [GlowTTS](https://github.com/jaywalnut310/glow-tts/). They are inserted between words by default, including before the first word (change with `--no-blank-start`) and after the last word (change with `--no-blank-end`).\n\nIt's possible to include blank tokens between *every* phoneme (instead of just between words):\n\n```sh\necho '0 #' > assigned_phonemes.txt\necho 'a|b|a_b|b_a' | \\\n phonemes2ids -p '_' -w '|' --blank '#' --blank-between tokens \\\n --read-phonemes assigned_phonemes.txt\n0 1 0 2 0 1 0 2 0 2 0 1 0\n```\n\nNow every other phoneme/token in the output is blank (`#` = 0).\n\n### Pad/BOS/EOS Symbols\n\nIt's common to include pad, bos (beginning of sentence), and eos (end of sentence) symbols. These typically occupy the first few phoneme ids, especially the pad symbol which is almost always 0.\n\nYou can have bos/eos added automatically with `--auto-bos-eos`:\n\n```sh\necho 'a b c' | \\\n phonemes2ids --auto-bos-eos \\\n --pad '_' --bos '^' --eos '$' \\\n --write-phonemes phonemes.txt\n1 3 4 5 2\n```\n\nLooking at phonemes.txt, we can see that pad, bos, and eos are automatically assigned ids (in that order):\n\n```\n0 _\n1 ^\n2 $\n3 a\n4 b\n5 c\n```\n\nThe output from the first command (`1 3 4 5 2`) can now be interpreted as `^ a b c $`.\n\n### Stress/Tone Separation\n\nDepending on your use case, it may be important that stress markers and tones are separated into distinct phonemes.\n\nBy default, stress is assumed to be part of a phoneme:\n\n```sh\necho \"\u02c8a a c\u02cc c\" | phonemes2ids -p ' '\n3 0 2 1\n```\n\nNote that each phoneme as received a distinct id. To separate the IPA primary/secondary stress characters (U+02C8 and U+02CC):\n\n```sh\necho \"\u02c8a a c\u02cc c\" | \\\n phonemes2ids -p ' ' --separate-stress \\\n --write-phonemes phonemes.txt\n0 2 2 3 1 3\n```\n\nLooking in phonemes.txt, we can see that stress markers are assigned ids first:\n\n```\n0 \u02c8\n1 \u02cc\n2 a\n3 c\n```\n\nTones can also be separated, if desired. These are represented as digits (`[0-9]+`) that follow a phoneme:\n\n```sh\necho 'a123 b45 c6' | \\\n phonemes2ids -p ' ' --separate-tones \\\n --write-phonemes phonemes.txt\n3 0 4 1 5 2\n```\n\nThe tones are given separate ids and placed after their corresponding phoneme (change with `--tone-before`):\n\n```\n0 123\n1 45\n2 6\n3 a\n4 b\n5 c\n```\n\n### Advanced Separation\n\nSeparating out parts of phonemes can be controlled further with the `--separate-graphemes` flag and `--separate <string>` option.\n\nThe `--separate-graphemes` flag will case all Unicode characters to be decomposed into codepoints before being assigned ids:\n\n```sh\necho '\u0251\u0303' | \\\n phonemes2ids --separate-graphemes\n0 1\n```\n\nwhere U+0251 (`\u0251`) is 0 and U+0303 (nasal) is 1.\n\nSpecifying the exact graphemes to separate out is done with `--separate <string>` (one or more times):\n\n```sh\necho 'a\u02d0' | \\\n phonemes2ids --separate '\u02d0'\n0 1\n```\n\nwhere `a` is 0 and `\u02d0` is 1. If the separator occurs in the middle of a phoneme, the phoneme is split into three parts (before, separator, after):\n\n```sh\necho 'a\u02d0b' | \\\n phonemes2ids --separate '\u02d0'\n0 2 1\n```\n\nwhere `a` is 0, `b` is 1, and `\u02d0` is 2.\n\n### Punctuation Simplification\n\nIf you only care about short and long pauses in a sentence, the `--simple-punctuation` flag is for you! It replaces common punctuation symbols with either `,` (short pause) or `.` (long pause):\n\n```sh\necho ', . : ; ! ?' | \\\n phonemes2ids --simple-punctuation\n0 1 0 0 1 1\n```\n\nwhere `,` is 0 and `.` is 1. Use `--phoneme-map` for more control.\n\n---\n\n## Phoneme Counts and Maps\n\nThe learning feature of phonemes2ids can be used to help you reduce your phoneme set. A typical workflow is:\n\n1. Run `phonemes2ids` with `--write-phoneme-counts` on your input data\n2. Look inside the phoneme counts file, and decide which phonemes should be re-mapped (usually ones with very few examples)\n3. Create a phoneme map text file, where each line is `<FROM> <TO>` like `\u028c \u0259` (every occurrence of `\u028c` is replaced by `\u0259`)\n * The `<TO>` can be multiple phonemes like `a\u026a a \u026a`, which breaks apart the dipthong\n4. Re-run `phonemes2ids` with `--phoneme-map` and `--write-phonemes`\n\nMake sure to keep your phoneme map with your phonemes.txt file!\n\n---\n\n## Converting Phonemes to IDs\n\nOnce you've figured out all of your settings, it's time to convert some input data! This will usually look something like:\n\n```sh\nphonemes2ids --read-phonemes phonemes.txt \\\n --phoneme-map map.txt \\\n [other settings] \\\n < input_phonemes.txt \\\n > output_ids.txt\n```\n\nwhere `phonemes.txt` contains your complete phoneme/id pairs from the learning phase, and `map.txt` has phoneme/phoneme pairs that you'd like to be automatically replaced.\nEach line in the output file (`output_ids.txt`) will contain the ids of the corresponding line from the input file (`input_phonemes.txt`).\n\nIf your input file is delimited, you can keep extra information with each output line:\n\n```sh\necho 's1|a b c' | phonemes2ids --csv\ns1|a b c|0 1 2\n```\n\nThe `--csv` flag indicates that the input data is delimited by '|' (change with `--csv-delimiter`). The final column of each row is assumed to be the input phonemes, and the ids are simply appended as a new column. This allows you to pass arbitrary metadata through to the output file.\n\n---\n\n## Python Library\n\nYou can use phonemes2ids directly from Python:\n\n```python\nfrom phonemes2ids import phonemes2ids\n\nword_phonemes = [[\"a\"], [\"b\"], [\"c\"], [\"b\", \"c\", \"a\"]]\nphoneme_to_id = {\"a\": 1, \"b\": 2, \"c\": 3}\n\nids = phonemes2ids(word_phonemes=word_phonemes, phoneme_to_id=phoneme_to_id)\n\nassert ids == [1, 2, 3, 2, 3, 1]\n```\n\nSee the docstrings for `phonemes2ids` and `learn_phoneme_ids` for more details.",
"bugtrack_url": null,
"license": "",
"summary": "Convert phonemes to integer ids",
"version": "1.2.2",
"project_urls": {
"Homepage": "https://github.com/rhasspy/phonemes2ids"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "276f0cf0746d02a356103b7c3a065dbc188d8b5092e58bb363d74aa668ee1ad1",
"md5": "ade80bfa91c00615a512768fa4371585",
"sha256": "8e3e9e0723215c7187b56276bb053688a43851d8deb9948432e793262551c2ac"
},
"downloads": -1,
"filename": "phonemes2ids-1.2.2.tar.gz",
"has_sig": false,
"md5_digest": "ade80bfa91c00615a512768fa4371585",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12370,
"upload_time": "2022-01-29T22:46:06",
"upload_time_iso_8601": "2022-01-29T22:46:06.526496Z",
"url": "https://files.pythonhosted.org/packages/27/6f/0cf0746d02a356103b7c3a065dbc188d8b5092e58bb363d74aa668ee1ad1/phonemes2ids-1.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-01-29 22:46:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rhasspy",
"github_project": "phonemes2ids",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "phonemes2ids"
}