# ToJyutping
![pip](https://github.com/CanCLID/ToJyutping/workflows/Python%20Package/badge.svg)
粵語拼音自動標註工具 Cantonese Pronunciation Automatic Labeling Tool
## Installation
```sh
$ pip install ToJyutping
```
### In Other Languages
- [JavaScript/TypeScript (npm) Version](https://www.npmjs.com/package/to-jyutping) ([Repo](https://github.com/CanCLID/to-jyutping))
## Usage
For the 8 basic functions, examples are worth a thousand words:
```python
>>> import ToJyutping
>>> ToJyutping.get_jyutping_list('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
[('咁', 'gam3'), ('啱', 'ngaam1'), ('老', 'lou5'), ('世', 'sai3'), ('要', 'jiu1'), ('求', 'kau4'), ('佢', 'keoi5'), ('等', 'dang2'), ('陣', 'zan6'), ('要', 'jiu3'), ('開', 'hoi1'), ('會', 'wui2'), (',', None), ('剩', 'zing6'), ('低', 'dai1'), ('嘅', 'ge3'), ('嘢', 'je5'), ('我', 'ngo5'), ('會', 'wui5'), ('搞', 'gaau2'), ('掂', 'dim6'), ('㗎', 'gaa3'), ('喇', 'laa3'), ('。', None)]
>>> ToJyutping.get_jyutping('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
'咁(gam3)啱(ngaam1)老(lou5)世(sai3)要(jiu1)求(kau4)佢(keoi5)等(dang2)陣(zan6)要(jiu3)開(hoi1)會(wui2),剩(zing6)低(dai1)嘅(ge3)嘢(je5)我(ngo5)會(wui5)搞(gaau2)掂(dim6)㗎(gaa3)喇(laa3)。'
>>> ToJyutping.get_jyutping_text('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
'gam3 ngaam1 lou5 sai3 jiu1 kau4 keoi5 dang2 zan6 jiu3 hoi1 wui2, zing6 dai1 ge3 je5 ngo5 wui5 gaau2 dim6 gaa3 laa3.'
>>> ToJyutping.get_jyutping_candidates('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
[('咁', ['gam3', 'gam2', 'gam1', 'gam4']), ('啱', ['ngaam1', 'aam1', 'am1', 'ngam1']), ('老', ['lou5', 'lou2']), ('世', ['sai3', 'sai2']), ('要', ['jiu1', 'jiu3', 'jiu2']), ('求', ['kau4']), ('佢', ['keoi5', 'heoi5']), ('等', ['dang2']), ('陣', ['zan6', 'zan2']), ('要', ['jiu3', 'jiu2', 'jiu1']), ('開', ['hoi1']), ('會', ['wui2', 'wui5', 'wui6', 'wui3', 'kui2', 'kui3', 'kwui2']), (',', []), ('剩', ['zing6', 'sing6']), ('低', ['dai1']), ('嘅', ['ge3', 'ge2', 'koi2', 'koi3']), ('嘢', ['je5', 'e5']), ('我', ['ngo5', 'o5']), ('會', ['wui5', 'wui6', 'wui2', 'wui3', 'kui2', 'kui3', 'kwui2']), ('搞', ['gaau2']), ('掂', ['dim6', 'dim3', 'dim1']), ('㗎', ['gaa3', 'ga3', 'gaa2', 'gaa1', 'gaa4']), ('喇', ['laa3', 'laa1', 'laak3', 'laa5', 'laat3']), ('。', [])]
>>> ToJyutping.get_ipa_list('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
[('咁', 'kɐm˧'), ('啱', 'ŋaːm˥'), ('老', 'lou̯˩˧'), ('世', 'sɐi̯˧'), ('要', 'jiːu̯˥'), ('求', 'kʰɐu̯˨˩'), ('佢', 'kʰɵy̑˩˧'), ('等', 'tɐŋ˧˥'), ('陣', 't͡sɐn˨'), ('要', 'jiːu̯˧'), ('開', 'hɔːi̯˥'), ('會', 'wuːi̯˧˥'), (',', None), ('剩', 't͡seŋ˨'), ('低', 'tɐi̯˥'), ('嘅', 'kɛː˧'), ('嘢', 'jɛː˩˧'), ('我', 'ŋɔː˩˧'), ('會', 'wuːi̯˩˧'), ('搞', 'kaːu̯˧˥'), ('掂', 'tiːm˨'), ('㗎', 'kaː˧'), ('喇', 'laː˧'), ('。', None)]
>>> ToJyutping.get_ipa('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
'咁[kɐm˧]啱[ŋaːm˥]老[lou̯˩˧]世[sɐi̯˧]要[jiːu̯˥]求[kʰɐu̯˨˩]佢[kʰɵy̑˩˧]等[tɐŋ˧˥]陣[t͡sɐn˨]要[jiːu̯˧]開[hɔːi̯˥]會[wuːi̯˧˥],剩[t͡seŋ˨]低[tɐi̯˥]嘅[kɛː˧]嘢[jɛː˩˧]我[ŋɔː˩˧]會[wuːi̯˩˧]搞[kaːu̯˧˥]掂[tiːm˨]㗎[kaː˧]喇[laː˧]。'
>>> ToJyutping.get_ipa_text('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
'kɐm˧.ŋaːm˥.lou̯˩˧.sɐi̯˧.jiːu̯˥.kʰɐu̯˨˩.kʰɵy̑˩˧.tɐŋ˧˥.t͡sɐn˨.jiːu̯˧.hɔːi̯˥.wuːi̯˧˥ | t͡seŋ˨.tɐi̯˥.kɛː˧.jɛː˩˧.ŋɔː˩˧.wuːi̯˩˧.kaːu̯˧˥.tiːm˨.kaː˧.laː˧'
>>> ToJyutping.get_ipa_candidates('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
[('咁', ['kɐm˧', 'kɐm˧˥', 'kɐm˥', 'kɐm˨˩']), ('啱', ['ŋaːm˥', 'aːm˥', 'ɐm˥', 'ŋɐm˥']), ('老', ['lou̯˩˧', 'lou̯˧˥']), ('世', ['sɐi̯˧', 'sɐi̯˧˥']), ('要', ['jiːu̯˥', 'jiːu̯˧', 'jiːu̯˧˥']), ('求', ['kʰɐu̯˨˩']), ('佢', ['kʰɵy̑˩˧', 'hɵy̑˩˧']), ('等', ['tɐŋ˧˥']), ('陣', ['t͡sɐn˨', 't͡sɐn˧˥']), ('要', ['jiːu̯˧', 'jiːu̯˧˥', 'jiːu̯˥']), ('開', ['hɔːi̯˥']), ('會', ['wuːi̯˧˥', 'wuːi̯˩˧', 'wuːi̯˨', 'wuːi̯˧', 'kʰuːi̯˧˥', 'kʰuːi̯˧', 'kʷʰuːi̯˧˥']), (',', []), ('剩', ['t͡seŋ˨', 'seŋ˨']), ('低', ['tɐi̯˥']), ('嘅', ['kɛː˧', 'kɛː˧˥', 'kʰɔːi̯˧˥', 'kʰɔːi̯˧']), ('嘢', ['jɛː˩˧', 'ɛː˩˧']), ('我', ['ŋɔː˩˧', 'ɔː˩˧']), ('會', ['wuːi̯˩˧', 'wuːi̯˨', 'wuːi̯˧˥', 'wuːi̯˧', 'kʰuːi̯˧˥', 'kʰuːi̯˧', 'kʷʰuːi̯˧˥']), ('搞', ['kaːu̯˧˥']), ('掂', ['tiːm˨', 'tiːm˧', 'tiːm˥']), ('㗎', ['kaː˧', 'kɐ˧', 'kaː˧˥', 'kaː˥', 'kaː˨˩']), ('喇', ['laː˧', 'laː˥', 'laːk̚˧', 'laː˩˧', 'laːt̚˧']), ('。', [])]
```
For `get_jyutping_candidates` and `get_ipa_candidates`, pronunciations are sorted according to how likely they are to be correct in a sentence, with the first being the most likely.
Methods may also be imported individually:
```python
>>> from ToJyutping import get_jyutping_text
>>> get_jyutping_text('咁啱老世要求佢等陣要開會,剩低嘅嘢我會搞掂㗎喇。')
'gam3 ngaam1 lou5 sai3 jiu1 kau4 keoi5 dang2 zan6 jiu3 hoi1 wui2, zing6 dai1 ge3 je5 ngo5 wui5 gaau2 dim6 gaa3 laa3.'
```
In rare cases, the pronunciation of a single character can contain more than one syllable:
```python
>>> ToJyutping.get_jyutping_list('一瓩')
[('一', 'jat1'), ('瓩', 'cin1 ngaa5')]
>>> ToJyutping.get_ipa_list('一瓩')
[('一', 'jɐt̚˥'), ('瓩', 't͡sʰiːn˥.ŋaː˩˧')]
```
They are mostly dated ligature characters (合字) coined to represent units with SI prefixes.
## Custom Entries & Existing Entries Overriding or Exclusion
With an accuracy rate of 99%, the possibility of needing an adjustment is rare. However, Cantonese, like other varieties of Chinese, is mostly written in logographs, which means that homographs (同形詞) that are indistinguishable out of context can occur. Consider the following sentence:
> 上堂終於講到分數
In the above sentence, there are multiple possible pronunciations of 上, 到 and 分, and their meanings are different depending on how they are actually pronounced:
| Pronunciation | Meaning |
| --- | --- |
| soeng**5** tong4 zung1 jyu1 gong2 dou**3** fan**1** sou3 | Attending the lesson, it finally came to talk about scores.<br>_(Perhaps the scores weren’t available until today.)_ |
| soeng**5** tong4 zung1 jyu1 gong2 dou**3** fan**6** sou3 | Attending the lesson, it finally came to talk about fractions.<br>_(Perhaps the progress of the math class was slow.)_ |
| soeng**5** tong4 zung1 jyu1 gong2 dou**2** fan**1** sou3 | Attending the lesson, eventually it was able to talk about scores.<br>_(Perhaps the teacher wasn’t allowed to reveal the scores until today.)_ |
| soeng**5** tong4 zung1 jyu1 gong2 dou**2** fan**6** sou3 | Attending the lesson, eventually it was able to talk about fractions.<br>_(Perhaps the introduction to fractions requires some other concepts to be taught.)_ |
| soeng**6** tong4 zung1 jyu1 gong2 dou**3** fan**1** sou3 | The previous lesson finally came to talk about scores.<br>_(Perhaps the teacher just made the scores available right before the previous lesson.)_ |
| soeng**6** tong4 zung1 jyu1 gong2 dou**3** fan**6** sou3 | The previous lesson finally came to talk about fractions.<br>_(Perhaps the students just managed to catch up the progress in the math class.)_ |
| soeng**6** tong4 zung1 jyu1 gong2 dou**2** fan**1** sou3 | Eventually, it was able to talk about scores in the previous lesson.<br>_(Perhaps the teacher was finally allowed to reveal the scores in the previous lesson.)_ |
| soeng**6** tong4 zung1 jyu1 gong2 dou**2** fan**6** sou3 | Eventually, it was able to talk about fractions in the previous lesson.<br>_(Perhaps the teacher just finished teaching the other concepts required for learning fractions.)_ |
Thus, the library offers the ability to include custom entries and override or exclude built-in entries:
```python
>>> ToJyutping.get_jyutping_text('上堂終於講到分數')
'soeng5 tong4 zung1 jyu1 gong2 dou3 fan1 sou3'
>>> converter_lesson = ToJyutping.customize({'上堂': None, '分數': 'fan6 sou3'})
>>> converter_lesson.get_jyutping_text('上堂終於講到分數')
'soeng6 tong4 zung1 jyu1 gong2 dou3 fan6 sou3'
```
In the above example:
- By default, the library special-cases the pronunciation of 上堂 to “soeng5 tong4”. Setting `上堂` to `None` removes the special case and both 上 and 堂 now fallback to the their default pronunciations, which are “soeng6” and “tong4” respectively.
- By default, the library does not special-case 分數. Thus, the pronunciations of each individual characters, which in this case are “fan1” and “sou3”, are used. By including the entry `分數` and setting it to `fan6 sou3`, the converter outputs `fan6 sou3` when `分數` is encountered.
In general, setting any built-in entry to `None` fallbacks it to shorter matches and ultimately individual character pronunciations if there isn’t a match:
```python
>>> ToJyutping.get_jyutping_text('好學生')
'hou2 hok6 saang1'
>>> converter_studious = ToJyutping.customize({'好學生': None})
>>> converter_studious.get_jyutping_text('好學生')
'hou3 hok6 saang1' # Using shorter matches 好學 and 生
>>> converter_good_student = converter_studious.customize({'好學': None})
>>> converter_good_student.get_jyutping_text('好學生')
'hou2 hok6 saang1' # Using individual character pronunciations as it can’t be decomposed further
```
Converters can be chained without affecting each other:
```python
>>> converter_dou2 = converter_lesson.customize({'到': 'dou2'})
>>> converter_None = converter_lesson.customize({'到': None})
>>> converter_dou2.get_jyutping_text('上堂終於講到分數')
'soeng6 tong4 zung1 jyu1 gong2 dou2 fan6 sou3'
>>> converter_None.get_jyutping_text('上堂終於講到分數')
'soeng6 tong4 zung1 jyu1 gong2 […] fan6 sou3'
>>> ToJyutping.get_jyutping_text('上堂終於講到分數')
'soeng5 tong4 zung1 jyu1 gong2 dou3 fan1 sou3' # Also not affected
```
> [!WARNING]
> - This library only offers basic customization functionality. If there are longer built-in word entries, they aren’t overridden:
> ```python
> >>> converter_dou2.get_jyutping_text('笑到轆地')
> 'siu3 dou3 luk1 dei2'
>
> >>> converter_None.get_jyutping_text('笑到轆地')
> 'siu3 dou3 luk1 dei2'
>
> >>> converter_another_lesson = ToJyutping.customize({'上': None, '分': 'fan6'})
> >>> converter_another_lesson.get_jyutping_text('上堂終於講到分數')
> 'soeng5 tong4 zung1 jyu1 gong2 dou3 fan6 sou3'
> ```
> In the second example, their isn’t an entry for 分數, so 分 is patched successfully. However, this is not the case for 上 since the longer built-in entry 上堂 is prioritized.
> - The original pronunciations will be lost. If you are using `get_jyutping_candidates` or `get_ipa_candidates`, you will need to include the pronunciations manually:
> ```python
> >>> 到_original_pronunciations = ToJyutping.get_jyutping_candidates('到')
> >>> 到_original_pronunciations
> [('到', ['dou3', 'dou2'])]
> >>> converter_dou2_dou3 = converter_lesson.customize({'到': ['dou2', *到_original_pronunciations[0][1]]})
> >>> converter_dou2_dou3.get_jyutping_candidates('到')
> [('到', ['dou2', 'dou3'])]
> ```
> Notice how the library automatically deduplicates the values for you.
## Grapheme-to-Phoneme Conversion Function
Intended for machine learning purposes (especially text-to-speech and automatic speech recognition), a `g2p` function is provided to minimize the possibility of conversion problems due to lack of linguistic knowledge. It takes a string and outputs tuples of 3 integers (ranged from 8 to 94 inclusive) representing the **onset** (聲母), **rhyme** (韻母) and **tone** (聲調) of a syllable. Punctuations are included as singletons (1-tuples) and range from 1 to 7. They are detailed in the _[Punctuations](#punctuations)_ section below.
> [!NOTE]
> In this section, the word _“punctuation”_ may be delibrately written in the plural form to avoid confusion.
> [!IMPORTANT]
> In this section, **punctuations** include the unknown character filler, described in the _Punctuations_ section.
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?')
PhonemesList(
[(11, 46, 1), (22, 28, 2), (2,), (15, 47, 5), (22, 28, 6), (26, 84, 6), (17, 64, 3), (27, 92, 6), (26, 84, 5), (14, 69, 4), (23, 72, 6), (12, 35, 3), (11, 41, 2), (11, 41, 4), (12, 35, 3), (27, 78, 6), (24, 64, 2), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)],
segmentals=[11, 46, 22, 28, 2, 15, 47, 22, 28, 26, 84, 17, 64, 27, 92, 26, 84, 14, 69, 23, 72, 12, 35, 11, 41, 11, 41, 12, 35, 27, 78, 24, 64, 1, 1, 1, 1, 1, 1, 1, 4, 5],
tones=[1, 1, 2, 2, 0, 5, 5, 6, 6, 6, 6, 3, 3, 6, 6, 5, 5, 4, 4, 6, 6, 3, 3, 2, 2, 4, 4, 3, 3, 6, 6, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0],
lengths=[2, 2, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
)
```
(Formatted manually, not exactly the same as what `repr` produces)
`PhonemesList` is a `list` with convenient attributes common to syllables handling (particularly in VITS2). `segmentals` is a `list` `property` with onsets, rhymes and punctuations included, while each element of `tones` (also a `list` `property`) gives the tone corresponding to each onset or rhyme, or `0` if the corresponding element is a punctuation. `lengths` (an ordinary `list`) suggests how many elements of `segmentals` or `tones` is each character of the original input correspond to. It is guaranteed that:
- `lengths` always sum up to the number of elements of both `segmentals` and `tones`; and
- The length of the input is always the same as that of `lengths`, such that they can be zipped nicely.
Note that the number of elements of `lengths` does not necessary match that of the original `PhonemesList`, since the input may contain polysyllabic characters, consecutive punctuations of the same category, or whitespaces.
> [!WARNING]
> Unlike the dynamic `segmentals` and `tones` properties, `lengths` is not recalculated even if the `PhonemesList` is modified, since it requires the original input to be taken into account. We do not recommend modifying `PhonemesList` either.
From the above example, you can see that the tone values are coincidently the same as some of the onsets, as it is a more common practice to separate tones into another sequence (this is what VITS2 expects, for example). If this is undesirable, pass `tone_same_seq=True` to output integers ranged from 8 up to 100:
(From now on, the attributes are not shown. Try them out and reveal them yourselves! However, for the case setting `tone_same_seq` to `True`, you probably don’t need them and just need to flatten the list.)
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', tone_same_seq=True)
PhonemesList([(11, 46, 95), (22, 28, 96), (2,), (15, 47, 99), (22, 28, 100), (26, 84, 100), (17, 64, 97), (27, 92, 100), (26, 84, 99), (14, 69, 98), (23, 72, 100), (12, 35, 97), (11, 41, 96), (11, 41, 98), (12, 35, 97), (27, 78, 100), (24, 64, 96), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])
```
Additionally, the second `offset` argument can be specified to shift the output values by a certain amount:
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', offset=100)
PhonemesList([(103, 138, 101), (114, 120, 102), (2,), (107, 139, 105), (114, 120, 106), (118, 176, 106), (109, 156, 103), (119, 184, 106), (118, 176, 105), (106, 161, 104), (115, 164, 106), (104, 127, 103), (103, 133, 102), (103, 133, 104), (104, 127, 103), (119, 170, 106), (116, 156, 102), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])
```
You may pass a triplet as well for shifting each element by different amounts:
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', offset=(100, 200, 300))
PhonemesList([(103, 238, 301), (114, 220, 302), (2,), (107, 239, 305), (114, 220, 306), (118, 276, 306), (109, 256, 303), (119, 284, 306), (118, 276, 305), (106, 261, 304), (115, 264, 306), (104, 227, 303), (103, 233, 302), (103, 233, 304), (104, 227, 303), (119, 270, 306), (116, 256, 302), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])
```
By default, `offset` shifts the onsets and rhymes by 8, which is one plus the number of elements in the default punctuation set described in the below section. The tones are **not** shifted unless `tone_same_seq` is set to `True`.
> [!WARNING]
> This is not the case if you specify `offset` as a single value. The onsets, rhymes and tones are **all shifted** by `offset` and both the `segmentals` and `tones` properties are affected if you provide a single integer. This may not be desirable unless `tone_same_seq` is set. To suppress shifting tones, pass `(amount, amount, 0)`. See below for an example.
### Punctuations
By default, ToJyutping classifies punctuations into 6 categories, `.`, `,`, `!`, `?`, `-` and `'`, and numbers them from 2 to 7. 1's (you may label it `…` for convenience) are used to mark unknown characters in the input string.
The reason for using a filler instead of raising an error is that we want to avoid data-driven errors. Strings in the dataset are known, but this is not the case for user input. If this is undesirable, you will need to look for `(1,)` in the output list and raise the error yourself.
Consecutive punctuations of the same category are collapsed into a single element. For example, both `……` and `......` become a single `(2,)`. However, this does not applies to the unknown character filler. This is because we want to maintain the length of the audio when there are multiple consecutive unknown characters.
You may supply `puncts_offset` to shift the punctuation IDs by a certain amount. For example, to swap the order of syllable IDs and punctuation IDs (i.e. make syllables range from 1 to 87 and punctuations range from 88 to 94):
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', offset=(1, 1, 0), puncts_offset=88)
PhonemesList([(4, 39, 1), (15, 21, 2), (89,), (8, 40, 5), (15, 21, 6), (19, 77, 6), (10, 57, 3), (20, 85, 6), (19, 77, 5), (7, 62, 4), (16, 65, 6), (5, 28, 3), (4, 34, 2), (4, 34, 4), (5, 28, 3), (20, 71, 6), (17, 57, 2), (88,), (88,), (88,), (88,), (88,), (88,), (88,), (91,), (92,)])
```
The `puncts_offset` argument **defaults to 1**, since 0 is commonly used as the ID for padding. However, if `puncts_map` (detailed in the [Custom Punctuation Map](#custom-punctuation-map) section below) is provided, it’s defaulted to 0.
> [!WARNING]
> The `offset` and `puncts_offset` arguments do not affect each other. If you changed `puncts_offset`, be sure to modify `offset` as well, or the values will coincide. **Read the instructions in the above section** before doing so.
Although not recommended, to make the ID for padding and the unknown character filler the same:
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', puncts_offset=0, offset=(7, 7, 0))
PhonemesList([(10, 45, 1), (21, 27, 2), (1,), (14, 46, 5), (21, 27, 6), (25, 83, 6), (16, 63, 3), (26, 91, 6), (25, 83, 5), (13, 68, 4), (22, 71, 6), (11, 34, 3), (10, 40, 2), (10, 40, 4), (11, 34, 3), (26, 77, 6), (23, 63, 2), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (3,), (4,)])
```
This causes all values to be decreased by 1 except for the tones.
#### Decimal Numbers Detection
From the very first example in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section, you can see that the thousand separator `,` is converted to `(1,)` instead of `(3,)`. In fact, the library automatically checks if `,` and `.`, etc. are between digits, prevents them from being treated as commas and treats them as if they were part of the number. As the library does not (yet) convert Arabic decimal numbers to their pronunciations, they are converted to `(1,)`, the same as the result of converting any of the digits. In addition, the library detects negative signs by checking whether `-` etc. are preceding a digit and no `-` or unknown characters precede them and converts them to `(1,)` instead of `(6,)`.
If this is not desired, you may suppress this behavior by passing `decimal_check=False` (Note the `(3,)` between the `(1,)`s):
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', decimal_check=False)
PhonemesList([(11, 46, 1), (22, 28, 2), (2,), (15, 47, 5), (22, 28, 6), (26, 84, 6), (17, 64, 3), (27, 92, 6), (26, 84, 5), (14, 69, 4), (23, 72, 6), (12, 35, 3), (11, 41, 2), (11, 41, 4), (12, 35, 3), (27, 78, 6), (24, 64, 2), (1,), (1,), (1,), (3,), (1,), (1,), (1,), (4,), (5,)])
```
#### Extra Punctuations & Custom Punctuation Map
The default punctuation mapping is comprehensive and should cover most of the use cases. However, you may add your own punctuations and symbols or override the built-in mapping by the `extra_puncts` option:
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', extra_puncts={'$': 8, '!': 9})
PhonemesList([(13, 48, 1), (24, 30, 2), (2,), (17, 49, 5), (24, 30, 6), (28, 86, 6), (19, 66, 3), (29, 94, 6), (28, 86, 5), (16, 71, 4), (25, 74, 6), (14, 37, 3), (13, 43, 2), (13, 43, 4), (14, 37, 3), (29, 80, 6), (26, 66, 2), (8,), (1,), (1,), (1,), (1,), (1,), (1,), (9,), (5,)])
```
> [!IMPORTANT]
> - You cannot override pronunciations of Chinese characters by `extra_puncts`. Use `customize` for that purpose.
> - If you want to override a built-in punctuation, you will need to specify all “variants” of it by yourself. For example, `!`, `︕`, `﹗` and `!` should map to the same ID.
If any of the values in `extra_puncts` are greater than 7, the `offset` parameter is automatically adjusted to shift onsets and rhymes (and tones if `tone_same_seq`) by one plus the largest value, but you may modify it to suit your needs. **Read the instructions in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section** before doing so.
If you would like to use your own mapping for some reason, you may specify the `puncts_map` option:
```python
>>> ToJyutping.g2p('咩話……你話上個月上堂學法文文法用咗 $50,000!?', puncts_map={'…': 2, '$': 3, '!': 4, '?': 5}, unknown_id=1)
PhonemesList([(9, 44, 1), (20, 26, 2), (2,), (13, 45, 5), (20, 26, 6), (24, 82, 6), (15, 62, 3), (25, 90, 6), (24, 82, 5), (12, 67, 4), (21, 70, 6), (10, 33, 3), (9, 39, 2), (9, 39, 4), (10, 33, 3), (25, 76, 6), (22, 62, 2), (3,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])
```
> [!IMPORTANT]
> - You **must** specify `unknown_id` if `puncts_map` is provided.
> - You cannot override pronunciations of Chinese characters by `puncts_map`. Use `customize` for that purpose.
> - You will need to specify all “variants” of a punctuation by yourself. For example, `!`, `︕`, `﹗` and `!` should map to the same ID.
The `offset` parameter is automatically calculated for you. By default, it shifts onsets and rhymes (and tones if `tone_same_seq`) by one plus the maximum of `unknown_id` and all the values in `puncts_map`. However, you may modify it to suit your needs. Again, be sure to **read the instructions in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section** before doing so.
The `unknown_id` parameter may be used alone or with the `extra_puncts` option as well, but we generally do not recommend doing so. The `offset` parameter is computed similarly if it is greater than 7.
## Helper
```python
>>> ToJyutping.jyutping2ipa('jat1')
'jɐt̚˥'
>>> ToJyutping.jyutping2ipa('cin1 ngaa5')
't͡sʰiːn˥.ŋaː˩˧'
```
Note that autocorrection is intentionally not included in this helper, and an error is thrown if strings like `jyt6` are passed into the function.
Punctuation is ignored in the helper.
Raw data
{
"_id": null,
"home_page": "https://github.com/CanCLID/ToJyutping",
"name": "ToJyutping",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.8",
"maintainer_email": null,
"keywords": "cantonese chinese chinese-characters jyutping linguistics natural-language-processing nlp romanization simplified-chinese traditional-chinese",
"author": "Cantonese Computational Linguistics Infrastructure Development Workgroup",
"author_email": "support@jyutping.org",
"download_url": "https://files.pythonhosted.org/packages/ff/42/08ba0a43866a10504c2b74effc21b8f428474b551ab67b3145f444e14824/tojyutping-3.1.0.tar.gz",
"platform": null,
"description": "# ToJyutping\n\n![pip](https://github.com/CanCLID/ToJyutping/workflows/Python%20Package/badge.svg)\n\n\u7cb5\u8a9e\u62fc\u97f3\u81ea\u52d5\u6a19\u8a3b\u5de5\u5177 Cantonese Pronunciation Automatic Labeling Tool\n\n## Installation\n\n```sh\n$ pip install ToJyutping\n```\n\n### In Other Languages\n\n- [JavaScript/TypeScript (npm) Version](https://www.npmjs.com/package/to-jyutping) ([Repo](https://github.com/CanCLID/to-jyutping))\n\n## Usage\n\nFor the 8 basic functions, examples are worth a thousand words:\n\n```python\n>>> import ToJyutping\n\n>>> ToJyutping.get_jyutping_list('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n[('\u5481', 'gam3'), ('\u5571', 'ngaam1'), ('\u8001', 'lou5'), ('\u4e16', 'sai3'), ('\u8981', 'jiu1'), ('\u6c42', 'kau4'), ('\u4f62', 'keoi5'), ('\u7b49', 'dang2'), ('\u9663', 'zan6'), ('\u8981', 'jiu3'), ('\u958b', 'hoi1'), ('\u6703', 'wui2'), ('\uff0c', None), ('\u5269', 'zing6'), ('\u4f4e', 'dai1'), ('\u5605', 'ge3'), ('\u5622', 'je5'), ('\u6211', 'ngo5'), ('\u6703', 'wui5'), ('\u641e', 'gaau2'), ('\u6382', 'dim6'), ('\u35ce', 'gaa3'), ('\u5587', 'laa3'), ('\u3002', None)]\n\n>>> ToJyutping.get_jyutping('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n'\u5481(gam3)\u5571(ngaam1)\u8001(lou5)\u4e16(sai3)\u8981(jiu1)\u6c42(kau4)\u4f62(keoi5)\u7b49(dang2)\u9663(zan6)\u8981(jiu3)\u958b(hoi1)\u6703(wui2)\uff0c\u5269(zing6)\u4f4e(dai1)\u5605(ge3)\u5622(je5)\u6211(ngo5)\u6703(wui5)\u641e(gaau2)\u6382(dim6)\u35ce(gaa3)\u5587(laa3)\u3002'\n\n>>> ToJyutping.get_jyutping_text('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n'gam3 ngaam1 lou5 sai3 jiu1 kau4 keoi5 dang2 zan6 jiu3 hoi1 wui2, zing6 dai1 ge3 je5 ngo5 wui5 gaau2 dim6 gaa3 laa3.'\n\n>>> ToJyutping.get_jyutping_candidates('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n[('\u5481', ['gam3', 'gam2', 'gam1', 'gam4']), ('\u5571', ['ngaam1', 'aam1', 'am1', 'ngam1']), ('\u8001', ['lou5', 'lou2']), ('\u4e16', ['sai3', 'sai2']), ('\u8981', ['jiu1', 'jiu3', 'jiu2']), ('\u6c42', ['kau4']), ('\u4f62', ['keoi5', 'heoi5']), ('\u7b49', ['dang2']), ('\u9663', ['zan6', 'zan2']), ('\u8981', ['jiu3', 'jiu2', 'jiu1']), ('\u958b', ['hoi1']), ('\u6703', ['wui2', 'wui5', 'wui6', 'wui3', 'kui2', 'kui3', 'kwui2']), ('\uff0c', []), ('\u5269', ['zing6', 'sing6']), ('\u4f4e', ['dai1']), ('\u5605', ['ge3', 'ge2', 'koi2', 'koi3']), ('\u5622', ['je5', 'e5']), ('\u6211', ['ngo5', 'o5']), ('\u6703', ['wui5', 'wui6', 'wui2', 'wui3', 'kui2', 'kui3', 'kwui2']), ('\u641e', ['gaau2']), ('\u6382', ['dim6', 'dim3', 'dim1']), ('\u35ce', ['gaa3', 'ga3', 'gaa2', 'gaa1', 'gaa4']), ('\u5587', ['laa3', 'laa1', 'laak3', 'laa5', 'laat3']), ('\u3002', [])]\n\n>>> ToJyutping.get_ipa_list('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n[('\u5481', 'k\u0250m\u02e7'), ('\u5571', '\u014ba\u02d0m\u02e5'), ('\u8001', 'lou\u032f\u02e9\u02e7'), ('\u4e16', 's\u0250i\u032f\u02e7'), ('\u8981', 'ji\u02d0u\u032f\u02e5'), ('\u6c42', 'k\u02b0\u0250u\u032f\u02e8\u02e9'), ('\u4f62', 'k\u02b0\u0275y\u0311\u02e9\u02e7'), ('\u7b49', 't\u0250\u014b\u02e7\u02e5'), ('\u9663', 't\u0361s\u0250n\u02e8'), ('\u8981', 'ji\u02d0u\u032f\u02e7'), ('\u958b', 'h\u0254\u02d0i\u032f\u02e5'), ('\u6703', 'wu\u02d0i\u032f\u02e7\u02e5'), ('\uff0c', None), ('\u5269', 't\u0361se\u014b\u02e8'), ('\u4f4e', 't\u0250i\u032f\u02e5'), ('\u5605', 'k\u025b\u02d0\u02e7'), ('\u5622', 'j\u025b\u02d0\u02e9\u02e7'), ('\u6211', '\u014b\u0254\u02d0\u02e9\u02e7'), ('\u6703', 'wu\u02d0i\u032f\u02e9\u02e7'), ('\u641e', 'ka\u02d0u\u032f\u02e7\u02e5'), ('\u6382', 'ti\u02d0m\u02e8'), ('\u35ce', 'ka\u02d0\u02e7'), ('\u5587', 'la\u02d0\u02e7'), ('\u3002', None)]\n\n>>> ToJyutping.get_ipa('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n'\u5481[k\u0250m\u02e7]\u5571[\u014ba\u02d0m\u02e5]\u8001[lou\u032f\u02e9\u02e7]\u4e16[s\u0250i\u032f\u02e7]\u8981[ji\u02d0u\u032f\u02e5]\u6c42[k\u02b0\u0250u\u032f\u02e8\u02e9]\u4f62[k\u02b0\u0275y\u0311\u02e9\u02e7]\u7b49[t\u0250\u014b\u02e7\u02e5]\u9663[t\u0361s\u0250n\u02e8]\u8981[ji\u02d0u\u032f\u02e7]\u958b[h\u0254\u02d0i\u032f\u02e5]\u6703[wu\u02d0i\u032f\u02e7\u02e5]\uff0c\u5269[t\u0361se\u014b\u02e8]\u4f4e[t\u0250i\u032f\u02e5]\u5605[k\u025b\u02d0\u02e7]\u5622[j\u025b\u02d0\u02e9\u02e7]\u6211[\u014b\u0254\u02d0\u02e9\u02e7]\u6703[wu\u02d0i\u032f\u02e9\u02e7]\u641e[ka\u02d0u\u032f\u02e7\u02e5]\u6382[ti\u02d0m\u02e8]\u35ce[ka\u02d0\u02e7]\u5587[la\u02d0\u02e7]\u3002'\n\n>>> ToJyutping.get_ipa_text('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n'k\u0250m\u02e7.\u014ba\u02d0m\u02e5.lou\u032f\u02e9\u02e7.s\u0250i\u032f\u02e7.ji\u02d0u\u032f\u02e5.k\u02b0\u0250u\u032f\u02e8\u02e9.k\u02b0\u0275y\u0311\u02e9\u02e7.t\u0250\u014b\u02e7\u02e5.t\u0361s\u0250n\u02e8.ji\u02d0u\u032f\u02e7.h\u0254\u02d0i\u032f\u02e5.wu\u02d0i\u032f\u02e7\u02e5 | t\u0361se\u014b\u02e8.t\u0250i\u032f\u02e5.k\u025b\u02d0\u02e7.j\u025b\u02d0\u02e9\u02e7.\u014b\u0254\u02d0\u02e9\u02e7.wu\u02d0i\u032f\u02e9\u02e7.ka\u02d0u\u032f\u02e7\u02e5.ti\u02d0m\u02e8.ka\u02d0\u02e7.la\u02d0\u02e7'\n\n>>> ToJyutping.get_ipa_candidates('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n[('\u5481', ['k\u0250m\u02e7', 'k\u0250m\u02e7\u02e5', 'k\u0250m\u02e5', 'k\u0250m\u02e8\u02e9']), ('\u5571', ['\u014ba\u02d0m\u02e5', 'a\u02d0m\u02e5', '\u0250m\u02e5', '\u014b\u0250m\u02e5']), ('\u8001', ['lou\u032f\u02e9\u02e7', 'lou\u032f\u02e7\u02e5']), ('\u4e16', ['s\u0250i\u032f\u02e7', 's\u0250i\u032f\u02e7\u02e5']), ('\u8981', ['ji\u02d0u\u032f\u02e5', 'ji\u02d0u\u032f\u02e7', 'ji\u02d0u\u032f\u02e7\u02e5']), ('\u6c42', ['k\u02b0\u0250u\u032f\u02e8\u02e9']), ('\u4f62', ['k\u02b0\u0275y\u0311\u02e9\u02e7', 'h\u0275y\u0311\u02e9\u02e7']), ('\u7b49', ['t\u0250\u014b\u02e7\u02e5']), ('\u9663', ['t\u0361s\u0250n\u02e8', 't\u0361s\u0250n\u02e7\u02e5']), ('\u8981', ['ji\u02d0u\u032f\u02e7', 'ji\u02d0u\u032f\u02e7\u02e5', 'ji\u02d0u\u032f\u02e5']), ('\u958b', ['h\u0254\u02d0i\u032f\u02e5']), ('\u6703', ['wu\u02d0i\u032f\u02e7\u02e5', 'wu\u02d0i\u032f\u02e9\u02e7', 'wu\u02d0i\u032f\u02e8', 'wu\u02d0i\u032f\u02e7', 'k\u02b0u\u02d0i\u032f\u02e7\u02e5', 'k\u02b0u\u02d0i\u032f\u02e7', 'k\u02b7\u02b0u\u02d0i\u032f\u02e7\u02e5']), ('\uff0c', []), ('\u5269', ['t\u0361se\u014b\u02e8', 'se\u014b\u02e8']), ('\u4f4e', ['t\u0250i\u032f\u02e5']), ('\u5605', ['k\u025b\u02d0\u02e7', 'k\u025b\u02d0\u02e7\u02e5', 'k\u02b0\u0254\u02d0i\u032f\u02e7\u02e5', 'k\u02b0\u0254\u02d0i\u032f\u02e7']), ('\u5622', ['j\u025b\u02d0\u02e9\u02e7', '\u025b\u02d0\u02e9\u02e7']), ('\u6211', ['\u014b\u0254\u02d0\u02e9\u02e7', '\u0254\u02d0\u02e9\u02e7']), ('\u6703', ['wu\u02d0i\u032f\u02e9\u02e7', 'wu\u02d0i\u032f\u02e8', 'wu\u02d0i\u032f\u02e7\u02e5', 'wu\u02d0i\u032f\u02e7', 'k\u02b0u\u02d0i\u032f\u02e7\u02e5', 'k\u02b0u\u02d0i\u032f\u02e7', 'k\u02b7\u02b0u\u02d0i\u032f\u02e7\u02e5']), ('\u641e', ['ka\u02d0u\u032f\u02e7\u02e5']), ('\u6382', ['ti\u02d0m\u02e8', 'ti\u02d0m\u02e7', 'ti\u02d0m\u02e5']), ('\u35ce', ['ka\u02d0\u02e7', 'k\u0250\u02e7', 'ka\u02d0\u02e7\u02e5', 'ka\u02d0\u02e5', 'ka\u02d0\u02e8\u02e9']), ('\u5587', ['la\u02d0\u02e7', 'la\u02d0\u02e5', 'la\u02d0k\u031a\u02e7', 'la\u02d0\u02e9\u02e7', 'la\u02d0t\u031a\u02e7']), ('\u3002', [])]\n```\n\nFor `get_jyutping_candidates` and `get_ipa_candidates`, pronunciations are sorted according to how likely they are to be correct in a sentence, with the first being the most likely.\n\nMethods may also be imported individually:\n\n```python\n>>> from ToJyutping import get_jyutping_text\n>>> get_jyutping_text('\u5481\u5571\u8001\u4e16\u8981\u6c42\u4f62\u7b49\u9663\u8981\u958b\u6703\uff0c\u5269\u4f4e\u5605\u5622\u6211\u6703\u641e\u6382\u35ce\u5587\u3002')\n'gam3 ngaam1 lou5 sai3 jiu1 kau4 keoi5 dang2 zan6 jiu3 hoi1 wui2, zing6 dai1 ge3 je5 ngo5 wui5 gaau2 dim6 gaa3 laa3.'\n```\n\nIn rare cases, the pronunciation of a single character can contain more than one syllable:\n\n```python\n>>> ToJyutping.get_jyutping_list('\u4e00\u74e9')\n[('\u4e00', 'jat1'), ('\u74e9', 'cin1 ngaa5')]\n>>> ToJyutping.get_ipa_list('\u4e00\u74e9')\n[('\u4e00', 'j\u0250t\u031a\u02e5'), ('\u74e9', 't\u0361s\u02b0i\u02d0n\u02e5.\u014ba\u02d0\u02e9\u02e7')]\n```\n\nThey are mostly dated ligature characters (\u5408\u5b57) coined to represent units with SI prefixes. \n\n## Custom Entries & Existing Entries Overriding or Exclusion\n\nWith an accuracy rate of 99%, the possibility of needing an adjustment is rare. However, Cantonese, like other varieties of Chinese, is mostly written in logographs, which means that homographs (\u540c\u5f62\u8a5e) that are indistinguishable out of context can occur. Consider the following sentence:\n\n> \u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578\n\nIn the above sentence, there are multiple possible pronunciations of \u4e0a, \u5230 and \u5206, and their meanings are different depending on how they are actually pronounced:\n\n| Pronunciation | Meaning |\n| --- | --- |\n| soeng**5** tong4 zung1 jyu1 gong2 dou**3** fan**1** sou3 | Attending the lesson, it finally came to talk about scores.<br>_(Perhaps the scores weren\u2019t available until today.)_ |\n| soeng**5** tong4 zung1 jyu1 gong2 dou**3** fan**6** sou3 | Attending the lesson, it finally came to talk about fractions.<br>_(Perhaps the progress of the math class was slow.)_ |\n| soeng**5** tong4 zung1 jyu1 gong2 dou**2** fan**1** sou3 | Attending the lesson, eventually it was able to talk about scores.<br>_(Perhaps the teacher wasn\u2019t allowed to reveal the scores until today.)_ |\n| soeng**5** tong4 zung1 jyu1 gong2 dou**2** fan**6** sou3 | Attending the lesson, eventually it was able to talk about fractions.<br>_(Perhaps the introduction to fractions requires some other concepts to be taught.)_ |\n| soeng**6** tong4 zung1 jyu1 gong2 dou**3** fan**1** sou3 | The previous lesson finally came to talk about scores.<br>_(Perhaps the teacher just made the scores available right before the previous lesson.)_ |\n| soeng**6** tong4 zung1 jyu1 gong2 dou**3** fan**6** sou3 | The previous lesson finally came to talk about fractions.<br>_(Perhaps the students just managed to catch up the progress in the math class.)_ |\n| soeng**6** tong4 zung1 jyu1 gong2 dou**2** fan**1** sou3 | Eventually, it was able to talk about scores in the previous lesson.<br>_(Perhaps the teacher was finally allowed to reveal the scores in the previous lesson.)_ |\n| soeng**6** tong4 zung1 jyu1 gong2 dou**2** fan**6** sou3 | Eventually, it was able to talk about fractions in the previous lesson.<br>_(Perhaps the teacher just finished teaching the other concepts required for learning fractions.)_ |\n\nThus, the library offers the ability to include custom entries and override or exclude built-in entries:\n\n```python\n>>> ToJyutping.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n'soeng5 tong4 zung1 jyu1 gong2 dou3 fan1 sou3'\n\n>>> converter_lesson = ToJyutping.customize({'\u4e0a\u5802': None, '\u5206\u6578': 'fan6 sou3'})\n>>> converter_lesson.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n'soeng6 tong4 zung1 jyu1 gong2 dou3 fan6 sou3'\n```\n\nIn the above example:\n\n- By default, the library special-cases the pronunciation of \u4e0a\u5802 to \u201csoeng5 tong4\u201d. Setting `\u4e0a\u5802` to `None` removes the special case and both \u4e0a and \u5802 now fallback to the their default pronunciations, which are \u201csoeng6\u201d and \u201ctong4\u201d respectively.\n- By default, the library does not special-case \u5206\u6578. Thus, the pronunciations of each individual characters, which in this case are \u201cfan1\u201d and \u201csou3\u201d, are used. By including the entry `\u5206\u6578` and setting it to `fan6 sou3`, the converter outputs `fan6 sou3` when `\u5206\u6578` is encountered.\n\nIn general, setting any built-in entry to `None` fallbacks it to shorter matches and ultimately individual character pronunciations if there isn\u2019t a match:\n\n```python\n>>> ToJyutping.get_jyutping_text('\u597d\u5b78\u751f')\n'hou2 hok6 saang1'\n\n>>> converter_studious = ToJyutping.customize({'\u597d\u5b78\u751f': None})\n>>> converter_studious.get_jyutping_text('\u597d\u5b78\u751f')\n'hou3 hok6 saang1' # Using shorter matches \u597d\u5b78 and \u751f\n\n>>> converter_good_student = converter_studious.customize({'\u597d\u5b78': None})\n>>> converter_good_student.get_jyutping_text('\u597d\u5b78\u751f')\n'hou2 hok6 saang1' # Using individual character pronunciations as it can\u2019t be decomposed further\n```\n\nConverters can be chained without affecting each other:\n\n```python\n>>> converter_dou2 = converter_lesson.customize({'\u5230': 'dou2'})\n>>> converter_None = converter_lesson.customize({'\u5230': None})\n\n>>> converter_dou2.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n'soeng6 tong4 zung1 jyu1 gong2 dou2 fan6 sou3'\n\n>>> converter_None.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n'soeng6 tong4 zung1 jyu1 gong2 [\u2026] fan6 sou3'\n\n>>> ToJyutping.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n'soeng5 tong4 zung1 jyu1 gong2 dou3 fan1 sou3' # Also not affected\n```\n\n> [!WARNING]\n> - This library only offers basic customization functionality. If there are longer built-in word entries, they aren\u2019t overridden:\n> ```python\n> >>> converter_dou2.get_jyutping_text('\u7b11\u5230\u8f46\u5730')\n> 'siu3 dou3 luk1 dei2'\n> \n> >>> converter_None.get_jyutping_text('\u7b11\u5230\u8f46\u5730')\n> 'siu3 dou3 luk1 dei2'\n> \n> >>> converter_another_lesson = ToJyutping.customize({'\u4e0a': None, '\u5206': 'fan6'})\n> >>> converter_another_lesson.get_jyutping_text('\u4e0a\u5802\u7d42\u65bc\u8b1b\u5230\u5206\u6578')\n> 'soeng5 tong4 zung1 jyu1 gong2 dou3 fan6 sou3'\n> ```\n> In the second example, their isn\u2019t an entry for \u5206\u6578, so \u5206 is patched successfully. However, this is not the case for \u4e0a since the longer built-in entry \u4e0a\u5802 is prioritized.\n> - The original pronunciations will be lost. If you are using `get_jyutping_candidates` or `get_ipa_candidates`, you will need to include the pronunciations manually:\n> ```python\n> >>> \u5230_original_pronunciations = ToJyutping.get_jyutping_candidates('\u5230')\n> >>> \u5230_original_pronunciations\n> [('\u5230', ['dou3', 'dou2'])]\n> >>> converter_dou2_dou3 = converter_lesson.customize({'\u5230': ['dou2', *\u5230_original_pronunciations[0][1]]})\n> >>> converter_dou2_dou3.get_jyutping_candidates('\u5230')\n> [('\u5230', ['dou2', 'dou3'])]\n> ```\n> Notice how the library automatically deduplicates the values for you.\n\n## Grapheme-to-Phoneme Conversion Function\n\nIntended for machine learning purposes (especially text-to-speech and automatic speech recognition), a `g2p` function is provided to minimize the possibility of conversion problems due to lack of linguistic knowledge. It takes a string and outputs tuples of 3 integers (ranged from 8 to 94 inclusive) representing the **onset** (\u8072\u6bcd), **rhyme** (\u97fb\u6bcd) and **tone** (\u8072\u8abf) of a syllable. Punctuations are included as singletons (1-tuples) and range from 1 to 7. They are detailed in the _[Punctuations](#punctuations)_ section below.\n\n> [!NOTE]\n> In this section, the word _\u201cpunctuation\u201d_ may be delibrately written in the plural form to avoid confusion.\n\n> [!IMPORTANT]\n> In this section, **punctuations** include the unknown character filler, described in the _Punctuations_ section.\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f')\nPhonemesList(\n [(11, 46, 1), (22, 28, 2), (2,), (15, 47, 5), (22, 28, 6), (26, 84, 6), (17, 64, 3), (27, 92, 6), (26, 84, 5), (14, 69, 4), (23, 72, 6), (12, 35, 3), (11, 41, 2), (11, 41, 4), (12, 35, 3), (27, 78, 6), (24, 64, 2), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)],\n segmentals=[11, 46, 22, 28, 2, 15, 47, 22, 28, 26, 84, 17, 64, 27, 92, 26, 84, 14, 69, 23, 72, 12, 35, 11, 41, 11, 41, 12, 35, 27, 78, 24, 64, 1, 1, 1, 1, 1, 1, 1, 4, 5],\n tones=[1, 1, 2, 2, 0, 5, 5, 6, 6, 6, 6, 3, 3, 6, 6, 5, 5, 4, 4, 6, 6, 3, 3, 2, 2, 4, 4, 3, 3, 6, 6, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n lengths=[2, 2, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]\n)\n```\n\n(Formatted manually, not exactly the same as what `repr` produces)\n\n`PhonemesList` is a `list` with convenient attributes common to syllables handling (particularly in VITS2). `segmentals` is a `list` `property` with onsets, rhymes and punctuations included, while each element of `tones` (also a `list` `property`) gives the tone corresponding to each onset or rhyme, or `0` if the corresponding element is a punctuation. `lengths` (an ordinary `list`) suggests how many elements of `segmentals` or `tones` is each character of the original input correspond to. It is guaranteed that:\n\n- `lengths` always sum up to the number of elements of both `segmentals` and `tones`; and\n- The length of the input is always the same as that of `lengths`, such that they can be zipped nicely.\n\nNote that the number of elements of `lengths` does not necessary match that of the original `PhonemesList`, since the input may contain polysyllabic characters, consecutive punctuations of the same category, or whitespaces.\n\n> [!WARNING]\n> Unlike the dynamic `segmentals` and `tones` properties, `lengths` is not recalculated even if the `PhonemesList` is modified, since it requires the original input to be taken into account. We do not recommend modifying `PhonemesList` either.\n\nFrom the above example, you can see that the tone values are coincidently the same as some of the onsets, as it is a more common practice to separate tones into another sequence (this is what VITS2 expects, for example). If this is undesirable, pass `tone_same_seq=True` to output integers ranged from 8 up to 100:\n\n(From now on, the attributes are not shown. Try them out and reveal them yourselves! However, for the case setting `tone_same_seq` to `True`, you probably don\u2019t need them and just need to flatten the list.)\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', tone_same_seq=True)\nPhonemesList([(11, 46, 95), (22, 28, 96), (2,), (15, 47, 99), (22, 28, 100), (26, 84, 100), (17, 64, 97), (27, 92, 100), (26, 84, 99), (14, 69, 98), (23, 72, 100), (12, 35, 97), (11, 41, 96), (11, 41, 98), (12, 35, 97), (27, 78, 100), (24, 64, 96), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])\n```\n\nAdditionally, the second `offset` argument can be specified to shift the output values by a certain amount:\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', offset=100)\nPhonemesList([(103, 138, 101), (114, 120, 102), (2,), (107, 139, 105), (114, 120, 106), (118, 176, 106), (109, 156, 103), (119, 184, 106), (118, 176, 105), (106, 161, 104), (115, 164, 106), (104, 127, 103), (103, 133, 102), (103, 133, 104), (104, 127, 103), (119, 170, 106), (116, 156, 102), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])\n```\n\nYou may pass a triplet as well for shifting each element by different amounts:\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', offset=(100, 200, 300))\nPhonemesList([(103, 238, 301), (114, 220, 302), (2,), (107, 239, 305), (114, 220, 306), (118, 276, 306), (109, 256, 303), (119, 284, 306), (118, 276, 305), (106, 261, 304), (115, 264, 306), (104, 227, 303), (103, 233, 302), (103, 233, 304), (104, 227, 303), (119, 270, 306), (116, 256, 302), (1,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])\n```\n\nBy default, `offset` shifts the onsets and rhymes by 8, which is one plus the number of elements in the default punctuation set described in the below section. The tones are **not** shifted unless `tone_same_seq` is set to `True`.\n\n> [!WARNING]\n> This is not the case if you specify `offset` as a single value. The onsets, rhymes and tones are **all shifted** by `offset` and both the `segmentals` and `tones` properties are affected if you provide a single integer. This may not be desirable unless `tone_same_seq` is set. To suppress shifting tones, pass `(amount, amount, 0)`. See below for an example.\n\n### Punctuations\n\nBy default, ToJyutping classifies punctuations into 6 categories, `.`, `,`, `!`, `?`, `-` and `'`, and numbers them from 2 to 7. 1's (you may label it `\u2026` for convenience) are used to mark unknown characters in the input string.\n\nThe reason for using a filler instead of raising an error is that we want to avoid data-driven errors. Strings in the dataset are known, but this is not the case for user input. If this is undesirable, you will need to look for `(1,)` in the output list and raise the error yourself.\n\nConsecutive punctuations of the same category are collapsed into a single element. For example, both `\u2026\u2026` and `......` become a single `(2,)`. However, this does not applies to the unknown character filler. This is because we want to maintain the length of the audio when there are multiple consecutive unknown characters.\n\nYou may supply `puncts_offset` to shift the punctuation IDs by a certain amount. For example, to swap the order of syllable IDs and punctuation IDs (i.e. make syllables range from 1 to 87 and punctuations range from 88 to 94):\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', offset=(1, 1, 0), puncts_offset=88)\nPhonemesList([(4, 39, 1), (15, 21, 2), (89,), (8, 40, 5), (15, 21, 6), (19, 77, 6), (10, 57, 3), (20, 85, 6), (19, 77, 5), (7, 62, 4), (16, 65, 6), (5, 28, 3), (4, 34, 2), (4, 34, 4), (5, 28, 3), (20, 71, 6), (17, 57, 2), (88,), (88,), (88,), (88,), (88,), (88,), (88,), (91,), (92,)])\n```\n\nThe `puncts_offset` argument **defaults to 1**, since 0 is commonly used as the ID for padding. However, if `puncts_map` (detailed in the [Custom Punctuation Map](#custom-punctuation-map) section below) is provided, it\u2019s defaulted to 0.\n\n> [!WARNING]\n> The `offset` and `puncts_offset` arguments do not affect each other. If you changed `puncts_offset`, be sure to modify `offset` as well, or the values will coincide. **Read the instructions in the above section** before doing so.\n\nAlthough not recommended, to make the ID for padding and the unknown character filler the same:\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', puncts_offset=0, offset=(7, 7, 0))\nPhonemesList([(10, 45, 1), (21, 27, 2), (1,), (14, 46, 5), (21, 27, 6), (25, 83, 6), (16, 63, 3), (26, 91, 6), (25, 83, 5), (13, 68, 4), (22, 71, 6), (11, 34, 3), (10, 40, 2), (10, 40, 4), (11, 34, 3), (26, 77, 6), (23, 63, 2), (0,), (0,), (0,), (0,), (0,), (0,), (0,), (3,), (4,)])\n```\n\nThis causes all values to be decreased by 1 except for the tones.\n\n#### Decimal Numbers Detection\n\nFrom the very first example in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section, you can see that the thousand separator `,` is converted to `(1,)` instead of `(3,)`. In fact, the library automatically checks if `,` and `.`, etc. are between digits, prevents them from being treated as commas and treats them as if they were part of the number. As the library does not (yet) convert Arabic decimal numbers to their pronunciations, they are converted to `(1,)`, the same as the result of converting any of the digits. In addition, the library detects negative signs by checking whether `-` etc. are preceding a digit and no `-` or unknown characters precede them and converts them to `(1,)` instead of `(6,)`.\n\nIf this is not desired, you may suppress this behavior by passing `decimal_check=False` (Note the `(3,)` between the `(1,)`s):\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', decimal_check=False)\nPhonemesList([(11, 46, 1), (22, 28, 2), (2,), (15, 47, 5), (22, 28, 6), (26, 84, 6), (17, 64, 3), (27, 92, 6), (26, 84, 5), (14, 69, 4), (23, 72, 6), (12, 35, 3), (11, 41, 2), (11, 41, 4), (12, 35, 3), (27, 78, 6), (24, 64, 2), (1,), (1,), (1,), (3,), (1,), (1,), (1,), (4,), (5,)])\n```\n\n#### Extra Punctuations & Custom Punctuation Map\n\nThe default punctuation mapping is comprehensive and should cover most of the use cases. However, you may add your own punctuations and symbols or override the built-in mapping by the `extra_puncts` option:\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', extra_puncts={'$': 8, '\uff01': 9})\nPhonemesList([(13, 48, 1), (24, 30, 2), (2,), (17, 49, 5), (24, 30, 6), (28, 86, 6), (19, 66, 3), (29, 94, 6), (28, 86, 5), (16, 71, 4), (25, 74, 6), (14, 37, 3), (13, 43, 2), (13, 43, 4), (14, 37, 3), (29, 80, 6), (26, 66, 2), (8,), (1,), (1,), (1,), (1,), (1,), (1,), (9,), (5,)])\n```\n\n> [!IMPORTANT]\n> - You cannot override pronunciations of Chinese characters by `extra_puncts`. Use `customize` for that purpose.\n> - If you want to override a built-in punctuation, you will need to specify all \u201cvariants\u201d of it by yourself. For example, `!`, `\ufe15`, `\ufe57` and `\uff01` should map to the same ID.\n\nIf any of the values in `extra_puncts` are greater than 7, the `offset` parameter is automatically adjusted to shift onsets and rhymes (and tones if `tone_same_seq`) by one plus the largest value, but you may modify it to suit your needs. **Read the instructions in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section** before doing so.\n\nIf you would like to use your own mapping for some reason, you may specify the `puncts_map` option:\n\n```python\n>>> ToJyutping.g2p('\u54a9\u8a71\u2026\u2026\u4f60\u8a71\u4e0a\u500b\u6708\u4e0a\u5802\u5b78\u6cd5\u6587\u6587\u6cd5\u7528\u5497 $50,000\uff01\uff1f', puncts_map={'\u2026': 2, '$': 3, '\uff01': 4, '\uff1f': 5}, unknown_id=1)\nPhonemesList([(9, 44, 1), (20, 26, 2), (2,), (13, 45, 5), (20, 26, 6), (24, 82, 6), (15, 62, 3), (25, 90, 6), (24, 82, 5), (12, 67, 4), (21, 70, 6), (10, 33, 3), (9, 39, 2), (9, 39, 4), (10, 33, 3), (25, 76, 6), (22, 62, 2), (3,), (1,), (1,), (1,), (1,), (1,), (1,), (4,), (5,)])\n```\n\n> [!IMPORTANT]\n> - You **must** specify `unknown_id` if `puncts_map` is provided.\n> - You cannot override pronunciations of Chinese characters by `puncts_map`. Use `customize` for that purpose.\n> - You will need to specify all \u201cvariants\u201d of a punctuation by yourself. For example, `!`, `\ufe15`, `\ufe57` and `\uff01` should map to the same ID.\n\nThe `offset` parameter is automatically calculated for you. By default, it shifts onsets and rhymes (and tones if `tone_same_seq`) by one plus the maximum of `unknown_id` and all the values in `puncts_map`. However, you may modify it to suit your needs. Again, be sure to **read the instructions in the _[g2p Conversion Function](#grapheme-to-phoneme-conversion-function)_ section** before doing so.\n\nThe `unknown_id` parameter may be used alone or with the `extra_puncts` option as well, but we generally do not recommend doing so. The `offset` parameter is computed similarly if it is greater than 7.\n\n## Helper\n\n```python\n>>> ToJyutping.jyutping2ipa('jat1')\n'j\u0250t\u031a\u02e5'\n>>> ToJyutping.jyutping2ipa('cin1 ngaa5')\n't\u0361s\u02b0i\u02d0n\u02e5.\u014ba\u02d0\u02e9\u02e7'\n```\n\nNote that autocorrection is intentionally not included in this helper, and an error is thrown if strings like `jyt6` are passed into the function.\nPunctuation is ignored in the helper.\n",
"bugtrack_url": null,
"license": null,
"summary": "\u7cb5\u8a9e\u62fc\u97f3\u81ea\u52d5\u6a19\u8a3b\u5de5\u5177 Cantonese Pronunciation Automatic Labeling Tool",
"version": "3.1.0",
"project_urls": {
"Bug Reports": "https://github.com/CanCLID/ToJyutping/issues",
"Homepage": "https://github.com/CanCLID/ToJyutping",
"Source": "https://github.com/CanCLID/ToJyutping"
},
"split_keywords": [
"cantonese",
"chinese",
"chinese-characters",
"jyutping",
"linguistics",
"natural-language-processing",
"nlp",
"romanization",
"simplified-chinese",
"traditional-chinese"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "18364d05ac47c9630ed1e0089c0c0411786b13e76bd1610f196fd8b399705c05",
"md5": "341988741970f2ac608ee8bb5348949b",
"sha256": "d7cf56829de94859956f35dee0f25f35f73ffabd795e4bcd9f556aaaf0c3385e"
},
"downloads": -1,
"filename": "ToJyutping-3.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "341988741970f2ac608ee8bb5348949b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.8",
"size": 294455,
"upload_time": "2024-08-22T04:41:37",
"upload_time_iso_8601": "2024-08-22T04:41:37.697954Z",
"url": "https://files.pythonhosted.org/packages/18/36/4d05ac47c9630ed1e0089c0c0411786b13e76bd1610f196fd8b399705c05/ToJyutping-3.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ff4208ba0a43866a10504c2b74effc21b8f428474b551ab67b3145f444e14824",
"md5": "58d9d13d87e9f920108ad24cc44f8122",
"sha256": "1cd8e3afdfb9f1a48cfe78f815a9a8f651dcc595b9b6ecccaecfa6c1d3abd3c6"
},
"downloads": -1,
"filename": "tojyutping-3.1.0.tar.gz",
"has_sig": false,
"md5_digest": "58d9d13d87e9f920108ad24cc44f8122",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.8",
"size": 309115,
"upload_time": "2024-08-22T04:41:39",
"upload_time_iso_8601": "2024-08-22T04:41:39.516870Z",
"url": "https://files.pythonhosted.org/packages/ff/42/08ba0a43866a10504c2b74effc21b8f428474b551ab67b3145f444e14824/tojyutping-3.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-22 04:41:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CanCLID",
"github_project": "ToJyutping",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tojyutping"
}