taibun


Nametaibun JSON
Version 1.1.7 PyPI version JSON
download
home_pagehttps://github.com/andreihar/taibun
SummaryTaiwanese Hokkien Transliterator and Tokeniser
upload_time2024-08-31 20:25:01
maintainerNone
docs_urlNone
authorAndrei Harbachov
requires_python>=3.8
licenseMIT
keywords python taiwan taiwanese taigi hokkien romanization transliteration transliterator tokenization tokenizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [台語](readme/README-oan.md) | [國語](readme/README-cmn.md)



<!-- PROJECT LOGO -->
<br />
<div align="center">
  <a href="https://github.com/andreihar/taibun">
    <img src="https://github.com/andreihar/taibun/raw/main/readme/logo.png" alt="Logo" width="90" height="80">
  </a>
  
# Taibun



<!-- PROJECT SHIELDS -->
[![Contributions][contributions-badge]][contributions]
[![Live Demo][demo-badge]][demo]
[![Tests][tests-badge]][tests]
[![Release][release-badge]][release]
[![Licence][licence-badge]][licence]
[![LinkedIn][linkedin-badge]][linkedin]
[![Downloads][downloads-badge]][pypi]

**Taiwanese Hokkien Transliterator and Tokeniser**

It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.<br />
Includes word tokeniser for Taiwanese Hokkien.

[Report Bug][bug] •
[PyPI][pypi]

</div>



---



<!-- TABLE OF CONTENTS -->
<details open>
  <summary>Table of Contents</summary>
  <ol>
    <li><a href="#versions">Versions</a></li>
    <li><a href="#install">Install</a></li>
    <li>
      <a href="#usage">Usage</a>
      <ul>
        <li>
          <a href="#converter">Converter</a>
          <ul>
            <li><a href="#system">System</a></li>
            <li><a href="#dialect">Dialect</a></li>
            <li><a href="#format">Format</a></li>
            <li><a href="#delimiter">Delimiter</a></li>
            <li><a href="#sandhi">Sandhi</a></li>
            <li><a href="#punctuation">Punctuation</a></li>
            <li><a href="#convert-non-cjk">Convert non-CJK</a></li>
          </ul>
        </li>
        <li>
          <a href="#tokeniser">Tokeniser</a>
          <ul>
            <li><a href="#keep-original">Keep original</a></li>
          </ul>
        </li>
        <li><a href="#other-functions">Other Functions</a></li>
      </ul>
    </li>
    <li><a href="#example">Example</a></li>
    <li><a href="#data">Data</a></li>
    <li><a href="#acknowledgements">Acknowledgements</a></li>
    <li><a href="#licence">Licence</a></li>
  </ol>
</details>



<!-- OTHER VERSIONS -->
## Versions

[![JavaScript Version][js-badge]][js-link]



<!-- INSTALL -->
## Install

Taibun can be installed from [pypi][pypi]

```bash
$ pip install taibun
```



<!-- USAGE -->
## Usage

### Converter

`Converter` class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.

```python
# Constructor
c = Converter(system, dialect, format, delimiter, sandhi, punctuation, convert_non_cjk)

# Transliterate Chinese characters
c.get(input)
```

#### System

`system` String - system of transliteration.

* `Tailo` (default) - [Tâi-uân Lô-má-jī Phing-im Hong-àn][tailo-wiki]
* `POJ` - [Pe̍h-ōe-jī][poj-wiki]
* `Zhuyin` - [Taiwanese Phonetic Symbols][zhuyin-wiki]
* `TLPA` - [Taiwanese Language Phonetic Alphabet][tlpa-wiki]
* `Pingyim` - [Bbánlám Uē Pìngyīm Hōng'àn][pingyim-wiki]
* `Tongiong` - [Daī-ghî Tōng-iōng Pīng-im][tongiong-wiki]
* `IPA` - [International Phonetic Alphabet][ipa-wiki]

| text | Tailo   | POJ     | Zhuyin      | TLPA      | Pingyim | Tongiong | IPA         |
| ---- | ------- | ------- | ----------- | --------- | ------- | -------- | ----------- |
| 台灣 | Tâi-uân | Tâi-oân | ㄉㄞˊ ㄨㄢˊ | Tai5 uan5 | Dáiwán  | Tāi-uǎn  | Tai²⁵ uan²⁵ |

#### Dialect

`dialect` String - preferred pronunciation.

* `south` (default) - [Zhangzhou][zhangzhou-wiki]-leaning pronunciation
* `north` - [Quanzhou][quanzhou-wiki]-leaning pronunciation
* `singapore` - Quanzhou-leaning pronunciation with [Singaporean characteristics][singapore-wiki]

| text           | south                       | north                       | singapore                  |
| -------------- | --------------------------- | --------------------------- | -------------------------- |
| 五月節我啉咖啡 | Gōo-gue̍h-tseh guá lim ka-pi | Gōo-ge̍h-tsueh guá lim ka-pi | Gōo-ge̍h-tsueh uá lim ko-pi |

#### Format

`format` String - format in which tones will be represented in the converted sentence.

* `mark` (default) - uses diacritics for each syllable. Not available for TLPA
* `number` - add a number which represents the tone at the end of the syllable
* `strip` - removes any tone marking

| text | mark    | number    | strip   |
| ---- | ------- | --------- | ------- |
| 台灣 | Tâi-uân | Tai5-uan5 | Tai-uan |

#### Delimiter

`delimiter` String - sets the delimiter character that will be placed in between syllables of a word.

Default value depends on the chosen `system`:

* `'-'` - for `Tailo`, `POJ`, `Tongiong`
* `''` - for `Pingyim`
* `' '` - for `Zhuyin`, `TLPA`, `IPA`

| text | '-'     | ''     | ' '     |
| ---- | ------- | ------ | ------- |
| 台灣 | Tâi-uân | Tâiuân | Tâi uân |

#### Sandhi

`sandhi` String - applies the [sandhi rules of Taiwanese Hokkien][sandhi-wiki].

Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.

* `none` - doesn't perform any tone sandhi
* `auto` - closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔
* `exc_last` - changes tone for every syllable except for the last one
* `incl_last` - changes tone for every syllable including the last one

Default value depends on the chosen `system`:

* `auto` - for `Tongiong`
* `none` - for `Tailo`, `POJ`, `Zhuyin`, `TLPA`, `Pingyim`, `IPA`

| text             | none                    | auto                   | exc_last               | incl_last              |
| ---------------- | ----------------------- | ---------------------- | ---------------------- | ---------------------- |
| 這是你的茶桌仔無 | Tse sī lí ê tê-toh-á bô | Tse sì li ē tē-to-á bô | Tsē sì li ē tē-tó-a bô | Tsē sì li ē tē-tó-a bō |

Sandhi rules also change depending on the dialect chosen.

| text | no sandhi | south   | north / singapore |
| ---- | --------- | ------- | ----------------- |
| 台灣 | Tâi-uân   | Tāi-uân | Tài-uân           |

#### Punctuation

`punctuation` String

* `format` (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence
* `none` - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences

| text                                                                           | format                                                                                            | none                                                                                                 |
| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
| 這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。 | Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán). | tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。 |

#### Convert non-CJK

`convert_non_cjk` Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.

* `True` - convert non-Chinese character words
* `False` (default) - convert only Chinese character words

| text      | False                   | True                    |
| --------- | ----------------------- | ----------------------- |
| 我食pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ |

### Tokeniser

`Tokeniser` class performs [NLTK wordpunct_tokenize][nltk-tokenize]-like tokenisation of a Taiwanese Hokkien sentence.

```python
# Constructor
t = Tokeniser(keep_original)

# Tokenise Taiwanese Hokkien sentence
t.tokenise(input)
```

#### Keep original

`keep_original` Boolean - defines whether the original characters of the input are retained.

* `True` (default) - preserve original characters
* `False` - replace original characters with characters defined in the dataset

| text         | True                 | False                |
| ------------ | -------------------- | -------------------- |
| 臺灣火鸡肉饭 | ['臺灣', '火鸡肉饭'] | ['台灣', '火雞肉飯'] |

### Other Functions

Handy functions for NLP tasks in Taiwanese Hokkien.

`to_traditional` function converts input to Traditional Chinese characters that are used in the dataset. Also accounts for different variants of Traditional Chinese characters.

`to_simplified` function converts input to Simplified Chinese characters.

`is_cjk` function checks whether the input string consists entirely of Chinese characters.

```python
to_traditional(input)

to_simplified(input)

is_cjk(input)
```



<!-- EXAMPLE -->
## Example

```python
# Converter
from taibun import Converter

## System
c = Converter() # Tailo system default
c.get('先生講,學生恬恬聽。')
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.

c = Converter(system='Zhuyin')
c.get('先生講,學生恬恬聽。')
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.

## Dialect
c = Converter() # south dialect default
c.get("我欲用箸食魚")
>> Guá beh īng tī tsia̍h hî

c = Converter(dialect='north')
c.get("我欲用箸食魚")
>> Guá bueh īng tū tsia̍h hû

c = new Converter({ dialect: 'singapore' });
c.get("我欲用箸食魚");
>> Uá bueh ēng tū tsia̍h hû

## Format
c = Converter() # for Tailo, mark by default
c.get("生日快樂")
>> Senn-ji̍t khuài-lo̍k

c = Converter(format='number')
c.get("生日快樂")
>> Senn1-jit8 khuai3-lok8

c = Converter(format='strip')
c.get("生日快樂")
>> Senn-jit khuai-lok

## Delimiter
c = Converter(delimiter='')
c.get("先生講,學生恬恬聽。")
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.

c = Converter(system='Pingyim', delimiter='-')
c.get("先生講,學生恬恬聽。")
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.

## Sandhi
c = Converter() # for Tailo, sandhi none by default
c.get("這是你的茶桌仔無")
>> Tse sī lí ê tê-toh-á bô

c = Converter(sandhi='auto')
c.get("這是你的茶桌仔無")
>> Tse sì li ē tē-to-á bô

c = Converter(sandhi='exc_last')
c.get("這是你的茶桌仔無")
>> Tsē sì li ē tē-tó-a bô

c = Converter(sandhi='incl_last')
c.get("這是你的茶桌仔無")
>> Tsē sì li ē tē-tó-a bō

## Punctuation
c = Converter() # format punctuation default
c.get("太空朋友,恁好!恁食飽未?")
>> Thài-khong pîng-iú, lín-hó! Lín tsia̍h-pá buē?

c = Converter(punctuation='none')
c.get("太空朋友,恁好!恁食飽未?")
>> thài-khong pîng-iú,lín-hó!lín tsia̍h-pá buē?

## Convert non-CJK
c = Converter(system='Zhuyin') # False convert_non_cjk default
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng

c = Converter(system='Zhuyin', convert_non_cjk=True)
c.get("我食pháng")
>> ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ


# Tokeniser
from taibun import Tokeniser

t = Tokeniser()
t.tokenise("太空朋友,恁好!恁食飽未?")
>> ['太空', '朋友', ',', '恁好', '!', '恁', '食飽', '未', '?']

## Keep Original
t = Tokeniser() # True keep_original default
t.tokenise("爲啥物臺灣遮爾好?")
>> ['爲啥物', '臺灣', '遮爾', '好', '?']

t.tokenise("为啥物台湾遮尔好?")
>> ['为啥物', '台湾', '遮尔', '好', '?']

t = Tokeniser(False)
t.tokenise("爲啥物臺灣遮爾好?")
>> ['為啥物', '台灣', '遮爾', '好', '?']

t.tokenise("为啥物台湾遮尔好?")
>> ['為啥物', '台灣', '遮爾', '好', '?']


# Other Functions
from taibun import to_traditional, to_simplified, is_cjk

## to_traditional
to_traditional("我听无台语")
>> 我聽無台語

to_traditional("我爱这个个人台面")
>> 我愛這个個人檯面

to_traditional("爲啥物")
>> 為啥物

## to_simplified
to_simplified("我聽無台語")
>> 我听无台语

## is_cjk
is_cjk('我食麭')
>> True

is_cjk('我食pháng')
>> False
```



<!-- DATA -->
## Data

- [Taiwanese-Chinese Online Dictionary][online-dictionary] (via [ChhoeTaigi][data-via])
- [iTaigi Chinese-Taiwanese Comparison Dictionary][itaigi-dictionary] (via [ChhoeTaigi][data-via])



<!-- ACKNOWLEDGEMENTS -->
## Acknowledgements

- Samuel Jen ([Github][samuel-github] · [LinkedIn][samuel-linkedin]) - Taiwanese and Mandarin translation



<!-- LICENCE -->
## Licence

Because Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.

The data is licensed under [CC BY-SA 4.0][data-cc]



<!-- MARKDOWN LINKS -->
<!-- Badges and their links -->
[contributions]: https://github.com/andreihar/taibun/issues
[contributions-badge]: https://img.shields.io/badge/Contributions-Welcomed-be132d?style=for-the-badge&logo=github
[demo]: https://taibun.andreihar.com/
[demo-badge]: https://img.shields.io/badge/Live_Demo-222222?style=for-the-badge&logo=homeadvisor&logoColor=white
[tests]: https://github.com/andreihar/taibun/actions
[tests-badge]: https://img.shields.io/github/actions/workflow/status/andreihar/taibun/ci.yaml?style=for-the-badge&logo=github-actions&logoColor=ffffff
[release-badge]: https://img.shields.io/github/v/release/andreihar/taibun?color=38618c&style=for-the-badge
[release]: https://github.com/andreihar/taibun/releases
[licence-badge]: https://img.shields.io/github/license/andreihar/taibun?color=000000&style=for-the-badge
[licence]: LICENSE
[linkedin-badge]: https://img.shields.io/badge/LinkedIn-0077b5?style=for-the-badge&logo=linkedin&logoColor=ffffff
[linkedin]: https://www.linkedin.com/in/andreihar/
[js-badge]: https://img.shields.io/badge/JS_Version-f7df1e?style=for-the-badge&logo=javascript&logoColor=000000
[js-link]: https://github.com/andreihar/taibun.js
[downloads-badge]: https://img.shields.io/pypi/dm/taibun.svg?style=for-the-badge

<!-- Technical links -->
[pypi]: https://pypi.org/project/taibun
[bug]: https://github.com/andreihar/taibun/issues
[online-dictionary]: http://ip194097.ntcu.edu.tw/ungian/soannteng/chil/Taihoa.asp
[itaigi-dictionary]: https://itaigi.tw/
[data-via]: https://github.com/ChhoeTaigi/ChhoeTaigiDatabase
[data-cc]: https://creativecommons.org/licenses/by-sa/4.0/deed.en
[tailo-wiki]: https://en.wikipedia.org/wiki/T%C3%A2i-u%C3%A2n_L%C3%B4-m%C3%A1-j%C4%AB_Phing-im_Hong-%C3%A0n
[poj-wiki]: https://en.wikipedia.org/wiki/Pe%CC%8Dh-%C5%8De-j%C4%AB
[zhuyin-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Phonetic_Symbols
[tlpa-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Language_Phonetic_Alphabet
[pingyim-wiki]: https://en.wikipedia.org/wiki/Bb%C3%A1nl%C3%A1m_p%C3%ACngy%C4%ABm
[tongiong-wiki]: https://en.wikipedia.org/wiki/Da%C4%AB-gh%C3%AE_t%C5%8Dng-i%C5%8Dng_p%C4%ABng-im
[ipa-wiki]: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
[zhangzhou-wiki]: https://en.wikipedia.org/wiki/Zhangzhou_dialects
[quanzhou-wiki]: https://en.wikipedia.org/wiki/Quanzhou_dialects
[singapore-wiki]: https://en.wikipedia.org/wiki/Singaporean_Hokkien
[nltk-tokenize]: https://nltk.org/api/nltk.tokenize.html
[sandhi-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Tone%20sandhi:~:text=thng%E2%9F%A9%20(%22soup%22).-,Tone%20sandhi,-%5Bedit%5D

<!-- Socials -->
[samuel-github]: https://github.com/SSSam
[samuel-linkedin]: https://www.linkedin.com/in/samuel-jen/

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/andreihar/taibun",
    "name": "taibun",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, taiwan, taiwanese, taigi, hokkien, romanization, transliteration, transliterator, tokenization, tokenizer",
    "author": "Andrei Harbachov",
    "author_email": "andrei.harbachov@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/93/52/e313c931ee7226b5f20dd4577be470808f7e4415b95414246e5167fb6a05/taibun-1.1.7.tar.gz",
    "platform": null,
    "description": "[\u53f0\u8a9e](readme/README-oan.md) | [\u570b\u8a9e](readme/README-cmn.md)\r\n\r\n\r\n\r\n<!-- PROJECT LOGO -->\r\n<br />\r\n<div align=\"center\">\r\n  <a href=\"https://github.com/andreihar/taibun\">\r\n    <img src=\"https://github.com/andreihar/taibun/raw/main/readme/logo.png\" alt=\"Logo\" width=\"90\" height=\"80\">\r\n  </a>\r\n  \r\n# Taibun\r\n\r\n\r\n\r\n<!-- PROJECT SHIELDS -->\r\n[![Contributions][contributions-badge]][contributions]\r\n[![Live Demo][demo-badge]][demo]\r\n[![Tests][tests-badge]][tests]\r\n[![Release][release-badge]][release]\r\n[![Licence][licence-badge]][licence]\r\n[![LinkedIn][linkedin-badge]][linkedin]\r\n[![Downloads][downloads-badge]][pypi]\r\n\r\n**Taiwanese Hokkien Transliterator and Tokeniser**\r\n\r\nIt has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.<br />\r\nIncludes word tokeniser for Taiwanese Hokkien.\r\n\r\n[Report Bug][bug] \u2022\r\n[PyPI][pypi]\r\n\r\n</div>\r\n\r\n\r\n\r\n---\r\n\r\n\r\n\r\n<!-- TABLE OF CONTENTS -->\r\n<details open>\r\n  <summary>Table of Contents</summary>\r\n  <ol>\r\n    <li><a href=\"#versions\">Versions</a></li>\r\n    <li><a href=\"#install\">Install</a></li>\r\n    <li>\r\n      <a href=\"#usage\">Usage</a>\r\n      <ul>\r\n        <li>\r\n          <a href=\"#converter\">Converter</a>\r\n          <ul>\r\n            <li><a href=\"#system\">System</a></li>\r\n            <li><a href=\"#dialect\">Dialect</a></li>\r\n            <li><a href=\"#format\">Format</a></li>\r\n            <li><a href=\"#delimiter\">Delimiter</a></li>\r\n            <li><a href=\"#sandhi\">Sandhi</a></li>\r\n            <li><a href=\"#punctuation\">Punctuation</a></li>\r\n            <li><a href=\"#convert-non-cjk\">Convert non-CJK</a></li>\r\n          </ul>\r\n        </li>\r\n        <li>\r\n          <a href=\"#tokeniser\">Tokeniser</a>\r\n          <ul>\r\n            <li><a href=\"#keep-original\">Keep original</a></li>\r\n          </ul>\r\n        </li>\r\n        <li><a href=\"#other-functions\">Other Functions</a></li>\r\n      </ul>\r\n    </li>\r\n    <li><a href=\"#example\">Example</a></li>\r\n    <li><a href=\"#data\">Data</a></li>\r\n    <li><a href=\"#acknowledgements\">Acknowledgements</a></li>\r\n    <li><a href=\"#licence\">Licence</a></li>\r\n  </ol>\r\n</details>\r\n\r\n\r\n\r\n<!-- OTHER VERSIONS -->\r\n## Versions\r\n\r\n[![JavaScript Version][js-badge]][js-link]\r\n\r\n\r\n\r\n<!-- INSTALL -->\r\n## Install\r\n\r\nTaibun can be installed from [pypi][pypi]\r\n\r\n```bash\r\n$ pip install taibun\r\n```\r\n\r\n\r\n\r\n<!-- USAGE -->\r\n## Usage\r\n\r\n### Converter\r\n\r\n`Converter` class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.\r\n\r\n```python\r\n# Constructor\r\nc = Converter(system, dialect, format, delimiter, sandhi, punctuation, convert_non_cjk)\r\n\r\n# Transliterate Chinese characters\r\nc.get(input)\r\n```\r\n\r\n#### System\r\n\r\n`system` String - system of transliteration.\r\n\r\n* `Tailo` (default) - [T\u00e2i-u\u00e2n L\u00f4-m\u00e1-j\u012b Phing-im Hong-\u00e0n][tailo-wiki]\r\n* `POJ` - [Pe\u030dh-\u014de-j\u012b][poj-wiki]\r\n* `Zhuyin` - [Taiwanese Phonetic Symbols][zhuyin-wiki]\r\n* `TLPA` - [Taiwanese Language Phonetic Alphabet][tlpa-wiki]\r\n* `Pingyim` - [Bb\u00e1nl\u00e1m U\u0113 P\u00ecngy\u012bm H\u014dng'\u00e0n][pingyim-wiki]\r\n* `Tongiong` - [Da\u012b-gh\u00ee T\u014dng-i\u014dng P\u012bng-im][tongiong-wiki]\r\n* `IPA` - [International Phonetic Alphabet][ipa-wiki]\r\n\r\n| text | Tailo   | POJ     | Zhuyin      | TLPA      | Pingyim | Tongiong | IPA         |\r\n| ---- | ------- | ------- | ----------- | --------- | ------- | -------- | ----------- |\r\n| \u53f0\u7063 | T\u00e2i-u\u00e2n | T\u00e2i-o\u00e2n | \u3109\u311e\u02ca \u3128\u3122\u02ca | Tai5 uan5 | D\u00e1iw\u00e1n  | T\u0101i-u\u01cen  | Tai\u00b2\u2075 uan\u00b2\u2075 |\r\n\r\n#### Dialect\r\n\r\n`dialect` String - preferred pronunciation.\r\n\r\n* `south` (default) - [Zhangzhou][zhangzhou-wiki]-leaning pronunciation\r\n* `north` - [Quanzhou][quanzhou-wiki]-leaning pronunciation\r\n* `singapore` - Quanzhou-leaning pronunciation with [Singaporean characteristics][singapore-wiki]\r\n\r\n| text           | south                       | north                       | singapore                  |\r\n| -------------- | --------------------------- | --------------------------- | -------------------------- |\r\n| \u4e94\u6708\u7bc0\u6211\u5549\u5496\u5561 | G\u014do-gue\u030dh-tseh gu\u00e1 lim ka-pi | G\u014do-ge\u030dh-tsueh gu\u00e1 lim ka-pi | G\u014do-ge\u030dh-tsueh u\u00e1 lim ko-pi |\r\n\r\n#### Format\r\n\r\n`format` String - format in which tones will be represented in the converted sentence.\r\n\r\n* `mark` (default) - uses diacritics for each syllable. Not available for TLPA\r\n* `number` - add a number which represents the tone at the end of the syllable\r\n* `strip` - removes any tone marking\r\n\r\n| text | mark    | number    | strip   |\r\n| ---- | ------- | --------- | ------- |\r\n| \u53f0\u7063 | T\u00e2i-u\u00e2n | Tai5-uan5 | Tai-uan |\r\n\r\n#### Delimiter\r\n\r\n`delimiter` String - sets the delimiter character that will be placed in between syllables of a word.\r\n\r\nDefault value depends on the chosen `system`:\r\n\r\n* `'-'` - for `Tailo`, `POJ`, `Tongiong`\r\n* `''` - for `Pingyim`\r\n* `' '` - for `Zhuyin`, `TLPA`, `IPA`\r\n\r\n| text | '-'     | ''     | ' '     |\r\n| ---- | ------- | ------ | ------- |\r\n| \u53f0\u7063 | T\u00e2i-u\u00e2n | T\u00e2iu\u00e2n | T\u00e2i u\u00e2n |\r\n\r\n#### Sandhi\r\n\r\n`sandhi` String - applies the [sandhi rules of Taiwanese Hokkien][sandhi-wiki].\r\n\r\nSince it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.\r\n\r\n* `none` - doesn't perform any tone sandhi\r\n* `auto` - closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with \u4ed4\r\n* `exc_last` - changes tone for every syllable except for the last one\r\n* `incl_last` - changes tone for every syllable including the last one\r\n\r\nDefault value depends on the chosen `system`:\r\n\r\n* `auto` - for `Tongiong`\r\n* `none` - for `Tailo`, `POJ`, `Zhuyin`, `TLPA`, `Pingyim`, `IPA`\r\n\r\n| text             | none                    | auto                   | exc_last               | incl_last              |\r\n| ---------------- | ----------------------- | ---------------------- | ---------------------- | ---------------------- |\r\n| \u9019\u662f\u4f60\u7684\u8336\u684c\u4ed4\u7121 | Tse s\u012b l\u00ed \u00ea t\u00ea-toh-\u00e1 b\u00f4 | Tse s\u00ec li \u0113 t\u0113-to-\u00e1 b\u00f4 | Ts\u0113 s\u00ec li \u0113 t\u0113-t\u00f3-a b\u00f4 | Ts\u0113 s\u00ec li \u0113 t\u0113-t\u00f3-a b\u014d |\r\n\r\nSandhi rules also change depending on the dialect chosen.\r\n\r\n| text | no sandhi | south   | north / singapore |\r\n| ---- | --------- | ------- | ----------------- |\r\n| \u53f0\u7063 | T\u00e2i-u\u00e2n   | T\u0101i-u\u00e2n | T\u00e0i-u\u00e2n           |\r\n\r\n#### Punctuation\r\n\r\n`punctuation` String\r\n\r\n* `format` (default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentence\r\n* `none` - preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences\r\n\r\n| text                                                                           | format                                                                                            | none                                                                                                 |\r\n| ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |\r\n| \u9019\u662f\u81fa\u5357\uff0c\u7c21\u7a31\u300c\u5357\u300d\uff08\u767d\u8a71\u5b57\uff1aT\u00e2i-l\u00e2m\uff1b\u6ce8\u97f3\u7b26\u865f\uff1a\u310a\u311e\u02ca \u310b\u3122\u02ca\uff0c\u570b\u8a9e\uff1aT\u00e1in\u00e1n\uff09\u3002 | Tse s\u012b T\u00e2i-l\u00e2m, k\u00e1n-tshing \"l\u00e2m\" (Pe\u030dh-u\u0113-j\u012b: T\u00e2i-l\u00e2m; ts\u00f9-im h\u00fb-h\u014d: \u310a\u311e\u02ca \u310b\u3122\u02ca, kok-g\u00ed: T\u00e1in\u00e1n). | tse s\u012b T\u00e2i-l\u00e2m\uff0ck\u00e1n-tshing\u300cl\u00e2m\u300d\uff08Pe\u030dh-u\u0113-j\u012b\uff1aT\u00e2i-l\u00e2m\uff1bts\u00f9-im h\u00fb-h\u014d\uff1a\u310a\u311e\u02ca \u310b\u3122\u02ca\uff0ckok-g\u00ed\uff1aT\u00e1in\u00e1n\uff09\u3002 |\r\n\r\n#### Convert non-CJK\r\n\r\n`convert_non_cjk` Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.\r\n\r\n* `True` - convert non-Chinese character words\r\n* `False` (default) - convert only Chinese character words\r\n\r\n| text      | False                   | True                    |\r\n| --------- | ----------------------- | ----------------------- |\r\n| \u6211\u98dfph\u00e1ng | \u31a3\u3128\u311a\u02cb \u3110\u3127\u311a\u31b7\u02d9 ph\u00e1ng | \u31a3\u3128\u311a\u02cb \u3110\u3127\u311a\u31b7\u02d9 \u3106\u3124\u02cb |\r\n\r\n### Tokeniser\r\n\r\n`Tokeniser` class performs [NLTK wordpunct_tokenize][nltk-tokenize]-like tokenisation of a Taiwanese Hokkien sentence.\r\n\r\n```python\r\n# Constructor\r\nt = Tokeniser(keep_original)\r\n\r\n# Tokenise Taiwanese Hokkien sentence\r\nt.tokenise(input)\r\n```\r\n\r\n#### Keep original\r\n\r\n`keep_original` Boolean - defines whether the original characters of the input are retained.\r\n\r\n* `True` (default) - preserve original characters\r\n* `False` - replace original characters with characters defined in the dataset\r\n\r\n| text         | True                 | False                |\r\n| ------------ | -------------------- | -------------------- |\r\n| \u81fa\u7063\u706b\u9e21\u8089\u996d | ['\u81fa\u7063', '\u706b\u9e21\u8089\u996d'] | ['\u53f0\u7063', '\u706b\u96de\u8089\u98ef'] |\r\n\r\n### Other Functions\r\n\r\nHandy functions for NLP tasks in Taiwanese Hokkien.\r\n\r\n`to_traditional` function converts input to Traditional Chinese characters that are used in the dataset. Also accounts for different variants of Traditional Chinese characters.\r\n\r\n`to_simplified` function converts input to Simplified Chinese characters.\r\n\r\n`is_cjk` function checks whether the input string consists entirely of Chinese characters.\r\n\r\n```python\r\nto_traditional(input)\r\n\r\nto_simplified(input)\r\n\r\nis_cjk(input)\r\n```\r\n\r\n\r\n\r\n<!-- EXAMPLE -->\r\n## Example\r\n\r\n```python\r\n# Converter\r\nfrom taibun import Converter\r\n\r\n## System\r\nc = Converter() # Tailo system default\r\nc.get('\u5148\u751f\u8b1b\uff0c\u5b78\u751f\u606c\u606c\u807d\u3002')\r\n>> Sian-sinn k\u00f3ng, ha\u030dk-sing ti\u0101m-ti\u0101m thiann.\r\n\r\nc = Converter(system='Zhuyin')\r\nc.get('\u5148\u751f\u8b1b\uff0c\u5b78\u751f\u606c\u606c\u807d\u3002')\r\n>> \u3112\u3127\u3122 \u3112\u31aa \u310d\u31b2\u02cb, \u310f\u311a\u31b6\u02d9 \u3112\u3127\u3125 \u3109\u3127\u31b0\u02eb \u3109\u3127\u31b0\u02eb \u310a\u3127\u31a9.\r\n\r\n## Dialect\r\nc = Converter() # south dialect default\r\nc.get(\"\u6211\u6b32\u7528\u7bb8\u98df\u9b5a\")\r\n>> Gu\u00e1 beh \u012bng t\u012b tsia\u030dh h\u00ee\r\n\r\nc = Converter(dialect='north')\r\nc.get(\"\u6211\u6b32\u7528\u7bb8\u98df\u9b5a\")\r\n>> Gu\u00e1 bueh \u012bng t\u016b tsia\u030dh h\u00fb\r\n\r\nc = new Converter({ dialect: 'singapore' });\r\nc.get(\"\u6211\u6b32\u7528\u7bb8\u98df\u9b5a\");\r\n>> U\u00e1 bueh \u0113ng t\u016b tsia\u030dh h\u00fb\r\n\r\n## Format\r\nc = Converter() # for Tailo, mark by default\r\nc.get(\"\u751f\u65e5\u5feb\u6a02\")\r\n>> Senn-ji\u030dt khu\u00e0i-lo\u030dk\r\n\r\nc = Converter(format='number')\r\nc.get(\"\u751f\u65e5\u5feb\u6a02\")\r\n>> Senn1-jit8 khuai3-lok8\r\n\r\nc = Converter(format='strip')\r\nc.get(\"\u751f\u65e5\u5feb\u6a02\")\r\n>> Senn-jit khuai-lok\r\n\r\n## Delimiter\r\nc = Converter(delimiter='')\r\nc.get(\"\u5148\u751f\u8b1b\uff0c\u5b78\u751f\u606c\u606c\u807d\u3002\")\r\n>> Siansinn k\u00f3ng, ha\u030dksing ti\u0101mti\u0101m thiann.\r\n\r\nc = Converter(system='Pingyim', delimiter='-')\r\nc.get(\"\u5148\u751f\u8b1b\uff0c\u5b78\u751f\u606c\u606c\u807d\u3002\")\r\n>> Si\u0101n-sn\u012b g\u01d2ng, h\u00e1g-s\u012bng di\u00e2m-di\u00e2m tin\u0101.\r\n\r\n## Sandhi\r\nc = Converter() # for Tailo, sandhi none by default\r\nc.get(\"\u9019\u662f\u4f60\u7684\u8336\u684c\u4ed4\u7121\")\r\n>> Tse s\u012b l\u00ed \u00ea t\u00ea-toh-\u00e1 b\u00f4\r\n\r\nc = Converter(sandhi='auto')\r\nc.get(\"\u9019\u662f\u4f60\u7684\u8336\u684c\u4ed4\u7121\")\r\n>> Tse s\u00ec li \u0113 t\u0113-to-\u00e1 b\u00f4\r\n\r\nc = Converter(sandhi='exc_last')\r\nc.get(\"\u9019\u662f\u4f60\u7684\u8336\u684c\u4ed4\u7121\")\r\n>> Ts\u0113 s\u00ec li \u0113 t\u0113-t\u00f3-a b\u00f4\r\n\r\nc = Converter(sandhi='incl_last')\r\nc.get(\"\u9019\u662f\u4f60\u7684\u8336\u684c\u4ed4\u7121\")\r\n>> Ts\u0113 s\u00ec li \u0113 t\u0113-t\u00f3-a b\u014d\r\n\r\n## Punctuation\r\nc = Converter() # format punctuation default\r\nc.get(\"\u592a\u7a7a\u670b\u53cb\uff0c\u6041\u597d\uff01\u6041\u98df\u98fd\u672a\uff1f\")\r\n>> Th\u00e0i-khong p\u00eeng-i\u00fa, l\u00edn-h\u00f3! L\u00edn tsia\u030dh-p\u00e1 bu\u0113?\r\n\r\nc = Converter(punctuation='none')\r\nc.get(\"\u592a\u7a7a\u670b\u53cb\uff0c\u6041\u597d\uff01\u6041\u98df\u98fd\u672a\uff1f\")\r\n>> th\u00e0i-khong p\u00eeng-i\u00fa\uff0cl\u00edn-h\u00f3\uff01l\u00edn tsia\u030dh-p\u00e1 bu\u0113\uff1f\r\n\r\n## Convert non-CJK\r\nc = Converter(system='Zhuyin') # False convert_non_cjk default\r\nc.get(\"\u6211\u98dfph\u00e1ng\")\r\n>> \u31a3\u3128\u311a\u02cb \u3110\u3127\u311a\u31b7\u02d9 ph\u00e1ng\r\n\r\nc = Converter(system='Zhuyin', convert_non_cjk=True)\r\nc.get(\"\u6211\u98dfph\u00e1ng\")\r\n>> \u31a3\u3128\u311a\u02cb \u3110\u3127\u311a\u31b7\u02d9 \u3106\u3124\u02cb\r\n\r\n\r\n# Tokeniser\r\nfrom taibun import Tokeniser\r\n\r\nt = Tokeniser()\r\nt.tokenise(\"\u592a\u7a7a\u670b\u53cb\uff0c\u6041\u597d\uff01\u6041\u98df\u98fd\u672a\uff1f\")\r\n>> ['\u592a\u7a7a', '\u670b\u53cb', '\uff0c', '\u6041\u597d', '\uff01', '\u6041', '\u98df\u98fd', '\u672a', '\uff1f']\r\n\r\n## Keep Original\r\nt = Tokeniser() # True keep_original default\r\nt.tokenise(\"\u7232\u5565\u7269\u81fa\u7063\u906e\u723e\u597d\uff1f\")\r\n>> ['\u7232\u5565\u7269', '\u81fa\u7063', '\u906e\u723e', '\u597d', '\uff1f']\r\n\r\nt.tokenise(\"\u4e3a\u5565\u7269\u53f0\u6e7e\u906e\u5c14\u597d\uff1f\")\r\n>> ['\u4e3a\u5565\u7269', '\u53f0\u6e7e', '\u906e\u5c14', '\u597d', '\uff1f']\r\n\r\nt = Tokeniser(False)\r\nt.tokenise(\"\u7232\u5565\u7269\u81fa\u7063\u906e\u723e\u597d\uff1f\")\r\n>> ['\u70ba\u5565\u7269', '\u53f0\u7063', '\u906e\u723e', '\u597d', '\uff1f']\r\n\r\nt.tokenise(\"\u4e3a\u5565\u7269\u53f0\u6e7e\u906e\u5c14\u597d\uff1f\")\r\n>> ['\u70ba\u5565\u7269', '\u53f0\u7063', '\u906e\u723e', '\u597d', '\uff1f']\r\n\r\n\r\n# Other Functions\r\nfrom taibun import to_traditional, to_simplified, is_cjk\r\n\r\n## to_traditional\r\nto_traditional(\"\u6211\u542c\u65e0\u53f0\u8bed\")\r\n>> \u6211\u807d\u7121\u53f0\u8a9e\r\n\r\nto_traditional(\"\u6211\u7231\u8fd9\u4e2a\u4e2a\u4eba\u53f0\u9762\")\r\n>> \u6211\u611b\u9019\u4e2a\u500b\u4eba\u6aaf\u9762\r\n\r\nto_traditional(\"\u7232\u5565\u7269\")\r\n>> \u70ba\u5565\u7269\r\n\r\n## to_simplified\r\nto_simplified(\"\u6211\u807d\u7121\u53f0\u8a9e\")\r\n>> \u6211\u542c\u65e0\u53f0\u8bed\r\n\r\n## is_cjk\r\nis_cjk('\u6211\u98df\u9ead')\r\n>> True\r\n\r\nis_cjk('\u6211\u98dfph\u00e1ng')\r\n>> False\r\n```\r\n\r\n\r\n\r\n<!-- DATA -->\r\n## Data\r\n\r\n- [Taiwanese-Chinese Online Dictionary][online-dictionary] (via [ChhoeTaigi][data-via])\r\n- [iTaigi Chinese-Taiwanese Comparison Dictionary][itaigi-dictionary] (via [ChhoeTaigi][data-via])\r\n\r\n\r\n\r\n<!-- ACKNOWLEDGEMENTS -->\r\n## Acknowledgements\r\n\r\n- Samuel Jen ([Github][samuel-github] \u00b7 [LinkedIn][samuel-linkedin]) - Taiwanese and Mandarin translation\r\n\r\n\r\n\r\n<!-- LICENCE -->\r\n## Licence\r\n\r\nBecause Taibun is MIT-licensed, any developer can essentially do whatever they want with it as long as they include the original copyright and licence notice in any copies of the source code. Note, that the data used by the package is licensed under a different copyright.\r\n\r\nThe data is licensed under [CC BY-SA 4.0][data-cc]\r\n\r\n\r\n\r\n<!-- MARKDOWN LINKS -->\r\n<!-- Badges and their links -->\r\n[contributions]: https://github.com/andreihar/taibun/issues\r\n[contributions-badge]: https://img.shields.io/badge/Contributions-Welcomed-be132d?style=for-the-badge&logo=github\r\n[demo]: https://taibun.andreihar.com/\r\n[demo-badge]: https://img.shields.io/badge/Live_Demo-222222?style=for-the-badge&logo=homeadvisor&logoColor=white\r\n[tests]: https://github.com/andreihar/taibun/actions\r\n[tests-badge]: https://img.shields.io/github/actions/workflow/status/andreihar/taibun/ci.yaml?style=for-the-badge&logo=github-actions&logoColor=ffffff\r\n[release-badge]: https://img.shields.io/github/v/release/andreihar/taibun?color=38618c&style=for-the-badge\r\n[release]: https://github.com/andreihar/taibun/releases\r\n[licence-badge]: https://img.shields.io/github/license/andreihar/taibun?color=000000&style=for-the-badge\r\n[licence]: LICENSE\r\n[linkedin-badge]: https://img.shields.io/badge/LinkedIn-0077b5?style=for-the-badge&logo=linkedin&logoColor=ffffff\r\n[linkedin]: https://www.linkedin.com/in/andreihar/\r\n[js-badge]: https://img.shields.io/badge/JS_Version-f7df1e?style=for-the-badge&logo=javascript&logoColor=000000\r\n[js-link]: https://github.com/andreihar/taibun.js\r\n[downloads-badge]: https://img.shields.io/pypi/dm/taibun.svg?style=for-the-badge\r\n\r\n<!-- Technical links -->\r\n[pypi]: https://pypi.org/project/taibun\r\n[bug]: https://github.com/andreihar/taibun/issues\r\n[online-dictionary]: http://ip194097.ntcu.edu.tw/ungian/soannteng/chil/Taihoa.asp\r\n[itaigi-dictionary]: https://itaigi.tw/\r\n[data-via]: https://github.com/ChhoeTaigi/ChhoeTaigiDatabase\r\n[data-cc]: https://creativecommons.org/licenses/by-sa/4.0/deed.en\r\n[tailo-wiki]: https://en.wikipedia.org/wiki/T%C3%A2i-u%C3%A2n_L%C3%B4-m%C3%A1-j%C4%AB_Phing-im_Hong-%C3%A0n\r\n[poj-wiki]: https://en.wikipedia.org/wiki/Pe%CC%8Dh-%C5%8De-j%C4%AB\r\n[zhuyin-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Phonetic_Symbols\r\n[tlpa-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Language_Phonetic_Alphabet\r\n[pingyim-wiki]: https://en.wikipedia.org/wiki/Bb%C3%A1nl%C3%A1m_p%C3%ACngy%C4%ABm\r\n[tongiong-wiki]: https://en.wikipedia.org/wiki/Da%C4%AB-gh%C3%AE_t%C5%8Dng-i%C5%8Dng_p%C4%ABng-im\r\n[ipa-wiki]: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet\r\n[zhangzhou-wiki]: https://en.wikipedia.org/wiki/Zhangzhou_dialects\r\n[quanzhou-wiki]: https://en.wikipedia.org/wiki/Quanzhou_dialects\r\n[singapore-wiki]: https://en.wikipedia.org/wiki/Singaporean_Hokkien\r\n[nltk-tokenize]: https://nltk.org/api/nltk.tokenize.html\r\n[sandhi-wiki]: https://en.wikipedia.org/wiki/Taiwanese_Hokkien#Tone%20sandhi:~:text=thng%E2%9F%A9%20(%22soup%22).-,Tone%20sandhi,-%5Bedit%5D\r\n\r\n<!-- Socials -->\r\n[samuel-github]: https://github.com/SSSam\r\n[samuel-linkedin]: https://www.linkedin.com/in/samuel-jen/\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Taiwanese Hokkien Transliterator and Tokeniser",
    "version": "1.1.7",
    "project_urls": {
        "Homepage": "https://github.com/andreihar/taibun"
    },
    "split_keywords": [
        "python",
        " taiwan",
        " taiwanese",
        " taigi",
        " hokkien",
        " romanization",
        " transliteration",
        " transliterator",
        " tokenization",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7ab57164a1218cc9ff600e668f1fc8eedd3dba4287c0939a9e629d6edb82aa2d",
                "md5": "e0f68e5cbf92d74e264c143da84ac1a6",
                "sha256": "5b38335ea9550c3976d728c9d5e29d63eaa7326429d30eb737c1f681c2ab6108"
            },
            "downloads": -1,
            "filename": "taibun-1.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e0f68e5cbf92d74e264c143da84ac1a6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 563076,
            "upload_time": "2024-08-31T20:24:59",
            "upload_time_iso_8601": "2024-08-31T20:24:59.756250Z",
            "url": "https://files.pythonhosted.org/packages/7a/b5/7164a1218cc9ff600e668f1fc8eedd3dba4287c0939a9e629d6edb82aa2d/taibun-1.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9352e313c931ee7226b5f20dd4577be470808f7e4415b95414246e5167fb6a05",
                "md5": "4a7aad5266ebe374980e86e6711e9937",
                "sha256": "136aa60861c1f50ab286a78445ef40eaed7460098daabcf10026074c6ea6e870"
            },
            "downloads": -1,
            "filename": "taibun-1.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "4a7aad5266ebe374980e86e6711e9937",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 611617,
            "upload_time": "2024-08-31T20:25:01",
            "upload_time_iso_8601": "2024-08-31T20:25:01.938421Z",
            "url": "https://files.pythonhosted.org/packages/93/52/e313c931ee7226b5f20dd4577be470808f7e4415b95414246e5167fb6a05/taibun-1.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-31 20:25:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "andreihar",
    "github_project": "taibun",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "taibun"
}
        
Elapsed time: 0.38418s