Thai Language Toolkit Project version 1.9.1
============================================
TLTK is a Python package designed for Thai language processing, which includes functionalities such as syllable and word segmentation, discourse unit segmentation, POS tagging, named entity recognition, grapheme-to-phoneme conversion, IPA transcription, romanization, and more. To use TLTK, you will need to have Python 3.6 or a more recent version installed. The project is an open-source software developed at Chulalongkorn University. As of version 1.2.2, the package license has been changed to the New BSD License (BSD-3-Clause).
Input : must be utf8 Thai texts.
Updates:
--------
Version 1.9: TextClass(text) can classify both level and genre. Levels are categorized as L1 - L4 for Lower Elementary, Upper Elementary, Middle, and High School. Genres are categorized as Academic, Essay, Fiction, Institution, Law, Misc, Newspaper, Non-academic, Popular magazine, Speech
Version 1.8.1: Bug fixes for TNC3g_load().
Version 1.8: Two new modules have been introduced: TextClass(text): This module is designed to assess the level of difficulty based on L1-L4 (Lower Elementary, Upper Elementary, Middle, and High School). It provides a mechanism for determining the text's complexity level. txt2feat(text): This module is introduced to generate a vector of 129 features, represented as a list of values. These features are derived from the output of the TextAna module. Within these modules, dependency relations in 'wrd_deprel[deprel]' have been transformed into UD format, such as UDsubj, UDobj, UDnmod, and so on. These 129 features are then generated and utilized within the TextClass module to evaluate text difficulty."
Version 1.7: Introduced the `spoonerism(w)` module, which generates one or two spoonerisms from the input word `w`. This is achieved by swapping the first and last syllables, either a) preserving the initial consonant or b) preserving both the initial consonant and tone. The output is provided as a list of readings in Thai. Additionally, the dependency "sklearn" has been updated to "scikit-learn".
Version 1.6.8: Bug fixes have been made to the "TextAna" module.
Version 1.6.7: Bug fixes have been made to the "g2p" module.
Version 1.6.6 includes UDParser using MaltParser (https://www.maltparser.org/). To use this feature, please install MaltParser and add a line 'tltk.nlp.Maltparser_Path = "/path/to/maltparser-1.9.2"' in your code before using 'MaltParser' or 'MaltParser_wordlist'. The former requires text input while the latter requires a list of words. The UD tree generated by MaltParser is a dictionary with the following format: {'sentence': "ข้อความภาษาไทย", 'words': [{'id': nn, 'pos': POS, 'deprel': REL, 'head': HD_ID}, {...}, ...]}. You can use 'print_dtree' to print D-tree from the parsed result. Additionally, 'delrel' and 'SynDepth' have been added to the properties of 'TextAna' when the option 'UDParse="Malt"' is specified. By default, 'UDParse="none"'.
Version 1.6.5: This version includes bug fixes in the "SylAna" and "WordAna" modules, as well as a new module called "tltk.corpus.compound(x,y)".
Version 1.6.3: Bug fixes have been made to the "g2p" module, and some features have been modified in both "WordAna" and "TextAna" modules.
Version 1.6.2: Changes have been made to the text features in this version.
Version 1.6.1: This version includes new text features, an updated Word2Vec model using 'TNCc5model3.bin', a change from 'g2p_all' to 'th2ipa_all', and some bug fixes.
Version 1.6: The new feature in this version is 'TNC_tag', which allows you to mark up Thai text in XML format.
Version 1.5.8: This version includes the addition of average reduced frequency in the TextAna module.
Version 1.5.7: The SylAna module has been added, which is included in WordAna. The output is a list of syllable properties, which is added to the word property. Additionally, 'th2read(text)' has been added, which shows the pronunciation in Thai written forms.
Version 1.5: This version includes the addition of the WordAna and TextAna modules. The output of WordAna is an object with word properties.
The following line of code has also been mentioned:
'res = tltk.nlp.TNC_tag(text,POS)' returns XML format of Thai texts as used in TNC. The POS option can be set to either "Y" or "N".
sp = tltk.nlp.SylAna(syl_form,syl_phone) => sp.form (syllable form), sp.phone (syllable sound), sp.char (number of characters in the syllable), sp.dead (indicates whether the syllable is dead or live, True/False), sp.initC (initial consonant form), sp.finalC (final consonant form), sp.vowel (vowel form), sp.tonemark (indicates the tone mark, เอก, โท, ตรี, จัตวา), sp.initPh (initial consonant sound), sp.finalPh (final consonant sound), sp.vowelPh (vowel sound), sp.tone (tone 1, 2, 3, 4, or 5), sp.leading (indicates whether the syllable is a leading syllable, True/False), sp.cluster (indicates whether the syllable has an initial cluster, True/False), sp.karan (number of characters marked with a karan marker)
wd = tltk.nlp.WordAna(w) => wd.form (word form), wd.phone (word sound), wd.char (number of characters in the word), wd.syl (number of syllables), wd.corrtone (number of tones that match the same tone marker), wd.corrfinal (number of final consonant sounds that match the final character -ก -ด -ง -น -ม -ย -ว), wd.karan (number of karan markers), wd.cluster (number of cluster consonants), wd.lead (number of leading consonants), wd.doubvowel (number of complex vowels), wd.syl_prop (a list of syllable properties)
res = tltk.nlp.TextAna(text, TextOption, WordOption) => a complex dictionary output describing the input text.
TextOption can be configured with one of the following values: "segmented," "edu," or "par." To segment the text with <p>, <s>, and | representing a new paragraph, space, and word segmentation, select "segmented." To apply TLTK EDU segmentation, choose "edu." To process the text as plain text format using "\\n" for paragraph separation, use "par."
WordOption can be set to "colloc" or "mm". If the text is not yet segmented, use "colloc" or "mm" to segment the text into words using TLTK.
### properties from SylAna
- form: syllable form
- phone: syllable sound
- char: number of characters in the syllable
- dead: True|False (indicates whether the syllable is dead or alive)
- initC: initial consonant
- finalC: final consonant
- vowel: vowel form
- tonemark: tone marker (values: 1, 2, 3, 4, 5)
- initPh: initial sound
- finalPh: final sound
- vowelPh: vowel sound
- tone: tone (values: 1, 2, 3, 4, 5)
- leading: True|False (indicates whether the syllable is a leading syllable, e.g., in สบาย, สห)
- cluster: True|False (indicates whether the syllable has a cluster consonant)
- karan: character(s) marked with karan
### properties from WordAna
- form: word form
- phone: word sound
- char: number of characters
- syl: number of syllables
- corrtone: number of correct tone markers (สามัญ, ่ เอก, ้ โท, ๊ ตรี, ๋ จัตวา) in both form and sound
- incorrtone: number of incorrect tone markers in both form and sound
- corrfinal: number of correct final consonants (-ก -ด -ง -น -ม -ย -ว)
- incorrfinal: number of incorrect final consonants (excluding -ก -ด -ง -น -ม -ย -ว)
- karan: number of karan markers
- cluster: number of cluster consonants
- lead: number of leading consonants
- doubvowel: number of double vowels
### properties from TextAna
- DesSpC: No. of spaces in a text
- DesChaC: No. of characters in a text
- DesSymbC: No. of symbols or special characters in a text
- DesPC: No. of paragraphs
- DesEduC: No. of edu units
- DesTotW: Total number of words in a text
- DesTotT: Total number of unique words (types) in a text
- DesEduL: Mean length of an edu unit (in words)
- DesEduLd: Standard deviation of edu length (in words)
- DesWrdL: Mean length of a word (in syllables)
- DesWrdLd: Standard deviation of word length (in syllables)
- DesPL: Mean length of a paragraph (in words)
- DesCorrToneC: Number of words with the correct tone form and tone sound
- DesInCorrToneC: Number of words with incorrect tone form and/or tone sound
- DesCorrFinalC: Number of words with correct final consonant (-ก -ด -ง -น -ม -ย -ว)
- DesInCorrFinalC: Number of words with incorrect final consonant (not -ก -ด -ง -น -ม -ย -ว)
- DesClusterC: Number of words with a consonant cluster
- DesLeadC: Number of words with a leading syllable (e.g. สบาย, สห)
- DesDoubVowelC: Number of words with a double vowel
- DesTNCt1C: No. of words in TNC tier1 50%
- DesTNCt2C: No. of words in TNC tier2 51-60%
- DesTNCt3C: No. of words in TNC tier3 61-70%
- DesTNCt4C: No. of words in TNC tier4 71-80%
- DesTTC1: No. of words in TTC level1
- DesTTC2: No. of words in TTC level2
- DesTTC3: No. of words in TTC level3
- DesTTC4: No. of words in TTC level4
- WrdCorrTone: ratio of words with the same tone form and phone
- WrdInCorrTone: ratio of words with different tone form and phone
- WrdCorrFinal: ratio of words with correct final consonant -ก -ด -ง -น -ม -ย -ว
- WrdInCorrFinal: ratio of words with final consonant not -ก -ด -ง -น -ม -ย -ว
- WrdKaran: ratio of words with a karan
- WrdCluster: ratio of words with a cluster
- WrdLead: ratio of words with a leading syllable
- WrdDoubVowel: ratio of words with a double vowel
- WrdNEl: ratio of named entity locations
- WrdNEo: ratio of named entity organizations
- WrdNEp: ratio of named entity persons
- WrdNeg: ratio of negations
- WrdTNCt1: relative frequency of words in TNC tier 1 (/1000 words)
- WrdTNCt2: relative frequency of words in TNC tier 2
- WrdTNCt3: relative frequency of words in TNC tier 3
- WrdTNCt4: relative frequency of words in TNC tier 4
- WrdTTC1: relative frequency of words in TTC level 1
- WrdTTC2: relative frequency of words in TTC level 2
- WrdTTC3: relative frequency of words in TTC level 3
- WrdTTC4: relative frequency of words in TTC level 4
- WrdC: mean of relative frequency of content words in TTC
- WrdF: mean of relative frequency of function words in TTC
- WrdCF: mean of relative frequency of content/function words in TTC
- WrdFrmSing: mean of relative frequency of single-word forms in TTC
- WrdFrmComp: mean of relative frequency of complex/compound word forms in TTC
- WrdFrmTran: mean of relative frequency of transliterated words in TTC
- WrdSemSimp: mean of relative frequency of simple words in TTC
- WrdSemTran: mean of relative frequency of transparent compound words in TTC
- WrdSemSemi: mean of relative frequency of words in between transparent and opaque compound words in TTC
- WrdSemOpaq: mean of relative frequency of opaque compound words in TTC
- WrdBaseM: mean of relative frequency of basic vocab from Ministry of Education
- WrdBaseT: mean of relative frequency of basic vocab from TTC & TNC < 2000
- WrdTfidf: average of TF-IDF of each word (calculated from TNC)
- WrdTncDisp: average of dispersion of each word (calculated from TNC)
- WrdTtcDisp: average of dispersion of each word (calculated from TTC)
- WrdArf: average of ARF (average reduced frequency) of each word in the text
- WrdNOUN: mean of relative frequency of words with POS=NOUN
- WrdVERB: mean of relative frequency of words with POS=VERB
- WrdADV: mean of relative frequency of words with POS=ADV
- WrdDET: mean of relative frequency of words with POS=DET
- WrdADJ: mean of relative frequency of words with POS=ADJ
- WrdADP: mean of relative frequency of words with POS=ADP
- WrdPUNCT: mean of relative frequency of words with POS=PUNCT
- WrdAUX: mean of relative frequency of words with POS=AUX
- WrdSYM: mean of relative frequency of words with POS=SYM
- WrdINTJ: mean of relative frequency of words with POS=INTJ
- WrdCCONJ: mean of relative frequency of words with POS=CCONJ
- WrdPROPN: mean of relative frequency of words with POS=PROPN
- WrdNUM: mean of relative frequency of words with POS=NUM
- WrdPART: mean of relative frequency of words with POS=PART
- WrdPRON: mean relative frequency of words with POS=PRON
- WrdSCONJ: mean relative frequency of words with POS=SCONJ
- LdvTTR: type-token ratio, which is the ratio of the number of unique words (types) to the total number of words (tokens) in a text
- CrfCNL: proportion of utterances having the same NOUN overlapped locally (yes or no)
- CrfCVL: proportion of utterances having the same VERB overlapped locally (yes or no)
- CrfCWL: proportion of utterances having the same content words overlapped locally (yes or no)
- CrfCTL: proportion of utterances having content words overlapped locally (measured by the number of overlapping tokens)
- wrd: dictionary where wrd[word] = freq, representing the frequency of each word in a text
- wrd_arf: dictionary where wrd_arf[word] = arf, representing the average reduced frequency of each word in a text
- wrd_deprel: dictionary where wrd_deprel[deprel] = freq, representing the frequency of each dependency relation (deprel) in a text
Version 1.4 has been updated for gensim 4.0. Users can load a Thai corpus using Corpus(), then create a model using W2V_train() or D2V_train(), or load an existing model from W2V_load(Model_File). The pre-trained w2v model for TNC is TNCc5model2.bin. The model for EDU segmentation has been recompiled to work with the new library.
Version 1.3.8 has added spell_variants to generate all variation forms of the same pronunciation.
Version 1.3.6 has removed the "matplotlib" dependency and fixed an error with "ใคร".
More compound words have been added to the dictionary. Versions 1.1.3-1.1.5 contained many entries that were not words and had a few errors. Those entries have been removed in later versions.
The NER tagger model has been updated by using more named entity data from the AiforThai project.
tltk.nlp : basic tools for Thai language processing.
------------------------------------------------------
\>tltk.nlp.TextClass(text) By default, TextOption="par",WordOption="colloc", UDParse="Malt", Classifier="level" is set. If text is word segmented with "|", use WordOption="segmented". Two classifiers are available "level" and "genre".
\>tltk.nlp.txt2feat(text, Option="name|value"): Returns a list of 129 feature values analyzed from the text. If Option="name", only a list of 129 feature names is returned.
\>tltk.nlp.spoonerism(word_or_phrase): Returns one or two "spoonerisms" derived from the input. For example, using `spoonerism('แขนเป็นฟอ')` will produce the spoonerism(s).
=>['คอ-เป็น-แฝน', 'ขอ-เป็น-แฟน']
\>tltk.nlp.TextAna(Text, UDParse="Malt"): This function analyzes plain text by paragraph, segments words using the colloc approach, and employs MaltParse for UDParsing. The default options are TextOption="par", WordOption="colloc", and UDParse="none". If the input is already segmented with '|', then use TextOption="segmented" and WordOption="segmented". If processing by EDU is preferred, set TextOption="edu". If no parsing is needed, set UDParse="none".
=>output as a dict of text features described in TextAna
\>tltk.nlp.TextAna2json(Text, Filename, Options) functions similarly to the above, but the results are saved to a JSON file. The `Options` parameter includes a `Mode` which can be set to "write" or "append".
\>tltk.nlp.MaltParser(Text) e.g. print_dtree(tltk.nlp.MaltParser("เขานั่งดูหนังอยู่ที่บ้าน"))
=>
* 1:----เขา (PRON, nsubj - 2)
* 2:--นั่ง (VERB, root - 0)
* 3:----ดู (VERB, compound - 2)
* 4:------หนัง (NOUN, obj - 3)
* 5:------อยู่ (VERB, compound - 3)
* 6:----------ที่ (ADP, case - 7)
* 7:--------บ้าน (NOUN, obl - 5)
\>tltk.nlp.TNC_tag(Text,POSTagOption) e.g. tltk.nlp.TNC_tag('นายกรัฐมนตรีกล่าวกับคนขับรถประจำทางหลวงสายสองว่า อยากวิงวอนให้ใช้ความรอบคอบ',POS='Y')
=> '<w tran="naa0jok3rat3tha1mon0trii0" POS="NOUN">นายกรัฐมนตรี</w><w tran="klaaw1" POS="VERB">กล่าว</w><w tran="kap1" POS="ADP">กับ</w><w tran="khon0khap1rot3" POS="NOUN">คนขับรถ</w><w tran="pra1cam0" POS="NOUN">ประจำ</w><w tran="thaaN0luuaN4" POS="NOUN">ทางหลวง</w><w tran="saaj4" POS="NOUN">สาย</w><w tran="sOON4" POS="NUM">สอง</w><w tran="waa2" POS="SCONJ">ว่า</w><s/><w tran="jaak1" POS="VERB">อยาก</w><w tran="wiN0wOOn0" POS="VERB">วิงวอน</w><w tran="haj2" POS="SCONJ">ให้</w><w tran="chaj3" POS="VERB">ใช้</w><w tran="khwaam0" POS="NOUN">ความ</w><w tran="rOOp2khOOp2" POS="VERB">รอบคอบ</w><s/>'
\>tltk.nlp.chunk(Text) : chunk parsing. The output includes markups for word segments (\|), elementary discourse units (\<u/\>), pos tags (/POS),and named entities (\<NEx\>...\</NEx\>), e.g. tltk.nlp.chunk("สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์")
=> '<NEo\>สำนักงาน/NOUN|เขต/NOUN|จตุจักร/PROPN|</NEo\>ชี้แจง/VERB|ว่า/SCONJ|\<s/\>/PUNCT|ได้/AUX|นำ/VERB|ป้ายประกาศ/NOUN|เตือน/VERB|ปลิง/NOUN|ไป/VERB|ปัก/VERB|ตาม/ADP|แหล่งน้ำ/NOUN|\<u/\>ใน/ADP|<NEl\>เขต/NOUN|อำเภอ/NOUN|เมือง/NOUN|\<s/\>/PUNCT|จังหวัด/NOUN|อ่างทอง/PROPN|\</NEl\>\<u/\>หลังจาก/SCONJ|\<NEp\>นาย/NOUN|สุ/PROPN|กิจ/NOUN|\</NEp\>\<s/\>/PUNCT|อายุ/NOUN|\<u/\>65/NUM|\<s/\>/PUNCT|ปี/NOUN|\<u/\>ถูก/AUX|ปลิง/VERB|กัด/VERB|แล้ว/ADV|ไม่ได้/AUX|ไป/VERB|พบ/VERB|แพทย์/NOUN|\<u/\>'
\>tltk.nlp.segment(Text) : segment edu by marking <u\/> e.g. tltk.nlp.segment("แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย")
=>"แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|<s/>|รัง|ที่|ทำ|จึง|ไม่ค่อย|แข็งแรง<u/>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง<u/>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม<u/>ขณะที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ|<s/><u/>แต่|สุดท้าย|ไม่|สำเร็จ<u/>สอง|สาม|วัน|ต่อมา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป<u/>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย<u/>"
\>tltk.nlp.ner_tag(Text) : The output includes markups for named entities (\<NEx\>...\</NEx\>), e.g. tltk.nlp.ner_tag("สำนักงานเขตจตุจักรชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ในเขตอำเภอเมือง จังหวัดอ่างทอง หลังจากนายสุกิจ อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์")
=> '\<NEo\>สำนักงานเขตจตุจักร\</NEo\>ชี้แจงว่า ได้นำป้ายประกาศเตือนปลิงไปปักตามแหล่งน้ำ ใน\<NEl\>เขตอำเภอเมือง จังหวัดอ่างทอง\</NEl\> หลังจาก\<NEp\>นายสุกิจ\</NEp\> อายุ 65 ปี ถูกปลิงกัดแล้วไม่ได้ไปพบแพทย์'
\>tltk.nlp.ner([(w,pos),....]) : module for named entity recognition (person, organization, location), e.g. tltk.nlp.ner([('สำนักงาน', 'NOUN'), ('เขต', 'NOUN'), ('จตุจักร', 'PROPN'), ('ชี้แจง', 'VERB'), ('ว่า', 'SCONJ'), ('\<s/\>', 'PUNCT')])
=> [('สำนักงาน', 'NOUN', 'B-O'), ('เขต', 'NOUN', 'I-O'), ('จตุจักร', 'PROPN', 'I-O'), ('ชี้แจง', 'VERB', 'O'), ('ว่า', 'SCONJ', 'O'), ('\<s/\>', 'PUNCT', 'O')]
Named entity recognition is based on the CRF model adapted from the http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html tutorial. The model was trained on a corpus containing 170,000 named entities. The tags used for organizations are B-O and I-O, for persons are B-P and I-P, and for locations are B-L and I-L.
\>tltk.nlp.pos_tag(Text,WordSegmentOption) : word segmentation and POS tagging (using nltk.tag.perceptron), e.g. tltk.nlp.pos_tag('โปรแกรมสำหรับใส่แท็กหมวดคำภาษาไทย วันนี้ใช้งานได้บ้างแล้ว') or
=> [[('โปรแกรม', 'NOUN'), ('สำหรับ', 'ADP'), ('ใส่', 'VERB'), ('แท็ก', 'NOUN'), ('หมวดคำ', 'NOUN'), ('ภาษาไทย', 'PROPN'), ('\<s/\>', 'PUNCT')], [('วันนี้', 'NOUN'), ('ใช้งาน', 'VERB'), ('ได้', 'ADV'), ('บ้าง', 'ADV'), ('แล้ว', 'ADV'), ('\<s/\>', 'PUNCT')]]
The default word segmentation method used is "colloc" in the function word_segment(Text, "colloc"), but if the option is set to "mm", then the function word_segment(Text, "mm") will be used. The POS tag set used is based on the Universal POS tag set found at http://universaldependencies.org/u/pos/index.html.
The nltk.tag.perceptron model is used for POS tagging, which was trained on a POS-tagged subcorpus in TNC consisting of 148,000 words.
nltk.tag.perceptron model is used for POS tagging. It is trainned with POS-tagged subcorpus in TNC (148,000 words)
\>tltk.nlp.pos_tag_wordlist(WordLst) : Same as "tltk.nlp.pos_tag", but the input is a word list, [w1,w2,...]
\>tltk.nlp.segment(Text) : segment a paragraph into elementary discourse units (edu) marked with \<u/\> and segment words in each edu e.g. tltk.nlp.segment("แต่อาจเพราะนกกินปลีอกเหลืองเป็นพ่อแม่มือใหม่ รังที่ทำจึงไม่ค่อยแข็งแรง วันหนึ่งรังก็ฉีกเกือบขาดเป็นสองท่อนห้อยต่องแต่ง ผมพยายามหาอุปกรณ์มายึดรังกลับคืนรูปทรงเดิม ขณะที่แม่นกกินปลีอกเหลืองส่งเสียงโวยวายอยู่ใกล้ ๆ แต่สุดท้ายไม่สำเร็จ สองสามวันต่อมารังที่ช่วยซ่อมก็พังไป ไม่เห็นแม่นกบินกลับมาอีกเลย")
=> 'แต่|อาจ|เพราะ|นกกินปลีอกเหลือง|เป็น|พ่อแม่|มือใหม่|\<s/\>|รัง|ที่|ทำ|จึง|ไม่|ค่อย|แข็งแรง\<u/\>วัน|หนึ่ง|รัง|ก็|ฉีก|เกือบ|ขาด|เป็น|สอง|ท่อน|ห้อย|ต่องแต่ง\<u/\>ผม|พยายาม|หา|อุปกรณ์|มา|ยึด|รัง|กลับคืน|รูปทรง|เดิม\<u/\>ขณะ|ที่|แม่|นกกินปลีอกเหลือง|ส่งเสียง|โวยวาย|อยู่|ใกล้|ๆ\<u/\>แต่|สุดท้าย|ไม่|สำเร็จ|\<s/\>|สอง|สาม|วัน|ต่อ|มา|รัง|ที่|ช่วย|ซ่อม|ก็|พัง|ไป\<u/\>ไม่|เห็น|แม่|นก|บิน|กลับ|มา|อีก|เลย\<u/\>' edu segmentation is based on syllable input using RandomForestClassifier model, which is trained on an edu-segmented corpus (approx. 7,000 edus) created and used in Nalinee\'s thesis
\>tltk.nlp.word_segment(Text,method='mm|ngram|colloc') : word segmentation using either maximum matching or ngram or maximum collocation approach. 'colloc' is used by default. Please note that the first run of ngram method would take a long time because TNC.3g will be loaded for ngram calculation. e.g.
\>tltk.nlp.word_segment('ผู้สื่อข่าวรายงานว่านายกรัฐมนตรีไม่มาทำงานที่ทำเนียบรัฐบาล')
=> 'ผู้สื่อข่าว|รายงาน|ว่า|นายกรัฐมนตรี|ไม่|มา|ทำงาน|ที่|ทำเนียบรัฐบาล|\<s/>'
\>tltk.nlp.syl_segment(Text) : syllable segmentation using 3gram statistics e.g. tltk.nlp.syl_segment('โปรแกรมสำหรับประมวลผลภาษาไทย')
=> 'โปร~แกรม~สำ~หรับ~ประ~มวล~ผล~ภา~ษา~ไทย\<s/>'
\>tltk.nlp.word_segment_nbest(Text, N) : return the best N segmentations based on the assumption of minimum word approach. e.g. tltk.nlp.word_segment_nbest('คนขับรถประจำทางปรับอากาศ"',10)
=> [['คนขับ|รถประจำทาง|ปรับอากาศ', 'คนขับรถ|ประจำทาง|ปรับอากาศ', 'คน|ขับ|รถประจำทาง|ปรับอากาศ', 'คน|ขับรถ|ประจำทาง|ปรับอากาศ', 'คนขับ|รถ|ประจำทาง|ปรับอากาศ', 'คนขับรถ|ประจำ|ทาง|ปรับอากาศ', 'คนขับ|รถประจำทาง|ปรับ|อากาศ', 'คนขับรถ|ประจำทาง|ปรับ|อากาศ', 'คน|ขับ|รถ|ประจำทาง|ปรับอากาศ', 'คนขับ|ร|ถ|ประจำทาง|ปรับอากาศ']]
\>tltk.nlp.g2p(Text) : return Word segments and pronunciations
e.g. tltk.nlp.g2p("สถาบันอุดมศึกษาไม่สามารถก้าวให้ทันการเปลี่ยนแปลงของตลาดแรงงาน")
=> "สถา~บัน~อุ~ดม~ศึก~ษา|ไม่|สา~มารถ|ก้าว|ให้|ทัน|การ|เปลี่ยน~แปลง|ของ|ตลาด~แรง~งาน\<tr/\>sa1'thaa4~ban0~?u1~dom0~sUk1~saa4|maj2|saa4~maat2|kaaw2|haj2|than0|kaan0|pliian1~plxxN0|khOON4|ta1'laat1~rxxN0~Naan0|\<s/\>"
\>tltk.nlp.th2ipa(Text) : return Thai transcription in IPA forms
e.g. tltk.nlp.th2ipa("ลงแม่น้ำรอเดินไปหาปลา")
=> 'loŋ1 mɛː3.naːm4 rᴐː1 dɤːn1 paj1 haː5 plaː1 \<s/\>'
\>tltk.nlp.th2roman(Text) : return Thai romanization according to Royal Thai Institute guideline.
.e.g. tltk.nlp.th2roman("คือเขาเดินเลยลงไปรอในแม่น้ำสะอาดไปหามะปราง")
=> 'khue khaw doen loei long pai ro nai maenam sa-at pai ha maprang \<s/>'
\>tltk.nlp.th2read(Text) : convert text into Thai reading forms, e.g. th2read('สามารถเขียนคำอ่านภาษาไทยได้')
=> 'สา-มาด-เขียน-คัม-อ่าน-พา-สา-ไท-ด้าย-'
\>tltk.nlp.th2ipa_all(Text) : return all transcriptions (IPA) as a list of tuple (syllable_list, transcription). Transcription is based on syllable reading rules. It could be different from th2ipa.
e.g. tltk.nlp.th2ipa_all("รอยกร่าง")
=> [('รอย~กร่าง', 'rᴐːj1.ka2.raːŋ2'), ('รอย~กร่าง', 'rᴐːj1.kraːŋ2'), ('รอ~ยก~ร่าง', 'rᴐː1.jok4.raːŋ3')]
\>tltk.nlp.spell_candidates(Word) : list of possible correct words using minimum edit distance, e.g. tltk.nlp.spell_candidates('รักษ')
=> ['รัก', 'ทักษ', 'รักษา', 'รักษ์']
\>tltk.nlp.spell_variants(Word, InDict="no|yes", Karan="exclude|include"):
This function returns a list of word variants with the same pronunciation as the input Word. The InDict parameter allows the option "yes" to save only words found in the dictionary, while the default option "no" includes all variants regardless of their dictionary status. The Karan parameter allows the option "include" to include words spelled with the karan character, while the default option "exclude" excludes them. For example, tltk.nlp.spell_variants('โควิด').
=> ['โฆวิธ', 'โฆวิต', 'โฆวิด', 'โฆวิท', 'โฆวิช', 'โฆวิจ', 'โฆวิส', 'โฆวิษ', 'โฆวิตร', 'โฆวิฒ', 'โฆวิฏ', 'โฆวิซ', 'โควิธ', 'โควิต', 'โควิด', 'โควิท', 'โควิช', 'โควิจ', 'โควิส', 'โควิษ', 'โควิตร', 'โควิฒ', 'โควิฏ', 'โควิซ']
Other defined functions in the package:
\>tltk.nlp.reset_thaidict() : clear dictionary content
\>tltk.nlp.read_thaidict(DictFile) : add a new dictionary e.g. tltk.nlp.read_thaidict('BEST.dict')
\>tltk.nlp.check_thaidict(Word) : check whether Word exists in the dictionary
tltk.corpus : basic tools for corpus enquiry
-----------------------------------------------
\>tltk.corpus.compound(w1, w2): Evaluates the similarity between combinations of w1 and w2, specifically w1-w2, w1-w1w2, and w2-w1w2. For instance, invoking `tltk.corpus.compound('กลัด','กลุ้ม')` indicates that 'กลัดกลุ้ม' is more similar to 'กลุ้ม'.
=>[(('กลุ้ม', 'กลัดกลุ้ม'), 0.42245594), (('กลัด', 'กลัดกลุ้ม'), 0.09066804), (('กลัด', 'กลุ้ม'), 0.0011619462)]
\>tltk.corpus.Corpus_build(DIR, filetype="xxx") creates a corpus as a list of paragraphs from files located in the directory specified by DIR. The default file type is .txt. However, it is important to note that the files must be pre-segmented into words, with each word separated by the | character, e.g. w1|w2|w3|w4 ....
\>tltk.corpus.Corpus() creates a corpus object that has three methods:
- x.frequency(Text): This method returns the frequency of a specific Text string in the corpus.
- x.dispersion(C): This method returns a dispersion plot for a given word list C in the corpus.
- x.totalword(C): This method returns the total number of words in the corpus that match a given word list C.
Here, C is the result created from Corpus_build.
\>C = tltk.corpus.Copus_build('temp/data/')
\>corp = tltk.corpus.Corpus()
\>print(corp.frequency(C))
\> {'จังหวัด': 32, 'สมุทรสาคร': 16, 'เปิด': 3, 'ศูนย์': 13, 'ควบคุม': 13, 'แจ้ง': 16, .....}
\>tltk.corpus.Xwordlist() creates a comparison object that compares two word lists A and B generated from the Corp.frequency() method. The Corp object is created from Corpus().
Four comparison methods are defined in this object:
- onlyA(): This method returns the list of words that occur only in A.
- onlyB(): This method returns the list of words that occur only in B.
- intersect(): This method returns the list of words that occur in both A and B.
- union(): This method returns the list of words that occur in either A or B (or both).
Here, c1 and c2 are Corpus() objects created using Corpus_build(...). Xcomp is a Xwordlist() object. parsA and parsB are word lists created from the Corpus_build(...) method.
For example, Xcomp.onlyA(c1.frequency(parsA), c2.frequency(parsB)).
\>tltk.corpus.Xwordlist() create an object which is a comparison of two wordlists A and B. Four comparison methods are defined: onlyA, onlyB, intersect, union. A and B is an object created from Corp.frequency(). Corp is an object created from Corpus() e.g. Xcomp.onlyA(c1.frequency(parsA),c2.frequency(parsB))); c1 = Corpus(); c2 = Corpus(); Xcomp = Xwordlist(); parsA and parsB are created from Corpus_build(...)
\>tltk.corpus.W2V_train(Corpus) create a model of Word2Vec. Input is a corpus created from Corpus_build.
\>tltk.corpus.D2V_train(Corpus) create a model of Doc2Vec. Input is a corpus created from Corpus_build.
\>tltk.corpus.TNC_load() by default load TNC.3g. The file can be in the working directory or TLTK package directory
\>tltk.corpus.trigram_load(TRIGRAM) load Trigram data from other sourse saved in tab delimited format "W1\tW2\tW3\tFreq" e.g. tltk.corpus.load3gram('TNC.3g') 'TNC.3g' can be downloaded separately from Thai National Corpus Project.
\>tltk.corpus.unigram(w1) return normalized frequecy (frequency/million) of w1 from the corpus
\>tltk.corpus.bigram(w1,w2) return frequency/million of Bigram w1-w2 from the corpus e.g. tltk.corpus.bigram("หาย","ดี") => 2.331959592765809
\>tltk.corpus.trigram(w1,w2,w3) return frequency/million of Trigram w1-w2-w3 from the corpus
\>tltk.corpus.collocates(w, stat="chi2", direct="both", span=2, limit=10, minfq=1) ### return all collocates of w, STAT = {freq,mi,chi2} DIR={left,right,both} SPAN={1,2} The output is a list of tuples ((w1,w2), stat). e.g. tltk.corpus.collocates("วิ่ง",limit=5)
=> [(('วิ่ง', 'แจ้น'), 86633.93952758134), (('วิ่ง', 'ตื๋อ'), 77175.29122642518), (('วิ่ง', 'กระหืดกระหอบ'), 48598.79465339733), (('วิ่ง', 'ปรู๊ด'), 41111.63720974819), (('ลู่', 'วิ่ง'), 33990.56839021914)]
\>tltk.corpus.W2V_load(File) load w2v model created from gensim. If no file is given, file "TNCc5model3.bin" will be loaded.
\>tltk.corpus.w2v_load() by deafult load word2vec file "TNCc5model2.bin". The file can be in the working directory or TLTK package directory
\>tltk.corpus.w2v_exist(w) check whether w has a vector representation e.g. tltk.corpus.w2v_exist("อาหาร") => True
\>tltk.corpus.w2v(w) return vector representation of w
\>tltk.corpus.similarity(w1,w2) e.g. tltk.corpus.similarity("อาหาร","อาหารว่าง") => 0.783551877546
\>tltk.corpus.similar_words(w, n=10, cutoff=0., score="n") e.g. tltk.corpus.similar_words("อาหาร",n=5, score="y")
=> [('อาหารว่าง', 0.7835519313812256), ('ของว่าง', 0.7366500496864319), ('ของหวาน', 0.703102707862854), ('เนื้อสัตว์', 0.6960341930389404), ('ผลไม้', 0.6641997694969177)]
\>tltk.corpus.outofgroup([w1,w2,w3,...]) e.g. tltk.corpus.outofgroup(["น้ำ","อาหาร","ข้าว","รถยนต์","ผัก"]) => "รถยนต์"
\>tltk.corpus.analogy(w1,w2,w3,n=1) e.g. tltk.corpus.analogy('พ่อ','ผู้ชาย','แม่') => ['ผู้หญิง']
\>tltk.corpus.w2v_plot([w1,w2,w3,...]) => plot a scratter graph of w1-wn in two dimensions
\>tltk.corpus.w2v_compare_color([w1,w2,w3,...]) => visualize the components of vectors w1-wn in color
\>tltk.corpus.compound(w1,w2) => check a compound w1w2, whether w1 or w2 is similar to w1w2 e.g. tltk.corpus.compound('เล็ก','น้อย') => [(('เล็ก', 'น้อย'), 0.4533272), (('น้อย', 'เล็กน้อย'), 0.35492077), (('เล็ก', 'เล็กน้อย'), 0.24106339)]
Notes
-----
- The word segmentation method used is based on a maximum collocation approach, which is described in the publication "Collocation and Thai Word Segmentation" by W. Aroonmanakun (2002). This publication can be found in the Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, edited by Thanaruk Theeramunkong and Virach Sornlertlamvanich, and published by Sirindhorn International Institute of Technology in Pathumthani. The relevant pages are 68-75. Here is the link to the publication: http://pioneer.chula.ac.th/~awirote/ling/SNLP2002-0051c.pdf
- To segment Thai texts, you can use either tltk.nlp.word_segment(Text) or tltk.nlp.syl_segment(Text). The syllable segmentation method is based on a trigram model trained on a corpus of 3.1 million syllables. The input text should be a paragraph of Thai text that may contain English text. Spaces in the paragraph should be marked as "\<s/\>". Word boundaries are marked by "|", and syllable boundaries are marked by "~". Please note that the syllables represented here are written syllables. Some written syllables may be pronounced as two syllables. For example, "สกัด" is segmented here as one written syllable, but it is pronounced as two syllables "sa1-kat1".
- The process of determining words in a sentence is based on a combination of a dictionary and the maximum collocation strength between syllables. The standard dictionary includes many compounds and idioms, such as 'เตาไมโครเวฟ', 'ไฟฟ้ากระแสสลับ', 'ปีงบประมาณ', 'อุโมงค์ใต้ดิน', 'อาหารจานด่วน', 'ปูนขาวผสมพิเศษ', 'เต้นแร้งเต้นกา', etc. These will likely be segmented as one word. If your application requires the use of shortest meaningful words (i.e. 'รถ|โดยสาร', 'คน|ใช้', 'กลาง|คืน', 'ต้น|ไม้', as segmented in the BEST corpus), you can reset the default dictionary used in this package and load a new dictionary containing only simple words or the shortest meaningful words. To clear the default dictionary content, use "reset_thaidict()". To load a new dictionary, use "read_thaidict('DICT_FILE')". A file named 'BEST.dict' containing a list of words compiled from the BEST corpus is included in this package.
- The standard dictionary used in this package has more than 65,000 entries, including abbreviations and transliterations, compiled from various sources. Additionally, a list of 8,700 proper names such as country names, organization names, location names, animal names, plant names, food names, etc., has been added to the system's dictionary. Examples of such proper names include 'อุซเบกิสถาน', 'สำนักเลขาธิการนายกรัฐมนตรี', 'วัดใหญ่สุวรรณาราม', 'หนอนเจาะลำต้นข้าวโพด', and 'ปลาหมึกกระเทียมพริกไทย'.
- For segmenting a specific domain text, a specialized dictionary can be used by adding it to the existing dictionary before segmenting the text. This can be done by calling read_thaidict("SPECIALIZED_DICT"). Please note that the dictionary should be a text file in "utf-8" encoding, and each word should be on a separate line.
- 'Sentence segmentation' or actually 'EDU segmentation' is a process of breaking a paragraph into chunks of discourse units, which are usually clauses. It is based on a RandomForestClassifier model, which is trained on an EDU-segmented corpus (8,100 EDUs) created and used in Nalinee's thesis (http://www.arts.chula.ac.th/~ling/thesis/2556MA-LING-Nalinee.pdf). The model has an accuracy of 97.8%. The reason behind using EDUs can be found in [Aroonmanakun, W. 2007. Thoughts on Word and Sentence Segmentation in Thai. In Proceedings of the Seventh Symposium on Natural Language Processing, Dec 13-15, 2007, Pattaya, Thailand. 85-90.] [Intasaw, N. and Aroonmanakun, W. 2013. Basic Principles for Segmenting Thai EDUs. in Proceedings of 27th Pacific Asia Conference on Language, Information, and Computation, pages 491-498, Nov 22-24, 2013, Taipei.].
- 'grapheme to phoneme' (g2p), as well as IPA transcription (th2ipa) and Thai romanization (th2roman) are based on the hybrid approach presented in the paper "A Unified Model of Thai Word Segmentation and Romanization". The Thai Royal Institute guideline for Thai romanization can be downloaded from "http://www.arts.chula.ac.th/~ling/tts/ThaiRoman.pdf", or "http://www.royin.go.th/?page_id=619". [Aroonmanakun, W., and W. Rivepiboon. 2004. A Unified Model of Thai Word Segmentation and Romanization. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, Dec 8-10, 2004, Tokyo, Japan. 205-214.] (http://www.aclweb.org/anthology/Y04-1021)
Remarks
-------
- A prototype of the UD Parser is implemented using MaltParser (https://www.maltparser.org/). To use MaltParser, it must be installed, and a line 'tltk.nlp.Maltparser_Path = "/path/to/maltparser-1.9.2"' should be added to your code. The UD tree generated by MaltParser is a dictionary with the following format: {'sentence': "ข้อความภาษาไทย", 'words': [{'id': nn, 'pos': POS, 'deprel': REL, 'head': HD_ID}, {...}, ...]}. The model is trained on 1,114 UD trees manually analyzed from a sample of TNC and is included as "thamalt.mco" in the TLTK package. Additional UD trees will be added in the future.
- The TNC Trigram data (TNC.3g) and TNC word2vec (TNCc5model3.bin) can be downloaded from the TNC website: http://www.arts.chula.ac.th/ling/tnc/searchtnc/.
- The "spell_candidates" module is modified from Peter Norvig's Python code, which can be found at http://norvig.com/spell-correct.html.
- The "w2v_compare_color" module is modified from http://chrisculy.net/lx/wordvectors/wvecs_visualization.html.
- The BEST corpus is a corpus released by NECTEC (https://www.nectec.or.th/corpus/).
- This project uses Universal POS tags. For more information, please see http://universaldependencies.org/u/pos/index.html and http://www.arts.chula.ac.th/~ling/contents/File/UD%20Annotation%20for%20Thai.pdf.
- pos_tag is based on the PerceptronTagger in the nltk.tag.perceptron module. It was trained using TNC data that was manually pos-tagged (approximately 148,000 words). The accuracy of the pos-tagging is 91.68%. The NLTK PerceptronTagger is a port of the Textblob Averaged Perceptron Tagger, which can be found at https://explosion.ai/blog/part-of-speech-pos-tagger-in-python.
- The named entity recognition module is a CRF model adapted from a tutorial (http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html). The model was trained using NER data from Sasimimon's and Nutcha's theses (altogether 7,354 names in a corpus of 183,300 words) (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip, http://pioneer.chula.ac.th/~awirote/Data-Sasiwimon.zip) and NER data from AIforThai (https://aiforthai.in.th/). Only valid NE files from AIforThai were used, and the total number of all NEs is 170,076. The accuracy of the model is reported below (88%).
============ =========== ======= ========= ========
tag precision recall f1-score support
------------ ----------- ------- --------- --------
B-L 0.56 0.48 0.52 27105
B-O 0.72 0.58 0.64 59613
B-P 0.82 0.83 0.83 83358
I-L 0.52 0.43 0.47 17859
I-O 0.67 0.59 0.63 67396
I-P 0.85 0.88 0.86 175069
O 0.92 0.94 0.93 1032377
------------ ----------- ------- --------- --------
accuracy 0.88 1462777
macro avg 0.72 0.68 0.70 1462777
weighted avg 0.87 0.88 0.88 1462777
============ =========== ======= ========= ========
Use cases
---------
This package is free for commercial use. If you incorporate this package in your work, we would appreciate it if you inform us through awirote@chula.ac.th.
- BAS Web Services (https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface) used TLTK for Thai grapheme-to-phoneme conversion in their project.
- Chubb Life Assurance Public Company Limited used TLTK for Thai transliteration.
- The .NET project wraps Thai Romanization in the Thai Language Toolkit Project to simplify usage in other .NET projects. https://github.com/dotnetthailand/ThaiRomanizationSharp
- Huawei, Consumer Cloud Service Asia Pacific Cloud Service Business Growth Dept. used TLTK for AppSearch processing for Thai.
- osml10n, localization functions for Openstreetmap data used TLTK for thai language transcription in cases where transcripted names are unavailable in Openstreetmap data itself. https://github.com/giggls/osml10n
Raw data
{
"_id": null,
"home_page": "http://pypi.python.org/pypi/tltk/",
"name": "tltk",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Thai language toolkit, Thai language processing, segmentation, pos tag, transcription, romanization",
"author": "Wirote Aroonmanakun",
"author_email": "awirote@chula.ac.th",
"download_url": "https://files.pythonhosted.org/packages/2d/dc/1b12c3eb2af01265001f0eeada78a25639f3472296784e0601ef2cddedd0/tltk-1.9.1.tar.gz",
"platform": null,
"description": "Thai Language Toolkit Project version 1.9.1\n============================================\n\nTLTK is a Python package designed for Thai language processing, which includes functionalities such as syllable and word segmentation, discourse unit segmentation, POS tagging, named entity recognition, grapheme-to-phoneme conversion, IPA transcription, romanization, and more. To use TLTK, you will need to have Python 3.6 or a more recent version installed. The project is an open-source software developed at Chulalongkorn University. As of version 1.2.2, the package license has been changed to the New BSD License (BSD-3-Clause).\n\nInput : must be utf8 Thai texts.\n\nUpdates:\n--------\nVersion 1.9: TextClass(text) can classify both level and genre. Levels are categorized as L1 - L4 for Lower Elementary, Upper Elementary, Middle, and High School. Genres are categorized as Academic, Essay, Fiction, Institution, Law, Misc, Newspaper, Non-academic, Popular magazine, Speech\n\nVersion 1.8.1: Bug fixes for TNC3g_load().\n\nVersion 1.8: Two new modules have been introduced: TextClass(text): This module is designed to assess the level of difficulty based on L1-L4 (Lower Elementary, Upper Elementary, Middle, and High School). It provides a mechanism for determining the text's complexity level. txt2feat(text): This module is introduced to generate a vector of 129 features, represented as a list of values. These features are derived from the output of the TextAna module. Within these modules, dependency relations in 'wrd_deprel[deprel]' have been transformed into UD format, such as UDsubj, UDobj, UDnmod, and so on. These 129 features are then generated and utilized within the TextClass module to evaluate text difficulty.\"\n\nVersion 1.7: Introduced the `spoonerism(w)` module, which generates one or two spoonerisms from the input word `w`. This is achieved by swapping the first and last syllables, either a) preserving the initial consonant or b) preserving both the initial consonant and tone. The output is provided as a list of readings in Thai. Additionally, the dependency \"sklearn\" has been updated to \"scikit-learn\".\n\nVersion 1.6.8: Bug fixes have been made to the \"TextAna\" module.\n\nVersion 1.6.7: Bug fixes have been made to the \"g2p\" module.\n\nVersion 1.6.6 includes UDParser using MaltParser (https://www.maltparser.org/). To use this feature, please install MaltParser and add a line 'tltk.nlp.Maltparser_Path = \"/path/to/maltparser-1.9.2\"' in your code before using 'MaltParser' or 'MaltParser_wordlist'. The former requires text input while the latter requires a list of words. The UD tree generated by MaltParser is a dictionary with the following format: {'sentence': \"\u0e02\u0e49\u0e2d\u0e04\u0e27\u0e32\u0e21\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22\", 'words': [{'id': nn, 'pos': POS, 'deprel': REL, 'head': HD_ID}, {...}, ...]}. You can use 'print_dtree' to print D-tree from the parsed result. Additionally, 'delrel' and 'SynDepth' have been added to the properties of 'TextAna' when the option 'UDParse=\"Malt\"' is specified. By default, 'UDParse=\"none\"'.\n\nVersion 1.6.5: This version includes bug fixes in the \"SylAna\" and \"WordAna\" modules, as well as a new module called \"tltk.corpus.compound(x,y)\".\n\nVersion 1.6.3: Bug fixes have been made to the \"g2p\" module, and some features have been modified in both \"WordAna\" and \"TextAna\" modules.\n\nVersion 1.6.2: Changes have been made to the text features in this version.\n\nVersion 1.6.1: This version includes new text features, an updated Word2Vec model using 'TNCc5model3.bin', a change from 'g2p_all' to 'th2ipa_all', and some bug fixes.\n\nVersion 1.6: The new feature in this version is 'TNC_tag', which allows you to mark up Thai text in XML format.\n\nVersion 1.5.8: This version includes the addition of average reduced frequency in the TextAna module.\n\nVersion 1.5.7: The SylAna module has been added, which is included in WordAna. The output is a list of syllable properties, which is added to the word property. Additionally, 'th2read(text)' has been added, which shows the pronunciation in Thai written forms.\n\nVersion 1.5: This version includes the addition of the WordAna and TextAna modules. The output of WordAna is an object with word properties.\n\nThe following line of code has also been mentioned:\n'res = tltk.nlp.TNC_tag(text,POS)' returns XML format of Thai texts as used in TNC. The POS option can be set to either \"Y\" or \"N\".\n\nsp = tltk.nlp.SylAna(syl_form,syl_phone) => sp.form (syllable form), sp.phone (syllable sound), sp.char (number of characters in the syllable), sp.dead (indicates whether the syllable is dead or live, True/False), sp.initC (initial consonant form), sp.finalC (final consonant form), sp.vowel (vowel form), sp.tonemark (indicates the tone mark, \u0e40\u0e2d\u0e01, \u0e42\u0e17, \u0e15\u0e23\u0e35, \u0e08\u0e31\u0e15\u0e27\u0e32), sp.initPh (initial consonant sound), sp.finalPh (final consonant sound), sp.vowelPh (vowel sound), sp.tone (tone 1, 2, 3, 4, or 5), sp.leading (indicates whether the syllable is a leading syllable, True/False), sp.cluster (indicates whether the syllable has an initial cluster, True/False), sp.karan (number of characters marked with a karan marker)\n\nwd = tltk.nlp.WordAna(w) => wd.form (word form), wd.phone (word sound), wd.char (number of characters in the word), wd.syl (number of syllables), wd.corrtone (number of tones that match the same tone marker), wd.corrfinal (number of final consonant sounds that match the final character -\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27), wd.karan (number of karan markers), wd.cluster (number of cluster consonants), wd.lead (number of leading consonants), wd.doubvowel (number of complex vowels), wd.syl_prop (a list of syllable properties)\n\nres = tltk.nlp.TextAna(text, TextOption, WordOption) => a complex dictionary output describing the input text.\n\nTextOption can be configured with one of the following values: \"segmented,\" \"edu,\" or \"par.\" To segment the text with <p>, <s>, and | representing a new paragraph, space, and word segmentation, select \"segmented.\" To apply TLTK EDU segmentation, choose \"edu.\" To process the text as plain text format using \"\\\\n\" for paragraph separation, use \"par.\"\n\nWordOption can be set to \"colloc\" or \"mm\". If the text is not yet segmented, use \"colloc\" or \"mm\" to segment the text into words using TLTK.\n\n### properties from SylAna \n\n- form: syllable form\n- phone: syllable sound\n- char: number of characters in the syllable\n- dead: True|False (indicates whether the syllable is dead or alive)\n- initC: initial consonant\n- finalC: final consonant\n- vowel: vowel form\n- tonemark: tone marker (values: 1, 2, 3, 4, 5)\n- initPh: initial sound\n- finalPh: final sound\n- vowelPh: vowel sound\n- tone: tone (values: 1, 2, 3, 4, 5)\n- leading: True|False (indicates whether the syllable is a leading syllable, e.g., in \u0e2a\u0e1a\u0e32\u0e22, \u0e2a\u0e2b)\n- cluster: True|False (indicates whether the syllable has a cluster consonant)\n- karan: character(s) marked with karan\n\n### properties from WordAna \n\n- form: word form\n- phone: word sound\n- char: number of characters\n- syl: number of syllables\n- corrtone: number of correct tone markers (\u0e2a\u0e32\u0e21\u0e31\u0e0d, \u0e48 \u0e40\u0e2d\u0e01, \u0e49 \u0e42\u0e17, \u0e4a \u0e15\u0e23\u0e35, \u0e4b \u0e08\u0e31\u0e15\u0e27\u0e32) in both form and sound\n- incorrtone: number of incorrect tone markers in both form and sound\n- corrfinal: number of correct final consonants (-\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27)\n- incorrfinal: number of incorrect final consonants (excluding -\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27)\n- karan: number of karan markers\n- cluster: number of cluster consonants\n- lead: number of leading consonants\n- doubvowel: number of double vowels\n\n### properties from TextAna \n\n- DesSpC: No. of spaces in a text\n- DesChaC: No. of characters in a text\n- DesSymbC: No. of symbols or special characters in a text\n- DesPC: No. of paragraphs\n- DesEduC: No. of edu units\n- DesTotW: Total number of words in a text\n- DesTotT: Total number of unique words (types) in a text\n- DesEduL: Mean length of an edu unit (in words)\n- DesEduLd: Standard deviation of edu length (in words)\n- DesWrdL: Mean length of a word (in syllables)\n- DesWrdLd: Standard deviation of word length (in syllables)\n- DesPL: Mean length of a paragraph (in words)\n- DesCorrToneC: Number of words with the correct tone form and tone sound\n- DesInCorrToneC: Number of words with incorrect tone form and/or tone sound\n- DesCorrFinalC: Number of words with correct final consonant (-\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27)\n- DesInCorrFinalC: Number of words with incorrect final consonant (not -\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27)\n- DesClusterC: Number of words with a consonant cluster\n- DesLeadC: Number of words with a leading syllable (e.g. \u0e2a\u0e1a\u0e32\u0e22, \u0e2a\u0e2b)\n- DesDoubVowelC: Number of words with a double vowel\n- DesTNCt1C: No. of words in TNC tier1 50%\n- DesTNCt2C: No. of words in TNC tier2 51-60%\n- DesTNCt3C: No. of words in TNC tier3 61-70%\n- DesTNCt4C: No. of words in TNC tier4 71-80%\n- DesTTC1: No. of words in TTC level1\n- DesTTC2: No. of words in TTC level2\n- DesTTC3: No. of words in TTC level3\n- DesTTC4: No. of words in TTC level4\n- WrdCorrTone: ratio of words with the same tone form and phone\n- WrdInCorrTone: ratio of words with different tone form and phone\n- WrdCorrFinal: ratio of words with correct final consonant -\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27\n- WrdInCorrFinal: ratio of words with final consonant not -\u0e01 -\u0e14 -\u0e07 -\u0e19 -\u0e21 -\u0e22 -\u0e27\n- WrdKaran: ratio of words with a karan\n- WrdCluster: ratio of words with a cluster\n- WrdLead: ratio of words with a leading syllable\n- WrdDoubVowel: ratio of words with a double vowel\n- WrdNEl: ratio of named entity locations\n- WrdNEo: ratio of named entity organizations\n- WrdNEp: ratio of named entity persons\n- WrdNeg: ratio of negations\n- WrdTNCt1: relative frequency of words in TNC tier 1 (/1000 words)\n- WrdTNCt2: relative frequency of words in TNC tier 2\n- WrdTNCt3: relative frequency of words in TNC tier 3\n- WrdTNCt4: relative frequency of words in TNC tier 4\n- WrdTTC1: relative frequency of words in TTC level 1\n- WrdTTC2: relative frequency of words in TTC level 2\n- WrdTTC3: relative frequency of words in TTC level 3\n- WrdTTC4: relative frequency of words in TTC level 4\n- WrdC: mean of relative frequency of content words in TTC\n- WrdF: mean of relative frequency of function words in TTC\n- WrdCF: mean of relative frequency of content/function words in TTC\n- WrdFrmSing: mean of relative frequency of single-word forms in TTC\n- WrdFrmComp: mean of relative frequency of complex/compound word forms in TTC\n- WrdFrmTran: mean of relative frequency of transliterated words in TTC\n- WrdSemSimp: mean of relative frequency of simple words in TTC\n- WrdSemTran: mean of relative frequency of transparent compound words in TTC\n- WrdSemSemi: mean of relative frequency of words in between transparent and opaque compound words in TTC\n- WrdSemOpaq: mean of relative frequency of opaque compound words in TTC\n- WrdBaseM: mean of relative frequency of basic vocab from Ministry of Education\n- WrdBaseT: mean of relative frequency of basic vocab from TTC & TNC < 2000\n- WrdTfidf: average of TF-IDF of each word (calculated from TNC)\n- WrdTncDisp: average of dispersion of each word (calculated from TNC)\n- WrdTtcDisp: average of dispersion of each word (calculated from TTC)\n- WrdArf: average of ARF (average reduced frequency) of each word in the text\n- WrdNOUN: mean of relative frequency of words with POS=NOUN\n- WrdVERB: mean of relative frequency of words with POS=VERB\n- WrdADV: mean of relative frequency of words with POS=ADV\n- WrdDET: mean of relative frequency of words with POS=DET\n- WrdADJ: mean of relative frequency of words with POS=ADJ\n- WrdADP: mean of relative frequency of words with POS=ADP\n- WrdPUNCT: mean of relative frequency of words with POS=PUNCT\n- WrdAUX: mean of relative frequency of words with POS=AUX\n- WrdSYM: mean of relative frequency of words with POS=SYM\n- WrdINTJ: mean of relative frequency of words with POS=INTJ\n- WrdCCONJ: mean of relative frequency of words with POS=CCONJ\n- WrdPROPN: mean of relative frequency of words with POS=PROPN\n- WrdNUM: mean of relative frequency of words with POS=NUM\n- WrdPART: mean of relative frequency of words with POS=PART\n- WrdPRON: mean relative frequency of words with POS=PRON\n- WrdSCONJ: mean relative frequency of words with POS=SCONJ\n- LdvTTR: type-token ratio, which is the ratio of the number of unique words (types) to the total number of words (tokens) in a text\n- CrfCNL: proportion of utterances having the same NOUN overlapped locally (yes or no)\n- CrfCVL: proportion of utterances having the same VERB overlapped locally (yes or no)\n- CrfCWL: proportion of utterances having the same content words overlapped locally (yes or no)\n- CrfCTL: proportion of utterances having content words overlapped locally (measured by the number of overlapping tokens)\n- wrd: dictionary where wrd[word] = freq, representing the frequency of each word in a text\n- wrd_arf: dictionary where wrd_arf[word] = arf, representing the average reduced frequency of each word in a text\n- wrd_deprel: dictionary where wrd_deprel[deprel] = freq, representing the frequency of each dependency relation (deprel) in a text\n\n\n\nVersion 1.4 has been updated for gensim 4.0. Users can load a Thai corpus using Corpus(), then create a model using W2V_train() or D2V_train(), or load an existing model from W2V_load(Model_File). The pre-trained w2v model for TNC is TNCc5model2.bin. The model for EDU segmentation has been recompiled to work with the new library.\n\nVersion 1.3.8 has added spell_variants to generate all variation forms of the same pronunciation.\n\nVersion 1.3.6 has removed the \"matplotlib\" dependency and fixed an error with \"\u0e43\u0e04\u0e23\".\n\nMore compound words have been added to the dictionary. Versions 1.1.3-1.1.5 contained many entries that were not words and had a few errors. Those entries have been removed in later versions.\n\nThe NER tagger model has been updated by using more named entity data from the AiforThai project.\n\n\ntltk.nlp : basic tools for Thai language processing.\n------------------------------------------------------\n\n\\>tltk.nlp.TextClass(text) By default, TextOption=\"par\",WordOption=\"colloc\", UDParse=\"Malt\", Classifier=\"level\" is set. If text is word segmented with \"|\", use WordOption=\"segmented\". Two classifiers are available \"level\" and \"genre\".\n\n\\>tltk.nlp.txt2feat(text, Option=\"name|value\"): Returns a list of 129 feature values analyzed from the text. If Option=\"name\", only a list of 129 feature names is returned.\n\n\\>tltk.nlp.spoonerism(word_or_phrase): Returns one or two \"spoonerisms\" derived from the input. For example, using `spoonerism('\u0e41\u0e02\u0e19\u0e40\u0e1b\u0e47\u0e19\u0e1f\u0e2d')` will produce the spoonerism(s).\n\n=>['\u0e04\u0e2d-\u0e40\u0e1b\u0e47\u0e19-\u0e41\u0e1d\u0e19', '\u0e02\u0e2d-\u0e40\u0e1b\u0e47\u0e19-\u0e41\u0e1f\u0e19']\n\n\\>tltk.nlp.TextAna(Text, UDParse=\"Malt\"): This function analyzes plain text by paragraph, segments words using the colloc approach, and employs MaltParse for UDParsing. The default options are TextOption=\"par\", WordOption=\"colloc\", and UDParse=\"none\". If the input is already segmented with '|', then use TextOption=\"segmented\" and WordOption=\"segmented\". If processing by EDU is preferred, set TextOption=\"edu\". If no parsing is needed, set UDParse=\"none\".\n\n=>output as a dict of text features described in TextAna\n\n\\>tltk.nlp.TextAna2json(Text, Filename, Options) functions similarly to the above, but the results are saved to a JSON file. The `Options` parameter includes a `Mode` which can be set to \"write\" or \"append\".\n\n\\>tltk.nlp.MaltParser(Text) e.g. print_dtree(tltk.nlp.MaltParser(\"\u0e40\u0e02\u0e32\u0e19\u0e31\u0e48\u0e07\u0e14\u0e39\u0e2b\u0e19\u0e31\u0e07\u0e2d\u0e22\u0e39\u0e48\u0e17\u0e35\u0e48\u0e1a\u0e49\u0e32\u0e19\"))\n\n=>\n\n* 1:----\u0e40\u0e02\u0e32 (PRON, nsubj - 2)\n* 2:--\u0e19\u0e31\u0e48\u0e07 (VERB, root - 0)\n* 3:----\u0e14\u0e39 (VERB, compound - 2)\n* 4:------\u0e2b\u0e19\u0e31\u0e07 (NOUN, obj - 3)\n* 5:------\u0e2d\u0e22\u0e39\u0e48 (VERB, compound - 3)\n* 6:----------\u0e17\u0e35\u0e48 (ADP, case - 7)\n* 7:--------\u0e1a\u0e49\u0e32\u0e19 (NOUN, obl - 5)\n\n\\>tltk.nlp.TNC_tag(Text,POSTagOption) e.g. tltk.nlp.TNC_tag('\u0e19\u0e32\u0e22\u0e01\u0e23\u0e31\u0e10\u0e21\u0e19\u0e15\u0e23\u0e35\u0e01\u0e25\u0e48\u0e32\u0e27\u0e01\u0e31\u0e1a\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07\u0e2b\u0e25\u0e27\u0e07\u0e2a\u0e32\u0e22\u0e2a\u0e2d\u0e07\u0e27\u0e48\u0e32 \u0e2d\u0e22\u0e32\u0e01\u0e27\u0e34\u0e07\u0e27\u0e2d\u0e19\u0e43\u0e2b\u0e49\u0e43\u0e0a\u0e49\u0e04\u0e27\u0e32\u0e21\u0e23\u0e2d\u0e1a\u0e04\u0e2d\u0e1a',POS='Y')\n\n=> '<w tran=\"naa0jok3rat3tha1mon0trii0\" POS=\"NOUN\">\u0e19\u0e32\u0e22\u0e01\u0e23\u0e31\u0e10\u0e21\u0e19\u0e15\u0e23\u0e35</w><w tran=\"klaaw1\" POS=\"VERB\">\u0e01\u0e25\u0e48\u0e32\u0e27</w><w tran=\"kap1\" POS=\"ADP\">\u0e01\u0e31\u0e1a</w><w tran=\"khon0khap1rot3\" POS=\"NOUN\">\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16</w><w tran=\"pra1cam0\" POS=\"NOUN\">\u0e1b\u0e23\u0e30\u0e08\u0e33</w><w tran=\"thaaN0luuaN4\" POS=\"NOUN\">\u0e17\u0e32\u0e07\u0e2b\u0e25\u0e27\u0e07</w><w tran=\"saaj4\" POS=\"NOUN\">\u0e2a\u0e32\u0e22</w><w tran=\"sOON4\" POS=\"NUM\">\u0e2a\u0e2d\u0e07</w><w tran=\"waa2\" POS=\"SCONJ\">\u0e27\u0e48\u0e32</w><s/><w tran=\"jaak1\" POS=\"VERB\">\u0e2d\u0e22\u0e32\u0e01</w><w tran=\"wiN0wOOn0\" POS=\"VERB\">\u0e27\u0e34\u0e07\u0e27\u0e2d\u0e19</w><w tran=\"haj2\" POS=\"SCONJ\">\u0e43\u0e2b\u0e49</w><w tran=\"chaj3\" POS=\"VERB\">\u0e43\u0e0a\u0e49</w><w tran=\"khwaam0\" POS=\"NOUN\">\u0e04\u0e27\u0e32\u0e21</w><w tran=\"rOOp2khOOp2\" POS=\"VERB\">\u0e23\u0e2d\u0e1a\u0e04\u0e2d\u0e1a</w><s/>'\n\n\\>tltk.nlp.chunk(Text) : chunk parsing. The output includes markups for word segments (\\|), elementary discourse units (\\<u/\\>), pos tags (/POS),and named entities (\\<NEx\\>...\\</NEx\\>), e.g. tltk.nlp.chunk(\"\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19\u0e40\u0e02\u0e15\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07\u0e27\u0e48\u0e32 \u0e44\u0e14\u0e49\u0e19\u0e33\u0e1b\u0e49\u0e32\u0e22\u0e1b\u0e23\u0e30\u0e01\u0e32\u0e28\u0e40\u0e15\u0e37\u0e2d\u0e19\u0e1b\u0e25\u0e34\u0e07\u0e44\u0e1b\u0e1b\u0e31\u0e01\u0e15\u0e32\u0e21\u0e41\u0e2b\u0e25\u0e48\u0e07\u0e19\u0e49\u0e33 \u0e43\u0e19\u0e40\u0e02\u0e15\u0e2d\u0e33\u0e40\u0e20\u0e2d\u0e40\u0e21\u0e37\u0e2d\u0e07 \u0e08\u0e31\u0e07\u0e2b\u0e27\u0e31\u0e14\u0e2d\u0e48\u0e32\u0e07\u0e17\u0e2d\u0e07 \u0e2b\u0e25\u0e31\u0e07\u0e08\u0e32\u0e01\u0e19\u0e32\u0e22\u0e2a\u0e38\u0e01\u0e34\u0e08 \u0e2d\u0e32\u0e22\u0e38 65 \u0e1b\u0e35 \u0e16\u0e39\u0e01\u0e1b\u0e25\u0e34\u0e07\u0e01\u0e31\u0e14\u0e41\u0e25\u0e49\u0e27\u0e44\u0e21\u0e48\u0e44\u0e14\u0e49\u0e44\u0e1b\u0e1e\u0e1a\u0e41\u0e1e\u0e17\u0e22\u0e4c\")\n\n=> '<NEo\\>\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19/NOUN|\u0e40\u0e02\u0e15/NOUN|\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23/PROPN|</NEo\\>\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07/VERB|\u0e27\u0e48\u0e32/SCONJ|\\<s/\\>/PUNCT|\u0e44\u0e14\u0e49/AUX|\u0e19\u0e33/VERB|\u0e1b\u0e49\u0e32\u0e22\u0e1b\u0e23\u0e30\u0e01\u0e32\u0e28/NOUN|\u0e40\u0e15\u0e37\u0e2d\u0e19/VERB|\u0e1b\u0e25\u0e34\u0e07/NOUN|\u0e44\u0e1b/VERB|\u0e1b\u0e31\u0e01/VERB|\u0e15\u0e32\u0e21/ADP|\u0e41\u0e2b\u0e25\u0e48\u0e07\u0e19\u0e49\u0e33/NOUN|\\<u/\\>\u0e43\u0e19/ADP|<NEl\\>\u0e40\u0e02\u0e15/NOUN|\u0e2d\u0e33\u0e40\u0e20\u0e2d/NOUN|\u0e40\u0e21\u0e37\u0e2d\u0e07/NOUN|\\<s/\\>/PUNCT|\u0e08\u0e31\u0e07\u0e2b\u0e27\u0e31\u0e14/NOUN|\u0e2d\u0e48\u0e32\u0e07\u0e17\u0e2d\u0e07/PROPN|\\</NEl\\>\\<u/\\>\u0e2b\u0e25\u0e31\u0e07\u0e08\u0e32\u0e01/SCONJ|\\<NEp\\>\u0e19\u0e32\u0e22/NOUN|\u0e2a\u0e38/PROPN|\u0e01\u0e34\u0e08/NOUN|\\</NEp\\>\\<s/\\>/PUNCT|\u0e2d\u0e32\u0e22\u0e38/NOUN|\\<u/\\>65/NUM|\\<s/\\>/PUNCT|\u0e1b\u0e35/NOUN|\\<u/\\>\u0e16\u0e39\u0e01/AUX|\u0e1b\u0e25\u0e34\u0e07/VERB|\u0e01\u0e31\u0e14/VERB|\u0e41\u0e25\u0e49\u0e27/ADV|\u0e44\u0e21\u0e48\u0e44\u0e14\u0e49/AUX|\u0e44\u0e1b/VERB|\u0e1e\u0e1a/VERB|\u0e41\u0e1e\u0e17\u0e22\u0e4c/NOUN|\\<u/\\>'\n\n\\>tltk.nlp.segment(Text) : segment edu by marking <u\\/> e.g. tltk.nlp.segment(\"\u0e41\u0e15\u0e48\u0e2d\u0e32\u0e08\u0e40\u0e1e\u0e23\u0e32\u0e30\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07\u0e40\u0e1b\u0e47\u0e19\u0e1e\u0e48\u0e2d\u0e41\u0e21\u0e48\u0e21\u0e37\u0e2d\u0e43\u0e2b\u0e21\u0e48\u2005\u0e23\u0e31\u0e07\u0e17\u0e35\u0e48\u0e17\u0e33\u0e08\u0e36\u0e07\u0e44\u0e21\u0e48\u0e04\u0e48\u0e2d\u0e22\u0e41\u0e02\u0e47\u0e07\u0e41\u0e23\u0e07 \u0e27\u0e31\u0e19\u0e2b\u0e19\u0e36\u0e48\u0e07\u0e23\u0e31\u0e07\u0e01\u0e47\u0e09\u0e35\u0e01\u0e40\u0e01\u0e37\u0e2d\u0e1a\u0e02\u0e32\u0e14\u0e40\u0e1b\u0e47\u0e19\u0e2a\u0e2d\u0e07\u0e17\u0e48\u0e2d\u0e19\u0e2b\u0e49\u0e2d\u0e22\u0e15\u0e48\u0e2d\u0e07\u0e41\u0e15\u0e48\u0e07 \u0e1c\u0e21\u0e1e\u0e22\u0e32\u0e22\u0e32\u0e21\u0e2b\u0e32\u0e2d\u0e38\u0e1b\u0e01\u0e23\u0e13\u0e4c\u0e21\u0e32\u0e22\u0e36\u0e14\u0e23\u0e31\u0e07\u0e01\u0e25\u0e31\u0e1a\u0e04\u0e37\u0e19\u0e23\u0e39\u0e1b\u0e17\u0e23\u0e07\u0e40\u0e14\u0e34\u0e21 \u0e02\u0e13\u0e30\u0e17\u0e35\u0e48\u0e41\u0e21\u0e48\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07\u0e2a\u0e48\u0e07\u0e40\u0e2a\u0e35\u0e22\u0e07\u0e42\u0e27\u0e22\u0e27\u0e32\u0e22\u0e2d\u0e22\u0e39\u0e48\u0e43\u0e01\u0e25\u0e49 \u0e46 \u0e41\u0e15\u0e48\u0e2a\u0e38\u0e14\u0e17\u0e49\u0e32\u0e22\u0e44\u0e21\u0e48\u0e2a\u0e33\u0e40\u0e23\u0e47\u0e08\u2005\u0e2a\u0e2d\u0e07\u0e2a\u0e32\u0e21\u0e27\u0e31\u0e19\u0e15\u0e48\u0e2d\u0e21\u0e32\u0e23\u0e31\u0e07\u0e17\u0e35\u0e48\u0e0a\u0e48\u0e27\u0e22\u0e0b\u0e48\u0e2d\u0e21\u0e01\u0e47\u0e1e\u0e31\u0e07\u0e44\u0e1b \u0e44\u0e21\u0e48\u0e40\u0e2b\u0e47\u0e19\u0e41\u0e21\u0e48\u0e19\u0e01\u0e1a\u0e34\u0e19\u0e01\u0e25\u0e31\u0e1a\u0e21\u0e32\u0e2d\u0e35\u0e01\u0e40\u0e25\u0e22\")\n\n=>\"\u0e41\u0e15\u0e48|\u0e2d\u0e32\u0e08|\u0e40\u0e1e\u0e23\u0e32\u0e30|\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07|\u0e40\u0e1b\u0e47\u0e19|\u0e1e\u0e48\u0e2d\u0e41\u0e21\u0e48|\u0e21\u0e37\u0e2d\u0e43\u0e2b\u0e21\u0e48|<s/>|\u0e23\u0e31\u0e07|\u0e17\u0e35\u0e48|\u0e17\u0e33|\u0e08\u0e36\u0e07|\u0e44\u0e21\u0e48\u0e04\u0e48\u0e2d\u0e22|\u0e41\u0e02\u0e47\u0e07\u0e41\u0e23\u0e07<u/>\u0e27\u0e31\u0e19|\u0e2b\u0e19\u0e36\u0e48\u0e07|\u0e23\u0e31\u0e07|\u0e01\u0e47|\u0e09\u0e35\u0e01|\u0e40\u0e01\u0e37\u0e2d\u0e1a|\u0e02\u0e32\u0e14|\u0e40\u0e1b\u0e47\u0e19|\u0e2a\u0e2d\u0e07|\u0e17\u0e48\u0e2d\u0e19|\u0e2b\u0e49\u0e2d\u0e22|\u0e15\u0e48\u0e2d\u0e07\u0e41\u0e15\u0e48\u0e07<u/>\u0e1c\u0e21|\u0e1e\u0e22\u0e32\u0e22\u0e32\u0e21|\u0e2b\u0e32|\u0e2d\u0e38\u0e1b\u0e01\u0e23\u0e13\u0e4c|\u0e21\u0e32|\u0e22\u0e36\u0e14|\u0e23\u0e31\u0e07|\u0e01\u0e25\u0e31\u0e1a\u0e04\u0e37\u0e19|\u0e23\u0e39\u0e1b\u0e17\u0e23\u0e07|\u0e40\u0e14\u0e34\u0e21<u/>\u0e02\u0e13\u0e30\u0e17\u0e35\u0e48|\u0e41\u0e21\u0e48|\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07|\u0e2a\u0e48\u0e07\u0e40\u0e2a\u0e35\u0e22\u0e07|\u0e42\u0e27\u0e22\u0e27\u0e32\u0e22|\u0e2d\u0e22\u0e39\u0e48|\u0e43\u0e01\u0e25\u0e49|\u0e46|<s/><u/>\u0e41\u0e15\u0e48|\u0e2a\u0e38\u0e14\u0e17\u0e49\u0e32\u0e22|\u0e44\u0e21\u0e48|\u0e2a\u0e33\u0e40\u0e23\u0e47\u0e08<u/>\u0e2a\u0e2d\u0e07|\u0e2a\u0e32\u0e21|\u0e27\u0e31\u0e19|\u0e15\u0e48\u0e2d\u0e21\u0e32|\u0e23\u0e31\u0e07|\u0e17\u0e35\u0e48|\u0e0a\u0e48\u0e27\u0e22|\u0e0b\u0e48\u0e2d\u0e21|\u0e01\u0e47|\u0e1e\u0e31\u0e07|\u0e44\u0e1b<u/>\u0e44\u0e21\u0e48|\u0e40\u0e2b\u0e47\u0e19|\u0e41\u0e21\u0e48|\u0e19\u0e01|\u0e1a\u0e34\u0e19|\u0e01\u0e25\u0e31\u0e1a|\u0e21\u0e32|\u0e2d\u0e35\u0e01|\u0e40\u0e25\u0e22<u/>\"\n\n\\>tltk.nlp.ner_tag(Text) : The output includes markups for named entities (\\<NEx\\>...\\</NEx\\>), e.g. tltk.nlp.ner_tag(\"\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19\u0e40\u0e02\u0e15\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07\u0e27\u0e48\u0e32 \u0e44\u0e14\u0e49\u0e19\u0e33\u0e1b\u0e49\u0e32\u0e22\u0e1b\u0e23\u0e30\u0e01\u0e32\u0e28\u0e40\u0e15\u0e37\u0e2d\u0e19\u0e1b\u0e25\u0e34\u0e07\u0e44\u0e1b\u0e1b\u0e31\u0e01\u0e15\u0e32\u0e21\u0e41\u0e2b\u0e25\u0e48\u0e07\u0e19\u0e49\u0e33 \u0e43\u0e19\u0e40\u0e02\u0e15\u0e2d\u0e33\u0e40\u0e20\u0e2d\u0e40\u0e21\u0e37\u0e2d\u0e07 \u0e08\u0e31\u0e07\u0e2b\u0e27\u0e31\u0e14\u0e2d\u0e48\u0e32\u0e07\u0e17\u0e2d\u0e07 \u0e2b\u0e25\u0e31\u0e07\u0e08\u0e32\u0e01\u0e19\u0e32\u0e22\u0e2a\u0e38\u0e01\u0e34\u0e08 \u0e2d\u0e32\u0e22\u0e38 65 \u0e1b\u0e35 \u0e16\u0e39\u0e01\u0e1b\u0e25\u0e34\u0e07\u0e01\u0e31\u0e14\u0e41\u0e25\u0e49\u0e27\u0e44\u0e21\u0e48\u0e44\u0e14\u0e49\u0e44\u0e1b\u0e1e\u0e1a\u0e41\u0e1e\u0e17\u0e22\u0e4c\")\n\n=> '\\<NEo\\>\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19\u0e40\u0e02\u0e15\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23\\</NEo\\>\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07\u0e27\u0e48\u0e32 \u0e44\u0e14\u0e49\u0e19\u0e33\u0e1b\u0e49\u0e32\u0e22\u0e1b\u0e23\u0e30\u0e01\u0e32\u0e28\u0e40\u0e15\u0e37\u0e2d\u0e19\u0e1b\u0e25\u0e34\u0e07\u0e44\u0e1b\u0e1b\u0e31\u0e01\u0e15\u0e32\u0e21\u0e41\u0e2b\u0e25\u0e48\u0e07\u0e19\u0e49\u0e33 \u0e43\u0e19\\<NEl\\>\u0e40\u0e02\u0e15\u0e2d\u0e33\u0e40\u0e20\u0e2d\u0e40\u0e21\u0e37\u0e2d\u0e07 \u0e08\u0e31\u0e07\u0e2b\u0e27\u0e31\u0e14\u0e2d\u0e48\u0e32\u0e07\u0e17\u0e2d\u0e07\\</NEl\\> \u0e2b\u0e25\u0e31\u0e07\u0e08\u0e32\u0e01\\<NEp\\>\u0e19\u0e32\u0e22\u0e2a\u0e38\u0e01\u0e34\u0e08\\</NEp\\> \u0e2d\u0e32\u0e22\u0e38 65 \u0e1b\u0e35 \u0e16\u0e39\u0e01\u0e1b\u0e25\u0e34\u0e07\u0e01\u0e31\u0e14\u0e41\u0e25\u0e49\u0e27\u0e44\u0e21\u0e48\u0e44\u0e14\u0e49\u0e44\u0e1b\u0e1e\u0e1a\u0e41\u0e1e\u0e17\u0e22\u0e4c'\n\n\\>tltk.nlp.ner([(w,pos),....]) : module for named entity recognition (person, organization, location), e.g. tltk.nlp.ner([('\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19', 'NOUN'), ('\u0e40\u0e02\u0e15', 'NOUN'), ('\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23', 'PROPN'), ('\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07', 'VERB'), ('\u0e27\u0e48\u0e32', 'SCONJ'), ('\\<s/\\>', 'PUNCT')])\n\n=> [('\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e07\u0e32\u0e19', 'NOUN', 'B-O'), ('\u0e40\u0e02\u0e15', 'NOUN', 'I-O'), ('\u0e08\u0e15\u0e38\u0e08\u0e31\u0e01\u0e23', 'PROPN', 'I-O'), ('\u0e0a\u0e35\u0e49\u0e41\u0e08\u0e07', 'VERB', 'O'), ('\u0e27\u0e48\u0e32', 'SCONJ', 'O'), ('\\<s/\\>', 'PUNCT', 'O')]\nNamed entity recognition is based on the CRF model adapted from the http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html tutorial. The model was trained on a corpus containing 170,000 named entities. The tags used for organizations are B-O and I-O, for persons are B-P and I-P, and for locations are B-L and I-L.\n\n\\>tltk.nlp.pos_tag(Text,WordSegmentOption) : word segmentation and POS tagging (using nltk.tag.perceptron), e.g. tltk.nlp.pos_tag('\u0e42\u0e1b\u0e23\u0e41\u0e01\u0e23\u0e21\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a\u0e43\u0e2a\u0e48\u0e41\u0e17\u0e47\u0e01\u0e2b\u0e21\u0e27\u0e14\u0e04\u0e33\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22 \u0e27\u0e31\u0e19\u0e19\u0e35\u0e49\u0e43\u0e0a\u0e49\u0e07\u0e32\u0e19\u0e44\u0e14\u0e49\u0e1a\u0e49\u0e32\u0e07\u0e41\u0e25\u0e49\u0e27') or \n\n=> [[('\u0e42\u0e1b\u0e23\u0e41\u0e01\u0e23\u0e21', 'NOUN'), ('\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a', 'ADP'), ('\u0e43\u0e2a\u0e48', 'VERB'), ('\u0e41\u0e17\u0e47\u0e01', 'NOUN'), ('\u0e2b\u0e21\u0e27\u0e14\u0e04\u0e33', 'NOUN'), ('\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22', 'PROPN'), ('\\<s/\\>', 'PUNCT')], [('\u0e27\u0e31\u0e19\u0e19\u0e35\u0e49', 'NOUN'), ('\u0e43\u0e0a\u0e49\u0e07\u0e32\u0e19', 'VERB'), ('\u0e44\u0e14\u0e49', 'ADV'), ('\u0e1a\u0e49\u0e32\u0e07', 'ADV'), ('\u0e41\u0e25\u0e49\u0e27', 'ADV'), ('\\<s/\\>', 'PUNCT')]]\n\nThe default word segmentation method used is \"colloc\" in the function word_segment(Text, \"colloc\"), but if the option is set to \"mm\", then the function word_segment(Text, \"mm\") will be used. The POS tag set used is based on the Universal POS tag set found at http://universaldependencies.org/u/pos/index.html. \nThe nltk.tag.perceptron model is used for POS tagging, which was trained on a POS-tagged subcorpus in TNC consisting of 148,000 words.\n\nnltk.tag.perceptron model is used for POS tagging. It is trainned with POS-tagged subcorpus in TNC (148,000 words)\n\n\\>tltk.nlp.pos_tag_wordlist(WordLst) : Same as \"tltk.nlp.pos_tag\", but the input is a word list, [w1,w2,...]\n\n\\>tltk.nlp.segment(Text) : segment a paragraph into elementary discourse units (edu) marked with \\<u/\\> and segment words in each edu e.g. tltk.nlp.segment(\"\u0e41\u0e15\u0e48\u0e2d\u0e32\u0e08\u0e40\u0e1e\u0e23\u0e32\u0e30\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07\u0e40\u0e1b\u0e47\u0e19\u0e1e\u0e48\u0e2d\u0e41\u0e21\u0e48\u0e21\u0e37\u0e2d\u0e43\u0e2b\u0e21\u0e48\u2005\u0e23\u0e31\u0e07\u0e17\u0e35\u0e48\u0e17\u0e33\u0e08\u0e36\u0e07\u0e44\u0e21\u0e48\u0e04\u0e48\u0e2d\u0e22\u0e41\u0e02\u0e47\u0e07\u0e41\u0e23\u0e07 \u0e27\u0e31\u0e19\u0e2b\u0e19\u0e36\u0e48\u0e07\u0e23\u0e31\u0e07\u0e01\u0e47\u0e09\u0e35\u0e01\u0e40\u0e01\u0e37\u0e2d\u0e1a\u0e02\u0e32\u0e14\u0e40\u0e1b\u0e47\u0e19\u0e2a\u0e2d\u0e07\u0e17\u0e48\u0e2d\u0e19\u0e2b\u0e49\u0e2d\u0e22\u0e15\u0e48\u0e2d\u0e07\u0e41\u0e15\u0e48\u0e07 \u0e1c\u0e21\u0e1e\u0e22\u0e32\u0e22\u0e32\u0e21\u0e2b\u0e32\u0e2d\u0e38\u0e1b\u0e01\u0e23\u0e13\u0e4c\u0e21\u0e32\u0e22\u0e36\u0e14\u0e23\u0e31\u0e07\u0e01\u0e25\u0e31\u0e1a\u0e04\u0e37\u0e19\u0e23\u0e39\u0e1b\u0e17\u0e23\u0e07\u0e40\u0e14\u0e34\u0e21 \u0e02\u0e13\u0e30\u0e17\u0e35\u0e48\u0e41\u0e21\u0e48\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07\u0e2a\u0e48\u0e07\u0e40\u0e2a\u0e35\u0e22\u0e07\u0e42\u0e27\u0e22\u0e27\u0e32\u0e22\u0e2d\u0e22\u0e39\u0e48\u0e43\u0e01\u0e25\u0e49 \u0e46 \u0e41\u0e15\u0e48\u0e2a\u0e38\u0e14\u0e17\u0e49\u0e32\u0e22\u0e44\u0e21\u0e48\u0e2a\u0e33\u0e40\u0e23\u0e47\u0e08\u2005\u0e2a\u0e2d\u0e07\u0e2a\u0e32\u0e21\u0e27\u0e31\u0e19\u0e15\u0e48\u0e2d\u0e21\u0e32\u0e23\u0e31\u0e07\u0e17\u0e35\u0e48\u0e0a\u0e48\u0e27\u0e22\u0e0b\u0e48\u0e2d\u0e21\u0e01\u0e47\u0e1e\u0e31\u0e07\u0e44\u0e1b \u0e44\u0e21\u0e48\u0e40\u0e2b\u0e47\u0e19\u0e41\u0e21\u0e48\u0e19\u0e01\u0e1a\u0e34\u0e19\u0e01\u0e25\u0e31\u0e1a\u0e21\u0e32\u0e2d\u0e35\u0e01\u0e40\u0e25\u0e22\") \n\n=> '\u0e41\u0e15\u0e48|\u0e2d\u0e32\u0e08|\u0e40\u0e1e\u0e23\u0e32\u0e30|\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07|\u0e40\u0e1b\u0e47\u0e19|\u0e1e\u0e48\u0e2d\u0e41\u0e21\u0e48|\u0e21\u0e37\u0e2d\u0e43\u0e2b\u0e21\u0e48|\\<s/\\>|\u0e23\u0e31\u0e07|\u0e17\u0e35\u0e48|\u0e17\u0e33|\u0e08\u0e36\u0e07|\u0e44\u0e21\u0e48|\u0e04\u0e48\u0e2d\u0e22|\u0e41\u0e02\u0e47\u0e07\u0e41\u0e23\u0e07\\<u/\\>\u0e27\u0e31\u0e19|\u0e2b\u0e19\u0e36\u0e48\u0e07|\u0e23\u0e31\u0e07|\u0e01\u0e47|\u0e09\u0e35\u0e01|\u0e40\u0e01\u0e37\u0e2d\u0e1a|\u0e02\u0e32\u0e14|\u0e40\u0e1b\u0e47\u0e19|\u0e2a\u0e2d\u0e07|\u0e17\u0e48\u0e2d\u0e19|\u0e2b\u0e49\u0e2d\u0e22|\u0e15\u0e48\u0e2d\u0e07\u0e41\u0e15\u0e48\u0e07\\<u/\\>\u0e1c\u0e21|\u0e1e\u0e22\u0e32\u0e22\u0e32\u0e21|\u0e2b\u0e32|\u0e2d\u0e38\u0e1b\u0e01\u0e23\u0e13\u0e4c|\u0e21\u0e32|\u0e22\u0e36\u0e14|\u0e23\u0e31\u0e07|\u0e01\u0e25\u0e31\u0e1a\u0e04\u0e37\u0e19|\u0e23\u0e39\u0e1b\u0e17\u0e23\u0e07|\u0e40\u0e14\u0e34\u0e21\\<u/\\>\u0e02\u0e13\u0e30|\u0e17\u0e35\u0e48|\u0e41\u0e21\u0e48|\u0e19\u0e01\u0e01\u0e34\u0e19\u0e1b\u0e25\u0e35\u0e2d\u0e01\u0e40\u0e2b\u0e25\u0e37\u0e2d\u0e07|\u0e2a\u0e48\u0e07\u0e40\u0e2a\u0e35\u0e22\u0e07|\u0e42\u0e27\u0e22\u0e27\u0e32\u0e22|\u0e2d\u0e22\u0e39\u0e48|\u0e43\u0e01\u0e25\u0e49|\u0e46\\<u/\\>\u0e41\u0e15\u0e48|\u0e2a\u0e38\u0e14\u0e17\u0e49\u0e32\u0e22|\u0e44\u0e21\u0e48|\u0e2a\u0e33\u0e40\u0e23\u0e47\u0e08|\\<s/\\>|\u0e2a\u0e2d\u0e07|\u0e2a\u0e32\u0e21|\u0e27\u0e31\u0e19|\u0e15\u0e48\u0e2d|\u0e21\u0e32|\u0e23\u0e31\u0e07|\u0e17\u0e35\u0e48|\u0e0a\u0e48\u0e27\u0e22|\u0e0b\u0e48\u0e2d\u0e21|\u0e01\u0e47|\u0e1e\u0e31\u0e07|\u0e44\u0e1b\\<u/\\>\u0e44\u0e21\u0e48|\u0e40\u0e2b\u0e47\u0e19|\u0e41\u0e21\u0e48|\u0e19\u0e01|\u0e1a\u0e34\u0e19|\u0e01\u0e25\u0e31\u0e1a|\u0e21\u0e32|\u0e2d\u0e35\u0e01|\u0e40\u0e25\u0e22\\<u/\\>' edu segmentation is based on syllable input using RandomForestClassifier model, which is trained on an edu-segmented corpus (approx. 7,000 edus) created and used in Nalinee\\'s thesis \n\n\\>tltk.nlp.word_segment(Text,method='mm|ngram|colloc') : word segmentation using either maximum matching or ngram or maximum collocation approach. 'colloc' is used by default. Please note that the first run of ngram method would take a long time because TNC.3g will be loaded for ngram calculation. e.g. \n\n\\>tltk.nlp.word_segment('\u0e1c\u0e39\u0e49\u0e2a\u0e37\u0e48\u0e2d\u0e02\u0e48\u0e32\u0e27\u0e23\u0e32\u0e22\u0e07\u0e32\u0e19\u0e27\u0e48\u0e32\u0e19\u0e32\u0e22\u0e01\u0e23\u0e31\u0e10\u0e21\u0e19\u0e15\u0e23\u0e35\u0e44\u0e21\u0e48\u0e21\u0e32\u0e17\u0e33\u0e07\u0e32\u0e19\u0e17\u0e35\u0e48\u0e17\u0e33\u0e40\u0e19\u0e35\u0e22\u0e1a\u0e23\u0e31\u0e10\u0e1a\u0e32\u0e25')\n=> '\u0e1c\u0e39\u0e49\u0e2a\u0e37\u0e48\u0e2d\u0e02\u0e48\u0e32\u0e27|\u0e23\u0e32\u0e22\u0e07\u0e32\u0e19|\u0e27\u0e48\u0e32|\u0e19\u0e32\u0e22\u0e01\u0e23\u0e31\u0e10\u0e21\u0e19\u0e15\u0e23\u0e35|\u0e44\u0e21\u0e48|\u0e21\u0e32|\u0e17\u0e33\u0e07\u0e32\u0e19|\u0e17\u0e35\u0e48|\u0e17\u0e33\u0e40\u0e19\u0e35\u0e22\u0e1a\u0e23\u0e31\u0e10\u0e1a\u0e32\u0e25|\\<s/>'\n\n\\>tltk.nlp.syl_segment(Text) : syllable segmentation using 3gram statistics e.g. tltk.nlp.syl_segment('\u0e42\u0e1b\u0e23\u0e41\u0e01\u0e23\u0e21\u0e2a\u0e33\u0e2b\u0e23\u0e31\u0e1a\u0e1b\u0e23\u0e30\u0e21\u0e27\u0e25\u0e1c\u0e25\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22') \n\n=> '\u0e42\u0e1b\u0e23~\u0e41\u0e01\u0e23\u0e21~\u0e2a\u0e33~\u0e2b\u0e23\u0e31\u0e1a~\u0e1b\u0e23\u0e30~\u0e21\u0e27\u0e25~\u0e1c\u0e25~\u0e20\u0e32~\u0e29\u0e32~\u0e44\u0e17\u0e22\\<s/>'\n\n\\>tltk.nlp.word_segment_nbest(Text, N) : return the best N segmentations based on the assumption of minimum word approach. e.g. tltk.nlp.word_segment_nbest('\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28\"',10) \n\n=> [['\u0e04\u0e19\u0e02\u0e31\u0e1a|\u0e23\u0e16\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19|\u0e02\u0e31\u0e1a|\u0e23\u0e16\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19|\u0e02\u0e31\u0e1a\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a|\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33|\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a|\u0e23\u0e16\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a|\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a|\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19|\u0e02\u0e31\u0e1a|\u0e23\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28', '\u0e04\u0e19\u0e02\u0e31\u0e1a|\u0e23|\u0e16|\u0e1b\u0e23\u0e30\u0e08\u0e33\u0e17\u0e32\u0e07|\u0e1b\u0e23\u0e31\u0e1a\u0e2d\u0e32\u0e01\u0e32\u0e28']]\n\n\\>tltk.nlp.g2p(Text) : return Word segments and pronunciations\ne.g. tltk.nlp.g2p(\"\u0e2a\u0e16\u0e32\u0e1a\u0e31\u0e19\u0e2d\u0e38\u0e14\u0e21\u0e28\u0e36\u0e01\u0e29\u0e32\u0e44\u0e21\u0e48\u0e2a\u0e32\u0e21\u0e32\u0e23\u0e16\u0e01\u0e49\u0e32\u0e27\u0e43\u0e2b\u0e49\u0e17\u0e31\u0e19\u0e01\u0e32\u0e23\u0e40\u0e1b\u0e25\u0e35\u0e48\u0e22\u0e19\u0e41\u0e1b\u0e25\u0e07\u0e02\u0e2d\u0e07\u0e15\u0e25\u0e32\u0e14\u0e41\u0e23\u0e07\u0e07\u0e32\u0e19\") \n\n=> \"\u0e2a\u0e16\u0e32~\u0e1a\u0e31\u0e19~\u0e2d\u0e38~\u0e14\u0e21~\u0e28\u0e36\u0e01~\u0e29\u0e32|\u0e44\u0e21\u0e48|\u0e2a\u0e32~\u0e21\u0e32\u0e23\u0e16|\u0e01\u0e49\u0e32\u0e27|\u0e43\u0e2b\u0e49|\u0e17\u0e31\u0e19|\u0e01\u0e32\u0e23|\u0e40\u0e1b\u0e25\u0e35\u0e48\u0e22\u0e19~\u0e41\u0e1b\u0e25\u0e07|\u0e02\u0e2d\u0e07|\u0e15\u0e25\u0e32\u0e14~\u0e41\u0e23\u0e07~\u0e07\u0e32\u0e19\\<tr/\\>sa1'thaa4~ban0~?u1~dom0~sUk1~saa4|maj2|saa4~maat2|kaaw2|haj2|than0|kaan0|pliian1~plxxN0|khOON4|ta1'laat1~rxxN0~Naan0|\\<s/\\>\"\n\n\\>tltk.nlp.th2ipa(Text) : return Thai transcription in IPA forms\ne.g. tltk.nlp.th2ipa(\"\u0e25\u0e07\u0e41\u0e21\u0e48\u0e19\u0e49\u0e33\u0e23\u0e2d\u0e40\u0e14\u0e34\u0e19\u0e44\u0e1b\u0e2b\u0e32\u0e1b\u0e25\u0e32\") \n\n=> 'lo\u014b1 m\u025b\u02d03.na\u02d0m4 r\u1d10\u02d01 d\u0264\u02d0n1 paj1 ha\u02d05 pla\u02d01 \\<s/\\>'\n\n\\>tltk.nlp.th2roman(Text) : return Thai romanization according to Royal Thai Institute guideline.\n.e.g. tltk.nlp.th2roman(\"\u0e04\u0e37\u0e2d\u0e40\u0e02\u0e32\u0e40\u0e14\u0e34\u0e19\u0e40\u0e25\u0e22\u0e25\u0e07\u0e44\u0e1b\u0e23\u0e2d\u0e43\u0e19\u0e41\u0e21\u0e48\u0e19\u0e49\u0e33\u0e2a\u0e30\u0e2d\u0e32\u0e14\u0e44\u0e1b\u0e2b\u0e32\u0e21\u0e30\u0e1b\u0e23\u0e32\u0e07\") \n\n=> 'khue khaw doen loei long pai ro nai maenam sa-at pai ha maprang \\<s/>'\n\n\\>tltk.nlp.th2read(Text) : convert text into Thai reading forms, e.g. th2read('\u0e2a\u0e32\u0e21\u0e32\u0e23\u0e16\u0e40\u0e02\u0e35\u0e22\u0e19\u0e04\u0e33\u0e2d\u0e48\u0e32\u0e19\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22\u0e44\u0e14\u0e49') \n\n=> '\u0e2a\u0e32-\u0e21\u0e32\u0e14-\u0e40\u0e02\u0e35\u0e22\u0e19-\u0e04\u0e31\u0e21-\u0e2d\u0e48\u0e32\u0e19-\u0e1e\u0e32-\u0e2a\u0e32-\u0e44\u0e17-\u0e14\u0e49\u0e32\u0e22-'\n\n\\>tltk.nlp.th2ipa_all(Text) : return all transcriptions (IPA) as a list of tuple (syllable_list, transcription). Transcription is based on syllable reading rules. It could be different from th2ipa.\ne.g. tltk.nlp.th2ipa_all(\"\u0e23\u0e2d\u0e22\u0e01\u0e23\u0e48\u0e32\u0e07\") \n\n=> [('\u0e23\u0e2d\u0e22~\u0e01\u0e23\u0e48\u0e32\u0e07', 'r\u1d10\u02d0j1.ka2.ra\u02d0\u014b2'), ('\u0e23\u0e2d\u0e22~\u0e01\u0e23\u0e48\u0e32\u0e07', 'r\u1d10\u02d0j1.kra\u02d0\u014b2'), ('\u0e23\u0e2d~\u0e22\u0e01~\u0e23\u0e48\u0e32\u0e07', 'r\u1d10\u02d01.jok4.ra\u02d0\u014b3')]\n\n\\>tltk.nlp.spell_candidates(Word) : list of possible correct words using minimum edit distance, e.g. tltk.nlp.spell_candidates('\u0e23\u0e31\u0e01\u0e29')\n\n=> ['\u0e23\u0e31\u0e01', '\u0e17\u0e31\u0e01\u0e29', '\u0e23\u0e31\u0e01\u0e29\u0e32', '\u0e23\u0e31\u0e01\u0e29\u0e4c']\n\n\\>tltk.nlp.spell_variants(Word, InDict=\"no|yes\", Karan=\"exclude|include\"):\n\nThis function returns a list of word variants with the same pronunciation as the input Word. The InDict parameter allows the option \"yes\" to save only words found in the dictionary, while the default option \"no\" includes all variants regardless of their dictionary status. The Karan parameter allows the option \"include\" to include words spelled with the karan character, while the default option \"exclude\" excludes them. For example, tltk.nlp.spell_variants('\u0e42\u0e04\u0e27\u0e34\u0e14').\n\n=> ['\u0e42\u0e06\u0e27\u0e34\u0e18', '\u0e42\u0e06\u0e27\u0e34\u0e15', '\u0e42\u0e06\u0e27\u0e34\u0e14', '\u0e42\u0e06\u0e27\u0e34\u0e17', '\u0e42\u0e06\u0e27\u0e34\u0e0a', '\u0e42\u0e06\u0e27\u0e34\u0e08', '\u0e42\u0e06\u0e27\u0e34\u0e2a', '\u0e42\u0e06\u0e27\u0e34\u0e29', '\u0e42\u0e06\u0e27\u0e34\u0e15\u0e23', '\u0e42\u0e06\u0e27\u0e34\u0e12', '\u0e42\u0e06\u0e27\u0e34\u0e0f', '\u0e42\u0e06\u0e27\u0e34\u0e0b', '\u0e42\u0e04\u0e27\u0e34\u0e18', '\u0e42\u0e04\u0e27\u0e34\u0e15', '\u0e42\u0e04\u0e27\u0e34\u0e14', '\u0e42\u0e04\u0e27\u0e34\u0e17', '\u0e42\u0e04\u0e27\u0e34\u0e0a', '\u0e42\u0e04\u0e27\u0e34\u0e08', '\u0e42\u0e04\u0e27\u0e34\u0e2a', '\u0e42\u0e04\u0e27\u0e34\u0e29', '\u0e42\u0e04\u0e27\u0e34\u0e15\u0e23', '\u0e42\u0e04\u0e27\u0e34\u0e12', '\u0e42\u0e04\u0e27\u0e34\u0e0f', '\u0e42\u0e04\u0e27\u0e34\u0e0b']\n\nOther defined functions in the package:\n\\>tltk.nlp.reset_thaidict() : clear dictionary content\n\\>tltk.nlp.read_thaidict(DictFile) : add a new dictionary e.g. tltk.nlp.read_thaidict('BEST.dict')\n\\>tltk.nlp.check_thaidict(Word) : check whether Word exists in the dictionary\n\ntltk.corpus : basic tools for corpus enquiry\n-----------------------------------------------\n\n\\>tltk.corpus.compound(w1, w2): Evaluates the similarity between combinations of w1 and w2, specifically w1-w2, w1-w1w2, and w2-w1w2. For instance, invoking `tltk.corpus.compound('\u0e01\u0e25\u0e31\u0e14','\u0e01\u0e25\u0e38\u0e49\u0e21')` indicates that '\u0e01\u0e25\u0e31\u0e14\u0e01\u0e25\u0e38\u0e49\u0e21' is more similar to '\u0e01\u0e25\u0e38\u0e49\u0e21'.\n\n=>[(('\u0e01\u0e25\u0e38\u0e49\u0e21', '\u0e01\u0e25\u0e31\u0e14\u0e01\u0e25\u0e38\u0e49\u0e21'), 0.42245594), (('\u0e01\u0e25\u0e31\u0e14', '\u0e01\u0e25\u0e31\u0e14\u0e01\u0e25\u0e38\u0e49\u0e21'), 0.09066804), (('\u0e01\u0e25\u0e31\u0e14', '\u0e01\u0e25\u0e38\u0e49\u0e21'), 0.0011619462)]\n\n\\>tltk.corpus.Corpus_build(DIR, filetype=\"xxx\") creates a corpus as a list of paragraphs from files located in the directory specified by DIR. The default file type is .txt. However, it is important to note that the files must be pre-segmented into words, with each word separated by the | character, e.g. w1|w2|w3|w4 ....\n\n\\>tltk.corpus.Corpus() creates a corpus object that has three methods:\n\n- x.frequency(Text): This method returns the frequency of a specific Text string in the corpus.\n- x.dispersion(C): This method returns a dispersion plot for a given word list C in the corpus.\n- x.totalword(C): This method returns the total number of words in the corpus that match a given word list C.\n\nHere, C is the result created from Corpus_build.\n\n\\>C = tltk.corpus.Copus_build('temp/data/')\n\n\\>corp = tltk.corpus.Corpus()\n\n\\>print(corp.frequency(C))\n\n\\> {'\u0e08\u0e31\u0e07\u0e2b\u0e27\u0e31\u0e14': 32, '\u0e2a\u0e21\u0e38\u0e17\u0e23\u0e2a\u0e32\u0e04\u0e23': 16, '\u0e40\u0e1b\u0e34\u0e14': 3, '\u0e28\u0e39\u0e19\u0e22\u0e4c': 13, '\u0e04\u0e27\u0e1a\u0e04\u0e38\u0e21': 13, '\u0e41\u0e08\u0e49\u0e07': 16, .....}\n\n\\>tltk.corpus.Xwordlist() creates a comparison object that compares two word lists A and B generated from the Corp.frequency() method. The Corp object is created from Corpus().\n\nFour comparison methods are defined in this object:\n\n- onlyA(): This method returns the list of words that occur only in A.\n- onlyB(): This method returns the list of words that occur only in B.\n- intersect(): This method returns the list of words that occur in both A and B.\n- union(): This method returns the list of words that occur in either A or B (or both).\n\nHere, c1 and c2 are Corpus() objects created using Corpus_build(...). Xcomp is a Xwordlist() object. parsA and parsB are word lists created from the Corpus_build(...) method.\n\nFor example, Xcomp.onlyA(c1.frequency(parsA), c2.frequency(parsB)).\n\n\\>tltk.corpus.Xwordlist() create an object which is a comparison of two wordlists A and B. Four comparison methods are defined: onlyA, onlyB, intersect, union. A and B is an object created from Corp.frequency(). Corp is an object created from Corpus() e.g. Xcomp.onlyA(c1.frequency(parsA),c2.frequency(parsB))); c1 = Corpus(); c2 = Corpus(); Xcomp = Xwordlist(); parsA and parsB are created from Corpus_build(...)\n\n\\>tltk.corpus.W2V_train(Corpus) create a model of Word2Vec. Input is a corpus created from Corpus_build.\n\n\\>tltk.corpus.D2V_train(Corpus) create a model of Doc2Vec. Input is a corpus created from Corpus_build.\n\n\\>tltk.corpus.TNC_load() by default load TNC.3g. The file can be in the working directory or TLTK package directory\n\n\\>tltk.corpus.trigram_load(TRIGRAM) load Trigram data from other sourse saved in tab delimited format \"W1\\tW2\\tW3\\tFreq\" e.g. tltk.corpus.load3gram('TNC.3g') 'TNC.3g' can be downloaded separately from Thai National Corpus Project.\n\n\\>tltk.corpus.unigram(w1) return normalized frequecy (frequency/million) of w1 from the corpus\n\n\\>tltk.corpus.bigram(w1,w2) return frequency/million of Bigram w1-w2 from the corpus e.g. tltk.corpus.bigram(\"\u0e2b\u0e32\u0e22\",\"\u0e14\u0e35\") => 2.331959592765809\n\n\\>tltk.corpus.trigram(w1,w2,w3) return frequency/million of Trigram w1-w2-w3 from the corpus\n\n\\>tltk.corpus.collocates(w, stat=\"chi2\", direct=\"both\", span=2, limit=10, minfq=1) ### return all collocates of w, STAT = {freq,mi,chi2} DIR={left,right,both} SPAN={1,2} The output is a list of tuples ((w1,w2), stat). e.g. tltk.corpus.collocates(\"\u0e27\u0e34\u0e48\u0e07\",limit=5) \n\n=> [(('\u0e27\u0e34\u0e48\u0e07', '\u0e41\u0e08\u0e49\u0e19'), 86633.93952758134), (('\u0e27\u0e34\u0e48\u0e07', '\u0e15\u0e37\u0e4b\u0e2d'), 77175.29122642518), (('\u0e27\u0e34\u0e48\u0e07', '\u0e01\u0e23\u0e30\u0e2b\u0e37\u0e14\u0e01\u0e23\u0e30\u0e2b\u0e2d\u0e1a'), 48598.79465339733), (('\u0e27\u0e34\u0e48\u0e07', '\u0e1b\u0e23\u0e39\u0e4a\u0e14'), 41111.63720974819), (('\u0e25\u0e39\u0e48', '\u0e27\u0e34\u0e48\u0e07'), 33990.56839021914)]\n\n\\>tltk.corpus.W2V_load(File) load w2v model created from gensim. If no file is given, file \"TNCc5model3.bin\" will be loaded.\n\n\\>tltk.corpus.w2v_load() by deafult load word2vec file \"TNCc5model2.bin\". The file can be in the working directory or TLTK package directory\n\n\\>tltk.corpus.w2v_exist(w) check whether w has a vector representation e.g. tltk.corpus.w2v_exist(\"\u0e2d\u0e32\u0e2b\u0e32\u0e23\") => True\n\n\\>tltk.corpus.w2v(w) return vector representation of w\n\n\\>tltk.corpus.similarity(w1,w2) e.g. tltk.corpus.similarity(\"\u0e2d\u0e32\u0e2b\u0e32\u0e23\",\"\u0e2d\u0e32\u0e2b\u0e32\u0e23\u0e27\u0e48\u0e32\u0e07\") => 0.783551877546\n\n\\>tltk.corpus.similar_words(w, n=10, cutoff=0., score=\"n\") e.g. tltk.corpus.similar_words(\"\u0e2d\u0e32\u0e2b\u0e32\u0e23\",n=5, score=\"y\") \n\n=> [('\u0e2d\u0e32\u0e2b\u0e32\u0e23\u0e27\u0e48\u0e32\u0e07', 0.7835519313812256), ('\u0e02\u0e2d\u0e07\u0e27\u0e48\u0e32\u0e07', 0.7366500496864319), ('\u0e02\u0e2d\u0e07\u0e2b\u0e27\u0e32\u0e19', 0.703102707862854), ('\u0e40\u0e19\u0e37\u0e49\u0e2d\u0e2a\u0e31\u0e15\u0e27\u0e4c', 0.6960341930389404), ('\u0e1c\u0e25\u0e44\u0e21\u0e49', 0.6641997694969177)]\n\n\\>tltk.corpus.outofgroup([w1,w2,w3,...]) e.g. tltk.corpus.outofgroup([\"\u0e19\u0e49\u0e33\",\"\u0e2d\u0e32\u0e2b\u0e32\u0e23\",\"\u0e02\u0e49\u0e32\u0e27\",\"\u0e23\u0e16\u0e22\u0e19\u0e15\u0e4c\",\"\u0e1c\u0e31\u0e01\"]) => \"\u0e23\u0e16\u0e22\u0e19\u0e15\u0e4c\"\n\n\\>tltk.corpus.analogy(w1,w2,w3,n=1) e.g. tltk.corpus.analogy('\u0e1e\u0e48\u0e2d','\u0e1c\u0e39\u0e49\u0e0a\u0e32\u0e22','\u0e41\u0e21\u0e48') => ['\u0e1c\u0e39\u0e49\u0e2b\u0e0d\u0e34\u0e07'] \n\n\\>tltk.corpus.w2v_plot([w1,w2,w3,...]) => plot a scratter graph of w1-wn in two dimensions\n\n\\>tltk.corpus.w2v_compare_color([w1,w2,w3,...]) => visualize the components of vectors w1-wn in color\n\n\\>tltk.corpus.compound(w1,w2) => check a compound w1w2, whether w1 or w2 is similar to w1w2 e.g. tltk.corpus.compound('\u0e40\u0e25\u0e47\u0e01','\u0e19\u0e49\u0e2d\u0e22') => [(('\u0e40\u0e25\u0e47\u0e01', '\u0e19\u0e49\u0e2d\u0e22'), 0.4533272), (('\u0e19\u0e49\u0e2d\u0e22', '\u0e40\u0e25\u0e47\u0e01\u0e19\u0e49\u0e2d\u0e22'), 0.35492077), (('\u0e40\u0e25\u0e47\u0e01', '\u0e40\u0e25\u0e47\u0e01\u0e19\u0e49\u0e2d\u0e22'), 0.24106339)]\n\n\nNotes\n-----\n\n- The word segmentation method used is based on a maximum collocation approach, which is described in the publication \"Collocation and Thai Word Segmentation\" by W. Aroonmanakun (2002). This publication can be found in the Proceedings of the Fifth Symposium on Natural Language Processing & The Fifth Oriental COCOSDA Workshop, edited by Thanaruk Theeramunkong and Virach Sornlertlamvanich, and published by Sirindhorn International Institute of Technology in Pathumthani. The relevant pages are 68-75. Here is the link to the publication: http://pioneer.chula.ac.th/~awirote/ling/SNLP2002-0051c.pdf\n\n- To segment Thai texts, you can use either tltk.nlp.word_segment(Text) or tltk.nlp.syl_segment(Text). The syllable segmentation method is based on a trigram model trained on a corpus of 3.1 million syllables. The input text should be a paragraph of Thai text that may contain English text. Spaces in the paragraph should be marked as \"\\<s/\\>\". Word boundaries are marked by \"|\", and syllable boundaries are marked by \"~\". Please note that the syllables represented here are written syllables. Some written syllables may be pronounced as two syllables. For example, \"\u0e2a\u0e01\u0e31\u0e14\" is segmented here as one written syllable, but it is pronounced as two syllables \"sa1-kat1\".\n\n- The process of determining words in a sentence is based on a combination of a dictionary and the maximum collocation strength between syllables. The standard dictionary includes many compounds and idioms, such as '\u0e40\u0e15\u0e32\u0e44\u0e21\u0e42\u0e04\u0e23\u0e40\u0e27\u0e1f', '\u0e44\u0e1f\u0e1f\u0e49\u0e32\u0e01\u0e23\u0e30\u0e41\u0e2a\u0e2a\u0e25\u0e31\u0e1a', '\u0e1b\u0e35\u0e07\u0e1a\u0e1b\u0e23\u0e30\u0e21\u0e32\u0e13', '\u0e2d\u0e38\u0e42\u0e21\u0e07\u0e04\u0e4c\u0e43\u0e15\u0e49\u0e14\u0e34\u0e19', '\u0e2d\u0e32\u0e2b\u0e32\u0e23\u0e08\u0e32\u0e19\u0e14\u0e48\u0e27\u0e19', '\u0e1b\u0e39\u0e19\u0e02\u0e32\u0e27\u0e1c\u0e2a\u0e21\u0e1e\u0e34\u0e40\u0e28\u0e29', '\u0e40\u0e15\u0e49\u0e19\u0e41\u0e23\u0e49\u0e07\u0e40\u0e15\u0e49\u0e19\u0e01\u0e32', etc. These will likely be segmented as one word. If your application requires the use of shortest meaningful words (i.e. '\u0e23\u0e16|\u0e42\u0e14\u0e22\u0e2a\u0e32\u0e23', '\u0e04\u0e19|\u0e43\u0e0a\u0e49', '\u0e01\u0e25\u0e32\u0e07|\u0e04\u0e37\u0e19', '\u0e15\u0e49\u0e19|\u0e44\u0e21\u0e49', as segmented in the BEST corpus), you can reset the default dictionary used in this package and load a new dictionary containing only simple words or the shortest meaningful words. To clear the default dictionary content, use \"reset_thaidict()\". To load a new dictionary, use \"read_thaidict('DICT_FILE')\". A file named 'BEST.dict' containing a list of words compiled from the BEST corpus is included in this package. \n\n- The standard dictionary used in this package has more than 65,000 entries, including abbreviations and transliterations, compiled from various sources. Additionally, a list of 8,700 proper names such as country names, organization names, location names, animal names, plant names, food names, etc., has been added to the system's dictionary. Examples of such proper names include '\u0e2d\u0e38\u0e0b\u0e40\u0e1a\u0e01\u0e34\u0e2a\u0e16\u0e32\u0e19', '\u0e2a\u0e33\u0e19\u0e31\u0e01\u0e40\u0e25\u0e02\u0e32\u0e18\u0e34\u0e01\u0e32\u0e23\u0e19\u0e32\u0e22\u0e01\u0e23\u0e31\u0e10\u0e21\u0e19\u0e15\u0e23\u0e35', '\u0e27\u0e31\u0e14\u0e43\u0e2b\u0e0d\u0e48\u0e2a\u0e38\u0e27\u0e23\u0e23\u0e13\u0e32\u0e23\u0e32\u0e21', '\u0e2b\u0e19\u0e2d\u0e19\u0e40\u0e08\u0e32\u0e30\u0e25\u0e33\u0e15\u0e49\u0e19\u0e02\u0e49\u0e32\u0e27\u0e42\u0e1e\u0e14', and '\u0e1b\u0e25\u0e32\u0e2b\u0e21\u0e36\u0e01\u0e01\u0e23\u0e30\u0e40\u0e17\u0e35\u0e22\u0e21\u0e1e\u0e23\u0e34\u0e01\u0e44\u0e17\u0e22'.\n\n- For segmenting a specific domain text, a specialized dictionary can be used by adding it to the existing dictionary before segmenting the text. This can be done by calling read_thaidict(\"SPECIALIZED_DICT\"). Please note that the dictionary should be a text file in \"utf-8\" encoding, and each word should be on a separate line.\n\n- 'Sentence segmentation' or actually 'EDU segmentation' is a process of breaking a paragraph into chunks of discourse units, which are usually clauses. It is based on a RandomForestClassifier model, which is trained on an EDU-segmented corpus (8,100 EDUs) created and used in Nalinee's thesis (http://www.arts.chula.ac.th/~ling/thesis/2556MA-LING-Nalinee.pdf). The model has an accuracy of 97.8%. The reason behind using EDUs can be found in [Aroonmanakun, W. 2007. Thoughts on Word and Sentence Segmentation in Thai. In Proceedings of the Seventh Symposium on Natural Language Processing, Dec 13-15, 2007, Pattaya, Thailand. 85-90.] [Intasaw, N. and Aroonmanakun, W. 2013. Basic Principles for Segmenting Thai EDUs. in Proceedings of 27th Pacific Asia Conference on Language, Information, and Computation, pages 491-498, Nov 22-24, 2013, Taipei.].\n\n- 'grapheme to phoneme' (g2p), as well as IPA transcription (th2ipa) and Thai romanization (th2roman) are based on the hybrid approach presented in the paper \"A Unified Model of Thai Word Segmentation and Romanization\". The Thai Royal Institute guideline for Thai romanization can be downloaded from \"http://www.arts.chula.ac.th/~ling/tts/ThaiRoman.pdf\", or \"http://www.royin.go.th/?page_id=619\". [Aroonmanakun, W., and W. Rivepiboon. 2004. A Unified Model of Thai Word Segmentation and Romanization. In Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, Dec 8-10, 2004, Tokyo, Japan. 205-214.] (http://www.aclweb.org/anthology/Y04-1021)\n\n\nRemarks\n-------\n\n- A prototype of the UD Parser is implemented using MaltParser (https://www.maltparser.org/). To use MaltParser, it must be installed, and a line 'tltk.nlp.Maltparser_Path = \"/path/to/maltparser-1.9.2\"' should be added to your code. The UD tree generated by MaltParser is a dictionary with the following format: {'sentence': \"\u0e02\u0e49\u0e2d\u0e04\u0e27\u0e32\u0e21\u0e20\u0e32\u0e29\u0e32\u0e44\u0e17\u0e22\", 'words': [{'id': nn, 'pos': POS, 'deprel': REL, 'head': HD_ID}, {...}, ...]}. The model is trained on 1,114 UD trees manually analyzed from a sample of TNC and is included as \"thamalt.mco\" in the TLTK package. Additional UD trees will be added in the future.\n- The TNC Trigram data (TNC.3g) and TNC word2vec (TNCc5model3.bin) can be downloaded from the TNC website: http://www.arts.chula.ac.th/ling/tnc/searchtnc/.\n- The \"spell_candidates\" module is modified from Peter Norvig's Python code, which can be found at http://norvig.com/spell-correct.html.\n- The \"w2v_compare_color\" module is modified from http://chrisculy.net/lx/wordvectors/wvecs_visualization.html.\n- The BEST corpus is a corpus released by NECTEC (https://www.nectec.or.th/corpus/).\n- This project uses Universal POS tags. For more information, please see http://universaldependencies.org/u/pos/index.html and http://www.arts.chula.ac.th/~ling/contents/File/UD%20Annotation%20for%20Thai.pdf.\n- pos_tag is based on the PerceptronTagger in the nltk.tag.perceptron module. It was trained using TNC data that was manually pos-tagged (approximately 148,000 words). The accuracy of the pos-tagging is 91.68%. The NLTK PerceptronTagger is a port of the Textblob Averaged Perceptron Tagger, which can be found at https://explosion.ai/blog/part-of-speech-pos-tagger-in-python.\n- The named entity recognition module is a CRF model adapted from a tutorial (http://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html). The model was trained using NER data from Sasimimon's and Nutcha's theses (altogether 7,354 names in a corpus of 183,300 words) (http://pioneer.chula.ac.th/~awirote/Data-Nutcha.zip, http://pioneer.chula.ac.th/~awirote/Data-Sasiwimon.zip) and NER data from AIforThai (https://aiforthai.in.th/). Only valid NE files from AIforThai were used, and the total number of all NEs is 170,076. The accuracy of the model is reported below (88%).\n\n\n============ =========== ======= ========= ========\n tag precision recall f1-score support\n------------ ----------- ------- --------- --------\n B-L 0.56 0.48 0.52 27105\n B-O 0.72 0.58 0.64 59613\n B-P 0.82 0.83 0.83 83358\n I-L 0.52 0.43 0.47 17859\n I-O 0.67 0.59 0.63 67396\n I-P 0.85 0.88 0.86 175069\n O 0.92 0.94 0.93 1032377\n------------ ----------- ------- --------- --------\n accuracy 0.88 1462777\n macro avg 0.72 0.68 0.70 1462777\nweighted avg 0.87 0.88 0.88 1462777\n============ =========== ======= ========= ========\n\n\nUse cases\n---------\n\nThis package is free for commercial use. If you incorporate this package in your work, we would appreciate it if you inform us through awirote@chula.ac.th.\n\n- BAS Web Services (https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface) used TLTK for Thai grapheme-to-phoneme conversion in their project.\n- Chubb Life Assurance Public Company Limited used TLTK for Thai transliteration.\n- The .NET project wraps Thai Romanization in the Thai Language Toolkit Project to simplify usage in other .NET projects. https://github.com/dotnetthailand/ThaiRomanizationSharp\n- Huawei, Consumer Cloud Service Asia Pacific Cloud Service Business Growth Dept. used TLTK for AppSearch processing for Thai.\n- osml10n, localization functions for Openstreetmap data used TLTK for thai language transcription in cases where transcripted names are unavailable in Openstreetmap data itself. https://github.com/giggls/osml10n\n\n\n\n",
"bugtrack_url": null,
"license": "BSD-3-Clause",
"summary": "Thai Language Toolkit",
"version": "1.9.1",
"project_urls": {
"Homepage": "http://pypi.python.org/pypi/tltk/"
},
"split_keywords": [
"thai language toolkit",
" thai language processing",
" segmentation",
" pos tag",
" transcription",
" romanization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f790718e236864b8b512e432b9cca315a8ccaf7e18ba661db95b4a8b56fa5072",
"md5": "3dba869948c595f6bf235e11bb3276fa",
"sha256": "ffa55006636a0d5bb8b4087a99767d8ca47e647d07d7a5c7b33a525c32857e6e"
},
"downloads": -1,
"filename": "tltk-1.9.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3dba869948c595f6bf235e11bb3276fa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 20184761,
"upload_time": "2024-09-09T16:02:46",
"upload_time_iso_8601": "2024-09-09T16:02:46.974489Z",
"url": "https://files.pythonhosted.org/packages/f7/90/718e236864b8b512e432b9cca315a8ccaf7e18ba661db95b4a8b56fa5072/tltk-1.9.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2ddc1b12c3eb2af01265001f0eeada78a25639f3472296784e0601ef2cddedd0",
"md5": "c0489e9633b82e243f187a718eb5dea3",
"sha256": "589f2f3d8b89258d97379ae9d5e6744168705ed6029b47c01b39b89c3fddfd22"
},
"downloads": -1,
"filename": "tltk-1.9.1.tar.gz",
"has_sig": false,
"md5_digest": "c0489e9633b82e243f187a718eb5dea3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20048923,
"upload_time": "2024-09-09T16:02:54",
"upload_time_iso_8601": "2024-09-09T16:02:54.031155Z",
"url": "https://files.pythonhosted.org/packages/2d/dc/1b12c3eb2af01265001f0eeada78a25639f3472296784e0601ef2cddedd0/tltk-1.9.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-09 16:02:54",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "tltk"
}