# Hanzipy
<p align="center">
<a href="https://circleci.com/gh/Synkied/hanzipy">
<img src="https://circleci.com/gh/Synkied/hanzipy.svg?style=svg" alt="TravisCI Build Status"/>
</a>
</p>
Hanzipy is a Chinese character and NLP module for Chinese language processing for python.
It is primarily written to help provide a framework for Chinese language learners to explore Chinese.
It was translated from the awesome library provided by nieldlr: https://github.com/nieldlr/hanzi
The following README is copy/pasted from the above repository but adapted for the python language.
At present features include:
- Character decomposition into components
- Dictionary definition lookup using CC-CEDICT
- Phonetic Regularity Computation
- Example Word Calculations
Future features planned:
- Futur plans to include IDS: https://github.com/cjkvi/cjkvi-ids
Currently the data was generated by Gavin Grover
http://groovy.codeplex.com/wikipage?title=cjk-decomp
## Install
```python
pip install hanzipy
```
## How to use
### Initiate Hanzipy. Required.
```python
# import dictionary
from hanzipy.decomposer import HanziDecomposer
decomposer = HanziDecomposer()
# import decomposer
from hanzipy.dictionary import HanziDictionary
dictionary = HanziDictionary()
```
hanzipy has been seperated into two different modules for clarity purposes.
The `decomposer` aims to focus on character or phrase decomposition.
The `dictionary` is has the name implies, focused on providing dictionary entries and phrase examples.
### Hanzi Dictionary
#### dictionary.definition_lookup(character/word, script_type=None)
Returns a dictionary entry object. ```script_type``` is optional.
```script_type``` parameters:
- "simplified" - Simplified
- "traditional" - Traditional
```python
print(dictionary.definition_lookup("雪"))
[
{
"traditional": "雪",
"simplified": "雪",
"pinyin": "Xue3",
"definition": "surname Xue",
},
{
"traditional": "雪",
"simplified": "雪",
"pinyin": "xue3",
"definition": "snow/CL:場|场[chang2]/(literary) to wipe away (a humiliation etc)", # noqa
},
]
print(dictionary.definition_lookup("這", "traditional"))
[
{
"traditional": "這",
"simplified": "这",
"pinyin": "zhe4",
"definition": "this/these/(commonly pr. [zhei4] before a classifier, esp. in Beijing)",
}
]
```
#### dictionary.dictionary_search(characters, search_type=None)
Searches the dictionary based on input. ```search_type``` changes what data it returns. Defaults to None.
```search_type``` parameters:
- "only" - this parameter returns only entries with the characters specfied. This is a means to find all compounds words with the characters specified.
- None - returns all occurences of the character.
```python
print(dictionary.dictionary_search("雪"))
[
{
"traditional": "一雪前恥",
"simplified": "一雪前耻",
"pinyin": "yi1 xue3 qian2 chi3",
"definition": "to wipe away a humiliation (idiom)",
},
{
"traditional": "下雪",
"simplified": "下雪",
"pinyin": "xia4 xue3",
"definition": "to snow",
},
{
"traditional": "伸雪",
"simplified": "伸雪",
"pinyin": "shen1 xue3",
"definition": "variant of 申雪[shen1 xue3]",
},
{
"traditional": "似雪",
"simplified": "似雪",
"pinyin": "si4 xue3",
"definition": "snowy",
},
{
"traditional": "冰天雪地",
"simplified": "冰天雪地",
"pinyin": "bing1 tian1 xue3 di4",
"definition": "a world of ice and snow",
},
{
"traditional": "冰雪",
"simplified": "冰雪",
"pinyin": "bing1 xue3",
"definition": "ice and snow",
},
{
"traditional": "冰雪皇后",
"simplified": "冰雪皇后",
"pinyin": "Bing1 xue3 Huang2 hou4",
"definition": "Dairy Queen (brand)",
},
{
"traditional": "冰雪聰明",
"simplified": "冰雪聪明",
"pinyin": "bing1 xue3 cong1 ming5",
"definition": "exceptionally intelligent (idiom)",
},
{
"traditional": "各人自掃門前雪,莫管他家瓦上霜",
"simplified": "各人自扫门前雪,莫管他家瓦上霜",
"pinyin": "ge4 ren2 zi4 sao3 men2 qian2 xue3 , mo4 guan3 ta1 jia1 wa3 shang4 shuang1",
"definition": "sweep the snow from your own door step, don't worry about the frost on your neighbor's roof (idiom)",
},
{
"traditional": "哈巴雪山",
"simplified": "哈巴雪山",
"pinyin": "Ha1 ba1 xue3 shan1",
"definition": "Mt Haba (Nakhi: golden flower), in Lijiang 麗江|丽江, northwest Yunnan",
},
{
"traditional": "單板滑雪",
"simplified": "单板滑雪",
"pinyin": "dan1 ban3 hua2 xue3",
"definition": "to snowboard",
},
{
"traditional": "報仇雪恥",
"simplified": "报仇雪耻",
"pinyin": "bao4 chou2 xue3 chi3",
"definition": "to take revenge and erase humiliation (idiom)",
},
{
"traditional": "報仇雪恨",
"simplified": "报仇雪恨",
"pinyin": "bao4 chou2 xue3 hen4",
"definition": "to take revenge and wipe out a grudge (idiom)",
},
]
[....] # Truncated for display purposes
print(dictionary.dictionary_search("心的小孩真", "only"))
[
{"traditional": "孩", "simplified": "孩", "pinyin": "hai2", "definition": "child"},
{
"traditional": "小",
"simplified": "小",
"pinyin": "xiao3",
"definition": "small/tiny/few/young",
},
{
"traditional": "小孩",
"simplified": "小孩",
"pinyin": "xiao3 hai2",
"definition": "child/CL:個|个[ge4]",
},
{
"traditional": "小小",
"simplified": "小小",
"pinyin": "xiao3 xiao3",
"definition": "very small/very few/very minor",
},
{
"traditional": "小心",
"simplified": "小心",
"pinyin": "xiao3 xin1",
"definition": "to be careful/to take care",
},
{
"traditional": "小的",
"simplified": "小的",
"pinyin": "xiao3 de5",
"definition": "I (when talking to a superior)",
},
{
"traditional": "心",
"simplified": "心",
"pinyin": "xin1",
"definition": "heart/mind/intention/center/core/CL:顆|颗[ke1],個|个[ge4]",
},
{
"traditional": "的",
"simplified": "的",
"pinyin": "de5",
"definition": "of/~'s (possessive particle)/(used after an attribute)/(used to form a nominal expression)/(used at the end of a declarative sentence for emphasis)/also pr. [di4] or [di5] in poetry and songs",
},
{
"traditional": "的",
"simplified": "的",
"pinyin": "di1",
"definition": "see 的士[di1 shi4]",
},
{
"traditional": "的",
"simplified": "的",
"pinyin": "di2",
"definition": "really and truly",
},
{"traditional": "的", "simplified": "的", "pinyin": "di4", "definition": "aim/clear"},
{
"traditional": "真",
"simplified": "真",
"pinyin": "zhen1",
"definition": "really/truly/indeed/real/true/genuine",
},
{
"traditional": "真心",
"simplified": "真心",
"pinyin": "zhen1 xin1",
"definition": "sincere/heartfelt/CL:片[pian4]",
},
{
"traditional": "真真",
"simplified": "真真",
"pinyin": "zhen1 zhen1",
"definition": "really/in fact/genuinely/scrupulously",
},
]
```
#### dictionary.get_examples(character)
This function does a dictionary_search(), then compares that to the Leiden University corpus for vocabulary frequency, then sorts the dictionary entries into three categories in an array: [high frequency, medium frequency and low frequency].
The frequency categories are determined relative to the frequency distribution of the dictionary_search data compared to the corpus.
```python
print(dictionary.get_examples("橄"))
{
"high_frequency": [
{
"traditional": "橄欖",
"simplified": "橄榄",
"pinyin": "gan3 lan3",
"definition": "Chinese olive/olive",
},
{
"traditional": "橄欖油",
"simplified": "橄榄油",
"pinyin": "gan3 lan3 you2",
"definition": "olive oil",
},
],
"mid_frequency": [
{
"traditional": "橄欖球",
"simplified": "橄榄球",
"pinyin": "gan3 lan3 qiu2",
"definition": "football played with oval-shaped ball (rugby, American football, Australian rules etc)",
},
{
"traditional": "橄欖綠",
"simplified": "橄榄绿",
"pinyin": "gan3 lan3 lu:4",
"definition": "olive-green (color)",
},
],
"low_frequency": [
{
"traditional": "橄欖枝",
"simplified": "橄榄枝",
"pinyin": "gan3 lan3 zhi1",
"definition": "olive branch/symbol of peace",
},
{
"traditional": "橄欖樹",
"simplified": "橄榄树",
"pinyin": "gan3 lan3 shu4",
"definition": "olive tree",
},
{
"traditional": "橄欖石",
"simplified": "橄榄石",
"pinyin": "gan3 lan3 shi2",
"definition": "olivine (rock-forming mineral magnesium-iron silicate (Mg,Fe)2SiO4)/peridot",
},
],
}
```
<!-- #### hanzi.segment(phrase) - NOT YET AVAILABLE
Returns an array of characters that are segmented based on a longest match lookup.
````python
print(hanzi.segment("我們都是陌生人。"))
["我們", "都", "是", "陌生人", "。"]
```` -->
#### dictionary.get_pinyin(character)
Returns all possible pinyin data for a character.
```python
print(dictionary.get_pinyin("的"))
["de5", "di1", "di2", "di4"]
```
#### dictionary.get_character_frequency(character)
Returns frequency data for a character based on the Junda corpus. The data is in simplified characters, but I made the function script agnostic. So both traditional and simplified will return the same data.
```python
print(dictionary.get_character_frequency("热"))
{
"number": 606,
"character": "热",
"count": "67051",
"percentage": "79.8453694124",
"pinyin": "re4",
"meaning": "heat/to heat up/fervent/hot (of weather)/warm up",
}
```
#### dictionary.get_character_in_frequency_list_by_position(position)
Gets a character based on its position the frequency list. This only goes up to 9933 based on the Junda Frequency list.
```python
print(dictionary.get_character_in_frequency_list_by_position(111))
{
"number": 111,
"character": "机",
"count": "339823",
"percentage": "43.7756134862",
"pinyin": "ji1",
"meaning": "machine/opportunity/secret",
}
```
#### dictionary.determine_phonetic_regularity(decomposition_object/character)
This function takes a decomposition object created by hanzipy.decompose() or a character, then returns an object that displays all possible combinations of phonetic regularity relationship of the character to all its components.
Phonetic Regularity Scale:
- 0 = No regularity
- 1 = Exact Match (with tone)
- 2 = Syllable Match (without tone)
- 3 = Similar in Initial (alliterates)
- 4 = Similar in Final (rhymes)
The object returned is organized by the possible pronunciations of the character. You may notice duplicate entries in the fields, but these are there based on the similarities between the decomposition levels. It is up to the developer to use this data or not.
```python
print(dictionary.determine_phonetic_regularity("洋"))
{
"yang2": {
"character": "洋",
"component": ["氵", "羊", "羊", "氵", "羊", "羊"],
"phonetic_pinyin": [
"shui3",
"Yang2",
"yang2",
"shui3",
"Yang2",
"yang2",
],
"regularity": [0, 1, 1, 0, 1, 1],
}
}
```
### Hanzi Decomposer
#### decomposer.decompose(character, decomposition_type=None)
A function that takes a Chinese character and returns an object with decomposition data. Type of decomposition is optional.
Type of decomposition levels:
- 1 - "Once" (only decomposes character once),
- 2 - "Radical" (decomposes character into its lowest radical components),
- 3 - "Graphical" (decomposes into lowest forms, will be mostly strokes and small indivisable units)
```python
decomposition = decomposer.decompose("爱")
print(decomposition)
{
"character": "爱",
"once": ["No glyph available", "友"],
"radical": ["爫", "冖", "𠂇", "又"],
"graphical": ["爫", "冖", "𠂇", "㇇", "㇏"],
}
# Example of forced level decomposition
decomposition = hanzi.decompose("爱", 2)
print(decomposition)
{"character": "爱", "components": ["爫", "冖", "𠂇", "又"]}
```
#### decomposer.decompose_many(character string, type of decomposition)
A function that takes a string of characters and returns one object for all characters.
```python
decomposition = hanzi.decompose_many("爱橄黃")
print(decomposition)
{
"爱": {
"character": "爱",
"once": ["No glyph available", "友"],
"radical": ["爫", "冖", "𠂇", "又"],
"graphical": ["爫", "冖", "𠂇", "㇇", "㇏"],
},
"橄": {
"character": "橄",
"once": ["木", "敢"],
"radical": ["木", "No glyph available", "耳", "⺙"],
"graphical": ["一", "丨", "八", "匚", "二", "丨", "二", "丿", "一", "乂"],
},
"黃": {
"character": "黃",
"once": ["廿", "No glyph available"],
"radical": ["黃"],
"graphical": ["卄", "一", "一", "二", "丨", "凵", "八"],
},
}
```
#### decomposer.component_exists(character/component)
Check if a component/character exists in the data. Returns boolean value.
```python
print(decomposer.component_exists("乂"))
True
print(decomposer.component_exists("$"))
False
```
#### decomposer.get_characters_with_component(component)
Returns an array of characters with the given component. If a component has bound forms, such as 手 and 扌, they"re considered the same and returns all the characters with the component.
NB: This feature is new. Data might not be hundred percent correct and consistent.
```python
print(decomposer.get_characters_with_component("囗"))
["国","因","西","回","口","四","团","图","围","困","固","园","圆","圈","囚","圃","囤","囿","囡","囫","圜","囵","囹","圄","囝","圉","圊","釦"]
```
#### decomposer.get_radical_meaning(radical)
Returns a short, usually one-word, meaning of a radical.
```python
print(decomposer.get_radical_meaning("氵"))
water
```
## Projects
Hanzipy is used in the following projects:
## Contributors
- [synkied (Author)](https://github.com/synkied)
## License
Hanzipy uses data from various sources:
- [CEDICT](http://cc-cedict.org/wiki/)
- [Gavin Grover's Decomposition Data](http://cjkdecomp.codeplex.com/license)
- [Leiden Word Frequency Data](http://lwc.daanvanesch.nl/legal.php)
- [Jun Da Character Frequency Data](http://lingua.mtsu.edu/chinese-computing/copyright.html)
Other data files are either generated by Hanzipy or are not in use at present in the software.
Raw data
{
"_id": null,
"home_page": "https://github.com/Synkied/hanzipy",
"name": "hanzipy",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "python,hanzi,hanzipy,decomposition,cjk,dictionary",
"author": "\u6606\u6c40",
"author_email": "synkx@hotmail.fr",
"download_url": "https://files.pythonhosted.org/packages/05/38/fd9f35a4cc8a0f515e0588251bdcc354aa49724c759cfa571ad111d2af40/hanzipy-1.0.4.tar.gz",
"platform": null,
"description": "# Hanzipy\n<p align=\"center\">\n <a href=\"https://circleci.com/gh/Synkied/hanzipy\">\n <img src=\"https://circleci.com/gh/Synkied/hanzipy.svg?style=svg\" alt=\"TravisCI Build Status\"/>\n </a>\n</p>\n\n\nHanzipy is a Chinese character and NLP module for Chinese language processing for python. \nIt is primarily written to help provide a framework for Chinese language learners to explore Chinese. \nIt was translated from the awesome library provided by nieldlr: https://github.com/nieldlr/hanzi\n\nThe following README is copy/pasted from the above repository but adapted for the python language.\n\nAt present features include:\n\n- Character decomposition into components\n- Dictionary definition lookup using CC-CEDICT\n- Phonetic Regularity Computation\n- Example Word Calculations\n\nFuture features planned:\n\n- Futur plans to include IDS: https://github.com/cjkvi/cjkvi-ids\n\nCurrently the data was generated by Gavin Grover\nhttp://groovy.codeplex.com/wikipage?title=cjk-decomp\n\n## Install\n\n```python\npip install hanzipy\n```\n\n## How to use\n\n### Initiate Hanzipy. Required.\n\n```python\n# import dictionary\nfrom hanzipy.decomposer import HanziDecomposer\ndecomposer = HanziDecomposer()\n# import decomposer\nfrom hanzipy.dictionary import HanziDictionary\ndictionary = HanziDictionary()\n\n```\n\nhanzipy has been seperated into two different modules for clarity purposes.\nThe `decomposer` aims to focus on character or phrase decomposition.\nThe `dictionary` is has the name implies, focused on providing dictionary entries and phrase examples.\n\n### Hanzi Dictionary\n\n#### dictionary.definition_lookup(character/word, script_type=None)\n\nReturns a dictionary entry object. ```script_type``` is optional.\n\n```script_type``` parameters:\n\n- \"simplified\" - Simplified\n- \"traditional\" - Traditional\n\n```python\nprint(dictionary.definition_lookup(\"\u96ea\"))\n\n[\n {\n \"traditional\": \"\u96ea\",\n \"simplified\": \"\u96ea\",\n \"pinyin\": \"Xue3\",\n \"definition\": \"surname Xue\",\n },\n {\n \"traditional\": \"\u96ea\",\n \"simplified\": \"\u96ea\",\n \"pinyin\": \"xue3\",\n \"definition\": \"snow/CL:\u5834|\u573a[chang2]/(literary) to wipe away (a humiliation etc)\", # noqa\n },\n]\n\nprint(dictionary.definition_lookup(\"\u9019\", \"traditional\"))\n[\n {\n \"traditional\": \"\u9019\",\n \"simplified\": \"\u8fd9\",\n \"pinyin\": \"zhe4\",\n \"definition\": \"this/these/(commonly pr. [zhei4] before a classifier, esp. in Beijing)\",\n }\n]\n```\n\n#### dictionary.dictionary_search(characters, search_type=None)\n\nSearches the dictionary based on input. ```search_type``` changes what data it returns. Defaults to None.\n\n```search_type``` parameters:\n\n- \"only\" - this parameter returns only entries with the characters specfied. This is a means to find all compounds words with the characters specified.\n- None - returns all occurences of the character.\n\n```python\nprint(dictionary.dictionary_search(\"\u96ea\"))\n\n[\n {\n \"traditional\": \"\u4e00\u96ea\u524d\u6065\",\n \"simplified\": \"\u4e00\u96ea\u524d\u803b\",\n \"pinyin\": \"yi1 xue3 qian2 chi3\",\n \"definition\": \"to wipe away a humiliation (idiom)\",\n },\n {\n \"traditional\": \"\u4e0b\u96ea\",\n \"simplified\": \"\u4e0b\u96ea\",\n \"pinyin\": \"xia4 xue3\",\n \"definition\": \"to snow\",\n },\n {\n \"traditional\": \"\u4f38\u96ea\",\n \"simplified\": \"\u4f38\u96ea\",\n \"pinyin\": \"shen1 xue3\",\n \"definition\": \"variant of \u7533\u96ea[shen1 xue3]\",\n },\n {\n \"traditional\": \"\u4f3c\u96ea\",\n \"simplified\": \"\u4f3c\u96ea\",\n \"pinyin\": \"si4 xue3\",\n \"definition\": \"snowy\",\n },\n {\n \"traditional\": \"\u51b0\u5929\u96ea\u5730\",\n \"simplified\": \"\u51b0\u5929\u96ea\u5730\",\n \"pinyin\": \"bing1 tian1 xue3 di4\",\n \"definition\": \"a world of ice and snow\",\n },\n {\n \"traditional\": \"\u51b0\u96ea\",\n \"simplified\": \"\u51b0\u96ea\",\n \"pinyin\": \"bing1 xue3\",\n \"definition\": \"ice and snow\",\n },\n {\n \"traditional\": \"\u51b0\u96ea\u7687\u540e\",\n \"simplified\": \"\u51b0\u96ea\u7687\u540e\",\n \"pinyin\": \"Bing1 xue3 Huang2 hou4\",\n \"definition\": \"Dairy Queen (brand)\",\n },\n {\n \"traditional\": \"\u51b0\u96ea\u8070\u660e\",\n \"simplified\": \"\u51b0\u96ea\u806a\u660e\",\n \"pinyin\": \"bing1 xue3 cong1 ming5\",\n \"definition\": \"exceptionally intelligent (idiom)\",\n },\n {\n \"traditional\": \"\u5404\u4eba\u81ea\u6383\u9580\u524d\u96ea\uff0c\u83ab\u7ba1\u4ed6\u5bb6\u74e6\u4e0a\u971c\",\n \"simplified\": \"\u5404\u4eba\u81ea\u626b\u95e8\u524d\u96ea\uff0c\u83ab\u7ba1\u4ed6\u5bb6\u74e6\u4e0a\u971c\",\n \"pinyin\": \"ge4 ren2 zi4 sao3 men2 qian2 xue3 , mo4 guan3 ta1 jia1 wa3 shang4 shuang1\",\n \"definition\": \"sweep the snow from your own door step, don't worry about the frost on your neighbor's roof (idiom)\",\n },\n {\n \"traditional\": \"\u54c8\u5df4\u96ea\u5c71\",\n \"simplified\": \"\u54c8\u5df4\u96ea\u5c71\",\n \"pinyin\": \"Ha1 ba1 xue3 shan1\",\n \"definition\": \"Mt Haba (Nakhi: golden flower), in Lijiang \u9e97\u6c5f|\u4e3d\u6c5f, northwest Yunnan\",\n },\n {\n \"traditional\": \"\u55ae\u677f\u6ed1\u96ea\",\n \"simplified\": \"\u5355\u677f\u6ed1\u96ea\",\n \"pinyin\": \"dan1 ban3 hua2 xue3\",\n \"definition\": \"to snowboard\",\n },\n {\n \"traditional\": \"\u5831\u4ec7\u96ea\u6065\",\n \"simplified\": \"\u62a5\u4ec7\u96ea\u803b\",\n \"pinyin\": \"bao4 chou2 xue3 chi3\",\n \"definition\": \"to take revenge and erase humiliation (idiom)\",\n },\n {\n \"traditional\": \"\u5831\u4ec7\u96ea\u6068\",\n \"simplified\": \"\u62a5\u4ec7\u96ea\u6068\",\n \"pinyin\": \"bao4 chou2 xue3 hen4\",\n \"definition\": \"to take revenge and wipe out a grudge (idiom)\",\n },\n]\n[....] # Truncated for display purposes\n\nprint(dictionary.dictionary_search(\"\u5fc3\u7684\u5c0f\u5b69\u771f\", \"only\"))\n\n[\n {\"traditional\": \"\u5b69\", \"simplified\": \"\u5b69\", \"pinyin\": \"hai2\", \"definition\": \"child\"},\n {\n \"traditional\": \"\u5c0f\",\n \"simplified\": \"\u5c0f\",\n \"pinyin\": \"xiao3\",\n \"definition\": \"small/tiny/few/young\",\n },\n {\n \"traditional\": \"\u5c0f\u5b69\",\n \"simplified\": \"\u5c0f\u5b69\",\n \"pinyin\": \"xiao3 hai2\",\n \"definition\": \"child/CL:\u500b|\u4e2a[ge4]\",\n },\n {\n \"traditional\": \"\u5c0f\u5c0f\",\n \"simplified\": \"\u5c0f\u5c0f\",\n \"pinyin\": \"xiao3 xiao3\",\n \"definition\": \"very small/very few/very minor\",\n },\n {\n \"traditional\": \"\u5c0f\u5fc3\",\n \"simplified\": \"\u5c0f\u5fc3\",\n \"pinyin\": \"xiao3 xin1\",\n \"definition\": \"to be careful/to take care\",\n },\n {\n \"traditional\": \"\u5c0f\u7684\",\n \"simplified\": \"\u5c0f\u7684\",\n \"pinyin\": \"xiao3 de5\",\n \"definition\": \"I (when talking to a superior)\",\n },\n {\n \"traditional\": \"\u5fc3\",\n \"simplified\": \"\u5fc3\",\n \"pinyin\": \"xin1\",\n \"definition\": \"heart/mind/intention/center/core/CL:\u9846|\u9897[ke1],\u500b|\u4e2a[ge4]\",\n },\n {\n \"traditional\": \"\u7684\",\n \"simplified\": \"\u7684\",\n \"pinyin\": \"de5\",\n \"definition\": \"of/~'s (possessive particle)/(used after an attribute)/(used to form a nominal expression)/(used at the end of a declarative sentence for emphasis)/also pr. [di4] or [di5] in poetry and songs\",\n },\n {\n \"traditional\": \"\u7684\",\n \"simplified\": \"\u7684\",\n \"pinyin\": \"di1\",\n \"definition\": \"see \u7684\u58eb[di1 shi4]\",\n },\n {\n \"traditional\": \"\u7684\",\n \"simplified\": \"\u7684\",\n \"pinyin\": \"di2\",\n \"definition\": \"really and truly\",\n },\n {\"traditional\": \"\u7684\", \"simplified\": \"\u7684\", \"pinyin\": \"di4\", \"definition\": \"aim/clear\"},\n {\n \"traditional\": \"\u771f\",\n \"simplified\": \"\u771f\",\n \"pinyin\": \"zhen1\",\n \"definition\": \"really/truly/indeed/real/true/genuine\",\n },\n {\n \"traditional\": \"\u771f\u5fc3\",\n \"simplified\": \"\u771f\u5fc3\",\n \"pinyin\": \"zhen1 xin1\",\n \"definition\": \"sincere/heartfelt/CL:\u7247[pian4]\",\n },\n {\n \"traditional\": \"\u771f\u771f\",\n \"simplified\": \"\u771f\u771f\",\n \"pinyin\": \"zhen1 zhen1\",\n \"definition\": \"really/in fact/genuinely/scrupulously\",\n },\n]\n\n\n```\n\n#### dictionary.get_examples(character)\n\nThis function does a dictionary_search(), then compares that to the Leiden University corpus for vocabulary frequency, then sorts the dictionary entries into three categories in an array: [high frequency, medium frequency and low frequency].\n\nThe frequency categories are determined relative to the frequency distribution of the dictionary_search data compared to the corpus.\n\n```python\nprint(dictionary.get_examples(\"\u6a44\"))\n\n{\n \"high_frequency\": [\n {\n \"traditional\": \"\u6a44\u6b16\",\n \"simplified\": \"\u6a44\u6984\",\n \"pinyin\": \"gan3 lan3\",\n \"definition\": \"Chinese olive/olive\",\n },\n {\n \"traditional\": \"\u6a44\u6b16\u6cb9\",\n \"simplified\": \"\u6a44\u6984\u6cb9\",\n \"pinyin\": \"gan3 lan3 you2\",\n \"definition\": \"olive oil\",\n },\n ],\n \"mid_frequency\": [\n {\n \"traditional\": \"\u6a44\u6b16\u7403\",\n \"simplified\": \"\u6a44\u6984\u7403\",\n \"pinyin\": \"gan3 lan3 qiu2\",\n \"definition\": \"football played with oval-shaped ball (rugby, American football, Australian rules etc)\",\n },\n {\n \"traditional\": \"\u6a44\u6b16\u7da0\",\n \"simplified\": \"\u6a44\u6984\u7eff\",\n \"pinyin\": \"gan3 lan3 lu:4\",\n \"definition\": \"olive-green (color)\",\n },\n ],\n \"low_frequency\": [\n {\n \"traditional\": \"\u6a44\u6b16\u679d\",\n \"simplified\": \"\u6a44\u6984\u679d\",\n \"pinyin\": \"gan3 lan3 zhi1\",\n \"definition\": \"olive branch/symbol of peace\",\n },\n {\n \"traditional\": \"\u6a44\u6b16\u6a39\",\n \"simplified\": \"\u6a44\u6984\u6811\",\n \"pinyin\": \"gan3 lan3 shu4\",\n \"definition\": \"olive tree\",\n },\n {\n \"traditional\": \"\u6a44\u6b16\u77f3\",\n \"simplified\": \"\u6a44\u6984\u77f3\",\n \"pinyin\": \"gan3 lan3 shi2\",\n \"definition\": \"olivine (rock-forming mineral magnesium-iron silicate (Mg,Fe)2SiO4)/peridot\",\n },\n ],\n}\n```\n\n<!-- #### hanzi.segment(phrase) - NOT YET AVAILABLE\n\nReturns an array of characters that are segmented based on a longest match lookup.\n\n````python\nprint(hanzi.segment(\"\u6211\u5011\u90fd\u662f\u964c\u751f\u4eba\u3002\"))\n\n[\"\u6211\u5011\", \"\u90fd\", \"\u662f\", \"\u964c\u751f\u4eba\", \"\u3002\"]\n```` -->\n\n#### dictionary.get_pinyin(character)\n\nReturns all possible pinyin data for a character.\n\n```python\nprint(dictionary.get_pinyin(\"\u7684\"))\n\n[\"de5\", \"di1\", \"di2\", \"di4\"]\n```\n\n#### dictionary.get_character_frequency(character)\n\nReturns frequency data for a character based on the Junda corpus. The data is in simplified characters, but I made the function script agnostic. So both traditional and simplified will return the same data.\n\n```python\nprint(dictionary.get_character_frequency(\"\u70ed\"))\n\n{\n \"number\": 606,\n \"character\": \"\u70ed\",\n \"count\": \"67051\",\n \"percentage\": \"79.8453694124\",\n \"pinyin\": \"re4\",\n \"meaning\": \"heat/to heat up/fervent/hot (of weather)/warm up\",\n}\n```\n\n#### dictionary.get_character_in_frequency_list_by_position(position)\n\nGets a character based on its position the frequency list. This only goes up to 9933 based on the Junda Frequency list.\n\n```python\nprint(dictionary.get_character_in_frequency_list_by_position(111))\n\n{\n \"number\": 111,\n \"character\": \"\u673a\",\n \"count\": \"339823\",\n \"percentage\": \"43.7756134862\",\n \"pinyin\": \"ji1\",\n \"meaning\": \"machine/opportunity/secret\",\n}\n```\n\n#### dictionary.determine_phonetic_regularity(decomposition_object/character)\n\nThis function takes a decomposition object created by hanzipy.decompose() or a character, then returns an object that displays all possible combinations of phonetic regularity relationship of the character to all its components.\n\nPhonetic Regularity Scale:\n\n- 0 = No regularity\n- 1 = Exact Match (with tone)\n- 2 = Syllable Match (without tone)\n- 3 = Similar in Initial (alliterates)\n- 4 = Similar in Final (rhymes)\n\nThe object returned is organized by the possible pronunciations of the character. You may notice duplicate entries in the fields, but these are there based on the similarities between the decomposition levels. It is up to the developer to use this data or not.\n\n```python\nprint(dictionary.determine_phonetic_regularity(\"\u6d0b\"))\n\n{\n \"yang2\": {\n \"character\": \"\u6d0b\",\n \"component\": [\"\u6c35\", \"\u7f8a\", \"\u7f8a\", \"\u6c35\", \"\u7f8a\", \"\u7f8a\"],\n \"phonetic_pinyin\": [\n \"shui3\",\n \"Yang2\",\n \"yang2\",\n \"shui3\",\n \"Yang2\",\n \"yang2\",\n ],\n \"regularity\": [0, 1, 1, 0, 1, 1],\n }\n}\n\n```\n\n### Hanzi Decomposer\n\n#### decomposer.decompose(character, decomposition_type=None)\n\nA function that takes a Chinese character and returns an object with decomposition data. Type of decomposition is optional.\n\nType of decomposition levels:\n\n- 1 - \"Once\" (only decomposes character once),\n- 2 - \"Radical\" (decomposes character into its lowest radical components),\n- 3 - \"Graphical\" (decomposes into lowest forms, will be mostly strokes and small indivisable units)\n\n```python\ndecomposition = decomposer.decompose(\"\u7231\")\nprint(decomposition)\n\n{\n \"character\": \"\u7231\",\n \"once\": [\"No glyph available\", \"\u53cb\"],\n \"radical\": [\"\u722b\", \"\u5196\", \"\ud840\udc87\", \"\u53c8\"],\n \"graphical\": [\"\u722b\", \"\u5196\", \"\ud840\udc87\", \"\u31c7\", \"\u31cf\"],\n}\n\n# Example of forced level decomposition\n\ndecomposition = hanzi.decompose(\"\u7231\", 2)\nprint(decomposition)\n\n{\"character\": \"\u7231\", \"components\": [\"\u722b\", \"\u5196\", \"\ud840\udc87\", \"\u53c8\"]}\n```\n\n#### decomposer.decompose_many(character string, type of decomposition)\n\nA function that takes a string of characters and returns one object for all characters.\n\n```python\ndecomposition = hanzi.decompose_many(\"\u7231\u6a44\u9ec3\")\nprint(decomposition)\n\n{\n \"\u7231\": {\n \"character\": \"\u7231\",\n \"once\": [\"No glyph available\", \"\u53cb\"],\n \"radical\": [\"\u722b\", \"\u5196\", \"\ud840\udc87\", \"\u53c8\"],\n \"graphical\": [\"\u722b\", \"\u5196\", \"\ud840\udc87\", \"\u31c7\", \"\u31cf\"],\n },\n \"\u6a44\": {\n \"character\": \"\u6a44\",\n \"once\": [\"\u6728\", \"\u6562\"],\n \"radical\": [\"\u6728\", \"No glyph available\", \"\u8033\", \"\u2e99\"],\n \"graphical\": [\"\u4e00\", \"\u4e28\", \"\u516b\", \"\u531a\", \"\u4e8c\", \"\u4e28\", \"\u4e8c\", \"\u4e3f\", \"\u4e00\", \"\u4e42\"],\n },\n \"\u9ec3\": {\n \"character\": \"\u9ec3\",\n \"once\": [\"\u5eff\", \"No glyph available\"],\n \"radical\": [\"\u9ec3\"],\n \"graphical\": [\"\u5344\", \"\u4e00\", \"\u4e00\", \"\u4e8c\", \"\u4e28\", \"\u51f5\", \"\u516b\"],\n },\n}\n```\n\n#### decomposer.component_exists(character/component)\n\nCheck if a component/character exists in the data. Returns boolean value.\n\n```python\nprint(decomposer.component_exists(\"\u4e42\"))\n\nTrue\n\nprint(decomposer.component_exists(\"$\"))\n\nFalse\n```\n\n#### decomposer.get_characters_with_component(component)\n\nReturns an array of characters with the given component. If a component has bound forms, such as \u624b and \u624c, they\"re considered the same and returns all the characters with the component.\n\nNB: This feature is new. Data might not be hundred percent correct and consistent.\n\n```python\nprint(decomposer.get_characters_with_component(\"\u56d7\"))\n\n[\"\u56fd\",\"\u56e0\",\"\u897f\",\"\u56de\",\"\u53e3\",\"\u56db\",\"\u56e2\",\"\u56fe\",\"\u56f4\",\"\u56f0\",\"\u56fa\",\"\u56ed\",\"\u5706\",\"\u5708\",\"\u56da\",\"\u5703\",\"\u56e4\",\"\u56ff\",\"\u56e1\",\"\u56eb\",\"\u571c\",\"\u56f5\",\"\u56f9\",\"\u5704\",\"\u56dd\",\"\u5709\",\"\u570a\",\"\u91e6\"]\n```\n\n#### decomposer.get_radical_meaning(radical)\n\nReturns a short, usually one-word, meaning of a radical.\n\n```python\nprint(decomposer.get_radical_meaning(\"\u6c35\"))\n\nwater\n```\n\n## Projects\n\nHanzipy is used in the following projects:\n\n## Contributors\n\n- [synkied (Author)](https://github.com/synkied)\n\n## License\n\nHanzipy uses data from various sources:\n\n- [CEDICT](http://cc-cedict.org/wiki/)\n- [Gavin Grover's Decomposition Data](http://cjkdecomp.codeplex.com/license)\n- [Leiden Word Frequency Data](http://lwc.daanvanesch.nl/legal.php)\n- [Jun Da Character Frequency Data](http://lingua.mtsu.edu/chinese-computing/copyright.html)\n\nOther data files are either generated by Hanzipy or are not in use at present in the software.\n",
"bugtrack_url": null,
"license": "",
"summary": "Hanzi decomposition and dictionary",
"version": "1.0.4",
"project_urls": {
"Download": "https://github.com/Synkied/hanzipy/archive/1.0.4.tar.gz",
"Homepage": "https://github.com/Synkied/hanzipy"
},
"split_keywords": [
"python",
"hanzi",
"hanzipy",
"decomposition",
"cjk",
"dictionary"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0538fd9f35a4cc8a0f515e0588251bdcc354aa49724c759cfa571ad111d2af40",
"md5": "ecf15c8d3138fe7ccd332669015ad4f7",
"sha256": "95713a887695e8f3f053e5b8593e845f457c5aeed35cf04d726c7f5acb104227"
},
"downloads": -1,
"filename": "hanzipy-1.0.4.tar.gz",
"has_sig": false,
"md5_digest": "ecf15c8d3138fe7ccd332669015ad4f7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 7686832,
"upload_time": "2023-09-17T11:08:41",
"upload_time_iso_8601": "2023-09-17T11:08:41.307647Z",
"url": "https://files.pythonhosted.org/packages/05/38/fd9f35a4cc8a0f515e0588251bdcc354aa49724c759cfa571ad111d2af40/hanzipy-1.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-17 11:08:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Synkied",
"github_project": "hanzipy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"circle": true,
"requirements": [],
"lcname": "hanzipy"
}