# kuro2sudachi
[![PyPi version](https://img.shields.io/pypi/v/kuro2sudachi.svg)](https://pypi.python.org/pypi/kuro2sudachi/)
![PyTest](https://github.com/po3rin/kuro2sudachi/workflows/PyTest/badge.svg)
[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-390/)
kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.
## Usage
```sh
$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt
```
## Custom pos convert dict
you can overwrite convert config with setting json file.
```json
{
"固有名詞": {
"sudachi_pos": "名詞,固有名詞,一般,*,*,*",
"left_id": 4786,
"right_id": 4786,
"cost": 5000
},
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000
}
}
```
```$
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json
```
if you want to ignore unsupported pos error & invalid format, use `--ignore` flag.
## Dictionary type
You can specify the dictionary with the tokenize option -s (default: core).
```sh
$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full
```
## Auto Splitting
kuro2sudachi supports suto splitting.
```json
{
"名詞": {
"sudachi_pos": "名詞,普通名詞,一般,*,*,*",
"left_id": 5146,
"right_id": 5146,
"cost": 5000,
"split_mode": "C",
"unit_div_mode": [
"A", "B"
]
}
}
```
output includes unit devision info.
```sh
$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore
$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*
```
## Splitting Words defined by kuromoji
Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.
```
中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*
```
# For Developer
test kuro2sudachi
```sh
$ poetry install
$ poetry run pytest
```
exec kuro2sudachi command
```sh
$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt
```
Raw data
{
"_id": null,
"home_page": "http://github.com/po3rin/kuro2sudachi",
"name": "kuro2sudachi",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.11,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "po3rin",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/3a/74/13b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce/kuro2sudachi-0.4.6.tar.gz",
"platform": null,
"description": "# kuro2sudachi\n\n[![PyPi version](https://img.shields.io/pypi/v/kuro2sudachi.svg)](https://pypi.python.org/pypi/kuro2sudachi/)\n![PyTest](https://github.com/po3rin/kuro2sudachi/workflows/PyTest/badge.svg)\n[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n\nkuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.\n\n## Usage\n\n```sh\n$ pip install kuro2sudachi\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt\n```\n\n## Custom pos convert dict\n\nyou can overwrite convert config with setting json file.\n\n```json\n{\n \"\u56fa\u6709\u540d\u8a5e\": {\n \"sudachi_pos\": \"\u540d\u8a5e,\u56fa\u6709\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n \"left_id\": 4786,\n \"right_id\": 4786,\n \"cost\": 5000\n },\n \"\u540d\u8a5e\": {\n \"sudachi_pos\": \"\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n \"left_id\": 5146,\n \"right_id\": 5146,\n \"cost\": 5000\n }\n}\n\n```\n\n```$\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json\n```\n\nif you want to ignore unsupported pos error & invalid format, use `--ignore` flag.\n\n## Dictionary type\n\nYou can specify the dictionary with the tokenize option -s (default: core).\n\n```sh\n$ pip install sudachidict_full\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full\n```\n\n## Auto Splitting\n\nkuro2sudachi supports suto splitting.\n\n```json\n{\n \"\u540d\u8a5e\": {\n \"sudachi_pos\": \"\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n \"left_id\": 5146,\n \"right_id\": 5146,\n \"cost\": 5000,\n \"split_mode\": \"C\",\n \"unit_div_mode\": [\n \"A\", \"B\"\n ]\n }\n}\n```\n\noutput includes unit devision info.\n\n```sh\n$ cat kuromoji_dict.txt\n\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u540d\u8a5e\n\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u540d\u8a5e\n\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore\n\n$ cat sudachi_user_dict.txt\n\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,4786,4786,5000,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,*,C,*,660881/810248,*\n\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,4786,4786,5000,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,*,C,584006/434835/428494/619020,2756385/428494/619020,*\n```\n\n## Splitting Words defined by kuromoji\n\nCurrently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.\n\n```\n\u4e2d\u54bd\u982d\u30ac\u30f3,\u4e2d\u54bd\u982d \u30ac\u30f3,\u30c1\u30e5\u30a6\u30a4\u30f3\u30c8\u30a6 \u30ac\u30f3,\u30ab\u30b9\u30bf\u30e0\u540d\u8a5e\n\u2193\n\u4e2d\u54bd\u982d\u30ac\u30f3,4786,4786,7000,\u4e2d\u54bd\u982d\u30ac\u30f3,\u540d\u8a5e,\u56fa\u6709\u540d\u8a5e,\u4e00\u822c,*,*,*,\u30c1\u30e5\u30a6\u30a4\u30f3\u30c8\u30a6\u30ac\u30f3,\u4e2d\u54bd\u982d\u30ac\u30f3,*,*,*,*,*\n```\n\n# For Developer\n\ntest kuro2sudachi\n\n```sh\n$ poetry install\n$ poetry run pytest\n```\n\nexec kuro2sudachi command\n\n```sh\n$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt\n```\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "",
"version": "0.4.6",
"project_urls": {
"Homepage": "http://github.com/po3rin/kuro2sudachi",
"Repository": "http://github.com/po3rin/kuro2sudachi"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f0a5ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24",
"md5": "06bb847352613bab3554fb9462fc13a2",
"sha256": "67f4143bb1f2c2017ebb07b2877956d5aebefda8d284ef358e10139558ed351a"
},
"downloads": -1,
"filename": "kuro2sudachi-0.4.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "06bb847352613bab3554fb9462fc13a2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11,<4.0",
"size": 8637,
"upload_time": "2023-07-16T15:52:37",
"upload_time_iso_8601": "2023-07-16T15:52:37.542703Z",
"url": "https://files.pythonhosted.org/packages/f0/a5/ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24/kuro2sudachi-0.4.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3a7413b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce",
"md5": "972028eccccf8ada4e60bfd6fbf86520",
"sha256": "a2191070a688d1ea2586c351379e8dab851abbca5748532424d5a8d5215e3f80"
},
"downloads": -1,
"filename": "kuro2sudachi-0.4.6.tar.gz",
"has_sig": false,
"md5_digest": "972028eccccf8ada4e60bfd6fbf86520",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11,<4.0",
"size": 8055,
"upload_time": "2023-07-16T15:52:39",
"upload_time_iso_8601": "2023-07-16T15:52:39.683015Z",
"url": "https://files.pythonhosted.org/packages/3a/74/13b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce/kuro2sudachi-0.4.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-16 15:52:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "po3rin",
"github_project": "kuro2sudachi",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kuro2sudachi"
}