kuro2sudachi


Namekuro2sudachi JSON
Version 0.4.6 PyPI version JSON
download
home_pagehttp://github.com/po3rin/kuro2sudachi
Summary
upload_time2023-07-16 15:52:39
maintainer
docs_urlNone
authorpo3rin
requires_python>=3.11,<4.0
licenseApache-2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # kuro2sudachi

[![PyPi version](https://img.shields.io/pypi/v/kuro2sudachi.svg)](https://pypi.python.org/pypi/kuro2sudachi/)
![PyTest](https://github.com/po3rin/kuro2sudachi/workflows/PyTest/badge.svg)
[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-390/)

kuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.

## Usage

```sh
$ pip install kuro2sudachi
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt
```

## Custom pos convert dict

you can overwrite convert config with setting json file.

```json
{
    "固有名詞": {
        "sudachi_pos": "名詞,固有名詞,一般,*,*,*",
        "left_id": 4786,
        "right_id": 4786,
        "cost": 5000
    },
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000
    }
}

```

```$
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json
```

if you want to ignore unsupported pos error & invalid format, use `--ignore` flag.

## Dictionary type

You can specify the dictionary with the tokenize option -s (default: core).

```sh
$ pip install sudachidict_full
$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full
```

## Auto Splitting

kuro2sudachi supports suto splitting.

```json
{
    "名詞": {
        "sudachi_pos": "名詞,普通名詞,一般,*,*,*",
        "left_id": 5146,
        "right_id": 5146,
        "cost": 5000,
        "split_mode": "C",
        "unit_div_mode": [
            "A", "B"
        ]
    }
}
```

output includes unit devision info.

```sh
$ cat kuromoji_dict.txt
融合たんぱく質,融合たんぱく質,融合たんぱく質,名詞
発作性心房細動,発作性心房細動,発作性心房細動,名詞

$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore

$ cat sudachi_user_dict.txt
融合たんぱく質,4786,4786,5000,融合たんぱく質,名詞,普通名詞,一般,*,*,*,,融合たんぱく質,*,C,*,660881/810248,*
発作性心房細動,4786,4786,5000,発作性心房細動,名詞,普通名詞,一般,*,*,*,,発作性心房細動,*,C,584006/434835/428494/619020,2756385/428494/619020,*
```

## Splitting Words defined by kuromoji

Currently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.

```
中咽頭ガン,中咽頭 ガン,チュウイントウ ガン,カスタム名詞
↓
中咽頭ガン,4786,4786,7000,中咽頭ガン,名詞,固有名詞,一般,*,*,*,チュウイントウガン,中咽頭ガン,*,*,*,*,*
```

# For Developer

test kuro2sudachi

```sh
$ poetry install
$ poetry run pytest
```

exec kuro2sudachi command

```sh
$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt
```


            

Raw data

            {
    "_id": null,
    "home_page": "http://github.com/po3rin/kuro2sudachi",
    "name": "kuro2sudachi",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.11,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "po3rin",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/3a/74/13b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce/kuro2sudachi-0.4.6.tar.gz",
    "platform": null,
    "description": "# kuro2sudachi\n\n[![PyPi version](https://img.shields.io/pypi/v/kuro2sudachi.svg)](https://pypi.python.org/pypi/kuro2sudachi/)\n![PyTest](https://github.com/po3rin/kuro2sudachi/workflows/PyTest/badge.svg)\n[![](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-390/)\n\nkuro2sudachi lets you to convert kuromoji user dict to sudachi user dict.\n\n## Usage\n\n```sh\n$ pip install kuro2sudachi\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt\n```\n\n## Custom pos convert dict\n\nyou can overwrite convert config with setting json file.\n\n```json\n{\n    \"\u56fa\u6709\u540d\u8a5e\": {\n        \"sudachi_pos\": \"\u540d\u8a5e,\u56fa\u6709\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n        \"left_id\": 4786,\n        \"right_id\": 4786,\n        \"cost\": 5000\n    },\n    \"\u540d\u8a5e\": {\n        \"sudachi_pos\": \"\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n        \"left_id\": 5146,\n        \"right_id\": 5146,\n        \"cost\": 5000\n    }\n}\n\n```\n\n```$\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json\n```\n\nif you want to ignore unsupported pos error & invalid format, use `--ignore` flag.\n\n## Dictionary type\n\nYou can specify the dictionary with the tokenize option -s (default: core).\n\n```sh\n$ pip install sudachidict_full\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -s full\n```\n\n## Auto Splitting\n\nkuro2sudachi supports suto splitting.\n\n```json\n{\n    \"\u540d\u8a5e\": {\n        \"sudachi_pos\": \"\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*\",\n        \"left_id\": 5146,\n        \"right_id\": 5146,\n        \"cost\": 5000,\n        \"split_mode\": \"C\",\n        \"unit_div_mode\": [\n            \"A\", \"B\"\n        ]\n    }\n}\n```\n\noutput includes unit devision info.\n\n```sh\n$ cat kuromoji_dict.txt\n\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u540d\u8a5e\n\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u540d\u8a5e\n\n$ kuro2sudachi kuromoji_dict.txt -o sudachi_user_dict.txt -c kuro2sudachi.json --ignore\n\n$ cat sudachi_user_dict.txt\n\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,4786,4786,5000,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,,\u878d\u5408\u305f\u3093\u3071\u304f\u8cea,*,C,*,660881/810248,*\n\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,4786,4786,5000,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,\u540d\u8a5e,\u666e\u901a\u540d\u8a5e,\u4e00\u822c,*,*,*,,\u767a\u4f5c\u6027\u5fc3\u623f\u7d30\u52d5,*,C,584006/434835/428494/619020,2756385/428494/619020,*\n```\n\n## Splitting Words defined by kuromoji\n\nCurrently, the CLI does not support word splitting defined by kuromoji. Therefore, the split representation of kuromoji is ignored.\n\n```\n\u4e2d\u54bd\u982d\u30ac\u30f3,\u4e2d\u54bd\u982d \u30ac\u30f3,\u30c1\u30e5\u30a6\u30a4\u30f3\u30c8\u30a6 \u30ac\u30f3,\u30ab\u30b9\u30bf\u30e0\u540d\u8a5e\n\u2193\n\u4e2d\u54bd\u982d\u30ac\u30f3,4786,4786,7000,\u4e2d\u54bd\u982d\u30ac\u30f3,\u540d\u8a5e,\u56fa\u6709\u540d\u8a5e,\u4e00\u822c,*,*,*,\u30c1\u30e5\u30a6\u30a4\u30f3\u30c8\u30a6\u30ac\u30f3,\u4e2d\u54bd\u982d\u30ac\u30f3,*,*,*,*,*\n```\n\n# For Developer\n\ntest kuro2sudachi\n\n```sh\n$ poetry install\n$ poetry run pytest\n```\n\nexec kuro2sudachi command\n\n```sh\n$ poetry run kuro2sudachi tests/kuromoji_dict_test.txt -o sudachi_user_dict.txt\n```\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "",
    "version": "0.4.6",
    "project_urls": {
        "Homepage": "http://github.com/po3rin/kuro2sudachi",
        "Repository": "http://github.com/po3rin/kuro2sudachi"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f0a5ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24",
                "md5": "06bb847352613bab3554fb9462fc13a2",
                "sha256": "67f4143bb1f2c2017ebb07b2877956d5aebefda8d284ef358e10139558ed351a"
            },
            "downloads": -1,
            "filename": "kuro2sudachi-0.4.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "06bb847352613bab3554fb9462fc13a2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11,<4.0",
            "size": 8637,
            "upload_time": "2023-07-16T15:52:37",
            "upload_time_iso_8601": "2023-07-16T15:52:37.542703Z",
            "url": "https://files.pythonhosted.org/packages/f0/a5/ee7461dd34311a80a4cd89b1905e4ef470fad7df927679b84485687a3c24/kuro2sudachi-0.4.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3a7413b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce",
                "md5": "972028eccccf8ada4e60bfd6fbf86520",
                "sha256": "a2191070a688d1ea2586c351379e8dab851abbca5748532424d5a8d5215e3f80"
            },
            "downloads": -1,
            "filename": "kuro2sudachi-0.4.6.tar.gz",
            "has_sig": false,
            "md5_digest": "972028eccccf8ada4e60bfd6fbf86520",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11,<4.0",
            "size": 8055,
            "upload_time": "2023-07-16T15:52:39",
            "upload_time_iso_8601": "2023-07-16T15:52:39.683015Z",
            "url": "https://files.pythonhosted.org/packages/3a/74/13b0f5d12efdb38e9031205622444a50ce80cc7ac21911a010b294e6b0ce/kuro2sudachi-0.4.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-16 15:52:39",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "po3rin",
    "github_project": "kuro2sudachi",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kuro2sudachi"
}
        
Elapsed time: 0.08741s