# yurenizer
This is a Japanese text normalizer that resolves spelling inconsistencies.
Japanese README is Here.(日本語のREADMEはこちら)
https://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md
## Overview
yurenizer is a tool for detecting and unifying variations in Japanese text notation.
For example, it can unify variations like "パソコン" (pasokon), "パーソナル・コンピュータ" (personal computer), and "パーソナルコンピュータ" into "パーソナルコンピューター".
These rules follow the [Sudachi Synonym Dictionary](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md).
## Installation
```bash
pip install yurenizer
```
## Download Synonym Dictionary
```bash
curl -L -o /path/to/synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt
```
## Usage
### Quick Start
```python
from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "「パソコン」は「パーソナルコンピュータ」の「synonym」で、「パーソナル・コンピュータ」と表記することもあります。"
print(normalizer.normalize(text))
# Output: 「パーソナルコンピューター」は「パーソナルコンピューター」の「シノニム」で、「パーソナルコンピューター」と表記することもあります。
```
### Customizing Settings
You can control normalization by specifying `NormalizerConfig` as an argument to the normalize function.
#### Example with Custom Settings
```python
from yurenizer import SynonymNormalizer, NormalizerConfig
normalizer = SynonymNormalizer(synonym_file_path="path/to/synonym_file_path")
text = "パソコンはパーソナルコンピュータの同義語です"
config = NormalizerConfig(taigen=True, yougen=False, expansion="from_another", other_language=False, alphabet=False, alphabetic_abbreviation=False, non_alphabetic_abbreviation=False, orthographic_variation=False, misspelling=False)
print(normalizer.normalize(text, config))
# Output: パソコンはパーソナルコンピュータの同義語で、パーソナル・コンピュータと言ったりパーソナル・コンピューターと言ったりします。
```
#### Configuration Details
- unify_level (default="lexeme"): Flag to specify unification level. Default "lexeme" unifies based on lexeme number. "word_form" option unifies based on word form number. "abbreviation" option unifies based on abbreviation number.
- taigen (default=True): Flag to include nouns in unification. Default is to include. Specify False to exclude.
- yougen (default=False): Flag to include conjugated words in unification. Default is to exclude. Specify True to include.
- expansion (default="from_another"): Synonym expansion control flag. Default only expands those with expansion control flag 0. Specify "ANY" to always expand.
- other_language (default=True): Flag to normalize non-Japanese languages to Japanese. Default is to normalize. Specify False to disable.
- alias (default=True): Flag to normalize aliases. Default is to normalize. Specify False to disable.
- old_name (default=True): Flag to normalize old names. Default is to normalize. Specify False to disable.
- misuse (default=True): Flag to normalize misused terms. Default is to normalize. Specify False to disable.
- alphabetic_abbreviation (default=True): Flag to normalize alphabetic abbreviations. Default is to normalize. Specify False to disable.
- non_alphabetic_abbreviation (default=True): Flag to normalize Japanese abbreviations. Default is to normalize. Specify False to disable.
- alphabet (default=True): Flag to normalize alphabet variations. Default is to normalize. Specify False to disable.
- orthographic_variation (default=True): Flag to normalize orthographic variations. Default is to normalize. Specify False to disable.
- misspelling (default=True): Flag to normalize misspellings. Default is to normalize. Specify False to disable.
- custom_synonym (default=True): Flag to use user-defined custom synonyms. Default is to use. Specify False to disable.
## Specifying SudachiDict
The length of text segmentation varies depending on the type of SudachiDict. Default is "full", but you can specify "small" or "core".
To use "small" or "core", install it and specify in the `SynonymNormalizer()` arguments:
```bash
pip install sudachidict_small
# or
pip install sudachidict_core
```
```python
normalizer = SynonymNormalizer(sudachi_dict="small")
# or
normalizer = SynonymNormalizer(sudachi_dict="core")
```
※ Please refer to SudachiDict documentation for details.
## Custom Dictionary Specification
You can specify your own custom dictionary.
If the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.
### Custom Dictionary Format
Create a JSON file with the following format for your custom dictionary:
```json
{
"representative_word1": ["synonym1_1", "synonym1_2", ...],
"representative_word2": ["synonym2_1", "synonym2_2", ...],
...
}
```
#### Example
If you create a file like this, "幽白", "ゆうはく", and "幽☆遊☆白書" will be normalized to "幽遊白書":
```json
{
"幽遊白書": ["幽白", "ゆうはく", "幽☆遊☆白書"]
}
```
### How to Specify
```python
normalizer = SynonymNormalizer(custom_synonyms_file="path/to/custom_dict.json")
```
## License
This project is licensed under the [Apache License 2.0](LICENSE).
### Open Source Software Used
- [Sudachi Synonym Dictionary](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md): Apache License 2.0
- [SudachiPy](https://github.com/WorksApplications/SudachiPy): Apache License 2.0
- [SudachiDict](https://github.com/WorksApplications/SudachiDict): Apache License 2.0
This library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.
For detailed license information, please check the LICENSE files of each project:
- [Sudachi Synonym Dictionary LICENSE](https://github.com/WorksApplications/SudachiDict/blob/develop/LICENSE-2.0.txt)
※ Provided under the same license as the Sudachi dictionary.
- [SudachiPy LICENSE](https://github.com/WorksApplications/SudachiPy/blob/develop/LICENSE)
- [SudachiDict LICENSE](https://github.com/WorksApplications/SudachiDict/blob/develop/LICENSE-2.0.txt)
Raw data
{
"_id": null,
"home_page": "https://github.com/sea-turt1e/yurenizer",
"name": "yurenizer",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.11",
"maintainer_email": null,
"keywords": "nlp, text-processing, japanese, synonym, normalization",
"author": "sea-turt1e",
"author_email": "h.yamada.bg@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/8f/28/856eca72aa02ac73275c7203a6c52ee60c1cc37bcd9ca8f703f47b6621a8/yurenizer-0.1.1.tar.gz",
"platform": null,
"description": "# yurenizer\nThis is a Japanese text normalizer that resolves spelling inconsistencies.\n\nJapanese README is Here.\uff08\u65e5\u672c\u8a9e\u306eREADME\u306f\u3053\u3061\u3089\uff09 \nhttps://github.com/sea-turt1e/yurenizer/blob/main/README_ja.md\n\n## Overview\nyurenizer is a tool for detecting and unifying variations in Japanese text notation. \nFor example, it can unify variations like \"\u30d1\u30bd\u30b3\u30f3\" (pasokon), \"\u30d1\u30fc\u30bd\u30ca\u30eb\u30fb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\" (personal computer), and \"\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\" into \"\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fc\". \nThese rules follow the [Sudachi Synonym Dictionary](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md).\n\n## Installation\n```bash\npip install yurenizer\n```\n\n## Download Synonym Dictionary\n```bash\ncurl -L -o /path/to/synonyms.txt https://raw.githubusercontent.com/WorksApplications/SudachiDict/refs/heads/develop/src/main/text/synonyms.txt\n```\n\n## Usage\n### Quick Start\n```python\nfrom yurenizer import SynonymNormalizer, NormalizerConfig\nnormalizer = SynonymNormalizer(synonym_file_path=\"path/to/synonym_file_path\")\ntext = \"\u300c\u30d1\u30bd\u30b3\u30f3\u300d\u306f\u300c\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u300d\u306e\u300csynonym\u300d\u3067\u3001\u300c\u30d1\u30fc\u30bd\u30ca\u30eb\u30fb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u300d\u3068\u8868\u8a18\u3059\u308b\u3053\u3068\u3082\u3042\u308a\u307e\u3059\u3002\"\nprint(normalizer.normalize(text))\n# Output: \u300c\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fc\u300d\u306f\u300c\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fc\u300d\u306e\u300c\u30b7\u30ce\u30cb\u30e0\u300d\u3067\u3001\u300c\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fc\u300d\u3068\u8868\u8a18\u3059\u308b\u3053\u3068\u3082\u3042\u308a\u307e\u3059\u3002\n```\n\n### Customizing Settings\nYou can control normalization by specifying `NormalizerConfig` as an argument to the normalize function.\n\n#### Example with Custom Settings\n```python\nfrom yurenizer import SynonymNormalizer, NormalizerConfig\nnormalizer = SynonymNormalizer(synonym_file_path=\"path/to/synonym_file_path\")\ntext = \"\u30d1\u30bd\u30b3\u30f3\u306f\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u306e\u540c\u7fa9\u8a9e\u3067\u3059\"\nconfig = NormalizerConfig(taigen=True, yougen=False, expansion=\"from_another\", other_language=False, alphabet=False, alphabetic_abbreviation=False, non_alphabetic_abbreviation=False, orthographic_variation=False, misspelling=False)\nprint(normalizer.normalize(text, config))\n# Output: \u30d1\u30bd\u30b3\u30f3\u306f\u30d1\u30fc\u30bd\u30ca\u30eb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u306e\u540c\u7fa9\u8a9e\u3067\u3001\u30d1\u30fc\u30bd\u30ca\u30eb\u30fb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u3068\u8a00\u3063\u305f\u308a\u30d1\u30fc\u30bd\u30ca\u30eb\u30fb\u30b3\u30f3\u30d4\u30e5\u30fc\u30bf\u30fc\u3068\u8a00\u3063\u305f\u308a\u3057\u307e\u3059\u3002\n```\n\n#### Configuration Details\n- unify_level (default=\"lexeme\"): Flag to specify unification level. Default \"lexeme\" unifies based on lexeme number. \"word_form\" option unifies based on word form number. \"abbreviation\" option unifies based on abbreviation number.\n- taigen (default=True): Flag to include nouns in unification. Default is to include. Specify False to exclude.\n- yougen (default=False): Flag to include conjugated words in unification. Default is to exclude. Specify True to include.\n- expansion (default=\"from_another\"): Synonym expansion control flag. Default only expands those with expansion control flag 0. Specify \"ANY\" to always expand.\n- other_language (default=True): Flag to normalize non-Japanese languages to Japanese. Default is to normalize. Specify False to disable.\n- alias (default=True): Flag to normalize aliases. Default is to normalize. Specify False to disable.\n- old_name (default=True): Flag to normalize old names. Default is to normalize. Specify False to disable.\n- misuse (default=True): Flag to normalize misused terms. Default is to normalize. Specify False to disable.\n- alphabetic_abbreviation (default=True): Flag to normalize alphabetic abbreviations. Default is to normalize. Specify False to disable.\n- non_alphabetic_abbreviation (default=True): Flag to normalize Japanese abbreviations. Default is to normalize. Specify False to disable.\n- alphabet (default=True): Flag to normalize alphabet variations. Default is to normalize. Specify False to disable.\n- orthographic_variation (default=True): Flag to normalize orthographic variations. Default is to normalize. Specify False to disable.\n- misspelling (default=True): Flag to normalize misspellings. Default is to normalize. Specify False to disable.\n- custom_synonym (default=True): Flag to use user-defined custom synonyms. Default is to use. Specify False to disable.\n\n## Specifying SudachiDict\nThe length of text segmentation varies depending on the type of SudachiDict. Default is \"full\", but you can specify \"small\" or \"core\". \nTo use \"small\" or \"core\", install it and specify in the `SynonymNormalizer()` arguments:\n```bash\npip install sudachidict_small\n# or\npip install sudachidict_core\n```\n\n```python\nnormalizer = SynonymNormalizer(sudachi_dict=\"small\")\n# or\nnormalizer = SynonymNormalizer(sudachi_dict=\"core\")\n```\n\u203b Please refer to SudachiDict documentation for details.\n\n## Custom Dictionary Specification\nYou can specify your own custom dictionary. \nIf the same word exists in both the custom dictionary and Sudachi synonym dictionary, the custom dictionary takes precedence.\n\n### Custom Dictionary Format\nCreate a JSON file with the following format for your custom dictionary:\n```json\n{\n \"representative_word1\": [\"synonym1_1\", \"synonym1_2\", ...],\n \"representative_word2\": [\"synonym2_1\", \"synonym2_2\", ...],\n ...\n}\n```\n\n#### Example\nIf you create a file like this, \"\u5e7d\u767d\", \"\u3086\u3046\u306f\u304f\", and \"\u5e7d\u2606\u904a\u2606\u767d\u66f8\" will be normalized to \"\u5e7d\u904a\u767d\u66f8\":\n```json\n{\n \"\u5e7d\u904a\u767d\u66f8\": [\"\u5e7d\u767d\", \"\u3086\u3046\u306f\u304f\", \"\u5e7d\u2606\u904a\u2606\u767d\u66f8\"]\n}\n```\n\n### How to Specify\n```python\nnormalizer = SynonymNormalizer(custom_synonyms_file=\"path/to/custom_dict.json\")\n```\n\n## License\nThis project is licensed under the [Apache License 2.0](LICENSE).\n\n### Open Source Software Used\n- [Sudachi Synonym Dictionary](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md): Apache License 2.0\n- [SudachiPy](https://github.com/WorksApplications/SudachiPy): Apache License 2.0\n- [SudachiDict](https://github.com/WorksApplications/SudachiDict): Apache License 2.0\n\nThis library uses SudachiPy and its dictionary SudachiDict for morphological analysis. These are also distributed under the Apache License 2.0.\n\nFor detailed license information, please check the LICENSE files of each project:\n- [Sudachi Synonym Dictionary LICENSE](https://github.com/WorksApplications/SudachiDict/blob/develop/LICENSE-2.0.txt)\n\u203b Provided under the same license as the Sudachi dictionary.\n- [SudachiPy LICENSE](https://github.com/WorksApplications/SudachiPy/blob/develop/LICENSE)\n- [SudachiDict LICENSE](https://github.com/WorksApplications/SudachiDict/blob/develop/LICENSE-2.0.txt)\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "A library for standardizing terms with spelling variations using a synonym dictionary.",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/sea-turt1e/yurenizer",
"Repository": "https://github.com/sea-turt1e/yurenizer"
},
"split_keywords": [
"nlp",
" text-processing",
" japanese",
" synonym",
" normalization"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5d7c2b5fabe1747094173e0b4b7ea270ff6542cee1c609ed0fba9ab80ca34273",
"md5": "c15bb5a662737f4cec14b6d2e9e7c74a",
"sha256": "81ba1fb2475e5dbc8093a447c2593ab6a2479c303c3e182be88bdda71d24349c"
},
"downloads": -1,
"filename": "yurenizer-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "c15bb5a662737f4cec14b6d2e9e7c74a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.11",
"size": 18544,
"upload_time": "2024-11-13T11:26:39",
"upload_time_iso_8601": "2024-11-13T11:26:39.033315Z",
"url": "https://files.pythonhosted.org/packages/5d/7c/2b5fabe1747094173e0b4b7ea270ff6542cee1c609ed0fba9ab80ca34273/yurenizer-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8f28856eca72aa02ac73275c7203a6c52ee60c1cc37bcd9ca8f703f47b6621a8",
"md5": "d209283fdaa6a11d73b0257f4a4deab1",
"sha256": "86a9be43f13b9c7ea6664edb42ffa8e6045c6bf0483c69345e80ddddc1811520"
},
"downloads": -1,
"filename": "yurenizer-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d209283fdaa6a11d73b0257f4a4deab1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.11",
"size": 16390,
"upload_time": "2024-11-13T11:26:40",
"upload_time_iso_8601": "2024-11-13T11:26:40.622535Z",
"url": "https://files.pythonhosted.org/packages/8f/28/856eca72aa02ac73275c7203a6c52ee60c1cc37bcd9ca8f703f47b6621a8/yurenizer-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-13 11:26:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sea-turt1e",
"github_project": "yurenizer",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "yurenizer"
}