# Yosina Python
A Python port of the Yosina Japanese text transliteration library.
## Overview
Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.
## Usage
```python
from yosina import make_transliterator, TransliterationRecipe
# Create a recipe with desired transformations
recipe = TransliterationRecipe(
kanji_old_new=True,
replace_spaces=True,
replace_suspicious_hyphens_to_prolonged_sound_marks=True,
replace_circled_or_squared_characters=True,
replace_combined_characters=True,
hira_kata="hira-to-kata", # Convert hiragana to katakana
replace_japanese_iteration_marks=True, # Expand iteration marks
to_fullwidth=True,
)
# Create the transliterator
transliterator = make_transliterator(recipe)
# Use it with various special characters
input_text = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿" # circled numbers, letters, space, combined characters
result = transliterator(input_text)
print(result) # "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"
# Convert old kanji to new
old_kanji = "舊字體"
result = transliterator(old_kanji)
print(result) # "旧字体"
# Convert half-width katakana to full-width
half_width = "テストモジレツ"
result = transliterator(half_width)
print(result) # "テストモジレツ"
# Demonstrate hiragana to katakana conversion with iteration marks
mixed_text = "学問のすゝめ"
result = transliterator(mixed_text)
print(result) # "学問ノススメ"
```
### Using Direct Configuration
```python
from yosina import make_transliterator
# Configure with direct transliterator configs
configs = [
("kanji-old-new", {}),
("spaces", {}),
("prolonged-sound-marks", {"replace_prolonged_marks_following_alnums": True}),
("circled-or-squared", {}),
("combined", {}),
("hira-kata", {"mode": "kata-to-hira"}), # Convert katakana to hiragana
("japanese-iteration-marks", {}), # Expand iteration marks like 々, ゝゞ, ヽヾ
]
transliterator = make_transliterator(configs)
# Example with various transformations including the new ones
input_text = "カタカナでの時々の佐々木さん"
result = transliterator(input_text)
print(result) # "かたかなでの時時の佐佐木さん"
```
## Available Transliterators
### 1. **Circled or Squared** (`circled-or-squared`)
Converts circled or squared characters to their plain equivalents.
- Options: `templates` (custom rendering), `includeEmojis` (include emoji characters)
- Example: `①②③` → `(1)(2)(3)`, `㊙㊗` → `(秘)(祝)`
### 2. **Combined** (`combined`)
Expands combined characters into their individual character sequences.
- Example: `㍻` (Heisei era) → `平成`, `㈱` → `(株)`
### 3. **Hiragana-Katakana Composition** (`hira-kata-composition`)
Combines decomposed hiraganas and katakanas into composed equivalents.
- Options: `composeNonCombiningMarks` (compose non-combining marks)
- Example: `か + ゙` → `が`, `ヘ + ゜` → `ペ`
### 4. **Hiragana-Katakana** (`hira-kata`)
Converts between hiragana and katakana scripts bidirectionally.
- Options: `mode` ("hira-to-kata" or "kata-to-hira")
- Example: `ひらがな` → `ヒラガナ` (hira-to-kata)
### 5. **Hyphens** (`hyphens`)
Replaces various dash/hyphen symbols with common ones used in Japanese.
- Options: `precedence` (mapping priority order)
- Available mappings: "ascii", "jisx0201", "jisx0208_90", "jisx0208_90_windows", "jisx0208_verbatim"
- Example: `2019—2020` (em dash) → `2019-2020`
### 6. **Ideographic Annotations** (`ideographic-annotations`)
Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.
- Example: `㆖㆘` → `上下`
### 7. **IVS-SVS Base** (`ivs-svs-base`)
Handles Ideographic and Standardized Variation Selectors.
- Options: `charset`, `mode` ("ivs-or-svs" or "base"), `preferSVS`, `dropSelectorsAltogether`
- Example: `葛󠄀` (葛 + IVS) → `葛`
### 8. **Japanese Iteration Marks** (`japanese-iteration-marks`)
Expands iteration marks by repeating the preceding character.
- Example: `時々` → `時時`, `いすゞ` → `いすず`
### 9. **JIS X 0201 and Alike** (`jisx0201-and-alike`)
Handles half-width/full-width character conversion.
- Options: `fullwidthToHalfwidth`, `convertGL` (alphanumerics/symbols), `convertGR` (katakana), `u005cAsYenSign`
- Example: `ABC123` → `ABC123`, `カタカナ` → `カタカナ`
### 10. **Kanji Old-New** (`kanji-old-new`)
Converts old-style kanji (旧字体) to modern forms (新字体).
- Example: `舊字體の變換` → `旧字体の変換`
### 11. **Mathematical Alphanumerics** (`mathematical-alphanumerics`)
Normalizes mathematical alphanumeric symbols to plain ASCII.
- Example: `𝐀𝐁𝐂` (mathematical bold) → `ABC`
### 12. **Prolonged Sound Marks** (`prolonged-sound-marks`)
Handles contextual conversion between hyphens and prolonged sound marks.
- Options: `skipAlreadyTransliteratedChars`, `allowProlongedHatsuon`, `allowProlongedSokuon`, `replaceProlongedMarksFollowingAlnums`
- Example: `イ−ハト−ヴォ` (with hyphen) → `イーハトーヴォ` (prolonged mark)
### 13. **Radicals** (`radicals`)
Converts CJK radical characters to their corresponding ideographs.
- Example: `⾔⾨⾷` (Kangxi radicals) → `言門食`
### 14. **Spaces** (`spaces`)
Normalizes various Unicode space characters to standard ASCII space.
- Example: `A B` (ideographic space) → `A B`
## Requirements
- Python 3.10 or higher
## Installation
```bash
# Install with uv
uv add yosina
# Install with pip
pip install yosina
```
## Development
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
```bash
# Code generation
python -m codegen
# Install development dependencies
uv sync --extra dev
# Run tests
uv run pytest
# Run linting
uv run ruff check .
# Run formatting
uv run ruff format .
# Run type checking
uv run pyright
```
## Requirements
- Python 3.10+
- typing-extensions
## License
MIT
Raw data
{
"_id": null,
"home_page": null,
"name": "yosina",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "japanese, normalization, text-processing, transliteration",
"author": null,
"author_email": "Moriyoshi Koizumi <mozo@mozo.jp>",
"download_url": "https://files.pythonhosted.org/packages/f7/1e/0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6/yosina-0.1.0.tar.gz",
"platform": null,
"description": "# Yosina Python\n\nA Python port of the Yosina Japanese text transliteration library.\n\n## Overview\n\nYosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.\n\n## Usage\n\n```python\nfrom yosina import make_transliterator, TransliterationRecipe\n\n# Create a recipe with desired transformations\nrecipe = TransliterationRecipe(\n kanji_old_new=True,\n replace_spaces=True,\n replace_suspicious_hyphens_to_prolonged_sound_marks=True,\n replace_circled_or_squared_characters=True,\n replace_combined_characters=True,\n hira_kata=\"hira-to-kata\", # Convert hiragana to katakana\n replace_japanese_iteration_marks=True, # Expand iteration marks\n to_fullwidth=True,\n)\n\n# Create the transliterator\ntransliterator = make_transliterator(recipe)\n\n# Use it with various special characters\ninput_text = \"\u2460\u2461\u2462\u3000\u24b6\u24b7\u24b8\u3000\u337f\u3351\u3320\u32ff\" # circled numbers, letters, space, combined characters\nresult = transliterator(input_text)\nprint(result) # \"\uff08\uff11\uff09\uff08\uff12\uff09\uff08\uff13\uff09\u3000\uff08\uff21\uff09\uff08\uff22\uff09\uff08\uff23\uff09\u3000\u682a\u5f0f\u4f1a\u793e\u30ea\u30c3\u30c8\u30eb\u30b5\u30f3\u30c1\u30fc\u30e0\u4ee4\u548c\"\n\n# Convert old kanji to new\nold_kanji = \"\u820a\u5b57\u9ad4\"\nresult = transliterator(old_kanji)\nprint(result) # \"\u65e7\u5b57\u4f53\"\n\n# Convert half-width katakana to full-width\nhalf_width = \"\uff83\uff7d\uff84\uff93\uff7c\uff9e\uff9a\uff82\"\nresult = transliterator(half_width)\nprint(result) # \"\u30c6\u30b9\u30c8\u30e2\u30b8\u30ec\u30c4\"\n\n# Demonstrate hiragana to katakana conversion with iteration marks\nmixed_text = \"\u5b66\u554f\u306e\u3059\u309d\u3081\"\nresult = transliterator(mixed_text)\nprint(result) # \"\u5b66\u554f\u30ce\u30b9\u30b9\u30e1\"\n```\n\n### Using Direct Configuration\n\n```python\nfrom yosina import make_transliterator\n\n# Configure with direct transliterator configs\nconfigs = [\n (\"kanji-old-new\", {}),\n (\"spaces\", {}),\n (\"prolonged-sound-marks\", {\"replace_prolonged_marks_following_alnums\": True}),\n (\"circled-or-squared\", {}),\n (\"combined\", {}),\n (\"hira-kata\", {\"mode\": \"kata-to-hira\"}), # Convert katakana to hiragana\n (\"japanese-iteration-marks\", {}), # Expand iteration marks like \u3005, \u309d\u309e, \u30fd\u30fe\n]\n\ntransliterator = make_transliterator(configs)\n\n# Example with various transformations including the new ones\ninput_text = \"\u30ab\u30bf\u30ab\u30ca\u3067\u306e\u6642\u3005\u306e\u4f50\u3005\u6728\u3055\u3093\"\nresult = transliterator(input_text)\nprint(result) # \"\u304b\u305f\u304b\u306a\u3067\u306e\u6642\u6642\u306e\u4f50\u4f50\u6728\u3055\u3093\"\n```\n\n## Available Transliterators\n\n### 1. **Circled or Squared** (`circled-or-squared`)\nConverts circled or squared characters to their plain equivalents.\n- Options: `templates` (custom rendering), `includeEmojis` (include emoji characters)\n- Example: `\u2460\u2461\u2462` \u2192 `(1)(2)(3)`, `\u3299\u3297` \u2192 `(\u79d8)(\u795d)`\n\n### 2. **Combined** (`combined`)\nExpands combined characters into their individual character sequences.\n- Example: `\u337b` (Heisei era) \u2192 `\u5e73\u6210`, `\u3231` \u2192 `(\u682a)`\n\n### 3. **Hiragana-Katakana Composition** (`hira-kata-composition`)\nCombines decomposed hiraganas and katakanas into composed equivalents.\n- Options: `composeNonCombiningMarks` (compose non-combining marks)\n- Example: `\u304b + \u3099` \u2192 `\u304c`, `\u30d8 + \u309c` \u2192 `\u30da`\n\n### 4. **Hiragana-Katakana** (`hira-kata`)\nConverts between hiragana and katakana scripts bidirectionally.\n- Options: `mode` (\"hira-to-kata\" or \"kata-to-hira\")\n- Example: `\u3072\u3089\u304c\u306a` \u2192 `\u30d2\u30e9\u30ac\u30ca` (hira-to-kata)\n\n### 5. **Hyphens** (`hyphens`)\nReplaces various dash/hyphen symbols with common ones used in Japanese.\n- Options: `precedence` (mapping priority order)\n- Available mappings: \"ascii\", \"jisx0201\", \"jisx0208_90\", \"jisx0208_90_windows\", \"jisx0208_verbatim\"\n- Example: `2019\u20142020` (em dash) \u2192 `2019-2020`\n\n### 6. **Ideographic Annotations** (`ideographic-annotations`)\nReplaces ideographic annotations used in traditional Chinese-to-Japanese translation.\n- Example: `\u3196\u3198` \u2192 `\u4e0a\u4e0b`\n\n### 7. **IVS-SVS Base** (`ivs-svs-base`)\nHandles Ideographic and Standardized Variation Selectors.\n- Options: `charset`, `mode` (\"ivs-or-svs\" or \"base\"), `preferSVS`, `dropSelectorsAltogether`\n- Example: `\u845b\udb40\udd00` (\u845b + IVS) \u2192 `\u845b`\n\n### 8. **Japanese Iteration Marks** (`japanese-iteration-marks`)\nExpands iteration marks by repeating the preceding character.\n- Example: `\u6642\u3005` \u2192 `\u6642\u6642`, `\u3044\u3059\u309e` \u2192 `\u3044\u3059\u305a`\n\n### 9. **JIS X 0201 and Alike** (`jisx0201-and-alike`)\nHandles half-width/full-width character conversion.\n- Options: `fullwidthToHalfwidth`, `convertGL` (alphanumerics/symbols), `convertGR` (katakana), `u005cAsYenSign`\n- Example: `ABC123` \u2192 `\uff21\uff22\uff23\uff11\uff12\uff13`, `\uff76\uff80\uff76\uff85` \u2192 `\u30ab\u30bf\u30ab\u30ca`\n\n### 10. **Kanji Old-New** (`kanji-old-new`)\nConverts old-style kanji (\u65e7\u5b57\u4f53) to modern forms (\u65b0\u5b57\u4f53).\n- Example: `\u820a\u5b57\u9ad4\u306e\u8b8a\u63db` \u2192 `\u65e7\u5b57\u4f53\u306e\u5909\u63db`\n\n### 11. **Mathematical Alphanumerics** (`mathematical-alphanumerics`)\nNormalizes mathematical alphanumeric symbols to plain ASCII.\n- Example: `\ud835\udc00\ud835\udc01\ud835\udc02` (mathematical bold) \u2192 `ABC`\n\n### 12. **Prolonged Sound Marks** (`prolonged-sound-marks`)\nHandles contextual conversion between hyphens and prolonged sound marks.\n- Options: `skipAlreadyTransliteratedChars`, `allowProlongedHatsuon`, `allowProlongedSokuon`, `replaceProlongedMarksFollowingAlnums`\n- Example: `\u30a4\u2212\u30cf\u30c8\u2212\u30f4\u30a9` (with hyphen) \u2192 `\u30a4\u30fc\u30cf\u30c8\u30fc\u30f4\u30a9` (prolonged mark)\n\n### 13. **Radicals** (`radicals`)\nConverts CJK radical characters to their corresponding ideographs.\n- Example: `\u2f94\u2fa8\u2fb7` (Kangxi radicals) \u2192 `\u8a00\u9580\u98df`\n\n### 14. **Spaces** (`spaces`)\nNormalizes various Unicode space characters to standard ASCII space.\n- Example: `A\u3000B` (ideographic space) \u2192 `A B`\n\n## Requirements\n\n- Python 3.10 or higher\n\n## Installation\n\n```bash\n# Install with uv\nuv add yosina\n\n# Install with pip\npip install yosina\n```\n\n## Development\n\nThis project uses [uv](https://github.com/astral-sh/uv) for dependency management.\n\n```bash\n# Code generation\npython -m codegen\n\n# Install development dependencies\nuv sync --extra dev\n\n# Run tests\nuv run pytest\n\n# Run linting\nuv run ruff check .\n\n# Run formatting\nuv run ruff format .\n\n# Run type checking\nuv run pyright\n```\n\n## Requirements\n\n- Python 3.10+\n- typing-extensions\n\n## License\n\nMIT",
"bugtrack_url": null,
"license": "MIT",
"summary": "Japanese text transliteration library",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/yosina-lib/yosina",
"Issues": "https://github.com/yosina-lib/yosina/issues",
"Repository": "https://github.com/yosina-lib/yosina"
},
"split_keywords": [
"japanese",
" normalization",
" text-processing",
" transliteration"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e9978b8077dd6d8fc6ada4604b57ace7119c5196c9a697e1e17d09f07934a8fd",
"md5": "1ca34b3c2f2368c2590fd299cb9aa621",
"sha256": "dbd203dd39eae21bcb7625b642bef37b27916b6fe2e38373cf2e88843ebb0f83"
},
"downloads": -1,
"filename": "yosina-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1ca34b3c2f2368c2590fd299cb9aa621",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 134029,
"upload_time": "2025-08-19T18:28:37",
"upload_time_iso_8601": "2025-08-19T18:28:37.167490Z",
"url": "https://files.pythonhosted.org/packages/e9/97/8b8077dd6d8fc6ada4604b57ace7119c5196c9a697e1e17d09f07934a8fd/yosina-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f71e0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6",
"md5": "121c25a1581dc979fa223f5001196713",
"sha256": "2a3320c147eabd0139e76c0a413996d64e16847a5e7fafdad60f7557041d77cb"
},
"downloads": -1,
"filename": "yosina-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "121c25a1581dc979fa223f5001196713",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 352420,
"upload_time": "2025-08-19T18:28:38",
"upload_time_iso_8601": "2025-08-19T18:28:38.700183Z",
"url": "https://files.pythonhosted.org/packages/f7/1e/0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6/yosina-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-19 18:28:38",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "yosina-lib",
"github_project": "yosina",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "yosina"
}