yosina


Nameyosina JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryJapanese text transliteration library
upload_time2025-08-19 18:28:38
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords japanese normalization text-processing transliteration
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Yosina Python

A Python port of the Yosina Japanese text transliteration library.

## Overview

Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.

## Usage

```python
from yosina import make_transliterator, TransliterationRecipe

# Create a recipe with desired transformations
recipe = TransliterationRecipe(
    kanji_old_new=True,
    replace_spaces=True,
    replace_suspicious_hyphens_to_prolonged_sound_marks=True,
    replace_circled_or_squared_characters=True,
    replace_combined_characters=True,
    hira_kata="hira-to-kata",  # Convert hiragana to katakana
    replace_japanese_iteration_marks=True,  # Expand iteration marks
    to_fullwidth=True,
)

# Create the transliterator
transliterator = make_transliterator(recipe)

# Use it with various special characters
input_text = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿"  # circled numbers, letters, space, combined characters
result = transliterator(input_text)
print(result)  # "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"

# Convert old kanji to new
old_kanji = "舊字體"
result = transliterator(old_kanji)
print(result)  # "旧字体"

# Convert half-width katakana to full-width
half_width = "テストモジレツ"
result = transliterator(half_width)
print(result)  # "テストモジレツ"

# Demonstrate hiragana to katakana conversion with iteration marks
mixed_text = "学問のすゝめ"
result = transliterator(mixed_text)
print(result)  # "学問ノススメ"
```

### Using Direct Configuration

```python
from yosina import make_transliterator

# Configure with direct transliterator configs
configs = [
    ("kanji-old-new", {}),
    ("spaces", {}),
    ("prolonged-sound-marks", {"replace_prolonged_marks_following_alnums": True}),
    ("circled-or-squared", {}),
    ("combined", {}),
    ("hira-kata", {"mode": "kata-to-hira"}),  # Convert katakana to hiragana
    ("japanese-iteration-marks", {}),  # Expand iteration marks like 々, ゝゞ, ヽヾ
]

transliterator = make_transliterator(configs)

# Example with various transformations including the new ones
input_text = "カタカナでの時々の佐々木さん"
result = transliterator(input_text)
print(result)  # "かたかなでの時時の佐佐木さん"
```

## Available Transliterators

### 1. **Circled or Squared** (`circled-or-squared`)
Converts circled or squared characters to their plain equivalents.
- Options: `templates` (custom rendering), `includeEmojis` (include emoji characters)
- Example: `①②③` → `(1)(2)(3)`, `㊙㊗` → `(秘)(祝)`

### 2. **Combined** (`combined`)
Expands combined characters into their individual character sequences.
- Example: `㍻` (Heisei era) → `平成`, `㈱` → `(株)`

### 3. **Hiragana-Katakana Composition** (`hira-kata-composition`)
Combines decomposed hiraganas and katakanas into composed equivalents.
- Options: `composeNonCombiningMarks` (compose non-combining marks)
- Example: `か + ゙` → `が`, `ヘ + ゜` → `ペ`

### 4. **Hiragana-Katakana** (`hira-kata`)
Converts between hiragana and katakana scripts bidirectionally.
- Options: `mode` ("hira-to-kata" or "kata-to-hira")
- Example: `ひらがな` → `ヒラガナ` (hira-to-kata)

### 5. **Hyphens** (`hyphens`)
Replaces various dash/hyphen symbols with common ones used in Japanese.
- Options: `precedence` (mapping priority order)
- Available mappings: "ascii", "jisx0201", "jisx0208_90", "jisx0208_90_windows", "jisx0208_verbatim"
- Example: `2019—2020` (em dash) → `2019-2020`

### 6. **Ideographic Annotations** (`ideographic-annotations`)
Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.
- Example: `㆖㆘` → `上下`

### 7. **IVS-SVS Base** (`ivs-svs-base`)
Handles Ideographic and Standardized Variation Selectors.
- Options: `charset`, `mode` ("ivs-or-svs" or "base"), `preferSVS`, `dropSelectorsAltogether`
- Example: `葛󠄀` (葛 + IVS) → `葛`

### 8. **Japanese Iteration Marks** (`japanese-iteration-marks`)
Expands iteration marks by repeating the preceding character.
- Example: `時々` → `時時`, `いすゞ` → `いすず`

### 9. **JIS X 0201 and Alike** (`jisx0201-and-alike`)
Handles half-width/full-width character conversion.
- Options: `fullwidthToHalfwidth`, `convertGL` (alphanumerics/symbols), `convertGR` (katakana), `u005cAsYenSign`
- Example: `ABC123` → `ABC123`, `カタカナ` → `カタカナ`

### 10. **Kanji Old-New** (`kanji-old-new`)
Converts old-style kanji (旧字体) to modern forms (新字体).
- Example: `舊字體の變換` → `旧字体の変換`

### 11. **Mathematical Alphanumerics** (`mathematical-alphanumerics`)
Normalizes mathematical alphanumeric symbols to plain ASCII.
- Example: `𝐀𝐁𝐂` (mathematical bold) → `ABC`

### 12. **Prolonged Sound Marks** (`prolonged-sound-marks`)
Handles contextual conversion between hyphens and prolonged sound marks.
- Options: `skipAlreadyTransliteratedChars`, `allowProlongedHatsuon`, `allowProlongedSokuon`, `replaceProlongedMarksFollowingAlnums`
- Example: `イ−ハト−ヴォ` (with hyphen) → `イーハトーヴォ` (prolonged mark)

### 13. **Radicals** (`radicals`)
Converts CJK radical characters to their corresponding ideographs.
- Example: `⾔⾨⾷` (Kangxi radicals) → `言門食`

### 14. **Spaces** (`spaces`)
Normalizes various Unicode space characters to standard ASCII space.
- Example: `A B` (ideographic space) → `A B`

## Requirements

- Python 3.10 or higher

## Installation

```bash
# Install with uv
uv add yosina

# Install with pip
pip install yosina
```

## Development

This project uses [uv](https://github.com/astral-sh/uv) for dependency management.

```bash
# Code generation
python -m codegen

# Install development dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Run linting
uv run ruff check .

# Run formatting
uv run ruff format .

# Run type checking
uv run pyright
```

## Requirements

- Python 3.10+
- typing-extensions

## License

MIT
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "yosina",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "japanese, normalization, text-processing, transliteration",
    "author": null,
    "author_email": "Moriyoshi Koizumi <mozo@mozo.jp>",
    "download_url": "https://files.pythonhosted.org/packages/f7/1e/0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6/yosina-0.1.0.tar.gz",
    "platform": null,
    "description": "# Yosina Python\n\nA Python port of the Yosina Japanese text transliteration library.\n\n## Overview\n\nYosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.\n\n## Usage\n\n```python\nfrom yosina import make_transliterator, TransliterationRecipe\n\n# Create a recipe with desired transformations\nrecipe = TransliterationRecipe(\n    kanji_old_new=True,\n    replace_spaces=True,\n    replace_suspicious_hyphens_to_prolonged_sound_marks=True,\n    replace_circled_or_squared_characters=True,\n    replace_combined_characters=True,\n    hira_kata=\"hira-to-kata\",  # Convert hiragana to katakana\n    replace_japanese_iteration_marks=True,  # Expand iteration marks\n    to_fullwidth=True,\n)\n\n# Create the transliterator\ntransliterator = make_transliterator(recipe)\n\n# Use it with various special characters\ninput_text = \"\u2460\u2461\u2462\u3000\u24b6\u24b7\u24b8\u3000\u337f\u3351\u3320\u32ff\"  # circled numbers, letters, space, combined characters\nresult = transliterator(input_text)\nprint(result)  # \"\uff08\uff11\uff09\uff08\uff12\uff09\uff08\uff13\uff09\u3000\uff08\uff21\uff09\uff08\uff22\uff09\uff08\uff23\uff09\u3000\u682a\u5f0f\u4f1a\u793e\u30ea\u30c3\u30c8\u30eb\u30b5\u30f3\u30c1\u30fc\u30e0\u4ee4\u548c\"\n\n# Convert old kanji to new\nold_kanji = \"\u820a\u5b57\u9ad4\"\nresult = transliterator(old_kanji)\nprint(result)  # \"\u65e7\u5b57\u4f53\"\n\n# Convert half-width katakana to full-width\nhalf_width = \"\uff83\uff7d\uff84\uff93\uff7c\uff9e\uff9a\uff82\"\nresult = transliterator(half_width)\nprint(result)  # \"\u30c6\u30b9\u30c8\u30e2\u30b8\u30ec\u30c4\"\n\n# Demonstrate hiragana to katakana conversion with iteration marks\nmixed_text = \"\u5b66\u554f\u306e\u3059\u309d\u3081\"\nresult = transliterator(mixed_text)\nprint(result)  # \"\u5b66\u554f\u30ce\u30b9\u30b9\u30e1\"\n```\n\n### Using Direct Configuration\n\n```python\nfrom yosina import make_transliterator\n\n# Configure with direct transliterator configs\nconfigs = [\n    (\"kanji-old-new\", {}),\n    (\"spaces\", {}),\n    (\"prolonged-sound-marks\", {\"replace_prolonged_marks_following_alnums\": True}),\n    (\"circled-or-squared\", {}),\n    (\"combined\", {}),\n    (\"hira-kata\", {\"mode\": \"kata-to-hira\"}),  # Convert katakana to hiragana\n    (\"japanese-iteration-marks\", {}),  # Expand iteration marks like \u3005, \u309d\u309e, \u30fd\u30fe\n]\n\ntransliterator = make_transliterator(configs)\n\n# Example with various transformations including the new ones\ninput_text = \"\u30ab\u30bf\u30ab\u30ca\u3067\u306e\u6642\u3005\u306e\u4f50\u3005\u6728\u3055\u3093\"\nresult = transliterator(input_text)\nprint(result)  # \"\u304b\u305f\u304b\u306a\u3067\u306e\u6642\u6642\u306e\u4f50\u4f50\u6728\u3055\u3093\"\n```\n\n## Available Transliterators\n\n### 1. **Circled or Squared** (`circled-or-squared`)\nConverts circled or squared characters to their plain equivalents.\n- Options: `templates` (custom rendering), `includeEmojis` (include emoji characters)\n- Example: `\u2460\u2461\u2462` \u2192 `(1)(2)(3)`, `\u3299\u3297` \u2192 `(\u79d8)(\u795d)`\n\n### 2. **Combined** (`combined`)\nExpands combined characters into their individual character sequences.\n- Example: `\u337b` (Heisei era) \u2192 `\u5e73\u6210`, `\u3231` \u2192 `(\u682a)`\n\n### 3. **Hiragana-Katakana Composition** (`hira-kata-composition`)\nCombines decomposed hiraganas and katakanas into composed equivalents.\n- Options: `composeNonCombiningMarks` (compose non-combining marks)\n- Example: `\u304b + \u3099` \u2192 `\u304c`, `\u30d8 + \u309c` \u2192 `\u30da`\n\n### 4. **Hiragana-Katakana** (`hira-kata`)\nConverts between hiragana and katakana scripts bidirectionally.\n- Options: `mode` (\"hira-to-kata\" or \"kata-to-hira\")\n- Example: `\u3072\u3089\u304c\u306a` \u2192 `\u30d2\u30e9\u30ac\u30ca` (hira-to-kata)\n\n### 5. **Hyphens** (`hyphens`)\nReplaces various dash/hyphen symbols with common ones used in Japanese.\n- Options: `precedence` (mapping priority order)\n- Available mappings: \"ascii\", \"jisx0201\", \"jisx0208_90\", \"jisx0208_90_windows\", \"jisx0208_verbatim\"\n- Example: `2019\u20142020` (em dash) \u2192 `2019-2020`\n\n### 6. **Ideographic Annotations** (`ideographic-annotations`)\nReplaces ideographic annotations used in traditional Chinese-to-Japanese translation.\n- Example: `\u3196\u3198` \u2192 `\u4e0a\u4e0b`\n\n### 7. **IVS-SVS Base** (`ivs-svs-base`)\nHandles Ideographic and Standardized Variation Selectors.\n- Options: `charset`, `mode` (\"ivs-or-svs\" or \"base\"), `preferSVS`, `dropSelectorsAltogether`\n- Example: `\u845b\udb40\udd00` (\u845b + IVS) \u2192 `\u845b`\n\n### 8. **Japanese Iteration Marks** (`japanese-iteration-marks`)\nExpands iteration marks by repeating the preceding character.\n- Example: `\u6642\u3005` \u2192 `\u6642\u6642`, `\u3044\u3059\u309e` \u2192 `\u3044\u3059\u305a`\n\n### 9. **JIS X 0201 and Alike** (`jisx0201-and-alike`)\nHandles half-width/full-width character conversion.\n- Options: `fullwidthToHalfwidth`, `convertGL` (alphanumerics/symbols), `convertGR` (katakana), `u005cAsYenSign`\n- Example: `ABC123` \u2192 `\uff21\uff22\uff23\uff11\uff12\uff13`, `\uff76\uff80\uff76\uff85` \u2192 `\u30ab\u30bf\u30ab\u30ca`\n\n### 10. **Kanji Old-New** (`kanji-old-new`)\nConverts old-style kanji (\u65e7\u5b57\u4f53) to modern forms (\u65b0\u5b57\u4f53).\n- Example: `\u820a\u5b57\u9ad4\u306e\u8b8a\u63db` \u2192 `\u65e7\u5b57\u4f53\u306e\u5909\u63db`\n\n### 11. **Mathematical Alphanumerics** (`mathematical-alphanumerics`)\nNormalizes mathematical alphanumeric symbols to plain ASCII.\n- Example: `\ud835\udc00\ud835\udc01\ud835\udc02` (mathematical bold) \u2192 `ABC`\n\n### 12. **Prolonged Sound Marks** (`prolonged-sound-marks`)\nHandles contextual conversion between hyphens and prolonged sound marks.\n- Options: `skipAlreadyTransliteratedChars`, `allowProlongedHatsuon`, `allowProlongedSokuon`, `replaceProlongedMarksFollowingAlnums`\n- Example: `\u30a4\u2212\u30cf\u30c8\u2212\u30f4\u30a9` (with hyphen) \u2192 `\u30a4\u30fc\u30cf\u30c8\u30fc\u30f4\u30a9` (prolonged mark)\n\n### 13. **Radicals** (`radicals`)\nConverts CJK radical characters to their corresponding ideographs.\n- Example: `\u2f94\u2fa8\u2fb7` (Kangxi radicals) \u2192 `\u8a00\u9580\u98df`\n\n### 14. **Spaces** (`spaces`)\nNormalizes various Unicode space characters to standard ASCII space.\n- Example: `A\u3000B` (ideographic space) \u2192 `A B`\n\n## Requirements\n\n- Python 3.10 or higher\n\n## Installation\n\n```bash\n# Install with uv\nuv add yosina\n\n# Install with pip\npip install yosina\n```\n\n## Development\n\nThis project uses [uv](https://github.com/astral-sh/uv) for dependency management.\n\n```bash\n# Code generation\npython -m codegen\n\n# Install development dependencies\nuv sync --extra dev\n\n# Run tests\nuv run pytest\n\n# Run linting\nuv run ruff check .\n\n# Run formatting\nuv run ruff format .\n\n# Run type checking\nuv run pyright\n```\n\n## Requirements\n\n- Python 3.10+\n- typing-extensions\n\n## License\n\nMIT",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Japanese text transliteration library",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/yosina-lib/yosina",
        "Issues": "https://github.com/yosina-lib/yosina/issues",
        "Repository": "https://github.com/yosina-lib/yosina"
    },
    "split_keywords": [
        "japanese",
        " normalization",
        " text-processing",
        " transliteration"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e9978b8077dd6d8fc6ada4604b57ace7119c5196c9a697e1e17d09f07934a8fd",
                "md5": "1ca34b3c2f2368c2590fd299cb9aa621",
                "sha256": "dbd203dd39eae21bcb7625b642bef37b27916b6fe2e38373cf2e88843ebb0f83"
            },
            "downloads": -1,
            "filename": "yosina-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1ca34b3c2f2368c2590fd299cb9aa621",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 134029,
            "upload_time": "2025-08-19T18:28:37",
            "upload_time_iso_8601": "2025-08-19T18:28:37.167490Z",
            "url": "https://files.pythonhosted.org/packages/e9/97/8b8077dd6d8fc6ada4604b57ace7119c5196c9a697e1e17d09f07934a8fd/yosina-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f71e0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6",
                "md5": "121c25a1581dc979fa223f5001196713",
                "sha256": "2a3320c147eabd0139e76c0a413996d64e16847a5e7fafdad60f7557041d77cb"
            },
            "downloads": -1,
            "filename": "yosina-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "121c25a1581dc979fa223f5001196713",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 352420,
            "upload_time": "2025-08-19T18:28:38",
            "upload_time_iso_8601": "2025-08-19T18:28:38.700183Z",
            "url": "https://files.pythonhosted.org/packages/f7/1e/0b04a7f1c56132c279dded1ffbd2d5ecd310f1b60e01da8cd3a7513effa6/yosina-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-19 18:28:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yosina-lib",
    "github_project": "yosina",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "yosina"
}
        
Elapsed time: 0.97322s