soe-vinorm


Namesoe-vinorm JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryAn effective text normalization tool for Vietnamese
upload_time2025-09-07 00:37:33
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords nlp non-standard-words speech text-normalization tts vietnamese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Soe Vinorm - Vietnamese Text Normalization Toolkit

Soe Vinorm is an effective and extensible toolkit for Vietnamese text normalization, designed for use in Text-to-Speech (TTS) and NLP pipelines. It detects and expands non-standard words (NSWs) such as numbers, dates, abbreviations, and more, converting them into their spoken forms. This project is based on the paper [Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech](https://arxiv.org/abs/2209.02971).

## Installation

### Option 1: Clone the repository (for development)
```bash
# Clone the repository
git clone https://github.com/vinhdq842/soe-vinorm.git
cd soe-vinorm

# Install dependencies including development dependencies (using uv)
uv sync --dev
```

### Option 2: Install from PyPI
```bash
# Install using uv
uv add soe-vinorm

# Or using pip
pip install soe-vinorm
```

### Option 3: Install from source
```bash
# Install directly from GitHub
uv pip install git+https://github.com/vinhdq842/soe-vinorm.git
```

## Usage

Basic usage

```python
from soe_vinorm import SoeNormalizer

normalizer = SoeNormalizer()
text = "Từ năm 2021 đến nay, đây là lần thứ 3 Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu, giấy thông hành."

# Single
result = normalizer.normalize(text)
print(result)
# Output: Từ năm hai nghìn không trăm hai mươi mốt đến nay , đây là lần thứ ba Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu , giấy thông hành .

# Batch
results = normalizer.batch_normalize([text] * 5, n_jobs=5)
print(results)
```

Quick function usage

```python
from soe_vinorm import normalize_text

text = "1kg dâu 25 quả, giá 700.000 - Trung bình 30.000đ/quả"
result = normalize_text(text)
print(result)
# Output: một ki lô gam dâu hai mươi lăm quả , giá bảy trăm nghìn - Trung bình ba mươi nghìn đồng trên quả
```

```python
from soe_vinorm import batch_normalize_texts

texts = [
    "Công trình cao 7,9 m bằng chất liệu đồng nguyên chất với trọng lượng gần 7 tấn, bệ tượng cao 3,6 m",
    "Một trường ĐH tại TP.HCM ghi nhận số nguyện vọng đăng ký vào trường tăng kỷ lục, với trên 178.000.",
    "Theo phương án của Ban Quản lý dự án đường sắt, Bộ Xây dựng, tuyến đường sắt đô thị Thủ Thiêm - Long Thành có chiều dài khoảng 42 km, thiết kế đường đôi, khổ đường 1.435 mm, tốc độ thiết kế 120 km/giờ.",
    "iPhone 16 Pro hiện có giá 999 USD cho phiên bản bộ nhớ 128 GB, và 1.099 USD cho bản 256 GB. Trong khi đó, mẫu 16 Pro Max có dung lượng khởi điểm 256 GB với giá 1.199 USD.",
]

# Process multiple texts in parallel (4 worker processes)
results = batch_normalize_texts(texts, n_jobs=4)

for original, normalized in zip(texts, results):
    print(f"Original: {original}")
    print(f"Normalized: {normalized}")
    print("-" * 50)
```

Output:
```
Original: Công trình cao 7,9 m bằng chất liệu đồng nguyên chất với trọng lượng gần 7 tấn, bệ tượng cao 3,6 m
Normalized: Công trình cao bảy phẩy chín mét bằng chất liệu đồng nguyên chất với trọng lượng gần bảy tấn , bệ tượng cao ba phẩy sáu mét
--------------------------------------------------
Original: Một trường ĐH tại TP.HCM ghi nhận số nguyện vọng đăng ký vào trường tăng kỷ lục, với trên 178.000.
Normalized: Một trường Đại học tại Thành phố Hồ Chí Minh ghi nhận số nguyện vọng đăng ký vào trường tăng kỷ lục , với trên một trăm bảy mươi tám nghìn .
--------------------------------------------------
Original: Theo phương án của Ban Quản lý dự án đường sắt, Bộ Xây dựng, tuyến đường sắt đô thị Thủ Thiêm - Long Thành có chiều dài khoảng 42 km, thiết kế đường đôi, khổ đường 1.435 mm, tốc độ thiết kế 120 km/giờ.
Normalized: Theo phương án của Ban Quản lý dự án đường sắt , Bộ Xây dựng , tuyến đường sắt đô thị Thủ Thiêm - Long Thành có chiều dài khoảng bốn mươi hai ki lô mét , thiết kế đường đôi , khổ đường một nghìn bốn trăm ba mươi lăm mi li mét , tốc độ thiết kế một trăm hai mươi ki lô mét trên giờ .
--------------------------------------------------
Original: iPhone 16 Pro hiện có giá 999 USD cho phiên bản bộ nhớ 128 GB, và 1.099 USD cho bản 256 GB. Trong khi đó, mẫu 16 Pro Max có dung lượng khởi điểm 256 GB với giá 1.199 USD.
Normalized: iPhone mười sáu Pro hiện có giá chín trăm chín mươi chín U Ét Đê cho phiên bản bộ nhớ một trăm hai mươi tám ghi ga bai , và một nghìn không trăm chín mươi chín U Ét Đê cho bản hai trăm năm mươi sáu ghi ga bai . Trong khi đó , mẫu mười sáu Pro Max có dung lượng khởi điểm hai trăm năm mươi sáu ghi ga bai với giá một nghìn một trăm chín mươi chín U Ét Đê .
--------------------------------------------------
```

## Approach: Two-stage normalization

### Preprocessing & tokenizing
- Extra spaces, ASCII art, emojis, HTML entities, unspoken words, etc., are removed.
- A regex-based tokenizer is then used to split the sentence into tokens.

### Stage 1: Non-standard word detection
- A sequence tagger is used to extract non-standard words (NSWs) and categorize them into 18 different types.
- These NSWs can later be verbalized properly according to their types.
- The sequence tagger can be any sequence labeling model; this implementation uses a Conditional Random Field due to limited data.

### Stage 2: Non-standard word normalization
- With the NSWs detected in **Stage 1** and their respective types, regex-based expanders are applied to produce normalized results.
- Each NSW type has its own dedicated expander.
- The normalized results are then inserted back into the original sentence, resulting in the desired normalized sentence.

### Other details
- Foreign NSWs are currently kept as-is.
- Abbreviation NSWs
  - **v0.1**: Quantized PhoBERT model combined with a Vietnamese abbreviation dictionary.
  - **v0.2**: A small neural network, also incorporating a Vietnamese abbreviation dictionary.

## Testing
Run all tests with:
```bash
pytest tests
```

## Author
- Vinh Dang (<quangvinh0842@gmail.com>)

## License
MIT License

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "soe-vinorm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, non-standard-words, speech, text-normalization, tts, vietnamese",
    "author": null,
    "author_email": "Vinh Dang <quangvinh0842@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9d/67/54552021841774dcd3cc9afc39c1e4c72e26a8b4d19dc1ff5943ea32009e/soe_vinorm-0.2.2.tar.gz",
    "platform": null,
    "description": "# Soe Vinorm - Vietnamese Text Normalization Toolkit\n\nSoe Vinorm is an effective and extensible toolkit for Vietnamese text normalization, designed for use in Text-to-Speech (TTS) and NLP pipelines. It detects and expands non-standard words (NSWs) such as numbers, dates, abbreviations, and more, converting them into their spoken forms. This project is based on the paper [Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech](https://arxiv.org/abs/2209.02971).\n\n## Installation\n\n### Option 1: Clone the repository (for development)\n```bash\n# Clone the repository\ngit clone https://github.com/vinhdq842/soe-vinorm.git\ncd soe-vinorm\n\n# Install dependencies including development dependencies (using uv)\nuv sync --dev\n```\n\n### Option 2: Install from PyPI\n```bash\n# Install using uv\nuv add soe-vinorm\n\n# Or using pip\npip install soe-vinorm\n```\n\n### Option 3: Install from source\n```bash\n# Install directly from GitHub\nuv pip install git+https://github.com/vinhdq842/soe-vinorm.git\n```\n\n## Usage\n\nBasic usage\n\n```python\nfrom soe_vinorm import SoeNormalizer\n\nnormalizer = SoeNormalizer()\ntext = \"T\u1eeb n\u0103m 2021 \u0111\u1ebfn nay, \u0111\u00e2y l\u00e0 l\u1ea7n th\u1ee9 3 B\u1ed9 C\u00f4ng an x\u00e2y d\u1ef1ng th\u00f4ng t\u01b0 \u0111\u1ec3 quy \u0111\u1ecbnh li\u00ean quan \u0111\u1ebfn m\u1eabu h\u1ed9 chi\u1ebfu, gi\u1ea5y th\u00f4ng h\u00e0nh.\"\n\n# Single\nresult = normalizer.normalize(text)\nprint(result)\n# Output: T\u1eeb n\u0103m hai ngh\u00ecn kh\u00f4ng tr\u0103m hai m\u01b0\u01a1i m\u1ed1t \u0111\u1ebfn nay , \u0111\u00e2y l\u00e0 l\u1ea7n th\u1ee9 ba B\u1ed9 C\u00f4ng an x\u00e2y d\u1ef1ng th\u00f4ng t\u01b0 \u0111\u1ec3 quy \u0111\u1ecbnh li\u00ean quan \u0111\u1ebfn m\u1eabu h\u1ed9 chi\u1ebfu , gi\u1ea5y th\u00f4ng h\u00e0nh .\n\n# Batch\nresults = normalizer.batch_normalize([text] * 5, n_jobs=5)\nprint(results)\n```\n\nQuick function usage\n\n```python\nfrom soe_vinorm import normalize_text\n\ntext = \"1kg d\u00e2u 25 qu\u1ea3, gi\u00e1 700.000 - Trung b\u00ecnh 30.000\u0111/qu\u1ea3\"\nresult = normalize_text(text)\nprint(result)\n# Output: m\u1ed9t ki l\u00f4 gam d\u00e2u hai m\u01b0\u01a1i l\u0103m qu\u1ea3 , gi\u00e1 b\u1ea3y tr\u0103m ngh\u00ecn - Trung b\u00ecnh ba m\u01b0\u01a1i ngh\u00ecn \u0111\u1ed3ng tr\u00ean qu\u1ea3\n```\n\n```python\nfrom soe_vinorm import batch_normalize_texts\n\ntexts = [\n    \"C\u00f4ng tr\u00ecnh cao 7,9 m b\u1eb1ng ch\u1ea5t li\u1ec7u \u0111\u1ed3ng nguy\u00ean ch\u1ea5t v\u1edbi tr\u1ecdng l\u01b0\u1ee3ng g\u1ea7n 7 t\u1ea5n, b\u1ec7 t\u01b0\u1ee3ng cao 3,6 m\",\n    \"M\u1ed9t tr\u01b0\u1eddng \u0110H t\u1ea1i TP.HCM ghi nh\u1eadn s\u1ed1 nguy\u1ec7n v\u1ecdng \u0111\u0103ng k\u00fd v\u00e0o tr\u01b0\u1eddng t\u0103ng k\u1ef7 l\u1ee5c, v\u1edbi tr\u00ean 178.000.\",\n    \"Theo ph\u01b0\u01a1ng \u00e1n c\u1ee7a Ban Qu\u1ea3n l\u00fd d\u1ef1 \u00e1n \u0111\u01b0\u1eddng s\u1eaft, B\u1ed9 X\u00e2y d\u1ef1ng, tuy\u1ebfn \u0111\u01b0\u1eddng s\u1eaft \u0111\u00f4 th\u1ecb Th\u1ee7 Thi\u00eam - Long Th\u00e0nh c\u00f3 chi\u1ec1u d\u00e0i kho\u1ea3ng 42 km, thi\u1ebft k\u1ebf \u0111\u01b0\u1eddng \u0111\u00f4i, kh\u1ed5 \u0111\u01b0\u1eddng 1.435 mm, t\u1ed1c \u0111\u1ed9 thi\u1ebft k\u1ebf 120 km/gi\u1edd.\",\n    \"iPhone 16 Pro hi\u1ec7n c\u00f3 gi\u00e1 999 USD cho phi\u00ean b\u1ea3n b\u1ed9 nh\u1edb 128 GB, v\u00e0 1.099 USD cho b\u1ea3n 256 GB. Trong khi \u0111\u00f3, m\u1eabu 16 Pro Max c\u00f3 dung l\u01b0\u1ee3ng kh\u1edfi \u0111i\u1ec3m 256 GB v\u1edbi gi\u00e1 1.199 USD.\",\n]\n\n# Process multiple texts in parallel (4 worker processes)\nresults = batch_normalize_texts(texts, n_jobs=4)\n\nfor original, normalized in zip(texts, results):\n    print(f\"Original: {original}\")\n    print(f\"Normalized: {normalized}\")\n    print(\"-\" * 50)\n```\n\nOutput:\n```\nOriginal: C\u00f4ng tr\u00ecnh cao 7,9 m b\u1eb1ng ch\u1ea5t li\u1ec7u \u0111\u1ed3ng nguy\u00ean ch\u1ea5t v\u1edbi tr\u1ecdng l\u01b0\u1ee3ng g\u1ea7n 7 t\u1ea5n, b\u1ec7 t\u01b0\u1ee3ng cao 3,6 m\nNormalized: C\u00f4ng tr\u00ecnh cao b\u1ea3y ph\u1ea9y ch\u00edn m\u00e9t b\u1eb1ng ch\u1ea5t li\u1ec7u \u0111\u1ed3ng nguy\u00ean ch\u1ea5t v\u1edbi tr\u1ecdng l\u01b0\u1ee3ng g\u1ea7n b\u1ea3y t\u1ea5n , b\u1ec7 t\u01b0\u1ee3ng cao ba ph\u1ea9y s\u00e1u m\u00e9t\n--------------------------------------------------\nOriginal: M\u1ed9t tr\u01b0\u1eddng \u0110H t\u1ea1i TP.HCM ghi nh\u1eadn s\u1ed1 nguy\u1ec7n v\u1ecdng \u0111\u0103ng k\u00fd v\u00e0o tr\u01b0\u1eddng t\u0103ng k\u1ef7 l\u1ee5c, v\u1edbi tr\u00ean 178.000.\nNormalized: M\u1ed9t tr\u01b0\u1eddng \u0110\u1ea1i h\u1ecdc t\u1ea1i Th\u00e0nh ph\u1ed1 H\u1ed3 Ch\u00ed Minh ghi nh\u1eadn s\u1ed1 nguy\u1ec7n v\u1ecdng \u0111\u0103ng k\u00fd v\u00e0o tr\u01b0\u1eddng t\u0103ng k\u1ef7 l\u1ee5c , v\u1edbi tr\u00ean m\u1ed9t tr\u0103m b\u1ea3y m\u01b0\u01a1i t\u00e1m ngh\u00ecn .\n--------------------------------------------------\nOriginal: Theo ph\u01b0\u01a1ng \u00e1n c\u1ee7a Ban Qu\u1ea3n l\u00fd d\u1ef1 \u00e1n \u0111\u01b0\u1eddng s\u1eaft, B\u1ed9 X\u00e2y d\u1ef1ng, tuy\u1ebfn \u0111\u01b0\u1eddng s\u1eaft \u0111\u00f4 th\u1ecb Th\u1ee7 Thi\u00eam - Long Th\u00e0nh c\u00f3 chi\u1ec1u d\u00e0i kho\u1ea3ng 42 km, thi\u1ebft k\u1ebf \u0111\u01b0\u1eddng \u0111\u00f4i, kh\u1ed5 \u0111\u01b0\u1eddng 1.435 mm, t\u1ed1c \u0111\u1ed9 thi\u1ebft k\u1ebf 120 km/gi\u1edd.\nNormalized: Theo ph\u01b0\u01a1ng \u00e1n c\u1ee7a Ban Qu\u1ea3n l\u00fd d\u1ef1 \u00e1n \u0111\u01b0\u1eddng s\u1eaft , B\u1ed9 X\u00e2y d\u1ef1ng , tuy\u1ebfn \u0111\u01b0\u1eddng s\u1eaft \u0111\u00f4 th\u1ecb Th\u1ee7 Thi\u00eam - Long Th\u00e0nh c\u00f3 chi\u1ec1u d\u00e0i kho\u1ea3ng b\u1ed1n m\u01b0\u01a1i hai ki l\u00f4 m\u00e9t , thi\u1ebft k\u1ebf \u0111\u01b0\u1eddng \u0111\u00f4i , kh\u1ed5 \u0111\u01b0\u1eddng m\u1ed9t ngh\u00ecn b\u1ed1n tr\u0103m ba m\u01b0\u01a1i l\u0103m mi li m\u00e9t , t\u1ed1c \u0111\u1ed9 thi\u1ebft k\u1ebf m\u1ed9t tr\u0103m hai m\u01b0\u01a1i ki l\u00f4 m\u00e9t tr\u00ean gi\u1edd .\n--------------------------------------------------\nOriginal: iPhone 16 Pro hi\u1ec7n c\u00f3 gi\u00e1 999 USD cho phi\u00ean b\u1ea3n b\u1ed9 nh\u1edb 128 GB, v\u00e0 1.099 USD cho b\u1ea3n 256 GB. Trong khi \u0111\u00f3, m\u1eabu 16 Pro Max c\u00f3 dung l\u01b0\u1ee3ng kh\u1edfi \u0111i\u1ec3m 256 GB v\u1edbi gi\u00e1 1.199 USD.\nNormalized: iPhone m\u01b0\u1eddi s\u00e1u Pro hi\u1ec7n c\u00f3 gi\u00e1 ch\u00edn tr\u0103m ch\u00edn m\u01b0\u01a1i ch\u00edn U \u00c9t \u0110\u00ea cho phi\u00ean b\u1ea3n b\u1ed9 nh\u1edb m\u1ed9t tr\u0103m hai m\u01b0\u01a1i t\u00e1m ghi ga bai , v\u00e0 m\u1ed9t ngh\u00ecn kh\u00f4ng tr\u0103m ch\u00edn m\u01b0\u01a1i ch\u00edn U \u00c9t \u0110\u00ea cho b\u1ea3n hai tr\u0103m n\u0103m m\u01b0\u01a1i s\u00e1u ghi ga bai . Trong khi \u0111\u00f3 , m\u1eabu m\u01b0\u1eddi s\u00e1u Pro Max c\u00f3 dung l\u01b0\u1ee3ng kh\u1edfi \u0111i\u1ec3m hai tr\u0103m n\u0103m m\u01b0\u01a1i s\u00e1u ghi ga bai v\u1edbi gi\u00e1 m\u1ed9t ngh\u00ecn m\u1ed9t tr\u0103m ch\u00edn m\u01b0\u01a1i ch\u00edn U \u00c9t \u0110\u00ea .\n--------------------------------------------------\n```\n\n## Approach: Two-stage normalization\n\n### Preprocessing & tokenizing\n- Extra spaces, ASCII art, emojis, HTML entities, unspoken words, etc., are removed.\n- A regex-based tokenizer is then used to split the sentence into tokens.\n\n### Stage 1: Non-standard word detection\n- A sequence tagger is used to extract non-standard words (NSWs) and categorize them into 18 different types.\n- These NSWs can later be verbalized properly according to their types.\n- The sequence tagger can be any sequence labeling model; this implementation uses a Conditional Random Field due to limited data.\n\n### Stage 2: Non-standard word normalization\n- With the NSWs detected in **Stage 1** and their respective types, regex-based expanders are applied to produce normalized results.\n- Each NSW type has its own dedicated expander.\n- The normalized results are then inserted back into the original sentence, resulting in the desired normalized sentence.\n\n### Other details\n- Foreign NSWs are currently kept as-is.\n- Abbreviation NSWs\n  - **v0.1**: Quantized PhoBERT model combined with a Vietnamese abbreviation dictionary.\n  - **v0.2**: A small neural network, also incorporating a Vietnamese abbreviation dictionary.\n\n## Testing\nRun all tests with:\n```bash\npytest tests\n```\n\n## Author\n- Vinh Dang (<quangvinh0842@gmail.com>)\n\n## License\nMIT License\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An effective text normalization tool for Vietnamese",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://github.com/vinhdq842/soe-vinorm",
        "Homepage": "https://github.com/vinhdq842/soe-vinorm",
        "Issues": "https://github.com/vinhdq842/soe-vinorm/issues",
        "Source Code": "https://github.com/vinhdq842/soe-vinorm"
    },
    "split_keywords": [
        "nlp",
        " non-standard-words",
        " speech",
        " text-normalization",
        " tts",
        " vietnamese"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3ae23e3d21a31cde9e10d25645d161709ad8fa548512aa6fdfa7a2fcc7322db4",
                "md5": "2b7a21aa7ac3141ee687ae54cc83d678",
                "sha256": "29a3cf69513cd2786af65cf87a245593848424ee63de5faf75fcf89893f96748"
            },
            "downloads": -1,
            "filename": "soe_vinorm-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2b7a21aa7ac3141ee687ae54cc83d678",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 87618,
            "upload_time": "2025-09-07T00:37:32",
            "upload_time_iso_8601": "2025-09-07T00:37:32.010071Z",
            "url": "https://files.pythonhosted.org/packages/3a/e2/3e3d21a31cde9e10d25645d161709ad8fa548512aa6fdfa7a2fcc7322db4/soe_vinorm-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "9d6754552021841774dcd3cc9afc39c1e4c72e26a8b4d19dc1ff5943ea32009e",
                "md5": "e270ee6a0c3cb982fa1711ea4cff112b",
                "sha256": "9d75df1e1ca8334faad45ed8f88458e97de74c428c975d603029be9a9d943bc6"
            },
            "downloads": -1,
            "filename": "soe_vinorm-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e270ee6a0c3cb982fa1711ea4cff112b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 192609,
            "upload_time": "2025-09-07T00:37:33",
            "upload_time_iso_8601": "2025-09-07T00:37:33.302814Z",
            "url": "https://files.pythonhosted.org/packages/9d/67/54552021841774dcd3cc9afc39c1e4c72e26a8b4d19dc1ff5943ea32009e/soe_vinorm-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-07 00:37:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vinhdq842",
    "github_project": "soe-vinorm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "soe-vinorm"
}
        
Elapsed time: 1.62443s