soe-vinorm


Namesoe-vinorm JSON
Version 0.1.6 PyPI version JSON
download
home_pageNone
SummaryAn effective text normalization tool for Vietnamese
upload_time2025-08-02 15:21:19
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords nlp non-standard-words speech text-normalization tts vietnamese
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Soe Vinorm - Vietnamese Text Normalization Toolkit

Soe Vinorm is an effective and extensible toolkit for Vietnamese text normalization, designed for use in Text-to-Speech (TTS) and NLP pipelines. It detects and expands non-standard words (NSWs) such as numbers, dates, abbreviations, and more, converting them into their spoken forms. This project is based on the paper [Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech](https://arxiv.org/abs/2209.02971).

## Installation

### Option 1: Clone the repository (for development)
```bash
# Clone the repository
git clone https://github.com/vinhdq842/soe-vinorm.git
cd soe-vinorm

# Install dependencies including development dependencies (using uv)
uv sync --dev
```

### Option 2: Install from PyPI
```bash
# Install using uv
uv add soe-vinorm

# Or using pip
pip install soe-vinorm
```

### Option 3: Install from source
```bash
# Install directly from GitHub
uv pip install git+https://github.com/vinhdq842/soe-vinorm.git
```

## Usage

```python
from soe_vinorm import SoeNormalizer

normalizer = SoeNormalizer()
text = 'Từ năm 2021 đến nay, đây là lần thứ 3 Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu, giấy thông hành.'

result = normalizer.normalize(text)
print(result)
# Output: Từ năm hai nghìn không trăm hai mươi mốt đến nay , đây là lần thứ ba Bộ Công an xây dựng thông tư để quy định liên quan đến mẫu hộ chiếu , giấy thông hành .
```

### Quick function usage
```python
from soe_vinorm import normalize_text

text = "1kg dâu 25 quả, giá 700.000 - Trung bình 30.000đ/quả"
result = normalize_text(text)
print(result)
# Output: một ki lô gam dâu hai mươi lăm quả , giá bảy trăm nghìn - Trung bình ba mươi nghìn đồng trên quả
```

### Batch processing
```python
from soe_vinorm import batch_normalize_texts

texts = [
    "Tôi có 123.456 đồng trong tài khoản",
    "ĐT Việt Nam giành HCV tại SEA Games 32",
    "Nhiệt độ hôm nay là 25°C, ngày 25/04/2014",
    "Tốc độ xe đạt 60km/h trên quãng đường 150km"
]

# Process multiple texts in parallel (4 worker processes)
results = batch_normalize_texts(texts, n_jobs=4)

for original, normalized in zip(texts, results):
    print(f"Original: {original}")
    print(f"Normalized: {normalized}")
    print("-" * 50)
```

Output:
```
Original: Tôi có 123.456 đồng trong tài khoản
Normalized: Tôi có một trăm hai mươi ba nghìn bốn trăm năm mươi sáu đồng trong tài khoản
--------------------------------------------------
Original: ĐT Việt Nam giành HCV tại SEA Games 32
Normalized: đội tuyển Việt Nam giành Huy chương vàng tại SEA Games ba mươi hai
--------------------------------------------------
Original: Nhiệt độ hôm nay là 25°C, ngày 25/04/2014
Normalized: Nhiệt độ hôm nay là hai mươi lăm độ xê , ngày hai mươi lăm tháng bốn năm hai nghìn không trăm mười bốn
--------------------------------------------------
Original: Tốc độ xe đạt 60km/h trên quãng đường 150km
Normalized: Tốc độ xe đạt sáu mươi ki lô mét trên giờ trên quãng đường một trăm năm mươi ki lô mét
--------------------------------------------------
```

## Approach: Two-stage normalization

### Preprocessing & tokenizing
- The extra spaces, ASCII arts, emojis, HTML entities, unspoken words, etc. are removed.
- A Regex-based tokenizer is then used to split the very sentence into tokens.

### Stage 1: Non-standard word detection
- Use a sequence tagger to extract non-standard words (NSWs) and categorize them into different types (18 in total).
- Later, these NSWs can be verbalized properly according to their types.
- The sequence tagger can be any kind of sequence labeling models. This implementation uses Conditional Random Field due to the shortage of data.

### Stage 2: Non-standard word normalization
- With the NSWs detected in **Stage 1** and their respective types, Regex-based expanders are applied to get the normalized results.
- Each NSW type has its own dedicated expander.
- The normalized results are then inserted into the original sentence, resulting in the desired normalized sentence.

### Minor details
- *Foreign* NSWs are kept as is at the moment.
- To expand *Abbreviation* NSWs, a language model is used (i.e. BERT), incorporated with a Vietnamese abbreviation dictionary.
- ...


## Testing
Run all tests with:
```bash
pytest tests
```

## Author
- Vinh Dang (<quangvinh0842@gmail.com>)

## License
MIT License

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "soe-vinorm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, non-standard-words, speech, text-normalization, tts, vietnamese",
    "author": null,
    "author_email": "Vinh Dang <quangvinh0842@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/04/25/09e0f2e0a01a3edf3238ce697def99bea5f95198d3e07618955a35ec7fa0/soe_vinorm-0.1.6.tar.gz",
    "platform": null,
    "description": "# Soe Vinorm - Vietnamese Text Normalization Toolkit\n\nSoe Vinorm is an effective and extensible toolkit for Vietnamese text normalization, designed for use in Text-to-Speech (TTS) and NLP pipelines. It detects and expands non-standard words (NSWs) such as numbers, dates, abbreviations, and more, converting them into their spoken forms. This project is based on the paper [Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech](https://arxiv.org/abs/2209.02971).\n\n## Installation\n\n### Option 1: Clone the repository (for development)\n```bash\n# Clone the repository\ngit clone https://github.com/vinhdq842/soe-vinorm.git\ncd soe-vinorm\n\n# Install dependencies including development dependencies (using uv)\nuv sync --dev\n```\n\n### Option 2: Install from PyPI\n```bash\n# Install using uv\nuv add soe-vinorm\n\n# Or using pip\npip install soe-vinorm\n```\n\n### Option 3: Install from source\n```bash\n# Install directly from GitHub\nuv pip install git+https://github.com/vinhdq842/soe-vinorm.git\n```\n\n## Usage\n\n```python\nfrom soe_vinorm import SoeNormalizer\n\nnormalizer = SoeNormalizer()\ntext = 'T\u1eeb n\u0103m 2021 \u0111\u1ebfn nay, \u0111\u00e2y l\u00e0 l\u1ea7n th\u1ee9 3 B\u1ed9 C\u00f4ng an x\u00e2y d\u1ef1ng th\u00f4ng t\u01b0 \u0111\u1ec3 quy \u0111\u1ecbnh li\u00ean quan \u0111\u1ebfn m\u1eabu h\u1ed9 chi\u1ebfu, gi\u1ea5y th\u00f4ng h\u00e0nh.'\n\nresult = normalizer.normalize(text)\nprint(result)\n# Output: T\u1eeb n\u0103m hai ngh\u00ecn kh\u00f4ng tr\u0103m hai m\u01b0\u01a1i m\u1ed1t \u0111\u1ebfn nay , \u0111\u00e2y l\u00e0 l\u1ea7n th\u1ee9 ba B\u1ed9 C\u00f4ng an x\u00e2y d\u1ef1ng th\u00f4ng t\u01b0 \u0111\u1ec3 quy \u0111\u1ecbnh li\u00ean quan \u0111\u1ebfn m\u1eabu h\u1ed9 chi\u1ebfu , gi\u1ea5y th\u00f4ng h\u00e0nh .\n```\n\n### Quick function usage\n```python\nfrom soe_vinorm import normalize_text\n\ntext = \"1kg d\u00e2u 25 qu\u1ea3, gi\u00e1 700.000 - Trung b\u00ecnh 30.000\u0111/qu\u1ea3\"\nresult = normalize_text(text)\nprint(result)\n# Output: m\u1ed9t ki l\u00f4 gam d\u00e2u hai m\u01b0\u01a1i l\u0103m qu\u1ea3 , gi\u00e1 b\u1ea3y tr\u0103m ngh\u00ecn - Trung b\u00ecnh ba m\u01b0\u01a1i ngh\u00ecn \u0111\u1ed3ng tr\u00ean qu\u1ea3\n```\n\n### Batch processing\n```python\nfrom soe_vinorm import batch_normalize_texts\n\ntexts = [\n    \"T\u00f4i c\u00f3 123.456 \u0111\u1ed3ng trong t\u00e0i kho\u1ea3n\",\n    \"\u0110T Vi\u1ec7t Nam gi\u00e0nh HCV t\u1ea1i SEA Games 32\",\n    \"Nhi\u1ec7t \u0111\u1ed9 h\u00f4m nay l\u00e0 25\u00b0C, ng\u00e0y 25/04/2014\",\n    \"T\u1ed1c \u0111\u1ed9 xe \u0111\u1ea1t 60km/h tr\u00ean qu\u00e3ng \u0111\u01b0\u1eddng 150km\"\n]\n\n# Process multiple texts in parallel (4 worker processes)\nresults = batch_normalize_texts(texts, n_jobs=4)\n\nfor original, normalized in zip(texts, results):\n    print(f\"Original: {original}\")\n    print(f\"Normalized: {normalized}\")\n    print(\"-\" * 50)\n```\n\nOutput:\n```\nOriginal: T\u00f4i c\u00f3 123.456 \u0111\u1ed3ng trong t\u00e0i kho\u1ea3n\nNormalized: T\u00f4i c\u00f3 m\u1ed9t tr\u0103m hai m\u01b0\u01a1i ba ngh\u00ecn b\u1ed1n tr\u0103m n\u0103m m\u01b0\u01a1i s\u00e1u \u0111\u1ed3ng trong t\u00e0i kho\u1ea3n\n--------------------------------------------------\nOriginal: \u0110T Vi\u1ec7t Nam gi\u00e0nh HCV t\u1ea1i SEA Games 32\nNormalized: \u0111\u1ed9i tuy\u1ec3n Vi\u1ec7t Nam gi\u00e0nh Huy ch\u01b0\u01a1ng v\u00e0ng t\u1ea1i SEA Games ba m\u01b0\u01a1i hai\n--------------------------------------------------\nOriginal: Nhi\u1ec7t \u0111\u1ed9 h\u00f4m nay l\u00e0 25\u00b0C, ng\u00e0y 25/04/2014\nNormalized: Nhi\u1ec7t \u0111\u1ed9 h\u00f4m nay l\u00e0 hai m\u01b0\u01a1i l\u0103m \u0111\u1ed9 x\u00ea , ng\u00e0y hai m\u01b0\u01a1i l\u0103m th\u00e1ng b\u1ed1n n\u0103m hai ngh\u00ecn kh\u00f4ng tr\u0103m m\u01b0\u1eddi b\u1ed1n\n--------------------------------------------------\nOriginal: T\u1ed1c \u0111\u1ed9 xe \u0111\u1ea1t 60km/h tr\u00ean qu\u00e3ng \u0111\u01b0\u1eddng 150km\nNormalized: T\u1ed1c \u0111\u1ed9 xe \u0111\u1ea1t s\u00e1u m\u01b0\u01a1i ki l\u00f4 m\u00e9t tr\u00ean gi\u1edd tr\u00ean qu\u00e3ng \u0111\u01b0\u1eddng m\u1ed9t tr\u0103m n\u0103m m\u01b0\u01a1i ki l\u00f4 m\u00e9t\n--------------------------------------------------\n```\n\n## Approach: Two-stage normalization\n\n### Preprocessing & tokenizing\n- The extra spaces, ASCII arts, emojis, HTML entities, unspoken words, etc. are removed.\n- A Regex-based tokenizer is then used to split the very sentence into tokens.\n\n### Stage 1: Non-standard word detection\n- Use a sequence tagger to extract non-standard words (NSWs) and categorize them into different types (18 in total).\n- Later, these NSWs can be verbalized properly according to their types.\n- The sequence tagger can be any kind of sequence labeling models. This implementation uses Conditional Random Field due to the shortage of data.\n\n### Stage 2: Non-standard word normalization\n- With the NSWs detected in **Stage 1** and their respective types, Regex-based expanders are applied to get the normalized results.\n- Each NSW type has its own dedicated expander.\n- The normalized results are then inserted into the original sentence, resulting in the desired normalized sentence.\n\n### Minor details\n- *Foreign* NSWs are kept as is at the moment.\n- To expand *Abbreviation* NSWs, a language model is used (i.e. BERT), incorporated with a Vietnamese abbreviation dictionary.\n- ...\n\n\n## Testing\nRun all tests with:\n```bash\npytest tests\n```\n\n## Author\n- Vinh Dang (<quangvinh0842@gmail.com>)\n\n## License\nMIT License\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An effective text normalization tool for Vietnamese",
    "version": "0.1.6",
    "project_urls": {
        "Documentation": "https://github.com/vinhdq842/soe-vinorm",
        "Homepage": "https://github.com/vinhdq842/soe-vinorm",
        "Issues": "https://github.com/vinhdq842/soe-vinorm/issues",
        "Source Code": "https://github.com/vinhdq842/soe-vinorm"
    },
    "split_keywords": [
        "nlp",
        " non-standard-words",
        " speech",
        " text-normalization",
        " tts",
        " vietnamese"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "eecddfe5020f0286a4ad8fc04e32c9a99caf30086dfa12e4c0b9f2a94d78473c",
                "md5": "60ab2852197282c7d63ab2975c55492d",
                "sha256": "8b8c7cb65e536cd1e06056b748d98c3123e24e6a70e7abd3b28704be119459d2"
            },
            "downloads": -1,
            "filename": "soe_vinorm-0.1.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "60ab2852197282c7d63ab2975c55492d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 86283,
            "upload_time": "2025-08-02T15:21:18",
            "upload_time_iso_8601": "2025-08-02T15:21:18.481189Z",
            "url": "https://files.pythonhosted.org/packages/ee/cd/dfe5020f0286a4ad8fc04e32c9a99caf30086dfa12e4c0b9f2a94d78473c/soe_vinorm-0.1.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "042509e0f2e0a01a3edf3238ce697def99bea5f95198d3e07618955a35ec7fa0",
                "md5": "c2e734fa7c49171769ae33fcda74bee1",
                "sha256": "efb5395edb0b0f3802f44768fb62c66c840b3aa72a5b6682716e20f5197f00e5"
            },
            "downloads": -1,
            "filename": "soe_vinorm-0.1.6.tar.gz",
            "has_sig": false,
            "md5_digest": "c2e734fa7c49171769ae33fcda74bee1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 200232,
            "upload_time": "2025-08-02T15:21:19",
            "upload_time_iso_8601": "2025-08-02T15:21:19.762511Z",
            "url": "https://files.pythonhosted.org/packages/04/25/09e0f2e0a01a3edf3238ce697def99bea5f95198d3e07618955a35ec7fa0/soe_vinorm-0.1.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-02 15:21:19",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vinhdq842",
    "github_project": "soe-vinorm",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "soe-vinorm"
}
        
Elapsed time: 1.88893s