| Name | olaph JSON |
| Version |
0.1.9
JSON |
| download |
| home_page | None |
| Summary | A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy.. |
| upload_time | 2025-10-30 10:45:56 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.10 |
| license | MIT |
| keywords |
phonemizer
text-to-speech
linguistics
nlp
multilingual
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# OLaPh — Optimal Language Phonemizer
[](https://pypi.org/project/olaph/)
[](https://pypi.org/project/olaph/)
[](https://opensource.org/licenses/MIT)
**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.
---
## Overview
Traditional phonemizers rely on simple rule-based mappings or lexicon lookups.
Neural and hybrid approaches improve generalization but still struggle with:
- Names and foreign words
- Abbreviations and acronyms
- Loanwords and compounds
- Ambiguous homographs
**OLaPh** tackles these challenges by combining:
- Extensive **language-specific dictionaries**
- **Abbreviation, number, and letter normalization**
- **Compound resolution with probabilistic scoring**
- **Cross-language handling**
- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)
Evaluations in **German** and **English** show improved accuracy and robustness over existing phonemizers, including on challenging multilingual datasets.
---
## Features
- Multilingual phonemization (DE, EN, FR, ES)
- Abbreviation and letter pronunciation dictionaries
- Number normalization
- Cross-language acronym detection
- Compound splitting with probabilistic scoring
- Freely available lexica for research and development derived from wiktionary.org.
## Large Language Model
A LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.
Find it here on [huggingface](https://huggingface.co/iisys-hof/olaph)
---
## Installation
### From PyPI
```bash
pip install olaph
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
```
### From source
```bash
git clone https://github.com/iisys-hof/olaph.git
cd olaph
pip install -e .
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
```
## Example Usage
```python
from olaph import Olaph
phonemizer = Olaph()
output = phonemizer.phonemize_text("He ordered a Brezel and a beer in a tavern near München.", lang="en")
print(output)
```
---
## Dependencies
- [spaCy](https://spacy.io)
- [Lingua](https://github.com/pemistahl/lingua-py)
- [num2words](https://github.com/savoirfairelinux/num2words)
- [inflect](https://github.com/jaraco/inflect)
---
## Research Summary
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.
---
## Citation
If you use OLaPh in academic work, please cite:
```bibtex
@misc{wirth2025olaphoptimallanguagephonemizer,
title={OLaPh: Optimal Language Phonemizer},
author={Johannes Wirth},
year={2025},
eprint={2509.20086},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.20086},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "olaph",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "phonemizer, text-to-speech, linguistics, NLP, multilingual",
"author": null,
"author_email": "Johannes Wirth <johannes.wirth.3@iisys.de>",
"download_url": "https://files.pythonhosted.org/packages/66/28/e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa/olaph-0.1.9.tar.gz",
"platform": null,
"description": "# OLaPh \u2014 Optimal Language Phonemizer\r\n\r\n[](https://pypi.org/project/olaph/)\r\n[](https://pypi.org/project/olaph/)\r\n[](https://opensource.org/licenses/MIT)\r\n\r\n**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.\r\n\r\n---\r\n\r\n## Overview\r\n\r\nTraditional phonemizers rely on simple rule-based mappings or lexicon lookups.\r\nNeural and hybrid approaches improve generalization but still struggle with:\r\n\r\n- Names and foreign words\r\n- Abbreviations and acronyms\r\n- Loanwords and compounds\r\n- Ambiguous homographs\r\n\r\n**OLaPh** tackles these challenges by combining:\r\n\r\n- Extensive **language-specific dictionaries**\r\n- **Abbreviation, number, and letter normalization**\r\n- **Compound resolution with probabilistic scoring**\r\n- **Cross-language handling**\r\n- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)\r\n\r\nEvaluations in **German** and **English** show improved accuracy and robustness over existing phonemizers, including on challenging multilingual datasets.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- Multilingual phonemization (DE, EN, FR, ES)\r\n- Abbreviation and letter pronunciation dictionaries\r\n- Number normalization\r\n- Cross-language acronym detection\r\n- Compound splitting with probabilistic scoring\r\n- Freely available lexica for research and development derived from wiktionary.org.\r\n\r\n## Large Language Model\r\nA LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.\r\n\r\nFind it here on [huggingface](https://huggingface.co/iisys-hof/olaph)\r\n\r\n---\r\n\r\n## Installation\r\n\r\n### From PyPI\r\n\r\n```bash\r\npip install olaph\r\npython -m spacy download de_core_news_sm\r\npython -m spacy download en_core_web_sm\r\npython -m spacy download es_core_news_sm\r\npython -m spacy download fr_core_news_sm\r\n\r\n```\r\n\r\n### From source\r\n\r\n```bash\r\ngit clone https://github.com/iisys-hof/olaph.git\r\ncd olaph\r\npip install -e .\r\npython -m spacy download de_core_news_sm\r\npython -m spacy download en_core_web_sm\r\npython -m spacy download es_core_news_sm\r\npython -m spacy download fr_core_news_sm\r\n```\r\n\r\n## Example Usage\r\n\r\n```python\r\nfrom olaph import Olaph\r\n\r\nphonemizer = Olaph()\r\n\r\noutput = phonemizer.phonemize_text(\"He ordered a Brezel and a beer in a tavern near M\u00fcnchen.\", lang=\"en\")\r\n\r\nprint(output)\r\n```\r\n\r\n---\r\n\r\n## Dependencies\r\n\r\n- [spaCy](https://spacy.io)\r\n- [Lingua](https://github.com/pemistahl/lingua-py)\r\n- [num2words](https://github.com/savoirfairelinux/num2words)\r\n- [inflect](https://github.com/jaraco/inflect)\r\n\r\n---\r\n\r\n## Research Summary\r\n\r\nPhonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.\r\n\r\n---\r\n\r\n## Citation\r\n\r\nIf you use OLaPh in academic work, please cite:\r\n\r\n```bibtex\r\n@misc{wirth2025olaphoptimallanguagephonemizer,\r\n title={OLaPh: Optimal Language Phonemizer},\r\n author={Johannes Wirth},\r\n year={2025},\r\n eprint={2509.20086},\r\n archivePrefix={arXiv},\r\n primaryClass={cs.CL},\r\n url={https://arxiv.org/abs/2509.20086},\r\n}\r\n```\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..",
"version": "0.1.9",
"project_urls": {
"Documentation": "https://github.com/iisys-hof/olaph#readme",
"Homepage": "https://github.com/iisys-hof/olaph",
"Issues": "https://github.com/iisys-hof/olaph/issues"
},
"split_keywords": [
"phonemizer",
" text-to-speech",
" linguistics",
" nlp",
" multilingual"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4da9a4190de879e11f6882a7838201acff14ae55df9c16fbfe644ae7f395a157",
"md5": "0a6c582c95557c9385b7e1ae941bfc06",
"sha256": "2ec0a9f4ed7c235c781f8361fb7db4cc2ed5e2b2760f721f31c492387875c7ac"
},
"downloads": -1,
"filename": "olaph-0.1.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0a6c582c95557c9385b7e1ae941bfc06",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 29742486,
"upload_time": "2025-10-30T10:45:49",
"upload_time_iso_8601": "2025-10-30T10:45:49.723749Z",
"url": "https://files.pythonhosted.org/packages/4d/a9/a4190de879e11f6882a7838201acff14ae55df9c16fbfe644ae7f395a157/olaph-0.1.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6628e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa",
"md5": "919e9e37704bcd487ddb8f2f579221f8",
"sha256": "852ed6e370d98cad26eec1f175bcaaa4a57cbfdbf72de68c9d8f7c7e547ccdac"
},
"downloads": -1,
"filename": "olaph-0.1.9.tar.gz",
"has_sig": false,
"md5_digest": "919e9e37704bcd487ddb8f2f579221f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 29499702,
"upload_time": "2025-10-30T10:45:56",
"upload_time_iso_8601": "2025-10-30T10:45:56.985863Z",
"url": "https://files.pythonhosted.org/packages/66/28/e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa/olaph-0.1.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-30 10:45:56",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "iisys-hof",
"github_project": "olaph#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "olaph"
}