olaph

Name	olaph JSON
Version	0.1.9 JSON
	download
home_page	None
Summary	A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..
upload_time	2025-10-30 10:45:56
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	phonemizer text-to-speech linguistics nlp multilingual
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OLaPh — Optimal Language Phonemizer

[![PyPI version](https://img.shields.io/pypi/v/olaph.svg?logo=pypi)](https://pypi.org/project/olaph/)
[![Python versions](https://img.shields.io/pypi/pyversions/olaph.svg)](https://pypi.org/project/olaph/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.

---

## Overview

Traditional phonemizers rely on simple rule-based mappings or lexicon lookups.
Neural and hybrid approaches improve generalization but still struggle with:

- Names and foreign words
- Abbreviations and acronyms
- Loanwords and compounds
- Ambiguous homographs

**OLaPh** tackles these challenges by combining:

- Extensive **language-specific dictionaries**
- **Abbreviation, number, and letter normalization**
- **Compound resolution with probabilistic scoring**
- **Cross-language handling**
- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)

Evaluations in **German** and **English** show improved accuracy and robustness over existing phonemizers, including on challenging multilingual datasets.

---

## Features

- Multilingual phonemization (DE, EN, FR, ES)
- Abbreviation and letter pronunciation dictionaries
- Number normalization
- Cross-language acronym detection
- Compound splitting with probabilistic scoring
- Freely available lexica for research and development derived from wiktionary.org.

## Large Language Model
A LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.

Find it here on [huggingface](https://huggingface.co/iisys-hof/olaph)

---

## Installation

### From PyPI

```bash
pip install olaph
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm

```

### From source

```bash
git clone https://github.com/iisys-hof/olaph.git
cd olaph
pip install -e .
python -m spacy download de_core_news_sm
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
```

## Example Usage

```python
from olaph import Olaph

phonemizer = Olaph()

output = phonemizer.phonemize_text("He ordered a Brezel and a beer in a tavern near München.", lang="en")

print(output)
```

---

## Dependencies

- [spaCy](https://spacy.io)
- [Lingua](https://github.com/pemistahl/lingua-py)
- [num2words](https://github.com/savoirfairelinux/num2words)
- [inflect](https://github.com/jaraco/inflect)

---

## Research Summary

Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.

---

## Citation

If you use OLaPh in academic work, please cite:

```bibtex
@misc{wirth2025olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer},
      author={Johannes Wirth},
      year={2025},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086},
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "olaph",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "phonemizer, text-to-speech, linguistics, NLP, multilingual",
    "author": null,
    "author_email": "Johannes Wirth <johannes.wirth.3@iisys.de>",
    "download_url": "https://files.pythonhosted.org/packages/66/28/e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa/olaph-0.1.9.tar.gz",
    "platform": null,
    "description": "# OLaPh \u2014 Optimal Language Phonemizer\r\n\r\n[![PyPI version](https://img.shields.io/pypi/v/olaph.svg?logo=pypi)](https://pypi.org/project/olaph/)\r\n[![Python versions](https://img.shields.io/pypi/pyversions/olaph.svg)](https://pypi.org/project/olaph/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\r\n\r\n**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.\r\n\r\n---\r\n\r\n## Overview\r\n\r\nTraditional phonemizers rely on simple rule-based mappings or lexicon lookups.\r\nNeural and hybrid approaches improve generalization but still struggle with:\r\n\r\n- Names and foreign words\r\n- Abbreviations and acronyms\r\n- Loanwords and compounds\r\n- Ambiguous homographs\r\n\r\n**OLaPh** tackles these challenges by combining:\r\n\r\n- Extensive **language-specific dictionaries**\r\n- **Abbreviation, number, and letter normalization**\r\n- **Compound resolution with probabilistic scoring**\r\n- **Cross-language handling**\r\n- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)\r\n\r\nEvaluations in **German** and **English** show improved accuracy and robustness over existing phonemizers, including on challenging multilingual datasets.\r\n\r\n---\r\n\r\n## Features\r\n\r\n- Multilingual phonemization (DE, EN, FR, ES)\r\n- Abbreviation and letter pronunciation dictionaries\r\n- Number normalization\r\n- Cross-language acronym detection\r\n- Compound splitting with probabilistic scoring\r\n- Freely available lexica for research and development derived from wiktionary.org.\r\n\r\n## Large Language Model\r\nA LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.\r\n\r\nFind it here on [huggingface](https://huggingface.co/iisys-hof/olaph)\r\n\r\n---\r\n\r\n## Installation\r\n\r\n### From PyPI\r\n\r\n```bash\r\npip install olaph\r\npython -m spacy download de_core_news_sm\r\npython -m spacy download en_core_web_sm\r\npython -m spacy download es_core_news_sm\r\npython -m spacy download fr_core_news_sm\r\n\r\n```\r\n\r\n### From source\r\n\r\n```bash\r\ngit clone https://github.com/iisys-hof/olaph.git\r\ncd olaph\r\npip install -e .\r\npython -m spacy download de_core_news_sm\r\npython -m spacy download en_core_web_sm\r\npython -m spacy download es_core_news_sm\r\npython -m spacy download fr_core_news_sm\r\n```\r\n\r\n## Example Usage\r\n\r\n```python\r\nfrom olaph import Olaph\r\n\r\nphonemizer = Olaph()\r\n\r\noutput = phonemizer.phonemize_text(\"He ordered a Brezel and a beer in a tavern near M\u00fcnchen.\", lang=\"en\")\r\n\r\nprint(output)\r\n```\r\n\r\n---\r\n\r\n## Dependencies\r\n\r\n- [spaCy](https://spacy.io)\r\n- [Lingua](https://github.com/pemistahl/lingua-py)\r\n- [num2words](https://github.com/savoirfairelinux/num2words)\r\n- [inflect](https://github.com/jaraco/inflect)\r\n\r\n---\r\n\r\n## Research Summary\r\n\r\nPhonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.\r\n\r\n---\r\n\r\n## Citation\r\n\r\nIf you use OLaPh in academic work, please cite:\r\n\r\n```bibtex\r\n@misc{wirth2025olaphoptimallanguagephonemizer,\r\n      title={OLaPh: Optimal Language Phonemizer},\r\n      author={Johannes Wirth},\r\n      year={2025},\r\n      eprint={2509.20086},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.CL},\r\n      url={https://arxiv.org/abs/2509.20086},\r\n}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..",
    "version": "0.1.9",
    "project_urls": {
        "Documentation": "https://github.com/iisys-hof/olaph#readme",
        "Homepage": "https://github.com/iisys-hof/olaph",
        "Issues": "https://github.com/iisys-hof/olaph/issues"
    },
    "split_keywords": [
        "phonemizer",
        " text-to-speech",
        " linguistics",
        " nlp",
        " multilingual"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4da9a4190de879e11f6882a7838201acff14ae55df9c16fbfe644ae7f395a157",
                "md5": "0a6c582c95557c9385b7e1ae941bfc06",
                "sha256": "2ec0a9f4ed7c235c781f8361fb7db4cc2ed5e2b2760f721f31c492387875c7ac"
            },
            "downloads": -1,
            "filename": "olaph-0.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a6c582c95557c9385b7e1ae941bfc06",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 29742486,
            "upload_time": "2025-10-30T10:45:49",
            "upload_time_iso_8601": "2025-10-30T10:45:49.723749Z",
            "url": "https://files.pythonhosted.org/packages/4d/a9/a4190de879e11f6882a7838201acff14ae55df9c16fbfe644ae7f395a157/olaph-0.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6628e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa",
                "md5": "919e9e37704bcd487ddb8f2f579221f8",
                "sha256": "852ed6e370d98cad26eec1f175bcaaa4a57cbfdbf72de68c9d8f7c7e547ccdac"
            },
            "downloads": -1,
            "filename": "olaph-0.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "919e9e37704bcd487ddb8f2f579221f8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 29499702,
            "upload_time": "2025-10-30T10:45:56",
            "upload_time_iso_8601": "2025-10-30T10:45:56.985863Z",
            "url": "https://files.pythonhosted.org/packages/66/28/e20ec54377811cf98305d99f0b565c07c4bcb6b19454b052ed6e8202b3aa/olaph-0.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-30 10:45:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iisys-hof",
    "github_project": "olaph#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "olaph"
}

None