# pl_itn
Inverse Text Normalization is an NLP task of changing the spoken form of a phrase to written form, for example:
```
one two three -> 1 2 3
```
[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm.fming.dev)
`pl_itn` is an opensource Polish ITN Python library and REST API for practical applications.
This project is an implementation of [NeMo Inverse Text Normalization](https://arxiv.org/abs/2104.05055) for Polish.
## Table of contents
[Prerequisites](#prerequisites)\
[Setup](#setup)\
[Docker](#docker)\
[Usage](#usage)\
[gRPC service](#grpc-service)\
[Building custom grammars](#building-custom-grammars)\
[Documentation](#documentation)\
[Contributing](#contributing)\
[License](#License)\
[References](#References)
## Prerequisites
For [pynini](https://pypi.org/project/pynini/)
- A standards-compliant C++17 compiler (GCC >= 7 or Clang >= 700)
- The compatible recent version of OpenFst built with the grm extensions (see `deps/install_openfst.md`)
## Setup
Make sure to first install prerequisites, especially OpenFST.
### Install from PyPI
```bash
pip install pl_itn
```
### Build from source
```bash
pip install .
```
### Editable install for development
```bash
pip install -e .[dev]
```
### Docker
To build docker image containing pl_itn library use `pl_itn_lib.dockerfile` file.\
To build docker image with gRPC service use `grpc_service.dockerfile` file.
```bash
docker build -t <IMAGE:TAG> -f <DOCKERFILE> .
```
## Usage
### Console app
```bash
usage: pl_itn [-h] (-t TEXT | -i) [--tagger TAGGER] [--verbalizer VERBALIZER] [--config CONFIG]
[--log-level {debug,info}]
Inverse Text Normalization based on Finite State Transducers
options:
-h, --help show this help message and exit
-t TEXT, --text TEXT Input text
-i, --interactive If used, demo will process phrases from stdin interactively.
--tagger TAGGER
--verbalizer VERBALIZER
--config CONFIG Optionally provide yaml config with tagger and verbalizer paths.
--log-level {debug,info}
return a step back value.
```
```bash
pl_itn -t "jest za pięć druga"
jest 01:55
pl_itn -t "drugi listopada dwa tysiące osiemnastego roku"
2 listopada 2018 roku
```
### Python
```python
>>> from pl_itn import Normalizer
>>> normalizer = Normalizer()
>>> normalizer.normalize("za pięć dwunasta")
'11:55'
```
### Docker
Existing docker image containing pl_itn library is required. For build command refer to [Docker](#docker) section.
```bash
docker run --rm -it <IMAGE:TAG> --help
```
## gRPC Service
gRPC service methods are described in `grpc_service/pl_itn_api/api.proto` file. Docker container is suggested approach for running the service. For build command refer to [Docker](#docker) section.
Service within container serves on port 10010.
Example of building the image and starting the service.
```bash
docker build -t pl_itn_service:test -f grpc_service.dockerfile .
docker run -p 10010:10010 pl_itn_service:test
```
## Building custom grammars
Custom grammars can be built using `build_grammar/build_grammar.py` script.
There are three demo grammars available:
- not declined cardinal numbers (e.g. "jeden", "dwa", "trzy")
- declined cardinal numbers (e.g. "jednego", "dwóch", "trzech")
- ordinal numbers (e.g. "pierwszy", "druga", "trzecie")
Normalization types can be included and excluded from the grammar through the config file, which is set by default to `build_grammar/grammar_config.yaml`.
```bash
# cardinals_basic_forms: True
# cardinals_declined: True
# ordinals: True
$ python3 build_grammar/build_grammar.py --grammars-dir all
$ pl_itn \
--tagger all/tagger.fst \
--verbalizer all/verbalizer.fst \
-t "Jeden trzech piąta"
1 3 5
```
```bash
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: True
$ python3 build_grammar/build_grammar.py --grammars-dir cardinals_basic_ordinals
$ pl_itn \
--tagger cardinals_basic_ordinals/tagger.fst \
--verbalizer cardinals_basic_ordinals/verbalizer.fst \
-t "Jeden trzech piąta"
1 trzech 5
```
```bash
# cardinals_basic_forms: True
# cardinals_declined: False
# ordinals: False
$ python3 build_grammar/build_grammar.py --grammars-dir only_basic_cardinals
$ pl_itn \
--tagger only_basic_cardinals/tagger.fst \
--verbalizer only_basic_cardinals/verbalizer.fst \
-t "Jeden trzech piąta"
1 trzech piąta
```
See [Documentation](#documentation) for more details.
## Documentation
## Contributing
## License
## Rerences
- K. Gorman. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proc. ACL Workshop on Statistical NLP and Weighted Automata, 75-80.
- Y. Zhang, E. Bakhturina, K. Gorman, and B. Ginsburg. 2021. NeMo Inverse Text Normalization: From Development To Production.
Raw data
{
"_id": null,
"home_page": "",
"name": "pl-itn",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "fst itn text normalization polish",
"author": "mstopa, cansubmarinesswim",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/09/22/e5165b2f4545c3dacc9e229d21a0d8d7b39915abbfc94df9a3897bdd37e2/pl_itn-0.1.0rc1.tar.gz",
"platform": null,
"description": "# pl_itn\nInverse Text Normalization is an NLP task of changing the spoken form of a phrase to written form, for example:\n```\none two three -> 1 2 3\n```\n\n[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm.fming.dev)\n\n`pl_itn` is an opensource Polish ITN Python library and REST API for practical applications.\n\nThis project is an implementation of [NeMo Inverse Text Normalization](https://arxiv.org/abs/2104.05055) for Polish.\n\n## Table of contents\n[Prerequisites](#prerequisites)\\\n[Setup](#setup)\\\n[Docker](#docker)\\\n[Usage](#usage)\\\n[gRPC service](#grpc-service)\\\n[Building custom grammars](#building-custom-grammars)\\\n[Documentation](#documentation)\\\n[Contributing](#contributing)\\\n[License](#License)\\\n[References](#References)\n\n## Prerequisites\nFor [pynini](https://pypi.org/project/pynini/)\n- A standards-compliant C++17 compiler (GCC >= 7 or Clang >= 700)\n- The compatible recent version of OpenFst built with the grm extensions (see `deps/install_openfst.md`)\n\n## Setup\nMake sure to first install prerequisites, especially OpenFST.\n\n### Install from PyPI\n```bash\npip install pl_itn\n```\n\n### Build from source\n```bash\npip install .\n```\n\n### Editable install for development\n```bash\npip install -e .[dev]\n```\n\n### Docker\n\nTo build docker image containing pl_itn library use `pl_itn_lib.dockerfile` file.\\\nTo build docker image with gRPC service use `grpc_service.dockerfile` file.\n\n```bash\ndocker build -t <IMAGE:TAG> -f <DOCKERFILE> .\n```\n\n## Usage\n### Console app\n```bash\nusage: pl_itn [-h] (-t TEXT | -i) [--tagger TAGGER] [--verbalizer VERBALIZER] [--config CONFIG]\n [--log-level {debug,info}]\n\nInverse Text Normalization based on Finite State Transducers\n\noptions:\n -h, --help show this help message and exit\n -t TEXT, --text TEXT Input text\n -i, --interactive If used, demo will process phrases from stdin interactively.\n --tagger TAGGER\n --verbalizer VERBALIZER\n --config CONFIG Optionally provide yaml config with tagger and verbalizer paths.\n --log-level {debug,info}\n return a step back value.\n```\n\n```bash\npl_itn -t \"jest za pi\u0119\u0107 druga\"\njest 01:55\n\npl_itn -t \"drugi listopada dwa tysi\u0105ce osiemnastego roku\"\n2 listopada 2018 roku\n```\n\n### Python\n```python\n>>> from pl_itn import Normalizer\n>>> normalizer = Normalizer()\n>>> normalizer.normalize(\"za pi\u0119\u0107 dwunasta\")\n'11:55'\n```\n\n### Docker\n\nExisting docker image containing pl_itn library is required. For build command refer to [Docker](#docker) section.\n```bash\ndocker run --rm -it <IMAGE:TAG> --help\n```\n\n## gRPC Service\n\ngRPC service methods are described in `grpc_service/pl_itn_api/api.proto` file. Docker container is suggested approach for running the service. For build command refer to [Docker](#docker) section.\nService within container serves on port 10010.\n\nExample of building the image and starting the service.\n```bash\ndocker build -t pl_itn_service:test -f grpc_service.dockerfile .\ndocker run -p 10010:10010 pl_itn_service:test\n```\n\n## Building custom grammars\nCustom grammars can be built using `build_grammar/build_grammar.py` script.\n\nThere are three demo grammars available:\n- not declined cardinal numbers (e.g. \"jeden\", \"dwa\", \"trzy\")\n- declined cardinal numbers (e.g. \"jednego\", \"dw\u00f3ch\", \"trzech\")\n- ordinal numbers (e.g. \"pierwszy\", \"druga\", \"trzecie\")\n\nNormalization types can be included and excluded from the grammar through the config file, which is set by default to `build_grammar/grammar_config.yaml`.\n\n```bash\n# cardinals_basic_forms: True\n# cardinals_declined: True\n# ordinals: True\n\n$ python3 build_grammar/build_grammar.py --grammars-dir all\n\n$ pl_itn \\\n --tagger all/tagger.fst \\\n --verbalizer all/verbalizer.fst \\\n -t \"Jeden trzech pi\u0105ta\"\n\n1 3 5\n```\n\n```bash\n# cardinals_basic_forms: True\n# cardinals_declined: False\n# ordinals: True\n\n$ python3 build_grammar/build_grammar.py --grammars-dir cardinals_basic_ordinals\n\n$ pl_itn \\\n --tagger cardinals_basic_ordinals/tagger.fst \\\n --verbalizer cardinals_basic_ordinals/verbalizer.fst \\\n -t \"Jeden trzech pi\u0105ta\"\n\n1 trzech 5\n```\n\n```bash\n# cardinals_basic_forms: True\n# cardinals_declined: False\n# ordinals: False\n\n$ python3 build_grammar/build_grammar.py --grammars-dir only_basic_cardinals\n\n$ pl_itn \\\n --tagger only_basic_cardinals/tagger.fst \\\n --verbalizer only_basic_cardinals/verbalizer.fst \\\n -t \"Jeden trzech pi\u0105ta\"\n\n1 trzech pi\u0105ta\n```\n\nSee [Documentation](#documentation) for more details.\n\n## Documentation\n\n## Contributing\n\n## License\n\n## Rerences\n- K. Gorman. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proc. ACL Workshop on Statistical NLP and Weighted Automata, 75-80.\n- Y. Zhang, E. Bakhturina, K. Gorman, and B. Ginsburg. 2021. NeMo Inverse Text Normalization: From Development To Production.",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Polish FST Inverse Text Normalization",
"version": "0.1.0rc1",
"project_urls": {
"Documentation": "https://pl_itn.readthedocs.io/en/latest/",
"Repository": "https://github.com/mstopa/pl_itn"
},
"split_keywords": [
"fst",
"itn",
"text",
"normalization",
"polish"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "24d1ea94d3d52f57b9060a0ad2d7ee737055a2494e903103de81d35823c6612d",
"md5": "8fe5366e3aa22f9b1319fc2bca820f74",
"sha256": "c029b517e9e9810abbc7847d0388e17d0577c81a26ef3b2f9147b4d0cf590c0d"
},
"downloads": -1,
"filename": "pl_itn-0.1.0rc1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8fe5366e3aa22f9b1319fc2bca820f74",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 670008,
"upload_time": "2023-05-15T08:26:50",
"upload_time_iso_8601": "2023-05-15T08:26:50.596516Z",
"url": "https://files.pythonhosted.org/packages/24/d1/ea94d3d52f57b9060a0ad2d7ee737055a2494e903103de81d35823c6612d/pl_itn-0.1.0rc1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0922e5165b2f4545c3dacc9e229d21a0d8d7b39915abbfc94df9a3897bdd37e2",
"md5": "4795aba0dc43ba4fb0418beea86b4cea",
"sha256": "6afd5609ea0e06bf95cd35a53a356dac38cfb4899963dc3315b3860ddb16da92"
},
"downloads": -1,
"filename": "pl_itn-0.1.0rc1.tar.gz",
"has_sig": false,
"md5_digest": "4795aba0dc43ba4fb0418beea86b4cea",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 647241,
"upload_time": "2023-05-15T08:26:55",
"upload_time_iso_8601": "2023-05-15T08:26:55.129981Z",
"url": "https://files.pythonhosted.org/packages/09/22/e5165b2f4545c3dacc9e229d21a0d8d7b39915abbfc94df9a3897bdd37e2/pl_itn-0.1.0rc1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-05-15 08:26:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mstopa",
"github_project": "pl_itn",
"github_not_found": true,
"lcname": "pl-itn"
}