turkic-translit


Nameturkic-translit JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryDeterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.
upload_time2025-07-17 13:24:18
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords kazakh kyrgyz transliteration ipa
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ---
title: Turkic Transliteration Demo
emoji: 🌖
colorFrom: green
colorTo: green
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Transliteration of Kazakh & Kyrgyz into Latin and IPA
---

turkic\_transliterate
Deterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.

Quick install

1. Install Miniconda or Anaconda (recommended).
2. Clone the repo and create the environment:
   conda env create -f env.yml
3. Activate the environment:
   conda activate turkic
4. Run the verification tests:
   python -m pytest      (all tests should pass)

Python compatibility
• Works on CPython 3.10 and 3.11.
• CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see “Windows & PyICU” below.

Package names
• Runtime import path:  turkic\_translit
• Distributable name on PyPI:  turkic\_transliterate
• Command-line entry point:  turkic-translit

## Developer Setup

For the simplest developer setup experience, run the setup script:

```bash
python scripts/setup_dev.py
```

This script will:
1. Install the package with all development dependencies
2. Set up PyICU on Windows automatically
3. Verify that development tools are working properly

### Manual Installation

Alternatively, install with pip:

```bash
pip install -e .[dev,ui]        # add ,winlid on Windows if you need fasttext-wheel
```

### Development Tools

#### Linux/macOS/Windows with GNU Make

If you have GNU Make installed, you can use the Makefile for common tasks:

```bash
make lint       # Run linting (ruff, black, mypy)
make format     # Auto-format code
make test       # Run tests
make web        # Launch the web UI
make help       # Show all available commands
```

#### Windows

**Option 1: Install GNU Make using Chocolatey (Recommended)**

Install GNU Make using Chocolatey (requires admin privileges):

```powershell
# In an Admin PowerShell window
choco install make
```

After installation, you can use the same `make` commands as on Linux/macOS.

**Option 2: Use the PowerShell Script Alternative**

If you prefer not to install Chocolatey or GNU Make, use the PowerShell script:

```powershell
./scripts/run.ps1 lint       # Run linting
./scripts/run.ps1 format     # Auto-format code
./scripts/run.ps1 test       # Run tests
./scripts/run.ps1 web        # Launch the web UI
./scripts/run.ps1 help       # Show all available commands
```

Optional extras
dev   → black, ruff, pytest
ui    → gradio web demo
winlid (Windows only) → fasttext-wheel for language ID

Windows & PyICU

**Important:** Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:

    turkic-pyicu-install

This script will download and install the correct PyICU wheel from Christoph Gohlke’s repository based on your Python version. See the script for details.

Command-line usage
turkic-translit --lang kk --in text.txt --out\_latin kk\_lat.txt --ipa --out\_ipa kk\_ipa.txt --arabic --log-level debug
• --lang            kk or ky
• --ipa             emit IPA alongside Latin
• --arabic          also transliterate embedded Arabic script
• --benchmark       print throughput statistics
• --log-level       debug | info | warning | error | critical (default: info)

Logging
The central logging setup uses Rich for colour when available.
Set TURKIC\_LOG\_LEVEL or pass --log-level to the CLI.
Fallback to standard logging when Rich is absent.

# Project Organization

The project is organized into the following directories:

- `src/turkic_translit/` - Core source code for the package
- `examples/` - Example scripts showing how to use the package
  - `examples/web/` - Web interface for demonstrating transliteration features
- `data/` - Sample data files and language resources
- `docs/` - Documentation and reference materials
- `scripts/` - Utility scripts for development and release
  - `scripts/release/` - Scripts for building and publishing packages
- `vendor/pyicu/` - Pre-built PyICU wheels for Windows
- `tests/` - Test suite for the package

## FastText Language Identification Model

This package uses the [FastText language identification model](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin) (`lid.176.bin`) for Russian token filtering and language detection. **The model file is not included in the repository or pip package due to its large size.**

**Automatic Download:**
- When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download `lid.176.bin` from the official Facebook AI public link if it is not already present.
- The file will be saved in the package directory on first use.

**No manual action is needed.** This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.

If you need to download the model manually, you can do so from:
https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin


## Using the Examples

Use the main entry point script to run examples:

```bash
python turkic_tools.py [command]
```

Available commands:
- `web` - Launch the Gradio web interface for real-time transliteration
- `demo` - Run the simple CLI demo
- `full-demo` - Run the comprehensive demo with multiple languages
- `help` - Display available commands

Tokenizer training example
turkic-build-spm --input corpora/kk\_lat.txt,corpora/ky\_lat.txt --model\_prefix spm/turkic12k --vocab\_size 12000

Filtering Russian tokens from Uzbek
cat uz\_raw\.txt | turkic-filter-russian --mode drop > uz\_clean.txt

Developer checklist
black .
ruff check .
pytest -q

All code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.

License
Apache-2.0

### Type-checking

```bash
pip install mypy
mypy --strict .
```

The included mypy.ini restricts analysis to the src/ tree and skips
build/, dist/, virtual-env and egg directories so duplicate-module
errors do not occur even if you build wheels locally.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "turkic-translit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "kazakh, kyrgyz, transliteration, ipa",
    "author": null,
    "author_email": "Austin Wagner <austinwagner@msn.com>",
    "download_url": "https://files.pythonhosted.org/packages/f8/6c/76f64ab5270ba7fbd628ed4df0835d3667bbcacf794427d9081efe5394c7/turkic_translit-0.3.0.tar.gz",
    "platform": null,
    "description": "---\r\ntitle: Turkic Transliteration Demo\r\nemoji: \ud83c\udf16\r\ncolorFrom: green\r\ncolorTo: green\r\nsdk: gradio\r\nsdk_version: 5.29.0\r\napp_file: app.py\r\npinned: false\r\nlicense: apache-2.0\r\nshort_description: Transliteration of Kazakh & Kyrgyz into Latin and IPA\r\n---\r\n\r\nturkic\\_transliterate\r\nDeterministic Latin and IPA transliteration for Kazakh and Kyrgyz, plus helper utilities for tokenizer training and Russian-token filtering.\r\n\r\nQuick install\r\n\r\n1. Install Miniconda or Anaconda (recommended).\r\n2. Clone the repo and create the environment:\r\n   conda env create -f env.yml\r\n3. Activate the environment:\r\n   conda activate turkic\r\n4. Run the verification tests:\r\n   python -m pytest      (all tests should pass)\r\n\r\nPython compatibility\r\n\u2022 Works on CPython 3.10 and 3.11.\r\n\u2022 CPython 3.12+ is supported everywhere except on Windows until official PyICU wheels are available; see \u201cWindows & PyICU\u201d below.\r\n\r\nPackage names\r\n\u2022 Runtime import path:  turkic\\_translit\r\n\u2022 Distributable name on PyPI:  turkic\\_transliterate\r\n\u2022 Command-line entry point:  turkic-translit\r\n\r\n## Developer Setup\r\n\r\nFor the simplest developer setup experience, run the setup script:\r\n\r\n```bash\r\npython scripts/setup_dev.py\r\n```\r\n\r\nThis script will:\r\n1. Install the package with all development dependencies\r\n2. Set up PyICU on Windows automatically\r\n3. Verify that development tools are working properly\r\n\r\n### Manual Installation\r\n\r\nAlternatively, install with pip:\r\n\r\n```bash\r\npip install -e .[dev,ui]        # add ,winlid on Windows if you need fasttext-wheel\r\n```\r\n\r\n### Development Tools\r\n\r\n#### Linux/macOS/Windows with GNU Make\r\n\r\nIf you have GNU Make installed, you can use the Makefile for common tasks:\r\n\r\n```bash\r\nmake lint       # Run linting (ruff, black, mypy)\r\nmake format     # Auto-format code\r\nmake test       # Run tests\r\nmake web        # Launch the web UI\r\nmake help       # Show all available commands\r\n```\r\n\r\n#### Windows\r\n\r\n**Option 1: Install GNU Make using Chocolatey (Recommended)**\r\n\r\nInstall GNU Make using Chocolatey (requires admin privileges):\r\n\r\n```powershell\r\n# In an Admin PowerShell window\r\nchoco install make\r\n```\r\n\r\nAfter installation, you can use the same `make` commands as on Linux/macOS.\r\n\r\n**Option 2: Use the PowerShell Script Alternative**\r\n\r\nIf you prefer not to install Chocolatey or GNU Make, use the PowerShell script:\r\n\r\n```powershell\r\n./scripts/run.ps1 lint       # Run linting\r\n./scripts/run.ps1 format     # Auto-format code\r\n./scripts/run.ps1 test       # Run tests\r\n./scripts/run.ps1 web        # Launch the web UI\r\n./scripts/run.ps1 help       # Show all available commands\r\n```\r\n\r\nOptional extras\r\ndev   \u2192 black, ruff, pytest\r\nui    \u2192 gradio web demo\r\nwinlid (Windows only) \u2192 fasttext-wheel for language ID\r\n\r\nWindows & PyICU\r\n\r\n**Important:** Due to PyPI rules, the correct PyICU wheel for Windows cannot be installed automatically during pip install. After installing this package with pip, Windows users must run the helper script to install the appropriate PyICU wheel:\r\n\r\n    turkic-pyicu-install\r\n\r\nThis script will download and install the correct PyICU wheel from Christoph Gohlke\u2019s repository based on your Python version. See the script for details.\r\n\r\nCommand-line usage\r\nturkic-translit --lang kk --in text.txt --out\\_latin kk\\_lat.txt --ipa --out\\_ipa kk\\_ipa.txt --arabic --log-level debug\r\n\u2022 --lang            kk or ky\r\n\u2022 --ipa             emit IPA alongside Latin\r\n\u2022 --arabic          also transliterate embedded Arabic script\r\n\u2022 --benchmark       print throughput statistics\r\n\u2022 --log-level       debug | info | warning | error | critical (default: info)\r\n\r\nLogging\r\nThe central logging setup uses Rich for colour when available.\r\nSet TURKIC\\_LOG\\_LEVEL or pass --log-level to the CLI.\r\nFallback to standard logging when Rich is absent.\r\n\r\n# Project Organization\r\n\r\nThe project is organized into the following directories:\r\n\r\n- `src/turkic_translit/` - Core source code for the package\r\n- `examples/` - Example scripts showing how to use the package\r\n  - `examples/web/` - Web interface for demonstrating transliteration features\r\n- `data/` - Sample data files and language resources\r\n- `docs/` - Documentation and reference materials\r\n- `scripts/` - Utility scripts for development and release\r\n  - `scripts/release/` - Scripts for building and publishing packages\r\n- `vendor/pyicu/` - Pre-built PyICU wheels for Windows\r\n- `tests/` - Test suite for the package\r\n\r\n## FastText Language Identification Model\r\n\r\nThis package uses the [FastText language identification model](https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin) (`lid.176.bin`) for Russian token filtering and language detection. **The model file is not included in the repository or pip package due to its large size.**\r\n\r\n**Automatic Download:**\r\n- When you use features that require language identification (such as Russian token filtering or the Gradio web demo), the package will automatically download `lid.176.bin` from the official Facebook AI public link if it is not already present.\r\n- The file will be saved in the package directory on first use.\r\n\r\n**No manual action is needed.** This ensures compatibility with pip installs, Hugging Face Spaces, and other cloud environments.\r\n\r\nIf you need to download the model manually, you can do so from:\r\nhttps://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin\r\n\r\n\r\n## Using the Examples\r\n\r\nUse the main entry point script to run examples:\r\n\r\n```bash\r\npython turkic_tools.py [command]\r\n```\r\n\r\nAvailable commands:\r\n- `web` - Launch the Gradio web interface for real-time transliteration\r\n- `demo` - Run the simple CLI demo\r\n- `full-demo` - Run the comprehensive demo with multiple languages\r\n- `help` - Display available commands\r\n\r\nTokenizer training example\r\nturkic-build-spm --input corpora/kk\\_lat.txt,corpora/ky\\_lat.txt --model\\_prefix spm/turkic12k --vocab\\_size 12000\r\n\r\nFiltering Russian tokens from Uzbek\r\ncat uz\\_raw\\.txt | turkic-filter-russian --mode drop > uz\\_clean.txt\r\n\r\nDeveloper checklist\r\nblack .\r\nruff check .\r\npytest -q\r\n\r\nAll code is UTF-8-only; on Windows a BOM is written when piping to files to avoid encoding issues.\r\n\r\nLicense\r\nApache-2.0\r\n\r\n### Type-checking\r\n\r\n```bash\r\npip install mypy\r\nmypy --strict .\r\n```\r\n\r\nThe included mypy.ini restricts analysis to the src/ tree and skips\r\nbuild/, dist/, virtual-env and egg directories so duplicate-module\r\nerrors do not occur even if you build wheels locally.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Deterministic Latin and IPA transliteration for Kazakh, Kyrgyz, plus tokenizer/glue scripts.",
    "version": "0.3.0",
    "project_urls": null,
    "split_keywords": [
        "kazakh",
        " kyrgyz",
        " transliteration",
        " ipa"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aeeb5604cbc41b916af1fda7ba6ed9a0f8bc823705fc860082f64295d32b82ee",
                "md5": "4ea2e6b564d57768705631ec2b5cd4a9",
                "sha256": "29d2215c280d5235532e4ec27b427a444999a7524ec013d47977308901a1e613"
            },
            "downloads": -1,
            "filename": "turkic_translit-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4ea2e6b564d57768705631ec2b5cd4a9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 74349,
            "upload_time": "2025-07-17T13:24:17",
            "upload_time_iso_8601": "2025-07-17T13:24:17.850995Z",
            "url": "https://files.pythonhosted.org/packages/ae/eb/5604cbc41b916af1fda7ba6ed9a0f8bc823705fc860082f64295d32b82ee/turkic_translit-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f86c76f64ab5270ba7fbd628ed4df0835d3667bbcacf794427d9081efe5394c7",
                "md5": "51fcced145a8d3247de01370c68c6659",
                "sha256": "388547ec69338ab21502c80a27cd9aeaea2f2c94195665a95e1eeac58fed098a"
            },
            "downloads": -1,
            "filename": "turkic_translit-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "51fcced145a8d3247de01370c68c6659",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 83289,
            "upload_time": "2025-07-17T13:24:18",
            "upload_time_iso_8601": "2025-07-17T13:24:18.822754Z",
            "url": "https://files.pythonhosted.org/packages/f8/6c/76f64ab5270ba7fbd628ed4df0835d3667bbcacf794427d9081efe5394c7/turkic_translit-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-17 13:24:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "turkic-translit"
}
        
Elapsed time: 1.34861s