uroman


Nameuroman JSON
Version 1.3.1.1 PyPI version JSON
download
home_pageNone
Summaryuroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.
upload_time2024-06-28 06:03:34
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseCopyright (C) 2015-2020 Ulf Hermjakob, USC Information Sciences Institute Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. Any publication of projects using uroman shall acknowledge its use: "This project uses the universal romanizer software 'uroman' written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)". Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords nlp computational linguistics machine translation natural language processing romanization string similarity
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # uroman

*uroman* is a *universal romanizer*. It converts text in any script to the standard Latin alphabet.<br>
&nbsp;&nbsp;&nbsp;&nbsp;Example (Greek): Νεπάλ → Nepal<br>
&nbsp;&nbsp;&nbsp;&nbsp;Example (Hindi):&nbsp; नेपाल → nepaal<br>
&nbsp;&nbsp;&nbsp;&nbsp;Example (Urdu):&nbsp; نیپال → nypal<br>
&nbsp;&nbsp;&nbsp;&nbsp;Example (Chinese): 三万一 → 31000

* *uroman* enables the application of string-similarity metrics to texts from different scripts without the need and complexity of an intermediate phonetic representation.
* *uroman* converts digital numbers in various scripts to Western Arabic numerals.
* *uroman* uses m-to-n character mappings, context, and a user-provided language code (optional), i.e. *uroman* does not just replace characters one by one.
* *uroman* expects all input to be encoded in UTF-8.

New Python version: 1.3.1 (released on June 27, 2024)<br>
Last Perl version: 1.2.8 (released on April 23, 2021)<br>
Author: Ulf Hermjakob, USC Information Sciences Institute  

## (New) Python version

#### Installation
```bash
python3 -m pip install uroman
```

### Command Line Interface (CLI)
#### Examples

```bash
python3 -m uroman "Игорь Стравинский"
python3 -m uroman Игорь -l ukr
python3 -m uroman Ντέιβις Καπ -l ell
python3 -m uroman "\u03C0\u03B9" -d
python3 -m uroman -l hin -i mini-test/hin.txt
python3 -m uroman -l fas -i mini-test/fas.txt -o mini-test/fas-rom.jsonl -f edges
python3 -m uroman < mini-test/multi-script.txt > mini-test/multi-script.uroman.txt
python3 -m uroman -h
```

<b>Note:</b> Using the _uroman_ CLI for single strings can be useful for simple tests, 
but it is inefficient at scale because data resources are loaded every time. It is more efficient to romanize entire files or to use _uroman_ inside Python as shown further below.<br>
<b>Note:</b> The _mini-test_ directory is included in this release. 
Use command &nbsp; <code>python3 -m uroman x --verbose</code> &nbsp; to find it.
You can compare your output mini-test/multi-script.uroman.txt with reference output mini-test/multi-script.uroman-ref.txt

#### *uroman.py* &nbsp; Argument Structure Highlights 
<table>
  <tr><td><i>Direct inputs (zero&nbsp;or&nbsp;more)</i></td><td>such as ‘Игорь Стравинский’ and ‘Ντέιβις’ above.</td></tr>
  <tr><td>-l<br>--lcode</td><td>language code according to <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes" target="_LCODE">ISO-639-3</a>, e.g. <i>-l ukr</i> for Ukrainian, <i>-l hin</i> for Hindi, <i>-l fas</i> for Persian</td></tr>
  <tr><td>-i<br>--input_filename</td><td>alternative:&nbsp;<i>stdin</i><br>Note: If both <i>direct inputs</i> and <i>input_filename</i> are given, the romanization results for <i>direct inputs</i> will be written to <i>stderr</i>.</td></tr>
  <tr><td width="200">-o<br><nobr>--output_filename</nobr></td><td>alternative: <i>stdout</i></td></tr>
  <tr><td>-f<br>--rom_format</td><td>Output format choices:
        <ul>
           <li> -f str &nbsp;&nbsp;&nbsp;&nbsp;&nbsp (best string, default, output format: string)
           <li> -f edges (best edges, includes offset information, output format: JSONL)
           <li> -f alts &nbsp;&nbsp;&nbsp;&nbsp; (lattice including alternative edges, output format: JSONL)
           <li> -f lattice (lattice including alternative and superseded edges, output format: JSONL)
        </ul></td></tr>
  <tr><td>-d<br>--decode_unicode</td><td>Decode Unicode escape sequences such as ‘\u03C0\u03B9’ to ‘πι’ which in turn will be romanized to ‘pi’. This is useful for input formats such as JSON.</td></tr>
  <tr><td>-h<br>--help</td><td>Use this option to see the full argument structure with all options.</td></tr>
</table>

### Using _uroman_ inside Python
#### Examples

```bash
import uroman as ur

uroman = ur.Uroman()   # load uroman data (takes about a second or so)
print(uroman.romanize_string('Игорь Стравинский'))
print(uroman.romanize_string('Игорь', lcode='ukr'))
uroman.romanize_file(input_filename='mini-test/multi-script.txt',
                     output_filename='mini-test/multi-script.uroman.jsonl',
                     rom_format=ur.RomFormat.LATTICE)
```

#### Methods
__`uroman = ur.Uroman(data_dir)`__

This constructor method loads data needed for the romanization of different languages.
This constructor call might take about a second (real time) to load all of the romanization data, but it is necessary only once for multiple subsequent romanization calls.
<table>
  <tr><td>data_dir</td><td>data directory (optional, default: standard uroman data directory)</td></tr>
</table>

<hr>

__`uroman.romanize_string(s, lcode, rom_format)`__

This method takes a string <i>s</i> and returns its romanization in the format according to <i>rom_format</i>: a string (default), or a list of edges.
<table>
  <tr><td>s</td><td>string to be romanized, e.g. "ایران"</td></tr>
  <tr><td>lcode</td><td>language code, optional, a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr>
  <tr><td>rom_format</td><td>Output format choices:
        <ul>
           <li> ur.RomFormat.STR &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(best string, default, output format: string)
           <li> ur.RomFormat.EDGES &nbsp;(best edges, includes offset information, output format: JSONL)
           <li> ur.RomFormat.ALTS &nbsp;&nbsp;&nbsp;&nbsp;(lattice including alternative edges, output format: JSONL)
           <li> ur.RomFormat.LATTICE (lattice including alternative and superseded edges, output format: JSONL)
        </ul>
</table>

<hr>

__`uroman.romanize_file(input_filename, output_filename, lcode)`__

This method romanizes a file <i>input_filename</i> to <i>output_filename</i>.
<table>
  <tr><td>input_filename</td><td>default: stdin&nbsp;(for input_filename value of <i>None</i>)</td></tr>
  <tr><td width="200">output_filename</td><td>default: stdout&nbsp;(for output_filename value of <i>None</i>)</td></tr>
  <tr><td>lcode</td><td>language code (optional), a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr>
</table>

## Old Perl Version
<sup>Old Perl Version included on GitHub, but not included on PyPI.</sup>

### Usage
```bash
$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
       where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
            grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
       --chart specifies chart output (in JSON format) to represent alternative romanizations.
       --no-cache disables caching.
```
### Examples
<sup>Note: Directories _text_ and _test_ are under _uroman_'s root directory on GitHub.</sup>
```bash
uroman.pl < text/zho.txt
uroman.pl -l tur < text/tur.txt
uroman.pl -l heb --chart < text/heb.txt
uroman.pl < test/multi-script.txt > test/multi-script.uroman-perl.txt
```

Identifying the input as Arabic, Belarusian, Bulgarian, English, German,
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,
Lithuanian, Macedonian, Ossetian, Persian, Russian, Serbian, Turkish, 
Ukrainian, Uyghur or Yiddish 
will improve romanization for those languages as some letters in those 
languages have different sound values from other languages using the same script 
(Arabic vs. Persian, Russian vs. Ukrainian, Hebrew vs. Yiddish).
No effect for other languages in this version.

### Bibliography
Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. [Paper in ACL Anthology](https://www.aclweb.org/anthology/P18-4003) | [Poster](https://www.isi.edu/~ulf/papers/poster-uroman-acl2018.pdf) | [BibTex](https://www.aclweb.org/anthology/P18-4003.bib)

### Change History

Changes in version 1.3.0
 * Added Python version.
 * Initial dedicated support for Coptic (Egypt); significantly improved support for Thai; improved support for Khmer, Tibetan and several Indian languages incl. better final schwa deletion.
 * Chinese fractions and percentages.
 * Various small improvements.

Changes in version 1.2.8
 * Updated to Unicode 13.0 (2021), which supports several new scripts (10% larger UnicodeData.txt).
 * Improved support for Georgian.
 * Preserve various symbols (as opposed to mapping to the symbols' names).
 * Various small improvements.

Changes in version 1.2.7
 * Improved support for Pashto.

Changes in version 1.2.6
 * Improved support for Ukrainian, Russian and Ogham (ancient Irish script).
 * Added support for English Braille.
 * Added alternative Romanization for Macedonian and Serbian (mkd2/srp2)
   reflecting a casual style that many native speakers of those languages use
   when writing text in Latin script, e.g. non-accented single letters (e.g. "s")
   rather than phonetically motivated combinations of letters (e.g. "sh").
 * When a line starts with "::lcode xyz ", the new uroman version will switch to
   that language for that line. This is used for the new reference test file.
 * Various small improvements.

Changes in version 1.2.5
 * Improved support for Armenian and eight languages using Cyrillic scripts.
   -- For Serbian and Macedonian, which are often written in both Cyrillic
      and Latin scripts, uroman will map both official versions to the same
      romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which
      properly reflects the pronunciation of the city's name).
      For both Serbian and Macedonian, casual writers often use a simplified
      Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с"
      and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and
      other such pairs. The casual romanization can be simulated by using
      alternative uroman language codes "srp2" and "mkd2", which romanize
      both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling.
 * Various small improvements.

Changes in version 1.2.4
  * Bug-fix that generated two emtpy lines for each empty line in cache mode.

Changes in version 1.2
 * Run-time improvement based on (1) token-based caching and (2) shortcut 
   romanization (identity) of ASCII strings for default 1-best (non-chart) 
   output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and 
   large size texts.
 * Incremental improvements for Farsi, Amharic, Russian, Hebrew and related
   languages.
 * Richer lattice structure (more alternatives) for "Romanization" of English
   to support better matching to romanizations of other languages.
   Changes output only when --chart option is specified. No change in output for
   default 1-best output, which for ASCII characters is always the input string.

Changes in version 1.1 (major upgrade)
 * Offers chart output (in JSON format) to represent alternative romanizations.
   * Location of first character is defined to be "line: 1, start:0, end:0".
 * Incremental improvements of Hebrew and Greek romanization; Chinese numbers.
 * Improved web-interface (now) at https://uhermjakob.github.io/uroman.html
   * Shows corresponding original and romanization text in red
     when hovering over a text segment.
   * Shows alternative romanizations when hovering over romanized text
     marked by dotted underline.
   * Added right-to-left script detection and improved display for right-to-left
     script text (as determined line by line).
   * On-page support for some scripts that are often not pre-installed on users'
     computers (Burmese, Egyptian, Klingon).

Changes in version 1.0 (major upgrade)
 * Upgraded principal internal data structure from string to lattice.
 * Improvements mostly in vowelization of South and Southeast Asian languages.
 * Vocalic 'r' more consistently treated as vowel (no additional vowel added).
 * Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.
 * Japanese Katakana middle dots now mapped to ASCII space.
 * Tibetan intersyllabic mark now mapped to middle dot (U+00B7).
 * Some corrections regarding analysis of Chinese numbers.
 * Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.
 * Zero-width characters dropped, except line/sentence-initial byte order marks.
 * Spaces normalized to ASCII space.
 * Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.
 * Tested against previous version of uroman with a new uroman visual diff tool.
 * Almost an order of magnitude faster.

Changes in version 0.7 (minor upgrade)
 * Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.
   Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.
   Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic
   or Chinese characters in Uyghur texts.

Changes in version 0.6 (minor upgrade)
 * Added support for two letter characters used in Uzbek:
   (1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters)
   (2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop).
   Both are now mapped to "'" (plain ASCII apostrophe).
 * Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe)
   even when they are not preceded by "ئ" (yeh with hamza above).
 * Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ"
   ("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace")
 * Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).
   However, it is strongly recommended to normalize any presentation form Arabic letters
   to their non-presentation form before calling uroman.
 * Added force flush directive ($|=1;).

Changes in version 0.5 (minor upgrade)
 * Improvements for Uyghur (make sure to use language option: -l uig)

Changes in version 0.4 (minor upgrade)
 * Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's)
 * Minor change for Arabic (added "alef+fathatan" = "an")

New features in version 0.3
 * Covers Mandarin (Chinese)
 * Improved romanization for numerous languages
 * Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)
 * Maps from native digits to Western numbers
 * Faster for South Asian languages

### Other features
 * Web interface (old Perl): https://uhermjakob.github.io/uroman.html
 * Vowelization is provided when locally computable, e.g. for many South Asian languages and Tibetan.

### Limitations
 * The current version of uroman has a few limitations, some of which we plan to address in future versions.
   For Japanese, *uroman* currently romanizes hiragana and katakana as expected, but kanji are interpreted as Chinese characters and romanized as such. 
   For Egyptian hieroglyphs, only single-sound phonetic characters and numbers are currently romanized. 
   For Linear B, only phonetic syllabic characters are romanized. 
   For some other extinct scripts such as cuneiform, no romanization is provided.
 * A romanizer is not a full transliterator. For example, this version of
   uroman does not vowelize text that lacks explicit vowelization such as
   normal text in Arabic and Hebrew (without diacritics/points).

### Acknowledgments
Earlier versions of this tool were based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "uroman",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Ulf Hermjakob <ulf@isi.edu>",
    "keywords": "NLP, computational linguistics, machine translation, natural language processing, romanization, string similarity",
    "author": null,
    "author_email": "Ulf Hermjakob <ulf@isi.edu>",
    "download_url": "https://files.pythonhosted.org/packages/73/03/7d23f79d9b259861c31437ed76007eb8dc6f6c419b709f5b2ef37d4fa7da/uroman-1.3.1.1.tar.gz",
    "platform": null,
    "description": "# uroman\n\n*uroman* is a *universal romanizer*. It converts text in any script to the standard Latin alphabet.<br>\n&nbsp;&nbsp;&nbsp;&nbsp;Example (Greek): \u039d\u03b5\u03c0\u03ac\u03bb \u2192 Nepal<br>\n&nbsp;&nbsp;&nbsp;&nbsp;Example (Hindi):&nbsp; \u0928\u0947\u092a\u093e\u0932 \u2192 nepaal<br>\n&nbsp;&nbsp;&nbsp;&nbsp;Example (Urdu):&nbsp; \u0646\u06cc\u067e\u0627\u0644 \u2192 nypal<br>\n&nbsp;&nbsp;&nbsp;&nbsp;Example (Chinese): \u4e09\u4e07\u4e00 \u2192 31000\n\n* *uroman* enables the application of string-similarity metrics to texts from different scripts without the need and complexity of an intermediate phonetic representation.\n* *uroman* converts digital numbers in various scripts to Western Arabic numerals.\n* *uroman* uses m-to-n character mappings, context, and a user-provided language code (optional), i.e. *uroman* does not just replace characters one by one.\n* *uroman* expects all input to be encoded in UTF-8.\n\nNew Python version: 1.3.1 (released on June 27, 2024)<br>\nLast Perl version: 1.2.8 (released on April 23, 2021)<br>\nAuthor: Ulf Hermjakob, USC Information Sciences Institute  \n\n## (New) Python version\n\n#### Installation\n```bash\npython3 -m pip install uroman\n```\n\n### Command Line Interface (CLI)\n#### Examples\n\n```bash\npython3 -m uroman \"\u0418\u0433\u043e\u0440\u044c \u0421\u0442\u0440\u0430\u0432\u0438\u043d\u0441\u043a\u0438\u0439\"\npython3 -m uroman \u0418\u0433\u043e\u0440\u044c -l ukr\npython3 -m uroman \u039d\u03c4\u03ad\u03b9\u03b2\u03b9\u03c2 \u039a\u03b1\u03c0 -l ell\npython3 -m uroman \"\\u03C0\\u03B9\" -d\npython3 -m uroman -l hin -i mini-test/hin.txt\npython3 -m uroman -l fas -i mini-test/fas.txt -o mini-test/fas-rom.jsonl -f edges\npython3 -m uroman < mini-test/multi-script.txt > mini-test/multi-script.uroman.txt\npython3 -m uroman -h\n```\n\n<b>Note:</b> Using the _uroman_ CLI for single strings can be useful for simple tests, \nbut it is inefficient at scale because data resources are loaded every time. It is more efficient to romanize entire files or to use _uroman_ inside Python as shown further below.<br>\n<b>Note:</b> The _mini-test_ directory is included in this release. \nUse command &nbsp; <code>python3 -m uroman x --verbose</code> &nbsp; to find it.\nYou can compare your output mini-test/multi-script.uroman.txt with reference output mini-test/multi-script.uroman-ref.txt\n\n#### *uroman.py* &nbsp; Argument Structure Highlights \n<table>\n  <tr><td><i>Direct inputs (zero&nbsp;or&nbsp;more)</i></td><td>such as \u2018\u0418\u0433\u043e\u0440\u044c \u0421\u0442\u0440\u0430\u0432\u0438\u043d\u0441\u043a\u0438\u0439\u2019 and \u2018\u039d\u03c4\u03ad\u03b9\u03b2\u03b9\u03c2\u2019 above.</td></tr>\n  <tr><td>-l<br>--lcode</td><td>language code according to <a href=\"https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes\" target=\"_LCODE\">ISO-639-3</a>, e.g. <i>-l ukr</i> for Ukrainian, <i>-l hin</i> for Hindi, <i>-l fas</i> for Persian</td></tr>\n  <tr><td>-i<br>--input_filename</td><td>alternative:&nbsp;<i>stdin</i><br>Note: If both <i>direct inputs</i> and <i>input_filename</i> are given, the romanization results for <i>direct inputs</i> will be written to <i>stderr</i>.</td></tr>\n  <tr><td width=\"200\">-o<br><nobr>--output_filename</nobr></td><td>alternative: <i>stdout</i></td></tr>\n  <tr><td>-f<br>--rom_format</td><td>Output format choices:\n        <ul>\n           <li> -f str &nbsp;&nbsp;&nbsp;&nbsp;&nbsp (best string, default, output format: string)\n           <li> -f edges (best edges, includes offset information, output format: JSONL)\n           <li> -f alts &nbsp;&nbsp;&nbsp;&nbsp; (lattice including alternative edges, output format: JSONL)\n           <li> -f lattice (lattice including alternative and superseded edges, output format: JSONL)\n        </ul></td></tr>\n  <tr><td>-d<br>--decode_unicode</td><td>Decode Unicode escape sequences such as \u2018\\u03C0\\u03B9\u2019 to \u2018\u03c0\u03b9\u2019 which in turn will be romanized to \u2018pi\u2019. This is useful for input formats such as JSON.</td></tr>\n  <tr><td>-h<br>--help</td><td>Use this option to see the full argument structure with all options.</td></tr>\n</table>\n\n### Using _uroman_ inside Python\n#### Examples\n\n```bash\nimport uroman as ur\n\nuroman = ur.Uroman()   # load uroman data (takes about a second or so)\nprint(uroman.romanize_string('\u0418\u0433\u043e\u0440\u044c \u0421\u0442\u0440\u0430\u0432\u0438\u043d\u0441\u043a\u0438\u0439'))\nprint(uroman.romanize_string('\u0418\u0433\u043e\u0440\u044c', lcode='ukr'))\nuroman.romanize_file(input_filename='mini-test/multi-script.txt',\n                     output_filename='mini-test/multi-script.uroman.jsonl',\n                     rom_format=ur.RomFormat.LATTICE)\n```\n\n#### Methods\n__`uroman = ur.Uroman(data_dir)`__\n\nThis constructor method loads data needed for the romanization of different languages.\nThis constructor call might take about a second (real time) to load all of the romanization data, but it is necessary only once for multiple subsequent romanization calls.\n<table>\n  <tr><td>data_dir</td><td>data directory (optional, default: standard uroman data directory)</td></tr>\n</table>\n\n<hr>\n\n__`uroman.romanize_string(s, lcode, rom_format)`__\n\nThis method takes a string <i>s</i> and returns its romanization in the format according to <i>rom_format</i>: a string (default), or a list of edges.\n<table>\n  <tr><td>s</td><td>string to be romanized, e.g. \"\u0627\u06cc\u0631\u0627\u0646\"</td></tr>\n  <tr><td>lcode</td><td>language code, optional, a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr>\n  <tr><td>rom_format</td><td>Output format choices:\n        <ul>\n           <li> ur.RomFormat.STR &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(best string, default, output format: string)\n           <li> ur.RomFormat.EDGES &nbsp;(best edges, includes offset information, output format: JSONL)\n           <li> ur.RomFormat.ALTS &nbsp;&nbsp;&nbsp;&nbsp;(lattice including alternative edges, output format: JSONL)\n           <li> ur.RomFormat.LATTICE (lattice including alternative and superseded edges, output format: JSONL)\n        </ul>\n</table>\n\n<hr>\n\n__`uroman.romanize_file(input_filename, output_filename, lcode)`__\n\nThis method romanizes a file <i>input_filename</i> to <i>output_filename</i>.\n<table>\n  <tr><td>input_filename</td><td>default: stdin&nbsp;(for input_filename value of <i>None</i>)</td></tr>\n  <tr><td width=\"200\">output_filename</td><td>default: stdout&nbsp;(for output_filename value of <i>None</i>)</td></tr>\n  <tr><td>lcode</td><td>language code (optional), a 3-letter code such as 'eng' for English (ISO-639-3)</td></tr>\n</table>\n\n## Old Perl Version\n<sup>Old Perl Version included on GitHub, but not included on PyPI.</sup>\n\n### Usage\n```bash\n$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN\n       where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,\n            grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.\n       --chart specifies chart output (in JSON format) to represent alternative romanizations.\n       --no-cache disables caching.\n```\n### Examples\n<sup>Note: Directories _text_ and _test_ are under _uroman_'s root directory on GitHub.</sup>\n```bash\nuroman.pl < text/zho.txt\nuroman.pl -l tur < text/tur.txt\nuroman.pl -l heb --chart < text/heb.txt\nuroman.pl < test/multi-script.txt > test/multi-script.uroman-perl.txt\n```\n\nIdentifying the input as Arabic, Belarusian, Bulgarian, English, German,\nAncient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,\nLithuanian, Macedonian, Ossetian, Persian, Russian, Serbian, Turkish, \nUkrainian, Uyghur or Yiddish \nwill improve romanization for those languages as some letters in those \nlanguages have different sound values from other languages using the same script \n(Arabic vs. Persian, Russian vs. Ukrainian, Hebrew vs. Yiddish).\nNo effect for other languages in this version.\n\n### Bibliography\nUlf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. [Paper in ACL Anthology](https://www.aclweb.org/anthology/P18-4003) | [Poster](https://www.isi.edu/~ulf/papers/poster-uroman-acl2018.pdf) | [BibTex](https://www.aclweb.org/anthology/P18-4003.bib)\n\n### Change History\n\nChanges in version 1.3.0\n * Added Python version.\n * Initial dedicated support for Coptic (Egypt); significantly improved support for Thai; improved support for Khmer, Tibetan and several Indian languages incl. better final schwa deletion.\n * Chinese fractions and percentages.\n * Various small improvements.\n\nChanges in version 1.2.8\n * Updated to Unicode 13.0 (2021), which supports several new scripts (10% larger UnicodeData.txt).\n * Improved support for Georgian.\n * Preserve various symbols (as opposed to mapping to the symbols' names).\n * Various small improvements.\n\nChanges in version 1.2.7\n * Improved support for Pashto.\n\nChanges in version 1.2.6\n * Improved support for Ukrainian, Russian and Ogham (ancient Irish script).\n * Added support for English Braille.\n * Added alternative Romanization for Macedonian and Serbian (mkd2/srp2)\n   reflecting a casual style that many native speakers of those languages use\n   when writing text in Latin script, e.g. non-accented single letters (e.g. \"s\")\n   rather than phonetically motivated combinations of letters (e.g. \"sh\").\n * When a line starts with \"::lcode xyz \", the new uroman version will switch to\n   that language for that line. This is used for the new reference test file.\n * Various small improvements.\n\nChanges in version 1.2.5\n * Improved support for Armenian and eight languages using Cyrillic scripts.\n   -- For Serbian and Macedonian, which are often written in both Cyrillic\n      and Latin scripts, uroman will map both official versions to the same\n      romanized text, e.g. both \"\u041d\u0438\u0448\" and \"Ni\u0161\" will be mapped to \"Nish\" (which\n      properly reflects the pronunciation of the city's name).\n      For both Serbian and Macedonian, casual writers often use a simplified\n      Latin form without diacritics, e.g. \"s\" to represent not only Cyrillic \"\u0441\"\n      and Latin \"s\", but also \"\u0448\" or \"\u0161\", even if this conflates \"s\" and \"sh\" and\n      other such pairs. The casual romanization can be simulated by using\n      alternative uroman language codes \"srp2\" and \"mkd2\", which romanize\n      both \"\u041d\u0438\u0448\" and \"Ni\u0161\" to \"Nis\" to reflect the casual Latin spelling.\n * Various small improvements.\n\nChanges in version 1.2.4\n  * Bug-fix that generated two emtpy lines for each empty line in cache mode.\n\nChanges in version 1.2\n * Run-time improvement based on (1) token-based caching and (2) shortcut \n   romanization (identity) of ASCII strings for default 1-best (non-chart) \n   output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and \n   large size texts.\n * Incremental improvements for Farsi, Amharic, Russian, Hebrew and related\n   languages.\n * Richer lattice structure (more alternatives) for \"Romanization\" of English\n   to support better matching to romanizations of other languages.\n   Changes output only when --chart option is specified. No change in output for\n   default 1-best output, which for ASCII characters is always the input string.\n\nChanges in version 1.1 (major upgrade)\n * Offers chart output (in JSON format) to represent alternative romanizations.\n   * Location of first character is defined to be \"line: 1, start:0, end:0\".\n * Incremental improvements of Hebrew and Greek romanization; Chinese numbers.\n * Improved web-interface (now) at https://uhermjakob.github.io/uroman.html\n   * Shows corresponding original and romanization text in red\n     when hovering over a text segment.\n   * Shows alternative romanizations when hovering over romanized text\n     marked by dotted underline.\n   * Added right-to-left script detection and improved display for right-to-left\n     script text (as determined line by line).\n   * On-page support for some scripts that are often not pre-installed on users'\n     computers (Burmese, Egyptian, Klingon).\n\nChanges in version 1.0 (major upgrade)\n * Upgraded principal internal data structure from string to lattice.\n * Improvements mostly in vowelization of South and Southeast Asian languages.\n * Vocalic 'r' more consistently treated as vowel (no additional vowel added).\n * Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.\n * Japanese Katakana middle dots now mapped to ASCII space.\n * Tibetan intersyllabic mark now mapped to middle dot (U+00B7).\n * Some corrections regarding analysis of Chinese numbers.\n * Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.\n * Zero-width characters dropped, except line/sentence-initial byte order marks.\n * Spaces normalized to ASCII space.\n * Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.\n * Tested against previous version of uroman with a new uroman visual diff tool.\n * Almost an order of magnitude faster.\n\nChanges in version 0.7 (minor upgrade)\n * Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.\n   Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.\n   Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic\n   or Chinese characters in Uyghur texts.\n\nChanges in version 0.6 (minor upgrade)\n * Added support for two letter characters used in Uzbek:\n   (1) character \"\u02bb\" (\"modifier letter turned comma\", which modifies preceding \"g\" and \"u\" letters)\n   (2) character \"\u02bc\" (\"modifier letter apostrophe\", which Uzbek uses to mark a glottal stop).\n   Both are now mapped to \"'\" (plain ASCII apostrophe).\n * Added support for Uyghur vowel characters such as \"\u06d0\" (Arabic e) and \"\u06c6\" (Arabic oe)\n   even when they are not preceded by \"\u0626\" (yeh with hamza above).\n * Added support for Arabic semicolon \"\u061b\", Arabic ligature forms for phrases such as \"\ufdfa\"\n   (\"sallallahou alayhe wasallam\" = \"prayer of God be upon him and his family and peace\")\n * Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).\n   However, it is strongly recommended to normalize any presentation form Arabic letters\n   to their non-presentation form before calling uroman.\n * Added force flush directive ($|=1;).\n\nChanges in version 0.5 (minor upgrade)\n * Improvements for Uyghur (make sure to use language option: -l uig)\n\nChanges in version 0.4 (minor upgrade)\n * Improvements for Thai (special cases for vowel/consonant reordering, e.g. for \"sara o\"; dropped some aspiration 'h's)\n * Minor change for Arabic (added \"alef+fathatan\" = \"an\")\n\nNew features in version 0.3\n * Covers Mandarin (Chinese)\n * Improved romanization for numerous languages\n * Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)\n * Maps from native digits to Western numbers\n * Faster for South Asian languages\n\n### Other features\n * Web interface (old Perl): https://uhermjakob.github.io/uroman.html\n * Vowelization is provided when locally computable, e.g. for many South Asian languages and Tibetan.\n\n### Limitations\n * The current version of uroman has a few limitations, some of which we plan to address in future versions.\n   For Japanese, *uroman* currently romanizes hiragana and katakana as expected, but kanji are interpreted as Chinese characters and romanized as such. \n   For Egyptian hieroglyphs, only single-sound phonetic characters and numbers are currently romanized. \n   For Linear B, only phonetic syllabic characters are romanized. \n   For some other extinct scripts such as cuneiform, no romanization is provided.\n * A romanizer is not a full transliterator. For example, this version of\n   uroman does not vowelize text that lacks explicit vowelization such as\n   normal text in Arabic and Hebrew (without diacritics/points).\n\n### Acknowledgments\nEarlier versions of this tool were based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.\n",
    "bugtrack_url": null,
    "license": "Copyright (C) 2015-2020 Ulf Hermjakob, USC Information Sciences Institute  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  Any publication of projects using uroman shall acknowledge its use: \"This project uses the universal romanizer software 'uroman' written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)\". Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "uroman is a universal romanizer. It converts text in any script to the standard Latin alphabet.",
    "version": "1.3.1.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/isi-nlp/uroman/issues",
        "Homepage": "https://github.com/isi-nlp/uroman",
        "Repository": "https://github.com/isi-nlp/uroman"
    },
    "split_keywords": [
        "nlp",
        " computational linguistics",
        " machine translation",
        " natural language processing",
        " romanization",
        " string similarity"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "78e143722c41eebab0592c6f83410e5e35edc1d6e333f44feb0a543bd38dba3e",
                "md5": "4f4c3e3196f094cd0ef5cb6fa01ffab3",
                "sha256": "394f965f7011fd56a84aca098a6c3b50082f365324f5d94c992852137918c8f5"
            },
            "downloads": -1,
            "filename": "uroman-1.3.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4f4c3e3196f094cd0ef5cb6fa01ffab3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 930684,
            "upload_time": "2024-06-28T06:03:32",
            "upload_time_iso_8601": "2024-06-28T06:03:32.578466Z",
            "url": "https://files.pythonhosted.org/packages/78/e1/43722c41eebab0592c6f83410e5e35edc1d6e333f44feb0a543bd38dba3e/uroman-1.3.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "73037d23f79d9b259861c31437ed76007eb8dc6f6c419b709f5b2ef37d4fa7da",
                "md5": "150906cb3de3fae7185d0766e2c175ae",
                "sha256": "6aaf2d5265f24f15201cbbf92c86720b2b804ac53294ce43a3307fcd242387d5"
            },
            "downloads": -1,
            "filename": "uroman-1.3.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "150906cb3de3fae7185d0766e2c175ae",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 896697,
            "upload_time": "2024-06-28T06:03:34",
            "upload_time_iso_8601": "2024-06-28T06:03:34.868940Z",
            "url": "https://files.pythonhosted.org/packages/73/03/7d23f79d9b259861c31437ed76007eb8dc6f6c419b709f5b2ef37d4fa7da/uroman-1.3.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-28 06:03:34",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "isi-nlp",
    "github_project": "uroman",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "uroman"
}
        
Elapsed time: 0.35084s