# GlotScript
- GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.
- GlotScript-Resource: provides a resource displaying the writing systems for various languages.
## GlotScript Resource
What writing system is each language written in?
See [metadata folder](./metadata/).
## GlotScript Tool
Detect the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from [Wikipedia ISO_15924](https://en.wikipedia.org/wiki/ISO_15924).
- Unicode ranges were extracted from [Unicode Character Database](https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt).
### Special codes
- `Zinh` code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
- `Zyyy` code is the Unicode script for "Common" characters.
- `Zzzz` code is for Unicode script for "uncoded" script.
### Install from pip
```bash
pip3 install GlotScript
```
### Install from git
```bash
pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript
```
### Usage: Script Detection
```python
from GlotScript import get_script_predictor
sp = get_script_predictor()
```
OR
```python
from GlotScript import sp
```
```python
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
```
```python
sp('This is Latin')[:1]
>> ('Latn', 1.0)
```
```python
sp('මේක සිංහල')[0]
>> 'Sinh'
```
```python
sp('𝄞𝄫 𒊕𒀸')
>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})
```
### Usage: Script Separation
```python
from GlotScript import separate_script
```
```python
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
separate_script(sent)
>> {
"Latn":"Hello Salut ",
"Hebr":" שלום ",
"Arab":" سلام مرحبا",
"Hani":" 你好 ",
"Hira":" こんにちは "
}
```
### Exploring Unicode Blocks: Related Sources
<details>
<summary>Click to Exapand</summary>
- [List of Unicode characters - Wikipedia](https://en.wikipedia.org/wiki/List_of_Unicode_characters)
- [Lightweight Plain-Text Editor for macOS - CotEditor](https://github.com/coteditor/CotEditor/blob/main/CotEditor/Sources/Unicode.UTF32.CodeUnit%2BBlockName.swift)
- [The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty](https://github.com/mintty/mintty/blob/master/src/scripts.t)
- [ISO_15924 Wikipedia](https://en.wikipedia.org/wiki/ISO_15924)
- [Unicode Character Database (Blocks) - Unicode](http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt)
- [Unicode Character Database (Scripts) - Unicode](https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt)
- [A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1 ](https://github.com/glyphr-studio/Glyphr-Studio-1/blob/master/dev/js/lib_unicode_blocks.js)
- [Kotlin - JetBrains](https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/native-wasm/src/kotlin/text/regex/AbstractCharClass.kt)
- [UNIX-like reverse engineering framework and command-line toolset - radare2](https://github.com/radareorg/radare2/blob/master/libr/util/utf8.c)
- [FreeOrion Game](https://github.com/freeorion/freeorion/blob/master/GG/src/UnicodeCharsets.cpp)
- [DOMinator - Firefox](https://github.com/wisec/DOMinator/blob/master/gfx/thebes/gfxFontUtils.cpp)
- [SHSans-derived CJK font family - glow-sans](https://github.com/welai/glow-sans/blob/master/src/utils/code-range.js)
- [Unicode Subset Bitfields - Microsoft](https://learn.microsoft.com/en-us/windows/win32/intl/unicode-subset-bitfields)
- [Stops - FAIR NLLB FB](https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/utils/predict_script.py)
- [Gradient Boosting on Decision Trees - catboost](https://github.com/catboost/catboost/blob/master/contrib/python/fonttools/fontTools/unicodedata/Blocks.py)
- [Blender](https://github.com/blender/blender/blob/main/source/blender/blenfont/intern/blf_glyph.cc)
- [Unicode Wikipedia](https://en.wikipedia.org/wiki/Unicode_block)
</details>
## Citation
If you use any part of this library in your research, please cite it using the following BibTex entry.
```
@article{kargaran2023glotscript,
title = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year = 2023,
journal = {arXiv preprint arXiv:2309.13320}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/cisnlp/GlotScript",
"name": "GlotScript",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Amir Hossein Kargaran",
"author_email": "kargaranamir@email.com",
"download_url": "https://files.pythonhosted.org/packages/88/ed/24fea61a982a6caedd03e6df4fad590e13f298e5a231bc04b193ee4953b2/GlotScript-1.2.tar.gz",
"platform": null,
"description": "# GlotScript\n\n- GlotScript-Tool: determines the script (writing system) of input text using ISO 15924. \n\n- GlotScript-Resource: provides a resource displaying the writing systems for various languages. \n\n\n## GlotScript Resource\n\nWhat writing system is each language written in?\n\nSee [metadata folder](./metadata/).\n\n## GlotScript Tool\n\nDetect the script (writing system) of text based on ISO 15924.\n- Unicode version: 15.0.0\n- The codes were sourced from [Wikipedia ISO_15924](https://en.wikipedia.org/wiki/ISO_15924).\n- Unicode ranges were extracted from [Unicode Character Database](https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt).\n\n### Special codes\n- `Zinh` code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.\n- `Zyyy` code is the Unicode script for \"Common\" characters.\n- `Zzzz` code is for Unicode script for \"uncoded\" script.\n\n### Install from pip\n```bash\npip3 install GlotScript\n```\n\n### Install from git\n```bash\npip3 install GlotScript@git+https://github.com/cisnlp/GlotScript\n```\n\n### Usage: Script Detection\n\n```python\nfrom GlotScript import get_script_predictor\nsp = get_script_predictor()\n```\n\nOR\n\n```python\nfrom GlotScript import sp\n```\n\n```python\nsp('\u3053\u308c\u306f\u65e5\u672c\u4eba\u3067\u3059')\n>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})\n```\n\n```python\nsp('This is Latin')[:1]\n>> ('Latn', 1.0)\n```\n\n```python\nsp('\u0db8\u0dda\u0d9a \u0dc3\u0dd2\u0d82\u0dc4\u0dbd')[0]\n>> 'Sinh'\n```\n\n```python\nsp('\ud834\udd1e\ud834\udd2b \ud808\ude95\ud808\udc38')\n>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})\n```\n\n### Usage: Script Separation \n\n```python\nfrom GlotScript import separate_script\n```\n\n```python\nsent = \"Hello Salut \u0633\u0644\u0627\u0645 \u4f60\u597d \u3053\u3093\u306b\u3061\u306f \u05e9\u05dc\u05d5\u05dd \u0645\u0631\u062d\u0628\u0627\"\nseparate_script(sent)\n>> {\n \"Latn\":\"Hello Salut \",\n \"Hebr\":\" \u05e9\u05dc\u05d5\u05dd \",\n \"Arab\":\" \u0633\u0644\u0627\u0645 \u0645\u0631\u062d\u0628\u0627\",\n \"Hani\":\" \u4f60\u597d \",\n \"Hira\":\" \u3053\u3093\u306b\u3061\u306f \"\n}\n```\n\n### Exploring Unicode Blocks: Related Sources\n<details>\n<summary>Click to Exapand</summary>\n\n- [List of Unicode characters - Wikipedia](https://en.wikipedia.org/wiki/List_of_Unicode_characters)\n- [Lightweight Plain-Text Editor for macOS - CotEditor](https://github.com/coteditor/CotEditor/blob/main/CotEditor/Sources/Unicode.UTF32.CodeUnit%2BBlockName.swift)\n- [The Cygwin Terminal \u2013 terminal emulator for Cygwin, MSYS, and WSL - mintty](https://github.com/mintty/mintty/blob/master/src/scripts.t)\n- [ISO_15924 Wikipedia](https://en.wikipedia.org/wiki/ISO_15924)\n- [Unicode Character Database (Blocks) - Unicode](http://www.unicode.org/Public/4.1.0/ucd/Blocks.txt)\n- [Unicode Character Database (Scripts) - Unicode](https://www.unicode.org/Public/15.0.0/ucd/Scripts.txt)\n- [A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1 ](https://github.com/glyphr-studio/Glyphr-Studio-1/blob/master/dev/js/lib_unicode_blocks.js)\n- [Kotlin - JetBrains](https://github.com/JetBrains/kotlin/blob/master/libraries/stdlib/native-wasm/src/kotlin/text/regex/AbstractCharClass.kt)\n- [UNIX-like reverse engineering framework and command-line toolset - radare2](https://github.com/radareorg/radare2/blob/master/libr/util/utf8.c)\n- [FreeOrion Game](https://github.com/freeorion/freeorion/blob/master/GG/src/UnicodeCharsets.cpp)\n- [DOMinator - Firefox](https://github.com/wisec/DOMinator/blob/master/gfx/thebes/gfxFontUtils.cpp)\n- [SHSans-derived CJK font family - glow-sans](https://github.com/welai/glow-sans/blob/master/src/utils/code-range.js)\n- [Unicode Subset Bitfields - Microsoft](https://learn.microsoft.com/en-us/windows/win32/intl/unicode-subset-bitfields)\n- [Stops - FAIR NLLB FB](https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/utils/predict_script.py)\n- [Gradient Boosting on Decision Trees - catboost](https://github.com/catboost/catboost/blob/master/contrib/python/fonttools/fontTools/unicodedata/Blocks.py)\n- [Blender](https://github.com/blender/blender/blob/main/source/blender/blenfont/intern/blf_glyph.cc)\n- [Unicode Wikipedia](https://en.wikipedia.org/wiki/Unicode_block)\n\n</details>\n\n## Citation\nIf you use any part of this library in your research, please cite it using the following BibTex entry. \n\n```\n@article{kargaran2023glotscript,\ntitle = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},\nauthor = {Kargaran, Amir Hossein and Yvon, Fran{\\c{c}}ois and Sch{\\\"u}tze, Hinrich},\nyear = 2023,\njournal = {arXiv preprint arXiv:2309.13320}\n}\n```\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "A package for detecting the script (writing system) of given text.",
"version": "1.2",
"project_urls": {
"Homepage": "https://github.com/cisnlp/GlotScript"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d39a2797ccee0eb8fd52ba9b060489049f34a50f939090f7d4166838ecfcfa53",
"md5": "5b3fdd107aaff07e019e964f32969ea6",
"sha256": "a41cbbe0ef1e7317fa229da81f79f0e92adac940cbdc27ce5dd14328cf5aba74"
},
"downloads": -1,
"filename": "GlotScript-1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5b3fdd107aaff07e019e964f32969ea6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 14492,
"upload_time": "2024-02-17T17:22:17",
"upload_time_iso_8601": "2024-02-17T17:22:17.356117Z",
"url": "https://files.pythonhosted.org/packages/d3/9a/2797ccee0eb8fd52ba9b060489049f34a50f939090f7d4166838ecfcfa53/GlotScript-1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "88ed24fea61a982a6caedd03e6df4fad590e13f298e5a231bc04b193ee4953b2",
"md5": "63caa359d9d533e42fc9fa1201453d16",
"sha256": "536acb3182f78f349b6af7bdc5a1292c55c6a8a4c48605f010663aa0930cba3f"
},
"downloads": -1,
"filename": "GlotScript-1.2.tar.gz",
"has_sig": false,
"md5_digest": "63caa359d9d533e42fc9fa1201453d16",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 14329,
"upload_time": "2024-02-17T17:22:19",
"upload_time_iso_8601": "2024-02-17T17:22:19.004883Z",
"url": "https://files.pythonhosted.org/packages/88/ed/24fea61a982a6caedd03e6df4fad590e13f298e5a231bc04b193ee4953b2/GlotScript-1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-17 17:22:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cisnlp",
"github_project": "GlotScript",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "glotscript"
}