Name | yajwiz JSON |
Version |
0.10.4
JSON |
| download |
home_page | None |
Summary | Klingon NLP toolkit |
upload_time | 2024-04-21 18:31:54 |
maintainer | None |
docs_url | None |
author | Iikka Hauhio |
requires_python | >=3.8 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
yajwI'
======
**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.
It heavily uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_.
Installation
------------
yajwI' requires Python 3.8 or newer.
It can be installed from PyPI::
pip install yajwiz
Updating and using the boQwI' dictionary
----------------------------------------
When yajwI' is first imported, it will download a copy of the boQwI' dictionary.
After this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.
The function will check for updates and install them.
The downloaded dictionary can be accessed through the ``load_dictionary()`` function.
>>> import yajwiz
>>> yajwiz.update_dictionary()
>>> dictionary = yajwiz.load_dictionary()
>>> dictionary.version
'2021.03.18a'
Tokenization
------------
The library includes very simple tokenization.
>>> import yajwiz
>>> yajwiz.tokenize("Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.")
[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', "qanchoHpa'"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]
Morphological analysis
----------------------
The ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.
>>> yajwiz.analyze("yInwI'")
[{'BOQWIZ_ID': 'yIn:n',
'BOQWIZ_POS': 'n:klcp1',
'LEMMA': 'yIn',
'PARTS': ['yIn:n', "-wI':n"],
'POS': 'N',
'SUFFIX': {'N4': "-wI'"},
'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',
'WORD': "yInwI'",
'XPOS': 'N',
'XPOS_GSUFF': 'N'},
{'BOQWIZ_ID': 'yIn:v',
'BOQWIZ_POS': 'v:t_c,klcp1',
'LEMMA': 'yIn',
'PARTS': ['yIn:v', "-wI':v"],
'POS': 'V',
'SUFFIX': {'V9': "-wI'"},
'WORD': "yInwI'",
'XPOS': 'VT',
'XPOS_GSUFF': "VT.wI'"}]
Currently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:
- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun
- Using **-Du'** when the noun is not a body part
- Using **-vIS** without using **-taH**
- Using **-lu'** with an illegal verb prefix
- Using intransitive verbs with prefixes indicating object
- Using **-ghach** without any other verb suffix
- Using aspect suffix with **-jaj**
There is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):
>>> yajwiz.split_to_morphemes("yInwI'")
{('yIn', "-wI'")}
List of Parts of Speech
.......................
===== ===========
XPOS Explanation
===== ===========
VS Stative verb
VT Transitive verb
VI Intransitive verb
VA Transitive and intransitive verb
V? Verb with unknown transitivity
NL Person noun
NB Body part noun
PRON Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)
NUM Number
N Other noun
ADV Adverb
EXCL Exclamation
CONJ Conjunction
QUES Question word (other than **'Iv** and **nuq**)
UNK Unknown
===== ===========
Grammar checker
---------------
yajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:
.. code::
python -m yajwiz.grammar_check file.txt
CONLL-U files and POS tagger
----------------------------
CONLL-U files are a popular data format for storing annotated linguistic data.
yajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).
Below is an example script that first parses a text without a trained POS tagger,
then trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.
.. code:: python
import yajwiz
with open("prose-corpus.txt", "r") as f:
text = f.read()
conllu = yajwiz.text_to_conllu(text)
tagger = yajwiz.Tagger()
tagger.train(yajwiz.conllu_to_tagged_list(conllu))
conllu = yajwiz.text_to_conllu(text, tagger)
with open("prose-corpus.conllu", "w") as f:
f.write(conllu)
Without a trained POS tagger, ambiguous words will be left without a tag:
.. code::
# Hegh neH chav qoH.
1 Hegh _ _ _ _ _ _ _ _
2 neH _ _ _ _ _ _ _ _
3 chav _ _ _ _ _ _ _ _
4 qoH qoH NOUN N _ _ _ _ _
5 . . PUNCT PUNCT _ _ _ _ _
# qanchoHpa' qoH, Hegh qoH.
1 qanchoHpa' qan VERB V?.pa' Person=3|ObjPerson=3,0 _ _ _ SuffixV3=-choH|SuffixV9=-pa'
2 qoH qoH NOUN N _ _ _ _ _
3 , , PUNCT PUNCT _ _ _ _ _
4 Hegh _ _ _ _ _ _ _ _
5 qoH qoH NOUN N _ _ _ _ _
6 . . PUNCT PUNCT _ _ _ _ _
After training the tagger, it will take the "best guess" when deciding the POS.
.. code::
# Hegh neH chav qoH.
1 Hegh Hegh VERB VT Person=3|ObjPerson=3,0 _ _ _ _
2 neH neH ADV ADV _ _ _ _ _
3 chav chav VERB VT Person=3|ObjPerson=3,0 _ _ _ _
4 qoH qoH NOUN N _ _ _ _ _
5 . . PUNCT PUNCT _ _ _ _ _
# qanchoHpa' qoH, Hegh qoH.
1 qanchoHpa' qan VERB V?.pa' Person=3|ObjPerson=3,0 _ _ _ SuffixV3=-choH|SuffixV9=-pa'
2 qoH qoH NOUN N _ _ _ _ _
3 , , PUNCT PUNCT _ _ _ _ _
4 Hegh Hegh VERB VT Person=3|ObjPerson=3,0 _ _ _ _
5 qoH qoH NOUN N _ _ _ _ _
6 . . PUNCT PUNCT _ _ _ _ _
In this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(
Copyright
---------
yajwiz (c) 2020 Iikka Hauhio
This program a uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_ (``data.json``) that is licensed under the Apache License 2.0.
The Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.
Raw data
{
"_id": null,
"home_page": null,
"name": "yajwiz",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Iikka Hauhio",
"author_email": "fergusq@kaivos.org",
"download_url": "https://files.pythonhosted.org/packages/97/e1/1f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a/yajwiz-0.10.4.tar.gz",
"platform": null,
"description": "yajwI'\n======\n\n**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.\n\nIt heavily uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_.\n\nInstallation\n------------\n\nyajwI' requires Python 3.8 or newer.\n\nIt can be installed from PyPI::\n\n pip install yajwiz\n\nUpdating and using the boQwI' dictionary\n----------------------------------------\n\nWhen yajwI' is first imported, it will download a copy of the boQwI' dictionary.\nAfter this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.\nThe function will check for updates and install them.\n\nThe downloaded dictionary can be accessed through the ``load_dictionary()`` function.\n\n>>> import yajwiz\n>>> yajwiz.update_dictionary()\n>>> dictionary = yajwiz.load_dictionary()\n>>> dictionary.version\n'2021.03.18a'\n\nTokenization\n------------\n\nThe library includes very simple tokenization.\n\n>>> import yajwiz\n>>> yajwiz.tokenize(\"Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.\")\n[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', \"qanchoHpa'\"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]\n\n\nMorphological analysis\n----------------------\n\nThe ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.\n\n>>> yajwiz.analyze(\"yInwI'\")\n[{'BOQWIZ_ID': 'yIn:n',\n 'BOQWIZ_POS': 'n:klcp1',\n 'LEMMA': 'yIn',\n 'PARTS': ['yIn:n', \"-wI':n\"],\n 'POS': 'N',\n 'SUFFIX': {'N4': \"-wI'\"},\n 'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',\n 'WORD': \"yInwI'\",\n 'XPOS': 'N',\n 'XPOS_GSUFF': 'N'},\n {'BOQWIZ_ID': 'yIn:v',\n 'BOQWIZ_POS': 'v:t_c,klcp1',\n 'LEMMA': 'yIn',\n 'PARTS': ['yIn:v', \"-wI':v\"],\n 'POS': 'V',\n 'SUFFIX': {'V9': \"-wI'\"},\n 'WORD': \"yInwI'\",\n 'XPOS': 'VT',\n 'XPOS_GSUFF': \"VT.wI'\"}]\n\nCurrently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:\n\n- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun\n- Using **-Du'** when the noun is not a body part\n- Using **-vIS** without using **-taH**\n- Using **-lu'** with an illegal verb prefix\n- Using intransitive verbs with prefixes indicating object\n- Using **-ghach** without any other verb suffix\n- Using aspect suffix with **-jaj**\n\nThere is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):\n\n>>> yajwiz.split_to_morphemes(\"yInwI'\")\n{('yIn', \"-wI'\")}\n\nList of Parts of Speech\n.......................\n\n===== ===========\nXPOS Explanation\n===== ===========\nVS Stative verb\nVT Transitive verb\nVI Intransitive verb\nVA Transitive and intransitive verb\nV? Verb with unknown transitivity\nNL Person noun\nNB Body part noun\nPRON Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)\nNUM Number\nN Other noun\nADV Adverb\nEXCL Exclamation\nCONJ Conjunction\nQUES Question word (other than **'Iv** and **nuq**)\nUNK Unknown\n===== ===========\n\nGrammar checker\n---------------\n\nyajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:\n\n.. code::\n\n python -m yajwiz.grammar_check file.txt\n\nCONLL-U files and POS tagger\n----------------------------\n\nCONLL-U files are a popular data format for storing annotated linguistic data.\n\nyajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).\n\nBelow is an example script that first parses a text without a trained POS tagger,\nthen trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.\n\n.. code:: python\n\n import yajwiz\n\n with open(\"prose-corpus.txt\", \"r\") as f:\n text = f.read()\n\n conllu = yajwiz.text_to_conllu(text)\n\n tagger = yajwiz.Tagger()\n tagger.train(yajwiz.conllu_to_tagged_list(conllu))\n\n conllu = yajwiz.text_to_conllu(text, tagger)\n\n with open(\"prose-corpus.conllu\", \"w\") as f:\n f.write(conllu)\n\nWithout a trained POS tagger, ambiguous words will be left without a tag:\n\n.. code::\n\n # Hegh neH chav qoH.\n 1\tHegh\t_\t_\t_\t_\t_\t_\t_\t_\n 2\tneH\t_\t_\t_\t_\t_\t_\t_\t_\n 3\tchav\t_\t_\t_\t_\t_\t_\t_\t_\n 4\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 5\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\n # qanchoHpa' qoH, Hegh qoH.\n 1\tqanchoHpa'\tqan\tVERB\tV?.pa'\tPerson=3|ObjPerson=3,0\t_\t_\t_\tSuffixV3=-choH|SuffixV9=-pa'\n 2\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 3\t,\t,\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n 4\tHegh\t_\t_\t_\t_\t_\t_\t_\t_\n 5\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 6\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\nAfter training the tagger, it will take the \"best guess\" when deciding the POS.\n\n.. code::\n\n # Hegh neH chav qoH.\n 1\tHegh\tHegh\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n 2\tneH\tneH\tADV\tADV\t_\t_\t_\t_\t_\n 3\tchav\tchav\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n 4\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 5\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\n # qanchoHpa' qoH, Hegh qoH.\n 1\tqanchoHpa'\tqan\tVERB\tV?.pa'\tPerson=3|ObjPerson=3,0\t_\t_\t_\tSuffixV3=-choH|SuffixV9=-pa'\n 2\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 3\t,\t,\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n 4\tHegh\tHegh\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n 5\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n 6\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\nIn this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(\n\nCopyright\n---------\n\nyajwiz (c) 2020 Iikka Hauhio\n\nThis program a uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_ (``data.json``) that is licensed under the Apache License 2.0.\n\nThe Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Klingon NLP toolkit",
"version": "0.10.4",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e1e67254e267a26955ea929b73ef7d701a40b63891dc9d248752ae35984743bf",
"md5": "28f8e3cd7259fd04583d06239914a7ee",
"sha256": "921066ca18a09ee77d57bbdf50c7fb4be3d58c06e794a6f84143672e33161f66"
},
"downloads": -1,
"filename": "yajwiz-0.10.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "28f8e3cd7259fd04583d06239914a7ee",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 2206439,
"upload_time": "2024-04-21T18:31:52",
"upload_time_iso_8601": "2024-04-21T18:31:52.311962Z",
"url": "https://files.pythonhosted.org/packages/e1/e6/7254e267a26955ea929b73ef7d701a40b63891dc9d248752ae35984743bf/yajwiz-0.10.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "97e11f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a",
"md5": "529079e6a4fd97529262148bf386d398",
"sha256": "b53a237cd7ebbd8aa4bca38c1283d5d48a8652c30f3cb3cccb57e9341628f8c0"
},
"downloads": -1,
"filename": "yajwiz-0.10.4.tar.gz",
"has_sig": false,
"md5_digest": "529079e6a4fd97529262148bf386d398",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 25688,
"upload_time": "2024-04-21T18:31:54",
"upload_time_iso_8601": "2024-04-21T18:31:54.894614Z",
"url": "https://files.pythonhosted.org/packages/97/e1/1f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a/yajwiz-0.10.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-21 18:31:54",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "yajwiz"
}