yajwiz

Name	yajwiz JSON
Version	0.10.4 JSON
	download
home_page	None
Summary	Klingon NLP toolkit
upload_time	2024-04-21 18:31:54
maintainer	None
docs_url	None
author	Iikka Hauhio
requires_python	>=3.8
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            yajwI'
======

**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.

It heavily uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_.

Installation
------------

yajwI' requires Python 3.8 or newer.

It can be installed from PyPI::

    pip install yajwiz

Updating and using the boQwI' dictionary
----------------------------------------

When yajwI' is first imported, it will download a copy of the boQwI' dictionary.
After this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.
The function will check for updates and install them.

The downloaded dictionary can be accessed through the ``load_dictionary()`` function.

>>> import yajwiz
>>> yajwiz.update_dictionary()
>>> dictionary = yajwiz.load_dictionary()
>>> dictionary.version
'2021.03.18a'

Tokenization
------------

The library includes very simple tokenization.

>>> import yajwiz
>>> yajwiz.tokenize("Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.")
[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', "qanchoHpa'"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]


Morphological analysis
----------------------

The ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.

>>> yajwiz.analyze("yInwI'")
[{'BOQWIZ_ID': 'yIn:n',
  'BOQWIZ_POS': 'n:klcp1',
  'LEMMA': 'yIn',
  'PARTS': ['yIn:n', "-wI':n"],
  'POS': 'N',
  'SUFFIX': {'N4': "-wI'"},
  'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',
  'WORD': "yInwI'",
  'XPOS': 'N',
  'XPOS_GSUFF': 'N'},
 {'BOQWIZ_ID': 'yIn:v',
  'BOQWIZ_POS': 'v:t_c,klcp1',
  'LEMMA': 'yIn',
  'PARTS': ['yIn:v', "-wI':v"],
  'POS': 'V',
  'SUFFIX': {'V9': "-wI'"},
  'WORD': "yInwI'",
  'XPOS': 'VT',
  'XPOS_GSUFF': "VT.wI'"}]

Currently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:

- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun
- Using **-Du'** when the noun is not a body part
- Using **-vIS** without using **-taH**
- Using **-lu'** with an illegal verb prefix
- Using intransitive verbs with prefixes indicating object
- Using **-ghach** without any other verb suffix
- Using aspect suffix with **-jaj**

There is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):

>>> yajwiz.split_to_morphemes("yInwI'")
{('yIn', "-wI'")}

List of Parts of Speech
.......................

===== ===========
XPOS  Explanation
===== ===========
VS    Stative verb
VT    Transitive verb
VI    Intransitive verb
VA    Transitive and intransitive verb
V?    Verb with unknown transitivity
NL    Person noun
NB    Body part noun
PRON  Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)
NUM   Number
N     Other noun
ADV   Adverb
EXCL  Exclamation
CONJ  Conjunction
QUES  Question word (other than **'Iv** and **nuq**)
UNK   Unknown
===== ===========

Grammar checker
---------------

yajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:

.. code::

    python -m yajwiz.grammar_check file.txt

CONLL-U files and POS tagger
----------------------------

CONLL-U files are a popular data format for storing annotated linguistic data.

yajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).

Below is an example script that first parses a text without a trained POS tagger,
then trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.

.. code:: python

    import yajwiz

    with open("prose-corpus.txt", "r") as f:
        text = f.read()

    conllu = yajwiz.text_to_conllu(text)

    tagger = yajwiz.Tagger()
    tagger.train(yajwiz.conllu_to_tagged_list(conllu))

    conllu = yajwiz.text_to_conllu(text, tagger)

    with open("prose-corpus.conllu", "w") as f:
        f.write(conllu)

Without a trained POS tagger, ambiguous words will be left without a tag:

.. code::

    # Hegh neH chav qoH.
    1	Hegh	_	_	_	_	_	_	_	_
    2	neH	_	_	_	_	_	_	_	_
    3	chav	_	_	_	_	_	_	_	_
    4	qoH	qoH	NOUN	N	_	_	_	_	_
    5	.	.	PUNCT	PUNCT	_	_	_	_	_

    # qanchoHpa' qoH, Hegh qoH.
    1	qanchoHpa'	qan	VERB	V?.pa'	Person=3|ObjPerson=3,0	_	_	_	SuffixV3=-choH|SuffixV9=-pa'
    2	qoH	qoH	NOUN	N	_	_	_	_	_
    3	,	,	PUNCT	PUNCT	_	_	_	_	_
    4	Hegh	_	_	_	_	_	_	_	_
    5	qoH	qoH	NOUN	N	_	_	_	_	_
    6	.	.	PUNCT	PUNCT	_	_	_	_	_

After training the tagger, it will take the "best guess" when deciding the POS.

.. code::

    # Hegh neH chav qoH.
    1	Hegh	Hegh	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_
    2	neH	neH	ADV	ADV	_	_	_	_	_
    3	chav	chav	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_
    4	qoH	qoH	NOUN	N	_	_	_	_	_
    5	.	.	PUNCT	PUNCT	_	_	_	_	_

    # qanchoHpa' qoH, Hegh qoH.
    1	qanchoHpa'	qan	VERB	V?.pa'	Person=3|ObjPerson=3,0	_	_	_	SuffixV3=-choH|SuffixV9=-pa'
    2	qoH	qoH	NOUN	N	_	_	_	_	_
    3	,	,	PUNCT	PUNCT	_	_	_	_	_
    4	Hegh	Hegh	VERB	VT	Person=3|ObjPerson=3,0	_	_	_	_
    5	qoH	qoH	NOUN	N	_	_	_	_	_
    6	.	.	PUNCT	PUNCT	_	_	_	_	_

In this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(

Copyright
---------

yajwiz (c) 2020 Iikka Hauhio

This program a uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_ (``data.json``) that is licensed under the Apache License 2.0.

The Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "yajwiz",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Iikka Hauhio",
    "author_email": "fergusq@kaivos.org",
    "download_url": "https://files.pythonhosted.org/packages/97/e1/1f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a/yajwiz-0.10.4.tar.gz",
    "platform": null,
    "description": "yajwI'\n======\n\n**yajwI'** is a Klingon NLP toolkit that includes basic tokenization, morphological analysis and POS tagging.\n\nIt heavily uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_.\n\nInstallation\n------------\n\nyajwI' requires Python 3.8 or newer.\n\nIt can be installed from PyPI::\n\n    pip install yajwiz\n\nUpdating and using the boQwI' dictionary\n----------------------------------------\n\nWhen yajwI' is first imported, it will download a copy of the boQwI' dictionary.\nAfter this the ``update_dictionary()`` function must be called whenever the dictionary needs to be updated.\nThe function will check for updates and install them.\n\nThe downloaded dictionary can be accessed through the ``load_dictionary()`` function.\n\n>>> import yajwiz\n>>> yajwiz.update_dictionary()\n>>> dictionary = yajwiz.load_dictionary()\n>>> dictionary.version\n'2021.03.18a'\n\nTokenization\n------------\n\nThe library includes very simple tokenization.\n\n>>> import yajwiz\n>>> yajwiz.tokenize(\"Hegh neH chav qoH. qanchoHpa' qoH, Hegh qoH.\")\n[('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'neH'), ('SPACE', ' '), ('WORD', 'chav'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.'), ('SPACE', ' '), ('WORD', \"qanchoHpa'\"), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'Hegh'), ('SPACE', ' '), ('WORD', 'qoH'), ('PUNCT', '.')]\n\n\nMorphological analysis\n----------------------\n\nThe ``yajwiz.analyze`` function parses a word and returns a list of possible parses and a lot of extra information.\n\n>>> yajwiz.analyze(\"yInwI'\")\n[{'BOQWIZ_ID': 'yIn:n',\n  'BOQWIZ_POS': 'n:klcp1',\n  'LEMMA': 'yIn',\n  'PARTS': ['yIn:n', \"-wI':n\"],\n  'POS': 'N',\n  'SUFFIX': {'N4': \"-wI'\"},\n  'UNGRAMMATICAL': 'ILLEGAL PLURAL OR POSSESSIVE SUFFIX',\n  'WORD': \"yInwI'\",\n  'XPOS': 'N',\n  'XPOS_GSUFF': 'N'},\n {'BOQWIZ_ID': 'yIn:v',\n  'BOQWIZ_POS': 'v:t_c,klcp1',\n  'LEMMA': 'yIn',\n  'PARTS': ['yIn:v', \"-wI':v\"],\n  'POS': 'V',\n  'SUFFIX': {'V9': \"-wI'\"},\n  'WORD': \"yInwI'\",\n  'XPOS': 'VT',\n  'XPOS_GSUFF': \"VT.wI'\"}]\n\nCurrently the analyzer is very permissive and does allow using wrong plurals and possessive suffixes (eg. **yInwI'** instead of **yInwIj**). It will try to mark this kind of errors with ``'UNGRAMMATICAL': True``. It detects the following errors:\n\n- Using **-pu'**, **-wI'**, **-lI'**, etc. when the noun is not a person noun\n- Using **-Du'** when the noun is not a body part\n- Using **-vIS** without using **-taH**\n- Using **-lu'** with an illegal verb prefix\n- Using intransitive verbs with prefixes indicating object\n- Using **-ghach** without any other verb suffix\n- Using aspect suffix with **-jaj**\n\nThere is also a simpler function ``yajwiz.split_to_morphemes``, that returns a set of tuples of strings (usually there will be only one tuple in the set):\n\n>>> yajwiz.split_to_morphemes(\"yInwI'\")\n{('yIn', \"-wI'\")}\n\nList of Parts of Speech\n.......................\n\n===== ===========\nXPOS  Explanation\n===== ===========\nVS    Stative verb\nVT    Transitive verb\nVI    Intransitive verb\nVA    Transitive and intransitive verb\nV?    Verb with unknown transitivity\nNL    Person noun\nNB    Body part noun\nPRON  Pronoun (including **'Iv** and **nuq**: it is a noun that can function as a copula)\nNUM   Number\nN     Other noun\nADV   Adverb\nEXCL  Exclamation\nCONJ  Conjunction\nQUES  Question word (other than **'Iv** and **nuq**)\nUNK   Unknown\n===== ===========\n\nGrammar checker\n---------------\n\nyajwI' can be used to find common grammar errors. You can either use the method ``yajwiz.grammar_check`` or the following command line interface:\n\n.. code::\n\n    python -m yajwiz.grammar_check file.txt\n\nCONLL-U files and POS tagger\n----------------------------\n\nCONLL-U files are a popular data format for storing annotated linguistic data.\n\nyajwI' can generate CONLL-U files filled with morphological information (it does not support dependency parsing).\n\nBelow is an example script that first parses a text without a trained POS tagger,\nthen trains a POS tagger with it and finally parses the text with the tagger and saves the result to a CONLL-U file.\n\n.. code:: python\n\n    import yajwiz\n\n    with open(\"prose-corpus.txt\", \"r\") as f:\n        text = f.read()\n\n    conllu = yajwiz.text_to_conllu(text)\n\n    tagger = yajwiz.Tagger()\n    tagger.train(yajwiz.conllu_to_tagged_list(conllu))\n\n    conllu = yajwiz.text_to_conllu(text, tagger)\n\n    with open(\"prose-corpus.conllu\", \"w\") as f:\n        f.write(conllu)\n\nWithout a trained POS tagger, ambiguous words will be left without a tag:\n\n.. code::\n\n    # Hegh neH chav qoH.\n    1\tHegh\t_\t_\t_\t_\t_\t_\t_\t_\n    2\tneH\t_\t_\t_\t_\t_\t_\t_\t_\n    3\tchav\t_\t_\t_\t_\t_\t_\t_\t_\n    4\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    5\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\n    # qanchoHpa' qoH, Hegh qoH.\n    1\tqanchoHpa'\tqan\tVERB\tV?.pa'\tPerson=3|ObjPerson=3,0\t_\t_\t_\tSuffixV3=-choH|SuffixV9=-pa'\n    2\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    3\t,\t,\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n    4\tHegh\t_\t_\t_\t_\t_\t_\t_\t_\n    5\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    6\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\nAfter training the tagger, it will take the \"best guess\" when deciding the POS.\n\n.. code::\n\n    # Hegh neH chav qoH.\n    1\tHegh\tHegh\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n    2\tneH\tneH\tADV\tADV\t_\t_\t_\t_\t_\n    3\tchav\tchav\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n    4\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    5\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\n    # qanchoHpa' qoH, Hegh qoH.\n    1\tqanchoHpa'\tqan\tVERB\tV?.pa'\tPerson=3|ObjPerson=3,0\t_\t_\t_\tSuffixV3=-choH|SuffixV9=-pa'\n    2\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    3\t,\t,\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n    4\tHegh\tHegh\tVERB\tVT\tPerson=3|ObjPerson=3,0\t_\t_\t_\t_\n    5\tqoH\tqoH\tNOUN\tN\t_\t_\t_\t_\t_\n    6\t.\t.\tPUNCT\tPUNCT\t_\t_\t_\t_\t_\n\nIn this example the tagger made a mistake: it classified the first **Hegh** as VT, although it should be N. I don't have a correctly tagged corpus, so evaluating the tagger is currently impossible. :(\n\nCopyright\n---------\n\nyajwiz (c) 2020 Iikka Hauhio\n\nThis program a uses the `boQwI' dictionary <https://github.com/De7vID/klingon-assistant-data>`_ (``data.json``) that is licensed under the Apache License 2.0.\n\nThe Python files are also licensed under the Apache License 2.0. See the LICENSE file for more details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Klingon NLP toolkit",
    "version": "0.10.4",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e1e67254e267a26955ea929b73ef7d701a40b63891dc9d248752ae35984743bf",
                "md5": "28f8e3cd7259fd04583d06239914a7ee",
                "sha256": "921066ca18a09ee77d57bbdf50c7fb4be3d58c06e794a6f84143672e33161f66"
            },
            "downloads": -1,
            "filename": "yajwiz-0.10.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "28f8e3cd7259fd04583d06239914a7ee",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 2206439,
            "upload_time": "2024-04-21T18:31:52",
            "upload_time_iso_8601": "2024-04-21T18:31:52.311962Z",
            "url": "https://files.pythonhosted.org/packages/e1/e6/7254e267a26955ea929b73ef7d701a40b63891dc9d248752ae35984743bf/yajwiz-0.10.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "97e11f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a",
                "md5": "529079e6a4fd97529262148bf386d398",
                "sha256": "b53a237cd7ebbd8aa4bca38c1283d5d48a8652c30f3cb3cccb57e9341628f8c0"
            },
            "downloads": -1,
            "filename": "yajwiz-0.10.4.tar.gz",
            "has_sig": false,
            "md5_digest": "529079e6a4fd97529262148bf386d398",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25688,
            "upload_time": "2024-04-21T18:31:54",
            "upload_time_iso_8601": "2024-04-21T18:31:54.894614Z",
            "url": "https://files.pythonhosted.org/packages/97/e1/1f494f8833fdfdbba35aa62f81b04c910b5840479a1cf4759d123282b03a/yajwiz-0.10.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-21 18:31:54",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "yajwiz"
}

Iikka Hauhio