uralicNLP

Name	uralicNLP JSON
Version	2.0.3 JSON
	download
home_page	https://github.com/mikahama/uralicNLP
Summary	An NLP library for Uralic languages such as Finnish and Sami. Also supports Spanish, Arabic, Russian etc.
upload_time	2024-11-28 18:46:01
maintainer	None
docs_url	None
author	Mika Hämäläinen
requires_python	None
license	Apache-2.0 license
keywords	uralic languages nlp morphology hfst omorfi dictionary lexicography
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI
coveralls test coverage

            <h1 align="center">UralicNLP</h1>
<p align="center">Natural language processing for many languages</p>

[![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/)  [![Downloads](https://static.pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)


UralicNLP can produce **morphological analyses**, **generate morphological forms**, **lemmatize words** and **give lexical information** about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages. 

[See the catalog of supported languages](http://models.uralicnlp.com/nightly/)

Some of the supported languages: 🇸🇦 🇪🇸 🇮🇹 🇵🇹 🇩🇪 🇫🇷 🇳🇱 🇬🇧 🇷🇺 🇫🇮 🇸🇪 🇳🇴 🇩🇰 🇱🇻 🇪🇪

Check out [**UralicGUI** - a graphical user interface for UralicNLP](https://github.com/mikahama/uralicGUI).

☕ Check out UralicNLP [official Java version](https://github.com/mikahama/uralicNLP-Java)

♯ Check out UralicNLP [official C# version](https://github.com/mikahama/uralicNLP.net)

## Installation

The library can be installed from [PyPi](https://pypi.python.org/pypi/uralicNLP/).

    pip install uralicNLP
   
If you want to use the Constraint Grammar features (*from uralicNLP.cg3 import Cg3*), you will also need to install VISL CG-3.

## Large language models (LLMs)

UralicNLP supports a wide range of LLMs and it can even embed text in some endangered languages [Check out LLMs](https://github.com/mikahama/uralicNLP/wiki/Large-Language-Models).

UralicNLP can cluster texts into semantically similar categories. [Learn more about clustering](https://github.com/mikahama/uralicNLP/wiki/Semantics).

## List supported languages
The API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.

    from uralicNLP import uralicApi
    uralicApi.supported_languages()
    >>{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}

The *dictionary* key lists the languages that are supported by the lexical lookup, whereas *morph* lists the languages that have morphological FSTs and *cg* lists the languages that have a CG.

## Download models 

On the command line:

    python -m uralicNLP.download --languages fin eng

From python code:

    from uralicNLP import uralicApi
    uralicApi.download("fin")

When models are installed, *generate()*, *analyze()* and *lemmatize()* methods will automatically use them instead of the server side API. [More information about the models](https://github.com/mikahama/uralicNLP/wiki/Models).

## Lemmatize words
A word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.

    from uralicNLP import uralicApi
    uralicApi.lemmatize("вирев", "myv")
    >>['вирев', 'вирь']
    uralicApi.lemmatize("luutapiiri", "fin", word_boundaries=True)
    >>['luuta|piiri', 'luu|tapiiri']
  
An example of lemmatizing the word *вирев* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize("вирев", "myv", descriptive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |.

## Morphological analysis
Apart from just getting the lemmas, it's also possible to perform a complete morphological analysis.

    from uralicNLP import uralicApi
    uralicApi.analyze("voita", "fin")
    >>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]
  
An example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze("voita", "fin", descriptive=False)*.

## Morphological generation
From a lemma and a morphological analysis, it's possible to generate the desired word form. 

    from uralicNLP import uralicApi
    uralicApi.generate("käsi+N+Sg+Par", "fin")
    >>[['kättä', 0.0]]
  
An example of generating the singular partitive form for the Finnish noun *käsi*. The result is *kättä*. The default generator is a **regular normative** generator. *uralicApi.generate("käsi+N+Sg+Par", "fin", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate("käsi+N+Sg+Par", "fin", descriptive=True)* a descriptive generator.

## Morphological segmentation
UralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)

    from uralicNLP import uralicApi
    uralicApi.segment("luutapiirinikin", "fin")
    >>[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]

In the example, the word _luutapiirinikin_ has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.

## Disambiguation

This section has been moved to [UralicNLP wiki page on disambiguation](https://github.com/mikahama/uralicNLP/wiki/Disambiguation).

## Dictionaries

Learn more about dictionaries in [the wiki page on dictionaries](https://github.com/mikahama/uralicNLP/wiki/Dictionaries).

## Parsing UD CoNLL-U annotated TreeBank data

UralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to [the Wiki for the UD parser documentation](https://github.com/mikahama/uralicNLP/wiki/UD-parser).


## Other functionalities

- [Machine Translation](https://github.com/mikahama/uralicNLP/wiki/Machine-Translation)
- [Finnish Dependency Parsing](https://github.com/mikahama/uralicNLP/wiki/Dependency-parsing)
- [ISO code to language name](https://github.com/mikahama/uralicNLP/wiki/uralicNLP.string_processing#iso_to_name)
- [Tokenization](https://github.com/mikahama/uralicNLP/wiki/Tokenization)

# Cite

If you use UralicNLP in an academic publication, please cite it as follows:

Hämäläinen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345

    @article{uralicnlp_2019, 
        title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},
        DOI={10.21105/joss.01345}, 
        journal={Journal of Open Source Software}, 
        author={Mika Hämäläinen}, 
        year={2019}, 
        volume={4},
        number={37},
        pages={1345}
    }

For citing the FSTs and CGs, see *uralicApi.model_info(language)*.

The FST and CG tools and dictionaries come mostly from the [GiellaLT repositories](https://github.com/giellalt) and [Apertium](https://github.com/apertium).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mikahama/uralicNLP",
    "name": "uralicNLP",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "Uralic languages, NLP, morphology, hfst, omorfi, dictionary, lexicography",
    "author": "Mika H\u00e4m\u00e4l\u00e4inen",
    "author_email": "mika@flyforpoints.com",
    "download_url": "https://files.pythonhosted.org/packages/16/f7/9abb9c414544bbc0c235a4ee226353597da0139e4fd826cc6699eb4d94c3/uralicnlp-2.0.3.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">UralicNLP</h1>\n<p align=\"center\">Natural language processing for many languages</p>\n\n[![Updates](https://pyup.io/repos/github/mikahama/uralicNLP/shield.svg)](https://pyup.io/repos/github/mikahama/uralicNLP/)  [![Downloads](https://static.pepy.tech/badge/uralicnlp)](https://pepy.tech/project/uralicnlp) [![DOI](https://joss.theoj.org/papers/10.21105/joss.01345/status.svg)](https://doi.org/10.21105/joss.01345)\n\n\nUralicNLP can produce **morphological analyses**, **generate morphological forms**, **lemmatize words** and **give lexical information** about words in Uralic and other languages. The languages we support include the following languages: Finnish, Russian, German, English, Norwegian, Swedish, Arabic, Ingrian, Meadow & Eastern Mari, Votic, Olonets-Karelian, Erzya, Moksha, Hill Mari, Udmurt, Tundra Nenets, Komi-Permyak, North Sami, South Sami and Skolt Sami. Currently, UralicNLP uses stable builds for the supported languages. \n\n[See the catalog of supported languages](http://models.uralicnlp.com/nightly/)\n\nSome of the supported languages: \ud83c\uddf8\ud83c\udde6 \ud83c\uddea\ud83c\uddf8 \ud83c\uddee\ud83c\uddf9 \ud83c\uddf5\ud83c\uddf9 \ud83c\udde9\ud83c\uddea \ud83c\uddeb\ud83c\uddf7 \ud83c\uddf3\ud83c\uddf1 \ud83c\uddec\ud83c\udde7 \ud83c\uddf7\ud83c\uddfa \ud83c\uddeb\ud83c\uddee \ud83c\uddf8\ud83c\uddea \ud83c\uddf3\ud83c\uddf4 \ud83c\udde9\ud83c\uddf0 \ud83c\uddf1\ud83c\uddfb \ud83c\uddea\ud83c\uddea\n\nCheck out [**UralicGUI** - a graphical user interface for UralicNLP](https://github.com/mikahama/uralicGUI).\n\n\u2615 Check out UralicNLP [official Java version](https://github.com/mikahama/uralicNLP-Java)\n\n\u266f Check out UralicNLP [official C# version](https://github.com/mikahama/uralicNLP.net)\n\n## Installation\n\nThe library can be installed from [PyPi](https://pypi.python.org/pypi/uralicNLP/).\n\n    pip install uralicNLP\n   \nIf you want to use the Constraint Grammar features (*from uralicNLP.cg3 import Cg3*), you will also need to install VISL CG-3.\n\n## Large language models (LLMs)\n\nUralicNLP supports a wide range of LLMs and it can even embed text in some endangered languages [Check out LLMs](https://github.com/mikahama/uralicNLP/wiki/Large-Language-Models).\n\nUralicNLP can cluster texts into semantically similar categories. [Learn more about clustering](https://github.com/mikahama/uralicNLP/wiki/Semantics).\n\n## List supported languages\nThe API is under constant development and new languages will be added to the nightly builds system. That's why UralicNLP provides a functionality for looking up the list of currently supported languages. The method returns 3 letter ISO codes for the languages.\n\n    from uralicNLP import uralicApi\n    uralicApi.supported_languages()\n    >>{'cg': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'ron', 'olo', 'bxr', 'hun', 'crk', 'chr', 'vep', 'deu', 'mrj', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'tat', 'smj'], 'dictionary': ['vot', 'lav', 'rus', 'est', 'nob', 'ron', 'olo', 'hun', 'koi', 'chr', 'deu', 'mrj', 'sjd', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'fkv', 'mhr', 'kpv', 'sme', 'sje', 'hdn', 'fin', 'mns', 'mdf', 'vro', 'udm', 'smj'], 'morph': ['vot', 'lav', 'izh', 'rus', 'lut', 'fao', 'est', 'nob', 'swe', 'ron', 'eng', 'olo', 'bxr', 'hun', 'koi', 'crk', 'chr', 'vep', 'deu', 'mrj', 'ara', 'gle', 'sjd', 'nio', 'myv', 'som', 'sma', 'sms', 'smn', 'kal', 'bak', 'kca', 'otw', 'ciw', 'fkv', 'nds', 'mhr', 'kpv', 'sme', 'sje', 'evn', 'oji', 'ipk', 'fit', 'fin', 'mns', 'rmf', 'liv', 'cor', 'mdf', 'yrk', 'vro', 'udm', 'tat', 'smj']}\n\nThe *dictionary* key lists the languages that are supported by the lexical lookup, whereas *morph* lists the languages that have morphological FSTs and *cg* lists the languages that have a CG.\n\n## Download models \n\nOn the command line:\n\n    python -m uralicNLP.download --languages fin eng\n\nFrom python code:\n\n    from uralicNLP import uralicApi\n    uralicApi.download(\"fin\")\n\nWhen models are installed, *generate()*, *analyze()* and *lemmatize()* methods will automatically use them instead of the server side API. [More information about the models](https://github.com/mikahama/uralicNLP/wiki/Models).\n\n## Lemmatize words\nA word form can be lemmatized with UralicNLP. This does not do any disambiguation but rather returns a list of all the possible lemmas.\n\n    from uralicNLP import uralicApi\n    uralicApi.lemmatize(\"\u0432\u0438\u0440\u0435\u0432\", \"myv\")\n    >>['\u0432\u0438\u0440\u0435\u0432', '\u0432\u0438\u0440\u044c']\n    uralicApi.lemmatize(\"luutapiiri\", \"fin\", word_boundaries=True)\n    >>['luuta|piiri', 'luu|tapiiri']\n  \nAn example of lemmatizing the word *\u0432\u0438\u0440\u0435\u0432* in Erzya (myv). By default, a **descriptive** analyzer is used. Use *uralicApi.lemmatize(\"\u0432\u0438\u0440\u0435\u0432\", \"myv\", descriptive=False)* for a non-descriptive analyzer. If *word_boundaries* is set to True, the lemmatizer will mark word boundaries with a |.\n\n## Morphological analysis\nApart from just getting the lemmas, it's also possible to perform a complete morphological analysis.\n\n    from uralicNLP import uralicApi\n    uralicApi.analyze(\"voita\", \"fin\")\n    >>[['voi+N+Sg+Par', 0.0], ['voi+N+Pl+Par', 0.0], ['voitaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voitaa+V+Act+Imprt+Sg2', 0.0], ['voitaa+V+Act+Ind+Prs+ConNeg', 0.0], ['voittaa+V+Act+Imprt+Prs+ConNeg+Sg2', 0.0], ['voittaa+V+Act+Imprt+Sg2', 0.0], ['voittaa+V+Act+Ind+Prs+ConNeg', 0.0], ['vuo+N+Pl+Par', 0.0]]\n  \nAn example of analyzing the word *voita* in Finnish (fin). The default analyzer is **descriptive**. To use a normative analyzer instead, use *uralicApi.analyze(\"voita\", \"fin\", descriptive=False)*.\n\n## Morphological generation\nFrom a lemma and a morphological analysis, it's possible to generate the desired word form. \n\n    from uralicNLP import uralicApi\n    uralicApi.generate(\"k\u00e4si+N+Sg+Par\", \"fin\")\n    >>[['k\u00e4tt\u00e4', 0.0]]\n  \nAn example of generating the singular partitive form for the Finnish noun *k\u00e4si*. The result is *k\u00e4tt\u00e4*. The default generator is a **regular normative** generator. *uralicApi.generate(\"k\u00e4si+N+Sg+Par\", \"fin\", dictionary_forms=True)* uses a normative dictionary generator and *uralicApi.generate(\"k\u00e4si+N+Sg+Par\", \"fin\", descriptive=True)* a descriptive generator.\n\n## Morphological segmentation\nUralicNLP makes it possible to split a word form into morphemes. (Note: this does not work with all languages)\n\n    from uralicNLP import uralicApi\n    uralicApi.segment(\"luutapiirinikin\", \"fin\")\n    >>[['luu', 'tapiiri', 'ni', 'kin'], ['luuta', 'piiri', 'ni', 'kin']]\n\nIn the example, the word _luutapiirinikin_ has two possible interpretations luu|tapiiri and luuta|piiri, the segmentation is done for both interpretations.\n\n## Disambiguation\n\nThis section has been moved to [UralicNLP wiki page on disambiguation](https://github.com/mikahama/uralicNLP/wiki/Disambiguation).\n\n## Dictionaries\n\nLearn more about dictionaries in [the wiki page on dictionaries](https://github.com/mikahama/uralicNLP/wiki/Dictionaries).\n\n## Parsing UD CoNLL-U annotated TreeBank data\n\nUralicNLP comes with tools for parsing and searching CoNLL-U formatted data. Please refer to [the Wiki for the UD parser documentation](https://github.com/mikahama/uralicNLP/wiki/UD-parser).\n\n\n## Other functionalities\n\n- [Machine Translation](https://github.com/mikahama/uralicNLP/wiki/Machine-Translation)\n- [Finnish Dependency Parsing](https://github.com/mikahama/uralicNLP/wiki/Dependency-parsing)\n- [ISO code to language name](https://github.com/mikahama/uralicNLP/wiki/uralicNLP.string_processing#iso_to_name)\n- [Tokenization](https://github.com/mikahama/uralicNLP/wiki/Tokenization)\n\n# Cite\n\nIf you use UralicNLP in an academic publication, please cite it as follows:\n\nH\u00e4m\u00e4l\u00e4inen, Mika. (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of open source software, 4(37), [1345]. https://doi.org/10.21105/joss.01345\n\n    @article{uralicnlp_2019, \n        title={{UralicNLP}: An {NLP} Library for {U}ralic Languages},\n        DOI={10.21105/joss.01345}, \n        journal={Journal of Open Source Software}, \n        author={Mika H\u00e4m\u00e4l\u00e4inen}, \n        year={2019}, \n        volume={4},\n        number={37},\n        pages={1345}\n    }\n\nFor citing the FSTs and CGs, see *uralicApi.model_info(language)*.\n\nThe FST and CG tools and dictionaries come mostly from the [GiellaLT repositories](https://github.com/giellalt) and [Apertium](https://github.com/apertium).\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0 license",
    "summary": "An NLP library for Uralic languages such as Finnish and Sami. Also supports Spanish, Arabic, Russian etc.",
    "version": "2.0.3",
    "project_urls": {
        "Bug Reports": "https://github.com/mikahama/uralicNLP/issues",
        "Developer": "https://mikakalevi.com/",
        "Homepage": "https://github.com/mikahama/uralicNLP",
        "Wiki": "https://github.com/mikahama/uralicNLP/wiki"
    },
    "split_keywords": [
        "uralic languages",
        " nlp",
        " morphology",
        " hfst",
        " omorfi",
        " dictionary",
        " lexicography"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "07748013bb6cef23c9209b6c4161cfc925f93330d273411cb829c3c030481ca1",
                "md5": "f509d825c11f39c19eda81542d202f66",
                "sha256": "3fd12b667fbe24639b927a6891b35f7176df72cd0e264f7e34e912f3b8b20980"
            },
            "downloads": -1,
            "filename": "uralicNLP-2.0.3-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f509d825c11f39c19eda81542d202f66",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 113914,
            "upload_time": "2024-11-28T18:46:08",
            "upload_time_iso_8601": "2024-11-28T18:46:08.947531Z",
            "url": "https://files.pythonhosted.org/packages/07/74/8013bb6cef23c9209b6c4161cfc925f93330d273411cb829c3c030481ca1/uralicNLP-2.0.3-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "16f79abb9c414544bbc0c235a4ee226353597da0139e4fd826cc6699eb4d94c3",
                "md5": "32d85e5b02c31fa6816c69da68d235af",
                "sha256": "31ce062cf32a0be263cfd0309072ea6f1eba4e3e88aad4e456cd583615732a2b"
            },
            "downloads": -1,
            "filename": "uralicnlp-2.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "32d85e5b02c31fa6816c69da68d235af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 115392,
            "upload_time": "2024-11-28T18:46:01",
            "upload_time_iso_8601": "2024-11-28T18:46:01.191349Z",
            "url": "https://files.pythonhosted.org/packages/16/f7/9abb9c414544bbc0c235a4ee226353597da0139e4fd826cc6699eb4d94c3/uralicnlp-2.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-28 18:46:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mikahama",
    "github_project": "uralicNLP",
    "travis_ci": true,
    "coveralls": true,
    "github_actions": true,
    "lcname": "uralicnlp"
}

Mika Hämäläinen