NlpToolkit-MorphologicalAnalysis-Cy


NameNlpToolkit-MorphologicalAnalysis-Cy JSON
Version 1.0.28 PyPI version JSON
download
home_pagehttps://github.com/StarlangSoftware/TurkishMorphologicalAnalysis-Cy
SummaryTurkish Morphological Analysis
upload_time2022-12-07 12:08:50
maintainer
docs_urlNone
authorolcaytaner
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            Morphological Analysis
============

## Morphology

In linguistics, the term morphology refers to the study of the internal structure of words. Each word is assumed to consist of one or more morphemes, which can be defined as the smallest linguistic unit having a particular meaning or grammatical function. One can come across morphologically simplex words, i.e. roots, as well as morphologically complex ones, such as compounds or affixed forms.

Batı-lı-laş-tır-ıl-ama-yan-lar-dan-mış-ız 
west-With-Make-Caus-Pass-Neg.Abil-Nom-Pl-Abl-Evid-A3Pl
‘It appears that we are among the ones that cannot be westernized.’

The morphemes that constitute a word combine in a (more or less) strict order. Most morphologically complex words are in the ”ROOT-SUFFIX1-SUFFIX2-...” structure. Affixes have two types: (i) derivational affixes, which change the meaning and sometimes also the grammatical category of the base they are attached to, and (ii) inflectional affixes serving particular grammatical functions. In general, derivational suffixes precede inflectional ones. The order of derivational suffixes is reflected on the meaning of the derived form. For instance, consider the combination of the noun göz ‘eye’ with two derivational suffixes -lIK and -CI: Even though the same three morphemes are used, the meaning of a word like gözcülük ‘scouting’ is clearly different from that of gözlükçü ‘optician’.

## Dilbaz

Here we present a new morphological analyzer, which is (i) open: The latest version of source codes, the lexicon, and the morphotactic rule engine are all available here, (ii) extendible: One of the disadvantages of other morphological analyzers is that their lexicons are fixed or unmodifiable, which prevents to add new bare-forms to the morphological analyzer. In our morphological analyzer, the lexicon is in text form and is easily modifiable, (iii) fast: Morphological analysis is one of the core components of any NLP process. It must be very fast to handle huge corpora. Compared to other morphological analyzers, our analyzer is capable of analyzing hundreds of thousands words per second, which makes it one of the fastest Turkish morphological analyzers available.

The morphological analyzer consists of five main components, namely, a lexicon, a finite state transducer, a rule engine for suffixation, a trie data structure, and a least recently used (LRU) cache.

In this analyzer, we assume all idiosyncratic information to be encoded in the lexicon. While phonologically conditioned allomorphy will be dealt with by the transducer, other types of allomorphy, all exceptional forms to otherwise regular processes, as well as words formed through derivation (except for the few transparently compositional derivational suffixes are considered to be included in the lexicon.

In our morphological analyzer, finite state transducer is encoded in an xml file.

To overcome the irregularities and also to accelerate the search for the bareforms, we use a trie data structure in our morphological analyzer, and store all words in our lexicon in that data structure. For the regular words, we only store that word in our trie, whereas for irregular words we store both the original form and some prefix of that word. 

Video Lectures
============

[<img src="https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video1.jpg" width="50%">](https://youtu.be/KxguxpbgDQc)[<img src="https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video2.jpg" width="50%">](https://youtu.be/UMmA2LMkAkw)[<img src="https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video3.jpg" width="50%">](https://youtu.be/dP97ovMSSfE)[<img src="https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video4.jpg" width="50%">](https://youtu.be/Tgmy5tts_pY)

For Developers
============

You can also see [Python](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Py), [Java](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis), [C++](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-CPP), [Swift](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Swift), [Js](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Js), or [C#](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-CS) repository.

## Requirements

* [Python 3.7 or higher](#python)
* [Git](#git)

### Python 

To check if you have a compatible version of Python installed, use the following command:

    python -V
    
You can find the latest version of Python [here](https://www.python.org/downloads/).

### Git

Install the [latest version of Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).

## Pip Install

	pip3 install NlpToolkit-MorphologicalAnalysis-Cy

## Download Code

In order to work on code, create a fork from GitHub page. 
Use Git for cloning the code to your local or below line for Ubuntu:

	git clone <your-fork-git-link>

A directory called DataStructure will be created. Or you can use below link for exploring the code:

	git clone https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Cy.git

## Open project with Pycharm IDE

Steps for opening the cloned project:

* Start IDE
* Select **File | Open** from main menu
* Choose `MorphologicalAnalysis-Cy` file
* Select open as project option

Detailed Description
============

+ [Creating FsmMorphologicalAnalyzer](#creating-fsmmorphologicalanalyzer)
+ [Word level morphological analysis](#word-level-morphological-analysis)
+ [Sentence level morphological analysis](#sentence-level-morphological-analysis)

## Creating FsmMorphologicalAnalyzer 

FsmMorphologicalAnalyzer provides Turkish morphological analysis. This class can be created as follows:

    fsm = FsmMorphologicalAnalyzer()
    
This generates a new `TxtDictionary` type dictionary from [`turkish_dictionary.txt`](https://github.com/olcaytaner/Dictionary/tree/master/src/main/resources) with fixed cache size 100000 and by using [`turkish_finite_state_machine.xml`](https://github.com/olcaytaner/MorphologicalAnalysis/tree/master/src/main/resources). 

Creating a morphological analyzer with different cache size, dictionary or finite state machine is also possible. 
* With different cache size, 

        fsm = FsmMorphologicalAnalyzer(50000);   

* Using a different dictionary,

        fsm = FsmMorphologicalAnalyzer("my_turkish_dictionary.txt");   

* Specifying both finite state machine and dictionary, 

        fsm = FsmMorphologicalAnalyzer("fsm.xml", "my_turkish_dictionary.txt") ;      
    
* Giving finite state machine and cache size with creating `TxtDictionary` object, 
        
        dictionary = TxtDictionary("my_turkish_dictionary.txt");
        fsm = FsmMorphologicalAnalyzer("fsm.xml", dictionary, 50000) ;
    
* With different finite state machine and creating `TxtDictionary` object,
       
        dictionary = TxtDictionary("my_turkish_dictionary.txt", "my_turkish_misspelled.txt");
        fsm = FsmMorphologicalAnalyzer("fsm.xml", dictionary);

## Word level morphological analysis

For morphological analysis,  `morphologicalAnalysis(String word)` method of `FsmMorphologicalAnalyzer` is used. This returns `FsmParseList` object. 


    fsm = FsmMorphologicalAnalyzer()
    word = "yarına"
    fsmParseList = fsm.morphologicalAnalysis(word)
    for i in range(fsmParseList.size()):
      	print(fsmParseList.getFsmParse(i).transitionList())
    
      
Output

    yar+NOUN+A3SG+P2SG+DAT
    yar+NOUN+A3SG+P3SG+DAT
    yarı+NOUN+A3SG+P2SG+DAT
    yarın+NOUN+A3SG+PNON+DAT
    
From `FsmParseList`, a single `FsmParse` can be obtained as follows:

    parse = fsmParseList.getFsmParse(0)
    print(parse.transitionList())  
    
Output    
    
    yar+NOUN+A3SG+P2SG+DAT
    
## Sentence level morphological analysis
`morphologicalAnalysis(Sentence sentence)` method of `FsmMorphologicalAnalyzer` is used. This returns `FsmParseList[]` object. 

    fsm = FsmMorphologicalAnalyzer()
    sentence = Sentence("Yarın doktora gidecekler")
    parseLists = fsm.morphologicalAnalysis(sentence)
    for i in range(len(parseLists)):
        for j in range(parseLists[i].size()):
            parse = parseLists[i].getFsmParse(j)
            print(parse.transitionList())
        print("-----------------")
    
Output
    
    -----------------
    yar+NOUN+A3SG+P2SG+NOM
    yar+NOUN+A3SG+PNON+GEN
    yar+VERB+POS+IMP+A2PL
    yarı+NOUN+A3SG+P2SG+NOM
    yarın+NOUN+A3SG+PNON+NOM
    -----------------
    doktor+NOUN+A3SG+PNON+DAT
    doktora+NOUN+A3SG+PNON+NOM
    -----------------
    git+VERB+POS+FUT+A3PL
    git+VERB+POS^DB+NOUN+FUTPART+A3PL+PNON+NOM

# Cite

	@inproceedings{yildiz-etal-2019-open,
    	title = "An Open, Extendible, and Fast {T}urkish Morphological Analyzer",
    	author = {Y{\i}ld{\i}z, Olcay Taner  and
      	Avar, Beg{\"u}m  and
      	Ercan, G{\"o}khan},
    	booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)",
    	month = sep,
    	year = "2019",
    	address = "Varna, Bulgaria",
    	publisher = "INCOMA Ltd.",
    	url = "https://www.aclweb.org/anthology/R19-1156",
    	doi = "10.26615/978-954-452-056-4_156",
    	pages = "1364--1372",
	}
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis-Cy",
    "name": "NlpToolkit-MorphologicalAnalysis-Cy",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "olcaytaner",
    "author_email": "olcay.yildiz@ozyegin.edu.tr",
    "download_url": "https://files.pythonhosted.org/packages/a3/dd/df80aee13c1bb4011dc1a64db443c782020aa1f2e3977cde96426b83f5b9/NlpToolkit-MorphologicalAnalysis-Cy-1.0.28.tar.gz",
    "platform": null,
    "description": "Morphological Analysis\n============\n\n## Morphology\n\nIn linguistics, the term morphology refers to the study of the internal structure of words. Each word is assumed to consist of one or more morphemes, which can be defined as the smallest linguistic unit having a particular meaning or grammatical function. One can come across morphologically simplex words, i.e. roots, as well as morphologically complex ones, such as compounds or affixed forms.\n\nBat\u0131-l\u0131-la\u015f-t\u0131r-\u0131l-ama-yan-lar-dan-m\u0131\u015f-\u0131z \nwest-With-Make-Caus-Pass-Neg.Abil-Nom-Pl-Abl-Evid-A3Pl\n\u2018It appears that we are among the ones that cannot be westernized.\u2019\n\nThe morphemes that constitute a word combine in a (more or less) strict order. Most morphologically complex words are in the \u201dROOT-SUFFIX1-SUFFIX2-...\u201d structure. Affixes have two types: (i) derivational affixes, which change the meaning and sometimes also the grammatical category of the base they are attached to, and (ii) inflectional affixes serving particular grammatical functions. In general, derivational suffixes precede inflectional ones. The order of derivational suffixes is reflected on the meaning of the derived form. For instance, consider the combination of the noun g\u00f6z \u2018eye\u2019 with two derivational suffixes -lIK and -CI: Even though the same three morphemes are used, the meaning of a word like g\u00f6zc\u00fcl\u00fck \u2018scouting\u2019 is clearly different from that of g\u00f6zl\u00fck\u00e7\u00fc \u2018optician\u2019.\n\n## Dilbaz\n\nHere we present a new morphological analyzer, which is (i) open: The latest version of source codes, the lexicon, and the morphotactic rule engine are all available here, (ii) extendible: One of the disadvantages of other morphological analyzers is that their lexicons are fixed or unmodifiable, which prevents to add new bare-forms to the morphological analyzer. In our morphological analyzer, the lexicon is in text form and is easily modifiable, (iii) fast: Morphological analysis is one of the core components of any NLP process. It must be very fast to handle huge corpora. Compared to other morphological analyzers, our analyzer is capable of analyzing hundreds of thousands words per second, which makes it one of the fastest Turkish morphological analyzers available.\n\nThe morphological analyzer consists of five main components, namely, a lexicon, a finite state transducer, a rule engine for suffixation, a trie data structure, and a least recently used (LRU) cache.\n\nIn this analyzer, we assume all idiosyncratic information to be encoded in the lexicon. While phonologically conditioned allomorphy will be dealt with by the transducer, other types of allomorphy, all exceptional forms to otherwise regular processes, as well as words formed through derivation (except for the few transparently compositional derivational suffixes are considered to be included in the lexicon.\n\nIn our morphological analyzer, finite state transducer is encoded in an xml file.\n\nTo overcome the irregularities and also to accelerate the search for the bareforms, we use a trie data structure in our morphological analyzer, and store all words in our lexicon in that data structure. For the regular words, we only store that word in our trie, whereas for irregular words we store both the original form and some prefix of that word. \n\nVideo Lectures\n============\n\n[<img src=\"https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video1.jpg\" width=\"50%\">](https://youtu.be/KxguxpbgDQc)[<img src=\"https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video2.jpg\" width=\"50%\">](https://youtu.be/UMmA2LMkAkw)[<img src=\"https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video3.jpg\" width=\"50%\">](https://youtu.be/dP97ovMSSfE)[<img src=\"https://github.com/StarlangSoftware/TurkishMorphologicalAnalysis/blob/master/video4.jpg\" width=\"50%\">](https://youtu.be/Tgmy5tts_pY)\n\nFor Developers\n============\n\nYou can also see [Python](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Py), [Java](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis), [C++](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-CPP), [Swift](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Swift), [Js](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Js), or [C#](https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-CS) repository.\n\n## Requirements\n\n* [Python 3.7 or higher](#python)\n* [Git](#git)\n\n### Python \n\nTo check if you have a compatible version of Python installed, use the following command:\n\n    python -V\n    \nYou can find the latest version of Python [here](https://www.python.org/downloads/).\n\n### Git\n\nInstall the [latest version of Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).\n\n## Pip Install\n\n\tpip3 install NlpToolkit-MorphologicalAnalysis-Cy\n\n## Download Code\n\nIn order to work on code, create a fork from GitHub page. \nUse Git for cloning the code to your local or below line for Ubuntu:\n\n\tgit clone <your-fork-git-link>\n\nA directory called DataStructure will be created. Or you can use below link for exploring the code:\n\n\tgit clone https://github.com/starlangsoftware/TurkishMorphologicalAnalysis-Cy.git\n\n## Open project with Pycharm IDE\n\nSteps for opening the cloned project:\n\n* Start IDE\n* Select **File | Open** from main menu\n* Choose `MorphologicalAnalysis-Cy` file\n* Select open as project option\n\nDetailed Description\n============\n\n+ [Creating FsmMorphologicalAnalyzer](#creating-fsmmorphologicalanalyzer)\n+ [Word level morphological analysis](#word-level-morphological-analysis)\n+ [Sentence level morphological analysis](#sentence-level-morphological-analysis)\n\n## Creating FsmMorphologicalAnalyzer \n\nFsmMorphologicalAnalyzer provides Turkish morphological analysis. This class can be created as follows:\n\n    fsm = FsmMorphologicalAnalyzer()\n    \nThis generates a new `TxtDictionary` type dictionary from [`turkish_dictionary.txt`](https://github.com/olcaytaner/Dictionary/tree/master/src/main/resources) with fixed cache size 100000 and by using [`turkish_finite_state_machine.xml`](https://github.com/olcaytaner/MorphologicalAnalysis/tree/master/src/main/resources). \n\nCreating a morphological analyzer with different cache size, dictionary or finite state machine is also possible. \n* With different cache size, \n\n        fsm = FsmMorphologicalAnalyzer(50000);   \n\n* Using a different dictionary,\n\n        fsm = FsmMorphologicalAnalyzer(\"my_turkish_dictionary.txt\");   \n\n* Specifying both finite state machine and dictionary, \n\n        fsm = FsmMorphologicalAnalyzer(\"fsm.xml\", \"my_turkish_dictionary.txt\") ;      \n    \n* Giving finite state machine and cache size with creating `TxtDictionary` object, \n        \n        dictionary = TxtDictionary(\"my_turkish_dictionary.txt\");\n        fsm = FsmMorphologicalAnalyzer(\"fsm.xml\", dictionary, 50000) ;\n    \n* With different finite state machine and creating `TxtDictionary` object,\n       \n        dictionary = TxtDictionary(\"my_turkish_dictionary.txt\", \"my_turkish_misspelled.txt\");\n        fsm = FsmMorphologicalAnalyzer(\"fsm.xml\", dictionary);\n\n## Word level morphological analysis\n\nFor morphological analysis,  `morphologicalAnalysis(String word)` method of `FsmMorphologicalAnalyzer` is used. This returns `FsmParseList` object. \n\n\n    fsm = FsmMorphologicalAnalyzer()\n    word = \"yar\u0131na\"\n    fsmParseList = fsm.morphologicalAnalysis(word)\n    for i in range(fsmParseList.size()):\n      \tprint(fsmParseList.getFsmParse(i).transitionList())\n    \n      \nOutput\n\n    yar+NOUN+A3SG+P2SG+DAT\n    yar+NOUN+A3SG+P3SG+DAT\n    yar\u0131+NOUN+A3SG+P2SG+DAT\n    yar\u0131n+NOUN+A3SG+PNON+DAT\n    \nFrom `FsmParseList`, a single `FsmParse` can be obtained as follows:\n\n    parse = fsmParseList.getFsmParse(0)\n    print(parse.transitionList())  \n    \nOutput    \n    \n    yar+NOUN+A3SG+P2SG+DAT\n    \n## Sentence level morphological analysis\n`morphologicalAnalysis(Sentence sentence)` method of `FsmMorphologicalAnalyzer` is used. This returns `FsmParseList[]` object. \n\n    fsm = FsmMorphologicalAnalyzer()\n    sentence = Sentence(\"Yar\u0131n doktora gidecekler\")\n    parseLists = fsm.morphologicalAnalysis(sentence)\n    for i in range(len(parseLists)):\n        for j in range(parseLists[i].size()):\n            parse = parseLists[i].getFsmParse(j)\n            print(parse.transitionList())\n        print(\"-----------------\")\n    \nOutput\n    \n    -----------------\n    yar+NOUN+A3SG+P2SG+NOM\n    yar+NOUN+A3SG+PNON+GEN\n    yar+VERB+POS+IMP+A2PL\n    yar\u0131+NOUN+A3SG+P2SG+NOM\n    yar\u0131n+NOUN+A3SG+PNON+NOM\n    -----------------\n    doktor+NOUN+A3SG+PNON+DAT\n    doktora+NOUN+A3SG+PNON+NOM\n    -----------------\n    git+VERB+POS+FUT+A3PL\n    git+VERB+POS^DB+NOUN+FUTPART+A3PL+PNON+NOM\n\n# Cite\n\n\t@inproceedings{yildiz-etal-2019-open,\n    \ttitle = \"An Open, Extendible, and Fast {T}urkish Morphological Analyzer\",\n    \tauthor = {Y{\\i}ld{\\i}z, Olcay Taner  and\n      \tAvar, Beg{\\\"u}m  and\n      \tErcan, G{\\\"o}khan},\n    \tbooktitle = \"Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)\",\n    \tmonth = sep,\n    \tyear = \"2019\",\n    \taddress = \"Varna, Bulgaria\",\n    \tpublisher = \"INCOMA Ltd.\",\n    \turl = \"https://www.aclweb.org/anthology/R19-1156\",\n    \tdoi = \"10.26615/978-954-452-056-4_156\",\n    \tpages = \"1364--1372\",\n\t}",
    "bugtrack_url": null,
    "license": "",
    "summary": "Turkish Morphological Analysis",
    "version": "1.0.28",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "0e28805e48d1447347e3cb737f1f6b2b",
                "sha256": "d499d0fbadd56ccaee155e8a674255c88634df066d67cceda3e6a8c742d4e135"
            },
            "downloads": -1,
            "filename": "NlpToolkit-MorphologicalAnalysis-Cy-1.0.28.tar.gz",
            "has_sig": false,
            "md5_digest": "0e28805e48d1447347e3cb737f1f6b2b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 828636,
            "upload_time": "2022-12-07T12:08:50",
            "upload_time_iso_8601": "2022-12-07T12:08:50.794624Z",
            "url": "https://files.pythonhosted.org/packages/a3/dd/df80aee13c1bb4011dc1a64db443c782020aa1f2e3977cde96426b83f5b9/NlpToolkit-MorphologicalAnalysis-Cy-1.0.28.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-07 12:08:50",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "StarlangSoftware",
    "github_project": "TurkishMorphologicalAnalysis-Cy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "nlptoolkit-morphologicalanalysis-cy"
}
        
Elapsed time: 0.03763s