yakutmorph

Name	yakutmorph JSON
Version	0.0.5 JSON
	download
home_page	https://github.com/nicolascortegoso/yakutmorph
Summary	A morphological analyzer for Yakut language
upload_time	2024-06-15 17:30:10
maintainer	None
docs_url	None
author	Nicolas Cortegoso Vissio
requires_python	>=3.8
license	None
keywords	python morphology analyzer yakut sakha nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # yakutmorph

This Python library provides tools for performing morphological annotations on texts in the Yakut (Sakha) language. It includes:

- A tokenizer to divide a string into tokens.
- A morphological transducer to map surface and analysis forms.
- A module to resolve ambiguity in morphological analysis within the context of a given sequence.

## Installation
 
The library yakutmorph can be installed using the package manager pip (Python's package installer):

```
pip install yakutmorph
```

## Basic Usage

For convenience, all three modules (tokenization, morphological analysis, and disambiguation) are implemented within the `YakutMorph` class, which provides a user-friendly interface.
This class follows a non-destructive approach, encapsulating the input string and subsequent processing steps as objects within a main `Parse` object:

```
>>> from yakutmorph.main import YakutMorph
>>> morphology = YakutMorph()
>>> parse = morphology.parse('мин атым Кэскил')
>>> parse
Parse(мин атым Кэскил.)
```

### Parse


The property `text` retrieves the input string: 

```
>>> parse.text
мин атым Кэскил.
```

The property `tokens` returns a list of `Token` objects: 

```
>>> parse.text
мин атым Кэскил.
```


### Tokens

Tokens within a `Parse` object can be accessed by their index. For example:

```
>>> token = parse.tokens[0]
>>> token
Token(мин)
```

The property `pos` returns an integer representing the position of the token in the sequence (starting at 1):

```
>>> token.pos
1
```

The property `surface` retrieves the surface form of the token (as it appears originally in the input string):


```
>>> token.surface
'мин'
```

The property `type` returns the token classification provided by the tokenizer:


```
>>> token.type
'lowercase'
```

If the token corresponds to a Yakut word form, it also contains an `Analyses` object.


### Analyses (Possible Interpretations)

The `Analyses` object contains the transducer that performed the morphological analysis and wraps its outputs as a list of `Analysis` objects. A word form can be morphologically ambiguous and, therefore, have more than one interpretation.


```
>>> analyses = token.analyses
>>> analyses
Analyses(Fst(voc)=2)
```

In the example above, the object representation `Analyses(Fst(voc)=2)` shows that the surface form was processed by the morphological transducer `voc` and that it produced 2 analyses.
The transducer that performed the morphological analyses is found under the property `fst`:


```
>>> analyses.fst
Fst(voc)
```

The transducer output can be obtained with the property `output`. This returns a list of `Analysis` objects with possible interpretations:


```
>>> fst_output = analyses.output
>>> fst_output
[Analysis([Morph(мин), Morph(^N)]), Analysis([Morph(мин), Morph(^Pron)])]
```

### Analysis

Each `Analyses` object can be accessed using its respective index:

```
>>> output = fst_output[0]
>>> output
Analysis([Morph(мин), Morph(^N)])
```

The property `morphemes` returns a list of `Morph` objects representing the lexical root and the concatenated affixes:


```
>>> output.morphemes
[Morph(мин), Morph(^N)]
```


The property `root` returns just the Morph object that contains the lexical root:


```
>>> output.root
Morph(мин)
```

The property `infl_groups` retrieves a list of `InflGroup` objects:


```
>>> output.infl_groups
[InflGroup(1)]
```

### Inflectional Groups


Inflectional groups can be accessed by index:


```
>>> ig = output.infl_groups[0]
>>> ig
InflGroup(1)
```


The `InflGroup` object wraps a series of suffixes represented as `Morph` objects.

The property `pos` returns an integer representing the position of the inflectional group in the analysis:


```
>>> ig.pos
1
```

The property `affixes` is used to retrieve the list of `Morph` objects grouped within:


```
>>> ig.affixes
[Morph(^N)]
```

### Morphemes

The `Morph` objects are accessed by index:


```
>>> morph = ig.affixes[0]
>>> morph
Morph(^N)
```

A `Morph` object contains either a lexical root (root), a derivational (db), or an inflectional affix (fl).

The property `morpheme` gets the tag representation of the morpheme:


```
>>> morph.morpheme
'^N'
```


The property `type` returns the morpheme type:


```
>>> morph.type
'db'
```

The property `reference` returns a dictionary with mappings for the morpheme:


```
>>> morph.reference
{'UPOS': 'NOUN', 'XPOS': 'n', 'ref': 'noun', 'aper': 'n'}
```


## Processing Unknown Lexical Roots

The default morphological transducer analyzes and generates surface forms from an internal vocabulary containing lexical roots. However, it is impossible to list all roots that may appear in Yakut texts, especially given the expected presence of numerous loanwords from the Russian language.

A common practice to handle this issue is to provide auxiliary morphological transducers that increase coverage at the expense of outputting some spurious analyses.

To expand the capability of processing surface forms with minimal ambiguity, the `YakutMorph` class by default implements a three-stage morphological pipeline:

1. **Vocabulary-based transducer (labeled 'voc')**: Analyzes word forms using the lexical roots listed in the vocabulary.
2. **Syllable-based transducer (labeled 'syl')**: Operates on a set of Yakut syllables and accepts any valid concatenation of syllables in a Yakut root. It cannot analyze word forms that deviate from Yakut phonotactics.
3. **Affix-based transducer (labeled 'aff')**: Accepts any string consisting of a sequence of at least two characters of the Yakut alphabet. It can process loanwords.

In this pipeline, the next transducer only takes part if the previous one fails to process a given surface form. In the example below, the surface forms have been automatically processed by different morphological transducers:

```
>>> from yakutmorph.main import YakutMorph
>>> morph = YakutMorph()
>>> parse = morphology.parse('Мама Егора учуутал.')
>>> [token.analyses for token in parse.tokens if token.has_morph]
[Analyses(Fst(syl)=1), Analyses(Fst(aff)=3), Analyses(Fst(voc)=1)]
```

For more details, please refer to the README.md file inside the `src` folder, which contains the source code for the morphological transducers.


## Morphological Ambiguity

Ambiguous analyses occur when the morphological transducer outputs more than one possible interpretation for a surface form. For example:

```
>>> token.analyses.output
[Morph(morphemes=['мин', '^N']), Morph(morphemes=['мин', '^Pron'])]
```

The disambiguation module employs a neural model to select the most likely analysis for each surface form within the context of the token sequence. This process happens automatically when calling the `parse` method.

The most likely analysis is an `Analysis` object, which can be retrieved through the token's `morph` property:

```
>>> token.morph
Analysis([Morph(мин), Morph(^Pron)])
```

Under the hood, the disambiguation model sets the `idx_mla` (index most-likely analysis) property inside the `Analyses` object. This property is an integer that points to the index of the output list containing the selected `Analysis` object:

```
>>> token.analyses.idx_mla
1
```

This index can be set manually if needed. It is used internally to retrieve the `Analysis` object when accessing the `morph` property of the `Token` object:


```
>>> token.analyses.idx_mla = 0
>>> token.morph
Analysis([Morph)(мин), Morph)(^N)])
```


## Independent Modules

The modules integrated into the `YakutMorph` class can be used independently by importing their respective classes. For example:


```
>>> from yakutmorph.tokenizers import YakutTokenizer
>>> tokenizer = YakutTokenizer()
>>> tokenizer.tokenize('Мин аатым Кэскил.')
[('Мин', 'title'), ('аатым', 'lowercase'), ('Кэскил', 'title'), ('.', 'period')]
```

They output Python native types instead of wrapping the results in the objects described above. For example:


```
>>> from yakutmorph.transducers import YakutTransducer
>>> transducer = YakutTransducer()
>>> transducer.analyse('аатым')
['аат^N+POSS.1SG']
>>> transducer.generate('аат^N+POSS.1SG')
['аатым']
```

These modules also expect Python native types as input, so it's essential to ensure the correct types are provided. For example, the disambiguation model expects a list of analyses and returns another list containing the indices corresponding to the selected analyses (excluding the sequence's start and end symbols):


```
>>> from yakutmorph.disambiguation import YakutModel
>>> model = YakutModel()
>>> tags = [['<BOS>'], ['^N', '^Pron'], ['^N+POSS.1SG'], ['^N', '^PN'], ['<EOS>']]
>>> model.disambiguate(tags)
[1, 0, 1]
```

## Analysis Output

The `mappers` module provides classes to convert the `Parse` object to a given format. For example:


```
>>> from yakutmorph.mappers import CoNLLU
>>> print(CoNLLU(parse))
text = Мин аатым Кэскил.
1       Мин     мин     PRON    pron    Case=Nom|Number=Sing|Person=1|PronType=Prs      _       _       мин^Pron
2       аатым   аат     NOUN    n       Case=Nom|Number=Sing|Number[psor]=Sing|Person[psor]=1   _       _       аат^N+POSS.1SG
3       Кэскил  кэскил  PROPN   propn   Case=Nom        _       _       кэскил^PN
4       .       .       PUNCT   punct   _       _       _       _
```


# Morphological Reference

The transducers were developed following the grammar: Ubryatova Y.I. (red.) Grammatika sovremennogo yakutskogo literaturnogo yazyka. Tom 1: Fonetika i morfologiya. Moskva: Nauka Print, 1982.

The analysis form for affixes attempts to conform to the markup identifiers for grammatical annotation listed on the Turkic Morpheme web portal: Institute of Applied Semiotics, 420111, Kazan, 36A Levo-Bulachnaya st., http://modmorph.turklang.net/en/annotation .

The default YakutTransducer object includes a YakutReference object with references to the implemented tags:

## Default reference

The default `YakutTransducer` (and those in the morphological pipeline) object includes a `YakutReference` object with references to the implemented tags:


```
>>> from yakutmorph.transducers import YakutTransducer
>>> transducer = YakutTransducer()
>>> tag_set = transducer.reference.get_tags()
>>> len(tag_set)
142
```


The method `get_tag` returns a series of mappings for a tag in the transducer. For example, `ref` retrieves a description for the morpheme from the grammar:


```
>>> mappings = transducer.reference.get_tag('+PL')
>>> mappings['ref']
'-лар (and allomorphs) forms the plural affix from various type of stems. The interrogative pronoun ким takes the special form нээх to form the plural, after which a regular plural affix can be used for emphasis [Ubryatova et al., §329].'
```

These include alternative tags to map to different formats:


```
>>> mappings['ud']
{'Number': 'Plur'}
```

**ATTENTION**: the collaboration of specialists in Yakut language is highly needed to test/improve the current default reference.


### Modifying the default reference

The default reference can be manually edited as a normal dictionary object:

```
>>> mappings.update({'custom': 'plural affix'})
>>> mappings['custom']
'plural affix'
```

The `parse` method from the `YakutMorph` class applies the (edited) reference to the `Morph` object:


### Initializing a custom reference

Each transducer implements its own reference. This means, that if we are using a morphological pipeline with many transducers, we will need to edit each reference. This can be avoided by injecting an edited `YakutReference` object when initializing `YakutMorph`:


```
>>> from yakutmorph.main import YakutMorph
>>> from yakutmorph.transducers import YakutMorphReference
>>> custom_reference = YakutMorphReference('my_reference.yaml')
>>> morphology = YakutMorph(reference=custom_reference)
```


### Loading a custom reference

The `YakutReferece` object implements a `yaml` file. The default reference is located in folder `yakutmorph/data/morph_reference.yaml` . It is possible to upload a custom `yaml` file, as long as it implements the following key-value structure:

```
general_type:
    affix_1:
        key_1: value_1
        key_2: value_2
    affix_2:
        key_1: value_1
        key_2: value_2
    ...
```


## Contact

The project is currently under development. If you would like to collaborate in the process, report an issue, or need assistance with using, implementing, or testing the morphology analyzer, please feel free to contact us.

In principle, the project could be modified to work for other from the turkish family.

Special thanks to:

- Helmut Schmid, for developing the SFST toolkit: https://www.cis.uni-muenchen.de/~schmid/tools/SFST/
- Gregor Middell, for the Python bindings https://pypi.org/project/sfst-transduce/

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nicolascortegoso/yakutmorph",
    "name": "yakutmorph",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "python, morphology, analyzer, Yakut, Sakha, NLP",
    "author": "Nicolas Cortegoso Vissio",
    "author_email": "Nicolas Cortegoso Vissio <nicolascortegoso@hotmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/64/92/751c9109718712c749accb4c6ba7a45d8cf01c9b659295d44547182dfc52/yakutmorph-0.0.5.tar.gz",
    "platform": null,
    "description": "# yakutmorph\n\nThis Python library provides tools for performing morphological annotations on texts in the Yakut (Sakha) language. It includes:\n\n- A tokenizer to divide a string into tokens.\n- A morphological transducer to map surface and analysis forms.\n- A module to resolve ambiguity in morphological analysis within the context of a given sequence.\n\n## Installation\n \nThe library yakutmorph can be installed using the package manager pip (Python's package installer):\n\n```\npip install yakutmorph\n```\n\n## Basic Usage\n\nFor convenience, all three modules (tokenization, morphological analysis, and disambiguation) are implemented within the `YakutMorph` class, which provides a user-friendly interface.\nThis class follows a non-destructive approach, encapsulating the input string and subsequent processing steps as objects within a main `Parse` object:\n\n```\n>>> from yakutmorph.main import YakutMorph\n>>> morphology = YakutMorph()\n>>> parse = morphology.parse('\u043c\u0438\u043d \u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b')\n>>> parse\nParse(\u043c\u0438\u043d \u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b.)\n```\n\n### Parse\n\n\nThe property `text` retrieves the input string: \n\n```\n>>> parse.text\n\u043c\u0438\u043d \u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b.\n```\n\nThe property `tokens` returns a list of `Token` objects: \n\n```\n>>> parse.text\n\u043c\u0438\u043d \u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b.\n```\n\n\n### Tokens\n\nTokens within a `Parse` object can be accessed by their index. For example:\n\n```\n>>> token = parse.tokens[0]\n>>> token\nToken(\u043c\u0438\u043d)\n```\n\nThe property `pos` returns an integer representing the position of the token in the sequence (starting at 1):\n\n```\n>>> token.pos\n1\n```\n\nThe property `surface` retrieves the surface form of the token (as it appears originally in the input string):\n\n\n```\n>>> token.surface\n'\u043c\u0438\u043d'\n```\n\nThe property `type` returns the token classification provided by the tokenizer:\n\n\n```\n>>> token.type\n'lowercase'\n```\n\nIf the token corresponds to a Yakut word form, it also contains an `Analyses` object.\n\n\n### Analyses (Possible Interpretations)\n\nThe `Analyses` object contains the transducer that performed the morphological analysis and wraps its outputs as a list of `Analysis` objects. A word form can be morphologically ambiguous and, therefore, have more than one interpretation.\n\n\n```\n>>> analyses = token.analyses\n>>> analyses\nAnalyses(Fst(voc)=2)\n```\n\nIn the example above, the object representation `Analyses(Fst(voc)=2)` shows that the surface form was processed by the morphological transducer `voc` and that it produced 2 analyses.\nThe transducer that performed the morphological analyses is found under the property `fst`:\n\n\n```\n>>> analyses.fst\nFst(voc)\n```\n\nThe transducer output can be obtained with the property `output`. This returns a list of `Analysis` objects with possible interpretations:\n\n\n```\n>>> fst_output = analyses.output\n>>> fst_output\n[Analysis([Morph(\u043c\u0438\u043d), Morph(^N)]), Analysis([Morph(\u043c\u0438\u043d), Morph(^Pron)])]\n```\n\n### Analysis\n\nEach `Analyses` object can be accessed using its respective index:\n\n```\n>>> output = fst_output[0]\n>>> output\nAnalysis([Morph(\u043c\u0438\u043d), Morph(^N)])\n```\n\nThe property `morphemes` returns a list of `Morph` objects representing the lexical root and the concatenated affixes:\n\n\n```\n>>> output.morphemes\n[Morph(\u043c\u0438\u043d), Morph(^N)]\n```\n\n\nThe property `root` returns just the Morph object that contains the lexical root:\n\n\n```\n>>> output.root\nMorph(\u043c\u0438\u043d)\n```\n\nThe property `infl_groups` retrieves a list of `InflGroup` objects:\n\n\n```\n>>> output.infl_groups\n[InflGroup(1)]\n```\n\n### Inflectional Groups\n\n\nInflectional groups can be accessed by index:\n\n\n```\n>>> ig = output.infl_groups[0]\n>>> ig\nInflGroup(1)\n```\n\n\nThe `InflGroup` object wraps a series of suffixes represented as `Morph` objects.\n\nThe property `pos` returns an integer representing the position of the inflectional group in the analysis:\n\n\n```\n>>> ig.pos\n1\n```\n\nThe property `affixes` is used to retrieve the list of `Morph` objects grouped within:\n\n\n```\n>>> ig.affixes\n[Morph(^N)]\n```\n\n### Morphemes\n\nThe `Morph` objects are accessed by index:\n\n\n```\n>>> morph = ig.affixes[0]\n>>> morph\nMorph(^N)\n```\n\nA `Morph` object contains either a lexical root (root), a derivational (db), or an inflectional affix (fl).\n\nThe property `morpheme` gets the tag representation of the morpheme:\n\n\n```\n>>> morph.morpheme\n'^N'\n```\n\n\nThe property `type` returns the morpheme type:\n\n\n```\n>>> morph.type\n'db'\n```\n\nThe property `reference` returns a dictionary with mappings for the morpheme:\n\n\n```\n>>> morph.reference\n{'UPOS': 'NOUN', 'XPOS': 'n', 'ref': 'noun', 'aper': 'n'}\n```\n\n\n## Processing Unknown Lexical Roots\n\nThe default morphological transducer analyzes and generates surface forms from an internal vocabulary containing lexical roots. However, it is impossible to list all roots that may appear in Yakut texts, especially given the expected presence of numerous loanwords from the Russian language.\n\nA common practice to handle this issue is to provide auxiliary morphological transducers that increase coverage at the expense of outputting some spurious analyses.\n\nTo expand the capability of processing surface forms with minimal ambiguity, the `YakutMorph` class by default implements a three-stage morphological pipeline:\n\n1. **Vocabulary-based transducer (labeled 'voc')**: Analyzes word forms using the lexical roots listed in the vocabulary.\n2. **Syllable-based transducer (labeled 'syl')**: Operates on a set of Yakut syllables and accepts any valid concatenation of syllables in a Yakut root. It cannot analyze word forms that deviate from Yakut phonotactics.\n3. **Affix-based transducer (labeled 'aff')**: Accepts any string consisting of a sequence of at least two characters of the Yakut alphabet. It can process loanwords.\n\nIn this pipeline, the next transducer only takes part if the previous one fails to process a given surface form. In the example below, the surface forms have been automatically processed by different morphological transducers:\n\n```\n>>> from yakutmorph.main import YakutMorph\n>>> morph = YakutMorph()\n>>> parse = morphology.parse('\u041c\u0430\u043c\u0430 \u0415\u0433\u043e\u0440\u0430 \u0443\u0447\u0443\u0443\u0442\u0430\u043b.')\n>>> [token.analyses for token in parse.tokens if token.has_morph]\n[Analyses(Fst(syl)=1), Analyses(Fst(aff)=3), Analyses(Fst(voc)=1)]\n```\n\nFor more details, please refer to the README.md file inside the `src` folder, which contains the source code for the morphological transducers.\n\n\n## Morphological Ambiguity\n\nAmbiguous analyses occur when the morphological transducer outputs more than one possible interpretation for a surface form. For example:\n\n```\n>>> token.analyses.output\n[Morph(morphemes=['\u043c\u0438\u043d', '^N']), Morph(morphemes=['\u043c\u0438\u043d', '^Pron'])]\n```\n\nThe disambiguation module employs a neural model to select the most likely analysis for each surface form within the context of the token sequence. This process happens automatically when calling the `parse` method.\n\nThe most likely analysis is an `Analysis` object, which can be retrieved through the token's `morph` property:\n\n```\n>>> token.morph\nAnalysis([Morph(\u043c\u0438\u043d), Morph(^Pron)])\n```\n\nUnder the hood, the disambiguation model sets the `idx_mla` (index most-likely analysis) property inside the `Analyses` object. This property is an integer that points to the index of the output list containing the selected `Analysis` object:\n\n```\n>>> token.analyses.idx_mla\n1\n```\n\nThis index can be set manually if needed. It is used internally to retrieve the `Analysis` object when accessing the `morph` property of the `Token` object:\n\n\n```\n>>> token.analyses.idx_mla = 0\n>>> token.morph\nAnalysis([Morph)(\u043c\u0438\u043d), Morph)(^N)])\n```\n\n\n## Independent Modules\n\nThe modules integrated into the `YakutMorph` class can be used independently by importing their respective classes. For example:\n\n\n```\n>>> from yakutmorph.tokenizers import YakutTokenizer\n>>> tokenizer = YakutTokenizer()\n>>> tokenizer.tokenize('\u041c\u0438\u043d \u0430\u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b.')\n[('\u041c\u0438\u043d', 'title'), ('\u0430\u0430\u0442\u044b\u043c', 'lowercase'), ('\u041a\u044d\u0441\u043a\u0438\u043b', 'title'), ('.', 'period')]\n```\n\nThey output Python native types instead of wrapping the results in the objects described above. For example:\n\n\n```\n>>> from yakutmorph.transducers import YakutTransducer\n>>> transducer = YakutTransducer()\n>>> transducer.analyse('\u0430\u0430\u0442\u044b\u043c')\n['\u0430\u0430\u0442^N+POSS.1SG']\n>>> transducer.generate('\u0430\u0430\u0442^N+POSS.1SG')\n['\u0430\u0430\u0442\u044b\u043c']\n```\n\nThese modules also expect Python native types as input, so it's essential to ensure the correct types are provided. For example, the disambiguation model expects a list of analyses and returns another list containing the indices corresponding to the selected analyses (excluding the sequence's start and end symbols):\n\n\n```\n>>> from yakutmorph.disambiguation import YakutModel\n>>> model = YakutModel()\n>>> tags = [['<BOS>'], ['^N', '^Pron'], ['^N+POSS.1SG'], ['^N', '^PN'], ['<EOS>']]\n>>> model.disambiguate(tags)\n[1, 0, 1]\n```\n\n## Analysis Output\n\nThe `mappers` module provides classes to convert the `Parse` object to a given format. For example:\n\n\n```\n>>> from yakutmorph.mappers import CoNLLU\n>>> print(CoNLLU(parse))\ntext = \u041c\u0438\u043d \u0430\u0430\u0442\u044b\u043c \u041a\u044d\u0441\u043a\u0438\u043b.\n1       \u041c\u0438\u043d     \u043c\u0438\u043d     PRON    pron    Case=Nom|Number=Sing|Person=1|PronType=Prs      _       _       \u043c\u0438\u043d^Pron\n2       \u0430\u0430\u0442\u044b\u043c   \u0430\u0430\u0442     NOUN    n       Case=Nom|Number=Sing|Number[psor]=Sing|Person[psor]=1   _       _       \u0430\u0430\u0442^N+POSS.1SG\n3       \u041a\u044d\u0441\u043a\u0438\u043b  \u043a\u044d\u0441\u043a\u0438\u043b  PROPN   propn   Case=Nom        _       _       \u043a\u044d\u0441\u043a\u0438\u043b^PN\n4       .       .       PUNCT   punct   _       _       _       _\n```\n\n\n# Morphological Reference\n\nThe transducers were developed following the grammar: Ubryatova Y.I. (red.) Grammatika sovremennogo yakutskogo literaturnogo yazyka. Tom 1: Fonetika i morfologiya. Moskva: Nauka Print, 1982.\n\nThe analysis form for affixes attempts to conform to the markup identifiers for grammatical annotation listed on the Turkic Morpheme web portal: Institute of Applied Semiotics, 420111, Kazan, 36A Levo-Bulachnaya st., http://modmorph.turklang.net/en/annotation .\n\nThe default YakutTransducer object includes a YakutReference object with references to the implemented tags:\n\n## Default reference\n\nThe default `YakutTransducer` (and those in the morphological pipeline) object includes a `YakutReference` object with references to the implemented tags:\n\n\n```\n>>> from yakutmorph.transducers import YakutTransducer\n>>> transducer = YakutTransducer()\n>>> tag_set = transducer.reference.get_tags()\n>>> len(tag_set)\n142\n```\n\n\nThe method `get_tag` returns a series of mappings for a tag in the transducer. For example, `ref` retrieves a description for the morpheme from the grammar:\n\n\n```\n>>> mappings = transducer.reference.get_tag('+PL')\n>>> mappings['ref']\n'-\u043b\u0430\u0440 (and allomorphs) forms the plural affix from various type of stems. The interrogative pronoun \u043a\u0438\u043c takes the special form \u043d\u044d\u044d\u0445 to form the plural, after which a regular plural affix can be used for emphasis [Ubryatova et al., \u00a7329].'\n```\n\nThese include alternative tags to map to different formats:\n\n\n```\n>>> mappings['ud']\n{'Number': 'Plur'}\n```\n\n**ATTENTION**: the collaboration of specialists in Yakut language is highly needed to test/improve the current default reference.\n\n\n### Modifying the default reference\n\nThe default reference can be manually edited as a normal dictionary object:\n\n```\n>>> mappings.update({'custom': 'plural affix'})\n>>> mappings['custom']\n'plural affix'\n```\n\nThe `parse` method from the `YakutMorph` class applies the (edited) reference to the `Morph` object:\n\n\n### Initializing a custom reference\n\nEach transducer implements its own reference. This means, that if we are using a morphological pipeline with many transducers, we will need to edit each reference. This can be avoided by injecting an edited `YakutReference` object when initializing `YakutMorph`:\n\n\n```\n>>> from yakutmorph.main import YakutMorph\n>>> from yakutmorph.transducers import YakutMorphReference\n>>> custom_reference = YakutMorphReference('my_reference.yaml')\n>>> morphology = YakutMorph(reference=custom_reference)\n```\n\n\n### Loading a custom reference\n\nThe `YakutReferece` object implements a `yaml` file. The default reference is located in folder `yakutmorph/data/morph_reference.yaml` . It is possible to upload a custom `yaml` file, as long as it implements the following key-value structure:\n\n```\ngeneral_type:\n    affix_1:\n        key_1: value_1\n        key_2: value_2\n    affix_2:\n        key_1: value_1\n        key_2: value_2\n    ...\n```\n\n\n## Contact\n\nThe project is currently under development. If you would like to collaborate in the process, report an issue, or need assistance with using, implementing, or testing the morphology analyzer, please feel free to contact us.\n\nIn principle, the project could be modified to work for other from the turkish family.\n\nSpecial thanks to:\n\n- Helmut Schmid, for developing the SFST toolkit: https://www.cis.uni-muenchen.de/~schmid/tools/SFST/\n- Gregor Middell, for the Python bindings https://pypi.org/project/sfst-transduce/\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A morphological analyzer for Yakut language",
    "version": "0.0.5",
    "project_urls": {
        "Homepage": "https://github.com/nicolascortegoso/yakutmorph"
    },
    "split_keywords": [
        "python",
        " morphology",
        " analyzer",
        " yakut",
        " sakha",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b15709d93a2348e7cd58dddc81f91358f84f6406a5c1c9779f932234328eef3",
                "md5": "d6ebad307f9fe97f226d0fde9b74cdfd",
                "sha256": "f6d6a6a44d675a19fc7690f4b2026621deff1486e395fbbe7082d9fa9508aabf"
            },
            "downloads": -1,
            "filename": "yakutmorph-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d6ebad307f9fe97f226d0fde9b74cdfd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 951737,
            "upload_time": "2024-06-15T17:30:07",
            "upload_time_iso_8601": "2024-06-15T17:30:07.701751Z",
            "url": "https://files.pythonhosted.org/packages/0b/15/709d93a2348e7cd58dddc81f91358f84f6406a5c1c9779f932234328eef3/yakutmorph-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6492751c9109718712c749accb4c6ba7a45d8cf01c9b659295d44547182dfc52",
                "md5": "a6ffa940babf30f23f8c8ffd958838fa",
                "sha256": "53a43e6b50435119165d8efd8b795d5f2f1a31951c4a93b33e0f3b57255c283c"
            },
            "downloads": -1,
            "filename": "yakutmorph-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "a6ffa940babf30f23f8c8ffd958838fa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 948742,
            "upload_time": "2024-06-15T17:30:10",
            "upload_time_iso_8601": "2024-06-15T17:30:10.002726Z",
            "url": "https://files.pythonhosted.org/packages/64/92/751c9109718712c749accb4c6ba7a45d8cf01c9b659295d44547182dfc52/yakutmorph-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-15 17:30:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nicolascortegoso",
    "github_project": "yakutmorph",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "yakutmorph"
}

Nicolas Cortegoso Vissio