simplemma


Namesimplemma JSON
Version 1.1.1 PyPI version JSON
download
home_pageNone
SummaryA lightweight toolkit for multilingual lemmatization and language detection.
upload_time2024-08-08 12:20:45
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT License
keywords language detection language identification langid lemmatization lemmatizer lemmatiser nlp tokenization tokenizer
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # Simplemma: a simple multilingual lemmatizer for Python

[![Python package](https://img.shields.io/pypi/v/simplemma.svg)](https://pypi.python.org/pypi/simplemma)
[![Python versions](https://img.shields.io/pypi/pyversions/simplemma.svg)](https://pypi.python.org/pypi/simplemma)
[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/simplemma.svg)](https://codecov.io/gh/adbar/simplemma)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)


## Purpose

[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the
process of grouping together the inflected forms of a word so they can
be analysed as a single item, identified by the word\'s lemma, or
dictionary form. Unlike stemming, lemmatization outputs word units that
are still valid linguistic forms.

In modern natural language processing (NLP), this task is often
indirectly tackled by more complex systems encompassing a whole
processing pipeline. However, it appears that there is no
straightforward way to address lemmatization in Python although this
task can be crucial in fields such as information retrieval and NLP.

*Simplemma* provides a simple and multilingual approach to look for base
forms or lemmata. It may not be as powerful as full-fledged solutions
but it is generic, easy to install and straightforward to use. In
particular, it does not need morphosyntactic information and can process
a raw series of tokens or even a text with its built-in tokenizer. By
design it should be reasonably fast and work in a large majority of
cases, without being perfect.

With its comparatively small footprint it is especially useful when
speed and simplicity matter, in low-resource contexts, for educational
purposes, or as a baseline system for lemmatization and morphological
analysis.

Currently, 49 languages are partly or fully supported (see table below).


## Installation

The current library is written in pure Python with no dependencies:
`pip install simplemma`

- `pip3` where applicable
- `pip install -U simplemma` for updates
- `pip install git+https://github.com/adbar/simplemma` for the cutting-edge version

The last version supporting Python 3.6 and 3.7 is `simplemma==1.0.0`.


## Usage

### Word-by-word

Simplemma is used by selecting a language of interest and then applying
the data on a list of words.

``` python
>>> import simplemma
# get a word
myword = 'masks'
# decide which language to use and apply it on a word form
>>> simplemma.lemmatize(myword, lang='en')
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> for token in mytokens:
>>>     simplemma.lemmatize(token, lang='de')
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
['hier', 'sein', 'Vaccines']
```


### Chaining languages

Chaining several languages can improve coverage, they are used in
sequence:

``` python
>>> from simplemma import lemmatize
>>> lemmatize('Vaccines', lang=('de', 'en'))
'vaccine'
>>> lemmatize('spaghettis', lang='it')
'spaghettis'
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetti'
>>> lemmatize('spaghetti', lang=('it', 'fr'))
'spaghetto'
```


### Greedier decomposition

For certain languages a greedier decomposition is activated by default
as it can be beneficial, mostly due to a certain capacity to address
affixes in an unsupervised way. This can be triggered manually by
setting the `greedy` parameter to `True`.

This option also triggers a stronger reduction through an additional
iteration of the search algorithm, e.g. \"angekündigten\" →
\"angekündigt\" (standard) → \"ankündigen\" (greedy). In some cases it
may be closer to stemming than to lemmatization.

``` python
# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
'spaghetto'
# German case described above
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=True)
'ankündigen' # 2 steps: reduction to infinitive verb
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=False)
'angekündigt' # 1 step: reduction to past participle
```


### is_known()

The additional function `is_known()` checks if a given word is present
in the language data:

``` python
>>> from simplemma import is_known
>>> is_known('spaghetti', lang='it')
True
```


### Tokenization

A simple tokenization function is provided for convenience:

``` python
>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
# use iterator instead
>>> simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)
```

The functions `text_lemmatizer()` and `lemma_iterator()` chain
tokenization and lemmatization. They can take `greedy` (affecting
lemmatization) and `silent` (affecting errors and logging) as arguments:

``` python
>>> from simplemma import text_lemmatizer
>>> sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.'
>>> text_lemmatizer(sentence, lang='pt')
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']
# same principle, returns a generator and not a list
>>> from simplemma import lemma_iterator
>>> lemma_iterator(sentence, lang='pt')
```


### Caveats

``` python
# don't expect too much though
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', lang='it')
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> simplemma.lemmatize('son', lang='es')
'son' # valid common name, but what about the verb form?
```

As the focus lies on overall coverage, some short frequent words
(typically: pronouns and conjunctions) may need post-processing, this
generally concerns a few dozens of tokens per language.

The current absence of morphosyntactic information is an advantage in
terms of simplicity. However, it is also an impassable frontier regarding
lemmatization accuracy, for example when it comes to disambiguating
between past participles and adjectives derived from verbs in Germanic
and Romance languages. In most cases, `simplemma` often does not change
such input words.

The greedy algorithm seldom produces invalid forms. It is designed to
work best in the low-frequency range, notably for compound words and
neologisms. Aggressive decomposition is only useful as a general
approach in the case of morphologically-rich languages, where it can
also act as a linguistically motivated stemmer.

Bug reports over the [issues
page](https://github.com/adbar/simplemma/issues) are welcome.


### Language detection

Language detection works by providing a text and tuple `lang` consisting
of a series of languages of interest. Scores between 0 and 1 are
returned.

The `lang_detector()` function returns a list of language codes along
with their corresponding scores, appending \"unk\" for unknown or
out-of-vocabulary words. The latter can also be calculated by using
the function `in_target_language()` which returns a ratio.

``` python
# import necessary functions
>>> from simplemma import in_target_language, lang_detector
# language detection
>>> lang_detector('"Exoplaneta, též extrasolární planeta, je planeta obíhající kolem jiné hvězdy než kolem Slunce."', lang=("cs", "sk"))
[("cs", 0.75), ("sk", 0.125), ("unk", 0.25)]
# proportion of known words
>>> in_target_language("opera post physica posita (τὰ μετὰ τὰ φυσικά)", lang="la")
0.5
```

The `greedy` argument (`extensive` in past software versions) triggers
use of the greedier decomposition algorithm described above, thus
extending word coverage and recall of detection at the potential cost of
a lesser accuracy.


### Advanced usage via classes

The functions described above are suitable for simple usage, but you
can have more control by instantiating Simplemma classes and calling
their methods instead. Lemmatization is handled by the `Lemmatizer`
class, while language detection is handled by the `LanguageDetector`
class. These in turn rely on different lemmatization strategies, which
are implementations of the `LemmatizationStrategy` protocol. The
`DefaultStrategy` implementation uses a combination of different
strategies, one of which is `DictionaryLookupStrategy`. It looks up
tokens in a dictionary created by a `DictionaryFactory`.

For example, it is possible to conserve RAM by limiting the number of
cached language dictionaries (default: 8) by creating a custom
`DefaultDictionaryFactory` with a specific `cache_max_size` setting,
creating a `DefaultStrategy` using that factory, and then creating a
`Lemmatizer` and/or a `LanguageDetector` using that strategy:

``` python
# import necessary classes
>>> from simplemma import LanguageDetector, Lemmatizer
>>> from simplemma.strategies import DefaultStrategy
>>> from simplemma.strategies.dictionaries import DefaultDictionaryFactory

LANG_CACHE_SIZE = 5  # How many language dictionaries to keep in memory at once (max)
>>> dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE)
>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory)

# lemmatize using the above customized strategy
>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)
>>> lemmatizer.lemmatize('doughnuts', lang='en')
'doughnut'

# detect languages using the above customized strategy
>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)
>>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)")
0.5
```

For more information see the
[extended documentation](https://adbar.github.io/simplemma/).


### Reducing memory usage

Simplemma provides an alternative solution for situations where low
memory usage and fast initialization time are more important than
lemmatization and language detection performance. This solution uses a
`DictionaryFactory` that employs a trie as its underlying data structure,
rather than a Python dict.

The `TrieDictionaryFactory` reduces memory usage by an average of
20x and initialization time by 100x, but this comes at the cost of
potentially reducing performance by 50% or more, depending on the
specific usage.

To use the `TrieDictionaryFactory` you have to install Simplemma with
the `marisa-trie` extra dependency (available from version 1.1.0):

```
pip install simplemma[marisa-trie]
```

Then you have to create a custom strategy using the
`TrieDictionaryFactory` and use that for `Lemmatizer` and
`LanguageDetector` instances:

``` python
>>> from simplemma import LanguageDetector, Lemmatizer
>>> from simplemma.strategies import DefaultStrategy
>>> from simplemma.strategies.dictionaries import TrieDictionaryFactory

>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=TrieDictionaryFactory())

>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)
>>> lemmatizer.lemmatize('doughnuts', lang='en')
'doughnut'

>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)
>>> language_detector.proportion_in_target_languages("opera post physica posita (τὰ μετὰ τὰ φυσικά)")
0.5
```

While memory usage and initialization time when using the
`TrieDictionaryFactory` are significantly lower compared to the
`DefaultDictionaryFactory`, that's only true if the trie dictionaries
are available on disk. That's not the case when using the
`TrieDictionaryFactory` for the first time, as Simplemma only ships
the dictionaries as Python dicts. The trie dictionaries have to be
generated once from the Python dicts. That happens on-the-fly when
using the `TrieDictionaryFactory` for the first time for a language and
will take a few seconds and use as much memory as loading the Python
dicts for the language requires. For further invocations the trie
dictionaries get cached on disk.

If the computer supposed to run Simplemma doesn't have enough memory to
generate the trie dictionaries, they can also be generated on another
computer with the same CPU architecture and copied over to the cache
directory.


## Supported languages

The following languages are available, identified by their [BCP 47
language tag](https://en.wikipedia.org/wiki/IETF_language_tag), which
typically corresponds to the [ISO 639-1 code](
https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).
If no such code exists, a [ISO 639-3
code](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) is
used instead.

Available languages (2022-01-20):


| Code | Language | Forms (10³) | Acc. | Comments |
| ---- | -------- | ----------- | ---- | -------- |
| `ast` | Asturian | 124 | 
| `bg` | Bulgarian | 204 | 
| `ca` | Catalan | 579 | 
| `cs` | Czech | 187 | 0.89 | on UD CS-PDT
| `cy` | Welsh | 360 | 
| `da` | Danish | 554 | 0.92 | on UD DA-DDT, alternative: [lemmy](https://github.com/sorenlind/lemmy)
| `de` | German | 675 | 0.95 | on UD DE-GSD, see also [German-NLP list](https://github.com/adbar/German-NLP#Lemmatization)
| `el` | Greek | 181 | 0.88 | on UD EL-GDT
| `en` | English | 131 | 0.94 | on UD EN-GUM, alternative: [LemmInflect](https://github.com/bjascob/LemmInflect)
| `enm` | Middle English | 38
| `es` | Spanish | 665 | 0.95 | on UD ES-GSD
| `et` | Estonian | 119 | | low coverage
| `fa` | Persian | 12 | | experimental
| `fi` | Finnish | 3,199 | | see [this benchmark](https://github.com/aajanki/finnish-pos-accuracy)
| `fr` | French | 217 | 0.94 | on UD FR-GSD
| `ga` | Irish | 372
| `gd` | Gaelic | 48
| `gl` | Galician | 384
| `gv` | Manx | 62
| `hbs` | Serbo-Croatian | 656 | | Croatian and Serbian lists to be added later
| `hi` | Hindi | 58 | | experimental
| `hu` | Hungarian | 458
| `hy` | Armenian | 246
| `id` | Indonesian | 17 | 0.91 | on UD ID-CSUI
| `is` | Icelandic | 174
| `it` | Italian | 333 | 0.93 | on UD IT-ISDT
| `ka` | Georgian | 65
| `la` | Latin | 843
| `lb` | Luxembourgish | 305
| `lt` | Lithuanian | 247
| `lv` | Latvian | 164
| `mk` | Macedonian | 56
| `ms` | Malay | 14
| `nb` | Norwegian (Bokmål) |  617
| `nl` | Dutch | 250 | 0.92 | on UD-NL-Alpino
| `nn` | Norwegian (Nynorsk) | 56
| `pl` | Polish | 3,211 | 0.91 | on UD-PL-PDB
| `pt` | Portuguese | 924 | 0.92 | on UD-PT-GSD
| `ro` | Romanian | 311
| `ru` | Russian | 595 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)
| `se` | Northern Sámi | 113
| `sk` | Slovak | 818 | 0.92 | on UD SK-SNK
| `sl` | Slovene | 136
| `sq` | Albanian | 35
| `sv` | Swedish | 658 | | alternative: [lemmy](https://github.com/sorenlind/lemmy)
| `sw` | Swahili | 10 | | experimental
| `tl` | Tagalog | 32 | | experimental
| `tr` | Turkish | 1,232 | 0.89 | on UD-TR-Boun
| `uk` | Ukrainian | 370 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)


Languages marked as having low coverage may be better suited to
language-specific libraries, but Simplemma can still provide limited
functionality. Where possible, open-source Python alternatives are
referenced.

*Experimental* mentions indicate that the language remains untested or
that there could be issues with the underlying data or lemmatization
process.

The scores are calculated on [Universal
Dependencies](https://universaldependencies.org/) treebanks on single
word tokens (including some contractions but not merged prepositions),
they describe to what extent simplemma can accurately map tokens to
their lemma form. See `eval/` folder of the code repository for more
information.

This library is particularly relevant as regards the lemmatization of
less frequent words. Its performance in this case is only incidentally
captured by the benchmark above. In some languages, a fixed number of
words such as pronouns can be further mapped by hand to enhance
performance.


## Speed

The following orders of magnitude are provided for reference only and
were measured on an old laptop to establish a lower bound:

-   Tokenization: \> 1 million tokens/sec
-   Lemmatization: \> 250,000 words/sec

Using the most recent Python version (i.e. with `pyenv`) can make the
package run faster.


## Roadmap

- [x] Add further lemmatization lists
- [ ] Grammatical categories as option
- [ ] Function as a meta-package?
- [ ] Integrate optional, more complex models?


## Credits and licenses

The software is licensed under the MIT license. For information on the
licenses of the linguistic information databases, see the `licenses` folder.

The surface lookups (non-greedy mode) rely on lemmatization lists derived
from the following sources, listed in order of relative importance:

-   [Lemmatization
    lists](https://github.com/michmech/lemmatization-lists) by Michal
    Měchura (Open Database License)
-   Wiktionary entries packaged by the [Kaikki
    project](https://kaikki.org/)
-   [FreeLing project](https://github.com/TALP-UPC/FreeLing)
-   [spaCy lookups
    data](https://github.com/explosion/spacy-lookups-data)
-   [Unimorph Project](https://unimorph.github.io/)
-   [Wikinflection
    corpus](https://github.com/lenakmeth/Wikinflection-Corpus) by Eleni
    Metheniti (CC BY 4.0 License)


## Contributions

This package has been first created and published by Adrien Barbaresi.
It has then benefited from extensive refactoring by Juanjo Diaz (especially the new classes).
See the [full list of contributors](https://github.com/adbar/simplemma/graphs/contributors)
to the repository.

Feel free to contribute, notably by [filing
issues](https://github.com/adbar/simplemma/issues/) for feedback, bug
reports, or links to further lemmatization lists, rules and tests.

Contributions by pull requests ought to follow the following
conventions: code style with [black](https://github.com/psf/black), type
hinting with [mypy](https://github.com/python/mypy), included tests with
[pytest](https://pytest.org).


## Other solutions

See lists: [German-NLP](https://github.com/adbar/German-NLP) and [other
awesome-NLP lists](https://github.com/adbar/German-NLP#More-lists).

For another approach in Python see Spacy's
[edit tree lemmatizer](https://spacy.io/api/edittreelemmatizer).


## References

To cite this software:

[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)

Barbaresi A. (*year*). Simplemma: a simple multilingual lemmatizer for
Python \[Computer software\] (Version *version number*). Berlin,
Germany: Berlin-Brandenburg Academy of Sciences. Available from
<https://github.com/adbar/simplemma> DOI: 10.5281/zenodo.4673264

This work draws from lexical analysis algorithms used in:

-   Barbaresi, A., & Hein, K. (2017). [Data-driven identification of
    German phrasal
    compounds](https://hal.archives-ouvertes.fr/hal-01575651/document).
    In International Conference on Text, Speech, and Dialogue Springer,
    pp. 192-200.
-   Barbaresi, A. (2016). [An unsupervised morphological criterion for
    discriminating similar
    languages](https://aclanthology.org/W16-4827/). In 3rd Workshop on
    NLP for Similar Languages, Varieties and Dialects (VarDial 2016),
    Association for Computational Linguistics, pp. 212-220.
-   Barbaresi, A. (2016). [Bootstrapped OCR error detection for a
    less-resourced language
    variant](https://hal.archives-ouvertes.fr/hal-01371689/document). In
    13th Conference on Natural Language Processing (KONVENS 2016), pp.
    21-26.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "simplemma",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "language detection, language identification, langid, lemmatization, lemmatizer, lemmatiser, nlp, tokenization, tokenizer",
    "author": null,
    "author_email": "Adrien Barbaresi <barbaresi@bbaw.de>",
    "download_url": "https://files.pythonhosted.org/packages/da/49/db745ca54b74cac35fc5b641817fe90866fba8091246cab3e34b1ba6f16e/simplemma-1.1.1.tar.gz",
    "platform": null,
    "description": "# Simplemma: a simple multilingual lemmatizer for Python\n\n[![Python package](https://img.shields.io/pypi/v/simplemma.svg)](https://pypi.python.org/pypi/simplemma)\n[![Python versions](https://img.shields.io/pypi/pyversions/simplemma.svg)](https://pypi.python.org/pypi/simplemma)\n[![Code Coverage](https://img.shields.io/codecov/c/github/adbar/simplemma.svg)](https://codecov.io/gh/adbar/simplemma)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)\n\n\n## Purpose\n\n[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the\nprocess of grouping together the inflected forms of a word so they can\nbe analysed as a single item, identified by the word\\'s lemma, or\ndictionary form. Unlike stemming, lemmatization outputs word units that\nare still valid linguistic forms.\n\nIn modern natural language processing (NLP), this task is often\nindirectly tackled by more complex systems encompassing a whole\nprocessing pipeline. However, it appears that there is no\nstraightforward way to address lemmatization in Python although this\ntask can be crucial in fields such as information retrieval and NLP.\n\n*Simplemma* provides a simple and multilingual approach to look for base\nforms or lemmata. It may not be as powerful as full-fledged solutions\nbut it is generic, easy to install and straightforward to use. In\nparticular, it does not need morphosyntactic information and can process\na raw series of tokens or even a text with its built-in tokenizer. By\ndesign it should be reasonably fast and work in a large majority of\ncases, without being perfect.\n\nWith its comparatively small footprint it is especially useful when\nspeed and simplicity matter, in low-resource contexts, for educational\npurposes, or as a baseline system for lemmatization and morphological\nanalysis.\n\nCurrently, 49 languages are partly or fully supported (see table below).\n\n\n## Installation\n\nThe current library is written in pure Python with no dependencies:\n`pip install simplemma`\n\n- `pip3` where applicable\n- `pip install -U simplemma` for updates\n- `pip install git+https://github.com/adbar/simplemma` for the cutting-edge version\n\nThe last version supporting Python 3.6 and 3.7 is `simplemma==1.0.0`.\n\n\n## Usage\n\n### Word-by-word\n\nSimplemma is used by selecting a language of interest and then applying\nthe data on a list of words.\n\n``` python\n>>> import simplemma\n# get a word\nmyword = 'masks'\n# decide which language to use and apply it on a word form\n>>> simplemma.lemmatize(myword, lang='en')\n'mask'\n# grab a list of tokens\n>>> mytokens = ['Hier', 'sind', 'Vaccines']\n>>> for token in mytokens:\n>>>     simplemma.lemmatize(token, lang='de')\n'hier'\n'sein'\n'Vaccines'\n# list comprehensions can be faster\n>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]\n['hier', 'sein', 'Vaccines']\n```\n\n\n### Chaining languages\n\nChaining several languages can improve coverage, they are used in\nsequence:\n\n``` python\n>>> from simplemma import lemmatize\n>>> lemmatize('Vaccines', lang=('de', 'en'))\n'vaccine'\n>>> lemmatize('spaghettis', lang='it')\n'spaghettis'\n>>> lemmatize('spaghettis', lang=('it', 'fr'))\n'spaghetti'\n>>> lemmatize('spaghetti', lang=('it', 'fr'))\n'spaghetto'\n```\n\n\n### Greedier decomposition\n\nFor certain languages a greedier decomposition is activated by default\nas it can be beneficial, mostly due to a certain capacity to address\naffixes in an unsupervised way. This can be triggered manually by\nsetting the `greedy` parameter to `True`.\n\nThis option also triggers a stronger reduction through an additional\niteration of the search algorithm, e.g. \\\"angek\u00fcndigten\\\" \u2192\n\\\"angek\u00fcndigt\\\" (standard) \u2192 \\\"ank\u00fcndigen\\\" (greedy). In some cases it\nmay be closer to stemming than to lemmatization.\n\n``` python\n# same example as before, comes to this result in one step\n>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)\n'spaghetto'\n# German case described above\n>>> simplemma.lemmatize('angek\u00fcndigten', lang='de', greedy=True)\n'ank\u00fcndigen' # 2 steps: reduction to infinitive verb\n>>> simplemma.lemmatize('angek\u00fcndigten', lang='de', greedy=False)\n'angek\u00fcndigt' # 1 step: reduction to past participle\n```\n\n\n### is_known()\n\nThe additional function `is_known()` checks if a given word is present\nin the language data:\n\n``` python\n>>> from simplemma import is_known\n>>> is_known('spaghetti', lang='it')\nTrue\n```\n\n\n### Tokenization\n\nA simple tokenization function is provided for convenience:\n\n``` python\n>>> from simplemma import simple_tokenizer\n>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')\n['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']\n# use iterator instead\n>>> simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)\n```\n\nThe functions `text_lemmatizer()` and `lemma_iterator()` chain\ntokenization and lemmatization. They can take `greedy` (affecting\nlemmatization) and `silent` (affecting errors and logging) as arguments:\n\n``` python\n>>> from simplemma import text_lemmatizer\n>>> sentence = 'Sou o intervalo entre o que desejo ser e os outros me fizeram.'\n>>> text_lemmatizer(sentence, lang='pt')\n# caveat: desejo is also a noun, should be desejar here\n['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']\n# same principle, returns a generator and not a list\n>>> from simplemma import lemma_iterator\n>>> lemma_iterator(sentence, lang='pt')\n```\n\n\n### Caveats\n\n``` python\n# don't expect too much though\n# this diminutive form isn't in the model data\n>>> simplemma.lemmatize('spaghettini', lang='it')\n'spaghettini' # should read 'spaghettino'\n# the algorithm cannot choose between valid alternatives yet\n>>> simplemma.lemmatize('son', lang='es')\n'son' # valid common name, but what about the verb form?\n```\n\nAs the focus lies on overall coverage, some short frequent words\n(typically: pronouns and conjunctions) may need post-processing, this\ngenerally concerns a few dozens of tokens per language.\n\nThe current absence of morphosyntactic information is an advantage in\nterms of simplicity. However, it is also an impassable frontier regarding\nlemmatization accuracy, for example when it comes to disambiguating\nbetween past participles and adjectives derived from verbs in Germanic\nand Romance languages. In most cases, `simplemma` often does not change\nsuch input words.\n\nThe greedy algorithm seldom produces invalid forms. It is designed to\nwork best in the low-frequency range, notably for compound words and\nneologisms. Aggressive decomposition is only useful as a general\napproach in the case of morphologically-rich languages, where it can\nalso act as a linguistically motivated stemmer.\n\nBug reports over the [issues\npage](https://github.com/adbar/simplemma/issues) are welcome.\n\n\n### Language detection\n\nLanguage detection works by providing a text and tuple `lang` consisting\nof a series of languages of interest. Scores between 0 and 1 are\nreturned.\n\nThe `lang_detector()` function returns a list of language codes along\nwith their corresponding scores, appending \\\"unk\\\" for unknown or\nout-of-vocabulary words. The latter can also be calculated by using\nthe function `in_target_language()` which returns a ratio.\n\n``` python\n# import necessary functions\n>>> from simplemma import in_target_language, lang_detector\n# language detection\n>>> lang_detector('\"Exoplaneta, t\u00e9\u017e extrasol\u00e1rn\u00ed planeta, je planeta ob\u00edhaj\u00edc\u00ed kolem jin\u00e9 hv\u011bzdy ne\u017e kolem Slunce.\"', lang=(\"cs\", \"sk\"))\n[(\"cs\", 0.75), (\"sk\", 0.125), (\"unk\", 0.25)]\n# proportion of known words\n>>> in_target_language(\"opera post physica posita (\u03c4\u1f70 \u03bc\u03b5\u03c4\u1f70 \u03c4\u1f70 \u03c6\u03c5\u03c3\u03b9\u03ba\u03ac)\", lang=\"la\")\n0.5\n```\n\nThe `greedy` argument (`extensive` in past software versions) triggers\nuse of the greedier decomposition algorithm described above, thus\nextending word coverage and recall of detection at the potential cost of\na lesser accuracy.\n\n\n### Advanced usage via classes\n\nThe functions described above are suitable for simple usage, but you\ncan have more control by instantiating Simplemma classes and calling\ntheir methods instead. Lemmatization is handled by the `Lemmatizer`\nclass, while language detection is handled by the `LanguageDetector`\nclass. These in turn rely on different lemmatization strategies, which\nare implementations of the `LemmatizationStrategy` protocol. The\n`DefaultStrategy` implementation uses a combination of different\nstrategies, one of which is `DictionaryLookupStrategy`. It looks up\ntokens in a dictionary created by a `DictionaryFactory`.\n\nFor example, it is possible to conserve RAM by limiting the number of\ncached language dictionaries (default: 8) by creating a custom\n`DefaultDictionaryFactory` with a specific `cache_max_size` setting,\ncreating a `DefaultStrategy` using that factory, and then creating a\n`Lemmatizer` and/or a `LanguageDetector` using that strategy:\n\n``` python\n# import necessary classes\n>>> from simplemma import LanguageDetector, Lemmatizer\n>>> from simplemma.strategies import DefaultStrategy\n>>> from simplemma.strategies.dictionaries import DefaultDictionaryFactory\n\nLANG_CACHE_SIZE = 5  # How many language dictionaries to keep in memory at once (max)\n>>> dictionary_factory = DefaultDictionaryFactory(cache_max_size=LANG_CACHE_SIZE)\n>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=dictionary_factory)\n\n# lemmatize using the above customized strategy\n>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)\n>>> lemmatizer.lemmatize('doughnuts', lang='en')\n'doughnut'\n\n# detect languages using the above customized strategy\n>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)\n>>> language_detector.proportion_in_target_languages(\"opera post physica posita (\u03c4\u1f70 \u03bc\u03b5\u03c4\u1f70 \u03c4\u1f70 \u03c6\u03c5\u03c3\u03b9\u03ba\u03ac)\")\n0.5\n```\n\nFor more information see the\n[extended documentation](https://adbar.github.io/simplemma/).\n\n\n### Reducing memory usage\n\nSimplemma provides an alternative solution for situations where low\nmemory usage and fast initialization time are more important than\nlemmatization and language detection performance. This solution uses a\n`DictionaryFactory` that employs a trie as its underlying data structure,\nrather than a Python dict.\n\nThe `TrieDictionaryFactory` reduces memory usage by an average of\n20x and initialization time by 100x, but this comes at the cost of\npotentially reducing performance by 50% or more, depending on the\nspecific usage.\n\nTo use the `TrieDictionaryFactory` you have to install Simplemma with\nthe `marisa-trie` extra dependency (available from version 1.1.0):\n\n```\npip install simplemma[marisa-trie]\n```\n\nThen you have to create a custom strategy using the\n`TrieDictionaryFactory` and use that for `Lemmatizer` and\n`LanguageDetector` instances:\n\n``` python\n>>> from simplemma import LanguageDetector, Lemmatizer\n>>> from simplemma.strategies import DefaultStrategy\n>>> from simplemma.strategies.dictionaries import TrieDictionaryFactory\n\n>>> lemmatization_strategy = DefaultStrategy(dictionary_factory=TrieDictionaryFactory())\n\n>>> lemmatizer = Lemmatizer(lemmatization_strategy=lemmatization_strategy)\n>>> lemmatizer.lemmatize('doughnuts', lang='en')\n'doughnut'\n\n>>> language_detector = LanguageDetector('la', lemmatization_strategy=lemmatization_strategy)\n>>> language_detector.proportion_in_target_languages(\"opera post physica posita (\u03c4\u1f70 \u03bc\u03b5\u03c4\u1f70 \u03c4\u1f70 \u03c6\u03c5\u03c3\u03b9\u03ba\u03ac)\")\n0.5\n```\n\nWhile memory usage and initialization time when using the\n`TrieDictionaryFactory` are significantly lower compared to the\n`DefaultDictionaryFactory`, that's only true if the trie dictionaries\nare available on disk. That's not the case when using the\n`TrieDictionaryFactory` for the first time, as Simplemma only ships\nthe dictionaries as Python dicts. The trie dictionaries have to be\ngenerated once from the Python dicts. That happens on-the-fly when\nusing the `TrieDictionaryFactory` for the first time for a language and\nwill take a few seconds and use as much memory as loading the Python\ndicts for the language requires. For further invocations the trie\ndictionaries get cached on disk.\n\nIf the computer supposed to run Simplemma doesn't have enough memory to\ngenerate the trie dictionaries, they can also be generated on another\ncomputer with the same CPU architecture and copied over to the cache\ndirectory.\n\n\n## Supported languages\n\nThe following languages are available, identified by their [BCP 47\nlanguage tag](https://en.wikipedia.org/wiki/IETF_language_tag), which\ntypically corresponds to the [ISO 639-1 code](\nhttps://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).\nIf no such code exists, a [ISO 639-3\ncode](https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes) is\nused instead.\n\nAvailable languages (2022-01-20):\n\n\n| Code | Language | Forms (10\u00b3) | Acc. | Comments |\n| ---- | -------- | ----------- | ---- | -------- |\n| `ast` | Asturian | 124 | \n| `bg` | Bulgarian | 204 | \n| `ca` | Catalan | 579 | \n| `cs` | Czech | 187 | 0.89 | on UD CS-PDT\n| `cy` | Welsh | 360 | \n| `da` | Danish | 554 | 0.92 | on UD DA-DDT, alternative: [lemmy](https://github.com/sorenlind/lemmy)\n| `de` | German | 675 | 0.95 | on UD DE-GSD, see also [German-NLP list](https://github.com/adbar/German-NLP#Lemmatization)\n| `el` | Greek | 181 | 0.88 | on UD EL-GDT\n| `en` | English | 131 | 0.94 | on UD EN-GUM, alternative: [LemmInflect](https://github.com/bjascob/LemmInflect)\n| `enm` | Middle English | 38\n| `es` | Spanish | 665 | 0.95 | on UD ES-GSD\n| `et` | Estonian | 119 | | low coverage\n| `fa` | Persian | 12 | | experimental\n| `fi` | Finnish | 3,199 | | see [this benchmark](https://github.com/aajanki/finnish-pos-accuracy)\n| `fr` | French | 217 | 0.94 | on UD FR-GSD\n| `ga` | Irish | 372\n| `gd` | Gaelic | 48\n| `gl` | Galician | 384\n| `gv` | Manx | 62\n| `hbs` | Serbo-Croatian | 656 | | Croatian and Serbian lists to be added later\n| `hi` | Hindi | 58 | | experimental\n| `hu` | Hungarian | 458\n| `hy` | Armenian | 246\n| `id` | Indonesian | 17 | 0.91 | on UD ID-CSUI\n| `is` | Icelandic | 174\n| `it` | Italian | 333 | 0.93 | on UD IT-ISDT\n| `ka` | Georgian | 65\n| `la` | Latin | 843\n| `lb` | Luxembourgish | 305\n| `lt` | Lithuanian | 247\n| `lv` | Latvian | 164\n| `mk` | Macedonian | 56\n| `ms` | Malay | 14\n| `nb` | Norwegian (Bokm\u00e5l) |  617\n| `nl` | Dutch | 250 | 0.92 | on UD-NL-Alpino\n| `nn` | Norwegian (Nynorsk) | 56\n| `pl` | Polish | 3,211 | 0.91 | on UD-PL-PDB\n| `pt` | Portuguese | 924 | 0.92 | on UD-PT-GSD\n| `ro` | Romanian | 311\n| `ru` | Russian | 595 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)\n| `se` | Northern S\u00e1mi | 113\n| `sk` | Slovak | 818 | 0.92 | on UD SK-SNK\n| `sl` | Slovene | 136\n| `sq` | Albanian | 35\n| `sv` | Swedish | 658 | | alternative: [lemmy](https://github.com/sorenlind/lemmy)\n| `sw` | Swahili | 10 | | experimental\n| `tl` | Tagalog | 32 | | experimental\n| `tr` | Turkish | 1,232 | 0.89 | on UD-TR-Boun\n| `uk` | Ukrainian | 370 | | alternative: [pymorphy2](https://github.com/kmike/pymorphy2/)\n\n\nLanguages marked as having low coverage may be better suited to\nlanguage-specific libraries, but Simplemma can still provide limited\nfunctionality. Where possible, open-source Python alternatives are\nreferenced.\n\n*Experimental* mentions indicate that the language remains untested or\nthat there could be issues with the underlying data or lemmatization\nprocess.\n\nThe scores are calculated on [Universal\nDependencies](https://universaldependencies.org/) treebanks on single\nword tokens (including some contractions but not merged prepositions),\nthey describe to what extent simplemma can accurately map tokens to\ntheir lemma form. See `eval/` folder of the code repository for more\ninformation.\n\nThis library is particularly relevant as regards the lemmatization of\nless frequent words. Its performance in this case is only incidentally\ncaptured by the benchmark above. In some languages, a fixed number of\nwords such as pronouns can be further mapped by hand to enhance\nperformance.\n\n\n## Speed\n\nThe following orders of magnitude are provided for reference only and\nwere measured on an old laptop to establish a lower bound:\n\n-   Tokenization: \\> 1 million tokens/sec\n-   Lemmatization: \\> 250,000 words/sec\n\nUsing the most recent Python version (i.e. with `pyenv`) can make the\npackage run faster.\n\n\n## Roadmap\n\n- [x] Add further lemmatization lists\n- [ ] Grammatical categories as option\n- [ ] Function as a meta-package?\n- [ ] Integrate optional, more complex models?\n\n\n## Credits and licenses\n\nThe software is licensed under the MIT license. For information on the\nlicenses of the linguistic information databases, see the `licenses` folder.\n\nThe surface lookups (non-greedy mode) rely on lemmatization lists derived\nfrom the following sources, listed in order of relative importance:\n\n-   [Lemmatization\n    lists](https://github.com/michmech/lemmatization-lists) by Michal\n    M\u011bchura (Open Database License)\n-   Wiktionary entries packaged by the [Kaikki\n    project](https://kaikki.org/)\n-   [FreeLing project](https://github.com/TALP-UPC/FreeLing)\n-   [spaCy lookups\n    data](https://github.com/explosion/spacy-lookups-data)\n-   [Unimorph Project](https://unimorph.github.io/)\n-   [Wikinflection\n    corpus](https://github.com/lenakmeth/Wikinflection-Corpus) by Eleni\n    Metheniti (CC BY 4.0 License)\n\n\n## Contributions\n\nThis package has been first created and published by Adrien Barbaresi.\nIt has then benefited from extensive refactoring by Juanjo Diaz (especially the new classes).\nSee the [full list of contributors](https://github.com/adbar/simplemma/graphs/contributors)\nto the repository.\n\nFeel free to contribute, notably by [filing\nissues](https://github.com/adbar/simplemma/issues/) for feedback, bug\nreports, or links to further lemmatization lists, rules and tests.\n\nContributions by pull requests ought to follow the following\nconventions: code style with [black](https://github.com/psf/black), type\nhinting with [mypy](https://github.com/python/mypy), included tests with\n[pytest](https://pytest.org).\n\n\n## Other solutions\n\nSee lists: [German-NLP](https://github.com/adbar/German-NLP) and [other\nawesome-NLP lists](https://github.com/adbar/German-NLP#More-lists).\n\nFor another approach in Python see Spacy's\n[edit tree lemmatizer](https://spacy.io/api/edittreelemmatizer).\n\n\n## References\n\nTo cite this software:\n\n[![Reference DOI: 10.5281/zenodo.4673264](https://img.shields.io/badge/DOI-10.5281%2Fzenodo.4673264-brightgreen)](https://doi.org/10.5281/zenodo.4673264)\n\nBarbaresi A. (*year*). Simplemma: a simple multilingual lemmatizer for\nPython \\[Computer software\\] (Version *version number*). Berlin,\nGermany: Berlin-Brandenburg Academy of Sciences. Available from\n<https://github.com/adbar/simplemma> DOI: 10.5281/zenodo.4673264\n\nThis work draws from lexical analysis algorithms used in:\n\n-   Barbaresi, A., & Hein, K. (2017). [Data-driven identification of\n    German phrasal\n    compounds](https://hal.archives-ouvertes.fr/hal-01575651/document).\n    In International Conference on Text, Speech, and Dialogue Springer,\n    pp. 192-200.\n-   Barbaresi, A. (2016). [An unsupervised morphological criterion for\n    discriminating similar\n    languages](https://aclanthology.org/W16-4827/). In 3rd Workshop on\n    NLP for Similar Languages, Varieties and Dialects (VarDial 2016),\n    Association for Computational Linguistics, pp. 212-220.\n-   Barbaresi, A. (2016). [Bootstrapped OCR error detection for a\n    less-resourced language\n    variant](https://hal.archives-ouvertes.fr/hal-01371689/document). In\n    13th Conference on Natural Language Processing (KONVENS 2016), pp.\n    21-26.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A lightweight toolkit for multilingual lemmatization and language detection.",
    "version": "1.1.1",
    "project_urls": {
        "Blog": "https://adrien.barbaresi.eu/blog/",
        "Docs": "https://adbar.github.io/simplemma/",
        "Homepage": "https://github.com/adbar/simplemma"
    },
    "split_keywords": [
        "language detection",
        " language identification",
        " langid",
        " lemmatization",
        " lemmatizer",
        " lemmatiser",
        " nlp",
        " tokenization",
        " tokenizer"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "13e3e8f8a219b16f0301d874af5f7f8551e023a9d8a396b2d87eebb4ebc59071",
                "md5": "43ed96da78ac7d16d216e4bd1fd7e67a",
                "sha256": "13d0e9f7100471c4135f66cc4c33626174f2c8e1451bed3d387374d114eb1eaf"
            },
            "downloads": -1,
            "filename": "simplemma-1.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "43ed96da78ac7d16d216e4bd1fd7e67a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 67166325,
            "upload_time": "2024-08-08T12:19:27",
            "upload_time_iso_8601": "2024-08-08T12:19:27.658915Z",
            "url": "https://files.pythonhosted.org/packages/13/e3/e8f8a219b16f0301d874af5f7f8551e023a9d8a396b2d87eebb4ebc59071/simplemma-1.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "da49db745ca54b74cac35fc5b641817fe90866fba8091246cab3e34b1ba6f16e",
                "md5": "2d0cf7c080c4f27fda942254a66ea4e2",
                "sha256": "5980daa8b1721ad52395c74f92adfe4be49fe52e68f50d2877e2fe29348a01d2"
            },
            "downloads": -1,
            "filename": "simplemma-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2d0cf7c080c4f27fda942254a66ea4e2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 67200102,
            "upload_time": "2024-08-08T12:20:45",
            "upload_time_iso_8601": "2024-08-08T12:20:45.161103Z",
            "url": "https://files.pythonhosted.org/packages/da/49/db745ca54b74cac35fc5b641817fe90866fba8091246cab3e34b1ba6f16e/simplemma-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-08 12:20:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "adbar",
    "github_project": "simplemma",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "simplemma"
}
        
Elapsed time: 4.30562s