# Description
LemonTizer is a class that wraps the [spacy](https://spacy.io) library to build a lemmatizer for language learning applications. It automatically manages the installation and loading of all [languages](https://spacy.io/models) supported by spacy and provides various lemmatizations options.
It is designed so that lemmatization can be enabled for multiple languages with the same amount of effort as enabling it for one, thus making community made scripts more widely accessible.
(for those curious, lemon tizer is a pun on the [Scottish soft drink](https://en.wikipedia.org/wiki/Tizer) which used to come in various fruit flavours)
# Quickstart
First, install lemon-tizer using pip:
```bash
pip install lemon-tizer
```
Example of lemmatizing a single sentence:
```python
# Import class
from lemon_tizer import LemonTizer
# Initialise class
# Language should be a lower case 2 letter code, see "Supported Languages" table for list of abbreviations
# Model size depends on availability of models, see https://spacy.io/models
# Normally, these are "sm", "md", "lg"
# Larger models are more accurate and support more features but require more storage space and may take longer to run
lemma = LemonTizer(language="en", model_size= "lg")
# Lemmatize a test string and print the result
test_string = "I am going to the shops to buy a can of Tizer."
output = lemma.lemmatize_sentence(test_string)
print(output)
```
This would produce the following output:
```python
"""
Output:
[{'I': 'I'},
{'am': 'be'},
{'going': 'go'},
{'to': 'to'},
{'the': 'the'},
{'shops': 'shop'},
{'to': 'to'},
{'buy': 'buy'},
{'a': 'a'},
{'can': 'can'},
{'of': 'of'},
{'Tizer': 'Tizer'},
{'.': '.'}]
"""
```
# Script settings
You can also enable various settings to exclude punctuation, exclude common words, force the input to lower case to change the behaviour, etc. A use case of this would be creating a frequency analysis of calculating the words in a text.
Example:
```python
# Import class
from lemon_tizer import LemonTizer
# Initialise class
lemma = LemonTizer(language="en", model_size= "lg")
# Configure settings
lemma.set_lemma_settings(filter_out_non_alpha=True,
filter_out_common=True,
convert_input_to_lower=True,
convert_output_to_lower=True,
return_just_first_word_of_lemma=True
)
# Lemmatize a test string and print the result
test_string = "I am going to the shops to buy a can of Tizer."
output = lemma.lemmatize_sentence(test_string)
print(output)
```
This would produce the following output:
```python
"""
Output:
[{'going': 'go'}, {'shops': 'shop'}, {'buy': 'buy'}, {'tizer': 'tizer'}]
"""
```
The options are:
| Boolean Variable | Explanation |
| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| filter_out_non_alpha | Will filter out lemmatizations that contain non-alpha characters. Useful for removing punctuation, etc. Note: lemmatizations with an apostrophe will also be filtered if this is set! |
| filter_out_common | Will filter out common words such as "the, and, she". Useful when doing frequency analysis. |
| convert_input_to_lower | Forces the input string to lowercase. May be useful to increase accuracy in some languages. |
| convert_output_to_lower | Forces the lemmatization to be lower case to change the behaviour of the algorithm, particularly in relation to the identification of proper nouns. |
| return_just_first_word_of_lemma | Some lemmatizations will return multiple words for a given input token. Setting this to True will return just the first word. |
# Advanced Functions
You can call `LemonTizer.get_spacy_object()` to get the underlying spacy object which has been initialised to a given model, should you wish to use functions not exposed by the wrapper.
# Public Functions and Properties
```python
def init_model(language: str, model_size: str) -> None:
"""Loads model based upon specified language and model size.
If model hasn't been downloaded, it will download it prior to the loading step.
Also loads default settings for lemmatization.
Args:
language: Lower case two letter code matching language codes in https://spacy.io/models
model_size: Lower case two letter code matching sm, md, lg, etc.
in https://spacy.io/models
"""
def set_lemma_settings(filter_out_non_alpha: bool = False,
filter_out_common: bool = False,
convert_input_to_lower: bool = False,
convert_output_to_lower: bool = False,
return_just_first_word_of_lemma: bool = False) -> None:
""" Sets various settings for lemmatisation
Args:
filter_out_non_alpha: (bool) Will filter out lemmatizations that contain non-alpha
characters. Useful for removing punctuation, etc. Note: lemmatizations with an
apostrophe will also be filtered if this is set!
filter_out_common: (bool) Will filter out common words such as "the, and, she". Useful
when doing frequency analysis.
convert_input_to_lower: (bool) Forces the input string to lowercase. May be useful to
increase accuracy in some languages.
convert_output_to_lower: (bool) Optionally force the lemmatization to be lower case.
return_just_first_word_of_lemma: (bool) Some lemmatizations will return multiple words
for a given input token. Setting this to True will return just the first word.
"""
def lemmatize_sentence(input_str: str) -> list[dict[str, str]]:
"""Lemmatizes a sentence (can also be a word, paragraph, etc.)
Returns:
Lists of dictionaries which has the original token as the key (str) and lemmatized
token as the value (str)
Args:
input_str: String containing the data to be lemmatized
"""
def find_model_name(language: str, model_size: str) -> str:
"""Looks up models compatible with the installed version of spacy, based upon language code
and model size.
Returns:
spacy model name (str)
Args:
language: Lower case two letter code matching language codes in https://spacy.io/models
model_size: Lower case two letter code matching sm, md, lg, etc.
in https://spacy.io/models
"""
def download_model(model_name: str) -> None:
"""Downloads spacy model ("trained pipeline") to local storage
Args:
model_name: should match a model in the spacy documentation,
see https://spacy.io/models
Use the method is_model_installed() if you need to check if model has already been
downloaded.
Use the method find_model_name() to get available models based upon language and model size
"""
def get_available_models() -> list[str]:
""" Gets the list of available pre-trained models for the installed version of spacy
Returns:
List of strings with the names of spacy trained models
"""
def is_model_installed(model_name: str) -> bool:
"""
Returns:
True if model is found in local storage, otherwise False
"""
@property
def get_current_model_name() -> str:
"""
Returns:
Name of currently loaded model as a str
"""
@property
def get_spacy_object() -> spacy.language.Language:
"""
Returns:
Returns the spacy Language object aka "model" for external processing
"""
```
# Supported languages
The supported languages are determined by the installed version of spacy, see here: [languages](https://spacy.io/models).
At the time of writing, the following languages are supported:
| Abbreviation | Language Name |
| ------------ | ---------------- |
| ca | Catalan |
| zh | Chinese |
| hr | Croatian |
| da | Danish |
| nl | Dutch |
| en | English |
| fi | Finnish |
| fr | French |
| de | German |
| el | Greek |
| it | Italian |
| ja | Japanese |
| ko | Korean |
| lt | Lithuanian |
| mk | Macedonian |
| xx | Multi-language |
| nb | Norwegian Bokmål |
| pl | Polish |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| sl | Slovenian |
| es | Spanish |
| sv | Swedish |
| uk | Ukrainian |
# Acknowledgements
Unless otherwise noted, all materials within this repository are Copyright (C) 2024 Jonathan Fox.
Raw data
{
"_id": null,
"home_page": null,
"name": "lemon-tizer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "lemmatizer, spacy, wrapper",
"author": "Jonathan Fox",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/74/c4/6f3970e2a2f5b65c43e5708b204b9e62af86993848885e05cf53229cf9e8/lemon_tizer-0.0.7.tar.gz",
"platform": null,
"description": "# Description\n\nLemonTizer is a class that wraps the [spacy](https://spacy.io) library to build a lemmatizer for language learning applications. It automatically manages the installation and loading of all [languages](https://spacy.io/models) supported by spacy and provides various lemmatizations options.\n\nIt is designed so that lemmatization can be enabled for multiple languages with the same amount of effort as enabling it for one, thus making community made scripts more widely accessible.\n\n(for those curious, lemon tizer is a pun on the [Scottish soft drink](https://en.wikipedia.org/wiki/Tizer) which used to come in various fruit flavours)\n\n# Quickstart\n\nFirst, install lemon-tizer using pip:\n\n```bash\npip install lemon-tizer\n```\n\nExample of lemmatizing a single sentence:\n\n```python\n# Import class\nfrom lemon_tizer import LemonTizer\n\n# Initialise class\n# Language should be a lower case 2 letter code, see \"Supported Languages\" table for list of abbreviations\n# Model size depends on availability of models, see https://spacy.io/models\n# Normally, these are \"sm\", \"md\", \"lg\"\n# Larger models are more accurate and support more features but require more storage space and may take longer to run\nlemma = LemonTizer(language=\"en\", model_size= \"lg\")\n\n# Lemmatize a test string and print the result\ntest_string = \"I am going to the shops to buy a can of Tizer.\"\noutput = lemma.lemmatize_sentence(test_string)\nprint(output)\n```\n\nThis would produce the following output:\n\n```python\n\"\"\"\nOutput:\n[{'I': 'I'},\n {'am': 'be'},\n {'going': 'go'},\n {'to': 'to'},\n {'the': 'the'},\n {'shops': 'shop'},\n {'to': 'to'},\n {'buy': 'buy'},\n {'a': 'a'},\n {'can': 'can'},\n {'of': 'of'},\n {'Tizer': 'Tizer'},\n {'.': '.'}]\n\"\"\"\n```\n\n# Script settings\n\nYou can also enable various settings to exclude punctuation, exclude common words, force the input to lower case to change the behaviour, etc. A use case of this would be creating a frequency analysis of calculating the words in a text.\n\nExample:\n\n```python\n# Import class\nfrom lemon_tizer import LemonTizer\n\n# Initialise class\nlemma = LemonTizer(language=\"en\", model_size= \"lg\")\n\n# Configure settings\nlemma.set_lemma_settings(filter_out_non_alpha=True,\n filter_out_common=True,\n convert_input_to_lower=True,\n convert_output_to_lower=True,\n return_just_first_word_of_lemma=True\n)\n\n# Lemmatize a test string and print the result\ntest_string = \"I am going to the shops to buy a can of Tizer.\"\noutput = lemma.lemmatize_sentence(test_string)\nprint(output)\n```\n\nThis would produce the following output:\n\n```python\n\"\"\"\nOutput:\n[{'going': 'go'}, {'shops': 'shop'}, {'buy': 'buy'}, {'tizer': 'tizer'}]\n\"\"\"\n```\n\nThe options are:\n\n| Boolean Variable | Explanation |\n| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| filter_out_non_alpha | Will filter out lemmatizations that contain non-alpha characters. Useful for removing punctuation, etc. Note: lemmatizations with an apostrophe will also be filtered if this is set! |\n| filter_out_common | Will filter out common words such as \"the, and, she\". Useful when doing frequency analysis. |\n| convert_input_to_lower | Forces the input string to lowercase. May be useful to increase accuracy in some languages. |\n| convert_output_to_lower | Forces the lemmatization to be lower case to change the behaviour of the algorithm, particularly in relation to the identification of proper nouns. |\n| return_just_first_word_of_lemma | Some lemmatizations will return multiple words for a given input token. Setting this to True will return just the first word. |\n\n# Advanced Functions\n\nYou can call `LemonTizer.get_spacy_object()` to get the underlying spacy object which has been initialised to a given model, should you wish to use functions not exposed by the wrapper.\n\n# Public Functions and Properties\n\n```python\n\ndef init_model(language: str, model_size: str) -> None:\n \"\"\"Loads model based upon specified language and model size.\n If model hasn't been downloaded, it will download it prior to the loading step.\n Also loads default settings for lemmatization.\n\n Args:\n language: Lower case two letter code matching language codes in https://spacy.io/models\n model_size: Lower case two letter code matching sm, md, lg, etc.\n in https://spacy.io/models\n \"\"\"\n\ndef set_lemma_settings(filter_out_non_alpha: bool = False,\n filter_out_common: bool = False,\n convert_input_to_lower: bool = False,\n convert_output_to_lower: bool = False,\n return_just_first_word_of_lemma: bool = False) -> None:\n \"\"\" Sets various settings for lemmatisation\n Args:\n filter_out_non_alpha: (bool) Will filter out lemmatizations that contain non-alpha\n characters. Useful for removing punctuation, etc. Note: lemmatizations with an\n apostrophe will also be filtered if this is set!\n filter_out_common: (bool) Will filter out common words such as \"the, and, she\". Useful\n when doing frequency analysis.\n convert_input_to_lower: (bool) Forces the input string to lowercase. May be useful to\n increase accuracy in some languages.\n convert_output_to_lower: (bool) Optionally force the lemmatization to be lower case.\n return_just_first_word_of_lemma: (bool) Some lemmatizations will return multiple words\n for a given input token. Setting this to True will return just the first word.\n \"\"\"\n\ndef lemmatize_sentence(input_str: str) -> list[dict[str, str]]:\n \"\"\"Lemmatizes a sentence (can also be a word, paragraph, etc.)\n Returns:\n Lists of dictionaries which has the original token as the key (str) and lemmatized\n token as the value (str)\n\n Args:\n input_str: String containing the data to be lemmatized\n \"\"\"\n\ndef find_model_name(language: str, model_size: str) -> str:\n \"\"\"Looks up models compatible with the installed version of spacy, based upon language code\n and model size.\n\n Returns:\n spacy model name (str)\n Args:\n language: Lower case two letter code matching language codes in https://spacy.io/models\n model_size: Lower case two letter code matching sm, md, lg, etc.\n in https://spacy.io/models\n \"\"\"\n\ndef download_model(model_name: str) -> None:\n \"\"\"Downloads spacy model (\"trained pipeline\") to local storage\n Args:\n model_name: should match a model in the spacy documentation,\n see https://spacy.io/models\n\n Use the method is_model_installed() if you need to check if model has already been\n downloaded.\n\n Use the method find_model_name() to get available models based upon language and model size\n \"\"\"\n\ndef get_available_models() -> list[str]:\n \"\"\" Gets the list of available pre-trained models for the installed version of spacy\n Returns:\n List of strings with the names of spacy trained models\n \"\"\"\n\ndef is_model_installed(model_name: str) -> bool:\n \"\"\"\n Returns:\n True if model is found in local storage, otherwise False\n \"\"\"\n@property\ndef get_current_model_name() -> str:\n \"\"\"\n Returns:\n Name of currently loaded model as a str\n \"\"\"\n\n@property\ndef get_spacy_object() -> spacy.language.Language:\n \"\"\"\n Returns:\n Returns the spacy Language object aka \"model\" for external processing\n \"\"\"\n```\n\n# Supported languages\n\nThe supported languages are determined by the installed version of spacy, see here: [languages](https://spacy.io/models).\n\nAt the time of writing, the following languages are supported:\n\n| Abbreviation | Language Name |\n| ------------ | ---------------- |\n| ca | Catalan |\n| zh | Chinese |\n| hr | Croatian |\n| da | Danish |\n| nl | Dutch |\n| en | English |\n| fi | Finnish |\n| fr | French |\n| de | German |\n| el | Greek |\n| it | Italian |\n| ja | Japanese |\n| ko | Korean |\n| lt | Lithuanian |\n| mk | Macedonian |\n| xx | Multi-language |\n| nb | Norwegian Bokm\u00e5l |\n| pl | Polish |\n| pt | Portuguese |\n| ro | Romanian |\n| ru | Russian |\n| sl | Slovenian |\n| es | Spanish |\n| sv | Swedish |\n| uk | Ukrainian |\n\n# Acknowledgements\n\nUnless otherwise noted, all materials within this repository are Copyright (C) 2024 Jonathan Fox.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "LemonTizer is a class that wraps the spacy library to build a lemmatizer for language learning applications.",
"version": "0.0.7",
"project_urls": {
"Homepage": "https://github.com/jonathanfox5/lemon_tizer",
"Issues": "https://github.com/jonathanfox5/lemon_tizer/issues"
},
"split_keywords": [
"lemmatizer",
" spacy",
" wrapper"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5f877d1fb5f558366f898ba68cd4ef0a2036ea822e303a3758542557ec8de2a6",
"md5": "54c510d9a95b2d13815dc989ee422c94",
"sha256": "9c8fba0cba922ec414baf49c1f79ee7e7ca075cc6c22641178d314273c2ede9e"
},
"downloads": -1,
"filename": "lemon_tizer-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "54c510d9a95b2d13815dc989ee422c94",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 7572,
"upload_time": "2024-11-27T15:22:50",
"upload_time_iso_8601": "2024-11-27T15:22:50.888413Z",
"url": "https://files.pythonhosted.org/packages/5f/87/7d1fb5f558366f898ba68cd4ef0a2036ea822e303a3758542557ec8de2a6/lemon_tizer-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "74c46f3970e2a2f5b65c43e5708b204b9e62af86993848885e05cf53229cf9e8",
"md5": "12916d2b0b8752aeab753920c3181817",
"sha256": "683d669161d5dbcac1b24b10454917c3bbb923cee799e86baea2f20fbd517ac0"
},
"downloads": -1,
"filename": "lemon_tizer-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "12916d2b0b8752aeab753920c3181817",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 8099,
"upload_time": "2024-11-27T15:22:51",
"upload_time_iso_8601": "2024-11-27T15:22:51.990414Z",
"url": "https://files.pythonhosted.org/packages/74/c4/6f3970e2a2f5b65c43e5708b204b9e62af86993848885e05cf53229cf9e8/lemon_tizer-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-27 15:22:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jonathanfox5",
"github_project": "lemon_tizer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "annotated-types",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "blis",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "catalogue",
"specs": [
[
"==",
"2.0.10"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2024.8.30"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.3.2"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.7"
]
]
},
{
"name": "cloudpathlib",
"specs": [
[
"==",
"0.19.0"
]
]
},
{
"name": "confection",
"specs": [
[
"==",
"0.1.5"
]
]
},
{
"name": "cymem",
"specs": [
[
"==",
"2.0.8"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.10"
]
]
},
{
"name": "Jinja2",
"specs": [
[
"==",
"3.1.4"
]
]
},
{
"name": "langcodes",
"specs": [
[
"==",
"3.4.1"
]
]
},
{
"name": "language_data",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "marisa-trie",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "markdown-it-py",
"specs": [
[
"==",
"3.0.0"
]
]
},
{
"name": "MarkupSafe",
"specs": [
[
"==",
"3.0.1"
]
]
},
{
"name": "mdurl",
"specs": [
[
"==",
"0.1.2"
]
]
},
{
"name": "murmurhash",
"specs": [
[
"==",
"1.0.10"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.0.2"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"24.1"
]
]
},
{
"name": "preshed",
"specs": [
[
"==",
"3.0.9"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.9.2"
]
]
},
{
"name": "pydantic_core",
"specs": [
[
"==",
"2.23.4"
]
]
},
{
"name": "Pygments",
"specs": [
[
"==",
"2.18.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.32.3"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"13.9.2"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"75.1.0"
]
]
},
{
"name": "shellingham",
"specs": [
[
"==",
"1.5.4"
]
]
},
{
"name": "smart-open",
"specs": [
[
"==",
"7.0.5"
]
]
},
{
"name": "spacy",
"specs": [
[
"==",
"3.8.2"
]
]
},
{
"name": "spacy-legacy",
"specs": [
[
"==",
"3.0.12"
]
]
},
{
"name": "spacy-loggers",
"specs": [
[
"==",
"1.0.5"
]
]
},
{
"name": "srsly",
"specs": [
[
"==",
"2.4.8"
]
]
},
{
"name": "thinc",
"specs": [
[
"==",
"8.3.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.66.5"
]
]
},
{
"name": "typer",
"specs": [
[
"==",
"0.12.5"
]
]
},
{
"name": "typing_extensions",
"specs": [
[
"==",
"4.12.2"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.2.3"
]
]
},
{
"name": "wasabi",
"specs": [
[
"==",
"1.1.3"
]
]
},
{
"name": "weasel",
"specs": [
[
"==",
"0.4.1"
]
]
},
{
"name": "wrapt",
"specs": [
[
"==",
"1.16.0"
]
]
}
],
"lcname": "lemon-tizer"
}