# **bgnlp**: Model-first approach to Bulgarian NLP
<a href="https://colab.research.google.com/drive/1etvcxad0f754pjSdjremDftq16o_oMTh?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>
[![Downloads](https://static.pepy.tech/personalized-badge/bgnlp?period=total&units=international_system&left_color=grey&right_color=blue&left_text=pip%20downloads)](https://pypi.org/project/bgnlp/)
```sh
pip install bgnlp
```
## Package functionalities
- [Part-of-speech](#pos)
- [Lemmatization](#lemma)
- [Named Entity Recognition](#ner)
- [Keyword Extraction](#keywords)
- [Commatization](#comma)
> Please note - only the first time you run one of these operations a model will be downloaded! Therefore, the first run might take more time.
<a id="pos"></a>
### Part-of-speech (PoS) tagging
```python
from bgnlp import pos
print(pos("Това е библиотека за обработка на естествен език."))
```
```json
[{
"word": "Това",
"tag": "PDOsn",
"bg_desc": "местоимение",
"en_desc": "pronoun"
}, {
"word": "е",
"tag": "VLINr3s",
"bg_desc": "глагол",
"en_desc": "verb"
}, {
"word": "библиотека",
"tag": "NCFsof",
"bg_desc": "съществително име",
"en_desc": "noun"
}, {
"word": "за",
"tag": "R",
"bg_desc": "предлог",
"en_desc": "preposition"
}, {
"word": "обработка",
"tag": "NCFsof",
"bg_desc": "съществително име",
"en_desc": "noun"
}, {
"word": "на",
"tag": "R",
"bg_desc": "предлог",
"en_desc": "preposition"
}, {
"word": "естествен",
"tag": "Asmo",
"bg_desc": "прилагателно име",
"en_desc": "adjective"
}, {
"word": "език",
"tag": "NCMsom",
"bg_desc": "съществително име",
"en_desc": "noun"
}, {
"word": ".",
"tag": "U",
"bg_desc": "препинателен знак",
"en_desc": "punctuation"
}]
```
<a id="lemma"></a>
### Lemmatization
```python
from bgnlp import lemmatize
text = "Добре дошли!"
print(lemmatize(text))
```
```bash
[{'word': 'Добре', 'lemma': 'Добре'}, {'word': 'дошли', 'lemma': 'дойда'}, {'word': '!', 'lemma': '!'}]
```
```python
# Generating a string of lemmas.
print(lemmatize(text, as_string=True))
```
```bash
Добре дойда!
```
<a id="ner"></a>
### Named Entity Recognition (NER) tagging
Currently, the available NER tags are:
- `PER` - Person
- `ORG` - Organization
- `LOC` - Location
```python
from bgnlp import ner
text = "Барух Спиноза е роден в Амстердам"
print(f"Input: {text}")
print("Result:", ner(text))
```
```bash
Input: Барух Спиноза е роден в Амстердам
Result: [{'word': 'Барух Спиноза', 'entity_group': 'PER'}, {'word': 'Амстердам', 'entity_group': 'LOC'}]
```
<a id="keywords"></a>
### Keyword Extraction
```python
from bgnlp import extract_keywords
# Reading the text from a file, since it may be large, hence it wouldn't be
# pleasant to write it directly here.
# The current input is this Bulgarian news article (only the text, no HTML!):
# https://novini.bg/sviat/eu/781622
with open("input_text.txt", "r", encoding="utf-8") as f:
text = f.read()
# Extracting keywords with probability of at least 0.5.
keywords = extract_keywords(text, threshold=0.5)
print("Keywords:")
pprint(keywords)
```
```bash
Keywords:
[{'keyword': 'Еманюел Макрон', 'score': 0.8759163320064545},
{'keyword': 'Г-7', 'score': 0.5938143730163574},
{'keyword': 'Япония', 'score': 0.607077419757843}]
```
<a id="comma"></a>
### Commatization
```python
from pprint import pprint
from bgnlp import commatize
text = "Човекът искащ безгрижно писане ме помоли да създам този модел."
print("Without metadata:")
print(commatize(text))
print("\nWith metadata:")
pprint(commatize(text, return_metadata=True))
```
```bash
Without metadata:
Човекът, искащ безгрижно писане, ме помоли да създам този модел.
With metadata:
('Човекът, искащ безгрижно писане, ме помоли да създам този модел.',
[{'end': 12,
'score': 0.9301406145095825,
'start': 0,
'substring': 'Човекът, иск'},
{'end': 34,
'score': 0.93571537733078,
'start': 24,
'substring': ' писане, м'}])
```
Raw data
{
"_id": null,
"home_page": "",
"name": "bgnlp",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "pytorch,nlp,bulgaria,machine learning,deep learning,AI",
"author": "Adam Fauzi",
"author_email": "adamfzh98@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f6/31/f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb/bgnlp-0.5.3.tar.gz",
"platform": null,
"description": "# **bgnlp**: Model-first approach to Bulgarian NLP\n<a href=\"https://colab.research.google.com/drive/1etvcxad0f754pjSdjremDftq16o_oMTh?usp=sharing\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n\n[![Downloads](https://static.pepy.tech/personalized-badge/bgnlp?period=total&units=international_system&left_color=grey&right_color=blue&left_text=pip%20downloads)](https://pypi.org/project/bgnlp/)\n\n```sh\npip install bgnlp\n```\n\n## Package functionalities\n- [Part-of-speech](#pos)\n- [Lemmatization](#lemma)\n- [Named Entity Recognition](#ner)\n- [Keyword Extraction](#keywords)\n- [Commatization](#comma)\n\n> Please note - only the first time you run one of these operations a model will be downloaded! Therefore, the first run might take more time.\n\n\n<a id=\"pos\"></a>\n\n### Part-of-speech (PoS) tagging\n\n```python\nfrom bgnlp import pos\n\n\nprint(pos(\"\u0422\u043e\u0432\u0430 \u0435 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 \u0437\u0430 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430 \u043d\u0430 \u0435\u0441\u0442\u0435\u0441\u0442\u0432\u0435\u043d \u0435\u0437\u0438\u043a.\"))\n```\n\n```json\n[{\n \"word\": \"\u0422\u043e\u0432\u0430\",\n \"tag\": \"PDOsn\",\n \"bg_desc\": \"\u043c\u0435\u0441\u0442\u043e\u0438\u043c\u0435\u043d\u0438\u0435\",\n \"en_desc\": \"pronoun\"\n}, {\n \"word\": \"\u0435\",\n \"tag\": \"VLINr3s\",\n \"bg_desc\": \"\u0433\u043b\u0430\u0433\u043e\u043b\",\n \"en_desc\": \"verb\"\n}, {\n \"word\": \"\u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430\",\n \"tag\": \"NCFsof\",\n \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n \"en_desc\": \"noun\"\n}, {\n \"word\": \"\u0437\u0430\",\n \"tag\": \"R\",\n \"bg_desc\": \"\u043f\u0440\u0435\u0434\u043b\u043e\u0433\",\n \"en_desc\": \"preposition\"\n}, {\n \"word\": \"\u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430\",\n \"tag\": \"NCFsof\",\n \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n \"en_desc\": \"noun\"\n}, {\n \"word\": \"\u043d\u0430\",\n \"tag\": \"R\",\n \"bg_desc\": \"\u043f\u0440\u0435\u0434\u043b\u043e\u0433\",\n \"en_desc\": \"preposition\"\n}, {\n \"word\": \"\u0435\u0441\u0442\u0435\u0441\u0442\u0432\u0435\u043d\",\n \"tag\": \"Asmo\",\n \"bg_desc\": \"\u043f\u0440\u0438\u043b\u0430\u0433\u0430\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n \"en_desc\": \"adjective\"\n}, {\n \"word\": \"\u0435\u0437\u0438\u043a\",\n \"tag\": \"NCMsom\",\n \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n \"en_desc\": \"noun\"\n}, {\n \"word\": \".\",\n \"tag\": \"U\",\n \"bg_desc\": \"\u043f\u0440\u0435\u043f\u0438\u043d\u0430\u0442\u0435\u043b\u0435\u043d \u0437\u043d\u0430\u043a\",\n \"en_desc\": \"punctuation\"\n}]\n```\n\n<a id=\"lemma\"></a>\n\n### Lemmatization\n\n```python\nfrom bgnlp import lemmatize\n\n\ntext = \"\u0414\u043e\u0431\u0440\u0435 \u0434\u043e\u0448\u043b\u0438!\"\nprint(lemmatize(text))\n```\n\n```bash\n[{'word': '\u0414\u043e\u0431\u0440\u0435', 'lemma': '\u0414\u043e\u0431\u0440\u0435'}, {'word': '\u0434\u043e\u0448\u043b\u0438', 'lemma': '\u0434\u043e\u0439\u0434\u0430'}, {'word': '!', 'lemma': '!'}]\n```\n\n```python\n# Generating a string of lemmas.\nprint(lemmatize(text, as_string=True))\n```\n\n```bash\n\u0414\u043e\u0431\u0440\u0435 \u0434\u043e\u0439\u0434\u0430!\n```\n\n<a id=\"ner\"></a>\n\n### Named Entity Recognition (NER) tagging\n\nCurrently, the available NER tags are:\n- `PER` - Person\n- `ORG` - Organization\n- `LOC` - Location\n\n```python\nfrom bgnlp import ner\n\n\ntext = \"\u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c\"\n\nprint(f\"Input: {text}\")\nprint(\"Result:\", ner(text))\n```\n\n```bash\nInput: \u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c\nResult: [{'word': '\u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430', 'entity_group': 'PER'}, {'word': '\u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c', 'entity_group': 'LOC'}]\n```\n\n\n<a id=\"keywords\"></a>\n\n### Keyword Extraction\n```python\nfrom bgnlp import extract_keywords\n\n\n# Reading the text from a file, since it may be large, hence it wouldn't be \n# pleasant to write it directly here.\n# The current input is this Bulgarian news article (only the text, no HTML!):\n# https://novini.bg/sviat/eu/781622\nwith open(\"input_text.txt\", \"r\", encoding=\"utf-8\") as f:\n text = f.read()\n\n# Extracting keywords with probability of at least 0.5.\nkeywords = extract_keywords(text, threshold=0.5)\nprint(\"Keywords:\")\npprint(keywords)\n```\n```bash\nKeywords:\n[{'keyword': '\u0415\u043c\u0430\u043d\u044e\u0435\u043b \u041c\u0430\u043a\u0440\u043e\u043d', 'score': 0.8759163320064545},\n {'keyword': '\u0413-7', 'score': 0.5938143730163574},\n {'keyword': '\u042f\u043f\u043e\u043d\u0438\u044f', 'score': 0.607077419757843}]\n```\n\n<a id=\"comma\"></a>\n\n### Commatization\n```python\nfrom pprint import pprint\n\nfrom bgnlp import commatize\n\n\ntext = \"\u0427\u043e\u0432\u0435\u043a\u044a\u0442 \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435 \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.\"\n\nprint(\"Without metadata:\")\nprint(commatize(text))\n\nprint(\"\\nWith metadata:\")\npprint(commatize(text, return_metadata=True))\n```\n```bash\nWithout metadata:\n\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435, \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.\n\nWith metadata:\n('\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435, \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.',\n [{'end': 12,\n 'score': 0.9301406145095825,\n 'start': 0,\n 'substring': '\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a'},\n {'end': 34,\n 'score': 0.93571537733078,\n 'start': 24,\n 'substring': ' \u043f\u0438\u0441\u0430\u043d\u0435, \u043c'}])\n```\n",
"bugtrack_url": null,
"license": "",
"summary": "Package for Bulgarian Natural Language Processing (NLP)",
"version": "0.5.3",
"project_urls": null,
"split_keywords": [
"pytorch",
"nlp",
"bulgaria",
"machine learning",
"deep learning",
"ai"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0f1e4da61a314656bceff8d3b348a0e898a1fe1ab50fc21d2ba0a846423485b7",
"md5": "9adc7dea55413e353f011d9f80cb78e1",
"sha256": "7d9e82108bffe74e1ffa1d5f22f6ea2c5372ef22da32587ff6bb765be3212fc6"
},
"downloads": -1,
"filename": "bgnlp-0.5.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9adc7dea55413e353f011d9f80cb78e1",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 50867,
"upload_time": "2024-01-27T16:31:42",
"upload_time_iso_8601": "2024-01-27T16:31:42.084765Z",
"url": "https://files.pythonhosted.org/packages/0f/1e/4da61a314656bceff8d3b348a0e898a1fe1ab50fc21d2ba0a846423485b7/bgnlp-0.5.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f631f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb",
"md5": "0e9056a1004147a27bcdf8fba0c183f6",
"sha256": "96e67221583538fb013fa7e6ae6f585ce89f4e1b191cc84c015116518b1581db"
},
"downloads": -1,
"filename": "bgnlp-0.5.3.tar.gz",
"has_sig": false,
"md5_digest": "0e9056a1004147a27bcdf8fba0c183f6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 51995,
"upload_time": "2024-01-27T16:31:43",
"upload_time_iso_8601": "2024-01-27T16:31:43.244718Z",
"url": "https://files.pythonhosted.org/packages/f6/31/f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb/bgnlp-0.5.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-27 16:31:43",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "bgnlp"
}