bgnlp

Name	bgnlp JSON
Version	0.5.3 JSON
	download
home_page
Summary	Package for Bulgarian Natural Language Processing (NLP)
upload_time	2024-01-27 16:31:43
maintainer
docs_url	None
author	Adam Fauzi
requires_python
license
keywords	pytorch nlp bulgaria machine learning deep learning ai
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # **bgnlp**: Model-first approach to Bulgarian NLP
<a href="https://colab.research.google.com/drive/1etvcxad0f754pjSdjremDftq16o_oMTh?usp=sharing"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"></a>

[![Downloads](https://static.pepy.tech/personalized-badge/bgnlp?period=total&units=international_system&left_color=grey&right_color=blue&left_text=pip%20downloads)](https://pypi.org/project/bgnlp/)

```sh
pip install bgnlp
```

## Package functionalities
- [Part-of-speech](#pos)
- [Lemmatization](#lemma)
- [Named Entity Recognition](#ner)
- [Keyword Extraction](#keywords)
- [Commatization](#comma)

> Please note - only the first time you run one of these operations a model will be downloaded! Therefore, the first run might take more time.


<a id="pos"></a>

### Part-of-speech (PoS) tagging

```python
from bgnlp import pos


print(pos("Това е библиотека за обработка на естествен език."))
```

```json
[{
    "word": "Това",
    "tag": "PDOsn",
    "bg_desc": "местоимение",
    "en_desc": "pronoun"
}, {
    "word": "е",
    "tag": "VLINr3s",
    "bg_desc": "глагол",
    "en_desc": "verb"
}, {
    "word": "библиотека",
    "tag": "NCFsof",
    "bg_desc": "съществително име",
    "en_desc": "noun"
}, {
    "word": "за",
    "tag": "R",
    "bg_desc": "предлог",
    "en_desc": "preposition"
}, {
    "word": "обработка",
    "tag": "NCFsof",
    "bg_desc": "съществително име",
    "en_desc": "noun"
}, {
    "word": "на",
    "tag": "R",
    "bg_desc": "предлог",
    "en_desc": "preposition"
}, {
    "word": "естествен",
    "tag": "Asmo",
    "bg_desc": "прилагателно име",
    "en_desc": "adjective"
}, {
    "word": "език",
    "tag": "NCMsom",
    "bg_desc": "съществително име",
    "en_desc": "noun"
}, {
    "word": ".",
    "tag": "U",
    "bg_desc": "препинателен знак",
    "en_desc": "punctuation"
}]
```

<a id="lemma"></a>

### Lemmatization

```python
from bgnlp import lemmatize


text = "Добре дошли!"
print(lemmatize(text))
```

```bash
[{'word': 'Добре', 'lemma': 'Добре'}, {'word': 'дошли', 'lemma': 'дойда'}, {'word': '!', 'lemma': '!'}]
```

```python
# Generating a string of lemmas.
print(lemmatize(text, as_string=True))
```

```bash
Добре дойда!
```

<a id="ner"></a>

### Named Entity Recognition (NER) tagging

Currently, the available NER tags are:
- `PER` - Person
- `ORG` - Organization
- `LOC` - Location

```python
from bgnlp import ner


text = "Барух Спиноза е роден в Амстердам"

print(f"Input: {text}")
print("Result:", ner(text))
```

```bash
Input: Барух Спиноза е роден в Амстердам
Result: [{'word': 'Барух Спиноза', 'entity_group': 'PER'}, {'word': 'Амстердам', 'entity_group': 'LOC'}]
```


<a id="keywords"></a>

### Keyword Extraction
```python
from bgnlp import extract_keywords


# Reading the text from a file, since it may be large, hence it wouldn't be 
# pleasant to write it directly here.
# The current input is this Bulgarian news article (only the text, no HTML!):
# https://novini.bg/sviat/eu/781622
with open("input_text.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Extracting keywords with probability of at least 0.5.
keywords = extract_keywords(text, threshold=0.5)
print("Keywords:")
pprint(keywords)
```
```bash
Keywords:
[{'keyword': 'Еманюел Макрон', 'score': 0.8759163320064545},
 {'keyword': 'Г-7', 'score': 0.5938143730163574},
 {'keyword': 'Япония', 'score': 0.607077419757843}]
```

<a id="comma"></a>

### Commatization
```python
from pprint import pprint

from bgnlp import commatize


text = "Човекът искащ безгрижно писане ме помоли да създам този модел."

print("Without metadata:")
print(commatize(text))

print("\nWith metadata:")
pprint(commatize(text, return_metadata=True))
```
```bash
Without metadata:
Човекът, искащ безгрижно писане, ме помоли да създам този модел.

With metadata:
('Човекът, искащ безгрижно писане, ме помоли да създам този модел.',
 [{'end': 12,
   'score': 0.9301406145095825,
   'start': 0,
   'substring': 'Човекът, иск'},
  {'end': 34,
   'score': 0.93571537733078,
   'start': 24,
   'substring': ' писане, м'}])
```

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "bgnlp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "pytorch,nlp,bulgaria,machine learning,deep learning,AI",
    "author": "Adam Fauzi",
    "author_email": "adamfzh98@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/f6/31/f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb/bgnlp-0.5.3.tar.gz",
    "platform": null,
    "description": "# **bgnlp**: Model-first approach to Bulgarian NLP\n<a href=\"https://colab.research.google.com/drive/1etvcxad0f754pjSdjremDftq16o_oMTh?usp=sharing\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"></a>\n\n[![Downloads](https://static.pepy.tech/personalized-badge/bgnlp?period=total&units=international_system&left_color=grey&right_color=blue&left_text=pip%20downloads)](https://pypi.org/project/bgnlp/)\n\n```sh\npip install bgnlp\n```\n\n## Package functionalities\n- [Part-of-speech](#pos)\n- [Lemmatization](#lemma)\n- [Named Entity Recognition](#ner)\n- [Keyword Extraction](#keywords)\n- [Commatization](#comma)\n\n> Please note - only the first time you run one of these operations a model will be downloaded! Therefore, the first run might take more time.\n\n\n<a id=\"pos\"></a>\n\n### Part-of-speech (PoS) tagging\n\n```python\nfrom bgnlp import pos\n\n\nprint(pos(\"\u0422\u043e\u0432\u0430 \u0435 \u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430 \u0437\u0430 \u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430 \u043d\u0430 \u0435\u0441\u0442\u0435\u0441\u0442\u0432\u0435\u043d \u0435\u0437\u0438\u043a.\"))\n```\n\n```json\n[{\n    \"word\": \"\u0422\u043e\u0432\u0430\",\n    \"tag\": \"PDOsn\",\n    \"bg_desc\": \"\u043c\u0435\u0441\u0442\u043e\u0438\u043c\u0435\u043d\u0438\u0435\",\n    \"en_desc\": \"pronoun\"\n}, {\n    \"word\": \"\u0435\",\n    \"tag\": \"VLINr3s\",\n    \"bg_desc\": \"\u0433\u043b\u0430\u0433\u043e\u043b\",\n    \"en_desc\": \"verb\"\n}, {\n    \"word\": \"\u0431\u0438\u0431\u043b\u0438\u043e\u0442\u0435\u043a\u0430\",\n    \"tag\": \"NCFsof\",\n    \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n    \"en_desc\": \"noun\"\n}, {\n    \"word\": \"\u0437\u0430\",\n    \"tag\": \"R\",\n    \"bg_desc\": \"\u043f\u0440\u0435\u0434\u043b\u043e\u0433\",\n    \"en_desc\": \"preposition\"\n}, {\n    \"word\": \"\u043e\u0431\u0440\u0430\u0431\u043e\u0442\u043a\u0430\",\n    \"tag\": \"NCFsof\",\n    \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n    \"en_desc\": \"noun\"\n}, {\n    \"word\": \"\u043d\u0430\",\n    \"tag\": \"R\",\n    \"bg_desc\": \"\u043f\u0440\u0435\u0434\u043b\u043e\u0433\",\n    \"en_desc\": \"preposition\"\n}, {\n    \"word\": \"\u0435\u0441\u0442\u0435\u0441\u0442\u0432\u0435\u043d\",\n    \"tag\": \"Asmo\",\n    \"bg_desc\": \"\u043f\u0440\u0438\u043b\u0430\u0433\u0430\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n    \"en_desc\": \"adjective\"\n}, {\n    \"word\": \"\u0435\u0437\u0438\u043a\",\n    \"tag\": \"NCMsom\",\n    \"bg_desc\": \"\u0441\u044a\u0449\u0435\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u043d\u043e \u0438\u043c\u0435\",\n    \"en_desc\": \"noun\"\n}, {\n    \"word\": \".\",\n    \"tag\": \"U\",\n    \"bg_desc\": \"\u043f\u0440\u0435\u043f\u0438\u043d\u0430\u0442\u0435\u043b\u0435\u043d \u0437\u043d\u0430\u043a\",\n    \"en_desc\": \"punctuation\"\n}]\n```\n\n<a id=\"lemma\"></a>\n\n### Lemmatization\n\n```python\nfrom bgnlp import lemmatize\n\n\ntext = \"\u0414\u043e\u0431\u0440\u0435 \u0434\u043e\u0448\u043b\u0438!\"\nprint(lemmatize(text))\n```\n\n```bash\n[{'word': '\u0414\u043e\u0431\u0440\u0435', 'lemma': '\u0414\u043e\u0431\u0440\u0435'}, {'word': '\u0434\u043e\u0448\u043b\u0438', 'lemma': '\u0434\u043e\u0439\u0434\u0430'}, {'word': '!', 'lemma': '!'}]\n```\n\n```python\n# Generating a string of lemmas.\nprint(lemmatize(text, as_string=True))\n```\n\n```bash\n\u0414\u043e\u0431\u0440\u0435 \u0434\u043e\u0439\u0434\u0430!\n```\n\n<a id=\"ner\"></a>\n\n### Named Entity Recognition (NER) tagging\n\nCurrently, the available NER tags are:\n- `PER` - Person\n- `ORG` - Organization\n- `LOC` - Location\n\n```python\nfrom bgnlp import ner\n\n\ntext = \"\u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c\"\n\nprint(f\"Input: {text}\")\nprint(\"Result:\", ner(text))\n```\n\n```bash\nInput: \u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430 \u0435 \u0440\u043e\u0434\u0435\u043d \u0432 \u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c\nResult: [{'word': '\u0411\u0430\u0440\u0443\u0445 \u0421\u043f\u0438\u043d\u043e\u0437\u0430', 'entity_group': 'PER'}, {'word': '\u0410\u043c\u0441\u0442\u0435\u0440\u0434\u0430\u043c', 'entity_group': 'LOC'}]\n```\n\n\n<a id=\"keywords\"></a>\n\n### Keyword Extraction\n```python\nfrom bgnlp import extract_keywords\n\n\n# Reading the text from a file, since it may be large, hence it wouldn't be \n# pleasant to write it directly here.\n# The current input is this Bulgarian news article (only the text, no HTML!):\n# https://novini.bg/sviat/eu/781622\nwith open(\"input_text.txt\", \"r\", encoding=\"utf-8\") as f:\n    text = f.read()\n\n# Extracting keywords with probability of at least 0.5.\nkeywords = extract_keywords(text, threshold=0.5)\nprint(\"Keywords:\")\npprint(keywords)\n```\n```bash\nKeywords:\n[{'keyword': '\u0415\u043c\u0430\u043d\u044e\u0435\u043b \u041c\u0430\u043a\u0440\u043e\u043d', 'score': 0.8759163320064545},\n {'keyword': '\u0413-7', 'score': 0.5938143730163574},\n {'keyword': '\u042f\u043f\u043e\u043d\u0438\u044f', 'score': 0.607077419757843}]\n```\n\n<a id=\"comma\"></a>\n\n### Commatization\n```python\nfrom pprint import pprint\n\nfrom bgnlp import commatize\n\n\ntext = \"\u0427\u043e\u0432\u0435\u043a\u044a\u0442 \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435 \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.\"\n\nprint(\"Without metadata:\")\nprint(commatize(text))\n\nprint(\"\\nWith metadata:\")\npprint(commatize(text, return_metadata=True))\n```\n```bash\nWithout metadata:\n\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435, \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.\n\nWith metadata:\n('\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a\u0430\u0449 \u0431\u0435\u0437\u0433\u0440\u0438\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u043d\u0435, \u043c\u0435 \u043f\u043e\u043c\u043e\u043b\u0438 \u0434\u0430 \u0441\u044a\u0437\u0434\u0430\u043c \u0442\u043e\u0437\u0438 \u043c\u043e\u0434\u0435\u043b.',\n [{'end': 12,\n   'score': 0.9301406145095825,\n   'start': 0,\n   'substring': '\u0427\u043e\u0432\u0435\u043a\u044a\u0442, \u0438\u0441\u043a'},\n  {'end': 34,\n   'score': 0.93571537733078,\n   'start': 24,\n   'substring': ' \u043f\u0438\u0441\u0430\u043d\u0435, \u043c'}])\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Package for Bulgarian Natural Language Processing (NLP)",
    "version": "0.5.3",
    "project_urls": null,
    "split_keywords": [
        "pytorch",
        "nlp",
        "bulgaria",
        "machine learning",
        "deep learning",
        "ai"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f1e4da61a314656bceff8d3b348a0e898a1fe1ab50fc21d2ba0a846423485b7",
                "md5": "9adc7dea55413e353f011d9f80cb78e1",
                "sha256": "7d9e82108bffe74e1ffa1d5f22f6ea2c5372ef22da32587ff6bb765be3212fc6"
            },
            "downloads": -1,
            "filename": "bgnlp-0.5.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9adc7dea55413e353f011d9f80cb78e1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 50867,
            "upload_time": "2024-01-27T16:31:42",
            "upload_time_iso_8601": "2024-01-27T16:31:42.084765Z",
            "url": "https://files.pythonhosted.org/packages/0f/1e/4da61a314656bceff8d3b348a0e898a1fe1ab50fc21d2ba0a846423485b7/bgnlp-0.5.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f631f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb",
                "md5": "0e9056a1004147a27bcdf8fba0c183f6",
                "sha256": "96e67221583538fb013fa7e6ae6f585ce89f4e1b191cc84c015116518b1581db"
            },
            "downloads": -1,
            "filename": "bgnlp-0.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "0e9056a1004147a27bcdf8fba0c183f6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 51995,
            "upload_time": "2024-01-27T16:31:43",
            "upload_time_iso_8601": "2024-01-27T16:31:43.244718Z",
            "url": "https://files.pythonhosted.org/packages/f6/31/f50bd56638760e395c5998696190403fcca6370845ee3678f773d755e6eb/bgnlp-0.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-27 16:31:43",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "bgnlp"
}

Adam Fauzi