![Khl Logo](https://raw.githubusercontent.com/Rishat-F/khl/master/data/logo.png)
<h1 align="center">No Water - Ice Only</h1>
Preparing russian hockey news for machine learning.
**Unify -> Simplify -> Preprocess** text and feed your neural model.
## Installation
*Khl* is available on PyPI:
```console
$ pip install khl
```
It requires Python 3.8+ to run.
## Usage
To get started right away with basic usage:
```python
from khl import text_to_codes
coder = {
'': 0, # placeholder
'???': 1, # unknown
'.': 2,
'и': 3,
'в': 4,
'-': 5,
':': 6,
'матч': 7,
'за': 8,
'забить': 9,
'гол': 10,
'per': 11, # person entity
'org': 12, # organization entity
'loc': 13, # location entity
'date': 14, # date entity
'против': 15,
'год': 16,
'pers': 17, # few persons entity
'orgs': 18, # few organizations entity
'свой': 19
}
text = """
1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
«Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""
codes = text_to_codes(
text=text,
coder=coder,
stop_words_=["за", "и", "свой"], # stop words to drop
replace_ners_=True, # replace named entities ("Иван Иванов" -> "per", "Спартак" -> "org", "Москва" -> "loc")
replace_dates_=True, # replace dates ("1 апреля 2023 года" -> "date")
replace_penalties_=True, # replace penalties ("5+20" -> "pen")
exclude_unknown=True, # drop lemma that not presented in coder
max_len=20, # get sequence of codes of length 20
)
# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
```
```text_to_codes``` is a very high level function. What's happens under hood see in [Lower level usage](#lower-level-usage).
## What is `coder`?
`coder` is just a dictionary where each lemma is represented with unique integer code.
Note that first two elements are reserved for *placeholder* and *unknown* elements.
It is possible to get `coder` from frequency dictionary file (see in [Get lemmas coder](#2-get-lemmas-coder)).
Frequency dictionary file is a **json**-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset.
Preferably it should be sorted in descending order of values.
`example_frequency_dictionary.json`:
```json
{
".": 1000,
"и": 500,
"в": 400,
"-": 300,
":": 300,
"матч": 290,
"за": 250,
"забить": 240,
"гол": 230,
"per": 200,
"org": 150,
"loc": 150,
"date": 100,
"против": 90,
"год": 70,
"pers": 40,
"orgs": 30,
"свой": 20
}
```
You could make and use your own frequency dictionary or download [this dictionary](https://github.com/Rishat-F/khl/blob/master/data/frequency_dictionary.json) created by myself.
## Lower level usage<a id="lower-level-usage"></a>
#### 1. Make imports
```python
from khl import stop_words
from khl import utils
from khl import preprocess
```
#### 2. Get lemmas coder<a id="2-get-lemmas-coder"></a>
```python
coder = preprocess.get_coder("example_frequency_dictionary.json")
```
#### 3. Define text
```python
text = """
1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
«Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""
```
#### 4. Unify
```python
unified_text = utils.unify(text)
# "1 апреля 2023 года в Москве в матче 1/8 финала против 'Спартака' Иван Иванов забил свой 100-й гол за карьеру. 'Динамо Мск' - 'Спартак' 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров."
```
#### 5. Simplify
```python
simplified_text = utils.simplify(
text=unified_text,
replace_ners_=True,
replace_dates_=True,
replace_penalties_=True,
)
# 'date в loc в матче финала против org per забил свой гол за карьеру. org org Голы забили: per per per.'
```
#### 6. Lemmatize
```python
lemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)
# ['date', 'в', 'loc', 'в', 'матч', 'финал', 'против', 'org', 'per', 'забить', 'гол', 'карьера', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
```
#### 7. Transform to codes
```python
codes = preprocess.lemmas_to_codes(
lemmas=lemmas,
coder=coder,
exclude_unknown=True,
max_len=20,
)
# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
```
#### 8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)
```python
print(
preprocess.codes_to_lemmas(codes=codes, coder=coder)
)
# ['', '', '', 'date', 'в', 'loc', 'в', 'матч', 'против', 'org', 'per', 'забить', 'гол', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
```
Raw data
{
"_id": null,
"home_page": "https://github.com/Rishat-F/khl",
"name": "khl",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "khl,news,nlp,preprocessing,ml",
"author": "Rishat Fayzullin",
"author_email": "nilluziaf@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/6f/9e/aa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f/khl-2.0.2.tar.gz",
"platform": null,
"description": "![Khl Logo](https://raw.githubusercontent.com/Rishat-F/khl/master/data/logo.png)\n\n<h1 align=\"center\">No Water - Ice Only</h1>\n\nPreparing russian hockey news for machine learning.\n\n**Unify -> Simplify -> Preprocess** text and feed your neural model.\n\n## Installation\n\n*Khl* is available on PyPI:\n\n```console\n$ pip install khl\n```\nIt requires Python 3.8+ to run.\n\n## Usage\n\nTo get started right away with basic usage:\n\n```python\nfrom khl import text_to_codes\n\ncoder = {\n '': 0, # placeholder\n '???': 1, # unknown\n '.': 2,\n '\u0438': 3,\n '\u0432': 4,\n '-': 5,\n ':': 6,\n '\u043c\u0430\u0442\u0447': 7,\n '\u0437\u0430': 8,\n '\u0437\u0430\u0431\u0438\u0442\u044c': 9,\n '\u0433\u043e\u043b': 10,\n 'per': 11, # person entity\n 'org': 12, # organization entity\n 'loc': 13, # location entity\n 'date': 14, # date entity\n '\u043f\u0440\u043e\u0442\u0438\u0432': 15,\n '\u0433\u043e\u0434': 16,\n 'pers': 17, # few persons entity\n 'orgs': 18, # few organizations entity\n '\u0441\u0432\u043e\u0439': 19\n}\n\ntext = \"\"\"\n 1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 \u215b \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 \u201e\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430\u201d \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100\u2014\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443.\n \u00ab\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a\u00bb - \u00ab\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u00bb 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\n\"\"\"\n\ncodes = text_to_codes(\n text=text,\n coder=coder,\n stop_words_=[\"\u0437\u0430\", \"\u0438\", \"\u0441\u0432\u043e\u0439\"], # stop words to drop\n replace_ners_=True, # replace named entities (\"\u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432\" -> \"per\", \"\u0421\u043f\u0430\u0440\u0442\u0430\u043a\" -> \"org\", \"\u041c\u043e\u0441\u043a\u0432\u0430\" -> \"loc\")\n replace_dates_=True, # replace dates (\"1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430\" -> \"date\")\n replace_penalties_=True, # replace penalties (\"5+20\" -> \"pen\")\n exclude_unknown=True, # drop lemma that not presented in coder\n max_len=20, # get sequence of codes of length 20\n)\n# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]\n```\n\n```text_to_codes``` is a very high level function. What's happens under hood see in [Lower level usage](#lower-level-usage).\n\n## What is `coder`?\n`coder` is just a dictionary where each lemma is represented with unique integer code.\nNote that first two elements are reserved for *placeholder* and *unknown* elements.\n\nIt is possible to get `coder` from frequency dictionary file (see in [Get lemmas coder](#2-get-lemmas-coder)).\nFrequency dictionary file is a **json**-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset.\nPreferably it should be sorted in descending order of values. \n`example_frequency_dictionary.json`:\n\n```json\n{\n \".\": 1000,\n \"\u0438\": 500,\n \"\u0432\": 400,\n \"-\": 300,\n \":\": 300,\n \"\u043c\u0430\u0442\u0447\": 290,\n \"\u0437\u0430\": 250,\n \"\u0437\u0430\u0431\u0438\u0442\u044c\": 240,\n \"\u0433\u043e\u043b\": 230,\n \"per\": 200,\n \"org\": 150,\n \"loc\": 150,\n \"date\": 100,\n \"\u043f\u0440\u043e\u0442\u0438\u0432\": 90,\n \"\u0433\u043e\u0434\": 70,\n \"pers\": 40,\n \"orgs\": 30,\n \"\u0441\u0432\u043e\u0439\": 20\n}\n```\n\nYou could make and use your own frequency dictionary or download [this dictionary](https://github.com/Rishat-F/khl/blob/master/data/frequency_dictionary.json) created by myself.\n\n## Lower level usage<a id=\"lower-level-usage\"></a>\n\n#### 1. Make imports\n```python\nfrom khl import stop_words\nfrom khl import utils\nfrom khl import preprocess\n```\n\n#### 2. Get lemmas coder<a id=\"2-get-lemmas-coder\"></a>\n```python\ncoder = preprocess.get_coder(\"example_frequency_dictionary.json\")\n```\n\n#### 3. Define text\n```python\ntext = \"\"\"\n 1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 \u215b \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 \u201e\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430\u201d \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100\u2014\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443.\n \u00ab\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a\u00bb - \u00ab\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u00bb 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\n\"\"\"\n```\n\n#### 4. Unify\n```python\nunified_text = utils.unify(text)\n# \"1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 1/8 \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 '\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430' \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100-\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443. '\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a' - '\u0421\u043f\u0430\u0440\u0442\u0430\u043a' 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\"\n```\n\n#### 5. Simplify\n```python\nsimplified_text = utils.simplify(\n text=unified_text,\n replace_ners_=True,\n replace_dates_=True,\n replace_penalties_=True,\n)\n# 'date \u0432 loc \u0432 \u043c\u0430\u0442\u0447\u0435 \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 org per \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443. org org \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: per per per.'\n```\n\n#### 6. Lemmatize\n```python\nlemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)\n# ['date', '\u0432', 'loc', '\u0432', '\u043c\u0430\u0442\u0447', '\u0444\u0438\u043d\u0430\u043b', '\u043f\u0440\u043e\u0442\u0438\u0432', 'org', 'per', '\u0437\u0430\u0431\u0438\u0442\u044c', '\u0433\u043e\u043b', '\u043a\u0430\u0440\u044c\u0435\u0440\u0430', '.', 'orgs', '\u0433\u043e\u043b', '\u0437\u0430\u0431\u0438\u0442\u044c', ':', 'pers', '.']\n```\n\n#### 7. Transform to codes\n```python\ncodes = preprocess.lemmas_to_codes(\n lemmas=lemmas,\n coder=coder,\n exclude_unknown=True,\n max_len=20,\n)\n# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]\n```\n\n#### 8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)\n```python\nprint(\n preprocess.codes_to_lemmas(codes=codes, coder=coder)\n)\n# ['', '', '', 'date', '\u0432', 'loc', '\u0432', '\u043c\u0430\u0442\u0447', '\u043f\u0440\u043e\u0442\u0438\u0432', 'org', 'per', '\u0437\u0430\u0431\u0438\u0442\u044c', '\u0433\u043e\u043b', '.', 'orgs', '\u0433\u043e\u043b', '\u0437\u0430\u0431\u0438\u0442\u044c', ':', 'pers', '.']\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Preparing russian hockey news for machine learning",
"version": "2.0.2",
"project_urls": {
"Homepage": "https://github.com/Rishat-F/khl",
"Repository": "https://github.com/Rishat-F/khl"
},
"split_keywords": [
"khl",
"news",
"nlp",
"preprocessing",
"ml"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d00528c5eabff72031db481d18e9d598af63bc23bd7720c0ef245a36f64b7a4e",
"md5": "3484e419d2d3c669bc52223a1a901b84",
"sha256": "a736205310f1c66428fb1fa8a916cb875f0dbe72f38bb8f524847bb6cd29ea3f"
},
"downloads": -1,
"filename": "khl-2.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3484e419d2d3c669bc52223a1a901b84",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 20410,
"upload_time": "2023-11-25T18:36:31",
"upload_time_iso_8601": "2023-11-25T18:36:31.500710Z",
"url": "https://files.pythonhosted.org/packages/d0/05/28c5eabff72031db481d18e9d598af63bc23bd7720c0ef245a36f64b7a4e/khl-2.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "6f9eaa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f",
"md5": "388556ee76809fc18ef1ba5ebb6f0896",
"sha256": "23c288c90304a2efb9c63e3c4a512a542d05a150cd068cf40f820f87c0298337"
},
"downloads": -1,
"filename": "khl-2.0.2.tar.gz",
"has_sig": false,
"md5_digest": "388556ee76809fc18ef1ba5ebb6f0896",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 21715,
"upload_time": "2023-11-25T18:36:33",
"upload_time_iso_8601": "2023-11-25T18:36:33.855961Z",
"url": "https://files.pythonhosted.org/packages/6f/9e/aa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f/khl-2.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-25 18:36:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Rishat-F",
"github_project": "khl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "khl"
}