khl

Name	khl JSON
Version	2.0.2 JSON
	download
home_page	https://github.com/Rishat-F/khl
Summary	Preparing russian hockey news for machine learning
upload_time	2023-11-25 18:36:33
maintainer
docs_url	None
author	Rishat Fayzullin
requires_python	>=3.8,<4.0
license	MIT
keywords	khl news nlp preprocessing ml
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            ![Khl Logo](https://raw.githubusercontent.com/Rishat-F/khl/master/data/logo.png)

<h1 align="center">No Water - Ice Only</h1>

Preparing russian hockey news for machine learning.

**Unify -> Simplify -> Preprocess** text and feed your neural model.

## Installation

*Khl* is available on PyPI:

```console
$ pip install khl
```
It requires Python 3.8+ to run.

## Usage

To get started right away with basic usage:

```python
from khl import text_to_codes

coder = {
    '': 0,     # placeholder
    '???': 1,  # unknown
    '.': 2,
    'и': 3,
    'в': 4,
    '-': 5,
    ':': 6,
    'матч': 7,
    'за': 8,
    'забить': 9,
    'гол': 10,
    'per': 11,   # person entity
    'org': 12,   # organization entity
    'loc': 13,   # location entity
    'date': 14,  # date entity
    'против': 15,
    'год': 16,
    'pers': 17,  # few persons entity
    'orgs': 18,  # few organizations entity
    'свой': 19
}

text = """
    1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
    «Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""

codes = text_to_codes(
    text=text,
    coder=coder,
    stop_words_=["за", "и", "свой"],  # stop words to drop
    replace_ners_=True,               # replace named entities ("Иван Иванов" -> "per", "Спартак" -> "org", "Москва" -> "loc")
    replace_dates_=True,              # replace dates ("1 апреля 2023 года" -> "date")
    replace_penalties_=True,          # replace penalties ("5+20" -> "pen")
    exclude_unknown=True,             # drop lemma that not presented in coder
    max_len=20,                       # get sequence of codes of length 20
)
# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
```

```text_to_codes``` is a very high level function. What's happens under hood see in [Lower level usage](#lower-level-usage).

## What is `coder`?
`coder` is just a dictionary where each lemma is represented with unique integer code.
Note that first two elements are reserved for *placeholder* and *unknown* elements.

It is possible to get `coder` from frequency dictionary file (see in [Get lemmas coder](#2-get-lemmas-coder)).
Frequency dictionary file is a **json**-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset.
Preferably it should be sorted in descending order of values.  
`example_frequency_dictionary.json`:

```json
{
  ".": 1000,
  "и": 500,
  "в": 400,
  "-": 300,
  ":": 300,
  "матч": 290,
  "за": 250,
  "забить": 240,
  "гол": 230,
  "per": 200,
  "org": 150,
  "loc": 150,
  "date": 100,
  "против": 90,
  "год": 70,
  "pers": 40,
  "orgs": 30,
  "свой": 20
}
```

You could make and use your own frequency dictionary or download [this dictionary](https://github.com/Rishat-F/khl/blob/master/data/frequency_dictionary.json) created by myself.

## Lower level usage<a id="lower-level-usage"></a>

#### 1. Make imports
```python
from khl import stop_words
from khl import utils
from khl import preprocess
```

#### 2. Get lemmas coder<a id="2-get-lemmas-coder"></a>
```python
coder = preprocess.get_coder("example_frequency_dictionary.json")
```

#### 3. Define text
```python
text = """
    1 апреля 2023 года в Москве в матче ⅛ финала против „Спартака” Иван Иванов забил свой 100—й гол за карьеру.
    «Динамо Мск» - «Спартак» 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров.
"""
```

#### 4. Unify
```python
unified_text = utils.unify(text)
# "1 апреля 2023 года в Москве в матче 1/8 финала против 'Спартака' Иван Иванов забил свой 100-й гол за карьеру. 'Динамо Мск' - 'Спартак' 2:1 ОТ (1:0 0:1 0:0 1:0) Голы забили: Иванов, Петров и Сидоров."
```

#### 5. Simplify
```python
simplified_text = utils.simplify(
    text=unified_text,
    replace_ners_=True,
    replace_dates_=True,
    replace_penalties_=True,
)
# 'date в loc в матче финала против org per забил свой гол за карьеру. org org Голы забили: per per per.'
```

#### 6. Lemmatize
```python
lemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)
# ['date', 'в', 'loc', 'в', 'матч', 'финал', 'против', 'org', 'per', 'забить', 'гол', 'карьера', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
```

#### 7. Transform to codes
```python
codes = preprocess.lemmas_to_codes(
    lemmas=lemmas,
    coder=coder,
    exclude_unknown=True,
    max_len=20,
)
# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]
```

#### 8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)
```python
print(
    preprocess.codes_to_lemmas(codes=codes, coder=coder)
)
# ['', '', '', 'date', 'в', 'loc', 'в', 'матч', 'против', 'org', 'per', 'забить', 'гол', '.', 'orgs', 'гол', 'забить', ':', 'pers', '.']
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Rishat-F/khl",
    "name": "khl",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8,<4.0",
    "maintainer_email": "",
    "keywords": "khl,news,nlp,preprocessing,ml",
    "author": "Rishat Fayzullin",
    "author_email": "nilluziaf@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6f/9e/aa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f/khl-2.0.2.tar.gz",
    "platform": null,
    "description": "![Khl Logo](https://raw.githubusercontent.com/Rishat-F/khl/master/data/logo.png)\n\n<h1 align=\"center\">No Water - Ice Only</h1>\n\nPreparing russian hockey news for machine learning.\n\n**Unify -> Simplify -> Preprocess** text and feed your neural model.\n\n## Installation\n\n*Khl* is available on PyPI:\n\n```console\n$ pip install khl\n```\nIt requires Python 3.8+ to run.\n\n## Usage\n\nTo get started right away with basic usage:\n\n```python\nfrom khl import text_to_codes\n\ncoder = {\n    '': 0,     # placeholder\n    '???': 1,  # unknown\n    '.': 2,\n    '\u0438': 3,\n    '\u0432': 4,\n    '-': 5,\n    ':': 6,\n    '\u043c\u0430\u0442\u0447': 7,\n    '\u0437\u0430': 8,\n    '\u0437\u0430\u0431\u0438\u0442\u044c': 9,\n    '\u0433\u043e\u043b': 10,\n    'per': 11,   # person entity\n    'org': 12,   # organization entity\n    'loc': 13,   # location entity\n    'date': 14,  # date entity\n    '\u043f\u0440\u043e\u0442\u0438\u0432': 15,\n    '\u0433\u043e\u0434': 16,\n    'pers': 17,  # few persons entity\n    'orgs': 18,  # few organizations entity\n    '\u0441\u0432\u043e\u0439': 19\n}\n\ntext = \"\"\"\n    1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 \u215b \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 \u201e\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430\u201d \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100\u2014\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443.\n    \u00ab\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a\u00bb - \u00ab\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u00bb 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\n\"\"\"\n\ncodes = text_to_codes(\n    text=text,\n    coder=coder,\n    stop_words_=[\"\u0437\u0430\", \"\u0438\", \"\u0441\u0432\u043e\u0439\"],  # stop words to drop\n    replace_ners_=True,               # replace named entities (\"\u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432\" -> \"per\", \"\u0421\u043f\u0430\u0440\u0442\u0430\u043a\" -> \"org\", \"\u041c\u043e\u0441\u043a\u0432\u0430\" -> \"loc\")\n    replace_dates_=True,              # replace dates (\"1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430\" -> \"date\")\n    replace_penalties_=True,          # replace penalties (\"5+20\" -> \"pen\")\n    exclude_unknown=True,             # drop lemma that not presented in coder\n    max_len=20,                       # get sequence of codes of length 20\n)\n# codes = [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]\n```\n\n```text_to_codes``` is a very high level function. What's happens under hood see in [Lower level usage](#lower-level-usage).\n\n## What is `coder`?\n`coder` is just a dictionary where each lemma is represented with unique integer code.\nNote that first two elements are reserved for *placeholder* and *unknown* elements.\n\nIt is possible to get `coder` from frequency dictionary file (see in [Get lemmas coder](#2-get-lemmas-coder)).\nFrequency dictionary file is a **json**-file with dictionary where key is lemma and value is how many times this lemma occurred in your whole dataset.\nPreferably it should be sorted in descending order of values.  \n`example_frequency_dictionary.json`:\n\n```json\n{\n  \".\": 1000,\n  \"\u0438\": 500,\n  \"\u0432\": 400,\n  \"-\": 300,\n  \":\": 300,\n  \"\u043c\u0430\u0442\u0447\": 290,\n  \"\u0437\u0430\": 250,\n  \"\u0437\u0430\u0431\u0438\u0442\u044c\": 240,\n  \"\u0433\u043e\u043b\": 230,\n  \"per\": 200,\n  \"org\": 150,\n  \"loc\": 150,\n  \"date\": 100,\n  \"\u043f\u0440\u043e\u0442\u0438\u0432\": 90,\n  \"\u0433\u043e\u0434\": 70,\n  \"pers\": 40,\n  \"orgs\": 30,\n  \"\u0441\u0432\u043e\u0439\": 20\n}\n```\n\nYou could make and use your own frequency dictionary or download [this dictionary](https://github.com/Rishat-F/khl/blob/master/data/frequency_dictionary.json) created by myself.\n\n## Lower level usage<a id=\"lower-level-usage\"></a>\n\n#### 1. Make imports\n```python\nfrom khl import stop_words\nfrom khl import utils\nfrom khl import preprocess\n```\n\n#### 2. Get lemmas coder<a id=\"2-get-lemmas-coder\"></a>\n```python\ncoder = preprocess.get_coder(\"example_frequency_dictionary.json\")\n```\n\n#### 3. Define text\n```python\ntext = \"\"\"\n    1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 \u215b \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 \u201e\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430\u201d \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100\u2014\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443.\n    \u00ab\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a\u00bb - \u00ab\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u00bb 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\n\"\"\"\n```\n\n#### 4. Unify\n```python\nunified_text = utils.unify(text)\n# \"1 \u0430\u043f\u0440\u0435\u043b\u044f 2023 \u0433\u043e\u0434\u0430 \u0432 \u041c\u043e\u0441\u043a\u0432\u0435 \u0432 \u043c\u0430\u0442\u0447\u0435 1/8 \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 '\u0421\u043f\u0430\u0440\u0442\u0430\u043a\u0430' \u0418\u0432\u0430\u043d \u0418\u0432\u0430\u043d\u043e\u0432 \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 100-\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443. '\u0414\u0438\u043d\u0430\u043c\u043e \u041c\u0441\u043a' - '\u0421\u043f\u0430\u0440\u0442\u0430\u043a' 2:1 \u041e\u0422 (1:0 0:1 0:0 1:0) \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: \u0418\u0432\u0430\u043d\u043e\u0432, \u041f\u0435\u0442\u0440\u043e\u0432 \u0438 \u0421\u0438\u0434\u043e\u0440\u043e\u0432.\"\n```\n\n#### 5. Simplify\n```python\nsimplified_text = utils.simplify(\n    text=unified_text,\n    replace_ners_=True,\n    replace_dates_=True,\n    replace_penalties_=True,\n)\n# 'date \u0432 loc \u0432 \u043c\u0430\u0442\u0447\u0435 \u0444\u0438\u043d\u0430\u043b\u0430 \u043f\u0440\u043e\u0442\u0438\u0432 org per \u0437\u0430\u0431\u0438\u043b \u0441\u0432\u043e\u0439 \u0433\u043e\u043b \u0437\u0430 \u043a\u0430\u0440\u044c\u0435\u0440\u0443. org org \u0413\u043e\u043b\u044b \u0437\u0430\u0431\u0438\u043b\u0438: per per per.'\n```\n\n#### 6. Lemmatize\n```python\nlemmas = preprocess.lemmatize(text=simplified_text, stop_words_=stop_words)\n# ['date', '\u0432', 'loc', '\u0432', '\u043c\u0430\u0442\u0447', '\u0444\u0438\u043d\u0430\u043b', '\u043f\u0440\u043e\u0442\u0438\u0432', 'org', 'per', '\u0437\u0430\u0431\u0438\u0442\u044c', '\u0433\u043e\u043b', '\u043a\u0430\u0440\u044c\u0435\u0440\u0430', '.', 'orgs', '\u0433\u043e\u043b', '\u0437\u0430\u0431\u0438\u0442\u044c', ':', 'pers', '.']\n```\n\n#### 7. Transform to codes\n```python\ncodes = preprocess.lemmas_to_codes(\n    lemmas=lemmas,\n    coder=coder,\n    exclude_unknown=True,\n    max_len=20,\n)\n# [0, 0, 0, 14, 4, 13, 4, 7, 15, 12, 11, 9, 10, 2, 18, 10, 9, 6, 17, 2]\n```\n\n#### 8. Transform to lemmas back (just to look which lemmas are presented in codes sequence)\n```python\nprint(\n    preprocess.codes_to_lemmas(codes=codes, coder=coder)\n)\n# ['', '', '', 'date', '\u0432', 'loc', '\u0432', '\u043c\u0430\u0442\u0447', '\u043f\u0440\u043e\u0442\u0438\u0432', 'org', 'per', '\u0437\u0430\u0431\u0438\u0442\u044c', '\u0433\u043e\u043b', '.', 'orgs', '\u0433\u043e\u043b', '\u0437\u0430\u0431\u0438\u0442\u044c', ':', 'pers', '.']\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Preparing russian hockey news for machine learning",
    "version": "2.0.2",
    "project_urls": {
        "Homepage": "https://github.com/Rishat-F/khl",
        "Repository": "https://github.com/Rishat-F/khl"
    },
    "split_keywords": [
        "khl",
        "news",
        "nlp",
        "preprocessing",
        "ml"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d00528c5eabff72031db481d18e9d598af63bc23bd7720c0ef245a36f64b7a4e",
                "md5": "3484e419d2d3c669bc52223a1a901b84",
                "sha256": "a736205310f1c66428fb1fa8a916cb875f0dbe72f38bb8f524847bb6cd29ea3f"
            },
            "downloads": -1,
            "filename": "khl-2.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3484e419d2d3c669bc52223a1a901b84",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8,<4.0",
            "size": 20410,
            "upload_time": "2023-11-25T18:36:31",
            "upload_time_iso_8601": "2023-11-25T18:36:31.500710Z",
            "url": "https://files.pythonhosted.org/packages/d0/05/28c5eabff72031db481d18e9d598af63bc23bd7720c0ef245a36f64b7a4e/khl-2.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f9eaa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f",
                "md5": "388556ee76809fc18ef1ba5ebb6f0896",
                "sha256": "23c288c90304a2efb9c63e3c4a512a542d05a150cd068cf40f820f87c0298337"
            },
            "downloads": -1,
            "filename": "khl-2.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "388556ee76809fc18ef1ba5ebb6f0896",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8,<4.0",
            "size": 21715,
            "upload_time": "2023-11-25T18:36:33",
            "upload_time_iso_8601": "2023-11-25T18:36:33.855961Z",
            "url": "https://files.pythonhosted.org/packages/6f/9e/aa41785eebb3f314f51a2ba52bbc80a8ceef4a06889a44eb12149ddcd38f/khl-2.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-25 18:36:33",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Rishat-F",
    "github_project": "khl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "khl"
}

Rishat Fayzullin