[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keystem/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)
[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keystem/)
<!-- [![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/keyBERT/testing.yml?branch=master)](https://pypi.org/keystem/) -->
<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing) -->
<img src="images/logo.png" width="35%" height="35%" align="right" />
# KeyStem
KeyStem is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to
create keywords and keyphrases that are most similar to a document.
Corresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).
<a name="toc"/></a>
## Table of Contents
<!--ts-->
1. [About the Project](#about)
2. [Getting Started](#gettingstarted)
2.1. [Installation](#installation)
2.2. [Basic Usage](#usage)
2.3. [Max Sum Distance](#maxsum)
2.4. [Maximal Marginal Relevance](#maximal)
2.5. [Embedding Models](#embeddings)
<!--te-->
<a name="about"/></a>
## 1. About the Project
[Back to ToC](#toc)
Although there are already many methods available for keyword generation
(e.g.,
[Rake](https://github.com/aneesha/RAKE),
[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)
I wanted to create a very basic, but powerful method for extracting keywords and keyphrases.
This is where **KeyStem** comes in! Which uses BERT-embeddings and simple cosine similarity
to find the sub-phrases in a document that are the most similar to the document itself.
First, document embeddings are extracted with BERT to get a document-level representation.
Then, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity
to find the words/phrases that are the most similar to the document. The most similar words could
then be identified as the words that best describe the entire document.
KeyStem is by no means unique and is created as a quick and easy method
for creating keywords and keyphrases. Although there are many great
papers and solutions out there that use BERT-embeddings
(e.g.,
[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),
[2](https://github.com/ibatra/BERT-Keyword-Extractor),
[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),
), I could not find a BERT-based solution that did not have to be trained from scratch and
could be used for beginners (**correct me if I'm wrong!**).
Thus, the goal was a `pip install keystem` and at most 3 lines of code in usage.
<a name="gettingstarted"/></a>
## 2. Getting Started
[Back to ToC](#toc)
<a name="installation"/></a>
### 2.1. Installation
Installation can be done using [pypi](https://pypi.org/project/keystem/):
```
pip install keystem
```
<a name="usage"/></a>
### 2.2. Usage
The most minimal example can be seen below for the extraction of keywords:
```python
from keystem import KeyStem
doc = """
Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).
"""
ks_model = KeyStem()
keywords = ks_model.get_keygroups(doc)
```
You can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:
```python
>>> ks_model.get_keygroups(doc, keyphrase_ngram_range=(1, 1), stop_words=None)
{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}
```
To extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number
of words you would like in the resulting keyphrases:
```python
>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)
{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}
```
<a name="maximal"/></a>
### 2.4. Maximal Marginal Relevance
To diversify the results, we can use Maximal Margin Relevance (MMR) to create
keywords / keyphrases which is also based on cosine similarity. The results
with **high diversity**:
The results with **low diversity**:
<a name="embeddings"/></a>
### 2.5. Embedding Models
KeyBERT supports many embedding models that can be used to embed the documents and words:
* Sentence-Transformers
* Flair
* Spacy
* Gensim
* USE
Click [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.
**Sentence-Transformers**
You can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through KeyStem with `model`:
```python
from keystem import KeyStem
kw_model = KeyStem(model='all-MiniLM-L6-v2')
```
Or select a SentenceTransformer model with your own parameters:
```python
from keystem import KeyStem
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyStem(model=sentence_model)
```
**Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:
```python
from keystem import KeyStem
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')
ks_model = KeyStem(model=roberta)
```
You can select any 🤗 transformers model [here](https://huggingface.co/models).
## Citation
To cite KeyStem in your work, please use the following bibtex reference:
```bibtex
@misc{grootendorst2020keybert,
author = {Naga Kiran},
title = {KeyStem: Minimal keyword extraction with BERT and grouping to the stem of key.},
year = 2023,
publisher = {caspai},
version = {v0.0.1},
url = {http://caspai.in/}
}
```
## References
Below, you can find several resources that were used for the creation of KeyStem
but most importantly, these are amazing resources for creating impressive keyword extraction models:
**Github Repos**:
* https://github.com/MaartenGr/KeyBERT
* https://github.com/Nagakiran1/keystem
* https://github.com/thunlp/BERT-KPE
* https://github.com/ibatra/BERT-Keyword-Extractor
* https://github.com/pranav-ust/BERT-keyphrase-extraction
* https://github.com/swisscom/ai-research-keyphrase-extraction
**MMR**:
The selection of keywords/keyphrases was modeled after:
* https://github.com/swisscom/ai-research-keyphrase-extraction
Raw data
{
"_id": null,
"home_page": "https://github.com/Nagakiran1/keystem.git",
"name": "keystem",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6.0",
"maintainer_email": "",
"keywords": "text,keywords,bert,keybert",
"author": "Naga",
"author_email": "Naga <naga@caspai.in>",
"download_url": "https://files.pythonhosted.org/packages/9d/aa/3df0e0fe553dae544699d01cacad0697a3e7edb69eed92d55f6151fd4c52/keystem-1.0.5.tar.gz",
"platform": null,
"description": "[![PyPI - Python](https://img.shields.io/badge/python-3.6%20|%203.7%20|%203.8-blue.svg)](https://pypi.org/project/keystem/)\n[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/keybert/blob/master/LICENSE)\n[![PyPI - PyPi](https://img.shields.io/pypi/v/keyBERT)](https://pypi.org/project/keystem/)\n<!-- [![Build](https://img.shields.io/github/actions/workflow/status/MaartenGr/keyBERT/testing.yml?branch=master)](https://pypi.org/keystem/) -->\n<!-- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OxpgwKqSzODtO3vS7Xe1nEmZMCAIMckX?usp=sharing) -->\n\n<img src=\"images/logo.png\" width=\"35%\" height=\"35%\" align=\"right\" />\n\n# KeyStem\n\nKeyStem is a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to\ncreate keywords and keyphrases that are most similar to a document.\n\nCorresponding medium post can be found [here](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea).\n\n<a name=\"toc\"/></a>\n## Table of Contents \n<!--ts--> \n 1. [About the Project](#about) \n 2. [Getting Started](#gettingstarted) \n 2.1. [Installation](#installation) \n 2.2. [Basic Usage](#usage) \n 2.3. [Max Sum Distance](#maxsum) \n 2.4. [Maximal Marginal Relevance](#maximal) \n 2.5. [Embedding Models](#embeddings) \n<!--te--> \n\n\n<a name=\"about\"/></a>\n## 1. About the Project\n[Back to ToC](#toc)\n\nAlthough there are already many methods available for keyword generation\n(e.g.,\n[Rake](https://github.com/aneesha/RAKE),\n[YAKE!](https://github.com/LIAAD/yake), TF-IDF, etc.)\nI wanted to create a very basic, but powerful method for extracting keywords and keyphrases.\nThis is where **KeyStem** comes in! Which uses BERT-embeddings and simple cosine similarity\nto find the sub-phrases in a document that are the most similar to the document itself.\n\nFirst, document embeddings are extracted with BERT to get a document-level representation.\nThen, word embeddings are extracted for N-gram words/phrases. Finally, we use cosine similarity\nto find the words/phrases that are the most similar to the document. The most similar words could\nthen be identified as the words that best describe the entire document.\n\nKeyStem is by no means unique and is created as a quick and easy method\nfor creating keywords and keyphrases. Although there are many great\npapers and solutions out there that use BERT-embeddings\n(e.g.,\n[1](https://github.com/pranav-ust/BERT-keyphrase-extraction),\n[2](https://github.com/ibatra/BERT-Keyword-Extractor),\n[3](https://www.preprints.org/manuscript/201908.0073/download/final_file),\n), I could not find a BERT-based solution that did not have to be trained from scratch and\ncould be used for beginners (**correct me if I'm wrong!**).\nThus, the goal was a `pip install keystem` and at most 3 lines of code in usage.\n\n<a name=\"gettingstarted\"/></a>\n## 2. Getting Started\n[Back to ToC](#toc)\n\n<a name=\"installation\"/></a>\n### 2.1. Installation\nInstallation can be done using [pypi](https://pypi.org/project/keystem/):\n\n```\npip install keystem\n```\n\n\n<a name=\"usage\"/></a>\n### 2.2. Usage\n\nThe most minimal example can be seen below for the extraction of keywords:\n```python\nfrom keystem import KeyStem\n\ndoc = \"\"\"\n Supervised learning is the machine learning task of learning a function that\n maps an input to an output based on example input-output pairs. It infers a\n function from labeled training data consisting of a set of training examples.\n In supervised learning, each example is a pair consisting of an input object\n (typically a vector) and a desired output value (also called the supervisory signal).\n A supervised learning algorithm analyzes the training data and produces an inferred function,\n which can be used for mapping new examples. An optimal scenario will allow for the\n algorithm to correctly determine the class labels for unseen instances. This requires\n the learning algorithm to generalize from the training data to unseen situations in a\n 'reasonable' way (see inductive bias).\n \"\"\"\nks_model = KeyStem()\nkeywords = ks_model.get_keygroups(doc)\n```\n\nYou can set `keyphrase_ngram_range` to set the length of the resulting keywords/keyphrases:\n\n```python\n>>> ks_model.get_keygroups(doc, keyphrase_ngram_range=(1, 1), stop_words=None)\n\n{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}\n```\n\nTo extract keyphrases, simply set `keyphrase_ngram_range` to (1, 2) or higher depending on the number\nof words you would like in the resulting keyphrases:\n\n```python\n>>> kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words=None)\n\n{'index': {0: 0, 2: 1, 26: 15, 28: 16, 20: 11}, 'keywords': {0: ('supervised learning', 0.7096), 2: ('supervised', 0.6735), 26: ('supervised learning', 0.613), 28: ('supervised', 0.6125), 20: ('supervised', 0.5554)}, 'features': {0: 'supervised learning', 2: 'supervised', 26: 'supervised learning', 28: 'supervised', 20: 'supervised'}, 'cluster': {0: 0.0, 2: 0.0, 26: 0.0, 28: 0.0, 20: 0.0}, 'score': {0: 0.7096, 2: 0.6735, 26: 0.613, 28: 0.6125, 20: 0.5554}, 'label': {0: 'supervised learning', 2: 'supervised learning', 26: 'supervised learning', 28: 'supervised learning', 20: 'supervised learning'}\n```\n\n<a name=\"maximal\"/></a>\n### 2.4. Maximal Marginal Relevance\n\nTo diversify the results, we can use Maximal Margin Relevance (MMR) to create\nkeywords / keyphrases which is also based on cosine similarity. The results\nwith **high diversity**:\n\n\nThe results with **low diversity**:\n\n\n\n<a name=\"embeddings\"/></a>\n### 2.5. Embedding Models\nKeyBERT supports many embedding models that can be used to embed the documents and words:\n\n* Sentence-Transformers\n* Flair\n* Spacy\n* Gensim\n* USE\n\nClick [here](https://maartengr.github.io/KeyBERT/guides/embeddings.html) for a full overview of all supported embedding models.\n\n**Sentence-Transformers** \nYou can select any model from `sentence-transformers` [here](https://www.sbert.net/docs/pretrained_models.html)\nand pass it through KeyStem with `model`:\n\n```python\nfrom keystem import KeyStem\nkw_model = KeyStem(model='all-MiniLM-L6-v2')\n```\n\nOr select a SentenceTransformer model with your own parameters:\n\n```python\nfrom keystem import KeyStem\nfrom sentence_transformers import SentenceTransformer\n\nsentence_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\nkw_model = KeyStem(model=sentence_model)\n```\n\n**Flair** \n[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that\nis publicly available. Flair can be used as follows:\n\n```python\nfrom keystem import KeyStem\nfrom flair.embeddings import TransformerDocumentEmbeddings\n\nroberta = TransformerDocumentEmbeddings('roberta-base')\nks_model = KeyStem(model=roberta)\n```\n\nYou can select any \ud83e\udd17 transformers model [here](https://huggingface.co/models).\n\n\n## Citation\nTo cite KeyStem in your work, please use the following bibtex reference:\n\n```bibtex\n@misc{grootendorst2020keybert,\n author = {Naga Kiran},\n title = {KeyStem: Minimal keyword extraction with BERT and grouping to the stem of key.},\n year = 2023,\n publisher = {caspai},\n version = {v0.0.1},\n url = {http://caspai.in/}\n}\n```\n\n## References\nBelow, you can find several resources that were used for the creation of KeyStem\nbut most importantly, these are amazing resources for creating impressive keyword extraction models:\n\n\n**Github Repos**:\n* https://github.com/MaartenGr/KeyBERT\n* https://github.com/Nagakiran1/keystem\n* https://github.com/thunlp/BERT-KPE\n* https://github.com/ibatra/BERT-Keyword-Extractor\n* https://github.com/pranav-ust/BERT-keyphrase-extraction\n* https://github.com/swisscom/ai-research-keyphrase-extraction\n\n**MMR**:\nThe selection of keywords/keyphrases was modeled after:\n* https://github.com/swisscom/ai-research-keyphrase-extraction\n\n",
"bugtrack_url": null,
"license": "",
"summary": "Extract the keywords from the given text and assign root of the key for each cluster keys",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://github.com/Nagakiran1/keystem"
},
"split_keywords": [
"text",
"keywords",
"bert",
"keybert"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "48ed8fd20bcbb95667164e3ae3fc854384b74f6b08461508ed0d535301ef40b2",
"md5": "e6d15ec5280450db240e2af44d2cc60b",
"sha256": "6b106a06e640727a6f32ebbc5c8a39224176749d0aa78f19bd0bbd91f22fa010"
},
"downloads": -1,
"filename": "keystem-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e6d15ec5280450db240e2af44d2cc60b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6.0",
"size": 8338,
"upload_time": "2023-06-29T19:43:54",
"upload_time_iso_8601": "2023-06-29T19:43:54.210466Z",
"url": "https://files.pythonhosted.org/packages/48/ed/8fd20bcbb95667164e3ae3fc854384b74f6b08461508ed0d535301ef40b2/keystem-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9daa3df0e0fe553dae544699d01cacad0697a3e7edb69eed92d55f6151fd4c52",
"md5": "82aaa1ef45cafa5478eea35e316f0098",
"sha256": "e8d82f1365754e6f8dcbf00ef8d2e1f2bdb26bbbc51785f887f15d09cbdb292c"
},
"downloads": -1,
"filename": "keystem-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "82aaa1ef45cafa5478eea35e316f0098",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6.0",
"size": 7590,
"upload_time": "2023-06-29T19:43:58",
"upload_time_iso_8601": "2023-06-29T19:43:58.306182Z",
"url": "https://files.pythonhosted.org/packages/9d/aa/3df0e0fe553dae544699d01cacad0697a3e7edb69eed92d55f6151fd4c52/keystem-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-29 19:43:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Nagakiran1",
"github_project": "keystem",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "keystem"
}