[](https://github.com/wjbmattingly/keyword-spacy)
[](https://pypi.org/project/keyword-spacy/0.0.1/)
[](https://pypi.org/project/keyword-spacy/0.0.1/)
# 🔑 Keyword spaCy

Keyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity. The basis for this comes from [KeyBERT: A Minimal Method for Keyphrase Extraction using BERT](https://github.com/MaartenGr/KeyBERT), a transformer-based approach to keyword extraction. The methods employed by Keyword spaCy follow this methodology closely. It allows users to specify the range of n-grams to consider and can operate in a strict mode, which limits results to the specified n-gram range.
## Installation
Before using Keyword spaCy, make sure you have spaCy installed:
```
pip install keyword-spacy
```
Then, download the `en_core_web_md` model:
```
python -m spacy download en_core_web_md
```
## Usage
To use the Keyword Extractor, first, create a spaCy `nlp` object:
```python
import spacy
nlp = spacy.load("en_core_web_md")
```
Then, add the `KeywordExtractor` to the pipeline:
```python
nlp.add_pipe("keyword_extractor", last=True, config={"top_n": 10, "min_ngram": 3, "max_ngram": 3, "strict": True})
```
Now you can process text and extract keywords:
```python
text = "Natural language processing is a fascinating domain of artificial intelligence. It allows computers to understand and generate human language."
doc = nlp(text)
print("Top Keywords:", doc._.keywords)
```
Output:
```
Top Keywords: ['generate human language', 'Natural language processing']
```
Each token that is not a punctuation also receives a special attribute `._.keyword_value`, this is the value of a given word's similarity to the `doc.vector`. This may be helpful for other downstream tasks.
## Configuration
The `KeywordExtractor` can be configured using the following parameters:
- `top_n`: The number of top keywords to extract.
- `min_ngram`: The minimum size for n-grams.
- `max_ngram`: The maximum size for n-grams.
- `strict`: If set to `True`, only n-grams within the `min_ngram` to `max_ngram` range are considered. If `False`, individual tokens and the specified range of n-grams are considered.
## Methodology
The methodology employed by Keyword spaCy is inspired by [KeyBERT](https://github.com/MaartenGr/KeyBERT). It utilizes cosine similarity between tokens (and n-grams) and the entire document to determine the relevance of terms. The most similar terms are then considered as keywords.
## References
- [KeyBERT: A Minimal Method for Keyphrase Extraction using BERT](https://github.com/MaartenGr/KeyBERT)
Raw data
{
"_id": null,
"home_page": "https://github.com/wjbmattingly/keyword-spacy",
"name": "keyword-spacy",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "WJB Mattingly",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/b5/6a/6ac144946514b8564d9854a8c0e1743b0d6c01f16004d222f1ef46843954/keyword_spacy-0.1.2.tar.gz",
"platform": null,
"description": "[](https://github.com/wjbmattingly/keyword-spacy)\n[](https://pypi.org/project/keyword-spacy/0.0.1/)\n[](https://pypi.org/project/keyword-spacy/0.0.1/)\n\n# \ud83d\udd11 Keyword spaCy\n\n\n\nKeyword spaCy is a spaCy pipeline component for extracting keywords from text using cosine similarity. The basis for this comes from [KeyBERT: A Minimal Method for Keyphrase Extraction using BERT](https://github.com/MaartenGr/KeyBERT), a transformer-based approach to keyword extraction. The methods employed by Keyword spaCy follow this methodology closely. It allows users to specify the range of n-grams to consider and can operate in a strict mode, which limits results to the specified n-gram range.\n\n## Installation\n\nBefore using Keyword spaCy, make sure you have spaCy installed:\n\n```\npip install keyword-spacy\n```\n\nThen, download the `en_core_web_md` model:\n\n```\npython -m spacy download en_core_web_md\n```\n\n## Usage\n\nTo use the Keyword Extractor, first, create a spaCy `nlp` object:\n\n```python\nimport spacy\nnlp = spacy.load(\"en_core_web_md\")\n```\n\nThen, add the `KeywordExtractor` to the pipeline:\n\n```python\nnlp.add_pipe(\"keyword_extractor\", last=True, config={\"top_n\": 10, \"min_ngram\": 3, \"max_ngram\": 3, \"strict\": True})\n```\n\nNow you can process text and extract keywords:\n\n```python\ntext = \"Natural language processing is a fascinating domain of artificial intelligence. It allows computers to understand and generate human language.\"\ndoc = nlp(text)\nprint(\"Top Keywords:\", doc._.keywords)\n```\nOutput:\n```\nTop Keywords: ['generate human language', 'Natural language processing']\n```\n\nEach token that is not a punctuation also receives a special attribute `._.keyword_value`, this is the value of a given word's similarity to the `doc.vector`. This may be helpful for other downstream tasks.\n\n## Configuration\n\nThe `KeywordExtractor` can be configured using the following parameters:\n\n- `top_n`: The number of top keywords to extract.\n- `min_ngram`: The minimum size for n-grams.\n- `max_ngram`: The maximum size for n-grams.\n- `strict`: If set to `True`, only n-grams within the `min_ngram` to `max_ngram` range are considered. If `False`, individual tokens and the specified range of n-grams are considered.\n\n## Methodology\n\nThe methodology employed by Keyword spaCy is inspired by [KeyBERT](https://github.com/MaartenGr/KeyBERT). It utilizes cosine similarity between tokens (and n-grams) and the entire document to determine the relevance of terms. The most similar terms are then considered as keywords.\n\n## References\n\n- [KeyBERT: A Minimal Method for Keyphrase Extraction using BERT](https://github.com/MaartenGr/KeyBERT)\n",
"bugtrack_url": null,
"license": "",
"summary": "A spaCy pipeline component for extracting keywords from text using cosine similarity.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/wjbmattingly/keyword-spacy"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "b56a6ac144946514b8564d9854a8c0e1743b0d6c01f16004d222f1ef46843954",
"md5": "46b5ee0d8aba15185dda37f4588fc397",
"sha256": "f1235f8e5fbff1429f70cd07953e3993d7e71df7925b45fa46d6915a14f16bbf"
},
"downloads": -1,
"filename": "keyword_spacy-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "46b5ee0d8aba15185dda37f4588fc397",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 3591,
"upload_time": "2023-08-25T16:12:19",
"upload_time_iso_8601": "2023-08-25T16:12:19.346256Z",
"url": "https://files.pythonhosted.org/packages/b5/6a/6ac144946514b8564d9854a8c0e1743b0d6c01f16004d222f1ef46843954/keyword_spacy-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-25 16:12:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wjbmattingly",
"github_project": "keyword-spacy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "keyword-spacy"
}