kwx


Namekwx JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/andrewtavis/kwx
SummaryBERT, LDA, and TFIDF based keyword extraction in Python
upload_time2023-01-28 18:34:42
maintainer
docs_urlNone
authorAndrew Tavis McAllister
requires_python
licensenew BSD
keywords
VCS
bugtrack_url
requirements black defusedxml emoji gensim googletrans ipython keras matplotlib nltk numpy packaging pandas pyldavis pytest pytest-cov scikit-learn seaborn sentence-transformers spacy stopwordsiso tensorflow tqdm wordcloud xlrd
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <a href="https://github.com/andrewtavis/kwx"><img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/logo/kwx_logo_transparent.png" width=431 height=215></a>
</div>

<ol></ol>

[![rtd](https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs)](http://kwx.readthedocs.io/en/latest/)
[![ci](https://img.shields.io/github/actions/workflow/status/andrewtavis/kwx/.github/workflows/ci.yml?branch=main?logo=github)](https://github.com/andrewtavis/kwx/actions?query=workflow%3ACI)
[![codecov](https://codecov.io/gh/andrewtavis/kwx/branch/main/graphs/badge.svg)](https://codecov.io/gh/andrewtavis/kwx)
[![pyversions](https://img.shields.io/pypi/pyversions/kwx.svg?logo=python&logoColor=FFD43B&color=306998)](https://pypi.org/project/kwx/)
[![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/)
[![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/)
[![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.txt)
[![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md)
[![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx)

## BERT, LDA, and TFIDF based keyword extraction in Python

**kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert), [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/src/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.

For a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods.

<a id="contents"></a>

# **Contents**

- [Installation](#installation)
- [Models](#models)
  - [BERT](#bert)
  - [LDA](#lda)
  - [TFIDF](#tfidf)
  - [Word Frequency](#word-frequency)
- [Usage](#usage)
  - [Text Cleaning](#text-cleaning)
  - [Keyword Extraction](#keyword-extraction)
- [Visuals](#visuals)
  - [Topic Number Evaluation](#topic-number-evaluation)
  - [t-SNE](#t-sne)
  - [pyLDAvis](#pyldavis)
  - [Word Cloud](#word-cloud)
- [To-Do](#to-do)

<a id="installation"></a>

# Installation [`⇧`](#contents)

kwx can be downloaded from PyPI via pip or sourced directly from this repository:

```bash
pip install kwx
```

```bash
git clone https://github.com/andrewtavis/kwx.git
cd kwx
python setup.py install
```

```python
import kwx
```

<a id="models"></a>

# Models [`⇧`](#contents)

Implemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) include:

<a id="bert"></a>

### • BERT [`⇧`](#contents)

[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.

kwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models.

<a id="lda"></a>

### • LDA [`⇧`](#contents)

[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.

Although not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.

<a id="tfidf"></a>

### • TFIDF [`⇧`](#contents)

The user can also compute [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) keywords - those that are unique in a text body in comparison to another that's compared. This is a useful baseline when a user has another text or text body to compare the target corpus against.

<a id="word-frequency"></a>

### • Word Frequency [`⇧`](#contents)

Finally a user can simply query the most common words from a text corpus. This method is used in kwx as a baseline to check model efficacy.

<a id="usage"></a>

# Usage [`⇧`](#contents)

Keyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples/kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.

The following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced):

<a id="text-cleaning"></a>

### • Text Cleaning [`⇧`](#contents)

```python
from kwx.utils import prepare_data

input_language = "english" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)
```

<a id="keyword-extraction"></a>

### • Keyword Extraction [`⇧`](#contents)

```python
from kwx.model import extract_kws

num_keywords = 15
num_topics = 10
ignore_words = ["words", "user", "knows", "they", "don't", "want"]

# Remove n-grams for BERT training
corpus_no_ngrams = [
    " ".join([t for t in text.split(" ") if "_" not in t]) for text in text_corpus
]

# We can pass keywords for sentence_transformers.SentenceTransformer.encode,
# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer
bert_kws = extract_kws(
    method="BERT", # "BERT", "LDA", "TFIDF", "frequency"
    bert_st_model="xlm-r-bert-base-nli-stsb-mean-tokens",
    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA
    input_language=input_language,
    output_language=None,  # allows the output to be translated
    num_keywords=num_keywords,
    num_topics=num_topics,
    corpuses_to_compare=None,  # for TFIDF
    ignore_words=ignore_words,
    prompt_remove_words=True,  # check words with user
    show_progress_bar=True,
    batch_size=32,
)
```

```_output
The BERT keywords are:

['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',
'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']

Should words be removed [y/n]? y
Type or copy word(s) to be removed: southwestair, united, virginamerica

The new BERT keywords are:

['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',
'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']

Should words be removed [y/n]? n
```

The model will be rerun until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).

<a id="visuals"></a>

# Visuals [`⇧`](#contents)

[kwx.visuals](https://github.com/andrewtavis/kwx/blob/main/src/kwx/visuals.py) includes the following functions for presenting and analyzing the results of keyword extraction:

<a id="topic-number-evaluation"></a>

### • Topic Number Evaluation [`⇧`](#contents)

A graph of topic coherence and overlap given a variable number of topics to derive keywords from.

```python
from kwx.visuals import graph_topic_num_evals
import matplotlib.pyplot as plt

graph_topic_num_evals(
    method=["lda", "bert"],
    text_corpus=text_corpus,
    num_keywords=num_keywords,
    topic_nums_to_compare=list(range(5, 15)),
    metrics=True, #  stability and coherence
)
plt.show()
```

<p align="middle">
  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/topic_num_eval.png" width="600" />
</p>

<a id="t-sne"></a>

### • t-SNE [`⇧`](#contents)

[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.

```python
from kwx.visuals import t_sne
import matplotlib.pyplot as plt

t_sne(
    dimension="both",  # 2d and 3d are options
    text_corpus=text_corpus,
    num_topics=10,
    remove_3d_outliers=True,
)
plt.show()
```

<p align="middle">
  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/t_sne.png" width="600" />
</p>

<a id="pyldavis"></a>

### • pyLDAvis [`⇧`](#contents)

[pyLDAvis](https://github.com/bmabey/pyLDAvis) is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.

```python
from kwx.visuals import pyLDAvis_topics

pyLDAvis_topics(
    method="lda",
    text_corpus=text_corpus,
    num_topics=10,
    display_ipython=False,  # For Jupyter integration
)
```

<p align="middle">
  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/pyLDAvis.png" width="600" />
</p>

<a id="word-cloud"></a>

### • Word Cloud [`⇧`](#contents)

Word clouds via [wordcloud](https://github.com/amueller/word_cloud) are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from [examples/kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) shows a word cloud generated from tweets of US air carrier passengers:

```python
from kwx.visuals import gen_word_cloud

ignore_words = ["words", "user", "knows", "they", "don't", "want"]

gen_word_cloud(
    text_corpus=text_corpus,
    ignore_words=None,
    height=500,
)
```

<p align="middle">
  <img src="https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/word_cloud.png" width="600" />
</p>

<a id="to-do"></a>

# To-Do [`⇧`](#contents)

Please see the [contribution guidelines](https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md) if you are interested in contributing to this project. Work that is in progress or could be implemented includes:

- Including more methods to extract keywords [(see issue)](https://github.com/andrewtavis/kwx/issues/17)

- Adding key phrase extraction as an option for [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) [(see issues)](https://github.com/andrewtavis/kwx/issues/)

- Adding t-SNE and pyLDAvis style visualizations for BERT models [(see issues\)](https://github.com/andrewtavis/kwx/issues/45)

- Converting the translation feature over to use another translation api rather than [py-googletrans](https://github.com/ssut/py-googletrans) [(see issue)](https://github.com/andrewtavis/kwx/issues/44)

- Updates to [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/src/kwx/languages.py) as lemmatization and other linguistic package dependencies evolve

- Creating, improving and sharing [examples](https://github.com/andrewtavis/kwx/tree/main/examples)

- Improving [tests](https://github.com/andrewtavis/kwx/tree/main/tests) for greater [code coverage](https://codecov.io/gh/andrewtavis/kwx)

- Updating and refining the [documentation](https://kwx.readthedocs.io/en/latest/)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/andrewtavis/kwx",
    "name": "kwx",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Andrew Tavis McAllister",
    "author_email": "andrew.t.mcallister@gmail.com",
    "download_url": "",
    "platform": null,
    "description": "<div align=\"center\">\n  <a href=\"https://github.com/andrewtavis/kwx\"><img src=\"https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/logo/kwx_logo_transparent.png\" width=431 height=215></a>\n</div>\n\n<ol></ol>\n\n[![rtd](https://img.shields.io/readthedocs/kwx.svg?logo=read-the-docs)](http://kwx.readthedocs.io/en/latest/)\n[![ci](https://img.shields.io/github/actions/workflow/status/andrewtavis/kwx/.github/workflows/ci.yml?branch=main?logo=github)](https://github.com/andrewtavis/kwx/actions?query=workflow%3ACI)\n[![codecov](https://codecov.io/gh/andrewtavis/kwx/branch/main/graphs/badge.svg)](https://codecov.io/gh/andrewtavis/kwx)\n[![pyversions](https://img.shields.io/pypi/pyversions/kwx.svg?logo=python&logoColor=FFD43B&color=306998)](https://pypi.org/project/kwx/)\n[![pypi](https://img.shields.io/pypi/v/kwx.svg?color=4B8BBE)](https://pypi.org/project/kwx/)\n[![pypistatus](https://img.shields.io/pypi/status/kwx.svg)](https://pypi.org/project/kwx/)\n[![license](https://img.shields.io/github/license/andrewtavis/kwx.svg)](https://github.com/andrewtavis/kwx/blob/main/LICENSE.txt)\n[![coc](https://img.shields.io/badge/coc-Contributor%20Covenant-ff69b4.svg)](https://github.com/andrewtavis/kwx/blob/main/.github/CODE_OF_CONDUCT.md)\n[![codestyle](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![colab](https://img.shields.io/badge/%20-Open%20in%20Colab-097ABB.svg?logo=google-colab&color=097ABB&labelColor=525252)](https://colab.research.google.com/github/andrewtavis/kwx)\n\n## BERT, LDA, and TFIDF based keyword extraction in Python\n\n**kwx** is a toolkit for multilingual keyword extraction based on Google's [BERT](https://github.com/google-research/bert), [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) and [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The package provides a suite of methods to process texts of any language to varying degrees and then extract and analyze keywords from the created corpus (see [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/src/kwx/languages.py) for the various degrees of language support). A unique focus is allowing users to decide which words to not include in outputs, thereby guaranteeing sensible results that are in line with user intuitions.\n\nFor a thorough overview of the process and techniques see the [Google slides](https://docs.google.com/presentation/d/1BNddaeipNQG1mUTjBYmrdpGC6xlBvAi3rapT88fkdBU/edit?usp=sharing), and reference the [documentation](https://kwx.readthedocs.io/en/latest/) for explanations of the models and visualization methods.\n\n<a id=\"contents\"></a>\n\n# **Contents**\n\n- [Installation](#installation)\n- [Models](#models)\n  - [BERT](#bert)\n  - [LDA](#lda)\n  - [TFIDF](#tfidf)\n  - [Word Frequency](#word-frequency)\n- [Usage](#usage)\n  - [Text Cleaning](#text-cleaning)\n  - [Keyword Extraction](#keyword-extraction)\n- [Visuals](#visuals)\n  - [Topic Number Evaluation](#topic-number-evaluation)\n  - [t-SNE](#t-sne)\n  - [pyLDAvis](#pyldavis)\n  - [Word Cloud](#word-cloud)\n- [To-Do](#to-do)\n\n<a id=\"installation\"></a>\n\n# Installation [`\u21e7`](#contents)\n\nkwx can be downloaded from PyPI via pip or sourced directly from this repository:\n\n```bash\npip install kwx\n```\n\n```bash\ngit clone https://github.com/andrewtavis/kwx.git\ncd kwx\npython setup.py install\n```\n\n```python\nimport kwx\n```\n\n<a id=\"models\"></a>\n\n# Models [`\u21e7`](#contents)\n\nImplemented NLP modeling methods within [kwx.model](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) include:\n\n<a id=\"bert\"></a>\n\n### \u2022 BERT [`\u21e7`](#contents)\n\n[Bidirectional Encoder Representations from Transformers](https://github.com/google-research/bert) derives representations of words based on nlp models ran over open-source Wikipedia data. These representations are then leveraged to derive corpus topics.\n\nkwx uses [sentence-transformers](https://github.com/UKPLab/sentence-transformers) pretrained models. See their GitHub and [documentation](https://www.sbert.net/) for the available models.\n\n<a id=\"lda\"></a>\n\n### \u2022 LDA [`\u21e7`](#contents)\n\n[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwx, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.\n\nAlthough not as computationally robust as some machine learning models, LDA provides quick results that are suitable for many applications. Specifically for keyword extraction, in most settings the results are similar to those of BERT in a fraction of the time.\n\n<a id=\"tfidf\"></a>\n\n### \u2022 TFIDF [`\u21e7`](#contents)\n\nThe user can also compute [Term Frequency Inverse Document Frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) keywords - those that are unique in a text body in comparison to another that's compared. This is a useful baseline when a user has another text or text body to compare the target corpus against.\n\n<a id=\"word-frequency\"></a>\n\n### \u2022 Word Frequency [`\u21e7`](#contents)\n\nFinally a user can simply query the most common words from a text corpus. This method is used in kwx as a baseline to check model efficacy.\n\n<a id=\"usage\"></a>\n\n# Usage [`\u21e7`](#contents)\n\nKeyword extraction can be useful to analyze surveys, tweets and other kinds of social media posts, research papers, and further classes of texts. [examples/kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) provides an example of how to use kwx by deriving keywords from tweets in the Kaggle [Twitter US Airline Sentiment](https://www.kaggle.com/crowdflower/twitter-airline-sentiment) dataset.\n\nThe following outlines using kwx to derive keywords from a text corpus with `prompt_remove_words` as `True` (the user will be asked if some of the extracted words need to be replaced):\n\n<a id=\"text-cleaning\"></a>\n\n### \u2022 Text Cleaning [`\u21e7`](#contents)\n\n```python\nfrom kwx.utils import prepare_data\n\ninput_language = \"english\" # see kwx.languages for options\n\n# kwx.utils.clean() can be used on a list of lists\ntext_corpus = prepare_data(\n    data=\"df_or_csv_xlsx_path\",\n    target_cols=\"cols_where_texts_are\",\n    input_language=input_language,\n    min_token_freq=0,  # for BERT\n    min_token_len=0,  # for BERT\n    remove_stopwords=False,  # for BERT\n    verbose=True,\n)\n```\n\n<a id=\"keyword-extraction\"></a>\n\n### \u2022 Keyword Extraction [`\u21e7`](#contents)\n\n```python\nfrom kwx.model import extract_kws\n\nnum_keywords = 15\nnum_topics = 10\nignore_words = [\"words\", \"user\", \"knows\", \"they\", \"don't\", \"want\"]\n\n# Remove n-grams for BERT training\ncorpus_no_ngrams = [\n    \" \".join([t for t in text.split(\" \") if \"_\" not in t]) for text in text_corpus\n]\n\n# We can pass keywords for sentence_transformers.SentenceTransformer.encode,\n# gensim.models.ldamulticore.LdaMulticore, or sklearn.feature_extraction.text.TfidfVectorizer\nbert_kws = extract_kws(\n    method=\"BERT\", # \"BERT\", \"LDA\", \"TFIDF\", \"frequency\"\n    bert_st_model=\"xlm-r-bert-base-nli-stsb-mean-tokens\",\n    text_corpus=corpus_no_ngrams,  # automatically tokenized if using LDA\n    input_language=input_language,\n    output_language=None,  # allows the output to be translated\n    num_keywords=num_keywords,\n    num_topics=num_topics,\n    corpuses_to_compare=None,  # for TFIDF\n    ignore_words=ignore_words,\n    prompt_remove_words=True,  # check words with user\n    show_progress_bar=True,\n    batch_size=32,\n)\n```\n\n```_output\nThe BERT keywords are:\n\n['time', 'flight', 'plane', 'southwestair', 'ticket', 'cancel', 'united', 'baggage',\n'love', 'virginamerica', 'service', 'customer', 'delay', 'late', 'hour']\n\nShould words be removed [y/n]? y\nType or copy word(s) to be removed: southwestair, united, virginamerica\n\nThe new BERT keywords are:\n\n['late', 'baggage', 'service', 'flight', 'time', 'love', 'book', 'customer',\n'response', 'hold', 'hour', 'cancel', 'cancelled_flighted', 'delay', 'plane']\n\nShould words be removed [y/n]? n\n```\n\nThe model will be rerun until all words known to be unreasonable are removed for a suitable output. [kwx.model.gen_files](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) could also be used as a run-all function that produces a directory with a keyword text file and visuals (for experienced users wanting quick results).\n\n<a id=\"visuals\"></a>\n\n# Visuals [`\u21e7`](#contents)\n\n[kwx.visuals](https://github.com/andrewtavis/kwx/blob/main/src/kwx/visuals.py) includes the following functions for presenting and analyzing the results of keyword extraction:\n\n<a id=\"topic-number-evaluation\"></a>\n\n### \u2022 Topic Number Evaluation [`\u21e7`](#contents)\n\nA graph of topic coherence and overlap given a variable number of topics to derive keywords from.\n\n```python\nfrom kwx.visuals import graph_topic_num_evals\nimport matplotlib.pyplot as plt\n\ngraph_topic_num_evals(\n    method=[\"lda\", \"bert\"],\n    text_corpus=text_corpus,\n    num_keywords=num_keywords,\n    topic_nums_to_compare=list(range(5, 15)),\n    metrics=True, #  stability and coherence\n)\nplt.show()\n```\n\n<p align=\"middle\">\n  <img src=\"https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/topic_num_eval.png\" width=\"600\" />\n</p>\n\n<a id=\"t-sne\"></a>\n\n### \u2022 t-SNE [`\u21e7`](#contents)\n\n[t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) allows the user to visualize their topic distribution in both two and three dimensions. Currently available just for LDA, this technique provides another check for model suitability.\n\n```python\nfrom kwx.visuals import t_sne\nimport matplotlib.pyplot as plt\n\nt_sne(\n    dimension=\"both\",  # 2d and 3d are options\n    text_corpus=text_corpus,\n    num_topics=10,\n    remove_3d_outliers=True,\n)\nplt.show()\n```\n\n<p align=\"middle\">\n  <img src=\"https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/t_sne.png\" width=\"600\" />\n</p>\n\n<a id=\"pyldavis\"></a>\n\n### \u2022 pyLDAvis [`\u21e7`](#contents)\n\n[pyLDAvis](https://github.com/bmabey/pyLDAvis) is included so that users can inspect LDA extracted topics, and further so that it can easily be generated for output files.\n\n```python\nfrom kwx.visuals import pyLDAvis_topics\n\npyLDAvis_topics(\n    method=\"lda\",\n    text_corpus=text_corpus,\n    num_topics=10,\n    display_ipython=False,  # For Jupyter integration\n)\n```\n\n<p align=\"middle\">\n  <img src=\"https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/pyLDAvis.png\" width=\"600\" />\n</p>\n\n<a id=\"word-cloud\"></a>\n\n### \u2022 Word Cloud [`\u21e7`](#contents)\n\nWord clouds via [wordcloud](https://github.com/amueller/word_cloud) are included for a basic representation of the text corpus - specifically being a way to convey basic visual information to potential stakeholders. The following figure from [examples/kw_extraction](https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb) shows a word cloud generated from tweets of US air carrier passengers:\n\n```python\nfrom kwx.visuals import gen_word_cloud\n\nignore_words = [\"words\", \"user\", \"knows\", \"they\", \"don't\", \"want\"]\n\ngen_word_cloud(\n    text_corpus=text_corpus,\n    ignore_words=None,\n    height=500,\n)\n```\n\n<p align=\"middle\">\n  <img src=\"https://raw.githubusercontent.com/andrewtavis/kwx/main/.github/resources/images/word_cloud.png\" width=\"600\" />\n</p>\n\n<a id=\"to-do\"></a>\n\n# To-Do [`\u21e7`](#contents)\n\nPlease see the [contribution guidelines](https://github.com/andrewtavis/kwx/blob/main/.github/CONTRIBUTING.md) if you are interested in contributing to this project. Work that is in progress or could be implemented includes:\n\n- Including more methods to extract keywords [(see issue)](https://github.com/andrewtavis/kwx/issues/17)\n\n- Adding key phrase extraction as an option for [kwx.model.extract_kws](https://github.com/andrewtavis/kwx/blob/main/src/kwx/model.py) [(see issues)](https://github.com/andrewtavis/kwx/issues/)\n\n- Adding t-SNE and pyLDAvis style visualizations for BERT models [(see issues\\)](https://github.com/andrewtavis/kwx/issues/45)\n\n- Converting the translation feature over to use another translation api rather than [py-googletrans](https://github.com/ssut/py-googletrans) [(see issue)](https://github.com/andrewtavis/kwx/issues/44)\n\n- Updates to [kwx.languages](https://github.com/andrewtavis/kwx/blob/main/src/kwx/languages.py) as lemmatization and other linguistic package dependencies evolve\n\n- Creating, improving and sharing [examples](https://github.com/andrewtavis/kwx/tree/main/examples)\n\n- Improving [tests](https://github.com/andrewtavis/kwx/tree/main/tests) for greater [code coverage](https://codecov.io/gh/andrewtavis/kwx)\n\n- Updating and refining the [documentation](https://kwx.readthedocs.io/en/latest/)\n\n\n",
    "bugtrack_url": null,
    "license": "new BSD",
    "summary": "BERT, LDA, and TFIDF based keyword extraction in Python",
    "version": "1.0.2",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a85c8fb9cf481d0c807e129fd3113acff3b6a383c369c7aa54c7a01753905b09",
                "md5": "f990ac832eb227ba269e2874c98d91ec",
                "sha256": "63ac01fd7609f461a820da8f16937cfaae41a69380197e7bdeff695b70837d0d"
            },
            "downloads": -1,
            "filename": "kwx-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f990ac832eb227ba269e2874c98d91ec",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 29638,
            "upload_time": "2023-01-28T18:34:42",
            "upload_time_iso_8601": "2023-01-28T18:34:42.985481Z",
            "url": "https://files.pythonhosted.org/packages/a8/5c/8fb9cf481d0c807e129fd3113acff3b6a383c369c7aa54c7a01753905b09/kwx-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-01-28 18:34:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "andrewtavis",
    "github_project": "kwx",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "black",
            "specs": [
                [
                    "==",
                    "20.8b1"
                ]
            ]
        },
        {
            "name": "defusedxml",
            "specs": [
                [
                    "==",
                    "0.7.1"
                ]
            ]
        },
        {
            "name": "emoji",
            "specs": [
                [
                    "==",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "gensim",
            "specs": [
                [
                    ">=",
                    "3.8.0"
                ]
            ]
        },
        {
            "name": "googletrans",
            "specs": [
                [
                    "==",
                    "4.0.0rc1"
                ]
            ]
        },
        {
            "name": "ipython",
            "specs": [
                [
                    "==",
                    "7.31.1"
                ]
            ]
        },
        {
            "name": "keras",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.3.2"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.6.5"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.2"
                ]
            ]
        },
        {
            "name": "packaging",
            "specs": [
                [
                    "==",
                    "20.9"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "pyldavis",
            "specs": [
                [
                    ">=",
                    "3.3.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "6.2.2"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    "==",
                    "2.11.1"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "0.23.2"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.11.1"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    "==",
                    "0.4.1.2"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    "==",
                    "2.3.5"
                ]
            ]
        },
        {
            "name": "stopwordsiso",
            "specs": [
                [
                    "==",
                    "0.6.1"
                ]
            ]
        },
        {
            "name": "tensorflow",
            "specs": [
                [
                    ">=",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.56.1"
                ]
            ]
        },
        {
            "name": "wordcloud",
            "specs": [
                [
                    "==",
                    "1.8.1"
                ]
            ]
        },
        {
            "name": "xlrd",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        }
    ],
    "lcname": "kwx"
}
        
Elapsed time: 0.03430s