[![](https://img.shields.io/pypi/v/top2vec.svg)](https://pypi.org/project/top2vec/)
[![](https://img.shields.io/pypi/l/top2vec.svg)](https://github.com/ddangelov/Top2Vec/blob/master/LICENSE)
[![](https://readthedocs.org/projects/top2vec/badge/?version=latest)](https://top2vec.readthedocs.io/en/latest/?badge=latest)
[![](https://img.shields.io/badge/arXiv-2008.09470-00ff00.svg)](http://arxiv.org/abs/2008.09470)
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/top2vec_logo.png?sanitize=true)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/top2vec_logo.svg" alt="" width=600 height="whatever">
</p>
# Contextual Top2Vec Overview
Paper: [Topic Modeling: Contextual Token Embeddings Are All You Need](https://aclanthology.org/2024.findings-emnlp.790.pdf)
The Top2Vec library now supports a new contextual version, allowing for deeper topic modeling capabilities. **Contextual Top2Vec**, enables the model to generate **contextual token embeddings** for each document, identifying multiple topics per document and even detecting topic segments within a document. This enhancement is useful for capturing a nuanced understanding of topics, especially in documents that cover multiple themes.
### Key Features of Contextual Top2Vec
- **`contextual_top2vec` flag**: A new parameter, `contextual_top2vec`, is added to the Top2Vec class. When set to `True`, the model uses contextual token embeddings. Only the following embedding models are supported:
- `all-MiniLM-L6-v2`
- `all-mpnet-base-v2`
- **Topic Spans**: C-Top2Vec automatically determines the number of topics and finds topic segments within documents, allowing for a more granular topic discovery.
### Simple Usage Example
Here is a simple example of how to use Contextual Top2Vec:
```python
from top2vec import Top2Vec
# Create a Contextual Top2Vec model
top2vec_model = Top2Vec(documents=documents,
ngram_vocab=True,
contextual_top2vec=True)
```
### New Methods for Contextual Top2Vec
#### `get_document_topic_distribution()`
```python
get_document_topic_distribution() -> np.ndarray
```
- **Description**: Retrieves the topic distribution for each document.
- **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row represents the **probability distribution of topics** for a document.
#### `get_document_topic_relevance()`
```python
get_document_topic_relevance() -> np.ndarray
```
- **Description**: Provides the relevance of each topic for each document.
- **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row indicates the **relevance scores of topics** for a document.
#### `get_document_token_topic_assignment()`
```python
get_document_token_topic_assignment() -> List[Document]
```
- **Description**: Retrieves token-level topic assignments for each document.
- **Returns**: A list of `Document` objects, each containing topics with **token assignments and scores** for each token.
#### `get_document_tokens()`
```python
get_document_tokens() -> List[List[str]]
```
- **Description**: Returns the tokens for each document.
- **Returns**: A list of lists where each sublist contains the **tokens for a given document**.
### Usage Note
The **contextual version** of Top2Vec requires specific embedding models, and the new methods provide insights into the distribution, relevance, and assignment of topics at both the document and token levels, allowing for a richer understanding of the data.
> Warning: Contextual Top2Vec is still in **beta**. You may encounter issues or unexpected behavior, and the functionality may change in future updates.
Citation
--------
```
@inproceedings{angelov-inkpen-2024-topic,
title = "Topic Modeling: Contextual Token Embeddings Are All You Need",
author = "Angelov, Dimo and
Inkpen, Diana",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.790",
pages = "13528--13539",
abstract = "The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics.",
}
```
Classic Top2Vec
===============
Top2Vec is an algorithm for **topic modeling** and **semantic search**. It automatically detects topics present in text
and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model
you can:
* Get number of detected topics.
* Get topics.
* Get topic sizes.
* Get hierarchichal topics.
* Search topics by keywords.
* Search documents by topic.
* Search documents by keywords.
* Find similar words.
* Find similar documents.
* Expose model with [RESTful-Top2Vec](https://github.com/ddangelov/RESTful-Top2Vec)
See the [paper](http://arxiv.org/abs/2008.09470) for more details on how it works.
Benefits
--------
1. Automatically finds number of topics.
2. No stop word lists required.
3. No need for stemming/lemmatization.
4. Works on short text.
5. Creates jointly embedded topic, document, and word vectors.
6. Has search functions built in.
How does it work?
-----------------
The assumption the algorithm makes is that many semantically similar documents
are indicative of an underlying topic. The first step is to create a joint embedding of
document and word vectors. Once documents and words are embedded in a vector
space the goal of the algorithm is to find dense clusters of documents, then identify which
words attracted those documents together. Each dense area is a topic and the words that
attracted the documents to the dense area are the topic words.
### The Algorithm:
#### 1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html) or [Universal Sentence Encoder](https://tfhub.dev/google/collections/universal-sentence-encoder/1) or [BERT Sentence Transformer](https://www.sbert.net/).
>Documents will be placed close to other similar documents and close to the most distinguishing words.
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/doc_word_embedding.svg?sanitize=true)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/doc_word_embedding.svg?sanitize=true" alt="" width=600 height="whatever">
</p>
#### 2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap).
>Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/umap_docs.png)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/umap_docs.png" alt="" width=700 height="whatever">
</p>
#### 3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan).
>The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/hdbscan_docs.png)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/hdbscan_docs.png" alt="" width=700 height="whatever">
</p>
#### 4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.
>The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_vector.svg?sanitize=true)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_vector.svg?sanitize=true" alt="" width=600 height="whatever">
</p>
#### 5. Find n-closest word vectors to the resulting topic vector.
>The closest word vectors in order of proximity become the topic words.
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true)-->
<p align="center">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true" alt="" width=600 height="whatever">
</p>
Installation
------------
The easy way to install Top2Vec is:
pip install top2vec
To install pre-trained universal sentence encoder options:
pip install top2vec[sentence_encoders]
To install pre-trained BERT sentence transformer options:
pip install top2vec[sentence_transformers]
To install indexing options:
pip install top2vec[indexing]
Usage
-----
```python
from top2vec import top2vec
model = Top2Vec(documents)
```
Important parameters:
* ``documents``: Input corpus, should be a list of strings.
* ``speed``: This parameter will determine how fast the model takes to train.
The 'fast-learn' option is the fastest and will generate the lowest quality
vectors. The 'learn' option will learn better quality vectors but take a longer
time to train. The 'deep-learn' option will learn the best quality vectors but
will take significant time to train.
* ``workers``: The amount of worker threads to be used in training the model. Larger
amount will lead to faster training.
> Trained models can be saved and loaded.
```python
model.save("filename")
model = Top2Vec.load("filename")
```
For more information view the [API guide](https://top2vec.readthedocs.io/en/latest/api.html).
Pretrained Embedding Models <a name="pretrained"></a>
-----------------
Doc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained `embedding_model` options for generating joint word and document embeddings:
* `universal-sentence-encoder`
* `universal-sentence-encoder-multilingual`
* `distiluse-base-multilingual-cased`
```python
from top2vec import top2vec
model = Top2Vec(documents, embedding_model='universal-sentence-encoder')
```
For large data sets and data sets with very unique vocabulary doc2vec could
produce better results. This will train a doc2vec model from scratch. This method
is language agnostic. However multiple languages will not be aligned.
Using the universal sentence encoder options will be much faster since those are
pre-trained and efficient models. The universal sentence encoder options are
suggested for smaller data sets. They are also good options for large data sets
that are in English or in languages covered by the multilingual model. It is also
suggested for data sets that are multilingual.
The distiluse-base-multilingual-cased pre-trained sentence transformer is suggested
for multilingual datasets and languages that are not covered by the multilingual
universal sentence encoder. The transformer is significantly slower than
the universal sentence encoder options.
More information on [universal-sentence-encoder](https://tfhub.dev/google/universal-sentence-encoder/4), [universal-sentence-encoder-multilingual](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3), and [distiluse-base-multilingual-cased](https://www.sbert.net/docs/pretrained_models.html).
Citation
-----------------
If you would like to cite Top2Vec in your work this is the current reference:
```bibtex
@article{angelov2020top2vec,
title={Top2Vec: Distributed Representations of Topics},
author={Dimo Angelov},
year={2020},
eprint={2008.09470},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Example
-------
### Train Model
Train a Top2Vec model on the 20newsgroups dataset.
```python
from top2vec import top2vec
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)
```
### Get Number of Topics
This will return the number of topics that Top2Vec has found in the data.
```python
>>> model.get_num_topics()
77
```
### Get Topic Sizes
This will return the number of documents most similar to each topic. Topics are
in decreasing order of size.
```python
topic_sizes, topic_nums = model.get_topic_sizes()
```
Returns:
* ``topic_sizes``: The number of documents most similar to each topic.
* ``topic_nums``: The unique index of every topic will be returned.
### Get Topics
This will return the topics in decreasing size.
```python
topic_words, word_scores, topic_nums = model.get_topics(77)
```
Returns:
* ``topic_words``: For each topic the top 50 words are returned, in order
of semantic similarity to topic.
* ``word_scores``: For each topic the cosine similarity scores of the
top 50 words to the topic are returned.
* ``topic_nums``: The unique index of every topic will be returned.
### Search Topics
We are going to search for topics most similar to **medicine**.
```python
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)
```
Returns:
* ``topic_words``: For each topic the top 50 words are returned, in order
of semantic similarity to topic.
* ``word_scores``: For each topic the cosine similarity scores of the
top 50 words to the topic are returned.
* ``topic_scores``: For each topic the cosine similarity to the search keywords will be returned.
* ``topic_nums``: The unique index of every topic will be returned.
```python
>>> topic_nums
[21, 29, 9, 61, 48]
>>> topic_scores
[0.4468, 0.381, 0.2779, 0.2566, 0.2515]
```
> Topic 21 was the most similar topic to "medicine" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1)
### Generate Word Clouds
Using a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our **medicine** topic search from above.
```python
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)
for topic in topic_nums:
model.generate_topic_wordcloud(topic)
```
<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic21.png)
![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic29.png)
![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic9.png)
![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic61.png)
![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic48.png)-->
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic21.png" alt="" width=700 height="whatever">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic29.png" alt="" width=700 height="whatever">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic9.png" alt="" width=700 height="whatever">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic61.png" alt="" width=700 height="whatever">
<img src="https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic48.png" alt="" width=700 height="whatever">
### Search Documents by Topic
We are going to search by **topic 48**, a topic that appears to be about **science**.
```python
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
```
Returns:
* ``documents``: The documents in a list, the most similar are first.
* ``doc_scores``: Semantic similarity of document to topic. The cosine similarity of the
document and topic vector.
* ``doc_ids``: Unique ids of documents. If ids were not given, the index of document
in the original corpus.
For each of the returned documents we are going to print its content, score and document number.
```python
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
print(f"Document: {doc_id}, Score: {score}")
print("-----------")
print(doc)
print("-----------")
print()
```
Document: 15227, Score: 0.6322
-----------
Evolution is both fact and theory. The THEORY of evolution represents the
scientific attempt to explain the FACT of evolution. The theory of evolution
does not provide facts; it explains facts. It can be safely assumed that ALL
scientific theories neither provide nor become facts but rather EXPLAIN facts.
I recommend that you do some appropriate reading in general science. A good
starting point with regard to evolution for the layman would be "Evolution as
Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
Gould. There is a great deal of other useful information in this publication.
-----------
Document: 14515, Score: 0.6186
-----------
Just what are these "scientific facts"? I have never heard of such a thing.
Science never proves or disproves any theory - history does.
-Tim
-----------
Document: 9433, Score: 0.5997
-----------
The same way that any theory is proven false. You examine the predicitions
that the theory makes, and try to observe them. If you don't, or if you
observe things that the theory predicts wouldn't happen, then you have some
evidence against the theory. If the theory can't be modified to
incorporate the new observations, then you say that it is false.
For example, people used to believe that the earth had been created
10,000 years ago. But, as evidence showed that predictions from this
theory were not true, it was abandoned.
-----------
Document: 11917, Score: 0.5845
-----------
The point about its being real or not is that one does not waste time with
what reality might be when one wants predictions. The questions if the
atoms are there or if something else is there making measurements indicate
atoms is not necessary in such a system.
And one does not have to write a new theory of existence everytime new
models are used in Physics.
-----------
...
### Semantic Search Documents by Keywords
Search documents for content semantically similar to **cryptography** and **privacy**.
```python
documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
print(f"Document: {doc_id}, Score: {score}")
print("-----------")
print(doc)
print("-----------")
print()
```
Document: 16837, Score: 0.6112
-----------
...
Email and account privacy, anonymity, file encryption, academic
computer policies, relevant legislation and references, EFF, and
other privacy and rights issues associated with use of the Internet
and global networks in general.
...
Document: 16254, Score: 0.5722
-----------
...
The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.
...
-----------
...
### Similar Keywords
Search for similar words to **space**.
```python
words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
for word, score in zip(words, word_scores):
print(f"{word} {score}")
```
space 1.0
nasa 0.6589
shuttle 0.5976
exploration 0.5448
planetary 0.5391
missions 0.5069
launch 0.4941
telescope 0.4821
astro 0.4696
jsc 0.4549
ames 0.4515
satellite 0.446
station 0.4445
orbital 0.4438
solar 0.4386
astronomy 0.4378
observatory 0.4355
facility 0.4325
propulsion 0.4251
aerospace 0.4226
Raw data
{
"_id": null,
"home_page": "https://github.com/ddangelov/Top2Vec",
"name": "top2vec",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "topic modeling semantic search word document embedding",
"author": "Dimo Angelov",
"author_email": "dimo.angelov@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bf/d6/d6b4b4b3b73fb6d48bd5aba7c003630383bae8c418bef0e197318a5c4539/top2vec-1.0.36.tar.gz",
"platform": null,
"description": "[![](https://img.shields.io/pypi/v/top2vec.svg)](https://pypi.org/project/top2vec/)\n[![](https://img.shields.io/pypi/l/top2vec.svg)](https://github.com/ddangelov/Top2Vec/blob/master/LICENSE)\n[![](https://readthedocs.org/projects/top2vec/badge/?version=latest)](https://top2vec.readthedocs.io/en/latest/?badge=latest)\n[![](https://img.shields.io/badge/arXiv-2008.09470-00ff00.svg)](http://arxiv.org/abs/2008.09470)\n\n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/top2vec_logo.png?sanitize=true)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/top2vec_logo.svg\" alt=\"\" width=600 height=\"whatever\">\n</p>\n\n# Contextual Top2Vec Overview\nPaper: [Topic Modeling: Contextual Token Embeddings Are All You Need](https://aclanthology.org/2024.findings-emnlp.790.pdf)\n\nThe Top2Vec library now supports a new contextual version, allowing for deeper topic modeling capabilities. **Contextual Top2Vec**, enables the model to generate **contextual token embeddings** for each document, identifying multiple topics per document and even detecting topic segments within a document. This enhancement is useful for capturing a nuanced understanding of topics, especially in documents that cover multiple themes.\n\n### Key Features of Contextual Top2Vec\n\n- **`contextual_top2vec` flag**: A new parameter, `contextual_top2vec`, is added to the Top2Vec class. When set to `True`, the model uses contextual token embeddings. Only the following embedding models are supported:\n - `all-MiniLM-L6-v2`\n - `all-mpnet-base-v2`\n- **Topic Spans**: C-Top2Vec automatically determines the number of topics and finds topic segments within documents, allowing for a more granular topic discovery.\n\n### Simple Usage Example\n\nHere is a simple example of how to use Contextual Top2Vec:\n\n```python\nfrom top2vec import Top2Vec\n\n# Create a Contextual Top2Vec model\ntop2vec_model = Top2Vec(documents=documents,\n ngram_vocab=True,\n contextual_top2vec=True)\n```\n\n### New Methods for Contextual Top2Vec\n\n#### `get_document_topic_distribution()`\n\n```python\nget_document_topic_distribution() -> np.ndarray\n```\n- **Description**: Retrieves the topic distribution for each document.\n- **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row represents the **probability distribution of topics** for a document.\n\n#### `get_document_topic_relevance()`\n\n```python\nget_document_topic_relevance() -> np.ndarray\n```\n- **Description**: Provides the relevance of each topic for each document.\n- **Returns**: A `numpy.ndarray` of shape `(num_documents, num_topics)`. Each row indicates the **relevance scores of topics** for a document.\n\n#### `get_document_token_topic_assignment()`\n\n```python\nget_document_token_topic_assignment() -> List[Document]\n```\n- **Description**: Retrieves token-level topic assignments for each document.\n- **Returns**: A list of `Document` objects, each containing topics with **token assignments and scores** for each token.\n\n#### `get_document_tokens()`\n\n```python\nget_document_tokens() -> List[List[str]]\n```\n- **Description**: Returns the tokens for each document.\n- **Returns**: A list of lists where each sublist contains the **tokens for a given document**.\n\n### Usage Note\n\nThe **contextual version** of Top2Vec requires specific embedding models, and the new methods provide insights into the distribution, relevance, and assignment of topics at both the document and token levels, allowing for a richer understanding of the data.\n\n> Warning: Contextual Top2Vec is still in **beta**. You may encounter issues or unexpected behavior, and the functionality may change in future updates.\n\n\n\nCitation\n--------\n```\n@inproceedings{angelov-inkpen-2024-topic,\n title = \"Topic Modeling: Contextual Token Embeddings Are All You Need\",\n author = \"Angelov, Dimo and\n Inkpen, Diana\",\n editor = \"Al-Onaizan, Yaser and\n Bansal, Mohit and\n Chen, Yun-Nung\",\n booktitle = \"Findings of the Association for Computational Linguistics: EMNLP 2024\",\n month = nov,\n year = \"2024\",\n address = \"Miami, Florida, USA\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/2024.findings-emnlp.790\",\n pages = \"13528--13539\",\n abstract = \"The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models on a comprehensive set of topic model evaluation metrics.\",\n}\n```\n\nClassic Top2Vec\n===============\n\nTop2Vec is an algorithm for **topic modeling** and **semantic search**. It automatically detects topics present in text\nand generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model \nyou can:\n* Get number of detected topics.\n* Get topics.\n* Get topic sizes. \n* Get hierarchichal topics. \n* Search topics by keywords.\n* Search documents by topic.\n* Search documents by keywords.\n* Find similar words.\n* Find similar documents.\n* Expose model with [RESTful-Top2Vec](https://github.com/ddangelov/RESTful-Top2Vec)\n\nSee the [paper](http://arxiv.org/abs/2008.09470) for more details on how it works.\n\nBenefits\n--------\n1. Automatically finds number of topics.\n2. No stop word lists required.\n3. No need for stemming/lemmatization.\n4. Works on short text.\n5. Creates jointly embedded topic, document, and word vectors. \n6. Has search functions built in.\n\nHow does it work?\n-----------------\n\nThe assumption the algorithm makes is that many semantically similar documents\nare indicative of an underlying topic. The first step is to create a joint embedding of \ndocument and word vectors. Once documents and words are embedded in a vector \nspace the goal of the algorithm is to find dense clusters of documents, then identify which \nwords attracted those documents together. Each dense area is a topic and the words that\nattracted the documents to the dense area are the topic words.\n\n### The Algorithm:\n\n#### 1. Create jointly embedded document and word vectors using [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html) or [Universal Sentence Encoder](https://tfhub.dev/google/collections/universal-sentence-encoder/1) or [BERT Sentence Transformer](https://www.sbert.net/).\n>Documents will be placed close to other similar documents and close to the most distinguishing words.\n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/doc_word_embedding.svg?sanitize=true)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/doc_word_embedding.svg?sanitize=true\" alt=\"\" width=600 height=\"whatever\">\n</p>\n\n#### 2. Create lower dimensional embedding of document vectors using [UMAP](https://github.com/lmcinnes/umap).\n>Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.\n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/umap_docs.png)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/umap_docs.png\" alt=\"\" width=700 height=\"whatever\">\n</p>\n\n#### 3. Find dense areas of documents using [HDBSCAN](https://github.com/scikit-learn-contrib/hdbscan).\n>The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.\n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/hdbscan_docs.png)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/hdbscan_docs.png\" alt=\"\" width=700 height=\"whatever\">\n</p>\n\n#### 4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.\n>The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated. \n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_vector.svg?sanitize=true)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_vector.svg?sanitize=true\" alt=\"\" width=600 height=\"whatever\">\n</p>\n\n#### 5. Find n-closest word vectors to the resulting topic vector.\n>The closest word vectors in order of proximity become the topic words. \n\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true)-->\n<p align=\"center\">\n <img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic_words.svg?sanitize=true\" alt=\"\" width=600 height=\"whatever\">\n</p>\n\nInstallation\n------------\n\nThe easy way to install Top2Vec is:\n\n pip install top2vec\n \nTo install pre-trained universal sentence encoder options:\n \n pip install top2vec[sentence_encoders]\n \nTo install pre-trained BERT sentence transformer options:\n\n pip install top2vec[sentence_transformers]\n \nTo install indexing options:\n\n pip install top2vec[indexing]\n\n\nUsage\n-----\n\n```python\n\nfrom top2vec import top2vec\n\nmodel = Top2Vec(documents)\n```\nImportant parameters:\n\n * ``documents``: Input corpus, should be a list of strings.\n \n * ``speed``: This parameter will determine how fast the model takes to train. \n The 'fast-learn' option is the fastest and will generate the lowest quality\n vectors. The 'learn' option will learn better quality vectors but take a longer\n time to train. The 'deep-learn' option will learn the best quality vectors but \n will take significant time to train.\n \n * ``workers``: The amount of worker threads to be used in training the model. Larger\n amount will lead to faster training.\n \n> Trained models can be saved and loaded. \n```python\n\nmodel.save(\"filename\")\nmodel = Top2Vec.load(\"filename\")\n```\n\nFor more information view the [API guide](https://top2vec.readthedocs.io/en/latest/api.html).\n\nPretrained Embedding Models <a name=\"pretrained\"></a>\n-----------------\nDoc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained `embedding_model` options for generating joint word and document embeddings:\n\n * `universal-sentence-encoder`\n * `universal-sentence-encoder-multilingual`\n * `distiluse-base-multilingual-cased`\n\n```python\nfrom top2vec import top2vec\n\nmodel = Top2Vec(documents, embedding_model='universal-sentence-encoder')\n```\n\nFor large data sets and data sets with very unique vocabulary doc2vec could\nproduce better results. This will train a doc2vec model from scratch. This method\nis language agnostic. However multiple languages will not be aligned.\n\nUsing the universal sentence encoder options will be much faster since those are\npre-trained and efficient models. The universal sentence encoder options are\nsuggested for smaller data sets. They are also good options for large data sets\nthat are in English or in languages covered by the multilingual model. It is also\nsuggested for data sets that are multilingual.\n\nThe distiluse-base-multilingual-cased pre-trained sentence transformer is suggested\nfor multilingual datasets and languages that are not covered by the multilingual\nuniversal sentence encoder. The transformer is significantly slower than\nthe universal sentence encoder options. \n\nMore information on [universal-sentence-encoder](https://tfhub.dev/google/universal-sentence-encoder/4), [universal-sentence-encoder-multilingual](https://tfhub.dev/google/universal-sentence-encoder-multilingual/3), and [distiluse-base-multilingual-cased](https://www.sbert.net/docs/pretrained_models.html).\n\n\nCitation\n-----------------\n\nIf you would like to cite Top2Vec in your work this is the current reference:\n\n```bibtex\n@article{angelov2020top2vec,\n title={Top2Vec: Distributed Representations of Topics}, \n author={Dimo Angelov},\n year={2020},\n eprint={2008.09470},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}\n```\n \nExample\n-------\n\n### Train Model\nTrain a Top2Vec model on the 20newsgroups dataset.\n\n```python\n\nfrom top2vec import top2vec\nfrom sklearn.datasets import fetch_20newsgroups\n\nnewsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))\n\nmodel = Top2Vec(documents=newsgroups.data, speed=\"learn\", workers=8)\n\n```\n### Get Number of Topics\nThis will return the number of topics that Top2Vec has found in the data.\n```python\n\n>>> model.get_num_topics()\n77\n\n```\n### Get Topic Sizes\nThis will return the number of documents most similar to each topic. Topics are\nin decreasing order of size. \n```python\ntopic_sizes, topic_nums = model.get_topic_sizes()\n```\nReturns:\n\n * ``topic_sizes``: The number of documents most similar to each topic.\n \n * ``topic_nums``: The unique index of every topic will be returned. \n \n### Get Topics \nThis will return the topics in decreasing size.\n```python\ntopic_words, word_scores, topic_nums = model.get_topics(77)\n\n```\nReturns:\n\n * ``topic_words``: For each topic the top 50 words are returned, in order\n of semantic similarity to topic.\n \n * ``word_scores``: For each topic the cosine similarity scores of the\n top 50 words to the topic are returned. \n \n * ``topic_nums``: The unique index of every topic will be returned.\n \n### Search Topics\nWe are going to search for topics most similar to **medicine**. \n```python\n\ntopic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=[\"medicine\"], num_topics=5)\n```\nReturns:\n * ``topic_words``: For each topic the top 50 words are returned, in order\n of semantic similarity to topic.\n \n * ``word_scores``: For each topic the cosine similarity scores of the\n top 50 words to the topic are returned. \n \n * ``topic_scores``: For each topic the cosine similarity to the search keywords will be returned.\n \n * ``topic_nums``: The unique index of every topic will be returned.\n\n```python\n\n>>> topic_nums\n[21, 29, 9, 61, 48]\n\n>>> topic_scores\n[0.4468, 0.381, 0.2779, 0.2566, 0.2515]\n```\n> Topic 21 was the most similar topic to \"medicine\" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1)\n\n### Generate Word Clouds\n\nUsing a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our **medicine** topic search from above. \n```python\ntopic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=[\"medicine\"], num_topics=5)\nfor topic in topic_nums:\n model.generate_topic_wordcloud(topic)\n```\n<!--![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic21.png)\n![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic29.png)\n![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic9.png)\n![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic61.png)\n![](https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic48.png)-->\n\n<img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic21.png\" alt=\"\" width=700 height=\"whatever\">\n<img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic29.png\" alt=\"\" width=700 height=\"whatever\">\n<img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic9.png\" alt=\"\" width=700 height=\"whatever\">\n<img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic61.png\" alt=\"\" width=700 height=\"whatever\">\n<img src=\"https://raw.githubusercontent.com/ddangelov/Top2Vec/master/images/topic48.png\" alt=\"\" width=700 height=\"whatever\">\n\n\n### Search Documents by Topic\n\nWe are going to search by **topic 48**, a topic that appears to be about **science**.\n```python\ndocuments, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)\n```\nReturns:\n * ``documents``: The documents in a list, the most similar are first. \n \n * ``doc_scores``: Semantic similarity of document to topic. The cosine similarity of the\n document and topic vector.\n \n * ``doc_ids``: Unique ids of documents. If ids were not given, the index of document\n in the original corpus.\n \nFor each of the returned documents we are going to print its content, score and document number.\n```python\ndocuments, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)\nfor doc, score, doc_id in zip(documents, document_scores, document_ids):\n print(f\"Document: {doc_id}, Score: {score}\")\n print(\"-----------\")\n print(doc)\n print(\"-----------\")\n print()\n```\n\n\n Document: 15227, Score: 0.6322\n -----------\n Evolution is both fact and theory. The THEORY of evolution represents the\n scientific attempt to explain the FACT of evolution. The theory of evolution\n does not provide facts; it explains facts. It can be safely assumed that ALL\n scientific theories neither provide nor become facts but rather EXPLAIN facts.\n I recommend that you do some appropriate reading in general science. A good\n starting point with regard to evolution for the layman would be \"Evolution as\n Fact and Theory\" in \"Hen's Teeth and Horse's Toes\" [pp 253-262] by Stephen Jay\n Gould. There is a great deal of other useful information in this publication.\n -----------\n\n Document: 14515, Score: 0.6186\n -----------\n Just what are these \"scientific facts\"? I have never heard of such a thing.\n Science never proves or disproves any theory - history does.\n\n -Tim\n -----------\n \n Document: 9433, Score: 0.5997\n -----------\n The same way that any theory is proven false. You examine the predicitions\n that the theory makes, and try to observe them. If you don't, or if you\n observe things that the theory predicts wouldn't happen, then you have some \n evidence against the theory. If the theory can't be modified to \n incorporate the new observations, then you say that it is false.\n\n For example, people used to believe that the earth had been created\n 10,000 years ago. But, as evidence showed that predictions from this \n theory were not true, it was abandoned.\n -----------\n \n Document: 11917, Score: 0.5845\n -----------\n The point about its being real or not is that one does not waste time with\n what reality might be when one wants predictions. The questions if the\n atoms are there or if something else is there making measurements indicate\n atoms is not necessary in such a system.\n\n And one does not have to write a new theory of existence everytime new\n models are used in Physics.\n -----------\n \n ...\n\n### Semantic Search Documents by Keywords\n\nSearch documents for content semantically similar to **cryptography** and **privacy**.\n```python\ndocuments, document_scores, document_ids = model.search_documents_by_keywords(keywords=[\"cryptography\", \"privacy\"], num_docs=5)\nfor doc, score, doc_id in zip(documents, document_scores, document_ids):\n print(f\"Document: {doc_id}, Score: {score}\")\n print(\"-----------\")\n print(doc)\n print(\"-----------\")\n print()\n``` \n Document: 16837, Score: 0.6112\n -----------\n ...\n Email and account privacy, anonymity, file encryption, academic \n computer policies, relevant legislation and references, EFF, and \n other privacy and rights issues associated with use of the Internet\n and global networks in general.\n ...\n \n Document: 16254, Score: 0.5722\n -----------\n ...\n The President today announced a new initiative that will bring\n the Federal Government together with industry in a voluntary\n program to improve the security and privacy of telephone\n communications while meeting the legitimate needs of law\n enforcement.\n ...\n -----------\n ...\n\n### Similar Keywords\n\nSearch for similar words to **space**.\n```python\nwords, word_scores = model.similar_words(keywords=[\"space\"], keywords_neg=[], num_words=20)\nfor word, score in zip(words, word_scores):\n print(f\"{word} {score}\")\n``` \n space 1.0\n nasa 0.6589\n shuttle 0.5976\n exploration 0.5448\n planetary 0.5391\n missions 0.5069\n launch 0.4941\n telescope 0.4821\n astro 0.4696\n jsc 0.4549\n ames 0.4515\n satellite 0.446\n station 0.4445\n orbital 0.4438\n solar 0.4386\n astronomy 0.4378\n observatory 0.4355\n facility 0.4325\n propulsion 0.4251\n aerospace 0.4226\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "Top2Vec learns jointly embedded topic, document and word vectors.",
"version": "1.0.36",
"project_urls": {
"Homepage": "https://github.com/ddangelov/Top2Vec"
},
"split_keywords": [
"topic",
"modeling",
"semantic",
"search",
"word",
"document",
"embedding"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "912618f44fb5bf6bc434462e375e159226da5b17284fd97cb99c87ea62360514",
"md5": "2e4133ac615b2f5865aed3b5074c0a81",
"sha256": "181fd3e3bbf8328ccf0455c32e51d445e56e3077f257a4a5150274628eeaaad9"
},
"downloads": -1,
"filename": "top2vec-1.0.36-py3-none-any.whl",
"has_sig": false,
"md5_digest": "2e4133ac615b2f5865aed3b5074c0a81",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 33098,
"upload_time": "2024-11-14T23:48:28",
"upload_time_iso_8601": "2024-11-14T23:48:28.164904Z",
"url": "https://files.pythonhosted.org/packages/91/26/18f44fb5bf6bc434462e375e159226da5b17284fd97cb99c87ea62360514/top2vec-1.0.36-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bfd6d6b4b4b3b73fb6d48bd5aba7c003630383bae8c418bef0e197318a5c4539",
"md5": "3612196295beeefd681bcde65698d60e",
"sha256": "a187eff3f5a020bdea98a2717ed7f0602a5765cb4d024b05162e9bc45b791384"
},
"downloads": -1,
"filename": "top2vec-1.0.36.tar.gz",
"has_sig": false,
"md5_digest": "3612196295beeefd681bcde65698d60e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 39578,
"upload_time": "2024-11-14T23:48:30",
"upload_time_iso_8601": "2024-11-14T23:48:30.029046Z",
"url": "https://files.pythonhosted.org/packages/bf/d6/d6b4b4b3b73fb6d48bd5aba7c003630383bae8c418bef0e197318a5c4539/top2vec-1.0.36.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-14 23:48:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ddangelov",
"github_project": "Top2Vec",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "top2vec"
}