# FASTopic
[![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)](https://github.com/BobXWu/Fastopic)
[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/fastopic)
[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)
[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)
FASTopic is a fast, adaptive, stable, and transferable topic model, different
from the previous conventional (LDA), VAE-based (ProdLDA, ETM), or clustering-based (Top2Vec, BERTopic) methods.
It leverages optimal transport between the document, topic, and word embeddings from pretrained Transformers to model topics and topic distributions of documents.
Check our paper: **[FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm](https://arxiv.org/pdf/2405.17978.pdf)**
<img src='https://github.com/BobXWu/FASTopic/raw/master/docs/img/illustration.svg' with='300pt'></img>
- [FASTopic](#fastopic)
- [Tutorials](#tutorials)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Usage](#usage)
- [Try FASTopic on your dataset](#try-fastopic-on-your-dataset)
- [Topic info](#topic-info)
- [Topic hierarchy](#topic-hierarchy)
- [Topic weights](#topic-weights)
- [Topic activity over time](#topic-activity-over-time)
- [APIs](#apis)
- [Common](#common)
- [Visualization](#visualization)
- [Q\&A](#qa)
- [Contact](#contact)
- [Related Resources](#related-resources)
- [Citation](#citation)
## Tutorials
| Tutorial | Link |
| ------ | --- |
| A complete tutorial on FASTopic. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bduHWL5_bvsl4EYOgimCOmU-7RfnXqrX?usp=sharing) |
| FASTopic with other languages. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_b55QpVQGFBX9PsyrYfNxDzbJ1IjrUdH?usp=sharing) |
## Installation
Install FASTopic with `pip`:
```bash
pip install fastopic
```
Otherwise, install FASTopic from the source:
```bash
pip install git+https://github.com/bobxwu/FASTopic.git
```
## Quick Start
Discover topics from 20newsgroups with the topic number as `50`.
```python
from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocessing = Preprocessing(vocab_size=10000, stopwords='English')
model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
```
`topic_top_words` is a list containing the top words of discovered topics.
`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
## Usage
### Try FASTopic on your dataset
```python
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing
# Prepare your dataset.
docs = [
'doc 1',
'doc 2', # ...
]
# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc.
# Pass your tokenizer as:
# preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')
model = FASTopic(50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
```
### Topic info
We can get the top words and their probabilities of a topic.
```python
model.get_topic(topic_idx=36)
(('impeachment', 0.008047104),
('mueller', 0.0075936727),
('trump', 0.0066773472),
('committee', 0.0057785935),
('inquiry', 0.005647915))
```
We can visualize these topic info.
```python
fig = model.visualize_topic(top_n=5)
fig.show()
```
<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_info.png?raw=true' with='300pt'></img>
### Topic hierarchy
We use the learned topic embeddings and `scipy.cluster.hierarchy` to build a hierarchy of discovered topics.
```python
fig = model.visualize_topic_hierarchy()
fig.show()
```
<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_hierarchy.png?raw=true' with='300pt'></img>
### Topic weights
We plot the weights of topics in the given dataset.
```python
fig = model.visualize_topic_weights(top_n=20, height=500)
fig.show()
```
<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_weight.png?raw=true' with='300pt'></img>
### Topic activity over time
Topic activity refers to the weight of a topic at a time slice.
We additionally input the time slices of documents, `time_slices` to compute and plot topic activity over time.
```python
act = model.topic_activity_over_time(time_slices)
fig = model.visualize_topic_activity(top_n=6, topic_activity=act, time_slices=time_slices)
fig.show()
```
<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_activity.png?raw=true' with='300pt'></img>
## APIs
We summarize the frequently used APIs of FASTopic here. It's easier for you to look up.
### Common
| Method | API |
|-----------------------|---|
| Fit the model | `.fit(docs)` |
| Fit the model and predict documents | `.fit_transform(docs)` |
| Predict new documents | `.transform(new_docs)` |
| Get topic-word distribution matrix | `.get_beta()` |
| Get top words of all topics | `.get_top_words()` |
| Get top words and probabilities of a topic | `.get_topic(topic_idx=10)` |
| Get topic weights over the input dataset | `.get_topic_weights()` |
| Get topic activity over time | `.topic_activity_over_time(time_slices)` |
| Save model | `.save(path=path)` |
| Load model | `.from_pretrained(path=path)` |
### Visualization
| Method | API |
|-----------------------|---|
| Visualize topics | `.visualize_topic(top_n=5)` or `.visualize_topic(topic_idx=[1, 2, 3])` |
| Visualize topic weights | `.visualize_topic_weights(top_n=5)` or `.visualize_topic_weights(topic_idx=[1, 2, 3])` |
| Visualize topic hierarchy | `.visualize_topic_hierarchy()` |
| Visualize topic activity | `.visualize_topic_activity(top_n=5, topic_activity=topic_activity, time_slices=time_slices)` |
## Q&A
1. **Meet the `out of memory` error. My GPU memory is not enough due to large datasets. What should I do?**
You can try to set `save_memory=True` and `batch_size` in FASTopic.
`batch_size` should not be too small, otherwise it may damage performance.
```python
model = FASTopic(50, save_memory=True, batch_size=2000)
```
Or you can run FASTopic on the CPU as
```python
model = FASTopic(50, device='cpu')
```
2. **Can I try FASTopic with the languages other than English?**
Yes!
You can pass a multilingual document embedding model, like `paraphrase-multilingual-MiniLM-L12-v2`,
and the tokenizer and the stop words for your language, like [pipelines of spaCy](https://spacy.io/models).
Please refer to the tutorial in Colab.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_b55QpVQGFBX9PsyrYfNxDzbJ1IjrUdH?usp=sharing)
3. **Loss value can not decline, and the discovered topics all are repetitive.**
This may be caused by the less discriminative document embeddings,
due to the used document embedding model or extremely short texts as inputs.
Try to
(1) normalize document embeddings as `FASTopic(50, normalize_embeddings=True)`;
or
(2) increase `DT_alpha` to `5.0`, `10.0` or `15.0`, as `FASTopic(num_topics, DT_alpha=10.0)`.
4. **Can I use my own document embedding models?**
Yes! You can wrap your model and pass it to FASTopic:
```python
class YourDocEmbedModel:
def __init__(self):
...
def encode(self,
docs: List[str],
show_progress_bar: bool=False,
normalize_embeddings: bool=False
):
...
return embeddings
your_model = YourDocEmbedModel()
FASTopic(50, doc_embed_model=your_model)
```
## Contact
- We welcome your contributions to this project. Please feel free to submit pull requests.
- If you encounter any issues, please either directly contact **Xiaobao Wu (xiaobao002@e.ntu.edu.sg)** or leave an issue in the GitHub repo.
## Related Resources
- [**TopMost**](https://github.com/bobxwu/topmost): a topic modeling toolkit, including preprocessing, model training, and evaluations.
- [**A Survey on Neural Topic Models: Methods, Applications, and Challenges**](https://github.com/BobXWu/Paper-Neural-Topic-Models)
## Citation
If you want to use FASTopic, please cite our [paper](https://arxiv.org/pdf/2405.17978.pdf) as
@article{wu2024fastopic,
title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
journal={arXiv preprint arXiv:2405.17978},
year={2024}
}
Raw data
{
"_id": null,
"home_page": "https://github.com/bobxwu/FASTopic",
"name": "fastopic",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "topic model, neural topic model, transformers, optimal transport",
"author": "Xiaobao Wu",
"author_email": "xiaobao002@e.ntu.edu.sg",
"download_url": "https://files.pythonhosted.org/packages/ac/e9/1be6b3acfb7f1559ccecf4f58dd1d65f031c10e64c0664d8a0ff446d687b/fastopic-0.0.5.tar.gz",
"platform": null,
"description": "# FASTopic\n\n[![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)](https://github.com/BobXWu/Fastopic)\n[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/fastopic)\n[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)\n[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)\n[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)\n[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)\n\n\nFASTopic is a fast, adaptive, stable, and transferable topic model, different\nfrom the previous conventional (LDA), VAE-based (ProdLDA, ETM), or clustering-based (Top2Vec, BERTopic) methods.\nIt leverages optimal transport between the document, topic, and word embeddings from pretrained Transformers to model topics and topic distributions of documents.\n\nCheck our paper: **[FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm](https://arxiv.org/pdf/2405.17978.pdf)**\n\n<img src='https://github.com/BobXWu/FASTopic/raw/master/docs/img/illustration.svg' with='300pt'></img>\n\n- [FASTopic](#fastopic)\n - [Tutorials](#tutorials)\n - [Installation](#installation)\n - [Quick Start](#quick-start)\n - [Usage](#usage)\n - [Try FASTopic on your dataset](#try-fastopic-on-your-dataset)\n - [Topic info](#topic-info)\n - [Topic hierarchy](#topic-hierarchy)\n - [Topic weights](#topic-weights)\n - [Topic activity over time](#topic-activity-over-time)\n - [APIs](#apis)\n - [Common](#common)\n - [Visualization](#visualization)\n - [Q\\&A](#qa)\n - [Contact](#contact)\n - [Related Resources](#related-resources)\n - [Citation](#citation)\n\n\n## Tutorials\n\n| Tutorial | Link |\n| ------ | --- |\n| A complete tutorial on FASTopic. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1bduHWL5_bvsl4EYOgimCOmU-7RfnXqrX?usp=sharing) |\n| FASTopic with other languages. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_b55QpVQGFBX9PsyrYfNxDzbJ1IjrUdH?usp=sharing) |\n\n\n## Installation\n\nInstall FASTopic with `pip`:\n\n```bash\n pip install fastopic\n```\n\nOtherwise, install FASTopic from the source:\n\n```bash\n pip install git+https://github.com/bobxwu/FASTopic.git\n```\n\n## Quick Start\n\nDiscover topics from 20newsgroups with the topic number as `50`.\n\n```python\n from fastopic import FASTopic\n from sklearn.datasets import fetch_20newsgroups\n from topmost.preprocessing import Preprocessing\n\n docs = fetch_20newsgroups(subset='all', \u00a0remove=('headers', 'footers', 'quotes'))['data']\n\n preprocessing = Preprocessing(vocab_size=10000, stopwords='English')\n\n model = FASTopic(50, preprocessing)\n topic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n```\n\n`topic_top_words` is a list containing the top words of discovered topics.\n`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),\na numpy array with shape $N \\times K$ (number of documents $N$ and number of topics $K$).\n\n\n## Usage\n\n### Try FASTopic on your dataset\n\n```python\n from fastopic import FASTopic\n from topmost.preprocessing import Preprocessing\n\n # Prepare your dataset.\n docs = [\n \u00a0 \u00a0 'doc 1',\n \u00a0 \u00a0 'doc 2', # ...\n ]\n\n # Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc.\n # Pass your tokenizer as:\n # preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)\n preprocessing = Preprocessing(stopwords='English')\n\n model = FASTopic(50, preprocessing)\n topic_top_words, doc_topic_dist = model.fit_transform(docs)\n```\n\n\n### Topic info\n\nWe can get the top words and their probabilities of a topic.\n\n```python\n model.get_topic(topic_idx=36)\n\n (('impeachment', 0.008047104),\n ('mueller', 0.0075936727),\n ('trump', 0.0066773472),\n ('committee', 0.0057785935),\n ('inquiry', 0.005647915))\n```\n\nWe can visualize these topic info.\n\n```python\n fig = model.visualize_topic(top_n=5)\n fig.show()\n```\n\n<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_info.png?raw=true' with='300pt'></img>\n\n\n### Topic hierarchy\n\nWe use the learned topic embeddings and `scipy.cluster.hierarchy` to build a hierarchy of discovered topics.\n\n\n```python\n fig = model.visualize_topic_hierarchy()\n fig.show()\n```\n\n<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_hierarchy.png?raw=true' with='300pt'></img>\n\n\n\n### Topic weights\n\nWe plot the weights of topics in the given dataset.\n\n```python\n fig = model.visualize_topic_weights(top_n=20, height=500)\n fig.show()\n```\n\n<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_weight.png?raw=true' with='300pt'></img>\n\n\n\n\n### Topic activity over time\n\nTopic activity refers to the weight of a topic at a time slice.\nWe additionally input the time slices of documents, `time_slices` to compute and plot topic activity over time.\n\n\n\n```python\n act = model.topic_activity_over_time(time_slices)\n fig = model.visualize_topic_activity(top_n=6, topic_activity=act, time_slices=time_slices)\n fig.show()\n```\n\n<img src='https://github.com/BobXWu/FASTopic/blob/master/tutorials/topic_activity.png?raw=true' with='300pt'></img>\n\n\n\n## APIs\n\nWe summarize the frequently used APIs of FASTopic here. It's easier for you to look up.\n\n### Common\n\n| Method | API | \n|-----------------------|---|\n| Fit the model | `.fit(docs)` |\n| Fit the model and predict documents | `.fit_transform(docs)` |\n| Predict new documents | `.transform(new_docs)` |\n| Get topic-word distribution matrix | `.get_beta()` |\n| Get top words of all topics | `.get_top_words()` |\n| Get top words and probabilities of a topic | `.get_topic(topic_idx=10)` |\n| Get topic weights over the input dataset | `.get_topic_weights()` |\n| Get topic activity over time | `.topic_activity_over_time(time_slices)` |\n| Save model | `.save(path=path)` |\n| Load model | `.from_pretrained(path=path)` |\n\n\n### Visualization\n\n| Method | API | \n|-----------------------|---|\n| Visualize topics | `.visualize_topic(top_n=5)` or `.visualize_topic(topic_idx=[1, 2, 3])` |\n| Visualize topic weights | `.visualize_topic_weights(top_n=5)` or `.visualize_topic_weights(topic_idx=[1, 2, 3])` |\n| Visualize topic hierarchy | `.visualize_topic_hierarchy()` |\n| Visualize topic activity | `.visualize_topic_activity(top_n=5, topic_activity=topic_activity, time_slices=time_slices)` |\n\n\n\n\n## Q&A\n\n1. **Meet the `out of memory` error. My GPU memory is not enough due to large datasets. What should I do?**\n\n You can try to set `save_memory=True` and `batch_size` in FASTopic.\n `batch_size` should not be too small, otherwise it may damage performance.\n\n\n ```python\n model = FASTopic(50, save_memory=True, batch_size=2000)\n ```\n\n Or you can run FASTopic on the CPU as\n\n ```python\n model = FASTopic(50, device='cpu')\n ```\n\n\n2. **Can I try FASTopic with the languages other than English?**\n\n Yes!\n You can pass a multilingual document embedding model, like `paraphrase-multilingual-MiniLM-L12-v2`,\n and the tokenizer and the stop words for your language, like [pipelines of spaCy](https://spacy.io/models). \n Please refer to the tutorial in Colab.\n [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_b55QpVQGFBX9PsyrYfNxDzbJ1IjrUdH?usp=sharing)\n\n\n\n3. **Loss value can not decline, and the discovered topics all are repetitive.**\n\n This may be caused by the less discriminative document embeddings,\n due to the used document embedding model or extremely short texts as inputs. \n Try to\n (1) normalize document embeddings as `FASTopic(50, normalize_embeddings=True)`;\n or\n (2) increase `DT_alpha` to `5.0`, `10.0` or `15.0`, as `FASTopic(num_topics, DT_alpha=10.0)`.\n\n4. **Can I use my own document embedding models?**\n\n Yes! You can wrap your model and pass it to FASTopic:\n\n\n```python\n class YourDocEmbedModel:\n def __init__(self):\n ...\n\n def encode(self,\n docs: List[str],\n show_progress_bar: bool=False,\n normalize_embeddings: bool=False\n ):\n ...\n return embeddings\n\n your_model = YourDocEmbedModel()\n FASTopic(50, doc_embed_model=your_model)\n```\n\n\n## Contact\n- We welcome your contributions to this project. Please feel free to submit pull requests.\n- If you encounter any issues, please either directly contact **Xiaobao Wu (xiaobao002@e.ntu.edu.sg)** or leave an issue in the GitHub repo.\n\n\n## Related Resources\n- [**TopMost**](https://github.com/bobxwu/topmost): a topic modeling toolkit, including preprocessing, model training, and evaluations.\n- [**A Survey on Neural Topic Models: Methods, Applications, and Challenges**](https://github.com/BobXWu/Paper-Neural-Topic-Models)\n\n## Citation\n\nIf you want to use FASTopic, please cite our [paper](https://arxiv.org/pdf/2405.17978.pdf) as\n\n @article{wu2024fastopic,\n title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},\n author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},\n journal={arXiv preprint arXiv:2405.17978},\n year={2024}\n }\n",
"bugtrack_url": null,
"license": "Apache 2.0 License",
"summary": "FASTopic",
"version": "0.0.5",
"project_urls": {
"Homepage": "https://github.com/bobxwu/FASTopic"
},
"split_keywords": [
"topic model",
" neural topic model",
" transformers",
" optimal transport"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "303348118fa22165985d1f850af0f8e7b02be2fb89567c4c197d2530de265ccd",
"md5": "76c19619c8953f9ff1ebf5da524f97d5",
"sha256": "c428bd1286d850ee6713ff146825e5a1af7a718ea21515a1ed4a9fa3ff6e8290"
},
"downloads": -1,
"filename": "fastopic-0.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "76c19619c8953f9ff1ebf5da524f97d5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 18249,
"upload_time": "2024-07-10T00:38:53",
"upload_time_iso_8601": "2024-07-10T00:38:53.673889Z",
"url": "https://files.pythonhosted.org/packages/30/33/48118fa22165985d1f850af0f8e7b02be2fb89567c4c197d2530de265ccd/fastopic-0.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ace91be6b3acfb7f1559ccecf4f58dd1d65f031c10e64c0664d8a0ff446d687b",
"md5": "3b723610f4d92e84c9bdd178305f4893",
"sha256": "c7a0a367cc8f10ac4b6452250b7b91878579f0dc47be01aeeb9f5ea1d206b4be"
},
"downloads": -1,
"filename": "fastopic-0.0.5.tar.gz",
"has_sig": false,
"md5_digest": "3b723610f4d92e84c9bdd178305f4893",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 20650,
"upload_time": "2024-07-10T00:38:55",
"upload_time_iso_8601": "2024-07-10T00:38:55.101806Z",
"url": "https://files.pythonhosted.org/packages/ac/e9/1be6b3acfb7f1559ccecf4f58dd1d65f031c10e64c0664d8a0ff446d687b/fastopic-0.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-10 00:38:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bobxwu",
"github_project": "FASTopic",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "topmost",
"specs": [
[
">=",
"0.0.4"
]
]
}
],
"lcname": "fastopic"
}