# FASTopic
![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)
[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/topmost)
[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)
[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)
FASTopic is a fast, adaptive, stable, and transferable topic modeling package.
It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings.
This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.
<img src='docs/img/illustration.svg' with='300pt'></img>
## Installation
Install FASTopic with `pip`:
```bash
pip install fastopic
```
Otherwise, install FASTopic from the source:
```bash
git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install
```
## Quick Start
Discover topics from 20newsgroups.
```python
from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
preprocessing = Preprocessing(vocab_size=10000, stopwords='English')
model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
```
`topic_top_words` is a list of the top words in discovered topics.
`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).
## Usage
### 1. Try FASTopic on your dataset
```python
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing
# Prepare your dataset.
your_dataset = [
'doc 1',
'doc 2', # ...
]
# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as:
# preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')
model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)
```
### 2. Topic activity over time
After training, we can compute the activity of each topic at each time slice.
```python
topic_activity = model.topic_activity_over_time(time_slices)
```
## Citation
If you want to use our package, please cite as
@article{wu2024fastopic,
title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
journal={arXiv preprint arXiv:2405.17978},
year={2024}
}
Raw data
{
"_id": null,
"home_page": "https://github.com/bobxwu/FASTopic",
"name": "fastopic",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "topic model, neural topic model, transformers, optimal transport",
"author": "Xiaobao Wu",
"author_email": "xiaobao002@e.ntu.edu.sg",
"download_url": null,
"platform": null,
"description": "# FASTopic\n\n![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)\n[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/topmost)\n[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)\n[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)\n[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)\n[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)\n\n\nFASTopic is a fast, adaptive, stable, and transferable topic modeling package.\nIt leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings.\nThis brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.\n\n\n<img src='docs/img/illustration.svg' with='300pt'></img>\n\n\n## Installation\n\nInstall FASTopic with `pip`:\n\n```bash\npip install fastopic\n```\n\nOtherwise, install FASTopic from the source:\n\n```bash\ngit clone https://github.com/bobxwu/FASTopic.git\ncd FASTopic && python setup.py install\n```\n\n## Quick Start\n\nDiscover topics from 20newsgroups.\n\n```python\nfrom fastopic import FASTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom topmost.preprocessing import Preprocessing\n\ndocs = fetch_20newsgroups(subset='all', \u00a0remove=('headers', 'footers', 'quotes'))['data']\n\npreprocessing = Preprocessing(vocab_size=10000, stopwords='English')\n\nmodel = FASTopic(num_topics=50, preprocessing)\ntopic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n```\n\n`topic_top_words` is a list of the top words in discovered topics.\n`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),\na numpy array with shape $N \\times K$ (number of documents $N$ and number of topics $K$).\n\n\n## Usage\n\n### 1. Try FASTopic on your dataset\n\n\n```python\nfrom fastopic import FASTopic\nfrom topmost.preprocessing import Preprocessing\n\n# Prepare your dataset.\nyour_dataset = [\n\u00a0 \u00a0 'doc 1',\n\u00a0 \u00a0 'doc 2', # ...\n]\n\n# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..\n# Pass your tokenizer as:\n# preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)\npreprocessing = Preprocessing(stopwords='English')\n\nmodel = FASTopic(num_topics=50, preprocessing)\ntopic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n```\n\n\n### 2. Topic activity over time\n\nAfter training, we can compute the activity of each topic at each time slice.\n\n```python\ntopic_activity = model.topic_activity_over_time(time_slices)\n```\n\n\n## Citation\n\nIf you want to use our package, please cite as\n\n @article{wu2024fastopic,\n title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},\n author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},\n journal={arXiv preprint arXiv:2405.17978},\n year={2024}\n }\n",
"bugtrack_url": null,
"license": "Apache 2.0 License",
"summary": "FASTopic",
"version": "0.0.3",
"project_urls": {
"Homepage": "https://github.com/bobxwu/FASTopic"
},
"split_keywords": [
"topic model",
" neural topic model",
" transformers",
" optimal transport"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7a553d822e5360cda62cf09899703d7957fe7494668122522de5e201779de793",
"md5": "077ac09275c56bd50a00df45e4d73781",
"sha256": "938184d927604fea078d0de61acf200677c0a5426f415cafeca8745212f26b9d"
},
"downloads": -1,
"filename": "fastopic-0.0.3-1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "077ac09275c56bd50a00df45e4d73781",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 14661,
"upload_time": "2024-06-22T09:34:00",
"upload_time_iso_8601": "2024-06-22T09:34:00.200911Z",
"url": "https://files.pythonhosted.org/packages/7a/55/3d822e5360cda62cf09899703d7957fe7494668122522de5e201779de793/fastopic-0.0.3-1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-22 09:34:00",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bobxwu",
"github_project": "FASTopic",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "topmost",
"specs": [
[
">=",
"0.0.4"
]
]
}
],
"lcname": "fastopic"
}