fastopic


Namefastopic JSON
Version 0.0.3 PyPI version JSON
download
home_pagehttps://github.com/bobxwu/FASTopic
SummaryFASTopic
upload_time2024-06-22 09:34:00
maintainerNone
docs_urlNone
authorXiaobao Wu
requires_pythonNone
licenseApache 2.0 License
keywords topic model neural topic model transformers optimal transport
VCS
bugtrack_url
requirements topmost
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # FASTopic

![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)
[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/topmost)
[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)
[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)
[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)
[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)


FASTopic is a fast, adaptive, stable, and transferable topic modeling package.
It leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings.
This brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.


<img src='docs/img/illustration.svg' with='300pt'></img>


## Installation

Install FASTopic with `pip`:

```bash
pip install fastopic
```

Otherwise, install FASTopic from the source:

```bash
git clone https://github.com/bobxwu/FASTopic.git
cd FASTopic && python setup.py install
```

## Quick Start

Discover topics from 20newsgroups.

```python
from fastopic import FASTopic
from sklearn.datasets import fetch_20newsgroups
from topmost.preprocessing import Preprocessing

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

preprocessing = Preprocessing(vocab_size=10000, stopwords='English')

model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

```

`topic_top_words` is a list of the top words in discovered topics.
`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),
a numpy array with shape $N \times K$ (number of documents $N$ and number of topics $K$).


## Usage

### 1. Try FASTopic on your dataset


```python
from fastopic import FASTopic
from topmost.preprocessing import Preprocessing

# Prepare your dataset.
your_dataset = [
    'doc 1',
    'doc 2', # ...
]

# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..
# Pass your tokenizer as:
#   preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)
preprocessing = Preprocessing(stopwords='English')

model = FASTopic(num_topics=50, preprocessing)
topic_top_words, doc_topic_dist = model.fit_transform(docs)

```


### 2. Topic activity over time

After training, we can compute the activity of each topic at each time slice.

```python
topic_activity = model.topic_activity_over_time(time_slices)
```


## Citation

If you want to use our package, please cite as

    @article{wu2024fastopic,
        title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},
        author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},
        journal={arXiv preprint arXiv:2405.17978},
        year={2024}
    }

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bobxwu/FASTopic",
    "name": "fastopic",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "topic model, neural topic model, transformers, optimal transport",
    "author": "Xiaobao Wu",
    "author_email": "xiaobao002@e.ntu.edu.sg",
    "download_url": null,
    "platform": null,
    "description": "# FASTopic\n\n![stars](https://img.shields.io/github/stars/bobxwu/FASTopic?logo=github)\n[![PyPI](https://img.shields.io/pypi/v/fastopic)](https://pypi.org/project/topmost)\n[![Downloads](https://static.pepy.tech/badge/fastopic)](https://pepy.tech/project/fastopic)\n[![LICENSE](https://img.shields.io/github/license/bobxwu/fastopic)](https://www.apache.org/licenses/LICENSE-2.0/)\n[![arXiv](https://img.shields.io/badge/arXiv-2405.17978-<COLOR>.svg)](https://arxiv.org/pdf/2405.17978.pdf)\n[![Contributors](https://img.shields.io/github/contributors/bobxwu/fastopic)](https://github.com/bobxwu/fastopic/graphs/contributors/)\n\n\nFASTopic is a fast, adaptive, stable, and transferable topic modeling package.\nIt leverages pretrained Transformers to produce document embeddings, and discovers latent topics through the optimal transport between document, topic, and word embeddings.\nThis brings about a neat and efficient topic modeling paradigm, different from traditional probabilistic, VAE-based, and clustering-based models.\n\n\n<img src='docs/img/illustration.svg' with='300pt'></img>\n\n\n## Installation\n\nInstall FASTopic with `pip`:\n\n```bash\npip install fastopic\n```\n\nOtherwise, install FASTopic from the source:\n\n```bash\ngit clone https://github.com/bobxwu/FASTopic.git\ncd FASTopic && python setup.py install\n```\n\n## Quick Start\n\nDiscover topics from 20newsgroups.\n\n```python\nfrom fastopic import FASTopic\nfrom sklearn.datasets import fetch_20newsgroups\nfrom topmost.preprocessing import Preprocessing\n\ndocs = fetch_20newsgroups(subset='all', \u00a0remove=('headers', 'footers', 'quotes'))['data']\n\npreprocessing = Preprocessing(vocab_size=10000, stopwords='English')\n\nmodel = FASTopic(num_topics=50, preprocessing)\ntopic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n```\n\n`topic_top_words` is a list of the top words in discovered topics.\n`doc_topic_dist` is the topic distributions of documents (doc-topic distributions),\na numpy array with shape $N \\times K$ (number of documents $N$ and number of topics $K$).\n\n\n## Usage\n\n### 1. Try FASTopic on your dataset\n\n\n```python\nfrom fastopic import FASTopic\nfrom topmost.preprocessing import Preprocessing\n\n# Prepare your dataset.\nyour_dataset = [\n\u00a0 \u00a0 'doc 1',\n\u00a0 \u00a0 'doc 2', # ...\n]\n\n# Preprocess the dataset. This step tokenizes docs, removes stopwords, and sets max vocabulary size, etc..\n# Pass your tokenizer as:\n#   preprocessing = Preprocessing(vocab_size=your_vocab_size, tokenizer=your_tokenizer, stopwords=your_stopwords_set)\npreprocessing = Preprocessing(stopwords='English')\n\nmodel = FASTopic(num_topics=50, preprocessing)\ntopic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n```\n\n\n### 2. Topic activity over time\n\nAfter training, we can compute the activity of each topic at each time slice.\n\n```python\ntopic_activity = model.topic_activity_over_time(time_slices)\n```\n\n\n## Citation\n\nIf you want to use our package, please cite as\n\n    @article{wu2024fastopic,\n        title={FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm},\n        author={Wu, Xiaobao and Nguyen, Thong and Zhang, Delvin Ce and Wang, William Yang and Luu, Anh Tuan},\n        journal={arXiv preprint arXiv:2405.17978},\n        year={2024}\n    }\n",
    "bugtrack_url": null,
    "license": "Apache 2.0 License",
    "summary": "FASTopic",
    "version": "0.0.3",
    "project_urls": {
        "Homepage": "https://github.com/bobxwu/FASTopic"
    },
    "split_keywords": [
        "topic model",
        " neural topic model",
        " transformers",
        " optimal transport"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7a553d822e5360cda62cf09899703d7957fe7494668122522de5e201779de793",
                "md5": "077ac09275c56bd50a00df45e4d73781",
                "sha256": "938184d927604fea078d0de61acf200677c0a5426f415cafeca8745212f26b9d"
            },
            "downloads": -1,
            "filename": "fastopic-0.0.3-1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "077ac09275c56bd50a00df45e4d73781",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14661,
            "upload_time": "2024-06-22T09:34:00",
            "upload_time_iso_8601": "2024-06-22T09:34:00.200911Z",
            "url": "https://files.pythonhosted.org/packages/7a/55/3d822e5360cda62cf09899703d7957fe7494668122522de5e201779de793/fastopic-0.0.3-1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-22 09:34:00",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bobxwu",
    "github_project": "FASTopic",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "topmost",
            "specs": [
                [
                    ">=",
                    "0.0.4"
                ]
            ]
        }
    ],
    "lcname": "fastopic"
}
        
Elapsed time: 0.28363s