antm


Nameantm JSON
Version 0.1.7 PyPI version JSON
download
home_pagehttps://github.com/hamedR96/ANTM
SummaryAligned Neural Topic Model for Exploring Evolving Topics
upload_time2023-11-12 20:35:20
maintainer
docs_urlNone
authorHamed Rahimi
requires_python
license
keywords
VCS
bugtrack_url
requirements gensim hdbscan matplotlib nltk numpy pandas plotly scikit_learn scipy sentence_transformers swifter torch transformers
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI - PyPi](https://img.shields.io/pypi/v/antm)](https://pypi.org/project/antm/)
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://hamedrahimi.fr)
[![arXiv](https://img.shields.io/badge/arXiv-2302.01501-<COLOR>.svg)](https://arxiv.org/abs/2302.01501)

# ANTM
ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics

![alt text](https://github.com/hamedR96/ANTM/blob/main/diagram_Twitter.png?raw=true)

 Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. Here, we introduce a dynamic neural topic model called ANTM, which uses document embeddings (data2vec) to compute clusters of semantically similar documents at different periods, and aligns document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).


## Installation

Installation can be done using:

```bash
pip install antm
```

## Quick Start
As implemented in the notebook, we can quickly start extracting evolving topics from DBLP dataset containing computer science articles.
### To Fit and Save a Model

```python
from antm import ANTM
import pandas as pd

# load data
df=pd.read_parquet("./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet")
df=df[["abstract","year"]].rename(columns={"abstract":"content","year":"time"})
df=df.dropna().sort_values("time").reset_index(drop=True).reset_index()

# choosing the windows size and overlapping length for time frames
window_size = 6
overlap = 2

#initialize model
model=ANTM(df,overlap,window_size,umap_n_neighbors=10, partioned_clusttering_size=5,mode="data2vec",num_words=10,path="./saved_data")

#learn the model and save it
topics_per_period=model.fit(save=True)
#output is a list of timeframes including all the topics associated with that period
```
### To Load a Model

```python
from antm import ANTM
import pandas as pd

# load data
df=pd.read_parquet("./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet")
df=df[["abstract","year"]].rename(columns={"abstract":"content","year":"time"})
df=df.dropna().sort_values("time").reset_index(drop=True).reset_index()

# choosing the windows size and overlapping length for time frames
window_size = 6
overlap = 2
#initialize model
model=ANTM(df,overlap,window_size,mode="data2vec",num_words=10,path="./saved_data")
topics_per_period=model.load()
```
### Plug-and-Play Functions
```python
#find all the evolving topics
model.save_evolution_topics_plots(display=False)

#plots a random evolving topic with 2-dimensional document representations
model.random_evolution_topic()

#plots partioned clusters for each time frame
model.plot_clusters_over_time()

#plots all the evolving topics
model.plot_evolving_topics()
```
### Topic Quality Metrics 
```python
#returns pairwise jaccard diversity for each period
model.get_periodwise_pairwise_jaccard_diversity()

#returns proportion unique words diversity for each period
model.get_periodwise_puw_diversity()

#returns topic coherence for each period
model.get_periodwise_topic_coherence(model="c_v") 

```
## Datasets
[Arxiv articles](https://www.kaggle.com/datasets/Cornell-University/arxiv)

[DBLP articles](https://nuage.lip6.fr/s/FLKwdzcsbqYMkat)

[Elon Musk's Tweets](https://nuage.lip6.fr/s/XKkcWLAiDiykZ4D)

[New York Times News](https://nuage.lip6.fr/s/XKkcWLAiDiykZ4D)

## Experiments
You can use the notebooks provided in "./experiments" in order to run ANTM on other sequential datasets. 


## Citation
To cite [ANTM](https://arxiv.org/abs/2302.01501), please use the following bibtex reference:
```bibtext
@misc{rahimi2023antm,
      title={ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics}, 
      author={Hamed Rahimi and Hubert Naacke and Camelia Constantin and Bernd Amann},
      year={2023},
      eprint={2302.01501},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/hamedR96/ANTM",
    "name": "antm",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Hamed Rahimi",
    "author_email": "<hamed.rahimi@sorbonne-universite.fr",
    "download_url": "https://files.pythonhosted.org/packages/b0/02/ad6124bab88652dbc13ec3e4d3569a1289ea6b15a05f7c6379c15bb51bed/antm-0.1.7.tar.gz",
    "platform": null,
    "description": "[![PyPI - PyPi](https://img.shields.io/pypi/v/antm)](https://pypi.org/project/antm/)\n[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://hamedrahimi.fr)\n[![arXiv](https://img.shields.io/badge/arXiv-2302.01501-<COLOR>.svg)](https://arxiv.org/abs/2302.01501)\n\n# ANTM\nANTM: An Aligned Neural Topic Model for Exploring Evolving Topics\n\n![alt text](https://github.com/hamedR96/ANTM/blob/main/diagram_Twitter.png?raw=true)\n\n Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. Here, we introduce a dynamic neural topic model called ANTM, which uses document embeddings (data2vec) to compute clusters of semantically similar documents at different periods, and aligns document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).\n\n\n## Installation\n\nInstallation can be done using:\n\n```bash\npip install antm\n```\n\n## Quick Start\nAs implemented in the notebook, we can quickly start extracting evolving topics from DBLP dataset containing computer science articles.\n### To Fit and Save a Model\n\n```python\nfrom antm import ANTM\nimport pandas as pd\n\n# load data\ndf=pd.read_parquet(\"./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet\")\ndf=df[[\"abstract\",\"year\"]].rename(columns={\"abstract\":\"content\",\"year\":\"time\"})\ndf=df.dropna().sort_values(\"time\").reset_index(drop=True).reset_index()\n\n# choosing the windows size and overlapping length for time frames\nwindow_size = 6\noverlap = 2\n\n#initialize model\nmodel=ANTM(df,overlap,window_size,umap_n_neighbors=10, partioned_clusttering_size=5,mode=\"data2vec\",num_words=10,path=\"./saved_data\")\n\n#learn the model and save it\ntopics_per_period=model.fit(save=True)\n#output is a list of timeframes including all the topics associated with that period\n```\n### To Load a Model\n\n```python\nfrom antm import ANTM\nimport pandas as pd\n\n# load data\ndf=pd.read_parquet(\"./data/dblpFullSchema_2000_2020_extract_big_data_2K.parquet\")\ndf=df[[\"abstract\",\"year\"]].rename(columns={\"abstract\":\"content\",\"year\":\"time\"})\ndf=df.dropna().sort_values(\"time\").reset_index(drop=True).reset_index()\n\n# choosing the windows size and overlapping length for time frames\nwindow_size = 6\noverlap = 2\n#initialize model\nmodel=ANTM(df,overlap,window_size,mode=\"data2vec\",num_words=10,path=\"./saved_data\")\ntopics_per_period=model.load()\n```\n### Plug-and-Play Functions\n```python\n#find all the evolving topics\nmodel.save_evolution_topics_plots(display=False)\n\n#plots a random evolving topic with 2-dimensional document representations\nmodel.random_evolution_topic()\n\n#plots partioned clusters for each time frame\nmodel.plot_clusters_over_time()\n\n#plots all the evolving topics\nmodel.plot_evolving_topics()\n```\n### Topic Quality Metrics \n```python\n#returns pairwise jaccard diversity for each period\nmodel.get_periodwise_pairwise_jaccard_diversity()\n\n#returns proportion unique words diversity for each period\nmodel.get_periodwise_puw_diversity()\n\n#returns topic coherence for each period\nmodel.get_periodwise_topic_coherence(model=\"c_v\") \n\n```\n## Datasets\n[Arxiv articles](https://www.kaggle.com/datasets/Cornell-University/arxiv)\n\n[DBLP articles](https://nuage.lip6.fr/s/FLKwdzcsbqYMkat)\n\n[Elon Musk's Tweets](https://nuage.lip6.fr/s/XKkcWLAiDiykZ4D)\n\n[New York Times News](https://nuage.lip6.fr/s/XKkcWLAiDiykZ4D)\n\n## Experiments\nYou can use the notebooks provided in \"./experiments\" in order to run ANTM on other sequential datasets. \n\n\n## Citation\nTo cite [ANTM](https://arxiv.org/abs/2302.01501), please use the following bibtex reference:\n```bibtext\n@misc{rahimi2023antm,\n      title={ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics}, \n      author={Hamed Rahimi and Hubert Naacke and Camelia Constantin and Bernd Amann},\n      year={2023},\n      eprint={2302.01501},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR}\n}\n```\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Aligned Neural Topic Model for Exploring Evolving Topics",
    "version": "0.1.7",
    "project_urls": {
        "Bug Tracker": "https://github.com/hamedR96/ANTM/issues",
        "Homepage": "https://github.com/hamedR96/ANTM"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f0610d29bc0617b11dabde12b68d275db26f3f5770eab9ae94945cf602eafc54",
                "md5": "28d3377e319ea69416b445f588e95269",
                "sha256": "c9394b044d29634df45e4aaa4db796d6c54fe3ea1596f68dbe512d2822775a84"
            },
            "downloads": -1,
            "filename": "antm-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "28d3377e319ea69416b445f588e95269",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14497,
            "upload_time": "2023-11-12T20:35:09",
            "upload_time_iso_8601": "2023-11-12T20:35:09.987481Z",
            "url": "https://files.pythonhosted.org/packages/f0/61/0d29bc0617b11dabde12b68d275db26f3f5770eab9ae94945cf602eafc54/antm-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b002ad6124bab88652dbc13ec3e4d3569a1289ea6b15a05f7c6379c15bb51bed",
                "md5": "cc6255e36c11930bb2407c380218e9e9",
                "sha256": "dd58c670243849ae25bbb3091774deec02b330578c3114fbc58a5d2da21f9d3f"
            },
            "downloads": -1,
            "filename": "antm-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "cc6255e36c11930bb2407c380218e9e9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 14381,
            "upload_time": "2023-11-12T20:35:20",
            "upload_time_iso_8601": "2023-11-12T20:35:20.586119Z",
            "url": "https://files.pythonhosted.org/packages/b0/02/ad6124bab88652dbc13ec3e4d3569a1289ea6b15a05f7c6379c15bb51bed/antm-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-12 20:35:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hamedR96",
    "github_project": "ANTM",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "gensim",
            "specs": [
                [
                    "==",
                    "4.3.0"
                ]
            ]
        },
        {
            "name": "hdbscan",
            "specs": [
                [
                    "==",
                    "0.8.29"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.6.2"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "==",
                    "3.8.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.22.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.5.2"
                ]
            ]
        },
        {
            "name": "plotly",
            "specs": [
                [
                    "==",
                    "5.13.0"
                ]
            ]
        },
        {
            "name": "scikit_learn",
            "specs": [
                [
                    "==",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "sentence_transformers",
            "specs": [
                [
                    "==",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "swifter",
            "specs": [
                [
                    "==",
                    "1.3.4"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "1.13.1"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.26.0"
                ]
            ]
        }
    ],
    "lcname": "antm"
}
        
Elapsed time: 0.13586s