embedded-topic-model

Name	embedded-topic-model JSON
Version	1.2.1 JSON
	download
home_page	https://github.com/lffloyd/embedded-topic-model
Summary	A package to run embedded topic modelling
upload_time	2023-09-06 19:22:44
maintainer
docs_url	None
author	Luiz F. Matos
requires_python	>=3.9
license	MIT license
keywords	embedded_topic_model
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Embedded Topic Model
[![PyPI version](https://badge.fury.io/py/embedded-topic-model.svg)](https://badge.fury.io/py/embedded-topic-model)
[![Actions Status](https://github.com/lffloyd/embedded-topic-model/workflows/Python%20package/badge.svg)](https://github.com/lffloyd/embedded-topic-model/actions)
![GitHub contributors](https://img.shields.io/github/contributors/lffloyd/embedded-topic-model)
![GitHub Repo stars](https://img.shields.io/github/stars/lffloyd/embedded-topic-model)
[![Downloads](https://static.pepy.tech/badge/embedded-topic-model/month)](https://pepy.tech/project/embedded-topic-model)
[![License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](https://github.com/lffloyd/embedded-topic-model/blob/main/LICENSE)

This package was made to easily run embedded topic modelling on a given corpus.

ETM is a topic model that marries the probabilistic topic modelling of Latent Dirichlet Allocation with the
contextual information brought by word embeddings-most specifically, word2vec. ETM models topics as points
in the word embedding space, arranging together topics and words with similar context.
As such, ETM can either learn word embeddings alongside topics, or be given pretrained embeddings to discover
the topic patterns on the corpus.

ETM was originally published by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei on a article titled ["Topic Modeling in Embedding Spaces"](https://arxiv.org/abs/1907.04907) in 2019. This code is an adaptation of the [original](https://github.com/adjidieng/ETM) provided with the article and is not affiliated in any manner with the original authors. Most of the original code was kept here, with some changes here and there, mostly for ease of usage. This package was created to facilitate research purposes. If you want a more stable and feature-rich package to train ETM and other models, take a look at [OCTIS](https://github.com/MIND-Lab/OCTIS).

With the tools provided here, you can run ETM on your dataset using simple steps.

# Index

* [:beer: Installation](#beer-installation)
* [:wrench: Usage](#wrench-usage)
    * [:microscope: Examples](#microscope-examples)
* [:books: Citation](#books-citation)
* [:heart: Contributing](#heart-contributing)
* [:v: Acknowledgements](#v-acknowledgements)
* [:pushpin: License](#pushpin-license)

# :beer: Installation
You can install the package using ```pip``` by running: ```pip install -U embedded_topic_model```

# :wrench: Usage
To use ETM on your corpus, you must first preprocess the documents into a format understandable by the model.
This package has a quick-use preprocessing script. The only requirement is that the corpus must be composed
by a list of strings, where each string corresponds to a document in the corpus.

You can preprocess your corpus as follows:

```python
from embedded_topic_model.utils import preprocessing
import json

# Loading a dataset in JSON format. As said, documents must be composed by string sentences
corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(corpus_file, 'r'))
documents = [document['body'] for document in documents_raw]

# Preprocessing the dataset
vocabulary, train_dataset, _, = preprocessing.create_etm_datasets(
    documents, 
    min_df=0.01, 
    max_df=0.75, 
    train_size=0.85, 
)
```

Then, you can train word2vec embeddings to use with the ETM model. This is optional, and if you're not interested
on training your embeddings, you can either pass a pretrained word2vec embeddings file for ETM or learn the embeddings
using ETM itself. If you want ETM to learn its word embeddings, just pass ```train_embeddings=True``` as an instance parameter.

To pretrain the embeddings, you can do the following:

```python
from embedded_topic_model.utils import embedding

# Training word2vec embeddings
embeddings_mapping = embedding.create_word2vec_embedding_from_dataset(documents)
```

To create and fit the model using the training data, execute:

```python
from embedded_topic_model.models.etm import ETM

# Training an ETM instance
etm_instance = ETM(
    vocabulary,
    embeddings=embeddings_mapping, # You can pass here the path to a word2vec file or
                                   # a KeyedVectors instance
    num_topics=8,
    epochs=100,
    debug_mode=True,
    train_embeddings=False, # Optional. If True, ETM will learn word embeddings jointly with
                            # topic embeddings. By default, is False. If 'embeddings' argument
                            # is being passed, this argument must not be True
)

etm_instance.fit(train_dataset)
```

You can get the topic words with this method. Note that you can select how many word per topic you're interest in:
```python
t_w_mtx = etm_instance.get_topics(top_n_words=20)
```

You can get the topic word matrix with this method. Note that it will return all word for each topic:
```python
t_w_mtx = etm_instance.get_topic_word_matrix()
```

You can get the topic word distribution matrix and the document topic distribution matrix with the following methods, both return a normalized distribution matrix:
```python
t_w_dist_mtx = etm_instance.get_topic_word_dist()
d_t_dist_mtx = etm_instance.get_document_topic_dist()
```

Also, to obtain topic coherence or topic diversity of the model, you can do as follows:

```python
topics = etm_instance.get_topics(20)
topic_coherence = etm_instance.get_topic_coherence()
topic_diversity = etm_instance.get_topic_diversity()
```

You can also predict topics for unseen documents with the following.

```python
from embedded_topic_model.utils import preprocessing
from embedded_topic_model.models.etm import ETM

corpus_file = 'datasets/example_dataset.json'
documents_raw = json.load(open(corpus_file, 'r'))
documents = [document['body'] for document in documents_raw]

# Splits into train/test datasets
train = documents[:len(documents)-100]
test = documents[len(documents)-100:]

# Model fitting
# ...

# The vocabulary must be the same one created during preprocessing of the training dataset (see above)
preprocessed_test = preprocessing.create_bow_dataset(test, vocabulary)
# Transforms test dataset and returns normalized document topic distribution
test_d_t_dist = etm_instance.transform(preprocessed_test)
print(f'test_d_t_dist: {test_d_t_dist}')
```

For further details, see [examples](#microscope-examples).

## :microscope: Examples

| title                                       | link |
| :-------------:                             | :--: |
| ETM example - Reddit (r/depression) dataset | [Jupyter Notebook](./2023-09-01%20-%20reddit%20-%20depression%20dataset%20-%20etm%20-%20example.ipynb) |

# :books: Citation
To cite ETM, use the original article's citation:

```
@article{dieng2019topic,
    title = {Topic modeling in embedding spaces},
    author = {Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},
    journal = {arXiv preprint arXiv: 1907.04907},
    year = {2019}
}
```

# :heart: Contributing
Contributions are always welcomed :heart:! You can take a look at []() to see some guidelines. Feel free to contact through issues, to elaborate on desired enhancements and to check if work is already being done on the matter.

# :v: Acknowledgements
Credits given to Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei for the original work.

# :pushpin: License
Licensed under [MIT](LICENSE) license.
# Changelog

This changelog was inspired by the [keep-a-changelog](https://github.com/olivierlacan/keep-a-changelog) project and follows [semantic versioning](https://semver.org).

## [1.2.1] - 2023-09-06

### Changed

- ([#cf35c3](https://github.com/lffloyd/embedded-topic-model/commit/cf35c3)) fixes minimum python version to be `python>=3.9`

## [1.2.0] - 2023-09-06

### Added

- ([#61730d](https://github.com/lffloyd/embedded-topic-model/commit/61730d), [#224995](https://github.com/lffloyd/embedded-topic-model/commit/224995), [#331fc0](https://github.com/lffloyd/embedded-topic-model/commit/331fc0)) adds support for macOS MPS devices and updates outdated `numpy`/`sklearn` code - thanks to [@d-jiao](https://github.com/d-jiao)
- ([#c48016](https://github.com/lffloyd/embedded-topic-model/commit/c48016), [#2fe517](https://github.com/lffloyd/embedded-topic-model/commit/2fe517), [#c965b1](https://github.com/lffloyd/embedded-topic-model/commit/c965b1), [#5578ca](https://github.com/lffloyd/embedded-topic-model/commit/5578ca), [#5b0d85](https://github.com/lffloyd/embedded-topic-model/commit/5b0d85)) adds security guidelines and request templates

### Changed

- ([#331fc0](https://github.com/lffloyd/embedded-topic-model/commit/331fc0)) updates actions pipeline, supported python versions and internal dependencies to the latest available like `torch`, `gensim`, among others. Support for `python<=3.8` was dropped as a result. Numerous security vulnerabilities were solved

## [1.1.0] - 2023-09-05

### Added

- ([#3f27ee](https://github.com/lffloyd/embedded-topic-model/commit/3f27ee)) adds `transform` method
- ([#f98f3f](https://github.com/lffloyd/embedded-topic-model/commit/f98f3f)) adds example jupyter notebook
- ([#683bec](https://github.com/lffloyd/embedded-topic-model/commit/683bec)) adds contributing and conduct guidelines

### Changed

- ([#f98f3f](https://github.com/lffloyd/embedded-topic-model/commit/f98f3f), [#c918a4](https://github.com/lffloyd/embedded-topic-model/commit/c918a4)) updates documentation

## [1.0.2] - 2021-06-23

### Changed

- deactivates debug mode by default
- documents get_most_similar_words method

## [1.0.1] - 2021-02-15

### Changed

- optimizes original word2vec TXT file input for model training
- updates README.md

## [1.0.0] - 2021-02-15

### Added

- adds support for original word2vec pretrained embeddings files on both formats (BIN/TXT)

### Changed

- optimizes handling of gensim's word2vec mapping file for better memory usage

## [0.1.1] - 2021-02-01

### Added

- support for python 3.6

## [0.1.0] - 2021-02-01

### Added

- ETM training with partially tested support for original ETM features.
- ETM corpus preprocessing scripts - including word2vec embeddings training - adapted from the original code.
- adds methods to retrieve document-topic and topic-word probability distributions from the trained model.
- adds docstrings for tested API methods.
- adds unit and integration tests for ETM and preprocessing scripts.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/lffloyd/embedded-topic-model",
    "name": "embedded-topic-model",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "",
    "keywords": "embedded_topic_model",
    "author": "Luiz F. Matos",
    "author_email": "lfmatosmelo@id.uff.br",
    "download_url": "https://files.pythonhosted.org/packages/42/2a/a4a12fd4d9b31c89551c84c1c1dfe80335a005e6eb1e5a295883913e5ae5/embedded_topic_model-1.2.1.tar.gz",
    "platform": null,
    "description": "# Embedded Topic Model\n[![PyPI version](https://badge.fury.io/py/embedded-topic-model.svg)](https://badge.fury.io/py/embedded-topic-model)\n[![Actions Status](https://github.com/lffloyd/embedded-topic-model/workflows/Python%20package/badge.svg)](https://github.com/lffloyd/embedded-topic-model/actions)\n![GitHub contributors](https://img.shields.io/github/contributors/lffloyd/embedded-topic-model)\n![GitHub Repo stars](https://img.shields.io/github/stars/lffloyd/embedded-topic-model)\n[![Downloads](https://static.pepy.tech/badge/embedded-topic-model/month)](https://pepy.tech/project/embedded-topic-model)\n[![License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](https://github.com/lffloyd/embedded-topic-model/blob/main/LICENSE)\n\nThis package was made to easily run embedded topic modelling on a given corpus.\n\nETM is a topic model that marries the probabilistic topic modelling of Latent Dirichlet Allocation with the\ncontextual information brought by word embeddings-most specifically, word2vec. ETM models topics as points\nin the word embedding space, arranging together topics and words with similar context.\nAs such, ETM can either learn word embeddings alongside topics, or be given pretrained embeddings to discover\nthe topic patterns on the corpus.\n\nETM was originally published by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei on a article titled [\"Topic Modeling in Embedding Spaces\"](https://arxiv.org/abs/1907.04907) in 2019. This code is an adaptation of the [original](https://github.com/adjidieng/ETM) provided with the article and is not affiliated in any manner with the original authors. Most of the original code was kept here, with some changes here and there, mostly for ease of usage. This package was created to facilitate research purposes. If you want a more stable and feature-rich package to train ETM and other models, take a look at [OCTIS](https://github.com/MIND-Lab/OCTIS).\n\nWith the tools provided here, you can run ETM on your dataset using simple steps.\n\n# Index\n\n* [:beer: Installation](#beer-installation)\n* [:wrench: Usage](#wrench-usage)\n    * [:microscope: Examples](#microscope-examples)\n* [:books: Citation](#books-citation)\n* [:heart: Contributing](#heart-contributing)\n* [:v: Acknowledgements](#v-acknowledgements)\n* [:pushpin: License](#pushpin-license)\n\n# :beer: Installation\nYou can install the package using ```pip``` by running: ```pip install -U embedded_topic_model```\n\n# :wrench: Usage\nTo use ETM on your corpus, you must first preprocess the documents into a format understandable by the model.\nThis package has a quick-use preprocessing script. The only requirement is that the corpus must be composed\nby a list of strings, where each string corresponds to a document in the corpus.\n\nYou can preprocess your corpus as follows:\n\n```python\nfrom embedded_topic_model.utils import preprocessing\nimport json\n\n# Loading a dataset in JSON format. As said, documents must be composed by string sentences\ncorpus_file = 'datasets/example_dataset.json'\ndocuments_raw = json.load(open(corpus_file, 'r'))\ndocuments = [document['body'] for document in documents_raw]\n\n# Preprocessing the dataset\nvocabulary, train_dataset, _, = preprocessing.create_etm_datasets(\n    documents, \n    min_df=0.01, \n    max_df=0.75, \n    train_size=0.85, \n)\n```\n\nThen, you can train word2vec embeddings to use with the ETM model. This is optional, and if you're not interested\non training your embeddings, you can either pass a pretrained word2vec embeddings file for ETM or learn the embeddings\nusing ETM itself. If you want ETM to learn its word embeddings, just pass ```train_embeddings=True``` as an instance parameter.\n\nTo pretrain the embeddings, you can do the following:\n\n```python\nfrom embedded_topic_model.utils import embedding\n\n# Training word2vec embeddings\nembeddings_mapping = embedding.create_word2vec_embedding_from_dataset(documents)\n```\n\nTo create and fit the model using the training data, execute:\n\n```python\nfrom embedded_topic_model.models.etm import ETM\n\n# Training an ETM instance\netm_instance = ETM(\n    vocabulary,\n    embeddings=embeddings_mapping, # You can pass here the path to a word2vec file or\n                                   # a KeyedVectors instance\n    num_topics=8,\n    epochs=100,\n    debug_mode=True,\n    train_embeddings=False, # Optional. If True, ETM will learn word embeddings jointly with\n                            # topic embeddings. By default, is False. If 'embeddings' argument\n                            # is being passed, this argument must not be True\n)\n\netm_instance.fit(train_dataset)\n```\n\nYou can get the topic words with this method. Note that you can select how many word per topic you're interest in:\n```python\nt_w_mtx = etm_instance.get_topics(top_n_words=20)\n```\n\nYou can get the topic word matrix with this method. Note that it will return all word for each topic:\n```python\nt_w_mtx = etm_instance.get_topic_word_matrix()\n```\n\nYou can get the topic word distribution matrix and the document topic distribution matrix with the following methods, both return a normalized distribution matrix:\n```python\nt_w_dist_mtx = etm_instance.get_topic_word_dist()\nd_t_dist_mtx = etm_instance.get_document_topic_dist()\n```\n\nAlso, to obtain topic coherence or topic diversity of the model, you can do as follows:\n\n```python\ntopics = etm_instance.get_topics(20)\ntopic_coherence = etm_instance.get_topic_coherence()\ntopic_diversity = etm_instance.get_topic_diversity()\n```\n\nYou can also predict topics for unseen documents with the following.\n\n```python\nfrom embedded_topic_model.utils import preprocessing\nfrom embedded_topic_model.models.etm import ETM\n\ncorpus_file = 'datasets/example_dataset.json'\ndocuments_raw = json.load(open(corpus_file, 'r'))\ndocuments = [document['body'] for document in documents_raw]\n\n# Splits into train/test datasets\ntrain = documents[:len(documents)-100]\ntest = documents[len(documents)-100:]\n\n# Model fitting\n# ...\n\n# The vocabulary must be the same one created during preprocessing of the training dataset (see above)\npreprocessed_test = preprocessing.create_bow_dataset(test, vocabulary)\n# Transforms test dataset and returns normalized document topic distribution\ntest_d_t_dist = etm_instance.transform(preprocessed_test)\nprint(f'test_d_t_dist: {test_d_t_dist}')\n```\n\nFor further details, see [examples](#microscope-examples).\n\n## :microscope: Examples\n\n| title                                       | link |\n| :-------------:                             | :--: |\n| ETM example - Reddit (r/depression) dataset | [Jupyter Notebook](./2023-09-01%20-%20reddit%20-%20depression%20dataset%20-%20etm%20-%20example.ipynb) |\n\n# :books: Citation\nTo cite ETM, use the original article's citation:\n\n```\n@article{dieng2019topic,\n    title = {Topic modeling in embedding spaces},\n    author = {Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},\n    journal = {arXiv preprint arXiv: 1907.04907},\n    year = {2019}\n}\n```\n\n# :heart: Contributing\nContributions are always welcomed :heart:! You can take a look at []() to see some guidelines. Feel free to contact through issues, to elaborate on desired enhancements and to check if work is already being done on the matter.\n\n# :v: Acknowledgements\nCredits given to Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei for the original work.\n\n# :pushpin: License\nLicensed under [MIT](LICENSE) license.\n# Changelog\n\nThis changelog was inspired by the [keep-a-changelog](https://github.com/olivierlacan/keep-a-changelog) project and follows [semantic versioning](https://semver.org).\n\n## [1.2.1] - 2023-09-06\n\n### Changed\n\n- ([#cf35c3](https://github.com/lffloyd/embedded-topic-model/commit/cf35c3)) fixes minimum python version to be `python>=3.9`\n\n## [1.2.0] - 2023-09-06\n\n### Added\n\n- ([#61730d](https://github.com/lffloyd/embedded-topic-model/commit/61730d), [#224995](https://github.com/lffloyd/embedded-topic-model/commit/224995), [#331fc0](https://github.com/lffloyd/embedded-topic-model/commit/331fc0)) adds support for macOS MPS devices and updates outdated `numpy`/`sklearn` code - thanks to [@d-jiao](https://github.com/d-jiao)\n- ([#c48016](https://github.com/lffloyd/embedded-topic-model/commit/c48016), [#2fe517](https://github.com/lffloyd/embedded-topic-model/commit/2fe517), [#c965b1](https://github.com/lffloyd/embedded-topic-model/commit/c965b1), [#5578ca](https://github.com/lffloyd/embedded-topic-model/commit/5578ca), [#5b0d85](https://github.com/lffloyd/embedded-topic-model/commit/5b0d85)) adds security guidelines and request templates\n\n### Changed\n\n- ([#331fc0](https://github.com/lffloyd/embedded-topic-model/commit/331fc0)) updates actions pipeline, supported python versions and internal dependencies to the latest available like `torch`, `gensim`, among others. Support for `python<=3.8` was dropped as a result. Numerous security vulnerabilities were solved\n\n## [1.1.0] - 2023-09-05\n\n### Added\n\n- ([#3f27ee](https://github.com/lffloyd/embedded-topic-model/commit/3f27ee)) adds `transform` method\n- ([#f98f3f](https://github.com/lffloyd/embedded-topic-model/commit/f98f3f)) adds example jupyter notebook\n- ([#683bec](https://github.com/lffloyd/embedded-topic-model/commit/683bec)) adds contributing and conduct guidelines\n\n### Changed\n\n- ([#f98f3f](https://github.com/lffloyd/embedded-topic-model/commit/f98f3f), [#c918a4](https://github.com/lffloyd/embedded-topic-model/commit/c918a4)) updates documentation\n\n## [1.0.2] - 2021-06-23\n\n### Changed\n\n- deactivates debug mode by default\n- documents get_most_similar_words method\n\n## [1.0.1] - 2021-02-15\n\n### Changed\n\n- optimizes original word2vec TXT file input for model training\n- updates README.md\n\n## [1.0.0] - 2021-02-15\n\n### Added\n\n- adds support for original word2vec pretrained embeddings files on both formats (BIN/TXT)\n\n### Changed\n\n- optimizes handling of gensim's word2vec mapping file for better memory usage\n\n## [0.1.1] - 2021-02-01\n\n### Added\n\n- support for python 3.6\n\n## [0.1.0] - 2021-02-01\n\n### Added\n\n- ETM training with partially tested support for original ETM features.\n- ETM corpus preprocessing scripts - including word2vec embeddings training - adapted from the original code.\n- adds methods to retrieve document-topic and topic-word probability distributions from the trained model.\n- adds docstrings for tested API methods.\n- adds unit and integration tests for ETM and preprocessing scripts.\n",
    "bugtrack_url": null,
    "license": "MIT license",
    "summary": "A package to run embedded topic modelling",
    "version": "1.2.1",
    "project_urls": {
        "Homepage": "https://github.com/lffloyd/embedded-topic-model"
    },
    "split_keywords": [
        "embedded_topic_model"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "21d2e66be977ab9f57f0443228fe039ae8277490ff7d0dd35d136a2b551bcdb1",
                "md5": "4837b3495f13240f2570779f474e19de",
                "sha256": "0f0a795d9d2c5c4cd48d5b1d79c9ff735b80f101bed46ca5b94dafac3fbe1773"
            },
            "downloads": -1,
            "filename": "embedded_topic_model-1.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4837b3495f13240f2570779f474e19de",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 19879,
            "upload_time": "2023-09-06T19:22:42",
            "upload_time_iso_8601": "2023-09-06T19:22:42.784164Z",
            "url": "https://files.pythonhosted.org/packages/21/d2/e66be977ab9f57f0443228fe039ae8277490ff7d0dd35d136a2b551bcdb1/embedded_topic_model-1.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "422aa4a12fd4d9b31c89551c84c1c1dfe80335a005e6eb1e5a295883913e5ae5",
                "md5": "49745a6f67be18530590fa97d1e121b2",
                "sha256": "af44e6902acabe0563d1036b4d5a28d7bc54b3c7ced75774c159da185658c56e"
            },
            "downloads": -1,
            "filename": "embedded_topic_model-1.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "49745a6f67be18530590fa97d1e121b2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 21352,
            "upload_time": "2023-09-06T19:22:44",
            "upload_time_iso_8601": "2023-09-06T19:22:44.931472Z",
            "url": "https://files.pythonhosted.org/packages/42/2a/a4a12fd4d9b31c89551c84c1c1dfe80335a005e6eb1e5a295883913e5ae5/embedded_topic_model-1.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-09-06 19:22:44",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lffloyd",
    "github_project": "embedded-topic-model",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "embedded-topic-model"
}

Luiz F. Matos