===========================
Contextualized Topic Models
===========================
.. image:: https://img.shields.io/pypi/v/contextualized_topic_models.svg
:target: https://pypi.python.org/pypi/contextualized_topic_models
.. image:: https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg
:target: https://github.com/MilaNLProc/contextualized-topic-models/actions
.. image:: https://readthedocs.org/projects/contextualized-topic-models/badge/?version=latest
:target: https://contextualized-topic-models.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://img.shields.io/github/contributors/MilaNLProc/contextualized-topic-models
:target: https://github.com/MilaNLProc/contextualized-topic-models/graphs/contributors/
:alt: Contributors
.. image:: https://img.shields.io/badge/License-MIT-blue.svg
:target: https://lbesson.mit-license.org/
:alt: License
.. image:: https://pepy.tech/badge/contextualized-topic-models
:target: https://pepy.tech/project/contextualized-topic-models
:alt: Downloads
.. image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing
:alt: Open In Colab
.. image:: https://raw.githubusercontent.com/aleen42/badges/master/src/medium.svg
:target: https://medium.com/towards-data-science/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576
:alt: Medium Blog Post
.. image:: https://img.shields.io/badge/youtube-video-red
:target: https://www.youtube.com/watch?v=n1_G8K07KoM
:alt: Video Tutorial
Contextualized Topic Models (CTM) are a family of topic models that use pre-trained representations of language (e.g., BERT) to
support topic modeling. See the papers for details:
* Bianchi, F., Terragni, S., & Hovy, D. (2021). `Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence`. ACL. https://aclanthology.org/2021.acl-short.96/
* Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). `Cross-lingual Contextualized Topic Models with Zero-shot Learning`. EACL. https://www.aclweb.org/anthology/2021.eacl-main.143/
.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png
:align: center
:width: 200px
Topic Modeling with Contextualized Embeddings
---------------------------------------------
Our new topic modeling family supports many different languages (i.e., the one supported by HuggingFace models) and comes in two versions: **CombinedTM** combines contextual embeddings with the good old bag of words to make more coherent topics; **ZeroShotTM** is the perfect topic model for task in which you might have missing words in the test data and also, if trained with muliglingual embeddings, inherits the property of being a multilingual topic model!
The big advantage is that you can use different embeddings for CTMs. Thus, when a new
embedding method comes out you can use it in the code and improve your results. We are not limited
by the BoW anymore.
We also have `Kitty <https://contextualized-topic-models.readthedocs.io/en/latest/kitty.html>`_! A new submodule that can be used to create a human-in-the-loop
classifier to quickly classify your documents and create named clusters.
.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo_kitty.png
:align: center
:width: 200px
Tutorials
---------
You can look at our `medium`_ blog post or start from one of our Colab Tutorials:
.. |colab1_2| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing
:alt: Open In Colab
.. |colab2_2| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1bfWUYEypULFk_4Tfff-Pb_n7-tSjEe9v?usp=sharing
:alt: Open In Colab
.. |colab3_3| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1upTRu4zSm1VMbl633n9qkIDA526l22E_?usp=sharing
:alt: Open In Colab
.. |kitty_colab| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/18mKzaKnmBlBOHb1oiS5MtaTSyq47ys2X?usp=sharing
:alt: Open In Colab
+--------------------------------------------------------------------------------+------------------+
| Name | Link |
+================================================================================+==================+
| Combined TM on Wikipedia Data (Preproc+Saving+Viz) (stable **v2.3.0**) | |colab1_2| |
+--------------------------------------------------------------------------------+------------------+
| Zero-Shot Cross-lingual Topic Modeling (Preproc+Viz) (stable **v2.3.0**) | |colab2_2| |
+--------------------------------------------------------------------------------+------------------+
| Kitty: Human in the loop Classifier (High-level usage) (stable **v2.2.0**) | |kitty_colab| |
+--------------------------------------------------------------------------------+------------------+
| SuperCTM and β-CTM (High-level usage) (stable **v2.2.0**) | |colab3_3| |
+--------------------------------------------------------------------------------+------------------+
Overview
--------
TL;DR
~~~~~
+ In CTMs we have two models. CombinedTM and ZeroShotTM, which have different use cases.
+ CTMs work better when the size of the bag of words **has been restricted to a number of terms** that does not go over **2000 elements**. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. This is **NOT** a strict limit, however, consider preprocessing your dataset. We have a preprocessing_ pipeline that can help you in dealing with this.
+ Check the contextual model you are using, the **multilingual model one used on English data might not give results that are as good** as the pure English trained one.
+ **Preprocessing is key**. If you give a contextual model like BERT preprocessed text, it might be difficult to get out a good representation. What we usually do is use the preprocessed text for the bag of word creating and use the NOT preprocessed text for BERT embeddings. Our preprocessing_ class can take care of this for you.
+ CTM uses `SBERT`_, you should check it out to better understand how we create embeddings. SBERT allows us to use any embedding model. You might want to check things like `max length <https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length>`_.
Installing
~~~~~~~~~~
**Important**: If you want to use CUDA you need to install the correct version of
the CUDA systems that matches your distribution, see pytorch_.
Install the package using pip
.. code-block:: bash
pip install -U contextualized_topic_models
Models
~~~~~~
An important aspect to take into account is which network you want to use:
the one that combines contextualized embeddings
and the BoW (`CombinedTM <https://contextualized-topic-models.readthedocs.io/en/latest/combined.html>`_) or the one that just uses contextualized embeddings (`ZeroShotTM <https://contextualized-topic-models.readthedocs.io/en/latest/zeroshot.html>`_)
But remember that you can do zero-shot cross-lingual topic modeling only with the `ZeroShotTM <https://contextualized-topic-models.readthedocs.io/en/latest/zeroshot.html>`_ model.
Contextualized Topic Models also support supervision (SuperCTM). You can read more about this on the `documentation <https://contextualized-topic-models.readthedocs.io/en/latest/introduction.html>`_.
.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/ctm_both.jpeg
:align: center
:width: 800px
We also have `Kitty <https://contextualized-topic-models.readthedocs.io/en/latest/kitty.html>`_: a utility you can use to do a simpler human in the loop classification of your
documents. This can be very useful to do document filtering. It also works in cross-lingual setting and
thus you might be able to filter documents in a language you don't know!
References
----------
If you find this useful you can cite the following papers :)
**ZeroShotTM**
::
@inproceedings{bianchi-etal-2021-cross,
title = "Cross-lingual Contextualized Topic Models with Zero-shot Learning",
author = "Bianchi, Federico and Terragni, Silvia and Hovy, Dirk and
Nozza, Debora and Fersini, Elisabetta",
booktitle = "Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.eacl-main.143",
pages = "1676--1683",
}
**CombinedTM**
::
@inproceedings{bianchi-etal-2021-pre,
title = "Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence",
author = "Bianchi, Federico and
Terragni, Silvia and
Hovy, Dirk",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-short.96",
doi = "10.18653/v1/2021.acl-short.96",
pages = "759--766",
}
Language-Specific and Multilingual
----------------------------------
Some of the examples below use a multilingual embedding model
:code:`paraphrase-multilingual-mpnet-base-v2`.
This means that the representations you are going to use are mutlilingual.
However you might need a broader coverage of languages or just one specific language.
Refer to the page in the documentation to see how to choose a model for another language.
In that case, you can check `SBERT`_ to find the perfect model to use.
Here, you can read more about `language-specific and mulitlingual <https://contextualized-topic-models.readthedocs.io/en/latest/language.html>`_.
Quick Overview
--------------
You should definitely take a look at the `documentation <https://contextualized-topic-models.readthedocs.io/en/latest/introduction.html>`_
to better understand how these topic models work.
Combined Topic Model
~~~~~~~~~~~~~~~~~~~~
Here is how you can use the CombinedTM. This is a standard topic model that also uses contextualized embeddings. The good thing about CombinedTM is that it makes your topic much more coherent (see the paper https://arxiv.org/abs/2004.03974).
n_components=50 specifies the number of topics.
.. code-block:: python
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
qt = TopicModelDataPreparation("all-mpnet-base-v2")
training_dataset = qt.fit(text_for_contextual=list_of_unpreprocessed_documents, text_for_bow=list_of_preprocessed_documents)
ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics
ctm.fit(training_dataset) # run the model
ctm.get_topics(2)
**Advanced Notes:** Combined TM combines the BoW with SBERT, a process that seems to increase
the coherence of the predicted topics (https://arxiv.org/pdf/2004.03974.pdf).
Zero-Shot Topic Model
~~~~~~~~~~~~~~~~~~~~~
Our ZeroShotTM can be used for zero-shot topic modeling. It can handle words that are not used during the training phase.
More interestingly, this model can be used for cross-lingual topic modeling (See next sections)! See the paper (https://www.aclweb.org/anthology/2021.eacl-main.143)
.. code-block:: python
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
text_for_contextual = [
"hello, this is unpreprocessed text you can give to the model",
"have fun with our topic model",
]
text_for_bow = [
"hello unpreprocessed give model",
"fun topic model",
]
qt = TopicModelDataPreparation("paraphrase-multilingual-mpnet-base-v2")
training_dataset = qt.fit(text_for_contextual=text_for_contextual, text_for_bow=text_for_bow)
ctm = ZeroShotTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50)
ctm.fit(training_dataset) # run the model
ctm.get_topics(2)
As you can see, the high-level API to handle the text is pretty easy to use;
**text_for_bert** should be used to pass to the model a list of documents that are not preprocessed.
Instead, to **text_for_bow** you should pass the preprocessed text used to build the BoW.
**Advanced Notes:** in this way, SBERT can use all the information in the text to generate the representations.
Using The Topic Models
----------------------
Getting The Topics
~~~~~~~~~~~~~~~~~~
Once the model is trained, it is very easy to get the topics!
.. code-block:: python
ctm.get_topics()
Predicting Topics For Unseen Documents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The **transform** method will take care of most things for you, for example the generation
of a corresponding BoW by considering only the words that the model has seen in training.
However, this comes with some bumps when dealing with the ZeroShotTM, as we will se in the next section.
You can, however, manually load the embeddings if you like (see the Advanced part of this documentation).
Mono-Lingual Topic Modeling
===========================
If you use **CombinedTM** you need to include the test text for the BOW:
.. code-block:: python
testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual, text_for_bow=testing_text_for_bow)
# n_sample how many times to sample the distribution (see the doc)
ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document
If you use **ZeroShotTM** you do not need to use the `testing_text_for_bow` because if you are using
a different set of test documents, this will create a BoW of a different size. Thus, the best
way to do this is to pass just the text that is going to be given in input to the contexual model:
.. code-block:: python
testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual)
# n_sample how many times to sample the distribution (see the doc)
ctm.get_doc_topic_distribution(testing_dataset, n_samples=20)
Cross-Lingual Topic Modeling
============================
Once you have trained the ZeroShotTM model with multilingual embeddings,
you can use this simple pipeline to predict the topics for documents in a different language (as long as this language
is covered by **paraphrase-multilingual-mpnet-base-v2**).
.. code-block:: python
# here we have a Spanish document
testing_text_for_contextual = [
"hola, bienvenido",
]
# since we are doing multilingual topic modeling, we do not need the BoW in
# ZeroShotTM when doing cross-lingual experiments (it does not make sense, since we trained with an english Bow
# to use the spanish BoW)
testing_dataset = qt.transform(testing_text_for_contextual)
# n_sample how many times to sample the distribution (see the doc)
ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document
**Advanced Notes:** We do not need to pass the Spanish bag of word: the bag of words of the two languages will not be comparable! We are passing it to the model for compatibility reasons, but you cannot get
the output of the model (i.e., the predicted BoW of the trained language) and compare it with the testing language one.
More Advanced Stuff
-------------------
Preprocessing
~~~~~~~~~~~~~
Do you need a quick script to run the preprocessing pipeline? We got you covered! Load your documents
and then use our SimplePreprocessing class. It will automatically filter infrequent words and remove documents
that are empty after training. The preprocess method will return the preprocessed and the unpreprocessed documents.
We generally use the unpreprocessed for BERT and the preprocessed for the Bag Of Word.
.. code-block:: python
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
documents = [line.strip() for line in open("unpreprocessed_documents.txt").readlines()]
sp = WhiteSpacePreprocessing(documents, "english")
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()
Using Custom Embeddings with Kitty
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Do you have custom embeddings and want to use them for faster results? Just give them to Kitty!
.. code-block:: python
from contextualized_topic_models.models.kitty_classifier import Kitty
import numpy as np
# read the training data
training_data = list(map(lambda x : x.strip(), open("train_data").readlines()))
custom_embeddings = np.load('custom_embeddings.npy')
kt = Kitty()
kt.train(training_data, custom_embeddings=custom_embeddings, stopwords_list=["stopwords"])
print(kt.pretty_print_word_classes())
Note: Custom embeddings must be numpy.arrays.
Development Team
----------------
* `Federico Bianchi`_ <f.bianchi@unibocconi.it> Bocconi University
* `Silvia Terragni`_ <s.terragni4@campus.unimib.it> University of Milan-Bicocca
* `Dirk Hovy`_ <dirk.hovy@unibocconi.it> Bocconi University
Software Details
----------------
* Free software: MIT license
* Documentation: https://contextualized-topic-models.readthedocs.io.
* Super big shout-out to `Stephen Carrow`_ for creating the awesome https://github.com/estebandito22/PyTorchAVITM package from which we constructed the foundations of this package. We are happy to redistribute this software again under the MIT License.
Credits
-------
This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.
To ease the use of the library we have also included the `rbo`_ package, all the rights reserved to the author of that package.
Note
----
Remember that this is a research tool :)
.. _pytorch: https://pytorch.org/get-started/locally/
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _preprocessing: https://github.com/MilaNLProc/contextualized-topic-models#preprocessing
.. _cross-lingual-topic-modeling: https://github.com/MilaNLProc/contextualized-topic-models#cross-lingual-topic-modeling
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage
.. _`Stephen Carrow` : https://github.com/estebandito22
.. _`rbo` : https://github.com/dlukes/rbo
.. _Federico Bianchi: https://federicobianchi.io
.. _Silvia Terragni: https://silviatti.github.io/
.. _Dirk Hovy: https://dirkhovy.com/
.. _SBERT: https://www.sbert.net/docs/pretrained_models.html
.. _HuggingFace: https://huggingface.co/models
.. _UmBERTo: https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1
.. _medium: https://fbvinid.medium.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576
=======
History
=======
2.2.2 (2021-11-09)
------------------
* kitty now takes in input a stopword list instead of a language (from which it gathered the stopwords)
* solving a bug in the whitespace preprocessing function
* adding a new preprocessing function that supports passing the stopwords as a list
* deprecating whitespace preprocessing
* minor fixes to kitty API
* breaking change to kitty API, now uses WhiteSpacePreprocessingStopwords.
2.2.0 (2021-09-20)
------------------
* introducing kitty
* improving the documentation a lot
2.1.2 (2021-09-03)
------------------
* patching `Issue 38 <https://github.com/MilaNLProc/contextualized-topic-models/issues/38>`_
* improvements `PR 80 <https://github.com/MilaNLProc/contextualized-topic-models/pull/80>`_
2.1.0 (2021-07-16)
------------------
* new model introduced SuperCTM
* new model introduced β-CTM
2.0.0 (2021-xx-xx)
------------------
* warning, breaking changes were introduced:
* the order of the parameters in CTMDataset was changed (now first is contextual embeddings)
* CTM takes in input bow_size, contextual_size instead of input_size and bert_size
* changed the name of the parameters in the dataset
* introduced early stopping
* introduced visualization with pyldavis
1.8.2 (2021-02-08)
------------------
* removed constraint over pytorch version. This should solve problems for Windows users
1.8.0 (2021-01-11)
------------------
* novel way to handle text, we now allow for an easy usage of training and testing data
* better visualization of the training progress and of the sampling process
* removed old stuff from the documentation
1.7.1 (2020-12-17)
------------------
* some minor updates to the documentation
* adding a new method to visualize the topic using a wordcloud
* save and load will now generate a warning since the feature has not been tested
1.7.0 (2020-12-10)
------------------
* adding a new and much simpler way to handle text for topic modeling
1.6.0 (2020-11-03)
------------------
* introducing the two different classes for ZeroShotTM and CombinedTM
* depracating CTM class in favor of ZeroShotTM and CombinedTM
1.5.3 (2020-11-03)
------------------
* adding support for Windows encoding by defaulting file load to UTF-8
1.5.2 (2020-11-03)
------------------
* updated sentence-transformers version to 0.3.6
* beta support for model saving and loading
* new evaluation metrics based on coherence
1.5.0 (2020-09-14)
------------------
* Introduced a method to predict the topics for a set of documents (supports multiple sampling to reduce variation)
* Adding some features to bert embeddings creation like increased batch size and progress bar
* Supporting training directly from lists without the need to deal with files
* Adding a simple quick preprocessing pipeline
1.4.3 (2020-09-03)
------------------
* Updating sentence-transformers package to avoid errors
1.4.2 (2020-08-04)
------------------
* Changed the encoding on file load for the SBERT embedding function
1.4.1 (2020-08-04)
------------------
* Fixed bug over sparse matrices
1.4.0 (2020-08-01)
------------------
* New feature handling sparse bow for optimized processing
* New method to return topic distributions for words
1.0.0 (2020-04-05)
------------------
* Released models with the main features implemented
0.1.0 (2020-04-04)
------------------
* First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/MilaNLProc/contextualized-topic-models",
"name": "contextualized-topic-models",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5",
"maintainer_email": "",
"keywords": "contextualized_topic_models",
"author": "Federico Bianchi",
"author_email": "f.bianchi@unibocconi.it",
"download_url": "https://files.pythonhosted.org/packages/4b/b5/979c9fc4c4ef15ca5cb79b5f82b58d1fb1011c41c67b56be61a190bbe789/contextualized_topic_models-2.4.2.tar.gz",
"platform": null,
"description": "===========================\nContextualized Topic Models\n===========================\n\n.. image:: https://img.shields.io/pypi/v/contextualized_topic_models.svg\n :target: https://pypi.python.org/pypi/contextualized_topic_models\n\n.. image:: https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg\n :target: https://github.com/MilaNLProc/contextualized-topic-models/actions\n\n.. image:: https://readthedocs.org/projects/contextualized-topic-models/badge/?version=latest\n :target: https://contextualized-topic-models.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://img.shields.io/github/contributors/MilaNLProc/contextualized-topic-models\n :target: https://github.com/MilaNLProc/contextualized-topic-models/graphs/contributors/\n :alt: Contributors\n\n.. image:: https://img.shields.io/badge/License-MIT-blue.svg\n :target: https://lbesson.mit-license.org/\n :alt: License\n\n.. image:: https://pepy.tech/badge/contextualized-topic-models\n :target: https://pepy.tech/project/contextualized-topic-models\n :alt: Downloads\n\n.. image:: https://colab.research.google.com/assets/colab-badge.svg\n :target: https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n :alt: Open In Colab\n\n.. image:: https://raw.githubusercontent.com/aleen42/badges/master/src/medium.svg\n :target: https://medium.com/towards-data-science/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576\n :alt: Medium Blog Post\n\n.. image:: https://img.shields.io/badge/youtube-video-red\n :target: https://www.youtube.com/watch?v=n1_G8K07KoM\n :alt: Video Tutorial\n\n\nContextualized Topic Models (CTM) are a family of topic models that use pre-trained representations of language (e.g., BERT) to\nsupport topic modeling. See the papers for details:\n\n* Bianchi, F., Terragni, S., & Hovy, D. (2021). `Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence`. ACL. https://aclanthology.org/2021.acl-short.96/\n* Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). `Cross-lingual Contextualized Topic Models with Zero-shot Learning`. EACL. https://www.aclweb.org/anthology/2021.eacl-main.143/\n\n\n.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png\n :align: center\n :width: 200px\n\n\nTopic Modeling with Contextualized Embeddings\n---------------------------------------------\n\nOur new topic modeling family supports many different languages (i.e., the one supported by HuggingFace models) and comes in two versions: **CombinedTM** combines contextual embeddings with the good old bag of words to make more coherent topics; **ZeroShotTM** is the perfect topic model for task in which you might have missing words in the test data and also, if trained with muliglingual embeddings, inherits the property of being a multilingual topic model!\n\nThe big advantage is that you can use different embeddings for CTMs. Thus, when a new\nembedding method comes out you can use it in the code and improve your results. We are not limited\nby the BoW anymore.\n\nWe also have `Kitty <https://contextualized-topic-models.readthedocs.io/en/latest/kitty.html>`_! A new submodule that can be used to create a human-in-the-loop\nclassifier to quickly classify your documents and create named clusters.\n\n.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo_kitty.png\n :align: center\n :width: 200px\n\n\nTutorials\n---------\n\nYou can look at our `medium`_ blog post or start from one of our Colab Tutorials:\n\n\n.. |colab1_2| image:: https://colab.research.google.com/assets/colab-badge.svg\n :target: https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n :alt: Open In Colab\n\n.. |colab2_2| image:: https://colab.research.google.com/assets/colab-badge.svg\n :target: https://colab.research.google.com/drive/1bfWUYEypULFk_4Tfff-Pb_n7-tSjEe9v?usp=sharing\n :alt: Open In Colab\n\n.. |colab3_3| image:: https://colab.research.google.com/assets/colab-badge.svg\n :target: https://colab.research.google.com/drive/1upTRu4zSm1VMbl633n9qkIDA526l22E_?usp=sharing\n :alt: Open In Colab\n\n.. |kitty_colab| image:: https://colab.research.google.com/assets/colab-badge.svg\n :target: https://colab.research.google.com/drive/18mKzaKnmBlBOHb1oiS5MtaTSyq47ys2X?usp=sharing\n :alt: Open In Colab\n\n+--------------------------------------------------------------------------------+------------------+\n| Name | Link |\n+================================================================================+==================+\n| Combined TM on Wikipedia Data (Preproc+Saving+Viz) (stable **v2.3.0**) | |colab1_2| |\n+--------------------------------------------------------------------------------+------------------+\n| Zero-Shot Cross-lingual Topic Modeling (Preproc+Viz) (stable **v2.3.0**) | |colab2_2| |\n+--------------------------------------------------------------------------------+------------------+\n| Kitty: Human in the loop Classifier (High-level usage) (stable **v2.2.0**) | |kitty_colab| |\n+--------------------------------------------------------------------------------+------------------+\n| SuperCTM and \u03b2-CTM (High-level usage) (stable **v2.2.0**) | |colab3_3| |\n+--------------------------------------------------------------------------------+------------------+\n\nOverview\n--------\n\nTL;DR\n~~~~~\n\n+ In CTMs we have two models. CombinedTM and ZeroShotTM, which have different use cases.\n+ CTMs work better when the size of the bag of words **has been restricted to a number of terms** that does not go over **2000 elements**. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. This is **NOT** a strict limit, however, consider preprocessing your dataset. We have a preprocessing_ pipeline that can help you in dealing with this.\n+ Check the contextual model you are using, the **multilingual model one used on English data might not give results that are as good** as the pure English trained one.\n+ **Preprocessing is key**. If you give a contextual model like BERT preprocessed text, it might be difficult to get out a good representation. What we usually do is use the preprocessed text for the bag of word creating and use the NOT preprocessed text for BERT embeddings. Our preprocessing_ class can take care of this for you.\n+ CTM uses `SBERT`_, you should check it out to better understand how we create embeddings. SBERT allows us to use any embedding model. You might want to check things like `max length <https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length>`_.\n\nInstalling\n~~~~~~~~~~\n\n**Important**: If you want to use CUDA you need to install the correct version of\nthe CUDA systems that matches your distribution, see pytorch_.\n\nInstall the package using pip\n\n.. code-block:: bash\n\n pip install -U contextualized_topic_models\n\nModels\n~~~~~~\n\nAn important aspect to take into account is which network you want to use:\nthe one that combines contextualized embeddings\nand the BoW (`CombinedTM <https://contextualized-topic-models.readthedocs.io/en/latest/combined.html>`_) or the one that just uses contextualized embeddings (`ZeroShotTM <https://contextualized-topic-models.readthedocs.io/en/latest/zeroshot.html>`_)\n\nBut remember that you can do zero-shot cross-lingual topic modeling only with the `ZeroShotTM <https://contextualized-topic-models.readthedocs.io/en/latest/zeroshot.html>`_ model.\n\nContextualized Topic Models also support supervision (SuperCTM). You can read more about this on the `documentation <https://contextualized-topic-models.readthedocs.io/en/latest/introduction.html>`_.\n\n.. image:: https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/ctm_both.jpeg\n :align: center\n :width: 800px\n\nWe also have `Kitty <https://contextualized-topic-models.readthedocs.io/en/latest/kitty.html>`_: a utility you can use to do a simpler human in the loop classification of your\ndocuments. This can be very useful to do document filtering. It also works in cross-lingual setting and\nthus you might be able to filter documents in a language you don't know!\n\nReferences\n----------\n\nIf you find this useful you can cite the following papers :)\n\n**ZeroShotTM**\n\n::\n\n @inproceedings{bianchi-etal-2021-cross,\n title = \"Cross-lingual Contextualized Topic Models with Zero-shot Learning\",\n author = \"Bianchi, Federico and Terragni, Silvia and Hovy, Dirk and\n Nozza, Debora and Fersini, Elisabetta\",\n booktitle = \"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume\",\n month = apr,\n year = \"2021\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/2021.eacl-main.143\",\n pages = \"1676--1683\",\n }\n\n**CombinedTM**\n\n::\n\n @inproceedings{bianchi-etal-2021-pre,\n title = \"Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence\",\n author = \"Bianchi, Federico and\n Terragni, Silvia and\n Hovy, Dirk\",\n booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)\",\n month = aug,\n year = \"2021\",\n address = \"Online\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://aclanthology.org/2021.acl-short.96\",\n doi = \"10.18653/v1/2021.acl-short.96\",\n pages = \"759--766\",\n }\n\n\nLanguage-Specific and Multilingual\n----------------------------------\n\nSome of the examples below use a multilingual embedding model\n:code:`paraphrase-multilingual-mpnet-base-v2`.\nThis means that the representations you are going to use are mutlilingual.\nHowever you might need a broader coverage of languages or just one specific language.\nRefer to the page in the documentation to see how to choose a model for another language.\nIn that case, you can check `SBERT`_ to find the perfect model to use.\n\nHere, you can read more about `language-specific and mulitlingual <https://contextualized-topic-models.readthedocs.io/en/latest/language.html>`_.\n\nQuick Overview\n--------------\n\nYou should definitely take a look at the `documentation <https://contextualized-topic-models.readthedocs.io/en/latest/introduction.html>`_\nto better understand how these topic models work.\n\nCombined Topic Model\n~~~~~~~~~~~~~~~~~~~~\n\nHere is how you can use the CombinedTM. This is a standard topic model that also uses contextualized embeddings. The good thing about CombinedTM is that it makes your topic much more coherent (see the paper https://arxiv.org/abs/2004.03974).\nn_components=50 specifies the number of topics.\n\n.. code-block:: python\n\n from contextualized_topic_models.models.ctm import CombinedTM\n from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n qt = TopicModelDataPreparation(\"all-mpnet-base-v2\")\n\n training_dataset = qt.fit(text_for_contextual=list_of_unpreprocessed_documents, text_for_bow=list_of_preprocessed_documents)\n\n ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics\n\n ctm.fit(training_dataset) # run the model\n\n ctm.get_topics(2)\n\n\n**Advanced Notes:** Combined TM combines the BoW with SBERT, a process that seems to increase\nthe coherence of the predicted topics (https://arxiv.org/pdf/2004.03974.pdf).\n\nZero-Shot Topic Model\n~~~~~~~~~~~~~~~~~~~~~\n\nOur ZeroShotTM can be used for zero-shot topic modeling. It can handle words that are not used during the training phase.\nMore interestingly, this model can be used for cross-lingual topic modeling (See next sections)! See the paper (https://www.aclweb.org/anthology/2021.eacl-main.143)\n\n.. code-block:: python\n\n from contextualized_topic_models.models.ctm import ZeroShotTM\n from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n text_for_contextual = [\n \"hello, this is unpreprocessed text you can give to the model\",\n \"have fun with our topic model\",\n ]\n\n text_for_bow = [\n \"hello unpreprocessed give model\",\n \"fun topic model\",\n ]\n\n qt = TopicModelDataPreparation(\"paraphrase-multilingual-mpnet-base-v2\")\n\n training_dataset = qt.fit(text_for_contextual=text_for_contextual, text_for_bow=text_for_bow)\n\n ctm = ZeroShotTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50)\n\n ctm.fit(training_dataset) # run the model\n\n ctm.get_topics(2)\n\n\nAs you can see, the high-level API to handle the text is pretty easy to use;\n**text_for_bert** should be used to pass to the model a list of documents that are not preprocessed.\nInstead, to **text_for_bow** you should pass the preprocessed text used to build the BoW.\n\n**Advanced Notes:** in this way, SBERT can use all the information in the text to generate the representations.\n\nUsing The Topic Models\n----------------------\n\nGetting The Topics\n~~~~~~~~~~~~~~~~~~\n\nOnce the model is trained, it is very easy to get the topics!\n\n.. code-block:: python\n\n ctm.get_topics()\n\nPredicting Topics For Unseen Documents\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe **transform** method will take care of most things for you, for example the generation\nof a corresponding BoW by considering only the words that the model has seen in training.\nHowever, this comes with some bumps when dealing with the ZeroShotTM, as we will se in the next section.\n\nYou can, however, manually load the embeddings if you like (see the Advanced part of this documentation).\n\nMono-Lingual Topic Modeling\n===========================\n\nIf you use **CombinedTM** you need to include the test text for the BOW:\n\n.. code-block:: python\n\n testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual, text_for_bow=testing_text_for_bow)\n\n # n_sample how many times to sample the distribution (see the doc)\n ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document\n\nIf you use **ZeroShotTM** you do not need to use the `testing_text_for_bow` because if you are using\na different set of test documents, this will create a BoW of a different size. Thus, the best\nway to do this is to pass just the text that is going to be given in input to the contexual model:\n\n.. code-block:: python\n\n testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual)\n\n # n_sample how many times to sample the distribution (see the doc)\n ctm.get_doc_topic_distribution(testing_dataset, n_samples=20)\n\n\nCross-Lingual Topic Modeling\n============================\n\nOnce you have trained the ZeroShotTM model with multilingual embeddings,\nyou can use this simple pipeline to predict the topics for documents in a different language (as long as this language\nis covered by **paraphrase-multilingual-mpnet-base-v2**).\n\n.. code-block:: python\n\n # here we have a Spanish document\n testing_text_for_contextual = [\n \"hola, bienvenido\",\n ]\n\n # since we are doing multilingual topic modeling, we do not need the BoW in\n # ZeroShotTM when doing cross-lingual experiments (it does not make sense, since we trained with an english Bow\n # to use the spanish BoW)\n testing_dataset = qt.transform(testing_text_for_contextual)\n\n # n_sample how many times to sample the distribution (see the doc)\n ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document\n\n**Advanced Notes:** We do not need to pass the Spanish bag of word: the bag of words of the two languages will not be comparable! We are passing it to the model for compatibility reasons, but you cannot get\nthe output of the model (i.e., the predicted BoW of the trained language) and compare it with the testing language one.\n\nMore Advanced Stuff\n-------------------\n\n\n\nPreprocessing\n~~~~~~~~~~~~~\n\nDo you need a quick script to run the preprocessing pipeline? We got you covered! Load your documents\nand then use our SimplePreprocessing class. It will automatically filter infrequent words and remove documents\nthat are empty after training. The preprocess method will return the preprocessed and the unpreprocessed documents.\nWe generally use the unpreprocessed for BERT and the preprocessed for the Bag Of Word.\n\n.. code-block:: python\n\n from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing\n\n documents = [line.strip() for line in open(\"unpreprocessed_documents.txt\").readlines()]\n sp = WhiteSpacePreprocessing(documents, \"english\")\n preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()\n\nUsing Custom Embeddings with Kitty\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nDo you have custom embeddings and want to use them for faster results? Just give them to Kitty!\n\n.. code-block:: python\n\n from contextualized_topic_models.models.kitty_classifier import Kitty\n import numpy as np\n\n # read the training data\n training_data = list(map(lambda x : x.strip(), open(\"train_data\").readlines()))\n custom_embeddings = np.load('custom_embeddings.npy')\n\n kt = Kitty()\n kt.train(training_data, custom_embeddings=custom_embeddings, stopwords_list=[\"stopwords\"])\n\n print(kt.pretty_print_word_classes())\n\n\nNote: Custom embeddings must be numpy.arrays.\n\nDevelopment Team\n----------------\n\n* `Federico Bianchi`_ <f.bianchi@unibocconi.it> Bocconi University\n* `Silvia Terragni`_ <s.terragni4@campus.unimib.it> University of Milan-Bicocca\n* `Dirk Hovy`_ <dirk.hovy@unibocconi.it> Bocconi University\n\n\nSoftware Details\n----------------\n\n* Free software: MIT license\n* Documentation: https://contextualized-topic-models.readthedocs.io.\n* Super big shout-out to `Stephen Carrow`_ for creating the awesome https://github.com/estebandito22/PyTorchAVITM package from which we constructed the foundations of this package. We are happy to redistribute this software again under the MIT License.\n\n\n\nCredits\n-------\n\n\nThis package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.\nTo ease the use of the library we have also included the `rbo`_ package, all the rights reserved to the author of that package.\n\nNote\n----\n\nRemember that this is a research tool :)\n\n.. _pytorch: https://pytorch.org/get-started/locally/\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _preprocessing: https://github.com/MilaNLProc/contextualized-topic-models#preprocessing\n.. _cross-lingual-topic-modeling: https://github.com/MilaNLProc/contextualized-topic-models#cross-lingual-topic-modeling\n.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage\n.. _`Stephen Carrow` : https://github.com/estebandito22\n.. _`rbo` : https://github.com/dlukes/rbo\n.. _Federico Bianchi: https://federicobianchi.io\n.. _Silvia Terragni: https://silviatti.github.io/\n.. _Dirk Hovy: https://dirkhovy.com/\n.. _SBERT: https://www.sbert.net/docs/pretrained_models.html\n.. _HuggingFace: https://huggingface.co/models\n.. _UmBERTo: https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1\n.. _medium: https://fbvinid.medium.com/contextualized-topic-modeling-with-python-eacl2021-eacf6dfa576\n\n\n\n=======\nHistory\n=======\n\n\n2.2.2 (2021-11-09)\n------------------\n\n* kitty now takes in input a stopword list instead of a language (from which it gathered the stopwords)\n* solving a bug in the whitespace preprocessing function\n* adding a new preprocessing function that supports passing the stopwords as a list\n* deprecating whitespace preprocessing\n* minor fixes to kitty API\n* breaking change to kitty API, now uses WhiteSpacePreprocessingStopwords.\n\n2.2.0 (2021-09-20)\n------------------\n\n* introducing kitty\n* improving the documentation a lot\n\n2.1.2 (2021-09-03)\n------------------\n\n* patching `Issue 38 <https://github.com/MilaNLProc/contextualized-topic-models/issues/38>`_\n* improvements `PR 80 <https://github.com/MilaNLProc/contextualized-topic-models/pull/80>`_\n\n\n2.1.0 (2021-07-16)\n------------------\n\n* new model introduced SuperCTM\n* new model introduced \u03b2-CTM\n\n2.0.0 (2021-xx-xx)\n------------------\n\n* warning, breaking changes were introduced:\n * the order of the parameters in CTMDataset was changed (now first is contextual embeddings)\n * CTM takes in input bow_size, contextual_size instead of input_size and bert_size\n * changed the name of the parameters in the dataset\n* introduced early stopping\n* introduced visualization with pyldavis\n\n1.8.2 (2021-02-08)\n------------------\n\n* removed constraint over pytorch version. This should solve problems for Windows users\n\n1.8.0 (2021-01-11)\n------------------\n\n* novel way to handle text, we now allow for an easy usage of training and testing data\n* better visualization of the training progress and of the sampling process\n* removed old stuff from the documentation\n\n1.7.1 (2020-12-17)\n------------------\n\n* some minor updates to the documentation\n* adding a new method to visualize the topic using a wordcloud\n* save and load will now generate a warning since the feature has not been tested\n\n\n1.7.0 (2020-12-10)\n------------------\n\n* adding a new and much simpler way to handle text for topic modeling\n\n1.6.0 (2020-11-03)\n------------------\n\n* introducing the two different classes for ZeroShotTM and CombinedTM\n* depracating CTM class in favor of ZeroShotTM and CombinedTM\n\n\n1.5.3 (2020-11-03)\n------------------\n\n* adding support for Windows encoding by defaulting file load to UTF-8\n\n1.5.2 (2020-11-03)\n------------------\n\n* updated sentence-transformers version to 0.3.6\n* beta support for model saving and loading\n* new evaluation metrics based on coherence\n\n1.5.0 (2020-09-14)\n------------------\n\n* Introduced a method to predict the topics for a set of documents (supports multiple sampling to reduce variation)\n* Adding some features to bert embeddings creation like increased batch size and progress bar\n* Supporting training directly from lists without the need to deal with files\n* Adding a simple quick preprocessing pipeline\n\n1.4.3 (2020-09-03)\n------------------\n\n* Updating sentence-transformers package to avoid errors\n\n1.4.2 (2020-08-04)\n------------------\n\n* Changed the encoding on file load for the SBERT embedding function\n\n1.4.1 (2020-08-04)\n------------------\n\n* Fixed bug over sparse matrices\n\n1.4.0 (2020-08-01)\n------------------\n\n* New feature handling sparse bow for optimized processing\n* New method to return topic distributions for words\n\n1.0.0 (2020-04-05)\n------------------\n\n* Released models with the main features implemented\n\n0.1.0 (2020-04-04)\n------------------\n\n* First release on PyPI.\n\n\n",
"bugtrack_url": null,
"license": "MIT license",
"summary": "Contextualized Topic Models",
"version": "2.4.2",
"split_keywords": [
"contextualized_topic_models"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4d59d1082e7a072666c0132bcaed946725d2c7375a6130cfb41413d7622cedb8",
"md5": "048e0fa3c7b76030d8370c6483dae35f",
"sha256": "bf9fa5cdfd260819a0fc988fc924fbfe1f1bf7b330626a97eb337824908e1e3c"
},
"downloads": -1,
"filename": "contextualized_topic_models-2.4.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "048e0fa3c7b76030d8370c6483dae35f",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.5",
"size": 35923,
"upload_time": "2022-11-03T19:02:09",
"upload_time_iso_8601": "2022-11-03T19:02:09.006794Z",
"url": "https://files.pythonhosted.org/packages/4d/59/d1082e7a072666c0132bcaed946725d2c7375a6130cfb41413d7622cedb8/contextualized_topic_models-2.4.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "4bb5979c9fc4c4ef15ca5cb79b5f82b58d1fb1011c41c67b56be61a190bbe789",
"md5": "c91727e0981974c29438fd53865a461c",
"sha256": "b3b3ca3e619cc5bb043166ea7e6485549dfcf3433f8c74fef0e18fb85bcb2e9f"
},
"downloads": -1,
"filename": "contextualized_topic_models-2.4.2.tar.gz",
"has_sig": false,
"md5_digest": "c91727e0981974c29438fd53865a461c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5",
"size": 61587,
"upload_time": "2022-11-03T19:02:11",
"upload_time_iso_8601": "2022-11-03T19:02:11.151295Z",
"url": "https://files.pythonhosted.org/packages/4b/b5/979c9fc4c4ef15ca5cb79b5f82b58d1fb1011c41c67b56be61a190bbe789/contextualized_topic_models-2.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-11-03 19:02:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "MilaNLProc",
"github_project": "contextualized-topic-models",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.19.1"
]
]
},
{
"name": "torchvision",
"specs": [
[
">=",
"0.7.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.6.0"
]
]
},
{
"name": "gensim",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
">=",
"2.1.1"
]
]
},
{
"name": "wordcloud",
"specs": [
[
">=",
"1.8.1"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.1.3"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.56.0"
]
]
},
{
"name": "scipy",
"specs": [
[
">=",
"1.4.1"
]
]
},
{
"name": "ipywidgets",
"specs": [
[
"==",
"7.5.1"
]
]
},
{
"name": "ipython",
"specs": [
[
"==",
"7.16.3"
]
]
}
],
"lcname": "contextualized-topic-models"
}