bertsenclu


Namebertsenclu JSON
Version 0.1.8 PyPI version JSON
download
home_pagehttps://github.com/JohnTailor/BertSenClu
Summary(Bert-)SenClu is a topic modeling technique that leverages sentence transformers to compute topic models.
upload_time2023-02-08 08:25:48
maintainer
docs_urlNone
authorJohannes Schneider
requires_python>=3.7
licenseLICENSE
keywords nlp bert topic modeling sentence embeddings
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <!--
[![PyPI - Python](https://img.shields.io/badge/python-v3.7+-blue.svg)](https://pypi.org/project/bertopic/)
[![Build](https://img.shields.io/github/workflow/status/MaartenGr/BERTopic/Code%20Checks/master)](https://pypi.org/project/bertopic/)
[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)
[![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)
[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-2203.05794-<COLOR>.svg)](https://arxiv.org/abs/2203.05794)
-->

# Bert-SenClu

**(Bert-)SenClu** is a topic modeling technique that leverages sentence transformers to compute topic models. For once, it differs from other topic models by using sentences as unit of analysis, i.e., a sentence is assigned to a topic and not a word (like for LDA, TKM) or an entire document (BertTopic). 
Methods that treat documents as a unit can be faster but they only assign the entire document to one topic, which is different from most classical topic models that produce a document-topic distribution, i.e., a document can contain multiple documents. Our topic model also does not do a dimensionality reduction of embeddings. Inference is based on expectation-maximization, e.g., like for TKM (see [**TKM Paper**](https://arxiv.org/abs/1710.02650) and [**TKM Code**](https://github.com/JohnTailor/tkm)).

For an in-depth overview of the features of **Bert-SenClu**
you can check the [**repository**](https://github.com/JohnTailor/BertSenClu/) or the paper [**the paper**](https://arxiv.org/abs/2302.03106).


<img src="https://github.com/JohnTailor/BertSenClu/blob/main/images/comp.png" width="60%" height="60%" align="center" />


## Installation

Installation, with sentence-transformers, can be done using [pypi](https://pypi.org/project/bertsenclu/):

```bash
pip install bertsenclu
```
    
## Quick Start
We start by extracting topics from the 20 newsgroups dataset containing English documents:

```python
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from bertSenClu import senClu

docs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))  # get raw data    

folder = "modelOutputs/"
topic_model= senClu.SenClu()
topics, probs = topic_model.fit_transform(docs, nTopics=20, loadAndStoreInFolder=folder)


```

After generating topics and their probabilities, we can save outputs:

```python
>> topic_model.saveOutputs(folder) #Save outputs in folder, i.e. csv-file and visualizations
```

and look at topics

```python


>>for it,t in enumerate(topics):
    print("Topic",it,t[:10])
    
Topic 0 [encryption, key, ripem, privacy, rsa, clipper, encrypted, escrow, nsa, secure]
Topic 1 [government, militia, amendment, federal, law, constitution, firearm, regulated, administration, clinton]
Topic 2 [launch, satellite, lunar, space, orbit, mission, spacecraft, larson, probe, shuttle]
Topic 3 [patient, hiv, disease, infection, candida, vitamin, antibiotic, diet, symptom, smokeless]
...

 ```  

We can also use an interactive tool for visualization and topic analysis that runs in a browser. It can be called command line with the folder containing topic modeling outputs:

You need to **download** the [**visual.py**](https://github.com/JohnTailor/BertSenClu/blob/main/visual.py) from the repo first

```console
streamlit run visual.py -- --folder "modelOutputs/"
```

It can also be called from python:

```python
import subprocess
folder = "modelOutputs/"
subprocess.run("streamlit run visual.py -- --folder "+folder,shell=True)
```

The interactive visualization looks like this:

<img src="https://github.com/JohnTailor/BertSenClu/blob/main/images/visual.PNG" width="100%" height="100%" align="center" />

If you scroll down (or look into the folder where you stored outputs), you see topic relationship information as well, i.e., a TSNE visualization and a hierarchical clustering of topics:

<img src="https://github.com/JohnTailor/BertSenClu/blob/main/images/topic_visual_hierarchy.png" width="60%" height="60%" align="center" />
<img src="https://github.com/JohnTailor/BertSenClu/blob/main/images/topic_visual_tsne.png" width="60%" height="60%" align="center" />


We can also access outputs directly by accessing functions from the model
```python

>>  print("Top 10 words with scores for topic 0",topic_model.getTopicsWitchScores()[0][:10])
Top 10 words with scores for topic 0 [('encryption', 11.269135), ('key', 11.173454), ('ripem', 10.151058), ('privacy', 10.070835), ('rsa', 7.3271174), ('clipper', 6.8211393), ('encrypted', 6.567956), ('escrow', 5.993511), ('nsa', 5.853071), ('secure', 5.4898496)]

>> print("Distribution of topics for document 0", np.round(topic_model.getTopicDocumentDistribution()[0],3))
Distribution of topics for document 0 [0. 0. 1. ... 0. 0. 0.]
    
>> print("Distribution of topics", np.round(topic_model.getTopicDistribution(), 3))
Distribution of topics [0.022 0.061 0.024 0.026 0.067 0.079 0.155 0.043 0.061 0.039 0.031 0.198
 0.018 0.033 0.033 0.012 0.016 0.029 0.033 0.02 ]

>> print("First 4 sentences for top doc for topic 0 with probability and ", topic_model.getTopDocsPerTopic()[0][0][:4])
First 4 sentences for top doc for topic 0 (['[...]>\n', '[...]>\n\n', "If the data isn't there when the warrant comes, you effectively have\n", 'secure crypto.  '], 1.0, 8607)

>> print("Top 3 sentences for topic 0 ", topic_model.getTopSentencesPerTopic()[0][:5])    
Top 3 sentences for topic 1  [('enforcement.\n\n    ', 0.22597079), ('Enforcement.  ', 0.22597079), ('to the Constitution.\n\n   ', 0.22434217)]
#The sentences show that the sentence partitioning algorithm used is not the best... (It splits based on carriage returns. Still topic modeling results are good. It's also easy to use another one, or preprocess the data    

```


## How it works
The steps for topic modeling with **Bert-SenClu** are
<ol>
  <li>Splitting docs into sentences</li>  
  <li>Embedding the sentences using pretrained sentence-transformers</li>
  <li>Running the topic modeling</li>
  <li>Computing topic-word distributions based on sentence to topic assignments</li>
</ol>
The outcomes of the first two steps are stored in a user-provided folder if parameter "loadAndStoreInFolder" is set explicitly in "fit_transform". By default this is not the case (i.e., "loadAndStoreInFolder"=None).  **Bert-SenClu** can reuse the stored precomputed sentence partitionen and embeddings, which speeds up re-running the topic modeling, e.g., if you want to change the number of topics. However, if you alter the data, you need to delete the folder, i.e., the files with the precomputed sentence embeddings and partitionings.  
 
You can change each algorithm in these steps, especially the algorithm for sentence partitioning as well as the pre-trained sentence embedder. As you saw in the example, the used algorithm for sentence partitioning is not that great for the newsgroup dataset, but the overall result is still good.

The (main) function "fit_transform" has a hyperparameter "alpha" (similar to other models like LDA), which guides the algorithm on how many topics a document should contain. Setting it 0, means that a document likely has few topics. Setting it to 1 (or larger) means it is more likely to have many (for longer documents). As default, you can use 0.5/sqrt(nTopics). 


## Citation
To cite the [Bert-SenClu Paper](https://arxiv.org/abs/2302.03106), please use the following bibtex reference:

```bibtex
@article{schneider23,
  title={Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences},
  author={Schneider,Johannes},
  journal={arXiv preprint arXiv:2302.03106},
  year={2023}
}
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JohnTailor/BertSenClu",
    "name": "bertsenclu",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "nlp bert topic modeling sentence embeddings",
    "author": "Johannes Schneider",
    "author_email": "vollkoff@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/12/df/7eb96747f3957ed7d0014d8c13d26598e9786afcd7c009b8d5b60872332c/bertsenclu-0.1.8.tar.gz",
    "platform": null,
    "description": "<!--\r\n[![PyPI - Python](https://img.shields.io/badge/python-v3.7+-blue.svg)](https://pypi.org/project/bertopic/)\r\n[![Build](https://img.shields.io/github/workflow/status/MaartenGr/BERTopic/Code%20Checks/master)](https://pypi.org/project/bertopic/)\r\n[![docs](https://img.shields.io/badge/docs-Passing-green.svg)](https://maartengr.github.io/BERTopic/)\r\n[![PyPI - PyPi](https://img.shields.io/pypi/v/BERTopic)](https://pypi.org/project/bertopic/)\r\n[![PyPI - License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/MaartenGr/VLAC/blob/master/LICENSE)\r\n[![arXiv](https://img.shields.io/badge/arXiv-2203.05794-<COLOR>.svg)](https://arxiv.org/abs/2203.05794)\r\n-->\r\n\r\n# Bert-SenClu\r\n\r\n**(Bert-)SenClu** is a topic modeling technique that leverages sentence transformers to compute topic models. For once, it differs from other topic models by using sentences as unit of analysis, i.e., a sentence is assigned to a topic and not a word (like for LDA, TKM) or an entire document (BertTopic). \r\nMethods that treat documents as a unit can be faster but they only assign the entire document to one topic, which is different from most classical topic models that produce a document-topic distribution, i.e., a document can contain multiple documents. Our topic model also does not do a dimensionality reduction of embeddings. Inference is based on expectation-maximization, e.g., like for TKM (see [**TKM Paper**](https://arxiv.org/abs/1710.02650) and [**TKM Code**](https://github.com/JohnTailor/tkm)).\r\n\r\nFor an in-depth overview of the features of **Bert-SenClu**\r\nyou can check the [**repository**](https://github.com/JohnTailor/BertSenClu/) or the paper [**the paper**](https://arxiv.org/abs/2302.03106).\r\n\r\n\r\n<img src=\"https://github.com/JohnTailor/BertSenClu/blob/main/images/comp.png\" width=\"60%\" height=\"60%\" align=\"center\" />\r\n\r\n\r\n## Installation\r\n\r\nInstallation, with sentence-transformers, can be done using [pypi](https://pypi.org/project/bertsenclu/):\r\n\r\n```bash\r\npip install bertsenclu\r\n```\r\n    \r\n## Quick Start\r\nWe start by extracting topics from the 20 newsgroups dataset containing English documents:\r\n\r\n```python\r\nfrom sklearn.datasets import fetch_20newsgroups\r\nimport numpy as np\r\nfrom bertSenClu import senClu\r\n\r\ndocs = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))  # get raw data    \r\n\r\nfolder = \"modelOutputs/\"\r\ntopic_model= senClu.SenClu()\r\ntopics, probs = topic_model.fit_transform(docs, nTopics=20, loadAndStoreInFolder=folder)\r\n\r\n\r\n```\r\n\r\nAfter generating topics and their probabilities, we can save outputs:\r\n\r\n```python\r\n>> topic_model.saveOutputs(folder) #Save outputs in folder, i.e. csv-file and visualizations\r\n```\r\n\r\nand look at topics\r\n\r\n```python\r\n\r\n\r\n>>for it,t in enumerate(topics):\r\n    print(\"Topic\",it,t[:10])\r\n    \r\nTopic 0 [encryption, key, ripem, privacy, rsa, clipper, encrypted, escrow, nsa, secure]\r\nTopic 1 [government, militia, amendment, federal, law, constitution, firearm, regulated, administration, clinton]\r\nTopic 2 [launch, satellite, lunar, space, orbit, mission, spacecraft, larson, probe, shuttle]\r\nTopic 3 [patient, hiv, disease, infection, candida, vitamin, antibiotic, diet, symptom, smokeless]\r\n...\r\n\r\n ```  \r\n\r\nWe can also use an interactive tool for visualization and topic analysis that runs in a browser. It can be called command line with the folder containing topic modeling outputs:\r\n\r\nYou need to **download** the [**visual.py**](https://github.com/JohnTailor/BertSenClu/blob/main/visual.py) from the repo first\r\n\r\n```console\r\nstreamlit run visual.py -- --folder \"modelOutputs/\"\r\n```\r\n\r\nIt can also be called from python:\r\n\r\n```python\r\nimport subprocess\r\nfolder = \"modelOutputs/\"\r\nsubprocess.run(\"streamlit run visual.py -- --folder \"+folder,shell=True)\r\n```\r\n\r\nThe interactive visualization looks like this:\r\n\r\n<img src=\"https://github.com/JohnTailor/BertSenClu/blob/main/images/visual.PNG\" width=\"100%\" height=\"100%\" align=\"center\" />\r\n\r\nIf you scroll down (or look into the folder where you stored outputs), you see topic relationship information as well, i.e., a TSNE visualization and a hierarchical clustering of topics:\r\n\r\n<img src=\"https://github.com/JohnTailor/BertSenClu/blob/main/images/topic_visual_hierarchy.png\" width=\"60%\" height=\"60%\" align=\"center\" />\r\n<img src=\"https://github.com/JohnTailor/BertSenClu/blob/main/images/topic_visual_tsne.png\" width=\"60%\" height=\"60%\" align=\"center\" />\r\n\r\n\r\nWe can also access outputs directly by accessing functions from the model\r\n```python\r\n\r\n>>  print(\"Top 10 words with scores for topic 0\",topic_model.getTopicsWitchScores()[0][:10])\r\nTop 10 words with scores for topic 0 [('encryption', 11.269135), ('key', 11.173454), ('ripem', 10.151058), ('privacy', 10.070835), ('rsa', 7.3271174), ('clipper', 6.8211393), ('encrypted', 6.567956), ('escrow', 5.993511), ('nsa', 5.853071), ('secure', 5.4898496)]\r\n\r\n>> print(\"Distribution of topics for document 0\", np.round(topic_model.getTopicDocumentDistribution()[0],3))\r\nDistribution of topics for document 0 [0. 0. 1. ... 0. 0. 0.]\r\n    \r\n>> print(\"Distribution of topics\", np.round(topic_model.getTopicDistribution(), 3))\r\nDistribution of topics [0.022 0.061 0.024 0.026 0.067 0.079 0.155 0.043 0.061 0.039 0.031 0.198\r\n 0.018 0.033 0.033 0.012 0.016 0.029 0.033 0.02 ]\r\n\r\n>> print(\"First 4 sentences for top doc for topic 0 with probability and \", topic_model.getTopDocsPerTopic()[0][0][:4])\r\nFirst 4 sentences for top doc for topic 0 (['[...]>\\n', '[...]>\\n\\n', \"If the data isn't there when the warrant comes, you effectively have\\n\", 'secure crypto.  '], 1.0, 8607)\r\n\r\n>> print(\"Top 3 sentences for topic 0 \", topic_model.getTopSentencesPerTopic()[0][:5])    \r\nTop 3 sentences for topic 1  [('enforcement.\\n\\n    ', 0.22597079), ('Enforcement.  ', 0.22597079), ('to the Constitution.\\n\\n   ', 0.22434217)]\r\n#The sentences show that the sentence partitioning algorithm used is not the best... (It splits based on carriage returns. Still topic modeling results are good. It's also easy to use another one, or preprocess the data    \r\n\r\n```\r\n\r\n\r\n## How it works\r\nThe steps for topic modeling with **Bert-SenClu** are\r\n<ol>\r\n  <li>Splitting docs into sentences</li>  \r\n  <li>Embedding the sentences using pretrained sentence-transformers</li>\r\n  <li>Running the topic modeling</li>\r\n  <li>Computing topic-word distributions based on sentence to topic assignments</li>\r\n</ol>\r\nThe outcomes of the first two steps are stored in a user-provided folder if parameter \"loadAndStoreInFolder\" is set explicitly in \"fit_transform\". By default this is not the case (i.e., \"loadAndStoreInFolder\"=None).  **Bert-SenClu** can reuse the stored precomputed sentence partitionen and embeddings, which speeds up re-running the topic modeling, e.g., if you want to change the number of topics. However, if you alter the data, you need to delete the folder, i.e., the files with the precomputed sentence embeddings and partitionings.  \r\n \r\nYou can change each algorithm in these steps, especially the algorithm for sentence partitioning as well as the pre-trained sentence embedder. As you saw in the example, the used algorithm for sentence partitioning is not that great for the newsgroup dataset, but the overall result is still good.\r\n\r\nThe (main) function \"fit_transform\" has a hyperparameter \"alpha\" (similar to other models like LDA), which guides the algorithm on how many topics a document should contain. Setting it 0, means that a document likely has few topics. Setting it to 1 (or larger) means it is more likely to have many (for longer documents). As default, you can use 0.5/sqrt(nTopics). \r\n\r\n\r\n## Citation\r\nTo cite the [Bert-SenClu Paper](https://arxiv.org/abs/2302.03106), please use the following bibtex reference:\r\n\r\n```bibtex\r\n@article{schneider23,\r\n  title={Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences},\r\n  author={Schneider,Johannes},\r\n  journal={arXiv preprint arXiv:2302.03106},\r\n  year={2023}\r\n}\r\n```\r\n",
    "bugtrack_url": null,
    "license": "LICENSE",
    "summary": "(Bert-)SenClu is a topic modeling technique that leverages sentence transformers to compute topic models.",
    "version": "0.1.8",
    "split_keywords": [
        "nlp",
        "bert",
        "topic",
        "modeling",
        "sentence",
        "embeddings"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5f6099fd1ddfcbe0a2eac10664c3c8d67f75b62201e59e91ed461c775fa67a1",
                "md5": "bb0f3257e1ba9bf2da4c132133df102a",
                "sha256": "e181896992d6ded4dca76aa10b280210083f2defd3ae37c60e0483333bd1b049"
            },
            "downloads": -1,
            "filename": "bertsenclu-0.1.8-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bb0f3257e1ba9bf2da4c132133df102a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14809,
            "upload_time": "2023-02-08T08:25:46",
            "upload_time_iso_8601": "2023-02-08T08:25:46.465047Z",
            "url": "https://files.pythonhosted.org/packages/d5/f6/099fd1ddfcbe0a2eac10664c3c8d67f75b62201e59e91ed461c775fa67a1/bertsenclu-0.1.8-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "12df7eb96747f3957ed7d0014d8c13d26598e9786afcd7c009b8d5b60872332c",
                "md5": "821e4f2cfcf6672293d9db3edc425fc2",
                "sha256": "d8a5c302ba59cdc967531f139d4ce757e34dcd06cee07fda23bdf67a551e8a2b"
            },
            "downloads": -1,
            "filename": "bertsenclu-0.1.8.tar.gz",
            "has_sig": false,
            "md5_digest": "821e4f2cfcf6672293d9db3edc425fc2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 18032,
            "upload_time": "2023-02-08T08:25:48",
            "upload_time_iso_8601": "2023-02-08T08:25:48.447787Z",
            "url": "https://files.pythonhosted.org/packages/12/df/7eb96747f3957ed7d0014d8c13d26598e9786afcd7c009b8d5b60872332c/bertsenclu-0.1.8.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-08 08:25:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "JohnTailor",
    "github_project": "BertSenClu",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "bertsenclu"
}
        
Elapsed time: 0.47250s