# Labelled Topic Clustering
Labelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.
The aim of this project is to make it as **easy-as-possible** to:
1. generate topic clusters on a text dataset using a cosine-similarity approach.
2. get human-readable labels for those clusters
![labelled topic clustering approach](https://github.com/tomhaydn/labelled-topic-clustering/blob/main/docs/diagram-1.png)
## Installation
To use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:
`pip install labelled-topic-clustering`
## Usage
1. Initialize the TopicClusterer:
```
from topic_clusterer import TopicClusterer
hf_token = "your_hugging_face_token"
# This can be any sentence-transformer, anecdotally I've found this the best.
model = "sentence-transformers/all-mpnet-base-v2"
clusterer = TopicClusterer(hf_token, model, debug=True)
```
2. Get clusters:
```
sentences = [
"the weather is great",
"This is some perfect weather",
"we're having some really good weather",
"my dog ate my homework",
"why do dogs love homework?",
"dog keeps devouring my homework"
]
clusters = clusterer.get_clusters(sentences)
```
#### Example Output
```
[[0, 1, 2], [3, 4, 5]]
```
`clusters` will be a 2d array representing clusters with sentence indicies for the original dataset
3. Get labels from clusters:
```
clusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)
```
#### Example Output
```
{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}
```
`clusters_labelled` is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.
> You can also just get it all at once:
```
# Get clusters with labels
labelled_clusters = clusterer.get_clusters_with_labels(sentences)
print(labelled_clusters)
```
# Contributing
You can view all the info on development and contributing [here](./CONTRIBUTING.md)
# Looking Forward
I have done virtually no performance testing as I wrote this once and it was all I needed for a side project.
Some ideas to work on:
- Allow custom tokenizers
- Benchmark performance on large datasets
- Allow for feature extraction locally
Raw data
{
"_id": null,
"home_page": "https://github.com/tomhaydn/labelled-topic-clustering",
"name": "labelled-topic-clustering",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "Sentence, Topic, Clustering, Labelling, Cosine Similarity, LDA, HuggingFace, pyTorch, Spacy, NLP, Deep Learning",
"author": "Tom Haydn",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/ef/8a/adcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089/labelled-topic-clustering-1.1.0.tar.gz",
"platform": null,
"description": "\n# Labelled Topic Clustering\n\nLabelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.\n\nThe aim of this project is to make it as **easy-as-possible** to:\n\n1. generate topic clusters on a text dataset using a cosine-similarity approach.\n2. get human-readable labels for those clusters\n\n![labelled topic clustering approach](https://github.com/tomhaydn/labelled-topic-clustering/blob/main/docs/diagram-1.png)\n\n## Installation\n\nTo use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:\n\n`pip install labelled-topic-clustering`\n\n## Usage\n\n1. Initialize the TopicClusterer:\n\n```\nfrom topic_clusterer import TopicClusterer\n\nhf_token = \"your_hugging_face_token\"\n# This can be any sentence-transformer, anecdotally I've found this the best.\nmodel = \"sentence-transformers/all-mpnet-base-v2\"\n\nclusterer = TopicClusterer(hf_token, model, debug=True)\n```\n\n2. Get clusters:\n\n```\nsentences = [\n \"the weather is great\",\n \"This is some perfect weather\",\n \"we're having some really good weather\",\n \"my dog ate my homework\",\n \"why do dogs love homework?\",\n \"dog keeps devouring my homework\"\n]\n\nclusters = clusterer.get_clusters(sentences)\n```\n\n#### Example Output\n\n```\n[[0, 1, 2], [3, 4, 5]]\n```\n\n`clusters` will be a 2d array representing clusters with sentence indicies for the original dataset\n\n3. Get labels from clusters:\n\n```\nclusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)\n```\n\n#### Example Output\n\n```\n{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}\n```\n\n`clusters_labelled` is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.\n\n> You can also just get it all at once:\n\n```\n# Get clusters with labels\nlabelled_clusters = clusterer.get_clusters_with_labels(sentences)\nprint(labelled_clusters)\n```\n\n# Contributing\n\nYou can view all the info on development and contributing [here](./CONTRIBUTING.md)\n\n# Looking Forward\n\nI have done virtually no performance testing as I wrote this once and it was all I needed for a side project.\n\nSome ideas to work on:\n\n- Allow custom tokenizers\n- Benchmark performance on large datasets\n- Allow for feature extraction locally\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Super Simple Labelled Topic Clustering",
"version": "1.1.0",
"project_urls": {
"Homepage": "https://github.com/tomhaydn/labelled-topic-clustering"
},
"split_keywords": [
"sentence",
" topic",
" clustering",
" labelling",
" cosine similarity",
" lda",
" huggingface",
" pytorch",
" spacy",
" nlp",
" deep learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a0ef0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4",
"md5": "f0a24747f27ae885fd722168f8cd94fb",
"sha256": "8ec3c07e74b03cd2aa3d4d0cbf50e41ce4aac85c36b7398cffe600732664e434"
},
"downloads": -1,
"filename": "labelled_topic_clustering-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "f0a24747f27ae885fd722168f8cd94fb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8313,
"upload_time": "2024-06-15T07:46:56",
"upload_time_iso_8601": "2024-06-15T07:46:56.469184Z",
"url": "https://files.pythonhosted.org/packages/a0/ef/0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4/labelled_topic_clustering-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ef8aadcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089",
"md5": "d513a679c4cbc5993c253f2a88cad197",
"sha256": "2962ff0ff63933cd7cea75787f670e41e8d0a45df629e97c2f70ba47031c4c1f"
},
"downloads": -1,
"filename": "labelled-topic-clustering-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "d513a679c4cbc5993c253f2a88cad197",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 7924,
"upload_time": "2024-06-15T07:46:57",
"upload_time_iso_8601": "2024-06-15T07:46:57.868891Z",
"url": "https://files.pythonhosted.org/packages/ef/8a/adcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089/labelled-topic-clustering-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-15 07:46:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "tomhaydn",
"github_project": "labelled-topic-clustering",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "labelled-topic-clustering"
}