labelled-topic-clustering


Namelabelled-topic-clustering JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/tomhaydn/labelled-topic-clustering
SummarySuper Simple Labelled Topic Clustering
upload_time2024-06-15 07:46:57
maintainerNone
docs_urlNone
authorTom Haydn
requires_python>=3.9
licenseMIT
keywords sentence topic clustering labelling cosine similarity lda huggingface pytorch spacy nlp deep learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# Labelled Topic Clustering

Labelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.

The aim of this project is to make it as **easy-as-possible** to:

1. generate topic clusters on a text dataset using a cosine-similarity approach.
2. get human-readable labels for those clusters

![labelled topic clustering approach](https://github.com/tomhaydn/labelled-topic-clustering/blob/main/docs/diagram-1.png)

## Installation

To use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:

`pip install labelled-topic-clustering`

## Usage

1. Initialize the TopicClusterer:

```
from topic_clusterer import TopicClusterer

hf_token = "your_hugging_face_token"
# This can be any sentence-transformer, anecdotally I've found this the best.
model = "sentence-transformers/all-mpnet-base-v2"

clusterer = TopicClusterer(hf_token, model, debug=True)
```

2. Get clusters:

```
sentences = [
    "the weather is great",
    "This is some perfect weather",
    "we're having some really good weather",
    "my dog ate my homework",
    "why do dogs love homework?",
    "dog keeps devouring my homework"
]

clusters = clusterer.get_clusters(sentences)
```

#### Example Output

```
[[0, 1, 2], [3, 4, 5]]
```

`clusters` will be a 2d array representing clusters with sentence indicies for the original dataset

3. Get labels from clusters:

```
clusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)
```

#### Example Output

```
{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}
```

`clusters_labelled` is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.

> You can also just get it all at once:

```
# Get clusters with labels
labelled_clusters = clusterer.get_clusters_with_labels(sentences)
print(labelled_clusters)
```

# Contributing

You can view all the info on development and contributing [here](./CONTRIBUTING.md)

# Looking Forward

I have done virtually no performance testing as I wrote this once and it was all I needed for a side project.

Some ideas to work on:

- Allow custom tokenizers
- Benchmark performance on large datasets
- Allow for feature extraction locally



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tomhaydn/labelled-topic-clustering",
    "name": "labelled-topic-clustering",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "Sentence, Topic, Clustering, Labelling, Cosine Similarity, LDA, HuggingFace, pyTorch, Spacy, NLP, Deep Learning",
    "author": "Tom Haydn",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/ef/8a/adcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089/labelled-topic-clustering-1.1.0.tar.gz",
    "platform": null,
    "description": "\n# Labelled Topic Clustering\n\nLabelled Topic Clustering is as the name suggests, feed it an array of sentences and it will cluster them with human-readable names.\n\nThe aim of this project is to make it as **easy-as-possible** to:\n\n1. generate topic clusters on a text dataset using a cosine-similarity approach.\n2. get human-readable labels for those clusters\n\n![labelled topic clustering approach](https://github.com/tomhaydn/labelled-topic-clustering/blob/main/docs/diagram-1.png)\n\n## Installation\n\nTo use the TopicClusterer class, you need to install the required packages. Assuming you have a package manager like pip, you can install the dependencies as follows:\n\n`pip install labelled-topic-clustering`\n\n## Usage\n\n1. Initialize the TopicClusterer:\n\n```\nfrom topic_clusterer import TopicClusterer\n\nhf_token = \"your_hugging_face_token\"\n# This can be any sentence-transformer, anecdotally I've found this the best.\nmodel = \"sentence-transformers/all-mpnet-base-v2\"\n\nclusterer = TopicClusterer(hf_token, model, debug=True)\n```\n\n2. Get clusters:\n\n```\nsentences = [\n    \"the weather is great\",\n    \"This is some perfect weather\",\n    \"we're having some really good weather\",\n    \"my dog ate my homework\",\n    \"why do dogs love homework?\",\n    \"dog keeps devouring my homework\"\n]\n\nclusters = clusterer.get_clusters(sentences)\n```\n\n#### Example Output\n\n```\n[[0, 1, 2], [3, 4, 5]]\n```\n\n`clusters` will be a 2d array representing clusters with sentence indicies for the original dataset\n\n3. Get labels from clusters:\n\n```\nclusters_labelled = clusterer.get_labels_from_clusters(clusters, sentences)\n```\n\n#### Example Output\n\n```\n{'Weather great perfect': [0, 1, 2], 'Dog eat homework': [3, 4, 5]}\n```\n\n`clusters_labelled` is a dictionary where the keys are topic labels, and the values are arrays of sentence indices corresponding to the original dataset.\n\n> You can also just get it all at once:\n\n```\n# Get clusters with labels\nlabelled_clusters = clusterer.get_clusters_with_labels(sentences)\nprint(labelled_clusters)\n```\n\n# Contributing\n\nYou can view all the info on development and contributing [here](./CONTRIBUTING.md)\n\n# Looking Forward\n\nI have done virtually no performance testing as I wrote this once and it was all I needed for a side project.\n\nSome ideas to work on:\n\n- Allow custom tokenizers\n- Benchmark performance on large datasets\n- Allow for feature extraction locally\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Super Simple Labelled Topic Clustering",
    "version": "1.1.0",
    "project_urls": {
        "Homepage": "https://github.com/tomhaydn/labelled-topic-clustering"
    },
    "split_keywords": [
        "sentence",
        " topic",
        " clustering",
        " labelling",
        " cosine similarity",
        " lda",
        " huggingface",
        " pytorch",
        " spacy",
        " nlp",
        " deep learning"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a0ef0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4",
                "md5": "f0a24747f27ae885fd722168f8cd94fb",
                "sha256": "8ec3c07e74b03cd2aa3d4d0cbf50e41ce4aac85c36b7398cffe600732664e434"
            },
            "downloads": -1,
            "filename": "labelled_topic_clustering-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f0a24747f27ae885fd722168f8cd94fb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 8313,
            "upload_time": "2024-06-15T07:46:56",
            "upload_time_iso_8601": "2024-06-15T07:46:56.469184Z",
            "url": "https://files.pythonhosted.org/packages/a0/ef/0bed82ea0f1b5501a18c8b0f6b150d4330d9d81911fd97884af43d58dbd4/labelled_topic_clustering-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef8aadcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089",
                "md5": "d513a679c4cbc5993c253f2a88cad197",
                "sha256": "2962ff0ff63933cd7cea75787f670e41e8d0a45df629e97c2f70ba47031c4c1f"
            },
            "downloads": -1,
            "filename": "labelled-topic-clustering-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d513a679c4cbc5993c253f2a88cad197",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 7924,
            "upload_time": "2024-06-15T07:46:57",
            "upload_time_iso_8601": "2024-06-15T07:46:57.868891Z",
            "url": "https://files.pythonhosted.org/packages/ef/8a/adcaadbe649857507a04c85131217ba71de7a70e89660a0653738f106089/labelled-topic-clustering-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-15 07:46:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tomhaydn",
    "github_project": "labelled-topic-clustering",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "labelled-topic-clustering"
}
        
Elapsed time: 3.63623s