leet-topic


Nameleet-topic JSON
Version 0.0.11 PyPI version JSON
download
home_page
SummaryA new transformer-based topic modeling library.
upload_time2024-01-09 18:45:01
maintainer
docs_urlNone
authorWJB Mattingly
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI - PyPi](https://img.shields.io/pypi/v/leet-topic)](https://pypi.org/project/leet-topic/)

![Leet Topic Logo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/LeeTopic.png)

LeetTopic builds upon [Top2Vec](https://github.com/ddangelov/Top2Vec), [BerTopic](https://github.com/MaartenGr/BERTopic) and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.

It also lets you turn any DataFrame into a [Bokeh](https://bokeh.org/) application for exploring your documents and topics. As of 0.0.10, LeetTopic also allows users to generate an [Annoy](https://github.com/spotify/annoy) Index as part of the LeetTopic pipeline. This allows users to construct a query their data.

# Installation

```python
pip install leet-topic
```

# Parameters
- df => a Pandas DataFrame that contains the documents that you want to model
- document_field => the DataFrame column name where your documents sit
- html_filename => the filename used to generate the Bokeh application
- extra_fields => a list of extra columns to include in the Bokeh application
- max_distance => the maximum distance between a document and the nearest topic vector to be considered for outliers

# Usage

```python
import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          extra_fields=["names", "hdbscan_labels"],
                                          max_distance=.5)
```

## Multilingual Support
With LeetTopic, you can work with texts in any language supported by spaCy for lemmatization and any model from HuggingFace via Sentence Transformers.

Here is an example working with Croatian

```python
import pandas as pd
from leet_topic import leet_topic

df = pd.DataFrame(["Bok. Kako ste?", "Drago mi je"]*20, columns=["text"])
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="text",
                                          html_filename="demo.html",
                                          extra_fields=["hdbscan_labels"],
                                          spacy_model="hr_core_news_sm",
                                          max_distance=.5)
```

## Custom UMAP and HDBScan Parameters
It is often necessary to control how your embeddings are flattened with UMAP and clustered with HDBScan. As of 0.0.9, you can control these parameters with dictionaries.

```python
import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data = leet_topic.LeetTopic(df,
                                          document_field="descriptions",
                                          html_filename="demo.html",
                                          extra_fields=["names", "hdbscan_labels"],
                                          umap_params={"n_neighbors": 15, "min_dist": 0.01, "metric": 'correlation'},
                                          hdbscan_params={"min_samples": 10, "min_cluster_size": 5},
                                          max_distance=.5)
```

## Create an Annoy Index
As of 0.0.10, users can also return an Annoy Index.

```python
import pandas as pd
from leet_topic import leet_topic

df = pd.read_json("data/vol7.json")
leet_df, topic_data, annoy_index = leet_topic.LeetTopic(df, "descriptions",
            "demo.html",
            build_annoy=True)
```

To leverage the Annoy Index, one can easily create a semantic search engine. One can query the index, for example, by encoding a new text with the same model.

```python
import pandas as pd
from leet_topic import leet_topic
from sentence_transformers import SentenceTransformer


model = SentenceTransformer('all-MiniLM-L6-v2')

emb = model.encode("An individual who was arrested.")

res = annoy_index.get_nns_by_vector(emb, 10)

print(df.iloc[res].descriptions.tolist())

```


# Outputs
This code above will generate a new DataFrame with the UMAP Projection (x, y), hdbscan_labels, and leet_labels, and top-n words for each document. It will also output data about each topic including the central plot of each vector, the documents assigned to it, top-n words associated with it.

Finally, the output will create an HTML file that is a self-contained Bokeh application like the image below.

![demo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/demo-new.JPG)

# Steps

LeetTopic takes an input DataFrame and converts the document field (texts to model) into embeddings via a transformer model. Next, UMAP is used to reduce the embeddings to 2 dimensions. HDBScan is then used to assign documents to topics. Like BerTopic and Top2Vec, at this stage, there are many outliers (topics assigned to -1).

LeetTopic, like Top2Vec, then calculates the centroid for each topic vector based on the HDBScan labels while ignoring topic -1 (outlier). Next, all outlier documents are assigned to nearest topic centroid. Unlike Top2Vec, LeetTopic gives the user the ability to set a max distance so that outliers that are significantly away from a topic vector, they are not assigned to a nearest vector. At the same time, the output DataFrame contains information about the original HDBScan topics, meaning users know if a document was originally an outlier.



# Future Roadmap
## 0.0.9
- Control UMAP parameters
- Control HDBScan parameters
- Multilingual support for lemmatization
- Multilingual support for embedding
- Add support for custom App Titles

## 0.0.10
- Output an Annoy Index so that the data can be queried

## 0.0.11
- Support for embedding text, images, or both via CLIP and displaying the results in the same bokeh application

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "leet-topic",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "WJB Mattingly",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/21/e0/b6de7b3eaf4b13e4b9cf8967697875b8041668de92f512272822fb87c64b/leet_topic-0.0.11.tar.gz",
    "platform": null,
    "description": "[![PyPI - PyPi](https://img.shields.io/pypi/v/leet-topic)](https://pypi.org/project/leet-topic/)\n\n![Leet Topic Logo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/LeeTopic.png)\n\nLeetTopic builds upon [Top2Vec](https://github.com/ddangelov/Top2Vec), [BerTopic](https://github.com/MaartenGr/BERTopic) and other transformer-based topic modeling Python libraries. Unlike BerTopic and Top2Vec, LeetTopic allows users to control the degree to which outliers are resolved into neighboring topics.\n\nIt also lets you turn any DataFrame into a [Bokeh](https://bokeh.org/) application for exploring your documents and topics. As of 0.0.10, LeetTopic also allows users to generate an [Annoy](https://github.com/spotify/annoy) Index as part of the LeetTopic pipeline. This allows users to construct a query their data.\n\n# Installation\n\n```python\npip install leet-topic\n```\n\n# Parameters\n- df => a Pandas DataFrame that contains the documents that you want to model\n- document_field => the DataFrame column name where your documents sit\n- html_filename => the filename used to generate the Bokeh application\n- extra_fields => a list of extra columns to include in the Bokeh application\n- max_distance => the maximum distance between a document and the nearest topic vector to be considered for outliers\n\n# Usage\n\n```python\nimport pandas as pd\nfrom leet_topic import leet_topic\n\ndf = pd.read_json(\"data/vol7.json\")\nleet_df, topic_data = leet_topic.LeetTopic(df,\n                                          document_field=\"descriptions\",\n                                          html_filename=\"demo.html\",\n                                          extra_fields=[\"names\", \"hdbscan_labels\"],\n                                          max_distance=.5)\n```\n\n## Multilingual Support\nWith LeetTopic, you can work with texts in any language supported by spaCy for lemmatization and any model from HuggingFace via Sentence Transformers.\n\nHere is an example working with Croatian\n\n```python\nimport pandas as pd\nfrom leet_topic import leet_topic\n\ndf = pd.DataFrame([\"Bok. Kako ste?\", \"Drago mi je\"]*20, columns=[\"text\"])\nleet_df, topic_data = leet_topic.LeetTopic(df,\n                                          document_field=\"text\",\n                                          html_filename=\"demo.html\",\n                                          extra_fields=[\"hdbscan_labels\"],\n                                          spacy_model=\"hr_core_news_sm\",\n                                          max_distance=.5)\n```\n\n## Custom UMAP and HDBScan Parameters\nIt is often necessary to control how your embeddings are flattened with UMAP and clustered with HDBScan. As of 0.0.9, you can control these parameters with dictionaries.\n\n```python\nimport pandas as pd\nfrom leet_topic import leet_topic\n\ndf = pd.read_json(\"data/vol7.json\")\nleet_df, topic_data = leet_topic.LeetTopic(df,\n                                          document_field=\"descriptions\",\n                                          html_filename=\"demo.html\",\n                                          extra_fields=[\"names\", \"hdbscan_labels\"],\n                                          umap_params={\"n_neighbors\": 15, \"min_dist\": 0.01, \"metric\": 'correlation'},\n                                          hdbscan_params={\"min_samples\": 10, \"min_cluster_size\": 5},\n                                          max_distance=.5)\n```\n\n## Create an Annoy Index\nAs of 0.0.10, users can also return an Annoy Index.\n\n```python\nimport pandas as pd\nfrom leet_topic import leet_topic\n\ndf = pd.read_json(\"data/vol7.json\")\nleet_df, topic_data, annoy_index = leet_topic.LeetTopic(df, \"descriptions\",\n            \"demo.html\",\n            build_annoy=True)\n```\n\nTo leverage the Annoy Index, one can easily create a semantic search engine. One can query the index, for example, by encoding a new text with the same model.\n\n```python\nimport pandas as pd\nfrom leet_topic import leet_topic\nfrom sentence_transformers import SentenceTransformer\n\n\nmodel = SentenceTransformer('all-MiniLM-L6-v2')\n\nemb = model.encode(\"An individual who was arrested.\")\n\nres = annoy_index.get_nns_by_vector(emb, 10)\n\nprint(df.iloc[res].descriptions.tolist())\n\n```\n\n\n# Outputs\nThis code above will generate a new DataFrame with the UMAP Projection (x, y), hdbscan_labels, and leet_labels, and top-n words for each document. It will also output data about each topic including the central plot of each vector, the documents assigned to it, top-n words associated with it.\n\nFinally, the output will create an HTML file that is a self-contained Bokeh application like the image below.\n\n![demo](https://github.com/wjbmattingly/LeetTopic/raw/main/images/demo-new.JPG)\n\n# Steps\n\nLeetTopic takes an input DataFrame and converts the document field (texts to model) into embeddings via a transformer model. Next, UMAP is used to reduce the embeddings to 2 dimensions. HDBScan is then used to assign documents to topics. Like BerTopic and Top2Vec, at this stage, there are many outliers (topics assigned to -1).\n\nLeetTopic, like Top2Vec, then calculates the centroid for each topic vector based on the HDBScan labels while ignoring topic -1 (outlier). Next, all outlier documents are assigned to nearest topic centroid. Unlike Top2Vec, LeetTopic gives the user the ability to set a max distance so that outliers that are significantly away from a topic vector, they are not assigned to a nearest vector. At the same time, the output DataFrame contains information about the original HDBScan topics, meaning users know if a document was originally an outlier.\n\n\n\n# Future Roadmap\n## 0.0.9\n- Control UMAP parameters\n- Control HDBScan parameters\n- Multilingual support for lemmatization\n- Multilingual support for embedding\n- Add support for custom App Titles\n\n## 0.0.10\n- Output an Annoy Index so that the data can be queried\n\n## 0.0.11\n- Support for embedding text, images, or both via CLIP and displaying the results in the same bokeh application\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "A new transformer-based topic modeling library.",
    "version": "0.0.11",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a7a269daa8708a949acd2a34093418036db863c47b147bd90e3f0e7bb9a08898",
                "md5": "82a58c83708c2c01f3b0e27d86861dea",
                "sha256": "522a56d6c2f96be94d863c026ec2db1f8eafa101a9b4994a3a6081d751286ef4"
            },
            "downloads": -1,
            "filename": "leet_topic-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "82a58c83708c2c01f3b0e27d86861dea",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9737,
            "upload_time": "2024-01-09T18:44:59",
            "upload_time_iso_8601": "2024-01-09T18:44:59.473834Z",
            "url": "https://files.pythonhosted.org/packages/a7/a2/69daa8708a949acd2a34093418036db863c47b147bd90e3f0e7bb9a08898/leet_topic-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "21e0b6de7b3eaf4b13e4b9cf8967697875b8041668de92f512272822fb87c64b",
                "md5": "c81c55b94451e0213fe78a0ee783497d",
                "sha256": "60ed35218f48398ac9a1849c9f7407847d375ff298f9365d1c11d1786064bf1d"
            },
            "downloads": -1,
            "filename": "leet_topic-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "c81c55b94451e0213fe78a0ee783497d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 10803,
            "upload_time": "2024-01-09T18:45:01",
            "upload_time_iso_8601": "2024-01-09T18:45:01.664847Z",
            "url": "https://files.pythonhosted.org/packages/21/e0/b6de7b3eaf4b13e4b9cf8967697875b8041668de92f512272822fb87c64b/leet_topic-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-09 18:45:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "leet-topic"
}
        
Elapsed time: 0.16608s