===========
Toponymy
===========
.. image:: doc/toponymy_text_horizontal.png
:width: 600
:align: center
:alt: Toponymy
**🤖 Chat with our AI docs:** https://deepwiki.com/TutteInstitute/toponymy
The package name Toponymy is derived from the Greek topos ‘place’ + onuma ‘name’. Thus, the naming of places.
The goal of Toponymy is to put names to places in the space of information. This could be a corpus of documents,
in which case Toponymy can be viewed as a topic naming library. It could also be a collection of images, in which case
Toponymy could be used to name the themes of the images. The goal is to provide a names that can allow a user to
navigate through the space of information in a meaningful way.
Toponymy is designed to scale to very large corpora and collections, providing meaningful names on multiple scales,
from broad themes to fine-grained topics. We make use a custom clustering methods, information extraction,
and large language models to power this. The library is designed to be flexible and easy to use.
As of now this is an beta version of the library. Things can and will break right now.
We welcome feedback, use cases and feature suggestions.
------------------
Basic Installation
------------------
You can install Toponymy using:
.. code-block:: bash
pip install toponymy
To install the latest version of Toponymy from source you can do so by cloning the repository and running:
.. code-block:: bash
git clone https://github.com/TutteInstitute/toponymy
cd toponymy
pip install .
-----------
Basic Usage
-----------
As an example, we can use Toponymy to cluster documents in the `20-Newsgroups dataset <http://qwone.com/~jason/20Newsgroups/>`_ on hugging face and then assign topic names to these clusters. The 20 newsgroups dataset contains 18,170 documents distributed roughly evenly across 20 different newsgroups. You can compute vector representations of each document on your own (see `Vector Construction <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#vector-construction>`_ for instructions), but this can be very expensive without a GPU. We recommend downloading our precomputed vectors. Code to retrieve these vectors is below:
.. code-block:: python
import numpy as np
import pandas as pd
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
text = newsgroups_df["post"].str.strip().values
document_vectors = np.stack(newsgroups_df["embedding"].values)
document_map = np.stack(newsgroups_df["map"].values)
After running the above code, ``document_vectors`` will contain 768-dimensional embeddings for each of the 18,170 documents in the dataset and ``document_map`` will contain 2-dimensional embeddings of these same documents.
We can visualize the documents using the 2-dimensional representations in ``document_map``:
.. code-block:: python
import datamapplot
plot = datamapplot.create_plot(document_map)
display(plot)
.. image:: doc/example_2D_plot.png
:width: 600
:align: center
:alt: example_2D_plot
Once we have a low-dimensional representation, we can do the topic naming.
Toponymy will make use of a clusterer to create a balanced hierarchical layered
clustering of our documents. (In this case, we use ``ToponymyClusterer`` on the 2-dimensional vectors in ``document_map``.)
.. code-block:: python
from toponymy import ToponymyClusterer
clusterer = ToponymyClusterer(min_clusters=4, verbose=True)
clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)
for i, layer in enumerate(clusterer.cluster_layers_):
print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')
Toponymy will then use a variety of sampling and summarization techniques to construct prompts
describing each cluster to pass to a large language model (LLM).
Note that Toponymy also requires an embedding model for determining which of the documents will be most relevant to each
of our clusters. This doesn't have to be the embedding model that our documents were embedded with but it
should be similar.
.. code-block:: python
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
Toponymy supports multiple LLMs, including Cohere, OpenAI, and Anthropic via service calls, and local models via
Huggingface and LlamaCpp. Here we show an example using OpenAI.
You will need to get a free `OpenAI key <https://platform.openai.com/api-keys>`_ and store it in the file ``openai_key.txt`` before running this code.
Also make sure that openai is installed in your environment. You can test your connection to OpenAI with the test_llm_connectivity() method before running Toponymy.
.. code-block:: python
import openai
from toponymy import Toponymy
from toponymy.llm_wrappers import OpenAINamer
openai_api_key = open("openai_key.txt").read().strip()
llm = OpenAINamer(openai_api_key)
llm.test_llm_connectivity()
The following code will generate a topic naming
for the documents in the data set using the ``document_vectors``, ``document_map``, and ``embedding_model`` created above.
(Warning are filtered here because they can interfere with the display of the progress bar.)
.. code-block:: python
import warnings
warnings.filterwarnings('ignore')
topic_model = Toponymy(
llm_wrapper=llm,
text_embedding_model=embedding_model,
clusterer=clusterer,
object_description="newsgroup posts",
corpus_description="20-newsgroups dataset",
exemplar_delimiters=["<EXAMPLE_POST>\n","\n</EXAMPLE_POST>\n\n"]
)
# Note on data types for fit() method:
# - text: Python list of strings (not numpy array)
# - document_vectors: numpy array of shape (n_documents, embedding_dimension)
# - document_map: numpy array of shape (n_documents, clustering_dimension)
topic_model.fit(text, document_vectors, document_map)
``topic_model`` will contain ``topic_names``, a list of lists which can be used to explore the unique topic names in each layer or resolution.
Let's examine the last layer of topics. There were five clusters in this layer. Toponymy assigns a name to each cluster.
.. code-block:: python
topic_names = topic_model.topic_names_
topic_names[-1:]
[['Sports Analysis',
'Religion and Sociopolitical Conflicts',
'Automotive and Motorcycle Discussion',
'X Window System and DOS/Windows Graphics',
'Vintage Computer Hardware']]
Our gray 2-D plot from above can now be displayed with labeled clusters. (See `Interactive Topic Visualization <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#interactive-topic-visualization>`_ for more details on generating interactive plots.)
.. image:: doc/example_labeled_plot.png
:width: 600
:align: center
:alt: example_labeled_plot
At this particular level of resolution, this plot also shows one topic ('NASA and Space Exploration Missions') from the second to last layer of clusters.
.. code-block:: python
topic_names[-2:]
[['NHL Hockey Playoffs and Team Analysis',
'Major League Baseball Analysis',
'NASA and Space Exploration Missions',
'Clipper Chip Encryption and Privacy Debate',
'Medical Discussions on Chronic Diseases and Diet',
'Middle East Conflicts and Israeli-Palestinian Issues',
'Automotive and Motorcycle Discussion',
'Christianity, Faith, and Religious Debates',
'Waco Siege and Government Controversy',
'US Gun Rights and Regulation Debate',
'Political and Social Controversies Online',
'X Window System and DOS/Windows Graphics',
'Vintage PC and Macintosh Hardware',
'PC Hard Drive Interfaces and Troubleshooting'],
['Sports Analysis',
'Religion and Sociopolitical Conflicts',
'Automotive and Motorcycle Discussion',
'X Window System and DOS/Windows Graphics',
'Vintage Computer Hardware']]
``topics_per_document`` contains topic labels for each document, with one list for each level of resultion in our
cluster layers. In our above case this will be a list of 5 layers each containing a list of topic labels for each of the 18,170 documents.
Documents that aren't contained within a cluster at a given layer are given the topic ``Unlabelled``.
.. code-block:: python
topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
topics_per_document
[array(['Unlabelled',
'Discussion on VESA Local Bus Video Cards and Performance',
'Unlabelled', ...,
'Cooling Solutions and Components for CPUs and Power Supplies',
'Algorithms for Finding Sphere from Four Points in 3D',
'Automotive Discussions on Performance Cars and Specifications'], dtype=object),
array(['NHL Playoff Analysis and Predictions',
'Graphics Card Performance and Benchmark Discussions',
'Armenian Genocide and Turkish Atrocities Discourse', ...,
'Cooling Solutions and Components for CPUs and Power Supplies',
'Algorithms for 3D Polygon Processing and Geometry',
'Discussions on SUVs and Performance Cars'], dtype=object),
array(['NHL Playoff Analysis and Predictions',
'Video Card Drivers and Performance',
'Armenian Genocide and Turkish Atrocities', ..., 'Unlabelled',
'Unlabelled', 'Automotive Performance and Used Cars'], dtype=object),
array(['NHL Playoffs and Player Analysis',
'Vintage Computer Hardware and Upgrades', 'Unlabelled', ...,
'Unlabelled', 'X Window System and Graphics Software',
'Automotive Performance and Safety'], dtype=object),
array(['Sports Analysis', 'Computer Hardware', 'Unlabelled', ...,
'Unlabelled', 'X Window System and Graphics Software',
'Automotive Performance and Safety'], dtype=object)]
-----------------------------------
Interactive Topic Visualization
-----------------------------------
Once you’ve generated the topic names and document map, it's helpful to visualize how topics are distributed across your corpus. We recommend using the `DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_ library for this purpose. It creates interactive, zoomable maps that allow you to explore clusters and topic labels in a spatial layout. It is particularly well suited to exploring data maps along with layers of topic names.
Here is an example of using ``datamapplot`` to visualize your data. We can pass in our ``document_map``, ``document_vectors`` and newly created ``topics_per_document`` as input:
.. code-block:: shell
pip install datamapplot
conda install -c conda-forge datamapplot
.. code-block:: python
import datamapplot
topic_name_vectors = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
plot = datamapplot.create_interactive_plot(
document_map,
*topic_name_vectors,
)
plot
This will launch an interactive map in your browser or notebook environment, showing document clusters and their associated topic names across all hierarchical layers. You can zoom in to explore fine-grained topics and zoom out to see broader themes, enabling intuitive navigation of the information space.
-----------------------------------
Controlling Verbose Output
-----------------------------------
Toponymy provides a unified ``verbose`` parameter to control progress bars and informative messages across all components:
.. code-block:: python
# Show all progress bars and messages
clusterer = ToponymyClusterer(min_clusters=4, verbose=True)
# Suppress all output for silent operation
clusterer = ToponymyClusterer(min_clusters=4, verbose=False)
# The same parameter works for all components
topic_model = Toponymy(
llm_wrapper=llm,
text_embedding_model=embedding_model,
verbose=True # Shows progress for all operations
)
The ``verbose`` parameter unifies the older separate ``verbose`` and ``show_progress_bar`` parameters, providing a simpler and more consistent interface. Legacy parameters are still supported for backward compatibility but will show deprecation warnings.
-------------------
Vector Construction
-------------------
If you do not have ready made document vectors and low dimensional representations of your data you will need to compute
your own. For faster encoding change device to: "cuda", "mps", "npu" or "cpu" depending on hardware availability. Alternatively,
one could make use of an API call to embedding service. Embedding wrappers can be found in:
.. code-block:: python
from toponymy.embedding_wrappers import OpenAIEmbedder
or the embedding wrapper of your choice. Once we generate document vectors we will need to construct a low dimensional representation.
Here we do that via our UMAP library.
.. code-block:: python
pip install umap-learn
pip install pandas
pip install sentence_transformers
import pandas as pd
from sentence_transformers import SentenceTransformer
import umap
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
text = newsgroups_df["post"].str.strip().values
embedding_model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
document_vectors = embedding_model.encode(text, show_progress_bar=True)
document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)
-------
License
-------
Toponymy is MIT licensed. See the LICENSE file for details.
------------
Contributing
------------
Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.
To contribute please `fork the project <https://github.com/TutteInstitute/toponymy/fork>`_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.
Raw data
{
"_id": null,
"home_page": "https://github.com/TutteInstitute/toponymy",
"name": "toponymy",
"maintainer": "John Healy, Leland McInnes",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "jchealy@gmail.com, leland.mcinnes@gmail.com",
"keywords": "topic modeing, representation, cluster, clustering, large language models, LLM, topic naming",
"author": "John Healy, Leland McInnes",
"author_email": "jchealy@gmail.com, leland.mcinnes@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/4c/32/4df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22/toponymy-0.4.0.tar.gz",
"platform": null,
"description": "===========\nToponymy\n===========\n\n.. image:: doc/toponymy_text_horizontal.png\n :width: 600\n :align: center\n :alt: Toponymy\n\n\n**\ud83e\udd16 Chat with our AI docs:** https://deepwiki.com/TutteInstitute/toponymy\n\nThe package name Toponymy is derived from the Greek topos \u2018place\u2019 + onuma \u2018name\u2019. Thus, the naming of places. \nThe goal of Toponymy is to put names to places in the space of information. This could be a corpus of documents,\nin which case Toponymy can be viewed as a topic naming library. It could also be a collection of images, in which case\nToponymy could be used to name the themes of the images. The goal is to provide a names that can allow a user to\nnavigate through the space of information in a meaningful way.\n\nToponymy is designed to scale to very large corpora and collections, providing meaningful names on multiple scales,\nfrom broad themes to fine-grained topics. We make use a custom clustering methods, information extraction, \nand large language models to power this. The library is designed to be flexible and easy to use.\n\nAs of now this is an beta version of the library. Things can and will break right now.\nWe welcome feedback, use cases and feature suggestions.\n\n------------------\nBasic Installation\n------------------\n\nYou can install Toponymy using:\n\n.. code-block:: bash\n\n pip install toponymy\n\n\nTo install the latest version of Toponymy from source you can do so by cloning the repository and running:\n\n.. code-block:: bash\n\n git clone https://github.com/TutteInstitute/toponymy\n cd toponymy\n pip install .\n\n-----------\nBasic Usage\n-----------\n\nAs an example, we can use Toponymy to cluster documents in the `20-Newsgroups dataset <http://qwone.com/~jason/20Newsgroups/>`_ on hugging face and then assign topic names to these clusters. The 20 newsgroups dataset contains 18,170 documents distributed roughly evenly across 20 different newsgroups. You can compute vector representations of each document on your own (see `Vector Construction <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#vector-construction>`_ for instructions), but this can be very expensive without a GPU. We recommend downloading our precomputed vectors. Code to retrieve these vectors is below:\n\n.. code-block:: python\n\n import numpy as np\n import pandas as pd\n newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")\n text = newsgroups_df[\"post\"].str.strip().values\n document_vectors = np.stack(newsgroups_df[\"embedding\"].values)\n document_map = np.stack(newsgroups_df[\"map\"].values)\n\nAfter running the above code, ``document_vectors`` will contain 768-dimensional embeddings for each of the 18,170 documents in the dataset and ``document_map`` will contain 2-dimensional embeddings of these same documents.\n\nWe can visualize the documents using the 2-dimensional representations in ``document_map``:\n\n.. code-block:: python\n\n import datamapplot\n plot = datamapplot.create_plot(document_map)\n display(plot)\n\n.. image:: doc/example_2D_plot.png\n :width: 600\n :align: center\n :alt: example_2D_plot\n\nOnce we have a low-dimensional representation, we can do the topic naming. \nToponymy will make use of a clusterer to create a balanced hierarchical layered \nclustering of our documents. (In this case, we use ``ToponymyClusterer`` on the 2-dimensional vectors in ``document_map``.)\n\n.. code-block:: python\n\n from toponymy import ToponymyClusterer\n clusterer = ToponymyClusterer(min_clusters=4, verbose=True)\n clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)\n for i, layer in enumerate(clusterer.cluster_layers_):\n print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')\n\nToponymy will then use a variety of sampling and summarization techniques to construct prompts \ndescribing each cluster to pass to a large language model (LLM). \n\nNote that Toponymy also requires an embedding model for determining which of the documents will be most relevant to each\nof our clusters. This doesn't have to be the embedding model that our documents were embedded with but it \nshould be similar.\n\n.. code-block:: python\n\n from sentence_transformers import SentenceTransformer\n embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\nToponymy supports multiple LLMs, including Cohere, OpenAI, and Anthropic via service calls, and local models via\nHuggingface and LlamaCpp. Here we show an example using OpenAI. \n\nYou will need to get a free `OpenAI key <https://platform.openai.com/api-keys>`_ and store it in the file ``openai_key.txt`` before running this code.\nAlso make sure that openai is installed in your environment. You can test your connection to OpenAI with the test_llm_connectivity() method before running Toponymy.\n\n.. code-block:: python\n\n import openai\n from toponymy import Toponymy\n from toponymy.llm_wrappers import OpenAINamer\n \n openai_api_key = open(\"openai_key.txt\").read().strip()\n llm = OpenAINamer(openai_api_key)\n llm.test_llm_connectivity()\n\n\nThe following code will generate a topic naming\nfor the documents in the data set using the ``document_vectors``, ``document_map``, and ``embedding_model`` created above.\n(Warning are filtered here because they can interfere with the display of the progress bar.)\n\n.. code-block:: python\n\n import warnings\n warnings.filterwarnings('ignore')\n\n topic_model = Toponymy(\n llm_wrapper=llm,\n text_embedding_model=embedding_model,\n clusterer=clusterer,\n object_description=\"newsgroup posts\",\n corpus_description=\"20-newsgroups dataset\",\n exemplar_delimiters=[\"<EXAMPLE_POST>\\n\",\"\\n</EXAMPLE_POST>\\n\\n\"]\n )\n \n # Note on data types for fit() method:\n # - text: Python list of strings (not numpy array)\n # - document_vectors: numpy array of shape (n_documents, embedding_dimension)\n # - document_map: numpy array of shape (n_documents, clustering_dimension)\n topic_model.fit(text, document_vectors, document_map)\n\n\n``topic_model`` will contain ``topic_names``, a list of lists which can be used to explore the unique topic names in each layer or resolution.\nLet's examine the last layer of topics. There were five clusters in this layer. Toponymy assigns a name to each cluster.\n\n.. code-block:: python\n\n topic_names = topic_model.topic_names_\n\n topic_names[-1:]\n\n [['Sports Analysis',\n 'Religion and Sociopolitical Conflicts',\n 'Automotive and Motorcycle Discussion',\n 'X Window System and DOS/Windows Graphics',\n 'Vintage Computer Hardware']]\n\nOur gray 2-D plot from above can now be displayed with labeled clusters. (See `Interactive Topic Visualization <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#interactive-topic-visualization>`_ for more details on generating interactive plots.)\n\n.. image:: doc/example_labeled_plot.png\n :width: 600\n :align: center\n :alt: example_labeled_plot\n\nAt this particular level of resolution, this plot also shows one topic ('NASA and Space Exploration Missions') from the second to last layer of clusters. \n\n.. code-block:: python\n\n topic_names[-2:]\n\n [['NHL Hockey Playoffs and Team Analysis',\n 'Major League Baseball Analysis',\n 'NASA and Space Exploration Missions',\n 'Clipper Chip Encryption and Privacy Debate',\n 'Medical Discussions on Chronic Diseases and Diet',\n 'Middle East Conflicts and Israeli-Palestinian Issues',\n 'Automotive and Motorcycle Discussion',\n 'Christianity, Faith, and Religious Debates',\n 'Waco Siege and Government Controversy',\n 'US Gun Rights and Regulation Debate',\n 'Political and Social Controversies Online',\n 'X Window System and DOS/Windows Graphics',\n 'Vintage PC and Macintosh Hardware',\n 'PC Hard Drive Interfaces and Troubleshooting'],\n ['Sports Analysis',\n 'Religion and Sociopolitical Conflicts',\n 'Automotive and Motorcycle Discussion',\n 'X Window System and DOS/Windows Graphics',\n 'Vintage Computer Hardware']]\n\n\n``topics_per_document`` contains topic labels for each document, with one list for each level of resultion in our \ncluster layers. In our above case this will be a list of 5 layers each containing a list of topic labels for each of the 18,170 documents. \nDocuments that aren't contained within a cluster at a given layer are given the topic ``Unlabelled``.\n\n.. code-block:: python\n \n topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]\n topics_per_document\n \n\n [array(['Unlabelled',\n 'Discussion on VESA Local Bus Video Cards and Performance',\n 'Unlabelled', ...,\n 'Cooling Solutions and Components for CPUs and Power Supplies',\n 'Algorithms for Finding Sphere from Four Points in 3D',\n 'Automotive Discussions on Performance Cars and Specifications'], dtype=object),\n array(['NHL Playoff Analysis and Predictions',\n 'Graphics Card Performance and Benchmark Discussions',\n 'Armenian Genocide and Turkish Atrocities Discourse', ...,\n 'Cooling Solutions and Components for CPUs and Power Supplies',\n 'Algorithms for 3D Polygon Processing and Geometry',\n 'Discussions on SUVs and Performance Cars'], dtype=object),\n array(['NHL Playoff Analysis and Predictions',\n 'Video Card Drivers and Performance',\n 'Armenian Genocide and Turkish Atrocities', ..., 'Unlabelled',\n 'Unlabelled', 'Automotive Performance and Used Cars'], dtype=object),\n array(['NHL Playoffs and Player Analysis',\n 'Vintage Computer Hardware and Upgrades', 'Unlabelled', ...,\n 'Unlabelled', 'X Window System and Graphics Software',\n 'Automotive Performance and Safety'], dtype=object),\n array(['Sports Analysis', 'Computer Hardware', 'Unlabelled', ...,\n 'Unlabelled', 'X Window System and Graphics Software',\n 'Automotive Performance and Safety'], dtype=object)]\n\n-----------------------------------\nInteractive Topic Visualization\n-----------------------------------\n\nOnce you\u2019ve generated the topic names and document map, it's helpful to visualize how topics are distributed across your corpus. We recommend using the `DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_ library for this purpose. It creates interactive, zoomable maps that allow you to explore clusters and topic labels in a spatial layout. It is particularly well suited to exploring data maps along with layers of topic names. \n\nHere is an example of using ``datamapplot`` to visualize your data. We can pass in our ``document_map``, ``document_vectors`` and newly created ``topics_per_document`` as input:\n\n.. code-block:: shell\n\n pip install datamapplot\n conda install -c conda-forge datamapplot\n\n.. code-block:: python\n\n import datamapplot\n topic_name_vectors = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]\n\n plot = datamapplot.create_interactive_plot(\n document_map,\n *topic_name_vectors,\n )\n\n plot\n\nThis will launch an interactive map in your browser or notebook environment, showing document clusters and their associated topic names across all hierarchical layers. You can zoom in to explore fine-grained topics and zoom out to see broader themes, enabling intuitive navigation of the information space.\n\n-----------------------------------\nControlling Verbose Output\n-----------------------------------\n\nToponymy provides a unified ``verbose`` parameter to control progress bars and informative messages across all components:\n\n.. code-block:: python\n\n # Show all progress bars and messages\n clusterer = ToponymyClusterer(min_clusters=4, verbose=True)\n \n # Suppress all output for silent operation\n clusterer = ToponymyClusterer(min_clusters=4, verbose=False)\n \n # The same parameter works for all components\n topic_model = Toponymy(\n llm_wrapper=llm,\n text_embedding_model=embedding_model,\n verbose=True # Shows progress for all operations\n )\n\nThe ``verbose`` parameter unifies the older separate ``verbose`` and ``show_progress_bar`` parameters, providing a simpler and more consistent interface. Legacy parameters are still supported for backward compatibility but will show deprecation warnings.\n\n-------------------\nVector Construction\n-------------------\n\nIf you do not have ready made document vectors and low dimensional representations of your data you will need to compute \nyour own. For faster encoding change device to: \"cuda\", \"mps\", \"npu\" or \"cpu\" depending on hardware availability. Alternatively,\none could make use of an API call to embedding service. Embedding wrappers can be found in:\n\n.. code-block:: python\n\n from toponymy.embedding_wrappers import OpenAIEmbedder\n\nor the embedding wrapper of your choice. Once we generate document vectors we will need to construct a low dimensional representation. \nHere we do that via our UMAP library. \n\n.. code-block:: python\n\n pip install umap-learn\n pip install pandas\n pip install sentence_transformers\n\n import pandas as pd\n from sentence_transformers import SentenceTransformer\n import umap\n\n newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")\n text = newsgroups_df[\"post\"].str.strip().values\n embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\", device=\"cpu\")\n\n document_vectors = embedding_model.encode(text, show_progress_bar=True)\n document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)\n\n-------\nLicense\n-------\n\nToponymy is MIT licensed. See the LICENSE file for details.\n\n------------\nContributing\n------------\n\nContributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from\ncode to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.\nTo contribute please `fork the project <https://github.com/TutteInstitute/toponymy/fork>`_ make your\nchanges and submit a pull request. We will do our best to work through any issues with you and get your code merged in.\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A library for using large language models to name topics",
"version": "0.4.0",
"project_urls": {
"Homepage": "https://github.com/TutteInstitute/toponymy"
},
"split_keywords": [
"topic modeing",
" representation",
" cluster",
" clustering",
" large language models",
" llm",
" topic naming"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0736d5ff4136c93f353877a4c6486a33265c4da74375b2ab898026a5bde0713c",
"md5": "1ddf0031d4d5658e4b32d0dd19e35fe0",
"sha256": "beadcbd884b6207e09587bdd415408edf7e971357070fe49c408875cefe077ee"
},
"downloads": -1,
"filename": "toponymy-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1ddf0031d4d5658e4b32d0dd19e35fe0",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 190521,
"upload_time": "2025-10-09T21:20:46",
"upload_time_iso_8601": "2025-10-09T21:20:46.374547Z",
"url": "https://files.pythonhosted.org/packages/07/36/d5ff4136c93f353877a4c6486a33265c4da74375b2ab898026a5bde0713c/toponymy-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "4c324df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22",
"md5": "e49a8d65be297dcb987de934af129894",
"sha256": "a4ea0d2eb7578cfe71af5a434659be7fadb71fb42d8184959d1012cbb057135e"
},
"downloads": -1,
"filename": "toponymy-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "e49a8d65be297dcb987de934af129894",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 188951,
"upload_time": "2025-10-09T21:20:47",
"upload_time_iso_8601": "2025-10-09T21:20:47.777695Z",
"url": "https://files.pythonhosted.org/packages/4c/32/4df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22/toponymy-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 21:20:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TutteInstitute",
"github_project": "toponymy",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"2.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.0"
]
]
},
{
"name": "numba",
"specs": [
[
">=",
"0.56"
]
]
},
{
"name": "datasets",
"specs": []
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.6"
]
]
},
{
"name": "vectorizers",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "fast_hdbscan",
"specs": [
[
">=",
"0.2.2"
]
]
},
{
"name": "dataclasses",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "tenacity",
"specs": []
},
{
"name": "aiohttp",
"specs": []
},
{
"name": "apricot-select",
"specs": []
}
],
"lcname": "toponymy"
}