toponymy


Nametoponymy JSON
Version 0.4.0 PyPI version JSON
download
home_pagehttps://github.com/TutteInstitute/toponymy
SummaryA library for using large language models to name topics
upload_time2025-10-09 21:20:47
maintainerJohn Healy, Leland McInnes
docs_urlNone
authorJohn Healy, Leland McInnes
requires_python>=3.9
licenseMIT License
keywords topic modeing representation cluster clustering large language models llm topic naming
VCS
bugtrack_url
requirements numpy pandas numba datasets scikit-learn vectorizers scipy fast_hdbscan dataclasses tqdm tenacity aiohttp apricot-select
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ===========
Toponymy
===========

.. image:: doc/toponymy_text_horizontal.png
  :width: 600
  :align: center
  :alt: Toponymy


**🤖 Chat with our AI docs:** https://deepwiki.com/TutteInstitute/toponymy

The package name Toponymy is derived from the Greek topos ‘place’ + onuma ‘name’.  Thus, the naming of places.  
The goal of Toponymy is to put names to places in the space of information. This could be a corpus of documents,
in which case Toponymy can be viewed as a topic naming library.  It could also be a collection of images, in which case
Toponymy could be used to name the themes of the images.  The goal is to provide a names that can allow a user to
navigate through the space of information in a meaningful way.

Toponymy is designed to scale to very large corpora and collections, providing meaningful names on multiple scales,
from broad themes to fine-grained topics.  We make use a custom clustering methods, information extraction, 
and large language models to power this. The library is designed to be flexible and easy to use.

As of now this is an beta version of the library. Things can and will break right now.
We welcome feedback, use cases and feature suggestions.

------------------
Basic Installation
------------------

You can install Toponymy using:

.. code-block:: bash

    pip install toponymy


To install the latest version of Toponymy from source you can do so by cloning the repository and running:

.. code-block:: bash

    git clone https://github.com/TutteInstitute/toponymy
    cd toponymy
    pip install .

-----------
Basic Usage
-----------

As an example, we can use Toponymy to cluster documents in the `20-Newsgroups dataset <http://qwone.com/~jason/20Newsgroups/>`_ on hugging face and then assign topic names to these clusters. The 20 newsgroups dataset contains 18,170 documents distributed roughly evenly across 20 different newsgroups. You can compute vector representations of each document on your own (see `Vector Construction <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#vector-construction>`_ for instructions), but this can be very expensive without a GPU. We recommend downloading our precomputed vectors. Code to retrieve these vectors is below:

.. code-block:: python

    import numpy as np
    import pandas as pd
    newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
    text = newsgroups_df["post"].str.strip().values
    document_vectors = np.stack(newsgroups_df["embedding"].values)
    document_map = np.stack(newsgroups_df["map"].values)

After running the above code, ``document_vectors`` will contain 768-dimensional embeddings for each of the 18,170 documents in the dataset and ``document_map`` will contain 2-dimensional embeddings of these same documents.

We can visualize the documents using the 2-dimensional representations in ``document_map``:

.. code-block:: python

  import datamapplot
  plot = datamapplot.create_plot(document_map)
  display(plot)

.. image:: doc/example_2D_plot.png
  :width: 600
  :align: center
  :alt: example_2D_plot

Once we have a low-dimensional representation, we can do the topic naming. 
Toponymy will make use of a clusterer to create a balanced hierarchical layered 
clustering of our documents. (In this case, we use ``ToponymyClusterer`` on the 2-dimensional vectors in ``document_map``.)

.. code-block:: python

    from toponymy import ToponymyClusterer
    clusterer = ToponymyClusterer(min_clusters=4, verbose=True)
    clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)
    for i, layer in enumerate(clusterer.cluster_layers_):
        print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')

Toponymy will then use a variety of sampling and summarization techniques to construct prompts 
describing each cluster to pass to a large language model (LLM).  

Note that Toponymy also requires an embedding model for determining which of the documents will be most relevant to each
of our clusters.  This doesn't have to be the embedding model that our documents were embedded with but it 
should be similar.

.. code-block:: python

    from sentence_transformers import SentenceTransformer
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

Toponymy supports multiple LLMs, including Cohere, OpenAI, and Anthropic via service calls, and local models via
Huggingface and LlamaCpp. Here we show an example using OpenAI. 

You will need to get a free `OpenAI key <https://platform.openai.com/api-keys>`_ and store it in the file ``openai_key.txt`` before running this code.
Also make sure that openai is installed in your environment. You can test your connection to OpenAI with the test_llm_connectivity() method before running Toponymy.

.. code-block:: python

    import openai
    from toponymy import Toponymy
    from toponymy.llm_wrappers import OpenAINamer
    
    openai_api_key = open("openai_key.txt").read().strip()
    llm = OpenAINamer(openai_api_key)
    llm.test_llm_connectivity()


The following code will generate a topic naming
for the documents in the data set using the ``document_vectors``, ``document_map``, and ``embedding_model`` created above.
(Warning are filtered here because they can interfere with the display of the progress bar.)

.. code-block:: python

    import warnings
    warnings.filterwarnings('ignore')

    topic_model = Toponymy(
        llm_wrapper=llm,
        text_embedding_model=embedding_model,
        clusterer=clusterer,
        object_description="newsgroup posts",
        corpus_description="20-newsgroups dataset",
        exemplar_delimiters=["<EXAMPLE_POST>\n","\n</EXAMPLE_POST>\n\n"]
    )
    
    # Note on data types for fit() method:
    # - text: Python list of strings (not numpy array)
    # - document_vectors: numpy array of shape (n_documents, embedding_dimension)
    # - document_map: numpy array of shape (n_documents, clustering_dimension)
    topic_model.fit(text, document_vectors, document_map)


``topic_model`` will contain ``topic_names``, a list of lists which can be used to explore the unique topic names in each layer or resolution.
Let's examine the last layer of topics. There were five clusters in this layer. Toponymy assigns a name to each cluster.

.. code-block:: python

    topic_names = topic_model.topic_names_

    topic_names[-1:]

    [['Sports Analysis',
    'Religion and Sociopolitical Conflicts',
    'Automotive and Motorcycle Discussion',
    'X Window System and DOS/Windows Graphics',
    'Vintage Computer Hardware']]

Our gray 2-D plot from above can now be displayed with labeled clusters. (See `Interactive Topic Visualization <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#interactive-topic-visualization>`_ for more details on generating interactive plots.)

.. image:: doc/example_labeled_plot.png
  :width: 600
  :align: center
  :alt: example_labeled_plot

At this particular level of resolution, this plot also shows one topic ('NASA and Space Exploration Missions') from the second to last layer of clusters. 

.. code-block:: python

    topic_names[-2:]

    [['NHL Hockey Playoffs and Team Analysis',
    'Major League Baseball Analysis',
    'NASA and Space Exploration Missions',
    'Clipper Chip Encryption and Privacy Debate',
    'Medical Discussions on Chronic Diseases and Diet',
    'Middle East Conflicts and Israeli-Palestinian Issues',
    'Automotive and Motorcycle Discussion',
    'Christianity, Faith, and Religious Debates',
    'Waco Siege and Government Controversy',
    'US Gun Rights and Regulation Debate',
    'Political and Social Controversies Online',
    'X Window System and DOS/Windows Graphics',
    'Vintage PC and Macintosh Hardware',
    'PC Hard Drive Interfaces and Troubleshooting'],
    ['Sports Analysis',
    'Religion and Sociopolitical Conflicts',
    'Automotive and Motorcycle Discussion',
    'X Window System and DOS/Windows Graphics',
    'Vintage Computer Hardware']]


``topics_per_document`` contains topic labels for each document, with one list for each level of resultion in our 
cluster layers.  In our above case this will be a list of 5 layers each containing a list of topic labels for each of the 18,170 documents.  
Documents that aren't contained within a cluster at a given layer are given the topic ``Unlabelled``.

.. code-block:: python
    
    topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]
    topics_per_document
    

    [array(['Unlabelled',
            'Discussion on VESA Local Bus Video Cards and Performance',
            'Unlabelled', ...,
            'Cooling Solutions and Components for CPUs and Power Supplies',
            'Algorithms for Finding Sphere from Four Points in 3D',
            'Automotive Discussions on Performance Cars and Specifications'], dtype=object),
    array(['NHL Playoff Analysis and Predictions',
            'Graphics Card Performance and Benchmark Discussions',
            'Armenian Genocide and Turkish Atrocities Discourse', ...,
            'Cooling Solutions and Components for CPUs and Power Supplies',
            'Algorithms for 3D Polygon Processing and Geometry',
            'Discussions on SUVs and Performance Cars'], dtype=object),
    array(['NHL Playoff Analysis and Predictions',
            'Video Card Drivers and Performance',
            'Armenian Genocide and Turkish Atrocities', ..., 'Unlabelled',
            'Unlabelled', 'Automotive Performance and Used Cars'], dtype=object),
    array(['NHL Playoffs and Player Analysis',
            'Vintage Computer Hardware and Upgrades', 'Unlabelled', ...,
            'Unlabelled', 'X Window System and Graphics Software',
            'Automotive Performance and Safety'], dtype=object),
    array(['Sports Analysis', 'Computer Hardware', 'Unlabelled', ...,
            'Unlabelled', 'X Window System and Graphics Software',
            'Automotive Performance and Safety'], dtype=object)]

-----------------------------------
Interactive Topic Visualization
-----------------------------------

Once you’ve generated the topic names and document map, it's helpful to visualize how topics are distributed across your corpus. We recommend using the `DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_ library for this purpose. It creates interactive, zoomable maps that allow you to explore clusters and topic labels in a spatial layout. It is particularly well suited to exploring data maps along with layers of topic names. 

Here is an example of using ``datamapplot`` to visualize your data. We can pass in our ``document_map``, ``document_vectors`` and newly created ``topics_per_document`` as input:

.. code-block:: shell

    pip install datamapplot
    conda install -c conda-forge datamapplot

.. code-block:: python

    import datamapplot
    topic_name_vectors = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]

    plot = datamapplot.create_interactive_plot(
        document_map,
        *topic_name_vectors,
    )

    plot

This will launch an interactive map in your browser or notebook environment, showing document clusters and their associated topic names across all hierarchical layers. You can zoom in to explore fine-grained topics and zoom out to see broader themes, enabling intuitive navigation of the information space.

-----------------------------------
Controlling Verbose Output
-----------------------------------

Toponymy provides a unified ``verbose`` parameter to control progress bars and informative messages across all components:

.. code-block:: python

    # Show all progress bars and messages
    clusterer = ToponymyClusterer(min_clusters=4, verbose=True)
    
    # Suppress all output for silent operation
    clusterer = ToponymyClusterer(min_clusters=4, verbose=False)
    
    # The same parameter works for all components
    topic_model = Toponymy(
        llm_wrapper=llm,
        text_embedding_model=embedding_model,
        verbose=True  # Shows progress for all operations
    )

The ``verbose`` parameter unifies the older separate ``verbose`` and ``show_progress_bar`` parameters, providing a simpler and more consistent interface. Legacy parameters are still supported for backward compatibility but will show deprecation warnings.

-------------------
Vector Construction
-------------------

If you do not have ready made document vectors and low dimensional representations of your data you will need to compute 
your own. For faster encoding change device to: "cuda", "mps", "npu" or "cpu" depending on hardware availability. Alternatively,
one could make use of an API call to embedding service.  Embedding wrappers can be found in:

.. code-block:: python

    from toponymy.embedding_wrappers import OpenAIEmbedder

or the embedding wrapper of your choice. Once we generate document vectors we will need to construct a low dimensional representation.  
Here we do that via our UMAP library.  

.. code-block:: python

    pip install umap-learn
    pip install pandas
    pip install sentence_transformers

    import pandas as pd
    from sentence_transformers import SentenceTransformer
    import umap

    newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
    text = newsgroups_df["post"].str.strip().values
    embedding_model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

    document_vectors = embedding_model.encode(text, show_progress_bar=True)
    document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)

-------
License
-------

Toponymy is MIT licensed. See the LICENSE file for details.

------------
Contributing
------------

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.
To contribute please `fork the project <https://github.com/TutteInstitute/toponymy/fork>`_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TutteInstitute/toponymy",
    "name": "toponymy",
    "maintainer": "John Healy, Leland McInnes",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "jchealy@gmail.com, leland.mcinnes@gmail.com",
    "keywords": "topic modeing, representation, cluster, clustering, large language models, LLM, topic naming",
    "author": "John Healy, Leland McInnes",
    "author_email": "jchealy@gmail.com, leland.mcinnes@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/4c/32/4df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22/toponymy-0.4.0.tar.gz",
    "platform": null,
    "description": "===========\nToponymy\n===========\n\n.. image:: doc/toponymy_text_horizontal.png\n  :width: 600\n  :align: center\n  :alt: Toponymy\n\n\n**\ud83e\udd16 Chat with our AI docs:** https://deepwiki.com/TutteInstitute/toponymy\n\nThe package name Toponymy is derived from the Greek topos \u2018place\u2019 + onuma \u2018name\u2019.  Thus, the naming of places.  \nThe goal of Toponymy is to put names to places in the space of information. This could be a corpus of documents,\nin which case Toponymy can be viewed as a topic naming library.  It could also be a collection of images, in which case\nToponymy could be used to name the themes of the images.  The goal is to provide a names that can allow a user to\nnavigate through the space of information in a meaningful way.\n\nToponymy is designed to scale to very large corpora and collections, providing meaningful names on multiple scales,\nfrom broad themes to fine-grained topics.  We make use a custom clustering methods, information extraction, \nand large language models to power this. The library is designed to be flexible and easy to use.\n\nAs of now this is an beta version of the library. Things can and will break right now.\nWe welcome feedback, use cases and feature suggestions.\n\n------------------\nBasic Installation\n------------------\n\nYou can install Toponymy using:\n\n.. code-block:: bash\n\n    pip install toponymy\n\n\nTo install the latest version of Toponymy from source you can do so by cloning the repository and running:\n\n.. code-block:: bash\n\n    git clone https://github.com/TutteInstitute/toponymy\n    cd toponymy\n    pip install .\n\n-----------\nBasic Usage\n-----------\n\nAs an example, we can use Toponymy to cluster documents in the `20-Newsgroups dataset <http://qwone.com/~jason/20Newsgroups/>`_ on hugging face and then assign topic names to these clusters. The 20 newsgroups dataset contains 18,170 documents distributed roughly evenly across 20 different newsgroups. You can compute vector representations of each document on your own (see `Vector Construction <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#vector-construction>`_ for instructions), but this can be very expensive without a GPU. We recommend downloading our precomputed vectors. Code to retrieve these vectors is below:\n\n.. code-block:: python\n\n    import numpy as np\n    import pandas as pd\n    newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")\n    text = newsgroups_df[\"post\"].str.strip().values\n    document_vectors = np.stack(newsgroups_df[\"embedding\"].values)\n    document_map = np.stack(newsgroups_df[\"map\"].values)\n\nAfter running the above code, ``document_vectors`` will contain 768-dimensional embeddings for each of the 18,170 documents in the dataset and ``document_map`` will contain 2-dimensional embeddings of these same documents.\n\nWe can visualize the documents using the 2-dimensional representations in ``document_map``:\n\n.. code-block:: python\n\n  import datamapplot\n  plot = datamapplot.create_plot(document_map)\n  display(plot)\n\n.. image:: doc/example_2D_plot.png\n  :width: 600\n  :align: center\n  :alt: example_2D_plot\n\nOnce we have a low-dimensional representation, we can do the topic naming. \nToponymy will make use of a clusterer to create a balanced hierarchical layered \nclustering of our documents. (In this case, we use ``ToponymyClusterer`` on the 2-dimensional vectors in ``document_map``.)\n\n.. code-block:: python\n\n    from toponymy import ToponymyClusterer\n    clusterer = ToponymyClusterer(min_clusters=4, verbose=True)\n    clusterer.fit(clusterable_vectors=document_map, embedding_vectors=document_vectors)\n    for i, layer in enumerate(clusterer.cluster_layers_):\n        print(f'{len(np.unique(layer.cluster_labels))-1} clusters in layer {i}')\n\nToponymy will then use a variety of sampling and summarization techniques to construct prompts \ndescribing each cluster to pass to a large language model (LLM).  \n\nNote that Toponymy also requires an embedding model for determining which of the documents will be most relevant to each\nof our clusters.  This doesn't have to be the embedding model that our documents were embedded with but it \nshould be similar.\n\n.. code-block:: python\n\n    from sentence_transformers import SentenceTransformer\n    embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\nToponymy supports multiple LLMs, including Cohere, OpenAI, and Anthropic via service calls, and local models via\nHuggingface and LlamaCpp. Here we show an example using OpenAI. \n\nYou will need to get a free `OpenAI key <https://platform.openai.com/api-keys>`_ and store it in the file ``openai_key.txt`` before running this code.\nAlso make sure that openai is installed in your environment. You can test your connection to OpenAI with the test_llm_connectivity() method before running Toponymy.\n\n.. code-block:: python\n\n    import openai\n    from toponymy import Toponymy\n    from toponymy.llm_wrappers import OpenAINamer\n    \n    openai_api_key = open(\"openai_key.txt\").read().strip()\n    llm = OpenAINamer(openai_api_key)\n    llm.test_llm_connectivity()\n\n\nThe following code will generate a topic naming\nfor the documents in the data set using the ``document_vectors``, ``document_map``, and ``embedding_model`` created above.\n(Warning are filtered here because they can interfere with the display of the progress bar.)\n\n.. code-block:: python\n\n    import warnings\n    warnings.filterwarnings('ignore')\n\n    topic_model = Toponymy(\n        llm_wrapper=llm,\n        text_embedding_model=embedding_model,\n        clusterer=clusterer,\n        object_description=\"newsgroup posts\",\n        corpus_description=\"20-newsgroups dataset\",\n        exemplar_delimiters=[\"<EXAMPLE_POST>\\n\",\"\\n</EXAMPLE_POST>\\n\\n\"]\n    )\n    \n    # Note on data types for fit() method:\n    # - text: Python list of strings (not numpy array)\n    # - document_vectors: numpy array of shape (n_documents, embedding_dimension)\n    # - document_map: numpy array of shape (n_documents, clustering_dimension)\n    topic_model.fit(text, document_vectors, document_map)\n\n\n``topic_model`` will contain ``topic_names``, a list of lists which can be used to explore the unique topic names in each layer or resolution.\nLet's examine the last layer of topics. There were five clusters in this layer. Toponymy assigns a name to each cluster.\n\n.. code-block:: python\n\n    topic_names = topic_model.topic_names_\n\n    topic_names[-1:]\n\n    [['Sports Analysis',\n    'Religion and Sociopolitical Conflicts',\n    'Automotive and Motorcycle Discussion',\n    'X Window System and DOS/Windows Graphics',\n    'Vintage Computer Hardware']]\n\nOur gray 2-D plot from above can now be displayed with labeled clusters. (See `Interactive Topic Visualization <https://github.com/TutteInstitute/toponymy?tab=readme-ov-file#interactive-topic-visualization>`_ for more details on generating interactive plots.)\n\n.. image:: doc/example_labeled_plot.png\n  :width: 600\n  :align: center\n  :alt: example_labeled_plot\n\nAt this particular level of resolution, this plot also shows one topic ('NASA and Space Exploration Missions') from the second to last layer of clusters. \n\n.. code-block:: python\n\n    topic_names[-2:]\n\n    [['NHL Hockey Playoffs and Team Analysis',\n    'Major League Baseball Analysis',\n    'NASA and Space Exploration Missions',\n    'Clipper Chip Encryption and Privacy Debate',\n    'Medical Discussions on Chronic Diseases and Diet',\n    'Middle East Conflicts and Israeli-Palestinian Issues',\n    'Automotive and Motorcycle Discussion',\n    'Christianity, Faith, and Religious Debates',\n    'Waco Siege and Government Controversy',\n    'US Gun Rights and Regulation Debate',\n    'Political and Social Controversies Online',\n    'X Window System and DOS/Windows Graphics',\n    'Vintage PC and Macintosh Hardware',\n    'PC Hard Drive Interfaces and Troubleshooting'],\n    ['Sports Analysis',\n    'Religion and Sociopolitical Conflicts',\n    'Automotive and Motorcycle Discussion',\n    'X Window System and DOS/Windows Graphics',\n    'Vintage Computer Hardware']]\n\n\n``topics_per_document`` contains topic labels for each document, with one list for each level of resultion in our \ncluster layers.  In our above case this will be a list of 5 layers each containing a list of topic labels for each of the 18,170 documents.  \nDocuments that aren't contained within a cluster at a given layer are given the topic ``Unlabelled``.\n\n.. code-block:: python\n    \n    topics_per_document = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]\n    topics_per_document\n    \n\n    [array(['Unlabelled',\n            'Discussion on VESA Local Bus Video Cards and Performance',\n            'Unlabelled', ...,\n            'Cooling Solutions and Components for CPUs and Power Supplies',\n            'Algorithms for Finding Sphere from Four Points in 3D',\n            'Automotive Discussions on Performance Cars and Specifications'], dtype=object),\n    array(['NHL Playoff Analysis and Predictions',\n            'Graphics Card Performance and Benchmark Discussions',\n            'Armenian Genocide and Turkish Atrocities Discourse', ...,\n            'Cooling Solutions and Components for CPUs and Power Supplies',\n            'Algorithms for 3D Polygon Processing and Geometry',\n            'Discussions on SUVs and Performance Cars'], dtype=object),\n    array(['NHL Playoff Analysis and Predictions',\n            'Video Card Drivers and Performance',\n            'Armenian Genocide and Turkish Atrocities', ..., 'Unlabelled',\n            'Unlabelled', 'Automotive Performance and Used Cars'], dtype=object),\n    array(['NHL Playoffs and Player Analysis',\n            'Vintage Computer Hardware and Upgrades', 'Unlabelled', ...,\n            'Unlabelled', 'X Window System and Graphics Software',\n            'Automotive Performance and Safety'], dtype=object),\n    array(['Sports Analysis', 'Computer Hardware', 'Unlabelled', ...,\n            'Unlabelled', 'X Window System and Graphics Software',\n            'Automotive Performance and Safety'], dtype=object)]\n\n-----------------------------------\nInteractive Topic Visualization\n-----------------------------------\n\nOnce you\u2019ve generated the topic names and document map, it's helpful to visualize how topics are distributed across your corpus. We recommend using the `DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_ library for this purpose. It creates interactive, zoomable maps that allow you to explore clusters and topic labels in a spatial layout. It is particularly well suited to exploring data maps along with layers of topic names. \n\nHere is an example of using ``datamapplot`` to visualize your data. We can pass in our ``document_map``, ``document_vectors`` and newly created ``topics_per_document`` as input:\n\n.. code-block:: shell\n\n    pip install datamapplot\n    conda install -c conda-forge datamapplot\n\n.. code-block:: python\n\n    import datamapplot\n    topic_name_vectors = [cluster_layer.topic_name_vector for cluster_layer in topic_model.cluster_layers_]\n\n    plot = datamapplot.create_interactive_plot(\n        document_map,\n        *topic_name_vectors,\n    )\n\n    plot\n\nThis will launch an interactive map in your browser or notebook environment, showing document clusters and their associated topic names across all hierarchical layers. You can zoom in to explore fine-grained topics and zoom out to see broader themes, enabling intuitive navigation of the information space.\n\n-----------------------------------\nControlling Verbose Output\n-----------------------------------\n\nToponymy provides a unified ``verbose`` parameter to control progress bars and informative messages across all components:\n\n.. code-block:: python\n\n    # Show all progress bars and messages\n    clusterer = ToponymyClusterer(min_clusters=4, verbose=True)\n    \n    # Suppress all output for silent operation\n    clusterer = ToponymyClusterer(min_clusters=4, verbose=False)\n    \n    # The same parameter works for all components\n    topic_model = Toponymy(\n        llm_wrapper=llm,\n        text_embedding_model=embedding_model,\n        verbose=True  # Shows progress for all operations\n    )\n\nThe ``verbose`` parameter unifies the older separate ``verbose`` and ``show_progress_bar`` parameters, providing a simpler and more consistent interface. Legacy parameters are still supported for backward compatibility but will show deprecation warnings.\n\n-------------------\nVector Construction\n-------------------\n\nIf you do not have ready made document vectors and low dimensional representations of your data you will need to compute \nyour own. For faster encoding change device to: \"cuda\", \"mps\", \"npu\" or \"cpu\" depending on hardware availability. Alternatively,\none could make use of an API call to embedding service.  Embedding wrappers can be found in:\n\n.. code-block:: python\n\n    from toponymy.embedding_wrappers import OpenAIEmbedder\n\nor the embedding wrapper of your choice. Once we generate document vectors we will need to construct a low dimensional representation.  \nHere we do that via our UMAP library.  \n\n.. code-block:: python\n\n    pip install umap-learn\n    pip install pandas\n    pip install sentence_transformers\n\n    import pandas as pd\n    from sentence_transformers import SentenceTransformer\n    import umap\n\n    newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")\n    text = newsgroups_df[\"post\"].str.strip().values\n    embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\", device=\"cpu\")\n\n    document_vectors = embedding_model.encode(text, show_progress_bar=True)\n    document_map = umap.UMAP(metric='cosine').fit_transform(document_vectors)\n\n-------\nLicense\n-------\n\nToponymy is MIT licensed. See the LICENSE file for details.\n\n------------\nContributing\n------------\n\nContributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from\ncode to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.\nTo contribute please `fork the project <https://github.com/TutteInstitute/toponymy/fork>`_ make your\nchanges and submit a pull request. We will do our best to work through any issues with you and get your code merged in.\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A library for using large language models to name topics",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/TutteInstitute/toponymy"
    },
    "split_keywords": [
        "topic modeing",
        " representation",
        " cluster",
        " clustering",
        " large language models",
        " llm",
        " topic naming"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0736d5ff4136c93f353877a4c6486a33265c4da74375b2ab898026a5bde0713c",
                "md5": "1ddf0031d4d5658e4b32d0dd19e35fe0",
                "sha256": "beadcbd884b6207e09587bdd415408edf7e971357070fe49c408875cefe077ee"
            },
            "downloads": -1,
            "filename": "toponymy-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1ddf0031d4d5658e4b32d0dd19e35fe0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 190521,
            "upload_time": "2025-10-09T21:20:46",
            "upload_time_iso_8601": "2025-10-09T21:20:46.374547Z",
            "url": "https://files.pythonhosted.org/packages/07/36/d5ff4136c93f353877a4c6486a33265c4da74375b2ab898026a5bde0713c/toponymy-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4c324df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22",
                "md5": "e49a8d65be297dcb987de934af129894",
                "sha256": "a4ea0d2eb7578cfe71af5a434659be7fadb71fb42d8184959d1012cbb057135e"
            },
            "downloads": -1,
            "filename": "toponymy-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e49a8d65be297dcb987de934af129894",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 188951,
            "upload_time": "2025-10-09T21:20:47",
            "upload_time_iso_8601": "2025-10-09T21:20:47.777695Z",
            "url": "https://files.pythonhosted.org/packages/4c/32/4df3360746dc89de89b9d866103efba47f38dd00e0b53a1ac0fdd4bb0f22/toponymy-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 21:20:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TutteInstitute",
    "github_project": "toponymy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "2.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.0"
                ]
            ]
        },
        {
            "name": "numba",
            "specs": [
                [
                    ">=",
                    "0.56"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.6"
                ]
            ]
        },
        {
            "name": "vectorizers",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "fast_hdbscan",
            "specs": [
                [
                    ">=",
                    "0.2.2"
                ]
            ]
        },
        {
            "name": "dataclasses",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "tenacity",
            "specs": []
        },
        {
            "name": "aiohttp",
            "specs": []
        },
        {
            "name": "apricot-select",
            "specs": []
        }
    ],
    "lcname": "toponymy"
}
        
Elapsed time: 2.44360s