evoc


Nameevoc JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/TutteInstitute/evoc
SummaryEmbedding Vector Oriented Clustering
upload_time2024-12-09 20:07:32
maintainerLeland McInnes
docs_urlNone
authorLeland McInnes
requires_python>=3.9
licenseBSD License
keywords embedding vector vector database topic modelling cluster clustering
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. image:: doc/evoc_logo.png
  :width: 600
  :align: center
  :alt: EVōC Logo

====
EVōC
====

EVōC (pronounced as "evoke") is Embedding Vector Oriented Clustering.
EVōC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. 
If you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want
to quickly get good clusters out this is the library for you. EVōC takes all the good parts of the 
combination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all 
the time-consuming parts. By specializing directly to embedding vectors we can get good
quality clustering with fewer hyper-parameters to tune and in a fraction of the time.

EVōC is the library to use if you want:

 * Fast clustering of embedding vectors on CPU
 * Multi-granularity clustering, and automatic selection of the number of clusters
 * Clustering of int8 or binary quantized embedding vectors that works out-of-the-box

 As of now this is very much an early beta version of the library. Things can and will break right now.
 We would welcome feedback, use cases and feature suggestions however.

-----------
Basic Usage
-----------

EVōC follows the scikit-learn API, so it should be familiar to most users. You can use EVōC wherever
you might have previously been using other sklearn clustering algorithms. Here is a simple example

.. code-block:: python

    import evoc
    from sklearn.datasets import make_blobs

    data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

    clusterer = evoc.EVoC()
    cluster_labels = clusterer.fit_predict(data)

Some more unique features include the generation of multiple layers of cluster granularity,
the ability to extract a hierarchy of clusters across those layers, and automatic duplicate 
(or very near duplicate) detection.

.. code-block:: python

    import evoc
    from sklearn.datasets import make_blobs

    data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)

    clusterer = evoc.EVoC()
    cluster_labels = clusterer.fit_predict(data)
    cluster_layers = clusterer.cluster_layers_
    hierarchy = clusterer.cluster_tree_
    potential_duplicates = clusterer.duplicates_

The cluster layers are a list of cluster label vectors with the first being the finest grained
and later layers being coarser grained. This is ideal for layered topic modelling and use with
`DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_. See 
`this data map <https://lmcinnes.github.io/datamapplot_examples/ArXiv_data_map_example.html>`_
for an example of using these layered clusters in topic modelling (zoom in to access finer 
grained topics).

------------
Installation
------------

EVōC has a small set of dependencies:

 * numpy
 * scikit-learn
 * numba
 * tqdm
 * tbb

At some point in the near future ... you can install EVōC from PyPI using pip:

.. code-block:: bash

    pip install evoc

For now install the latest version of EVōC from source you can do so by cloning the repository and running:

.. code-block:: bash

    git clone https://github.com/TutteInstitute/evoc
    cd evoc
    pip install .

-------
License
-------

EVōC is BSD (2-clause) licensed. See the LICENSE file for details.

------------
Contributing
------------

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.
To contribute please `fork the project <https://github.com/TutteInstitute/evoc/issues#fork-destination-box>`_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TutteInstitute/evoc",
    "name": "evoc",
    "maintainer": "Leland McInnes",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "leland.mcinnes@gmail.com",
    "keywords": "embedding vector, vector database, topic modelling, cluster, clustering",
    "author": "Leland McInnes",
    "author_email": "leland.mcinnes@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/07/f3/f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b/evoc-0.1.1.tar.gz",
    "platform": null,
    "description": ".. image:: doc/evoc_logo.png\n  :width: 600\n  :align: center\n  :alt: EV\u014dC Logo\n\n====\nEV\u014dC\n====\n\nEV\u014dC (pronounced as \"evoke\") is Embedding Vector Oriented Clustering.\nEV\u014dC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. \nIf you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want\nto quickly get good clusters out this is the library for you. EV\u014dC takes all the good parts of the \ncombination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all \nthe time-consuming parts. By specializing directly to embedding vectors we can get good\nquality clustering with fewer hyper-parameters to tune and in a fraction of the time.\n\nEV\u014dC is the library to use if you want:\n\n * Fast clustering of embedding vectors on CPU\n * Multi-granularity clustering, and automatic selection of the number of clusters\n * Clustering of int8 or binary quantized embedding vectors that works out-of-the-box\n\n As of now this is very much an early beta version of the library. Things can and will break right now.\n We would welcome feedback, use cases and feature suggestions however.\n\n-----------\nBasic Usage\n-----------\n\nEV\u014dC follows the scikit-learn API, so it should be familiar to most users. You can use EV\u014dC wherever\nyou might have previously been using other sklearn clustering algorithms. Here is a simple example\n\n.. code-block:: python\n\n    import evoc\n    from sklearn.datasets import make_blobs\n\n    data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)\n\n    clusterer = evoc.EVoC()\n    cluster_labels = clusterer.fit_predict(data)\n\nSome more unique features include the generation of multiple layers of cluster granularity,\nthe ability to extract a hierarchy of clusters across those layers, and automatic duplicate \n(or very near duplicate) detection.\n\n.. code-block:: python\n\n    import evoc\n    from sklearn.datasets import make_blobs\n\n    data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)\n\n    clusterer = evoc.EVoC()\n    cluster_labels = clusterer.fit_predict(data)\n    cluster_layers = clusterer.cluster_layers_\n    hierarchy = clusterer.cluster_tree_\n    potential_duplicates = clusterer.duplicates_\n\nThe cluster layers are a list of cluster label vectors with the first being the finest grained\nand later layers being coarser grained. This is ideal for layered topic modelling and use with\n`DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_. See \n`this data map <https://lmcinnes.github.io/datamapplot_examples/ArXiv_data_map_example.html>`_\nfor an example of using these layered clusters in topic modelling (zoom in to access finer \ngrained topics).\n\n------------\nInstallation\n------------\n\nEV\u014dC has a small set of dependencies:\n\n * numpy\n * scikit-learn\n * numba\n * tqdm\n * tbb\n\nAt some point in the near future ... you can install EV\u014dC from PyPI using pip:\n\n.. code-block:: bash\n\n    pip install evoc\n\nFor now install the latest version of EV\u014dC from source you can do so by cloning the repository and running:\n\n.. code-block:: bash\n\n    git clone https://github.com/TutteInstitute/evoc\n    cd evoc\n    pip install .\n\n-------\nLicense\n-------\n\nEV\u014dC is BSD (2-clause) licensed. See the LICENSE file for details.\n\n------------\nContributing\n------------\n\nContributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from\ncode to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.\nTo contribute please `fork the project <https://github.com/TutteInstitute/evoc/issues#fork-destination-box>`_ make your\nchanges and submit a pull request. We will do our best to work through any issues with you and get your code merged in.\n",
    "bugtrack_url": null,
    "license": "BSD License",
    "summary": "Embedding Vector Oriented Clustering",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/TutteInstitute/evoc"
    },
    "split_keywords": [
        "embedding vector",
        " vector database",
        " topic modelling",
        " cluster",
        " clustering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2b8b7cadf942e750bd0e928a41590e870809051fe9b218183de165c546ed33c9",
                "md5": "b4130e67a51b2b916a76baff72e12767",
                "sha256": "19172a08bb29e1f2839b74eee0c57d4b9b379c9bcc8d34976b8f48a32d586d83"
            },
            "downloads": -1,
            "filename": "evoc-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b4130e67a51b2b916a76baff72e12767",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 38924,
            "upload_time": "2024-12-09T20:07:30",
            "upload_time_iso_8601": "2024-12-09T20:07:30.678438Z",
            "url": "https://files.pythonhosted.org/packages/2b/8b/7cadf942e750bd0e928a41590e870809051fe9b218183de165c546ed33c9/evoc-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "07f3f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b",
                "md5": "4f1dba71a1e8e6212c3c26b6559f7b83",
                "sha256": "f146ff54079fcbdaab42c8cea5bc51b61879392ae86c82ededd1a4a3bab367bd"
            },
            "downloads": -1,
            "filename": "evoc-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "4f1dba71a1e8e6212c3c26b6559f7b83",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 32671,
            "upload_time": "2024-12-09T20:07:32",
            "upload_time_iso_8601": "2024-12-09T20:07:32.064040Z",
            "url": "https://files.pythonhosted.org/packages/07/f3/f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b/evoc-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-09 20:07:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TutteInstitute",
    "github_project": "evoc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "evoc"
}
        
Elapsed time: 0.37273s