.. image:: doc/evoc_logo.png
:width: 600
:align: center
:alt: EVōC Logo
====
EVōC
====
EVōC (pronounced as "evoke") is Embedding Vector Oriented Clustering.
EVōC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors.
If you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want
to quickly get good clusters out this is the library for you. EVōC takes all the good parts of the
combination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all
the time-consuming parts. By specializing directly to embedding vectors we can get good
quality clustering with fewer hyper-parameters to tune and in a fraction of the time.
EVōC is the library to use if you want:
* Fast clustering of embedding vectors on CPU
* Multi-granularity clustering, and automatic selection of the number of clusters
* Clustering of int8 or binary quantized embedding vectors that works out-of-the-box
As of now this is very much an early beta version of the library. Things can and will break right now.
We would welcome feedback, use cases and feature suggestions however.
-----------
Basic Usage
-----------
EVōC follows the scikit-learn API, so it should be familiar to most users. You can use EVōC wherever
you might have previously been using other sklearn clustering algorithms. Here is a simple example
.. code-block:: python
import evoc
from sklearn.datasets import make_blobs
data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)
clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
Some more unique features include the generation of multiple layers of cluster granularity,
the ability to extract a hierarchy of clusters across those layers, and automatic duplicate
(or very near duplicate) detection.
.. code-block:: python
import evoc
from sklearn.datasets import make_blobs
data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)
clusterer = evoc.EVoC()
cluster_labels = clusterer.fit_predict(data)
cluster_layers = clusterer.cluster_layers_
hierarchy = clusterer.cluster_tree_
potential_duplicates = clusterer.duplicates_
The cluster layers are a list of cluster label vectors with the first being the finest grained
and later layers being coarser grained. This is ideal for layered topic modelling and use with
`DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_. See
`this data map <https://lmcinnes.github.io/datamapplot_examples/ArXiv_data_map_example.html>`_
for an example of using these layered clusters in topic modelling (zoom in to access finer
grained topics).
------------
Installation
------------
EVōC has a small set of dependencies:
* numpy
* scikit-learn
* numba
* tqdm
* tbb
At some point in the near future ... you can install EVōC from PyPI using pip:
.. code-block:: bash
pip install evoc
For now install the latest version of EVōC from source you can do so by cloning the repository and running:
.. code-block:: bash
git clone https://github.com/TutteInstitute/evoc
cd evoc
pip install .
-------
License
-------
EVōC is BSD (2-clause) licensed. See the LICENSE file for details.
------------
Contributing
------------
Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.
To contribute please `fork the project <https://github.com/TutteInstitute/evoc/issues#fork-destination-box>`_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.
Raw data
{
"_id": null,
"home_page": "https://github.com/TutteInstitute/evoc",
"name": "evoc",
"maintainer": "Leland McInnes",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "leland.mcinnes@gmail.com",
"keywords": "embedding vector, vector database, topic modelling, cluster, clustering",
"author": "Leland McInnes",
"author_email": "leland.mcinnes@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/07/f3/f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b/evoc-0.1.1.tar.gz",
"platform": null,
"description": ".. image:: doc/evoc_logo.png\n :width: 600\n :align: center\n :alt: EV\u014dC Logo\n\n====\nEV\u014dC\n====\n\nEV\u014dC (pronounced as \"evoke\") is Embedding Vector Oriented Clustering.\nEV\u014dC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. \nIf you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want\nto quickly get good clusters out this is the library for you. EV\u014dC takes all the good parts of the \ncombination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all \nthe time-consuming parts. By specializing directly to embedding vectors we can get good\nquality clustering with fewer hyper-parameters to tune and in a fraction of the time.\n\nEV\u014dC is the library to use if you want:\n\n * Fast clustering of embedding vectors on CPU\n * Multi-granularity clustering, and automatic selection of the number of clusters\n * Clustering of int8 or binary quantized embedding vectors that works out-of-the-box\n\n As of now this is very much an early beta version of the library. Things can and will break right now.\n We would welcome feedback, use cases and feature suggestions however.\n\n-----------\nBasic Usage\n-----------\n\nEV\u014dC follows the scikit-learn API, so it should be familiar to most users. You can use EV\u014dC wherever\nyou might have previously been using other sklearn clustering algorithms. Here is a simple example\n\n.. code-block:: python\n\n import evoc\n from sklearn.datasets import make_blobs\n\n data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)\n\n clusterer = evoc.EVoC()\n cluster_labels = clusterer.fit_predict(data)\n\nSome more unique features include the generation of multiple layers of cluster granularity,\nthe ability to extract a hierarchy of clusters across those layers, and automatic duplicate \n(or very near duplicate) detection.\n\n.. code-block:: python\n\n import evoc\n from sklearn.datasets import make_blobs\n\n data, _ = make_blobs(n_samples=100_000, n_features=1024, centers=100)\n\n clusterer = evoc.EVoC()\n cluster_labels = clusterer.fit_predict(data)\n cluster_layers = clusterer.cluster_layers_\n hierarchy = clusterer.cluster_tree_\n potential_duplicates = clusterer.duplicates_\n\nThe cluster layers are a list of cluster label vectors with the first being the finest grained\nand later layers being coarser grained. This is ideal for layered topic modelling and use with\n`DataMapPlot <https://github.com/TutteInstitute/datamapplot>`_. See \n`this data map <https://lmcinnes.github.io/datamapplot_examples/ArXiv_data_map_example.html>`_\nfor an example of using these layered clusters in topic modelling (zoom in to access finer \ngrained topics).\n\n------------\nInstallation\n------------\n\nEV\u014dC has a small set of dependencies:\n\n * numpy\n * scikit-learn\n * numba\n * tqdm\n * tbb\n\nAt some point in the near future ... you can install EV\u014dC from PyPI using pip:\n\n.. code-block:: bash\n\n pip install evoc\n\nFor now install the latest version of EV\u014dC from source you can do so by cloning the repository and running:\n\n.. code-block:: bash\n\n git clone https://github.com/TutteInstitute/evoc\n cd evoc\n pip install .\n\n-------\nLicense\n-------\n\nEV\u014dC is BSD (2-clause) licensed. See the LICENSE file for details.\n\n------------\nContributing\n------------\n\nContributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from\ncode to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.\nTo contribute please `fork the project <https://github.com/TutteInstitute/evoc/issues#fork-destination-box>`_ make your\nchanges and submit a pull request. We will do our best to work through any issues with you and get your code merged in.\n",
"bugtrack_url": null,
"license": "BSD License",
"summary": "Embedding Vector Oriented Clustering",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/TutteInstitute/evoc"
},
"split_keywords": [
"embedding vector",
" vector database",
" topic modelling",
" cluster",
" clustering"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "2b8b7cadf942e750bd0e928a41590e870809051fe9b218183de165c546ed33c9",
"md5": "b4130e67a51b2b916a76baff72e12767",
"sha256": "19172a08bb29e1f2839b74eee0c57d4b9b379c9bcc8d34976b8f48a32d586d83"
},
"downloads": -1,
"filename": "evoc-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b4130e67a51b2b916a76baff72e12767",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 38924,
"upload_time": "2024-12-09T20:07:30",
"upload_time_iso_8601": "2024-12-09T20:07:30.678438Z",
"url": "https://files.pythonhosted.org/packages/2b/8b/7cadf942e750bd0e928a41590e870809051fe9b218183de165c546ed33c9/evoc-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "07f3f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b",
"md5": "4f1dba71a1e8e6212c3c26b6559f7b83",
"sha256": "f146ff54079fcbdaab42c8cea5bc51b61879392ae86c82ededd1a4a3bab367bd"
},
"downloads": -1,
"filename": "evoc-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "4f1dba71a1e8e6212c3c26b6559f7b83",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 32671,
"upload_time": "2024-12-09T20:07:32",
"upload_time_iso_8601": "2024-12-09T20:07:32.064040Z",
"url": "https://files.pythonhosted.org/packages/07/f3/f39d78c1325912a6c4610acd8c0decd27ac2b957536d8ddc42cb6da0b88b/evoc-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-09 20:07:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TutteInstitute",
"github_project": "evoc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "evoc"
}