fast-hdbscan


Namefast-hdbscan JSON
Version 0.2.0 PyPI version JSON
download
home_pagehttps://github.com/TutteInstitute/fast_hdbscan
SummaryA fast multicore version of hdbscan for low dimensional euclidean spaces
upload_time2024-10-01 02:22:31
maintainerLeland McInnes
docs_urlNone
authorLeland McInnes
requires_python>=3.9
licenseBSD 2-Clause License
keywords cluster clustering hdbscan dbscan
VCS
bugtrack_url
requirements numpy numba scikit-learn
Travis-CI No Travis.
coveralls test coverage No coveralls.
            .. -*- mode: rst -*-

.. image:: doc/hdbscan_logo.png
  :width: 600
  :alt: HDBSCAN logo
  :align: center

======================
Fast Multicore HDBSCAN
======================

The ``fast_hdbscan`` library provides a simple implementation of the HDBSCAN clustering algorithm designed specifically
for high performance on multicore machine with low dimensional data (2D to about 20D). The algorithm runs in parallel and can make
effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even
modern multicore laptops.

This library provides a
re-implementation of a subset of the HDBSCAN algorithm that is compatible with the
`hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_ library for data that is Euclidean and
low dimensional. The primary advantages of this library over the standard ``hdbscan`` library are:


 * this library can easily use all available cores to speed up computation;
 * this library has much faster implementations of tree condensing and cluster extraction;
 * this library is much simpler and more approachable for extending or using components from;
 * this library is built on numba and has less issues with binaries and compilation.

This library does not support all the features and input formats available in the hdbscan library.

-----------
Basic Usage
-----------

The ``fast_hdbscan`` library follows the ``hdbscan`` library in using the sklearn API. You can use the ``fast_hdbscan``
class ``HDBSCAN`` exactly as you would that of the ``hdbscan`` library with the caveat that ``fast_hdbscan`` only
supports a subset of the parameters and options of ``hdbscan``. Nonetheless, if you have low-dimensional
Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for
``hdbscan``:

.. code:: python

    import fast_hdbscan
    from sklearn.datasets import make_blobs

    data, _ = make_blobs(1000)

    clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
    cluster_labels = clusterer.fit_predict(data)

------------
Installation
------------
fast_hdbscan requires:

 * numba
 * numpy
 * scikit-learn

fast_hdbscan can be installed via pip:

.. code:: bash

    pip install fast_hdbscan

To manually install this package:

.. code:: bash

    wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip
    unzip main.zip
    rm main.zip
    cd fast_hdbscan-main
    python setup.py install

----------
References
----------

The algorithm used here is an adaptation of the algorithms described in the papers:

    McInnes L, Healy J. *Accelerated Hierarchical Density Based Clustering*
    In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42.
    2017 `[pdf] <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8215642>`_

    R. Campello, D. Moulavi, and J. Sander, *Density-Based Clustering Based on
    Hierarchical Density Estimates*
    In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.
    2013

-------
License
-------

fast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.

------------
Contributing
------------

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.
To contribute please `fork the project <https://github.com/TutteInstitute/fast_hdbscan/issues#fork-destination-box>`_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/TutteInstitute/fast_hdbscan",
    "name": "fast-hdbscan",
    "maintainer": "Leland McInnes",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "leland.mcinnes@gmail.com",
    "keywords": "cluster, clustering, hdbscan, dbscan",
    "author": "Leland McInnes",
    "author_email": "leland.mcinnes@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e5/57/7f9f311bf2ba48f8718952a8a4cf491aaee4b906220295fe1cf213359b4e/fast_hdbscan-0.2.0.tar.gz",
    "platform": null,
    "description": ".. -*- mode: rst -*-\n\n.. image:: doc/hdbscan_logo.png\n  :width: 600\n  :alt: HDBSCAN logo\n  :align: center\n\n======================\nFast Multicore HDBSCAN\n======================\n\nThe ``fast_hdbscan`` library provides a simple implementation of the HDBSCAN clustering algorithm designed specifically\nfor high performance on multicore machine with low dimensional data (2D to about 20D). The algorithm runs in parallel and can make\neffective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even\nmodern multicore laptops.\n\nThis library provides a\nre-implementation of a subset of the HDBSCAN algorithm that is compatible with the\n`hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_ library for data that is Euclidean and\nlow dimensional. The primary advantages of this library over the standard ``hdbscan`` library are:\n\n\n * this library can easily use all available cores to speed up computation;\n * this library has much faster implementations of tree condensing and cluster extraction;\n * this library is much simpler and more approachable for extending or using components from;\n * this library is built on numba and has less issues with binaries and compilation.\n\nThis library does not support all the features and input formats available in the hdbscan library.\n\n-----------\nBasic Usage\n-----------\n\nThe ``fast_hdbscan`` library follows the ``hdbscan`` library in using the sklearn API. You can use the ``fast_hdbscan``\nclass ``HDBSCAN`` exactly as you would that of the ``hdbscan`` library with the caveat that ``fast_hdbscan`` only\nsupports a subset of the parameters and options of ``hdbscan``. Nonetheless, if you have low-dimensional\nEuclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for\n``hdbscan``:\n\n.. code:: python\n\n    import fast_hdbscan\n    from sklearn.datasets import make_blobs\n\n    data, _ = make_blobs(1000)\n\n    clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)\n    cluster_labels = clusterer.fit_predict(data)\n\n------------\nInstallation\n------------\nfast_hdbscan requires:\n\n * numba\n * numpy\n * scikit-learn\n\nfast_hdbscan can be installed via pip:\n\n.. code:: bash\n\n    pip install fast_hdbscan\n\nTo manually install this package:\n\n.. code:: bash\n\n    wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip\n    unzip main.zip\n    rm main.zip\n    cd fast_hdbscan-main\n    python setup.py install\n\n----------\nReferences\n----------\n\nThe algorithm used here is an adaptation of the algorithms described in the papers:\n\n    McInnes L, Healy J. *Accelerated Hierarchical Density Based Clustering*\n    In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42.\n    2017 `[pdf] <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8215642>`_\n\n    R. Campello, D. Moulavi, and J. Sander, *Density-Based Clustering Based on\n    Hierarchical Density Estimates*\n    In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.\n    2013\n\n-------\nLicense\n-------\n\nfast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.\n\n------------\nContributing\n------------\n\nContributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from\ncode to notebooks to examples and documentation are all *equally valuable* so please don't feel you can't contribute.\nTo contribute please `fork the project <https://github.com/TutteInstitute/fast_hdbscan/issues#fork-destination-box>`_ make your\nchanges and submit a pull request. We will do our best to work through any issues with you and get your code merged in.\n",
    "bugtrack_url": null,
    "license": "BSD 2-Clause License",
    "summary": "A fast multicore version of hdbscan for low dimensional euclidean spaces",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/TutteInstitute/fast_hdbscan"
    },
    "split_keywords": [
        "cluster",
        " clustering",
        " hdbscan",
        " dbscan"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f493593953dbe9b3826fe937b3741641fe3ecd4587f53bd5e5961406e1f4642",
                "md5": "dc8602b2f06ebfa953f6f4c615e4c5e6",
                "sha256": "7315af09f5b19a4f1a8d6113b90eb4e98813d3bc0d55f05c0f24984252e164f2"
            },
            "downloads": -1,
            "filename": "fast_hdbscan-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dc8602b2f06ebfa953f6f4c615e4c5e6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 17743,
            "upload_time": "2024-10-01T02:22:29",
            "upload_time_iso_8601": "2024-10-01T02:22:29.912149Z",
            "url": "https://files.pythonhosted.org/packages/0f/49/3593953dbe9b3826fe937b3741641fe3ecd4587f53bd5e5961406e1f4642/fast_hdbscan-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e5577f9f311bf2ba48f8718952a8a4cf491aaee4b906220295fe1cf213359b4e",
                "md5": "76010f06820b105219dd5c1c0bb4a1a6",
                "sha256": "002966c2adbac5170a6627ea2f685cdedcd2a1c2cb077401fba6748f0fbc39f0"
            },
            "downloads": -1,
            "filename": "fast_hdbscan-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "76010f06820b105219dd5c1c0bb4a1a6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17876,
            "upload_time": "2024-10-01T02:22:31",
            "upload_time_iso_8601": "2024-10-01T02:22:31.592879Z",
            "url": "https://files.pythonhosted.org/packages/e5/57/7f9f311bf2ba48f8718952a8a4cf491aaee4b906220295fe1cf213359b4e/fast_hdbscan-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-01 02:22:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TutteInstitute",
    "github_project": "fast_hdbscan",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21"
                ]
            ]
        },
        {
            "name": "numba",
            "specs": [
                [
                    ">=",
                    "0.56"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        }
    ],
    "lcname": "fast-hdbscan"
}
        
Elapsed time: 3.96374s