# Theoretically-Efficient and Practical Parallel DBSCAN
[![arXiv](https://img.shields.io/badge/arXiv-1912.06255-b31b1b.svg)](https://arxiv.org/abs/1912.06255)
[![build](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml/badge.svg)](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml)
## Overview
This repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: [Theoretically Efficient and Practical Parallel DBSCAN](https://dl.acm.org/doi/10.1145/3318464.3380582).
Our software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 6-core computer with 2-way hyperthreading using a 2-dimensional data set with 50000 data points, where both implementation uses all available threads. Our implementation is more than **32x** faster. We also show a visualization of the clustering result on a smaller data set.
Data sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` in the [source code](https://github.com/wangyiqiu/dbscan-python/blob/master/include/dbscan/capi.h).
<p float="left">
<img src="https://raw.githubusercontent.com/wangyiqiu/dbscan-python/0.0.12-dev/compare.png" alt="timing" width="300"/>
<img src="https://raw.githubusercontent.com/wangyiqiu/dbscan-python/0.0.12-dev/example.png" alt="example" width="300"/>
</p>
## Tutorial
### Option 1: Use the Python binding
There are two ways to install it:
* Install it using PyPI: ``pip3 install --user dbscan`` (you can find the wheels [here](https://pypi.org/project/dbscan/#files)).
* To build from scratch for testing: ``pip3 install -e .`` from the project root directory.
An example for using the Python module is provided in ``example.py``. It generates the clustering example above.
#### Python API
```
from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)
```
#### Input
* ``X``: A 2-D Numpy array containing the input data points. The first dimension of ``X`` is the number of data points ``n``, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).
* ``eps``: The epsilon parameter (default 0.5).
* ``min_samples``: The minPts parameter (default 5).
#### Output
* ``labels``: A length ``n`` Numpy array (``dtype=np.int32``) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of ``-1``.
* ``core_samples_mask``: A length ``n`` Numpy array (``dtype=np.bool``) masking the core points, in the same ordering as the input data.
We provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above.
```
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)
X = StandardScaler().fit_transform(X)
# #############################################################################
# Compute DBSCAN
# direct call of the C API:
from dbscan import DBSCAN
labels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)
# OR calling our sklearn API:
# from dbscan import sklDBSCAN as DBSCAN
# db = DBSCAN(eps=0.3, min_samples=10).fit(X)
# core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
# core_samples_mask[db.core_sample_indices_] = True
# labels = db.labels_
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
```
### Option 2: Use the binary executable
Compile and run the program:
```
mkdir build
cd build
cmake ..
cd executable
make -j # this will take a while
./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>
```
The `<data-file>` can be any CSV-like point data file, where each line contains a data point -- see an example [here](https://github.com/wangyiqiu/hdbscan/blob/main/example-data.csv). The data file can be either with or without header. The cluster output `clusters.txt` will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of `-1`.
### Option 3: Include directly in your own C++ program
Create your own caller header and source file by instantiating the DBSCAN template function in "dbscan/algo.h".
dbscan.h:
```c++
template<int dim>
int DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);
// equivalent to
// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])
// if C++ syntax was a little more flexible
template<>
int DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);
```
dbscan.cpp:
```c++
#include "dbscan/algo.h"
#include "dbscan.h"
```
Calling the instantiated function:
```c++
int n = ...; // number of data points
double data[n][3] = ...; // data points
int labels[n]; // label ids get saved here
bool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here
{
int ignore[n];
DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);
}
```
Doing this will only compile the function for the number of dimensions that you want, which saves on compilation time.
You can also include the "dbscan/capi.h" and define your own ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` macros the same way the Python extension uses it. The function exported has the following signature.
```c++
extern "C" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);
```
Right now, the only two files that are guaranteed to remain in the C/C++ API are "dbscan/algo.h" and "dbscan/capi.h" and the functions named DBSCAN within.
## Citation
If you use our work in a publication, we would appreciate citations:
@inproceedings{wang2020theoretically,
author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},
title = {Theoretically-Efficient and Practical Parallel DBSCAN},
year = {2020},
isbn = {9781450367356},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3318464.3380582},
doi = {10.1145/3318464.3380582},
booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
pages = {2555–2571},
numpages = {17},
keywords = {parallel algorithms, spatial clustering, DBScan},
location = {Portland, OR, USA},
series = {SIGMOD ’20}
}
Raw data
{
"_id": null,
"home_page": "https://github.com/wangyiqiu/dbscan-python",
"name": "dbscan",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6,<4",
"maintainer_email": "",
"keywords": "cluster clustering density dbscan",
"author": "Yiqiu Wang",
"author_email": "yiqiu_wang@icloud.com",
"download_url": "https://files.pythonhosted.org/packages/e4/ee/ee02fe6b756131c6107de0b435e02d70cc1742f1e5afa9aa173df39d0e1c/dbscan-0.0.12.tar.gz",
"platform": null,
"description": "# Theoretically-Efficient and Practical Parallel DBSCAN\n\n[![arXiv](https://img.shields.io/badge/arXiv-1912.06255-b31b1b.svg)](https://arxiv.org/abs/1912.06255)\n[![build](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml/badge.svg)](https://github.com/wangyiqiu/dbscan-python/actions/workflows/build_wheels.yml)\n\n## Overview\n\nThis repository hosts fast parallel DBSCAN clustering code for low dimensional Euclidean space. The code automatically uses the available threads on a parallel shared-memory machine to speedup DBSCAN clustering. It stems from a paper presented in SIGMOD'20: [Theoretically Efficient and Practical Parallel DBSCAN](https://dl.acm.org/doi/10.1145/3318464.3380582).\n\nOur software on 1 thread is on par with all serial state-of-the-art DBSCAN packages, and provides additional speedup via multi-threading. Below, we show a simple benchmark comparing our code with the DBSCAN implementation of Sklearn, tested on a 6-core computer with 2-way hyperthreading using a 2-dimensional data set with 50000 data points, where both implementation uses all available threads. Our implementation is more than **32x** faster. We also show a visualization of the clustering result on a smaller data set.\n\nData sets with dimensionality 2 - 20 are supported by default, which can be modified by modifying ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` in the [source code](https://github.com/wangyiqiu/dbscan-python/blob/master/include/dbscan/capi.h).\n\n<p float=\"left\">\n<img src=\"https://raw.githubusercontent.com/wangyiqiu/dbscan-python/0.0.12-dev/compare.png\" alt=\"timing\" width=\"300\"/>\n<img src=\"https://raw.githubusercontent.com/wangyiqiu/dbscan-python/0.0.12-dev/example.png\" alt=\"example\" width=\"300\"/>\n</p>\n\n## Tutorial\n\n### Option 1: Use the Python binding\n\nThere are two ways to install it:\n\n* Install it using PyPI: ``pip3 install --user dbscan`` (you can find the wheels [here](https://pypi.org/project/dbscan/#files)).\n* To build from scratch for testing: ``pip3 install -e .`` from the project root directory.\n\nAn example for using the Python module is provided in ``example.py``. It generates the clustering example above.\n\n#### Python API\n\n```\nfrom dbscan import DBSCAN\nlabels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)\n```\n\n#### Input\n\n* ``X``: A 2-D Numpy array containing the input data points. The first dimension of ``X`` is the number of data points ``n``, and the second dimension is the data set dimensionality (the maximum supported dimensionality is 20).\n* ``eps``: The epsilon parameter (default 0.5).\n* ``min_samples``: The minPts parameter (default 5).\n\n#### Output\n\n* ``labels``: A length ``n`` Numpy array (``dtype=np.int32``) containing cluster IDs of the data points, in the same ordering as the input data. Noise points are given a pseudo-ID of ``-1``.\n* ``core_samples_mask``: A length ``n`` Numpy array (``dtype=np.bool``) masking the core points, in the same ordering as the input data.\n\nWe provide a complete example below that generates a toy data set, computes the DBSCAN clustering, and visualizes the result as shown in the plot above.\n\n```\nimport numpy as np\nfrom sklearn.datasets import make_blobs\nfrom sklearn.preprocessing import StandardScaler\n\n# #############################################################################\n# Generate sample data\ncenters = [[1, 1], [-1, -1], [1, -1]]\nX, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,\n random_state=0)\nX = StandardScaler().fit_transform(X)\n\n# #############################################################################\n# Compute DBSCAN\n\n# direct call of the C API:\nfrom dbscan import DBSCAN\nlabels, core_samples_mask = DBSCAN(X, eps=0.3, min_samples=10)\n\n# OR calling our sklearn API:\n# from dbscan import sklDBSCAN as DBSCAN\n# db = DBSCAN(eps=0.3, min_samples=10).fit(X)\n# core_samples_mask = np.zeros_like(db.labels_, dtype=bool)\n# core_samples_mask[db.core_sample_indices_] = True\n# labels = db.labels_\n\n# #############################################################################\n# Plot result\nimport matplotlib.pyplot as plt\n\nn_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)\nn_noise_ = list(labels).count(-1)\n# Black removed and is used for noise instead.\nunique_labels = set(labels)\ncolors = [plt.cm.Spectral(each)\n for each in np.linspace(0, 1, len(unique_labels))]\nfor k, col in zip(unique_labels, colors):\n if k == -1:\n # Black used for noise.\n col = [0, 0, 0, 1]\n\n class_member_mask = (labels == k)\n\n xy = X[class_member_mask & core_samples_mask]\n plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),\n markeredgecolor='k', markersize=14)\n\n xy = X[class_member_mask & ~core_samples_mask]\n plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),\n markeredgecolor='k', markersize=6)\n\nplt.title('Estimated number of clusters: %d' % n_clusters_)\nplt.show()\n```\n\n### Option 2: Use the binary executable\n\nCompile and run the program:\n\n```\nmkdir build\ncd build\ncmake ..\ncd executable\nmake -j # this will take a while\n./dbscan -eps 0.1 -minpts 10 -o clusters.txt <data-file>\n```\n\nThe `<data-file>` can be any CSV-like point data file, where each line contains a data point -- see an example [here](https://github.com/wangyiqiu/hdbscan/blob/main/example-data.csv). The data file can be either with or without header. The cluster output `clusters.txt` will contain a cluster ID on each line (other than the first-line header), giving a cluster assignment in the same ordering as the input file. A noise point will have a cluster ID of `-1`.\n\n### Option 3: Include directly in your own C++ program\n\nCreate your own caller header and source file by instantiating the DBSCAN template function in \"dbscan/algo.h\".\n\ndbscan.h:\n```c++\ntemplate<int dim>\nint DBSCAN(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);\n\n// equivalent to\n// int DBSCAN(intT n, floatT PF[n][dim], double epsilon, intT minPts, bool coreFlagOut[n], intT coreFlag[n], intT cluster[n])\n// if C++ syntax was a little more flexible\n\ntemplate<>\nint DBSCAN<3>(int n, double* PF, double epsilon, int minPts, bool* coreFlagOut, int* coreFlag, int* cluster);\n```\n\ndbscan.cpp:\n```c++\n#include \"dbscan/algo.h\"\n#include \"dbscan.h\"\n```\n\nCalling the instantiated function:\n```c++\nint n = ...; // number of data points\ndouble data[n][3] = ...; // data points\nint labels[n]; // label ids get saved here\nbool core_samples[n]; // a flag determining whether or not the sample is a core sample is saved here\n{\n int ignore[n];\n DBSCAN<3>(n, (void*)data, 70, 100, core_samples, ignore, labels);\n}\n```\n\nDoing this will only compile the function for the number of dimensions that you want, which saves on compilation time.\n\nYou can also include the \"dbscan/capi.h\" and define your own ``DBSCAN_MIN_DIMS`` and ``DBSCAN_MAX_DIMS`` macros the same way the Python extension uses it. The function exported has the following signature.\n```c++\nextern \"C\" int DBSCAN(int dim, int n, double* PF, double epsilon, int minPts, bool* coreFlag, int* cluster);\n```\n\nRight now, the only two files that are guaranteed to remain in the C/C++ API are \"dbscan/algo.h\" and \"dbscan/capi.h\" and the functions named DBSCAN within.\n\n## Citation\n\nIf you use our work in a publication, we would appreciate citations:\n\n @inproceedings{wang2020theoretically,\n author = {Wang, Yiqiu and Gu, Yan and Shun, Julian},\n title = {Theoretically-Efficient and Practical Parallel DBSCAN},\n year = {2020},\n isbn = {9781450367356},\n publisher = {Association for Computing Machinery},\n address = {New York, NY, USA},\n url = {https://doi.org/10.1145/3318464.3380582},\n doi = {10.1145/3318464.3380582},\n booktitle = {Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},\n pages = {2555\u20132571},\n numpages = {17},\n keywords = {parallel algorithms, spatial clustering, DBScan},\n location = {Portland, OR, USA},\n series = {SIGMOD \u201920}\n }\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Theoretically efficient and practical parallel DBSCAN",
"version": "0.0.12",
"split_keywords": [
"cluster",
"clustering",
"density",
"dbscan"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a869b7f619697dc25fec5b6320323162d22f9f8384b49119362c49bbb3eb5c2e",
"md5": "6820fd126289ae2aae793f439c935f8f",
"sha256": "0c1c7108e4cd6bd92e839f083750acfbe9a8d725fe2932014b3bf0ad691e2016"
},
"downloads": -1,
"filename": "dbscan-0.0.12-cp36-abi3-macosx_10_9_x86_64.whl",
"has_sig": false,
"md5_digest": "6820fd126289ae2aae793f439c935f8f",
"packagetype": "bdist_wheel",
"python_version": "cp36",
"requires_python": ">=3.6,<4",
"size": 1551824,
"upload_time": "2023-01-20T21:58:55",
"upload_time_iso_8601": "2023-01-20T21:58:55.267951Z",
"url": "https://files.pythonhosted.org/packages/a8/69/b7f619697dc25fec5b6320323162d22f9f8384b49119362c49bbb3eb5c2e/dbscan-0.0.12-cp36-abi3-macosx_10_9_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f039a309c4c394407908641d6442a1bdd0f7a9a25505723e7423434a3d33c886",
"md5": "85a34abcf4627a1f41e1ecbe3e5688de",
"sha256": "d16f944304ea17f476de88c7d09641217c2102c2f2258c87a73c0650399dd44b"
},
"downloads": -1,
"filename": "dbscan-0.0.12-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
"has_sig": false,
"md5_digest": "85a34abcf4627a1f41e1ecbe3e5688de",
"packagetype": "bdist_wheel",
"python_version": "cp36",
"requires_python": ">=3.6,<4",
"size": 9232216,
"upload_time": "2023-01-20T21:58:57",
"upload_time_iso_8601": "2023-01-20T21:58:57.077727Z",
"url": "https://files.pythonhosted.org/packages/f0/39/a309c4c394407908641d6442a1bdd0f7a9a25505723e7423434a3d33c886/dbscan-0.0.12-cp36-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "7997b5401324fd6c76147ae20500c155ef4a031ac4d26d8e2c58291d5fddc3f1",
"md5": "6d4f32e03ba567e1925d3eca2a3ea893",
"sha256": "818402c41188d605702c02733fad79ce88afe3b3491cb80b03ff7e137fc28572"
},
"downloads": -1,
"filename": "dbscan-0.0.12-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "6d4f32e03ba567e1925d3eca2a3ea893",
"packagetype": "bdist_wheel",
"python_version": "cp36",
"requires_python": ">=3.6,<4",
"size": 9644067,
"upload_time": "2023-01-20T21:58:59",
"upload_time_iso_8601": "2023-01-20T21:58:59.936105Z",
"url": "https://files.pythonhosted.org/packages/79/97/b5401324fd6c76147ae20500c155ef4a031ac4d26d8e2c58291d5fddc3f1/dbscan-0.0.12-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "add039f4e11f339332b5301966800051510e6a45a50a72e3571376f47973be7e",
"md5": "5b3e2da74986bd2fafb16a0a1b6a385f",
"sha256": "885694473fd7c387e3b2438ddb9d6cb3d4ee82d1ff2e5a0040ed5d079296dc73"
},
"downloads": -1,
"filename": "dbscan-0.0.12-cp36-abi3-win_amd64.whl",
"has_sig": false,
"md5_digest": "5b3e2da74986bd2fafb16a0a1b6a385f",
"packagetype": "bdist_wheel",
"python_version": "cp36",
"requires_python": ">=3.6,<4",
"size": 786507,
"upload_time": "2023-01-20T21:59:01",
"upload_time_iso_8601": "2023-01-20T21:59:01.941925Z",
"url": "https://files.pythonhosted.org/packages/ad/d0/39f4e11f339332b5301966800051510e6a45a50a72e3571376f47973be7e/dbscan-0.0.12-cp36-abi3-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a99884db82ed489fe84fa9ddd124dbbccde40b27b012f98b04c7a31b03ee3482",
"md5": "39a89a58c65e1e9d17e58b183dbbb456",
"sha256": "8620a1d3b2cd204bdba6893592724290a2c628267ba8c2e757c3e6371416b7e8"
},
"downloads": -1,
"filename": "dbscan-0.0.12-cp38-abi3-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "39a89a58c65e1e9d17e58b183dbbb456",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.6,<4",
"size": 1204487,
"upload_time": "2023-01-20T21:59:03",
"upload_time_iso_8601": "2023-01-20T21:59:03.962355Z",
"url": "https://files.pythonhosted.org/packages/a9/98/84db82ed489fe84fa9ddd124dbbccde40b27b012f98b04c7a31b03ee3482/dbscan-0.0.12-cp38-abi3-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "e4eeee02fe6b756131c6107de0b435e02d70cc1742f1e5afa9aa173df39d0e1c",
"md5": "19037d4a5e3516e9527d7385384185a0",
"sha256": "90c91ae98b2e6b04c110a0860978f4315e6dd2328d505a44749d82b7ea50a0ef"
},
"downloads": -1,
"filename": "dbscan-0.0.12.tar.gz",
"has_sig": false,
"md5_digest": "19037d4a5e3516e9527d7385384185a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6,<4",
"size": 168330,
"upload_time": "2023-01-20T21:59:06",
"upload_time_iso_8601": "2023-01-20T21:59:06.052286Z",
"url": "https://files.pythonhosted.org/packages/e4/ee/ee02fe6b756131c6107de0b435e02d70cc1742f1e5afa9aa173df39d0e1c/dbscan-0.0.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-20 21:59:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "wangyiqiu",
"github_project": "dbscan-python",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dbscan"
}