# DBOpt
DBOpt is a python program enabling reproducible and robust parameter selection for density based clusterering algorithms. The method combines and efficient implementaion of density based cluster validation (DBCV) with Bayesian optimization to find optimal clustering algorithm parameters that maximize the DBCV score. DBOpt is currently compatible with the density based clustering algorithms: DBSCAN, HDBSCAN, and OPTICS. For more information about the DBOpt method read Hammer et al. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024).
## Getting Started
### Dependencies
- k-DBCV
- BayesianOptimization
- sci-kit learn
- NumPy
### Installation
DBOpt can be installed via pip:
```
pip install DBOpt
```
## Usage
DBOpt class can be initialized by setting hyperparameters for the optimization. These include the algorithm to be optimized, the number of optimization iterations (runs), the number of initial parameter combinations to probe (rand_n), and the parameter space that is to be optimized. Each algorithm has its own set of parameters that can be optimized. More information about these parameters can be found in the corresponding scikit-learn documentation.
#### DBOpt-DBSCAN
For DBSCAN, the relevant parameters are eps and min_samples. Bounds for one or both of these parameters must be set.
```
model = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,
eps = [3,200], min_samples = [3,200])
```
Parameters can be held constant:
```
model = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,
eps = [4,200], min_samples = 6)
```
#### DBOpt-HDBSCAN
HDBSCAN has two primary parameters, min_cluster_size and min_samples.
```
model = DBOpt.DBOpt(algorithm = 'HDBSCAN', runs = 200, rand_n = 40,
min_cluster_size = [4,200], min_samples = [4,200])
```
DBOpt is capable of optimizing addition parameters for HDBSCAN including cluster_selection_epsilon, cluster_selection_method, and alpha.
In cases like these when parameter spaces are vastly different in size, it can be helpful to scale all parameters the same by setting scale_params = True. scale_params is set to False by default.
```
model = DBOpt.DBOpt(algorithm = 'HDBSCAN', runs = 200, rand_n = 40,
min_cluster_size = [4,200], min_samples = [4,200], eps = [0,200], method = [0,1], alpha = [0,1],
scale_params = True)
```
#### DBOpt-OPTICS
OPTICS can currently be optimized with the xi method.
```
model = DBOpt.DBOpt(algorithm = 'OPTICS', runs = 200, rand_n = 40,
xi = [0.05,0.5], min_samples = [4,200])
```
### Optimizing the parameters
#### Importing Data
The data can be multidimensional coordinates. Here we use the C01 simulation from the data folder.
<p align="center">
<img width=45% height=45% src="https://github.com/user-attachments/assets/e72dfc14-34ab-484f-816d-bf8d8e46da21">
</p>
We create an array X which is a 2D array with x positions in column 0 and y positions in column 1.
#### Optimizing parameters for the data
Once hyperparameters have beeen set, the algorithm can be optimized for the data.
```
model.optimize(X)
```
Information about the chosen parameters and the full parameter sweep can be extracted after optimizing.
```
parameter_sweep_arr = model.parameter_sweep_
DBOpt_selected_parameters = model.parameters_
```
The optimization can be plotted:
```
parameter_sweep_plot = model.plot_optimization()
```
<p align="center">
<img width=60% height=60% src="https://github.com/user-attachments/assets/1487a4c1-44cf-4d0f-9913-a00ae383d1a1">
</p>
### Clustering
The data is clustered via the fit function.
```
model.fit(X)
```
The optimization step and fit step can be performed together:
```
model.optimize_fit(X)
```
After fitting the labels and DBCV score can be stored:
```
labels = model.labels_
DBCV_score = model.DBCV_score_
```
The clusters can be plotted where show_noise will determine if the noise is shown or not (Default = True) and setting ind_cluster_scores = True will plot clusters colormapped to the individual cluster scores instead of colored randomly (Default = False) :
```
cluster_plot = model.plot_clusters()
```
<p align="center">
<img width=40% height=40% src="https://github.com/user-attachments/assets/fbee5fe3-5f78-450e-a79b-11631b96543c">
</p>
```
cluster_plot_modified = model.plot_clusters(show_noise = True, ind_cluster_scores = True)
```
<p align="center">
<img width=50% height=50% src="https://github.com/user-attachments/assets/46e5a5bd-f0ab-42ee-b228-ed1906ca6e10">
</p>
## License
DBOpt is licensed with an MIT license. See LICENSE file for more information.
## Referencing
If you use DBOpt for your work, cite with the following (currently in preprint):
Hammer, J. L., Devanny, A. J. & Kaufman, L. J. Density-based optimization for unbiased, reproducible clustering applied to single molecule localization microscopy. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024)
## Contact
kaufmangroup.rubylab@gmail.com
Raw data
{
"_id": null,
"home_page": "https://github.com/Kaufman-Lab-Columbia/DBOpt",
"name": "DBOpt",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "cluster clusters clustering DBSCAN HDBSCAN OPTICS",
"author": "Joseph L. Hammer, Alexander J. Devanny",
"author_email": "jhammer3018@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c8/05/79681dbcb5a89bfb9efd0a4f2bc002258b55318bb2d0cc465f047eaed821/dbopt-1.0.0.tar.gz",
"platform": null,
"description": "# DBOpt\r\n\r\nDBOpt is a python program enabling reproducible and robust parameter selection for density based clusterering algorithms. The method combines and efficient implementaion of density based cluster validation (DBCV) with Bayesian optimization to find optimal clustering algorithm parameters that maximize the DBCV score. DBOpt is currently compatible with the density based clustering algorithms: DBSCAN, HDBSCAN, and OPTICS. For more information about the DBOpt method read Hammer et al. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024).\r\n\r\n## Getting Started\r\n### Dependencies\r\n- k-DBCV\r\n- BayesianOptimization\r\n- sci-kit learn\r\n- NumPy\r\n \r\n### Installation\r\nDBOpt can be installed via pip:\r\n```\r\npip install DBOpt\r\n```\r\n\r\n\r\n## Usage\r\nDBOpt class can be initialized by setting hyperparameters for the optimization. These include the algorithm to be optimized, the number of optimization iterations (runs), the number of initial parameter combinations to probe (rand_n), and the parameter space that is to be optimized. Each algorithm has its own set of parameters that can be optimized. More information about these parameters can be found in the corresponding scikit-learn documentation.\r\n\r\n#### DBOpt-DBSCAN \r\nFor DBSCAN, the relevant parameters are eps and min_samples. Bounds for one or both of these parameters must be set. \r\n```\r\nmodel = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,\r\n eps = [3,200], min_samples = [3,200])\r\n```\r\nParameters can be held constant:\r\n```\r\nmodel = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,\r\n eps = [4,200], min_samples = 6)\r\n```\r\n#### DBOpt-HDBSCAN\r\nHDBSCAN has two primary parameters, min_cluster_size and min_samples.\r\n```\r\nmodel = DBOpt.DBOpt(algorithm = 'HDBSCAN', runs = 200, rand_n = 40,\r\n min_cluster_size = [4,200], min_samples = [4,200])\r\n```\r\nDBOpt is capable of optimizing addition parameters for HDBSCAN including cluster_selection_epsilon, cluster_selection_method, and alpha.\r\nIn cases like these when parameter spaces are vastly different in size, it can be helpful to scale all parameters the same by setting scale_params = True. scale_params is set to False by default.\r\n```\r\nmodel = DBOpt.DBOpt(algorithm = 'HDBSCAN', runs = 200, rand_n = 40,\r\n min_cluster_size = [4,200], min_samples = [4,200], eps = [0,200], method = [0,1], alpha = [0,1],\r\n scale_params = True)\r\n```\r\n#### DBOpt-OPTICS\r\nOPTICS can currently be optimized with the xi method.\r\n```\r\nmodel = DBOpt.DBOpt(algorithm = 'OPTICS', runs = 200, rand_n = 40,\r\n xi = [0.05,0.5], min_samples = [4,200])\r\n```\r\n### Optimizing the parameters\r\n#### Importing Data\r\nThe data can be multidimensional coordinates. Here we use the C01 simulation from the data folder.\r\n\r\n<p align=\"center\">\r\n <img width=45% height=45% src=\"https://github.com/user-attachments/assets/e72dfc14-34ab-484f-816d-bf8d8e46da21\">\r\n</p>\r\n\r\nWe create an array X which is a 2D array with x positions in column 0 and y positions in column 1.\r\n#### Optimizing parameters for the data\r\nOnce hyperparameters have beeen set, the algorithm can be optimized for the data. \r\n```\r\nmodel.optimize(X)\r\n```\r\nInformation about the chosen parameters and the full parameter sweep can be extracted after optimizing.\r\n```\r\nparameter_sweep_arr = model.parameter_sweep_\r\nDBOpt_selected_parameters = model.parameters_\r\n```\r\nThe optimization can be plotted:\r\n```\r\nparameter_sweep_plot = model.plot_optimization()\r\n```\r\n\r\n<p align=\"center\">\r\n <img width=60% height=60% src=\"https://github.com/user-attachments/assets/1487a4c1-44cf-4d0f-9913-a00ae383d1a1\">\r\n</p>\r\n\r\n### Clustering\r\nThe data is clustered via the fit function.\r\n```\r\nmodel.fit(X)\r\n```\r\nThe optimization step and fit step can be performed together:\r\n```\r\nmodel.optimize_fit(X)\r\n```\r\nAfter fitting the labels and DBCV score can be stored:\r\n```\r\nlabels = model.labels_\r\nDBCV_score = model.DBCV_score_\r\n```\r\nThe clusters can be plotted where show_noise will determine if the noise is shown or not (Default = True) and setting ind_cluster_scores = True will plot clusters colormapped to the individual cluster scores instead of colored randomly (Default = False) :\r\n```\r\ncluster_plot = model.plot_clusters()\r\n```\r\n\r\n<p align=\"center\">\r\n <img width=40% height=40% src=\"https://github.com/user-attachments/assets/fbee5fe3-5f78-450e-a79b-11631b96543c\">\r\n</p>\r\n\r\n```\r\ncluster_plot_modified = model.plot_clusters(show_noise = True, ind_cluster_scores = True)\r\n```\r\n\r\n<p align=\"center\">\r\n <img width=50% height=50% src=\"https://github.com/user-attachments/assets/46e5a5bd-f0ab-42ee-b228-ed1906ca6e10\">\r\n</p>\r\n\r\n\r\n## License\r\nDBOpt is licensed with an MIT license. See LICENSE file for more information.\r\n\r\n## Referencing\r\nIf you use DBOpt for your work, cite with the following (currently in preprint):\r\n\r\nHammer, J. L., Devanny, A. J. & Kaufman, L. J. Density-based optimization for unbiased, reproducible clustering applied to single molecule localization microscopy. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024)\r\n\r\n## Contact \r\nkaufmangroup.rubylab@gmail.com\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Bayesian optimized parameter selection for density-based clustering algorithms",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/Kaufman-Lab-Columbia/DBOpt"
},
"split_keywords": [
"cluster",
"clusters",
"clustering",
"dbscan",
"hdbscan",
"optics"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3908fc9b19d787ff3634d006cf3e39d8da3f444024da325cd0c92502f9e0944d",
"md5": "74bd73a93ea2d08d1e8291694503a5d6",
"sha256": "fe679b115e94b371589f5cc47a90dd7c9710dd1949ab04da1db7c012be84aec7"
},
"downloads": -1,
"filename": "DBOpt-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "74bd73a93ea2d08d1e8291694503a5d6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21509,
"upload_time": "2024-11-05T21:26:50",
"upload_time_iso_8601": "2024-11-05T21:26:50.538309Z",
"url": "https://files.pythonhosted.org/packages/39/08/fc9b19d787ff3634d006cf3e39d8da3f444024da325cd0c92502f9e0944d/DBOpt-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c80579681dbcb5a89bfb9efd0a4f2bc002258b55318bb2d0cc465f047eaed821",
"md5": "e239e99511cfd001700fbe0d1b0429f8",
"sha256": "f1ba63824d4a8ec88ffe8afd13392dbca4beaad5eafa34fb34a2e7d7860fec34"
},
"downloads": -1,
"filename": "dbopt-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "e239e99511cfd001700fbe0d1b0429f8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 17043,
"upload_time": "2024-11-05T21:26:52",
"upload_time_iso_8601": "2024-11-05T21:26:52.251642Z",
"url": "https://files.pythonhosted.org/packages/c8/05/79681dbcb5a89bfb9efd0a4f2bc002258b55318bb2d0cc465f047eaed821/dbopt-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-05 21:26:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Kaufman-Lab-Columbia",
"github_project": "DBOpt",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "kDBCV",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "bayesian-optimization",
"specs": [
[
"==",
"1.5.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"<",
"2"
],
[
">=",
"1.20.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.3"
]
]
}
],
"lcname": "dbopt"
}