# ClusOpt Core
<a href="https://pypi.python.org/pypi/clusopt_core"><img src="https://img.shields.io/pypi/v/clusopt_core.svg"></a>
This package is used by [ClusOpt](https://github.com/giuliano-oliveira/clusopt) for
it's CPU intensive tasks, but it can be easily imported in any python data stream clustering project,
it is coded mainly in C/C++ with bindings for python, and features:
* CluStream (based on MOA implementation)
* StreamKM++ (wrapped around the original paper authors implementation)
* Distance Matrix computation (in place implementation using boost threads)
* Silhouette score (custom in place implementation inspired by BIRCH clustering vector)
## Prerequisites
* python >= 3.6
* pip
* boost-thread
* gcc >= 6
`boost-thread` can be installed in Debian based systems with :
```bash
apt install libboost-thread-dev
```
## Usage
See `examples` folder for more.
### CluStream online clustering
```python
from clusopt_core.cluster import CluStream
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
k = 32
dataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)
model = CluStream(
m=k * 10, # no microclusters
h=64000, # horizon
t=2, # radius factor
)
chunks = np.split(dataset, len(dataset) / 4000)
model.init_offline(chunks.pop(0), seed=42)
for chunk in chunks:
model.partial_fit(chunk)
clusters, _ = model.get_macro_clusters(k, seed=42)
plt.scatter(*dataset.T, marker=",", label="datapoints")
plt.scatter(*model.get_partial_cluster_centers().T, marker=".", label="microclusters")
plt.scatter(*clusters.T, marker="x", label="macro clusters", color="black")
plt.legend()
plt.show()
```
output:
![clustream clustering results](https://github.com/giuliano-oliveira/clusopt_core/blob/master/examples/clustream.jpeg?raw=true)
## Benchmarks
Some functions in clusopt_core are faster than scikit learn implementations, see the `benchmark` folder for more info.
### Silhouette
Each bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.
![clusopt silhouette versus scikit learn silhouette execution time](https://github.com/giuliano-oliveira/clusopt_core/blob/master/benchmark/silhouette.jpeg?raw=true)
### Distance Matrix
Each bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.
![clusopt distance matrix versus scikit learn pairwise distance in execution time](https://github.com/giuliano-oliveira/clusopt_core/blob/master/benchmark/dist_matrix.jpeg?raw=true)
## Installation
You can install it directly from pypi with
```bash
pip install clusopt-core
```
or you can clone this repo and install from the directory
```bash
pip install ./clusopt_core
```
## Acknowledgments
#### Thanks to:
* **Marcel R. Ackermann et al.** for the StreamKM++ algorithm - [link](https://cs.uni-paderborn.de/cuk/forschung/abgeschlossene-projekte/dfg-schwerpunktprogramm-1307/streamkm/)
* **The university of Waikato** for the MOA framework - [link](https://moa.cms.waikato.ac.nz/)
Raw data
{
"_id": null,
"home_page": "https://github.com/giuliano-oliveira/clusopt_core",
"name": "clusopt-core",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "data-stream, clustering, silhouette",
"author": "Giuliano Oliveira De Macedo",
"author_email": "giuliano.llpinokio@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/08/82/17b6fe24450276675c0f25f09c30531823b7330a44d808cd6a1d0d8e610d/clusopt_core-1.1.7.tar.gz",
"platform": null,
"description": "# ClusOpt Core\n\n<a href=\"https://pypi.python.org/pypi/clusopt_core\"><img src=\"https://img.shields.io/pypi/v/clusopt_core.svg\"></a>\n\n\nThis package is used by [ClusOpt](https://github.com/giuliano-oliveira/clusopt) for \nit's CPU intensive tasks, but it can be easily imported in any python data stream clustering project,\nit is coded mainly in C/C++ with bindings for python, and features:\n\n* CluStream (based on MOA implementation)\n* StreamKM++ (wrapped around the original paper authors implementation)\n* Distance Matrix computation (in place implementation using boost threads)\n* Silhouette score (custom in place implementation inspired by BIRCH clustering vector)\n\n## Prerequisites\n\n* python >= 3.6\n* pip\n* boost-thread\n* gcc >= 6\n\n`boost-thread` can be installed in Debian based systems with :\n```bash\napt install libboost-thread-dev\n```\n\n## Usage\n\nSee `examples` folder for more.\n\n### CluStream online clustering\n\n```python\nfrom clusopt_core.cluster import CluStream\nfrom sklearn.datasets import make_blobs\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nk = 32\n\ndataset, _ = make_blobs(n_samples=64000, centers=k, random_state=42, cluster_std=0.1)\n\nmodel = CluStream(\n m=k * 10, # no microclusters\n h=64000, # horizon\n t=2, # radius factor\n)\n\nchunks = np.split(dataset, len(dataset) / 4000)\n\nmodel.init_offline(chunks.pop(0), seed=42)\n\nfor chunk in chunks:\n model.partial_fit(chunk)\n\nclusters, _ = model.get_macro_clusters(k, seed=42)\n\nplt.scatter(*dataset.T, marker=\",\", label=\"datapoints\")\n\nplt.scatter(*model.get_partial_cluster_centers().T, marker=\".\", label=\"microclusters\")\n\nplt.scatter(*clusters.T, marker=\"x\", label=\"macro clusters\", color=\"black\")\n\nplt.legend()\nplt.show()\n```\noutput:\n\n![clustream clustering results](https://github.com/giuliano-oliveira/clusopt_core/blob/master/examples/clustream.jpeg?raw=true) \n\n## Benchmarks\nSome functions in clusopt_core are faster than scikit learn implementations, see the `benchmark` folder for more info.\n\n### Silhouette\nEach bar have a tuple of (no_samples,dimension,no_groups), so independently of those 3 factors, clusopt implementation is faster.\n\n![clusopt silhouette versus scikit learn silhouette execution time](https://github.com/giuliano-oliveira/clusopt_core/blob/master/benchmark/silhouette.jpeg?raw=true)\n\n### Distance Matrix\n\nEach bar shows the dataset dimension, so clusopt_core implemetation is faster when the dataset dimension is small (<~150), even when using 4 processes in scikit-learn.\n\n![clusopt distance matrix versus scikit learn pairwise distance in execution time](https://github.com/giuliano-oliveira/clusopt_core/blob/master/benchmark/dist_matrix.jpeg?raw=true)\n\n## Installation\nYou can install it directly from pypi with \n```bash\npip install clusopt-core\n```\nor you can clone this repo and install from the directory\n```bash\npip install ./clusopt_core\n```\n## Acknowledgments\n\n#### Thanks to:\n* **Marcel R. Ackermann et al.** for the StreamKM++ algorithm - [link](https://cs.uni-paderborn.de/cuk/forschung/abgeschlossene-projekte/dfg-schwerpunktprogramm-1307/streamkm/)\n* **The university of Waikato** for the MOA framework - [link](https://moa.cms.waikato.ac.nz/)\n",
"bugtrack_url": null,
"license": "GPLv3",
"summary": "Clustream, Streamkm++ and metrics utilities C/C++ bindings for python",
"version": "1.1.7",
"project_urls": {
"Download": "https://github.com/giuliano-oliveira/clusopt_core/archive/v1.1.7.tar.gz",
"Homepage": "https://github.com/giuliano-oliveira/clusopt_core"
},
"split_keywords": [
"data-stream",
" clustering",
" silhouette"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "088217b6fe24450276675c0f25f09c30531823b7330a44d808cd6a1d0d8e610d",
"md5": "a0e56aebd2e2741cc7ec0bccba300c96",
"sha256": "e2208df62b077c09bf9be210728e88ed63b6407598fa6e6aab588b5d6862f3e2"
},
"downloads": -1,
"filename": "clusopt_core-1.1.7.tar.gz",
"has_sig": false,
"md5_digest": "a0e56aebd2e2741cc7ec0bccba300c96",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 37091,
"upload_time": "2024-09-14T13:19:39",
"upload_time_iso_8601": "2024-09-14T13:19:39.434267Z",
"url": "https://files.pythonhosted.org/packages/08/82/17b6fe24450276675c0f25f09c30531823b7330a44d808cd6a1d0d8e610d/clusopt_core-1.1.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-14 13:19:39",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "giuliano-oliveira",
"github_project": "clusopt_core",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "scikit-learn",
"specs": [
[
">=",
"0.23.1"
]
]
}
],
"lcname": "clusopt-core"
}