ckmeans


Nameckmeans JSON
Version 0.2.7 PyPI version JSON
download
home_pageNone
SummaryOptimal univariate (1D) clustering based on Ckmeans.1d.dp
upload_time2024-07-13 17:16:09
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords ckmeans clustering jenks
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CKmeans: Optimal Univariate Clustering

Ckmeans clustering is an improvement on 1-dimensional (univariate) heuristic-based clustering approaches such as [Jenks](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). The algorithm was developed by [Haizhou Wang and Mingzhou Song](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf) (2011) as a [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.

Minimizing the difference within groups – what Wang & Song refer to as `withinss`, or within sum-of-squares – means that groups are optimally homogenous within and the data is split into representative groups. This is very useful for visualization, where one may wish to represent a continuous variable in discrete colour or style groups. This function can provide groups that emphasize differences between data.

Being a dynamic approach, this algorithm is based on two matrices that store incrementally-computed values for squared deviations and backtracking indexes.

Unlike the [original implementation](https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html), this implementation does not include any code to automatically determine the optimal number of clusters: this information needs to be explicitly provided. It **does** provide the `roundbreaks` method to aid labelling, however.

## Implementation
This library uses the [`ckmeans`](https://crates.io/crates/ckmeans) Rust crate, by the same author, implementing the `ckmeans` and `breaks` methods.

### `ckmeans(data, k)`
Cluster data into `k` bins

Minimizing the difference within groups – what Wang & Song refer to as `withinss`,
or within sum-of-squares, means that groups are optimally homogenous within groups and the data are
split into representative groups. This is very useful for visualization, where one may wish to
represent a continuous variable in discrete colour or style groups. This function can provide
groups – or “classes” – that emphasize differences between data.


### `breaks(data, k)`
Calculate `k - 1` breaks in the data, distinguishing classes for labelling or visualisation

The boundaries of the classes returned by `ckmeans` are “ugly” in the sense that the values
returned are the lower bound of each cluster, which aren't always practical for labelling, since they
may have many decimal places. To create a legend, the values should be rounded — however the
rounding might be either too loose (and would thus result in spurious decimal places), or too
strict, resulting in classes ranging “from `x` to `x`”. A better approach is to choose the roundest
number that separates the lowest point from a class from the highest point in the preceding
class — thus giving just enough precision to distinguish the classes.
This function is closer to what Jenks returns: `k - 1` “breaks” in the data, useful for labelling.

This method is a port of the [visionscarto](https://observablehq.com/@visionscarto/natural-breaks#round) method of the same name.

## Benchmarks
Install optional dependencies, then run `benchmark.py`.

[ckmeans-1d-dp](https://pypi.org/project/ckmeans-1d-dp/) is about 20 % faster, but note that it only returns _indices_ identifying each cluster to which the input belongs; if you actually want to cluster your data, you need to do that yourself which I strongly suspect might be slower overall. On the other hand, if all you want is indices it may be a better choice.

# Examples
```python
from ckmeans import ckmeans
import numpy as np


data = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])
clusters = 2
result = ckmeans(data, clusters)
assert result == [
    np.array([1.0, 2.0, 3.0, 4.0]),
    np.array([100.0, 101.0, 102.0, 103.0])
]
```

```python
from ckmeans import breaks
import numpy as np


data = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])
clusters = 2
result = breaks(data, clusters)
assert result == [50.0,]
```
# License
[Blue Oak Model License 1.0.0](license.txt)


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ckmeans",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "ckmeans, clustering, jenks",
    "author": null,
    "author_email": "Stephan H\u00fcgel <urschrei@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/33/37/335d9b70ef8587199ff75647cb6c7466a3f7189d11ad402317e64e407dfd/ckmeans-0.2.7.tar.gz",
    "platform": null,
    "description": "# CKmeans: Optimal Univariate Clustering\n\nCkmeans clustering is an improvement on 1-dimensional (univariate) heuristic-based clustering approaches such as [Jenks](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). The algorithm was developed by [Haizhou Wang and Mingzhou Song](http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf) (2011) as a [dynamic programming](https://en.wikipedia.org/wiki/Dynamic_programming) approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.\n\nMinimizing the difference within groups \u2013 what Wang & Song refer to as `withinss`, or within sum-of-squares \u2013 means that groups are optimally homogenous within and the data is split into representative groups. This is very useful for visualization, where one may wish to represent a continuous variable in discrete colour or style groups. This function can provide groups that emphasize differences between data.\n\nBeing a dynamic approach, this algorithm is based on two matrices that store incrementally-computed values for squared deviations and backtracking indexes.\n\nUnlike the [original implementation](https://cran.r-project.org/web/packages/Ckmeans.1d.dp/index.html), this implementation does not include any code to automatically determine the optimal number of clusters: this information needs to be explicitly provided. It **does** provide the `roundbreaks` method to aid labelling, however.\n\n## Implementation\nThis library uses the [`ckmeans`](https://crates.io/crates/ckmeans) Rust crate, by the same author, implementing the `ckmeans` and `breaks` methods.\n\n### `ckmeans(data, k)`\nCluster data into `k` bins\n\nMinimizing the difference within groups \u2013 what Wang & Song refer to as `withinss`,\nor within sum-of-squares, means that groups are optimally homogenous within groups and the data are\nsplit into representative groups. This is very useful for visualization, where one may wish to\nrepresent a continuous variable in discrete colour or style groups. This function can provide\ngroups \u2013 or \u201cclasses\u201d \u2013 that emphasize differences between data.\n\n\n### `breaks(data, k)`\nCalculate `k - 1` breaks in the data, distinguishing classes for labelling or visualisation\n\nThe boundaries of the classes returned by `ckmeans` are \u201cugly\u201d in the sense that the values\nreturned are the lower bound of each cluster, which aren't always practical for labelling, since they\nmay have many decimal places. To create a legend, the values should be rounded \u2014 however the\nrounding might be either too loose (and would thus result in spurious decimal places), or too\nstrict, resulting in classes ranging \u201cfrom `x` to `x`\u201d. A better approach is to choose the roundest\nnumber that separates the lowest point from a class from the highest point in the preceding\nclass \u2014 thus giving just enough precision to distinguish the classes.\nThis function is closer to what Jenks returns: `k - 1` \u201cbreaks\u201d in the data, useful for labelling.\n\nThis method is a port of the [visionscarto](https://observablehq.com/@visionscarto/natural-breaks#round) method of the same name.\n\n## Benchmarks\nInstall optional dependencies, then run `benchmark.py`.\n\n[ckmeans-1d-dp](https://pypi.org/project/ckmeans-1d-dp/) is about 20 % faster, but note that it only returns _indices_ identifying each cluster to which the input belongs; if you actually want to cluster your data, you need to do that yourself which I strongly suspect might be slower overall. On the other hand, if all you want is indices it may be a better choice.\n\n# Examples\n```python\nfrom ckmeans import ckmeans\nimport numpy as np\n\n\ndata = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])\nclusters = 2\nresult = ckmeans(data, clusters)\nassert result == [\n    np.array([1.0, 2.0, 3.0, 4.0]),\n    np.array([100.0, 101.0, 102.0, 103.0])\n]\n```\n\n```python\nfrom ckmeans import breaks\nimport numpy as np\n\n\ndata = np.array([1.0, 2.0, 3.0, 4.0, 100.0, 101.0, 102.0, 103.0])\nclusters = 2\nresult = breaks(data, clusters)\nassert result == [50.0,]\n```\n# License\n[Blue Oak Model License 1.0.0](license.txt)\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Optimal univariate (1D) clustering based on Ckmeans.1d.dp",
    "version": "0.2.7",
    "project_urls": {
        "Repository": "https://github.com/urschrei/ckmeans_py",
        "Tracker": "https://github.com/urschrei/ckmeans_py/issues"
    },
    "split_keywords": [
        "ckmeans",
        " clustering",
        " jenks"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "168a53ee026b337e7ee9d485ff2d9e9cd4386329a9161005d4e09ca884dfbffa",
                "md5": "e9245e4ad9e96fe804c456aed619b639",
                "sha256": "ba2a06e66048bbf9941259b5b0587687a60b45d193ec0fd493bedc6580261cbc"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "e9245e4ad9e96fe804c456aed619b639",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 228794,
            "upload_time": "2024-07-13T17:15:58",
            "upload_time_iso_8601": "2024-07-13T17:15:58.358222Z",
            "url": "https://files.pythonhosted.org/packages/16/8a/53ee026b337e7ee9d485ff2d9e9cd4386329a9161005d4e09ca884dfbffa/ckmeans-0.2.7-cp310-abi3-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b4a7b0c892f280eee9894caef7adc9886b6d3b3b9bff4ec2f6021e023b98a6b",
                "md5": "ddbf9470987129ef0255181e3ea6157f",
                "sha256": "2c90de486a1e3916070c117d47fa0b283fcca9483226d161618da2dd279fac64"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "ddbf9470987129ef0255181e3ea6157f",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 223942,
            "upload_time": "2024-07-13T17:15:59",
            "upload_time_iso_8601": "2024-07-13T17:15:59.980184Z",
            "url": "https://files.pythonhosted.org/packages/8b/4a/7b0c892f280eee9894caef7adc9886b6d3b3b9bff4ec2f6021e023b98a6b/ckmeans-0.2.7-cp310-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5d1ccbff530372a34194e9a546bace2eb9983ad54ab7ba6861bf3dac004e662f",
                "md5": "2fb17fba3b21e24990fe9fb02d460a8f",
                "sha256": "c9df2be8fd8aed92d795f1a432165b6d36c0e340137c3c52395ee2e3dbdc8c98"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
            "has_sig": false,
            "md5_digest": "2fb17fba3b21e24990fe9fb02d460a8f",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 260239,
            "upload_time": "2024-07-13T17:16:01",
            "upload_time_iso_8601": "2024-07-13T17:16:01.565378Z",
            "url": "https://files.pythonhosted.org/packages/5d/1c/cbff530372a34194e9a546bace2eb9983ad54ab7ba6861bf3dac004e662f/ckmeans-0.2.7-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0bebed1797e30e0190fccae08e12e71052633fad75aa656c8ae1b96ad5869562",
                "md5": "ca5a3f34961b98756bd548feb5b6771a",
                "sha256": "9d8cdb28cbc345170c37253c2f039d9b7225502dfdac01095cdac0d5082872a0"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "ca5a3f34961b98756bd548feb5b6771a",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 253743,
            "upload_time": "2024-07-13T17:16:03",
            "upload_time_iso_8601": "2024-07-13T17:16:03.181225Z",
            "url": "https://files.pythonhosted.org/packages/0b/eb/ed1797e30e0190fccae08e12e71052633fad75aa656c8ae1b96ad5869562/ckmeans-0.2.7-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e2256143edb8f68cdb3d0b3c94fc6a9c91a331c4ef02d08023b7ca2d0ec79538",
                "md5": "fc8568d38147291d45be04fcaf23a0d4",
                "sha256": "6904243a35dfee2b36d5e1cd4702b540c2ace424c1bc4ab1156823b9b2fe5c2b"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
            "has_sig": false,
            "md5_digest": "fc8568d38147291d45be04fcaf23a0d4",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 258048,
            "upload_time": "2024-07-13T17:16:04",
            "upload_time_iso_8601": "2024-07-13T17:16:04.765745Z",
            "url": "https://files.pythonhosted.org/packages/e2/25/6143edb8f68cdb3d0b3c94fc6a9c91a331c4ef02d08023b7ca2d0ec79538/ckmeans-0.2.7-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f71ad3fbc64971e5b7ed69c872211b06cc9c709e61164bdb66f1097609da4d2b",
                "md5": "590416d0099097cd086aa550cf5663dd",
                "sha256": "f34e70fa5bedbccac64976a517d16ce7f31c89d30f3342f733f7908ace49d9f6"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-win32.whl",
            "has_sig": false,
            "md5_digest": "590416d0099097cd086aa550cf5663dd",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 134364,
            "upload_time": "2024-07-13T17:16:05",
            "upload_time_iso_8601": "2024-07-13T17:16:05.880676Z",
            "url": "https://files.pythonhosted.org/packages/f7/1a/d3fbc64971e5b7ed69c872211b06cc9c709e61164bdb66f1097609da4d2b/ckmeans-0.2.7-cp310-abi3-win32.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "98fa74cb270eaa0ce19a0ff474812999d88a4389e1358c2bcef4895f3785dadb",
                "md5": "eaa16859b3686e16eaeef231d572aed4",
                "sha256": "f352236cc3c233c7bee66b7371bbc2aa5e2c67f0c899cb688fc808b5a7fffb29"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7-cp310-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "eaa16859b3686e16eaeef231d572aed4",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 146028,
            "upload_time": "2024-07-13T17:16:07",
            "upload_time_iso_8601": "2024-07-13T17:16:07.558019Z",
            "url": "https://files.pythonhosted.org/packages/98/fa/74cb270eaa0ce19a0ff474812999d88a4389e1358c2bcef4895f3785dadb/ckmeans-0.2.7-cp310-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3337335d9b70ef8587199ff75647cb6c7466a3f7189d11ad402317e64e407dfd",
                "md5": "7e63080a15dd9c7ec778a7c7ceb1f1e0",
                "sha256": "27aae4cd1b5d934cd0da76e9b7835470726b6d9e22d348743b109b90d9af1621"
            },
            "downloads": -1,
            "filename": "ckmeans-0.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "7e63080a15dd9c7ec778a7c7ceb1f1e0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 10032,
            "upload_time": "2024-07-13T17:16:09",
            "upload_time_iso_8601": "2024-07-13T17:16:09.155288Z",
            "url": "https://files.pythonhosted.org/packages/33/37/335d9b70ef8587199ff75647cb6c7466a3f7189d11ad402317e64e407dfd/ckmeans-0.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-13 17:16:09",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "urschrei",
    "github_project": "ckmeans_py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ckmeans"
}
        
Elapsed time: 0.49882s