multi-mst


Namemulti-mst JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/vda-lab/multi_mst
SummaryMinimum spanning tree based manifold approximations.
upload_time2024-12-18 22:16:31
maintainerJelmer Bot
docs_urlNone
authorJelmer Bot
requires_python>=3.9
licenseBSD 2-Clause License
keywords minimum spanning tree dimensionality reduction
VCS
bugtrack_url
requirements numpy scipy scikit-learn umap-learn numba pynndescent
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![PyPI version](https://badge.fury.io/py/multi-mst.svg)](https://badge.fury.io/py/multi-mst)
[![Tests](https://github.com/vda-lab/multi_mst/actions/workflows/Tests.yml/badge.svg?branch=main)](https://github.com/vda-lab/multi_mst/actions/workflows/Tests.yml)
[![DOI](https://zenodo.org/badge/782003801.svg)](https://doi.org/10.5281/zenodo.13929036)

Manifold Modelling with Minimum Spanning Trees
==============================================

Dimensionality reduction (DR) algorithms typically assume the data they are
given is uniformly sampled from some underlying manifold. When this is not the
case, and there are observation-gaps along the manifold, these algorithms may
fail to detect a single connected entity. This repository presents two manifold
approximation approaches based on minimum spanning trees (MST) for non-uniform
sampled data. 

Noisy Minimum Spanning Tree Union
---------------------------------

The noisy minimum spanning tree union ($n$-MST) is inspired by Pathfinder
networks that, with a specific parameter selection, yield the union set of all
possible MSTs in a network (see, e.g., [[1]], [[2]]). We compute noisy MSTs to
detect alternative connectivity at all distance scales for distances which may
have few identically weighted connections.

We add Gaussian noise ($\mu=0$) to every candidate edge. The noise parameter $n$
is specified as a fraction of the points' nearest neighbour distance and
controls the Gaussian's standard deviation. This formulation makes the noise
scale with the data's density to avoid adding more edges in dense regions than
sparse regions, retaining a reasonably uniform manifold approximation graph.

```python
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.noisy_mst import NoisyMST

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = NoisyMST(num_trees=10, noise_fraction=1.0).fit(X)

# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
```
![noisy_mst](./doc/_static/noisy_mst.png)

$k$-Nearest Minimum Spanning Tree 
---------------------------------

The k-nearest Minimum Spanning Tree ($k$-MST) generalises $k$-nearest neighbour
networks ($k$-NN) to minimum spanning trees. It adds the $k$ shortest edges
between components. Since data points start as distinct components, all $k$-NN
edges are included in the kMST.  

To avoid creating shortcuts in the manifold, a distance threshold $\epsilon$ can
be applied. The parameter is specified as a fraction of the shortest edge
between components and provides an upper distance limit for the $2$-to-$k$
alternative edges.

```python
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst import KMST

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMST(num_neighbors=3, epsilon=2.0).fit(X)

# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
```
![k_mst](./doc/_static/k_mst.png)


Approximate $k$-MST
-------------------

Computing $k$-MSTs using KDTrees can be expensive on some datasets. We provide a
version of the algorithm based on Nearest Neighbour Descent for quicker
approximations. We combined Boruvka's algorithm with NNDescent to find
neighbours that are not already connected in the MST being build.


```python
import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst_descent import KMSTDescent

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMSTDescent(num_neighbors=3, epsilon=2.0).fit(X)

# Draw the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()
```
![k_mst](./doc/_static/k_mst_descent.png)



Installation Instructions
-------------------------

The `multi_mst` package can be installed from pypi:

```bash
pip install multi_mst
```

Acknowledgements
----------------

Most code---including the numba KDTree, disjoint set and Boruvka MST
construction implementation---is adapted from
[fast_hdbscan](https://github.com/TutteInstitute/fast_hdbscan).


License
-------

`multi_mst` uses the same license as `fast_hdbscan`: BSD (2-clause). See the
LICENSE file for details.


[1]: <https://onlinelibrary.wiley.com/doi/10.1002/asi.20904> "Pathfinder Networks"
[2]: <https://ieeexplore.ieee.org/document/8231853> "GraphRay"

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/vda-lab/multi_mst",
    "name": "multi-mst",
    "maintainer": "Jelmer Bot",
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "jelmer.bot@uhasselt.be",
    "keywords": "minimum spanning tree, dimensionality reduction",
    "author": "Jelmer Bot",
    "author_email": "jelmer.bot@uhasselt.be",
    "download_url": "https://files.pythonhosted.org/packages/0f/57/c774c466f3c77c5012140a5bb2f6f0d2aa158e8e4d54c843eb0e91a08925/multi_mst-0.1.4.tar.gz",
    "platform": null,
    "description": "[![PyPI version](https://badge.fury.io/py/multi-mst.svg)](https://badge.fury.io/py/multi-mst)\n[![Tests](https://github.com/vda-lab/multi_mst/actions/workflows/Tests.yml/badge.svg?branch=main)](https://github.com/vda-lab/multi_mst/actions/workflows/Tests.yml)\n[![DOI](https://zenodo.org/badge/782003801.svg)](https://doi.org/10.5281/zenodo.13929036)\n\nManifold Modelling with Minimum Spanning Trees\n==============================================\n\nDimensionality reduction (DR) algorithms typically assume the data they are\ngiven is uniformly sampled from some underlying manifold. When this is not the\ncase, and there are observation-gaps along the manifold, these algorithms may\nfail to detect a single connected entity. This repository presents two manifold\napproximation approaches based on minimum spanning trees (MST) for non-uniform\nsampled data. \n\nNoisy Minimum Spanning Tree Union\n---------------------------------\n\nThe noisy minimum spanning tree union ($n$-MST) is inspired by Pathfinder\nnetworks that, with a specific parameter selection, yield the union set of all\npossible MSTs in a network (see, e.g., [[1]], [[2]]). We compute noisy MSTs to\ndetect alternative connectivity at all distance scales for distances which may\nhave few identically weighted connections.\n\nWe add Gaussian noise ($\\mu=0$) to every candidate edge. The noise parameter $n$\nis specified as a fraction of the points' nearest neighbour distance and\ncontrols the Gaussian's standard deviation. This formulation makes the noise\nscale with the data's density to avoid adding more edges in dense regions than\nsparse regions, retaining a reasonably uniform manifold approximation graph.\n\n```python\nimport matplotlib.pyplot as plt\nimport matplotlib.collections as mc\nfrom sklearn.datasets import make_swiss_roll\nfrom multi_mst.noisy_mst import NoisyMST\n\nX, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)\nprojector = NoisyMST(num_trees=10, noise_fraction=1.0).fit(X)\n\n# Drawing the network\nxs = projector.embedding_[:, 0]\nys = projector.embedding_[:, 1]\ncoo_matrix = projector.graph_.tocoo()\nsources = coo_matrix.row\ntargets = coo_matrix.col\n\nplt.figure(figsize=(4, 3))\nplt.scatter(xs, ys, c=t, s=1, edgecolors=\"none\", linewidth=0, cmap=\"viridis\")\nlc = mc.LineCollection(\n    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),\n    linewidth=0.2,\n    zorder=-1,\n    alpha=0.5,\n    color=\"k\",\n)\nax = plt.gca()\nax.add_collection(lc)\nax.set_aspect(\"equal\")\nplt.subplots_adjust(0, 0, 1, 1)\nplt.axis(\"off\")\nplt.show()\n```\n![noisy_mst](./doc/_static/noisy_mst.png)\n\n$k$-Nearest Minimum Spanning Tree \n---------------------------------\n\nThe k-nearest Minimum Spanning Tree ($k$-MST) generalises $k$-nearest neighbour\nnetworks ($k$-NN) to minimum spanning trees. It adds the $k$ shortest edges\nbetween components. Since data points start as distinct components, all $k$-NN\nedges are included in the kMST.  \n\nTo avoid creating shortcuts in the manifold, a distance threshold $\\epsilon$ can\nbe applied. The parameter is specified as a fraction of the shortest edge\nbetween components and provides an upper distance limit for the $2$-to-$k$\nalternative edges.\n\n```python\nimport matplotlib.pyplot as plt\nimport matplotlib.collections as mc\nfrom sklearn.datasets import make_swiss_roll\nfrom multi_mst.k_mst import KMST\n\nX, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)\nprojector = KMST(num_neighbors=3, epsilon=2.0).fit(X)\n\n# Drawing the network\nxs = projector.embedding_[:, 0]\nys = projector.embedding_[:, 1]\ncoo_matrix = projector.graph_.tocoo()\nsources = coo_matrix.row\ntargets = coo_matrix.col\n\nplt.figure(figsize=(4, 3))\nplt.scatter(xs, ys, c=t, s=1, edgecolors=\"none\", linewidth=0, cmap=\"viridis\")\nlc = mc.LineCollection(\n    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),\n    linewidth=0.2,\n    zorder=-1,\n    alpha=0.5,\n    color=\"k\",\n)\nax = plt.gca()\nax.add_collection(lc)\nax.set_aspect(\"equal\")\nplt.subplots_adjust(0, 0, 1, 1)\nplt.axis(\"off\")\nplt.show()\n```\n![k_mst](./doc/_static/k_mst.png)\n\n\nApproximate $k$-MST\n-------------------\n\nComputing $k$-MSTs using KDTrees can be expensive on some datasets. We provide a\nversion of the algorithm based on Nearest Neighbour Descent for quicker\napproximations. We combined Boruvka's algorithm with NNDescent to find\nneighbours that are not already connected in the MST being build.\n\n\n```python\nimport matplotlib.pyplot as plt\nimport matplotlib.collections as mc\nfrom sklearn.datasets import make_swiss_roll\nfrom multi_mst.k_mst_descent import KMSTDescent\n\nX, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)\nprojector = KMSTDescent(num_neighbors=3, epsilon=2.0).fit(X)\n\n# Draw the network\nxs = projector.embedding_[:, 0]\nys = projector.embedding_[:, 1]\ncoo_matrix = projector.graph_.tocoo()\nsources = coo_matrix.row\ntargets = coo_matrix.col\n\nplt.figure(figsize=(4, 3))\nplt.scatter(xs, ys, c=t, s=1, edgecolors=\"none\", linewidth=0, cmap=\"viridis\")\nlc = mc.LineCollection(\n    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),\n    linewidth=0.2,\n    zorder=-1,\n    alpha=0.5,\n    color=\"k\",\n)\nax = plt.gca()\nax.add_collection(lc)\nax.set_aspect(\"equal\")\nplt.subplots_adjust(0, 0, 1, 1)\nplt.axis(\"off\")\nplt.show()\n```\n![k_mst](./doc/_static/k_mst_descent.png)\n\n\n\nInstallation Instructions\n-------------------------\n\nThe `multi_mst` package can be installed from pypi:\n\n```bash\npip install multi_mst\n```\n\nAcknowledgements\n----------------\n\nMost code---including the numba KDTree, disjoint set and Boruvka MST\nconstruction implementation---is adapted from\n[fast_hdbscan](https://github.com/TutteInstitute/fast_hdbscan).\n\n\nLicense\n-------\n\n`multi_mst` uses the same license as `fast_hdbscan`: BSD (2-clause). See the\nLICENSE file for details.\n\n\n[1]: <https://onlinelibrary.wiley.com/doi/10.1002/asi.20904> \"Pathfinder Networks\"\n[2]: <https://ieeexplore.ieee.org/document/8231853> \"GraphRay\"\n",
    "bugtrack_url": null,
    "license": "BSD 2-Clause License",
    "summary": "Minimum spanning tree based manifold approximations.",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/vda-lab/multi_mst"
    },
    "split_keywords": [
        "minimum spanning tree",
        " dimensionality reduction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd78a0346c8d26587d89f485bb4a23feab92f8649f9aef1d0290f30bdb336f4a",
                "md5": "b018e1c13451d0ccf48ba821d47c6826",
                "sha256": "3b165790208022c2be0622190a5cfc5787d54a28a6665b3cbce04353c04d614b"
            },
            "downloads": -1,
            "filename": "multi_mst-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b018e1c13451d0ccf48ba821d47c6826",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 42936,
            "upload_time": "2024-12-18T22:16:29",
            "upload_time_iso_8601": "2024-12-18T22:16:29.337463Z",
            "url": "https://files.pythonhosted.org/packages/cd/78/a0346c8d26587d89f485bb4a23feab92f8649f9aef1d0290f30bdb336f4a/multi_mst-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0f57c774c466f3c77c5012140a5bb2f6f0d2aa158e8e4d54c843eb0e91a08925",
                "md5": "f62c0de441d6363938d89827d308dc17",
                "sha256": "bd3476838dc937b9da9e1cd0ae51cde417ee29af715c0f733084d9ce765c2e6d"
            },
            "downloads": -1,
            "filename": "multi_mst-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "f62c0de441d6363938d89827d308dc17",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 135979,
            "upload_time": "2024-12-18T22:16:31",
            "upload_time_iso_8601": "2024-12-18T22:16:31.580741Z",
            "url": "https://files.pythonhosted.org/packages/0f/57/c774c466f3c77c5012140a5bb2f6f0d2aa158e8e4d54c843eb0e91a08925/multi_mst-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-18 22:16:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "vda-lab",
    "github_project": "multi_mst",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    "<",
                    "3"
                ],
                [
                    ">=",
                    "1.20"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.9"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.1"
                ]
            ]
        },
        {
            "name": "umap-learn",
            "specs": [
                [
                    ">=",
                    "0.5.4"
                ]
            ]
        },
        {
            "name": "numba",
            "specs": [
                [
                    ">=",
                    "0.57.1"
                ]
            ]
        },
        {
            "name": "pynndescent",
            "specs": [
                [
                    ">=",
                    "0.5.13"
                ]
            ]
        }
    ],
    "lcname": "multi-mst"
}
        
Elapsed time: 2.24442s