# pySTAD
[![PyPI version fury.io](https://pypip.in/v/pystad/badge.png)](https://pypi.python.org/pypi/pystad/)
[![PyPI status](https://img.shields.io/pypi/status/pystad.svg)](https://pypi.python.org/pypi/pystad/)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pystad.svg)](https://pypi.python.org/pypi/pystad/)
[![PyPI license](https://img.shields.io/pypi/l/pystad.svg)](https://pypi.python.org/pypi/pystad/)
[![pipeline status](https://gitlab.com/vda-lab/pystad2/badges/master/pipeline.svg)](https://gitlab.com/dsi_uhasselt/vda-lab/pystad2/-/commits/master)
[![coverage report](https://gitlab.com/vda-lab/pystad2/badges/master/coverage.svg)](https://gitlab.com/dsi_uhasselt/vda-lab/pystad2/-/commits/master)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/vda-lab%2Fpystad2/master?urlpath=lab/tree/examples)
This is a python implementation of [STAD](https://ieeexplore.ieee.org/document/9096616/)
for exploration and visualisation of high-dimensional data. This implementation
is based on the [R version](https://github.com/vda-lab/stad).
## Background
[STAD](https://ieeexplore.ieee.org/document/9096616/) is a dimensionality
reduction algorithm, that generates an abstract representation of
high-dimensional data by giving each data point a location in a graph which
preserves the distances in the original high-dimensional space. The STAD graph
is built upon the Minimum Spanning Tree (MST) to which new edges are added until
the correlation between the graph and the original dataset is maximized.
Additionally, STAD supports the inclusion of filter functions to analyse data
from new perspectives, emphasizing traits in data which otherwise would remain
hidden.
### Topological Data analysis
Topological data analysis (TDA) aims to describe the geometric structures
present in data. A dataset is interpreted as a point-cloud, where each point
is sampled from an underlying geometric object. TDA tries to recover and
describe the geometry of that object in terms of features that are invariant
["under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing"](https://en.wikipedia.org/wiki/Topology).
Two geometries that can be deformed into each other without tearing or
glueing are *homeomorphic* (for instance a donut and coffee mug). Typically,
TDA describes the *holes* in a geometry, formalised as
[Betti numbers](https://en.wikipedia.org/wiki/Betti_number).
Like other TDA algorithms, STAD constructs a graph that describes the structure
of the data. However, the output of STAD should be interpreted as a
data-visualisation result, rather than a topological description of the data's
structure. Other TDA algorithms, like
[mapper](https://github.com/scikit-tda/kepler-mapper), do produce topological
results. However, they rely on aggregating the data, whereas STAD encodes the
original data points as vertices in a graph.
### Dimensionality reduction
Compared to dimensionality reduction algorithms like, t-SNE and UMAP, the STAD
produces a more flexible description of the data. A graph can be drawn using
different layouts and a user can interact with it. In addition, STAD's
projections retain the global structure of the data. In general, the STAD graph
tends to underestimate distant data-points in the network structure. On the
other hand, t-SNE and UMAP emphasize the relation of data-points with their
closest neighbors over that with distant data-points.
<p style="text-align:center;"><img src="./assets/dimensionality_reduction_comparison.png" width="90%" /></p>
from [Alcaide & Aerts (2020)](https://ieeexplore.ieee.org/document/9096616/)
## Installation
Currently, we recommend installing pystad from this repository within a conda
enviroment with python and nodejs installed:
```bash
pip install git+https://gitlab.com/vda-lab/pystad2#egg=pystad
```
Alternatively, pystad can be compiled from source (see
`development/Development.md` for instructions)
## How to use pySTAD
### From the command-line
pySTAD has a `__main__` entry-point which can be called using:
`python -m stad --help` or `stad --help` from the command-line. These
entrypoints take a distance matrix in the form of a `.csv` file and print the
resulting network as a JSON string to stdout. Some information of the network is
logged to stderr, including the number of added edges and the correlation of the
network-distances with the original distances.
### From within python
pySTAD is the most versatile when used within python. Three basic examples are
shown below and the example jupyterlab notebooks can be explored on
[binder](https://mybinder.org/v2/gl/dsi_uhasselt%2Fvda-lab%2Fpystad2/master?urlpath=tree/examples)
without installing pySTAD on your machine.
#### Example 1
Most basic use of pySTAD using the default options.
```python
import stad as sd
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
# Load a dataset
data = pd.read_csv('./data/five_circles.csv', header=0)
condensed_distances = pdist(data[['x', 'y']], 'euclidean')
# Show the data in 2D
plt.scatter(data.x, data.y, s=5, c=data.x)
plt.show()
## Compute stad
network, sweep = sd.stad(condensed_distances)
sd.plot.network(network, layout='kk', node_color=data['x'])
plt.show()
# Show the correlation trace
sd.plot.sweep(condensed_distances, sweep)
plt.show()
```
#### Example 2
Use a lens / filter to highlight some property of the data.
```python
import stad as sd
import numpy as np
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
# Load a dataset
data = pd.read_csv('./data/five_circles.csv', header=0)
condensed_distances = pdist(data[['x', 'y']], 'euclidean')
# Show the dataset in 2D
plt.scatter(data.x, data.y, s=5, c=data.x)
plt.show()
# Run stad with a lens
lens = sd.Lens(data['x'].to_numpy(), n_bins=3)
network, sweep = sd.stad(condensed_distances, lens=lens)
# Show which edges cross filter-segment boundaries
edge_color = np.where(lens.adjacent_edges[sweep.network_mask], '#f33', '#ddd')
sd.plot.network(network, layout='kk', edge_color=edge_color, node_color=data['x'])
plt.show()
# Show the correlation trace
sd.plot.sweep(condensed_distances, sweep)
plt.show()
```
#### Example 3
Explore the resulting network interactively in jupyter-lab.
```python
import stad as sd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import ipywidgets as widgets
from scipy.spatial.distance import pdist
# Load data, compute distances, show 2d projection
data = pd.read_csv('./data/horse.csv')
idx = np.random.choice(data.shape[0], 500, replace=False)
data = data.iloc[idx, :]
dist = pdist(data, 'euclidean')
plt.scatter(data.z, data.y, s=5, c=data.z)
plt.show()
## Compute stad without lens
network, sweep = sd.stad(dist, sweep=sd.ThresholdDistance(0.11))
w = sd.Widget()
w
```
```python
# show() calls only work after the front-end of the widget is instantiated.
# so they have to be in a cell below the cell that outputs the widget.
w.show(network, node_color=data['z'])
```
## Compared to the R-implementation
The [R implementation](https://github.com/vda-lab/stad) supports 2-dimensional
filters (lenses) and uses Simulated Annealing to optimise the output graph. This
implementation currently only supports 1D lenses. In addition, this implementation
uses a logistic sweep on the number of edges in the network by default, but still
supports optimization functions such as simulated annealing.
This implementation is optimised using Cython and OpenMP, resulting shorter
computation times compared to the R implementation.
The R implementation uses a MST refinement procedure when using a lens / filter, as
described in the paper. This implementation just uses the MST. The refinement
procedure depends on community detection to remove edges between different groups of
data-points within the same filter segment, which is a process that requires fine-tuning
per dataset. When communities are not detected correctly, edges between distinct groups of
datapoints within a filter segment remain in the network, obscuring the patterns the filter
should expose.
## How to cite
@TODO create DOI for software releases
Please cite our papers when using this software:
APA:
Alcaide, D., & Aerts, J. (2020). Spanning Trees as Approximation of Data
Structures. IEEE Transactions on Visualization and Computer Graphics.
https://doi.org/10.1109/TVCG.2020.2995465
Bibtex:
@article{alcaide2020spanning,
title={Spanning Trees as Approximation of Data Structures},
author={Alcaide, Daniel and Aerts, Jan},
journal={IEEE Transactions on Visualization and Computer Graphics},
year={2020},
publisher={IEEE},
doi = {10.1109/TVCG.2020.2995465},
}
[![DOI:10.1109/TVCG.2020.2995465](https://zenodo.org/badge/DOI/10.1109/TVCG.2020.2995465.svg)](https://doi.org/10.1109/TVCG.2020.2995465)
and for the STAD-R variant:
APA:
Alcaide, D., & Aerts, J. (2021). A visual analytic approach for the
identification of ICU patient subpopulations using ICD diagnostic codes.
PeerJ Computer Science, 7, e430.
https://doi.org/10.7717/peerj-cs.430
Bibtex:
@article{alcaide2021visual,
title={A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes},
author={Alcaide, Daniel and Aerts, Jan},
journal={PeerJ Computer Science},
volume={7},
pages={e430},
year={2021},
publisher={PeerJ Inc.}
doi = {10.7717/peerj-cs.430}
}
[![DOI:10.7717/peerj-cs.430](https://zenodo.org/badge/DOI/10.7717/peerj-cs.430.svg)](https://doi.org/10.7717/peerj-cs.430)
Raw data
{
"_id": null,
"home_page": "https://gitlab.com/dsi_uhasselt/vda-lab/pystad2",
"name": "pySTAD",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Jelmer Bot",
"author_email": "jelmer.bot@uhasselt.be",
"download_url": "https://files.pythonhosted.org/packages/a4/30/516db0340c48a904a6909b96d693fd96afc0ff5409be94cec12771574896/pySTAD-0.2.11.tar.gz",
"platform": null,
"description": "# pySTAD \r\n\r\n[![PyPI version fury.io](https://pypip.in/v/pystad/badge.png)](https://pypi.python.org/pypi/pystad/)\r\n[![PyPI status](https://img.shields.io/pypi/status/pystad.svg)](https://pypi.python.org/pypi/pystad/)\r\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pystad.svg)](https://pypi.python.org/pypi/pystad/)\r\n[![PyPI license](https://img.shields.io/pypi/l/pystad.svg)](https://pypi.python.org/pypi/pystad/)\r\n[![pipeline status](https://gitlab.com/vda-lab/pystad2/badges/master/pipeline.svg)](https://gitlab.com/dsi_uhasselt/vda-lab/pystad2/-/commits/master)\r\n[![coverage report](https://gitlab.com/vda-lab/pystad2/badges/master/coverage.svg)](https://gitlab.com/dsi_uhasselt/vda-lab/pystad2/-/commits/master)\r\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gl/vda-lab%2Fpystad2/master?urlpath=lab/tree/examples)\r\n\r\nThis is a python implementation of [STAD](https://ieeexplore.ieee.org/document/9096616/) \r\nfor exploration and visualisation of high-dimensional data. This implementation \r\nis based on the [R version](https://github.com/vda-lab/stad).\r\n\r\n## Background\r\n\r\n[STAD](https://ieeexplore.ieee.org/document/9096616/) is a dimensionality \r\nreduction algorithm, that generates an abstract representation of \r\nhigh-dimensional data by giving each data point a location in a graph which \r\npreserves the distances in the original high-dimensional space. The STAD graph \r\nis built upon the Minimum Spanning Tree (MST) to which new edges are added until\r\nthe correlation between the graph and the original dataset is maximized. \r\nAdditionally, STAD supports the inclusion of filter functions to analyse data \r\nfrom new perspectives, emphasizing traits in data which otherwise would remain \r\nhidden. \r\n\r\n### Topological Data analysis\r\n\r\nTopological data analysis (TDA) aims to describe the geometric structures \r\npresent in data. A dataset is interpreted as a point-cloud, where each point \r\nis sampled from an underlying geometric object. TDA tries to recover and \r\ndescribe the geometry of that object in terms of features that are invariant \r\n[\"under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing\"](https://en.wikipedia.org/wiki/Topology). \r\nTwo geometries that can be deformed into each other without tearing or \r\nglueing are *homeomorphic* (for instance a donut and coffee mug). Typically, \r\nTDA describes the *holes* in a geometry, formalised as \r\n[Betti numbers](https://en.wikipedia.org/wiki/Betti_number).\r\n\r\n\r\nLike other TDA algorithms, STAD constructs a graph that describes the structure \r\nof the data. However, the output of STAD should be interpreted as a \r\ndata-visualisation result, rather than a topological description of the data's \r\nstructure. Other TDA algorithms, like \r\n[mapper](https://github.com/scikit-tda/kepler-mapper), do produce topological \r\nresults. However, they rely on aggregating the data, whereas STAD encodes the \r\noriginal data points as vertices in a graph.\r\n\r\n### Dimensionality reduction\r\n\r\nCompared to dimensionality reduction algorithms like, t-SNE and UMAP, the STAD \r\nproduces a more flexible description of the data. A graph can be drawn using\r\ndifferent layouts and a user can interact with it. In addition, STAD's \r\nprojections retain the global structure of the data. In general, the STAD graph \r\ntends to underestimate distant data-points in the network structure. On the \r\nother hand, t-SNE and UMAP emphasize the relation of data-points with their \r\nclosest neighbors over that with distant data-points.\r\n\r\n<p style=\"text-align:center;\"><img src=\"./assets/dimensionality_reduction_comparison.png\" width=\"90%\" /></p>\r\n\r\nfrom [Alcaide & Aerts (2020)](https://ieeexplore.ieee.org/document/9096616/)\r\n\r\n\r\n## Installation\r\n\r\nCurrently, we recommend installing pystad from this repository within a conda \r\nenviroment with python and nodejs installed:\r\n```bash\r\npip install git+https://gitlab.com/vda-lab/pystad2#egg=pystad\r\n```\r\nAlternatively, pystad can be compiled from source (see \r\n`development/Development.md` for instructions)\r\n\r\n## How to use pySTAD\r\n\r\n### From the command-line\r\npySTAD has a `__main__` entry-point which can be called using: \r\n`python -m stad --help` or `stad --help` from the command-line. These \r\nentrypoints take a distance matrix in the form of a `.csv` file and print the \r\nresulting network as a JSON string to stdout. Some information of the network is\r\nlogged to stderr, including the number of added edges and the correlation of the\r\nnetwork-distances with the original distances.\r\n\r\n### From within python\r\npySTAD is the most versatile when used within python. Three basic examples are \r\nshown below and the example jupyterlab notebooks can be explored on \r\n[binder](https://mybinder.org/v2/gl/dsi_uhasselt%2Fvda-lab%2Fpystad2/master?urlpath=tree/examples)\r\nwithout installing pySTAD on your machine.\r\n\r\n#### Example 1\r\nMost basic use of pySTAD using the default options.\r\n\r\n```python\r\nimport stad as sd\r\nimport numpy as np\r\nimport networkx as nx\r\nimport pandas as pd\r\nimport matplotlib.pyplot as plt\r\nfrom scipy.spatial.distance import pdist\r\n\r\n# Load a dataset\r\ndata = pd.read_csv('./data/five_circles.csv', header=0)\r\ncondensed_distances = pdist(data[['x', 'y']], 'euclidean')\r\n\r\n# Show the data in 2D\r\nplt.scatter(data.x, data.y, s=5, c=data.x)\r\nplt.show()\r\n\r\n## Compute stad\r\nnetwork, sweep = sd.stad(condensed_distances)\r\nsd.plot.network(network, layout='kk', node_color=data['x'])\r\nplt.show()\r\n\r\n# Show the correlation trace\r\nsd.plot.sweep(condensed_distances, sweep)\r\nplt.show()\r\n```\r\n\r\n#### Example 2\r\n\r\nUse a lens / filter to highlight some property of the data.\r\n\r\n```python\r\nimport stad as sd\r\nimport numpy as np\r\nimport networkx as nx\r\nimport pandas as pd\r\nimport matplotlib.pyplot as plt\r\nfrom scipy.spatial.distance import pdist\r\n\r\n# Load a dataset\r\ndata = pd.read_csv('./data/five_circles.csv', header=0)\r\ncondensed_distances = pdist(data[['x', 'y']], 'euclidean')\r\n\r\n# Show the dataset in 2D\r\nplt.scatter(data.x, data.y, s=5, c=data.x)\r\nplt.show()\r\n\r\n# Run stad with a lens\r\nlens = sd.Lens(data['x'].to_numpy(), n_bins=3)\r\nnetwork, sweep = sd.stad(condensed_distances, lens=lens)\r\n\r\n# Show which edges cross filter-segment boundaries\r\nedge_color = np.where(lens.adjacent_edges[sweep.network_mask], '#f33', '#ddd')\r\nsd.plot.network(network, layout='kk', edge_color=edge_color, node_color=data['x'])\r\nplt.show()\r\n\r\n# Show the correlation trace\r\nsd.plot.sweep(condensed_distances, sweep)\r\nplt.show()\r\n```\r\n\r\n#### Example 3\r\nExplore the resulting network interactively in jupyter-lab.\r\n\r\n```python\r\nimport stad as sd\r\nimport pandas as pd\r\nimport numpy as np\r\nimport matplotlib.pyplot as plt\r\nimport matplotlib as mpl\r\nimport ipywidgets as widgets\r\nfrom scipy.spatial.distance import pdist\r\n\r\n# Load data, compute distances, show 2d projection\r\ndata = pd.read_csv('./data/horse.csv')\r\nidx = np.random.choice(data.shape[0], 500, replace=False)\r\ndata = data.iloc[idx, :]\r\ndist = pdist(data, 'euclidean')\r\nplt.scatter(data.z, data.y, s=5, c=data.z)\r\nplt.show()\r\n\r\n## Compute stad without lens\r\nnetwork, sweep = sd.stad(dist, sweep=sd.ThresholdDistance(0.11))\r\nw = sd.Widget()\r\nw\r\n```\r\n\r\n```python\r\n# show() calls only work after the front-end of the widget is instantiated.\r\n# so they have to be in a cell below the cell that outputs the widget.\r\nw.show(network, node_color=data['z'])\r\n```\r\n\r\n## Compared to the R-implementation\r\n\r\nThe [R implementation](https://github.com/vda-lab/stad) supports 2-dimensional \r\nfilters (lenses) and uses Simulated Annealing to optimise the output graph. This\r\nimplementation currently only supports 1D lenses. In addition, this implementation\r\nuses a logistic sweep on the number of edges in the network by default, but still\r\nsupports optimization functions such as simulated annealing. \r\n\r\nThis implementation is optimised using Cython and OpenMP, resulting shorter \r\ncomputation times compared to the R implementation.\r\n\r\nThe R implementation uses a MST refinement procedure when using a lens / filter, as\r\ndescribed in the paper. This implementation just uses the MST. The refinement\r\nprocedure depends on community detection to remove edges between different groups of\r\ndata-points within the same filter segment, which is a process that requires fine-tuning\r\nper dataset. When communities are not detected correctly, edges between distinct groups of\r\ndatapoints within a filter segment remain in the network, obscuring the patterns the filter\r\nshould expose.\r\n\r\n\r\n## How to cite\r\n\r\n@TODO create DOI for software releases\r\n\r\nPlease cite our papers when using this software:\r\n\r\nAPA:\r\n\r\n Alcaide, D., & Aerts, J. (2020). Spanning Trees as Approximation of Data \r\n Structures. IEEE Transactions on Visualization and Computer Graphics. \r\n https://doi.org/10.1109/TVCG.2020.2995465\r\n\r\nBibtex:\r\n\r\n @article{alcaide2020spanning,\r\n title={Spanning Trees as Approximation of Data Structures},\r\n author={Alcaide, Daniel and Aerts, Jan},\r\n journal={IEEE Transactions on Visualization and Computer Graphics},\r\n year={2020},\r\n publisher={IEEE},\r\n doi = {10.1109/TVCG.2020.2995465},\r\n }\r\n\r\n[![DOI:10.1109/TVCG.2020.2995465](https://zenodo.org/badge/DOI/10.1109/TVCG.2020.2995465.svg)](https://doi.org/10.1109/TVCG.2020.2995465)\r\n\r\nand for the STAD-R variant:\r\n\r\nAPA: \r\n\r\n Alcaide, D., & Aerts, J. (2021). A visual analytic approach for the \r\n identification of ICU patient subpopulations using ICD diagnostic codes. \r\n PeerJ Computer Science, 7, e430. \r\n https://doi.org/10.7717/peerj-cs.430\r\n\r\nBibtex:\r\n\r\n @article{alcaide2021visual,\r\n title={A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes},\r\n author={Alcaide, Daniel and Aerts, Jan},\r\n journal={PeerJ Computer Science},\r\n volume={7},\r\n pages={e430},\r\n year={2021},\r\n publisher={PeerJ Inc.}\r\n doi = {10.7717/peerj-cs.430}\r\n }\r\n\r\n[![DOI:10.7717/peerj-cs.430](https://zenodo.org/badge/DOI/10.7717/peerj-cs.430.svg)](https://doi.org/10.7717/peerj-cs.430)\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Dimensionality reduction through Simplified Topological Abstraction of Data",
"version": "0.2.11",
"project_urls": {
"Code": "https://gitlab.com/dsi_uhasselt/vda-lab/pystad2",
"Homepage": "https://gitlab.com/dsi_uhasselt/vda-lab/pystad2",
"Issue tracker": "https://gitlab.com/dsi_uhasselt/vda-lab/pystad2/-/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7d1749286cd805ae5dc61a7cddb9a075ea1fc94e230b53a5a79ff742701daf1f",
"md5": "eead930f49af20f669b42bd995d778ab",
"sha256": "e050f689ac071c9e104b8053cdd000defc7270103c435918c0d15b8b4c412a4d"
},
"downloads": -1,
"filename": "pySTAD-0.2.11-cp39-cp39-win_amd64.whl",
"has_sig": false,
"md5_digest": "eead930f49af20f669b42bd995d778ab",
"packagetype": "bdist_wheel",
"python_version": "cp39",
"requires_python": ">=3.7",
"size": 4734736,
"upload_time": "2023-11-21T12:09:39",
"upload_time_iso_8601": "2023-11-21T12:09:39.204260Z",
"url": "https://files.pythonhosted.org/packages/7d/17/49286cd805ae5dc61a7cddb9a075ea1fc94e230b53a5a79ff742701daf1f/pySTAD-0.2.11-cp39-cp39-win_amd64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a430516db0340c48a904a6909b96d693fd96afc0ff5409be94cec12771574896",
"md5": "409fb73625f339bef984f25387570d48",
"sha256": "1fef2c2ff5cf7b182790b965f3f646ee767c5e00f883e8d228b6837b71d05d4f"
},
"downloads": -1,
"filename": "pySTAD-0.2.11.tar.gz",
"has_sig": false,
"md5_digest": "409fb73625f339bef984f25387570d48",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 6183742,
"upload_time": "2023-11-21T12:09:42",
"upload_time_iso_8601": "2023-11-21T12:09:42.039574Z",
"url": "https://files.pythonhosted.org/packages/a4/30/516db0340c48a904a6909b96d693fd96afc0ff5409be94cec12771574896/pySTAD-0.2.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-21 12:09:42",
"github": false,
"gitlab": true,
"bitbucket": false,
"codeberg": false,
"gitlab_user": "dsi_uhasselt",
"gitlab_project": "vda-lab",
"lcname": "pystad"
}