# SCA (Surprisal Component Analysis)
Surprisal Component Analysis (SCA) is a dimensionality reduction technique for single-cell data which leverages mathematical information theory to identify biologically informative axes of variation in single-cell transcriptomic data, enabling recovery of rare and common cell types at superior resolution. It is written in Python. The pre-print can be found [here](https://www.biorxiv.org/content/10.1101/2021.01.19.427303v1).
For full documentation of shannonca's API, see our [Readthedocs page](https://shannonca.readthedocs.io/).
## Installation
SCA is available via pip:
```pip install shannonca```
### Dependencies
SCA requires the following packages:
* scikit-learn
* scipy
* numpy
* matplotlib
* pandas
* seaborn
* scanpy
## Usage
### Dimensionality Reduction
SCA generates information score matrices, which are used to generate linear combinations of genes (metagenes) that are biologically informative. The package includes workflows both with and without Scanpy under `sca.dimred`.
#### Without Scanpy
The `reduce` function accepts a (num genes) x (num cells) matrix X, and outputs a dimensionality-reduced version with fewer features. The input matrix may be normalized or otherwise processed, but a zero in the input matrix must indicate zero recorded transcripts.
```
from shannonca.dimred import reduce
X = mmread('mydata.mtx').transpose() # read some dataset
reduction = reduce(X, n_comps=50, n_pcs=50, iters=1, nbhd_size=15, metric='euclidean', model='wilcoxon', chunk_size=1000, n_tests='auto')
```
`reduction` is an (num cells) x (`n_comps`)-dimensional matrix. The function optionally returns SCA's score matrix (if keep_scores=True), metagene loadings (if keep_loadings=True), or intermediate results (if iters>1 and keep_all_iters=True). If at least one of these is returned, the return type is a dictionary with keys for 'reduction', 'scores', and 'loadings'. If keep_all_iters=True, the reductions after each iteration will be keyed by 'reduction_i' for each iteration number i.
Starting neighborhoods are computed by default using Euclidean distance (controlled by `metric`) in `n_comps`-dimensional PCA space. See the docstring for more detailed and comprehensive parameter descriptions.
#### With Scanpy
Scanpy (https://github.com/theislab/scanpy) is a commonly-used single-cell workflow. To compute a reduction in place on a scanpy AnnData object, use `reduce_scanpy`:
```
import scanpy as sc
from shannonca.dimred import reduce_scanpy
adata = sc.AnnData(X)
reduce_scanpy(adata, keep_scores=True, keep_loadings=True, keep_all_iters=True, layer=None, key_added='sca', iters=1, n_comps=50)
```
This function shares all parameters with `reduce`, but instead of returning the reduction, it updates the input AnnData object. Dimensionality reductions are stored in `adata.obsm[key_added]`, or, if keep_all_iters=True, in `adata.obsm['key_added_i']` for each iteration number i. If `keep_scores=True` in the reducer constructor, the information scores of each gene in each cell are stored in `adata.layers[key_added_score]`. If `layer=None`, the algorithm is run on `adata.X`; otherwise, it is run on `adata.layers[layer]`.
## Troubleshooting
If you are having trouble running SCA, try the following:
* Pull from the github repository to ensure that your version of SCA is up to date.
* Ensure that the Python version is at least 3.0, and that the installations of scanpy, numpy, scipy, and sklearn are up to date.
* When running the `reduce` function, ensure that the input is either a CSR sparse matrix (`scipy.sparse.csr_matrix`) or a dense numpy array, with one row per cell and one column per gene. Coercion to sparse matrices is easy via `scipy.sparse.tocsr()`
* When running `reduce_scanpy`, ensure that the input is a scanpy anndata object.
* Ensure that the data type of the input is either an integer or float.
* Double-check that the code follows the docstring for the relevant function: reduce or reduce_scanpy.
Raw data
{
"_id": null,
"home_page": "https://github.com/bdemeo/shannonca",
"name": "shannonca",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Shannon, Information, Dimensionality, reduction, single-cell, RNA",
"author": "Benjamin DeMeo",
"author_email": "bdemeo@g.harvard.edu",
"download_url": "https://files.pythonhosted.org/packages/3a/ad/1a8b5044676fe73415f4107f44754f481a06503025f383a82832260dfa89/shannonca-0.0.9.tar.gz",
"platform": null,
"description": "# SCA (Surprisal Component Analysis)\n\nSurprisal Component Analysis (SCA) is a dimensionality reduction technique for single-cell data which leverages mathematical information theory to identify biologically informative axes of variation in single-cell transcriptomic data, enabling recovery of rare and common cell types at superior resolution. It is written in Python. The pre-print can be found [here](https://www.biorxiv.org/content/10.1101/2021.01.19.427303v1).\n\nFor full documentation of shannonca's API, see our [Readthedocs page](https://shannonca.readthedocs.io/). \n\n## Installation\n\nSCA is available via pip:\n```pip install shannonca```\n\n### Dependencies\nSCA requires the following packages:\n\n* scikit-learn\n* scipy\n* numpy\n* matplotlib\n* pandas\n* seaborn\n* scanpy\n\n\n## Usage\n### Dimensionality Reduction\nSCA generates information score matrices, which are used to generate linear combinations of genes (metagenes) that are biologically informative. The package includes workflows both with and without Scanpy under `sca.dimred`.\n\n#### Without Scanpy\nThe `reduce` function accepts a (num genes) x (num cells) matrix X, and outputs a dimensionality-reduced version with fewer features. The input matrix may be normalized or otherwise processed, but a zero in the input matrix must indicate zero recorded transcripts.\n```\nfrom shannonca.dimred import reduce\n\nX = mmread('mydata.mtx').transpose() # read some dataset\n\nreduction = reduce(X, n_comps=50, n_pcs=50, iters=1, nbhd_size=15, metric='euclidean', model='wilcoxon', chunk_size=1000, n_tests='auto')\n```\n`reduction` is an (num cells) x (`n_comps`)-dimensional matrix. The function optionally returns SCA's score matrix (if keep_scores=True), metagene loadings (if keep_loadings=True), or intermediate results (if iters>1 and keep_all_iters=True). If at least one of these is returned, the return type is a dictionary with keys for 'reduction', 'scores', and 'loadings'. If keep_all_iters=True, the reductions after each iteration will be keyed by 'reduction_i' for each iteration number i. \n\nStarting neighborhoods are computed by default using Euclidean distance (controlled by `metric`) in `n_comps`-dimensional PCA space. See the docstring for more detailed and comprehensive parameter descriptions.\n\n#### With Scanpy\nScanpy (https://github.com/theislab/scanpy) is a commonly-used single-cell workflow. To compute a reduction in place on a scanpy AnnData object, use `reduce_scanpy`:\n```\nimport scanpy as sc\nfrom shannonca.dimred import reduce_scanpy\n\nadata = sc.AnnData(X)\nreduce_scanpy(adata, keep_scores=True, keep_loadings=True, keep_all_iters=True, layer=None, key_added='sca', iters=1, n_comps=50)\n```\nThis function shares all parameters with `reduce`, but instead of returning the reduction, it updates the input AnnData object. Dimensionality reductions are stored in `adata.obsm[key_added]`, or, if keep_all_iters=True, in `adata.obsm['key_added_i']` for each iteration number i. If `keep_scores=True` in the reducer constructor, the information scores of each gene in each cell are stored in `adata.layers[key_added_score]`. If `layer=None`, the algorithm is run on `adata.X`; otherwise, it is run on `adata.layers[layer]`.\n\n## Troubleshooting\nIf you are having trouble running SCA, try the following:\n\n* Pull from the github repository to ensure that your version of SCA is up to date. \n* Ensure that the Python version is at least 3.0, and that the installations of scanpy, numpy, scipy, and sklearn are up to date. \n* When running the `reduce` function, ensure that the input is either a CSR sparse matrix (`scipy.sparse.csr_matrix`) or a dense numpy array, with one row per cell and one column per gene. Coercion to sparse matrices is easy via `scipy.sparse.tocsr()`\n* When running `reduce_scanpy`, ensure that the input is a scanpy anndata object. \n* Ensure that the data type of the input is either an integer or float. \n* Double-check that the code follows the docstring for the relevant function: reduce or reduce_scanpy.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Informative Dimensionality Reduction via Shannon Component Analysis",
"version": "0.0.9",
"project_urls": {
"Download": "https://github.com/bendemeo/shannonca/archive/refs/tags/0.0.1.tar.gz",
"Homepage": "https://github.com/bdemeo/shannonca"
},
"split_keywords": [
"shannon",
" information",
" dimensionality",
" reduction",
" single-cell",
" rna"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ae1693207879db1f7bc8c5b934abf2e03b643f645a2091f8dee36d86d00bee9e",
"md5": "423ac48c620bc78d9beaf74c09fb0f79",
"sha256": "2d55f794783bb4194cd85883783e14f07118455d38ed7a75e17e6446456227cc"
},
"downloads": -1,
"filename": "shannonca-0.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "423ac48c620bc78d9beaf74c09fb0f79",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 27888,
"upload_time": "2024-12-14T17:08:56",
"upload_time_iso_8601": "2024-12-14T17:08:56.365161Z",
"url": "https://files.pythonhosted.org/packages/ae/16/93207879db1f7bc8c5b934abf2e03b643f645a2091f8dee36d86d00bee9e/shannonca-0.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3aad1a8b5044676fe73415f4107f44754f481a06503025f383a82832260dfa89",
"md5": "cc5d5d9b481ad2eb4f8c4f09a1843360",
"sha256": "cd39a8009ffce444f37def87118911f330837749f85b8518d4b971e2744c89e3"
},
"downloads": -1,
"filename": "shannonca-0.0.9.tar.gz",
"has_sig": false,
"md5_digest": "cc5d5d9b481ad2eb4f8c4f09a1843360",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 24557,
"upload_time": "2024-12-14T17:08:58",
"upload_time_iso_8601": "2024-12-14T17:08:58.721867Z",
"url": "https://files.pythonhosted.org/packages/3a/ad/1a8b5044676fe73415f4107f44754f481a06503025f383a82832260dfa89/shannonca-0.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-14 17:08:58",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "bdemeo",
"github_project": "shannonca",
"github_not_found": true,
"lcname": "shannonca"
}