# too-many-cells (à la Python)
[![image](https://img.shields.io/pypi/v/toomanycells.svg)](https://pypi.python.org/pypi/toomanycells)
### It's [Scanpy](https://github.com/scverse/scanpy) friendly!
** A python package for spectral clustering based on the
powerful suite of tools named
[too-many-cells](https://github.com/GregorySchwartz/too-many-cells).
In essence, you can use toomanycells to partition a data set
in the form of a matrix of integers or floating point numbers
into clusters, where members of a cluster are similar to each
other. The rows represent observations and the
columns are the features. However, sometimes just knowing the
clusters is not sufficient. Often, we are insterested on the
relationships between the clusters, and this tool can help
you visualize the clusters as leaf nodes of a tree, where the
branches illustrate the trajectories that have to be followed
to reach a particular cluster. Initially, this tool will
partition your data set into two subsets (each subset is a
node of the tree), trying to maximize
the differences between the two. Subsequently, it will
reapply that same criterion to each subset (node) and will
continue bifurcating until the
[modularity](https://en.wikipedia.org/wiki/Modularity_(networks))
of the node that is about to be partitioned becomes less
than a given threshold value ($10^{-9}$ by default),
implying that the elements belonging
to the current node are fairly homogeneous,
and consequently suggesting
that further partitioning is not warranted. Thus, when the
process finishes, you end up with a tree structure of your
data set, where the leaves represent the clusters. As
mentioned earlier, you can use this tool with any kind of
data. However, a common application is to classify cells and
therefore you can provide an
[AnnData](https://anndata.readthedocs.io/en/latest/) object.
You can read about this application in this [Nature Methods
paper](https://www.nature.com/articles/s41592-020-0748-5).**
- Free software: GNU AFFERO GENERAL PUBLIC LICENSE
- Documentation: https://JRR3.github.io/toomanycells
## Dependencies
Version 1.0.40 no longer requires Graphviz. Thus,
no dependencies!
## Virtual environments
To have control of your working environment
you can use a python virtual environment, which
can help you keep only the packages you need in
one location. In bash or zsh you can simply
type
```
python -m venv /path/to/new/virtual/environment
```
To activate it you simply need
```
source pathToTheVirtualEnvironment/bin/activate
```
To deactivate the environment use the intuitive
```
deactivate
```
## Installation
Just type
```
pip install toomanycells
```
in your home or custom environment.
If you want to install an updated
version, then use the following flag.
```
pip install toomanycells -U
```
Make sure you have the latest version. If not,
run the previous command again.
## Quick run
If you want to see a concrete example of how
to use toomanycells, check out the jupyter
notebook [demo](./toomanycells_demo.ipynb).
## Usage
1. First import the module as follows
```
from toomanycells import TooManyCells as tmc
```
2. If you already have an
[AnnData](https://anndata.readthedocs.io/en/latest/)
object `A` loaded into memory, then you can create a
TooManyCells object with
```
tmc_obj = tmc(A)
```
In this case the output folder will be called
`tmc_outputs`. However, if you want the output folder to
be a particular directory, then you can specify the path
as follows
```
tmc_obj = tmc(A, output_directory)
```
3. If instead of providing an AnnData object you want to
provide the directory where your data is located, you can use
the syntax
```
tmc_obj = tmc(input_directory, output_directory)
```
4. If your input directory has a file in the [matrix market
format](https://math.nist.gov/MatrixMarket/formats.html),
then you have to specify this information by using the
following flag
```
tmc_obj = tmc(input_directory,
output_directory,
input_is_matrix_market=True)
```
Under this scenario, the `input_directory` must contain a
`.mtx` file, a `barcodes.tsv` file (the observations), and
a `genes.tsv` (the features).
5. Once your data has been loaded successfully, you can start the clustering process with the following command
```
tmc_obj.run_spectral_clustering()
```
In my desktop computer processing a data set with ~90K
cells (observations) and ~30K genes (features) took a
little less than 6 minutes in 1809 iterations. For a
larger data set like the [Tabula
Sapiens](https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219?file=40067134)
with 483,152 cells and 58,870 genes (14.51 GB in zip
format) the total time was about 50 minutes in the same
computer. ![Progress bar example](https://github.com/JRR3/toomanycells/blob/main/tests/tabula_sapiens_time.png)
6. At the end of the clustering process the `.obs` data frame of the AnnData object should have two columns named `['sp_cluster', 'sp_path']` which contain the cluster labels and the path from the root node to the leaf node, respectively.
```
tmc_obj.A.obs[['sp_cluster', 'sp_path']]
```
7. To generate the outputs, just call the function
```
tmc_obj.store_outputs()
```
This call will generate JSON file
containing the nodes and edges of the graph (`graph.json`),
one CSV file that describes the cluster
information (`clusters.csv`), another CSV file containing
the information of each node (`node_info.csv`), and two
JSON files. One relates cells to clusters
(`cluster_list.json`), and the other has the
full tree structure (`cluster_tree.json`). You need this
last file for too-many-cells interactive (TMCI).
8. If you already have the `graph.json` file you can
load it with
```
tmc_obj.load_graph(json_fname="some_path")
```
9. If you want to visualize your results in a dynamic
platform, I strongly recommend the tool
[too-many-cells-interactive](https://github.com/schwartzlab-methods/too-many-cells-interactive?tab=readme-ov-file).
To use it, first make sure that you have Docker Compose and
Docker. One simple way of getting the two is by installing
[Docker Desktop](https://docs.docker.com/compose/install/).
Note that with MacOS the instructions are slightly different.
If you use [Nix](https://search.nixos.org/packages), simply
add the packages `pkgs.docker` and `pkgs.docker-compose` to
your configuration or `home.nix` file and run
```
home-manager switch
```
10. If you installed Docker Desktop you probably don't need to
follow this step. However, under some distributions the
following two commands have proven to be essential. Use
```
sudo dockerd
```
to start the daemon service for docker containers and
```
sudo chmod 666 /var/run/docker.sock
```
to let Docker read and write to that location.
11. Now clone the repository
```
git clone https://github.com/schwartzlab-methods/too-many-cells-interactive.git
```
and store the path to the `too-many-cells-interactive`
folder in a variable, for example
`path_to_tmc_interactive`. Also, you will need to identify
a column in your `AnnData.obs` data frame that has the
labels for the cells. Let's assume that the column name is
stored in the variable `cell_annotations`. Lastly, you can
provide a port number to host your visualization, for
instance `port_id=1234`. Then, you can call the function
```
tmc_obj.visualize_with_tmc_interactive(
path_to_tmc_interactive,
cell_annotations,
port_id)
```
The following visualization corresponds to the data set
with ~90K cells (observations). ![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/example_1.png)
And this is the visualization for the Tabula Sapiens data set with ~480K cells.
![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/tmci_tabula_sapiens.png)
## What is the time complexity of toomanycells (à la Python)?
To answer that question we have created the following benchmark. We tested the performance of toomanycells in 20 data sets having the following number of cells: 6360, 10479, 12751, 16363, 23973, 32735, 35442, 40784, 48410, 53046, 57621, 62941, 68885, 76019, 81449, 87833, 94543, 101234, 107809, 483152. The range goes from thousands of cells to almost half a million cells. These are the results.
![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/log_linear_time.png)
![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/log_linear_iter.png)
As you can see, the program behaves linearly with respect to the size of the input. In other words, the observations fit the model $T = k\cdot N^p$, where $T$ is the time to process the data set, $N$ is the number of cells, $k$ is a constant, and $p$ is the exponent. In our case $p\approx 1$. Nice!
## Cell annotation
### CellTypist
When visualizing the tree, we often are interested on
observing how different cell types distribute across the
branches of the tree. In case your AnnData object lacks
a cell annotation column in the ``obs`` data frame, or
if you already have one but you want to try a different
method, we have created a wrapper function that calls
[CellTypist](https://www.celltypist.org/). Simply
write
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
)
```
and the ```obs``` data frame of your AnnData object will
have a column named like the string stored under the
```column_label_for_cell_annotations``` variable.
By default we use the ```Immune_All_High``` celltypist
model that contains 32 cell types. If you want to use
another model, simply write
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
celltypist_model,
)
```
where ```celltypist_model``` describes the type of model
to use by the library. For example, if this
variable is equal to ```Immune_All_Low```, then the number
of possible cell types increases to 98.
For a complete list of all the models, see the following
[list](https://www.celltypist.org/models). Lastly,
if you want to use the fact that transcriptionally similar
cells are likely to cluster together, you can assign the cell
type labels on a cluster-by-cluster basis
rather than a cell-by-cell basis. To activate this
feature, use the call
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
celltypist_model,
use_majority_voting = True,
)
```
### Median absolute deviation classification
Work in progress...
## Heterogeneity quantification
Imagine you want to compare the heterogeneity of cell
populations belonging to different branches of the
toomanycells tree. By branch we mean all the nodes that
derive from a particular node, including the node
that defines the branch in question.
For example, we want to compare branch 1183 against branch 2.
![heterogeneity](https://github.com/JRR3/toomanycells/blob/main/tests/heterogeneity.svg)
One way to do this is by comparing the modularity
distribution and the cumulative modularity for all the
nodes that belong to each branch.
We can do that using the following calls. First for
branch 1183
```
tmc_obj.quantify_heterogeneity(
list_of_branches=[1183],
use_log_y=true,
tag="branch_A",
show_column_totals=true,
color="blue",
file_format="svg")
```
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/branch_A.svg"
width="500" height="420"/>
<br/>
And then for branch 2
```
tmc_obj.quantify_heterogeneity(
list_of_branches=[2],
use_log_y=true,
tag="branch_B",
show_column_totals=true,
color="red",
file_format="svg")
```
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/branch_B.svg"
width="500" height="420"/>
<br/>
Note that you can include multiple nodes in the
list of branches.
From these figures we observe that the higher cumulative
modularity of branch 1183 with respect to branch 2 suggests
that the former has a higher degree of heterogeneity.
However, just relying in modularity could provide a
misleading interpretation. For example, consider the
following scenario where the numbers within the nodes
indicate the modularity at that node.
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/counter_node_modularity.svg"
width="300" height="400"/>
<br/>
In this case, scenario A has a larger cumulative modularity,
but we note that scenario B is more heterogeneous.
For that reason we recommend also computing additional
diversity measures. First, we need some notation.
For all the branches belonging to the list of branches in the
above function
`quantify_heterogeneity`, let $C$ be
the set of leaf nodes that belong to those branches.
We consider each leaf node as a separate species, and we
call the whole collection of cells an ecosystem.
For $c_i \in C$, let $|c_i|$ be the number of cells in
$c_i$ and $|C| = \sum_i |c_i|$ the total number
of cells contained in the given branches. If we let
$$p_i = \dfrac{|c_i|}{|C|},$$
then we define the following diversity measure
$$D(q) = \left(\sum_{i=1}^{n} p_i^q \right)^{\frac{1}{1-q}}.
$$
In general, the larger the value of $D(q)$, the more diverse
is the collection of species. Note that $D(q=0)$
describes the total number of species. We
call this quantity the richness of the ecosystem.
When $q=1$, $D$ is the exponential of the Shannon entropy
$$H = -\sum_{i=1}^{n}p_i \ln(p_i).$$
When $q=2$, $D$ is
the inverse of the Simpson's index
$$S = \sum_{i=1}^{n} (p_i)^2,$$
which represents the
probability that two cells picked at random belong
to the same species. Hence, the higher the Simpson's
index, the less diverse is the ecosystem.
Lastly, when $q=\infty$, $D$ is the inverse of
the largest proportion $\max_i(p_i)$.
In the above example, for branch 1183 we obtain
```
value
Richness 460.000000
Shannon 5.887544
Simpson 0.003361
MaxProp 0.010369
q = 0 460.000000
q = 1 360.518784
q = 2 297.562094
q = inf 96.442786
```
and for branch 2 we obtain
```
value
Richness 280.000000
Shannon 5.500414
Simpson 0.004519
MaxProp 0.010750
q = 0 280.000000
q = 1 244.793371
q = 2 221.270778
q = inf 93.021531
```
After comparing the results using two different measures,
namely, modularity and diversity, we conclude that branch
1183 is more heterogeneous than branch 2.
## Similarity functions
So far we have assumed that the similarity matrix
$S$ is
computed by calculating the cosine of the angle
between each observation. Concretely, if the
matrix of observations is $B$ ($m\times n$), the $i$-th row
of $B$ is $x = B(i,:)$, and the $j$-th row of $B$
is $y=B(j,:)$, then the similarity between $x$ and
$y$ is
$$S(x,y)=\frac{x\cdot y}{||x||_2\cdot ||y||_2}.$$
However, this is not the only way to compute
a similarity matrix. We will list all the available
similarity functions and how to call them.
### Cosine (sparse)
If your matrix is sparse, i.e., the number of nonzero
entries is proportional to the number of samples ($m$),
and you want to use the cosine similarity, then use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine_sparse")
```
By default we use the Halko-Martinsson-Tropp algorithm
to compute the truncated singular value decomposition.
However, the ARPACK library (written in Fortran)
is also available.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine_sparse",
svd_algorithm="arpack")
```
If $B$ has negative entries, it is possible
to get negative entries for $S$. This could
in turn produce negative row sums for $S$.
If that is the case,
the convergence to a solution could be extremely slow.
However, if you use the non-sparse version of this
function, we provide a reasonable solution to this problem.
### Cosine
If your matrix is dense,
and you want to use the cosine similarity, then use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine")
```
The same comment about negative entries applies here.
However, there is a simple solution. While shifting
the matrix of observations can drastically change the
interpretation of the data because each column lives
in a different (gene) space, shifting the similarity
matrix is actually a reasonable method to remove negative
entries. The reason is that similarities live in an
ordered space and shifting by a constant is
an order-preserving transformation. Equivalently,
if the similarity between $x$ and $y$ is
less than the similarity between $u$ and $w$, i.e.,
$S(x,y) < S(u,w)$, then $S(x,y)+s < S(u,w)+s$ for
any constant $s$. The raw data
have no natural order, but similarities do.
To shift the (dense) similarity matrix by $s=1$, use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine",
shift_similarity_matrix=1)
```
Note that since the range of the cosine similarity
is $[-1,1]$, the shifted range for $s=1$ becomes $[0,2]$.
The shift transformation can also be applied to any of
the subsequent similarity matrices.
### Laplacian
The similarity matrix is
$$S(x,y)=\exp(-\gamma\cdot ||x-y||_1)$$
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="laplacian",
similarity_gamma=0.01)
```
This function is very sensitive to $\gamma$. Hence, an
inadequate choice can result in poor results or
no convergence. If you obtain poor results, try using
a smaller value for $\gamma$.
### Gaussian
The similarity matrix is
$$S(x,y)=\exp(-\gamma\cdot ||x-y||_2^2)$$
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="gaussian",
similarity_gamma=0.001)
```
As before, this function is very sensitive to $\gamma$.
Note that the norm is squared. Thus, it transforms
big differences between $x$ and $y$ into very small
quantities.
### Divide by the sum
The similarity matrix is
$$S(x,y)=1-\frac{||x-y||_p}{||x||_p+||y||_p},$$
where $p =1$ or $p=2$. The rows
of the matrix are normalized (unit norm)
before computing the similarity.
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="div_by_sum")
```
## Normalization
### TF-IDF
If you want to use the inverse document
frequency (IDF) normalization, then use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
use_tf_idf=True)
```
If you also want to normalize the frequencies to
unit norm with the $2$-norm, then use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
use_tf_idf=True,
tf_idf_norm="l2")
```
If instead you want to use the $1$-norm, then
replace "l2" with "l1".
### Simple normalization
Sometimes normalizing your matrix
of observations can improve the
performance of some routines.
To normalize the rows, use the following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
normalize_rows=True)
```
Be default, the $2$-norm is used. To
use any other $p$-norm, use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
normalize_rows=True,
similarity_norm=p)
```
## Gene expression along a path
### Introduction
Imagine you have the following tree structure after
running toomanycells.
![Tree path](https://github.com/JRR3/toomanycells/blob/main/tests/tree_path_example.svg)
Further, assume that the colors denote different classes
satisfying specific properties. We want to know how the
expression of two genes, for instance, `Gene S` and `Gene T`,
fluctuates as we move from node $X$
(lower left side of the tree), which is rich
in `Class B`, to node $Y$ (upper left side of the tree),
which is rich in `Class
C`. To compute such quantities, we first need to define the
distance between nodes.
### Distance between nodes
Assume we have a (parent) node $P$ with
two children nodes $C_1$ and $C_2$. Recall that the modularity of
$P$ indicates the strength of separation between the cell
populations of $C_1$ and $C_2$.
A large the modularity indicates strong connections,
i.e., high similarity, within each cluster $C_i$,
and also implies weak connections, i.e., low similarity, between
$C_1$ and $C_2$. If the modularity at $P$ is $Q(P)$, we define
the distance between $C_1$ and $C_2$ as
$$d(C_1,C_2) = Q(P).$$
We also define $d(C_i,P) = Q(P)/2$. Note that with
those definitions we have that
$$d(C_1,C_2)=d(C_1,P) +d(P,C_2)=Q(P)/2+Q(P)/2=Q(P),$$
as expected. Now that we know how to calculate the
distance between a node and its parent or child, let
$X$ and $Y$ be two distinct nodes, and let
${(N_{i})}_{i=0}^{n}$ be the sequence of nodes that describes
the unique path between them satisfying:
1. $N_0 = X$,
2. $N_n=Y$,
3. $N_i$ is a direct relative of $N_{i+1}$, i.e.,
$N_i$ is either a child or parent of $N_{i+1}$,
4. $N_i \neq N_j$ for $i\neq j$.
Then, the distance between $X$ and $Y$ is given by
```math
d(X,Y) =
\sum_{i=0}^{n-1} d(N_{i},N_{i+1}).
```
### Gene expression
We define the expression
of `Gene G` at a node $N$, $Exp(G,N)$, as the mean expression
of `Gene G` considering all the cells that belong to node
$N$. Hence, given the sequence of nodes
```math
(N_i)_{i=0}^{n}
```
we can compute the corresponding gene
expression sequence
```math
(E_{i})_{i=0}^{n}, \quad E_i = Exp(G,N_i).
```
### Cumulative distance
Lastly, since we are interested in plotting the
gene expression as a function of the distance with respect to
the node $X$, we define the sequence of real numbers
```math
(D_{i})_{i=0}^{n}, \quad D_{i} = d(X,N_{i}).
```
### Summary
1. The sequence of nodes between $X$ and $Y$
$${(N_{i})}_{i=0}^{n}$$
2. The sequence of gene expression levels between $X$ and $Y$
$${(E_{i})}_{i=0}^{n}$$
3. And the sequence of distances with respect to node $X$
$${(D_{i})}_{i=0}^{n}$$
The final plot is simply $E_{i}$ versus $D_{i}$. An example
is given in the following figure.
### Example
![Gene expression](https://github.com/JRR3/toomanycells/blob/main/tests/exp_path_test.svg)
Note how the expression of `Gene A` is high relative to
that of `Gene B` at node $X$, and as we move
farther towards
node $Y$ the trend is inverted and now `Gene B` is
highly expressed relative to `Gene A` at node $Y$.
## Acknowledgments
I would like to thank
the Schwartz lab (GW) for
letting me explore different
directions and also Christie Lau for
providing multiple test
cases to improve this
implementation.
Raw data
{
"_id": null,
"home_page": null,
"name": "toomanycells",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "toomanycells",
"author": null,
"author_email": "Javier Ruiz-Ramirez <javier.ruizramirez@uhn.ca>",
"download_url": "https://files.pythonhosted.org/packages/86/7e/b581ec0f0155d78a7de3a0cf257abd3dbf4fdc1dbab2605552e4aa4ab2d1/toomanycells-1.0.50.tar.gz",
"platform": null,
"description": "# too-many-cells (\u00e0 la Python)\n\n\n[![image](https://img.shields.io/pypi/v/toomanycells.svg)](https://pypi.python.org/pypi/toomanycells)\n\n### It's [Scanpy](https://github.com/scverse/scanpy) friendly!\n\n** A python package for spectral clustering based on the\npowerful suite of tools named\n[too-many-cells](https://github.com/GregorySchwartz/too-many-cells).\nIn essence, you can use toomanycells to partition a data set\nin the form of a matrix of integers or floating point numbers\ninto clusters, where members of a cluster are similar to each\nother. The rows represent observations and the\ncolumns are the features. However, sometimes just knowing the\nclusters is not sufficient. Often, we are insterested on the\nrelationships between the clusters, and this tool can help \nyou visualize the clusters as leaf nodes of a tree, where the\nbranches illustrate the trajectories that have to be followed\nto reach a particular cluster. Initially, this tool will\npartition your data set into two subsets (each subset is a \nnode of the tree), trying to maximize\nthe differences between the two. Subsequently, it will\nreapply that same criterion to each subset (node) and will \ncontinue bifurcating until the\n[modularity](https://en.wikipedia.org/wiki/Modularity_(networks))\nof the node that is about to be partitioned becomes less \nthan a given threshold value ($10^{-9}$ by default), \nimplying that the elements belonging\nto the current node are fairly homogeneous, \nand consequently suggesting\nthat further partitioning is not warranted. Thus, when the\nprocess finishes, you end up with a tree structure of your\ndata set, where the leaves represent the clusters. As\nmentioned earlier, you can use this tool with any kind of\ndata. However, a common application is to classify cells and\ntherefore you can provide an\n[AnnData](https://anndata.readthedocs.io/en/latest/) object.\nYou can read about this application in this [Nature Methods\npaper](https://www.nature.com/articles/s41592-020-0748-5).**\n\n\n- Free software: GNU AFFERO GENERAL PUBLIC LICENSE\n- Documentation: https://JRR3.github.io/toomanycells\n\n## Dependencies\n\nVersion 1.0.40 no longer requires Graphviz. Thus,\nno dependencies!\n\n## Virtual environments\nTo have control of your working environment \nyou can use a python virtual environment, which \ncan help you keep only the packages you need in \none location. In bash or zsh you can simply\ntype\n```\npython -m venv /path/to/new/virtual/environment\n```\nTo activate it you simply need\n```\nsource pathToTheVirtualEnvironment/bin/activate\n```\nTo deactivate the environment use the intuitive\n```\ndeactivate\n```\n\n## Installation\n\nJust type\n```\npip install toomanycells\n```\nin your home or custom environment. \nIf you want to install an updated\nversion, then use the following flag.\n```\npip install toomanycells -U\n```\nMake sure you have the latest version. If not,\nrun the previous command again.\n\n## Quick run\nIf you want to see a concrete example of how\nto use toomanycells, check out the jupyter \nnotebook [demo](./toomanycells_demo.ipynb).\n\n## Usage\n1. First import the module as follows\n ```\n from toomanycells import TooManyCells as tmc\n ```\n\n2. If you already have an \n[AnnData](https://anndata.readthedocs.io/en/latest/) \nobject `A` loaded into memory, then you can create a \nTooManyCells object with\n ```\n tmc_obj = tmc(A)\n ```\n In this case the output folder will be called \n `tmc_outputs`. However, if you want the output folder to\n be a particular directory, then you can specify the path \n as follows\n ```\n tmc_obj = tmc(A, output_directory)\n ```\n3. If instead of providing an AnnData object you want to\nprovide the directory where your data is located, you can use\nthe syntax\n ```\n tmc_obj = tmc(input_directory, output_directory)\n ```\n\n4. If your input directory has a file in the [matrix market\nformat](https://math.nist.gov/MatrixMarket/formats.html),\nthen you have to specify this information by using the\nfollowing flag\n ```\n tmc_obj = tmc(input_directory,\n output_directory,\n input_is_matrix_market=True)\n ```\nUnder this scenario, the `input_directory` must contain a\n`.mtx` file, a `barcodes.tsv` file (the observations), and\na `genes.tsv` (the features).\n\n5. Once your data has been loaded successfully, you can start the clustering process with the following command\n ```\n tmc_obj.run_spectral_clustering()\n ```\nIn my desktop computer processing a data set with ~90K\ncells (observations) and ~30K genes (features) took a\nlittle less than 6 minutes in 1809 iterations. For a\nlarger data set like the [Tabula\nSapiens](https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219?file=40067134)\nwith 483,152 cells and 58,870 genes (14.51 GB in zip\nformat) the total time was about 50 minutes in the same\ncomputer. ![Progress bar example](https://github.com/JRR3/toomanycells/blob/main/tests/tabula_sapiens_time.png)\n \n6. At the end of the clustering process the `.obs` data frame of the AnnData object should have two columns named `['sp_cluster', 'sp_path']` which contain the cluster labels and the path from the root node to the leaf node, respectively.\n ```\n tmc_obj.A.obs[['sp_cluster', 'sp_path']]\n ```\n7. To generate the outputs, just call the function\n ```\n tmc_obj.store_outputs()\n ```\nThis call will generate JSON file\ncontaining the nodes and edges of the graph (`graph.json`), \none CSV file that describes the cluster\ninformation (`clusters.csv`), another CSV file containing \nthe information of each node (`node_info.csv`), and two \nJSON files. One relates cells to clusters \n(`cluster_list.json`), and the other has the \nfull tree structure (`cluster_tree.json`). You need this\nlast file for too-many-cells interactive (TMCI).\n\n8. If you already have the `graph.json` file you can \nload it with\n ```\n tmc_obj.load_graph(json_fname=\"some_path\")\n ```\n9. If you want to visualize your results in a dynamic\nplatform, I strongly recommend the tool\n[too-many-cells-interactive](https://github.com/schwartzlab-methods/too-many-cells-interactive?tab=readme-ov-file).\nTo use it, first make sure that you have Docker Compose and\nDocker. One simple way of getting the two is by installing\n[Docker Desktop](https://docs.docker.com/compose/install/).\nNote that with MacOS the instructions are slightly different.\nIf you use [Nix](https://search.nixos.org/packages), simply\nadd the packages `pkgs.docker` and `pkgs.docker-compose` to\nyour configuration or `home.nix` file and run\n```\nhome-manager switch\n```\n10. If you installed Docker Desktop you probably don't need to\nfollow this step. However, under some distributions the\nfollowing two commands have proven to be essential. Use\n```\nsudo dockerd\n```\nto start the daemon service for docker containers and\n```\nsudo chmod 666 /var/run/docker.sock\n```\nto let Docker read and write to that location.\n\n11. Now clone the repository \n ```\n git clone https://github.com/schwartzlab-methods/too-many-cells-interactive.git\n ```\nand store the path to the `too-many-cells-interactive`\nfolder in a variable, for example\n`path_to_tmc_interactive`. Also, you will need to identify\na column in your `AnnData.obs` data frame that has the\nlabels for the cells. Let's assume that the column name is\nstored in the variable `cell_annotations`. Lastly, you can\nprovide a port number to host your visualization, for\ninstance `port_id=1234`. Then, you can call the function\n ```\n tmc_obj.visualize_with_tmc_interactive(\n path_to_tmc_interactive,\n cell_annotations,\n port_id)\n ```\nThe following visualization corresponds to the data set\nwith ~90K cells (observations). ![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/example_1.png)\n \nAnd this is the visualization for the Tabula Sapiens data set with ~480K cells.\n![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/tmci_tabula_sapiens.png)\n\n## What is the time complexity of toomanycells (\u00e0 la Python)?\nTo answer that question we have created the following benchmark. We tested the performance of toomanycells in 20 data sets having the following number of cells: 6360, 10479, 12751, 16363, 23973, 32735, 35442, 40784, 48410, 53046, 57621, 62941, 68885, 76019, 81449, 87833, 94543, 101234, 107809, 483152. The range goes from thousands of cells to almost half a million cells. These are the results.\n![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/log_linear_time.png)\n![Visualization example](https://github.com/JRR3/toomanycells/blob/main/tests/log_linear_iter.png)\nAs you can see, the program behaves linearly with respect to the size of the input. In other words, the observations fit the model $T = k\\cdot N^p$, where $T$ is the time to process the data set, $N$ is the number of cells, $k$ is a constant, and $p$ is the exponent. In our case $p\\approx 1$. Nice!\n\n## Cell annotation\n### CellTypist\nWhen visualizing the tree, we often are interested on\nobserving how different cell types distribute across the\nbranches of the tree. In case your AnnData object lacks\na cell annotation column in the ``obs`` data frame, or \nif you already have one but you want to try a different \nmethod, we have created a wrapper function that calls \n[CellTypist](https://www.celltypist.org/). Simply \nwrite\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n )\n```\nand the ```obs``` data frame of your AnnData object will \nhave a column named like the string stored under the\n```column_label_for_cell_annotations``` variable.\nBy default we use the ```Immune_All_High``` celltypist \nmodel that contains 32 cell types. If you want to use\nanother model, simply write\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n celltypist_model,\n )\n```\nwhere ```celltypist_model``` describes the type of model\nto use by the library. For example, if this \nvariable is equal to ```Immune_All_Low```, then the number \nof possible cell types increases to 98.\nFor a complete list of all the models, see the following\n[list](https://www.celltypist.org/models). Lastly,\nif you want to use the fact that transcriptionally similar\ncells are likely to cluster together, you can assign the cell \ntype labels on a cluster-by-cluster basis\n rather than a cell-by-cell basis. To activate this \n feature, use the call\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n celltypist_model,\n use_majority_voting = True,\n )\n```\n### Median absolute deviation classification\nWork in progress...\n\n## Heterogeneity quantification\nImagine you want to compare the heterogeneity of cell \npopulations belonging to different branches of the \ntoomanycells tree. By branch we mean all the nodes that\nderive from a particular node, including the node \nthat defines the branch in question.\nFor example, we want to compare branch 1183 against branch 2.\n![heterogeneity](https://github.com/JRR3/toomanycells/blob/main/tests/heterogeneity.svg)\nOne way to do this is by comparing the modularity \ndistribution and the cumulative modularity for all the \nnodes that belong to each branch. \n We can do that using the following calls. First for \n branch 1183\n```\n tmc_obj.quantify_heterogeneity(\n list_of_branches=[1183],\n use_log_y=true,\n tag=\"branch_A\",\n show_column_totals=true,\n color=\"blue\",\n file_format=\"svg\")\n```\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/branch_A.svg\"\nwidth=\"500\" height=\"420\"/>\n<br/>\nAnd then for branch 2\n\n```\n tmc_obj.quantify_heterogeneity(\n list_of_branches=[2],\n use_log_y=true,\n tag=\"branch_B\",\n show_column_totals=true,\n color=\"red\",\n file_format=\"svg\")\n```\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/branch_B.svg\"\nwidth=\"500\" height=\"420\"/>\n<br/>\nNote that you can include multiple nodes in the \nlist of branches.\nFrom these figures we observe that the higher cumulative \nmodularity of branch 1183 with respect to branch 2 suggests \nthat the former has a higher degree of heterogeneity.\nHowever, just relying in modularity could provide a \nmisleading interpretation. For example, consider the \nfollowing scenario where the numbers within the nodes \nindicate the modularity at that node.\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/counter_node_modularity.svg\"\nwidth=\"300\" height=\"400\"/>\n<br/>\nIn this case, scenario A has a larger cumulative modularity, \nbut we note that scenario B is more heterogeneous.\nFor that reason we recommend also computing additional\ndiversity measures. First, we need some notation. \nFor all the branches belonging to the list of branches in the\nabove function \n\n`quantify_heterogeneity`, let $C$ be\nthe set of leaf nodes that belong to those branches. \nWe consider each leaf node as a separate species, and we \ncall the whole collection of cells an ecosystem.\nFor $c_i \\in C$, let $|c_i|$ be the number of cells in\n$c_i$ and $|C| = \\sum_i |c_i|$ the total number \nof cells contained in the given branches. If we let\n\n$$p_i = \\dfrac{|c_i|}{|C|},$$\n\nthen we define the following diversity measure\n\n$$D(q) = \\left(\\sum_{i=1}^{n} p_i^q \\right)^{\\frac{1}{1-q}}.\n$$\n\nIn general, the larger the value of $D(q)$, the more diverse\nis the collection of species. Note that $D(q=0)$ \ndescribes the total number of species. We \ncall this quantity the richness of the ecosystem. \nWhen $q=1$, $D$ is the exponential of the Shannon entropy\n\n$$H = -\\sum_{i=1}^{n}p_i \\ln(p_i).$$\n\nWhen $q=2$, $D$ is \nthe inverse of the Simpson's index\n\n$$S = \\sum_{i=1}^{n} (p_i)^2,$$\n\nwhich represents the\nprobability that two cells picked at random belong\nto the same species. Hence, the higher the Simpson's\nindex, the less diverse is the ecosystem. \nLastly, when $q=\\infty$, $D$ is the inverse of\nthe largest proportion $\\max_i(p_i)$.\n\nIn the above example, for branch 1183 we obtain\n```\n value\nRichness 460.000000\nShannon 5.887544\nSimpson 0.003361\nMaxProp 0.010369\nq = 0 460.000000\nq = 1 360.518784\nq = 2 297.562094\nq = inf 96.442786\n```\nand for branch 2 we obtain\n```\n value\nRichness 280.000000\nShannon 5.500414\nSimpson 0.004519\nMaxProp 0.010750\nq = 0 280.000000\nq = 1 244.793371\nq = 2 221.270778\nq = inf 93.021531\n```\nAfter comparing the results using two different measures,\nnamely, modularity and diversity, we conclude that branch\n1183 is more heterogeneous than branch 2.\n\n## Similarity functions\nSo far we have assumed that the similarity matrix \n$S$ is\ncomputed by calculating the cosine of the angle \nbetween each observation. Concretely, if the \nmatrix of observations is $B$ ($m\\times n$), the $i$-th row\nof $B$ is $x = B(i,:)$, and the $j$-th row of $B$ \nis $y=B(j,:)$, then the similarity between $x$ and\n$y$ is\n\n$$S(x,y)=\\frac{x\\cdot y}{||x||_2\\cdot ||y||_2}.$$\n\nHowever, this is not the only way to compute\na similarity matrix. We will list all the available\nsimilarity functions and how to call them.\n\n### Cosine (sparse)\nIf your matrix is sparse, i.e., the number of nonzero\nentries is proportional to the number of samples ($m$),\nand you want to use the cosine similarity, then use the\nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine_sparse\")\n```\nBy default we use the Halko-Martinsson-Tropp algorithm \nto compute the truncated singular value decomposition.\nHowever, the ARPACK library (written in Fortran)\nis also available.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine_sparse\",\n svd_algorithm=\"arpack\")\n```\nIf $B$ has negative entries, it is possible\nto get negative entries for $S$. This could\nin turn produce negative row sums for $S$. \nIf that is the case,\nthe convergence to a solution could be extremely slow.\nHowever, if you use the non-sparse version of this\nfunction, we provide a reasonable solution to this problem.\n\n### Cosine\nIf your matrix is dense, \nand you want to use the cosine similarity, then use the\nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine\")\n```\nThe same comment about negative entries applies here.\nHowever, there is a simple solution. While shifting\nthe matrix of observations can drastically change the\ninterpretation of the data because each column lives\nin a different (gene) space, shifting the similarity \nmatrix is actually a reasonable method to remove negative\nentries. The reason is that similarities live in an \nordered space and shifting by a constant is\nan order-preserving transformation. Equivalently,\nif the similarity between $x$ and $y$ is\nless than the similarity between $u$ and $w$, i.e.,\n$S(x,y) < S(u,w)$, then $S(x,y)+s < S(u,w)+s$ for\nany constant $s$. The raw data\nhave no natural order, but similarities do.\nTo shift the (dense) similarity matrix by $s=1$, use the \nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine\",\n shift_similarity_matrix=1)\n```\nNote that since the range of the cosine similarity\nis $[-1,1]$, the shifted range for $s=1$ becomes $[0,2]$.\nThe shift transformation can also be applied to any of \nthe subsequent similarity matrices.\n\n### Laplacian\nThe similarity matrix is\n\n$$S(x,y)=\\exp(-\\gamma\\cdot ||x-y||_1)$$\n\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"laplacian\",\n similarity_gamma=0.01)\n```\nThis function is very sensitive to $\\gamma$. Hence, an\ninadequate choice can result in poor results or \nno convergence. If you obtain poor results, try using \na smaller value for $\\gamma$.\n\n### Gaussian\nThe similarity matrix is\n\n$$S(x,y)=\\exp(-\\gamma\\cdot ||x-y||_2^2)$$\n\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"gaussian\",\n similarity_gamma=0.001)\n```\nAs before, this function is very sensitive to $\\gamma$. \nNote that the norm is squared. Thus, it transforms\nbig differences between $x$ and $y$ into very small\nquantities.\n\n### Divide by the sum\nThe similarity matrix is \n\n$$S(x,y)=1-\\frac{||x-y||_p}{||x||_p+||y||_p},$$\n\nwhere $p =1$ or $p=2$. The rows \nof the matrix are normalized (unit norm)\nbefore computing the similarity.\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"div_by_sum\")\n```\n\n## Normalization\n\n### TF-IDF\nIf you want to use the inverse document\nfrequency (IDF) normalization, then use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n use_tf_idf=True)\n```\nIf you also want to normalize the frequencies to\nunit norm with the $2$-norm, then use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n use_tf_idf=True,\n tf_idf_norm=\"l2\")\n```\nIf instead you want to use the $1$-norm, then\nreplace \"l2\" with \"l1\".\n\n### Simple normalization\nSometimes normalizing your matrix\nof observations can improve the\nperformance of some routines. \nTo normalize the rows, use the following instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n normalize_rows=True)\n```\nBe default, the $2$-norm is used. To \nuse any other $p$-norm, use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n normalize_rows=True,\n similarity_norm=p)\n```\n\n\n## Gene expression along a path\n### Introduction\nImagine you have the following tree structure after \nrunning toomanycells. \n![Tree path](https://github.com/JRR3/toomanycells/blob/main/tests/tree_path_example.svg)\nFurther, assume that the colors denote different classes\nsatisfying specific properties. We want to know how the\nexpression of two genes, for instance, `Gene S` and `Gene T`,\nfluctuates as we move from node $X$ \n(lower left side of the tree), which is rich\nin `Class B`, to node $Y$ (upper left side of the tree), \nwhich is rich in `Class\nC`. To compute such quantities, we first need to define the\ndistance between nodes. \n\n### Distance between nodes\nAssume we have a (parent) node $P$ with\ntwo children nodes $C_1$ and $C_2$. Recall that the modularity of \n$P$ indicates the strength of separation between the cell\npopulations of $C_1$ and $C_2$. \nA large the modularity indicates strong connections,\ni.e., high similarity, within each cluster $C_i$,\nand also implies weak connections, i.e., low similarity, between \n$C_1$ and $C_2$. If the modularity at $P$ is $Q(P)$, we define\nthe distance between $C_1$ and $C_2$ as \n\n$$d(C_1,C_2) = Q(P).$$\n\nWe also define $d(C_i,P) = Q(P)/2$. Note that with \nthose definitions we have that \n\n$$d(C_1,C_2)=d(C_1,P) +d(P,C_2)=Q(P)/2+Q(P)/2=Q(P),$$\n\nas expected. Now that we know how to calculate the\ndistance between a node and its parent or child, let \n$X$ and $Y$ be two distinct nodes, and let\n${(N_{i})}_{i=0}^{n}$ be the sequence of nodes that describes\nthe unique path between them satisfying:\n\n1. $N_0 = X$,\n2. $N_n=Y$,\n3. $N_i$ is a direct relative of $N_{i+1}$, i.e., \n$N_i$ is either a child or parent of $N_{i+1}$,\n4. $N_i \\neq N_j$ for $i\\neq j$.\n\nThen, the distance between $X$ and $Y$ is given by \n```math\nd(X,Y) =\n\\sum_{i=0}^{n-1} d(N_{i},N_{i+1}).\n```\n### Gene expression\nWe define the expression\nof `Gene G` at a node $N$, $Exp(G,N)$, as the mean expression\nof `Gene G` considering all the cells that belong to node\n$N$. Hence, given the sequence of nodes \n```math \n(N_i)_{i=0}^{n}\n```\nwe can compute the corresponding gene\nexpression sequence \n```math\n(E_{i})_{i=0}^{n}, \\quad E_i = Exp(G,N_i).\n```\n### Cumulative distance\nLastly, since we are interested in plotting the\ngene expression as a function of the distance with respect to\nthe node $X$, we define the sequence of real numbers\n```math \n(D_{i})_{i=0}^{n}, \\quad D_{i} = d(X,N_{i}).\n```\n### Summary\n1. The sequence of nodes between $X$ and $Y$\n$${(N_{i})}_{i=0}^{n}$$\n2. The sequence of gene expression levels between $X$ and $Y$\n$${(E_{i})}_{i=0}^{n}$$\n3. And the sequence of distances with respect to node $X$\n$${(D_{i})}_{i=0}^{n}$$\n\nThe final plot is simply $E_{i}$ versus $D_{i}$. An example\nis given in the following figure.\n### Example\n![Gene expression](https://github.com/JRR3/toomanycells/blob/main/tests/exp_path_test.svg)\n\nNote how the expression of `Gene A` is high relative to\nthat of `Gene B` at node $X$, and as we move\nfarther towards \nnode $Y$ the trend is inverted and now `Gene B` is \nhighly expressed relative to `Gene A` at node $Y$.\n\n## Acknowledgments\nI would like to thank \nthe Schwartz lab (GW) for \nletting me explore different\ndirections and also Christie Lau for\nproviding multiple test \ncases to improve this \nimplementation. \n",
"bugtrack_url": null,
"license": "BSD License",
"summary": "A python package for spectral clustering.",
"version": "1.0.50",
"project_urls": {
"Homepage": "https://github.com/JRR3/toomanycells"
},
"split_keywords": [
"toomanycells"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ad79d5c83a0eaeeea7ebb238d4fd6eccb54e457f9fa7c4618a9defbbbf219ed3",
"md5": "3fd3356913565564b9a7c2a0b24e2d7a",
"sha256": "2244900665346ce8cf525d3da4822416dc37894130d9434733f06096c3ccb9c2"
},
"downloads": -1,
"filename": "toomanycells-1.0.50-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3fd3356913565564b9a7c2a0b24e2d7a",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.9",
"size": 78014,
"upload_time": "2024-11-15T22:42:57",
"upload_time_iso_8601": "2024-11-15T22:42:57.305804Z",
"url": "https://files.pythonhosted.org/packages/ad/79/d5c83a0eaeeea7ebb238d4fd6eccb54e457f9fa7c4618a9defbbbf219ed3/toomanycells-1.0.50-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "867eb581ec0f0155d78a7de3a0cf257abd3dbf4fdc1dbab2605552e4aa4ab2d1",
"md5": "6c73cde4f622417dd0cd66a71cf14458",
"sha256": "f83a7f0df143ff0742ea1c1978a921c186c13d04b47a5f5b0b15f571214dd3b3"
},
"downloads": -1,
"filename": "toomanycells-1.0.50.tar.gz",
"has_sig": false,
"md5_digest": "6c73cde4f622417dd0cd66a71cf14458",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 3797105,
"upload_time": "2024-11-15T22:42:59",
"upload_time_iso_8601": "2024-11-15T22:42:59.325960Z",
"url": "https://files.pythonhosted.org/packages/86/7e/b581ec0f0155d78a7de3a0cf257abd3dbf4fdc1dbab2605552e4aa4ab2d1/toomanycells-1.0.50.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-15 22:42:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JRR3",
"github_project": "toomanycells",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "scanpy",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "networkx",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "plotly",
"specs": []
},
{
"name": "celltypist",
"specs": []
},
{
"name": "scprep",
"specs": []
}
],
"lcname": "toomanycells"
}