# too-many-cells (à la Python)
[](https://pypi.python.org/pypi/toomanycells)
### It's [Scanpy](https://github.com/scverse/scanpy) friendly!
### Please remember to [cite](https://doi.org/10.1093/gigascience/giae056) us
**A Python package for spectral clustering** based on the
powerful suite of tools named
[too-many-cells](https://github.com/GregorySchwartz/too-many-cells).
Note, **this is not a wrapper**.
In essence, you can use toomanycells to partition a data set
in the form of a matrix of integers or floating point numbers
into clusters, where members of a cluster are similar to each
other under a given similarity function.
The rows represent observations and the
columns are the features. However, sometimes just knowing the
clusters is not sufficient. Often, we are interested on the
relationships between the clusters, and this tool can help
you visualize the clusters as leaf nodes of a tree, where the
branches illustrate the trajectories that have to be followed
to reach a particular cluster. Initially, this tool will
partition your data set into two subsets (each subset is a
node of the tree), trying to maximize
the differences between the two. Subsequently, it will
reapply that same criterion to each subset (node) and will
continue bifurcating until the
[modularity](https://en.wikipedia.org/wiki/Modularity_(networks))
of the node that is about to be partitioned becomes less
than a given threshold value ($10^{-9}$ by default),
implying that the elements belonging
to the current node are fairly homogeneous,
and consequently suggesting
that further partitioning is not warranted. Thus, when the
process finishes, you end up with a tree structure of your
data set, where the leaves represent the clusters. As
mentioned earlier, you can use this tool with any kind of
data. However, a common application is to classify cells and
therefore you can provide an
[AnnData](https://anndata.readthedocs.io/en/latest/) object.
You can read about this application in this [Nature Methods
paper](https://www.nature.com/articles/s41592-020-0748-5).
- Free software: GNU AFFERO GENERAL PUBLIC LICENSE
- Documentation: https://JRR3.github.io/toomanycells
## Dependencies
Version 1.0.40 no longer requires Graphviz. Thus,
no need to install a separate C library!
## Virtual environments
To have control of your working environment
you can use a python virtual environment, which
can help you keep only the packages you need in
one location. In bash or zsh you can simply
type
```
python -m venv /path/to/new/virtual/environment
```
To activate it you simply need
```
source pathToTheVirtualEnvironment/bin/activate
```
To deactivate the environment use the intuitive
```
deactivate
```
## Installation
Caveat: I have tested the following steps in
**Python 3.9.18**. For other versions, things might
be different.
In theory, just typing
```
pip install toomanycells
```
in your home or custom environment
should work. However, for reproducibility,
here is the list of packages I had in my virtual
environment in 2024:
```
anndata==0.10.9
array_api_compat==1.8
celltypist==1.6.3
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.3.0
cycler==0.12.1
et-xmlfile==1.1.0
exceptiongroup==1.2.2
fonttools==4.54.1
get-annotations==0.1.2
h5py==3.11.0
idna==3.10
igraph==0.11.6
importlib_resources==6.4.5
joblib==1.4.2
kiwisolver==1.4.7
legacy-api-wrap==1.4
leidenalg==0.10.2
llvmlite==0.43.0
matplotlib==3.9.2
natsort==8.4.0
networkx==3.2.1
numba==0.60.0
numpy==2.0.2
openpyxl==3.1.5
packaging==24.1
pandas==2.2.3
patsy==0.5.6
pillow==10.4.0
plotly==5.24.1
pynndescent==0.5.13
pyparsing==3.1.4
python-dateutil==2.9.0.post0
pytz==2024.2
requests==2.32.3
scanpy==1.10.3
scikit-learn==1.5.2
scipy==1.13.1
seaborn==0.13.2
session-info==1.0.0
six==1.16.0
statsmodels==0.14.3
stdlib-list==0.10.0
tenacity==9.0.0
texttable==1.7.0
threadpoolctl==3.5.0
toomanycells==1.0.52
tqdm==4.66.5
tzdata==2024.2
umap-learn==0.5.6
urllib3==2.2.3
zipp==3.20.2
```
If you want to install an updated
version, then please use the following
approach.
```
pip install -U --no-deps toomanycells
```
Note that we are requiring to keep all the
dependencies as they are. Otherwise they would
get upgraded and that could potentially **break**
the installation.
To install packages based on a list of requirements,
i.e., the packages you want installed with the specific
version, then use
```
pip install -r requirements.txt
```
where `requirements.txt` is a list
like the one shown the above block of code.
Make sure you have the latest version. If not,
run the previous command again.
## Quick run (needs to be updated)
If you want to see a concrete example of how
to use toomanycells, check out the jupyter
notebook [demo](./toomanycells_demo.ipynb).
## Quick plotting
If you are already familiar with toomanycells and
want to generate a quick plot (an SVG) of your tree after
calling
```
tmc_obj.run_spectral_clustering()
```
then use the following call
```
tmc_obj.store_outputs(
cell_ann_col="name_of_the_column",
plot_tree=True,
)
```
or
```
tmc_obj.store_outputs(
cell_ann_col="name_of_the_column",
plot_tree=True,
draw_modularity=False,
draw_node_numbers=False,
)
```
with the appropriate flags.
If you already have the outputs and you
just want to plot, then simply call
```
tmc_obj.easy_plot(
cell_ann_col="name_of_the_column",
)
```
or
```
tmc_obj.easy_plot(
cell_ann_col="name_of_the_column",
draw_modularity=False,
draw_node_numbers=False,
)
```
with the appropriate flags,
where `name_of_the_column` is the name of the AnnData.obs
column that contains the cell annotations.
This function will look for the outputs in the
folder that you defined for the `output_directory`
as shown in step 2.
Note that this function relies on
[too-many-cells](https://gregoryschwartz.github.io/too-many-cells/)
(à la Haskell). So you need to have it installed. If you work
within the cluster of your organization, maybe it has
already been installed, and you could load it as follows.
```
module add too-many-cells
```
Otherwise, I recommend you to install it with Nix.
## Generating a matrix market file
Sometimes you want to generate a matrix market
file from a set of genes so that you can visualize
them with other tools. The function that will help you
with this is called `create_data_for_tmci`.
Just for context,
imagine you are interested in two genes, `COL1A1`, and
`TCF21`. Moreover, imagine that you also want to include
in your matrix another feature located in the `obs` data
frame called `total_counts`. Finally, assume that the
column that contains the labels for your cells is called
`cell_annotations`. Please use this a template for your
specific needs.
```
from toomanycells import TooManyCells as tmc
tmc_obj = tmc(A)
tmc_obj.run_spectral_clustering()
tmc_obj.store_outputs(cell_ann_col="cell_annotations")
list_of_genes = []
list_of_genes.append("COL1A1")
list_of_genes.append("TCF21")
list_of_genes.append("total_counts")
tmc_obj.create_data_for_tmci(list_of_genes=list_of_genes)
```
These lines of code will produce for you
a folder named `tmci_mtx_data` with the expected outputs.
## Starting from scratch
1. First import the module as follows
```
from toomanycells import TooManyCells as tmc
```
2. If you already have an
[AnnData](https://anndata.readthedocs.io/en/latest/)
object `A` loaded into memory, then you can create a
TooManyCells object with
```
tmc_obj = tmc(A)
```
In this case the output folder will be called
`tmc_outputs`. However, if you want the output folder to
be a particular directory, then you can specify the path
as follows.
```
tmc_obj = tmc(A, output_directory)
```
3. If instead of providing an AnnData object you want to
provide the directory where your data is located, you can use
the syntax
```
tmc_obj = tmc(input_directory, output_directory)
```
4. If your input directory has a file in the [matrix market
format](https://math.nist.gov/MatrixMarket/formats.html),
then you have to specify this information by using the
following flag
```
tmc_obj = tmc(input_directory,
output_directory,
input_is_matrix_market=True)
```
Under this scenario, the `input_directory` must contain a
`.mtx` file, a `barcodes.tsv` file (the observations), and
a `genes.tsv` (the features).
5. Once your data has been loaded successfully,
you can start the clustering process with the following command
```
tmc_obj.run_spectral_clustering()
```
In my desktop computer processing a data set with ~90K
cells (observations) and ~30K genes (features) took a
little less than 6 minutes in 1809 iterations. For a
larger data set like the [Tabula
Sapiens](https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219?file=40067134)
with 483,152 cells and 58,870 genes (14.51 GB in zip
format) the total time was about 50 minutes in the same
computer. 
6. At the end of the clustering process the `.obs` data frame of the AnnData object should have two columns named `['sp_cluster', 'sp_path']` which contain the cluster labels and the path from the root node to the leaf node, respectively.
```
tmc_obj.A.obs[['sp_cluster', 'sp_path']]
```
7. To generate the outputs, just call the function
```
tmc_obj.store_outputs()
```
or
```
tmc_obj.store_outputs(
cell_ann_col="name_of_the_column",
plot_tree=True,
)
```
to include a plot of the graph.
This call will generate JSON file
containing the nodes and edges of the graph (`graph.json`),
one CSV file that describes the cluster
information (`clusters.csv`), another CSV file containing
the information of each node (`node_info.csv`), and two
JSON files. One relates cells to clusters
(`cluster_list.json`), and the other has the
full tree structure (`cluster_tree.json`). You need this
last file for too-many-cells interactive (TMCI).
8. If you already have the `graph.json` file you can
load it with
```
tmc_obj.load_graph(json_fname="some_path")
```
## Visualization with TMCI
9. If you want to visualize your results in a dynamic
platform, I strongly recommend the tool
[too-many-cells-interactive](https://github.com/schwartzlab-methods/too-many-cells-interactive?tab=readme-ov-file).
To use it, first make sure that you have Docker Compose and
Docker. One simple way of getting the two is by installing
[Docker Desktop](https://docs.docker.com/compose/install/).
Note that with MacOS the instructions are slightly different.
If you use [Nix](https://search.nixos.org/packages), simply
add the packages `pkgs.docker` and `pkgs.docker-compose` to
your configuration or `home.nix` file and run
```
home-manager switch
```
10. If you installed Docker Desktop you probably don't need to
follow this step. However, under some distributions the
following two commands have proven to be essential. Use
```
sudo dockerd
```
to start the daemon service for docker containers and
```
sudo chmod 666 /var/run/docker.sock
```
to let Docker read and write to that location.
11. Now clone the repository
```
git clone https://github.com/schwartzlab-methods/too-many-cells-interactive.git
```
and store the path to the `too-many-cells-interactive`
folder in a variable, for example
`path_to_tmc_interactive`. Also, you will need to identify
a column in your `AnnData.obs` data frame that has the
labels for the cells. Let's assume that the column name is
stored in the variable `cell_annotations`. Lastly, you can
provide a port number to host your visualization, for
instance `port_id=1234`. Then, you can call the function
```
tmc_obj.visualize_with_tmc_interactive(
path_to_tmc_interactive,
cell_annotations,
port_id)
```
The following visualization corresponds to the data set
with ~90K cells (observations). 
And this is the visualization for the Tabula Sapiens data set with ~480K cells.

## Running TMCI independently
In case you already have the outputs for TMCI, but you
want to visualize a specific set of genes on top of
your tree, you are going to need the expression matrix
corresponding to those genes in the matrix marker format.
You will also need a list of genes and the barcodes.
All of that can be easily achieved with toomanycells
(à la Python) after loading your matrix or AnnData
object. If you are interested in only a few genes,
you can call
```
tmc_obj.create_data_for_tmci(
list_of_genes = ["G1","G2",...,"Gn"]
)
```
where `G1`,`G2`,...,`Gn`, are the labels
of the genes of interest. If instead you have
a table of genes stored as a text file, then
use the call
```
tmc_obj.create_data_for_tmci(
path_to_genes = "path/to/genes.csv"
)
```
Lastly, if you want to write all the available genes
to a matrix, then simply call
```
tmc_obj.create_data_for_tmci()
```
but note that this could take a considerable
amount of time, depending on how many genes
are in your matrix.
After calling this function, you will
have a new folder called `tmci_mtx_data`
which will contain the aforementioned files.
It is also important to mention that you need
a file wiht the labels
```
./start-and-load.sh \
--matrix-dir /path_to/tmci_mtx_data \
--tree-path /path_to/cluster_tree.json \
--label-path /path_to/cell_annotations.csv \
--port 2025 \
--debug
```
## What is the time complexity of toomanycells (à la Python)?
To answer that question we have created the following
benchmark. We tested the performance of toomanycells in 20
data sets having the following number of cells: 6,360, 10,479,
12,751, 16,363, 23,973, 32,735, 35,442, 40,784, 48,410, 53,046,
57,621, 62,941, 68,885, 76,019, 81,449, 87,833, 94,543, 101,234,
107,809, and 483,152. The range goes from thousands of cells to
almost half a million cells.
These are the results.


As you can see, the program behaves linearly with respect to the size of the input. In other words, the observations fit the model $T = k\cdot N^p$, where $T$ is the time to process the data set, $N$ is the number of cells, $k$ is a constant, and $p$ is the exponent. In our case $p\approx 1$. Nice!
## Cell annotation
### CellTypist
When visualizing the tree, we often are interested on
observing how different cell types distribute across the
branches of the tree. In case your AnnData object lacks
a cell annotation column in the ``obs`` data frame, or
if you already have one but you want to try a different
method, we have created a wrapper function that calls
[CellTypist](https://www.celltypist.org/). Simply
write
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
)
```
and the ```obs``` data frame of your AnnData object will
have a column named like the string stored under the
```column_label_for_cell_annotations``` variable.
By default we use the ```Immune_All_High``` celltypist
model that contains 32 cell types. If you want to use
another model, simply write
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
celltypist_model,
)
```
where ```celltypist_model``` describes the type of model
to use by the library. For example, if this
variable is equal to ```Immune_All_Low```, then the number
of possible cell types increases to 98.
For a complete list of all the models, see the following
[list](https://www.celltypist.org/models). Lastly,
if you want to use the fact that transcriptionally similar
cells are likely to cluster together, you can assign the cell
type labels on a cluster-by-cluster basis
rather than a cell-by-cell basis. To activate this
feature, use the call
```
tmc_obj.annotate_with_celltypist(
column_label_for_cell_annotations,
celltypist_model,
use_majority_voting = True,
)
```
## Filtering cells
If you want to select cells
that belong to a class defined
within a specific column of the
`.obs` dataframe, you can use the
following call.
```
A = tmc_obj.filter_for_cells_with_property(
"cell_type", "Neuro-2a")
```
In this case all cells that have the label `Neuro-2a`
within the column `cell_type` in the `.obs` dataframe
will be selected, and the resulting AnnData object `A`
will only have these cells.
## Graph operations
### Selecting cells through branches
Imagine you have a tree structure
of your data like the one shown below.

If you want to isolate all the cells that belong to branches
261 and 2, and produce an AnnData object with those cells,
simply use the following call
```
adata = tmc_obj.isolate_cells_from_branches(
list_of_branches=[261,2])
```
If you have a CSV file that specifies the branches,
then use the following call
```
adata = tmc_obj.isolate_cells_from_branches(
path_to_csv_file="list_of_branches.csv",
branch_column="node",
)
```
The name of the column that contains the branches
or nodes is specified through the keyword
`branch_column`. Lastly, if you want to store
a copy of the indices, use the following call
```
adata = tmc_obj.isolate_cells_from_branches(
path_to_csv_file="list_of_branches.csv",
branch_column="node",
generate_cell_id_file=True,
)
```
### Mean expression of a branch
Imagine we have the following tree.

If you want to quantify the mean expression of the marker
CD9 on branch 261, you can use the following call
```
m_exp = tmc_obj.compute_cluster_mean_expression(
node=261, genes=["CD9"])
```
and you would obtain 12.791.

Looking at the above plot, this suggests that Neuro-2a cells
highly express this marker.
If instead we were interested in a different marker, like
SDC1, this would be the corresponding color map
expression across the nodes.

The above plot also illustrates that some Neuro-2a cells
are rich in SDC1.
### Median absolute deviation classification
First we introduce the concept of median absolute
deviation. Imagine you have a list of $n$ observations
$Z = [z_0,z_1,\ldots,z_{n-1}]$. Let
$\mathcal{M}:\mathbb{R}^n \to \mathbb{R}$ be
the function that computes the median of a list.
Consider a new list $K=[k_0,k_1,\ldots,k_{n-1}]$, where
$k_i = \left| z_i - \mathcal{M}(Z) \right|$. Then,
the median absolute deviation of $Z$ is the
median of the absolute differences between the
original value and the median. Mathematically,
$\text{MAD}(Z) = \mathcal{M}(K)$. For this section
we will be indicating the expression of a gene in terms
of MADs. The reason is that we want to classify cells,
and using quantities that capture the dispersion of the
data is a convenient approach for that purpose.
An important point to mention is that
for each gene,
instead of considering the raw expression values
across all cells as the elements of the list $Z$,
we use the mean expression for each node of the tree.
In other words, for a given gene,
the element $z_k$ represents
the mean expression of that gene for node $k$. Thus,
$n$ indicates the number of nodes in the tree.
Based on the previous example, now imagine
you want to find cells whose expression
of two markers, CD9 and SDC1, is 1 MAD above the median.
First, you need a CSV file containing the following
information.
```
Marker Cell Threshold Direction
CD9 Neuro-2a 1.0 Above
SDC1 Neuro-2a 1.0 Above
```
Let's call it `marker_and_cell_info.csv`.
**Note**: For this discussion the cell types indicated in
the `Cell` column are not relevant and will not be
used. We quantify the mean expression of those
markers for every node of the tree and store
that information within each node.
We can do that using the following call.
```
tmc_obj.populate_tree_with_mean_expression_for_all_markers(
cell_marker_path="marker_and_cell_info.csv")
```
Then we compute basic statistics for each marker using the
following function
```
tmc_obj.compute_node_expression_metadata()
```
These are
the statistics associated to those markers.
```
median mad min max min_mad max_mad delta
CD9 3.080538 1.918258 0.000890 22.944445 -1.605441 10.355182 0.797375
SDC1 2.989691 1.165005 0.001669 6.639456 -2.564814 3.132832 0.379843
```
Note that the maximum expression of CD9
is about 10 MADs above the median, while that of
SDC1 is only about 3 MADs above the median.
The plot corresponding to the distribution of those
markers across all nodes can be generated through this call
```
tmc_obj.plot_marker_distributions()
```
The plots will be all contained in a dynamic html file.
Here are some examples.
This is the distribution for CD9:

and with TooManyCellsInteractive

The distribution for SDC1 looks as follows.

If we want to isolate the cells that satisfy the conditions
```
Marker Cell Threshold Direction
CD9 Neuro-2a 1.0 Above
SDC1 Neuro-2a 1.0 Above
```
We can use the call
```
tmc_obj.select_cells_based_on_inequalities(
cell_ann_col="cell_type")
```
where the cell annotation column in the `.obs`
dataframe is specified through the
`cell_ann_col` keyword. This function will
return an AnnData object with all the
cells satisfying all the constraints.
This function will also produce
multiple CSV files. One for each inequality
specified through the file of constraints.
For example,
one for all cells whose
expression of CD9 was above 1 MAD of the median
expression of CD9,
one for all cells whose
expression of SDC1 was above 1 MAD of the median
expression of SDC1,
and one corresponding to the intersection
of all of the above. The above function will
modify the original AnnData object by adding to
the `.obs` dataframe a column
named `Intersection`
indicating with a boolean value
if a cell satisfies all the constraints.
Lastly, if the number of markers is less than or
equal to three, then the `.obs` dataframe will
include a column classifying the cells
based on whether they express
highly or not each of the markers. For instance,
in this example we obtained the following outputs.
```
Class
CD9-Low-SDC1-Low 28729
CD9-High-SDC1-Low 6072
CD9-High-SDC1-High 4223
CD9-Low-SDC1-High 2058
Name: count, dtype: int64
Class
CD9-Low-SDC1-Low 0.699309
CD9-High-SDC1-Low 0.147802
CD9-High-SDC1-High 0.102794
CD9-Low-SDC1-High 0.050095
Name: proportion, dtype: float64
```
This indicates that the majority of the cells,
i.e., about `70%` of cells,
are low in CD9 and low in SDC1, and about `10%` of cells
are high in both. Note that in this particular example
when we say high it means
that the expression is above 1 MAD from the median, and
low is the complement of that.

## Heterogeneity quantification
Imagine you want to compare the heterogeneity of cell
populations belonging to different branches of the
toomanycells tree. By branch we mean all the nodes that
derive from a particular node, including the node
that defines the branch in question.
For example, we want to compare branch 1183 against branch 2.

One way to do this is by comparing the modularity
distribution and the cumulative modularity for all the
nodes that belong to each branch.
We can do that using the following calls. First for
branch 1183
```
tmc_obj.quantify_heterogeneity(
list_of_branches=[1183],
use_log_y=true,
tag="branch_A",
show_column_totals=true,
color="blue",
file_format="svg")
```
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/branch_A.svg"
width="500" height="420"/>
<br/>
And then for branch 2
```
tmc_obj.quantify_heterogeneity(
list_of_branches=[2],
use_log_y=true,
tag="branch_B",
show_column_totals=true,
color="red",
file_format="svg")
```
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/branch_B.svg"
width="500" height="420"/>
<br/>
Note that you can include multiple nodes in the
list of branches.
From these figures we observe that the higher cumulative
modularity of branch 1183 with respect to branch 2 suggests
that the former has a higher degree of heterogeneity.
However, just relying in modularity could provide a
misleading interpretation. For example, consider the
following scenario where the numbers within the nodes
indicate the modularity at that node.
<br/>
<img src="https://github.com/JRR3/toomanycells/blob/main/tests/counter_node_modularity.svg"
width="300" height="400"/>
<br/>
In this case, scenario A has a larger cumulative modularity,
but we note that scenario B is more heterogeneous.
For that reason we recommend also computing additional
diversity measures. First, we need some notation.
For all the branches belonging to the list of branches in the
above function
`quantify_heterogeneity`, let $C$ be
the set of leaf nodes that belong to those branches.
We consider each leaf node as a separate species, and we
call the whole collection of cells an ecosystem.
For $c_i \in C$, let $|c_i|$ be the number of cells in
$c_i$ and $|C| = \sum_i |c_i|$ the total number
of cells contained in the given branches. If we let
$$p_i = \dfrac{|c_i|}{|C|},$$
then we define the following diversity measure
$$D(q) = \left(\sum_{i=1}^{n} p_i^q \right)^{\frac{1}{1-q}}.
$$
In general, the larger the value of $D(q)$, the more diverse
is the collection of species. Note that $D(q=0)$
describes the total number of species. We
call this quantity the richness of the ecosystem.
When $q=1$, $D$ is the exponential of the Shannon entropy
$$H = -\sum_{i=1}^{n}p_i \ln(p_i).$$
When $q=2$, $D$ is
the inverse of the Simpson's index
$$S = \sum_{i=1}^{n} (p_i)^2,$$
which represents the
probability that two cells picked at random belong
to the same species. Hence, the higher the Simpson's
index, the less diverse is the ecosystem.
Lastly, when $q=\infty$, $D$ is the inverse of
the largest proportion $\max_i(p_i)$.
In the above example, for branch 1183 we obtain
```
value
Richness 460.000000
Shannon 5.887544
Simpson 0.003361
MaxProp 0.010369
q = 0 460.000000
q = 1 360.518784
q = 2 297.562094
q = inf 96.442786
```
and for branch 2 we obtain
```
value
Richness 280.000000
Shannon 5.500414
Simpson 0.004519
MaxProp 0.010750
q = 0 280.000000
q = 1 244.793371
q = 2 221.270778
q = inf 93.021531
```
After comparing the results using two different measures,
namely, modularity and diversity, we conclude that branch
1183 is more heterogeneous than branch 2.
## Similarity functions
So far we have assumed that the similarity matrix
$S$ is
computed by calculating the cosine of the angle
between each observation. Concretely, if the
matrix of observations is $B$ ($m\times n$), the $i$-th row
of $B$ is $x = B(i,:)$, and the $j$-th row of $B$
is $y=B(j,:)$, then the similarity between $x$ and
$y$ is
$$S(x,y)=\frac{x\cdot y}{||x||_2\cdot ||y||_2}.$$
However, this is not the only way to compute
a similarity matrix. We will list all the available
similarity functions and how to call them.
### Cosine (sparse)
If your matrix is sparse, i.e., the number of nonzero
entries is proportional to the number of samples ($m$),
and you want to use the cosine similarity, then use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine_sparse")
```
By default we use the ARPACK library (written in Fortran)
to compute the truncated singular value decomposition.
The Halko-Martinsson-Tropp algorithm
is also available. However, this one is not deterministic.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine_sparse",
svd_algorithm="arpack")
```
If $B$ has negative entries, it is possible
to get negative entries for $S$. This could
in turn produce negative row sums for $S$.
If that is the case,
the convergence to a solution could be extremely slow.
However, if you use the non-sparse version of this
function, we provide a reasonable solution to this problem.
### Dimension-adaptive Euclidean Norm (DaEN)
If your data consists of points whose Euclidean
norm varies across multiple length scales, then
one option is to use a similarity function that
can adapt to those changes in magnitude.
Before I explain it in detail, here is how
you can call it
```
tmc_obj.run_spectral_clustering(
similarity_function="norm_sparse")
```
### Cosine
If your matrix is dense,
and you want to use the cosine similarity, then use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine")
```
The same comment about negative entries applies here.
However, there is a simple solution. While shifting
the matrix of observations can drastically change the
interpretation of the data because each column lives
in a different (gene) space, shifting the similarity
matrix is actually a reasonable method to remove negative
entries. The reason is that similarities live in an
ordered space and shifting by a constant is
an order-preserving transformation. Equivalently,
if the similarity between $x$ and $y$ is
less than the similarity between $u$ and $w$, i.e.,
$S(x,y) < S(u,w)$, then $S(x,y)+s < S(u,w)+s$ for
any constant $s$. The raw data
have no natural order, but similarities do.
To shift the (dense) similarity matrix by $s=1$, use the
following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="cosine",
shift_similarity_matrix=1)
```
Note that since the range of the cosine similarity
is $[-1,1]$, the shifted range for $s=1$ becomes $[0,2]$.
The shift transformation can also be applied to any of
the subsequent similarity matrices.
### Laplacian
The similarity matrix is given by
$$
S(x,y)=\exp(-\gamma\cdot \left\lVert x-y \right\rVert _1).
$$
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="laplacian",
similarity_gamma=0.01)
```
This function is very sensitive to $\gamma$. Hence, an
inadequate choice can result in poor results or
no convergence. If you obtain poor results, try using
a smaller value for $\gamma$.
### Gaussian
The similarity matrix is given by
$$
S(x,y)=\exp(-\gamma\cdot \left\lVert x-y\right\rVert _2^2).
$$
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="gaussian",
similarity_gamma=0.001)
```
As before, this function is very sensitive to $\gamma$.
Note that the norm is squared. Thus, it transforms
big differences between $x$ and $y$ into very small
quantities.
### Divide by the sum
The similarity matrix is given by
$$
S(x,y)=1-\frac{
\left\lVert x-y \right\rVert_p
}{
\left\lVert x \right\rVert_p +
\left\lVert y \right\rVert_p
},
$$
where $p =1$ or $p=2$. The rows
of the matrix are normalized (unit norm)
before computing the similarity.
This is an example:
```
tmc_obj.run_spectral_clustering(
similarity_function="div_by_sum")
```
## Normalization
### TF-IDF
If you want to use the inverse document
frequency (IDF) normalization, then use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
use_tf_idf=True)
```
If you also want to normalize the frequencies to
unit norm with the $2$-norm, then use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
use_tf_idf=True,
tf_idf_norm="l2")
```
If instead you want to use the $1$-norm, then
replace "l2" with "l1".
### Simple normalization
Sometimes normalizing your matrix
of observations can improve the
performance of some routines.
To normalize the rows, use the following instruction.
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
normalize_rows=True)
```
Be default, the $2$-norm is used. To
use any other $p$-norm, use
```
tmc_obj.run_spectral_clustering(
similarity_function="some_sim_function",
normalize_rows=True,
similarity_norm=p)
```
## Gene expression along a path
### Introduction
Imagine you have the following tree structure after
running toomanycells.

Further, assume that the colors denote different classes
satisfying specific properties. We want to know how the
expression of two genes, for instance, `Gene S` and `Gene T`,
fluctuates as we move from node $X$
(lower left side of the tree), which is rich
in `Class B`, to node $Y$ (upper left side of the tree),
which is rich in `Class
C`. To compute such quantities, we first need to define the
distance between nodes.
### Distance between nodes
Assume we have a (parent) node $P$ with
two children nodes $C_1$ and $C_2$. Recall that the modularity of
$P$ indicates the strength of separation between the cell
populations of $C_1$ and $C_2$.
A large the modularity indicates strong connections,
i.e., high similarity, within each cluster $C_i$,
and also implies weak connections, i.e., low similarity, between
$C_1$ and $C_2$. If the modularity at $P$ is $Q(P)$, we define
the distance between $C_1$ and $C_2$ as
$$d(C_1,C_2) = Q(P).$$
We also define $d(C_i,P) = Q(P)/2$. Note that with
those definitions we have that
$$d(C_1,C_2)=d(C_1,P) +d(P,C_2)=Q(P)/2+Q(P)/2=Q(P),$$
as expected. Now that we know how to calculate the
distance between a node and its parent or child, let
$X$ and $Y$ be two distinct nodes, and let
${(N_{i})}_{i=0}^{n}$ be the sequence of nodes that describes
the unique path between them satisfying:
1. $N_0 = X$,
2. $N_n=Y$,
3. $N_i$ is a direct relative of $N_{i+1}$, i.e.,
$N_i$ is either a child or parent of $N_{i+1}$,
4. $N_i \neq N_j$ for $i\neq j$.
Then, the distance between $X$ and $Y$ is given by
```math
d(X,Y) =
\sum_{i=0}^{n-1} d(N_{i},N_{i+1}).
```
### Gene expression
We define the expression
of `Gene G` at a node $N$, $Exp(G,N)$, as the mean expression
of `Gene G` considering all the cells that belong to node
$N$. Hence, given the sequence of nodes
```math
(N_i)_{i=0}^{n}
```
we can compute the corresponding gene
expression sequence
```math
(E_{i})_{i=0}^{n}, \quad E_i = Exp(G,N_i).
```
### Cumulative distance
Lastly, since we are interested in plotting the
gene expression as a function of the distance with respect to
the node $X$, we define the sequence of real numbers
```math
(D_{i})_{i=0}^{n}, \quad D_{i} = d(X,N_{i}).
```
### Summary
1. The sequence of nodes between $X$ and $Y$
$${(N_{i})}_{i=0}^{n}$$
2. The sequence of gene expression levels between $X$ and $Y$
$${(E_{i})}_{i=0}^{n}$$
3. And the sequence of distances with respect to node $X$
$${(D_{i})}_{i=0}^{n}$$
The final plot is simply $E_{i}$ versus $D_{i}$. An example
is given in the following figure.
### Example

Note how the expression of `Gene A` is high relative to
that of `Gene B` at node $X$, and as we move
farther towards
node $Y$ the trend is inverted and now `Gene B` is
highly expressed relative to `Gene A` at node $Y$.
## Acknowledgments
I would like to thank
the Schwartz lab (GW) for
letting me explore different
directions and also Christie Lau for
providing multiple test
cases to improve this
implementation.
Raw data
{
"_id": null,
"home_page": null,
"name": "toomanycells",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "toomanycells",
"author": null,
"author_email": "Javier Ruiz-Ramirez <javier.ruizramirez@uhn.ca>",
"download_url": "https://files.pythonhosted.org/packages/54/f4/47cb307ebcbc12c371950db2fe09d914853739d565e0438bcb5831cb159c/toomanycells-1.0.69.tar.gz",
"platform": null,
"description": "# too-many-cells (\u00e0 la Python)\n\n\n[](https://pypi.python.org/pypi/toomanycells)\n\n### It's [Scanpy](https://github.com/scverse/scanpy) friendly!\n### Please remember to [cite](https://doi.org/10.1093/gigascience/giae056) us\n\n**A Python package for spectral clustering** based on the\npowerful suite of tools named\n[too-many-cells](https://github.com/GregorySchwartz/too-many-cells).\nNote, **this is not a wrapper**.\nIn essence, you can use toomanycells to partition a data set\nin the form of a matrix of integers or floating point numbers\ninto clusters, where members of a cluster are similar to each\nother under a given similarity function.\nThe rows represent observations and the\ncolumns are the features. However, sometimes just knowing the\nclusters is not sufficient. Often, we are interested on the\nrelationships between the clusters, and this tool can help \nyou visualize the clusters as leaf nodes of a tree, where the\nbranches illustrate the trajectories that have to be followed\nto reach a particular cluster. Initially, this tool will\npartition your data set into two subsets (each subset is a \nnode of the tree), trying to maximize\nthe differences between the two. Subsequently, it will\nreapply that same criterion to each subset (node) and will \ncontinue bifurcating until the\n[modularity](https://en.wikipedia.org/wiki/Modularity_(networks))\nof the node that is about to be partitioned becomes less \nthan a given threshold value ($10^{-9}$ by default), \nimplying that the elements belonging\nto the current node are fairly homogeneous, \nand consequently suggesting\nthat further partitioning is not warranted. Thus, when the\nprocess finishes, you end up with a tree structure of your\ndata set, where the leaves represent the clusters. As\nmentioned earlier, you can use this tool with any kind of\ndata. However, a common application is to classify cells and\ntherefore you can provide an\n[AnnData](https://anndata.readthedocs.io/en/latest/) object.\nYou can read about this application in this [Nature Methods\npaper](https://www.nature.com/articles/s41592-020-0748-5).\n\n\n- Free software: GNU AFFERO GENERAL PUBLIC LICENSE\n- Documentation: https://JRR3.github.io/toomanycells\n\n## Dependencies\nVersion 1.0.40 no longer requires Graphviz. Thus,\nno need to install a separate C library!\n\n## Virtual environments\nTo have control of your working environment \nyou can use a python virtual environment, which \ncan help you keep only the packages you need in \none location. In bash or zsh you can simply\ntype\n```\npython -m venv /path/to/new/virtual/environment\n```\nTo activate it you simply need\n```\nsource pathToTheVirtualEnvironment/bin/activate\n```\nTo deactivate the environment use the intuitive\n```\ndeactivate\n```\n\n## Installation\n\nCaveat: I have tested the following steps in\n**Python 3.9.18**. For other versions, things might\nbe different.\n\nIn theory, just typing\n```\npip install toomanycells\n```\nin your home or custom environment \nshould work. However, for reproducibility,\nhere is the list of packages I had in my virtual \nenvironment in 2024:\n```\nanndata==0.10.9\narray_api_compat==1.8\ncelltypist==1.6.3\ncertifi==2024.8.30\ncharset-normalizer==3.3.2\nclick==8.1.7\ncontourpy==1.3.0\ncycler==0.12.1\net-xmlfile==1.1.0\nexceptiongroup==1.2.2\nfonttools==4.54.1\nget-annotations==0.1.2\nh5py==3.11.0\nidna==3.10\nigraph==0.11.6\nimportlib_resources==6.4.5\njoblib==1.4.2\nkiwisolver==1.4.7\nlegacy-api-wrap==1.4\nleidenalg==0.10.2\nllvmlite==0.43.0\nmatplotlib==3.9.2\nnatsort==8.4.0\nnetworkx==3.2.1\nnumba==0.60.0\nnumpy==2.0.2\nopenpyxl==3.1.5\npackaging==24.1\npandas==2.2.3\npatsy==0.5.6\npillow==10.4.0\nplotly==5.24.1\npynndescent==0.5.13\npyparsing==3.1.4\npython-dateutil==2.9.0.post0\npytz==2024.2\nrequests==2.32.3\nscanpy==1.10.3\nscikit-learn==1.5.2\nscipy==1.13.1\nseaborn==0.13.2\nsession-info==1.0.0\nsix==1.16.0\nstatsmodels==0.14.3\nstdlib-list==0.10.0\ntenacity==9.0.0\ntexttable==1.7.0\nthreadpoolctl==3.5.0\ntoomanycells==1.0.52\ntqdm==4.66.5\ntzdata==2024.2\numap-learn==0.5.6\nurllib3==2.2.3\nzipp==3.20.2\n```\nIf you want to install an updated\nversion, then please use the following \napproach.\n```\npip install -U --no-deps toomanycells\n```\nNote that we are requiring to keep all the \ndependencies as they are. Otherwise they would\nget upgraded and that could potentially **break**\nthe installation.\n\nTo install packages based on a list of requirements,\ni.e., the packages you want installed with the specific\nversion, then use\n```\npip install -r requirements.txt\n```\nwhere `requirements.txt` is a list\nlike the one shown the above block of code.\n\nMake sure you have the latest version. If not,\nrun the previous command again.\n\n## Quick run (needs to be updated)\nIf you want to see a concrete example of how\nto use toomanycells, check out the jupyter \nnotebook [demo](./toomanycells_demo.ipynb).\n\n## Quick plotting\nIf you are already familiar with toomanycells and\nwant to generate a quick plot (an SVG) of your tree after \ncalling \n ```\n tmc_obj.run_spectral_clustering()\n ```\nthen use the following call\n ```\n tmc_obj.store_outputs(\n cell_ann_col=\"name_of_the_column\",\n plot_tree=True,\n )\n ```\n or\n ```\n tmc_obj.store_outputs(\n cell_ann_col=\"name_of_the_column\",\n plot_tree=True,\n draw_modularity=False,\n draw_node_numbers=False,\n )\n ```\nwith the appropriate flags.\nIf you already have the outputs and you \njust want to plot, then simply call\n ```\n tmc_obj.easy_plot(\n cell_ann_col=\"name_of_the_column\",\n )\n ```\n or\n ```\n tmc_obj.easy_plot(\n cell_ann_col=\"name_of_the_column\",\n draw_modularity=False,\n draw_node_numbers=False,\n )\n ```\nwith the appropriate flags,\nwhere `name_of_the_column` is the name of the AnnData.obs \ncolumn that contains the cell annotations.\nThis function will look for the outputs in the \nfolder that you defined for the `output_directory`\nas shown in step 2.\nNote that this function relies on\n[too-many-cells](https://gregoryschwartz.github.io/too-many-cells/)\n(\u00e0 la Haskell). So you need to have it installed. If you work\nwithin the cluster of your organization, maybe it has \nalready been installed, and you could load it as follows.\n ```\n module add too-many-cells\n ```\nOtherwise, I recommend you to install it with Nix.\n\n## Generating a matrix market file\nSometimes you want to generate a matrix market\nfile from a set of genes so that you can visualize\nthem with other tools. The function that will help you \nwith this is called `create_data_for_tmci`. \nJust for context,\nimagine you are interested in two genes, `COL1A1`, and\n`TCF21`. Moreover, imagine that you also want to include\nin your matrix another feature located in the `obs` data\nframe called `total_counts`. Finally, assume that the\ncolumn that contains the labels for your cells is called\n`cell_annotations`. Please use this a template for your\nspecific needs.\n ```\n from toomanycells import TooManyCells as tmc\n tmc_obj = tmc(A)\n tmc_obj.run_spectral_clustering() \n tmc_obj.store_outputs(cell_ann_col=\"cell_annotations\") \n list_of_genes = [] \n list_of_genes.append(\"COL1A1\") \n list_of_genes.append(\"TCF21\") \n list_of_genes.append(\"total_counts\") \n tmc_obj.create_data_for_tmci(list_of_genes=list_of_genes) \n ```\nThese lines of code will produce for you\na folder named `tmci_mtx_data` with the expected outputs.\n\n## Starting from scratch\n1. First import the module as follows\n ```\n from toomanycells import TooManyCells as tmc\n ```\n\n2. If you already have an \n[AnnData](https://anndata.readthedocs.io/en/latest/) \nobject `A` loaded into memory, then you can create a \nTooManyCells object with\n ```\n tmc_obj = tmc(A)\n ```\n In this case the output folder will be called \n `tmc_outputs`. However, if you want the output folder to\n be a particular directory, then you can specify the path \n as follows.\n ```\n tmc_obj = tmc(A, output_directory)\n ```\n3. If instead of providing an AnnData object you want to\nprovide the directory where your data is located, you can use\nthe syntax\n ```\n tmc_obj = tmc(input_directory, output_directory)\n ```\n\n4. If your input directory has a file in the [matrix market\nformat](https://math.nist.gov/MatrixMarket/formats.html),\nthen you have to specify this information by using the\nfollowing flag\n ```\n tmc_obj = tmc(input_directory,\n output_directory,\n input_is_matrix_market=True)\n ```\nUnder this scenario, the `input_directory` must contain a\n`.mtx` file, a `barcodes.tsv` file (the observations), and\na `genes.tsv` (the features).\n\n5. Once your data has been loaded successfully, \nyou can start the clustering process with the following command\n ```\n tmc_obj.run_spectral_clustering()\n ```\nIn my desktop computer processing a data set with ~90K\ncells (observations) and ~30K genes (features) took a\nlittle less than 6 minutes in 1809 iterations. For a\nlarger data set like the [Tabula\nSapiens](https://figshare.com/articles/dataset/Tabula_Sapiens_release_1_0/14267219?file=40067134)\nwith 483,152 cells and 58,870 genes (14.51 GB in zip\nformat) the total time was about 50 minutes in the same\ncomputer. \n \n6. At the end of the clustering process the `.obs` data frame of the AnnData object should have two columns named `['sp_cluster', 'sp_path']` which contain the cluster labels and the path from the root node to the leaf node, respectively.\n ```\n tmc_obj.A.obs[['sp_cluster', 'sp_path']]\n ```\n7. To generate the outputs, just call the function\n ```\n tmc_obj.store_outputs()\n ```\n or\n ```\n tmc_obj.store_outputs(\n cell_ann_col=\"name_of_the_column\",\n plot_tree=True,\n )\n ```\n to include a plot of the graph.\n\nThis call will generate JSON file\ncontaining the nodes and edges of the graph (`graph.json`), \none CSV file that describes the cluster\ninformation (`clusters.csv`), another CSV file containing \nthe information of each node (`node_info.csv`), and two \nJSON files. One relates cells to clusters \n(`cluster_list.json`), and the other has the \nfull tree structure (`cluster_tree.json`). You need this\nlast file for too-many-cells interactive (TMCI).\n\n8. If you already have the `graph.json` file you can \nload it with\n ```\n tmc_obj.load_graph(json_fname=\"some_path\")\n ```\n## Visualization with TMCI\n9. If you want to visualize your results in a dynamic\nplatform, I strongly recommend the tool\n[too-many-cells-interactive](https://github.com/schwartzlab-methods/too-many-cells-interactive?tab=readme-ov-file).\nTo use it, first make sure that you have Docker Compose and\nDocker. One simple way of getting the two is by installing\n[Docker Desktop](https://docs.docker.com/compose/install/).\nNote that with MacOS the instructions are slightly different.\nIf you use [Nix](https://search.nixos.org/packages), simply\nadd the packages `pkgs.docker` and `pkgs.docker-compose` to\nyour configuration or `home.nix` file and run\n```\nhome-manager switch\n```\n10. If you installed Docker Desktop you probably don't need to\nfollow this step. However, under some distributions the\nfollowing two commands have proven to be essential. Use\n```\nsudo dockerd\n```\nto start the daemon service for docker containers and\n```\nsudo chmod 666 /var/run/docker.sock\n```\nto let Docker read and write to that location.\n\n\n11. Now clone the repository \n ```\n git clone https://github.com/schwartzlab-methods/too-many-cells-interactive.git\n ```\nand store the path to the `too-many-cells-interactive`\nfolder in a variable, for example\n`path_to_tmc_interactive`. Also, you will need to identify\na column in your `AnnData.obs` data frame that has the\nlabels for the cells. Let's assume that the column name is\nstored in the variable `cell_annotations`. Lastly, you can\nprovide a port number to host your visualization, for\ninstance `port_id=1234`. Then, you can call the function\n ```\n tmc_obj.visualize_with_tmc_interactive(\n path_to_tmc_interactive,\n cell_annotations,\n port_id)\n ```\nThe following visualization corresponds to the data set\nwith ~90K cells (observations). \n \nAnd this is the visualization for the Tabula Sapiens data set with ~480K cells.\n\n\n## Running TMCI independently\nIn case you already have the outputs for TMCI, but you\nwant to visualize a specific set of genes on top of\nyour tree, you are going to need the expression matrix\ncorresponding to those genes in the matrix marker format.\nYou will also need a list of genes and the barcodes.\nAll of that can be easily achieved with toomanycells\n(\u00e0 la Python) after loading your matrix or AnnData\nobject. If you are interested in only a few genes,\nyou can call \n ```\n tmc_obj.create_data_for_tmci(\n list_of_genes = [\"G1\",\"G2\",...,\"Gn\"]\n )\n ```\n where `G1`,`G2`,...,`Gn`, are the labels\n of the genes of interest. If instead you have\n a table of genes stored as a text file, then\n use the call\n ```\n tmc_obj.create_data_for_tmci(\n path_to_genes = \"path/to/genes.csv\"\n )\n ```\n Lastly, if you want to write all the available genes\n to a matrix, then simply call\n ```\n tmc_obj.create_data_for_tmci()\n ```\n but note that this could take a considerable\n amount of time, depending on how many genes\n are in your matrix.\n After calling this function, you will\n have a new folder called `tmci_mtx_data`\n which will contain the aforementioned files.\n It is also important to mention that you need\n a file wiht the labels\n\n ```\n ./start-and-load.sh \\\n --matrix-dir /path_to/tmci_mtx_data \\\n --tree-path /path_to/cluster_tree.json \\\n --label-path /path_to/cell_annotations.csv \\\n --port 2025 \\\n --debug\n ```\n\n## What is the time complexity of toomanycells (\u00e0 la Python)?\nTo answer that question we have created the following\nbenchmark. We tested the performance of toomanycells in 20\ndata sets having the following number of cells: 6,360, 10,479,\n12,751, 16,363, 23,973, 32,735, 35,442, 40,784, 48,410, 53,046,\n57,621, 62,941, 68,885, 76,019, 81,449, 87,833, 94,543, 101,234,\n107,809, and 483,152. The range goes from thousands of cells to\nalmost half a million cells. \nThese are the results.\n\n\nAs you can see, the program behaves linearly with respect to the size of the input. In other words, the observations fit the model $T = k\\cdot N^p$, where $T$ is the time to process the data set, $N$ is the number of cells, $k$ is a constant, and $p$ is the exponent. In our case $p\\approx 1$. Nice!\n\n## Cell annotation\n### CellTypist\nWhen visualizing the tree, we often are interested on\nobserving how different cell types distribute across the\nbranches of the tree. In case your AnnData object lacks\na cell annotation column in the ``obs`` data frame, or \nif you already have one but you want to try a different \nmethod, we have created a wrapper function that calls \n[CellTypist](https://www.celltypist.org/). Simply \nwrite\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n )\n```\nand the ```obs``` data frame of your AnnData object will \nhave a column named like the string stored under the\n```column_label_for_cell_annotations``` variable.\nBy default we use the ```Immune_All_High``` celltypist \nmodel that contains 32 cell types. If you want to use\nanother model, simply write\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n celltypist_model,\n )\n```\nwhere ```celltypist_model``` describes the type of model\nto use by the library. For example, if this \nvariable is equal to ```Immune_All_Low```, then the number \nof possible cell types increases to 98.\nFor a complete list of all the models, see the following\n[list](https://www.celltypist.org/models). Lastly,\nif you want to use the fact that transcriptionally similar\ncells are likely to cluster together, you can assign the cell \ntype labels on a cluster-by-cluster basis\n rather than a cell-by-cell basis. To activate this \n feature, use the call\n\n```\n tmc_obj.annotate_with_celltypist(\n column_label_for_cell_annotations,\n celltypist_model,\n use_majority_voting = True,\n )\n```\n## Filtering cells\n\nIf you want to select cells \nthat belong to a class defined\nwithin a specific column of the\n`.obs` dataframe, you can use the\nfollowing call.\n\n```\n A = tmc_obj.filter_for_cells_with_property(\n \"cell_type\", \"Neuro-2a\")\n```\n\nIn this case all cells that have the label `Neuro-2a`\nwithin the column `cell_type` in the `.obs` dataframe \nwill be selected, and the resulting AnnData object `A`\nwill only have these cells.\n\n## Graph operations\n\n### Selecting cells through branches\n\nImagine you have a tree structure \nof your data like the one shown below.\n\nIf you want to isolate all the cells that belong to branches\n261 and 2, and produce an AnnData object with those cells,\nsimply use the following call\n\n```\n adata = tmc_obj.isolate_cells_from_branches(\n list_of_branches=[261,2])\n```\n\nIf you have a CSV file that specifies the branches,\nthen use the following call\n\n```\n adata = tmc_obj.isolate_cells_from_branches(\n path_to_csv_file=\"list_of_branches.csv\",\n branch_column=\"node\",\n )\n```\n\nThe name of the column that contains the branches\nor nodes is specified through the keyword \n`branch_column`. Lastly, if you want to store \na copy of the indices, use the following call\n\n```\n adata = tmc_obj.isolate_cells_from_branches(\n path_to_csv_file=\"list_of_branches.csv\",\n branch_column=\"node\",\n generate_cell_id_file=True,\n )\n```\n\n### Mean expression of a branch\nImagine we have the following tree.\n\nIf you want to quantify the mean expression of the marker \nCD9 on branch 261, you can use the following call\n\n```\n m_exp = tmc_obj.compute_cluster_mean_expression(\n node=261, genes=[\"CD9\"])\n```\n\nand you would obtain 12.791.\n\n\n\nLooking at the above plot, this suggests that Neuro-2a cells\nhighly express this marker.\nIf instead we were interested in a different marker, like\nSDC1, this would be the corresponding color map\nexpression across the nodes.\n\n\n\nThe above plot also illustrates that some Neuro-2a cells\nare rich in SDC1.\n\n### Median absolute deviation classification\n\nFirst we introduce the concept of median absolute \ndeviation. Imagine you have a list of $n$ observations\n$Z = [z_0,z_1,\\ldots,z_{n-1}]$. Let \n$\\mathcal{M}:\\mathbb{R}^n \\to \\mathbb{R}$ be\nthe function that computes the median of a list.\nConsider a new list $K=[k_0,k_1,\\ldots,k_{n-1}]$, where \n$k_i = \\left| z_i - \\mathcal{M}(Z) \\right|$. Then,\nthe median absolute deviation of $Z$ is the \nmedian of the absolute differences between the\noriginal value and the median. Mathematically,\n$\\text{MAD}(Z) = \\mathcal{M}(K)$. For this section\nwe will be indicating the expression of a gene in terms\nof MADs. The reason is that we want to classify cells,\nand using quantities that capture the dispersion of the\ndata is a convenient approach for that purpose.\nAn important point to mention is that\nfor each gene,\ninstead of considering the raw expression values\nacross all cells as the elements of the list $Z$, \nwe use the mean expression for each node of the tree.\nIn other words, for a given gene,\nthe element $z_k$ represents\nthe mean expression of that gene for node $k$. Thus,\n$n$ indicates the number of nodes in the tree.\n\nBased on the previous example, now imagine\nyou want to find cells whose expression\nof two markers, CD9 and SDC1, is 1 MAD above the median.\nFirst, you need a CSV file containing the following \ninformation. \n\n```\n Marker Cell Threshold Direction\n CD9 Neuro-2a 1.0 Above\n SDC1 Neuro-2a 1.0 Above\n```\n\nLet's call it `marker_and_cell_info.csv`.\n**Note**: For this discussion the cell types indicated in \nthe `Cell` column are not relevant and will not be \nused. We quantify the mean expression of those \nmarkers for every node of the tree and store \nthat information within each node.\nWe can do that using the following call.\n\n```\ntmc_obj.populate_tree_with_mean_expression_for_all_markers(\n cell_marker_path=\"marker_and_cell_info.csv\")\n```\n\nThen we compute basic statistics for each marker using the\nfollowing function\n\n```\ntmc_obj.compute_node_expression_metadata()\n```\n\nThese are\nthe statistics associated to those markers.\n\n```\n median mad min max min_mad max_mad delta\nCD9 3.080538 1.918258 0.000890 22.944445 -1.605441 10.355182 0.797375\nSDC1 2.989691 1.165005 0.001669 6.639456 -2.564814 3.132832 0.379843\n```\n\nNote that the maximum expression of CD9 \nis about 10 MADs above the median, while that of\nSDC1 is only about 3 MADs above the median.\nThe plot corresponding to the distribution of those\nmarkers across all nodes can be generated through this call\n\n```\ntmc_obj.plot_marker_distributions()\n```\n\nThe plots will be all contained in a dynamic html file. \nHere are some examples.\n\nThis is the distribution for CD9:\n\n\n\nand with TooManyCellsInteractive\n\n\n\nThe distribution for SDC1 looks as follows.\n\n\n\nIf we want to isolate the cells that satisfy the conditions\n\n```\n Marker Cell Threshold Direction\n CD9 Neuro-2a 1.0 Above\n SDC1 Neuro-2a 1.0 Above\n```\n\nWe can use the call\n\n```\ntmc_obj.select_cells_based_on_inequalities(\n cell_ann_col=\"cell_type\")\n```\n\nwhere the cell annotation column in the `.obs` \ndataframe is specified through the\n`cell_ann_col` keyword. This function will \nreturn an AnnData object with all the\ncells satisfying all the constraints.\n\nThis function will also produce\nmultiple CSV files. One for each inequality\nspecified through the file of constraints.\nFor example,\none for all cells whose \nexpression of CD9 was above 1 MAD of the median \nexpression of CD9,\none for all cells whose \nexpression of SDC1 was above 1 MAD of the median \nexpression of SDC1, \nand one corresponding to the intersection\nof all of the above. The above function will \nmodify the original AnnData object by adding to\nthe `.obs` dataframe a column\nnamed `Intersection`\nindicating with a boolean value \nif a cell satisfies all the constraints.\n\nLastly, if the number of markers is less than or\nequal to three, then the `.obs` dataframe will\ninclude a column classifying the cells \nbased on whether they express\nhighly or not each of the markers. For instance,\nin this example we obtained the following outputs.\n\n```\nClass\nCD9-Low-SDC1-Low 28729\nCD9-High-SDC1-Low 6072\nCD9-High-SDC1-High 4223\nCD9-Low-SDC1-High 2058\nName: count, dtype: int64\nClass\nCD9-Low-SDC1-Low 0.699309\nCD9-High-SDC1-Low 0.147802\nCD9-High-SDC1-High 0.102794\nCD9-Low-SDC1-High 0.050095\nName: proportion, dtype: float64\n```\n\nThis indicates that the majority of the cells,\ni.e., about `70%` of cells,\nare low in CD9 and low in SDC1, and about `10%` of cells\nare high in both. Note that in this particular example\nwhen we say high it means\nthat the expression is above 1 MAD from the median, and\nlow is the complement of that.\n\n\n\n\n## Heterogeneity quantification\nImagine you want to compare the heterogeneity of cell \npopulations belonging to different branches of the \ntoomanycells tree. By branch we mean all the nodes that\nderive from a particular node, including the node \nthat defines the branch in question.\nFor example, we want to compare branch 1183 against branch 2.\n\nOne way to do this is by comparing the modularity \ndistribution and the cumulative modularity for all the \nnodes that belong to each branch. \n We can do that using the following calls. First for \n branch 1183\n```\n tmc_obj.quantify_heterogeneity(\n list_of_branches=[1183],\n use_log_y=true,\n tag=\"branch_A\",\n show_column_totals=true,\n color=\"blue\",\n file_format=\"svg\")\n```\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/branch_A.svg\"\nwidth=\"500\" height=\"420\"/>\n<br/>\nAnd then for branch 2\n\n```\n tmc_obj.quantify_heterogeneity(\n list_of_branches=[2],\n use_log_y=true,\n tag=\"branch_B\",\n show_column_totals=true,\n color=\"red\",\n file_format=\"svg\")\n```\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/branch_B.svg\"\nwidth=\"500\" height=\"420\"/>\n<br/>\nNote that you can include multiple nodes in the \nlist of branches.\nFrom these figures we observe that the higher cumulative \nmodularity of branch 1183 with respect to branch 2 suggests \nthat the former has a higher degree of heterogeneity.\nHowever, just relying in modularity could provide a \nmisleading interpretation. For example, consider the \nfollowing scenario where the numbers within the nodes \nindicate the modularity at that node.\n<br/>\n<img src=\"https://github.com/JRR3/toomanycells/blob/main/tests/counter_node_modularity.svg\"\nwidth=\"300\" height=\"400\"/>\n<br/>\nIn this case, scenario A has a larger cumulative modularity, \nbut we note that scenario B is more heterogeneous.\nFor that reason we recommend also computing additional\ndiversity measures. First, we need some notation. \nFor all the branches belonging to the list of branches in the\nabove function \n\n`quantify_heterogeneity`, let $C$ be\nthe set of leaf nodes that belong to those branches. \nWe consider each leaf node as a separate species, and we \ncall the whole collection of cells an ecosystem.\nFor $c_i \\in C$, let $|c_i|$ be the number of cells in\n$c_i$ and $|C| = \\sum_i |c_i|$ the total number \nof cells contained in the given branches. If we let\n\n$$p_i = \\dfrac{|c_i|}{|C|},$$\n\nthen we define the following diversity measure\n\n$$D(q) = \\left(\\sum_{i=1}^{n} p_i^q \\right)^{\\frac{1}{1-q}}.\n$$\n\nIn general, the larger the value of $D(q)$, the more diverse\nis the collection of species. Note that $D(q=0)$ \ndescribes the total number of species. We \ncall this quantity the richness of the ecosystem. \nWhen $q=1$, $D$ is the exponential of the Shannon entropy\n\n$$H = -\\sum_{i=1}^{n}p_i \\ln(p_i).$$\n\nWhen $q=2$, $D$ is \nthe inverse of the Simpson's index\n\n$$S = \\sum_{i=1}^{n} (p_i)^2,$$\n\nwhich represents the\nprobability that two cells picked at random belong\nto the same species. Hence, the higher the Simpson's\nindex, the less diverse is the ecosystem. \nLastly, when $q=\\infty$, $D$ is the inverse of\nthe largest proportion $\\max_i(p_i)$.\n\nIn the above example, for branch 1183 we obtain\n```\n value\nRichness 460.000000\nShannon 5.887544\nSimpson 0.003361\nMaxProp 0.010369\nq = 0 460.000000\nq = 1 360.518784\nq = 2 297.562094\nq = inf 96.442786\n```\nand for branch 2 we obtain\n```\n value\nRichness 280.000000\nShannon 5.500414\nSimpson 0.004519\nMaxProp 0.010750\nq = 0 280.000000\nq = 1 244.793371\nq = 2 221.270778\nq = inf 93.021531\n```\nAfter comparing the results using two different measures,\nnamely, modularity and diversity, we conclude that branch\n1183 is more heterogeneous than branch 2.\n\n## Similarity functions\nSo far we have assumed that the similarity matrix \n$S$ is\ncomputed by calculating the cosine of the angle \nbetween each observation. Concretely, if the \nmatrix of observations is $B$ ($m\\times n$), the $i$-th row\nof $B$ is $x = B(i,:)$, and the $j$-th row of $B$ \nis $y=B(j,:)$, then the similarity between $x$ and\n$y$ is\n\n$$S(x,y)=\\frac{x\\cdot y}{||x||_2\\cdot ||y||_2}.$$\n\nHowever, this is not the only way to compute\na similarity matrix. We will list all the available\nsimilarity functions and how to call them.\n\n### Cosine (sparse)\nIf your matrix is sparse, i.e., the number of nonzero\nentries is proportional to the number of samples ($m$),\nand you want to use the cosine similarity, then use the\nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine_sparse\")\n```\nBy default we use the ARPACK library (written in Fortran)\nto compute the truncated singular value decomposition.\nThe Halko-Martinsson-Tropp algorithm \nis also available. However, this one is not deterministic.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine_sparse\",\n svd_algorithm=\"arpack\")\n```\nIf $B$ has negative entries, it is possible\nto get negative entries for $S$. This could\nin turn produce negative row sums for $S$. \nIf that is the case,\nthe convergence to a solution could be extremely slow.\nHowever, if you use the non-sparse version of this\nfunction, we provide a reasonable solution to this problem.\n\n### Dimension-adaptive Euclidean Norm (DaEN)\nIf your data consists of points whose Euclidean\nnorm varies across multiple length scales, then \none option is to use a similarity function that\ncan adapt to those changes in magnitude.\nBefore I explain it in detail, here is how\nyou can call it\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"norm_sparse\")\n```\n\n### Cosine\nIf your matrix is dense, \nand you want to use the cosine similarity, then use the\nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine\")\n```\nThe same comment about negative entries applies here.\nHowever, there is a simple solution. While shifting\nthe matrix of observations can drastically change the\ninterpretation of the data because each column lives\nin a different (gene) space, shifting the similarity \nmatrix is actually a reasonable method to remove negative\nentries. The reason is that similarities live in an \nordered space and shifting by a constant is\nan order-preserving transformation. Equivalently,\nif the similarity between $x$ and $y$ is\nless than the similarity between $u$ and $w$, i.e.,\n$S(x,y) < S(u,w)$, then $S(x,y)+s < S(u,w)+s$ for\nany constant $s$. The raw data\nhave no natural order, but similarities do.\nTo shift the (dense) similarity matrix by $s=1$, use the \nfollowing instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"cosine\",\n shift_similarity_matrix=1)\n```\nNote that since the range of the cosine similarity\nis $[-1,1]$, the shifted range for $s=1$ becomes $[0,2]$.\nThe shift transformation can also be applied to any of \nthe subsequent similarity matrices.\n\n### Laplacian\nThe similarity matrix is given by\n\n$$\nS(x,y)=\\exp(-\\gamma\\cdot \\left\\lVert x-y \\right\\rVert _1).\n$$\n\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"laplacian\",\n similarity_gamma=0.01)\n```\nThis function is very sensitive to $\\gamma$. Hence, an\ninadequate choice can result in poor results or \nno convergence. If you obtain poor results, try using \na smaller value for $\\gamma$.\n\n### Gaussian\nThe similarity matrix is given by\n\n$$\nS(x,y)=\\exp(-\\gamma\\cdot \\left\\lVert x-y\\right\\rVert _2^2).\n$$\n\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"gaussian\",\n similarity_gamma=0.001)\n```\nAs before, this function is very sensitive to $\\gamma$. \nNote that the norm is squared. Thus, it transforms\nbig differences between $x$ and $y$ into very small\nquantities.\n\n### Divide by the sum\nThe similarity matrix is given by\n\n$$\nS(x,y)=1-\\frac{\n \\left\\lVert x-y \\right\\rVert_p\n }{\n \\left\\lVert x \\right\\rVert_p +\n \\left\\lVert y \\right\\rVert_p\n },\n$$\n\nwhere $p =1$ or $p=2$. The rows \nof the matrix are normalized (unit norm)\nbefore computing the similarity.\nThis is an example:\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"div_by_sum\")\n```\n\n## Normalization\n\n### TF-IDF\nIf you want to use the inverse document\nfrequency (IDF) normalization, then use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n use_tf_idf=True)\n```\nIf you also want to normalize the frequencies to\nunit norm with the $2$-norm, then use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n use_tf_idf=True,\n tf_idf_norm=\"l2\")\n```\nIf instead you want to use the $1$-norm, then\nreplace \"l2\" with \"l1\".\n\n### Simple normalization\nSometimes normalizing your matrix\nof observations can improve the\nperformance of some routines. \nTo normalize the rows, use the following instruction.\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n normalize_rows=True)\n```\nBe default, the $2$-norm is used. To \nuse any other $p$-norm, use\n```\ntmc_obj.run_spectral_clustering(\n similarity_function=\"some_sim_function\",\n normalize_rows=True,\n similarity_norm=p)\n```\n\n\n## Gene expression along a path\n### Introduction\nImagine you have the following tree structure after \nrunning toomanycells. \n\nFurther, assume that the colors denote different classes\nsatisfying specific properties. We want to know how the\nexpression of two genes, for instance, `Gene S` and `Gene T`,\nfluctuates as we move from node $X$ \n(lower left side of the tree), which is rich\nin `Class B`, to node $Y$ (upper left side of the tree), \nwhich is rich in `Class\nC`. To compute such quantities, we first need to define the\ndistance between nodes. \n\n### Distance between nodes\nAssume we have a (parent) node $P$ with\ntwo children nodes $C_1$ and $C_2$. Recall that the modularity of \n$P$ indicates the strength of separation between the cell\npopulations of $C_1$ and $C_2$. \nA large the modularity indicates strong connections,\ni.e., high similarity, within each cluster $C_i$,\nand also implies weak connections, i.e., low similarity, between \n$C_1$ and $C_2$. If the modularity at $P$ is $Q(P)$, we define\nthe distance between $C_1$ and $C_2$ as \n\n$$d(C_1,C_2) = Q(P).$$\n\nWe also define $d(C_i,P) = Q(P)/2$. Note that with \nthose definitions we have that \n\n$$d(C_1,C_2)=d(C_1,P) +d(P,C_2)=Q(P)/2+Q(P)/2=Q(P),$$\n\nas expected. Now that we know how to calculate the\ndistance between a node and its parent or child, let \n$X$ and $Y$ be two distinct nodes, and let\n${(N_{i})}_{i=0}^{n}$ be the sequence of nodes that describes\nthe unique path between them satisfying:\n\n1. $N_0 = X$,\n2. $N_n=Y$,\n3. $N_i$ is a direct relative of $N_{i+1}$, i.e., \n$N_i$ is either a child or parent of $N_{i+1}$,\n4. $N_i \\neq N_j$ for $i\\neq j$.\n\nThen, the distance between $X$ and $Y$ is given by \n```math\nd(X,Y) =\n\\sum_{i=0}^{n-1} d(N_{i},N_{i+1}).\n```\n### Gene expression\nWe define the expression\nof `Gene G` at a node $N$, $Exp(G,N)$, as the mean expression\nof `Gene G` considering all the cells that belong to node\n$N$. Hence, given the sequence of nodes \n```math \n(N_i)_{i=0}^{n}\n```\nwe can compute the corresponding gene\nexpression sequence \n```math\n(E_{i})_{i=0}^{n}, \\quad E_i = Exp(G,N_i).\n```\n### Cumulative distance\nLastly, since we are interested in plotting the\ngene expression as a function of the distance with respect to\nthe node $X$, we define the sequence of real numbers\n```math \n(D_{i})_{i=0}^{n}, \\quad D_{i} = d(X,N_{i}).\n```\n### Summary\n1. The sequence of nodes between $X$ and $Y$\n$${(N_{i})}_{i=0}^{n}$$\n2. The sequence of gene expression levels between $X$ and $Y$\n$${(E_{i})}_{i=0}^{n}$$\n3. And the sequence of distances with respect to node $X$\n$${(D_{i})}_{i=0}^{n}$$\n\nThe final plot is simply $E_{i}$ versus $D_{i}$. An example\nis given in the following figure.\n### Example\n\n\nNote how the expression of `Gene A` is high relative to\nthat of `Gene B` at node $X$, and as we move\nfarther towards \nnode $Y$ the trend is inverted and now `Gene B` is \nhighly expressed relative to `Gene A` at node $Y$.\n\n## Acknowledgments\nI would like to thank \nthe Schwartz lab (GW) for \nletting me explore different\ndirections and also Christie Lau for\nproviding multiple test \ncases to improve this \nimplementation. \n",
"bugtrack_url": null,
"license": "BSD License",
"summary": "A python package for spectral clustering.",
"version": "1.0.69",
"project_urls": {
"Homepage": "https://github.com/JRR3/toomanycells"
},
"split_keywords": [
"toomanycells"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "490daf8c2e54967713d23a35305d31727cdf813d6667b26524f626f55880110d",
"md5": "5bff4ab8df2000687a3ee250ba0c3ae6",
"sha256": "a12453278b13f8d64da0be98f35988b72c184254ca88f2b025af6023b75fd15d"
},
"downloads": -1,
"filename": "toomanycells-1.0.69-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "5bff4ab8df2000687a3ee250ba0c3ae6",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.9",
"size": 87677,
"upload_time": "2025-08-13T21:27:25",
"upload_time_iso_8601": "2025-08-13T21:27:25.200953Z",
"url": "https://files.pythonhosted.org/packages/49/0d/af8c2e54967713d23a35305d31727cdf813d6667b26524f626f55880110d/toomanycells-1.0.69-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "54f447cb307ebcbc12c371950db2fe09d914853739d565e0438bcb5831cb159c",
"md5": "d6fbc7eaff40775794bfaae8eb08a9f0",
"sha256": "13e5646c7db334a50de51e6f6f643a8ff9f13463b5c8c626a6260810a0fe716f"
},
"downloads": -1,
"filename": "toomanycells-1.0.69.tar.gz",
"has_sig": false,
"md5_digest": "d6fbc7eaff40775794bfaae8eb08a9f0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 5749693,
"upload_time": "2025-08-13T21:27:26",
"upload_time_iso_8601": "2025-08-13T21:27:26.349865Z",
"url": "https://files.pythonhosted.org/packages/54/f4/47cb307ebcbc12c371950db2fe09d914853739d565e0438bcb5831cb159c/toomanycells-1.0.69.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-13 21:27:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "JRR3",
"github_project": "toomanycells",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "scanpy",
"specs": []
},
{
"name": "scipy",
"specs": []
},
{
"name": "numpy",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "networkx",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "plotly",
"specs": []
},
{
"name": "celltypist",
"specs": []
}
],
"lcname": "toomanycells"
}