# scBC
scBC —— a single-cell transcriptome Bayesian biClustering framework.
This document will help you easily go through the scBC model.
![fig1_00](https://github.com/GYQ-form/scBC/assets/79566479/a3a7a6a6-aebd-42dc-85dc-806ecfb54cb9)
## Installation
To install our package, run
```bash
conda install -c conda-forge scvi-tools #pip install scvi-tools
pip install scBC
```
## Quick start
To simply illustrate how one can run our scBC model, here we use the subsampled HEART dataset as an example. We have prepared four subsampled datasets in advance (HEART, PBMC, LUAD and BC), which can be downloaded locally with simple codes:
```python
from scBC import data
heart = data.load_data("HEART")
```
If you can't download the data automatically, you can also download them manually from our [repository](https://github.com/GYQ-form/scBC/tree/main/data) and place them in your working directory.
Now you are ready to set up the scBC model:
```python
from scBC.model import scBC
my_model = scBC(adata=heart,batch_key='cell_source')
```
Notice that `adata` must be an [AnnData](https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html#anndata.AnnData) object. `batch_key` specify the specific column in adata.var to be used as batch annotation.
We first train the VAE model with the default setting:
```python
my_model.train_VI()
```
Then we can get the reconstructed data, we use 10 posterior samples to estimate the mean expression:
```python
my_model.get_reconst_data(n_samples=10)
```
Since scBC encourage prior information about genes' co-expression to be provided, before conducting biclustering, we first generate a prior edge. Thankfully, we have also provide an API for automatically generating prior edge from biomart database, just run with default setting:
```python
my_model.get_edge()
```
Now a edge array has been stored in my_model.edge. Run biclustering will automatically use it, we set the number of biclusters as 3 here:
```python
my_model.Biclustering(L=3)
```
Take a look at the results:
```python
my_model.S
```
Or maybe we are more interest in the strong classification result at cell level:
```python
my_model.strong_class()
```
## Parameter lists
Here we give some parameter lists for readers to use flexibly:
### model
#### scBC.model.scBC()
- **adata**: *AnnData object with an observed n $\times$ p matrix, p is the number of measurements and n is the number of subjects.*
- **layer**: *if not None, uses this as the key in adata.layers for raw count data.*
- **batch_key**: *key in adata.obs for batch information. Categories will automatically be converted into integer categories and saved to adata.obs['_scvi_batch']. If None, assigns the same batch to all the data.*
- **labels_key**: *key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs['_scvi_labels']. If None, assigns the same label to all the data.*
- **size_factor_key**: *key in adata.obs for size factor information. Instead of using library size as a size factor, the provided size factor column will be used as offset in the mean of the likelihood. Assumed to be on linear scale.*
- **categorical_covariate_keys**: *keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.*
- **continuous_covariate_keys**: *keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.*
#### Attributes
- `self.adata` - The AnnData used when initializing the scBC object.
- `self.p` - Number of measurements(genes) of the expression matrix
- `self.n` - Number of subjects(cells) of the expression matrx
- `self.vi_model` - The scvi model
- `self.reconst_data` - Reconstructed expression matrix with n $\times$ p
- `self.W` - **W** matrix after biclustering
- `self.Z` - **Z** matrix after biclustering
- `self.mu` - Parameter matrix $\pmb\mu$ after biclustering
- `self.convg_trail` - The converge trail of biclustering process (likelihood value)
- `self.S` - A list containing L dictionaries. The $i_{th}$ dictionary is the $i_{th}$ bicluster's results
- `self.niter` - Iteration time during biclustering
- `self.edge` - The prior edge array
---
### Methods
#### scBC.train_VI()
- **n_hidden**: Number of nodes per hidden layer.
- **n_latent**: Dimensionality of the latent space.
- **n_layers**: Number of hidden layers used for encoder and decoder NNs.
- **dropout_rate**: Dropout rate for neural networks.
- **dispersion**: One of the following:
- ``'gene'`` - dispersion parameter of NB is constant per gene across cells
- ``'gene-batch'`` - dispersion can differ between different batches
- ``'gene-label'`` - dispersion can differ between different labels
- ``'gene-cell'`` - dispersion can differ for every gene in every cell
- **gene_likelihood**: One of:
- ``'nb'`` - Negative binomial distribution
- ``'zinb'`` - Zero-inflated negative binomial distribution
- ``'poisson'`` - Poisson distribution
- **latent_distribution**: One of:
- ``'normal'`` - Normal distribution
- ``'ln'`` - Logistic normal distribution (Normal(0, I) transformed by softmax)
- **max_epochs**: Number of passes through the dataset. If `None`, defaults to `np.min([round((20000 / n_cells) * 400), 400])`
- **use_gpu**: Use default GPU if available (if None or True), or index of GPU to use (if int), or name of GPU (if `str`, e.g., `'cuda:0'`), or use CPU (if False).
- **batch_size**: Minibatch size to use during training.
- **early_stopping**: Perform early stopping.
---
#### scBC.get_reconst_data()
- **n_samples**: Number of posterior samples to use for estimation.
- **batch_size**: Minibatch size for data loading into model.
---
#### scBC.get_edge()
- **gene_list**: A list containing HGNC gene names used to find prior edge in hsapiens_gene_ensembl dataset. Different actions will be performed based on the data type you provide:
- `str` - take self.adata.var[gene_list] as the gene set
- `list` - use the provided list as the gene set to be searched
- `None(default)` --- use the variable names(self.adata.var_names) as the gene set
- **dataset**: The dataset to be used for extracting prior information. Default is <u>hsapiens\_gene\_ensembl</u>.
- **intensity**: A integer denoting the intensity of the prior. Generally speaking, the larger the number, the more prior information returned (the bigger the edge array is).
---
#### scBC.Biclustering()
- **L**: number of the maximum biclusters
- **mat**: a p x n numpy array used for biclustering. We will use the reconstructed data after calling scBC.get_reconst_data(). However, you can explicitly provide one rather than using the reconstructed data.
- **edge**: a 2-colounm matrix providing prior network information of some measurements. We prefer to use the edge given here, if not provided, we will use scBC.edge as a candidate. Edge can be automaticly generated using scBC.get_edge(). Running Biclustering() without any edge prior is also feasible.
- **dist_type**: a p-element vector, each element represents the distribution type of measurements. 0 is Gaussian, 1 is Binomial, 2 is Negative Binomial, 3 is Poisson. If not given, all measurements are deemed to follow Gaussian distribution.
- **param**: a p-element vector, each element is the required parameter for the correspondence distribution. $\zeta$ for Gaussian, $n_j$ for Binomial, $r_j$ for Negative Binomial, $N$ for Poisson. Default is set as a 1 vector.
- **initWZ**: select between "random" and "svd". If "random", randomly generated N(0,1) numbers are used, o.w. use SVD results for initialization.
- **maxiter**: allowed maximum number of iterations.
- **tol**: desired total difference of solutions.
---
## simulation
We also provide an API for simulation in scBC.data:
#### scBC.data.simulate_data()
- **L**: number of biclusters
- **p**: number of measurements(genes)
- **n**: number of objects(cells)
- **dropout**: dropout rate (between 0 and 1)
- **batch_num**: number of batches you want to simulate
#### return value
A dictionary containing several simulation results:
| key | word |
| :--: | :--------------------------------------------------: |
| W | The **W** matrix |
| Z | The **Z** matrix |
| dat | AnnData object with expression matrix (n $\times$ p) |
| S | The ground truth result |
| edge | Randomly generated prior edge array |
During simulation, the scale of the FGM increases adaptively with the size of the simulated dataset (actually the size of p). The parameter $\pmb\mu$ is computed by the multiplicative model $\pmb\mu = \pmb{WZ}$, where **W** is a p×L matrix and **Z** is an L×n matrix. The number of non-zero elements in each column of **W** is set as p/20, and the number of non-zero elements in each row of **Z** is set as n/10. The row indices of non-zero elements in **W** and the column indices of **Z** with non-zero elements are randomly drawn from 1 to p and 1 to n. The nonzero element values for both **W** and **Z** are generated from a normal distribution with mean 1.5 and standard deviation 0.1, and are randomly assigned to be positive or negative. The prior edge is generated along with W. When generating **X**, each element is generated from $NB(r_j,\frac1{1+e^{-\mu_{ij}}})$ , and the parameter is randomly drawn from 5 to 20. Finally, in order to simulate different batches, we divided the dataset into `batch_num` parts, each with different intensities of noise. The implementation of dropout is to perform Bernoulli censoring at each data point according to the given dropout rate parameter. The simulation data generation process is shown as follows.
![simulation](https://user-images.githubusercontent.com/79566479/232784233-e0a07e0e-bbc3-449c-91b6-5e0936d48159.png)
Raw data
{
"_id": null,
"home_page": "https://github.com/GYQ-form/scBC",
"name": "scBC",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "single cell transcriptomics,Biclustering,bioinformatics,variational inference(VI)",
"author": "Yuqiao Gong",
"author_email": "gyq123@sjtu.edu.cn",
"download_url": "https://files.pythonhosted.org/packages/34/e2/75a1d1ddcf2181f854c6ccba2c215fd90bdf728f4fdcb60df134b668188a/scBC-0.3.2.tar.gz",
"platform": null,
"description": "# scBC\n\nscBC \u2014\u2014 a single-cell transcriptome Bayesian biClustering framework. \n\nThis document will help you easily go through the scBC model.\n\n\n![fig1_00](https://github.com/GYQ-form/scBC/assets/79566479/a3a7a6a6-aebd-42dc-85dc-806ecfb54cb9)\n\n\n\n## Installation\n\nTo install our package, run\n\n```bash\nconda install -c conda-forge scvi-tools #pip install scvi-tools\npip install scBC\n```\n\n\n\n## Quick start\n\nTo simply illustrate how one can run our scBC model, here we use the subsampled HEART dataset as an example. We have prepared four subsampled datasets in advance (HEART, PBMC, LUAD and BC), which can be downloaded locally with simple codes:\n\n```python\nfrom scBC import data\nheart = data.load_data(\"HEART\")\n```\n\nIf you can't download the data automatically, you can also download them manually from our [repository](https://github.com/GYQ-form/scBC/tree/main/data) and place them in your working directory.\n\nNow you are ready to set up the scBC model:\n\n```python\nfrom scBC.model import scBC\nmy_model = scBC(adata=heart,batch_key='cell_source')\n```\n\nNotice that `adata` must be an [AnnData](https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.html#anndata.AnnData) object. `batch_key` specify the specific column in adata.var to be used as batch annotation. \n\nWe first train the VAE model with the default setting:\n\n```python\nmy_model.train_VI()\n```\n\nThen we can get the reconstructed data, we use 10 posterior samples to estimate the mean expression:\n\n```python\nmy_model.get_reconst_data(n_samples=10)\n```\n\nSince scBC encourage prior information about genes' co-expression to be provided, before conducting biclustering, we first generate a prior edge. Thankfully, we have also provide an API for automatically generating prior edge from biomart database, just run with default setting:\n\n```python\nmy_model.get_edge()\n```\n\nNow a edge array has been stored in my_model.edge. Run biclustering will automatically use it, we set the number of biclusters as 3 here:\n\n```python\nmy_model.Biclustering(L=3)\n```\n\nTake a look at the results:\n\n```python\nmy_model.S\n```\n\nOr maybe we are more interest in the strong classification result at cell level:\n\n```python\nmy_model.strong_class()\n```\n\n\n\n## Parameter lists\n\nHere we give some parameter lists for readers to use flexibly:\n\n### model\n\n#### scBC.model.scBC()\n\n- **adata**: *AnnData object with an observed n $\\times$ p matrix, p is the number of measurements and n is the number of subjects.* \n- **layer**: *if not None, uses this as the key in adata.layers for raw count data.*\n- **batch_key**: *key in adata.obs for batch information. Categories will automatically be converted into integer categories and saved to adata.obs['_scvi_batch']. If None, assigns the same batch to all the data.* \n- **labels_key**: *key in adata.obs for label information. Categories will automatically be converted into integer categories and saved to adata.obs['_scvi_labels']. If None, assigns the same label to all the data.*\n- **size_factor_key**: *key in adata.obs for size factor information. Instead of using library size as a size factor, the provided size factor column will be used as offset in the mean of the likelihood. Assumed to be on linear scale.*\n- **categorical_covariate_keys**: *keys in adata.obs that correspond to categorical data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.*\n- **continuous_covariate_keys**: *keys in adata.obs that correspond to continuous data. These covariates can be added in addition to the batch covariate and are also treated as nuisance factors (i.e., the model tries to minimize their effects on the latent space). Thus, these should not be used for biologically-relevant factors that you do not want to correct for.*\n\n#### Attributes\n\n- `self.adata` - The AnnData used when initializing the scBC object.\n- `self.p` - Number of measurements(genes) of the expression matrix\n- `self.n` - Number of subjects(cells) of the expression matrx\n- `self.vi_model` - The scvi model\n- `self.reconst_data` - Reconstructed expression matrix with n $\\times$ p\n- `self.W` - **W** matrix after biclustering\n- `self.Z` - **Z** matrix after biclustering\n- `self.mu` - Parameter matrix $\\pmb\\mu$ after biclustering\n- `self.convg_trail` - The converge trail of biclustering process (likelihood value)\n- `self.S` - A list containing L dictionaries. The $i_{th}$ dictionary is the $i_{th}$ bicluster's results \n- `self.niter` - Iteration time during biclustering\n- `self.edge` - The prior edge array\n\n---\n\n### Methods\n\n#### scBC.train_VI()\n\n- **n_hidden**: Number of nodes per hidden layer.\n- **n_latent**: Dimensionality of the latent space.\n- **n_layers**: Number of hidden layers used for encoder and decoder NNs.\n- **dropout_rate**: Dropout rate for neural networks.\n- **dispersion**: One of the following:\n - ``'gene'`` - dispersion parameter of NB is constant per gene across cells\n - ``'gene-batch'`` - dispersion can differ between different batches\n - ``'gene-label'`` - dispersion can differ between different labels\n - ``'gene-cell'`` - dispersion can differ for every gene in every cell\n\n- **gene_likelihood**: One of:\n - ``'nb'`` - Negative binomial distribution\n - ``'zinb'`` - Zero-inflated negative binomial distribution\n - ``'poisson'`` - Poisson distribution\n\n- **latent_distribution**: One of:\n - ``'normal'`` - Normal distribution\n - ``'ln'`` - Logistic normal distribution (Normal(0, I) transformed by softmax)\n\n- **max_epochs**: Number of passes through the dataset. If `None`, defaults to `np.min([round((20000 / n_cells) * 400), 400])`\n- **use_gpu**: Use default GPU if available (if None or True), or index of GPU to use (if int), or name of GPU (if `str`, e.g., `'cuda:0'`), or use CPU (if False).\n- **batch_size**: Minibatch size to use during training.\n- **early_stopping**: Perform early stopping.\n\n---\n\n#### scBC.get_reconst_data()\n\n- **n_samples**: Number of posterior samples to use for estimation.\n- **batch_size**: Minibatch size for data loading into model.\n\n---\n\n#### scBC.get_edge()\n\n- **gene_list**: A list containing HGNC gene names used to find prior edge in hsapiens_gene_ensembl dataset. Different actions will be performed based on the data type you provide:\n - `str` - take self.adata.var[gene_list] as the gene set\n - `list` - use the provided list as the gene set to be searched\n - `None(default)` --- use the variable names(self.adata.var_names) as the gene set\n\n- **dataset**: The dataset to be used for extracting prior information. Default is <u>hsapiens\\_gene\\_ensembl</u>.\n- **intensity**: A integer denoting the intensity of the prior. Generally speaking, the larger the number, the more prior information returned (the bigger the edge array is).\n\n---\n\n#### scBC.Biclustering()\n\n- **L**: number of the maximum biclusters\n- **mat**: a p x n numpy array used for biclustering. We will use the reconstructed data after calling scBC.get_reconst_data(). However, you can explicitly provide one rather than using the reconstructed data.\n- **edge**: a 2-colounm matrix providing prior network information of some measurements. We prefer to use the edge given here, if not provided, we will use scBC.edge as a candidate. Edge can be automaticly generated using scBC.get_edge(). Running Biclustering() without any edge prior is also feasible.\n- **dist_type**: a p-element vector, each element represents the distribution type of measurements. 0 is Gaussian, 1 is Binomial, 2 is Negative Binomial, 3 is Poisson. If not given, all measurements are deemed to follow Gaussian distribution.\n- **param**: a p-element vector, each element is the required parameter for the correspondence distribution. $\\zeta$ for Gaussian, $n_j$ for Binomial, $r_j$ for Negative Binomial, $N$ for Poisson. Default is set as a 1 vector. \n- **initWZ**: select between \"random\" and \"svd\". If \"random\", randomly generated N(0,1) numbers are used, o.w. use SVD results for initialization.\n- **maxiter**: allowed maximum number of iterations.\n- **tol**: desired total difference of solutions.\n\n---\n\n\n\n## simulation\n\nWe also provide an API for simulation in scBC.data:\n\n#### scBC.data.simulate_data()\n\n- **L**: number of biclusters\n- **p**: number of measurements(genes)\n- **n**: number of objects(cells)\n- **dropout**: dropout rate (between 0 and 1)\n- **batch_num**: number of batches you want to simulate\n\n#### return value\n\nA dictionary containing several simulation results:\n\n| key | word |\n| :--: | :--------------------------------------------------: |\n| W | The **W** matrix |\n| Z | The **Z** matrix |\n| dat | AnnData object with expression matrix (n $\\times$ p) |\n| S | The ground truth result |\n| edge | Randomly generated prior edge array |\n\nDuring simulation, the scale of the FGM increases adaptively with the size of the simulated dataset (actually the size of p). The parameter $\\pmb\\mu$ is computed by the multiplicative model $\\pmb\\mu = \\pmb{WZ}$, where **W** is a p\u00d7L matrix and **Z** is an L\u00d7n matrix. The number of non-zero elements in each column of **W** is set as p/20, and the number of non-zero elements in each row of **Z** is set as n/10. The row indices of non-zero elements in **W** and the column indices of **Z** with non-zero elements are randomly drawn from 1 to p and 1 to n. The nonzero element values for both **W** and **Z** are generated from a normal distribution with mean 1.5 and standard deviation 0.1, and are randomly assigned to be positive or negative. The prior edge is generated along with W. When generating **X**, each element is generated from $NB(r_j,\\frac1{1+e^{-\\mu_{ij}}})$ , and the parameter is randomly drawn from 5 to 20. Finally, in order to simulate different batches, we divided the dataset into `batch_num` parts, each with different intensities of noise. The implementation of dropout is to perform Bernoulli censoring at each data point according to the given dropout rate parameter. The simulation data generation process is shown as follows.\n\n\n![simulation](https://user-images.githubusercontent.com/79566479/232784233-e0a07e0e-bbc3-449c-91b6-5e0936d48159.png)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "a single-cell transcriptome Bayesian biClustering framework",
"version": "0.3.2",
"project_urls": {
"Homepage": "https://github.com/GYQ-form/scBC"
},
"split_keywords": [
"single cell transcriptomics",
"biclustering",
"bioinformatics",
"variational inference(vi)"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "24a64aa35c9f45741852c0e1179b64aa74d657a42004baa86cce9c859534cd92",
"md5": "510b622e5af259aafe1c8244d3805ad5",
"sha256": "b7fc024ee456e366a82dc62c18a586efe0dbcb0718c678562ba5e74515c4beca"
},
"downloads": -1,
"filename": "scBC-0.3.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "510b622e5af259aafe1c8244d3805ad5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 15073,
"upload_time": "2024-02-22T05:52:22",
"upload_time_iso_8601": "2024-02-22T05:52:22.534658Z",
"url": "https://files.pythonhosted.org/packages/24/a6/4aa35c9f45741852c0e1179b64aa74d657a42004baa86cce9c859534cd92/scBC-0.3.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "34e275a1d1ddcf2181f854c6ccba2c215fd90bdf728f4fdcb60df134b668188a",
"md5": "304b5d1904da64fcbe820fc094deae9f",
"sha256": "bf08ae1089e998f72328e23b53a3a957471a0c24406a764e7433d684e25fce71"
},
"downloads": -1,
"filename": "scBC-0.3.2.tar.gz",
"has_sig": false,
"md5_digest": "304b5d1904da64fcbe820fc094deae9f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 16789,
"upload_time": "2024-02-22T05:52:24",
"upload_time_iso_8601": "2024-02-22T05:52:24.373705Z",
"url": "https://files.pythonhosted.org/packages/34/e2/75a1d1ddcf2181f854c6ccba2c215fd90bdf728f4fdcb60df134b668188a/scBC-0.3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-02-22 05:52:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "GYQ-form",
"github_project": "scBC",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "scbc"
}