Name | Garfield JSON |
Version |
0.3.5
JSON |
| download |
home_page | https://github.com/zhou-1314/Garfield |
Summary | Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding |
upload_time | 2024-10-04 02:05:26 |
maintainer | None |
docs_url | None |
author | Weige Zhou |
requires_python | >=3.7 |
license | BSD |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Garfield: G**raph-based Contrastive Le**ar**ning enable **F**ast S**i**ngle-C**el**l Embe**dding
<img src="imgs/Garfield_framework2.png" alt="Garfield" width="900"/>
## Installation
Please install `Garfield` from pypi with:
```bash
pip install Garfield
```
install from Github:
```
pip install git+https://github.com/zhou-1314/Garfield.git
```
or git clone and install:
```
git clone https://github.com/zhou-1314/Garfield.git
cd Garfield
python setup.py install
```
Garfield is implemented in [Pytorch](https://pytorch.org/) framework.
## Usage
```python
## load packages
import os
import Garfield as gf
import scanpy as sc
gf.__version__
# set workdir
workdir = 'garfield_multiome_10xbrain'
gf.settings.set_workdir(workdir)
### modify parameter
user_config = dict(
## Input options
adata_list=mdata,
profile='multi-modal',
data_type='Paired',
sub_data_type=['rna', 'atac'],
sample_col=None,
weight=0.5,
## Preprocessing options
graph_const_method='mu_std',
genome='hg38',
use_gene_weight=True,
use_top_pcs=False,
used_hvg=True,
min_cells=3,
min_features=0,
keep_mt=False,
target_sum=1e4,
rna_n_top_features=3000,
atac_n_top_features=100000,
n_components=50,
n_neighbors=5,
metric='euclidean',
svd_solver='arpack',
# datasets
adj_key='connectivities',
# data split parameters
edge_val_ratio=0.1,
edge_test_ratio=0.,
node_val_ratio=0.1,
node_test_ratio=0.,
## Model options
augment_type='svd',
svd_q=5,
use_FCencoder=False,
conv_type='GATv2Conv', # GAT or GATv2Conv or GCN
gnn_layer=2,
hidden_dims=[128, 128],
bottle_neck_neurons=20,
cluster_num=20,
drop_feature_rate=0.2,
drop_edge_rate=0.2,
num_heads=3,
dropout=0.2,
concat=True,
used_edge_weight=False,
used_DSBN=False,
used_mmd=False,
# data loader parameters
num_neighbors=5,
loaders_n_hops=2,
edge_batch_size=4096,
node_batch_size=128, # None
# loss parameters
include_edge_recon_loss=True,
include_gene_expr_recon_loss=True,
lambda_latent_contrastive_loss=1.0,
lambda_gene_expr_recon=300.,
lambda_edge_recon=500000.,
lambda_omics_recon_mmd_loss=0.2,
# train parameters
n_epochs_no_edge_recon=0,
learning_rate=0.001,
weight_decay=1e-05,
gradient_clipping=5,
# other parameters
latent_key='garfield_latent',
reload_best_model=True,
use_early_stopping=True,
early_stopping_kwargs=None,
monitor=True,
seed=2024,
verbose=True
)
dict_config = gf.settings.set_gf_params(user_config)
from Garfield.model import Garfield
# Initialize model
model = Garfield(dict_config)
# Train model
model.train()
# Compute latent neighbor graph
latent_key = 'garfield_latent'
sc.pp.neighbors(model.adata,
use_rep=latent_key,
key_added=latent_key)
# Compute UMAP embedding
sc.tl.umap(model.adata,
neighbors_key=latent_key)
sc.pl.umap(model.adata, color=[ 'celltype'], show=True, size=3)
model_folder_path = "./slideseqv2_mouse_hippocampus/model"
os.makedirs(figure_folder_path, exist_ok=True)
# Save trained model
model.save(dir_path=model_folder_path,
overwrite=True,
save_adata=True,
adata_file_name="adata.h5ad")
```
### Main Parameters of Garfield Model
#### Data Preprocessing Parameters
- **adata_list**: List of AnnData objects containing data from multiple batches or samples.
- **profile**: Specifies the data profile type (e.g., 'RNA', 'ATAC', 'ADT', 'multi-modal', 'spatial').
- **data_type**: Type of the multi-omics dataset (e.g., Paired, UnPaired) for preprocessing.
- **sub_data_type**: List of data types for multi-modal datasets (e.g., ['rna', 'atac'] or ['rna', 'adt']).
- **sample_col**: Column in the dataset that indicates batch or sample identifiers.
- **weight**: Weighting factor that determines the contribution of different modalities or types of graphs in multi-omics or spatial data.
- For non-spatial single-cell multi-omics data (e.g., RNA + ATAC),
`weight` specifies the contribution of the graph constructed from scRNA data.
The remaining (1 - weight) represents the contribution from the other modality.
- For spatial single-modality data,
`weight` refers to the contribution of the graph constructed from the physical spatial information,
while (1 - weight) reflects the contribution from the molecular graph (RNA graph).
- **graph_const_method**: Method for constructing the graph (e.g., 'mu_std', 'Radius', 'KNN', 'Squidpy').
- **genome**: Reference genome to use during preprocessing.
- **use_gene_weight**: Whether to apply gene weights in the preprocessing step.
- **use_top_pcs**: Whether to use the top principal components during dimensionality reduction.
- **used_hvg**: Whether to use highly variable genes (HVGs) for analysis.
- **min_features**: Minimum number of features required for a cell to be included in the dataset.
- **min_cells**: Minimum number of cells required for a feature to be retained in the dataset.
- **keep_mt**: Whether to retain mitochondrial genes in the analysis.
- **target_sum**: Target sum used for normalization (e.g., 1e4 for counts per cell).
- **rna_n_top_features**: Number of top features to retain for RNA datasets.
- **atac_n_top_features**: Number of top features to retain for ATAC datasets.
- **n_components**: Number of components to use for dimensionality reduction (e.g., PCA).
- **n_neighbors**: Number of neighbors to use in graph-based algorithms.
- **metric**: Distance metric used during graph construction.
- **svd_solver**: Solver for singular value decomposition (SVD).
- **adj_key**: Key in the AnnData object that holds the adjacency matrix.
#### Data Split Parameters
- **edge_val_ratio**: Ratio of edges to use for validation in edge-level tasks.
- **edge_test_ratio**: Ratio of edges to use for testing in edge-level tasks.
- **node_val_ratio**: Ratio of nodes to use for validation in node-level tasks.
- **node_test_ratio**: Ratio of nodes to use for testing in node-level tasks.
#### Model Architecture Parameters
- **augment_type**: Type of augmentation to use (e.g., 'dropout', 'svd').
- **svd_q**: Rank for the low-rank SVD approximation.
- **use_FCencoder**: Whether to use a fully connected encoder before the graph layers.
- **hidden_dims**: List of hidden layer dimensions for the encoder.
- **bottle_neck_neurons**: Number of neurons in the bottleneck (latent) layer.
- **num_heads**: Number of attention heads for each graph attention layer.
- **dropout**: Dropout rate applied during training.
- **concat**: Whether to concatenate attention heads or not.
- **drop_feature_rate**: Dropout rate applied to node features.
- **drop_edge_rate**: Dropout rate applied to edges during augmentation.
- **used_edge_weight**: Whether to use edge weights in the graph layers.
- **used_DSBN**: Whether to use domain-specific batch normalization.
- **conv_type**: Type of graph convolution to use ('GAT', 'GCN').
- **gnn_layer**: Number of times the graph neural network (GNN) encoder is repeated in the forward pass.
- **cluster_num**: Number of clusters for latent feature clustering.
#### Data Loader Parameters
- **num_neighbors**: Number of neighbors to sample for graph-based data loaders.
- **loaders_n_hops**: Number of hops for neighbors during graph construction.
- **edge_batch_size**: Batch size for edge-level tasks.
- **node_batch_size**: Batch size for node-level tasks.
#### Loss Function Parameters
- **include_edge_recon_loss**: Whether to include edge reconstruction loss in the training objective.
- **include_gene_expr_recon_loss**: Whether to include gene expression reconstruction loss in the training objective.
- **used_mmd**: Whether to use maximum mean discrepancy (MMD) for domain adaptation.
- **lambda_latent_contrastive_instanceloss**: Weight for the instance-level contrastive loss.
- **lambda_latent_contrastive_clusterloss**: Weight for the cluster-level contrastive loss.
- **lambda_gene_expr_recon**: Weight for the gene expression reconstruction loss.
- **lambda_edge_recon**: Weight for the edge reconstruction loss.
- **lambda_omics_recon_mmd_loss**: Weight for the MMD loss in omics reconstruction tasks.
#### Training Parameters
- **n_epochs**: Number of training epochs.
- **n_epochs_no_edge_recon**: Number of epochs without edge reconstruction loss.
- **learning_rate**: Learning rate for the optimizer.
- **weight_decay**: Weight decay (L2 regularization) for the optimizer.
- **gradient_clipping**: Maximum norm for gradient clipping.
#### Other Parameters
- **latent_key**: Key for storing latent features in the AnnData object.
- **reload_best_model**: Whether to reload the best model after training.
- **use_early_stopping**: Whether to use early stopping during training.
- **early_stopping_kwargs**: Arguments for configuring early stopping (e.g., patience, delta).
- **monitor**: Whether to print training progress.
- **seed**: Random seed for reproducibility.
- **verbose**: Whether to display detailed logs during training.
## Support
Please submit issues or reach out to zhouwg1314@gmail.com.
## Acknowledgment
Garfield uses and/or references the following libraries and packages:
- [NicheCompass](https://github.com/Lotfollahi-lab/nichecompass)
- [scArches](https://github.com/theislab/scarches)
- [SIMBA](https://github.com/pinellolab/simba)
- [MaxFuse](https://github.com/shuxiaoc/maxfuse)
- [scanpy](https://github.com/scverse/scanpy)
Thanks for all their contributors and maintainers!
## Citation
If you have used Garfiled for your work, please consider citing:
```bibtex
@misc{2024Garfield,
title={Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding},
author={Weige Zhou},
howpublished = {\url{https://github.com/zhou-1314/Garfield}},
year={2024}
}
```
Raw data
{
"_id": null,
"home_page": "https://github.com/zhou-1314/Garfield",
"name": "Garfield",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": "Weige Zhou",
"author_email": "zhouwg1314@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/bc/53/3ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25/garfield-0.3.5.tar.gz",
"platform": null,
"description": "# Garfield: G**raph-based Contrastive Le**ar**ning enable **F**ast S**i**ngle-C**el**l Embe**dding\n<img src=\"imgs/Garfield_framework2.png\" alt=\"Garfield\" width=\"900\"/>\n\n## Installation\nPlease install `Garfield` from pypi with:\n```bash\npip install Garfield\n```\n\ninstall from Github:\n\n```\npip install git+https://github.com/zhou-1314/Garfield.git\n```\n\nor git clone and install:\n\n```\ngit clone https://github.com/zhou-1314/Garfield.git\ncd Garfield\npython setup.py install\n```\n\nGarfield is implemented in [Pytorch](https://pytorch.org/) framework.\n\n## Usage\n\n```python\n## load packages\nimport os\nimport Garfield as gf\nimport scanpy as sc\ngf.__version__\n\n# set workdir\nworkdir = 'garfield_multiome_10xbrain'\ngf.settings.set_workdir(workdir)\n\n### modify parameter\nuser_config = dict(\n ## Input options\n adata_list=mdata, \n profile='multi-modal', \n data_type='Paired',\n sub_data_type=['rna', 'atac'],\n sample_col=None, \n weight=0.5,\n ## Preprocessing options\n graph_const_method='mu_std',\n genome='hg38',\n use_gene_weight=True,\n use_top_pcs=False,\n used_hvg=True,\n min_cells=3,\n min_features=0,\n keep_mt=False,\n target_sum=1e4,\n rna_n_top_features=3000,\n atac_n_top_features=100000,\n n_components=50,\n n_neighbors=5,\n metric='euclidean', \n svd_solver='arpack',\n # datasets\n adj_key='connectivities',\n # data split parameters\n edge_val_ratio=0.1,\n edge_test_ratio=0.,\n node_val_ratio=0.1,\n node_test_ratio=0.,\n ## Model options\n augment_type='svd',\n svd_q=5,\n use_FCencoder=False,\n conv_type='GATv2Conv', # GAT or GATv2Conv or GCN\n gnn_layer=2,\n hidden_dims=[128, 128],\n bottle_neck_neurons=20,\n cluster_num=20,\n drop_feature_rate=0.2, \n drop_edge_rate=0.2,\n num_heads=3,\n dropout=0.2,\n concat=True,\n used_edge_weight=False,\n used_DSBN=False,\n used_mmd=False,\n # data loader parameters\n num_neighbors=5,\n loaders_n_hops=2,\n edge_batch_size=4096,\n node_batch_size=128, # None\n # loss parameters\n include_edge_recon_loss=True,\n include_gene_expr_recon_loss=True,\n lambda_latent_contrastive_loss=1.0,\n lambda_gene_expr_recon=300.,\n lambda_edge_recon=500000.,\n lambda_omics_recon_mmd_loss=0.2,\n # train parameters\n n_epochs_no_edge_recon=0,\n learning_rate=0.001,\n weight_decay=1e-05,\n gradient_clipping=5,\n # other parameters\n latent_key='garfield_latent',\n reload_best_model=True,\n use_early_stopping=True,\n early_stopping_kwargs=None,\n monitor=True,\n seed=2024,\n verbose=True\n)\ndict_config = gf.settings.set_gf_params(user_config)\n\nfrom Garfield.model import Garfield\n\n# Initialize model\nmodel = Garfield(dict_config)\n# Train model\nmodel.train()\n# Compute latent neighbor graph\nlatent_key = 'garfield_latent'\nsc.pp.neighbors(model.adata,\n use_rep=latent_key,\n key_added=latent_key)\n# Compute UMAP embedding\nsc.tl.umap(model.adata,\n neighbors_key=latent_key)\nsc.pl.umap(model.adata, color=[ 'celltype'], show=True, size=3) \n\nmodel_folder_path = \"./slideseqv2_mouse_hippocampus/model\"\nos.makedirs(figure_folder_path, exist_ok=True)\n# Save trained model\nmodel.save(dir_path=model_folder_path,\n overwrite=True,\n save_adata=True,\n adata_file_name=\"adata.h5ad\")\n```\n### Main Parameters of Garfield Model\n\n#### Data Preprocessing Parameters\n\n- **adata_list**: List of AnnData objects containing data from multiple batches or samples.\n- **profile**: Specifies the data profile type (e.g., 'RNA', 'ATAC', 'ADT', 'multi-modal', 'spatial').\n- **data_type**: Type of the multi-omics dataset (e.g., Paired, UnPaired) for preprocessing.\n- **sub_data_type**: List of data types for multi-modal datasets (e.g., ['rna', 'atac'] or ['rna', 'adt']).\n- **sample_col**: Column in the dataset that indicates batch or sample identifiers.\n- **weight**: Weighting factor that determines the contribution of different modalities or types of graphs in multi-omics or spatial data.\n - For non-spatial single-cell multi-omics data (e.g., RNA + ATAC),\n `weight` specifies the contribution of the graph constructed from scRNA data.\n The remaining (1 - weight) represents the contribution from the other modality.\n - For spatial single-modality data,\n `weight` refers to the contribution of the graph constructed from the physical spatial information,\n while (1 - weight) reflects the contribution from the molecular graph (RNA graph).\n- **graph_const_method**: Method for constructing the graph (e.g., 'mu_std', 'Radius', 'KNN', 'Squidpy').\n- **genome**: Reference genome to use during preprocessing.\n- **use_gene_weight**: Whether to apply gene weights in the preprocessing step.\n- **use_top_pcs**: Whether to use the top principal components during dimensionality reduction.\n- **used_hvg**: Whether to use highly variable genes (HVGs) for analysis.\n- **min_features**: Minimum number of features required for a cell to be included in the dataset.\n- **min_cells**: Minimum number of cells required for a feature to be retained in the dataset.\n- **keep_mt**: Whether to retain mitochondrial genes in the analysis.\n- **target_sum**: Target sum used for normalization (e.g., 1e4 for counts per cell).\n- **rna_n_top_features**: Number of top features to retain for RNA datasets.\n- **atac_n_top_features**: Number of top features to retain for ATAC datasets.\n- **n_components**: Number of components to use for dimensionality reduction (e.g., PCA).\n- **n_neighbors**: Number of neighbors to use in graph-based algorithms.\n- **metric**: Distance metric used during graph construction.\n- **svd_solver**: Solver for singular value decomposition (SVD).\n- **adj_key**: Key in the AnnData object that holds the adjacency matrix.\n\n#### Data Split Parameters\n\n- **edge_val_ratio**: Ratio of edges to use for validation in edge-level tasks.\n- **edge_test_ratio**: Ratio of edges to use for testing in edge-level tasks.\n- **node_val_ratio**: Ratio of nodes to use for validation in node-level tasks.\n- **node_test_ratio**: Ratio of nodes to use for testing in node-level tasks.\n\n#### Model Architecture Parameters\n\n- **augment_type**: Type of augmentation to use (e.g., 'dropout', 'svd').\n- **svd_q**: Rank for the low-rank SVD approximation.\n- **use_FCencoder**: Whether to use a fully connected encoder before the graph layers.\n- **hidden_dims**: List of hidden layer dimensions for the encoder.\n- **bottle_neck_neurons**: Number of neurons in the bottleneck (latent) layer.\n- **num_heads**: Number of attention heads for each graph attention layer.\n- **dropout**: Dropout rate applied during training.\n- **concat**: Whether to concatenate attention heads or not.\n- **drop_feature_rate**: Dropout rate applied to node features.\n- **drop_edge_rate**: Dropout rate applied to edges during augmentation.\n- **used_edge_weight**: Whether to use edge weights in the graph layers.\n- **used_DSBN**: Whether to use domain-specific batch normalization.\n- **conv_type**: Type of graph convolution to use ('GAT', 'GCN').\n- **gnn_layer**: Number of times the graph neural network (GNN) encoder is repeated in the forward pass.\n- **cluster_num**: Number of clusters for latent feature clustering.\n\n#### Data Loader Parameters\n\n- **num_neighbors**: Number of neighbors to sample for graph-based data loaders.\n- **loaders_n_hops**: Number of hops for neighbors during graph construction.\n- **edge_batch_size**: Batch size for edge-level tasks.\n- **node_batch_size**: Batch size for node-level tasks.\n\n#### Loss Function Parameters\n\n- **include_edge_recon_loss**: Whether to include edge reconstruction loss in the training objective.\n- **include_gene_expr_recon_loss**: Whether to include gene expression reconstruction loss in the training objective.\n- **used_mmd**: Whether to use maximum mean discrepancy (MMD) for domain adaptation.\n- **lambda_latent_contrastive_instanceloss**: Weight for the instance-level contrastive loss.\n- **lambda_latent_contrastive_clusterloss**: Weight for the cluster-level contrastive loss.\n- **lambda_gene_expr_recon**: Weight for the gene expression reconstruction loss.\n- **lambda_edge_recon**: Weight for the edge reconstruction loss.\n- **lambda_omics_recon_mmd_loss**: Weight for the MMD loss in omics reconstruction tasks.\n\n#### Training Parameters\n\n- **n_epochs**: Number of training epochs.\n- **n_epochs_no_edge_recon**: Number of epochs without edge reconstruction loss.\n- **learning_rate**: Learning rate for the optimizer.\n- **weight_decay**: Weight decay (L2 regularization) for the optimizer.\n- **gradient_clipping**: Maximum norm for gradient clipping.\n\n#### Other Parameters\n\n- **latent_key**: Key for storing latent features in the AnnData object.\n\n- **reload_best_model**: Whether to reload the best model after training.\n\n- **use_early_stopping**: Whether to use early stopping during training.\n\n- **early_stopping_kwargs**: Arguments for configuring early stopping (e.g., patience, delta).\n\n- **monitor**: Whether to print training progress.\n\n- **seed**: Random seed for reproducibility.\n\n- **verbose**: Whether to display detailed logs during training.\n\n## Support\n\nPlease submit issues or reach out to zhouwg1314@gmail.com.\n\n## Acknowledgment\nGarfield uses and/or references the following libraries and packages:\n\n- [NicheCompass](https://github.com/Lotfollahi-lab/nichecompass)\n\n- [scArches](https://github.com/theislab/scarches)\n\n- [SIMBA](https://github.com/pinellolab/simba)\n- [MaxFuse](https://github.com/shuxiaoc/maxfuse)\n- [scanpy](https://github.com/scverse/scanpy)\n\nThanks for all their contributors and maintainers!\n\n## Citation\nIf you have used Garfiled for your work, please consider citing:\n```bibtex\n@misc{2024Garfield,\n title={Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding},\n author={Weige Zhou},\n howpublished = {\\url{https://github.com/zhou-1314/Garfield}},\n year={2024}\n}\n```\n\n",
"bugtrack_url": null,
"license": "BSD",
"summary": "Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding",
"version": "0.3.5",
"project_urls": {
"Homepage": "https://github.com/zhou-1314/Garfield"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "dee34facfc98b16c503249024abfc98f3a24334307a40520ba560ef1c8957e85",
"md5": "3c3e20a64ee350f75191a164703da27c",
"sha256": "4a2c01607b2fd27c0a4c84678297b1eeccd5135824bbe755ff1491ef1edab0bd"
},
"downloads": -1,
"filename": "Garfield-0.3.5-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "3c3e20a64ee350f75191a164703da27c",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": ">=3.7",
"size": 1953294,
"upload_time": "2024-10-04T02:05:22",
"upload_time_iso_8601": "2024-10-04T02:05:22.764751Z",
"url": "https://files.pythonhosted.org/packages/de/e3/4facfc98b16c503249024abfc98f3a24334307a40520ba560ef1c8957e85/Garfield-0.3.5-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bc533ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25",
"md5": "de9f328e32a01e3b467465d033f6f43c",
"sha256": "ed35b6d8e20e0b284d1e9fb342a5ad228c9a66c5d3bca55f5488450acfa13602"
},
"downloads": -1,
"filename": "garfield-0.3.5.tar.gz",
"has_sig": false,
"md5_digest": "de9f328e32a01e3b467465d033f6f43c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 5112773,
"upload_time": "2024-10-04T02:05:26",
"upload_time_iso_8601": "2024-10-04T02:05:26.308301Z",
"url": "https://files.pythonhosted.org/packages/bc/53/3ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25/garfield-0.3.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-04 02:05:26",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zhou-1314",
"github_project": "Garfield",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "garfield"
}