Garfield

Name	Garfield JSON
Version	0.3.5 JSON
	download
home_page	https://github.com/zhou-1314/Garfield
Summary	Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding
upload_time	2024-10-04 02:05:26
maintainer	None
docs_url	None
author	Weige Zhou
requires_python	>=3.7
license	BSD
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Garfield: G**raph-based Contrastive Le**ar**ning enable **F**ast S**i**ngle-C**el**l Embe**dding
<img src="imgs/Garfield_framework2.png" alt="Garfield" width="900"/>

## Installation
Please install `Garfield` from pypi with:
```bash
pip install Garfield
```

install from Github:

```
pip install git+https://github.com/zhou-1314/Garfield.git
```

or git clone and install:

```
git clone https://github.com/zhou-1314/Garfield.git
cd Garfield
python setup.py install
```

Garfield is implemented in [Pytorch](https://pytorch.org/) framework.

## Usage

```python
## load packages
import os
import Garfield as gf
import scanpy as sc
gf.__version__

# set workdir
workdir = 'garfield_multiome_10xbrain'
gf.settings.set_workdir(workdir)

### modify parameter
user_config = dict(
    ## Input options
    adata_list=mdata,  
    profile='multi-modal', 
    data_type='Paired',
    sub_data_type=['rna', 'atac'],
    sample_col=None, 
    weight=0.5,
    ## Preprocessing options
    graph_const_method='mu_std',
    genome='hg38',
    use_gene_weight=True,
    use_top_pcs=False,
    used_hvg=True,
    min_cells=3,
    min_features=0,
    keep_mt=False,
    target_sum=1e4,
    rna_n_top_features=3000,
    atac_n_top_features=100000,
    n_components=50,
    n_neighbors=5,
    metric='euclidean', 
    svd_solver='arpack',
    # datasets
    adj_key='connectivities',
    # data split parameters
    edge_val_ratio=0.1,
    edge_test_ratio=0.,
    node_val_ratio=0.1,
    node_test_ratio=0.,
    ## Model options
    augment_type='svd',
    svd_q=5,
    use_FCencoder=False,
    conv_type='GATv2Conv', # GAT or GATv2Conv or GCN
    gnn_layer=2,
    hidden_dims=[128, 128],
    bottle_neck_neurons=20,
    cluster_num=20,
    drop_feature_rate=0.2, 
    drop_edge_rate=0.2,
    num_heads=3,
    dropout=0.2,
    concat=True,
    used_edge_weight=False,
    used_DSBN=False,
    used_mmd=False,
    # data loader parameters
    num_neighbors=5,
    loaders_n_hops=2,
    edge_batch_size=4096,
    node_batch_size=128, # None
    # loss parameters
    include_edge_recon_loss=True,
    include_gene_expr_recon_loss=True,
    lambda_latent_contrastive_loss=1.0,
    lambda_gene_expr_recon=300.,
    lambda_edge_recon=500000.,
    lambda_omics_recon_mmd_loss=0.2,
    # train parameters
    n_epochs_no_edge_recon=0,
    learning_rate=0.001,
    weight_decay=1e-05,
    gradient_clipping=5,
    # other parameters
    latent_key='garfield_latent',
    reload_best_model=True,
    use_early_stopping=True,
    early_stopping_kwargs=None,
    monitor=True,
    seed=2024,
    verbose=True
)
dict_config = gf.settings.set_gf_params(user_config)

from Garfield.model import Garfield

# Initialize model
model = Garfield(dict_config)
# Train model
model.train()
# Compute latent neighbor graph
latent_key = 'garfield_latent'
sc.pp.neighbors(model.adata,
                use_rep=latent_key,
                key_added=latent_key)
# Compute UMAP embedding
sc.tl.umap(model.adata,
           neighbors_key=latent_key)
sc.pl.umap(model.adata, color=[ 'celltype'], show=True, size=3) 

model_folder_path = "./slideseqv2_mouse_hippocampus/model"
os.makedirs(figure_folder_path, exist_ok=True)
# Save trained model
model.save(dir_path=model_folder_path,
           overwrite=True,
           save_adata=True,
           adata_file_name="adata.h5ad")
```
### Main Parameters of Garfield Model

#### Data Preprocessing Parameters

- **adata_list**: List of AnnData objects containing data from multiple batches or samples.
- **profile**: Specifies the data profile type (e.g., 'RNA', 'ATAC', 'ADT', 'multi-modal', 'spatial').
- **data_type**: Type of the multi-omics dataset (e.g., Paired, UnPaired) for preprocessing.
- **sub_data_type**: List of data types for multi-modal datasets (e.g., ['rna', 'atac'] or ['rna', 'adt']).
- **sample_col**: Column in the dataset that indicates batch or sample identifiers.
- **weight**: Weighting factor that determines the contribution of different modalities or types of graphs in multi-omics or spatial data.
  - For non-spatial single-cell multi-omics data (e.g., RNA + ATAC),
    `weight` specifies the contribution of the graph constructed from scRNA data.
    The remaining (1 - weight) represents the contribution from the other modality.
  - For spatial single-modality data,
    `weight` refers to the contribution of the graph constructed from the physical spatial information,
    while (1 - weight) reflects the contribution from the molecular graph (RNA graph).
- **graph_const_method**: Method for constructing the graph (e.g., 'mu_std', 'Radius', 'KNN', 'Squidpy').
- **genome**: Reference genome to use during preprocessing.
- **use_gene_weight**: Whether to apply gene weights in the preprocessing step.
- **use_top_pcs**: Whether to use the top principal components during dimensionality reduction.
- **used_hvg**: Whether to use highly variable genes (HVGs) for analysis.
- **min_features**: Minimum number of features required for a cell to be included in the dataset.
- **min_cells**: Minimum number of cells required for a feature to be retained in the dataset.
- **keep_mt**: Whether to retain mitochondrial genes in the analysis.
- **target_sum**: Target sum used for normalization (e.g., 1e4 for counts per cell).
- **rna_n_top_features**: Number of top features to retain for RNA datasets.
- **atac_n_top_features**: Number of top features to retain for ATAC datasets.
- **n_components**: Number of components to use for dimensionality reduction (e.g., PCA).
- **n_neighbors**: Number of neighbors to use in graph-based algorithms.
- **metric**: Distance metric used during graph construction.
- **svd_solver**: Solver for singular value decomposition (SVD).
- **adj_key**: Key in the AnnData object that holds the adjacency matrix.

#### Data Split Parameters

- **edge_val_ratio**: Ratio of edges to use for validation in edge-level tasks.
- **edge_test_ratio**: Ratio of edges to use for testing in edge-level tasks.
- **node_val_ratio**: Ratio of nodes to use for validation in node-level tasks.
- **node_test_ratio**: Ratio of nodes to use for testing in node-level tasks.

#### Model Architecture Parameters

- **augment_type**: Type of augmentation to use (e.g., 'dropout', 'svd').
- **svd_q**: Rank for the low-rank SVD approximation.
- **use_FCencoder**: Whether to use a fully connected encoder before the graph layers.
- **hidden_dims**: List of hidden layer dimensions for the encoder.
- **bottle_neck_neurons**: Number of neurons in the bottleneck (latent) layer.
- **num_heads**: Number of attention heads for each graph attention layer.
- **dropout**: Dropout rate applied during training.
- **concat**: Whether to concatenate attention heads or not.
- **drop_feature_rate**: Dropout rate applied to node features.
- **drop_edge_rate**: Dropout rate applied to edges during augmentation.
- **used_edge_weight**: Whether to use edge weights in the graph layers.
- **used_DSBN**: Whether to use domain-specific batch normalization.
- **conv_type**: Type of graph convolution to use ('GAT', 'GCN').
- **gnn_layer**: Number of times the graph neural network (GNN) encoder is repeated in the forward pass.
- **cluster_num**: Number of clusters for latent feature clustering.

#### Data Loader Parameters

- **num_neighbors**: Number of neighbors to sample for graph-based data loaders.
- **loaders_n_hops**: Number of hops for neighbors during graph construction.
- **edge_batch_size**: Batch size for edge-level tasks.
- **node_batch_size**: Batch size for node-level tasks.

#### Loss Function Parameters

- **include_edge_recon_loss**: Whether to include edge reconstruction loss in the training objective.
- **include_gene_expr_recon_loss**: Whether to include gene expression reconstruction loss in the training objective.
- **used_mmd**: Whether to use maximum mean discrepancy (MMD) for domain adaptation.
- **lambda_latent_contrastive_instanceloss**: Weight for the instance-level contrastive loss.
- **lambda_latent_contrastive_clusterloss**: Weight for the cluster-level contrastive loss.
- **lambda_gene_expr_recon**: Weight for the gene expression reconstruction loss.
- **lambda_edge_recon**: Weight for the edge reconstruction loss.
- **lambda_omics_recon_mmd_loss**: Weight for the MMD loss in omics reconstruction tasks.

#### Training Parameters

- **n_epochs**: Number of training epochs.
- **n_epochs_no_edge_recon**: Number of epochs without edge reconstruction loss.
- **learning_rate**: Learning rate for the optimizer.
- **weight_decay**: Weight decay (L2 regularization) for the optimizer.
- **gradient_clipping**: Maximum norm for gradient clipping.

#### Other Parameters

- **latent_key**: Key for storing latent features in the AnnData object.

- **reload_best_model**: Whether to reload the best model after training.

- **use_early_stopping**: Whether to use early stopping during training.

- **early_stopping_kwargs**: Arguments for configuring early stopping (e.g., patience, delta).

- **monitor**: Whether to print training progress.

- **seed**: Random seed for reproducibility.

- **verbose**: Whether to display detailed logs during training.

## Support

Please submit issues or reach out to zhouwg1314@gmail.com.

## Acknowledgment
Garfield uses and/or references the following libraries and packages:

- [NicheCompass](https://github.com/Lotfollahi-lab/nichecompass)

- [scArches](https://github.com/theislab/scarches)

- [SIMBA](https://github.com/pinellolab/simba)
- [MaxFuse](https://github.com/shuxiaoc/maxfuse)
- [scanpy](https://github.com/scverse/scanpy)

Thanks for all their contributors and maintainers!

## Citation
If you have used Garfiled for your work, please consider citing:
```bibtex
@misc{2024Garfield,
    title={Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding},
    author={Weige Zhou},
    howpublished = {\url{https://github.com/zhou-1314/Garfield}},
    year={2024}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/zhou-1314/Garfield",
    "name": "Garfield",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Weige Zhou",
    "author_email": "zhouwg1314@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/bc/53/3ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25/garfield-0.3.5.tar.gz",
    "platform": null,
    "description": "# Garfield: G**raph-based Contrastive Le**ar**ning enable **F**ast S**i**ngle-C**el**l Embe**dding\n<img src=\"imgs/Garfield_framework2.png\" alt=\"Garfield\" width=\"900\"/>\n\n## Installation\nPlease install `Garfield` from pypi with:\n```bash\npip install Garfield\n```\n\ninstall from Github:\n\n```\npip install git+https://github.com/zhou-1314/Garfield.git\n```\n\nor git clone and install:\n\n```\ngit clone https://github.com/zhou-1314/Garfield.git\ncd Garfield\npython setup.py install\n```\n\nGarfield is implemented in [Pytorch](https://pytorch.org/) framework.\n\n## Usage\n\n```python\n## load packages\nimport os\nimport Garfield as gf\nimport scanpy as sc\ngf.__version__\n\n# set workdir\nworkdir = 'garfield_multiome_10xbrain'\ngf.settings.set_workdir(workdir)\n\n### modify parameter\nuser_config = dict(\n    ## Input options\n    adata_list=mdata,  \n    profile='multi-modal', \n    data_type='Paired',\n    sub_data_type=['rna', 'atac'],\n    sample_col=None, \n    weight=0.5,\n    ## Preprocessing options\n    graph_const_method='mu_std',\n    genome='hg38',\n    use_gene_weight=True,\n    use_top_pcs=False,\n    used_hvg=True,\n    min_cells=3,\n    min_features=0,\n    keep_mt=False,\n    target_sum=1e4,\n    rna_n_top_features=3000,\n    atac_n_top_features=100000,\n    n_components=50,\n    n_neighbors=5,\n    metric='euclidean', \n    svd_solver='arpack',\n    # datasets\n    adj_key='connectivities',\n    # data split parameters\n    edge_val_ratio=0.1,\n    edge_test_ratio=0.,\n    node_val_ratio=0.1,\n    node_test_ratio=0.,\n    ## Model options\n    augment_type='svd',\n    svd_q=5,\n    use_FCencoder=False,\n    conv_type='GATv2Conv', # GAT or GATv2Conv or GCN\n    gnn_layer=2,\n    hidden_dims=[128, 128],\n    bottle_neck_neurons=20,\n    cluster_num=20,\n    drop_feature_rate=0.2, \n    drop_edge_rate=0.2,\n    num_heads=3,\n    dropout=0.2,\n    concat=True,\n    used_edge_weight=False,\n    used_DSBN=False,\n    used_mmd=False,\n    # data loader parameters\n    num_neighbors=5,\n    loaders_n_hops=2,\n    edge_batch_size=4096,\n    node_batch_size=128, # None\n    # loss parameters\n    include_edge_recon_loss=True,\n    include_gene_expr_recon_loss=True,\n    lambda_latent_contrastive_loss=1.0,\n    lambda_gene_expr_recon=300.,\n    lambda_edge_recon=500000.,\n    lambda_omics_recon_mmd_loss=0.2,\n    # train parameters\n    n_epochs_no_edge_recon=0,\n    learning_rate=0.001,\n    weight_decay=1e-05,\n    gradient_clipping=5,\n    # other parameters\n    latent_key='garfield_latent',\n    reload_best_model=True,\n    use_early_stopping=True,\n    early_stopping_kwargs=None,\n    monitor=True,\n    seed=2024,\n    verbose=True\n)\ndict_config = gf.settings.set_gf_params(user_config)\n\nfrom Garfield.model import Garfield\n\n# Initialize model\nmodel = Garfield(dict_config)\n# Train model\nmodel.train()\n# Compute latent neighbor graph\nlatent_key = 'garfield_latent'\nsc.pp.neighbors(model.adata,\n                use_rep=latent_key,\n                key_added=latent_key)\n# Compute UMAP embedding\nsc.tl.umap(model.adata,\n           neighbors_key=latent_key)\nsc.pl.umap(model.adata, color=[ 'celltype'], show=True, size=3) \n\nmodel_folder_path = \"./slideseqv2_mouse_hippocampus/model\"\nos.makedirs(figure_folder_path, exist_ok=True)\n# Save trained model\nmodel.save(dir_path=model_folder_path,\n           overwrite=True,\n           save_adata=True,\n           adata_file_name=\"adata.h5ad\")\n```\n### Main Parameters of Garfield Model\n\n#### Data Preprocessing Parameters\n\n- **adata_list**: List of AnnData objects containing data from multiple batches or samples.\n- **profile**: Specifies the data profile type (e.g., 'RNA', 'ATAC', 'ADT', 'multi-modal', 'spatial').\n- **data_type**: Type of the multi-omics dataset (e.g., Paired, UnPaired) for preprocessing.\n- **sub_data_type**: List of data types for multi-modal datasets (e.g., ['rna', 'atac'] or ['rna', 'adt']).\n- **sample_col**: Column in the dataset that indicates batch or sample identifiers.\n- **weight**: Weighting factor that determines the contribution of different modalities or types of graphs in multi-omics or spatial data.\n  - For non-spatial single-cell multi-omics data (e.g., RNA + ATAC),\n    `weight` specifies the contribution of the graph constructed from scRNA data.\n    The remaining (1 - weight) represents the contribution from the other modality.\n  - For spatial single-modality data,\n    `weight` refers to the contribution of the graph constructed from the physical spatial information,\n    while (1 - weight) reflects the contribution from the molecular graph (RNA graph).\n- **graph_const_method**: Method for constructing the graph (e.g., 'mu_std', 'Radius', 'KNN', 'Squidpy').\n- **genome**: Reference genome to use during preprocessing.\n- **use_gene_weight**: Whether to apply gene weights in the preprocessing step.\n- **use_top_pcs**: Whether to use the top principal components during dimensionality reduction.\n- **used_hvg**: Whether to use highly variable genes (HVGs) for analysis.\n- **min_features**: Minimum number of features required for a cell to be included in the dataset.\n- **min_cells**: Minimum number of cells required for a feature to be retained in the dataset.\n- **keep_mt**: Whether to retain mitochondrial genes in the analysis.\n- **target_sum**: Target sum used for normalization (e.g., 1e4 for counts per cell).\n- **rna_n_top_features**: Number of top features to retain for RNA datasets.\n- **atac_n_top_features**: Number of top features to retain for ATAC datasets.\n- **n_components**: Number of components to use for dimensionality reduction (e.g., PCA).\n- **n_neighbors**: Number of neighbors to use in graph-based algorithms.\n- **metric**: Distance metric used during graph construction.\n- **svd_solver**: Solver for singular value decomposition (SVD).\n- **adj_key**: Key in the AnnData object that holds the adjacency matrix.\n\n#### Data Split Parameters\n\n- **edge_val_ratio**: Ratio of edges to use for validation in edge-level tasks.\n- **edge_test_ratio**: Ratio of edges to use for testing in edge-level tasks.\n- **node_val_ratio**: Ratio of nodes to use for validation in node-level tasks.\n- **node_test_ratio**: Ratio of nodes to use for testing in node-level tasks.\n\n#### Model Architecture Parameters\n\n- **augment_type**: Type of augmentation to use (e.g., 'dropout', 'svd').\n- **svd_q**: Rank for the low-rank SVD approximation.\n- **use_FCencoder**: Whether to use a fully connected encoder before the graph layers.\n- **hidden_dims**: List of hidden layer dimensions for the encoder.\n- **bottle_neck_neurons**: Number of neurons in the bottleneck (latent) layer.\n- **num_heads**: Number of attention heads for each graph attention layer.\n- **dropout**: Dropout rate applied during training.\n- **concat**: Whether to concatenate attention heads or not.\n- **drop_feature_rate**: Dropout rate applied to node features.\n- **drop_edge_rate**: Dropout rate applied to edges during augmentation.\n- **used_edge_weight**: Whether to use edge weights in the graph layers.\n- **used_DSBN**: Whether to use domain-specific batch normalization.\n- **conv_type**: Type of graph convolution to use ('GAT', 'GCN').\n- **gnn_layer**: Number of times the graph neural network (GNN) encoder is repeated in the forward pass.\n- **cluster_num**: Number of clusters for latent feature clustering.\n\n#### Data Loader Parameters\n\n- **num_neighbors**: Number of neighbors to sample for graph-based data loaders.\n- **loaders_n_hops**: Number of hops for neighbors during graph construction.\n- **edge_batch_size**: Batch size for edge-level tasks.\n- **node_batch_size**: Batch size for node-level tasks.\n\n#### Loss Function Parameters\n\n- **include_edge_recon_loss**: Whether to include edge reconstruction loss in the training objective.\n- **include_gene_expr_recon_loss**: Whether to include gene expression reconstruction loss in the training objective.\n- **used_mmd**: Whether to use maximum mean discrepancy (MMD) for domain adaptation.\n- **lambda_latent_contrastive_instanceloss**: Weight for the instance-level contrastive loss.\n- **lambda_latent_contrastive_clusterloss**: Weight for the cluster-level contrastive loss.\n- **lambda_gene_expr_recon**: Weight for the gene expression reconstruction loss.\n- **lambda_edge_recon**: Weight for the edge reconstruction loss.\n- **lambda_omics_recon_mmd_loss**: Weight for the MMD loss in omics reconstruction tasks.\n\n#### Training Parameters\n\n- **n_epochs**: Number of training epochs.\n- **n_epochs_no_edge_recon**: Number of epochs without edge reconstruction loss.\n- **learning_rate**: Learning rate for the optimizer.\n- **weight_decay**: Weight decay (L2 regularization) for the optimizer.\n- **gradient_clipping**: Maximum norm for gradient clipping.\n\n#### Other Parameters\n\n- **latent_key**: Key for storing latent features in the AnnData object.\n\n- **reload_best_model**: Whether to reload the best model after training.\n\n- **use_early_stopping**: Whether to use early stopping during training.\n\n- **early_stopping_kwargs**: Arguments for configuring early stopping (e.g., patience, delta).\n\n- **monitor**: Whether to print training progress.\n\n- **seed**: Random seed for reproducibility.\n\n- **verbose**: Whether to display detailed logs during training.\n\n## Support\n\nPlease submit issues or reach out to zhouwg1314@gmail.com.\n\n## Acknowledgment\nGarfield uses and/or references the following libraries and packages:\n\n- [NicheCompass](https://github.com/Lotfollahi-lab/nichecompass)\n\n- [scArches](https://github.com/theislab/scarches)\n\n- [SIMBA](https://github.com/pinellolab/simba)\n- [MaxFuse](https://github.com/shuxiaoc/maxfuse)\n- [scanpy](https://github.com/scverse/scanpy)\n\nThanks for all their contributors and maintainers!\n\n## Citation\nIf you have used Garfiled for your work, please consider citing:\n```bibtex\n@misc{2024Garfield,\n    title={Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding},\n    author={Weige Zhou},\n    howpublished = {\\url{https://github.com/zhou-1314/Garfield}},\n    year={2024}\n}\n```\n\n",
    "bugtrack_url": null,
    "license": "BSD",
    "summary": "Garfield: Graph-based Contrastive Learning enable Fast Single-Cell Embedding",
    "version": "0.3.5",
    "project_urls": {
        "Homepage": "https://github.com/zhou-1314/Garfield"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dee34facfc98b16c503249024abfc98f3a24334307a40520ba560ef1c8957e85",
                "md5": "3c3e20a64ee350f75191a164703da27c",
                "sha256": "4a2c01607b2fd27c0a4c84678297b1eeccd5135824bbe755ff1491ef1edab0bd"
            },
            "downloads": -1,
            "filename": "Garfield-0.3.5-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3c3e20a64ee350f75191a164703da27c",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.7",
            "size": 1953294,
            "upload_time": "2024-10-04T02:05:22",
            "upload_time_iso_8601": "2024-10-04T02:05:22.764751Z",
            "url": "https://files.pythonhosted.org/packages/de/e3/4facfc98b16c503249024abfc98f3a24334307a40520ba560ef1c8957e85/Garfield-0.3.5-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bc533ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25",
                "md5": "de9f328e32a01e3b467465d033f6f43c",
                "sha256": "ed35b6d8e20e0b284d1e9fb342a5ad228c9a66c5d3bca55f5488450acfa13602"
            },
            "downloads": -1,
            "filename": "garfield-0.3.5.tar.gz",
            "has_sig": false,
            "md5_digest": "de9f328e32a01e3b467465d033f6f43c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 5112773,
            "upload_time": "2024-10-04T02:05:26",
            "upload_time_iso_8601": "2024-10-04T02:05:26.308301Z",
            "url": "https://files.pythonhosted.org/packages/bc/53/3ffef182cff826ed5b34dbdd16ac4db911b8431b11e42a3f1768768fca25/garfield-0.3.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-04 02:05:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "zhou-1314",
    "github_project": "Garfield",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "garfield"
}

Weige Zhou