genevector


Namegenevector JSON
Version 0.2.2 PyPI version JSON
download
home_page
SummarySingle Cell GeneVector Library
upload_time2023-07-19 04:56:56
maintainer
docs_urlNone
author
requires_python>=3.7
license
keywords
VCS
bugtrack_url
requirements scipy leidenalg numpy sklearn pandas scanpy umap-learn tqdm seaborn matplotlib scikit-misc
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![alt text](https://github.com/nceglia/genevector/blob/main/logo.png?raw=true)
![alt text](https://github.com/nceglia/genevector/blob/main/framework.png?raw=true)

## Installation

Install using pip
```
pip install genevector
```
or install from Github
```
python3 -m venv gvenv
source gvenv/bin/activate
python3 pip install -r requirements.txt
python3 setup.py install
```
## Tutorials (see "example" directory)

1. PBMC workflow: Identification of interferon stimulated metagene and cell type annotation.
2. TICA workflow: Cell type assignment.
3. SPECTRUM workflow: Vector arithmetic for site specific metagenes.
4. FITNESS workflow: Identifying increasing metagenes in time series.

## GeneVector Workflow

### Loading scanpy dataset into GeneVector.
GeneVector makes use of Scanpy anndata objects and requires that the raw count data be loaded into .X matrix. It is highly recommended to subset genes using the *seurat_v3* flavor in Scanpy. The ```device="cuda"``` flag should be omitted if there is no GPU available. All downstream analysis requires a GeneVectorDataset object.

```
from genevector.data import GeneVectorDataset
from genevector.model import GeneVectorTrainer
from genevector.embedding import GeneEmbedding, CellEmbedding

import scanpy as sc

dataset = GeneVectorDataset(adata, device="cuda")
```

### Training GeneVector
After loading the expression, creating a GeneVector object will compute the mutual information between genes (can take up to 15 min for a dataset of 250k cells). This object is only required if you wish to train a model. Model training times vary depending on datasize. The 10k PBMC dataset can be trained in less than five minutes. ```emb_dimension``` sets the size of the learned gene vectors. Smaller values decrease training time, but values smaller than 50 may not provide optimal results.

```
cmps = GeneVector(dataset,
                  output_file="genes.vec",
                  emb_dimension=100,
                  threshold=1e-6,
                  device="cuda")
cmps.train(1000, threshold=1e-6) # run for 1000 iterations or loss delta below 1e-6.
```

To understand model convergence, a loss plot by epoch can be generated.

```
cmps.plot()
```

### Loading Gene Embedding
After training, two vector files are produced (for input and output weights). It is recommended to take the average of both weights ```vector="average"```). The GeneEmbedding class has several important analysis and visualization methods listed below.

```
gembed = GeneEmbedding("genes.vec", dataset, vector="average")
```

#### 1. Computing gene similarities
A pandas dataframe can be generated using ```compute_similarities``` that includes the most similar genes and their cosine similarities for a given gene query. A barplot figure with a specified number of the most similar genes can be generated using ```plots_similarities```.

```
df = gembed.compute_similarities("CD8A")
gembed.plot_similarities("CD8A",n_genes=10)
```

#### 2. Generating Metagenes
```get_adata``` produces and AnnData object that houses the gene embedding. This allows the use of Scanpy and AnnData visualization functions. The resolution parameter is passed directly to ```sc.tl.leiden``` to cluster the co-expression graph. ```get_metagenes``` returns a dictionary that stores each metagene as a list associated with an id. For a given id, the metagene can be visualized on a UMAP embedding using the ```plot_metagene``` function.

```
gdata = embed.get_adata(resolution=40)
metagenes = embed.get_metagenes(gdata)
embed.plot_metagene(gdata, mg=isg_metagene)
```

### Loading the Cell Embedding

Using the GeneEmbedding object and the GeneVectorDataset object, a CellEmbedding object can be instantiated and used to produce a Scanpy AnnData object with ```get_adata```. This method stores cell embedding under ```X_genevector``` in layers and generates a UMAP embedding using Scanpy. Scanpy functionality can be used to visualize UMAPS (```sc.pl.umap```). 

```
cembed = CellEmbedding(dataset, embed)
adata = cembed.get_adata()
```

The cell embedding can be batch corrected using ```cembed.batch_correct```. The user is required to select a valid column present in the *obs* dataframe and specify a reference label. This is a very fast operation.

```
cembed.batch_correct(column="sample",reference="control")
adata = cembed.get_adata()
```

#### Scoring Cells by Metagene

To score expression for each metagene against all cells, we can call the GeneEmbedding function ```score_metagenes``` with the cell-based AnnData object. To plot a heatmap of all metagenes over a set of cell labels, use the ```plot_metagenes_scores``` function. Metagenes are scored with the Scanpy ```sc.tl.score_genes``` function.

```
embed.score_metagenes(adata, metagenes)
embed.plot_metagenes_scores(adata,mgs,"cell_type")
```

#### Performing Cell Type Assignment

Using a dictionary of cell type annotations to marker genes, each cell can be classified using the CellEmbedding function ```phenotype_probability```. This function returns a new annotated AnnData object, where the resulting classification can be found in ```.obs["genevector"]``` (the user can also supply a column name using ```column=```). A separate column in the obs dataframe is created to hold the pseudo-probabilities for each cell type. These probabilties can be shown on a UMAP using standard the Scanpy function ```sc.pl.umap```.

```
markers = dict()
markers["T Cell"] = ["CD3D","CD3G","CD3E","TRAC","IL32","CD2"]
markers["B/Plasma"] = ["CD79A","CD79B","MZB1","CD19","BANK1"]
markers["Myeloid"] = ["LYZ","CST3","AIF1","CD68","C1QA","C1QB","C1QC"]

annotated_adata = cembed.phenotype_probability(adata,markers)

prob_cols = [x for x in annotated_adata.obs.columns.tolist() if "Pseudo-probability" in x]
sc.pl.umap(annotated_adata,color=prob_cols+["genevector"],size=25)
```

##### *All additional analyses described in the manuscript, including comparisons to LDVAE and CellAssign, can be found in Jupyter notebooks in the examples directory. Results were computed for v0.0.1.*

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "genevector",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "",
    "author_email": "Nicholas Ceglia <nickceglia@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/57/1a/927e17e263d2abfdaeb207fba51b2612bbe879e7e4d9b606b800e8de870f/genevector-0.2.2.tar.gz",
    "platform": null,
    "description": "![alt text](https://github.com/nceglia/genevector/blob/main/logo.png?raw=true)\n![alt text](https://github.com/nceglia/genevector/blob/main/framework.png?raw=true)\n\n## Installation\n\nInstall using pip\n```\npip install genevector\n```\nor install from Github\n```\npython3 -m venv gvenv\nsource gvenv/bin/activate\npython3 pip install -r requirements.txt\npython3 setup.py install\n```\n## Tutorials (see \"example\" directory)\n\n1. PBMC workflow: Identification of interferon stimulated metagene and cell type annotation.\n2. TICA workflow: Cell type assignment.\n3. SPECTRUM workflow: Vector arithmetic for site specific metagenes.\n4. FITNESS workflow: Identifying increasing metagenes in time series.\n\n## GeneVector Workflow\n\n### Loading scanpy dataset into GeneVector.\nGeneVector makes use of Scanpy anndata objects and requires that the raw count data be loaded into .X matrix. It is highly recommended to subset genes using the *seurat_v3* flavor in Scanpy. The ```device=\"cuda\"``` flag should be omitted if there is no GPU available. All downstream analysis requires a GeneVectorDataset object.\n\n```\nfrom genevector.data import GeneVectorDataset\nfrom genevector.model import GeneVectorTrainer\nfrom genevector.embedding import GeneEmbedding, CellEmbedding\n\nimport scanpy as sc\n\ndataset = GeneVectorDataset(adata, device=\"cuda\")\n```\n\n### Training GeneVector\nAfter loading the expression, creating a GeneVector object will compute the mutual information between genes (can take up to 15 min for a dataset of 250k cells). This object is only required if you wish to train a model. Model training times vary depending on datasize. The 10k PBMC dataset can be trained in less than five minutes. ```emb_dimension``` sets the size of the learned gene vectors. Smaller values decrease training time, but values smaller than 50 may not provide optimal results.\n\n```\ncmps = GeneVector(dataset,\n                  output_file=\"genes.vec\",\n                  emb_dimension=100,\n                  threshold=1e-6,\n                  device=\"cuda\")\ncmps.train(1000, threshold=1e-6) # run for 1000 iterations or loss delta below 1e-6.\n```\n\nTo understand model convergence, a loss plot by epoch can be generated.\n\n```\ncmps.plot()\n```\n\n### Loading Gene Embedding\nAfter training, two vector files are produced (for input and output weights). It is recommended to take the average of both weights ```vector=\"average\"```). The GeneEmbedding class has several important analysis and visualization methods listed below.\n\n```\ngembed = GeneEmbedding(\"genes.vec\", dataset, vector=\"average\")\n```\n\n#### 1. Computing gene similarities\nA pandas dataframe can be generated using ```compute_similarities``` that includes the most similar genes and their cosine similarities for a given gene query. A barplot figure with a specified number of the most similar genes can be generated using ```plots_similarities```.\n\n```\ndf = gembed.compute_similarities(\"CD8A\")\ngembed.plot_similarities(\"CD8A\",n_genes=10)\n```\n\n#### 2. Generating Metagenes\n```get_adata``` produces and AnnData object that houses the gene embedding. This allows the use of Scanpy and AnnData visualization functions. The resolution parameter is passed directly to ```sc.tl.leiden``` to cluster the co-expression graph. ```get_metagenes``` returns a dictionary that stores each metagene as a list associated with an id. For a given id, the metagene can be visualized on a UMAP embedding using the ```plot_metagene``` function.\n\n```\ngdata = embed.get_adata(resolution=40)\nmetagenes = embed.get_metagenes(gdata)\nembed.plot_metagene(gdata, mg=isg_metagene)\n```\n\n### Loading the Cell Embedding\n\nUsing the GeneEmbedding object and the GeneVectorDataset object, a CellEmbedding object can be instantiated and used to produce a Scanpy AnnData object with ```get_adata```. This method stores cell embedding under ```X_genevector``` in layers and generates a UMAP embedding using Scanpy. Scanpy functionality can be used to visualize UMAPS (```sc.pl.umap```). \n\n```\ncembed = CellEmbedding(dataset, embed)\nadata = cembed.get_adata()\n```\n\nThe cell embedding can be batch corrected using ```cembed.batch_correct```. The user is required to select a valid column present in the *obs* dataframe and specify a reference label. This is a very fast operation.\n\n```\ncembed.batch_correct(column=\"sample\",reference=\"control\")\nadata = cembed.get_adata()\n```\n\n#### Scoring Cells by Metagene\n\nTo score expression for each metagene against all cells, we can call the GeneEmbedding function ```score_metagenes``` with the cell-based AnnData object. To plot a heatmap of all metagenes over a set of cell labels, use the ```plot_metagenes_scores``` function. Metagenes are scored with the Scanpy ```sc.tl.score_genes``` function.\n\n```\nembed.score_metagenes(adata, metagenes)\nembed.plot_metagenes_scores(adata,mgs,\"cell_type\")\n```\n\n#### Performing Cell Type Assignment\n\nUsing a dictionary of cell type annotations to marker genes, each cell can be classified using the CellEmbedding function ```phenotype_probability```. This function returns a new annotated AnnData object, where the resulting classification can be found in ```.obs[\"genevector\"]``` (the user can also supply a column name using ```column=```). A separate column in the obs dataframe is created to hold the pseudo-probabilities for each cell type. These probabilties can be shown on a UMAP using standard the Scanpy function ```sc.pl.umap```.\n\n```\nmarkers = dict()\nmarkers[\"T Cell\"] = [\"CD3D\",\"CD3G\",\"CD3E\",\"TRAC\",\"IL32\",\"CD2\"]\nmarkers[\"B/Plasma\"] = [\"CD79A\",\"CD79B\",\"MZB1\",\"CD19\",\"BANK1\"]\nmarkers[\"Myeloid\"] = [\"LYZ\",\"CST3\",\"AIF1\",\"CD68\",\"C1QA\",\"C1QB\",\"C1QC\"]\n\nannotated_adata = cembed.phenotype_probability(adata,markers)\n\nprob_cols = [x for x in annotated_adata.obs.columns.tolist() if \"Pseudo-probability\" in x]\nsc.pl.umap(annotated_adata,color=prob_cols+[\"genevector\"],size=25)\n```\n\n##### *All additional analyses described in the manuscript, including comparisons to LDVAE and CellAssign, can be found in Jupyter notebooks in the examples directory. Results were computed for v0.0.1.*\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Single Cell GeneVector Library",
    "version": "0.2.2",
    "project_urls": {
        "Documentation": "https://genevector.readthedocs.io",
        "Homepage": "https://github.com/nceglia/genevector"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e0884c0a4fdac6cce5909fa54e1b2e3e1834b460c2e8ad9ccbe114db1a5fce4a",
                "md5": "d118092c7a27fb3c30bb1fb4e6328e4a",
                "sha256": "442c6b33152152a797103597d2c82686abe25ba8e32a93774afa878e4ad707f3"
            },
            "downloads": -1,
            "filename": "genevector-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d118092c7a27fb3c30bb1fb4e6328e4a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 16581,
            "upload_time": "2023-07-19T04:56:46",
            "upload_time_iso_8601": "2023-07-19T04:56:46.929283Z",
            "url": "https://files.pythonhosted.org/packages/e0/88/4c0a4fdac6cce5909fa54e1b2e3e1834b460c2e8ad9ccbe114db1a5fce4a/genevector-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "571a927e17e263d2abfdaeb207fba51b2612bbe879e7e4d9b606b800e8de870f",
                "md5": "ec666689fd1c1010c7a1f46e33ef355f",
                "sha256": "7f10fdf7a0c30b6b2bdc6f8be9d0f8bcfd62bf1ebaf88a82f721be9f8eeb1cea"
            },
            "downloads": -1,
            "filename": "genevector-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ec666689fd1c1010c7a1f46e33ef355f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 54496742,
            "upload_time": "2023-07-19T04:56:56",
            "upload_time_iso_8601": "2023-07-19T04:56:56.215332Z",
            "url": "https://files.pythonhosted.org/packages/57/1a/927e17e263d2abfdaeb207fba51b2612bbe879e7e4d9b606b800e8de870f/genevector-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-19 04:56:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nceglia",
    "github_project": "genevector",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.9.0"
                ]
            ]
        },
        {
            "name": "leidenalg",
            "specs": [
                [
                    "==",
                    "0.8.10"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.22.4"
                ]
            ]
        },
        {
            "name": "sklearn",
            "specs": [
                [
                    "==",
                    "0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.5.1"
                ]
            ]
        },
        {
            "name": "scanpy",
            "specs": [
                [
                    "==",
                    "1.9.1"
                ]
            ]
        },
        {
            "name": "umap-learn",
            "specs": [
                [
                    "==",
                    "0.5.3"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.64.1"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    "==",
                    "0.12.1"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.6.2"
                ]
            ]
        },
        {
            "name": "scikit-misc",
            "specs": [
                [
                    "==",
                    "0.1.4"
                ]
            ]
        }
    ],
    "lcname": "genevector"
}
        
Elapsed time: 0.25536s