GSVA


NameGSVA JSON
Version 1.0.6 PyPI version JSON
download
home_pagehttps://github.com/jason-weirather/GSVA
SummaryPython CLI and module for running the GSVA R bioconductor package with Python Pandas inputs and outputs.
upload_time2018-07-08 15:22:48
maintainer
docs_urlNone
authorJason L Weirather
requires_python
licenseApache License, Version 2.0
keywords bioinformatics r enrichment gsva ssgsea gsea bioconductor
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GSVA / ssGSEA command-line interface and Python module

The GSVA (gene-set variance analysis) package from R bioconductor provides efficient computation of single-sample gene-set enrichment analysis (ssGSEA). This pakcage provides a python implmented CLI, and Python module with Pandas inputs and outputs, as well as a docker to run this R package.

* Repository is here: https://github.com/jason-weirather/GSVA
* Autodoc manual is here:  https://jason-weirather.github.io/GSVA/

##### Disclaimer

I am not the creator or author of GSVA.  This is a CLI and python hook created to make their package easy to use from the command line and python. *This is not the offical site for the GSVA bioconductor package*

Find the official R package here

https://doi.org/doi:10.18129/B9.bioc.GSVA

##### And if you find this useful, please cite the author's publication

Hänzelmann S, Castelo R and Guinney J (2013). “GSVA: gene set variation analysis for microarray and RNA-Seq data.” BMC Bioinformatics, 14, pp. 7. doi: 10.1186/1471-2105-14-7, http://www.biomedcentral.com/1471-2105/14/7.

## Quickstart - CLI through docker

### Execute GSVA in docker

Just be careful to let docker know all the volumes you need to mount.  Here we will do everything in our current directory.

1. Start with an expression csv with gene-wise rows and sample-wise columns

```
$ cat example_expression.csv | cut -f 1-3 -d ',' | head -n 6 
gene_name,S-1,S-2
MT-CO1,13.852,12.328
MT-CO2,13.405999999999999,12.383
MT-CO3,13.234000000000002,12.109000000000002
MT-ATP8,13.805,11.789000000000001
MT-ATP6,13.5,11.703
```

2. Use any gene sets in **gmt** format where each row follows the convention `name <tab> description <tab> gene1 <tab> gene2 <tab> ... <tab> geneN`

```
cat c2.cp.v6.0.symbols.gmt | head -n 6 | cut -f 1-5
KEGG_GLYCOLYSIS_GLUCONEOGENESIS	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS	ACSS2	GCK	PGK2
KEGG_CITRATE_CYCLE_TCA_CYCLE	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_CITRATE_CYCLE_TCA_CYCLE	IDH3B	DLST	PCK2
KEGG_PENTOSE_PHOSPHATE_PATHWAY	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE_PHOSPHATE_PATHWAY	RPE	RPIA	PGM2
KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS	UGT1A10	UGT1A8	RPE
KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM	MPI	PMM2	PMM1
KEGG_GALACTOSE_METABOLISM	http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GALACTOSE_METABOLISM	GCK	GALK1	GLB1
```

3. Run GSVA

```
$ docker run -v $(pwd):$(pwd) vacation/gsva:1.0.4 \
    GSVA --gmt $(pwd)/c2.cp.v6.0.symbols.gmt \
         $(pwd)/example_expression.csv \
         --output $(pwd)/example_pathways.csv
```

##### You're done.  Thats it.  Enjoy, check your output.

Running outside of docker on your system is just as easy (actually easier) but you need to have the required programs installed (see below). 

```
$ cat example_pathways.csv | cut -f 1-3 -d ',' | head -n 6
name,S-1,S-2
BIOCARTA_41BB_PATHWAY,0.0686308398590944,0.257169127694153
BIOCARTA_ACE2_PATHWAY,0.11082238459933501,-0.22231034473486602
BIOCARTA_ACH_PATHWAY,0.514192767265737,0.149291024991685
BIOCARTA_ACTINY_PATHWAY,-0.0144944990305252,0.407870971955071
BIOCARTA_AGPCR_PATHWAY,0.6224821629523449,-0.0128449355033173
```


### For more advanced options you can list the options

```
$ docker run vacation/gsva:1.0.4 GSVA -h
usage: GSVA [-h] [--tsv_in] --gmt GMT [--tsv_out] [--output OUTPUT]
            [--meta_output META_OUTPUT] [--method {gsva,ssgsea,zscore,plage}]
            [--kcdf {Gaussian,Poisson,none}] [--abs_ranking] [--min_sz MIN_SZ]
            [--max_sz MAX_SZ] [--parallel_sz PARALLEL_SZ]
            [--parallel_type PARALLEL_TYPE] [--mx_diff MX_DIFF] [--tau TAU]
            [--ssgsea_norm SSGSEA_NORM] [--verbose]
            [--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]
            input

Execute R bioconductors GSVA

optional arguments:
  -h, --help            show this help message and exit

Input options:
  input                 Use - for STDIN
  --tsv_in              Exepct CSV by default, this overrides to tab (default:
                        False)
  --gmt GMT             GMT file with pathways (default: None)

Output options:
  --tsv_out             Override the default CSV and output TSV (default:
                        False)
  --output OUTPUT, -o OUTPUT
                        Specifiy path to write transformed data (default:
                        None)
  --meta_output META_OUTPUT
                        Speciify path to output additional run information
                        (default: None)

command options:
  --method {gsva,ssgsea,zscore,plage}
                        Method to employ in the estimation of gene-set
                        enrichment scores per sample. By default this is set
                        to gsva (Hanzelmann et al, 2013) and other options 6
                        gsva are ssgsea (Barbie et al, 2009), zscore (Lee et
                        al, 2008) or plage (Tomfohr et al, 2005). The latter
                        two standardize first expression profiles into
                        z-scores over the samples and, in the case of zscore,
                        it combines them together as their sum divided by the
                        square-root of the size of the gene set, while in the
                        case of plage they are used to calculate the singular
                        value decomposition (SVD) over the genes in the gene
                        set and use the coefficients of the first right-
                        singular vector as pathway activity profile. (default:
                        gsva)
  --kcdf {Gaussian,Poisson,none}
                        Character string denoting the kernel to use during the
                        non-parametric estimation of the cumulative
                        distribution function of expression levels across
                        samples when method="gsva". By default,
                        kcdf="Gaussian" which is suitable when input
                        expression values are continuous, such as microarray
                        fluorescent units in logarithmic scale, RNA-seq log-
                        CPMs, log-RPKMs or log-TPMs. When input expression
                        values are integer counts, such as those derived from
                        RNA-seq experiments, then this argument should be set
                        to kcdf="Poisson". This argument supersedes arguments
                        rnaseq and kernel, which are deprecated and will be
                        removed in the next release. (default: Gaussian)
  --abs_ranking         Flag used only when mx_diff=TRUE. When
                        abs_ranking=FALSE (default) a modified Kuiper
                        statistic is used to calculate enrichment scores,
                        taking the magnitude difference between the largest
                        positive and negative random walk deviations. When
                        abs.ranking=TRUE the original Kuiper statistic that
                        sums the largest positive and negative random walk
                        deviations, is used. In this latter case, gene sets
                        with genes enriched on either extreme (high or low)
                        will be regarded as'highly' activated. (default:
                        False)
  --min_sz MIN_SZ       Minimum size of the resulting gene sets. (default: 1)
  --max_sz MAX_SZ       Maximum size of the resulting gene sets. (default:
                        None)
  --parallel_sz PARALLEL_SZ
                        Number of processors to use when doing the
                        calculations in parallel. This requires to previously
                        load either the parallel or the snow library. If
                        parallel is loaded and this argument is left with its
                        default value (parallel_sz=0) then it will use all
                        available core processors unless we set this argument
                        with a smaller number. If snow is loaded then we must
                        set this argument to a positive integer number that
                        specifies the number of processors to employ in the
                        parallel calculation. (default: 0)
  --parallel_type PARALLEL_TYPE
                        Type of cluster architecture when using snow.
                        (default: SOCK)
  --mx_diff MX_DIFF     Offers two approaches to calculate the enrichment
                        statistic (ES) from the KS random walk statistic.
                        mx_diff=FALSE: ES is calculated as the maximum
                        distance of the random walk from 0. mx_diff=TRUE
                        (default): ES is calculated as the magnitude
                        difference between the largest positive and negative
                        random walk deviations. (default: True)
  --tau TAU             Exponent defining the weight of the tail in the random
                        walk performed by both the gsva (Hanzelmann et al.,
                        2013) and the ssgsea (Barbie et al., 2009) methods. By
                        default, this tau=1 when method="gsva" and tau=0.25
                        when method="ssgsea" just as specified by Barbie et
                        al. (2009) where this parameter is called alpha.
                        (default: None)
  --ssgsea_norm SSGSEA_NORM
                        Logical, set to TRUE (default) with method="ssgsea"
                        runs the SSGSEA method from Barbie et al. (2009)
                        normalizing the scores by the absolute difference
                        between the minimum and the maximum, as described in
                        their paper. When ssgsea_norm=FALSE this last
                        normalization step is skipped. (default: True)
  --verbose             Gives information about each calculation step.
                        (default: False)

Temporary folder parameters:
  --tempdir TEMPDIR     The temporary directory is made and destroyed here.
                        (default: /tmp)
  --specific_tempdir SPECIFIC_TEMPDIR
                        This temporary directory will be used, but will remain
                        after executing. (default: None)
```

## Installation

#### Method 1: Install on your system

1. Install R https://www.r-project.org/ 
2. Install the R bioconductor packaqge GSEABase and GSVA 

```
$ Rscript -e 'source("http://bioconductor.org/biocLite.R");\
              library(BiocInstaller);\
              biocLite(pkgs=c("GSEABase","GSVA"),dep=TRUE)'
```

3. Install this package `$ pip install GSVA`

#### Method 2: Run GSVA via the docker

`$ docker pull vacation/gsva:latest`

## Use GSVA Python CLI in your python code

First install GSVA Python CLI on your system as described above. For details on the `gsva(expression_df,genesets_df,...)` function parameters see https://jason-weirather.github.io/GSVA/ 

### Workflow example - Go from an expression-based tSNE plot to a pathway-based tSNE plot in a Jupyter notebook

Here we will convert a per-sample per-gene expression matrix to a per-sample per-pathway enrichment matrix. We will plot the values using tSNE.

These code snipits and outputs are from a Jupyter notebook.


```python
import pandas as pd
from GSVA import gsva, gmt_to_dataframe
# Some extras to look at the high dimensional data
from plotnine import *
from sklearn.manifold import TSNE
```

Read in a Broad reference pathway gmt file.  Notice the "member" and "name" fields.  If you make your own dataframe to use, these are the required column names.

```python
genesets_df = gmt_to_dataframe('c2.cp.v6.0.symbols.gmt')
genesets_df.head()
```

|	| description	                                    | member | name                            |
|---|---------------------------------------------------|--------|---------------------------------|
| 0	| http://www.broadinstitute.org/gsea/msigdb/card... | ACSS2  | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |
| 1	| http://www.broadinstitute.org/gsea/msigdb/card... | GCK    | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |
| 2	| http://www.broadinstitute.org/gsea/msigdb/card... | PGK2   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |
| 3	| http://www.broadinstitute.org/gsea/msigdb/card... | PGK1   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |
| 4	| http://www.broadinstitute.org/gsea/msigdb/card... | PDHB   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |

This example has 200 samples

```python
expression_df = pd.read_csv('example_expression.csv',index_col=0)
expression_df.iloc[0:5,0:5]
```

| gene_name | S-1    | S-2    | S-3    | S-4    | S-5    |
|-----------|--------|--------|--------|--------|--------|
| MT-CO1    | 13.852 | 12.328 | 13.055 | 11.898 | 10.234 |
| MT-CO2    | 13.406 | 12.383 | 13.281 | 11.578 | 11.156 |
| MT-CO3    | 13.234 | 12.109 | 13.352 | 11.531 | 10.422 |
| MT-ATP8   | 13.805 | 11.789 | 13.414 | 11.883 | 11.141 |
| MT-ATP6   | 13.500 | 11.703 | 13.227 | 11.219 | 10.836 |

```python
XV = TSNE(n_components=2).\
    fit_transform(expression_df.T)
df = pd.DataFrame(XV).rename(columns={0:'x',1:'y'})
(ggplot(df,aes(x='x',y='y'))
 + geom_point(alpha=0.2)
)
```

![Gene Expression](https://i.imgur.com/Qbwds5H.png)

The default command runs without verbose message output. but take notice, that genes that are not part of the `expression_df` are dropped from the analysis, and depending on your choice of GSVA method, genes for which there is not enough expression (i.e. all zero expression) will be dropped.

```python
pathways_df = gsva(expression_df,genesets_df)
pathways_df.iloc[0:5,0:5]
```

| name                    | S-1       | S-2       | S-3       | S-4      | S-5       |
|-------------------------|-----------|-----------|-----------|----------|-----------|
| BIOCARTA_41BB_PATHWAY   | 0.068631  | 0.257169  | -0.146907 | 0.020151 | -0.234537 |
| BIOCARTA_ACE2_PATHWAY   | 0.110822  | -0.222310 | -0.161572 | 0.370659 | -0.003318 |
| BIOCARTA_ACH_PATHWAY    | 0.514193  | 0.149291  | 0.226279  | 0.289960 | 0.016071  |
| BIOCARTA_ACTINY_PATHWAY | -0.014494 | 0.407871  | -0.062163 | 0.055607 | 0.424726  |
| BIOCARTA_AGPCR_PATHWAY  | 0.622482  | -0.012845 | 0.317349  | 0.286368 | 0.022540  |

```python
YV = TSNE(n_components=2).\
    fit_transform(pathways_df.T)
pf = pd.DataFrame(YV).rename(columns={0:'x',1:'y'})
(ggplot(pf,aes(x='x',y='y'))
 + geom_point(alpha=0.2)
)
```

![Pathway Enrichment](https://i.imgur.com/2pxjoRr.png)



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jason-weirather/GSVA",
    "name": "GSVA",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "bioinformatics,R,enrichment,GSVA,ssGSEA,GSEA,bioconductor",
    "author": "Jason L Weirather",
    "author_email": "jason.weirather@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/e5/64/75cbd888bbd1574a16f89a49bdaf66d394c3d088b40929a334179d51770f/GSVA-1.0.6.tar.gz",
    "platform": "",
    "description": "# GSVA / ssGSEA command-line interface and Python module\n\nThe GSVA (gene-set variance analysis) package from R bioconductor provides efficient computation of single-sample gene-set enrichment analysis (ssGSEA). This pakcage provides a python implmented CLI, and Python module with Pandas inputs and outputs, as well as a docker to run this R package.\n\n* Repository is here: https://github.com/jason-weirather/GSVA\n* Autodoc manual is here:  https://jason-weirather.github.io/GSVA/\n\n##### Disclaimer\n\nI am not the creator or author of GSVA.  This is a CLI and python hook created to make their package easy to use from the command line and python. *This is not the offical site for the GSVA bioconductor package*\n\nFind the official R package here\n\nhttps://doi.org/doi:10.18129/B9.bioc.GSVA\n\n##### And if you find this useful, please cite the author's publication\n\nH\u00e4nzelmann S, Castelo R and Guinney J (2013). \u201cGSVA: gene set variation analysis for microarray and RNA-Seq data.\u201d BMC Bioinformatics, 14, pp. 7. doi: 10.1186/1471-2105-14-7, http://www.biomedcentral.com/1471-2105/14/7.\n\n## Quickstart - CLI through docker\n\n### Execute GSVA in docker\n\nJust be careful to let docker know all the volumes you need to mount.  Here we will do everything in our current directory.\n\n1. Start with an expression csv with gene-wise rows and sample-wise columns\n\n```\n$ cat example_expression.csv | cut -f 1-3 -d ',' | head -n 6 \ngene_name,S-1,S-2\nMT-CO1,13.852,12.328\nMT-CO2,13.405999999999999,12.383\nMT-CO3,13.234000000000002,12.109000000000002\nMT-ATP8,13.805,11.789000000000001\nMT-ATP6,13.5,11.703\n```\n\n2. Use any gene sets in **gmt** format where each row follows the convention `name <tab> description <tab> gene1 <tab> gene2 <tab> ... <tab> geneN`\n\n```\ncat c2.cp.v6.0.symbols.gmt | head -n 6 | cut -f 1-5\nKEGG_GLYCOLYSIS_GLUCONEOGENESIS\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLYSIS_GLUCONEOGENESIS\tACSS2\tGCK\tPGK2\nKEGG_CITRATE_CYCLE_TCA_CYCLE\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_CITRATE_CYCLE_TCA_CYCLE\tIDH3B\tDLST\tPCK2\nKEGG_PENTOSE_PHOSPHATE_PATHWAY\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE_PHOSPHATE_PATHWAY\tRPE\tRPIA\tPGM2\nKEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS\tUGT1A10\tUGT1A8\tRPE\nKEGG_FRUCTOSE_AND_MANNOSE_METABOLISM\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM\tMPI\tPMM2\tPMM1\nKEGG_GALACTOSE_METABOLISM\thttp://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GALACTOSE_METABOLISM\tGCK\tGALK1\tGLB1\n```\n\n3. Run GSVA\n\n```\n$ docker run -v $(pwd):$(pwd) vacation/gsva:1.0.4 \\\n    GSVA --gmt $(pwd)/c2.cp.v6.0.symbols.gmt \\\n         $(pwd)/example_expression.csv \\\n         --output $(pwd)/example_pathways.csv\n```\n\n##### You're done.  Thats it.  Enjoy, check your output.\n\nRunning outside of docker on your system is just as easy (actually easier) but you need to have the required programs installed (see below). \n\n```\n$ cat example_pathways.csv | cut -f 1-3 -d ',' | head -n 6\nname,S-1,S-2\nBIOCARTA_41BB_PATHWAY,0.0686308398590944,0.257169127694153\nBIOCARTA_ACE2_PATHWAY,0.11082238459933501,-0.22231034473486602\nBIOCARTA_ACH_PATHWAY,0.514192767265737,0.149291024991685\nBIOCARTA_ACTINY_PATHWAY,-0.0144944990305252,0.407870971955071\nBIOCARTA_AGPCR_PATHWAY,0.6224821629523449,-0.0128449355033173\n```\n\n\n### For more advanced options you can list the options\n\n```\n$ docker run vacation/gsva:1.0.4 GSVA -h\nusage: GSVA [-h] [--tsv_in] --gmt GMT [--tsv_out] [--output OUTPUT]\n            [--meta_output META_OUTPUT] [--method {gsva,ssgsea,zscore,plage}]\n            [--kcdf {Gaussian,Poisson,none}] [--abs_ranking] [--min_sz MIN_SZ]\n            [--max_sz MAX_SZ] [--parallel_sz PARALLEL_SZ]\n            [--parallel_type PARALLEL_TYPE] [--mx_diff MX_DIFF] [--tau TAU]\n            [--ssgsea_norm SSGSEA_NORM] [--verbose]\n            [--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]\n            input\n\nExecute R bioconductors GSVA\n\noptional arguments:\n  -h, --help            show this help message and exit\n\nInput options:\n  input                 Use - for STDIN\n  --tsv_in              Exepct CSV by default, this overrides to tab (default:\n                        False)\n  --gmt GMT             GMT file with pathways (default: None)\n\nOutput options:\n  --tsv_out             Override the default CSV and output TSV (default:\n                        False)\n  --output OUTPUT, -o OUTPUT\n                        Specifiy path to write transformed data (default:\n                        None)\n  --meta_output META_OUTPUT\n                        Speciify path to output additional run information\n                        (default: None)\n\ncommand options:\n  --method {gsva,ssgsea,zscore,plage}\n                        Method to employ in the estimation of gene-set\n                        enrichment scores per sample. By default this is set\n                        to gsva (Hanzelmann et al, 2013) and other options 6\n                        gsva are ssgsea (Barbie et al, 2009), zscore (Lee et\n                        al, 2008) or plage (Tomfohr et al, 2005). The latter\n                        two standardize first expression profiles into\n                        z-scores over the samples and, in the case of zscore,\n                        it combines them together as their sum divided by the\n                        square-root of the size of the gene set, while in the\n                        case of plage they are used to calculate the singular\n                        value decomposition (SVD) over the genes in the gene\n                        set and use the coefficients of the first right-\n                        singular vector as pathway activity profile. (default:\n                        gsva)\n  --kcdf {Gaussian,Poisson,none}\n                        Character string denoting the kernel to use during the\n                        non-parametric estimation of the cumulative\n                        distribution function of expression levels across\n                        samples when method=\"gsva\". By default,\n                        kcdf=\"Gaussian\" which is suitable when input\n                        expression values are continuous, such as microarray\n                        fluorescent units in logarithmic scale, RNA-seq log-\n                        CPMs, log-RPKMs or log-TPMs. When input expression\n                        values are integer counts, such as those derived from\n                        RNA-seq experiments, then this argument should be set\n                        to kcdf=\"Poisson\". This argument supersedes arguments\n                        rnaseq and kernel, which are deprecated and will be\n                        removed in the next release. (default: Gaussian)\n  --abs_ranking         Flag used only when mx_diff=TRUE. When\n                        abs_ranking=FALSE (default) a modified Kuiper\n                        statistic is used to calculate enrichment scores,\n                        taking the magnitude difference between the largest\n                        positive and negative random walk deviations. When\n                        abs.ranking=TRUE the original Kuiper statistic that\n                        sums the largest positive and negative random walk\n                        deviations, is used. In this latter case, gene sets\n                        with genes enriched on either extreme (high or low)\n                        will be regarded as'highly' activated. (default:\n                        False)\n  --min_sz MIN_SZ       Minimum size of the resulting gene sets. (default: 1)\n  --max_sz MAX_SZ       Maximum size of the resulting gene sets. (default:\n                        None)\n  --parallel_sz PARALLEL_SZ\n                        Number of processors to use when doing the\n                        calculations in parallel. This requires to previously\n                        load either the parallel or the snow library. If\n                        parallel is loaded and this argument is left with its\n                        default value (parallel_sz=0) then it will use all\n                        available core processors unless we set this argument\n                        with a smaller number. If snow is loaded then we must\n                        set this argument to a positive integer number that\n                        specifies the number of processors to employ in the\n                        parallel calculation. (default: 0)\n  --parallel_type PARALLEL_TYPE\n                        Type of cluster architecture when using snow.\n                        (default: SOCK)\n  --mx_diff MX_DIFF     Offers two approaches to calculate the enrichment\n                        statistic (ES) from the KS random walk statistic.\n                        mx_diff=FALSE: ES is calculated as the maximum\n                        distance of the random walk from 0. mx_diff=TRUE\n                        (default): ES is calculated as the magnitude\n                        difference between the largest positive and negative\n                        random walk deviations. (default: True)\n  --tau TAU             Exponent defining the weight of the tail in the random\n                        walk performed by both the gsva (Hanzelmann et al.,\n                        2013) and the ssgsea (Barbie et al., 2009) methods. By\n                        default, this tau=1 when method=\"gsva\" and tau=0.25\n                        when method=\"ssgsea\" just as specified by Barbie et\n                        al. (2009) where this parameter is called alpha.\n                        (default: None)\n  --ssgsea_norm SSGSEA_NORM\n                        Logical, set to TRUE (default) with method=\"ssgsea\"\n                        runs the SSGSEA method from Barbie et al. (2009)\n                        normalizing the scores by the absolute difference\n                        between the minimum and the maximum, as described in\n                        their paper. When ssgsea_norm=FALSE this last\n                        normalization step is skipped. (default: True)\n  --verbose             Gives information about each calculation step.\n                        (default: False)\n\nTemporary folder parameters:\n  --tempdir TEMPDIR     The temporary directory is made and destroyed here.\n                        (default: /tmp)\n  --specific_tempdir SPECIFIC_TEMPDIR\n                        This temporary directory will be used, but will remain\n                        after executing. (default: None)\n```\n\n## Installation\n\n#### Method 1: Install on your system\n\n1. Install R https://www.r-project.org/ \n2. Install the R bioconductor packaqge GSEABase and GSVA \n\n```\n$ Rscript -e 'source(\"http://bioconductor.org/biocLite.R\");\\\n              library(BiocInstaller);\\\n              biocLite(pkgs=c(\"GSEABase\",\"GSVA\"),dep=TRUE)'\n```\n\n3. Install this package `$ pip install GSVA`\n\n#### Method 2: Run GSVA via the docker\n\n`$ docker pull vacation/gsva:latest`\n\n## Use GSVA Python CLI in your python code\n\nFirst install GSVA Python CLI on your system as described above. For details on the `gsva(expression_df,genesets_df,...)` function parameters see https://jason-weirather.github.io/GSVA/ \n\n### Workflow example - Go from an expression-based tSNE plot to a pathway-based tSNE plot in a Jupyter notebook\n\nHere we will convert a per-sample per-gene expression matrix to a per-sample per-pathway enrichment matrix. We will plot the values using tSNE.\n\nThese code snipits and outputs are from a Jupyter notebook.\n\n\n```python\nimport pandas as pd\nfrom GSVA import gsva, gmt_to_dataframe\n# Some extras to look at the high dimensional data\nfrom plotnine import *\nfrom sklearn.manifold import TSNE\n```\n\nRead in a Broad reference pathway gmt file.  Notice the \"member\" and \"name\" fields.  If you make your own dataframe to use, these are the required column names.\n\n```python\ngenesets_df = gmt_to_dataframe('c2.cp.v6.0.symbols.gmt')\ngenesets_df.head()\n```\n\n|\t| description\t                                    | member | name                            |\n|---|---------------------------------------------------|--------|---------------------------------|\n| 0\t| http://www.broadinstitute.org/gsea/msigdb/card... | ACSS2  | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |\n| 1\t| http://www.broadinstitute.org/gsea/msigdb/card... | GCK    | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |\n| 2\t| http://www.broadinstitute.org/gsea/msigdb/card... | PGK2   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |\n| 3\t| http://www.broadinstitute.org/gsea/msigdb/card... | PGK1   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |\n| 4\t| http://www.broadinstitute.org/gsea/msigdb/card... | PDHB   | KEGG_GLYCOLYSIS_GLUCONEOGENESIS |\n\nThis example has 200 samples\n\n```python\nexpression_df = pd.read_csv('example_expression.csv',index_col=0)\nexpression_df.iloc[0:5,0:5]\n```\n\n| gene_name | S-1    | S-2    | S-3    | S-4    | S-5    |\n|-----------|--------|--------|--------|--------|--------|\n| MT-CO1    | 13.852 | 12.328 | 13.055 | 11.898 | 10.234 |\n| MT-CO2    | 13.406 | 12.383 | 13.281 | 11.578 | 11.156 |\n| MT-CO3    | 13.234 | 12.109 | 13.352 | 11.531 | 10.422 |\n| MT-ATP8   | 13.805 | 11.789 | 13.414 | 11.883 | 11.141 |\n| MT-ATP6   | 13.500 | 11.703 | 13.227 | 11.219 | 10.836 |\n\n```python\nXV = TSNE(n_components=2).\\\n    fit_transform(expression_df.T)\ndf = pd.DataFrame(XV).rename(columns={0:'x',1:'y'})\n(ggplot(df,aes(x='x',y='y'))\n + geom_point(alpha=0.2)\n)\n```\n\n![Gene Expression](https://i.imgur.com/Qbwds5H.png)\n\nThe default command runs without verbose message output. but take notice, that genes that are not part of the `expression_df` are dropped from the analysis, and depending on your choice of GSVA method, genes for which there is not enough expression (i.e. all zero expression) will be dropped.\n\n```python\npathways_df = gsva(expression_df,genesets_df)\npathways_df.iloc[0:5,0:5]\n```\n\n| name                    | S-1       | S-2       | S-3       | S-4      | S-5       |\n|-------------------------|-----------|-----------|-----------|----------|-----------|\n| BIOCARTA_41BB_PATHWAY   | 0.068631  | 0.257169  | -0.146907 | 0.020151 | -0.234537 |\n| BIOCARTA_ACE2_PATHWAY   | 0.110822  | -0.222310 | -0.161572 | 0.370659 | -0.003318 |\n| BIOCARTA_ACH_PATHWAY    | 0.514193  | 0.149291  | 0.226279  | 0.289960 | 0.016071  |\n| BIOCARTA_ACTINY_PATHWAY | -0.014494 | 0.407871  | -0.062163 | 0.055607 | 0.424726  |\n| BIOCARTA_AGPCR_PATHWAY  | 0.622482  | -0.012845 | 0.317349  | 0.286368 | 0.022540  |\n\n```python\nYV = TSNE(n_components=2).\\\n    fit_transform(pathways_df.T)\npf = pd.DataFrame(YV).rename(columns={0:'x',1:'y'})\n(ggplot(pf,aes(x='x',y='y'))\n + geom_point(alpha=0.2)\n)\n```\n\n![Pathway Enrichment](https://i.imgur.com/2pxjoRr.png)\n\n\n",
    "bugtrack_url": null,
    "license": "Apache License, Version 2.0",
    "summary": "Python CLI and module for running the GSVA R bioconductor package with Python Pandas inputs and outputs.",
    "version": "1.0.6",
    "split_keywords": [
        "bioinformatics",
        "r",
        "enrichment",
        "gsva",
        "ssgsea",
        "gsea",
        "bioconductor"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "788dec166139b75be041b339428854c0",
                "sha256": "c655593bfcd7b1016fcfbfa9cf681ebbabf1744e856ae11a877ebfba1737e71b"
            },
            "downloads": -1,
            "filename": "GSVA-1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "788dec166139b75be041b339428854c0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 18445,
            "upload_time": "2018-07-08T15:22:47",
            "upload_time_iso_8601": "2018-07-08T15:22:47.241077Z",
            "url": "https://files.pythonhosted.org/packages/c9/5b/dea37b86c9ac442579f7acdf538d04b077b81cf24da8455f204991e9882c/GSVA-1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "fad6de38a625e1af1d367219c843d998",
                "sha256": "4131f839fa29a13ef61dd191211361fcb6d53f0b64475a6fde82ec0c25472752"
            },
            "downloads": -1,
            "filename": "GSVA-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "fad6de38a625e1af1d367219c843d998",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12301,
            "upload_time": "2018-07-08T15:22:48",
            "upload_time_iso_8601": "2018-07-08T15:22:48.500963Z",
            "url": "https://files.pythonhosted.org/packages/e5/64/75cbd888bbd1574a16f89a49bdaf66d394c3d088b40929a334179d51770f/GSVA-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2018-07-08 15:22:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "jason-weirather",
    "github_project": "GSVA",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "gsva"
}
        
Elapsed time: 0.02092s