phip-stat


Namephip-stat JSON
Version 0.3.0 PyPI version JSON
download
home_pagehttps://github.com/lasersonlab/phip-stat
SummaryPhIP-seq analysis tools
upload_time2017-02-17 18:32:34
maintainer
docs_urlNone
authorLaserson Lab
requires_python
licenseApache License, Version 2.0
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # `phip-stat`: tools for analyzing PhIP-seq data

The PhIP-seq assay was first described in [Larman et.
al.](https://dx.doi.org/10.1038/nbt.1856). This repo contains code for
processing raw PhIP-seq data into analysis-ready enrichment scores.

The overall flow of the pipeline is to split `.fastq` data into pieces and
align it against the associated phage library. The alignments are converted to
counts, which are then converted to scores based on a Generalized Poisson
regression model.

The tools are implemented in Python and can dispatch jobs to an HPC job
scheduler such as LSF or Grid Engine.

Please submit [issues](https://github.com/lasersonlab/phip-stat/issues) to
report any problems.  This code implements the statistical model as described
in the original Larman et al paper.  There are multiple alternative models
currently in development.


## Installation

`phip-stat` was developed with Python 3 but should also be compatible with
Python 2. Please submit as issue if there are any problems.

```bash
pip install phip-stat
```

or to install the latest development version from GitHub

```bash
pip install git+https://github.com/lasersonlab/phip-stat.git
```

or download a release
[directly from GitHub](https://github.com/lasersonlab/phip-stat/releases).

Dependencies for `phip` (which should get installed by `pip`):

*   `click`
*   `biopython`
*   `numpy`
*   `scipy`

Additionally, the pipeline makes use of `bowtie` for short read alignment, and
expects it to be available on the `PATH`.

When running on a cluster, we assume that you can access a common filesystem
from all nodes (e.g. NFS), which is common for academic HPC computing
environments.  It is also important that the `phip-stat` package is installed
into the Python distribution that will be invoked across the cluster.  We
recommend `conda` for easily installing Python into a local user directory (see
appendix).


## Running the pipeline

All the pipeline tools are accessed through the executable `phip`.  (Your
`PATH` may need to be modified to include the `bin/` directory of your Python
installation.)  A list of commands can be obtained by passing `-h`.

```
$ phip -h
Usage: phip [OPTIONS] COMMAND [ARGS]...

  phip -- PhIP-seq analysis tools

Options:
  -h, --help  Show this message and exit.

Commands:
  align-parts     align fastq files to peptide reference
  compute-counts  compute counts from aligned bam file
  compute-pvals   compute p-values from counts
  groupby-sample  group alignments by sample
  join-barcodes   annotate Illumina reads with barcodes Some...
  merge-columns   merge tab-delim files
  split-fastq     split fastq files into smaller chunks
```

Options/usage for a specific command can also be obtained with `-f`, for
example:

```
$ phip split-fastq -h
Usage: phip split-fastq [OPTIONS]

  split fastq files into smaller chunks

Options:
  -i, --input TEXT          input path (fastq file)
  -o, --output TEXT         output path (directory)
  -n, --chunk-size INTEGER  number of reads per chunk
  -h, --help                Show this message and exit.
```


### Pipeline inputs

The PhIP-seq pipeline as implemented requires:

1.  Read data from a PhIP-seq experiment (the raw data)

2.  Reference file for PhIP-seq library or bowtie index for alignment

3.  Input counts from sequencing the PhIP-seq library without any
    immunoprecipitation


### Generate per-sample alignment files

If you start with a single `.fastq` file that contains all the reads together,
we'll first split them into smaller chunks for alignment.

```bash
phip split-fastq -n 2000000 -i path/to/input.fastq -o path/to/workdir/parts
```

We then align each read to the reference PhIP-seq library using `bowtie`
(making sure to set the right queue):

```bash
phip align-parts \
    -i workdir/parts -o workdir/alns \
    -x path/to/index \ # implying path/to/index.1.ebwt exists etc.
    -b "bsub -q short"
```

Note: `align-parts` works by constructing a `bowtie` command and executing it
by prefixing it with the command given in to the `-b` option.  Each invocation
is executed and blocks to completion, which is instantaneous if submitting to a
batch scheduler such as LSF (as shown).  If omitted or given whitespace as a
string, each command will be executed serially.

Next, we reorganize the resulting alignments by sample, assuming that the
sample barcode is contained in the "query id" of each read.

```bash
phip groupby-sample -i workdir/alns -o workdir/alns_by_sample -m mapping.tsv
```

The `mapping.tsv` file is a tab-delimited file whose first column is a list of
barcode sequences and second column is a corresponding sample identifier (which
should work nicely as a file name).

If you instead started this pipeline with pre-demultiplexed `.fastq` files, you
can start with `align-parts` immediately and skip the `groupby-sample`.  We
assume the sample identifier is the base part of the filename (e.g.,
`sample1.fastq`).


### Generating p-values for enrichments

Once we have per-sample alignment files, we will generate count vectors for
each sample followed by enrichment p-values (using the "input" vector for the
particular phage library).

```bash
phip compute-counts \
    -i workdir/alns_by_sample -o workdir/counts -r path/to/reference/counts.tsv
```

The count-generation is performed single-threaded/locally, but the p-value
computation is more CPU-intensive so it can be parallelized with a job
scheduler.

```bash
phip compute-pvals -i workdir/counts -o workdir/pvals -b "bsub -q short"
```

It schedules single-file p-value computations using the same command without
the `-b`.  To manually compute p-values on just a single count file, execute

```bash
phip compute-pvals -i workdir/counts/sample1.counts.tsv -o sample1.pvals.tsv
```

Finally, the p-values from all samples can be merged into a single tab-
delimited file:

```bash
phip merge-columns -i workdir/pvals -o pvals.tsv -p 1
```

The `-p 1` tells the command to take the 2nd column from each file (0-indexed).


## Appendix

### Using `conda` for easily managing packages

Conda is a package manager that works on multiple operating systems and is
closely connected to the Anaconda and Miniconda distributions of Python. Using
Conda makes it incredibly easy to install a full Python distribution into your
local directory, along with all the heavy-weight packages. It also manages many
non-Python packages, including bowtie, for exmaple.

```bash
curl https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh > miniconda3.sh
bash miniconda3.sh -b -p $HOME/miniconda3
# ADD $HOME/miniconda3/bin to your PATH in $HOME/.bash_profile
conda install -y numpy scipy biopython click
conda install -y bowtie
```

This will install Python 3 and bowtie into your home directory, completely
isolated from the system Python installation.


### Joining reads to barcodes/indexes on older Illumina data

If you use the Basespace "generate fastq" pipeline or the MiSeq local pipeline
for creating `.fastq` files on indexed runs, then reads that do not match an
index (i.e., all of the reads if you will manually demultiplex) will go into an
"Unidentified" file that does NOT include the index sequence for that read.
Instead, the indexes are available in a separate `.fastq` file.  To use these
data sets, the reads `.fastq` file must be rewritten to include the index
sequence in the read header.  This can be accomplished with the `phip join-
barcodes` command.



            

Raw data

            {
    "maintainer": "", 
    "docs_url": null, 
    "requires_python": "", 
    "maintainer_email": "", 
    "cheesecake_code_kwalitee_id": null, 
    "keywords": "", 
    "upload_time": "2017-02-17 18:32:34", 
    "author": "Laserson Lab", 
    "home_page": "https://github.com/lasersonlab/phip-stat", 
    "github_user": "lasersonlab", 
    "download_url": "https://pypi.python.org/packages/ab/88/630164f8fbae84fc40b86569c4048d4301235ff6f80bf04f17d07bcab81b/phip-stat-0.3.0.tar.gz", 
    "platform": "", 
    "version": "0.3.0", 
    "cheesecake_documentation_id": null, 
    "description": "# `phip-stat`: tools for analyzing PhIP-seq data\n\nThe PhIP-seq assay was first described in [Larman et.\nal.](https://dx.doi.org/10.1038/nbt.1856). This repo contains code for\nprocessing raw PhIP-seq data into analysis-ready enrichment scores.\n\nThe overall flow of the pipeline is to split `.fastq` data into pieces and\nalign it against the associated phage library. The alignments are converted to\ncounts, which are then converted to scores based on a Generalized Poisson\nregression model.\n\nThe tools are implemented in Python and can dispatch jobs to an HPC job\nscheduler such as LSF or Grid Engine.\n\nPlease submit [issues](https://github.com/lasersonlab/phip-stat/issues) to\nreport any problems.  This code implements the statistical model as described\nin the original Larman et al paper.  There are multiple alternative models\ncurrently in development.\n\n\n## Installation\n\n`phip-stat` was developed with Python 3 but should also be compatible with\nPython 2. Please submit as issue if there are any problems.\n\n```bash\npip install phip-stat\n```\n\nor to install the latest development version from GitHub\n\n```bash\npip install git+https://github.com/lasersonlab/phip-stat.git\n```\n\nor download a release\n[directly from GitHub](https://github.com/lasersonlab/phip-stat/releases).\n\nDependencies for `phip` (which should get installed by `pip`):\n\n*   `click`\n*   `biopython`\n*   `numpy`\n*   `scipy`\n\nAdditionally, the pipeline makes use of `bowtie` for short read alignment, and\nexpects it to be available on the `PATH`.\n\nWhen running on a cluster, we assume that you can access a common filesystem\nfrom all nodes (e.g. NFS), which is common for academic HPC computing\nenvironments.  It is also important that the `phip-stat` package is installed\ninto the Python distribution that will be invoked across the cluster.  We\nrecommend `conda` for easily installing Python into a local user directory (see\nappendix).\n\n\n## Running the pipeline\n\nAll the pipeline tools are accessed through the executable `phip`.  (Your\n`PATH` may need to be modified to include the `bin/` directory of your Python\ninstallation.)  A list of commands can be obtained by passing `-h`.\n\n```\n$ phip -h\nUsage: phip [OPTIONS] COMMAND [ARGS]...\n\n  phip -- PhIP-seq analysis tools\n\nOptions:\n  -h, --help  Show this message and exit.\n\nCommands:\n  align-parts     align fastq files to peptide reference\n  compute-counts  compute counts from aligned bam file\n  compute-pvals   compute p-values from counts\n  groupby-sample  group alignments by sample\n  join-barcodes   annotate Illumina reads with barcodes Some...\n  merge-columns   merge tab-delim files\n  split-fastq     split fastq files into smaller chunks\n```\n\nOptions/usage for a specific command can also be obtained with `-f`, for\nexample:\n\n```\n$ phip split-fastq -h\nUsage: phip split-fastq [OPTIONS]\n\n  split fastq files into smaller chunks\n\nOptions:\n  -i, --input TEXT          input path (fastq file)\n  -o, --output TEXT         output path (directory)\n  -n, --chunk-size INTEGER  number of reads per chunk\n  -h, --help                Show this message and exit.\n```\n\n\n### Pipeline inputs\n\nThe PhIP-seq pipeline as implemented requires:\n\n1.  Read data from a PhIP-seq experiment (the raw data)\n\n2.  Reference file for PhIP-seq library or bowtie index for alignment\n\n3.  Input counts from sequencing the PhIP-seq library without any\n    immunoprecipitation\n\n\n### Generate per-sample alignment files\n\nIf you start with a single `.fastq` file that contains all the reads together,\nwe'll first split them into smaller chunks for alignment.\n\n```bash\nphip split-fastq -n 2000000 -i path/to/input.fastq -o path/to/workdir/parts\n```\n\nWe then align each read to the reference PhIP-seq library using `bowtie`\n(making sure to set the right queue):\n\n```bash\nphip align-parts \\\n    -i workdir/parts -o workdir/alns \\\n    -x path/to/index \\ # implying path/to/index.1.ebwt exists etc.\n    -b \"bsub -q short\"\n```\n\nNote: `align-parts` works by constructing a `bowtie` command and executing it\nby prefixing it with the command given in to the `-b` option.  Each invocation\nis executed and blocks to completion, which is instantaneous if submitting to a\nbatch scheduler such as LSF (as shown).  If omitted or given whitespace as a\nstring, each command will be executed serially.\n\nNext, we reorganize the resulting alignments by sample, assuming that the\nsample barcode is contained in the \"query id\" of each read.\n\n```bash\nphip groupby-sample -i workdir/alns -o workdir/alns_by_sample -m mapping.tsv\n```\n\nThe `mapping.tsv` file is a tab-delimited file whose first column is a list of\nbarcode sequences and second column is a corresponding sample identifier (which\nshould work nicely as a file name).\n\nIf you instead started this pipeline with pre-demultiplexed `.fastq` files, you\ncan start with `align-parts` immediately and skip the `groupby-sample`.  We\nassume the sample identifier is the base part of the filename (e.g.,\n`sample1.fastq`).\n\n\n### Generating p-values for enrichments\n\nOnce we have per-sample alignment files, we will generate count vectors for\neach sample followed by enrichment p-values (using the \"input\" vector for the\nparticular phage library).\n\n```bash\nphip compute-counts \\\n    -i workdir/alns_by_sample -o workdir/counts -r path/to/reference/counts.tsv\n```\n\nThe count-generation is performed single-threaded/locally, but the p-value\ncomputation is more CPU-intensive so it can be parallelized with a job\nscheduler.\n\n```bash\nphip compute-pvals -i workdir/counts -o workdir/pvals -b \"bsub -q short\"\n```\n\nIt schedules single-file p-value computations using the same command without\nthe `-b`.  To manually compute p-values on just a single count file, execute\n\n```bash\nphip compute-pvals -i workdir/counts/sample1.counts.tsv -o sample1.pvals.tsv\n```\n\nFinally, the p-values from all samples can be merged into a single tab-\ndelimited file:\n\n```bash\nphip merge-columns -i workdir/pvals -o pvals.tsv -p 1\n```\n\nThe `-p 1` tells the command to take the 2nd column from each file (0-indexed).\n\n\n## Appendix\n\n### Using `conda` for easily managing packages\n\nConda is a package manager that works on multiple operating systems and is\nclosely connected to the Anaconda and Miniconda distributions of Python. Using\nConda makes it incredibly easy to install a full Python distribution into your\nlocal directory, along with all the heavy-weight packages. It also manages many\nnon-Python packages, including bowtie, for exmaple.\n\n```bash\ncurl https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh > miniconda3.sh\nbash miniconda3.sh -b -p $HOME/miniconda3\n# ADD $HOME/miniconda3/bin to your PATH in $HOME/.bash_profile\nconda install -y numpy scipy biopython click\nconda install -y bowtie\n```\n\nThis will install Python 3 and bowtie into your home directory, completely\nisolated from the system Python installation.\n\n\n### Joining reads to barcodes/indexes on older Illumina data\n\nIf you use the Basespace \"generate fastq\" pipeline or the MiSeq local pipeline\nfor creating `.fastq` files on indexed runs, then reads that do not match an\nindex (i.e., all of the reads if you will manually demultiplex) will go into an\n\"Unidentified\" file that does NOT include the index sequence for that read.\nInstead, the indexes are available in a separate `.fastq` file.  To use these\ndata sets, the reads `.fastq` file must be rewritten to include the index\nsequence in the read header.  This can be accomplished with the `phip join-\nbarcodes` command.\n\n\n", 
    "lcname": "phip-stat", 
    "bugtrack_url": null, 
    "github": true, 
    "coveralls": true, 
    "name": "phip-stat", 
    "license": "Apache License, Version 2.0", 
    "travis_ci": false, 
    "github_project": "phip-stat", 
    "summary": "PhIP-seq analysis tools", 
    "split_keywords": [], 
    "author_email": "", 
    "urls": [
        {
            "has_sig": false, 
            "upload_time": "2017-02-17T18:32:34", 
            "comment_text": "", 
            "python_version": "source", 
            "url": "https://pypi.python.org/packages/ab/88/630164f8fbae84fc40b86569c4048d4301235ff6f80bf04f17d07bcab81b/phip-stat-0.3.0.tar.gz", 
            "md5_digest": "bb499353be92c6bb85211597774023eb", 
            "downloads": 0, 
            "filename": "phip-stat-0.3.0.tar.gz", 
            "packagetype": "sdist", 
            "path": "ab/88/630164f8fbae84fc40b86569c4048d4301235ff6f80bf04f17d07bcab81b/phip-stat-0.3.0.tar.gz", 
            "size": 9078
        }, 
        {
            "has_sig": false, 
            "upload_time": "2017-02-17T18:32:33", 
            "comment_text": "", 
            "python_version": "py3", 
            "url": "https://pypi.python.org/packages/fe/fe/40366b6cb2f29774135c77ff0ac28e378dfe7a4feb14e8ee104779413447/phip_stat-0.3.0-py3-none-any.whl", 
            "md5_digest": "3da6d47e73ff9a7021e35f4c63453345", 
            "downloads": 0, 
            "filename": "phip_stat-0.3.0-py3-none-any.whl", 
            "packagetype": "bdist_wheel", 
            "path": "fe/fe/40366b6cb2f29774135c77ff0ac28e378dfe7a4feb14e8ee104779413447/phip_stat-0.3.0-py3-none-any.whl", 
            "size": 14670
        }
    ], 
    "_id": null, 
    "cheesecake_installability_id": null
}