somaticseq


Namesomaticseq JSON
Version 3.10.0 PyPI version JSON
download
home_pagehttps://github.com/bioinform/somaticseq
SummarySomaticSeq: An ensemble approach to accurately detect somatic mutations using SomaticSeq
upload_time2025-01-01 23:58:14
maintainerNone
docs_urlNone
authorLi Tai Fang, Pegah Tootoonchi Afshar, Aparna Chhibber, Marghoob Mohiyuddin, John C. Mu, Greg Gibeling, Sharon Barr, Narges Bani Asadi, Hugo Y.K. Lam
requires_python>=3.11.0
licenseBSD-2-Clause
keywords somatic mutations bioinformatics genomics ngs
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # SomaticSeq

SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use
machine learning to filter out false positives from other callers. It also comes
with a suite of [genomic utilities](somaticseq/utilities/README.md). The
detailed documentation is located in
[docs/Manual.pdf](docs/Manual.pdf "User Manual").

-   It was published in
    [Fang, L.T., Afshar, P.T., Chhibber, A. _et al_. An ensemble approach to accurately detect somatic mutations using SomaticSeq. _Genome Biol_ **16**, 197 (2015)](http://dx.doi.org/10.1186/s13059-015-0758-2 "Fang LT, et al. Genome Biol (2015)").
-   Feel free to report issues and/or ask questions at the
    [Issues](../../issues "Issues") page.

## Training data for benchmarking and/or model building

In 2021, the
[FDA-led MAQC-IV/SEQC2 Consortium](https://www.fda.gov/science-research/bioinformatics-tools/microarraysequencing-quality-control-maqcseqc#MAQC_IV)
has produced multi-center multi-platform whole-genome and whole-exome
[sequencing data sets](https://identifiers.org/ncbi/insdc.sra:SRP162370) for a
pair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the
high-confidence
[somatic mutation call set](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/).
This work was published in
[Fang, L.T., Zhu, B., Zhao, Y. _et al_. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. _Nat Biotechnol_ **39**, 1151-1160 (2021)](https://doi.org/10.1038/s41587-021-00993-6 "Fang LT, et al. Nat Biotechnol (2021)")
/
[PMID:34504347](http://identifiers.org/pubmed:34504347 "Fang LT, et al. Nat Biotechnol (2021)")
/
[Free Read-Only Link](https://bit.ly/2021nbt "Fang LT, et al. Nat Biotechnol (2021)").
The following are some of the use cases for these resources:

-   Use high-confidence call set as the "ground truth" to investigate how
    different sample preparations, sequencing library kits, and bioinformatic
    algorithms affect the accuracy of the somatic mutation pipelines, and
    develop best practices, e.g.,
    [Xiao W. _et al_. Nat Biotechnol 2021](https://doi.org/10.1038/s41587-021-00994-5).
-   Use high-confidence call set as the "ground truth" to build accurate and
    robust machine learning models for somatic mutation detections, e.g.,
    [NeuSomatic by Sahraeian S.M.E. _et al_. Genome Biol 2022](https://doi.org/10.1186/s13059-021-02592-9),
    [DeepSomatic by Park J. _et al_. 2024](https://doi.org/10.1101/2024.08.16.608331).
-   Use the bam files and high-confidence call set to benchmark a workflow,
    e.g.,
    [Benchmarking NVIDIA Clara Parabricks Somatic Variant Calling Pipeline on AWS](https://aws.amazon.com/blogs/hpc/benchmarking-nvidia-clara-parabricks-somatic-variant-calling-pipeline-on-aws/),
    [NVIDIA Docs Hub](https://docs.nvidia.com/clara/parabricks/how-tos/somaticcalling.html),
    [nf-core/sarek](https://doi.org/10.1093/nargab/lqae031), etc.

#### Click for [more details of the SEQC2's somatic mutation project](docs/seqc2.md).

#### [Recommendation](docs/train_for_classifiers.md) of how to use SEQC2 data to create SomaticSeq classifiers.

<hr>
<table style="width: 100%;">

  <tr>
    <td>Briefly explaining SomaticSeq v1.0</td>
    <td>SEQC2 somatic mutation reference data and call sets</td>
    <td>How to run <a href="https://precision.fda.gov/home/apps/app-G7XVKQQ02v051q5PK3yQYJKJ-1">SomaticSeq v3.6.3</a> on precisionFDA</td>

  </tr>

  <tr>
    <td><a href="https://youtu.be/MnJdTQWWN6w"><img src="docs/SomaticSeqYoutube.png" width="400" /></a></td>
    <td><a href="https://youtu.be/nn0BOAONRe8"><img src="docs/workflow400.png" width="400" /></a></td>
    <td><a href="https://youtu.be/fLKokuMGTvk"><img src="docs/precisionfda.png" width="400" /></a></td>

  </tr>

  <tr>
    <td></td>
    <td></td>
    <td>Run in <a href="https://youtu.be/F6TSdg0OffM">train or prediction mode</a></td>

  </tr>

</table>
<hr>

# Installation

## Dependencies

This [dockerfile](Dockerfiles/somaticseq.base-1.6.dockerfile) reveals the
dependencies

-   Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
-   [BEDTools](https://bedtools.readthedocs.io/en/latest/): required when
    parallel processing is invoked, and/or when any bed files are used as input
    files.
-   Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
-   Optional: R and [ada](https://cran.r-project.org/package=ada) are required
    for AdaBoost, whereas XGBoost (default) is implemented in python.
-   To install SomaticSeq, clone this repo, `cd somaticseq`, and then run
    `pip install .` (To install extra packages for development:
    `pip install '.[dev]'`). A number of commands prefixed with `somaticseq_`
    will be placed into the PATH.

## To install using pip

Make sure to install `bedtools` separately.

```
pip install somaticseq
```

## To install the bioconda version

SomaticSeq can also be found on
[![Anaconda-Server Badge](https://anaconda.org/bioconda/somaticseq/badges/version.svg)](https://anaconda.org/bioconda/somaticseq),
which has
[![Anaconda-Server Badge](https://anaconda.org/bioconda/somaticseq/badges/downloads.svg)](https://anaconda.org/bioconda/somaticseq)
so far. To
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/somaticseq/README.html),
which also automatically installs a bunch of 3rd-party somatic mutation callers:

```
conda install -c bioconda somaticseq
```

## To install from github source with conda

```
conda create --name my_env -c bioconda python bedtools
conda activate my_env
git clone git@github.com:bioinform/somaticseq.git
cd somaticseq
pip install -e .
```

### Test your installation

If installed successfully, you will be able to run `somaticseq --help` in the
terminal. Also make sure `bedtools` is executable. There are some toy data sets
and test scripts in [**example**](tests/example) that should finish in <1 minute
if installed properly.

## Run SomaticSeq with an example command

-   At minimum, given the results of the individual mutation caller(s),
    SomaticSeq will extract sequencing features for the combined call set.
    Required inputs for command `somaticseq` are:

    -   `--output-directory` and `--genome-reference`, then
    -   Either `paired` or `single` to invoke paired or single sample mode,
        -   if `paired`: `--tumor-bam-file`, and `--normal-bam-file` are both
            required.
        -   if `single`: `--bam-file` is required.

    Everything else is optional (though without a single VCF file from at least
    one caller, SomaticSeq does nothing).

-   The following four files will be created into the output directory:

    -   `Consensus.sSNV.vcf`, `Consensus.sINDEL.vcf`, `Ensemble.sSNV.tsv`, and
        `Ensemble.sINDEL.tsv`.

-   If you're searching for pipelines to run those individual somatic mutation
    callers, feel free to take advantage of our
    [**Dockerized Somatic Mutation Workflow**](somaticseq/utilities/dockered_pipelines)
    as a start.
    -   Important note: multi-argument options (e.g., `--extra-hyperparameters`
        or `--features-excluded`) cannot be placed immediately before `paired`
        or `single`, because those options would try to "grab" `paired` or
        `single` as an additional argument.

```
# Merge caller results and extract SomaticSeq features
somaticseq \
  --output-directory  $OUTPUT_DIR \
  --genome-reference  GRCh38.fa \
  --inclusion-region  genome.bed \
  --exclusion-region  blacklist.bed \
  --threads           24 \
paired \
  --tumor-bam-file    tumor.bam \
  --normal-bam-file   matched_normal.bam \
  --mutect2-vcf       MuTect2/variants.vcf \
  --varscan-snv       VarScan2/variants.snp.vcf \
  --varscan-indel     VarScan2/variants.indel.vcf \
  --jsm-vcf           JointSNVMix2/variants.snp.vcf \
  --somaticsniper-vcf SomaticSniper/variants.snp.vcf \
  --vardict-vcf       VarDict/variants.vcf \
  --muse-vcf          MuSE/variants.snp.vcf \
  --lofreq-snv        LoFreq/variants.snp.vcf \
  --lofreq-indel      LoFreq/variants.indel.vcf \
  --scalpel-vcf       Scalpel/variants.indel.vcf \
  --strelka-snv       Strelka/variants.snv.vcf \
  --strelka-indel     Strelka/variants.indel.vcf \
  --arbitrary-snvs    additional_snv_calls_1.vcf.gz additional_snv_calls_2.vcf.gz ... \
  --arbitrary-indels  additional_indel_calls_1.vcf.gz additional_indel_calls_2.vcf.gz ...
```

-   For all of those input VCF files, both `.vcf` and `.vcf.gz` are acceptable.
    SomaticSeq also accepts `.cram`, but some callers may only take `.bam`.

-   `--arbitrary-snvs` and `--arbitrary-indels` are added since v3.7.0. It
    allows users to input **any** arbitrary VCF file(s) from caller(s) that we
    did not explicitly incorporate. SNVs and indels have to be separated.

    -   If your caller puts SNVs and indels in the same output VCF file, you may
        split it using a SomaticSeq utility script, e.g.,
        `somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf`.
        As usual, input can be either `.vcf` or `.vcf.gz`, but output will be
        `.vcf`.
    -   For those VCF file(s), any calls **not** labeled REJECT or LowQual will
        be considered a bona fide somatic mutation call. REJECT calls will be
        skipped. LowQual calls will be considered, but will not have a value of
        `1` in `if_Caller` machine learning feature.

-   `--inclusion-region` or `--exclusion-region` will require `bedtools` in your
    path.

-   `--algorithm` defaults to `xgboost` as v3.6.0, but can also be `ada`
    (AdaBoost in R). XGBoost supports multi-threading and can be orders of
    magnitude faster than AdaBoost, and seems to be about the same in terms of
    accuracy, so we changed the default from `ada` to `xgboost` as v3.6.0 and
    that's what we recommend now.

-   To split the job into multiple threads, place `--threads X` before the
    `paired` option to indicate X threads. It simply creates multiple BED file
    (each consisting of 1/X of total base pairs) for SomaticSeq to run on each
    of those sub-BED files in parallel. It then merges the results. This
    requires `bedtools` in your path.

Additional parameters to be specified **before** `paired` option to invoke
training mode. In addition to the four files specified above, two classifiers
(SNV and indel) will be created..

-   `--somaticseq-train`: FLAG to invoke training mode with no argument, which
    also requires ground truth VCF files.
    -   `--extra-hyperparameters`: add hyperparameters for xgboost, e.g.,
        `--extra-hyperparameters scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12`.
-   `--truth-snv`: if you have a ground truth VCF file for SNV
-   `--truth-indel`: if you have a ground truth VCF file for INDEL

Additional input files to be specified **before** `paired` option invoke
prediction mode (to use classifiers to score variants). Four additional files
will be created, i.e., `SSeq.Classified.sSNV.vcf`, `SSeq.Classified.sSNV.tsv`,
`SSeq.Classified.sINDEL.vcf`, and `SSeq.Classified.sINDEL.tsv`.

-   `--classifier-snv`: classifier previously built for SNV
-   `--classifier-indel`: classifier previously built for INDEL

Without those paramters above to invoking training or prediction mode,
SomaticSeq will default to majority-vote consensus mode.

## To train for SomaticSeq classifiers with multiple data sets combined

Run `somaticseq_xgboost train --help` to see the options. It is recommended that
SNV and INDEL models be trained separately, but it is up to you to experiment,
e.g.,

```
somaticseq_xgboost train \
  -tsvs SAMPLE_1/Ensemble.sSNV.tsv SAMPLE_2/Ensemble.sSNV.tsv ... SAMPLE_N/Ensemble.sSNV.tsv \
  -out multiSample.SNV.classifier \
  -threads 8 -depth 12 -seed 42 -method hist -iter 250 \
  --extra-params scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12
```

## Run SomaticSeq modules seperately

Most SomaticSeq modules can be run on their own. They may be useful in debugging
context, or be run for your own purposes. See [this page](MODULES.md) for your
options.

## Dockerized workflows and pipelines

### To run somatic mutation callers and then SomaticSeq

We have created a module (i.e., `somaticseq_make_somatic_scripts`) that can run
all the dockerized somatic mutation callers and then SomaticSeq, described at
[**somaticseq/utilities/dockered_pipelines**](somaticseq/utilities/dockered_pipelines).
There is also an alignment workflow described there. You need
[docker](https://www.docker.com/) to run these workflows. Singularity is also
supported, but is not optimized. Let me know if you find bugs.

### To create training data to create SomaticSeq classifiers

-   I recommend [SEQC2 Somatic Mutation Working Group](docs/seqc2.md)'s
    [reference sequencing data](https://identifiers.org/ncbi/insdc.sra:SRP162370)
    and
    [high-confidence somatic mutation call sets](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/).

-   Before well characterized real data was available, we have dockerized
    pipelines for _in silico_ mutation spike in at
    [**somaticseq/utilities/dockered_pipelines/bamSimulator**](somaticseq/utilities/dockered_pipelines/bamSimulator).
    These pipelines are based on
    [BAMSurgeon](https://github.com/adamewing/bamsurgeon). We have used it to
    create training set to build SomaticSeq classifiers, though it has not been
    updated for a while.

-   Combine both BAMSurgeon _in silico_ spike in and the real SEQC2 training
    data **may** give you better model than using either, which was shown in
    [Sahraeian S.M.E. _et al_. 2022](https://doi.org/10.1186/s13059-021-02592-9).
    The reason may be that the real data's high-confidence call sets do not have
    the most challenging genomic regions, whereas _in silico_ data do not have
    the most realistic data characteristics. Combining both allows them to cover
    each other's shortcomings.

### Dockerized alignment pipeline based on GATK's best practices

Described at
[**somaticseq/utilities/dockered_pipelines**](somaticseq/utilities/dockered_pipelines).
The module is `somaticseq_make_alignment_scripts`.

### Utilities

We have some generally useful scripts in [utilities](somaticseq/utilities). Some
of the more useful tools, e.g.,

-   `somaticseq_loci_counter` finds overlapping regions among multiple bed
    files.
-   `somaticseq_run_workflows` is a rudimentary workflow manager that executes
    multiple scripts at once.
-   `somaticseq_split_bed_into_equal_regions` splits one bed file into a number
    of output bed files, where each output bed file will have the same total
    length.
-   `somaticseq_linguistic_sequence_complexity` calculates sequence complexity
    given a nucleotide sequence (e.g., GCCAGAC) based on
    [Troyanskaya OG _et al_. Bioinformatics 2002](https://doi.org/10.1093/bioinformatics/18.5.679).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/bioinform/somaticseq",
    "name": "somaticseq",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11.0",
    "maintainer_email": "Li Tai Fang <ltfang@gmail.com>",
    "keywords": "somatic mutations, bioinformatics, genomics, ngs",
    "author": "Li Tai Fang, Pegah Tootoonchi Afshar, Aparna Chhibber, Marghoob Mohiyuddin, John C. Mu, Greg Gibeling, Sharon Barr, Narges Bani Asadi, Hugo Y.K. Lam",
    "author_email": "ltfang@gmail.com",
    "download_url": null,
    "platform": null,
    "description": "# SomaticSeq\n\nSomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use\nmachine learning to filter out false positives from other callers. It also comes\nwith a suite of [genomic utilities](somaticseq/utilities/README.md). The\ndetailed documentation is located in\n[docs/Manual.pdf](docs/Manual.pdf \"User Manual\").\n\n-   It was published in\n    [Fang, L.T., Afshar, P.T., Chhibber, A. _et al_. An ensemble approach to accurately detect somatic mutations using SomaticSeq. _Genome Biol_ **16**, 197 (2015)](http://dx.doi.org/10.1186/s13059-015-0758-2 \"Fang LT, et al. Genome Biol (2015)\").\n-   Feel free to report issues and/or ask questions at the\n    [Issues](../../issues \"Issues\") page.\n\n## Training data for benchmarking and/or model building\n\nIn 2021, the\n[FDA-led MAQC-IV/SEQC2 Consortium](https://www.fda.gov/science-research/bioinformatics-tools/microarraysequencing-quality-control-maqcseqc#MAQC_IV)\nhas produced multi-center multi-platform whole-genome and whole-exome\n[sequencing data sets](https://identifiers.org/ncbi/insdc.sra:SRP162370) for a\npair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the\nhigh-confidence\n[somatic mutation call set](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/).\nThis work was published in\n[Fang, L.T., Zhu, B., Zhao, Y. _et al_. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. _Nat Biotechnol_ **39**, 1151-1160 (2021)](https://doi.org/10.1038/s41587-021-00993-6 \"Fang LT, et al. Nat Biotechnol (2021)\")\n/\n[PMID:34504347](http://identifiers.org/pubmed:34504347 \"Fang LT, et al. Nat Biotechnol (2021)\")\n/\n[Free Read-Only Link](https://bit.ly/2021nbt \"Fang LT, et al. Nat Biotechnol (2021)\").\nThe following are some of the use cases for these resources:\n\n-   Use high-confidence call set as the \"ground truth\" to investigate how\n    different sample preparations, sequencing library kits, and bioinformatic\n    algorithms affect the accuracy of the somatic mutation pipelines, and\n    develop best practices, e.g.,\n    [Xiao W. _et al_. Nat Biotechnol 2021](https://doi.org/10.1038/s41587-021-00994-5).\n-   Use high-confidence call set as the \"ground truth\" to build accurate and\n    robust machine learning models for somatic mutation detections, e.g.,\n    [NeuSomatic by Sahraeian S.M.E. _et al_. Genome Biol 2022](https://doi.org/10.1186/s13059-021-02592-9),\n    [DeepSomatic by Park J. _et al_. 2024](https://doi.org/10.1101/2024.08.16.608331).\n-   Use the bam files and high-confidence call set to benchmark a workflow,\n    e.g.,\n    [Benchmarking NVIDIA Clara Parabricks Somatic Variant Calling Pipeline on AWS](https://aws.amazon.com/blogs/hpc/benchmarking-nvidia-clara-parabricks-somatic-variant-calling-pipeline-on-aws/),\n    [NVIDIA Docs Hub](https://docs.nvidia.com/clara/parabricks/how-tos/somaticcalling.html),\n    [nf-core/sarek](https://doi.org/10.1093/nargab/lqae031), etc.\n\n#### Click for [more details of the SEQC2's somatic mutation project](docs/seqc2.md).\n\n#### [Recommendation](docs/train_for_classifiers.md) of how to use SEQC2 data to create SomaticSeq classifiers.\n\n<hr>\n<table style=\"width: 100%;\">\n\n  <tr>\n    <td>Briefly explaining SomaticSeq v1.0</td>\n    <td>SEQC2 somatic mutation reference data and call sets</td>\n    <td>How to run <a href=\"https://precision.fda.gov/home/apps/app-G7XVKQQ02v051q5PK3yQYJKJ-1\">SomaticSeq v3.6.3</a> on precisionFDA</td>\n\n  </tr>\n\n  <tr>\n    <td><a href=\"https://youtu.be/MnJdTQWWN6w\"><img src=\"docs/SomaticSeqYoutube.png\" width=\"400\" /></a></td>\n    <td><a href=\"https://youtu.be/nn0BOAONRe8\"><img src=\"docs/workflow400.png\" width=\"400\" /></a></td>\n    <td><a href=\"https://youtu.be/fLKokuMGTvk\"><img src=\"docs/precisionfda.png\" width=\"400\" /></a></td>\n\n  </tr>\n\n  <tr>\n    <td></td>\n    <td></td>\n    <td>Run in <a href=\"https://youtu.be/F6TSdg0OffM\">train or prediction mode</a></td>\n\n  </tr>\n\n</table>\n<hr>\n\n# Installation\n\n## Dependencies\n\nThis [dockerfile](Dockerfiles/somaticseq.base-1.6.dockerfile) reveals the\ndependencies\n\n-   Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.\n-   [BEDTools](https://bedtools.readthedocs.io/en/latest/): required when\n    parallel processing is invoked, and/or when any bed files are used as input\n    files.\n-   Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).\n-   Optional: R and [ada](https://cran.r-project.org/package=ada) are required\n    for AdaBoost, whereas XGBoost (default) is implemented in python.\n-   To install SomaticSeq, clone this repo, `cd somaticseq`, and then run\n    `pip install .` (To install extra packages for development:\n    `pip install '.[dev]'`). A number of commands prefixed with `somaticseq_`\n    will be placed into the PATH.\n\n## To install using pip\n\nMake sure to install `bedtools` separately.\n\n```\npip install somaticseq\n```\n\n## To install the bioconda version\n\nSomaticSeq can also be found on\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/somaticseq/badges/version.svg)](https://anaconda.org/bioconda/somaticseq),\nwhich has\n[![Anaconda-Server Badge](https://anaconda.org/bioconda/somaticseq/badges/downloads.svg)](https://anaconda.org/bioconda/somaticseq)\nso far. To\n[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/somaticseq/README.html),\nwhich also automatically installs a bunch of 3rd-party somatic mutation callers:\n\n```\nconda install -c bioconda somaticseq\n```\n\n## To install from github source with conda\n\n```\nconda create --name my_env -c bioconda python bedtools\nconda activate my_env\ngit clone git@github.com:bioinform/somaticseq.git\ncd somaticseq\npip install -e .\n```\n\n### Test your installation\n\nIf installed successfully, you will be able to run `somaticseq --help` in the\nterminal. Also make sure `bedtools` is executable. There are some toy data sets\nand test scripts in [**example**](tests/example) that should finish in <1 minute\nif installed properly.\n\n## Run SomaticSeq with an example command\n\n-   At minimum, given the results of the individual mutation caller(s),\n    SomaticSeq will extract sequencing features for the combined call set.\n    Required inputs for command `somaticseq` are:\n\n    -   `--output-directory` and `--genome-reference`, then\n    -   Either `paired` or `single` to invoke paired or single sample mode,\n        -   if `paired`: `--tumor-bam-file`, and `--normal-bam-file` are both\n            required.\n        -   if `single`: `--bam-file` is required.\n\n    Everything else is optional (though without a single VCF file from at least\n    one caller, SomaticSeq does nothing).\n\n-   The following four files will be created into the output directory:\n\n    -   `Consensus.sSNV.vcf`, `Consensus.sINDEL.vcf`, `Ensemble.sSNV.tsv`, and\n        `Ensemble.sINDEL.tsv`.\n\n-   If you're searching for pipelines to run those individual somatic mutation\n    callers, feel free to take advantage of our\n    [**Dockerized Somatic Mutation Workflow**](somaticseq/utilities/dockered_pipelines)\n    as a start.\n    -   Important note: multi-argument options (e.g., `--extra-hyperparameters`\n        or `--features-excluded`) cannot be placed immediately before `paired`\n        or `single`, because those options would try to \"grab\" `paired` or\n        `single` as an additional argument.\n\n```\n# Merge caller results and extract SomaticSeq features\nsomaticseq \\\n  --output-directory  $OUTPUT_DIR \\\n  --genome-reference  GRCh38.fa \\\n  --inclusion-region  genome.bed \\\n  --exclusion-region  blacklist.bed \\\n  --threads           24 \\\npaired \\\n  --tumor-bam-file    tumor.bam \\\n  --normal-bam-file   matched_normal.bam \\\n  --mutect2-vcf       MuTect2/variants.vcf \\\n  --varscan-snv       VarScan2/variants.snp.vcf \\\n  --varscan-indel     VarScan2/variants.indel.vcf \\\n  --jsm-vcf           JointSNVMix2/variants.snp.vcf \\\n  --somaticsniper-vcf SomaticSniper/variants.snp.vcf \\\n  --vardict-vcf       VarDict/variants.vcf \\\n  --muse-vcf          MuSE/variants.snp.vcf \\\n  --lofreq-snv        LoFreq/variants.snp.vcf \\\n  --lofreq-indel      LoFreq/variants.indel.vcf \\\n  --scalpel-vcf       Scalpel/variants.indel.vcf \\\n  --strelka-snv       Strelka/variants.snv.vcf \\\n  --strelka-indel     Strelka/variants.indel.vcf \\\n  --arbitrary-snvs    additional_snv_calls_1.vcf.gz additional_snv_calls_2.vcf.gz ... \\\n  --arbitrary-indels  additional_indel_calls_1.vcf.gz additional_indel_calls_2.vcf.gz ...\n```\n\n-   For all of those input VCF files, both `.vcf` and `.vcf.gz` are acceptable.\n    SomaticSeq also accepts `.cram`, but some callers may only take `.bam`.\n\n-   `--arbitrary-snvs` and `--arbitrary-indels` are added since v3.7.0. It\n    allows users to input **any** arbitrary VCF file(s) from caller(s) that we\n    did not explicitly incorporate. SNVs and indels have to be separated.\n\n    -   If your caller puts SNVs and indels in the same output VCF file, you may\n        split it using a SomaticSeq utility script, e.g.,\n        `somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf`.\n        As usual, input can be either `.vcf` or `.vcf.gz`, but output will be\n        `.vcf`.\n    -   For those VCF file(s), any calls **not** labeled REJECT or LowQual will\n        be considered a bona fide somatic mutation call. REJECT calls will be\n        skipped. LowQual calls will be considered, but will not have a value of\n        `1` in `if_Caller` machine learning feature.\n\n-   `--inclusion-region` or `--exclusion-region` will require `bedtools` in your\n    path.\n\n-   `--algorithm` defaults to `xgboost` as v3.6.0, but can also be `ada`\n    (AdaBoost in R). XGBoost supports multi-threading and can be orders of\n    magnitude faster than AdaBoost, and seems to be about the same in terms of\n    accuracy, so we changed the default from `ada` to `xgboost` as v3.6.0 and\n    that's what we recommend now.\n\n-   To split the job into multiple threads, place `--threads X` before the\n    `paired` option to indicate X threads. It simply creates multiple BED file\n    (each consisting of 1/X of total base pairs) for SomaticSeq to run on each\n    of those sub-BED files in parallel. It then merges the results. This\n    requires `bedtools` in your path.\n\nAdditional parameters to be specified **before** `paired` option to invoke\ntraining mode. In addition to the four files specified above, two classifiers\n(SNV and indel) will be created..\n\n-   `--somaticseq-train`: FLAG to invoke training mode with no argument, which\n    also requires ground truth VCF files.\n    -   `--extra-hyperparameters`: add hyperparameters for xgboost, e.g.,\n        `--extra-hyperparameters scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12`.\n-   `--truth-snv`: if you have a ground truth VCF file for SNV\n-   `--truth-indel`: if you have a ground truth VCF file for INDEL\n\nAdditional input files to be specified **before** `paired` option invoke\nprediction mode (to use classifiers to score variants). Four additional files\nwill be created, i.e., `SSeq.Classified.sSNV.vcf`, `SSeq.Classified.sSNV.tsv`,\n`SSeq.Classified.sINDEL.vcf`, and `SSeq.Classified.sINDEL.tsv`.\n\n-   `--classifier-snv`: classifier previously built for SNV\n-   `--classifier-indel`: classifier previously built for INDEL\n\nWithout those paramters above to invoking training or prediction mode,\nSomaticSeq will default to majority-vote consensus mode.\n\n## To train for SomaticSeq classifiers with multiple data sets combined\n\nRun `somaticseq_xgboost train --help` to see the options. It is recommended that\nSNV and INDEL models be trained separately, but it is up to you to experiment,\ne.g.,\n\n```\nsomaticseq_xgboost train \\\n  -tsvs SAMPLE_1/Ensemble.sSNV.tsv SAMPLE_2/Ensemble.sSNV.tsv ... SAMPLE_N/Ensemble.sSNV.tsv \\\n  -out multiSample.SNV.classifier \\\n  -threads 8 -depth 12 -seed 42 -method hist -iter 250 \\\n  --extra-params scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12\n```\n\n## Run SomaticSeq modules seperately\n\nMost SomaticSeq modules can be run on their own. They may be useful in debugging\ncontext, or be run for your own purposes. See [this page](MODULES.md) for your\noptions.\n\n## Dockerized workflows and pipelines\n\n### To run somatic mutation callers and then SomaticSeq\n\nWe have created a module (i.e., `somaticseq_make_somatic_scripts`) that can run\nall the dockerized somatic mutation callers and then SomaticSeq, described at\n[**somaticseq/utilities/dockered_pipelines**](somaticseq/utilities/dockered_pipelines).\nThere is also an alignment workflow described there. You need\n[docker](https://www.docker.com/) to run these workflows. Singularity is also\nsupported, but is not optimized. Let me know if you find bugs.\n\n### To create training data to create SomaticSeq classifiers\n\n-   I recommend [SEQC2 Somatic Mutation Working Group](docs/seqc2.md)'s\n    [reference sequencing data](https://identifiers.org/ncbi/insdc.sra:SRP162370)\n    and\n    [high-confidence somatic mutation call sets](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/).\n\n-   Before well characterized real data was available, we have dockerized\n    pipelines for _in silico_ mutation spike in at\n    [**somaticseq/utilities/dockered_pipelines/bamSimulator**](somaticseq/utilities/dockered_pipelines/bamSimulator).\n    These pipelines are based on\n    [BAMSurgeon](https://github.com/adamewing/bamsurgeon). We have used it to\n    create training set to build SomaticSeq classifiers, though it has not been\n    updated for a while.\n\n-   Combine both BAMSurgeon _in silico_ spike in and the real SEQC2 training\n    data **may** give you better model than using either, which was shown in\n    [Sahraeian S.M.E. _et al_. 2022](https://doi.org/10.1186/s13059-021-02592-9).\n    The reason may be that the real data's high-confidence call sets do not have\n    the most challenging genomic regions, whereas _in silico_ data do not have\n    the most realistic data characteristics. Combining both allows them to cover\n    each other's shortcomings.\n\n### Dockerized alignment pipeline based on GATK's best practices\n\nDescribed at\n[**somaticseq/utilities/dockered_pipelines**](somaticseq/utilities/dockered_pipelines).\nThe module is `somaticseq_make_alignment_scripts`.\n\n### Utilities\n\nWe have some generally useful scripts in [utilities](somaticseq/utilities). Some\nof the more useful tools, e.g.,\n\n-   `somaticseq_loci_counter` finds overlapping regions among multiple bed\n    files.\n-   `somaticseq_run_workflows` is a rudimentary workflow manager that executes\n    multiple scripts at once.\n-   `somaticseq_split_bed_into_equal_regions` splits one bed file into a number\n    of output bed files, where each output bed file will have the same total\n    length.\n-   `somaticseq_linguistic_sequence_complexity` calculates sequence complexity\n    given a nucleotide sequence (e.g., GCCAGAC) based on\n    [Troyanskaya OG _et al_. Bioinformatics 2002](https://doi.org/10.1093/bioinformatics/18.5.679).\n",
    "bugtrack_url": null,
    "license": "BSD-2-Clause",
    "summary": "SomaticSeq: An ensemble approach to accurately detect somatic mutations using SomaticSeq",
    "version": "3.10.0",
    "project_urls": {
        "Homepage": "https://github.com/bioinform/somaticseq"
    },
    "split_keywords": [
        "somatic mutations",
        " bioinformatics",
        " genomics",
        " ngs"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bfedf459e33d8c4980c219f277c0d20bc81efcd18aef702257f6484125ebdb54",
                "md5": "e8d5d0686e38e13baf4942cc956d31b5",
                "sha256": "1b0fe2f4fccd280ef97a5f0ba95ce214e5c6b783444122969bcc90042dbefe0a"
            },
            "downloads": -1,
            "filename": "somaticseq-3.10.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e8d5d0686e38e13baf4942cc956d31b5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11.0",
            "size": 536989,
            "upload_time": "2025-01-01T23:58:14",
            "upload_time_iso_8601": "2025-01-01T23:58:14.109131Z",
            "url": "https://files.pythonhosted.org/packages/bf/ed/f459e33d8c4980c219f277c0d20bc81efcd18aef702257f6484125ebdb54/somaticseq-3.10.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-01 23:58:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "bioinform",
    "github_project": "somaticseq",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "somaticseq"
}
        
Elapsed time: 4.12650s