mitywgs

Name	mitywgs JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/KCCG/mity
Summary	A sensitive Mitochondrial variant detection pipeline from WGS data
upload_time	2024-06-07 03:51:08
maintainer	None
docs_url	None
author	Mark Cowley
requires_python	>=3.5.4
license	MIT
keywords	mitochondrial dna genomics variant snv indel
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

![mity logo](res/logos/mity-logo-red-white.png "mity")

# mity
`mity` is a bioinformatic analysis pipeline designed to call mitochondrial SNV and INDEL variants from Whole Genome Sequencing (WGS) data. `mity` can:
* identify very low-heteroplasmy variants, even <1% heteroplasmy when there is sufficient read-depth (eg >1000x)
* filter out common artefacts that arise from high-depth sequencing
* easily integrate with existing nuclear DNA analysis pipelines (mity merge)
* provide an annotated report, designed for clinicians and researchers to interrogate

# Usage
mity -h

More detailed usage can be found [docs/commands.md](docs/commands.md)

# Dependencies
* python3 (tested on 3.7.4)
* freebayes >= 1.2.0
* bgzip + tabix
* gsort (https://github.com/brentp/gsort)
* pyvcf
* xlsxwriter
* pandas

# Installation
Installation instructions via Docker, pip, or manually are available in [INSTALL.md](docs/INSTALL.md)

# Example Usage
This is an example of calling variants in the Ashkenazim Trio.

## mity call
First run `mity call` on three MT BAMs provided in [docs/test-files.md](docs/test-files.md). CRAM files are supported.

We recommend always using `--normalise`, or `mity report` won't work:
```bash
mity call \
--prefix ashkenazim \
--output-dir test_out \
--region MT:1-500 \
--normalise \
test_in/HG002.hs37d5.2x250.small.MT.RG.bam \
test_in/HG003.hs37d5.2x250.small.MT.RG.bam \
test_in/HG004.hs37d5.2x250.small.MT.RG.bam
```
This will create `test_out/normalised/ashkenazim.mity.vcf.gz` (and tbi file).

or, if using Docker:
```bash
docker run -w "$PWD" -v "$PWD":"$PWD" drmjc/mity call \
--prefix ashkenazim \
--output-dir test_out \
--region MT:1-500 \
--normalise \
test_in/HG002.hs37d5.2x250.small.MT.RG.bam \
test_in/HG003.hs37d5.2x250.small.MT.RG.bam \
test_in/HG004.hs37d5.2x250.small.MT.RG.bam
```

## mity report

We can create a `mity report` on the normalised VCF:
```bash
mity report \
--prefix ashkenazim \
--min_vaf 0.01 \
--output-dir test_out \
test_out/ashkenazim.mity.vcf.gz
```
This will create: `test_out/ashkenazim.annotated_variants.csv` and `test_out/ashkenazim.annotated_variants.xlsx`.

## mity normalise
High-depth sequencing and sensitive variant calling can create many variants with more than 2 alleles, and in some
cases, joins two nearby variants separated by shared `REF` sequence into a multi-nucleotide polymorphism
as discussed in the manuscript. Here, variant normalisation relates to decomposing the multi-allelic variants and
where possible, splitting multi-nucleotide polymorphisms into their cognate smaller variants. At the time of writing,
all variant decomposition tools we used failed to propagate the metadata in a multi-allelic variant to the split
variants which caused problems when reporting the quality scores associated with each variant.

Technically you can run `mity call` and `mity normalise` separately, but since `mity report` requires a normalised
vcf file, we recommend running `mity call --normalise`.

## mity merge
You can merge a nuclear vcf.gz file and a mity.vcf.gz file thereby replacing the MT calls from the nuclear VCF (
presumably from a caller like HaplotypeCaller which is not able to sensitively call mitochondrial variants) with
the calls from `mity`.

```bash
mity merge \
--prefix ashkenazim \
--mity_vcf test_out/ashkenazim.mity.vcf.gz \
--nuclear_vcf todo-create-example-nuclear.vcf.gz
```

# Recommendations for interpreting the report
Assuming that you are looking for a pathogenic variant underlying a patient with a rare genetic disorder potentially
caused by a Mitochondrial mutation, then we recommend the following strategy:
1. tier 1 or 2 variants included in the 'commercial_panels' column
2. tier 1 or 2 variants that match the clinical presentation and the phenotype in 'disease_mitomap', preferably
those that are annotated with Confirmed evidence in the 'status_mitomap' column
3. exclude common variants: anything linked to 'phylotree_haplotype', high 'phylotree_haplotype', high
'MGRB_frequency', high 'GenBank_frequency_mitomap'.
4. consider any remaining tier 1 or 2 variants that may have a predicted impact on tRNA
5. consider any remaining variants with high numbers of 'variant_references_mitomap'
5. if you have analysed multiple family members, consider variants who's level of 'variant_heteroplasmy' match the
disease burden
6. tier 3 variants have low numbers of supporting reads, and should be considered with caution. However we have observed
numerous tier 3 variants, especially in WGS from blood, that match the pathogenic allele known to be at much higher
heteroplasmy in the affected tissue (this phenomenon is well established in the literature). Thus, if there are any
tier 3 variants identified that match the patient's clinical presentation, then we recommend considering these
as candidate variants and validating using an orthogonal clinically validated assay, preferably on the disease
affected tissue.

# Reference genomes
## Human
`mity` natively supports the analysis of the revised Cambridge Reference Sequence (rCRS, RefSeq ID NC_012920.1). The
rCRS used in most human reference genomes from NCBI (GRCh37, hs37d5, GRCh38) and hg38 from UCSC, where it is either
named `chrM`, or `MT`. The main exception in common use is the `hg19` reference genome from UCSC, which used a different
sequence (RefSeq NC_001807) which differs in length by 2bp, and sharing 99% sequence homology (16530/16572 identities)
and 4 gaps. For now, `mity call` supports the hg19 reference, but `mity report` will not annotate variants properly, so
you should not use this part of the pipeline. We strongly recommend that for mitochondrial analysis, to use a reference
genome that uses the rCRS sequence.

> - the mitochondrial genome: since the release of the UCSC hg19
> assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the
> Genome Browser) has been replaced in GenBank with the record NC_012920, the
> revised Cambridge Reference Sequence (rCRS). We have not replaced the original
> sequence, NC_001807, as chrM in the hg19 Genome Browser. However, files in the
> subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original
> "chrM".

| Reference | contig name | RefSeq ID | length | rCRS |
| ----------- | ----------- | ----------- | ---------| ---- |
| GRCh37 | chrM | NC_012920.1 | 16569 bp | rCRS |
| hs37d5 | MT | NC_012920.1 | 16569 bp | rCRS |
| hg19 (UCSC) | chrM | NC_001807.4 | 16571 bp | no |
| GRCh38 | chrM | NC_012920.1 | 16569 bp | rCRS |

## Mouse
`mity` `call` and `normalise` support the analysis of the mouse genome (`mity call --reference mm10 ...`). `mity report`
currently only supports variant annotation to the human rCRS sequence.

# Commonly asked Questions
## Base quality score recalibration (BQSR)
Most of the development of `mity` was tested on BAM files that had undergone GATK's BQSR method, which improves the
base qualities of each read.
In our experience, this reduced the quality score of most bases by ~10 points, indicating that the base qualities
straight out of the sequencer are generally inflated. As the GATK best practices guide no longer recommends BQSR, it's
reasonable to ask whether `mity` can be run on BAM files straight out of the aligner.
`mity` has a custom QUAL score, which depends on the base qualities of only the reads that support the alternative
allele.
For tier 1 or 2 variants, there will be so many supporting reads, that any miscalibration of base quality scores will
have no material effect. Tier 3 variants with very few supporting reads may be impacted, where a variant with only 3 or
4 supporting reads may end up having a stronger mity QUAL score than after BQSR. The comment above regarding how you
should interpret and validate tier 3 variant still holds.
We would appreciate any feedback you may have on this.

## CRAM support
CRAM support was added to `mity call` in v0.4.0.

# Acknowledgements
We would like to thank:
* The Kinghorn Centre for Clinical Genomics and collaborators, who helped with feedback for running `mity`.
* The Genome in a Bottle consortium for providing the test data used here
* Eric Talevich who's CNVkit helped us structure `mity` as a package
* Erik Garrison for developing `FreeBayes` and his early feedback in optimising `FreeBayes` for sensitive variant detection.
* Brent Pederson for developing `gsort`

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/KCCG/mity",
    "name": "mitywgs",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.5.4",
    "maintainer_email": null,
    "keywords": "mitochondrial DNA genomics variant SNV INDEL",
    "author": "Mark Cowley",
    "author_email": "mcowley@ccia.org.au",
    "download_url": "https://files.pythonhosted.org/packages/8b/07/911f1a30fcd360a0861a2f6bb887f87a39b2d4cb788123cfbc3856e7f4de/mitywgs-1.1.0.tar.gz",
    "platform": null,
    "description": "![mity logo](res/logos/mity-logo-red-white.png \"mity\")\n\n# mity\n`mity` is a bioinformatic analysis pipeline designed to call mitochondrial SNV and INDEL variants from Whole Genome Sequencing (WGS) data. `mity` can:\n* identify very low-heteroplasmy variants, even <1% heteroplasmy when there is sufficient read-depth (eg >1000x)\n* filter out common artefacts that arise from high-depth sequencing\n* easily integrate with existing nuclear DNA analysis pipelines (mity merge)\n* provide an annotated report, designed for clinicians and researchers to interrogate\n\n# Usage\n    mity -h\n\nMore detailed usage can be found [docs/commands.md](docs/commands.md)\n\n# Dependencies\n* python3 (tested on 3.7.4)\n* freebayes >= 1.2.0\n* bgzip + tabix\n* gsort (https://github.com/brentp/gsort)\n* pyvcf\n* xlsxwriter\n* pandas\n\n# Installation\nInstallation instructions via Docker, pip, or manually are available in [INSTALL.md](docs/INSTALL.md)\n\n# Example Usage\nThis is an example of calling variants in the Ashkenazim Trio.\n\n## mity call\nFirst run `mity call` on three MT BAMs provided in [docs/test-files.md](docs/test-files.md). CRAM files are supported.\n\nWe recommend always using `--normalise`, or `mity report` won't work:\n```bash\nmity call \\\n--prefix ashkenazim \\\n--output-dir test_out \\\n--region MT:1-500 \\\n--normalise \\\ntest_in/HG002.hs37d5.2x250.small.MT.RG.bam \\\ntest_in/HG003.hs37d5.2x250.small.MT.RG.bam \\\ntest_in/HG004.hs37d5.2x250.small.MT.RG.bam \n```\nThis will create `test_out/normalised/ashkenazim.mity.vcf.gz` (and tbi file).\n\nor, if using Docker:\n```bash\ndocker run -w \"$PWD\" -v \"$PWD\":\"$PWD\" drmjc/mity call \\\n--prefix ashkenazim \\\n--output-dir test_out \\\n--region MT:1-500 \\\n--normalise \\\ntest_in/HG002.hs37d5.2x250.small.MT.RG.bam \\\ntest_in/HG003.hs37d5.2x250.small.MT.RG.bam \\\ntest_in/HG004.hs37d5.2x250.small.MT.RG.bam \n```\n\n## mity report\n\nWe can create a `mity report` on the normalised VCF:\n```bash\nmity report \\\n--prefix ashkenazim \\\n--min_vaf 0.01 \\\n--output-dir test_out \\\ntest_out/ashkenazim.mity.vcf.gz\n```\nThis will create: `test_out/ashkenazim.annotated_variants.csv` and `test_out/ashkenazim.annotated_variants.xlsx`.\n\n## mity normalise\nHigh-depth sequencing and sensitive variant calling can create many variants with more than 2 alleles, and in some\ncases, joins two nearby variants separated by shared `REF` sequence into a multi-nucleotide polymorphism \nas discussed in the manuscript. Here, variant normalisation relates to decomposing the multi-allelic variants and \nwhere possible, splitting multi-nucleotide polymorphisms into their cognate smaller variants. At the time of writing,\nall variant decomposition tools we used failed to propagate the metadata in a multi-allelic variant to the split\nvariants which caused problems when reporting the quality scores associated with each variant.\n  \nTechnically you can run `mity call` and `mity normalise` separately, but since `mity report` requires a normalised \nvcf file, we recommend running `mity call --normalise`. \n\n## mity merge\nYou can merge a nuclear vcf.gz file and a mity.vcf.gz file thereby replacing the MT calls from the nuclear VCF (\npresumably from a caller like HaplotypeCaller which is not able to sensitively call mitochondrial variants) with\nthe calls from `mity`.\n\n```bash\nmity merge \\\n--prefix ashkenazim \\\n--mity_vcf test_out/ashkenazim.mity.vcf.gz \\\n--nuclear_vcf todo-create-example-nuclear.vcf.gz\n```\n\n# Recommendations for interpreting the report\nAssuming that you are looking for a pathogenic variant underlying a patient with a rare genetic disorder potentially \ncaused by a Mitochondrial mutation, then we recommend the following strategy:\n1. tier 1 or 2 variants included in the 'commercial_panels' column \n2. tier 1 or 2 variants that match the clinical presentation and the phenotype in 'disease_mitomap', preferably \nthose that are annotated with Confirmed evidence in the 'status_mitomap' column\n3. exclude common variants: anything linked to 'phylotree_haplotype', high 'phylotree_haplotype', high \n'MGRB_frequency', high 'GenBank_frequency_mitomap'.\n4. consider any remaining tier 1 or 2 variants that may have a predicted impact on tRNA\n5. consider any remaining variants with high numbers of 'variant_references_mitomap'\n5. if you have analysed multiple family members, consider variants who's level of 'variant_heteroplasmy' match the\ndisease burden \n6. tier 3 variants have low numbers of supporting reads, and should be considered with caution. However we have observed\nnumerous tier 3 variants, especially in WGS from blood, that match the pathogenic allele known to be at much higher \nheteroplasmy in the affected tissue (this phenomenon is well established in the literature). Thus, if there are any \ntier 3 variants identified that match the patient's clinical presentation, then we recommend considering these\nas candidate variants and validating using an orthogonal clinically validated assay, preferably on the disease \naffected tissue.\n\n# Reference genomes\n## Human\n`mity` natively supports the analysis of the revised Cambridge Reference Sequence (rCRS, RefSeq ID NC_012920.1). The\nrCRS used in most human reference genomes from NCBI (GRCh37, hs37d5, GRCh38) and hg38 from UCSC, where it is either \nnamed `chrM`, or `MT`. The main exception in common use is the `hg19` reference genome from UCSC, which used a different\nsequence (RefSeq NC_001807) which differs in length by 2bp, and sharing 99% sequence homology (16530/16572 identities) \nand 4 gaps. For now, `mity call` supports the hg19 reference, but `mity report` will not annotate variants properly, so \nyou should not use this part of the pipeline. We strongly recommend that for mitochondrial analysis, to use a reference\ngenome that uses the rCRS sequence. \n\n> - the mitochondrial genome: since the release of the UCSC hg19\n> assembly, the Homo sapiens mitochondrion sequence (represented as \"chrM\" in the\n> Genome Browser) has been replaced in GenBank with the record NC_012920, the\n> revised Cambridge Reference Sequence (rCRS).  We have not replaced the original\n> sequence, NC_001807, as chrM in the hg19 Genome Browser.  However, files in the\n> subdirectory p13.plusMT include NC_012920 as \"chrMT\", in addition to the original\n> \"chrM\".\n\n| Reference   | contig name | RefSeq ID   | length   | rCRS | \n| ----------- | ----------- | ----------- | ---------| ---- |\n| GRCh37      | chrM        | NC_012920.1 | 16569 bp | rCRS |\n| hs37d5      | MT          | NC_012920.1 | 16569 bp | rCRS |\n| hg19 (UCSC) | chrM        | NC_001807.4 | 16571 bp | no   |\n| GRCh38      | chrM        | NC_012920.1 | 16569 bp | rCRS |\n\n## Mouse\n`mity` `call` and `normalise` support the analysis of the mouse genome (`mity call --reference mm10 ...`). `mity report`\ncurrently only supports variant annotation to the human rCRS sequence.\n\n# Commonly asked Questions\n## Base quality score recalibration (BQSR)\nMost of the development of `mity` was tested on BAM files that had undergone GATK's BQSR method, which improves the \nbase qualities of each read. \nIn our experience, this reduced the quality score of most bases by ~10 points, indicating that the base qualities \nstraight out of the sequencer are generally inflated. As the GATK best practices guide no longer recommends BQSR, it's \nreasonable to ask whether `mity` can be run on BAM files straight out of the aligner.\n`mity` has a custom QUAL score, which depends on the base qualities of only the reads that support the alternative \nallele.  \nFor tier 1 or 2 variants, there will be so many supporting reads, that any miscalibration of base quality scores will\nhave no material effect. Tier 3 variants with very few supporting reads may be impacted, where a variant with only 3 or\n4 supporting reads may end up having a stronger mity QUAL score than after BQSR. The comment above regarding how you\nshould interpret and validate tier 3 variant still holds. \nWe would appreciate any feedback you may have on this.\n\n## CRAM support\nCRAM support was added to `mity call` in v0.4.0.\n\n# Acknowledgements\nWe would like to thank:\n* The Kinghorn Centre for Clinical Genomics and collaborators, who helped with feedback for running `mity`.\n* The Genome in a Bottle consortium for providing the test data used here \n* Eric Talevich who's CNVkit helped us structure `mity` as a package\n* Erik Garrison for developing `FreeBayes` and his early feedback in optimising `FreeBayes` for sensitive variant detection.\n* Brent Pederson for developing `gsort`\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A sensitive Mitochondrial variant detection pipeline from WGS data",
    "version": "1.1.0",
    "project_urls": {
        "Documentation": "https://github.com/KCCG/mity/",
        "Funding": "http://garvan.org.au/kccg",
        "Homepage": "https://github.com/KCCG/mity",
        "Source": "https://github.com/KCCG/mity/"
    },
    "split_keywords": [
        "mitochondrial",
        "dna",
        "genomics",
        "variant",
        "snv",
        "indel"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8b07911f1a30fcd360a0861a2f6bb887f87a39b2d4cb788123cfbc3856e7f4de",
                "md5": "55698f5c6468d057453f9b55fb26a5e4",
                "sha256": "2d407a70632b7cb23792c84346a41d3b186ab0b6d839a335b9173cab42198e1d"
            },
            "downloads": -1,
            "filename": "mitywgs-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "55698f5c6468d057453f9b55fb26a5e4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.5.4",
            "size": 904065,
            "upload_time": "2024-06-07T03:51:08",
            "upload_time_iso_8601": "2024-06-07T03:51:08.687397Z",
            "url": "https://files.pythonhosted.org/packages/8b/07/911f1a30fcd360a0861a2f6bb887f87a39b2d4cb788123cfbc3856e7f4de/mitywgs-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-07 03:51:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "KCCG",
    "github_project": "mity",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "mitywgs"
}

Mark Cowley