Name | vcfstats JSON |
Version |
0.6.0
JSON |
| download |
home_page | https://github.com/pwwang/vcfstats |
Summary | Powerful statistics for VCF files |
upload_time | 2023-09-27 22:11:06 |
maintainer | |
docs_url | None |
author | pwwang |
requires_python | >=3.8,<4.0 |
license | MIT |
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# vcfstats - powerful statistics for VCF files
[![Pypi][1]][2] [![Github][3]][4] [![PythonVers][5]][2] [![docs][6]][13] ![github action][7] [![Codacy][9]][10] [![Codacy coverage][11]][10]
[Documentation][13] | [CHANGELOG][12]
## Motivation
There are a couple of tools that can plot some statistics of VCF files, including [`bcftools`][14] and [`jvarkit`][15]. However, none of them could:
1. plot specific metrics
2. customize the plots
3. focus on variants with certain filters
R package [`vcfR`][19] can do some of the above. However, it has to load entire VCF into memory, which is not friendly to large VCF files.
## Installation
```shell
pip install -U vcfstats
```
Or run with docker:
```shell
docker run \
-w /vcfstats/workdir \
-v $(pwd):/vcfstats/workdir \
--rm justold/vcfstats:latest \
vcfstats \
--vcf myfile.vcf \
-o outputs \
--formula 'COUNT(1) ~ CONTIG' \
--title 'Number of variants on each chromosome'
```
## Gallery
### Number of variants on each chromosome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1) ~ CONTIG' \
--title 'Number of variants on each chromosome' \
--config examples/config.toml
```

#### Changing labels and ticks
`vcfstats` uses [`plotnine`][17] for plotting, read more about it on how to specify `--ggs` to modify the plots.
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1) ~ CONTIG' \
--title 'Number of variants on each chromosome (modified)' \
--config examples/config.toml \
--ggs 'scale_x_discrete(name ="Chromosome", \
limits=["1","2","3","4","5","6","7","8","9","10","X"]); \
ylab("# Variants")'
```

#### Number of variants on first 5 chromosome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \
--title 'Number of variants on each chromosome (first 5)' \
--config examples/config.toml
# or
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1) ~ CONTIG[1-5]' \
--title 'Number of variants on each chromosome (first 5)' \
--config examples/config.toml
# or
# require vcf file to be tabix-indexed.
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1) ~ CONTIG' \
--title 'Number of variants on each chromosome (first 5)' \
--config examples/config.toml -r 1 2 3 4 5
```

### Number of substitutions of SNPs
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
--title 'Number of substitutions of SNPs' \
--config examples/config.toml
```

#### Only with SNPs PASS all filters
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
--title 'Number of substitutions of SNPs (passed)' \
--config examples/config.toml \
--passed
```

### Alternative allele frequency on each chromosome
```shell
# using a dark theme
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'AAF ~ CONTIG' \
--title 'Allele frequency on each chromosome' \
--config examples/config.toml --ggs 'theme_dark()'
```

#### Using boxplot
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'AAF ~ CONTIG' \
--title 'Allele frequency on each chromosome (boxplot)' \
--config examples/config.toml \
--figtype boxplot
```

#### Using density plot/histogram to investigate the distribution:
You can plot the distribution, using density plot or histogram
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'AAF ~ CONTIG[1,2]' \
--title 'Allele frequency on chromosome 1,2' \
--config examples/config.toml \
--figtype density
```

### Overall distribution of allele frequency
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'AAF ~ 1' \
--title 'Overall allele frequency distribution' \
--config examples/config.toml
```

#### Excluding some low/high frequency variants
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'AAF[0.05, 0.95] ~ 1' \
--title 'Overall allele frequency distribution (0.05-0.95)' \
--config examples/config.toml
```

### Counting types of variants on each chromosome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1, group=VARTYPE) ~ CHROM' \
# or simply
# --formula 'VARTYPE ~ CHROM' \
--title 'Types of variants on each chromosome' \
--config examples/config.toml
```

#### Using bar chart if there is only one chromosome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'COUNT(1, group=VARTYPE) ~ CHROM[1]' \
# or simply
# --formula 'VARTYPE ~ CHROM[1]' \
--title 'Types of variants on chromosome 1' \
--config examples/config.toml \
--figtype pie
```

#### Counting variant types on whole genome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
# or simply
# --formula 'VARTYPE ~ 1' \
--formula 'COUNT(1, group=VARTYPE) ~ 1' \
--title 'Types of variants on whole genome' \
--config examples/config.toml
```

### Counting type of mutant genotypes (HET, HOM_ALT) for sample 1 on each chromosome
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
# or simply
# --formula 'GTTYPEs[HET,HOM_ALT]{0} ~ CHROM' \
--formula 'COUNT(1, group=GTTYPEs[HET,HOM_ALT]{0}) ~ CHROM' \
--title 'Mutant genotypes on each chromosome (sample 1)' \
--config examples/config.toml
```

### Exploration of mean(genotype quality) and mean(depth) on each chromosome for sample 1
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'MEAN(GQs{0}) ~ MEAN(DEPTHs{0}, group=CHROM)' \
--title 'GQ vs depth (sample 1)' \
--config examples/config.toml
```

### Exploration of depths for sample 1,2
```shell
vcfstats --vcf examples/sample.vcf \
--outdir examples/ \
--formula 'DEPTHs{0} ~ DEPTHs{1}' \
--title 'Depths between sample 1 and 2' \
--config examples/config.toml
```

See more examples:
[https://github.com/pwwang/vcfstats/issues/15#issuecomment-1029367903](https://github.com/pwwang/vcfstats/issues/15#issuecomment-1029367903)
[1]: https://img.shields.io/pypi/v/vcfstats?style=flat-square
[2]: https://pypi.org/project/vcfstats/
[3]: https://img.shields.io/github/v/tag/pwwang/vcfstats?style=flat-square
[4]: https://github.com/pwwang/vcfstats
[5]: https://img.shields.io/pypi/pyversions/vcfstats?style=flat-square
[6]: https://img.shields.io/github/actions/workflow/status/pwwang/vcfstats/docs.yml?label=docs&style=flat-square
[7]: https://img.shields.io/github/actions/workflow/status/pwwang/vcfstats/build.yml?style=flat-square
[9]: https://img.shields.io/codacy/grade/c8c8bfa8c5e9443bbf268a0a7c6f206d?style=flat-square
[10]: https://app.codacy.com/gh/pwwang/vcfstats/
[11]: https://img.shields.io/codacy/coverage/c8c8bfa8c5e9443bbf268a0a7c6f206d?style=flat-square
[12]: https://pwwang.github.io/vcfstats/CHANGELOG/
[13]: https://pwwang.github.io/vcfstats/
[14]: https://samtools.github.io/bcftools/bcftools.html#stats
[15]: http://lindenb.github.io/jvarkit/VcfStatsJfx.html
[17]: https://plotnine.readthedocs.io/en/stable/
[19]: https://knausb.github.io/vcfR_documentation/visualization_1.html
Raw data
{
"_id": null,
"home_page": "https://github.com/pwwang/vcfstats",
"name": "vcfstats",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "pwwang",
"author_email": "pwwang@pwwang.com",
"download_url": "https://files.pythonhosted.org/packages/f0/ba/b324a84749ab451abd5b57ded8b7c81541854ecd8fd53ff61186703768cf/vcfstats-0.6.0.tar.gz",
"platform": null,
"description": "# vcfstats - powerful statistics for VCF files\n\n[![Pypi][1]][2] [![Github][3]][4] [![PythonVers][5]][2] [![docs][6]][13] ![github action][7] [![Codacy][9]][10] [![Codacy coverage][11]][10]\n\n[Documentation][13] | [CHANGELOG][12]\n\n## Motivation\n\nThere are a couple of tools that can plot some statistics of VCF files, including [`bcftools`][14] and [`jvarkit`][15]. However, none of them could:\n\n1. plot specific metrics\n2. customize the plots\n3. focus on variants with certain filters\n\nR package [`vcfR`][19] can do some of the above. However, it has to load entire VCF into memory, which is not friendly to large VCF files.\n\n## Installation\n\n```shell\npip install -U vcfstats\n```\n\nOr run with docker:\n\n```shell\ndocker run \\\n -w /vcfstats/workdir \\\n -v $(pwd):/vcfstats/workdir \\\n --rm justold/vcfstats:latest \\\n vcfstats \\\n --vcf myfile.vcf \\\n -o outputs \\\n --formula 'COUNT(1) ~ CONTIG' \\\n --title 'Number of variants on each chromosome'\n```\n\n## Gallery\n\n### Number of variants on each chromosome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1) ~ CONTIG' \\\n --title 'Number of variants on each chromosome' \\\n --config examples/config.toml\n```\n\n\n\n#### Changing labels and ticks\n\n`vcfstats` uses [`plotnine`][17] for plotting, read more about it on how to specify `--ggs` to modify the plots.\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1) ~ CONTIG' \\\n --title 'Number of variants on each chromosome (modified)' \\\n --config examples/config.toml \\\n --ggs 'scale_x_discrete(name =\"Chromosome\", \\\n limits=[\"1\",\"2\",\"3\",\"4\",\"5\",\"6\",\"7\",\"8\",\"9\",\"10\",\"X\"]); \\\n ylab(\"# Variants\")'\n```\n\n\n\n#### Number of variants on first 5 chromosome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \\\n --title 'Number of variants on each chromosome (first 5)' \\\n --config examples/config.toml\n# or\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1) ~ CONTIG[1-5]' \\\n --title 'Number of variants on each chromosome (first 5)' \\\n --config examples/config.toml\n# or\n# require vcf file to be tabix-indexed.\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1) ~ CONTIG' \\\n --title 'Number of variants on each chromosome (first 5)' \\\n --config examples/config.toml -r 1 2 3 4 5\n```\n\n\n\n### Number of substitutions of SNPs\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \\\n --title 'Number of substitutions of SNPs' \\\n --config examples/config.toml\n```\n\n\n\n#### Only with SNPs PASS all filters\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \\\n --title 'Number of substitutions of SNPs (passed)' \\\n --config examples/config.toml \\\n --passed\n```\n\n\n\n### Alternative allele frequency on each chromosome\n\n```shell\n# using a dark theme\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'AAF ~ CONTIG' \\\n --title 'Allele frequency on each chromosome' \\\n --config examples/config.toml --ggs 'theme_dark()'\n```\n\n\n\n#### Using boxplot\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'AAF ~ CONTIG' \\\n --title 'Allele frequency on each chromosome (boxplot)' \\\n --config examples/config.toml \\\n --figtype boxplot\n```\n\n\n\n#### Using density plot/histogram to investigate the distribution:\n\nYou can plot the distribution, using density plot or histogram\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'AAF ~ CONTIG[1,2]' \\\n --title 'Allele frequency on chromosome 1,2' \\\n --config examples/config.toml \\\n --figtype density\n```\n\n\n\n### Overall distribution of allele frequency\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'AAF ~ 1' \\\n --title 'Overall allele frequency distribution' \\\n --config examples/config.toml\n```\n\n\n\n#### Excluding some low/high frequency variants\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'AAF[0.05, 0.95] ~ 1' \\\n --title 'Overall allele frequency distribution (0.05-0.95)' \\\n --config examples/config.toml\n```\n\n\n\n### Counting types of variants on each chromosome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1, group=VARTYPE) ~ CHROM' \\\n # or simply\n # --formula 'VARTYPE ~ CHROM' \\\n --title 'Types of variants on each chromosome' \\\n --config examples/config.toml\n```\n\n\n\n#### Using bar chart if there is only one chromosome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'COUNT(1, group=VARTYPE) ~ CHROM[1]' \\\n # or simply\n # --formula 'VARTYPE ~ CHROM[1]' \\\n --title 'Types of variants on chromosome 1' \\\n --config examples/config.toml \\\n --figtype pie\n```\n\n\n\n#### Counting variant types on whole genome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n # or simply\n # --formula 'VARTYPE ~ 1' \\\n --formula 'COUNT(1, group=VARTYPE) ~ 1' \\\n --title 'Types of variants on whole genome' \\\n --config examples/config.toml\n```\n\n\n\n### Counting type of mutant genotypes (HET, HOM_ALT) for sample 1 on each chromosome\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n # or simply\n # --formula 'GTTYPEs[HET,HOM_ALT]{0} ~ CHROM' \\\n --formula 'COUNT(1, group=GTTYPEs[HET,HOM_ALT]{0}) ~ CHROM' \\\n --title 'Mutant genotypes on each chromosome (sample 1)' \\\n --config examples/config.toml\n```\n\n\n\n\n### Exploration of mean(genotype quality) and mean(depth) on each chromosome for sample 1\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'MEAN(GQs{0}) ~ MEAN(DEPTHs{0}, group=CHROM)' \\\n --title 'GQ vs depth (sample 1)' \\\n --config examples/config.toml\n```\n\n\n\n### Exploration of depths for sample 1,2\n\n```shell\nvcfstats --vcf examples/sample.vcf \\\n --outdir examples/ \\\n --formula 'DEPTHs{0} ~ DEPTHs{1}' \\\n --title 'Depths between sample 1 and 2' \\\n --config examples/config.toml\n```\n\n\n\nSee more examples:\n\n[https://github.com/pwwang/vcfstats/issues/15#issuecomment-1029367903](https://github.com/pwwang/vcfstats/issues/15#issuecomment-1029367903)\n\n[1]: https://img.shields.io/pypi/v/vcfstats?style=flat-square\n[2]: https://pypi.org/project/vcfstats/\n[3]: https://img.shields.io/github/v/tag/pwwang/vcfstats?style=flat-square\n[4]: https://github.com/pwwang/vcfstats\n[5]: https://img.shields.io/pypi/pyversions/vcfstats?style=flat-square\n[6]: https://img.shields.io/github/actions/workflow/status/pwwang/vcfstats/docs.yml?label=docs&style=flat-square\n[7]: https://img.shields.io/github/actions/workflow/status/pwwang/vcfstats/build.yml?style=flat-square\n[9]: https://img.shields.io/codacy/grade/c8c8bfa8c5e9443bbf268a0a7c6f206d?style=flat-square\n[10]: https://app.codacy.com/gh/pwwang/vcfstats/\n[11]: https://img.shields.io/codacy/coverage/c8c8bfa8c5e9443bbf268a0a7c6f206d?style=flat-square\n[12]: https://pwwang.github.io/vcfstats/CHANGELOG/\n[13]: https://pwwang.github.io/vcfstats/\n[14]: https://samtools.github.io/bcftools/bcftools.html#stats\n[15]: http://lindenb.github.io/jvarkit/VcfStatsJfx.html\n[17]: https://plotnine.readthedocs.io/en/stable/\n[19]: https://knausb.github.io/vcfR_documentation/visualization_1.html\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Powerful statistics for VCF files",
"version": "0.6.0",
"project_urls": {
"Homepage": "https://github.com/pwwang/vcfstats",
"Repository": "https://github.com/pwwang/vcfstats"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e89613a04a888d986031dc9212dfb5e516fadc422163bc0d3346335378a53071",
"md5": "049d04a4871d2b3b9ad6b158ca5a16f7",
"sha256": "45d346400809c01cfbacd25360cab4bf5bcc7e8047abf4e21e3c1ff9cf4f9005"
},
"downloads": -1,
"filename": "vcfstats-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "049d04a4871d2b3b9ad6b158ca5a16f7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<4.0",
"size": 17163,
"upload_time": "2023-09-27T22:11:05",
"upload_time_iso_8601": "2023-09-27T22:11:05.347689Z",
"url": "https://files.pythonhosted.org/packages/e8/96/13a04a888d986031dc9212dfb5e516fadc422163bc0d3346335378a53071/vcfstats-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f0bab324a84749ab451abd5b57ded8b7c81541854ecd8fd53ff61186703768cf",
"md5": "263eec5ce5b87dd6f47c2a025eeb0c3f",
"sha256": "a6afdaaa6af96e5f159f2f31428c5f9b3563bb55252e4f79b2306aa8e95fe98d"
},
"downloads": -1,
"filename": "vcfstats-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "263eec5ce5b87dd6f47c2a025eeb0c3f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<4.0",
"size": 16686,
"upload_time": "2023-09-27T22:11:06",
"upload_time_iso_8601": "2023-09-27T22:11:06.841876Z",
"url": "https://files.pythonhosted.org/packages/f0/ba/b324a84749ab451abd5b57ded8b7c81541854ecd8fd53ff61186703768cf/vcfstats-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-09-27 22:11:06",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "pwwang",
"github_project": "vcfstats",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "vcfstats"
}