| Name | vcf2py JSON |
| Version |
0.2.2
JSON |
| download |
| home_page | None |
| Summary | Tools to process genomic variant data with Numerical Python |
| upload_time | 2024-09-05 03:24:30 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.7 |
| license | None |
| keywords |
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# VCF2PY – working with genomic variants in Python #
Package `vcf2py` allows to quickly import genomic
variants in VCF format into Python as NumPy arrays
## Installation ##
With pip
pip install vcf2py
## Sample usage ##
Import main class to work with VCF data:
from vcf2py import VariantFile
### Parse variants and INFO fields ###
Consider the following VCF file `1KG_example.vcf`
##fileformat=VCFv4.3
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00097 HG00099
7 152135021 . C T . PASS AC=0;AN=4;DP=21624;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
7 152135047 . T C . PASS AC=0;AN=4;DP=21003;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
7 152135074 . C T . PASS AC=0;AN=4;DP=20726;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0.01;VT=SNP;NS=2548 GT 0|0 0|0
7 152135149 . A G . PASS AC=0;AN=4;DP=19360;AF=0;EAS_AF=0.01;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
7 152135225 . C T . PASS AC=0;AN=4;DP=20911;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
7 152135289 . A G . PASS AC=0;AN=4;DP=20973;AF=0;EAS_AF=0;EUR_AF=0.01;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
7 152135350 . C A . PASS AC=0;AN=4;DP=20835;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548 GT 0|0 0|0
After parsing it gives the following
>>> f = vcf.VariantFile("1KG_example.vcf")
>>> vrt, info = f.read(info=True)
>>> info
array([(0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 21624),
(0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 21003),
(0., 0, 2548, 4, 0. , 0. , 0., 0., 0.01, ('SNP',), False, 20726),
(0., 0, 2548, 4, 0.01, 0. , 0., 0., 0. , ('SNP',), False, 19360),
(0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 20911),
(0., 0, 2548, 4, 0. , 0.01, 0., 0., 0. , ('SNP',), False, 20973),
(0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 20835)],
dtype=[('AF', '<f8'), ('AC', '<i8'), ('NS', '<i8'), ('AN', '<i8'),
('EAS_AF', '<f8'), ('EUR_AF', '<f8'), ('AFR_AF', '<f8'),
('AMR_AF', '<f8'), ('SAS_AF', '<f8'), ('VT', 'O'), ('EX_TARGET', '?'), ('DP', '<i8')])
Variants and their data can be easily imported into Pandas
>>> import pandas as pd
>>> pd.DataFrame.from_records(vrt)
chrom pos ref alt id qual filter
0 7 152135021 C T . NaN PASS
1 7 152135047 T C . NaN PASS
2 7 152135074 C T . NaN PASS
3 7 152135149 A G . NaN PASS
4 7 152135225 C T . NaN PASS
5 7 152135289 A G . NaN PASS
6 7 152135350 C A . NaN PASS
### Parsing genotypes ###
Genotypes in VCF files are declared as `String` type. However they
are rather important in genetic analysis so additinal parsing tools
are implemented. `genotypes` parameter of `VariantFile.read` provides
additional control. In addition to default `genotypes="string"` it
allowes `split` or `sum` values.
Consider the following VCF file with three variants in total:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
chr24 166 . T TG,C 100 . . GT 0/1/2 0/1/0
chr24 167 . T TG 100 . . GT ./. 0/1
`genotypes=split` gives the following
>>> from vcf2py import VariantFile
>>> f = VariantFile("file.vcf")
>>> vrt, samples = f.read(samples=True, genotypes="split")
>>> samples["SAMPLE1"]["GT"]
array([[ 0., 1., 0.],
[ 0., 0., 1.],
[nan, nan, nan]], dtype=float16)
i.e. output will contain a ndarray with `1` or `0` depending on
presense of variant in allele for each of the variants. nan values stand
for unspecified genotypes.
`genotypes=sum` gives
>>> vrt, samples = f.read(samples=True, genotypes="sum")
>>> samples["SAMPLE1"]["GT"]
array([1, 1, 0], dtype=int8)
i.e. an integer value for each variant indicating number of alleles
with the variant.
Raw data
{
"_id": null,
"home_page": null,
"name": "vcf2py",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "mikpom <mikpom@mikpom.ru>",
"download_url": "https://files.pythonhosted.org/packages/38/ba/4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816/vcf2py-0.2.2.tar.gz",
"platform": null,
"description": "# VCF2PY \u2013 working with genomic variants in Python #\n\nPackage `vcf2py` allows to quickly import genomic\nvariants in VCF format into Python as NumPy arrays\n\n## Installation ##\n\nWith pip \n\n pip install vcf2py\n\n## Sample usage ##\n\nImport main class to work with VCF data:\n\n from vcf2py import VariantFile\n \n### Parse variants and INFO fields ###\n\nConsider the following VCF file `1KG_example.vcf`\n\n ##fileformat=VCFv4.3\n ##INFO=<ID=AF,Number=A,Type=Float,Description=\"Estimated allele frequency in the range (0,1)\">\n ##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Total number of alternate alleles in called genotypes\">\n ##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of samples with data\">\n ##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">\n ##INFO=<ID=EAS_AF,Number=A,Type=Float,Description=\"Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)\">\n ##INFO=<ID=EUR_AF,Number=A,Type=Float,Description=\"Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)\">\n ##INFO=<ID=AFR_AF,Number=A,Type=Float,Description=\"Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)\">\n ##INFO=<ID=AMR_AF,Number=A,Type=Float,Description=\"Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)\">\n ##INFO=<ID=SAS_AF,Number=A,Type=Float,Description=\"Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)\">\n ##INFO=<ID=VT,Number=.,Type=String,Description=\"indicates what type of variant the line represents\">\n ##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description=\"indicates whether a variant is within the exon pull down target boundaries\">\n ##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth; some reads may have been filtered\">\n #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHG00097\tHG00099\n 7\t152135021\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=21624;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135047\t.\tT\tC\t.\tPASS\tAC=0;AN=4;DP=21003;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135074\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=20726;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0.01;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135149\t.\tA\tG\t.\tPASS\tAC=0;AN=4;DP=19360;AF=0;EAS_AF=0.01;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135225\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=20911;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135289\t.\tA\tG\t.\tPASS\tAC=0;AN=4;DP=20973;AF=0;EAS_AF=0;EUR_AF=0.01;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n 7\t152135350\t.\tC\tA\t.\tPASS\tAC=0;AN=4;DP=20835;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n\nAfter parsing it gives the following\n\n >>> f = vcf.VariantFile(\"1KG_example.vcf\")\n >>> vrt, info = f.read(info=True)\n >>> info\n array([(0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 21624),\n (0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 21003),\n (0., 0, 2548, 4, 0. , 0. , 0., 0., 0.01, ('SNP',), False, 20726),\n (0., 0, 2548, 4, 0.01, 0. , 0., 0., 0. , ('SNP',), False, 19360),\n (0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 20911),\n (0., 0, 2548, 4, 0. , 0.01, 0., 0., 0. , ('SNP',), False, 20973),\n (0., 0, 2548, 4, 0. , 0. , 0., 0., 0. , ('SNP',), False, 20835)],\n dtype=[('AF', '<f8'), ('AC', '<i8'), ('NS', '<i8'), ('AN', '<i8'), \n ('EAS_AF', '<f8'), ('EUR_AF', '<f8'), ('AFR_AF', '<f8'), \n ('AMR_AF', '<f8'), ('SAS_AF', '<f8'), ('VT', 'O'), ('EX_TARGET', '?'), ('DP', '<i8')])\n \nVariants and their data can be easily imported into Pandas\n\n >>> import pandas as pd\n >>> pd.DataFrame.from_records(vrt)\n chrom pos ref alt id qual filter\n 0 7 152135021 C T . NaN PASS\n 1 7 152135047 T C . NaN PASS\n 2 7 152135074 C T . NaN PASS\n 3 7 152135149 A G . NaN PASS\n 4 7 152135225 C T . NaN PASS\n 5 7 152135289 A G . NaN PASS\n 6 7 152135350 C A . NaN PASS\n\n### Parsing genotypes ###\n\nGenotypes in VCF files are declared as `String` type. However they\nare rather important in genetic analysis so additinal parsing tools\nare implemented. `genotypes` parameter of `VariantFile.read` provides\nadditional control. In addition to default `genotypes=\"string\"` it\nallowes `split` or `sum` values.\n\nConsider the following VCF file with three variants in total:\n\n ##fileformat=VCFv4.2\n ##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Phased Genotype\">\n #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE1\tSAMPLE2\n chr24\t166\t.\tT\tTG,C\t100\t.\t.\tGT\t0/1/2\t0/1/0\n chr24\t167\t.\tT\tTG\t100\t.\t.\tGT\t./.\t0/1\n\n`genotypes=split` gives the following\n\n >>> from vcf2py import VariantFile\n >>> f = VariantFile(\"file.vcf\")\n >>> vrt, samples = f.read(samples=True, genotypes=\"split\")\n >>> samples[\"SAMPLE1\"][\"GT\"]\n array([[ 0., 1., 0.],\n [ 0., 0., 1.],\n [nan, nan, nan]], dtype=float16)\n\ni.e. output will contain a ndarray with `1` or `0` depending on\npresense of variant in allele for each of the variants. nan values stand\nfor unspecified genotypes.\n\n\n`genotypes=sum` gives\n\n >>> vrt, samples = f.read(samples=True, genotypes=\"sum\")\n >>> samples[\"SAMPLE1\"][\"GT\"]\n array([1, 1, 0], dtype=int8)\n\ni.e. an integer value for each variant indicating number of alleles\nwith the variant.\n\n",
"bugtrack_url": null,
"license": null,
"summary": "Tools to process genomic variant data with Numerical Python",
"version": "0.2.2",
"project_urls": {
"Homepage": "https://github.com/mikpom/vcf2py"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "afd44749503f995f31b6eac31320f104eeecf822367337beed2bd79f4dc65d28",
"md5": "ae0a489d91fcbc432a7e45f728763dc7",
"sha256": "fdc0fd3db0a7d9ae2c4f830c3ee5a1617e5b30132903a78c93d2622d28cf3dc2"
},
"downloads": -1,
"filename": "vcf2py-0.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae0a489d91fcbc432a7e45f728763dc7",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 15532,
"upload_time": "2024-09-05T03:24:29",
"upload_time_iso_8601": "2024-09-05T03:24:29.098248Z",
"url": "https://files.pythonhosted.org/packages/af/d4/4749503f995f31b6eac31320f104eeecf822367337beed2bd79f4dc65d28/vcf2py-0.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "38ba4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816",
"md5": "9116397f976150020c980c99ea78b841",
"sha256": "8927114c6d3badbbbea616b3fd31f18135245b5d5a270c17173ce8b2688b5223"
},
"downloads": -1,
"filename": "vcf2py-0.2.2.tar.gz",
"has_sig": false,
"md5_digest": "9116397f976150020c980c99ea78b841",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 16320,
"upload_time": "2024-09-05T03:24:30",
"upload_time_iso_8601": "2024-09-05T03:24:30.968437Z",
"url": "https://files.pythonhosted.org/packages/38/ba/4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816/vcf2py-0.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-09-05 03:24:30",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mikpom",
"github_project": "vcf2py",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "vcf2py"
}