vcf2py


Namevcf2py JSON
Version 0.2.2 PyPI version JSON
download
home_pageNone
SummaryTools to process genomic variant data with Numerical Python
upload_time2024-09-05 03:24:30
maintainerNone
docs_urlNone
authorNone
requires_python>=3.7
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # VCF2PY – working with genomic variants in Python #

Package `vcf2py` allows to quickly import genomic
variants in VCF format into Python as NumPy arrays

## Installation ##

With pip 

    pip install vcf2py

## Sample usage ##

Import main class to work with VCF data:

    from vcf2py import VariantFile
    
### Parse variants and INFO fields ###

Consider the following VCF file `1KG_example.vcf`

    ##fileformat=VCFv4.3
    ##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
    ##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
    ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
    ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
    ##INFO=<ID=EAS_AF,Number=A,Type=Float,Description="Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)">
    ##INFO=<ID=EUR_AF,Number=A,Type=Float,Description="Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)">
    ##INFO=<ID=AFR_AF,Number=A,Type=Float,Description="Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)">
    ##INFO=<ID=AMR_AF,Number=A,Type=Float,Description="Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)">
    ##INFO=<ID=SAS_AF,Number=A,Type=Float,Description="Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)">
    ##INFO=<ID=VT,Number=.,Type=String,Description="indicates what type of variant the line represents">
    ##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description="indicates whether a variant is within the exon pull down target boundaries">
    ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00097	HG00099
    7	152135021	.	C	T	.	PASS	AC=0;AN=4;DP=21624;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135047	.	T	C	.	PASS	AC=0;AN=4;DP=21003;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135074	.	C	T	.	PASS	AC=0;AN=4;DP=20726;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0.01;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135149	.	A	G	.	PASS	AC=0;AN=4;DP=19360;AF=0;EAS_AF=0.01;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135225	.	C	T	.	PASS	AC=0;AN=4;DP=20911;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135289	.	A	G	.	PASS	AC=0;AN=4;DP=20973;AF=0;EAS_AF=0;EUR_AF=0.01;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0
    7	152135350	.	C	A	.	PASS	AC=0;AN=4;DP=20835;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548	GT	0|0	0|0

After parsing it gives the following

    >>> f = vcf.VariantFile("1KG_example.vcf")
    >>> vrt, info = f.read(info=True)
    >>> info
    array([(0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 21624),
           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 21003),
           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.01, ('SNP',), False, 20726),
           (0., 0, 2548, 4, 0.01, 0.  , 0., 0., 0.  , ('SNP',), False, 19360),
           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 20911),
           (0., 0, 2548, 4, 0.  , 0.01, 0., 0., 0.  , ('SNP',), False, 20973),
           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 20835)],
          dtype=[('AF', '<f8'), ('AC', '<i8'), ('NS', '<i8'), ('AN', '<i8'), 
          ('EAS_AF', '<f8'), ('EUR_AF', '<f8'), ('AFR_AF', '<f8'), 
          ('AMR_AF', '<f8'), ('SAS_AF', '<f8'), ('VT', 'O'), ('EX_TARGET', '?'), ('DP', '<i8')])
          
Variants and their data can be easily imported into Pandas

    >>> import pandas as pd
    >>> pd.DataFrame.from_records(vrt)
      chrom        pos ref alt id  qual filter
    0     7  152135021   C   T  .   NaN   PASS
    1     7  152135047   T   C  .   NaN   PASS
    2     7  152135074   C   T  .   NaN   PASS
    3     7  152135149   A   G  .   NaN   PASS
    4     7  152135225   C   T  .   NaN   PASS
    5     7  152135289   A   G  .   NaN   PASS
    6     7  152135350   C   A  .   NaN   PASS

### Parsing genotypes ###

Genotypes in VCF files are declared as `String` type.  However they
are rather important in genetic analysis so additinal parsing tools
are implemented.  `genotypes` parameter of `VariantFile.read` provides
additional control.  In addition to default `genotypes="string"` it
allowes `split` or `sum` values.

Consider the following VCF file with three variants in total:

    ##fileformat=VCFv4.2
    ##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased Genotype">
    #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE1	SAMPLE2
    chr24	166	.	T	TG,C	100	.	.	GT	0/1/2	0/1/0
    chr24	167	.	T	TG	100	.	.	GT	./.	0/1

`genotypes=split` gives the following

    >>> from vcf2py import VariantFile
    >>> f = VariantFile("file.vcf")
    >>> vrt, samples = f.read(samples=True, genotypes="split")
    >>> samples["SAMPLE1"]["GT"]
    array([[ 0.,  1.,  0.],
           [ 0.,  0.,  1.],
           [nan, nan, nan]], dtype=float16)

i.e.  output will contain a ndarray with `1` or `0` depending on
presense of variant in allele for each of the variants.  nan values stand
for unspecified genotypes.


`genotypes=sum` gives

    >>> vrt, samples = f.read(samples=True, genotypes="sum")
    >>> samples["SAMPLE1"]["GT"]
    array([1, 1, 0], dtype=int8)

i.e. an integer value for each variant indicating number of alleles
with the variant.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vcf2py",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "mikpom <mikpom@mikpom.ru>",
    "download_url": "https://files.pythonhosted.org/packages/38/ba/4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816/vcf2py-0.2.2.tar.gz",
    "platform": null,
    "description": "# VCF2PY \u2013 working with genomic variants in Python #\n\nPackage `vcf2py` allows to quickly import genomic\nvariants in VCF format into Python as NumPy arrays\n\n## Installation ##\n\nWith pip \n\n    pip install vcf2py\n\n## Sample usage ##\n\nImport main class to work with VCF data:\n\n    from vcf2py import VariantFile\n    \n### Parse variants and INFO fields ###\n\nConsider the following VCF file `1KG_example.vcf`\n\n    ##fileformat=VCFv4.3\n    ##INFO=<ID=AF,Number=A,Type=Float,Description=\"Estimated allele frequency in the range (0,1)\">\n    ##INFO=<ID=AC,Number=A,Type=Integer,Description=\"Total number of alternate alleles in called genotypes\">\n    ##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of samples with data\">\n    ##INFO=<ID=AN,Number=1,Type=Integer,Description=\"Total number of alleles in called genotypes\">\n    ##INFO=<ID=EAS_AF,Number=A,Type=Float,Description=\"Allele frequency in the EAS populations calculated from AC and AN, in the range (0,1)\">\n    ##INFO=<ID=EUR_AF,Number=A,Type=Float,Description=\"Allele frequency in the EUR populations calculated from AC and AN, in the range (0,1)\">\n    ##INFO=<ID=AFR_AF,Number=A,Type=Float,Description=\"Allele frequency in the AFR populations calculated from AC and AN, in the range (0,1)\">\n    ##INFO=<ID=AMR_AF,Number=A,Type=Float,Description=\"Allele frequency in the AMR populations calculated from AC and AN, in the range (0,1)\">\n    ##INFO=<ID=SAS_AF,Number=A,Type=Float,Description=\"Allele frequency in the SAS populations calculated from AC and AN, in the range (0,1)\">\n    ##INFO=<ID=VT,Number=.,Type=String,Description=\"indicates what type of variant the line represents\">\n    ##INFO=<ID=EX_TARGET,Number=0,Type=Flag,Description=\"indicates whether a variant is within the exon pull down target boundaries\">\n    ##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth; some reads may have been filtered\">\n    #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHG00097\tHG00099\n    7\t152135021\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=21624;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135047\t.\tT\tC\t.\tPASS\tAC=0;AN=4;DP=21003;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135074\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=20726;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0.01;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135149\t.\tA\tG\t.\tPASS\tAC=0;AN=4;DP=19360;AF=0;EAS_AF=0.01;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135225\t.\tC\tT\t.\tPASS\tAC=0;AN=4;DP=20911;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135289\t.\tA\tG\t.\tPASS\tAC=0;AN=4;DP=20973;AF=0;EAS_AF=0;EUR_AF=0.01;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n    7\t152135350\t.\tC\tA\t.\tPASS\tAC=0;AN=4;DP=20835;AF=0;EAS_AF=0;EUR_AF=0;AFR_AF=0;AMR_AF=0;SAS_AF=0;VT=SNP;NS=2548\tGT\t0|0\t0|0\n\nAfter parsing it gives the following\n\n    >>> f = vcf.VariantFile(\"1KG_example.vcf\")\n    >>> vrt, info = f.read(info=True)\n    >>> info\n    array([(0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 21624),\n           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 21003),\n           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.01, ('SNP',), False, 20726),\n           (0., 0, 2548, 4, 0.01, 0.  , 0., 0., 0.  , ('SNP',), False, 19360),\n           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 20911),\n           (0., 0, 2548, 4, 0.  , 0.01, 0., 0., 0.  , ('SNP',), False, 20973),\n           (0., 0, 2548, 4, 0.  , 0.  , 0., 0., 0.  , ('SNP',), False, 20835)],\n          dtype=[('AF', '<f8'), ('AC', '<i8'), ('NS', '<i8'), ('AN', '<i8'), \n          ('EAS_AF', '<f8'), ('EUR_AF', '<f8'), ('AFR_AF', '<f8'), \n          ('AMR_AF', '<f8'), ('SAS_AF', '<f8'), ('VT', 'O'), ('EX_TARGET', '?'), ('DP', '<i8')])\n          \nVariants and their data can be easily imported into Pandas\n\n    >>> import pandas as pd\n    >>> pd.DataFrame.from_records(vrt)\n      chrom        pos ref alt id  qual filter\n    0     7  152135021   C   T  .   NaN   PASS\n    1     7  152135047   T   C  .   NaN   PASS\n    2     7  152135074   C   T  .   NaN   PASS\n    3     7  152135149   A   G  .   NaN   PASS\n    4     7  152135225   C   T  .   NaN   PASS\n    5     7  152135289   A   G  .   NaN   PASS\n    6     7  152135350   C   A  .   NaN   PASS\n\n### Parsing genotypes ###\n\nGenotypes in VCF files are declared as `String` type.  However they\nare rather important in genetic analysis so additinal parsing tools\nare implemented.  `genotypes` parameter of `VariantFile.read` provides\nadditional control.  In addition to default `genotypes=\"string\"` it\nallowes `split` or `sum` values.\n\nConsider the following VCF file with three variants in total:\n\n    ##fileformat=VCFv4.2\n    ##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Phased Genotype\">\n    #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tSAMPLE1\tSAMPLE2\n    chr24\t166\t.\tT\tTG,C\t100\t.\t.\tGT\t0/1/2\t0/1/0\n    chr24\t167\t.\tT\tTG\t100\t.\t.\tGT\t./.\t0/1\n\n`genotypes=split` gives the following\n\n    >>> from vcf2py import VariantFile\n    >>> f = VariantFile(\"file.vcf\")\n    >>> vrt, samples = f.read(samples=True, genotypes=\"split\")\n    >>> samples[\"SAMPLE1\"][\"GT\"]\n    array([[ 0.,  1.,  0.],\n           [ 0.,  0.,  1.],\n           [nan, nan, nan]], dtype=float16)\n\ni.e.  output will contain a ndarray with `1` or `0` depending on\npresense of variant in allele for each of the variants.  nan values stand\nfor unspecified genotypes.\n\n\n`genotypes=sum` gives\n\n    >>> vrt, samples = f.read(samples=True, genotypes=\"sum\")\n    >>> samples[\"SAMPLE1\"][\"GT\"]\n    array([1, 1, 0], dtype=int8)\n\ni.e. an integer value for each variant indicating number of alleles\nwith the variant.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Tools to process genomic variant data with Numerical Python",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/mikpom/vcf2py"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "afd44749503f995f31b6eac31320f104eeecf822367337beed2bd79f4dc65d28",
                "md5": "ae0a489d91fcbc432a7e45f728763dc7",
                "sha256": "fdc0fd3db0a7d9ae2c4f830c3ee5a1617e5b30132903a78c93d2622d28cf3dc2"
            },
            "downloads": -1,
            "filename": "vcf2py-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ae0a489d91fcbc432a7e45f728763dc7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 15532,
            "upload_time": "2024-09-05T03:24:29",
            "upload_time_iso_8601": "2024-09-05T03:24:29.098248Z",
            "url": "https://files.pythonhosted.org/packages/af/d4/4749503f995f31b6eac31320f104eeecf822367337beed2bd79f4dc65d28/vcf2py-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "38ba4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816",
                "md5": "9116397f976150020c980c99ea78b841",
                "sha256": "8927114c6d3badbbbea616b3fd31f18135245b5d5a270c17173ce8b2688b5223"
            },
            "downloads": -1,
            "filename": "vcf2py-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "9116397f976150020c980c99ea78b841",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 16320,
            "upload_time": "2024-09-05T03:24:30",
            "upload_time_iso_8601": "2024-09-05T03:24:30.968437Z",
            "url": "https://files.pythonhosted.org/packages/38/ba/4585003ba77ecf6f2c2aefddaead48fb22bb37c96dc2ce89be577900c816/vcf2py-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-05 03:24:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mikpom",
    "github_project": "vcf2py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "vcf2py"
}
        
Elapsed time: 1.11773s