vcfio


Namevcfio JSON
Version 1.1.9 PyPI version JSON
download
home_pagehttps://github.com/emedgene/vcfio
SummaryA simple and efficient VCF manipulation package.
upload_time2024-02-04 10:19:53
maintainershencar
docs_urlNone
authorshencar
requires_python>=3.6,<4.0
licenseMIT
keywords vcf variant genetics bioinformatics emedgene illumina
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <h1 align="center">
  <br>
  <img src="doc/logo.png" alt="vcfio" width="60%">
  <br>


</h1>

# 🔭 Overview
vcfio is an **efficient** and **easy-to-use** package for [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) reading, writing and manipluation.<br>
It is designed for robust variant processing, and requires *minimum* computing resources.

- ⚡ Fast - iterate >4500 variants per second
- 🪶 Lightweight - automatically parses only **crucial information** about variants (CHROM, POS, ID, REF, ALT, QUAL, FILTER), all the other information (INFO and SAMPLES) is parsed on demand
- 🏁 Efficient - No advanced parsing of lines and casting to hugh memory objects
- 🔌 Dependency free - We do have optional dependencies which enhance the package

# 🎯 Why ? 
The existing python VCF solutions were **extremely inefficient** - parsing every variant line in advance and casting 
every bit of data into their own objects or a list of strings, which is very memory-consuming.

This affected our runtime by a huge factor.

**We wrote vcfio to overcome those issues, making it a lightweight and dependency-free package.**


# ⚙️ ️Installation
```bash
# Basic installation
pip install vcfio

# Include optional dependencies
pip install vcfio[bio]
```

# ⭐ Features
:heavy_check_mark: **Read and write** plain and compressed vcf (no specification needed) - `file.vcf`, `file.vcf.gz`, `file.vcf.bgz`.<br>
:heavy_check_mark: **Automatically infer** the type of values in the file. For example `'1,2,3'` will yield `[1, 2, 3]`.<br>
:heavy_check_mark: **Parse** values __*on demand*__ for maximum efficiency.<br>
:heavy_check_mark: **Fetch** variant ranges within a chromosome.

# ❓ How to Use
### VcfReader
Here are some examples of what you can do with VcfReader
(We recommend you explore all the available methods)
<details>
    <summary>Click to view usage!</summary>
    <img src="doc/vcfreader.gif">
    <img src="doc/vcfreaderfetch.gif">
</details>

```python
import vcfio

with vcfio.VcfReader('/path/to/file.vcf') as reader:
    # Iterate variants
    for variant in reader:  # type: vcfio.Variant
        print(variant.to_vcf_line())
        
        # Iterate the variant's samples
        for sample_name, sample in variant.samples.items():  # type: AnyStr, EasyDict
            # Try to find "AD" in sample, if not found - find in variant's info, if the value is a dot - return None
            ad = variant.get_value('AD', sample_name, empty_value='.')        
            zygosity = variant.get_zygosity(sample_name)

    # Fetch variants within the range chr3:1-1000
    for variant in reader.fetch('chr3', start=1, end=1000):  # type: vcfio.Variant
        print(variant)
```
### VcfWriter
Here is an example of what you can do with VcfWriter
<details>
    <summary>Click to view usage!</summary>
    <img src="doc/vcfwriter.gif">
</details>

```python
import vcfio

# This will open an existing vcf, introduce a new value to each variant's INFO and write to another vcf.gz
with vcfio.VcfReader('/path/to/output_file.vcf.gz') as reader, \
        vcfio.VcfWriter('/path/to/output.vcf', headers=reader.headers) as writer:  
    for variant in reader:  # type: vcfio.Variant
        variant.info['new_info_field'] = 'new_info_value'
        writer.write_variant(variant)
```

### Variant
Here are some examples of what you can do with Variant
(We recommend you to explore all the available methods)
<details>
    <summary>Click to view usage!</summary>
    <img src="doc/variant.gif">
</details>

```python
import vcfio

variant_line = "chr1	726	.	G	C,T	500	.	DP=200;MQ=250.00	GT:AD:AF:DP:GQ	0/1:10,160,30:0.8,0.15:200:420"

# This will parse the raw variant line into a Variant instance
variant = vcfio.Variant.from_variant_line(variant_line, sample_names=['proband'])
print(
    variant.quality,                        # --> 726
    variant.chromosome,                     # --> "chr1"
    variant.alt,                            # --> ["C", "T"]
    variant.get_zygosity('proband'),        # --> "HET"
    variant.get_value('MQM', 'proband'),    # --> 250
    variant.samples['proband'].get('GT')    # --> "0/1"
)
```

### EasyDict
A Dict-inherited class with a smart `get` method. It's main (and only) feature is to automatically infer the type of the value it returns.
A vcf file rarely specifies the type of values within the variant's data so this object makes a bioinformaticaion's life easier and exempts him from casting-duty.
It is used in `vcfio.Variant`'s attributes.
```python
d = EasyDict({
	'simple_int': 1,
	'not_simple_int': '2',
	'this_is_not_a_list': ['abc'],
	'list_of_numbers': ['1', 2, '3.1'],
	'this_is_a_real_list': ['a', 'b', 'c'],
	'dot_is_not_a_value': '.',
    'this_is_a_list': '1,2,3',
    'this_is_a_list_but_i_like_strings': '1,2,3'
}) 
d.get('simple_int')                                                 # --> 1
d.get('not_simple_int', infer_type=True)                            # --> 2
d.get('this_is_not_a_list', infer_type=True)                        # --> 'abc'
d.get('list_of_numbers', infer_type=True)                           # --> [1, 2, 3.1]
d.get('this_is_a_real_list', infer_type=True)                       # --> ['a', 'b', 'c']
d.get('dot_is_not_a_value', empty_value='.', infer_type=True)       # --> None
d.get('this_is_a_list', empty_value='.', infer_type=True)           # --> [1, 2, 3]
d.get('this_is_a_list_but_i_like_strings', empty_value='.')         # --> '1,2,3' 
```

# Credits
This package uses the following open source packages:
- [PyVCF](https://github.com/jdoughertyii/PyVCF/)

# Contributors
<img src="https://img.shields.io/github/contributors-anon/emedgene/vcfio"/>

<a href="https://github.com/emedgene/vcfio/graphs/contributors"><img src="https://contrib.rocks/image?repo=emedgene/vcfio&max=240&columns=18" /></a>

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/emedgene/vcfio",
    "name": "vcfio",
    "maintainer": "shencar",
    "docs_url": null,
    "requires_python": ">=3.6,<4.0",
    "maintainer_email": "barak.shencar@gmail.com",
    "keywords": "vcf,variant,genetics,bioinformatics,emedgene,illumina",
    "author": "shencar",
    "author_email": "barak.shencar@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/6b/d6/21148560edff09974478a6b8d2e1e7dafc97430d01af368cb1be34fdcde2/vcfio-1.1.9.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">\n  <br>\n  <img src=\"doc/logo.png\" alt=\"vcfio\" width=\"60%\">\n  <br>\n\n\n</h1>\n\n# \ud83d\udd2d Overview\nvcfio is an **efficient** and **easy-to-use** package for [VCF](https://samtools.github.io/hts-specs/VCFv4.2.pdf) reading, writing and manipluation.<br>\nIt is designed for robust variant processing, and requires *minimum* computing resources.\n\n- \u26a1 Fast - iterate >4500 variants per second\n- \ud83e\udeb6 Lightweight - automatically parses only **crucial information** about variants (CHROM, POS, ID, REF, ALT, QUAL, FILTER), all the other information (INFO and SAMPLES) is parsed on demand\n- \ud83c\udfc1 Efficient - No advanced parsing of lines and casting to hugh memory objects\n- \ud83d\udd0c Dependency free - We do have optional dependencies which enhance the package\n\n# \ud83c\udfaf Why ? \nThe existing python VCF solutions were **extremely inefficient** - parsing every variant line in advance and casting \nevery bit of data into their own objects or a list of strings, which is very memory-consuming.\n\nThis affected our runtime by a huge factor.\n\n**We wrote vcfio to overcome those issues, making it a lightweight and dependency-free package.**\n\n\n# \u2699\ufe0f \ufe0fInstallation\n```bash\n# Basic installation\npip install vcfio\n\n# Include optional dependencies\npip install vcfio[bio]\n```\n\n# \u2b50 Features\n:heavy_check_mark: **Read and write** plain and compressed vcf (no specification needed) - `file.vcf`, `file.vcf.gz`, `file.vcf.bgz`.<br>\n:heavy_check_mark: **Automatically infer** the type of values in the file. For example `'1,2,3'` will yield `[1, 2, 3]`.<br>\n:heavy_check_mark: **Parse** values __*on demand*__ for maximum efficiency.<br>\n:heavy_check_mark: **Fetch** variant ranges within a chromosome.\n\n# \u2753 How to Use\n### VcfReader\nHere are some examples of what you can do with VcfReader\n(We recommend you explore all the available methods)\n<details>\n    <summary>Click to view usage!</summary>\n    <img src=\"doc/vcfreader.gif\">\n    <img src=\"doc/vcfreaderfetch.gif\">\n</details>\n\n```python\nimport vcfio\n\nwith vcfio.VcfReader('/path/to/file.vcf') as reader:\n    # Iterate variants\n    for variant in reader:  # type: vcfio.Variant\n        print(variant.to_vcf_line())\n        \n        # Iterate the variant's samples\n        for sample_name, sample in variant.samples.items():  # type: AnyStr, EasyDict\n            # Try to find \"AD\" in sample, if not found - find in variant's info, if the value is a dot - return None\n            ad = variant.get_value('AD', sample_name, empty_value='.')        \n            zygosity = variant.get_zygosity(sample_name)\n\n    # Fetch variants within the range chr3:1-1000\n    for variant in reader.fetch('chr3', start=1, end=1000):  # type: vcfio.Variant\n        print(variant)\n```\n### VcfWriter\nHere is an example of what you can do with VcfWriter\n<details>\n    <summary>Click to view usage!</summary>\n    <img src=\"doc/vcfwriter.gif\">\n</details>\n\n```python\nimport vcfio\n\n# This will open an existing vcf, introduce a new value to each variant's INFO and write to another vcf.gz\nwith vcfio.VcfReader('/path/to/output_file.vcf.gz') as reader, \\\n        vcfio.VcfWriter('/path/to/output.vcf', headers=reader.headers) as writer:  \n    for variant in reader:  # type: vcfio.Variant\n        variant.info['new_info_field'] = 'new_info_value'\n        writer.write_variant(variant)\n```\n\n### Variant\nHere are some examples of what you can do with Variant\n(We recommend you to explore all the available methods)\n<details>\n    <summary>Click to view usage!</summary>\n    <img src=\"doc/variant.gif\">\n</details>\n\n```python\nimport vcfio\n\nvariant_line = \"chr1\t726\t.\tG\tC,T\t500\t.\tDP=200;MQ=250.00\tGT:AD:AF:DP:GQ\t0/1:10,160,30:0.8,0.15:200:420\"\n\n# This will parse the raw variant line into a Variant instance\nvariant = vcfio.Variant.from_variant_line(variant_line, sample_names=['proband'])\nprint(\n    variant.quality,                        # --> 726\n    variant.chromosome,                     # --> \"chr1\"\n    variant.alt,                            # --> [\"C\", \"T\"]\n    variant.get_zygosity('proband'),        # --> \"HET\"\n    variant.get_value('MQM', 'proband'),    # --> 250\n    variant.samples['proband'].get('GT')    # --> \"0/1\"\n)\n```\n\n### EasyDict\nA Dict-inherited class with a smart `get` method. It's main (and only) feature is to automatically infer the type of the value it returns.\nA vcf file rarely specifies the type of values within the variant's data so this object makes a bioinformaticaion's life easier and exempts him from casting-duty.\nIt is used in `vcfio.Variant`'s attributes.\n```python\nd = EasyDict({\n\t'simple_int': 1,\n\t'not_simple_int': '2',\n\t'this_is_not_a_list': ['abc'],\n\t'list_of_numbers': ['1', 2, '3.1'],\n\t'this_is_a_real_list': ['a', 'b', 'c'],\n\t'dot_is_not_a_value': '.',\n    'this_is_a_list': '1,2,3',\n    'this_is_a_list_but_i_like_strings': '1,2,3'\n}) \nd.get('simple_int')                                                 # --> 1\nd.get('not_simple_int', infer_type=True)                            # --> 2\nd.get('this_is_not_a_list', infer_type=True)                        # --> 'abc'\nd.get('list_of_numbers', infer_type=True)                           # --> [1, 2, 3.1]\nd.get('this_is_a_real_list', infer_type=True)                       # --> ['a', 'b', 'c']\nd.get('dot_is_not_a_value', empty_value='.', infer_type=True)       # --> None\nd.get('this_is_a_list', empty_value='.', infer_type=True)           # --> [1, 2, 3]\nd.get('this_is_a_list_but_i_like_strings', empty_value='.')         # --> '1,2,3' \n```\n\n# Credits\nThis package uses the following open source packages:\n- [PyVCF](https://github.com/jdoughertyii/PyVCF/)\n\n# Contributors\n<img src=\"https://img.shields.io/github/contributors-anon/emedgene/vcfio\"/>\n\n<a href=\"https://github.com/emedgene/vcfio/graphs/contributors\"><img src=\"https://contrib.rocks/image?repo=emedgene/vcfio&max=240&columns=18\" /></a>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A simple and efficient VCF manipulation package.",
    "version": "1.1.9",
    "project_urls": {
        "Homepage": "https://github.com/emedgene/vcfio",
        "Repository": "https://github.com/emedgene/vcfio"
    },
    "split_keywords": [
        "vcf",
        "variant",
        "genetics",
        "bioinformatics",
        "emedgene",
        "illumina"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dedeff36165c9509ac37a6952d1846465399296283dd57f429614c6281c96aa7",
                "md5": "d7c3efff09f206f54ba9d2e6f1234f1a",
                "sha256": "5c143e450160349b6dd4b1d313916e62fe9f05f0d90f1723173c585ed465ed56"
            },
            "downloads": -1,
            "filename": "vcfio-1.1.9-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d7c3efff09f206f54ba9d2e6f1234f1a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6,<4.0",
            "size": 14461,
            "upload_time": "2024-02-04T10:19:51",
            "upload_time_iso_8601": "2024-02-04T10:19:51.858335Z",
            "url": "https://files.pythonhosted.org/packages/de/de/ff36165c9509ac37a6952d1846465399296283dd57f429614c6281c96aa7/vcfio-1.1.9-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bd621148560edff09974478a6b8d2e1e7dafc97430d01af368cb1be34fdcde2",
                "md5": "b24829495a040b1237a5212ffb0630b8",
                "sha256": "62b4ca0f2b38431e9b380d5369b6a0d0ea48b00b8a80ec81aa98dd7984c737a5"
            },
            "downloads": -1,
            "filename": "vcfio-1.1.9.tar.gz",
            "has_sig": false,
            "md5_digest": "b24829495a040b1237a5212ffb0630b8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6,<4.0",
            "size": 12958,
            "upload_time": "2024-02-04T10:19:53",
            "upload_time_iso_8601": "2024-02-04T10:19:53.588198Z",
            "url": "https://files.pythonhosted.org/packages/6b/d6/21148560edff09974478a6b8d2e1e7dafc97430d01af368cb1be34fdcde2/vcfio-1.1.9.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-04 10:19:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "emedgene",
    "github_project": "vcfio",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "vcfio"
}
        
Elapsed time: 0.18119s