# VariantExtractor<!-- omit in toc -->
**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. VariantExtractor is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs) and [SVs](#structural-variants) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.
While there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, VariantExtractor **adds a preprocessing layer to homogenize the variants** extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the [homogenization rules](#homogenization-rules) section.
## Table of contents<!-- omit in toc -->
- [Getting started](#getting-started)
- [Installation](#installation)
- [Usage](#usage)
- [VariantRecord](#variantrecord)
- [VariantType](#varianttype)
- [BreakendSVRecord](#breakendsvrecord)
- [ShorthandSVRecord](#shorthandsvrecord)
- [Homogenization rules](#homogenization-rules)
- [Multiallelic variants](#multiallelic-variants)
- [SNVs](#snvs)
- [Structural variants](#structural-variants)
- [Breakend vs shorthand notation](#breakend-vs-shorthand-notation)
- [Paired breakends](#paired-breakends)
- [Inferred breakend pairs](#inferred-breakend-pairs)
- [Imprecise paired breakends](#imprecise-paired-breakends)
- [Single breakends](#single-breakends)
- [Dependencies](#dependencies)
## Getting started
### Installation
VariantExtractor is available on PyPI and can be installed using `pip`:
```bash
pip install variant-extractor
```
## Usage
```python
# Import the package
from variant_extractor import VariantExtractor
# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')
# Iterate through the variants
for variant_record in extractor:
print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')
```
```python
# Import the package
from variant_extractor import VariantExtractor
# Create a new instance of the class
extractor = VariantExtractor('/path/to/file.vcf')
# Save variants to a CSV file
extractor.to_dataframe().drop(['variant_record_obj'], axis=1).to_csv('/path/to/output.csv', index=False)
```
For a more complete list of examples, check the [examples](./examples/) directory. This folder also includes an example of a [script for normalizing VCF files](examples/normalize_vcf.py) following the [homogenization rules](#homogenization-rules).
## VariantRecord
The `VariantExtractor` constructor returns a generator of `VariantRecord` instances. The `VariantRecord` class is a container for the information contained in a VCF record plus some extra useful information.
| Property | Type | Description |
| ------------------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| `contig` | `str` | Contig name |
| `pos` | `int` | Position on the contig |
| `end` | `int` | End position of the variant in the contig (same as `pos` for TRA and SNV) |
| `length` | `int` | Length of the variant |
| `id` | `Optional[str]` | Record identifier |
| `ref` | `str` | Reference sequence |
| `alt` | `str` | Alternative sequence |
| `qual` | `Optional[float]` | Quality score for the assertion made in ALT |
| `filter` | `List[str]` | Filter status. `PASS` if this position has passed all filters. Otherwise, it contains the filters that failed |
| `info` | `Dict[str, Any]` | Additional information |
| `format` | `List[str]` | Specifies data types and order of the genotype information |
| `samples` | `Dict[str, Dict[str, Any]]` | Genotype information for each sample |
| `variant_type` | [`VariantType`](#varianttype) | Variant type inferred |
| `alt_sv_breakend` | `Optional[`[`BreakendSVRecord`](#brekendsvrecord)`]` | Breakend SV info, present only for SVs with breakend notation. For example, `G]17:198982]` |
| `alt_sv_shorthand` | `Optional[`[`ShorthandSVRecord`](#shorthandsvrecord)`]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>` |
### VariantType
The `VariantType` enum describes the type of the variant. For structural variants, it is inferred **only** from the breakend notation (or shorthand notation). It does not take into account any `INFO` field (`SVTYPE` nor `EVENTYPE`) that might be added by the variant caller afterwards.
| REF | ALT | Variant name | Description |
| ---- | ---------------------------------------- | ------------ | -------------------------------------------------------------------- |
| A | G | SNV | Single nucleotide variant |
| AGTG | A | DEL | Deletion |
| A | A[1:20[ or \<DEL\> | DEL | Deletion |
| A | ACCT or \<INS\> | INS | Insertion |
| A | ]1:20]A or \<DUP\> | DUP | Duplication |
| A | A]1:20] or [1:20[A | INV | Inversion. **[\<INV\> is a special case](#the-special-case-of-inv)** |
| A | \<CNV\> | CNV | Copy number variation |
| A | A]X:20] or A[X:20[ or ]X:20]A or [X:20[A | TRA | Translocation |
| A | A. or .A | SGL | Single breakend |
### BreakendSVRecord
The `BreakendSVRecord` class is a container for the information contained in a VCF record for SVs with breakend notation.
| Property | Type | Description |
| --------- | --------------- | --------------------------------------------------------------------------------------------------------------- |
| `prefix` | `Optional[str]` | Prefix of the SV record with breakend notation. For example, for `G]17:198982]` the prefix will be `G` |
| `bracket` | `str` | Bracket of the SV record with breakend notation. For example, for `G]17:198982]` the bracket will be `]` |
| `contig` | `str` | Contig of the SV record with breakend notation. For example, for `G]17:198982]` the contig will be `17` |
| `pos` | `int` | Position of the SV record with breakend notation. For example, for `G]17:198982]` the position will be `198982` |
| `suffix` | `Optional[str]` | Suffix of the SV record with breakend notation. For example, for `G]17:198982]` the suffix will be `None` |
### ShorthandSVRecord
The `ShorthandSVRecord` class is a container for the information contained in a VCF record for SVs with shorthand notation.
| Property | Type | Description |
| -------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `type` | `str` | Type of the SV record with shorthand notation. One of the following, `'DEL'`, `'INS'`, `'DUP'`, `'INV'` or `'CNV'`. For example, for `<DUP:TANDEM>` the type will be `DUP` |
| `extra` | `List[str]` | Extra information of the SV. For example, for `<DUP:TANDEM:AA>` the extra will be `['TANDEM', 'AA']` |
## Homogenization rules
VariantExtractor provides a unified interface to extract variants (included structural variants) from VCF files generated by different variant callers. The variants are homogenized and returned applying the following rules:
### Multiallelic variants
An entry with multiple `ALT` sequences (multiallelic) is divided into multiple entries with a single `ALT` field. This entries with a single `ALT` field are then processed with the rest of the homogeneization rules. For example:
| CHROM | POS | ID | REF | ALT | FILTER |
| ----- | --- | -------------- | --- | --- | ------ |
| 2 | 1 | multiallelic_1 | G | C,T | PASS |
is returned as:
| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |
| ----- | --- | ---------------- | --- | --- | ------ | ----------------------------- |
| 2 | 1 | multiallelic_1_0 | G | C | PASS | SNV |
| 2 | 1 | multiallelic_1_1 | G | T | PASS | SNV |
### SNVs
Entries with `REF/ALT` of the same lenghts are treated like SNVs. If the `REF/ALT` sequences are more than one nucleotide (MNPs), they are divided into multiple atomic SNVs. For example:
| CHROM | POS | ID | REF | ALT | FILTER |
| ----- | --- | ----- | --- | --- | ------ |
| 2 | 1 | snv_1 | C | G | PASS |
| 2 | 3 | mnp_1 | TAG | AGT | PASS |
are returned as:
| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |
| ----- | --- | ------- | --- | --- | ------ | ----------------------------- |
| 2 | 3 | snv_1 | C | G | PASS | SNV |
| 2 | 3 | mnp_1_0 | T | A | PASS | SNV |
| 2 | 4 | mnp_1_1 | A | G | PASS | SNV |
| 2 | 5 | mnp_1_2 | G | T | PASS | SNV |
<!-- ### Compound indels
All entries with the `REF/ALT` of different lengths are treated as compound indels (or complex indels). They are left-trimmed and divided into multiple atomic SNVs and an insertion (INS) or a deletion (DEL). If the `REF` sequence is longer than the `ALT` sequence, it is considered a deletion. If the `REF` sequence is shorter than the `ALT` sequence, it is considered an insertion. For example:
| CHROM | POS | ID | REF | ALT | FILTER |
| ----- | ---- | ------------ | ------- | --------- | ------ |
| 1 | 2000 | standard_del | CT | C | PASS |
| 1 | 2100 | standard_ins | C | CAA | PASS |
| 1 | 2200 | compund_del | CCTGAAA | CGA | PASS |
| 1 | 2300 | compund_ins | GT | CAATATATA | PASS |
are returned as:
| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |
| ----- | ---- | ------------- | ----- | -------- | ------ | ----------------------------- |
| 1 | 2000 | standard_del | CT | C | PASS | DEL |
| 1 | 2100 | standard_ins | C | CAA | PASS | INS |
| 1 | 2201 | compund_del_0 | C | G | PASS | SNV |
| 1 | 2202 | compund_del_1 | TGAAA | T | PASS | DEL |
| 1 | 2202 | compund_del_2 | T | A | PASS | SNV |
| 1 | 2300 | compund_ins_0 | G | C | PASS | SNV |
| 1 | 2301 | compund_ins_1 | T | TATATATA | PASS | DEL |
| 1 | 2301 | compund_ins_2 | T | A | PASS | SNV | --> |
### Structural variants
VariantExtractor returns one entry per structural variant (one entry per breakend pair). This helps to avoid the ambiguity of the notation and keeps the process deterministic. For this reason, in case of paired breakends, the breakend with the lowest chromosome and/or position is returned. If a breakend is not the lowest chromosome and/or position and is missing its pair, its pair is [inferred and returned](#inferred-breakend-pairs).
#### Breakend vs shorthand notation
Entries with the same information, either described with shorthand or breakend notation, will be returned the same way. Here is an example for a DEL entry:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ---- | --------- | ----------- | --------- | ------ | -------------------- |
| 1 | 3000 | event_1_o | A | A[1:5000[ | PASS | SVTYPE=BND |
| 1 | 5000 | event_1_h | A | ]1:3000]A | PASS | SVTYPE=BND |
| 1 | 3000 | event_1 | A | A[1:5000[ | PASS | SVTYPE=DEL |
| 1 | 3000 | event_1 | A | \<DEL\> | PASS | SVTYPE=DEL; END=5000 |
| 1 | 3000 | event_1 | AGTCACAA... | A | PASS | |
are returned as one entry (each one of them with their own `ALT` field), but with the same `VariantRecord.end` and `VariantType`:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) | [`VariantRecord.end`](#variantrecord) |
| ----- | ---- | ------- | --- | --- | ------ | ---- | ----------------------------- | ------------------------------------- |
| 1 | 3000 | event_1 | A | ... | PASS | ... | DEL | 5000 |
##### The special case of INV<!-- omit in toc -->
\<INV\> is a special case of shorthand notation because it represents two paired breakends. For example, the following shorthand notation:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ------ | ------- | --- | ------- | ------ | --------------------- |
| 2 | 321682 | event_1 | T | \<INV\> | PASS | SVTYPE=INV;END=421681 |
is equivalent to the following breakends:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |
| ----- | ------ | --------- | --- | ----------- | ------ | ---------- | ----------------------------- |
| 2 | 321681 | event_1_0 | N | N]2:421681] | PASS | SVTYPE=INV | INV |
| 2 | 321682 | event_1_1 | T | [2:421682[T | PASS | SVTYPE=INV | INV |
In this case, VariantExtractor converts internally \<INV\> to two entries with breakend notation (one for each breakend pair). Note that the `N` will be replaced with the correct nucleotide if `fasta_ref` is provided to VariantExtractor.
#### Paired breakends
For **paired breakends**, breakends are paired using the `INFO` fields `MATEID` or `PARID`. If these fields are not available, they are paired using their coordinates (contig+position). The breakend with the lowest chromosome and/or position is returned. For example:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ---- | --------- | --- | --------- | ------ | ---------- |
| 2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND |
| 3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |
| 1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND |
| 1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |
are returned as one entry per variant:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |
| ----- | ---- | --------- | --- | --------- | ------ | ---------- | ----------------------------- |
| 2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND | TRA |
| 1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |
#### Inferred breakend pairs
If **all** breakends are missing their pair, the breakends with the lowest chromosome and/or position are inferred and returned. For example:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ---- | --------- | --- | --------- | ------ | ---------- |
| 3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |
| 1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |
are returned as their inferred breakend pair with the lowest chromosome and/or position:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |
| ----- | ---- | --------- | --- | --------- | ------ | ---------- | ----------------------------- |
| 2 | 3000 | event_1_h | N | ]3:5000]N | PASS | SVTYPE=BND | TRA |
| 1 | 3000 | event_2_h | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |
Note that the `N` will be replaced with the correct nucleotide if `fasta_ref` is provided to VariantExtractor. The following equivalencies are applied:
| CHROM1 | POS1 | REF1 | ALT1 | CHROM2 | POS2 | REF2 | ALT2 |
| ------ | ---- | ---- | -------- | ------ | ---- | ---- | -------- |
| 1 | 500 | N | N[7:800[ | 7 | 800 | N | ]1:500]N |
| 1 | 500 | N | ]7:800]N | 7 | 800 | N | N[1:500[ |
| 1 | 500 | N | [7:800[N | 7 | 800 | N | [1:500[N |
| 1 | 500 | N | N]7:800] | 7 | 800 | N | N]1:500] |
#### Imprecise paired breakends
Imprecise breakends do not match exactly with their pair in coordinates. In this case, they are paired using the `INFO` fields `MATEID` or `PARID`. As with the rest of variants, for each breakend pair, only the breakend with the lowest chromosome and/or position is returned. However, it is important to notice that the `CIPOS` field is lost for the other breakend. For example:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ---- | --------- | --- | --------- | ------ | -------------------------------------- |
| 2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=event_1_h |
| 3 | 5050 | event_1_h | A | ]2:3050]A | PASS | SVTYPE=BND;CIPOS=0,100;PARID=event_1_o |
are paired and the entry with the lowest chromosome and/or position is returned:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |
| ----- | ---- | --------- | --- | --------- | ------ | ------------------------------- | ----------------------------- |
| 2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=a_h | TRA |
#### Single breakends
Single breakends cannot be matched with other breakends because they lack a mate. They may be able to be matched later in downstream analysis. That is why each one is kept as a different variant. For example:
| CHROM | POS | ID | REF | ALT | FILTER | INFO |
| ----- | ---- | ------- | --- | --- | ------ | ---------- |
| 2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND |
| 3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND |
are returned as two entries:
| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |
| ----- | ---- | ------- | --- | --- | ------ | ---------- | ----------------------------- |
| 2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND | SGL |
| 3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND | SGL |
## Dependencies
The dependencies are covered by their own respective licenses as follows:
* [Python/Pysam package](https://github.com/pysam-developers/pysam) (MIT license)
Raw data
{
"_id": null,
"home_page": "https://github.com/EUCANCan/variant-extractor",
"name": "variant-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "vcf genetics bioinformatics variant indel snv sv",
"author": "Rapsssito",
"author_email": "contact@rodrigomartin.dev",
"download_url": "https://files.pythonhosted.org/packages/3b/6d/a5e661263ae424dd21f024786a81b7c7c201a03b25196c1b10ed34e87fe6/variant_extractor-4.0.8.tar.gz",
"platform": null,
"description": "# VariantExtractor<!-- omit in toc -->\n**Deterministic and standard extractor of indels, SNVs and structural variants (SVs)** from VCF files built under the frame of [EUCANCan](https://eucancan.com/)'s second work package. VariantExtractor is a Python package (**requires Python version 3.6 or higher**) and provides a set of data structures and functions to extract variants from VCF files in a **deterministic and standard** way while [adding information](#variantrecord) to facilitate afterwards processing. It homogenizes [multiallelic variants](#multiallelic-variants), [MNPs](#snvs) and [SVs](#structural-variants) (including [imprecise paired breakends](#imprecise-paired-breakends) and [single breakends](#single-breakends)). The package is designed to be used in a pipeline, where the variants are ingested from VCF files and then used in downstream analysis. Check the [available documentation](https://eucancan.github.io/variant-extractor/) for more information.\n\nWhile there is somewhat of an agreement on how to label the SNVs and indels variants, this is not the case for the structural variants. In the current scenario, different labeling between variant callers makes comparisons between structural variants difficult. This package provides an unified interface to extract variants (included structural variants) from VCFs generated by different variant callers. Apart from reading the VCF file, VariantExtractor **adds a preprocessing layer to homogenize the variants** extracted from the file. This way, the variants can be used in downstream analysis in a consistent way. For more information about the homogenization process, check the [homogenization rules](#homogenization-rules) section.\n\n\n## Table of contents<!-- omit in toc -->\n- [Getting started](#getting-started)\n - [Installation](#installation)\n- [Usage](#usage)\n- [VariantRecord](#variantrecord)\n - [VariantType](#varianttype)\n - [BreakendSVRecord](#breakendsvrecord)\n - [ShorthandSVRecord](#shorthandsvrecord)\n- [Homogenization rules](#homogenization-rules)\n - [Multiallelic variants](#multiallelic-variants)\n - [SNVs](#snvs)\n - [Structural variants](#structural-variants)\n - [Breakend vs shorthand notation](#breakend-vs-shorthand-notation)\n - [Paired breakends](#paired-breakends)\n - [Inferred breakend pairs](#inferred-breakend-pairs)\n - [Imprecise paired breakends](#imprecise-paired-breakends)\n - [Single breakends](#single-breakends)\n- [Dependencies](#dependencies)\n\n\n## Getting started\n### Installation\nVariantExtractor is available on PyPI and can be installed using `pip`:\n```bash\npip install variant-extractor\n```\n\n## Usage\n```python\n# Import the package\nfrom variant_extractor import VariantExtractor\n\n# Create a new instance of the class\nextractor = VariantExtractor('/path/to/file.vcf')\n# Iterate through the variants\nfor variant_record in extractor:\n print(f'Found variant of type {variant_record.variant_type.name}: {variant_record.contig}:{variant_record.pos}')\n```\n\n```python\n# Import the package\nfrom variant_extractor import VariantExtractor\n\n# Create a new instance of the class\nextractor = VariantExtractor('/path/to/file.vcf')\n\n# Save variants to a CSV file\nextractor.to_dataframe().drop(['variant_record_obj'], axis=1).to_csv('/path/to/output.csv', index=False)\n```\n\nFor a more complete list of examples, check the [examples](./examples/) directory. This folder also includes an example of a [script for normalizing VCF files](examples/normalize_vcf.py) following the [homogenization rules](#homogenization-rules).\n\n## VariantRecord\nThe `VariantExtractor` constructor returns a generator of `VariantRecord` instances. The `VariantRecord` class is a container for the information contained in a VCF record plus some extra useful information.\n\n| Property | Type | Description |\n| ------------------ | ------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |\n| `contig` | `str` | Contig name |\n| `pos` | `int` | Position on the contig |\n| `end` | `int` | End position of the variant in the contig (same as `pos` for TRA and SNV) |\n| `length` | `int` | Length of the variant |\n| `id` | `Optional[str]` | Record identifier |\n| `ref` | `str` | Reference sequence |\n| `alt` | `str` | Alternative sequence |\n| `qual` | `Optional[float]` | Quality score for the assertion made in ALT |\n| `filter` | `List[str]` | Filter status. `PASS` if this position has passed all filters. Otherwise, it contains the filters that failed |\n| `info` | `Dict[str, Any]` | Additional information |\n| `format` | `List[str]` | Specifies data types and order of the genotype information |\n| `samples` | `Dict[str, Dict[str, Any]]` | Genotype information for each sample |\n| `variant_type` | [`VariantType`](#varianttype) | Variant type inferred |\n| `alt_sv_breakend` | `Optional[`[`BreakendSVRecord`](#brekendsvrecord)`]` | Breakend SV info, present only for SVs with breakend notation. For example, `G]17:198982]` |\n| `alt_sv_shorthand` | `Optional[`[`ShorthandSVRecord`](#shorthandsvrecord)`]` | Shorthand SV info, present only for SVs with shorthand notation. For example, `<DUP:TANDEM>` |\n\n### VariantType\nThe `VariantType` enum describes the type of the variant. For structural variants, it is inferred **only** from the breakend notation (or shorthand notation). It does not take into account any `INFO` field (`SVTYPE` nor `EVENTYPE`) that might be added by the variant caller afterwards.\n\n| REF | ALT | Variant name | Description |\n| ---- | ---------------------------------------- | ------------ | -------------------------------------------------------------------- |\n| A | G | SNV | Single nucleotide variant |\n| AGTG | A | DEL | Deletion |\n| A | A[1:20[ or \\<DEL\\> | DEL | Deletion |\n| A | ACCT or \\<INS\\> | INS | Insertion |\n| A | ]1:20]A or \\<DUP\\> | DUP | Duplication |\n| A | A]1:20] or [1:20[A | INV | Inversion. **[\\<INV\\> is a special case](#the-special-case-of-inv)** |\n| A | \\<CNV\\> | CNV | Copy number variation |\n| A | A]X:20] or A[X:20[ or ]X:20]A or [X:20[A | TRA | Translocation |\n| A | A. or .A | SGL | Single breakend |\n\n### BreakendSVRecord\nThe `BreakendSVRecord` class is a container for the information contained in a VCF record for SVs with breakend notation.\n\n| Property | Type | Description |\n| --------- | --------------- | --------------------------------------------------------------------------------------------------------------- |\n| `prefix` | `Optional[str]` | Prefix of the SV record with breakend notation. For example, for `G]17:198982]` the prefix will be `G` |\n| `bracket` | `str` | Bracket of the SV record with breakend notation. For example, for `G]17:198982]` the bracket will be `]` |\n| `contig` | `str` | Contig of the SV record with breakend notation. For example, for `G]17:198982]` the contig will be `17` |\n| `pos` | `int` | Position of the SV record with breakend notation. For example, for `G]17:198982]` the position will be `198982` |\n| `suffix` | `Optional[str]` | Suffix of the SV record with breakend notation. For example, for `G]17:198982]` the suffix will be `None` |\n\n### ShorthandSVRecord\nThe `ShorthandSVRecord` class is a container for the information contained in a VCF record for SVs with shorthand notation.\n\n| Property | Type | Description |\n| -------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `type` | `str` | Type of the SV record with shorthand notation. One of the following, `'DEL'`, `'INS'`, `'DUP'`, `'INV'` or `'CNV'`. For example, for `<DUP:TANDEM>` the type will be `DUP` |\n| `extra` | `List[str]` | Extra information of the SV. For example, for `<DUP:TANDEM:AA>` the extra will be `['TANDEM', 'AA']` |\n\n## Homogenization rules\nVariantExtractor provides a unified interface to extract variants (included structural variants) from VCF files generated by different variant callers. The variants are homogenized and returned applying the following rules:\n\n### Multiallelic variants\nAn entry with multiple `ALT` sequences (multiallelic) is divided into multiple entries with a single `ALT` field. This entries with a single `ALT` field are then processed with the rest of the homogeneization rules. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER |\n| ----- | --- | -------------- | --- | --- | ------ |\n| 2 | 1 | multiallelic_1 | G | C,T | PASS |\n\nis returned as:\n\n| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |\n| ----- | --- | ---------------- | --- | --- | ------ | ----------------------------- |\n| 2 | 1 | multiallelic_1_0 | G | C | PASS | SNV |\n| 2 | 1 | multiallelic_1_1 | G | T | PASS | SNV |\n\n\n### SNVs\nEntries with `REF/ALT` of the same lenghts are treated like SNVs. If the `REF/ALT` sequences are more than one nucleotide (MNPs), they are divided into multiple atomic SNVs. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER |\n| ----- | --- | ----- | --- | --- | ------ |\n| 2 | 1 | snv_1 | C | G | PASS |\n| 2 | 3 | mnp_1 | TAG | AGT | PASS |\n\nare returned as:\n\n| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |\n| ----- | --- | ------- | --- | --- | ------ | ----------------------------- |\n| 2 | 3 | snv_1 | C | G | PASS | SNV |\n| 2 | 3 | mnp_1_0 | T | A | PASS | SNV |\n| 2 | 4 | mnp_1_1 | A | G | PASS | SNV |\n| 2 | 5 | mnp_1_2 | G | T | PASS | SNV |\n\n<!-- ### Compound indels\nAll entries with the `REF/ALT` of different lengths are treated as compound indels (or complex indels). They are left-trimmed and divided into multiple atomic SNVs and an insertion (INS) or a deletion (DEL). If the `REF` sequence is longer than the `ALT` sequence, it is considered a deletion. If the `REF` sequence is shorter than the `ALT` sequence, it is considered an insertion. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER |\n| ----- | ---- | ------------ | ------- | --------- | ------ |\n| 1 | 2000 | standard_del | CT | C | PASS |\n| 1 | 2100 | standard_ins | C | CAA | PASS |\n| 1 | 2200 | compund_del | CCTGAAA | CGA | PASS |\n| 1 | 2300 | compund_ins | GT | CAATATATA | PASS |\n\nare returned as:\n\n| CHROM | POS | ID | REF | ALT | FILTER | [`VariantType`](#varianttype) |\n| ----- | ---- | ------------- | ----- | -------- | ------ | ----------------------------- |\n| 1 | 2000 | standard_del | CT | C | PASS | DEL |\n| 1 | 2100 | standard_ins | C | CAA | PASS | INS |\n| 1 | 2201 | compund_del_0 | C | G | PASS | SNV |\n| 1 | 2202 | compund_del_1 | TGAAA | T | PASS | DEL |\n| 1 | 2202 | compund_del_2 | T | A | PASS | SNV |\n| 1 | 2300 | compund_ins_0 | G | C | PASS | SNV |\n| 1 | 2301 | compund_ins_1 | T | TATATATA | PASS | DEL |\n| 1 | 2301 | compund_ins_2 | T | A | PASS | SNV | --> |\n\n\n### Structural variants\nVariantExtractor returns one entry per structural variant (one entry per breakend pair). This helps to avoid the ambiguity of the notation and keeps the process deterministic. For this reason, in case of paired breakends, the breakend with the lowest chromosome and/or position is returned. If a breakend is not the lowest chromosome and/or position and is missing its pair, its pair is [inferred and returned](#inferred-breakend-pairs).\n\n#### Breakend vs shorthand notation\nEntries with the same information, either described with shorthand or breakend notation, will be returned the same way. Here is an example for a DEL entry:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ---- | --------- | ----------- | --------- | ------ | -------------------- |\n| 1 | 3000 | event_1_o | A | A[1:5000[ | PASS | SVTYPE=BND |\n| 1 | 5000 | event_1_h | A | ]1:3000]A | PASS | SVTYPE=BND |\n| 1 | 3000 | event_1 | A | A[1:5000[ | PASS | SVTYPE=DEL |\n| 1 | 3000 | event_1 | A | \\<DEL\\> | PASS | SVTYPE=DEL; END=5000 |\n| 1 | 3000 | event_1 | AGTCACAA... | A | PASS | |\n\nare returned as one entry (each one of them with their own `ALT` field), but with the same `VariantRecord.end` and `VariantType`:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) | [`VariantRecord.end`](#variantrecord) |\n| ----- | ---- | ------- | --- | --- | ------ | ---- | ----------------------------- | ------------------------------------- |\n| 1 | 3000 | event_1 | A | ... | PASS | ... | DEL | 5000 |\n\n##### The special case of INV<!-- omit in toc -->\n\\<INV\\> is a special case of shorthand notation because it represents two paired breakends. For example, the following shorthand notation:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ------ | ------- | --- | ------- | ------ | --------------------- |\n| 2 | 321682 | event_1 | T | \\<INV\\> | PASS | SVTYPE=INV;END=421681 |\n\nis equivalent to the following breakends:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |\n| ----- | ------ | --------- | --- | ----------- | ------ | ---------- | ----------------------------- |\n| 2 | 321681 | event_1_0 | N | N]2:421681] | PASS | SVTYPE=INV | INV |\n| 2 | 321682 | event_1_1 | T | [2:421682[T | PASS | SVTYPE=INV | INV |\n\nIn this case, VariantExtractor converts internally \\<INV\\> to two entries with breakend notation (one for each breakend pair). Note that the `N` will be replaced with the correct nucleotide if `fasta_ref` is provided to VariantExtractor.\n\n\n#### Paired breakends\nFor **paired breakends**, breakends are paired using the `INFO` fields `MATEID` or `PARID`. If these fields are not available, they are paired using their coordinates (contig+position). The breakend with the lowest chromosome and/or position is returned. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ---- | --------- | --- | --------- | ------ | ---------- |\n| 2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND |\n| 3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |\n| 1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND |\n| 1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |\n\nare returned as one entry per variant:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |\n| ----- | ---- | --------- | --- | --------- | ------ | ---------- | ----------------------------- |\n| 2 | 3000 | event_1_o | T | ]3:5000]T | PASS | SVTYPE=BND | TRA |\n| 1 | 3000 | event_2_o | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |\n\n\n#### Inferred breakend pairs\nIf **all** breakends are missing their pair, the breakends with the lowest chromosome and/or position are inferred and returned. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ---- | --------- | --- | --------- | ------ | ---------- |\n| 3 | 5000 | event_1_h | G | G[2:3000[ | PASS | SVTYPE=BND |\n| 1 | 5000 | event_2_h | A | ]1:3000]A | PASS | SVTYPE=BND |\n\nare returned as their inferred breakend pair with the lowest chromosome and/or position:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |\n| ----- | ---- | --------- | --- | --------- | ------ | ---------- | ----------------------------- |\n| 2 | 3000 | event_1_h | N | ]3:5000]N | PASS | SVTYPE=BND | TRA |\n| 1 | 3000 | event_2_h | A | A[1:5000[ | PASS | SVTYPE=BND | DEL |\n\nNote that the `N` will be replaced with the correct nucleotide if `fasta_ref` is provided to VariantExtractor. The following equivalencies are applied:\n\n| CHROM1 | POS1 | REF1 | ALT1 | CHROM2 | POS2 | REF2 | ALT2 |\n| ------ | ---- | ---- | -------- | ------ | ---- | ---- | -------- |\n| 1 | 500 | N | N[7:800[ | 7 | 800 | N | ]1:500]N |\n| 1 | 500 | N | ]7:800]N | 7 | 800 | N | N[1:500[ |\n| 1 | 500 | N | [7:800[N | 7 | 800 | N | [1:500[N |\n| 1 | 500 | N | N]7:800] | 7 | 800 | N | N]1:500] |\n \n\n#### Imprecise paired breakends\nImprecise breakends do not match exactly with their pair in coordinates. In this case, they are paired using the `INFO` fields `MATEID` or `PARID`. As with the rest of variants, for each breakend pair, only the breakend with the lowest chromosome and/or position is returned. However, it is important to notice that the `CIPOS` field is lost for the other breakend. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ---- | --------- | --- | --------- | ------ | -------------------------------------- |\n| 2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=event_1_h |\n| 3 | 5050 | event_1_h | A | ]2:3050]A | PASS | SVTYPE=BND;CIPOS=0,100;PARID=event_1_o |\n\nare paired and the entry with the lowest chromosome and/or position is returned:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |\n| ----- | ---- | --------- | --- | --------- | ------ | ------------------------------- | ----------------------------- |\n| 2 | 3010 | event_1_o | T | T[3:5000[ | PASS | SVTYPE=BND;CIPOS=0,50;PARID=a_h | TRA |\n\n\n#### Single breakends\nSingle breakends cannot be matched with other breakends because they lack a mate. They may be able to be matched later in downstream analysis. That is why each one is kept as a different variant. For example:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO |\n| ----- | ---- | ------- | --- | --- | ------ | ---------- |\n| 2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND |\n| 3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND |\n\nare returned as two entries:\n\n| CHROM | POS | ID | REF | ALT | FILTER | INFO | [`VariantType`](#varianttype) |\n| ----- | ---- | ------- | --- | --- | ------ | ---------- | ----------------------------- |\n| 2 | 3000 | event_s | T | T. | PASS | SVTYPE=BND | SGL |\n| 3 | 5000 | event_m | G | .G | PASS | SVTYPE=BND | SGL |\n\n\n## Dependencies\n\nThe dependencies are covered by their own respective licenses as follows:\n\n* [Python/Pysam package](https://github.com/pysam-developers/pysam) (MIT license)\n",
"bugtrack_url": null,
"license": null,
"summary": "Deterministic and standard extractor of indels, SNVs and structural variants (SVs) from VCF files",
"version": "4.0.8",
"project_urls": {
"Bug Tracker": "https://github.com/EUCANCan/variant-extractor/issues",
"Homepage": "https://github.com/EUCANCan/variant-extractor"
},
"split_keywords": [
"vcf",
"genetics",
"bioinformatics",
"variant",
"indel",
"snv",
"sv"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8eec756c263431a75ae1bdc030a8a6399ee75218584316d3d8312d51d1a004c3",
"md5": "14dcbfc1625063302843f052186bbe33",
"sha256": "faeb66ea360887e506e1f80225b9d009b2127db69d62f061567aebaa5900539e"
},
"downloads": -1,
"filename": "variant_extractor-4.0.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "14dcbfc1625063302843f052186bbe33",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 17659,
"upload_time": "2024-07-17T08:31:03",
"upload_time_iso_8601": "2024-07-17T08:31:03.493345Z",
"url": "https://files.pythonhosted.org/packages/8e/ec/756c263431a75ae1bdc030a8a6399ee75218584316d3d8312d51d1a004c3/variant_extractor-4.0.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "3b6da5e661263ae424dd21f024786a81b7c7c201a03b25196c1b10ed34e87fe6",
"md5": "86db1f48614e77cec4eafa95af91b1ae",
"sha256": "ea7f3cf6dc27afa4f747d1aba5430fa2b107aa18e1b34ebdcd1f444b3cb0025e"
},
"downloads": -1,
"filename": "variant_extractor-4.0.8.tar.gz",
"has_sig": false,
"md5_digest": "86db1f48614e77cec4eafa95af91b1ae",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 21298,
"upload_time": "2024-07-17T08:31:05",
"upload_time_iso_8601": "2024-07-17T08:31:05.229874Z",
"url": "https://files.pythonhosted.org/packages/3b/6d/a5e661263ae424dd21f024786a81b7c7c201a03b25196c1b10ed34e87fe6/variant_extractor-4.0.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-17 08:31:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "EUCANCan",
"github_project": "variant-extractor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pysam",
"specs": [
[
">=",
"0.11.2.2"
]
]
}
],
"lcname": "variant-extractor"
}