vcf-to-duckdb

Name	vcf-to-duckdb JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	None
upload_time	2025-01-02 16:00:39
maintainer	None
docs_url	None
author	Devin McCabe
requires_python	<4.0,>=3.12
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            VCF to DuckDB Converter
---

A module tool for converting a VCF (Variant Call Format) file to a DuckDB database (exported as Parquet files and accompanying SQL schema). 

## Features

- Efficient multithreaded batch processing for large files
- Infers data types from VCF headers
- Parses and separates data in compound INFO fields (e.g. from VEP, SnpEFF, etc.)
- URL-decodes specified fields and detects fields still needing decoding 

## Installation

1. Install the required system dependencies:
    - [pyenv](https://github.com/pyenv/pyenv)
    - [Poetry](https://python-poetry.org/)
    - [bcftools](https://samtools.github.io/bcftools/bcftools.html)

2. Install the required Python version (developed with 3.12.3, but other 3.12+ versions should work):
   ```shell
   pyenv install "$(cat .python-version)"
   ```

3. Confirm that `python` maps to the correct version:
   ```
   python --version
   ```

4. Set the Poetry interpreter and install the Python dependencies:
   ```shell
   poetry env use "$(pyenv which python)"
   poetry install
   ```

A `requirements.txt` file is also available and kept in sync with Poetry dependencies in case you don't want to use Poetry, or you can use arret via docker: `docker pull dmccabe606/arret:latest`.

## Usage

```python
from pathlib import Path
from vcf_to_duckdb.convert_utils import convert
Convert a VCF file to DuckDB
convert(
vcf_path=Path("input.vcf.gz"),
db_path=Path("output.db"),
parquet_dir_path=Path("output_parquet"),
multiallelics=True,
compound_info_fields={"CSQ"},
url_encoded_col_name_regexes=["field_."]
)

## Database Schema

The converter creates the following main tables:

### variants
- `vid`: Unique variant identifier (Primary Key)
- `chrom`: Chromosome
- `pos`: Position
- `id`: Variant identifier
- `ref`: Reference allele
- `alt`: Alternative allele
- `qual`: Quality score
- `filters`: Filter array

### vals_info
- `vid`: Variant identifier (Foreign Key)
- `kind`: Field type ('value' or 'info')
- `k`: Field key
- Various value columns for different data types:
  - `v_boolean`, `v_varchar`, `v_integer`, `v_float`, `v_json`
  - Array versions: `v_boolean_arr`, `v_varchar_arr`, etc.

## Data Type Handling

The converter automatically maps VCF data types to appropriate DuckDB types:
- Integer → INTEGER
- Float → FLOAT
- String → VARCHAR
- Character → VARCHAR
- Flag → BOOLEAN

Compound fields (like CSQ) are stored as JSON objects.

## Batch Processing

For large VCF files, the converter processes data in batches (default 100,000 variants per batch) to manage memory usage efficiently.

## URL Decoding

The tool can automatically URL-decode specified columns based on regular expressions matching column names. This is useful for fields containing URL-encoded data.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vcf-to-duckdb",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": null,
    "author": "Devin McCabe",
    "author_email": "dmccabe@broadinstitute.org",
    "download_url": "https://files.pythonhosted.org/packages/ba/c9/06f4cd34f1e5bbc905da3150fbea63d029aa3fc014af287a4de331cc3bb0/vcf_to_duckdb-0.2.0.tar.gz",
    "platform": null,
    "description": "VCF to DuckDB Converter\n---\n\nA module tool for converting a VCF (Variant Call Format) file to a DuckDB database (exported as Parquet files and accompanying SQL schema). \n\n## Features\n\n- Efficient multithreaded batch processing for large files\n- Infers data types from VCF headers\n- Parses and separates data in compound INFO fields (e.g. from VEP, SnpEFF, etc.)\n- URL-decodes specified fields and detects fields still needing decoding \n\n## Installation\n\n1. Install the required system dependencies:\n    - [pyenv](https://github.com/pyenv/pyenv)\n    - [Poetry](https://python-poetry.org/)\n    - [bcftools](https://samtools.github.io/bcftools/bcftools.html)\n\n2. Install the required Python version (developed with 3.12.3, but other 3.12+ versions should work):\n   ```shell\n   pyenv install \"$(cat .python-version)\"\n   ```\n\n3. Confirm that `python` maps to the correct version:\n   ```\n   python --version\n   ```\n\n4. Set the Poetry interpreter and install the Python dependencies:\n   ```shell\n   poetry env use \"$(pyenv which python)\"\n   poetry install\n   ```\n\nA `requirements.txt` file is also available and kept in sync with Poetry dependencies in case you don't want to use Poetry, or you can use arret via docker: `docker pull dmccabe606/arret:latest`.\n\n## Usage\n\n```python\nfrom pathlib import Path\nfrom vcf_to_duckdb.convert_utils import convert\nConvert a VCF file to DuckDB\nconvert(\nvcf_path=Path(\"input.vcf.gz\"),\ndb_path=Path(\"output.db\"),\nparquet_dir_path=Path(\"output_parquet\"),\nmultiallelics=True,\ncompound_info_fields={\"CSQ\"},\nurl_encoded_col_name_regexes=[\"field_.\"]\n)\n\n## Database Schema\n\nThe converter creates the following main tables:\n\n### variants\n- `vid`: Unique variant identifier (Primary Key)\n- `chrom`: Chromosome\n- `pos`: Position\n- `id`: Variant identifier\n- `ref`: Reference allele\n- `alt`: Alternative allele\n- `qual`: Quality score\n- `filters`: Filter array\n\n### vals_info\n- `vid`: Variant identifier (Foreign Key)\n- `kind`: Field type ('value' or 'info')\n- `k`: Field key\n- Various value columns for different data types:\n  - `v_boolean`, `v_varchar`, `v_integer`, `v_float`, `v_json`\n  - Array versions: `v_boolean_arr`, `v_varchar_arr`, etc.\n\n## Data Type Handling\n\nThe converter automatically maps VCF data types to appropriate DuckDB types:\n- Integer \u2192 INTEGER\n- Float \u2192 FLOAT\n- String \u2192 VARCHAR\n- Character \u2192 VARCHAR\n- Flag \u2192 BOOLEAN\n\nCompound fields (like CSQ) are stored as JSON objects.\n\n## Batch Processing\n\nFor large VCF files, the converter processes data in batches (default 100,000 variants per batch) to manage memory usage efficiently.\n\n## URL Decoding\n\nThe tool can automatically URL-decode specified columns based on regular expressions matching column names. This is useful for fields containing URL-encoded data.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": null,
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "63f2f2cf06fc39aa255ff45aff2737f6ea153861aab2a10d058f5152560a4104",
                "md5": "dd11fd1149054ccf6da8dd2b628b0548",
                "sha256": "3bf7de8ea947883a5137520a84eb099c7c45272938d5c22f940e3dbeaf584c74"
            },
            "downloads": -1,
            "filename": "vcf_to_duckdb-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dd11fd1149054ccf6da8dd2b628b0548",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 11412,
            "upload_time": "2025-01-02T16:00:36",
            "upload_time_iso_8601": "2025-01-02T16:00:36.729145Z",
            "url": "https://files.pythonhosted.org/packages/63/f2/f2cf06fc39aa255ff45aff2737f6ea153861aab2a10d058f5152560a4104/vcf_to_duckdb-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bac906f4cd34f1e5bbc905da3150fbea63d029aa3fc014af287a4de331cc3bb0",
                "md5": "e424e724e06e403bfbec17a1f8a8a617",
                "sha256": "ae689eedb28eeda0860bd1e832b3e16bb370f3ff0a237d20c744f6813755efe7"
            },
            "downloads": -1,
            "filename": "vcf_to_duckdb-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e424e724e06e403bfbec17a1f8a8a617",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 11767,
            "upload_time": "2025-01-02T16:00:39",
            "upload_time_iso_8601": "2025-01-02T16:00:39.054749Z",
            "url": "https://files.pythonhosted.org/packages/ba/c9/06f4cd34f1e5bbc905da3150fbea63d029aa3fc014af287a4de331cc3bb0/vcf_to_duckdb-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-02 16:00:39",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "vcf-to-duckdb"
}

Devin McCabe