varseek


Namevarseek JSON
Version 0.1.1 PyPI version JSON
download
home_pageNone
SummaryEfficient variant screening of RNA-seq and DNA-seq data using k-mer-based alignment against a reference of known variants.
upload_time2025-08-12 08:17:59
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseBSD 2-Clause License Copyright (c) 2024, Pachter Lab Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
keywords varseek bioinformatics variant-analysis k-mer rna-seq dna-seq
VCS
bugtrack_url
requirements numpy pandas matplotlib tqdm anndata kb-python gget scipy pyfastx pysam pyarrow
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # varseek
[![pypi version](https://img.shields.io/pypi/v/varseek)](https://pypi.org/project/varseek)
![Downloads](https://static.pepy.tech/personalized-badge/varseek?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads)
[![license](https://img.shields.io/pypi/l/varseek)](LICENSE)
![status](https://github.com/pachterlab/varseek/actions/workflows/ci.yml/badge.svg)
![Code Coverage](https://img.shields.io/badge/Coverage-83%25-green.svg)

<!--[![image](https://anaconda.org/bioconda/varseek/badges/version.svg)](https://anaconda.org/bioconda/varseek)-->
<!--[![Conda](https://img.shields.io/conda/dn/bioconda/varseek?logo=Anaconda)](https://anaconda.org/bioconda/varseek)-->

![alt text](https://github.com/pachterlab/varseek/blob/main/figures/logo.png?raw=true)

`varseek` is a free, open-source command-line tool and Python package that provides variant screening of RNA-seq and DNA-seq data using k-mer-based alignment against a reference of known variants. The name comes from "seeking variants" or, alternatively, "seeing k-variants" (where a "k-variant" is defined as a k-mer containing a variant).
  
![alt text](https://github.com/pachterlab/varseek/blob/main/figures/varseek_overview_simple.png?raw=true)

The two commands used in a standard workflow are `varseek ref` and `varseek count`. `varseek ref` generates a variant-containing reference sequence (VCRS) index that serves as the basis for variant calling. `varseek count` pseudoaligns RNA-seq or DNA-seq reads against the VCRS index and generates a variant count matrix. The variant count matrix can be used for downstream analysis. Each step wraps around other steps within the varseek package and the kb-python package, as described below.

![alt text](https://github.com/pachterlab/varseek/blob/main/figures/varseek_overview.png?raw=true)

The functions of `varseek` are described in the table below.

| Description                                                       | Bash              | Python (with `import varseek as vk`) |
|-------------------------------------------------------------------|-------------------|--------------------------------------|
| Build a variant-containing reference sequence (VCRS) fasta file   | `vk build ...`    | `vk.build(...)`                      |
| Describe the VCRS reference in a dataframe for filtering          | `vk info ...`     | `vk.info(...)`                       |
| Filter the VCRS file based on the CSV generated from varseek info | `vk filter ...`   | `vk.filter(...)`                     |
| Preprocess the FASTQ files before pseudoalignment                 | `vk fastqpp ...`  | `vk.fastqpp(...)`                    |
| Process the variant count matrix                                  | `vk clean ...`    | `vk.clean(...)`                      |
| Analyze the variant count matrix results                          | `vk summarize ...`| `vk.summarize(...)`                  |
| Wrap vk build, vk info, vk filter, and kb ref                     | `vk ref ...`      | `vk.ref(...)`                        |
| Wrap vk fastqpp, kb count, vk clean, and vk summarize             | `vk count ...`    | `vk.count(...)`                      |
| Create synthetic RNA-seq dataset with variant-containing reads    | `vk sim ...`      | `vk.sim(...)`                        |

After aligning and generating a variant count matrix with `varseek`, you can explore the data using our pre-built notebooks. The notebooks are described in the table below.

| Description                                   | Notebook                                                                 |
|-----------------------------------------------|--------------------------------------------------------------------------------------|
| Preprocessing the variant count matrix        | [3_matrix_preprocessing.ipynb](./3_matrix_preprocessing.ipynb)                       |
| Sequence visualization of variants            | [4_1_variant_analysis_sequence_visualization.ipynb](./4_1_variant_analysis_sequence_visualization.ipynb) |
| Heatmap visualization of variant patterns     | [4_2_variant_analysis_heatmaps.ipynb](./4_2_variant_analysis_heatmaps.ipynb)       |
| Protein-level variant analysis                | [4_3_variant_analysis_protein_variant.ipynb](./4_3_variant_analysis_protein_variant.ipynb) |
| Heatmap analysis of gene expression           | [5_1_gene_analysis_heatmaps.ipynb](./5_1_gene_analysis_heatmaps.ipynb)               |
| Drug-target analysis for genes                | [5_2_gene_analysis_drugs.ipynb](./5_2_gene_analysis_drugs.ipynb)                     |
| Pathway analysis using Enrichr                | [6_1_pathway_analysis_enrichr.ipynb](./6_1_pathway_analysis_enrichr.ipynb)           |
| Gene Ontology enrichment analysis (GOEA)      | [6_2_pathway_analysis_goea.ipynb](./6_2_pathway_analysis_goea.ipynb)                 |

You can find more examples of how to use varseek in the GitHub repository for our preprint [GitHub - pachterlab/RLSRWP_2025](https://github.com/pachterlab/RLSRWP_2025.git).

    
If you use `varseek` in a publication, please cite the following study:    
```
PAPER CITATION
```
Read the article here: PAPER DOI  

# Installation
```bash
pip install varseek
```

# 🪄 Quick start guide
## 1. Acquire a Reference

Follow one of the below options:

### a. Download a Pre-built Reference
- (optional) View all downloadable references: `vk ref --list_downloadable_references`
- `vk ref --download --variants VARIANTS --sequences SEQUENCES`

### b. Make custom reference – screen for user-defined variants
- `vk ref --variants VARIANTS --sequences SEQUENCES ...`

### c. Customize reference building process – customize the VCRS filtering process (e.g., add additional information by which to filter, add custom filtering logic, tune filtering parameters based on the results of intermediate steps, etc.)
- `vk build --variants VARIANTS --sequences SEQUENCES ...`
- (optional) `vk info --input_dir INPUT_DIR ...`
- (optional) `vk filter --input_dir INPUT_DIR ...`
- `kb ref --workflow custom --index INDEX ...`


## 2. Screen for variants

Follow one of the below options:

### a. Standard workflow
- (optional) fastq quality control
- `vk count --index INDEX --t2g T2G ... --fastqs FASTQ1 FASTQ2...`

### b. Customize variant screening process - additional fastq preprocessing, custom count matrix processing
- (optional) fastq quality control
- (optional) `vk fastqpp ... --fastqs FASTQ1 FASTQ2...`
- `kb count --index INDEX --t2g T2G ... --fastqs FASTQ1 FASTQ2...`
- (optional) `kb count --index REFERENCE_INDEX --t2g REFERENCE_T2G ... --fastqs FASTQ1 FASTQ2...`
- (optional) `vk clean --adata ADATA ...`
- (optional) `vk summarize --adata ADATA ...`


**Examples for getting started:** [GitHub - pachterlab/varseek](https://github.com/pachterlab/varseek-examples.git)
**Manuscript**: ...
**Repository for manuscript figures**: [GitHub - pachterlab/RLSRP_2025](https://github.com/pachterlab/RLSRP_2025.git)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "varseek",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Joseph Rich <josephrich98@gmail.com>",
    "keywords": "varseek, bioinformatics, variant-analysis, k-mer, RNA-seq, DNA-seq",
    "author": null,
    "author_email": "Joseph Rich <josephrich98@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7d/51/8aacda62660f88b7283865a76ac330574d64a2f83d6aeab3bc62d858248f/varseek-0.1.1.tar.gz",
    "platform": null,
    "description": "# varseek\n[![pypi version](https://img.shields.io/pypi/v/varseek)](https://pypi.org/project/varseek)\n![Downloads](https://static.pepy.tech/personalized-badge/varseek?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads)\n[![license](https://img.shields.io/pypi/l/varseek)](LICENSE)\n![status](https://github.com/pachterlab/varseek/actions/workflows/ci.yml/badge.svg)\n![Code Coverage](https://img.shields.io/badge/Coverage-83%25-green.svg)\n\n<!--[![image](https://anaconda.org/bioconda/varseek/badges/version.svg)](https://anaconda.org/bioconda/varseek)-->\n<!--[![Conda](https://img.shields.io/conda/dn/bioconda/varseek?logo=Anaconda)](https://anaconda.org/bioconda/varseek)-->\n\n![alt text](https://github.com/pachterlab/varseek/blob/main/figures/logo.png?raw=true)\n\n`varseek` is a free, open-source command-line tool and Python package that provides variant screening of RNA-seq and DNA-seq data using k-mer-based alignment against a reference of known variants. The name comes from \"seeking variants\" or, alternatively, \"seeing k-variants\" (where a \"k-variant\" is defined as a k-mer containing a variant).\n  \n![alt text](https://github.com/pachterlab/varseek/blob/main/figures/varseek_overview_simple.png?raw=true)\n\nThe two commands used in a standard workflow are `varseek ref` and `varseek count`. `varseek ref` generates a variant-containing reference sequence (VCRS) index that serves as the basis for variant calling. `varseek count` pseudoaligns RNA-seq or DNA-seq reads against the VCRS index and generates a variant count matrix. The variant count matrix can be used for downstream analysis. Each step wraps around other steps within the varseek package and the kb-python package, as described below.\n\n![alt text](https://github.com/pachterlab/varseek/blob/main/figures/varseek_overview.png?raw=true)\n\nThe functions of `varseek` are described in the table below.\n\n| Description                                                       | Bash              | Python (with `import varseek as vk`) |\n|-------------------------------------------------------------------|-------------------|--------------------------------------|\n| Build a variant-containing reference sequence (VCRS) fasta file   | `vk build ...`    | `vk.build(...)`                      |\n| Describe the VCRS reference in a dataframe for filtering          | `vk info ...`     | `vk.info(...)`                       |\n| Filter the VCRS file based on the CSV generated from varseek info | `vk filter ...`   | `vk.filter(...)`                     |\n| Preprocess the FASTQ files before pseudoalignment                 | `vk fastqpp ...`  | `vk.fastqpp(...)`                    |\n| Process the variant count matrix                                  | `vk clean ...`    | `vk.clean(...)`                      |\n| Analyze the variant count matrix results                          | `vk summarize ...`| `vk.summarize(...)`                  |\n| Wrap vk build, vk info, vk filter, and kb ref                     | `vk ref ...`      | `vk.ref(...)`                        |\n| Wrap vk fastqpp, kb count, vk clean, and vk summarize             | `vk count ...`    | `vk.count(...)`                      |\n| Create synthetic RNA-seq dataset with variant-containing reads    | `vk sim ...`      | `vk.sim(...)`                        |\n\nAfter aligning and generating a variant count matrix with `varseek`, you can explore the data using our pre-built notebooks. The notebooks are described in the table below.\n\n| Description                                   | Notebook                                                                 |\n|-----------------------------------------------|--------------------------------------------------------------------------------------|\n| Preprocessing the variant count matrix        | [3_matrix_preprocessing.ipynb](./3_matrix_preprocessing.ipynb)                       |\n| Sequence visualization of variants            | [4_1_variant_analysis_sequence_visualization.ipynb](./4_1_variant_analysis_sequence_visualization.ipynb) |\n| Heatmap visualization of variant patterns     | [4_2_variant_analysis_heatmaps.ipynb](./4_2_variant_analysis_heatmaps.ipynb)       |\n| Protein-level variant analysis                | [4_3_variant_analysis_protein_variant.ipynb](./4_3_variant_analysis_protein_variant.ipynb) |\n| Heatmap analysis of gene expression           | [5_1_gene_analysis_heatmaps.ipynb](./5_1_gene_analysis_heatmaps.ipynb)               |\n| Drug-target analysis for genes                | [5_2_gene_analysis_drugs.ipynb](./5_2_gene_analysis_drugs.ipynb)                     |\n| Pathway analysis using Enrichr                | [6_1_pathway_analysis_enrichr.ipynb](./6_1_pathway_analysis_enrichr.ipynb)           |\n| Gene Ontology enrichment analysis (GOEA)      | [6_2_pathway_analysis_goea.ipynb](./6_2_pathway_analysis_goea.ipynb)                 |\n\nYou can find more examples of how to use varseek in the GitHub repository for our preprint [GitHub - pachterlab/RLSRWP_2025](https://github.com/pachterlab/RLSRWP_2025.git).\n\n    \nIf you use `varseek` in a publication, please cite the following study:    \n```\nPAPER CITATION\n```\nRead the article here: PAPER DOI  \n\n# Installation\n```bash\npip install varseek\n```\n\n# \ud83e\ude84 Quick start guide\n## 1. Acquire a Reference\n\nFollow one of the below options:\n\n### a. Download a Pre-built Reference\n- (optional) View all downloadable references: `vk ref --list_downloadable_references`\n- `vk ref --download --variants VARIANTS --sequences SEQUENCES`\n\n### b. Make custom reference \u2013 screen for user-defined variants\n- `vk ref --variants VARIANTS --sequences SEQUENCES ...`\n\n### c. Customize reference building process \u2013 customize the VCRS filtering process (e.g., add additional information by which to filter, add custom filtering logic, tune filtering parameters based on the results of intermediate steps, etc.)\n- `vk build --variants VARIANTS --sequences SEQUENCES ...`\n- (optional) `vk info --input_dir INPUT_DIR ...`\n- (optional) `vk filter --input_dir INPUT_DIR ...`\n- `kb ref --workflow custom --index INDEX ...`\n\n\n## 2. Screen for variants\n\nFollow one of the below options:\n\n### a. Standard workflow\n- (optional) fastq quality control\n- `vk count --index INDEX --t2g T2G ... --fastqs FASTQ1 FASTQ2...`\n\n### b. Customize variant screening process - additional fastq preprocessing, custom count matrix processing\n- (optional) fastq quality control\n- (optional) `vk fastqpp ... --fastqs FASTQ1 FASTQ2...`\n- `kb count --index INDEX --t2g T2G ... --fastqs FASTQ1 FASTQ2...`\n- (optional) `kb count --index REFERENCE_INDEX --t2g REFERENCE_T2G ... --fastqs FASTQ1 FASTQ2...`\n- (optional) `vk clean --adata ADATA ...`\n- (optional) `vk summarize --adata ADATA ...`\n\n\n**Examples for getting started:** [GitHub - pachterlab/varseek](https://github.com/pachterlab/varseek-examples.git)\n**Manuscript**: ...\n**Repository for manuscript figures**: [GitHub - pachterlab/RLSRP_2025](https://github.com/pachterlab/RLSRP_2025.git)\n",
    "bugtrack_url": null,
    "license": "BSD 2-Clause License\n        \n        Copyright (c) 2024, Pachter Lab\n        \n        Redistribution and use in source and binary forms, with or without\n        modification, are permitted provided that the following conditions are met:\n        \n        1. Redistributions of source code must retain the above copyright notice, this\n           list of conditions and the following disclaimer.\n        \n        2. Redistributions in binary form must reproduce the above copyright notice,\n           this list of conditions and the following disclaimer in the documentation\n           and/or other materials provided with the distribution.\n        \n        THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\n        AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n        IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n        DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\n        FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\n        DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\n        SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\n        CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\n        OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\n        OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n        ",
    "summary": "Efficient variant screening of RNA-seq and DNA-seq data using k-mer-based alignment against a reference of known variants.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/pachterlab/varseek"
    },
    "split_keywords": [
        "varseek",
        " bioinformatics",
        " variant-analysis",
        " k-mer",
        " rna-seq",
        " dna-seq"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "963fda5133a35fc41ab16a7ae9e8f7dd6c4606cfdd71864a618e30a4cd050ec8",
                "md5": "9aff5806f4f96e4b36be66c401e7c3a1",
                "sha256": "85be51268e0fd2359d8adb9431f31dac41e47c83e44b1e9e10212820469c02d8"
            },
            "downloads": -1,
            "filename": "varseek-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9aff5806f4f96e4b36be66c401e7c3a1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 234323,
            "upload_time": "2025-08-12T08:17:58",
            "upload_time_iso_8601": "2025-08-12T08:17:58.076228Z",
            "url": "https://files.pythonhosted.org/packages/96/3f/da5133a35fc41ab16a7ae9e8f7dd6c4606cfdd71864a618e30a4cd050ec8/varseek-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7d518aacda62660f88b7283865a76ac330574d64a2f83d6aeab3bc62d858248f",
                "md5": "79bbfa0214253557c805b2299e2aa686",
                "sha256": "d6765f9fda272be7d1855106d5a52acf43500795b0002480d2b0f7600dbf011a"
            },
            "downloads": -1,
            "filename": "varseek-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "79bbfa0214253557c805b2299e2aa686",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 237522,
            "upload_time": "2025-08-12T08:17:59",
            "upload_time_iso_8601": "2025-08-12T08:17:59.312817Z",
            "url": "https://files.pythonhosted.org/packages/7d/51/8aacda62660f88b7283865a76ac330574d64a2f83d6aeab3bc62d858248f/varseek-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-12 08:17:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pachterlab",
    "github_project": "varseek",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.26.4"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.9.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.66.4"
                ]
            ]
        },
        {
            "name": "anndata",
            "specs": [
                [
                    ">=",
                    "0.8.0"
                ]
            ]
        },
        {
            "name": "kb-python",
            "specs": [
                [
                    ">=",
                    "0.29.3"
                ]
            ]
        },
        {
            "name": "gget",
            "specs": [
                [
                    ">=",
                    "0.29.1"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.13.1"
                ]
            ]
        },
        {
            "name": "pyfastx",
            "specs": [
                [
                    ">=",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "pysam",
            "specs": [
                [
                    ">=",
                    "0.22.1"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    ">=",
                    "19.0.1"
                ]
            ]
        }
    ],
    "lcname": "varseek"
}
        
Elapsed time: 1.33439s