vfind

Name	vfind JSON
Version	0.2.0 JSON
	download
home_page	None
Summary	Simple variant finding utilities for NGS data
upload_time	2025-02-24 23:07:26
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	None
keywords	bioinformatics sequence analysis ngs variant finding
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # vFind

[![PyPI - Version](https://img.shields.io/pypi/v/vfind)](https://pypi.org/project/vfind/)

*A simple variant finder for NGS data.*

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Examples](#examples)
4. [Contributing](#contributing)
5. [License](#license)

## Introduction

vFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).
It is actually using a simpler algorithm which is *usually* sufficient for
screening experiments. The main use case is finding variants from a library that
has constant adapter sequences flanking a variable region.

This simple algorithm is summarized as:

1. Define a pair of adapter sequences that flank the variable region.
2. For each fastq read, search for exact matches of these adapters.
3. If both adapters are found exactly, recover the variable region.
4. For each adapter without an exact match, perform semi-global alignment between the given adapter and read (optional see the [alignment parameters](#using-custom-alignment-parameters) section).
5. If the alignment score meets a set threshold, that adapter is considered to match.
6. If both adapters are exactly or partially matched, recover the variable region.
7. For exact matches of both adapters, recover the variable region. Otherwise, continue to the next read.
8. Finally, translate the variable region to its amino acid sequence and filter out any sequences with partial codons (Optional, see the [miscellaneuous](#miscellaneuous) section).

> [!WARNING]
> Note that vFind doesn't do any kind of preprocessing. For initial quality
> filtering, merging, and other common preprocessing operations, you might be
> interested in something like [fastp](https://github.com/OpenGene/fastp) or
> [ngmerge](https://github.com/jsh58/NGmerge). We generally recommend using
> fastp for quality filtering and merging fastq files before using vFind.

Installation details and usage examples are given below. For more usage details,
please see the [API reference](docs/api-reference.md)

## Installation

The package is available on [PyPI](https://pypi.org/project/vfind) and can be installed via pip (or alternatives like [uv](https://github.com/astral-sh/uv)).

### PyPI (Recommended)

Below is an example using uv to initialize a project and add vfind as a dependency.

```bash
uv init
uv add vfind
```

and with pip after creating and activating a new virtual environment

```bash
python3 -m venv .venv
source .venv/bin/activate

python3 -m pip install vfind

```

### Source

vFind is developed using PyO3 and Rust. You will need to make sure you have
a Rust toolchain installed as well as standard C tooling to build some
dependencies (i.e., parasail-rs crate).

1. Clone the repository

```bash
git clone https://github.com/nsbuitrago/vfind.git
cd vfind
```

2. Inside the vfind directory, sync dependencies with uv and build vfind

```bash
uv sync

# this will build and install vfind in the virtual env
uv run maturin develop --u
```

## Examples

### Basic Usage

```python
from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe

adapters = ("GGG", "CCC") # define the adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path to fq file

variants = find_variants(fq_path, adapters)

# print the number of unique sequences
print(variants.n_unique())
```

`find_variants` returns a polars dataframe with `sequence` and `count` columns.
`sequence` contains the amino acid sequence of the variable regions and
`count` contains the frequency of those variant.

We can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html)
to further analyze the recovered variants. Some examples are shown below.

```python
# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants

# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)

filtered_variants = variants.filter(
    variants["count"] > 10,
    ~variants["sequence"][::-2].str.contains("*")
)

# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")
```

### Using Custom Alignment Parameters

By default, vFind uses semi-global alignment with the following parameters:

- match score = 3
- mismatch score = -2
- gap open penalty = 5
- gap extend penalty = 2

Note that the gap penalties are represented as positive integers. This is largely due to how the underlying
alignment library works.

To adjust these alignment parameters, use the `match_score`, `mismatch_score`,
`gap_open_penalty`, and `gap_extend_penalty` keyword arguments:

```python
from vfind import find_variants

# ... define adapters and fq_path

# use identity scoring with no gap penalties for alignments
variants = find_variants(
    fq_path,
    adapters,
    match_score = 1,
    mismatch_score = -1,
    gap_open_penalty: 0,
    gap_extend_penalty: 0,
)
```

Alignments are accepted if they produce a score above a set threshold. The threshold
for considering an acceptable alignment can be adjusted with the `accept_prefix_alignment`
and `accept_suffix_alignment` arguments. By default, both thresholds are set to 0.75.

The thresholds are represent a percentage of the maximum alignment score. So, a value of 0.75
means alignments producing scores that are greater than 75% the maximum theoretical score will be accepted. Thus, valid values are between 0 and 1.

Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant.
In order to skip alignment and only look for exact matches, set the `skip_alignment` argument to `True`.

### Miscellaneous

**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?

**A:** Yes. Just set `skip_translation` to True.

```python
# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)
```

---

**Q:** I don't want to use polars. Can I use pandas instead?

**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.

---

**Q:** I have a lot of data and `find_variants` is slow. Is there anything I can do to speed it up?

**A:** Maybe. Try changing the number of threads or queue length the function uses.

```python
# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)
```

For more usage details, see the [API reference](./docs/api-reference.md).

## Contributing

Feedback is a gift and contributions are more than welcome. Please submit an
issue or pull request for any bugs, suggestions, or feedback. Please see the
[developing](./docs/developing.md) guide for more details on how to work on vFind.

## License

vFind is licensed under the [MIT license](./LICENSE.md)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vfind",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "bioinformatics, sequence analysis, NGS, variant finding",
    "author": null,
    "author_email": "nsbuitrago <mail@nsbuitrago.xyz>",
    "download_url": "https://files.pythonhosted.org/packages/62/c0/038e75c784d307327e1f028ff1711a0429d431971fe12f0266f1f6ee01f3/vfind-0.2.0.tar.gz",
    "platform": null,
    "description": "# vFind\n\n[![PyPI - Version](https://img.shields.io/pypi/v/vfind)](https://pypi.org/project/vfind/)\n\n*A simple variant finder for NGS data.*\n\n1. [Introduction](#introduction)\n2. [Installation](#installation)\n3. [Examples](#examples)\n4. [Contributing](#contributing)\n5. [License](#license)\n\n## Introduction\n\nvFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).\nIt is actually using a simpler algorithm which is *usually* sufficient for\nscreening experiments. The main use case is finding variants from a library that\nhas constant adapter sequences flanking a variable region.\n\nThis simple algorithm is summarized as:\n\n1. Define a pair of adapter sequences that flank the variable region.\n2. For each fastq read, search for exact matches of these adapters.\n3. If both adapters are found exactly, recover the variable region.\n4. For each adapter without an exact match, perform semi-global alignment between the given adapter and read (optional see the [alignment parameters](#using-custom-alignment-parameters) section).\n5. If the alignment score meets a set threshold, that adapter is considered to match.\n6. If both adapters are exactly or partially matched, recover the variable region.\n7. For exact matches of both adapters, recover the variable region. Otherwise, continue to the next read.\n8. Finally, translate the variable region to its amino acid sequence and filter out any sequences with partial codons (Optional, see the [miscellaneuous](#miscellaneuous) section).\n\n> [!WARNING]\n> Note that vFind doesn't do any kind of preprocessing. For initial quality\n> filtering, merging, and other common preprocessing operations, you might be\n> interested in something like [fastp](https://github.com/OpenGene/fastp) or\n> [ngmerge](https://github.com/jsh58/NGmerge). We generally recommend using\n> fastp for quality filtering and merging fastq files before using vFind.\n\nInstallation details and usage examples are given below. For more usage details,\nplease see the [API reference](docs/api-reference.md)\n\n## Installation\n\nThe package is available on [PyPI](https://pypi.org/project/vfind) and can be installed via pip (or alternatives like [uv](https://github.com/astral-sh/uv)).\n\n### PyPI (Recommended)\n\nBelow is an example using uv to initialize a project and add vfind as a dependency.\n\n```bash\nuv init\nuv add vfind\n```\n\nand with pip after creating and activating a new virtual environment\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\n\npython3 -m pip install vfind\n\n```\n\n### Source\n\nvFind is developed using PyO3 and Rust. You will need to make sure you have\na Rust toolchain installed as well as standard C tooling to build some\ndependencies (i.e., parasail-rs crate).\n\n1. Clone the repository\n\n```bash\ngit clone https://github.com/nsbuitrago/vfind.git\ncd vfind\n```\n\n2. Inside the vfind directory, sync dependencies with uv and build vfind\n\n```bash\nuv sync\n\n# this will build and install vfind in the virtual env\nuv run maturin develop --u\n```\n\n## Examples\n\n### Basic Usage\n\n```python\nfrom vfind import find_variants\nimport polars as pl # variants are returned in a polars dataframe\n\nadapters = (\"GGG\", \"CCC\") # define the adapters\nfq_path = \"./path/to/your/fastq/file.fq.gz\" # path to fq file\n\nvariants = find_variants(fq_path, adapters)\n\n# print the number of unique sequences\nprint(variants.n_unique())\n```\n\n`find_variants` returns a polars dataframe with `sequence` and `count` columns.\n`sequence` contains the amino acid sequence of the variable regions and\n`count` contains the frequency of those variant.\n\nWe can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html)\nto further analyze the recovered variants. Some examples are shown below.\n\n```python\n# Get the top 5 most frequent variants\nvariants.sort(\"count\", descending=True) # sort by the counts in descending order\nprint(variants.head(5)) # print the first 5 (most frequent) variants\n\n# filter out sequences with less than 10 read counts\n# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)\n\nfiltered_variants = variants.filter(\n    variants[\"count\"] > 10,\n    ~variants[\"sequence\"][::-2].str.contains(\"*\")\n)\n\n# write the filtered variants to a csv file\nfiltered_variants.write_csv(\"filtered_variants.csv\")\n```\n\n### Using Custom Alignment Parameters\n\nBy default, vFind uses semi-global alignment with the following parameters:\n\n- match score = 3\n- mismatch score = -2\n- gap open penalty = 5\n- gap extend penalty = 2\n\nNote that the gap penalties are represented as positive integers. This is largely due to how the underlying\nalignment library works.\n\nTo adjust these alignment parameters, use the `match_score`, `mismatch_score`,\n`gap_open_penalty`, and `gap_extend_penalty` keyword arguments:\n\n```python\nfrom vfind import find_variants\n\n# ... define adapters and fq_path\n\n# use identity scoring with no gap penalties for alignments\nvariants = find_variants(\n    fq_path,\n    adapters,\n    match_score = 1,\n    mismatch_score = -1,\n    gap_open_penalty: 0,\n    gap_extend_penalty: 0,\n)\n```\n\nAlignments are accepted if they produce a score above a set threshold. The threshold\nfor considering an acceptable alignment can be adjusted with the `accept_prefix_alignment`\nand `accept_suffix_alignment` arguments. By default, both thresholds are set to 0.75.\n\nThe thresholds are represent a percentage of the maximum alignment score. So, a value of 0.75\nmeans alignments producing scores that are greater than 75% the maximum theoretical score will be accepted. Thus, valid values are between 0 and 1.\n\nEither an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant.\nIn order to skip alignment and only look for exact matches, set the `skip_alignment` argument to `True`.\n\n### Miscellaneous\n\n**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?\n\n**A:** Yes. Just set `skip_translation` to True.\n\n```python\n# ...\ndna_seqs = find_variants(fq_path, adapters, skip_translation=True)\n```\n\n---\n\n**Q:** I don't want to use polars. Can I use pandas instead?\n\n**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.\n\n---\n\n**Q:** I have a lot of data and `find_variants` is slow. Is there anything I can do to speed it up?\n\n**A:** Maybe. Try changing the number of threads or queue length the function uses.\n\n```python\n# ...\nvariants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)\n```\n\nFor more usage details, see the [API reference](./docs/api-reference.md).\n\n## Contributing\n\nFeedback is a gift and contributions are more than welcome. Please submit an\nissue or pull request for any bugs, suggestions, or feedback. Please see the\n[developing](./docs/developing.md) guide for more details on how to work on vFind.\n\n## License\n\nvFind is licensed under the [MIT license](./LICENSE.md)\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Simple variant finding utilities for NGS data",
    "version": "0.2.0",
    "project_urls": {
        "Issues": "https://github.com/nsbuitrago/vfind/issues",
        "Repository": "https://github.com/nsbuitrago/vfind"
    },
    "split_keywords": [
        "bioinformatics",
        " sequence analysis",
        " ngs",
        " variant finding"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5c485dd26105179f3a733f9edc076ebaddf38d3f05d4587d0a8aa88a4fed69f8",
                "md5": "5b2e84d8f4ae7f6b803ae820ee84bd84",
                "sha256": "36cf3b4a08c858dcafa994da00d20608b7c3941c368d757185481ede4cfb550b"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "5b2e84d8f4ae7f6b803ae820ee84bd84",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 4101657,
            "upload_time": "2025-02-24T23:07:19",
            "upload_time_iso_8601": "2025-02-24T23:07:19.768058Z",
            "url": "https://files.pythonhosted.org/packages/5c/48/5dd26105179f3a733f9edc076ebaddf38d3f05d4587d0a8aa88a4fed69f8/vfind-0.2.0-cp311-cp311-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "59833a91e047bd603965daf19a31379006cc4dd68f60f30099285ed21a855aa7",
                "md5": "35147e5c16e2a8a1bb6105e50777e206",
                "sha256": "2dd43847c8b9eae48facc72599669922c514cbaf7407432e1bb3f460fd4da072"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "35147e5c16e2a8a1bb6105e50777e206",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.10",
            "size": 4573151,
            "upload_time": "2025-02-24T23:07:11",
            "upload_time_iso_8601": "2025-02-24T23:07:11.364438Z",
            "url": "https://files.pythonhosted.org/packages/59/83/3a91e047bd603965daf19a31379006cc4dd68f60f30099285ed21a855aa7/vfind-0.2.0-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d7ef121f007e9c64741df36c3f4f69f0689aa9eab995a5e24ce3a8ff23a2b7ba",
                "md5": "c4bd3fcf9e36b3439a85383095a2c9a9",
                "sha256": "31abf7e91ed4e01e64398c9618b6c197ae166a9a4bafac9664e83bb910092b23"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "c4bd3fcf9e36b3439a85383095a2c9a9",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 4098003,
            "upload_time": "2025-02-24T23:07:22",
            "upload_time_iso_8601": "2025-02-24T23:07:22.580998Z",
            "url": "https://files.pythonhosted.org/packages/d7/ef/121f007e9c64741df36c3f4f69f0689aa9eab995a5e24ce3a8ff23a2b7ba/vfind-0.2.0-cp312-cp312-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d820f3696834d5e7a06deefc93a66731bb92b4b0603a5726fc90de5a22caf77b",
                "md5": "098cf53929078886e1536e1273cc354e",
                "sha256": "8f0269a3703baa4735975e9e625af1de3e72f1b01eb2743c7e5dd0140e4a805c"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "098cf53929078886e1536e1273cc354e",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 4569789,
            "upload_time": "2025-02-24T23:07:14",
            "upload_time_iso_8601": "2025-02-24T23:07:14.269408Z",
            "url": "https://files.pythonhosted.org/packages/d8/20/f3696834d5e7a06deefc93a66731bb92b4b0603a5726fc90de5a22caf77b/vfind-0.2.0-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8e348f90a92cb1c9ed980db89e815fa405025ce9cad3ed32b9d84e33f6900a7a",
                "md5": "839f70cba090dbb38ee3486da712a4cc",
                "sha256": "7ef64d94845a9f48d7b0fe0d7d08f3c118a57155bce582f88b184c95c288d825"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl",
            "has_sig": false,
            "md5_digest": "839f70cba090dbb38ee3486da712a4cc",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.10",
            "size": 7688409,
            "upload_time": "2025-02-24T23:07:09",
            "upload_time_iso_8601": "2025-02-24T23:07:09.136745Z",
            "url": "https://files.pythonhosted.org/packages/8e/34/8f90a92cb1c9ed980db89e815fa405025ce9cad3ed32b9d84e33f6900a7a/vfind-0.2.0-cp312-cp312-manylinux_2_34_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0d54a23952d3ce5a203cf4ff3f68511a7fde42e1b2db11161c326db7a9c46b6b",
                "md5": "9ccc5019a6ed9c8b7aa9f5f88ae143ba",
                "sha256": "9353c541f2e653c800e856a48f3087583e9ab4e7f6d0c866f15b3b30bb39a21d"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp313-cp313-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "9ccc5019a6ed9c8b7aa9f5f88ae143ba",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.10",
            "size": 4096717,
            "upload_time": "2025-02-24T23:07:25",
            "upload_time_iso_8601": "2025-02-24T23:07:25.380096Z",
            "url": "https://files.pythonhosted.org/packages/0d/54/a23952d3ce5a203cf4ff3f68511a7fde42e1b2db11161c326db7a9c46b6b/vfind-0.2.0-cp313-cp313-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ea965528dcaa524708b09cc6129da62872a8dac2ec7a1f9ee90131393943e0ef",
                "md5": "847d86be9ae346b6e7a7b6371249fee6",
                "sha256": "6061919007e21d2ca772c5f6555b4bfff86bc6e098a0043990d8e94b624b5348"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0-cp313-cp313-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "847d86be9ae346b6e7a7b6371249fee6",
            "packagetype": "bdist_wheel",
            "python_version": "cp313",
            "requires_python": ">=3.10",
            "size": 4569329,
            "upload_time": "2025-02-24T23:07:17",
            "upload_time_iso_8601": "2025-02-24T23:07:17.465268Z",
            "url": "https://files.pythonhosted.org/packages/ea/96/5528dcaa524708b09cc6129da62872a8dac2ec7a1f9ee90131393943e0ef/vfind-0.2.0-cp313-cp313-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "62c0038e75c784d307327e1f028ff1711a0429d431971fe12f0266f1f6ee01f3",
                "md5": "d9ddba4695cf03ae07050536b6c8292d",
                "sha256": "864620dec735ff1525e75998ac2b4dc3cdcb2d7c5d80c57be3f85c1a0b49aeee"
            },
            "downloads": -1,
            "filename": "vfind-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d9ddba4695cf03ae07050536b6c8292d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 45035,
            "upload_time": "2025-02-24T23:07:26",
            "upload_time_iso_8601": "2025-02-24T23:07:26.986237Z",
            "url": "https://files.pythonhosted.org/packages/62/c0/038e75c784d307327e1f028ff1711a0429d431971fe12f0266f1f6ee01f3/vfind-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-24 23:07:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nsbuitrago",
    "github_project": "vfind",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "vfind"
}

None