# vFind
*A simple variant finder for NGS data.*
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Examples](#examples)
4. [Contributing](#contributing)
5. [License](#license)
## Introduction
vFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).
It is actually using a simpler algorithm which is *usually* sufficient for
screening experiments. The main use case is finding variants from a library that
has constant adapter sequences flanking a variable region.
The workflow can be described generally by these steps:
1. define a set of constant adapter sequences that immediately flank a variable region of interest.
2. for each fastq read, search for exact matches of these adapters.
3. in case of no exact match for either adapter, perform a semi-global alignment of
the adapter sequence on the fastq read.
4. if the alignment is produces a score that is above a set threshold, recover the variable region.
To adjust the thresholds for accepting alignments, see the [alignment parameters](#using-custom-alignment-parameters) section
Alignments are accepted if they produce a score above a set threshold.
These thresholds correspond to the fraction of the max theoretical alignment score
that can be produced from a given adapter sequence and fastq read. Thus, thresholds
are values between 0 and 1. By default, thresholds for both adapter sequences are
set to 0.75. In some cases, you may want to change these values to be more or less
stringent.
> [!WARNING]
> Note that vFind doesn't do any kind of fastq preprocessing. For initial quality
> filtering, merging, and other common preprocessing operations, you might be
> interested in something like [fastp]() or [ngmerge](). We generally recommend
> using fastp for quality filtering and merging fastq files before using vFind.
## Installation
### PyPI (Recommended for most)
vFind is available on [PyPI](https://pypi.org/) and can be installed via pip (or alternatives like
[uv](https://github.com/astral-sh/uv)).
Below is an example using pip with Python3 in a new project.
```bash
# create a new virtual env
python3 -m venv .venv # create a new virtual env if haven't already
source .venv/bin/activate # activate the virtual env
python3 -m pip install vfind # install vfind
```
### Nix
vFind is also available as a Python package on [NixPkgs](https://search.nixos.org/packages?). You can declare new
development enviroments using [nix flakes](https://wiki.nixos.org/wiki/Flakes).
For something quick, you can use nix-shell,
```bash
nix-shell -p python311 python3Packages.vfind
```
## Examples
### Basic Usage
```python
from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe
adapters = ("GGG", "CCC") # define the Adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path fo fq file
# now, let's find some variants
variants = find_variants(fq_path, adapters)
# print the number of unique sequences
print(variants.n_unique())
```
`find_variants` returns a polars dataframe with `sequence` and `count` columns.
`sequence` contains the amino acid sequence of the variable regions and
`count` contains the frequency of those variant.
We can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html)
to further analyze the recovered variants. Some examples are shown below.
```python
# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants
# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)
filtered_variants = variants.filter(
~variants["counts"] < 10,
~variants["sequence"][::-2].str.contains("*")
)
# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")
```
### Using Custom Alignment Parameters
By default, vFind uses semi-global alignment with the following parameters:
- match score = 3
- mismatch score = -2
- gap open penalty = 5
- gap extend penalty = 2
Note that the gap penalties are represented as positive integers. This is largely due to the underlying
alignment library.
```python
from vfind import find_variants
# ... define adapters and fq_path
# use identity scoring with no gap penalties for alignments
variants = find_variants(
fq_path,
adapters,
match_score = 1,
mismatch_score = -1,
gap_open_penalty: 0,
gap_extend_penalty: 0,
)
```
Only alignments that produce scores meeting a threshold will be considered accepted.
The threshold for considering an acceptable alignment can be adjusted with the
`accept_prefix_alignment` and `accept_suffix_alignment` arguments. By default,
both thresholds are set to 0.75.
The thresholds represent a fraction of the maximum alignment score. So, a value of 0.75
means alignments producing scores that are greater than 75% the maximum theoretical score
will be accepted.
Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant.
### Miscellaneous
**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?
**A:** Yes. Just set `skip_translation` to True.
```python
# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)
```
---
**Q:** I don't want to use polars. Can I use pandas instead?
**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.
---
**Q:** `find_variants` is slow. Is there anything I can do to speed it up?
**A:** Maybe? Try changing the number of threads or queue length the function uses.
```python
# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)
```
## Contributing
Contributions are welcome. Please submit an issue or pull request for any bugs,
suggestions, or feedback.
### Developing
vFind is written in Rust and uses [PyO3](https://pyo3.rs/v0.21.1/) and [Maturin](https://github.com/PyO3/maturin)
to generate the Python module. To get started, you will need to have the
[rust toolchain](https://www.rust-lang.org/tools/install) and [Python >= 3.10](https://www.python.org/downloads/).
Below are some general steps to get up and running. Note that these examples
use [uv]). However, you could do this with standard pip or your preferred method.
1. clone the repository to your machine
```bash
git clone git@github.com:nsbuitrago/vfind.git
cd vfind
```
2. Create a new Python virtual environment and install dev dependencies.
```bash
uv venv
source .venv/bin/activate
# this will intsall maturin, polars, and any other required packages.
uv pip install -r dev-requirements.txt
```
3. Build and install the vFind package in your virtual environment with maturin.
```bash
# in the root project directory
maturin develop
```
4. From here, you can make changes to the Rust lib and run `maturin` develop
to rebuild the package with those changes.
## License
vFind is licensed under the [MIT license](LICENSE)
Raw data
{
"_id": null,
"home_page": null,
"name": "vfind",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "bioinformatics, sequence analysis, NGS, variant finding",
"author": null,
"author_email": "nsbuitrago <44626938+nsbuitrago@users.noreply.github.com>",
"download_url": null,
"platform": null,
"description": "# vFind\n\n*A simple variant finder for NGS data.*\n\n1. [Introduction](#introduction)\n2. [Installation](#installation)\n3. [Examples](#examples)\n4. [Contributing](#contributing)\n5. [License](#license)\n\n## Introduction\n\nvFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).\nIt is actually using a simpler algorithm which is *usually* sufficient for \nscreening experiments. The main use case is finding variants from a library that\nhas constant adapter sequences flanking a variable region.\n\nThe workflow can be described generally by these steps:\n\n1. define a set of constant adapter sequences that immediately flank a variable region of interest.\n2. for each fastq read, search for exact matches of these adapters.\n3. in case of no exact match for either adapter, perform a semi-global alignment of\nthe adapter sequence on the fastq read.\n4. if the alignment is produces a score that is above a set threshold, recover the variable region.\nTo adjust the thresholds for accepting alignments, see the [alignment parameters](#using-custom-alignment-parameters) section\n\nAlignments are accepted if they produce a score above a set threshold.\nThese thresholds correspond to the fraction of the max theoretical alignment score\nthat can be produced from a given adapter sequence and fastq read. Thus, thresholds\nare values between 0 and 1. By default, thresholds for both adapter sequences are\nset to 0.75. In some cases, you may want to change these values to be more or less\nstringent. \n\n> [!WARNING]\n> Note that vFind doesn't do any kind of fastq preprocessing. For initial quality\n> filtering, merging, and other common preprocessing operations, you might be\n> interested in something like [fastp]() or [ngmerge](). We generally recommend\n> using fastp for quality filtering and merging fastq files before using vFind.\n\n## Installation\n\n### PyPI (Recommended for most)\n\nvFind is available on [PyPI](https://pypi.org/) and can be installed via pip (or alternatives like\n[uv](https://github.com/astral-sh/uv)).\n\nBelow is an example using pip with Python3 in a new project.\n\n```bash\n# create a new virtual env\npython3 -m venv .venv # create a new virtual env if haven't already\nsource .venv/bin/activate # activate the virtual env\n\npython3 -m pip install vfind # install vfind\n```\n\n### Nix\n\nvFind is also available as a Python package on [NixPkgs](https://search.nixos.org/packages?). You can declare new\ndevelopment enviroments using [nix flakes](https://wiki.nixos.org/wiki/Flakes).\n\nFor something quick, you can use nix-shell,\n\n```bash\nnix-shell -p python311 python3Packages.vfind\n```\n\n## Examples\n\n### Basic Usage\n\n```python\nfrom vfind import find_variants\nimport polars as pl # variants are returned in a polars dataframe\n\nadapters = (\"GGG\", \"CCC\") # define the Adapters\nfq_path = \"./path/to/your/fastq/file.fq.gz\" # path fo fq file\n\n# now, let's find some variants\nvariants = find_variants(fq_path, adapters)\n\n# print the number of unique sequences \nprint(variants.n_unique())\n```\n\n`find_variants` returns a polars dataframe with `sequence` and `count` columns.\n`sequence` contains the amino acid sequence of the variable regions and\n`count` contains the frequency of those variant.\n\nWe can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html) \nto further analyze the recovered variants. Some examples are shown below.\n\n```python\n# Get the top 5 most frequent variants\nvariants.sort(\"count\", descending=True) # sort by the counts in descending order\nprint(variants.head(5)) # print the first 5 (most frequent) variants\n\n# filter out sequences with less than 10 read counts\n# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)\n\nfiltered_variants = variants.filter(\n ~variants[\"counts\"] < 10,\n ~variants[\"sequence\"][::-2].str.contains(\"*\")\n)\n\n# write the filtered variants to a csv file\nfiltered_variants.write_csv(\"filtered_variants.csv\")\n```\n\n### Using Custom Alignment Parameters\n\nBy default, vFind uses semi-global alignment with the following parameters:\n\n- match score = 3\n- mismatch score = -2\n- gap open penalty = 5\n- gap extend penalty = 2\n\nNote that the gap penalties are represented as positive integers. This is largely due to the underlying\nalignment library.\n\n```python\nfrom vfind import find_variants\n\n# ... define adapters and fq_path\n\n# use identity scoring with no gap penalties for alignments\nvariants = find_variants(\n fq_path,\n adapters,\n match_score = 1,\n mismatch_score = -1,\n gap_open_penalty: 0,\n gap_extend_penalty: 0,\n)\n```\n\nOnly alignments that produce scores meeting a threshold will be considered accepted. \nThe threshold for considering an acceptable alignment can be adjusted with the\n`accept_prefix_alignment` and `accept_suffix_alignment` arguments. By default,\nboth thresholds are set to 0.75.\n\nThe thresholds represent a fraction of the maximum alignment score. So, a value of 0.75\nmeans alignments producing scores that are greater than 75% the maximum theoretical score\nwill be accepted.\n\nEither an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant. \n\n### Miscellaneous\n\n**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?\n\n**A:** Yes. Just set `skip_translation` to True.\n\n```python\n# ...\ndna_seqs = find_variants(fq_path, adapters, skip_translation=True)\n```\n\n---\n\n**Q:** I don't want to use polars. Can I use pandas instead?\n\n**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.\n\n---\n\n**Q:** `find_variants` is slow. Is there anything I can do to speed it up?\n\n**A:** Maybe? Try changing the number of threads or queue length the function uses.\n\n```python\n# ...\nvariants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)\n```\n\n## Contributing\n\nContributions are welcome. Please submit an issue or pull request for any bugs,\nsuggestions, or feedback.\n\n### Developing\n\nvFind is written in Rust and uses [PyO3](https://pyo3.rs/v0.21.1/) and [Maturin](https://github.com/PyO3/maturin)\nto generate the Python module. To get started, you will need to have the\n[rust toolchain](https://www.rust-lang.org/tools/install) and [Python >= 3.10](https://www.python.org/downloads/).\n\nBelow are some general steps to get up and running. Note that these examples\nuse [uv]). However, you could do this with standard pip or your preferred method.\n\n1. clone the repository to your machine\n\n```bash\ngit clone git@github.com:nsbuitrago/vfind.git\ncd vfind\n```\n\n2. Create a new Python virtual environment and install dev dependencies.\n\n```bash\nuv venv\nsource .venv/bin/activate\n\n# this will intsall maturin, polars, and any other required packages.\nuv pip install -r dev-requirements.txt\n```\n\n3. Build and install the vFind package in your virtual environment with maturin. \n\n```bash\n# in the root project directory\nmaturin develop\n```\n\n4. From here, you can make changes to the Rust lib and run `maturin` develop\nto rebuild the package with those changes.\n\n## License\n\nvFind is licensed under the [MIT license](LICENSE)\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A simple variant finder for NGS data",
"version": "0.1.0",
"project_urls": {
"Issues": "https://github.com/nsbuitrago/vfind/issues",
"Repository": "https://github.com/nsbuitrago/vfind"
},
"split_keywords": [
"bioinformatics",
" sequence analysis",
" ngs",
" variant finding"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "155e3ab953caa3f30d38da54425fb2dfbac80de9674101763643225baf19b1cc",
"md5": "5e1eb8af0ff8b61fd09742372832f275",
"sha256": "c1712b89d88d4d3a3bf5bce5bc2bfd3d679f388d649da04b0c33a45acbaa594c"
},
"downloads": -1,
"filename": "vfind-0.1.0-cp311-cp311-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "5e1eb8af0ff8b61fd09742372832f275",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.8",
"size": 3008631,
"upload_time": "2024-04-15T20:50:16",
"upload_time_iso_8601": "2024-04-15T20:50:16.986856Z",
"url": "https://files.pythonhosted.org/packages/15/5e/3ab953caa3f30d38da54425fb2dfbac80de9674101763643225baf19b1cc/vfind-0.1.0-cp311-cp311-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-15 20:50:16",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nsbuitrago",
"github_project": "vfind",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "vfind"
}