vfind

Name	vfind JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	A simple variant finder for NGS data
upload_time	2024-04-15 20:50:16
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	None
keywords	bioinformatics sequence analysis ngs variant finding
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # vFind

*A simple variant finder for NGS data.*

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Examples](#examples)
4. [Contributing](#contributing)
5. [License](#license)

## Introduction

vFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).
It is actually using a simpler algorithm which is *usually* sufficient for 
screening experiments. The main use case is finding variants from a library that
has constant adapter sequences flanking a variable region.

The workflow can be described generally by these steps:

1. define a set of constant adapter sequences that immediately flank a variable region of interest.
2. for each fastq read, search for exact matches of these adapters.
3. in case of no exact match for either adapter, perform a semi-global alignment of
the adapter sequence on the fastq read.
4. if the alignment is produces a score that is above a set threshold, recover the variable region.
To adjust the thresholds for accepting alignments, see the [alignment parameters](#using-custom-alignment-parameters) section

Alignments are accepted if they produce a score above a set threshold.
These thresholds correspond to the fraction of the max theoretical alignment score
that can be produced from a given adapter sequence and fastq read. Thus, thresholds
are values between 0 and 1. By default, thresholds for both adapter sequences are
set to 0.75. In some cases, you may want to change these values to be more or less
stringent. 

> [!WARNING]
> Note that vFind doesn't do any kind of fastq preprocessing. For initial quality
> filtering, merging, and other common preprocessing operations, you might be
> interested in something like [fastp]() or [ngmerge](). We generally recommend
> using fastp for quality filtering and merging fastq files before using vFind.

## Installation

### PyPI (Recommended for most)

vFind is available on [PyPI](https://pypi.org/) and can be installed via pip (or alternatives like
[uv](https://github.com/astral-sh/uv)).

Below is an example using pip with Python3 in a new project.

```bash
# create a new virtual env
python3 -m venv .venv # create a new virtual env if haven't already
source .venv/bin/activate # activate the virtual env

python3 -m pip install vfind # install vfind
```

### Nix

vFind is also available as a Python package on [NixPkgs](https://search.nixos.org/packages?). You can declare new
development enviroments using [nix flakes](https://wiki.nixos.org/wiki/Flakes).

For something quick, you can use nix-shell,

```bash
nix-shell -p python311 python3Packages.vfind
```

## Examples

### Basic Usage

```python
from vfind import find_variants
import polars as pl # variants are returned in a polars dataframe

adapters = ("GGG", "CCC") # define the Adapters
fq_path = "./path/to/your/fastq/file.fq.gz" # path fo fq file

# now, let's find some variants
variants = find_variants(fq_path, adapters)

# print the number of unique sequences 
print(variants.n_unique())
```

`find_variants` returns a polars dataframe with `sequence` and `count` columns.
`sequence` contains the amino acid sequence of the variable regions and
`count` contains the frequency of those variant.

We can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html) 
to further analyze the recovered variants. Some examples are shown below.

```python
# Get the top 5 most frequent variants
variants.sort("count", descending=True) # sort by the counts in descending order
print(variants.head(5)) # print the first 5 (most frequent) variants

# filter out sequences with less than 10 read counts
# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)

filtered_variants = variants.filter(
    ~variants["counts"] < 10,
    ~variants["sequence"][::-2].str.contains("*")
)

# write the filtered variants to a csv file
filtered_variants.write_csv("filtered_variants.csv")
```

### Using Custom Alignment Parameters

By default, vFind uses semi-global alignment with the following parameters:

- match score = 3
- mismatch score = -2
- gap open penalty = 5
- gap extend penalty = 2

Note that the gap penalties are represented as positive integers. This is largely due to the underlying
alignment library.

```python
from vfind import find_variants

# ... define adapters and fq_path

# use identity scoring with no gap penalties for alignments
variants = find_variants(
    fq_path,
    adapters,
    match_score = 1,
    mismatch_score = -1,
    gap_open_penalty: 0,
    gap_extend_penalty: 0,
)
```

Only alignments that produce scores meeting a threshold will be considered accepted. 
The threshold for considering an acceptable alignment can be adjusted with the
`accept_prefix_alignment` and `accept_suffix_alignment` arguments. By default,
both thresholds are set to 0.75.

The thresholds represent a fraction of the maximum alignment score. So, a value of 0.75
means alignments producing scores that are greater than 75% the maximum theoretical score
will be accepted.

Either an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant. 

### Miscellaneous

**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?

**A:** Yes. Just set `skip_translation` to True.

```python
# ...
dna_seqs = find_variants(fq_path, adapters, skip_translation=True)
```

---

**Q:** I don't want to use polars. Can I use pandas instead?

**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.

---

**Q:** `find_variants` is slow. Is there anything I can do to speed it up?

**A:** Maybe? Try changing the number of threads or queue length the function uses.

```python
# ...
variants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)
```

## Contributing

Contributions are welcome. Please submit an issue or pull request for any bugs,
suggestions, or feedback.

### Developing

vFind is written in Rust and uses [PyO3](https://pyo3.rs/v0.21.1/) and [Maturin](https://github.com/PyO3/maturin)
to generate the Python module. To get started, you will need to have the
[rust toolchain](https://www.rust-lang.org/tools/install) and [Python >= 3.10](https://www.python.org/downloads/).

Below are some general steps to get up and running. Note that these examples
use [uv]). However, you could do this with standard pip or your preferred method.

1. clone the repository to your machine

```bash
git clone git@github.com:nsbuitrago/vfind.git
cd vfind
```

2. Create a new Python virtual environment and install dev dependencies.

```bash
uv venv
source .venv/bin/activate

# this will intsall maturin, polars, and any other required packages.
uv pip install -r dev-requirements.txt
```

3. Build and install the vFind package in your virtual environment with maturin. 

```bash
# in the root project directory
maturin develop
```

4. From here, you can make changes to the Rust lib and run `maturin` develop
to rebuild the package with those changes.

## License

vFind is licensed under the [MIT license](LICENSE)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vfind",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "bioinformatics, sequence analysis, NGS, variant finding",
    "author": null,
    "author_email": "nsbuitrago <44626938+nsbuitrago@users.noreply.github.com>",
    "download_url": null,
    "platform": null,
    "description": "# vFind\n\n*A simple variant finder for NGS data.*\n\n1. [Introduction](#introduction)\n2. [Installation](#installation)\n3. [Examples](#examples)\n4. [Contributing](#contributing)\n5. [License](#license)\n\n## Introduction\n\nvFind is unlike a traditional [*variant caller*](https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/).\nIt is actually using a simpler algorithm which is *usually* sufficient for \nscreening experiments. The main use case is finding variants from a library that\nhas constant adapter sequences flanking a variable region.\n\nThe workflow can be described generally by these steps:\n\n1. define a set of constant adapter sequences that immediately flank a variable region of interest.\n2. for each fastq read, search for exact matches of these adapters.\n3. in case of no exact match for either adapter, perform a semi-global alignment of\nthe adapter sequence on the fastq read.\n4. if the alignment is produces a score that is above a set threshold, recover the variable region.\nTo adjust the thresholds for accepting alignments, see the [alignment parameters](#using-custom-alignment-parameters) section\n\nAlignments are accepted if they produce a score above a set threshold.\nThese thresholds correspond to the fraction of the max theoretical alignment score\nthat can be produced from a given adapter sequence and fastq read. Thus, thresholds\nare values between 0 and 1. By default, thresholds for both adapter sequences are\nset to 0.75. In some cases, you may want to change these values to be more or less\nstringent. \n\n> [!WARNING]\n> Note that vFind doesn't do any kind of fastq preprocessing. For initial quality\n> filtering, merging, and other common preprocessing operations, you might be\n> interested in something like [fastp]() or [ngmerge](). We generally recommend\n> using fastp for quality filtering and merging fastq files before using vFind.\n\n## Installation\n\n### PyPI (Recommended for most)\n\nvFind is available on [PyPI](https://pypi.org/) and can be installed via pip (or alternatives like\n[uv](https://github.com/astral-sh/uv)).\n\nBelow is an example using pip with Python3 in a new project.\n\n```bash\n# create a new virtual env\npython3 -m venv .venv # create a new virtual env if haven't already\nsource .venv/bin/activate # activate the virtual env\n\npython3 -m pip install vfind # install vfind\n```\n\n### Nix\n\nvFind is also available as a Python package on [NixPkgs](https://search.nixos.org/packages?). You can declare new\ndevelopment enviroments using [nix flakes](https://wiki.nixos.org/wiki/Flakes).\n\nFor something quick, you can use nix-shell,\n\n```bash\nnix-shell -p python311 python3Packages.vfind\n```\n\n## Examples\n\n### Basic Usage\n\n```python\nfrom vfind import find_variants\nimport polars as pl # variants are returned in a polars dataframe\n\nadapters = (\"GGG\", \"CCC\") # define the Adapters\nfq_path = \"./path/to/your/fastq/file.fq.gz\" # path fo fq file\n\n# now, let's find some variants\nvariants = find_variants(fq_path, adapters)\n\n# print the number of unique sequences \nprint(variants.n_unique())\n```\n\n`find_variants` returns a polars dataframe with `sequence` and `count` columns.\n`sequence` contains the amino acid sequence of the variable regions and\n`count` contains the frequency of those variant.\n\nWe can then use [dataframe methods](https://docs.pola.rs/py-polars/html/reference/dataframe/index.html) \nto further analyze the recovered variants. Some examples are shown below.\n\n```python\n# Get the top 5 most frequent variants\nvariants.sort(\"count\", descending=True) # sort by the counts in descending order\nprint(variants.head(5)) # print the first 5 (most frequent) variants\n\n# filter out sequences with less than 10 read counts\n# also any sequences that have a pre-mature stop codon (i.e., * before the last residue)\n\nfiltered_variants = variants.filter(\n    ~variants[\"counts\"] < 10,\n    ~variants[\"sequence\"][::-2].str.contains(\"*\")\n)\n\n# write the filtered variants to a csv file\nfiltered_variants.write_csv(\"filtered_variants.csv\")\n```\n\n### Using Custom Alignment Parameters\n\nBy default, vFind uses semi-global alignment with the following parameters:\n\n- match score = 3\n- mismatch score = -2\n- gap open penalty = 5\n- gap extend penalty = 2\n\nNote that the gap penalties are represented as positive integers. This is largely due to the underlying\nalignment library.\n\n```python\nfrom vfind import find_variants\n\n# ... define adapters and fq_path\n\n# use identity scoring with no gap penalties for alignments\nvariants = find_variants(\n    fq_path,\n    adapters,\n    match_score = 1,\n    mismatch_score = -1,\n    gap_open_penalty: 0,\n    gap_extend_penalty: 0,\n)\n```\n\nOnly alignments that produce scores meeting a threshold will be considered accepted. \nThe threshold for considering an acceptable alignment can be adjusted with the\n`accept_prefix_alignment` and `accept_suffix_alignment` arguments. By default,\nboth thresholds are set to 0.75.\n\nThe thresholds represent a fraction of the maximum alignment score. So, a value of 0.75\nmeans alignments producing scores that are greater than 75% the maximum theoretical score\nwill be accepted.\n\nEither an exact match or partial match (accepted alignment) must be made for both adapter sequences to recover a variant. \n\n### Miscellaneous\n\n**Q:** I don't need the amino acid sequence. Can I just get the DNA sequence?\n\n**A:** Yes. Just set `skip_translation` to True.\n\n```python\n# ...\ndna_seqs = find_variants(fq_path, adapters, skip_translation=True)\n```\n\n---\n\n**Q:** I don't want to use polars. Can I use pandas instead?\n\n**A:** Yes. Use the [`to_pandas`](https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.to_pandas.html#polars.DataFrame.to_pandas) method on the dataframe.\n\n---\n\n**Q:** `find_variants` is slow. Is there anything I can do to speed it up?\n\n**A:** Maybe? Try changing the number of threads or queue length the function uses.\n\n```python\n# ...\nvariants = find_variants(fq_path, adapters, n_threads=6, queue_len=4)\n```\n\n## Contributing\n\nContributions are welcome. Please submit an issue or pull request for any bugs,\nsuggestions, or feedback.\n\n### Developing\n\nvFind is written in Rust and uses [PyO3](https://pyo3.rs/v0.21.1/) and [Maturin](https://github.com/PyO3/maturin)\nto generate the Python module. To get started, you will need to have the\n[rust toolchain](https://www.rust-lang.org/tools/install) and [Python >= 3.10](https://www.python.org/downloads/).\n\nBelow are some general steps to get up and running. Note that these examples\nuse [uv]). However, you could do this with standard pip or your preferred method.\n\n1. clone the repository to your machine\n\n```bash\ngit clone git@github.com:nsbuitrago/vfind.git\ncd vfind\n```\n\n2. Create a new Python virtual environment and install dev dependencies.\n\n```bash\nuv venv\nsource .venv/bin/activate\n\n# this will intsall maturin, polars, and any other required packages.\nuv pip install -r dev-requirements.txt\n```\n\n3. Build and install the vFind package in your virtual environment with maturin. \n\n```bash\n# in the root project directory\nmaturin develop\n```\n\n4. From here, you can make changes to the Rust lib and run `maturin` develop\nto rebuild the package with those changes.\n\n## License\n\nvFind is licensed under the [MIT license](LICENSE)\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A simple variant finder for NGS data",
    "version": "0.1.0",
    "project_urls": {
        "Issues": "https://github.com/nsbuitrago/vfind/issues",
        "Repository": "https://github.com/nsbuitrago/vfind"
    },
    "split_keywords": [
        "bioinformatics",
        " sequence analysis",
        " ngs",
        " variant finding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "155e3ab953caa3f30d38da54425fb2dfbac80de9674101763643225baf19b1cc",
                "md5": "5e1eb8af0ff8b61fd09742372832f275",
                "sha256": "c1712b89d88d4d3a3bf5bce5bc2bfd3d679f388d649da04b0c33a45acbaa594c"
            },
            "downloads": -1,
            "filename": "vfind-0.1.0-cp311-cp311-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "5e1eb8af0ff8b61fd09742372832f275",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.8",
            "size": 3008631,
            "upload_time": "2024-04-15T20:50:16",
            "upload_time_iso_8601": "2024-04-15T20:50:16.986856Z",
            "url": "https://files.pythonhosted.org/packages/15/5e/3ab953caa3f30d38da54425fb2dfbac80de9674101763643225baf19b1cc/vfind-0.1.0-cp311-cp311-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-15 20:50:16",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nsbuitrago",
    "github_project": "vfind",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "vfind"
}

None