affine-gaps


Nameaffine-gaps JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryNumba-accelerated Python implementation of affine gap penalty extensions for Needleman-Wunsch and Smith-Waterman algorithms
upload_time2024-05-24 07:54:05
maintainerNone
docs_urlNone
authorNone
requires_python>=3.6
licenseNone
keywords affine gaps bioinformatics numba sequence alignment
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Affine Gaps

__Affine Gaps__ is a __less-wrong__ single-file Numba-accelerated Python implementation of Osamu Gotoh affine gap penalty extensions 1982 [paper](https://doc.aporc.org/attach/Course001Papers/gotoh1982.pdf) for the Needleman-Wunsch and Smith-Waterman algorithms often used for global and local sequence alignment in Bioinformatics.
Thanks to the Numba JIT compiler, it's also competitive in terms of performance.
But if you want to go even faster and need more hardware-accelerated string operations, check out [StringZilla](https://github.com/ashvardanian/stringzilla) 🦖

## Less Wrong

As reported in the "Are all global alignment algorithms and implementations correct?" [paper](https://www.biorxiv.org/content/10.1101/031500v1.full.pdf) by Tomas Flouri, Kassian Kobert, Torbjørn Rognes, and Alexandros Stamatakis:

> In 1982 Gotoh presented an improved algorithm with lower time complexity. 
> Gotoh’s algorithm is frequently cited...
> While implementing the algorithm, we discovered two mathematical mistakes in Gotoh’s paper that induce sub-optimal sequence alignments.
> First, there are minor indexing mistakes in the dynamic programming algorithm which become apparent immediately when implementing the procedure.
> Hence, we report on these for the sake of completeness.
> Second, there is a more profound problem with the dynamic programming matrix initialization.
> This initialization issue can easily be missed and find its way into actual implementations.
> This error is also present in standard text books.
> Namely, the widely used books by Gusfield and Waterman.
> To obtain an initial estimate of the extent to which this error has been propagated, we scrutinized freely available undergraduate lecture slides.
> We found that 8 out of 31 lecture slides contained the mistake, while 16 out of 31 simply omit parts of the initialization, thus giving an incomplete description of the algorithm.
> Finally, by inspecting ten source codes and running respective tests, we found that five implementations were incorrect.

During my exploration of exiting implementations, I've noticed several bugs:

- several libraries initialize the header row/columns of penalty matrices with ±∞, causing overflows on the first iteration.
- initialize matrices to zeros, ignoring the first gap opening cost.
- combining opening and expansion costs where only the opening cost should be applied.
- even the most correct `needle` from EMBOSS uses `float` representation, which would obviously be numerically unstable on very long sequences.

## Installation

```bash
pip install git+https://github.com/ashvardanian/affine-gaps.git
```

## Usaging the Library

To obtain the alignment of two sequences, use the `needleman_wunsch_gotoh_alignment` function.

```python
from affine_gaps import needleman_wunsch_gotoh_alignment

insulin = "GIVEQCCTSICSLYQLENYCN"
glucagon = "HSQGTFTSDYSKYLDSRAEQDFV"
aligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(insulin, glucagon)

print("Alignment 1:", aligned_insulin)  # GI-V---EQCC-TSICSLY---QL-ENYCN-
print("Alignment 2:", aligned_glucagon) # --D-FVHSQGTFTSDYSKYLDSRAEQDF--V
print("Score:", aligned_score)          # 41
```

If you only need the alignment score, you can use the `needleman_wunsch_gotoh_score` function, which uses less memory and works faster.

```python
from affine_gaps import needleman_wunsch_gotoh_score

score = needleman_wunsch_gotoh_score(insulin, glucagon)

print("Score:", score)
```

By default, a BLOSUM62 substitution matrix is used.
You can specify a different substitution matrix by passing it as an argument.

```python
from numpy import np

alphabet = "ARNDCQEGHILKMFPSTWYVBZX"
substitutions = np.zeros((len(alphabet), len(alphabet)), dtype=np.int8)
substitutions.fill(-1)
np.fill_diagonal(substitutions, 1)

aligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(
    insulin, glucagon,
    substitution_alphabet=alphabet,
    substitution_matrix=substitutions,
    gap_opening=-2,
    gap_extension=-1,
)
```

That is similar to the following usage example of BioPython:

```python
from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner(mode="global")
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = open_gap_score
aligner.extend_gap_score = extend_gap_score
```

## Using the Command Line Interface

```bash
affine-gaps insulin glucagon
```

## Testing & Development

### Symmetry Test for Needleman-Wunsch

First, verify that the Needleman-Wunsch algorithm is symmetric with respect to the argument order, assuming the substitution matrix is symmetric.

```bash
pytest test.py -s -x -k symmetry
```

### Needleman-Wunsch and Levenshtein Score Equivalence

The Needleman-Wunsch alignment score should be equal to the negated Levenshtein distance for specific match/mismatch costs.

```bash
pytest test.py -s -x -k levenshtein
```

### Gap Expansion Test

Check the effect of gap expansions on alignment scores. This test ensures that increasing the width of gaps in alignments with zero gap extension penalties does not change the alignment score.

```bash
pytest test.py -s -x -k gap_expansions
```

### Comparison with BioPython Examples

Compare the affine gap alignment scores with BioPython for specific sequence pairs and scoring parameters. This test ensures that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores.

```bash
pytest test.py -s -x -k biopython_examples
```

### Fuzzy Comparison with BioPython

Perform a fuzzy comparison of affine gap alignment scores with BioPython for randomly generated sequences. This test verifies that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores for various gap penalties.

```bash
pytest test.py -s -x -k biopython_fuzzy
```

### EMBOSS and Other Tools

Seemingly the only correct known open-source implementation is located in `nucleus/embaln.c` file in the EMBOSS package in the `embAlignPathCalcWithEndGapPenalties` and `embAlignGetScoreNWMatrix` functions.
That program was originally [implemented in 1999 by Alan Bleasby](https://www.bioinformatics.nl/cgi-bin/emboss/help/needle) and tweaked in 2000 for better scoring.
That implementation has no SIMD optimizations, branchless-computing tricks, or other modern optimizations, but it's still widely recommended.
If you want to compare the results, you can download the EMBOSS source code and compile it with following commands:

```bash
wget -m 'ftp://emboss.open-bio.org/pub/EMBOSS/'
cd emboss.open-bio.org/pub/EMBOSS/
gunzip EMBOSS-latest.tar.gz
tar xf EMBOSS-latest.tar
cd EMBOSS-latest
./configure
```

Or if you simply want to explore the source:

```bash
cat emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0/nucleus/embaln.c
```
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "affine-gaps",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "affine gaps, bioinformatics, numba, sequence alignment",
    "author": null,
    "author_email": "Ash Vardanian <1983160+ashvardanian@users.noreply.github.com>",
    "download_url": "https://files.pythonhosted.org/packages/57/03/970ce8c5eae9ef5279f4686d7db157dd1cc9d9eab21b6e1c6576dedd80fa/affine_gaps-0.2.0.tar.gz",
    "platform": null,
    "description": "# Affine Gaps\n\n__Affine Gaps__ is a __less-wrong__ single-file Numba-accelerated Python implementation of Osamu Gotoh affine gap penalty extensions 1982 [paper](https://doc.aporc.org/attach/Course001Papers/gotoh1982.pdf) for the Needleman-Wunsch and Smith-Waterman algorithms often used for global and local sequence alignment in Bioinformatics.\nThanks to the Numba JIT compiler, it's also competitive in terms of performance.\nBut if you want to go even faster and need more hardware-accelerated string operations, check out [StringZilla](https://github.com/ashvardanian/stringzilla) \ud83e\udd96\n\n## Less Wrong\n\nAs reported in the \"Are all global alignment algorithms and implementations correct?\" [paper](https://www.biorxiv.org/content/10.1101/031500v1.full.pdf) by Tomas Flouri, Kassian Kobert, Torbj\u00f8rn Rognes, and Alexandros Stamatakis:\n\n> In 1982 Gotoh presented an improved algorithm with lower time complexity. \n> Gotoh\u2019s algorithm is frequently cited...\n> While implementing the algorithm, we discovered two mathematical mistakes in Gotoh\u2019s paper that induce sub-optimal sequence alignments.\n> First, there are minor indexing mistakes in the dynamic programming algorithm which become apparent immediately when implementing the procedure.\n> Hence, we report on these for the sake of completeness.\n> Second, there is a more profound problem with the dynamic programming matrix initialization.\n> This initialization issue can easily be missed and find its way into actual implementations.\n> This error is also present in standard text books.\n> Namely, the widely used books by Gusfield and Waterman.\n> To obtain an initial estimate of the extent to which this error has been propagated, we scrutinized freely available undergraduate lecture slides.\n> We found that 8 out of 31 lecture slides contained the mistake, while 16 out of 31 simply omit parts of the initialization, thus giving an incomplete description of the algorithm.\n> Finally, by inspecting ten source codes and running respective tests, we found that five implementations were incorrect.\n\nDuring my exploration of exiting implementations, I've noticed several bugs:\n\n- several libraries initialize the header row/columns of penalty matrices with \u00b1\u221e, causing overflows on the first iteration.\n- initialize matrices to zeros, ignoring the first gap opening cost.\n- combining opening and expansion costs where only the opening cost should be applied.\n- even the most correct `needle` from EMBOSS uses `float` representation, which would obviously be numerically unstable on very long sequences.\n\n## Installation\n\n```bash\npip install git+https://github.com/ashvardanian/affine-gaps.git\n```\n\n## Usaging the Library\n\nTo obtain the alignment of two sequences, use the `needleman_wunsch_gotoh_alignment` function.\n\n```python\nfrom affine_gaps import needleman_wunsch_gotoh_alignment\n\ninsulin = \"GIVEQCCTSICSLYQLENYCN\"\nglucagon = \"HSQGTFTSDYSKYLDSRAEQDFV\"\naligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(insulin, glucagon)\n\nprint(\"Alignment 1:\", aligned_insulin)  # GI-V---EQCC-TSICSLY---QL-ENYCN-\nprint(\"Alignment 2:\", aligned_glucagon) # --D-FVHSQGTFTSDYSKYLDSRAEQDF--V\nprint(\"Score:\", aligned_score)          # 41\n```\n\nIf you only need the alignment score, you can use the `needleman_wunsch_gotoh_score` function, which uses less memory and works faster.\n\n```python\nfrom affine_gaps import needleman_wunsch_gotoh_score\n\nscore = needleman_wunsch_gotoh_score(insulin, glucagon)\n\nprint(\"Score:\", score)\n```\n\nBy default, a BLOSUM62 substitution matrix is used.\nYou can specify a different substitution matrix by passing it as an argument.\n\n```python\nfrom numpy import np\n\nalphabet = \"ARNDCQEGHILKMFPSTWYVBZX\"\nsubstitutions = np.zeros((len(alphabet), len(alphabet)), dtype=np.int8)\nsubstitutions.fill(-1)\nnp.fill_diagonal(substitutions, 1)\n\naligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(\n    insulin, glucagon,\n    substitution_alphabet=alphabet,\n    substitution_matrix=substitutions,\n    gap_opening=-2,\n    gap_extension=-1,\n)\n```\n\nThat is similar to the following usage example of BioPython:\n\n```python\nfrom Bio import Align\nfrom Bio.Align import substitution_matrices\n\naligner = Align.PairwiseAligner(mode=\"global\")\naligner.substitution_matrix = substitution_matrices.load(\"BLOSUM62\")\naligner.open_gap_score = open_gap_score\naligner.extend_gap_score = extend_gap_score\n```\n\n## Using the Command Line Interface\n\n```bash\naffine-gaps insulin glucagon\n```\n\n## Testing & Development\n\n### Symmetry Test for Needleman-Wunsch\n\nFirst, verify that the Needleman-Wunsch algorithm is symmetric with respect to the argument order, assuming the substitution matrix is symmetric.\n\n```bash\npytest test.py -s -x -k symmetry\n```\n\n### Needleman-Wunsch and Levenshtein Score Equivalence\n\nThe Needleman-Wunsch alignment score should be equal to the negated Levenshtein distance for specific match/mismatch costs.\n\n```bash\npytest test.py -s -x -k levenshtein\n```\n\n### Gap Expansion Test\n\nCheck the effect of gap expansions on alignment scores. This test ensures that increasing the width of gaps in alignments with zero gap extension penalties does not change the alignment score.\n\n```bash\npytest test.py -s -x -k gap_expansions\n```\n\n### Comparison with BioPython Examples\n\nCompare the affine gap alignment scores with BioPython for specific sequence pairs and scoring parameters. This test ensures that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores.\n\n```bash\npytest test.py -s -x -k biopython_examples\n```\n\n### Fuzzy Comparison with BioPython\n\nPerform a fuzzy comparison of affine gap alignment scores with BioPython for randomly generated sequences. This test verifies that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores for various gap penalties.\n\n```bash\npytest test.py -s -x -k biopython_fuzzy\n```\n\n### EMBOSS and Other Tools\n\nSeemingly the only correct known open-source implementation is located in `nucleus/embaln.c` file in the EMBOSS package in the `embAlignPathCalcWithEndGapPenalties` and `embAlignGetScoreNWMatrix` functions.\nThat program was originally [implemented in 1999 by Alan Bleasby](https://www.bioinformatics.nl/cgi-bin/emboss/help/needle) and tweaked in 2000 for better scoring.\nThat implementation has no SIMD optimizations, branchless-computing tricks, or other modern optimizations, but it's still widely recommended.\nIf you want to compare the results, you can download the EMBOSS source code and compile it with following commands:\n\n```bash\nwget -m 'ftp://emboss.open-bio.org/pub/EMBOSS/'\ncd emboss.open-bio.org/pub/EMBOSS/\ngunzip EMBOSS-latest.tar.gz\ntar xf EMBOSS-latest.tar\ncd EMBOSS-latest\n./configure\n```\n\nOr if you simply want to explore the source:\n\n```bash\ncat emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0/nucleus/embaln.c\n```",
    "bugtrack_url": null,
    "license": null,
    "summary": "Numba-accelerated Python implementation of affine gap penalty extensions for Needleman-Wunsch and Smith-Waterman algorithms",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [
        "affine gaps",
        " bioinformatics",
        " numba",
        " sequence alignment"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c2e25a1793e76353a9af8433473236a4cd13fa77257047450d9b7d7ab3a3aee6",
                "md5": "c29a95dd6be2b634d3b874927a05aa65",
                "sha256": "307426637475147d7e745ee6160068047c45e033bd14be51881f26bb92e7eb0e"
            },
            "downloads": -1,
            "filename": "affine_gaps-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c29a95dd6be2b634d3b874927a05aa65",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 13409,
            "upload_time": "2024-05-24T07:54:04",
            "upload_time_iso_8601": "2024-05-24T07:54:04.216529Z",
            "url": "https://files.pythonhosted.org/packages/c2/e2/5a1793e76353a9af8433473236a4cd13fa77257047450d9b7d7ab3a3aee6/affine_gaps-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5703970ce8c5eae9ef5279f4686d7db157dd1cc9d9eab21b6e1c6576dedd80fa",
                "md5": "64f646181bca3029d4070d6a0f8471d2",
                "sha256": "765614f4738f30a896f42b72924348b094fd80c50aeb9b58da75293f70688b46"
            },
            "downloads": -1,
            "filename": "affine_gaps-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "64f646181bca3029d4070d6a0f8471d2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 18550,
            "upload_time": "2024-05-24T07:54:05",
            "upload_time_iso_8601": "2024-05-24T07:54:05.428225Z",
            "url": "https://files.pythonhosted.org/packages/57/03/970ce8c5eae9ef5279f4686d7db157dd1cc9d9eab21b6e1c6576dedd80fa/affine_gaps-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-24 07:54:05",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "affine-gaps"
}
        
Elapsed time: 0.52340s