# pairwise-sequence-alignment (psa)
![PyPI - Version](https://img.shields.io/pypi/v/pairwise-sequence-alignment)
This is a Python module to calculate a pairwise alignment between biological sequences (protein or nucleic acid). This module uses the [needle](https://www.ebi.ac.uk/Tools/psa/emboss_needle/), [stretcher](https://www.ebi.ac.uk/jdispatcher/psa/emboss_stretcher) and [water](https://www.ebi.ac.uk/Tools/psa/emboss_water/) tools from the EMBOSS package to calculate an optimal, global/local pairwise alignment.
I wrote this module for two reasons. First, the needle and water tools are faster than any Python implementation. Second, Biopython has dropped support for tools from the EMBOSS package and recommends running them via the subprocess module directly.
## Table of contents
* [Introduction](#introduction)
* [Requirements](#requirements)
* [Installation](#installation)
* [Quick Start](#quick-start)
* [Alignment object](#alignment-object)
* [Attributes](#attributes)
* [Methods](#methods)
* [Usage examples](#usage-examples)
* [Alignment information](#alignment-information)
* [Alignment in raw format](#alignment-in-raw-format)
* [Alignment in FASTA format](#alignment-in-fasta-format)
* [Alignment iteration](#alignment-iteration)
* [Alignment iteration with index](#alignment-iteration-with-index)
* [Query coverage](#query-coverage)
* [Scoring scheme for alignment](#scoring-scheme-for-alignment)
* [Statistical significance of alignment](#statistical-significance-of-alignment)
* [All-against-all pairwise alignments](#all-against-all-pairwise-alignments)
* [Tests](#tests)
* [License](#license)
## Introduction
Pairwise sequence alignment identifies regions of similarity between two sequences, which can indicate functional, structural, or evolutionary relationships.
1. Global alignment aligns two sequences from start to finish, ideal for sequences that are similar and of comparable length.
* *needle* [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm)) calculates a full-length global alignment by maximizing similarity across both sequences through a dynamic programming approach.
* *stretcher* [documentation](https://galaxy-iuc.github.io/emboss-5.0-docs/stretcher.html) performs global alignment using a modified dynamic programming algorithm optimized for linear space efficiency.
2. Local alignment (*water*; [Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith–Waterman_algorithm)) finds the region with the highest level of similarity between the two sequences. It is suitable for sequences that are not assumed to be similar over the entire length.
## Requirements
1. Python >= 3.8
> Check with `python3 --version`
2. EMBOSS >= 6.6.0
> Check with `needle -version` or `water -version`.
## Installation
You can install the module from [PyPI](https://pypi.org/project/pairwise-sequence-alignment/):
```
pip install pairwise-sequence-alignment
```
or directly from GitHub:
```
pip install "git+https://github.com/aziele/pairwise-sequence-alignment.git"
```
or you can use the module without installation. Simply clone or download this repository and you're ready to use it.
## Quick Start
```python
import psa
# Global alignment
aln = psa.needle(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGA')
aln = psa.needle(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')
# Local alignment
aln = psa.water(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGAT')
aln = psa.water(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')
print(aln.score) # 20.0
print(aln.pidentity) # 71.4
print(aln.psimilarity) # 85.7
print(aln.pgaps) # 14.3
print(aln.qseq) # MKSTV-W
print(aln.sseq) # MKSSVLW
```
## Alignment object
### Attributes
| Attribute | Description |
| --- | --- |
| `qid` | Query sequence identifier |
| `sid` | Subject sequence identifier |
| `qseq` | Query unaligned sequence |
| `sseq` | Subject unaligned sequence |
| `qaln` | Query aligned sequence
| `saln` | Subject aligned sequence |
| `qstart` | Start of alignment in query |
| `qend` | End of alignment in query |
| `sstart` | Start of alignment in subject |
| `send` | End of alignment in subject |
| `length` | Alignment length |
| `score` | Alignment score |
| `nidentity` | Number of identical matches in the alignment |
| `pidentity` | Percentage of identical matches in the alignment |
| `nsimilarity` | Number of positive-scoring matches in the alignment |
| `psimilarity` | Percentage of positive-scoring matches in the alignment |
| `ngaps` | Total number of gaps in the alignment |
| `pgaps` | Total percentage of gaps in the alignment |
| `moltype` | nucl/prot |
| `program` | needle/water |
| `gapopen` | Gap open penalty |
| `gapextend` | Gap extension penalty |
| `matrix` | Name of scoring matrix |
| `raw` | Raw output obtained from EMBOSS' needle/water |
### Methods
| Method | Description |
| --- | --- |
| `query_coverage()` | Returns a query coverage [%] |
| `subject_coverage()` | Returns a subject coverage [%] |
| `pvalue()` | Returns a p-value of the alignment |
| `fasta()` | Returns pairwise alignment in FASTA/Pearson format |
## Usage examples
### Alignment information
```python
import psa
aln = psa.needle(
moltype='prot',
qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
sseq='MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
qid='seq1',
sid='seq2'
)
print(aln.score) # 225.0
print(aln.length) # 59
print(aln.pidentity) # 72.9
print(aln.psimilarity) # 79.7
print(aln.pgaps) # 18.6
print(aln.ngaps) # 11
print(aln.qseq) # MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT
print(aln.sseq) # MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP------
print(aln.qstart) # 1
print(aln.qend) # 54
print(aln.sstart) # 1
print(aln.send) # 53
print(aln.program) # needle
print(aln.gapopen) # 10
print(aln.gapextend) # 0.5
print(aln.matrix) # EBLOSUM62
```
### Alignment in raw format
```python
print(aln.raw)
```
Output:
```
#=======================================
#
# Aligned_sequences: 2
# 1: seq1
# 2: seq2
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 59
# Identity: 43/59 (72.9%)
# Similarity: 47/59 (79.7%)
# Gaps: 11/59 (18.6%)
# Score: 225.0
#
#
#=======================================
seq1 1 MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDP 45
||:||.:|||||||| ||||||||||||||||:||:||||||||||
seq2 1 MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDP 50
seq1 46 RKPQLHSRT 54
|||
seq2 51 RKP------ 53
#---------------------------------------
#---------------------------------------
```
### Alignment in FASTA format
```python
print(aln.fasta(wrap=30))
```
Output:
```
>seq1 1-54
MTSPSTKNSDDKGRP-----NLSSTEYFAN
TNVLTCRLKWVNPDTFIMDPRKPQLHSRT
>seq2 1-53
MTTPSRENSDDKGRPIEEASNLSSTEYFAN
TNVLTCKLKYVNPDTFIMDPRKP------
```
### Alignment iteration
You can iterate over the alignment residue-by-residude:
```python
for aa1, aa2 in aln:
if aa1 != aa2:
print(aa1, aa2)
```
Output:
```
S T
T R
K E
- I
- E
- E
- A
- S
R K
W Y
Q -
L -
H -
S -
R -
T -
```
### Alignment iteration with index
You can also loop through residudes in an alignment with the information of their position.
```python
for i, (aa1, aa2) in enumerate(aln):
if aa1 != aa2:
print(i, aa1, aa2)
```
Output:
```
2 S T
5 T R
6 K E
15 - I
16 - E
17 - E
18 - A
19 - S
36 R K
39 W Y
53 Q -
54 L -
55 H -
56 S -
57 R -
58 T -
```
### Query coverage
Query coverage describes how much of the query sequence is covered in the alignment by the subject sequence. Specifically, query coverage is the percentage of the query sequence length that is included in the alignment. In global alignments, query coverage is always 100% because both the sequences, query and subject, are aligned from end to end. It is thus more useful to calculate query coverage from local alignments.
```python
import psa
aln = psa.water(
moltype='prot',
qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
sseq='NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
qid='seq1',
sid='seq2'
)
# seq1 8 NSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKP 48
# |||||||| ||||||||||||||||:||:|||||||||||||
# seq2 1 NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP 46
print(aln.query_coverage())
# 75.9
print(aln.subject_coverage())
# 100
```
### Scoring scheme for alignment
You can change a scoring matrix and penalties for the gap open and extension to calculate the alignment.
```python
import psa
aln = psa.water(
moltype='prot',
qseq='MKSTWYERNST',
sseq='MKSTGYWTRESA',
matrix='EBLOSUM30',
gapopen=5,
gapextend=0.2
)
print(aln)
```
Output:
```
#=======================================
#
# Aligned_sequences: 2
# 1: query
# 2: subject
# Matrix: EBLOSUM30
# Gap_penalty: 5.0
# Extend_penalty: 0.2
#
# Length: 13
# Identity: 7/13 (53.8%)
# Similarity: 8/13 (61.5%)
# Gaps: 3/13 (23.1%)
# Score: 39.8
#
#
#=======================================
query 1 MKST--WYERNST 11
|||| |. |.|:
subject 1 MKSTGYWT-RESA 12
#---------------------------------------
#---------------------------------------
```
### Statistical significance of alignment
The Needleman-Wunsch and Smith-Waterman algorithms will always find an optimal alignment between two sequences, whether or not they are evolutionarily related. The strength of an alignment is determined by its score. However, often it is necessary to know if a score is high enough to indicate a biologically interesting alignment. The statistical significance of the score is assessed by the *P*-value, which describes how likely it is that two random sequences of similar length and composition will align with a score equal to or better than our target alignment.
The `.pvalue()` method calculates the *P*-value of the alignment between query and subject sequences. The method shuffles a subject sequence many times (100 by default) and calculates the alignment score between the query and each shuffled subject sequence. It then counts how many times the alignment score was greater than or equal to the alignment score of the original query and subject sequences. For example, if 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the *P*-value is likely to be less than 0.01.
```python
import psa
aln = psa.needle(moltype='prot', qseq='MKSTVILK', sseq='MKSRSLK')
print(aln.pvalue()) # 0.16
```
### All-against-all pairwise alignments
For more than two sequences, you can calculate alignments between every pair of your input sequences.
```python
import itertools
import psa
# Input sequences
sequences = {
'dna1': 'ATCGAGATCGAGATGGCGATAG',
'dna2': 'ATGCTGATCGTAGGGGC',
'dna3': 'GTCGGATCCTCGATGGAGA',
'dna4': 'TTTGGGAATGCGTAGGAGCTA',
'dna5': 'CCGTGATGCGATGCA'
}
# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
qseq = sequences[qid]
sseq = sequences[sid]
aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
```
Output:
```
dna1 dna2 38.5% 24.0
dna1 dna3 60.9% 34.0
dna1 dna4 35.7% 20.0
dna1 dna5 45.8% 27.0
dna2 dna3 43.5% 18.0
dna2 dna4 42.3% 39.0
dna2 dna5 39.1% 22.0
dna3 dna4 20.0% 14.0
dna3 dna5 52.4% 26.0
dna4 dna5 40.9% 14.0
```
### All-against-all pairwise alignments of sequences from a FASTA file
If you have multiple sequences in a FASTA file, you can use [Biopython](https://biopython.org) to read them and then calculate pairwise alignments.
```python
import itertools
import psa
from Bio import SeqIO
# Input sequences
sequences = {}
for seq_record in SeqIO.parse('sequences.fasta', 'fasta'):
sequences[seq_record.id] = str(seq_record.seq)
# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
qseq = sequences[qid]
sseq = sequences[sid]
aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
```
Output:
```
dna1 dna2 38.5% 24.0
dna1 dna3 60.9% 34.0
dna1 dna4 35.7% 20.0
dna1 dna5 45.8% 27.0
dna2 dna3 43.5% 18.0
dna2 dna4 42.3% 39.0
dna2 dna5 39.1% 22.0
dna3 dna4 20.0% 14.0
dna3 dna5 52.4% 26.0
dna4 dna5 40.9% 14.0
```
## Tests
If you want to check that everything works as intended, just run:
```
./test.py
```
## License
[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)
Raw data
{
"_id": null,
"home_page": null,
"name": "pairwise-sequence-alignment",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "sequence alignment, nucleotide, protein, pairwise alignment, needle, water, emboss",
"author": null,
"author_email": "Andrzej Zielezinski <a.zielezinski@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0d/6c/a91d34bdf3687dfd8515afaffba685f1b59cc516f46ae2c37fe48b32160e/pairwise_sequence_alignment-1.0.1.tar.gz",
"platform": null,
"description": "# pairwise-sequence-alignment (psa)\n\n![PyPI - Version](https://img.shields.io/pypi/v/pairwise-sequence-alignment)\n\nThis is a Python module to calculate a pairwise alignment between biological sequences (protein or nucleic acid). This module uses the [needle](https://www.ebi.ac.uk/Tools/psa/emboss_needle/), [stretcher](https://www.ebi.ac.uk/jdispatcher/psa/emboss_stretcher) and [water](https://www.ebi.ac.uk/Tools/psa/emboss_water/) tools from the EMBOSS package to calculate an optimal, global/local pairwise alignment.\n\nI wrote this module for two reasons. First, the needle and water tools are faster than any Python implementation. Second, Biopython has dropped support for tools from the EMBOSS package and recommends running them via the subprocess module directly.\n\n## Table of contents\n\n* [Introduction](#introduction)\n* [Requirements](#requirements)\n* [Installation](#installation)\n* [Quick Start](#quick-start)\n* [Alignment object](#alignment-object)\n * [Attributes](#attributes)\n * [Methods](#methods)\n* [Usage examples](#usage-examples)\n * [Alignment information](#alignment-information)\n * [Alignment in raw format](#alignment-in-raw-format)\n * [Alignment in FASTA format](#alignment-in-fasta-format)\n * [Alignment iteration](#alignment-iteration)\n * [Alignment iteration with index](#alignment-iteration-with-index)\n * [Query coverage](#query-coverage)\n * [Scoring scheme for alignment](#scoring-scheme-for-alignment)\n * [Statistical significance of alignment](#statistical-significance-of-alignment)\n * [All-against-all pairwise alignments](#all-against-all-pairwise-alignments)\n* [Tests](#tests)\n* [License](#license)\n\n\n## Introduction\nPairwise sequence alignment identifies regions of similarity between two sequences, which can indicate functional, structural, or evolutionary relationships.\n\n1. Global alignment aligns two sequences from start to finish, ideal for sequences that are similar and of comparable length.\n * *needle* [Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman\u2013Wunsch_algorithm)) calculates a full-length global alignment by maximizing similarity across both sequences through a dynamic programming approach.\n * *stretcher* [documentation](https://galaxy-iuc.github.io/emboss-5.0-docs/stretcher.html) performs global alignment using a modified dynamic programming algorithm optimized for linear space efficiency. \n\n2. Local alignment (*water*; [Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith\u2013Waterman_algorithm)) finds the region with the highest level of similarity between the two sequences. It is suitable for sequences that are not assumed to be similar over the entire length.\n\n\n## Requirements\n\n1. Python >= 3.8\n > Check with `python3 --version`\n2. EMBOSS >= 6.6.0\n > Check with `needle -version` or `water -version`.\n\n\n## Installation\n\nYou can install the module from [PyPI](https://pypi.org/project/pairwise-sequence-alignment/):\n\n```\npip install pairwise-sequence-alignment\n```\n\nor directly from GitHub:\n\n```\npip install \"git+https://github.com/aziele/pairwise-sequence-alignment.git\"\n```\n\nor you can use the module without installation. Simply clone or download this repository and you're ready to use it.\n\n\n## Quick Start\n\n```python\nimport psa\n\n# Global alignment\naln = psa.needle(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGA')\naln = psa.needle(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')\n\n# Local alignment\naln = psa.water(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGAT')\naln = psa.water(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')\n\nprint(aln.score) # 20.0\nprint(aln.pidentity) # 71.4\nprint(aln.psimilarity) # 85.7\nprint(aln.pgaps) # 14.3\nprint(aln.qseq) # MKSTV-W\nprint(aln.sseq) # MKSSVLW\n```\n\n## Alignment object\n\n### Attributes\n\n| Attribute | Description |\n| --- | --- |\n| `qid` | Query sequence identifier |\n| `sid` | Subject sequence identifier |\n| `qseq` | Query unaligned sequence |\n| `sseq` | Subject unaligned sequence |\n| `qaln` | Query aligned sequence \n| `saln` | Subject aligned sequence |\n| `qstart` | Start of alignment in query |\n| `qend` | End of alignment in query |\n| `sstart` | Start of alignment in subject |\n| `send` | End of alignment in subject |\n| `length` | Alignment length |\n| `score` | Alignment score |\n| `nidentity` | Number of identical matches in the alignment |\n| `pidentity` | Percentage of identical matches in the alignment |\n| `nsimilarity` | Number of positive-scoring matches in the alignment |\n| `psimilarity` | Percentage of positive-scoring matches in the alignment |\n| `ngaps` | Total number of gaps in the alignment |\n| `pgaps` | Total percentage of gaps in the alignment |\n| `moltype` | nucl/prot |\n| `program` | needle/water |\n| `gapopen` | Gap open penalty |\n| `gapextend` | Gap extension penalty |\n| `matrix` | Name of scoring matrix | \n| `raw` | Raw output obtained from EMBOSS' needle/water |\n\n\n### Methods\n\n| Method | Description |\n| --- | --- |\n| `query_coverage()` | Returns a query coverage [%] |\n| `subject_coverage()` | Returns a subject coverage [%] |\n| `pvalue()` | Returns a p-value of the alignment |\n| `fasta()` | Returns pairwise alignment in FASTA/Pearson format |\n\n\n## Usage examples\n\n### Alignment information\n\n```python\nimport psa\n\naln = psa.needle(\n moltype='prot',\n qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',\n sseq='MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',\n qid='seq1',\n sid='seq2'\n)\n\nprint(aln.score) # 225.0\nprint(aln.length) # 59\nprint(aln.pidentity) # 72.9\nprint(aln.psimilarity) # 79.7\nprint(aln.pgaps) # 18.6\nprint(aln.ngaps) # 11\nprint(aln.qseq) # MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT\nprint(aln.sseq) # MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP------\nprint(aln.qstart) # 1\nprint(aln.qend) # 54\nprint(aln.sstart) # 1\nprint(aln.send) # 53\nprint(aln.program) # needle\nprint(aln.gapopen) # 10\nprint(aln.gapextend) # 0.5\nprint(aln.matrix) # EBLOSUM62\n```\n\n### Alignment in raw format\n\n```python\nprint(aln.raw)\n```\n\nOutput:\n\n```\n#=======================================\n#\n# Aligned_sequences: 2\n# 1: seq1\n# 2: seq2\n# Matrix: EBLOSUM62\n# Gap_penalty: 10.0\n# Extend_penalty: 0.5\n#\n# Length: 59\n# Identity: 43/59 (72.9%)\n# Similarity: 47/59 (79.7%)\n# Gaps: 11/59 (18.6%)\n# Score: 225.0\n# \n#\n#=======================================\n\nseq1 1 MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDP 45\n ||:||.:|||||||| ||||||||||||||||:||:||||||||||\nseq2 1 MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDP 50\n\nseq1 46 RKPQLHSRT 54\n ||| \nseq2 51 RKP------ 53\n\n\n#---------------------------------------\n#---------------------------------------\n```\n\n### Alignment in FASTA format\n\n```python\nprint(aln.fasta(wrap=30))\n```\n\nOutput:\n\n```\n>seq1 1-54\nMTSPSTKNSDDKGRP-----NLSSTEYFAN\nTNVLTCRLKWVNPDTFIMDPRKPQLHSRT\n>seq2 1-53\nMTTPSRENSDDKGRPIEEASNLSSTEYFAN\nTNVLTCKLKYVNPDTFIMDPRKP------\n```\n\n### Alignment iteration\n\nYou can iterate over the alignment residue-by-residude:\n\n```python\nfor aa1, aa2 in aln:\n if aa1 != aa2:\n print(aa1, aa2)\n```\n\nOutput:\n\n```\nS T\nT R\nK E\n- I\n- E\n- E\n- A\n- S\nR K\nW Y\nQ -\nL -\nH -\nS -\nR -\nT -\n```\n\n### Alignment iteration with index\n\nYou can also loop through residudes in an alignment with the information of their position.\n\n```python\nfor i, (aa1, aa2) in enumerate(aln):\n if aa1 != aa2:\n print(i, aa1, aa2)\n```\n\nOutput:\n\n```\n2 S T\n5 T R\n6 K E\n15 - I\n16 - E\n17 - E\n18 - A\n19 - S\n36 R K\n39 W Y\n53 Q -\n54 L -\n55 H -\n56 S -\n57 R -\n58 T -\n```\n\n### Query coverage\n\nQuery coverage describes how much of the query sequence is covered in the alignment by the subject sequence. Specifically, query coverage is the percentage of the query sequence length that is included in the alignment. In global alignments, query coverage is always 100% because both the sequences, query and subject, are aligned from end to end. It is thus more useful to calculate query coverage from local alignments.\n\n```python\nimport psa\n\naln = psa.water(\n moltype='prot',\n qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',\n sseq='NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',\n qid='seq1',\n sid='seq2'\n)\n# seq1 8 NSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKP 48\n# |||||||| ||||||||||||||||:||:|||||||||||||\n# seq2 1 NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP 46\n\nprint(aln.query_coverage())\n# 75.9\nprint(aln.subject_coverage())\n# 100\n```\n\n### Scoring scheme for alignment\n\nYou can change a scoring matrix and penalties for the gap open and extension to calculate the alignment.\n\n```python\nimport psa\n\naln = psa.water(\n moltype='prot',\n qseq='MKSTWYERNST',\n sseq='MKSTGYWTRESA',\n matrix='EBLOSUM30',\n gapopen=5,\n gapextend=0.2\n)\nprint(aln)\n```\n\nOutput:\n\n```\n#=======================================\n#\n# Aligned_sequences: 2\n# 1: query\n# 2: subject\n# Matrix: EBLOSUM30\n# Gap_penalty: 5.0\n# Extend_penalty: 0.2\n#\n# Length: 13\n# Identity: 7/13 (53.8%)\n# Similarity: 8/13 (61.5%)\n# Gaps: 3/13 (23.1%)\n# Score: 39.8\n# \n#\n#=======================================\n\nquery 1 MKST--WYERNST 11\n |||| |. |.|:\nsubject 1 MKSTGYWT-RESA 12\n\n\n#---------------------------------------\n#---------------------------------------\n```\n\n### Statistical significance of alignment\n\nThe Needleman-Wunsch and Smith-Waterman algorithms will always find an optimal alignment between two sequences, whether or not they are evolutionarily related. The strength of an alignment is determined by its score. However, often it is necessary to know if a score is high enough to indicate a biologically interesting alignment. The statistical significance of the score is assessed by the *P*-value, which describes how likely it is that two random sequences of similar length and composition will align with a score equal to or better than our target alignment. \n\nThe `.pvalue()` method calculates the *P*-value of the alignment between query and subject sequences. The method shuffles a subject sequence many times (100 by default) and calculates the alignment score between the query and each shuffled subject sequence. It then counts how many times the alignment score was greater than or equal to the alignment score of the original query and subject sequences. For example, if 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the *P*-value is likely to be less than 0.01.\n\n```python\nimport psa\n\naln = psa.needle(moltype='prot', qseq='MKSTVILK', sseq='MKSRSLK')\n\nprint(aln.pvalue()) # 0.16\n```\n\n\n### All-against-all pairwise alignments\n\nFor more than two sequences, you can calculate alignments between every pair of your input sequences.\n\n```python\nimport itertools\nimport psa\n\n# Input sequences\nsequences = {\n 'dna1': 'ATCGAGATCGAGATGGCGATAG',\n 'dna2': 'ATGCTGATCGTAGGGGC',\n 'dna3': 'GTCGGATCCTCGATGGAGA',\n 'dna4': 'TTTGGGAATGCGTAGGAGCTA',\n 'dna5': 'CCGTGATGCGATGCA'\n}\n\n# All-against-all pairwise alignments\nfor qid, sid in itertools.combinations(sequences, r=2):\n qseq = sequences[qid]\n sseq = sequences[sid]\n aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)\n print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')\n```\n\nOutput:\n\n```\ndna1 dna2 38.5% 24.0\ndna1 dna3 60.9% 34.0\ndna1 dna4 35.7% 20.0\ndna1 dna5 45.8% 27.0\ndna2 dna3 43.5% 18.0\ndna2 dna4 42.3% 39.0\ndna2 dna5 39.1% 22.0\ndna3 dna4 20.0% 14.0\ndna3 dna5 52.4% 26.0\ndna4 dna5 40.9% 14.0\n```\n\n### All-against-all pairwise alignments of sequences from a FASTA file\n\nIf you have multiple sequences in a FASTA file, you can use [Biopython](https://biopython.org) to read them and then calculate pairwise alignments.\n\n```python\nimport itertools\nimport psa\n\nfrom Bio import SeqIO\n\n# Input sequences\nsequences = {}\nfor seq_record in SeqIO.parse('sequences.fasta', 'fasta'):\n sequences[seq_record.id] = str(seq_record.seq)\n\n# All-against-all pairwise alignments\nfor qid, sid in itertools.combinations(sequences, r=2):\n qseq = sequences[qid]\n sseq = sequences[sid]\n aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)\n print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')\n```\n\nOutput:\n\n```\ndna1 dna2 38.5% 24.0\ndna1 dna3 60.9% 34.0\ndna1 dna4 35.7% 20.0\ndna1 dna5 45.8% 27.0\ndna2 dna3 43.5% 18.0\ndna2 dna4 42.3% 39.0\ndna2 dna5 39.1% 22.0\ndna3 dna4 20.0% 14.0\ndna3 dna5 52.4% 26.0\ndna4 dna5 40.9% 14.0\n```\n\n\n## Tests\nIf you want to check that everything works as intended, just run:\n\n```\n./test.py\n```\n\n## License\n\n[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)\n",
"bugtrack_url": null,
"license": null,
"summary": "Global and local pairwise alignments between nucleotide/protein sequences.",
"version": "1.0.1",
"project_urls": {
"Documentation": "https://github.com/aziele/pairwise-sequence-alignment",
"Source": "https://github.com/aziele/pairwise-sequence-alignment"
},
"split_keywords": [
"sequence alignment",
" nucleotide",
" protein",
" pairwise alignment",
" needle",
" water",
" emboss"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "64110a3231c47844d572e9b205819bb0e0fbf08effac353b58ec59db92c9ec22",
"md5": "96b80679ce67775080d7b4e94f0e2c06",
"sha256": "20f4189d4ad84340a0237a8f0159ef64b98b0e9abbd47e1882fd76ea3f6d1199"
},
"downloads": -1,
"filename": "pairwise_sequence_alignment-1.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "96b80679ce67775080d7b4e94f0e2c06",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 20872,
"upload_time": "2024-10-30T12:59:50",
"upload_time_iso_8601": "2024-10-30T12:59:50.787756Z",
"url": "https://files.pythonhosted.org/packages/64/11/0a3231c47844d572e9b205819bb0e0fbf08effac353b58ec59db92c9ec22/pairwise_sequence_alignment-1.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0d6ca91d34bdf3687dfd8515afaffba685f1b59cc516f46ae2c37fe48b32160e",
"md5": "a5c4e9b6c49cbbabf8a7c748d3ebc775",
"sha256": "26fd379cbdbaafe7c2306e2597f4921d9ef685301bedc3c9f5510d51bb8303b7"
},
"downloads": -1,
"filename": "pairwise_sequence_alignment-1.0.1.tar.gz",
"has_sig": false,
"md5_digest": "a5c4e9b6c49cbbabf8a7c748d3ebc775",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 25549,
"upload_time": "2024-10-30T12:59:52",
"upload_time_iso_8601": "2024-10-30T12:59:52.754462Z",
"url": "https://files.pythonhosted.org/packages/0d/6c/a91d34bdf3687dfd8515afaffba685f1b59cc516f46ae2c37fe48b32160e/pairwise_sequence_alignment-1.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-30 12:59:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "aziele",
"github_project": "pairwise-sequence-alignment",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pairwise-sequence-alignment"
}