pairwise-sequence-alignment


Namepairwise-sequence-alignment JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryGlobal and local pairwise alignments between nucleotide/protein sequences.
upload_time2023-04-04 13:13:16
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords sequence alignment nucleotide protein pairwise alignment needle water emboss
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pairwise-sequence-alignment (psa)

This is a Python module to calculate a pairwise alignment between biological sequences (protein or nucleic acid). This module uses the [needle](https://www.ebi.ac.uk/Tools/psa/emboss_needle/) and [water](https://www.ebi.ac.uk/Tools/psa/emboss_water/) tools from the EMBOSS package to calculate an optimal, global/local pairwise alignment.

I wrote this module for two reasons. First, the needle and water tools are faster than any Python implementation. Second, Biopython has dropped support for tools from the EMBOSS package and recommends running them via the subprocess module directly.

## Table of contents

* [Introduction](#introduction)
* [Requirements](#requirements)
* [Installation](#installation)
* [Quick Start](#quick-start)
* [Alignment object](#alignment-object)
   * [Attributes](#attributes)
   * [Methods](#methods)
* [Usage examples](#usage-examples)
   * [Alignment information](#alignment-information)
   * [Alignment in raw format](#alignment-in-raw-format)
   * [Alignment in FASTA format](#alignment-in-fasta-format)
   * [Alignment iteration](#alignment-iteration)
   * [Alignment iteration with index](#alignment-iteration-with-index)
   * [Query coverage](#query-coverage)
   * [Scoring scheme for alignment](#scoring-scheme-for-alignment)
   * [Statistical significance of alignment](#statistical-significance-of-alignment)
   * [All-against-all pairwise alignments](#all-against-all-pairwise-alignments)
* [Tests](#tests)
* [License](#license)


## Introduction
Pairwise sequence alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two sequences.

1. Global alignment (*needle*; [the Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm)) aligns two sequences across their entire length, from beginning to end. It is most useful when sequences you are aligning are similar and roughly the same size.
2. Local alignment (*water*; [the Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith–Waterman_algorithm)) finds the region with the highest level of similarity between the two sequences. It is suitable for sequences that are not assumed to be similar over the entire length.


## Requirements

1. Python >= 3.8
    > Check with `python3 --version`
2. EMBOSS >= 6.6.0
    > Check with `needle -version` or `water -version`.


## Installation

You can install the module from [PyPI](https://pypi.org/project/pairwise-sequence-alignment/):

```
pip install pairwise-sequence-alignment
```

or directly from GitHub:

```
pip install "git+https://github.com/aziele/pairwise-sequence-alignment.git"
```

or you can use the module without installation. Simply clone or download this repository and you're ready to use it.


## Quick Start

```python
import psa

# Global alignment
aln = psa.needle(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGA')
aln = psa.needle(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')

# Local alignment
aln = psa.water(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGAT')
aln = psa.water(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')

print(aln.score)       # 20.0
print(aln.pidentity)   # 71.4
print(aln.psimilarity) # 85.7
print(aln.pgaps)       # 14.3
print(aln.qseq)        # MKSTV-W
print(aln.sseq)        # MKSSVLW
```

## Alignment object

### Attributes

| Attribute | Description |
| --- | --- |
| `qid` | Query sequence identifier |
| `sid` | Subject sequence identifier |
| `qseq` | Query unaligned sequence |
| `sseq` | Subject unaligned sequence |
| `qaln` | Query aligned sequence 
| `saln` | Subject aligned sequence |
| `qstart` | Start of alignment in query |
| `qend` | End of alignment in query |
| `sstart` | Start of alignment in subject |
| `send` | End of alignment in subject |
| `length` | Alignment length |
| `score` | Alignment score |
| `nidentity` | Number of identical matches in the alignment |
| `pidentity` | Percentage of identical matches in the alignment |
| `nsimilarity` | Number of positive-scoring matches in the alignment |
| `psimilarity` | Percentage of positive-scoring matches in the alignment |
| `ngaps` | Total number of gaps in the alignment |
| `pgaps` | Total percentage of gaps in the alignment |
| `moltype` | nucl/prot |
| `program` | needle/water |
| `gapopen` | Gap open penalty |
| `gapextend` | Gap extension penalty |
| `matrix` | Name of scoring matrix |  
| `raw` | Raw output obtained from EMBOSS' needle/water |


### Methods

| Method | Description |
| --- | --- |
| `query_coverage()` | Returns a query coverage [%] |
| `subject_coverage()` | Returns a subject coverage [%] |
| `pvalue()` | Returns a p-value of the alignment |
| `fasta()` | Returns pairwise alignment in FASTA/Pearson format |


## Usage examples

### Alignment information

```python
import psa

aln = psa.needle(
    moltype='prot',
    qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
    sseq='MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
    qid='seq1',
    sid='seq2'
)

print(aln.score)        # 225.0
print(aln.length)       # 59
print(aln.pidentity)    # 72.9
print(aln.psimilarity)  # 79.7
print(aln.pgaps)        # 18.6
print(aln.ngaps)        # 11
print(aln.qseq)         # MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT
print(aln.sseq)         # MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP------
print(aln.qstart)       # 1
print(aln.qend)         # 54
print(aln.sstart)       # 1
print(aln.send)         # 53
print(aln.program)      # needle
print(aln.gapopen)      # 10
print(aln.gapextend)    # 0.5
print(aln.matrix)       # EBLOSUM62
```

### Alignment in raw format

```python
print(aln.raw)
```

Output:

```
#=======================================
#
# Aligned_sequences: 2
# 1: seq1
# 2: seq2
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 59
# Identity:      43/59 (72.9%)
# Similarity:    47/59 (79.7%)
# Gaps:          11/59 (18.6%)
# Score: 225.0
# 
#
#=======================================

seq1               1 MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDP     45
                     ||:||.:||||||||     ||||||||||||||||:||:||||||||||
seq2               1 MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDP     50

seq1              46 RKPQLHSRT     54
                     |||      
seq2              51 RKP------     53


#---------------------------------------
#---------------------------------------
```

### Alignment in FASTA format

```python
print(aln.fasta(wrap=30))
```

Output:

```
>seq1 1-54
MTSPSTKNSDDKGRP-----NLSSTEYFAN
TNVLTCRLKWVNPDTFIMDPRKPQLHSRT
>seq2 1-53
MTTPSRENSDDKGRPIEEASNLSSTEYFAN
TNVLTCKLKYVNPDTFIMDPRKP------
```

### Alignment iteration

You can iterate over the alignment residue-by-residude:

```python
for aa1, aa2 in aln:
    if aa1 != aa2:
        print(aa1, aa2)
```

Output:

```
S T
T R
K E
- I
- E
- E
- A
- S
R K
W Y
Q -
L -
H -
S -
R -
T -
```

### Alignment iteration with index

You can also loop through residudes in an alignment with the information of their position.

```python
for i, (aa1, aa2) in enumerate(aln):
    if aa1 != aa2:
        print(i, aa1, aa2)
```

Output:

```
2  S T
5  T R
6  K E
15 - I
16 - E
17 - E
18 - A
19 - S
36 R K
39 W Y
53 Q -
54 L -
55 H -
56 S -
57 R -
58 T -
```

### Query coverage

Query coverage describes how much of the query sequence is covered in the alignment by the subject sequence. Specifically, query coverage is the percentage of the query sequence length that is included in the alignment. In global alignments, query coverage is always 100% because both the sequences, query and subject, are aligned from end to end. It is thus more useful to calculate query coverage from local alignments.

```python
import psa

aln = psa.water(
    moltype='prot',
    qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',
    sseq='NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',
    qid='seq1',
    sid='seq2'
)
# seq1               8 NSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKP     48
#                      ||||||||     ||||||||||||||||:||:|||||||||||||
# seq2               1 NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP     46

print(aln.query_coverage())
# 75.9
print(aln.subject_coverage())
# 100
```

### Scoring scheme for alignment

You can change a scoring matrix and penalties for the gap open and extension to calculate the alignment.

```python
import psa

aln = psa.water(
    moltype='prot',
    qseq='MKSTWYERNST',
    sseq='MKSTGYWTRESA',
    matrix='EBLOSUM30',
    gapopen=5,
    gapextend=0.2
)
print(aln)
```

Output:

```
#=======================================
#
# Aligned_sequences: 2
# 1: query
# 2: subject
# Matrix: EBLOSUM30
# Gap_penalty: 5.0
# Extend_penalty: 0.2
#
# Length: 13
# Identity:       7/13 (53.8%)
# Similarity:     8/13 (61.5%)
# Gaps:           3/13 (23.1%)
# Score: 39.8
# 
#
#=======================================

query              1 MKST--WYERNST     11
                     ||||  |. |.|:
subject            1 MKSTGYWT-RESA     12


#---------------------------------------
#---------------------------------------
```

### Statistical significance of alignment

The Needleman-Wunsch and Smith-Waterman algorithms will always find an optimal alignment between two sequences, whether or not they are evolutionarily related. The strength of an alignment is determined by its score. However, often it is necessary to know if a score is high enough to indicate a biologically interesting alignment. The statistical significance of the score is assessed by the *P*-value, which describes how likely it is that two random sequences of similar length and composition will align with a score equal to or better than our target alignment. 

The `.pvalue()` method calculates the *P*-value of the alignment between query and subject sequences. The method shuffles a subject sequence many times (100 by default) and calculates the alignment score between the query and each shuffled subject sequence. It then counts how many times the alignment score was greater than or equal to the alignment score of the original query and subject sequences. For example, if 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the *P*-value is likely to be less than 0.01.

```python
import psa

aln = psa.needle(moltype='prot', qseq='MKSTVILK', sseq='MKSRSLK')

print(aln.pvalue())   # 0.16
```


### All-against-all pairwise alignments

For more than two sequences, you can calculate alignments between every pair of your input sequences.

```python
import itertools
import psa

# Input sequences
sequences = {
    'dna1': 'ATCGAGATCGAGATGGCGATAG',
    'dna2': 'ATGCTGATCGTAGGGGC',
    'dna3': 'GTCGGATCCTCGATGGAGA',
    'dna4': 'TTTGGGAATGCGTAGGAGCTA',
    'dna5': 'CCGTGATGCGATGCA'
}

# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
    qseq = sequences[qid]
    sseq = sequences[sid]
    aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
    print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
```

Output:

```
dna1  dna2  38.5%  24.0
dna1  dna3  60.9%  34.0
dna1  dna4  35.7%  20.0
dna1  dna5  45.8%  27.0
dna2  dna3  43.5%  18.0
dna2  dna4  42.3%  39.0
dna2  dna5  39.1%  22.0
dna3  dna4  20.0%  14.0
dna3  dna5  52.4%  26.0
dna4  dna5  40.9%  14.0
```

### All-against-all pairwise alignments of sequences from a FASTA file

If you have multiple sequences in a FASTA file, you can use [Biopython](https://biopython.org) to read them and then calculate pairwise alignments.

```python
import itertools
import psa

from Bio import SeqIO

# Input sequences
sequences = {}
for seq_record in SeqIO.parse('sequences.fasta', 'fasta'):
    sequences[seq_record.id] = str(seq_record.seq)

# All-against-all pairwise alignments
for qid, sid in itertools.combinations(sequences, r=2):
    qseq = sequences[qid]
    sseq = sequences[sid]
    aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)
    print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')
```

Output:

```
dna1  dna2  38.5%  24.0
dna1  dna3  60.9%  34.0
dna1  dna4  35.7%  20.0
dna1  dna5  45.8%  27.0
dna2  dna3  43.5%  18.0
dna2  dna4  42.3%  39.0
dna2  dna5  39.1%  22.0
dna3  dna4  20.0%  14.0
dna3  dna5  52.4%  26.0
dna4  dna5  40.9%  14.0
```


## Tests
If you want to check that everything works as intended, just run:

```
./test.py
```

## License

[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pairwise-sequence-alignment",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "sequence alignment,nucleotide,protein,pairwise alignment,needle,water,emboss",
    "author": null,
    "author_email": "Andrzej Zielezinski <a.zielezinski@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/4b/e5/3c1b8b92e1241437a1f31afd9446883446041eea10768e92fa993515362a/pairwise-sequence-alignment-1.0.0.tar.gz",
    "platform": null,
    "description": "# pairwise-sequence-alignment (psa)\n\nThis is a Python module to calculate a pairwise alignment between biological sequences (protein or nucleic acid). This module uses the [needle](https://www.ebi.ac.uk/Tools/psa/emboss_needle/) and [water](https://www.ebi.ac.uk/Tools/psa/emboss_water/) tools from the EMBOSS package to calculate an optimal, global/local pairwise alignment.\n\nI wrote this module for two reasons. First, the needle and water tools are faster than any Python implementation. Second, Biopython has dropped support for tools from the EMBOSS package and recommends running them via the subprocess module directly.\n\n## Table of contents\n\n* [Introduction](#introduction)\n* [Requirements](#requirements)\n* [Installation](#installation)\n* [Quick Start](#quick-start)\n* [Alignment object](#alignment-object)\n   * [Attributes](#attributes)\n   * [Methods](#methods)\n* [Usage examples](#usage-examples)\n   * [Alignment information](#alignment-information)\n   * [Alignment in raw format](#alignment-in-raw-format)\n   * [Alignment in FASTA format](#alignment-in-fasta-format)\n   * [Alignment iteration](#alignment-iteration)\n   * [Alignment iteration with index](#alignment-iteration-with-index)\n   * [Query coverage](#query-coverage)\n   * [Scoring scheme for alignment](#scoring-scheme-for-alignment)\n   * [Statistical significance of alignment](#statistical-significance-of-alignment)\n   * [All-against-all pairwise alignments](#all-against-all-pairwise-alignments)\n* [Tests](#tests)\n* [License](#license)\n\n\n## Introduction\nPairwise sequence alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two sequences.\n\n1. Global alignment (*needle*; [the Needleman-Wunsch algorithm](https://en.wikipedia.org/wiki/Needleman\u2013Wunsch_algorithm)) aligns two sequences across their entire length, from beginning to end. It is most useful when sequences you are aligning are similar and roughly the same size.\n2. Local alignment (*water*; [the Smith-Waterman algorithm](https://en.wikipedia.org/wiki/Smith\u2013Waterman_algorithm)) finds the region with the highest level of similarity between the two sequences. It is suitable for sequences that are not assumed to be similar over the entire length.\n\n\n## Requirements\n\n1. Python >= 3.8\n    > Check with `python3 --version`\n2. EMBOSS >= 6.6.0\n    > Check with `needle -version` or `water -version`.\n\n\n## Installation\n\nYou can install the module from [PyPI](https://pypi.org/project/pairwise-sequence-alignment/):\n\n```\npip install pairwise-sequence-alignment\n```\n\nor directly from GitHub:\n\n```\npip install \"git+https://github.com/aziele/pairwise-sequence-alignment.git\"\n```\n\nor you can use the module without installation. Simply clone or download this repository and you're ready to use it.\n\n\n## Quick Start\n\n```python\nimport psa\n\n# Global alignment\naln = psa.needle(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGA')\naln = psa.needle(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')\n\n# Local alignment\naln = psa.water(moltype='nucl', qseq='ATGCTAGTA', sseq='ATGCTAGTAGATGATGAT')\naln = psa.water(moltype='prot', qseq='MKSTVWSG', sseq='MKSSVLW')\n\nprint(aln.score)       # 20.0\nprint(aln.pidentity)   # 71.4\nprint(aln.psimilarity) # 85.7\nprint(aln.pgaps)       # 14.3\nprint(aln.qseq)        # MKSTV-W\nprint(aln.sseq)        # MKSSVLW\n```\n\n## Alignment object\n\n### Attributes\n\n| Attribute | Description |\n| --- | --- |\n| `qid` | Query sequence identifier |\n| `sid` | Subject sequence identifier |\n| `qseq` | Query unaligned sequence |\n| `sseq` | Subject unaligned sequence |\n| `qaln` | Query aligned sequence \n| `saln` | Subject aligned sequence |\n| `qstart` | Start of alignment in query |\n| `qend` | End of alignment in query |\n| `sstart` | Start of alignment in subject |\n| `send` | End of alignment in subject |\n| `length` | Alignment length |\n| `score` | Alignment score |\n| `nidentity` | Number of identical matches in the alignment |\n| `pidentity` | Percentage of identical matches in the alignment |\n| `nsimilarity` | Number of positive-scoring matches in the alignment |\n| `psimilarity` | Percentage of positive-scoring matches in the alignment |\n| `ngaps` | Total number of gaps in the alignment |\n| `pgaps` | Total percentage of gaps in the alignment |\n| `moltype` | nucl/prot |\n| `program` | needle/water |\n| `gapopen` | Gap open penalty |\n| `gapextend` | Gap extension penalty |\n| `matrix` | Name of scoring matrix |  \n| `raw` | Raw output obtained from EMBOSS' needle/water |\n\n\n### Methods\n\n| Method | Description |\n| --- | --- |\n| `query_coverage()` | Returns a query coverage [%] |\n| `subject_coverage()` | Returns a subject coverage [%] |\n| `pvalue()` | Returns a p-value of the alignment |\n| `fasta()` | Returns pairwise alignment in FASTA/Pearson format |\n\n\n## Usage examples\n\n### Alignment information\n\n```python\nimport psa\n\naln = psa.needle(\n    moltype='prot',\n    qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',\n    sseq='MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',\n    qid='seq1',\n    sid='seq2'\n)\n\nprint(aln.score)        # 225.0\nprint(aln.length)       # 59\nprint(aln.pidentity)    # 72.9\nprint(aln.psimilarity)  # 79.7\nprint(aln.pgaps)        # 18.6\nprint(aln.ngaps)        # 11\nprint(aln.qseq)         # MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT\nprint(aln.sseq)         # MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP------\nprint(aln.qstart)       # 1\nprint(aln.qend)         # 54\nprint(aln.sstart)       # 1\nprint(aln.send)         # 53\nprint(aln.program)      # needle\nprint(aln.gapopen)      # 10\nprint(aln.gapextend)    # 0.5\nprint(aln.matrix)       # EBLOSUM62\n```\n\n### Alignment in raw format\n\n```python\nprint(aln.raw)\n```\n\nOutput:\n\n```\n#=======================================\n#\n# Aligned_sequences: 2\n# 1: seq1\n# 2: seq2\n# Matrix: EBLOSUM62\n# Gap_penalty: 10.0\n# Extend_penalty: 0.5\n#\n# Length: 59\n# Identity:      43/59 (72.9%)\n# Similarity:    47/59 (79.7%)\n# Gaps:          11/59 (18.6%)\n# Score: 225.0\n# \n#\n#=======================================\n\nseq1               1 MTSPSTKNSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDP     45\n                     ||:||.:||||||||     ||||||||||||||||:||:||||||||||\nseq2               1 MTTPSRENSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDP     50\n\nseq1              46 RKPQLHSRT     54\n                     |||      \nseq2              51 RKP------     53\n\n\n#---------------------------------------\n#---------------------------------------\n```\n\n### Alignment in FASTA format\n\n```python\nprint(aln.fasta(wrap=30))\n```\n\nOutput:\n\n```\n>seq1 1-54\nMTSPSTKNSDDKGRP-----NLSSTEYFAN\nTNVLTCRLKWVNPDTFIMDPRKPQLHSRT\n>seq2 1-53\nMTTPSRENSDDKGRPIEEASNLSSTEYFAN\nTNVLTCKLKYVNPDTFIMDPRKP------\n```\n\n### Alignment iteration\n\nYou can iterate over the alignment residue-by-residude:\n\n```python\nfor aa1, aa2 in aln:\n    if aa1 != aa2:\n        print(aa1, aa2)\n```\n\nOutput:\n\n```\nS T\nT R\nK E\n- I\n- E\n- E\n- A\n- S\nR K\nW Y\nQ -\nL -\nH -\nS -\nR -\nT -\n```\n\n### Alignment iteration with index\n\nYou can also loop through residudes in an alignment with the information of their position.\n\n```python\nfor i, (aa1, aa2) in enumerate(aln):\n    if aa1 != aa2:\n        print(i, aa1, aa2)\n```\n\nOutput:\n\n```\n2  S T\n5  T R\n6  K E\n15 - I\n16 - E\n17 - E\n18 - A\n19 - S\n36 R K\n39 W Y\n53 Q -\n54 L -\n55 H -\n56 S -\n57 R -\n58 T -\n```\n\n### Query coverage\n\nQuery coverage describes how much of the query sequence is covered in the alignment by the subject sequence. Specifically, query coverage is the percentage of the query sequence length that is included in the alignment. In global alignments, query coverage is always 100% because both the sequences, query and subject, are aligned from end to end. It is thus more useful to calculate query coverage from local alignments.\n\n```python\nimport psa\n\naln = psa.water(\n    moltype='prot',\n    qseq='MTSPSTKNSDDKGRPNLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKPQLHSRT',\n    sseq='NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP',\n    qid='seq1',\n    sid='seq2'\n)\n# seq1               8 NSDDKGRP-----NLSSTEYFANTNVLTCRLKWVNPDTFIMDPRKP     48\n#                      ||||||||     ||||||||||||||||:||:|||||||||||||\n# seq2               1 NSDDKGRPIEEASNLSSTEYFANTNVLTCKLKYVNPDTFIMDPRKP     46\n\nprint(aln.query_coverage())\n# 75.9\nprint(aln.subject_coverage())\n# 100\n```\n\n### Scoring scheme for alignment\n\nYou can change a scoring matrix and penalties for the gap open and extension to calculate the alignment.\n\n```python\nimport psa\n\naln = psa.water(\n    moltype='prot',\n    qseq='MKSTWYERNST',\n    sseq='MKSTGYWTRESA',\n    matrix='EBLOSUM30',\n    gapopen=5,\n    gapextend=0.2\n)\nprint(aln)\n```\n\nOutput:\n\n```\n#=======================================\n#\n# Aligned_sequences: 2\n# 1: query\n# 2: subject\n# Matrix: EBLOSUM30\n# Gap_penalty: 5.0\n# Extend_penalty: 0.2\n#\n# Length: 13\n# Identity:       7/13 (53.8%)\n# Similarity:     8/13 (61.5%)\n# Gaps:           3/13 (23.1%)\n# Score: 39.8\n# \n#\n#=======================================\n\nquery              1 MKST--WYERNST     11\n                     ||||  |. |.|:\nsubject            1 MKSTGYWT-RESA     12\n\n\n#---------------------------------------\n#---------------------------------------\n```\n\n### Statistical significance of alignment\n\nThe Needleman-Wunsch and Smith-Waterman algorithms will always find an optimal alignment between two sequences, whether or not they are evolutionarily related. The strength of an alignment is determined by its score. However, often it is necessary to know if a score is high enough to indicate a biologically interesting alignment. The statistical significance of the score is assessed by the *P*-value, which describes how likely it is that two random sequences of similar length and composition will align with a score equal to or better than our target alignment. \n\nThe `.pvalue()` method calculates the *P*-value of the alignment between query and subject sequences. The method shuffles a subject sequence many times (100 by default) and calculates the alignment score between the query and each shuffled subject sequence. It then counts how many times the alignment score was greater than or equal to the alignment score of the original query and subject sequences. For example, if 100 such shuffles all produce alignment scores that are lower than the observed alignment score, then one can say that the *P*-value is likely to be less than 0.01.\n\n```python\nimport psa\n\naln = psa.needle(moltype='prot', qseq='MKSTVILK', sseq='MKSRSLK')\n\nprint(aln.pvalue())   # 0.16\n```\n\n\n### All-against-all pairwise alignments\n\nFor more than two sequences, you can calculate alignments between every pair of your input sequences.\n\n```python\nimport itertools\nimport psa\n\n# Input sequences\nsequences = {\n    'dna1': 'ATCGAGATCGAGATGGCGATAG',\n    'dna2': 'ATGCTGATCGTAGGGGC',\n    'dna3': 'GTCGGATCCTCGATGGAGA',\n    'dna4': 'TTTGGGAATGCGTAGGAGCTA',\n    'dna5': 'CCGTGATGCGATGCA'\n}\n\n# All-against-all pairwise alignments\nfor qid, sid in itertools.combinations(sequences, r=2):\n    qseq = sequences[qid]\n    sseq = sequences[sid]\n    aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)\n    print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')\n```\n\nOutput:\n\n```\ndna1  dna2  38.5%  24.0\ndna1  dna3  60.9%  34.0\ndna1  dna4  35.7%  20.0\ndna1  dna5  45.8%  27.0\ndna2  dna3  43.5%  18.0\ndna2  dna4  42.3%  39.0\ndna2  dna5  39.1%  22.0\ndna3  dna4  20.0%  14.0\ndna3  dna5  52.4%  26.0\ndna4  dna5  40.9%  14.0\n```\n\n### All-against-all pairwise alignments of sequences from a FASTA file\n\nIf you have multiple sequences in a FASTA file, you can use [Biopython](https://biopython.org) to read them and then calculate pairwise alignments.\n\n```python\nimport itertools\nimport psa\n\nfrom Bio import SeqIO\n\n# Input sequences\nsequences = {}\nfor seq_record in SeqIO.parse('sequences.fasta', 'fasta'):\n    sequences[seq_record.id] = str(seq_record.seq)\n\n# All-against-all pairwise alignments\nfor qid, sid in itertools.combinations(sequences, r=2):\n    qseq = sequences[qid]\n    sseq = sequences[sid]\n    aln = psa.needle(moltype='nucl', qseq=qseq, sseq=sseq)\n    print(f'{qid} {sid} {aln.pidentity:.1f}% {aln.score}')\n```\n\nOutput:\n\n```\ndna1  dna2  38.5%  24.0\ndna1  dna3  60.9%  34.0\ndna1  dna4  35.7%  20.0\ndna1  dna5  45.8%  27.0\ndna2  dna3  43.5%  18.0\ndna2  dna4  42.3%  39.0\ndna2  dna5  39.1%  22.0\ndna3  dna4  20.0%  14.0\ndna3  dna5  52.4%  26.0\ndna4  dna5  40.9%  14.0\n```\n\n\n## Tests\nIf you want to check that everything works as intended, just run:\n\n```\n./test.py\n```\n\n## License\n\n[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)",
    "bugtrack_url": null,
    "license": null,
    "summary": "Global and local pairwise alignments between nucleotide/protein sequences.",
    "version": "1.0.0",
    "split_keywords": [
        "sequence alignment",
        "nucleotide",
        "protein",
        "pairwise alignment",
        "needle",
        "water",
        "emboss"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cf4a37d805af5adf444854417c951c961ebfa2f6ebebac76669eb7f2e203aa18",
                "md5": "97a0646ca9b7c57762c10fd59da37be4",
                "sha256": "e41e05cc875a23b1be2308a546d354852eea3d964bedb85c723add5f53fc229e"
            },
            "downloads": -1,
            "filename": "pairwise_sequence_alignment-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "97a0646ca9b7c57762c10fd59da37be4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 20616,
            "upload_time": "2023-04-04T13:13:13",
            "upload_time_iso_8601": "2023-04-04T13:13:13.804138Z",
            "url": "https://files.pythonhosted.org/packages/cf/4a/37d805af5adf444854417c951c961ebfa2f6ebebac76669eb7f2e203aa18/pairwise_sequence_alignment-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4be53c1b8b92e1241437a1f31afd9446883446041eea10768e92fa993515362a",
                "md5": "a26509856ffa17eeb878a3a4c6911490",
                "sha256": "378f543b2ff7163a798ec5fac75594863be8eb2718036f0f2aac649f5b985d80"
            },
            "downloads": -1,
            "filename": "pairwise-sequence-alignment-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a26509856ffa17eeb878a3a4c6911490",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25081,
            "upload_time": "2023-04-04T13:13:16",
            "upload_time_iso_8601": "2023-04-04T13:13:16.813791Z",
            "url": "https://files.pythonhosted.org/packages/4b/e5/3c1b8b92e1241437a1f31afd9446883446041eea10768e92fa993515362a/pairwise-sequence-alignment-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-04 13:13:16",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "pairwise-sequence-alignment"
}
        
Elapsed time: 0.05075s