<p align="center">
<img src="img/title.png" border="0"/>
<h2 align="center">Segmentation of PhAge Endolysin Domains</h2>
</p>
SPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.
Additional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences.
## Installation & usage
Check out [www.spaed.ca](https://spaed.ca) to launch SPAED quickly!
First create a virtual environment, then:
**From pypi**:
```
pip install spaed ### note the spelling of spaed
```
ex. `spaed pae_path --output_file spaed_predictions.csv`
**From source**:
```
git clone https://github.com/Rousseau-Team/spaed.git
pip install numpy pandas scipy
```
ex. `python spaed/src/spaed/spaed.py pae_path`
## Advanced usage
Optional dependency for structure visualisation: pymol (`conda install -c conda-forge -c schrodinger pymol-bundle`). Python>3.10 is required, 3.12.9 worked for me.\
ex. (install from pip). `pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\
ex. (install from source). `python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`
**Positional arguments**:
- **pae_path** - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.
**Optional arguments**:
- **output_file** - File to save table of segmented domains in csv format. (default spaed_predictions.csv)
- **fasta_path** - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and disordered regions into new fasta files named "spaed_predicted_{seq_type}.faa" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.
- **RATIO_NUM_CLUSTERS** - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.
- **MIN_DOMAIN_SIZE** - Minimum size a domain can have. (default 30).
- **PAE_SCORE_CUTOFF** - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).
- **MIN_DISORDERED_SIZE** - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).
- **FREQ_DISORDERED** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered "not part of a domain". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).
- **PROP_DISORDERED** - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).
- **FREQ_LINKER** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).
- **version** - Display installed SPAED version number.
If you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.
If you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.
## Outputs
A csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a ";".\
Ex.
| | length | # domains | domains | linkers | disordered |
| ------ | ------ | --------- | -------------- | ------- | ---------- |
| prot 1 | 251 | 2 | 1-120;130-251 | 121-129 | |
| prot 2 | 386 | 2 | 86-203;217-386 | 204-216 | 1-85 |
## Citation
Boulay, A. et al. SPAED: Harnessing AlphaFold Output for Accurate Segmentation of Phage Endolysin Domains. 2025.04.25.650745 Preprint at https://doi.org/10.1101/2025.04.25.650745 (2025).
Raw data
{
"_id": null,
"home_page": null,
"name": "spaed",
"maintainer": null,
"docs_url": null,
"requires_python": ">3.10",
"maintainer_email": null,
"keywords": "bacteriophage, bioinformatics, delineation, domain, phage, protein, segmentation",
"author": null,
"author_email": "Alexandre Boulay <alexandre.boulay.6@ulaval.ca>",
"download_url": "https://files.pythonhosted.org/packages/36/f9/9eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee/spaed-1.0.5.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <img src=\"img/title.png\" border=\"0\"/>\n <h2 align=\"center\">Segmentation of PhAge Endolysin Domains</h2>\n</p>\n\nSPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.\n\nAdditional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences. \n\n## Installation & usage\n\nCheck out [www.spaed.ca](https://spaed.ca) to launch SPAED quickly!\n\nFirst create a virtual environment, then: \n\n**From pypi**:\n```\npip install spaed ### note the spelling of spaed\n```\n\nex. `spaed pae_path --output_file spaed_predictions.csv`\n\n\n**From source**:\n```\ngit clone https://github.com/Rousseau-Team/spaed.git\n\npip install numpy pandas scipy\n```\n\nex. `python spaed/src/spaed/spaed.py pae_path`\n\n\n## Advanced usage\nOptional dependency for structure visualisation: pymol (`conda install -c conda-forge -c schrodinger pymol-bundle`). Python>3.10 is required, 3.12.9 worked for me.\\\nex. (install from pip). `pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\\\nex. (install from source). `python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\n\n**Positional arguments**:\n- **pae_path** - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.\n\n\n**Optional arguments**:\n- **output_file** - File to save table of segmented domains in csv format. (default spaed_predictions.csv)\n- **fasta_path** - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and disordered regions into new fasta files named \"spaed_predicted_{seq_type}.faa\" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.\n- **RATIO_NUM_CLUSTERS** - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.\n- **MIN_DOMAIN_SIZE** - Minimum size a domain can have. (default 30).\n- **PAE_SCORE_CUTOFF** - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).\n- **MIN_DISORDERED_SIZE** - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).\n- **FREQ_DISORDERED** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered \"not part of a domain\". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).\n- **PROP_DISORDERED** - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).\n- **FREQ_LINKER** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).\n- **version** - Display installed SPAED version number.\n\nIf you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.\n\nIf you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.\n\n\n## Outputs\nA csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a \";\".\\\nEx.\n\n| | length | # domains | domains | linkers | disordered |\n| ------ | ------ | --------- | -------------- | ------- | ---------- |\n| prot 1 | 251 | 2 | 1-120;130-251 | 121-129 | |\n| prot 2 | 386 | 2 | 86-203;217-386 | 204-216 | 1-85 |\n\n## Citation\n\nBoulay, A. et al. SPAED: Harnessing AlphaFold Output for Accurate Segmentation of Phage Endolysin Domains. 2025.04.25.650745 Preprint at https://doi.org/10.1101/2025.04.25.650745 (2025).\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A module for the segmentation of phage endolysin domains based on the PAE matrix from AlphaFold.",
"version": "1.0.5",
"project_urls": {
"Homepage": "https://github.com/Rousseau-Team/spaed.git"
},
"split_keywords": [
"bacteriophage",
" bioinformatics",
" delineation",
" domain",
" phage",
" protein",
" segmentation"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "79b3ede7745e6dc7642f17ef8e54c923dc3de58a02b2ba53e113302a5d2b304f",
"md5": "4618b0092fb69c70aff0e0075eb658cc",
"sha256": "cde3aa5f2f5f05f1b8c088ad1507d238fcf20d29e3654591d386712bea7c64d2"
},
"downloads": -1,
"filename": "spaed-1.0.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4618b0092fb69c70aff0e0075eb658cc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">3.10",
"size": 26155,
"upload_time": "2025-07-12T18:42:05",
"upload_time_iso_8601": "2025-07-12T18:42:05.925581Z",
"url": "https://files.pythonhosted.org/packages/79/b3/ede7745e6dc7642f17ef8e54c923dc3de58a02b2ba53e113302a5d2b304f/spaed-1.0.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "36f99eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee",
"md5": "44c58b12003fbb27ac759c14561bfefd",
"sha256": "64ac9ec422dc7945f03136e447634df6fc7fa5acefccc032bb76fd46a3844472"
},
"downloads": -1,
"filename": "spaed-1.0.5.tar.gz",
"has_sig": false,
"md5_digest": "44c58b12003fbb27ac759c14561bfefd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">3.10",
"size": 1785822,
"upload_time": "2025-07-12T18:42:07",
"upload_time_iso_8601": "2025-07-12T18:42:07.837372Z",
"url": "https://files.pythonhosted.org/packages/36/f9/9eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee/spaed-1.0.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-12 18:42:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Rousseau-Team",
"github_project": "spaed",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "spaed"
}