spaed


Namespaed JSON
Version 1.0.5 PyPI version JSON
download
home_pageNone
SummaryA module for the segmentation of phage endolysin domains based on the PAE matrix from AlphaFold.
upload_time2025-07-12 18:42:07
maintainerNone
docs_urlNone
authorNone
requires_python>3.10
licenseNone
keywords bacteriophage bioinformatics delineation domain phage protein segmentation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="img/title.png" border="0"/>
  <h2 align="center">Segmentation of PhAge Endolysin Domains</h2>
</p>

SPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.

Additional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences. 

## Installation & usage

Check out [www.spaed.ca](https://spaed.ca) to launch SPAED quickly!

First create a virtual environment, then: 

**From pypi**:
```
pip install spaed  ### note the spelling of spaed
```

ex. `spaed pae_path --output_file spaed_predictions.csv`


**From source**:
```
git clone https://github.com/Rousseau-Team/spaed.git

pip install numpy pandas scipy
```

ex. `python spaed/src/spaed/spaed.py pae_path`


## Advanced usage
Optional dependency for structure visualisation: pymol (`conda install -c conda-forge -c schrodinger pymol-bundle`). Python>3.10 is required, 3.12.9 worked for me.\
ex. (install from pip). `pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\
ex. (install from source). `python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`

**Positional arguments**:
- **pae_path** - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.


**Optional arguments**:
- **output_file** - File to save table of segmented domains in csv format. (default spaed_predictions.csv)
- **fasta_path** - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and disordered regions into new fasta files named  "spaed_predicted_{seq_type}.faa" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.
- **RATIO_NUM_CLUSTERS** - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.
- **MIN_DOMAIN_SIZE** - Minimum size a domain can have. (default 30).
- **PAE_SCORE_CUTOFF** - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).
- **MIN_DISORDERED_SIZE** - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).
- **FREQ_DISORDERED** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered "not part of a domain". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).
- **PROP_DISORDERED** - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).
- **FREQ_LINKER** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).
- **version** - Display installed SPAED version number.

If you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.

If you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.


## Outputs
A csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a ";".\
Ex.

|        | length | # domains |    domains     | linkers | disordered |
| ------ | ------ | --------- | -------------- | ------- | ---------- |
| prot 1 | 251    | 2         | 1-120;130-251  | 121-129 |            |
| prot 2 | 386    | 2         | 86-203;217-386 | 204-216 | 1-85       |

## Citation

Boulay, A. et al. SPAED: Harnessing AlphaFold Output for Accurate Segmentation of Phage Endolysin Domains. 2025.04.25.650745 Preprint at https://doi.org/10.1101/2025.04.25.650745 (2025).


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "spaed",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">3.10",
    "maintainer_email": null,
    "keywords": "bacteriophage, bioinformatics, delineation, domain, phage, protein, segmentation",
    "author": null,
    "author_email": "Alexandre Boulay <alexandre.boulay.6@ulaval.ca>",
    "download_url": "https://files.pythonhosted.org/packages/36/f9/9eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee/spaed-1.0.5.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"img/title.png\" border=\"0\"/>\n  <h2 align=\"center\">Segmentation of PhAge Endolysin Domains</h2>\n</p>\n\nSPAED is a tool to identify domains in phage endolysins. It takes as input the PAE file(s) obtained from AlphaFold and outputs a csv file with delineations.\n\nAdditional scripts are provided to visualize predicted domains with PyMOL and to obtain their amino acid sequences. \n\n## Installation & usage\n\nCheck out [www.spaed.ca](https://spaed.ca) to launch SPAED quickly!\n\nFirst create a virtual environment, then: \n\n**From pypi**:\n```\npip install spaed  ### note the spelling of spaed\n```\n\nex. `spaed pae_path --output_file spaed_predictions.csv`\n\n\n**From source**:\n```\ngit clone https://github.com/Rousseau-Team/spaed.git\n\npip install numpy pandas scipy\n```\n\nex. `python spaed/src/spaed/spaed.py pae_path`\n\n\n## Advanced usage\nOptional dependency for structure visualisation: pymol (`conda install -c conda-forge -c schrodinger pymol-bundle`). Python>3.10 is required, 3.12.9 worked for me.\\\nex. (install from pip). `pymol_vis pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\\\nex. (install from source). `python spaed/src/spaed/pymol_vis.py pred_path pdb_path --output_folder pymol_output --output_type {pse|png|both}`\n\n**Positional arguments**:\n- **pae_path** - Folder of or singular PAE file in json format as outputted by Alphafold2/3 or Colabfold.\n\n\n**Optional arguments**:\n- **output_file** - File to save table of segmented domains in csv format. (default spaed_predictions.csv)\n- **fasta_path** - Path to fasta file or folder containing fasta files. If specified, spaed will save the sequences corresponding to predicted domains,linkers and disordered regions into new fasta files named  \"spaed_predicted_{seq_type}.faa\" in the same output folder as output_file. Ensure fasta names or headers correspond to entries in pae files.\n- **RATIO_NUM_CLUSTERS** - Maximum number of clusters initially generated by hierarchical clustering corresponds to len(protein) // RATIO_NUM_CLUSTERS. (Default 10). For a protein 400 residues long, 40 clusters will be generated.\n- **MIN_DOMAIN_SIZE** - Minimum size a domain can have. (default 30).\n- **PAE_SCORE_CUTOFF** - Cutoff on the PAE score used to make adjustments to predicted domains/linkers/disordered regions. Residues with PAE score < PAE_SCORE_CUTOFF are considered close together. (default = 4).\n- **MIN_DISORDERED_SIZE** - Minimum size a terminal disordered region can be to be considered a separate entity from the domain it is next to (default 20).\n- **FREQ_DISORDERED** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered \"not part of a domain\". Values <MIN_DOMAIN_SIZE are logical, but as it increases, the more leniant the algorithm becomes to non-domain regions (more will be predicted). (default 6).\n- **PROP_DISORDERED** - Proportion of residues in a given region that must meet FREQ_DISORDERED criteria to be considered a terminal disordered region. The greater the value, the stricter the criteria to predict the region as disordered. (default 80%).\n- **FREQ_LINKER** - For a given residue in the PAE matrix, frequency of residues that can align to it with a low PAE score and still be considered as part of the linker. Values < MIN_DOMAIN_SIZE are logical as they are less than the expected size of the nearest domain. Increasing leads to a more leniant assignment of residues as part of the linker. (default 20).\n- **version** - Display installed SPAED version number.\n\nIf you are interested in looking at the disordered regions in N- or C-terminal, consider increasing FREQ_DISORDERED ([4-30]), decreasing MIN_DISORDERED_SIZE ([10-30]) or decreasing PROP_DISORDERED ([50-95]). This will result in more (and longer) terminal disordered regions being detected, but also many false positives. I would not change them all at the same time as this will probably increase the sensitivity too much.\n\nIf you are interested in linkers or have a protein that is less well folded, consider modifying the FREQ_LINKER parameter ([4-30]). This value is used to adjust the boundaries of the linkers and as such, a higher value will result in longer linkers. However, linkers that were missed will still not be detected.\n\n\n## Outputs\nA csv file containing the proteinID, protein length, number of predicted domains, domain delineations, linker delineations, terminal disordered region delineations. Delineations for each domain are separated by a \";\".\\\nEx.\n\n|        | length | # domains |    domains     | linkers | disordered |\n| ------ | ------ | --------- | -------------- | ------- | ---------- |\n| prot 1 | 251    | 2         | 1-120;130-251  | 121-129 |            |\n| prot 2 | 386    | 2         | 86-203;217-386 | 204-216 | 1-85       |\n\n## Citation\n\nBoulay, A. et al. SPAED: Harnessing AlphaFold Output for Accurate Segmentation of Phage Endolysin Domains. 2025.04.25.650745 Preprint at https://doi.org/10.1101/2025.04.25.650745 (2025).\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A module for the segmentation of phage endolysin domains based on the PAE matrix from AlphaFold.",
    "version": "1.0.5",
    "project_urls": {
        "Homepage": "https://github.com/Rousseau-Team/spaed.git"
    },
    "split_keywords": [
        "bacteriophage",
        " bioinformatics",
        " delineation",
        " domain",
        " phage",
        " protein",
        " segmentation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "79b3ede7745e6dc7642f17ef8e54c923dc3de58a02b2ba53e113302a5d2b304f",
                "md5": "4618b0092fb69c70aff0e0075eb658cc",
                "sha256": "cde3aa5f2f5f05f1b8c088ad1507d238fcf20d29e3654591d386712bea7c64d2"
            },
            "downloads": -1,
            "filename": "spaed-1.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4618b0092fb69c70aff0e0075eb658cc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">3.10",
            "size": 26155,
            "upload_time": "2025-07-12T18:42:05",
            "upload_time_iso_8601": "2025-07-12T18:42:05.925581Z",
            "url": "https://files.pythonhosted.org/packages/79/b3/ede7745e6dc7642f17ef8e54c923dc3de58a02b2ba53e113302a5d2b304f/spaed-1.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "36f99eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee",
                "md5": "44c58b12003fbb27ac759c14561bfefd",
                "sha256": "64ac9ec422dc7945f03136e447634df6fc7fa5acefccc032bb76fd46a3844472"
            },
            "downloads": -1,
            "filename": "spaed-1.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "44c58b12003fbb27ac759c14561bfefd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">3.10",
            "size": 1785822,
            "upload_time": "2025-07-12T18:42:07",
            "upload_time_iso_8601": "2025-07-12T18:42:07.837372Z",
            "url": "https://files.pythonhosted.org/packages/36/f9/9eeab37ee1710eb3d124e0d90cc8908b5e579477ea03d29c8185eebce4ee/spaed-1.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-12 18:42:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Rousseau-Team",
    "github_project": "spaed",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "spaed"
}
        
Elapsed time: 1.58199s