squidly


Namesquidly JSON
Version 0.0.5 PyPI version JSON
download
home_pagehttps://github.com/WRiegs/Squidly
SummaryNone
upload_time2025-08-09 21:45:56
maintainerNone
docs_urlNone
authorWilliam Reiger
requires_python>=3.10
licenseGPL3
keywords gene-annotation bioinformatics catalytic-site-prediction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🦑Squidly
![Overview Figure](overview_fig_.png)

Squidly, is a tool that employs a biologically informed contrastive learning approach to accurately predict catalytic residues from enzyme sequences. We offer Squidly as ensembled with Blast to achieve high accuracy at low and high sequence homology settings.

If you use squidly in your work please cite our preprint: https://www.biorxiv.org/content/10.1101/2025.06.13.659624v1

Also if you have any issues installing, please post an issue! We have tested this on ubuntu.

## 📥 Installation
### Requirements
Squidly is dependant on the ESM2 3B or 15B protein language model. Running Suidly will automatically attempt to download each model.
The Smaller 3B model is lighter, runs faster and requires less VRAM. 

Currently we expect GPU access but if you require a CPU only version please let us know and we can update this!
### Simple installation
```
conda create --name squidly python=3.10
conda activate squidly
pip install squidly
squidly install
```
Running `squidly install` should automatically download all models from huggingface. Now you can run squidly (see **Usage** below).

Note if you get the below error:

```ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject```
update numpy and pandas.

### Installation Steps (manual)
These steps will enable you to also develop and change things as you wish.
```bash
# Clone the repository
git clone https://github.com/WRiegs/Squidly
cd Squidly
# Install dependencies
./install.sh # Makes the squidly conda env
conda activate squidly
# install diamond for BLAST
conda install -c bioconda -c conda-forge diamond

# Build and install
python setup.py sdist bdist_wheel
pip install dist/squidly-0.0.2.tar.gz 
```

Torch with cuda 11.8+ must be installed.
https://pytorch.org/get-started/locally/

## Usage
For example to run the 3B model with a fasta file (in squidly only mode)
```bash
squidly run example.fasta esm2_t36_3B_UR50D 
```

Or to run as an ensemble with BLAST (you need to pass the database as well)
```
squidly run example.fasta esm2_t36_3B_UR50D output_folder/ --database reviewed_sprot_08042025.csv
```
Where `reviewed_sprot_08042025.csv` is the example database (i.e. a csv file with the following columns) 

You can see ours which is zipped in the data folder..


| Entry      | Sequence         | Residue                                  |
|------------|------------------|------------------------------------------|
| A0A009IHW8 | MSLEQKKGADIIS    | 207                                      |
| A0A023I7E1 | MRFQVIVAAATITMIY | 499\|577\|581                            |
| A0A024B7W1 | MKNPKKKSGGFRIV   | 1552\|1576\|1636\|2580\|2665\|2701\|2737 |
| A0A024RXP8 | MYRKLAVISAFL     | 228\|233                                 |


```bash
 Usage: squidly [OPTIONS] FASTA_FILE ESM2_MODEL [OUTPUT_FOLDER] [RUN_NAME]                                         
                                                                                                                   
 Find catalytic residues using Squidly and BLAST.                                                                  
                                                                                                                   
╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    fasta_file         TEXT             Full path to query fasta (note have simple IDs otherwise we'll remove  │
│                                          all funky characters.)                                                 │
│                                          [default: None]                                                        │
│                                          [required]                                                             │
│ *    esm2_model         TEXT             Name of the esm2_model, esm2_t36_3B_UR50D or esm2_t48_15B_UR50D        │
│                                          [default: None]                                                        │
│                                          [required]                                                             │
│      output_folder      [OUTPUT_FOLDER]  Where to store results (full path!) [default: Current Directory]       │
│      run_name           [RUN_NAME]       Name of the run [default: squidly]                                     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --database                  TEXT     Full path to database csv (if you want to do the ensemble), needs 3        │
│                                      columns: 'Entry', 'Sequence', 'Residue' where residue is a | separated     │
│                                      list of residues. See default DB provided by Squidly.                      │
│                                      [default: None]                                                            │
│ --cr-model-as               TEXT     Optional: Model for the catalytic residue prediction i.e. not using the    │
│                                      default with the package. Ensure it matches the esmmodel.                  │
│ --lstm-model-as             TEXT     Optional: LSTM model path for the catalytic residue prediction i.e. not    │
│                                      using the default with the package. Ensure it matches the esmmodel.        │
│ --toks-per-batch            INTEGER  Run method (filter or complete) i.e. filter = only annotates with the next │
│                                      tool those that couldn't be found.                                         │
│                                      [default: 5]                                                               │
│ --as-threshold              FLOAT    Whether or not to keep multiple predicted values if False only the top     │
│                                      result is retained.                                                        │
│                                      [default: 0.99]                                                            │
│ --blast-threshold           FLOAT    Sequence identity with which to use Squidly over BLAST defualt 0.3         │
│                                      (meaning for seqs with < 0.3 identity in the DB use Squidly).              │
│                                      [default: 0.3]                                                             │
│ --install-completion                 Install completion for the current shell.                                  │
│ --show-completion                    Show completion for the current shell, to copy it or customize the         │
│                                      installation.                                                              │
│ --help                               Show this message and exit.                                                │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

```

## Data Availability
All datasets used in the paper are available here https://zenodo.org/records/15541320.

## Reproducing Squidly
We developed reproduction scripts for each benchmark training/test scenario.

- **AEGAN and Common Benchmarks**: Trained on Uni14230 (AEGAN), and tested on Uni3175 (AEGAN), HA_superfamily, NN, PC, and EF datasets.
- **CataloDB**: Trained on a curated training and test set with structural/sequence ID filtering to less than 30% identity.

The corresponding scripts can be found in the reproduction_run directory.

Before running them, download the datasets.zip file from zenodo and place them and unzip it in the base directory of Squidly.

Datasets:
https://zenodo.org/records/15541320

Model weights:
https://huggingface.co/WillRieger/Squidly

```bash
python reproduction_scripts/reproduce_squidly_CataloDB.py --scheme 2 --sample_limit 16000 --esm2_model esm2_t36_3B_UR50D --reruns 1
```

You must choose the pair scheme for the Squidly models:
<img src="pair_scheme_fig_.png" width=50%>

Scheme 2 and 3 had the sample limit parameter set to 16000, and scheme 1 at 4000000.

You must also correctly specify the ESM2 model used.
You can either use esm2_t36_3B_UR50D or esm2_t48_15B_UR50D. The scripts will automatically download these if specified like so.
You may also instead provide your own path to the models if you have them downloaded somewhere.


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/WRiegs/Squidly",
    "name": "squidly",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "gene-annotation, bioinformatics, catalytic-site-prediction",
    "author": "William Reiger",
    "author_email": "w.rieger@uq.edu.au",
    "download_url": "https://files.pythonhosted.org/packages/7f/72/1cb5566316aaa17f78f251a97dc9e45e3dbef01594b2d96b8ecbaaf1d174/squidly-0.0.5.tar.gz",
    "platform": null,
    "description": "# \ud83e\udd91Squidly\n![Overview Figure](overview_fig_.png)\n\nSquidly, is a tool that employs a biologically informed contrastive learning approach to accurately predict catalytic residues from enzyme sequences. We offer Squidly as ensembled with Blast to achieve high accuracy at low and high sequence homology settings.\n\nIf you use squidly in your work please cite our preprint: https://www.biorxiv.org/content/10.1101/2025.06.13.659624v1\n\nAlso if you have any issues installing, please post an issue! We have tested this on ubuntu.\n\n## \ud83d\udce5 Installation\n### Requirements\nSquidly is dependant on the ESM2 3B or 15B protein language model. Running Suidly will automatically attempt to download each model.\nThe Smaller 3B model is lighter, runs faster and requires less VRAM. \n\nCurrently we expect GPU access but if you require a CPU only version please let us know and we can update this!\n### Simple installation\n```\nconda create --name squidly python=3.10\nconda activate squidly\npip install squidly\nsquidly install\n```\nRunning `squidly install` should automatically download all models from huggingface. Now you can run squidly (see **Usage** below).\n\nNote if you get the below error:\n\n```ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject```\nupdate numpy and pandas.\n\n### Installation Steps (manual)\nThese steps will enable you to also develop and change things as you wish.\n```bash\n# Clone the repository\ngit clone https://github.com/WRiegs/Squidly\ncd Squidly\n# Install dependencies\n./install.sh # Makes the squidly conda env\nconda activate squidly\n# install diamond for BLAST\nconda install -c bioconda -c conda-forge diamond\n\n# Build and install\npython setup.py sdist bdist_wheel\npip install dist/squidly-0.0.2.tar.gz \n```\n\nTorch with cuda 11.8+ must be installed.\nhttps://pytorch.org/get-started/locally/\n\n## Usage\nFor example to run the 3B model with a fasta file (in squidly only mode)\n```bash\nsquidly run example.fasta esm2_t36_3B_UR50D \n```\n\nOr to run as an ensemble with BLAST (you need to pass the database as well)\n```\nsquidly run example.fasta esm2_t36_3B_UR50D output_folder/ --database reviewed_sprot_08042025.csv\n```\nWhere `reviewed_sprot_08042025.csv` is the example database (i.e. a csv file with the following columns) \n\nYou can see ours which is zipped in the data folder..\n\n\n| Entry      | Sequence         | Residue                                  |\n|------------|------------------|------------------------------------------|\n| A0A009IHW8 | MSLEQKKGADIIS    | 207                                      |\n| A0A023I7E1 | MRFQVIVAAATITMIY | 499\\|577\\|581                            |\n| A0A024B7W1 | MKNPKKKSGGFRIV   | 1552\\|1576\\|1636\\|2580\\|2665\\|2701\\|2737 |\n| A0A024RXP8 | MYRKLAVISAFL     | 228\\|233                                 |\n\n\n```bash\n Usage: squidly [OPTIONS] FASTA_FILE ESM2_MODEL [OUTPUT_FOLDER] [RUN_NAME]                                         \n                                                                                                                   \n Find catalytic residues using Squidly and BLAST.                                                                  \n                                                                                                                   \n\u256d\u2500 Arguments \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 *    fasta_file         TEXT             Full path to query fasta (note have simple IDs otherwise we'll remove  \u2502\n\u2502                                          all funky characters.)                                                 \u2502\n\u2502                                          [default: None]                                                        \u2502\n\u2502                                          [required]                                                             \u2502\n\u2502 *    esm2_model         TEXT             Name of the esm2_model, esm2_t36_3B_UR50D or esm2_t48_15B_UR50D        \u2502\n\u2502                                          [default: None]                                                        \u2502\n\u2502                                          [required]                                                             \u2502\n\u2502      output_folder      [OUTPUT_FOLDER]  Where to store results (full path!) [default: Current Directory]       \u2502\n\u2502      run_name           [RUN_NAME]       Name of the run [default: squidly]                                     \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\u256d\u2500 Options \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256e\n\u2502 --database                  TEXT     Full path to database csv (if you want to do the ensemble), needs 3        \u2502\n\u2502                                      columns: 'Entry', 'Sequence', 'Residue' where residue is a | separated     \u2502\n\u2502                                      list of residues. See default DB provided by Squidly.                      \u2502\n\u2502                                      [default: None]                                                            \u2502\n\u2502 --cr-model-as               TEXT     Optional: Model for the catalytic residue prediction i.e. not using the    \u2502\n\u2502                                      default with the package. Ensure it matches the esmmodel.                  \u2502\n\u2502 --lstm-model-as             TEXT     Optional: LSTM model path for the catalytic residue prediction i.e. not    \u2502\n\u2502                                      using the default with the package. Ensure it matches the esmmodel.        \u2502\n\u2502 --toks-per-batch            INTEGER  Run method (filter or complete) i.e. filter = only annotates with the next \u2502\n\u2502                                      tool those that couldn't be found.                                         \u2502\n\u2502                                      [default: 5]                                                               \u2502\n\u2502 --as-threshold              FLOAT    Whether or not to keep multiple predicted values if False only the top     \u2502\n\u2502                                      result is retained.                                                        \u2502\n\u2502                                      [default: 0.99]                                                            \u2502\n\u2502 --blast-threshold           FLOAT    Sequence identity with which to use Squidly over BLAST defualt 0.3         \u2502\n\u2502                                      (meaning for seqs with < 0.3 identity in the DB use Squidly).              \u2502\n\u2502                                      [default: 0.3]                                                             \u2502\n\u2502 --install-completion                 Install completion for the current shell.                                  \u2502\n\u2502 --show-completion                    Show completion for the current shell, to copy it or customize the         \u2502\n\u2502                                      installation.                                                              \u2502\n\u2502 --help                               Show this message and exit.                                                \u2502\n\u2570\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u256f\n\n```\n\n## Data Availability\nAll datasets used in the paper are available here https://zenodo.org/records/15541320.\n\n## Reproducing Squidly\nWe developed reproduction scripts for each benchmark training/test scenario.\n\n- **AEGAN and Common Benchmarks**: Trained on Uni14230 (AEGAN), and tested on Uni3175 (AEGAN), HA_superfamily, NN, PC, and EF datasets.\n- **CataloDB**: Trained on a curated training and test set with structural/sequence ID filtering to less than 30% identity.\n\nThe corresponding scripts can be found in the reproduction_run directory.\n\nBefore running them, download the datasets.zip file from zenodo and place them and unzip it in the base directory of Squidly.\n\nDatasets:\nhttps://zenodo.org/records/15541320\n\nModel weights:\nhttps://huggingface.co/WillRieger/Squidly\n\n```bash\npython reproduction_scripts/reproduce_squidly_CataloDB.py --scheme 2 --sample_limit 16000 --esm2_model esm2_t36_3B_UR50D --reruns 1\n```\n\nYou must choose the pair scheme for the Squidly models:\n<img src=\"pair_scheme_fig_.png\" width=50%>\n\nScheme 2 and 3 had the sample limit parameter set to 16000, and scheme 1 at 4000000.\n\nYou must also correctly specify the ESM2 model used.\nYou can either use esm2_t36_3B_UR50D or esm2_t48_15B_UR50D. The scripts will automatically download these if specified like so.\nYou may also instead provide your own path to the models if you have them downloaded somewhere.\n\n",
    "bugtrack_url": null,
    "license": "GPL3",
    "summary": null,
    "version": "0.0.5",
    "project_urls": {
        "Bug Tracker": "https://github.com/WRiegs/Squidly/issues",
        "Documentation": "https://github.com/WRiegs/Squidly",
        "Homepage": "https://github.com/WRiegs/Squidly",
        "Source Code": "https://github.com/WRiegs/Squidly"
    },
    "split_keywords": [
        "gene-annotation",
        " bioinformatics",
        " catalytic-site-prediction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0db35f0c6f9fbd8eaafcfb041bc9726285f3cb52a566da29ea189f1df9ef75ed",
                "md5": "1b439ac7a0117c69939d4115d4fe621f",
                "sha256": "4d96d0311daa6c927e55cb6c0ad6a5b93eecba72fef1975173826e0bf87f63ee"
            },
            "downloads": -1,
            "filename": "squidly-0.0.5-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1b439ac7a0117c69939d4115d4fe621f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 19629,
            "upload_time": "2025-08-09T21:45:54",
            "upload_time_iso_8601": "2025-08-09T21:45:54.951789Z",
            "url": "https://files.pythonhosted.org/packages/0d/b3/5f0c6f9fbd8eaafcfb041bc9726285f3cb52a566da29ea189f1df9ef75ed/squidly-0.0.5-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7f721cb5566316aaa17f78f251a97dc9e45e3dbef01594b2d96b8ecbaaf1d174",
                "md5": "7fc2f57873981778a226a8d82cee7e86",
                "sha256": "6731b7ab3f0a1f6f9cf16d1e825706221ae60f1e1439dfbd232ab690513a3e7d"
            },
            "downloads": -1,
            "filename": "squidly-0.0.5.tar.gz",
            "has_sig": false,
            "md5_digest": "7fc2f57873981778a226a8d82cee7e86",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 20339,
            "upload_time": "2025-08-09T21:45:56",
            "upload_time_iso_8601": "2025-08-09T21:45:56.110037Z",
            "url": "https://files.pythonhosted.org/packages/7f/72/1cb5566316aaa17f78f251a97dc9e45e3dbef01594b2d96b8ecbaaf1d174/squidly-0.0.5.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-09 21:45:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "WRiegs",
    "github_project": "Squidly",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "squidly"
}
        
Elapsed time: 1.33057s