progres

Name	progres JSON
Version	0.2.7 JSON
	download
home_page	https://github.com/greener-group/progres
Summary	Fast protein structure searching using structure graph embeddings
upload_time	2024-09-02 11:48:11
maintainer	None
docs_url	None
author	Joe G Greener
requires_python	None
license	MIT
keywords	protein structure search graph embedding
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Progres - Protein Graph Embedding Search

[![Build status](https://github.com/greener-group/progres/workflows/CI/badge.svg)](https://github.com/greener-group/progres/actions)

This repository contains the method from the pre-print:

- Greener JG and Jamali K. Fast protein structure searching using structure graph embeddings. bioRxiv (2022) - [link](https://www.biorxiv.org/content/10.1101/2022.11.28.518224)

It provides the `progres` Python package that lets you search structures against pre-embedded structural databases, score pairs of structures and pre-embed datasets for searching against.
Searching typically takes 1-2 s and is much faster for multiple queries.
For the AlphaFold database, initial data loading takes around a minute but subsequent searching takes a tenth of a second per query.

Currently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the whole [PDB](https://www.rcsb.org), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.
Searching is done by domain but [Chainsaw](https://github.com/JudeWells/chainsaw) can be used to automatically split query structures into domains.

## Installation

1. Python 3.8 or later is required. The software is OS-independent.
2. Install [PyTorch](https://pytorch.org) 1.11 or later, [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter), [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric), [FAISS](https://github.com/facebookresearch/faiss) and [STRIDE](https://webclu.bio.wzw.tum.de/stride) as appropriate for your system. A GPU is not required but may provide speedup in certain situations. Example commands for Linux (and other operating systems, bar the STRIDE install):
```bash
conda create -n prog python=3.9
conda activate prog
conda install pytorch=1.11 faiss-cpu -c pytorch
conda install pytorch-scatter pyg -c pyg
conda install kimlab::stride
```
3. Run `pip install progres`, which will also install [Biopython](https://biopython.org), [mmtf-python](https://github.com/rcsb/mmtf-python), [einops](https://github.com/arogozhnikov/einops) and [pydantic](https://github.com/pydantic/pydantic) if they are not already present.
4. The first time you search with the software the trained model and pre-embedded databases (~660 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.
5. The first time you search against the AlphaFold database TED domains the pre-embedded database (~33 GB) will be downloaded similarly. This can take a while. Make sure you have enough disk space!

Alternatively, a Docker file is available in the `docker` directory.

## Usage

On Unix systems the executable `progres` will be added to the path during installation.
On Windows you can call the `bin/progres` script with python if you can't access the executable.

Run `progres -h` to see the help text and `progres {mode} -h` to see the help text for each mode.
The modes are described below but there are other options outlined in the help text.
For example the `-d` flag sets the device to run on; this is `cpu` by default since this is often fastest for searching, but `cuda` will likely be faster when splitting domains with Chainsaw, searching many queries or embedding a dataset.
Try both if performance is important.

## Search a structure against a database

To search a PDB file `query.pdb` (which can be found in the `data` directory) against domains in the SCOPe database and print output:
```bash
progres search -q query.pdb -t scope95
```
```
# QUERY_NUM: 1
# QUERY: query.pdb
# DOMAIN_NUM: 1
# DOMAIN_SIZE: 150 residues (1-150)
# DATABASE: scope95
# PARAMETERS: minsimilarity 0.8, maxhits 100, chainsaw no, faiss no, progres v0.2.7
# HIT_N  DOMAIN   HIT_NRES  SIMILARITY  NOTES
      1  d1a6ja_       150      1.0000  d.112.1.1 - Nitrogen regulatory bacterial protein IIa-ntr {Escherichia coli [TaxId: 562]}
      2  d2a0ja_       146      0.9988  d.112.1.0 - automated matches {Neisseria meningitidis [TaxId: 122586]}
      3  d3urra1       151      0.9983  d.112.1.0 - automated matches {Burkholderia thailandensis [TaxId: 271848]}
      4  d3lf6a_       154      0.9971  d.112.1.1 - automated matches {Artificial gene [TaxId: 32630]}
      5  d3oxpa1       147      0.9968  d.112.1.0 - automated matches {Yersinia pestis [TaxId: 214092]}
...
```
- `-q` is the path to the query structure file. Alternatively, `-l` is a text file with one query file path per line and each result will be printed in turn. This is considerably faster for multiple queries since setup only occurs once and multiple workers can be used. Only the first chain in each file is considered.
- `-t` is the pre-embedded database to search against. Currently this must be either one of the databases listed below or the file path to a pre-embedded dataset generated with `progres embed`.
- `-f` determines the file format of the query structure (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`). By default this is guessed from the file extension, with `pdb` chosen if a guess can't be made. `coords` refers to a text file with the coordinates of a Cα atom separated by white space on each line.
- `-s` is the Progres score (0 -> 1) above which to return hits, default 0.8. As discussed in the paper, 0.8 indicates the same fold.
- `-m` is the maximum number of hits to return, default 100.
- `-c` indicates to split the query structure(s) into domains with Chainsaw and search with each domain separately. If no domains are found with Chainsaw, no results will be returned. Only the first chain in each file is considered. Running Chainsaw may take a few seconds.

Other tools for splitting query structures into domains include [Merizo](https://github.com/psipred/Merizo) and [SWORD2](https://www.dsimb.inserm.fr/SWORD2).
You can also slice out domains manually using software such as the `pdb_selres` command from [pdb-tools](http://www.bonvinlab.org/pdb-tools).

Interpreting the hit descriptions depends on the database being searched.
The domain name often includes a reference to the corresponding PDB file, for example d1a6ja_ refers to PDB ID 1A6J chain A, and this can be opened in the [RCSB PDB structure view](https://www.rcsb.org/3d-view/1A6J/1) to get a quick look.
For the AlphaFold database TED domains, files can be downloaded from [links such as this](https://alphafold.ebi.ac.uk/files/AF-A0A6J8EXE6-F1-model_v4.pdb) where `AF-A0A6J8EXE6-F1` is the first part of the hit notes and is followed by the residue range of the domain.

### Available databases

The available pre-embedded databases are:

| Name      | Description                                                                                                                                                                                | Number of domains | Search time (1 query)      | Search time (100 queries)  |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------- | -------------------------- | -------------------------- |
| `scope95` | ASTRAL set of [SCOPe](https://scop.berkeley.edu) 2.08 domains clustered at 95% seq ID                                                                                                      | 35,371            | 1.35 s                     | 2.81 s                     |
| `scope40` | ASTRAL set of [SCOPe](https://scop.berkeley.edu) 2.08 domains clustered at 40% seq ID                                                                                                      | 15,127            | 1.32 s                     | 2.36 s                     |
| `cath40`  | S40 non-redundant domains from [CATH](http://cathdb.info) 23/11/22                                                                                                                         | 31,884            | 1.38 s                     | 2.79 s                     |
| `ecod70`  | F70 representative domains from [ECOD](http://prodata.swmed.edu/ecod) develop287                                                                                                           | 71,635            | 1.46 s                     | 3.82 s                     |
| `pdb100`  | All [PDB](https://www.rcsb.org) protein chains as of 02/08/24 split into domains with Chainsaw                                                                                             | 1,177,152         | 2.90 s                     | 27.3 s                     |
| `af21org` | [AlphaFold](https://alphafold.ebi.ac.uk) structures for 21 model organisms split into domains by [CATH-Assign](https://doi.org/10.1038/s42003-023-04488-9)                                 | 338,258           | 2.21 s                     | 11.0 s                     |
| `afted`   | [AlphaFold database](https://alphafold.ebi.ac.uk) structures split into domains by [TED](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) and clustered at 50% sequence identity | 53,344,209        | 67.7 s                     | 73.1 s                     |

Search time is for a 150 residue protein (d1a6ja_ in PDB format) on an Intel i9-10980XE CPU with 256 GB RAM and PyTorch 1.11.
Times are given for 1 or 100 queries.
Note that `afted` uses exhaustive FAISS searching.
This doesn't change the hits that are found, but the similarity score will differ by a small amount - see the paper.

## Calculate the score between two structures

To calculate the Progres score between two protein domains:
```bash
progres score struc_1.pdb struc_2.pdb
```
```
0.7265280485153198
```
- `-f` and `-g` determine the file format for the first and second structures as above (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`).

The order of the domains does not affect the score.
A score of 0.8 or higher indicates the same fold.

## Pre-embed a dataset to search against

To embed a dataset of structures, allowing it to be searched against:
```bash
progres embed -l filepaths.txt -o searchdb.pt
```
- `-l` is a text file with information on one structure per line, each of which will be one entry in the output. White space should separate the file path to the structure and the domain name, with optionally any additional text being treated as a note for the notes column of the results.
- `-o` is the output file path for the PyTorch file containing a dictionary with the embeddings and associated data. It can be read in with `torch.load`.
- `-f` determines the file format of each structure as above (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`).

Again, the structures should correspond to single protein domains.
The embeddings are stored as Float16, which has no noticeable effect on search performance.

As an example, you can run the above command from the `data` directory to generate a database with two structures.

## Python library

`progres` can also be used in Python, allowing it to be integrated into other methods:
```python
import progres as pg

# Search as above, returns a list where each entry is a dictionary for a query
# A generator is also available as pg.progres_search_generator
results = pg.progres_search(querystructure="query.pdb", targetdb="scope95")
results[0].keys() # dict_keys(['query_num', 'query', 'query_size', 'database', 'minsimilarity',
                  #            'maxhits', 'domains', 'hits_nres', 'similarities', 'notes'])

# Score as above, returns a float (similarity score 0 to 1)
pg.progres_score("struc_1.pdb", "struc_2.pdb")

# Pre-embed as above, saves a dictionary
pg.progres_embed(structurelist="filepaths.txt", outputfile="searchdb.pt")
import torch
torch.load("searchdb.pt").keys() # dict_keys(['ids', 'embeddings', 'nres', 'notes'])

# Read a structure file into a PyTorch Geometric graph
graph = pg.read_graph("query.pdb")
graph # Data(x=[150, 67], edge_index=[2, 2758], coords=[150, 3])

# Embed a single structure
embedding = pg.embed_structure("query.pdb")
embedding.shape # torch.Size([128])

# Load and reuse the model for speed
model = pg.load_trained_model()
embedding = pg.embed_structure("query.pdb", model=model)

# Embed Cα coordinates and search with the embedding
# This is useful for using progres in existing pipelines that give out Cα coordinates
# queryembeddings should have shape (128) or (n, 128)
#   and should be normalised across the 128 dimension
coords = pg.read_coords("query.pdb")
embedding = pg.embed_coords(coords) # Can take a list of coords or a tensor of shape (nres, 3)
results = pg.progres_search(queryembeddings=embedding, targetdb="scope95")

# Get the similarity score (0 to 1) between two embeddings
# The distance (1 - similarity) is also available as pg.embedding_distance
score = pg.embedding_similarity(embedding, embedding)
score # tensor(1.) in this case since they are the same embedding

# Get all-v-all similarity scores between 1000 embeddings
embs = torch.nn.functional.normalize(torch.randn(1000, 128), dim=1)
scores = pg.embedding_similarity(embs.unsqueeze(0), embs.unsqueeze(1))
scores.shape # torch.Size([1000, 1000])
```

## Scripts

Datasets and scripts for benchmarking (including for other methods), FAISS index generation and training are in the `scripts` directory.
The trained model and pre-embedded databases are available on [Zenodo](https://zenodo.org/record/7782088).

## Notes

The implementation of the E(n)-equivariant GNN uses [EGNN PyTorch](https://github.com/lucidrains/egnn-pytorch).
We also include code from [SupContrast](https://github.com/HobbitLong/SupContrast) and [Chainsaw](https://github.com/JudeWells/chainsaw).

Please open issues or [get in touch](http://jgreener64.github.io) with any feedback.
Contributions via pull requests are welcome.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/greener-group/progres",
    "name": "progres",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "protein structure search graph embedding",
    "author": "Joe G Greener",
    "author_email": "jgreener@mrc-lmb.cam.ac.uk",
    "download_url": "https://files.pythonhosted.org/packages/62/1e/bab5ecc25e3dad66db95f2c420446c7a19b2282b760b2bed6735df4d4997/progres-0.2.7.tar.gz",
    "platform": null,
    "description": "# Progres - Protein Graph Embedding Search\n\n[![Build status](https://github.com/greener-group/progres/workflows/CI/badge.svg)](https://github.com/greener-group/progres/actions)\n\nThis repository contains the method from the pre-print:\n\n- Greener JG and Jamali K. Fast protein structure searching using structure graph embeddings. bioRxiv (2022) - [link](https://www.biorxiv.org/content/10.1101/2022.11.28.518224)\n\nIt provides the `progres` Python package that lets you search structures against pre-embedded structural databases, score pairs of structures and pre-embed datasets for searching against.\nSearching typically takes 1-2 s and is much faster for multiple queries.\nFor the AlphaFold database, initial data loading takes around a minute but subsequent searching takes a tenth of a second per query.\n\nCurrently [SCOPe](https://scop.berkeley.edu), [CATH](http://cathdb.info), [ECOD](http://prodata.swmed.edu/ecod), the whole [PDB](https://www.rcsb.org), the [AlphaFold structures for 21 model organisms](https://doi.org/10.1093/nar/gkab1061) and the [AlphaFold database TED domains](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) are provided for searching against.\nSearching is done by domain but [Chainsaw](https://github.com/JudeWells/chainsaw) can be used to automatically split query structures into domains.\n\n## Installation\n\n1. Python 3.8 or later is required. The software is OS-independent.\n2. Install [PyTorch](https://pytorch.org) 1.11 or later, [PyTorch Scatter](https://github.com/rusty1s/pytorch_scatter), [PyTorch Geometric](https://github.com/pyg-team/pytorch_geometric), [FAISS](https://github.com/facebookresearch/faiss) and [STRIDE](https://webclu.bio.wzw.tum.de/stride) as appropriate for your system. A GPU is not required but may provide speedup in certain situations. Example commands for Linux (and other operating systems, bar the STRIDE install):\n```bash\nconda create -n prog python=3.9\nconda activate prog\nconda install pytorch=1.11 faiss-cpu -c pytorch\nconda install pytorch-scatter pyg -c pyg\nconda install kimlab::stride\n```\n3. Run `pip install progres`, which will also install [Biopython](https://biopython.org), [mmtf-python](https://github.com/rcsb/mmtf-python), [einops](https://github.com/arogozhnikov/einops) and [pydantic](https://github.com/pydantic/pydantic) if they are not already present.\n4. The first time you search with the software the trained model and pre-embedded databases (~660 MB) will be downloaded to the package directory from [Zenodo](https://zenodo.org/record/7782088), which requires an internet connection. This can take a few minutes. You can set the environmental variable `PROGRES_DATA_DIR` to change where this data is stored, for example if you cannot write to the package directory. Remember to keep it set the next time you run Progres.\n5. The first time you search against the AlphaFold database TED domains the pre-embedded database (~33 GB) will be downloaded similarly. This can take a while. Make sure you have enough disk space!\n\nAlternatively, a Docker file is available in the `docker` directory.\n\n## Usage\n\nOn Unix systems the executable `progres` will be added to the path during installation.\nOn Windows you can call the `bin/progres` script with python if you can't access the executable.\n\nRun `progres -h` to see the help text and `progres {mode} -h` to see the help text for each mode.\nThe modes are described below but there are other options outlined in the help text.\nFor example the `-d` flag sets the device to run on; this is `cpu` by default since this is often fastest for searching, but `cuda` will likely be faster when splitting domains with Chainsaw, searching many queries or embedding a dataset.\nTry both if performance is important.\n\n## Search a structure against a database\n\nTo search a PDB file `query.pdb` (which can be found in the `data` directory) against domains in the SCOPe database and print output:\n```bash\nprogres search -q query.pdb -t scope95\n```\n```\n# QUERY_NUM: 1\n# QUERY: query.pdb\n# DOMAIN_NUM: 1\n# DOMAIN_SIZE: 150 residues (1-150)\n# DATABASE: scope95\n# PARAMETERS: minsimilarity 0.8, maxhits 100, chainsaw no, faiss no, progres v0.2.7\n# HIT_N  DOMAIN   HIT_NRES  SIMILARITY  NOTES\n      1  d1a6ja_       150      1.0000  d.112.1.1 - Nitrogen regulatory bacterial protein IIa-ntr {Escherichia coli [TaxId: 562]}\n      2  d2a0ja_       146      0.9988  d.112.1.0 - automated matches {Neisseria meningitidis [TaxId: 122586]}\n      3  d3urra1       151      0.9983  d.112.1.0 - automated matches {Burkholderia thailandensis [TaxId: 271848]}\n      4  d3lf6a_       154      0.9971  d.112.1.1 - automated matches {Artificial gene [TaxId: 32630]}\n      5  d3oxpa1       147      0.9968  d.112.1.0 - automated matches {Yersinia pestis [TaxId: 214092]}\n...\n```\n- `-q` is the path to the query structure file. Alternatively, `-l` is a text file with one query file path per line and each result will be printed in turn. This is considerably faster for multiple queries since setup only occurs once and multiple workers can be used. Only the first chain in each file is considered.\n- `-t` is the pre-embedded database to search against. Currently this must be either one of the databases listed below or the file path to a pre-embedded dataset generated with `progres embed`.\n- `-f` determines the file format of the query structure (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`). By default this is guessed from the file extension, with `pdb` chosen if a guess can't be made. `coords` refers to a text file with the coordinates of a C\u03b1 atom separated by white space on each line.\n- `-s` is the Progres score (0 -> 1) above which to return hits, default 0.8. As discussed in the paper, 0.8 indicates the same fold.\n- `-m` is the maximum number of hits to return, default 100.\n- `-c` indicates to split the query structure(s) into domains with Chainsaw and search with each domain separately. If no domains are found with Chainsaw, no results will be returned. Only the first chain in each file is considered. Running Chainsaw may take a few seconds.\n\nOther tools for splitting query structures into domains include [Merizo](https://github.com/psipred/Merizo) and [SWORD2](https://www.dsimb.inserm.fr/SWORD2).\nYou can also slice out domains manually using software such as the `pdb_selres` command from [pdb-tools](http://www.bonvinlab.org/pdb-tools).\n\nInterpreting the hit descriptions depends on the database being searched.\nThe domain name often includes a reference to the corresponding PDB file, for example d1a6ja_ refers to PDB ID 1A6J chain A, and this can be opened in the [RCSB PDB structure view](https://www.rcsb.org/3d-view/1A6J/1) to get a quick look.\nFor the AlphaFold database TED domains, files can be downloaded from [links such as this](https://alphafold.ebi.ac.uk/files/AF-A0A6J8EXE6-F1-model_v4.pdb) where `AF-A0A6J8EXE6-F1` is the first part of the hit notes and is followed by the residue range of the domain.\n\n### Available databases\n\nThe available pre-embedded databases are:\n\n| Name      | Description                                                                                                                                                                                | Number of domains | Search time (1 query)      | Search time (100 queries)  |\n| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------- | -------------------------- | -------------------------- |\n| `scope95` | ASTRAL set of [SCOPe](https://scop.berkeley.edu) 2.08 domains clustered at 95% seq ID                                                                                                      | 35,371            | 1.35 s                     | 2.81 s                     |\n| `scope40` | ASTRAL set of [SCOPe](https://scop.berkeley.edu) 2.08 domains clustered at 40% seq ID                                                                                                      | 15,127            | 1.32 s                     | 2.36 s                     |\n| `cath40`  | S40 non-redundant domains from [CATH](http://cathdb.info) 23/11/22                                                                                                                         | 31,884            | 1.38 s                     | 2.79 s                     |\n| `ecod70`  | F70 representative domains from [ECOD](http://prodata.swmed.edu/ecod) develop287                                                                                                           | 71,635            | 1.46 s                     | 3.82 s                     |\n| `pdb100`  | All [PDB](https://www.rcsb.org) protein chains as of 02/08/24 split into domains with Chainsaw                                                                                             | 1,177,152         | 2.90 s                     | 27.3 s                     |\n| `af21org` | [AlphaFold](https://alphafold.ebi.ac.uk) structures for 21 model organisms split into domains by [CATH-Assign](https://doi.org/10.1038/s42003-023-04488-9)                                 | 338,258           | 2.21 s                     | 11.0 s                     |\n| `afted`   | [AlphaFold database](https://alphafold.ebi.ac.uk) structures split into domains by [TED](https://www.biorxiv.org/content/10.1101/2024.03.18.585509) and clustered at 50% sequence identity | 53,344,209        | 67.7 s                     | 73.1 s                     |\n\nSearch time is for a 150 residue protein (d1a6ja_ in PDB format) on an Intel i9-10980XE CPU with 256 GB RAM and PyTorch 1.11.\nTimes are given for 1 or 100 queries.\nNote that `afted` uses exhaustive FAISS searching.\nThis doesn't change the hits that are found, but the similarity score will differ by a small amount - see the paper.\n\n## Calculate the score between two structures\n\nTo calculate the Progres score between two protein domains:\n```bash\nprogres score struc_1.pdb struc_2.pdb\n```\n```\n0.7265280485153198\n```\n- `-f` and `-g` determine the file format for the first and second structures as above (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`).\n\nThe order of the domains does not affect the score.\nA score of 0.8 or higher indicates the same fold.\n\n## Pre-embed a dataset to search against\n\nTo embed a dataset of structures, allowing it to be searched against:\n```bash\nprogres embed -l filepaths.txt -o searchdb.pt\n```\n- `-l` is a text file with information on one structure per line, each of which will be one entry in the output. White space should separate the file path to the structure and the domain name, with optionally any additional text being treated as a note for the notes column of the results.\n- `-o` is the output file path for the PyTorch file containing a dictionary with the embeddings and associated data. It can be read in with `torch.load`.\n- `-f` determines the file format of each structure as above (`guess`, `pdb`, `mmcif`, `mmtf` or `coords`).\n\nAgain, the structures should correspond to single protein domains.\nThe embeddings are stored as Float16, which has no noticeable effect on search performance.\n\nAs an example, you can run the above command from the `data` directory to generate a database with two structures.\n\n## Python library\n\n`progres` can also be used in Python, allowing it to be integrated into other methods:\n```python\nimport progres as pg\n\n# Search as above, returns a list where each entry is a dictionary for a query\n# A generator is also available as pg.progres_search_generator\nresults = pg.progres_search(querystructure=\"query.pdb\", targetdb=\"scope95\")\nresults[0].keys() # dict_keys(['query_num', 'query', 'query_size', 'database', 'minsimilarity',\n                  #            'maxhits', 'domains', 'hits_nres', 'similarities', 'notes'])\n\n# Score as above, returns a float (similarity score 0 to 1)\npg.progres_score(\"struc_1.pdb\", \"struc_2.pdb\")\n\n# Pre-embed as above, saves a dictionary\npg.progres_embed(structurelist=\"filepaths.txt\", outputfile=\"searchdb.pt\")\nimport torch\ntorch.load(\"searchdb.pt\").keys() # dict_keys(['ids', 'embeddings', 'nres', 'notes'])\n\n# Read a structure file into a PyTorch Geometric graph\ngraph = pg.read_graph(\"query.pdb\")\ngraph # Data(x=[150, 67], edge_index=[2, 2758], coords=[150, 3])\n\n# Embed a single structure\nembedding = pg.embed_structure(\"query.pdb\")\nembedding.shape # torch.Size([128])\n\n# Load and reuse the model for speed\nmodel = pg.load_trained_model()\nembedding = pg.embed_structure(\"query.pdb\", model=model)\n\n# Embed C\u03b1 coordinates and search with the embedding\n# This is useful for using progres in existing pipelines that give out C\u03b1 coordinates\n# queryembeddings should have shape (128) or (n, 128)\n#   and should be normalised across the 128 dimension\ncoords = pg.read_coords(\"query.pdb\")\nembedding = pg.embed_coords(coords) # Can take a list of coords or a tensor of shape (nres, 3)\nresults = pg.progres_search(queryembeddings=embedding, targetdb=\"scope95\")\n\n# Get the similarity score (0 to 1) between two embeddings\n# The distance (1 - similarity) is also available as pg.embedding_distance\nscore = pg.embedding_similarity(embedding, embedding)\nscore # tensor(1.) in this case since they are the same embedding\n\n# Get all-v-all similarity scores between 1000 embeddings\nembs = torch.nn.functional.normalize(torch.randn(1000, 128), dim=1)\nscores = pg.embedding_similarity(embs.unsqueeze(0), embs.unsqueeze(1))\nscores.shape # torch.Size([1000, 1000])\n```\n\n## Scripts\n\nDatasets and scripts for benchmarking (including for other methods), FAISS index generation and training are in the `scripts` directory.\nThe trained model and pre-embedded databases are available on [Zenodo](https://zenodo.org/record/7782088).\n\n## Notes\n\nThe implementation of the E(n)-equivariant GNN uses [EGNN PyTorch](https://github.com/lucidrains/egnn-pytorch).\nWe also include code from [SupContrast](https://github.com/HobbitLong/SupContrast) and [Chainsaw](https://github.com/JudeWells/chainsaw).\n\nPlease open issues or [get in touch](http://jgreener64.github.io) with any feedback.\nContributions via pull requests are welcome.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast protein structure searching using structure graph embeddings",
    "version": "0.2.7",
    "project_urls": {
        "Homepage": "https://github.com/greener-group/progres"
    },
    "split_keywords": [
        "protein",
        "structure",
        "search",
        "graph",
        "embedding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "22b42fa75458dfadebc8d993a732994fb728c78ad4f4fed6d7e7f15ffb8c0123",
                "md5": "a1e3c3c7190f12a884ff14345e5df946",
                "sha256": "5037ce4bf8fb093380b4090daafec1109ca2387ca461488abff9e7cf2fd7cfe2"
            },
            "downloads": -1,
            "filename": "progres-0.2.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a1e3c3c7190f12a884ff14345e5df946",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 47423,
            "upload_time": "2024-09-02T11:48:07",
            "upload_time_iso_8601": "2024-09-02T11:48:07.268090Z",
            "url": "https://files.pythonhosted.org/packages/22/b4/2fa75458dfadebc8d993a732994fb728c78ad4f4fed6d7e7f15ffb8c0123/progres-0.2.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "621ebab5ecc25e3dad66db95f2c420446c7a19b2282b760b2bed6735df4d4997",
                "md5": "b6a10883324a953b23d444822476d38e",
                "sha256": "c744c8ab79b4f190231b127894645c2f1a6cb8811176eec0974455993da03552"
            },
            "downloads": -1,
            "filename": "progres-0.2.7.tar.gz",
            "has_sig": false,
            "md5_digest": "b6a10883324a953b23d444822476d38e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 45587,
            "upload_time": "2024-09-02T11:48:11",
            "upload_time_iso_8601": "2024-09-02T11:48:11.729506Z",
            "url": "https://files.pythonhosted.org/packages/62/1e/bab5ecc25e3dad66db95f2c420446c7a19b2282b760b2bed6735df4d4997/progres-0.2.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-02 11:48:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "greener-group",
    "github_project": "progres",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "progres"
}

Joe G Greener