alphafold3-pytorch-lightning-hydra

Name	alphafold3-pytorch-lightning-hydra JSON
Version	0.5.31 JSON
	download
home_page	None
Summary	AlphaFold 3 - Pytorch
upload_time	2024-09-21 20:05:18
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT License Copyright (c) 2024 Phil Wang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	artificial intelligence deep learning protein structure prediction
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">

# AlphaFold 3 - PyTorch Lightning + Hydra

<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white"></a>
<a href="https://pytorchlightning.ai/"><img alt="Lightning" src="https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white"></a>
<a href="https://hydra.cc/"><img alt="Config: Hydra" src="https://img.shields.io/badge/Config-Hydra-89b8cd"></a>
<a href="https://github.com/ashleve/lightning-hydra-template"><img alt="Template" src="https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray"></a><br>

<!-- [![Paper](http://img.shields.io/badge/paper-arxiv.1001.2234-B31B1B.svg)](https://www.nature.com/articles/s41586-024-07487-w) -->

<!-- [![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020) -->

<img src="./img/alphafold3.png" width="600">

</div>

## Description

Implementation of <a href="https://www.nature.com/articles/s41586-024-07487-w">AlphaFold 3</a> in PyTorch Lightning + Hydra

You can chat with other researchers about this work [here](https://discord.gg/Xsq4WaBR9S)

<a href="https://www.youtube.com/watch?v=qjFgthkKxcA">Review of the paper</a> by <a href="https://x.com/sokrypton">Sergey</a>

<a href="https://elanapearl.github.io/blog/2024/the-illustrated-alphafold/">Illustrated guide</a> by <a href="https://elanapearl.github.io/">Elana P. Simon</a>

<a href="https://www.youtube.com/watch?v=AE35XCN5NuU">Talk by Max Jaderberg</a>

The original version of this repository with <a href="https://lightning.ai/docs/fabric/stable/">Fabric</a> support is being maintained by <a href="https://github.com/lucidrains">Phil</a> at <a href="https://github.com/lucidrains/alphafold3-pytorch">this repository</a>

A visualization of the molecules of life used in the repository can be seen and interacted with <a href="https://colab.research.google.com/drive/1S9TD8WmS1Gjgwjo9xyEYTbdxgMVVZcQe?usp=sharing">here</a>

## Appreciation

- <a href="https://github.com/lucidrains">Phil</a> for spearheading the development of the `Alphafold3` module and losses as well as the `Input` classes!

- <a href="https://github.com/joseph-c-kim">Joseph</a> for contributing the Relative Positional Encoding and the Smooth LDDT Loss!

- <a href="https://github.com/engelberger">Felipe</a> for contributing Weighted Rigid Align, Express Coordinates In Frame, Compute Alignment Error, and Centre Random Augmentation modules!

- <a href="https://github.com/gitabtion">Heng</a> for pointing out inconsistencies with the paper and pull requesting the solutions (e.g., finding an issue with the molecular atom indices for the distogram loss)

- <a href="https://github.com/luwei0917">Wei Lu</a> for catching a few erroneous hyperparameters

- <a href="https://github.com/milot-mirdita">Milot</a> for generating MSA and template inputs as well as optimizing the PDB dataset clustering script!

- <a href="https://github.com/vandrw">Andrei</a> for working on the weighted PDB dataset sampling and for taking on the gradio frontend interface!

- <a href="https://github.com/tanjimin">Jimin</a> for submitting a small fix to an issue with the coordinates being passed into `WeightedRigidAlign`

- <a href="https://github.com/xluo233">@xluo233</a> for contributing the confidence measures, clash penalty ranking, sample ranking logic, as well as the logic for computing the model selection score and unresolved RASA!

- <a href="https://github.com/sj900">sj900</a> for integrating and testing the `WeightedPDBSampler` within the `PDBDataset` and for adding initial support for MSA and template parsing!

- <a href="https://github.com/wufandi">Fandi</a> for discovering a few inconsistencies in the elucidated atom diffusion module with the supplementary

- <a href="https://github.com/ptosco">Paolo</a> for proposing the `PDB neutral stable molecule` hypothesis!

- <a href="https://github.com/dhuvik">Dhuvi</a> for fixing a bug related to metal ion molecule ID assignment for `Alphafold3Inputs` and for taking on the logic for translating `Alphafold3Inputs` to `Biomolecules` for saving inference samples to mmCIF files!

- Tom (from the Discord channel) for identifying a discrepancy between this codebase's distogram and template unit vector computations and those of OpenFold (and <a href="https://github.com/vandrw">Andrei</a> for helping address the distogram issue)!

- <a href="https://github.com/Kaihui-Cheng">Kaihui</a> for identifying a bug in how non-standard atoms were handled in polymer residues!

- <a href="https://github.com/patrick-kidger">Patrick</a> for <a href="https://docs.kidger.site/jaxtyping/">jaxtyping</a>, <a href="https://github.com/fferflo">Florian</a> for <a href="https://github.com/fferflo/einx">einx</a>, and of course, <a href="https://github.com/arogozhnikov">Alex</a> for <a href="https://einops.rocks/">einops</a>

- Soumith and the PyTorch organization for giving <a href="https://github.com/lucidrains">Phil</a> the opportunity to open source this work

## Contents

- [Installation](#installation)
- [Usage](#usage)
- [Data preparation](#data-preparation)
- [Training](#training)
- [Evaluation](#evaluation)
- [For developers](#for-developers)
- [Citations](#citations)

## Installation

<details>

### Pip

```bash
pip install alphafold3-pytorch-lightning-hydra
```

### Conda

Install `mamba` for dependency management (as a fast alternative to Anaconda):

```bash
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location
rm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it
source ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result
```

Install dependencies:

```bash
# Clone project
git clone https://github.com/amorehead/alphafold3-pytorch-lightning-hydra
cd alphafold3-pytorch-lightning-hydra

# Create Conda environment
mamba env create -f environment.yaml
conda activate alphafold3-pytorch  # note: one still needs to use `conda` to (de)activate environments

# Install local project as package
pip3 install -e .
```

### Docker

The included `Dockerfile` contains the required dependencies to run the package and to train/inference using PyTorch with GPUs.

The default base image is `pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime` and installs the latest version of this package from the `main` GitHub branch.

```bash
# Clone project
git clone https://github.com/amorehead/alphafold3-pytorch-lightning-hydra
cd alphafold3-pytorch-lightning-hydra

# Build Docker container
docker build -t alphafold3-pytorch-lightning-hydra .
```

Alternatively, use build arguments to rebuild the image with different software versions:

- `PYTORCH_TAG`: Changes the base image and thus builds with different PyTorch, CUDA, and/or cuDNN versions.
- `GIT_TAG`: Changes the tag of this repo to clone and install the package.

For example:

```bash
## Use build argument to change versions
docker build --build-arg "PYTORCH_TAG=2.2.1-cuda12.1-cudnn8-devel" --build-arg "GIT_TAG=0.1.15" -t alphafold3-pytorch-lightning-hydra .
```

Then, run the container with GPUs and mount a local volume (for training) using the following command:

```bash
## Run Container
docker run -v .:/data --gpus all -it alphafold3-pytorch-lightning-hydra
```

**NOTE:** An AMD ROCm version of the Docker image can alternatively be built using `ROCm_Dockerfile`.

</details>

## Usage

<details>

```python
import torch
from alphafold3_pytorch import Alphafold3
from alphafold3_pytorch.utils.model_utils import exclusive_cumsum

alphafold3 = Alphafold3(
    dim_atom_inputs = 77,
    dim_template_feats = 108
)

# Mock inputs

seq_len = 16

molecule_atom_indices = torch.randint(0, 2, (2, seq_len)).long()
molecule_atom_lens = torch.full((2, seq_len), 2).long()

atom_seq_len = molecule_atom_lens.sum(dim=-1).amax()
atom_offsets = exclusive_cumsum(molecule_atom_lens)

atom_inputs = torch.randn(2, atom_seq_len, 77)
atompair_inputs = torch.randn(2, atom_seq_len, atom_seq_len, 5)

additional_molecule_feats = torch.randint(0, 2, (2, seq_len, 5))
additional_token_feats = torch.randn(2, seq_len, 33)
is_molecule_types = torch.randint(0, 2, (2, seq_len, 5)).bool()
is_molecule_mod = torch.randint(0, 2, (2, seq_len, 4)).bool()
molecule_ids = torch.randint(0, 32, (2, seq_len))

template_feats = torch.randn(2, 2, seq_len, seq_len, 108)
template_mask = torch.ones((2, 2)).bool()

msa = torch.randn(2, 7, seq_len, 32)
msa_mask = torch.ones((2, 7)).bool()

additional_msa_feats = torch.randn(2, 7, seq_len, 2)

# Required for training, but omitted on inference

atom_pos = torch.randn(2, atom_seq_len, 3)

distogram_atom_indices = molecule_atom_lens - 1

distance_labels = torch.randint(0, 37, (2, seq_len, seq_len))
resolved_labels = torch.randint(0, 2, (2, atom_seq_len))

# Offset indices correctly

distogram_atom_indices += atom_offsets
molecule_atom_indices += atom_offsets

# Train

loss = alphafold3(
    num_recycling_steps = 2,
    atom_inputs = atom_inputs,
    atompair_inputs = atompair_inputs,
    molecule_ids = molecule_ids,
    molecule_atom_lens = molecule_atom_lens,
    additional_molecule_feats = additional_molecule_feats,
    additional_msa_feats = additional_msa_feats,
    additional_token_feats = additional_token_feats,
    is_molecule_types = is_molecule_types,
    is_molecule_mod = is_molecule_mod,
    msa = msa,
    msa_mask = msa_mask,
    templates = template_feats,
    template_mask = template_mask,
    atom_pos = atom_pos,
    distogram_atom_indices = distogram_atom_indices,
    molecule_atom_indices = molecule_atom_indices,
    distance_labels = distance_labels,
    resolved_labels = resolved_labels
)

loss.backward()

# After much training ...

sampled_atom_pos = alphafold3(
    num_recycling_steps = 4,
    num_sample_steps = 16,
    atom_inputs = atom_inputs,
    atompair_inputs = atompair_inputs,
    molecule_ids = molecule_ids,
    molecule_atom_lens = molecule_atom_lens,
    additional_molecule_feats = additional_molecule_feats,
    additional_msa_feats = additional_msa_feats,
    additional_token_feats = additional_token_feats,
    is_molecule_types = is_molecule_types,
    is_molecule_mod = is_molecule_mod,
    msa = msa,
    msa_mask = msa_mask,
    templates = template_feats,
    template_mask = template_mask
)

sampled_atom_pos.shape # (2, <atom_seqlen>, 3)
```

</details>

<details>

An example with molecule level input handling

```python
import torch
from alphafold3_pytorch import Alphafold3, Alphafold3Input

contrived_protein = 'AG'

mock_atompos = [
    torch.randn(5, 3),   # alanine has 5 non-hydrogen atoms
    torch.randn(4, 3)    # glycine has 4 non-hydrogen atoms
]

train_alphafold3_input = Alphafold3Input(
    proteins = [contrived_protein],
    missing_atom_indices = [[1, 2], None],
    atom_pos = mock_atompos
)

eval_alphafold3_input = Alphafold3Input(
    proteins = [contrived_protein]
)

# training

alphafold3 = Alphafold3(
    dim_atom_inputs = 3,
    dim_atompair_inputs = 5,
    atoms_per_window = 27,
    dim_template_feats = 108,
    num_molecule_mods = 0,
    confidence_head_kwargs = dict(
        pairformer_depth = 1
    ),
    template_embedder_kwargs = dict(
        pairformer_stack_depth = 1
    ),
    msa_module_kwargs = dict(
        depth = 1
    ),
    pairformer_stack = dict(
        depth = 2
    ),
    diffusion_module_kwargs = dict(
        atom_encoder_depth = 1,
        token_transformer_depth = 1,
        atom_decoder_depth = 1,
    )
)

loss = alphafold3.forward_with_alphafold3_inputs([train_alphafold3_input])
loss.backward()

# sampling

alphafold3.eval()
sampled_atom_pos = alphafold3.forward_with_alphafold3_inputs(eval_alphafold3_input)

assert sampled_atom_pos.shape == (1, (5 + 4), 3)
```

</details>

## Data preparation

<details>

### PDB dataset curation

To acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The two Python scripts below (i.e., `filter_pdb_{train,val,test}_mmcifs.py` and `cluster_pdb_{train,val,test}_mmcifs.py`) assume you have downloaded the PDB in the **mmCIF file format**, placing its first-assembly and asymmetric unit mmCIF files at `data/pdb_data/unfiltered_assembly_mmcifs/` and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively.

For reproducibility, we recommend downloading the PDB using AWS snapshots (e.g., `20240101`). To do so, refer to [AWS's documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) to set up the AWS CLI locally. Alternatively, on the RCSB website, navigate down to "Download Protocols", and follow the download instructions depending on your location.

For example, one can use the following commands to download the PDB as two collections of mmCIF files:

```bash
# For `assembly1` complexes, use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444 \
rsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/

# For asymmetric unit complexes, also use the PDB's `20240101` AWS snapshot:
aws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs
# Or as a fallback, use rsync:
rsync -rlpt -v -z --delete --port=33444 \
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/
```

> WARNING: Downloading the PDB can take up to 700GB of space.

> NOTE: The PDB hosts all available AWS snapshots here: https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html.

After downloading, you should have two directories formatted like this:
https://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ & https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/

```bash
00/
01/
02/
..
zz/
```

For these directories, unzip all the files:

```bash
find ./data/pdb_data/unfiltered_assembly_mmcifs/ -type f -name "*.gz" -exec gzip -d {} \;
find ./data/pdb_data/unfiltered_asym_mmcifs/ -type f -name "*.gz" -exec gzip -d {} \;
```

Next run the commands

```bash
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz
wget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/component-models/complete/chem_comp_model.cif.gz
```

from the project's root directory to download the latest version of the PDB's Chemical Component Dictionary (CCD) and its structural models. Extract each of these files using the following command:

```bash
find data/ccd_data/ -type f -name "*.gz" -exec gzip -d {} \;
```

### PDB dataset filtering

Then run the following with `pdb_assembly_dir`, `pdb_asym_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the first-assembly PDB, asymmetric unit PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_assembly_mmcifs/`, `./data/pdb_data/unfiltered_asym_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/{train,val,test}_mmcifs/`).

```bash
python scripts/filter_pdb_train_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>
python scripts/filter_pdb_val_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --output_dir <mmcif_output_dir>
python scripts/filter_pdb_test_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --output_dir <mmcif_output_dir>
```

See the scripts for more options. Each first-assembly mmCIF that successfully passes
all processing steps will be written to `mmcif_output_dir` within a subdirectory
named according to the mmCIF's second and third PDB ID characters (e.g. `5c`).

### PDB dataset clustering

Next, run the following with `mmcif_dir` and `{train,val,test}_clustering_output_dir` replaced, respectively, with your local output directory created using the dataset filtering script above and with your desired clustering output directories (i.e., `./data/pdb_data/{train,val,test}_mmcifs/` and `./data/pdb_data/data_caches/{train,val,test}_clusterings/`):

```bash
python scripts/cluster_pdb_train_mmcifs.py --mmcif_dir <mmcif_dir> --output_dir <train_clustering_output_dir> --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_val_mmcifs.py --mmcif_dir <mmcif_dir> --reference_clustering_dir <train_clustering_output_dir> --output_dir <val_clustering_output_dir> --clustering_filtered_pdb_dataset
python scripts/cluster_pdb_test_mmcifs.py --mmcif_dir <mmcif_dir> --reference_1_clustering_dir <train_clustering_output_dir> --reference_2_clustering_dir <val_clustering_output_dir> --output_dir <test_clustering_output_dir> --clustering_filtered_pdb_dataset
```

**Note**: The `--clustering_filtered_pdb_dataset` flag is recommended when clustering the filtered PDB dataset as curated using the scripts above, as this flag will enable faster runtimes in this context (since filtering leaves each chain's residue IDs 1-based). However, this flag must **not** be provided when clustering other (i.e., non-PDB) datasets of mmCIF files. Otherwise, interface clustering may be performed incorrectly, as these datasets' mmCIF files may not use strict 1-based residue indexing for each chain.

**Note**: One can instead download preprocessed (i.e., filtered) mmCIF (`train`/`val`/`test`) files (~25GB, comprising 148k complexes) and chain/interface clustering (`train`/`val`/`test`) files (~3GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6). Each of these `tar.gz` archives should be decompressed within the `data/pdb_data/` directory e.g., via `tar -xzf data_caches.tar.gz -C data/pdb_data/`. One can also download and prepare PDB distillation data using as a reference the script `scripts/distillation_data_download.sh`. Once downloaded, one can run `scripts/reduce_uniprot_predictions_to_pdb.py` to filter this dataset to only examples associated with at least one PDB entry. Moreover, for convenience, a [mapping](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz) of UniProt accession IDs to PDB IDs for training on PDB distillation data has already been downloaded and extracted as `data/afdb_data/data_caches/uniprot_to_pdb_id_mapping.dat`.

</details>

## Training

<details>

Train model with default configuration

```bash
# Train on CPU
python alphafold3_pytorch/train.py trainer=cpu

# Train on GPU
python alphafold3_pytorch/train.py trainer=gpu
```

Train model with chosen experiment configuration from [configs/experiment/](configs/experiment/)

```bash
# e.g., Train an initial set of weights
python alphafold3_pytorch/train.py experiment=alphafold3_initial_training.yaml
```

You can override any parameter from command line like this

```bash
python alphafold3_pytorch/train.py trainer.max_steps=1e6 data.batch_size=128
```

</details>

## Evaluation

<details>

Evaluate a trained set of weights on held-out test data

```bash
# e.g., Evaluate on GPU
python alphafold3_pytorch/eval.py trainer=gpu
```

</details>

## For developers

<details>

### Contributing

At the project root, run

```bash
bash contributing.sh
```

Then, add your module to `alphafold3_pytorch/models/components/alphafold3.py`, add your tests to `tests/test_alphafold3.py`, and submit a pull request. You can run the tests locally with

```bash
pytest tests/
```

### Dependency management

We use `pip` and `docker` to manage the project's underlying dependencies. Notably, to update the dependencies built by the project's `Dockerfile`, first edit the contents of the `dependencies` list in `pyproject.toml`, and then rebuild the project's `docker` image:

```bash
docker stop <container_id> # First stop any running `af3` container(s)
docker rm <container_id> # Then remove the container(s) - Caution: Make sure to push your local changes to GitHub before running this!
docker build -t af3 . # Rebuild the Docker image
docker run -v .:/data --gpus all -it af3 # Lastly, (re)start the Docker container from the updated image
```

If you want to update the project's `pip` dependencies only, you can simply push to GitHub your changes to the `pyproject.toml` file.

### Code formatting

We use `pre-commit` to automatically format the project's code. To set up `pre-commit` (one time only) for automatic code linting and formatting upon each execution of `git commit`:

```bash
pre-commit install
```

To manually reformat all files in the project as desired:

```bash
pre-commit run -a
```

Refer to [pre-commit's documentation](https://pre-commit.com/) for more details.

</details>

## Citations

```bibtex
@article{Abramson2024-fj,
  title    = "Accurate structure prediction of biomolecular interactions with
              {AlphaFold} 3",
  author   = "Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans,
              Richard and Green, Tim and Pritzel, Alexander and Ronneberger,
              Olaf and Willmore, Lindsay and Ballard, Andrew J and Bambrick,
              Joshua and Bodenstein, Sebastian W and Evans, David A and Hung,
              Chia-Chun and O'Neill, Michael and Reiman, David and
              Tunyasuvunakool, Kathryn and Wu, Zachary and {\v Z}emgulyt{\.e},
              Akvil{\.e} and Arvaniti, Eirini and Beattie, Charles and
              Bertolli, Ottavia and Bridgland, Alex and Cherepanov, Alexey and
              Congreve, Miles and Cowen-Rivers, Alexander I and Cowie, Andrew
              and Figurnov, Michael and Fuchs, Fabian B and Gladman, Hannah and
              Jain, Rishub and Khan, Yousuf A and Low, Caroline M R and Perlin,
              Kuba and Potapenko, Anna and Savy, Pascal and Singh, Sukhdeep and
              Stecula, Adrian and Thillaisundaram, Ashok and Tong, Catherine
              and Yakneen, Sergei and Zhong, Ellen D and Zielinski, Michal and
              {\v Z}{\'\i}dek, Augustin and Bapst, Victor and Kohli, Pushmeet
              and Jaderberg, Max and Hassabis, Demis and Jumper, John M",
  journal  = "Nature",
  month    = "May",
  year     =  2024
}
```

```bibtex
@inproceedings{Darcet2023VisionTN,
    title   = {Vision Transformers Need Registers},
    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:263134283}
}
```

```bibtex
@article{Arora2024SimpleLA,
    title   = {Simple linear attention language models balance the recall-throughput tradeoff},
    author  = {Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher R'e},
    journal = {ArXiv},
    year    = {2024},
    volume  = {abs/2402.18668},
    url     = {https://api.semanticscholar.org/CorpusID:268063190}
}
```

```bibtex
@article{Puny2021FrameAF,
    title   = {Frame Averaging for Invariant and Equivariant Network Design},
    author  = {Omri Puny and Matan Atzmon and Heli Ben-Hamu and Edward James Smith and Ishan Misra and Aditya Grover and Yaron Lipman},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2110.03336},
    url     = {https://api.semanticscholar.org/CorpusID:238419638}
}
```

```bibtex
@article{Duval2023FAENetFA,
    title   = {FAENet: Frame Averaging Equivariant GNN for Materials Modeling},
    author  = {Alexandre Duval and Victor Schmidt and Alex Hernandez Garcia and Santiago Miret and Fragkiskos D. Malliaros and Yoshua Bengio and David Rolnick},
    journal = {ArXiv},
    year    = {2023},
    volume  = {abs/2305.05577},
    url     = {https://api.semanticscholar.org/CorpusID:258564608}
}
```

```bibtex
@article{Wang2022DeepNetST,
    title   = {DeepNet: Scaling Transformers to 1, 000 Layers},
    author  = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2203.00555},
    url     = {https://api.semanticscholar.org/CorpusID:247187905}
}
```

```bibtex
@inproceedings{Ainslie2023CoLT5FL,
    title   = {CoLT5: Faster Long-Range Transformers with Conditional Computation},
    author  = {Joshua Ainslie and Tao Lei and Michiel de Jong and Santiago Ontan'on and Siddhartha Brahma and Yury Zemlyanskiy and David Uthus and Mandy Guo and James Lee-Thorp and Yi Tay and Yun-Hsuan Sung and Sumit Sanghai},
    year    = {2023}
}
```

```bibtex
@article{Ash2019OnTD,
    title   = {On the Difficulty of Warm-Starting Neural Network Training},
    author  = {Jordan T. Ash and Ryan P. Adams},
    journal = {ArXiv},
    year    = {2019},
    volume  = {abs/1910.08475},
    url     = {https://api.semanticscholar.org/CorpusID:204788802}
}
```

```bibtex
@ARTICLE{Heinzinger2023.07.23.550085,
    author  = {Michael Heinzinger and Konstantin Weissenow and Joaquin Gomez Sanchez and Adrian Henkel and Martin Steinegger and Burkhard Rost},
    title   = {ProstT5: Bilingual Language Model for Protein Sequence and Structure},
    year    = {2023},
    doi     = {10.1101/2023.07.23.550085},
    journal = {bioRxiv}
}
```

```bibtex
@article {Lin2022.07.20.500902,
    author  = {Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Santos Costa, Allan dos and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and Rives, Alexander},
    title   = {Language models of protein sequences at the scale of evolution enable accurate structure prediction},
    elocation-id = {2022.07.20.500902},
    year    = {2022},
    doi     = {10.1101/2022.07.20.500902},
    publisher = {Cold Spring Harbor Laboratory},
    URL     = {https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902},
    eprint  = {https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902.full.pdf},
    journal = {bioRxiv}
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "alphafold3-pytorch-lightning-hydra",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "artificial intelligence, deep learning, protein structure prediction",
    "author": null,
    "author_email": "Phil Wang <lucidrains@gmail.com>, Alex Morehead <alex.morehead@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/24/5c/1fa17f6be91a111c9e043bfcfeaed4b5886ab9ab6730a1af00953fd4505f/alphafold3_pytorch_lightning_hydra-0.5.31.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# AlphaFold 3 - PyTorch Lightning + Hydra\n\n<a href=\"https://pytorch.org/get-started/locally/\"><img alt=\"PyTorch\" src=\"https://img.shields.io/badge/PyTorch-ee4c2c?logo=pytorch&logoColor=white\"></a>\n<a href=\"https://pytorchlightning.ai/\"><img alt=\"Lightning\" src=\"https://img.shields.io/badge/-Lightning-792ee5?logo=pytorchlightning&logoColor=white\"></a>\n<a href=\"https://hydra.cc/\"><img alt=\"Config: Hydra\" src=\"https://img.shields.io/badge/Config-Hydra-89b8cd\"></a>\n<a href=\"https://github.com/ashleve/lightning-hydra-template\"><img alt=\"Template\" src=\"https://img.shields.io/badge/-Lightning--Hydra--Template-017F2F?style=flat&logo=github&labelColor=gray\"></a><br>\n\n<!-- [![Paper](http://img.shields.io/badge/paper-arxiv.1001.2234-B31B1B.svg)](https://www.nature.com/articles/s41586-024-07487-w) -->\n\n<!-- [![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/paper/2020) -->\n\n<img src=\"./img/alphafold3.png\" width=\"600\">\n\n</div>\n\n## Description\n\nImplementation of <a href=\"https://www.nature.com/articles/s41586-024-07487-w\">AlphaFold 3</a> in PyTorch Lightning + Hydra\n\nYou can chat with other researchers about this work [here](https://discord.gg/Xsq4WaBR9S)\n\n<a href=\"https://www.youtube.com/watch?v=qjFgthkKxcA\">Review of the paper</a> by <a href=\"https://x.com/sokrypton\">Sergey</a>\n\n<a href=\"https://elanapearl.github.io/blog/2024/the-illustrated-alphafold/\">Illustrated guide</a> by <a href=\"https://elanapearl.github.io/\">Elana P. Simon</a>\n\n<a href=\"https://www.youtube.com/watch?v=AE35XCN5NuU\">Talk by Max Jaderberg</a>\n\nThe original version of this repository with <a href=\"https://lightning.ai/docs/fabric/stable/\">Fabric</a> support is being maintained by <a href=\"https://github.com/lucidrains\">Phil</a> at <a href=\"https://github.com/lucidrains/alphafold3-pytorch\">this repository</a>\n\nA visualization of the molecules of life used in the repository can be seen and interacted with <a href=\"https://colab.research.google.com/drive/1S9TD8WmS1Gjgwjo9xyEYTbdxgMVVZcQe?usp=sharing\">here</a>\n\n## Appreciation\n\n- <a href=\"https://github.com/lucidrains\">Phil</a> for spearheading the development of the `Alphafold3` module and losses as well as the `Input` classes!\n\n- <a href=\"https://github.com/joseph-c-kim\">Joseph</a> for contributing the Relative Positional Encoding and the Smooth LDDT Loss!\n\n- <a href=\"https://github.com/engelberger\">Felipe</a> for contributing Weighted Rigid Align, Express Coordinates In Frame, Compute Alignment Error, and Centre Random Augmentation modules!\n\n- <a href=\"https://github.com/gitabtion\">Heng</a> for pointing out inconsistencies with the paper and pull requesting the solutions (e.g., finding an issue with the molecular atom indices for the distogram loss)\n\n- <a href=\"https://github.com/luwei0917\">Wei Lu</a> for catching a few erroneous hyperparameters\n\n- <a href=\"https://github.com/milot-mirdita\">Milot</a> for generating MSA and template inputs as well as optimizing the PDB dataset clustering script!\n\n- <a href=\"https://github.com/vandrw\">Andrei</a> for working on the weighted PDB dataset sampling and for taking on the gradio frontend interface!\n\n- <a href=\"https://github.com/tanjimin\">Jimin</a> for submitting a small fix to an issue with the coordinates being passed into `WeightedRigidAlign`\n\n- <a href=\"https://github.com/xluo233\">@xluo233</a> for contributing the confidence measures, clash penalty ranking, sample ranking logic, as well as the logic for computing the model selection score and unresolved RASA!\n\n- <a href=\"https://github.com/sj900\">sj900</a> for integrating and testing the `WeightedPDBSampler` within the `PDBDataset` and for adding initial support for MSA and template parsing!\n\n- <a href=\"https://github.com/wufandi\">Fandi</a> for discovering a few inconsistencies in the elucidated atom diffusion module with the supplementary\n\n- <a href=\"https://github.com/ptosco\">Paolo</a> for proposing the `PDB neutral stable molecule` hypothesis!\n\n- <a href=\"https://github.com/dhuvik\">Dhuvi</a> for fixing a bug related to metal ion molecule ID assignment for `Alphafold3Inputs` and for taking on the logic for translating `Alphafold3Inputs` to `Biomolecules` for saving inference samples to mmCIF files!\n\n- Tom (from the Discord channel) for identifying a discrepancy between this codebase's distogram and template unit vector computations and those of OpenFold (and <a href=\"https://github.com/vandrw\">Andrei</a> for helping address the distogram issue)!\n\n- <a href=\"https://github.com/Kaihui-Cheng\">Kaihui</a> for identifying a bug in how non-standard atoms were handled in polymer residues!\n\n- <a href=\"https://github.com/patrick-kidger\">Patrick</a> for <a href=\"https://docs.kidger.site/jaxtyping/\">jaxtyping</a>, <a href=\"https://github.com/fferflo\">Florian</a> for <a href=\"https://github.com/fferflo/einx\">einx</a>, and of course, <a href=\"https://github.com/arogozhnikov\">Alex</a> for <a href=\"https://einops.rocks/\">einops</a>\n\n- Soumith and the PyTorch organization for giving <a href=\"https://github.com/lucidrains\">Phil</a> the opportunity to open source this work\n\n## Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n- [Data preparation](#data-preparation)\n- [Training](#training)\n- [Evaluation](#evaluation)\n- [For developers](#for-developers)\n- [Citations](#citations)\n\n## Installation\n\n<details>\n\n### Pip\n\n```bash\npip install alphafold3-pytorch-lightning-hydra\n```\n\n### Conda\n\nInstall `mamba` for dependency management (as a fast alternative to Anaconda):\n\n```bash\nwget \"https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh\"\nbash Mambaforge-$(uname)-$(uname -m).sh  # accept all terms and install to the default location\nrm Mambaforge-$(uname)-$(uname -m).sh  # (optionally) remove installer after using it\nsource ~/.bashrc  # alternatively, one can restart their shell session to achieve the same result\n```\n\nInstall dependencies:\n\n```bash\n# Clone project\ngit clone https://github.com/amorehead/alphafold3-pytorch-lightning-hydra\ncd alphafold3-pytorch-lightning-hydra\n\n# Create Conda environment\nmamba env create -f environment.yaml\nconda activate alphafold3-pytorch  # note: one still needs to use `conda` to (de)activate environments\n\n# Install local project as package\npip3 install -e .\n```\n\n### Docker\n\nThe included `Dockerfile` contains the required dependencies to run the package and to train/inference using PyTorch with GPUs.\n\nThe default base image is `pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime` and installs the latest version of this package from the `main` GitHub branch.\n\n```bash\n# Clone project\ngit clone https://github.com/amorehead/alphafold3-pytorch-lightning-hydra\ncd alphafold3-pytorch-lightning-hydra\n\n# Build Docker container\ndocker build -t alphafold3-pytorch-lightning-hydra .\n```\n\nAlternatively, use build arguments to rebuild the image with different software versions:\n\n- `PYTORCH_TAG`: Changes the base image and thus builds with different PyTorch, CUDA, and/or cuDNN versions.\n- `GIT_TAG`: Changes the tag of this repo to clone and install the package.\n\nFor example:\n\n```bash\n## Use build argument to change versions\ndocker build --build-arg \"PYTORCH_TAG=2.2.1-cuda12.1-cudnn8-devel\" --build-arg \"GIT_TAG=0.1.15\" -t alphafold3-pytorch-lightning-hydra .\n```\n\nThen, run the container with GPUs and mount a local volume (for training) using the following command:\n\n```bash\n## Run Container\ndocker run -v .:/data --gpus all -it alphafold3-pytorch-lightning-hydra\n```\n\n**NOTE:** An AMD ROCm version of the Docker image can alternatively be built using `ROCm_Dockerfile`.\n\n</details>\n\n## Usage\n\n<details>\n\n```python\nimport torch\nfrom alphafold3_pytorch import Alphafold3\nfrom alphafold3_pytorch.utils.model_utils import exclusive_cumsum\n\nalphafold3 = Alphafold3(\n    dim_atom_inputs = 77,\n    dim_template_feats = 108\n)\n\n# Mock inputs\n\nseq_len = 16\n\nmolecule_atom_indices = torch.randint(0, 2, (2, seq_len)).long()\nmolecule_atom_lens = torch.full((2, seq_len), 2).long()\n\natom_seq_len = molecule_atom_lens.sum(dim=-1).amax()\natom_offsets = exclusive_cumsum(molecule_atom_lens)\n\natom_inputs = torch.randn(2, atom_seq_len, 77)\natompair_inputs = torch.randn(2, atom_seq_len, atom_seq_len, 5)\n\nadditional_molecule_feats = torch.randint(0, 2, (2, seq_len, 5))\nadditional_token_feats = torch.randn(2, seq_len, 33)\nis_molecule_types = torch.randint(0, 2, (2, seq_len, 5)).bool()\nis_molecule_mod = torch.randint(0, 2, (2, seq_len, 4)).bool()\nmolecule_ids = torch.randint(0, 32, (2, seq_len))\n\ntemplate_feats = torch.randn(2, 2, seq_len, seq_len, 108)\ntemplate_mask = torch.ones((2, 2)).bool()\n\nmsa = torch.randn(2, 7, seq_len, 32)\nmsa_mask = torch.ones((2, 7)).bool()\n\nadditional_msa_feats = torch.randn(2, 7, seq_len, 2)\n\n# Required for training, but omitted on inference\n\natom_pos = torch.randn(2, atom_seq_len, 3)\n\ndistogram_atom_indices = molecule_atom_lens - 1\n\ndistance_labels = torch.randint(0, 37, (2, seq_len, seq_len))\nresolved_labels = torch.randint(0, 2, (2, atom_seq_len))\n\n# Offset indices correctly\n\ndistogram_atom_indices += atom_offsets\nmolecule_atom_indices += atom_offsets\n\n# Train\n\nloss = alphafold3(\n    num_recycling_steps = 2,\n    atom_inputs = atom_inputs,\n    atompair_inputs = atompair_inputs,\n    molecule_ids = molecule_ids,\n    molecule_atom_lens = molecule_atom_lens,\n    additional_molecule_feats = additional_molecule_feats,\n    additional_msa_feats = additional_msa_feats,\n    additional_token_feats = additional_token_feats,\n    is_molecule_types = is_molecule_types,\n    is_molecule_mod = is_molecule_mod,\n    msa = msa,\n    msa_mask = msa_mask,\n    templates = template_feats,\n    template_mask = template_mask,\n    atom_pos = atom_pos,\n    distogram_atom_indices = distogram_atom_indices,\n    molecule_atom_indices = molecule_atom_indices,\n    distance_labels = distance_labels,\n    resolved_labels = resolved_labels\n)\n\nloss.backward()\n\n# After much training ...\n\nsampled_atom_pos = alphafold3(\n    num_recycling_steps = 4,\n    num_sample_steps = 16,\n    atom_inputs = atom_inputs,\n    atompair_inputs = atompair_inputs,\n    molecule_ids = molecule_ids,\n    molecule_atom_lens = molecule_atom_lens,\n    additional_molecule_feats = additional_molecule_feats,\n    additional_msa_feats = additional_msa_feats,\n    additional_token_feats = additional_token_feats,\n    is_molecule_types = is_molecule_types,\n    is_molecule_mod = is_molecule_mod,\n    msa = msa,\n    msa_mask = msa_mask,\n    templates = template_feats,\n    template_mask = template_mask\n)\n\nsampled_atom_pos.shape # (2, <atom_seqlen>, 3)\n```\n\n</details>\n\n<details>\n\nAn example with molecule level input handling\n\n```python\nimport torch\nfrom alphafold3_pytorch import Alphafold3, Alphafold3Input\n\ncontrived_protein = 'AG'\n\nmock_atompos = [\n    torch.randn(5, 3),   # alanine has 5 non-hydrogen atoms\n    torch.randn(4, 3)    # glycine has 4 non-hydrogen atoms\n]\n\ntrain_alphafold3_input = Alphafold3Input(\n    proteins = [contrived_protein],\n    missing_atom_indices = [[1, 2], None],\n    atom_pos = mock_atompos\n)\n\neval_alphafold3_input = Alphafold3Input(\n    proteins = [contrived_protein]\n)\n\n# training\n\nalphafold3 = Alphafold3(\n    dim_atom_inputs = 3,\n    dim_atompair_inputs = 5,\n    atoms_per_window = 27,\n    dim_template_feats = 108,\n    num_molecule_mods = 0,\n    confidence_head_kwargs = dict(\n        pairformer_depth = 1\n    ),\n    template_embedder_kwargs = dict(\n        pairformer_stack_depth = 1\n    ),\n    msa_module_kwargs = dict(\n        depth = 1\n    ),\n    pairformer_stack = dict(\n        depth = 2\n    ),\n    diffusion_module_kwargs = dict(\n        atom_encoder_depth = 1,\n        token_transformer_depth = 1,\n        atom_decoder_depth = 1,\n    )\n)\n\nloss = alphafold3.forward_with_alphafold3_inputs([train_alphafold3_input])\nloss.backward()\n\n# sampling\n\nalphafold3.eval()\nsampled_atom_pos = alphafold3.forward_with_alphafold3_inputs(eval_alphafold3_input)\n\nassert sampled_atom_pos.shape == (1, (5 + 4), 3)\n```\n\n</details>\n\n## Data preparation\n\n<details>\n\n### PDB dataset curation\n\nTo acquire the AlphaFold 3 PDB dataset, first download all first-assembly (and asymmetric unit) complexes in the Protein Data Bank (PDB), and then preprocess them with the script referenced below. The PDB can be downloaded from the RCSB: https://www.wwpdb.org/ftp/pdb-ftp-sites#rcsbpdb. The two Python scripts below (i.e., `filter_pdb_{train,val,test}_mmcifs.py` and `cluster_pdb_{train,val,test}_mmcifs.py`) assume you have downloaded the PDB in the **mmCIF file format**, placing its first-assembly and asymmetric unit mmCIF files at `data/pdb_data/unfiltered_assembly_mmcifs/` and `data/pdb_data/unfiltered_asym_mmcifs/`, respectively.\n\nFor reproducibility, we recommend downloading the PDB using AWS snapshots (e.g., `20240101`). To do so, refer to [AWS's documentation](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-welcome.html) to set up the AWS CLI locally. Alternatively, on the RCSB website, navigate down to \"Download Protocols\", and follow the download instructions depending on your location.\n\nFor example, one can use the following commands to download the PDB as two collections of mmCIF files:\n\n```bash\n# For `assembly1` complexes, use the PDB's `20240101` AWS snapshot:\naws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs\n# Or as a fallback, use rsync:\nrsync -rlpt -v -z --delete --port=33444 \\\nrsync.rcsb.org::ftp_data/assemblies/mmCIF/divided/ ./data/pdb_data/unfiltered_assembly_mmcifs/\n\n# For asymmetric unit complexes, also use the PDB's `20240101` AWS snapshot:\naws s3 sync s3://pdbsnapshots/20240101/pub/pdb/data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs\n# Or as a fallback, use rsync:\nrsync -rlpt -v -z --delete --port=33444 \\\nrsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./data/pdb_data/unfiltered_asym_mmcifs/\n```\n\n> WARNING: Downloading the PDB can take up to 700GB of space.\n\n> NOTE: The PDB hosts all available AWS snapshots here: https://pdbsnapshots.s3.us-west-2.amazonaws.com/index.html.\n\nAfter downloading, you should have two directories formatted like this:\nhttps://files.rcsb.org/pub/pdb/data/assemblies/mmCIF/divided/ & https://files.rcsb.org/pub/pdb/data/structures/divided/mmCIF/\n\n```bash\n00/\n01/\n02/\n..\nzz/\n```\n\nFor these directories, unzip all the files:\n\n```bash\nfind ./data/pdb_data/unfiltered_assembly_mmcifs/ -type f -name \"*.gz\" -exec gzip -d {} \\;\nfind ./data/pdb_data/unfiltered_asym_mmcifs/ -type f -name \"*.gz\" -exec gzip -d {} \\;\n```\n\nNext run the commands\n\n```bash\nwget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/monomers/components.cif.gz\nwget -P ./data/ccd_data/ https://files.wwpdb.org/pub/pdb/data/component-models/complete/chem_comp_model.cif.gz\n```\n\nfrom the project's root directory to download the latest version of the PDB's Chemical Component Dictionary (CCD) and its structural models. Extract each of these files using the following command:\n\n```bash\nfind data/ccd_data/ -type f -name \"*.gz\" -exec gzip -d {} \\;\n```\n\n### PDB dataset filtering\n\nThen run the following with `pdb_assembly_dir`, `pdb_asym_dir`, `ccd_dir`, and `mmcif_output_dir` replaced with the locations of your local copies of the first-assembly PDB, asymmetric unit PDB, CCD, and your desired dataset output directory (i.e., `./data/pdb_data/unfiltered_assembly_mmcifs/`, `./data/pdb_data/unfiltered_asym_mmcifs/`, `./data/ccd_data/`, and `./data/pdb_data/{train,val,test}_mmcifs/`).\n\n```bash\npython scripts/filter_pdb_train_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --ccd_dir <ccd_dir> --output_dir <mmcif_output_dir>\npython scripts/filter_pdb_val_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --output_dir <mmcif_output_dir>\npython scripts/filter_pdb_test_mmcifs.py --mmcif_assembly_dir <pdb_assembly_dir> --mmcif_asym_dir <pdb_asym_dir> --output_dir <mmcif_output_dir>\n```\n\nSee the scripts for more options. Each first-assembly mmCIF that successfully passes\nall processing steps will be written to `mmcif_output_dir` within a subdirectory\nnamed according to the mmCIF's second and third PDB ID characters (e.g. `5c`).\n\n### PDB dataset clustering\n\nNext, run the following with `mmcif_dir` and `{train,val,test}_clustering_output_dir` replaced, respectively, with your local output directory created using the dataset filtering script above and with your desired clustering output directories (i.e., `./data/pdb_data/{train,val,test}_mmcifs/` and `./data/pdb_data/data_caches/{train,val,test}_clusterings/`):\n\n```bash\npython scripts/cluster_pdb_train_mmcifs.py --mmcif_dir <mmcif_dir> --output_dir <train_clustering_output_dir> --clustering_filtered_pdb_dataset\npython scripts/cluster_pdb_val_mmcifs.py --mmcif_dir <mmcif_dir> --reference_clustering_dir <train_clustering_output_dir> --output_dir <val_clustering_output_dir> --clustering_filtered_pdb_dataset\npython scripts/cluster_pdb_test_mmcifs.py --mmcif_dir <mmcif_dir> --reference_1_clustering_dir <train_clustering_output_dir> --reference_2_clustering_dir <val_clustering_output_dir> --output_dir <test_clustering_output_dir> --clustering_filtered_pdb_dataset\n```\n\n**Note**: The `--clustering_filtered_pdb_dataset` flag is recommended when clustering the filtered PDB dataset as curated using the scripts above, as this flag will enable faster runtimes in this context (since filtering leaves each chain's residue IDs 1-based). However, this flag must **not** be provided when clustering other (i.e., non-PDB) datasets of mmCIF files. Otherwise, interface clustering may be performed incorrectly, as these datasets' mmCIF files may not use strict 1-based residue indexing for each chain.\n\n**Note**: One can instead download preprocessed (i.e., filtered) mmCIF (`train`/`val`/`test`) files (~25GB, comprising 148k complexes) and chain/interface clustering (`train`/`val`/`test`) files (~3GB) for the PDB's `20240101` AWS snapshot via a [shared OneDrive folder](https://mailmissouri-my.sharepoint.com/:f:/g/personal/acmwhb_umsystem_edu/EqU8tjUmmKxJr-FAlq4tzaIBi2TIBtmw5Vl3k_kmgNlepA?e=mzlyv6). Each of these `tar.gz` archives should be decompressed within the `data/pdb_data/` directory e.g., via `tar -xzf data_caches.tar.gz -C data/pdb_data/`. One can also download and prepare PDB distillation data using as a reference the script `scripts/distillation_data_download.sh`. Once downloaded, one can run `scripts/reduce_uniprot_predictions_to_pdb.py` to filter this dataset to only examples associated with at least one PDB entry. Moreover, for convenience, a [mapping](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping.dat.gz) of UniProt accession IDs to PDB IDs for training on PDB distillation data has already been downloaded and extracted as `data/afdb_data/data_caches/uniprot_to_pdb_id_mapping.dat`.\n\n</details>\n\n## Training\n\n<details>\n\nTrain model with default configuration\n\n```bash\n# Train on CPU\npython alphafold3_pytorch/train.py trainer=cpu\n\n# Train on GPU\npython alphafold3_pytorch/train.py trainer=gpu\n```\n\nTrain model with chosen experiment configuration from [configs/experiment/](configs/experiment/)\n\n```bash\n# e.g., Train an initial set of weights\npython alphafold3_pytorch/train.py experiment=alphafold3_initial_training.yaml\n```\n\nYou can override any parameter from command line like this\n\n```bash\npython alphafold3_pytorch/train.py trainer.max_steps=1e6 data.batch_size=128\n```\n\n</details>\n\n## Evaluation\n\n<details>\n\nEvaluate a trained set of weights on held-out test data\n\n```bash\n# e.g., Evaluate on GPU\npython alphafold3_pytorch/eval.py trainer=gpu\n```\n\n</details>\n\n## For developers\n\n<details>\n\n### Contributing\n\nAt the project root, run\n\n```bash\nbash contributing.sh\n```\n\nThen, add your module to `alphafold3_pytorch/models/components/alphafold3.py`, add your tests to `tests/test_alphafold3.py`, and submit a pull request. You can run the tests locally with\n\n```bash\npytest tests/\n```\n\n### Dependency management\n\nWe use `pip` and `docker` to manage the project's underlying dependencies. Notably, to update the dependencies built by the project's `Dockerfile`, first edit the contents of the `dependencies` list in `pyproject.toml`, and then rebuild the project's `docker` image:\n\n```bash\ndocker stop <container_id> # First stop any running `af3` container(s)\ndocker rm <container_id> # Then remove the container(s) - Caution: Make sure to push your local changes to GitHub before running this!\ndocker build -t af3 . # Rebuild the Docker image\ndocker run -v .:/data --gpus all -it af3 # Lastly, (re)start the Docker container from the updated image\n```\n\nIf you want to update the project's `pip` dependencies only, you can simply push to GitHub your changes to the `pyproject.toml` file.\n\n### Code formatting\n\nWe use `pre-commit` to automatically format the project's code. To set up `pre-commit` (one time only) for automatic code linting and formatting upon each execution of `git commit`:\n\n```bash\npre-commit install\n```\n\nTo manually reformat all files in the project as desired:\n\n```bash\npre-commit run -a\n```\n\nRefer to [pre-commit's documentation](https://pre-commit.com/) for more details.\n\n</details>\n\n## Citations\n\n```bibtex\n@article{Abramson2024-fj,\n  title    = \"Accurate structure prediction of biomolecular interactions with\n              {AlphaFold} 3\",\n  author   = \"Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans,\n              Richard and Green, Tim and Pritzel, Alexander and Ronneberger,\n              Olaf and Willmore, Lindsay and Ballard, Andrew J and Bambrick,\n              Joshua and Bodenstein, Sebastian W and Evans, David A and Hung,\n              Chia-Chun and O'Neill, Michael and Reiman, David and\n              Tunyasuvunakool, Kathryn and Wu, Zachary and {\\v Z}emgulyt{\\.e},\n              Akvil{\\.e} and Arvaniti, Eirini and Beattie, Charles and\n              Bertolli, Ottavia and Bridgland, Alex and Cherepanov, Alexey and\n              Congreve, Miles and Cowen-Rivers, Alexander I and Cowie, Andrew\n              and Figurnov, Michael and Fuchs, Fabian B and Gladman, Hannah and\n              Jain, Rishub and Khan, Yousuf A and Low, Caroline M R and Perlin,\n              Kuba and Potapenko, Anna and Savy, Pascal and Singh, Sukhdeep and\n              Stecula, Adrian and Thillaisundaram, Ashok and Tong, Catherine\n              and Yakneen, Sergei and Zhong, Ellen D and Zielinski, Michal and\n              {\\v Z}{\\'\\i}dek, Augustin and Bapst, Victor and Kohli, Pushmeet\n              and Jaderberg, Max and Hassabis, Demis and Jumper, John M\",\n  journal  = \"Nature\",\n  month    = \"May\",\n  year     =  2024\n}\n```\n\n```bibtex\n@inproceedings{Darcet2023VisionTN,\n    title   = {Vision Transformers Need Registers},\n    author  = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},\n    year    = {2023},\n    url     = {https://api.semanticscholar.org/CorpusID:263134283}\n}\n```\n\n```bibtex\n@article{Arora2024SimpleLA,\n    title   = {Simple linear attention language models balance the recall-throughput tradeoff},\n    author  = {Simran Arora and Sabri Eyuboglu and Michael Zhang and Aman Timalsina and Silas Alberti and Dylan Zinsley and James Zou and Atri Rudra and Christopher R'e},\n    journal = {ArXiv},\n    year    = {2024},\n    volume  = {abs/2402.18668},\n    url     = {https://api.semanticscholar.org/CorpusID:268063190}\n}\n```\n\n```bibtex\n@article{Puny2021FrameAF,\n    title   = {Frame Averaging for Invariant and Equivariant Network Design},\n    author  = {Omri Puny and Matan Atzmon and Heli Ben-Hamu and Edward James Smith and Ishan Misra and Aditya Grover and Yaron Lipman},\n    journal = {ArXiv},\n    year    = {2021},\n    volume  = {abs/2110.03336},\n    url     = {https://api.semanticscholar.org/CorpusID:238419638}\n}\n```\n\n```bibtex\n@article{Duval2023FAENetFA,\n    title   = {FAENet: Frame Averaging Equivariant GNN for Materials Modeling},\n    author  = {Alexandre Duval and Victor Schmidt and Alex Hernandez Garcia and Santiago Miret and Fragkiskos D. Malliaros and Yoshua Bengio and David Rolnick},\n    journal = {ArXiv},\n    year    = {2023},\n    volume  = {abs/2305.05577},\n    url     = {https://api.semanticscholar.org/CorpusID:258564608}\n}\n```\n\n```bibtex\n@article{Wang2022DeepNetST,\n    title   = {DeepNet: Scaling Transformers to 1, 000 Layers},\n    author  = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},\n    journal = {ArXiv},\n    year    = {2022},\n    volume  = {abs/2203.00555},\n    url     = {https://api.semanticscholar.org/CorpusID:247187905}\n}\n```\n\n```bibtex\n@inproceedings{Ainslie2023CoLT5FL,\n    title   = {CoLT5: Faster Long-Range Transformers with Conditional Computation},\n    author  = {Joshua Ainslie and Tao Lei and Michiel de Jong and Santiago Ontan'on and Siddhartha Brahma and Yury Zemlyanskiy and David Uthus and Mandy Guo and James Lee-Thorp and Yi Tay and Yun-Hsuan Sung and Sumit Sanghai},\n    year    = {2023}\n}\n```\n\n```bibtex\n@article{Ash2019OnTD,\n    title   = {On the Difficulty of Warm-Starting Neural Network Training},\n    author  = {Jordan T. Ash and Ryan P. Adams},\n    journal = {ArXiv},\n    year    = {2019},\n    volume  = {abs/1910.08475},\n    url     = {https://api.semanticscholar.org/CorpusID:204788802}\n}\n```\n\n```bibtex\n@ARTICLE{Heinzinger2023.07.23.550085,\n    author  = {Michael Heinzinger and Konstantin Weissenow and Joaquin Gomez Sanchez and Adrian Henkel and Martin Steinegger and Burkhard Rost},\n    title   = {ProstT5: Bilingual Language Model for Protein Sequence and Structure},\n    year    = {2023},\n    doi     = {10.1101/2023.07.23.550085},\n    journal = {bioRxiv}\n}\n```\n\n```bibtex\n@article {Lin2022.07.20.500902,\n    author  = {Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Santos Costa, Allan dos and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Sal and Rives, Alexander},\n    title   = {Language models of protein sequences at the scale of evolution enable accurate structure prediction},\n    elocation-id = {2022.07.20.500902},\n    year    = {2022},\n    doi     = {10.1101/2022.07.20.500902},\n    publisher = {Cold Spring Harbor Laboratory},\n    URL     = {https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902},\n    eprint  = {https://www.biorxiv.org/content/early/2022/07/21/2022.07.20.500902.full.pdf},\n    journal = {bioRxiv}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 Phil Wang  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
    "summary": "AlphaFold 3 - Pytorch",
    "version": "0.5.31",
    "project_urls": {
        "Homepage": "https://pypi.org/project/alphafold3-pytorch-lightning-hydra/",
        "Repository": "https://github.com/amorehead/alphafold3-pytorch-lightning-hydra"
    },
    "split_keywords": [
        "artificial intelligence",
        " deep learning",
        " protein structure prediction"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e74dea496b65dc0090670365efad2c579702eae2b7bcab9ff208edb54637e689",
                "md5": "ad7c83d0ef0eb1f198516a3341ed382d",
                "sha256": "11deea1fddd16d389d609de4b39f00fcc2eedd54d97e62ce42150260ddc00de5"
            },
            "downloads": -1,
            "filename": "alphafold3_pytorch_lightning_hydra-0.5.31-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad7c83d0ef0eb1f198516a3341ed382d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 299570,
            "upload_time": "2024-09-21T20:05:15",
            "upload_time_iso_8601": "2024-09-21T20:05:15.439783Z",
            "url": "https://files.pythonhosted.org/packages/e7/4d/ea496b65dc0090670365efad2c579702eae2b7bcab9ff208edb54637e689/alphafold3_pytorch_lightning_hydra-0.5.31-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "245c1fa17f6be91a111c9e043bfcfeaed4b5886ab9ab6730a1af00953fd4505f",
                "md5": "9e66c6286d47ba04753c5b8cad43915a",
                "sha256": "9a6098fb7c182cb1def769577951972e34eab6bc27931a4c729f8faad503691b"
            },
            "downloads": -1,
            "filename": "alphafold3_pytorch_lightning_hydra-0.5.31.tar.gz",
            "has_sig": false,
            "md5_digest": "9e66c6286d47ba04753c5b8cad43915a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 13293823,
            "upload_time": "2024-09-21T20:05:18",
            "upload_time_iso_8601": "2024-09-21T20:05:18.642228Z",
            "url": "https://files.pythonhosted.org/packages/24/5c/1fa17f6be91a111c9e043bfcfeaed4b5886ab9ab6730a1af00953fd4505f/alphafold3_pytorch_lightning_hydra-0.5.31.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-21 20:05:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "amorehead",
    "github_project": "alphafold3-pytorch-lightning-hydra",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "alphafold3-pytorch-lightning-hydra"
}

None