# CESPED: Utilities for the Cryo-EM Supervised Pose Estimation Dataset
CESPED is a new dataset specifically designed for Supervised Pose Estimation in Cryo-EM. You can check our manuscript at https://arxiv.org/abs/2311.06194.
## Installation
cesped has been tested on python 3.11. Installation should be automatic using pip
```
pip install cesped
#Or directy from the master branch
pip install git+https://github.com/rsanchezgarc/cesped
```
or cloning the repository
```
git clone https://github.com/rsanchezgarc/cesped
cd cesped
pip install .
```
## Basic usage
### ParticlesDataset class
It is used to load the images and poses.
1. Get the list of downloadable entries
```
from cesped.particlesDataset import ParticlesDataset
listOfEntries = ParticlesDataset.getCESPEDEntries()
```
2. Load a given entry
```
targetName, halfset = listOfEntries[0] #We will work with the first entry only
dataset = ParticlesDataset(targetName, halfset)
```
For a rapid test, use `targetName="TEST"` and `halfset=0`. If the dataset is not yet available in the benchmarkDir (defined in [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)),
it will be automatically downloaded. Metadata (Euler angles, CTF,...) are stored using Relion starfile format, and images are stored as .mrcs stacks.
3. Use it as a regular dataset
```
dl = DataLoader(dataset, batch_size=32)
for batch in dl:
iid, img, (rotMat, xyShiftAngs, confidence), metadata = batch
#iid is the list of ids of the particles (string)
#img is a batch of Bx1xNxN images
#rotMat is a batch of rotation matrices Bx3x3
#xyShiftAngs is a batch of image shifts in Angstroms Bx2
#confidence is a batch of numbers, between 0 and 1, Bx1
#metata is a dictionary of names:values for all the information about the particle
#YOUR PYTORCH CODE HERE
predRot = model(img)
loss = loss_function(predRot, rotMat)
loss.backward()
optimizer.step()
optimizer.zero_grad()
```
4. Once your model is trained, you can update the metadata of the ParticlesDataset and save it so that it can be used in cryo-EM software
```
for iid, pred_rotmats, maxprob in predictions:
#iid is the list of ids of the particles (string)
#pred_rotmats is a batch of predicted rotation matrices Bx3x3
#maxprob is a batch of numbers, between 0 and 1, Bx1, that indicates the confidence in the prediction (e.g. softmax values)
particlesDataset.updateMd(ids=iid, angles=pred_rotmats,
shifts=torch.zeros(pred_rotmats.shape[0],2, device=pred_rotmats.device), #Or actual predictions if you have them
confidence=maxprob,
angles_format="rotmat")
particlesDataset.saveMd(outFname) #Save the metadata as an starfile, a common cryo-EM format
```
5. Finally, evaluation can be computed if the predictions for the halfset 0 and halfset 1 were saved using the evaluateEntry script.
```
python -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \
--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \
--n_cpus 12 --outdir evaluation/
```
evaluateEntry uses [Relion](https://relion.readthedocs.io/) for reconstruction, so you will need to install it and
edit the config file [defaultRelionConfig.yaml](cesped%2Fconfigs%2FdefaultRelionConfig.yaml) or provide, via command
line arguments, where Relion is installed
```
--mpirun /path/to/mpirun --relionBinDir /path/to/relion/bin
```
Alternatively, you can build a [singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html) image, using the
definition file we provide [relionSingularity.def](cesped%2FrelionSingularity.def)
```commandline
singularity build relionSingularity.sif relionSingularity.def
```
and edit the config file to point where the singularity image file is located, or use the command line argument
```
--singularityImgFile /path/to/relionSingularity.sif
```
### Cross-plataform usage.
Users of other deep learning frameworks can download CESPED entries using the following command
```
python -m cesped.particlesDataset download_entry -t 10166 --halfset 0
```
This will download the associated starfile and mrcs file to the default benchmark directory (defined in [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml).
Use `--benchmarkDir` to specify another directory<br/>
In order to list the entries available for download and the ones already downloaded, you can use
```
python -m cesped.particlesDataset list_entries
```
Preprocessing of the dataset entries can be executed using
```
python -m cesped.particlesDataset preprocess_entry --t 10166 --halfset 0 --o /tmp/dumpedData/ --ctf_correction "phase_flip"
```
where `--t` is the target name. Use `-h` to display the list of available preprocessing operations.
The raw data can be easily accessed using the Python package [starstack](https://pypi.org/project/starstack/), which relies on the [mrcfile](https://pypi.org/project/mrcfile/) and [starfile](https://pypi.org/project/starfile/) packages. Predictions should be written as a star file with the newly
predicted Euler angles.
Evaluation can be computed once the predictions for the half-set 0 and half-set 1 are saved
```
python -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \
--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \
--n_cpus 12 --outdir evaluation/
```
## Image2Sphere experiments
The experiments have been implemented using [lightning](https://lightning.ai/) and lightingCLI. You can find the configuration files
located at :
```
YOUR_DIR/cesped/configs/
```
You can also find it as:
```
import cesped
cesped.default_configs_dir
```
### Train
In order to train the model on one target, you run
```
python -m cesped.trainEntry --data.halfset <HALFSET> --data.targetName <TARGETNAME> --trainer.default_root_dir <OUTDIR>
```
with `<HALFSET>` 0 or 1 and `<TARGETNAME>` one of the list that can be found using `ParticlesDataset.getCESPEDEntries()`
<br><br>
The included targets are:
| EMPIAR ID | Composition | Symmetry | Image Pixels | FSCR<sub>0.143</sub> (Å) | Masked FSCR<sub>0.143</sub> (Å) | # Particles |
|-----------|-------------|----------|--------------|-------------------------|---------------------------------|-------------|
| 10166 | Human 26S proteasome bound to the chemotherapeutic Oprozomib | C1 | 284 | 5.0 | 3.9 | 238631 |
| 10786 | Substance P-Neurokinin Receptor G protein complexes (SP-NK1R-miniGs399) | C1 | 184 | 3.3 | 3.0* | 288659 |
| 10280 | Calcium-bound TMEM16F in nanodisc with supplement of PIP2 | C2 | 182 | 3.6 | 3.0* | 459504 |
| 11120 | M22 bound TSHR Gs 7TM G protein | C1 | 232 | 3.4 | 3.0* | 244973 |
| 10648 | PKM2 in complex with Compound 5 | D2 | 222 | 3.7 | 3.3 | 234956 |
| 10409 | Replicating SARS-CoV-2 polymerase (Map 1) | C1 | 240 | 3.3 | 3.0* | 406001 |
| 10374 | Human ABCG2 transporter with inhibitor MZ29 and 5D3-Fab | C2 | 216 | 3.7 | 3.0* | 323681 |
`*` Nyquist Frequency at 1.5 Å/pixel; Resolution is estimated at the usual threshold 0.143.
Reported FSCR<sub>0.143</sub> values were obtained directly from the relion_refine logs while Masked FSCR<sub>0.143</sub> values were collected from the relion_postprocess logs.
In addition, the entry TEST is a small subset of EMPIAR-11120
Do not forget to change the configuration files or to provide different values via the command line or environmental
variables. In addition, `[--config CONFIG_NAME.yaml]` also allows overwriting the default values using (a/several) custom
yaml file(s). Use `-h` to see the list of configurable parameters. Some of the most important ones are.
- trainer.default_root_dir. Directory where the checkpoints and the logs will be saved,
from [defaultTrainerConfig.yaml](cesped%2Fconfigs%2FdefaultTrainerConfig.yaml)
- optimizer.lr. The learning rate, from [defaultOptimizerConfig.yaml](cesped%2Fconfigs%2FdefaultOptimizerConfig.yaml)
- data.benchmarkDir. Directory where the benchmark entries are saved, from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml). It is recommended
to change this in the config file.
- data.num_data_workers. Number of workers for data loading, from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)
- data.batch_size. from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)
### Inference
By default, when using `python -m cesped.trainEntry`, inference on the complementary halfset is done on a single GPU
after training finishes, and the starfile with the predictions can be found at
`<OUTDIR>/lightning_logs/version_<\d>/predictions_[0,1].star`. In order to manually run the pose prediction
code (and to make use of all GPUs) you can run
```
python -m cesped.inferEntry --data.halfset <HALFSET> --data.targetName <TARGETNAME> --ckpt_path <PATH_TO_CHECKPOINT> \
--outFname /path/to/output/starfile.star
```
### Evaluation
5. As before, evaluation can be computed if the predictions for the halfset 0 and halfset 1 were saved using the evaluateEntry script.
```
python -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \
--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \
--n_cpus 12 --outdir evaluation/
```
## API
For API documentation check the [docs folder](https://rsanchezgarc.github.io/cesped/cesped/)
## Relion Singularity
A singularity container for relion_reconstruct with MPI support can be built with the following command.
```
singularity build relionSingulary.sif relionSingulary.def
```
Then, Relion reconstruction can be computed with the following command:
```
singularity exec relionSingulary.sif mpirun -np 4 relion_reconstruct_mpi --ctf --pad 2 --i input_particles.star --o output_map.mrc
#Or the following command
./relionSingulary.sif 4 --ctf --pad 2 --i input_particles.star --o output_map.mrc #This uses 4 mpis
```
However, typical users will not need to execute the container manually. Everything happens transparently within the evaluateEntry.py script
Raw data
{
"_id": null,
"home_page": "https://github.com/rsanchezgarc/cesped",
"name": "cesped",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "deep learning cryoem pose estimation",
"author": "Ruben Sanchez-Garcia",
"author_email": "ruben.sanchez-garcia@stats.ox.ac.uk",
"download_url": "https://files.pythonhosted.org/packages/48/d4/5eb7e847a2dcb595d8621218d646da486eaa6682ff205749d39103702159/cesped-24.7.0.tar.gz",
"platform": null,
"description": "# CESPED: Utilities for the Cryo-EM Supervised Pose Estimation Dataset\n\nCESPED is a new dataset specifically designed for Supervised Pose Estimation in Cryo-EM. You can check our manuscript at https://arxiv.org/abs/2311.06194.\n\n## Installation\ncesped has been tested on python 3.11. Installation should be automatic using pip\n```\npip install cesped\n#Or directy from the master branch\npip install git+https://github.com/rsanchezgarc/cesped\n```\n\nor cloning the repository\n```\ngit clone https://github.com/rsanchezgarc/cesped\ncd cesped\npip install .\n```\n\n\n## Basic usage\n\n### ParticlesDataset class\nIt is used to load the images and poses.\n\n1. Get the list of downloadable entries\n```\nfrom cesped.particlesDataset import ParticlesDataset\nlistOfEntries = ParticlesDataset.getCESPEDEntries()\n```\n2. Load a given entry\n```\ntargetName, halfset = listOfEntries[0] #We will work with the first entry only\n\ndataset = ParticlesDataset(targetName, halfset)\n```\nFor a rapid test, use `targetName=\"TEST\"` and `halfset=0`. If the dataset is not yet available in the benchmarkDir (defined in [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)),\nit will be automatically downloaded. Metadata (Euler angles, CTF,...) are stored using Relion starfile format, and images are stored as .mrcs stacks.\n\n3. Use it as a regular dataset\n```\ndl = DataLoader(dataset, batch_size=32)\nfor batch in dl:\n iid, img, (rotMat, xyShiftAngs, confidence), metadata = batch\n \n #iid is the list of ids of the particles (string)\n #img is a batch of Bx1xNxN images\n #rotMat is a batch of rotation matrices Bx3x3\n #xyShiftAngs is a batch of image shifts in Angstroms Bx2\n #confidence is a batch of numbers, between 0 and 1, Bx1\n #metata is a dictionary of names:values for all the information about the particle\n \n #YOUR PYTORCH CODE HERE\n predRot = model(img)\n loss = loss_function(predRot, rotMat)\n loss.backward()\n optimizer.step()\n optimizer.zero_grad()\n \n```\n\n4. Once your model is trained, you can update the metadata of the ParticlesDataset and save it so that it can be used in cryo-EM software\n```\nfor iid, pred_rotmats, maxprob in predictions:\n #iid is the list of ids of the particles (string)\n #pred_rotmats is a batch of predicted rotation matrices Bx3x3\n #maxprob is a batch of numbers, between 0 and 1, Bx1, that indicates the confidence in the prediction (e.g. softmax values)\n\n particlesDataset.updateMd(ids=iid, angles=pred_rotmats,\n shifts=torch.zeros(pred_rotmats.shape[0],2, device=pred_rotmats.device), #Or actual predictions if you have them\n confidence=maxprob,\n angles_format=\"rotmat\")\nparticlesDataset.saveMd(outFname) #Save the metadata as an starfile, a common cryo-EM format\n\n \n```\n5. Finally, evaluation can be computed if the predictions for the halfset 0 and halfset 1 were saved using the evaluateEntry script.\n```\npython -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \\\n--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \\\n--n_cpus 12 --outdir evaluation/\n```\nevaluateEntry uses [Relion](https://relion.readthedocs.io/) for reconstruction, so you will need to install it and \nedit the config file [defaultRelionConfig.yaml](cesped%2Fconfigs%2FdefaultRelionConfig.yaml) or provide, via command \nline arguments, where Relion is installed\n```\n--mpirun /path/to/mpirun --relionBinDir /path/to/relion/bin\n```\nAlternatively, you can build a [singularity](https://docs.sylabs.io/guides/3.0/user-guide/index.html) image, using the\ndefinition file we provide [relionSingularity.def](cesped%2FrelionSingularity.def)\n```commandline\nsingularity build relionSingularity.sif relionSingularity.def\n```\nand edit the config file to point where the singularity image file is located, or use the command line argument\n```\n--singularityImgFile /path/to/relionSingularity.sif\n```\n\n### Cross-plataform usage.\n\nUsers of other deep learning frameworks can download CESPED entries using the following command\n\n```\npython -m cesped.particlesDataset download_entry -t 10166 --halfset 0\n```\nThis will download the associated starfile and mrcs file to the default benchmark directory (defined in [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml).\nUse `--benchmarkDir` to specify another directory<br/>\n\nIn order to list the entries available for download and the ones already downloaded, you can use\n```\npython -m cesped.particlesDataset list_entries\n```\nPreprocessing of the dataset entries can be executed using\n```\npython -m cesped.particlesDataset preprocess_entry --t 10166 --halfset 0 --o /tmp/dumpedData/ --ctf_correction \"phase_flip\"\n```\nwhere `--t` is the target name. Use `-h` to display the list of available preprocessing operations.\n\nThe raw data can be easily accessed using the Python package [starstack](https://pypi.org/project/starstack/), which relies on the [mrcfile](https://pypi.org/project/mrcfile/) and [starfile](https://pypi.org/project/starfile/) packages. Predictions should be written as a star file with the newly\npredicted Euler angles.\n\nEvaluation can be computed once the predictions for the half-set 0 and half-set 1 are saved\n\n```\npython -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \\\n--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \\\n--n_cpus 12 --outdir evaluation/\n```\n\n## Image2Sphere experiments\nThe experiments have been implemented using [lightning](https://lightning.ai/) and lightingCLI. You can find the configuration files \nlocated at :\n```\nYOUR_DIR/cesped/configs/\n```\nYou can also find it as:\n```\nimport cesped\ncesped.default_configs_dir\n```\n### Train\nIn order to train the model on one target, you run\n```\npython -m cesped.trainEntry --data.halfset <HALFSET> --data.targetName <TARGETNAME> --trainer.default_root_dir <OUTDIR>\n```\nwith `<HALFSET>` 0 or 1 and `<TARGETNAME>` one of the list that can be found using `ParticlesDataset.getCESPEDEntries()` \n<br><br>\nThe included targets are:\n\n\n| EMPIAR ID | Composition | Symmetry | Image Pixels | FSCR<sub>0.143</sub> (\u00c5) | Masked FSCR<sub>0.143</sub> (\u00c5) | # Particles |\n|-----------|-------------|----------|--------------|-------------------------|---------------------------------|-------------|\n| 10166 | Human 26S proteasome bound to the chemotherapeutic Oprozomib | C1 | 284 | 5.0 | 3.9 | 238631 |\n| 10786 | Substance P-Neurokinin Receptor G protein complexes (SP-NK1R-miniGs399) | C1 | 184 | 3.3 | 3.0* | 288659 |\n| 10280 | Calcium-bound TMEM16F in nanodisc with supplement of PIP2 | C2 | 182 | 3.6 | 3.0* | 459504 |\n| 11120 | M22 bound TSHR Gs 7TM G protein | C1 | 232 | 3.4 | 3.0* | 244973 |\n| 10648 | PKM2 in complex with Compound 5 | D2 | 222 | 3.7 | 3.3 | 234956 |\n| 10409 | Replicating SARS-CoV-2 polymerase (Map 1) | C1 | 240 | 3.3 | 3.0* | 406001 |\n| 10374 | Human ABCG2 transporter with inhibitor MZ29 and 5D3-Fab | C2 | 216 | 3.7 | 3.0* | 323681 |\n\n`*` Nyquist Frequency at 1.5 \u00c5/pixel; Resolution is estimated at the usual threshold 0.143. \nReported FSCR<sub>0.143</sub> values were obtained directly from the relion_refine logs while Masked FSCR<sub>0.143</sub> values were collected from the relion_postprocess logs.\n\nIn addition, the entry TEST is a small subset of EMPIAR-11120\n\nDo not forget to change the configuration files or to provide different values via the command line or environmental \nvariables. In addition, `[--config CONFIG_NAME.yaml]` also allows overwriting the default values using (a/several) custom\nyaml file(s). Use `-h` to see the list of configurable parameters. Some of the most important ones are.\n- trainer.default_root_dir. Directory where the checkpoints and the logs will be saved, \nfrom [defaultTrainerConfig.yaml](cesped%2Fconfigs%2FdefaultTrainerConfig.yaml)\n- optimizer.lr. The learning rate, from [defaultOptimizerConfig.yaml](cesped%2Fconfigs%2FdefaultOptimizerConfig.yaml)\n- data.benchmarkDir. Directory where the benchmark entries are saved, from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml). It is recommended\nto change this in the config file.\n- data.num_data_workers. Number of workers for data loading, from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)\n- data.batch_size. from [defaultDataConfig.yaml](cesped%2Fconfigs%2FdefaultDataConfig.yaml)\n\n### Inference\nBy default, when using `python -m cesped.trainEntry`, inference on the complementary halfset is done on a single GPU\nafter training finishes, and the starfile with the predictions can be found at \n`<OUTDIR>/lightning_logs/version_<\\d>/predictions_[0,1].star`. In order to manually run the pose prediction \ncode (and to make use of all GPUs) you can run\n```\npython -m cesped.inferEntry --data.halfset <HALFSET> --data.targetName <TARGETNAME> --ckpt_path <PATH_TO_CHECKPOINT> \\\n--outFname /path/to/output/starfile.star\n```\n### Evaluation\n5. As before, evaluation can be computed if the predictions for the halfset 0 and halfset 1 were saved using the evaluateEntry script.\n```\npython -m cesped.evaluateEntry --predictionType SO3 --targetName 11120 \\\n--half0PredsFname particles_preds_0.star --half1PredsFname particles_preds_1.star \\\n--n_cpus 12 --outdir evaluation/\n```\n\n## API\n\nFor API documentation check the [docs folder](https://rsanchezgarc.github.io/cesped/cesped/)\n\n\n## Relion Singularity\n\nA singularity container for relion_reconstruct with MPI support can be built with the following command. \n```\nsingularity build relionSingulary.sif relionSingulary.def \n```\nThen, Relion reconstruction can be computed with the following command:\n```\nsingularity exec relionSingulary.sif mpirun -np 4 relion_reconstruct_mpi --ctf --pad 2 --i input_particles.star --o output_map.mrc\n#Or the following command\n./relionSingulary.sif 4 --ctf --pad 2 --i input_particles.star --o output_map.mrc #This uses 4 mpis\n```\nHowever, typical users will not need to execute the container manually. Everything happens transparently within the evaluateEntry.py script\n",
"bugtrack_url": null,
"license": null,
"summary": "Code utilities for the CESPED (Cryo-EM Supervised Pose Estimation Dataset) benchmark",
"version": "24.7.0",
"project_urls": {
"Homepage": "https://github.com/rsanchezgarc/cesped"
},
"split_keywords": [
"deep",
"learning",
"cryoem",
"pose",
"estimation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "3446b2633cf2b011a55e78ad81129d48bf3be810cbc73a21c802f7ebb4285104",
"md5": "5c03af047d9bc2550eae0742dd5e6115",
"sha256": "ad34e5a1e631575cd23fcd3243041df67adcc1905f773e6d4939658166deeecb"
},
"downloads": -1,
"filename": "cesped-24.7.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5c03af047d9bc2550eae0742dd5e6115",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 64157,
"upload_time": "2024-07-15T12:39:50",
"upload_time_iso_8601": "2024-07-15T12:39:50.563432Z",
"url": "https://files.pythonhosted.org/packages/34/46/b2633cf2b011a55e78ad81129d48bf3be810cbc73a21c802f7ebb4285104/cesped-24.7.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "48d45eb7e847a2dcb595d8621218d646da486eaa6682ff205749d39103702159",
"md5": "41e106fead53dc0c966932320a7521da",
"sha256": "b20648c11dd957b722d756b295ab29998e7f8393143b2922094000047fee827f"
},
"downloads": -1,
"filename": "cesped-24.7.0.tar.gz",
"has_sig": false,
"md5_digest": "41e106fead53dc0c966932320a7521da",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 57611,
"upload_time": "2024-07-15T12:39:54",
"upload_time_iso_8601": "2024-07-15T12:39:54.755852Z",
"url": "https://files.pythonhosted.org/packages/48/d4/5eb7e847a2dcb595d8621218d646da486eaa6682ff205749d39103702159/cesped-24.7.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-15 12:39:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "rsanchezgarc",
"github_project": "cesped",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "cesped"
}