# MolScribe
This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical
structure. Try our [demo](https://huggingface.co/spaces/yujieq/MolScribe) on HuggingFace!
![MolScribe](assets/model.png)
If you use MolScribe in your research, please cite our [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480).
```
@article{
MolScribe,
title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},
author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},
journal = {Journal of Chemical Information and Modeling},
publisher = {American Chemical Society ({ACS})},
doi = {10.1021/acs.jcim.2c01480},
year = 2023,
}
```
## Quick Start
Run the following command to install the package and its dependencies:
```
git clone git@github.com:thomas0809/MolScribe.git
cd MolScribe
python setup.py install
```
Download the MolScribe checkpoint from [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main)
and predict molecular structures:
```python
import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')
model = MolScribe(ckpt_path, device=torch.device('cpu'))
output = model.predict_image_file('assets/example.png', compute_confidence=True, get_atoms_bonds=True)
```
The output is a dictionary, with the following format
```
{
'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',
'molfile': '***',
'confidence': 0.9175,
'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],
'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]
}
```
Please refer to [`molscribe/interface.py`](molscribe/interface.py) for details and other available APIs.
For development or reproducing the experiments, please follow the instructions below.
## Experiments
### Requirements
Install the required packages
```
pip install -r requirements.txt
```
### Data
For training or evaluation, please download the corresponding datasets to `data/`.
Training data:
| Datasets | Description |
|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| USPTO <br> [Download](https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0) | Downloaded from [USPTO, Grant Red Book](https://bulkdata.uspto.gov/). |
| PubChem <br> [Download](https://www.dropbox.com/s/mxvm5i8139y5cvk/pubchem.zip?dl=0) | Molecules are downloaded from [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/), and images are dynamically rendered during training. |
Benchmarks:
| Category | Datasets | Description |
|--------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Synthetic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/synthetic.zip) | Indigo <br> ChemDraw | Images are rendered by Indigo and ChemDraw. |
| Realistic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/real.zip) | CLEF <br> UOB <br> USPTO <br> Staker <br> ACS | CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. <br/> Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. <br> ACS is a new dataset collected by ourself. |
| Perturbed <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/perturb.zip) | CLEF <br> UOB <br> USPTO <br> Staker | Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/ |
### Model
Our model checkpoints can be downloaded from [Dropbox](https://www.dropbox.com/sh/91u508kf48cotv4/AACQden2waMXIqLwYSi8zO37a?dl=0)
or [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main).
Model architecture:
- Encoder: [Swin Transformer](https://github.com/microsoft/Swin-Transformer), Swin-B.
- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.
- Input size: 384x384
Download the model checkpoint to reproduce our experiments:
```
mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth
```
### Prediction
```
python predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png
```
MolScribe prediction interface is in [`molscribe/interface.py`](molscribe/interface.py).
See python script [`predict.py`](predict.py) or jupyter notebook [`notebook/predict.ipynb`](notebook/predict.ipynb)
for example usage.
### Evaluate MolScribe
```
bash scripts/eval_uspto_joint_chartok_1m680k.sh
```
The script uses one GPU and batch size of 64 by default. If more GPUs are available, update `NUM_GPUS_PER_NODE` and
`BATCH_SIZE` for faster evaluation.
### Train MolScribe
```
bash scripts/train_uspto_joint_chartok_1m680k.sh
```
The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.
During training, we use a modified code of [Indigo](https://github.com/epam/Indigo) (included in `molscribe/indigo/`).
### Evaluation Script
We implement a standalone evaluation script [`evaluate.py`](evaluate.py). Example usage:
```
python evaluate.py \
--gold_file data/real/acs.csv \
--pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \
--pred_field post_SMILES
```
The prediction should be saved in a csv file, with columns `image_id` for the index (must match the gold file),
and `SMILES` for predicted SMILES. If prediction has a different column name, specify it with `--pred_field`.
The result contains three scores:
- canon_smiles: our main metric, exact matching accuracy.
- graph: graph exact matching accuracy, ignoring tetrahedral chirality.
- chiral: exact matching accuracy on chiral molecules.
Raw data
{
"_id": null,
"home_page": "https://github.com/thomas0809/MolScribe",
"name": "MolScribe",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "",
"author": "Yujie Qian",
"author_email": "yujieq@csail.mit.edu",
"download_url": "https://files.pythonhosted.org/packages/a1/c4/998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25/MolScribe-1.1.1.tar.gz",
"platform": null,
"description": "# MolScribe\n\nThis is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical\nstructure. Try our [demo](https://huggingface.co/spaces/yujieq/MolScribe) on HuggingFace!\n\n![MolScribe](assets/model.png)\n\nIf you use MolScribe in your research, please cite our [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480).\n```\n@article{\n MolScribe,\n title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},\n author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},\n journal = {Journal of Chemical Information and Modeling},\n publisher = {American Chemical Society ({ACS})},\n doi = {10.1021/acs.jcim.2c01480},\n year = 2023,\n}\n```\n\n## Quick Start\nRun the following command to install the package and its dependencies:\n```\ngit clone git@github.com:thomas0809/MolScribe.git\ncd MolScribe\npython setup.py install\n```\n\nDownload the MolScribe checkpoint from [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main) \nand predict molecular structures:\n```python\nimport torch\nfrom molscribe import MolScribe\nfrom huggingface_hub import hf_hub_download\n\nckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')\n\nmodel = MolScribe(ckpt_path, device=torch.device('cpu'))\noutput = model.predict_image_file('assets/example.png', compute_confidence=True, get_atoms_bonds=True)\n```\n\nThe output is a dictionary, with the following format\n```\n{\n 'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',\n 'molfile': '***', \n 'confidence': 0.9175,\n 'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],\n 'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]\n}\n```\n\nPlease refer to [`molscribe/interface.py`](molscribe/interface.py) for details and other available APIs.\n\nFor development or reproducing the experiments, please follow the instructions below.\n\n## Experiments\n\n### Requirements\nInstall the required packages\n```\npip install -r requirements.txt\n```\n\n### Data\nFor training or evaluation, please download the corresponding datasets to `data/`.\n\nTraining data:\n\n| Datasets | Description |\n|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|\n| USPTO <br> [Download](https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0) | Downloaded from [USPTO, Grant Red Book](https://bulkdata.uspto.gov/). |\n| PubChem <br> [Download](https://www.dropbox.com/s/mxvm5i8139y5cvk/pubchem.zip?dl=0) | Molecules are downloaded from [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/), and images are dynamically rendered during training. |\n\nBenchmarks:\n\n| Category | Datasets | Description |\n|--------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Synthetic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/synthetic.zip) | Indigo <br> ChemDraw | Images are rendered by Indigo and ChemDraw. |\n| Realistic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/real.zip) | CLEF <br> UOB <br> USPTO <br> Staker <br> ACS | CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. <br/> Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. <br> ACS is a new dataset collected by ourself. |\n| Perturbed <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/perturb.zip) | CLEF <br> UOB <br> USPTO <br> Staker | Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/ |\n\n\n### Model\nOur model checkpoints can be downloaded from [Dropbox](https://www.dropbox.com/sh/91u508kf48cotv4/AACQden2waMXIqLwYSi8zO37a?dl=0) \nor [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main).\n\nModel architecture:\n- Encoder: [Swin Transformer](https://github.com/microsoft/Swin-Transformer), Swin-B.\n- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.\n- Input size: 384x384\n\nDownload the model checkpoint to reproduce our experiments:\n```\nmkdir -p ckpts\nwget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth\n```\n\n### Prediction\n```\npython predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png\n```\nMolScribe prediction interface is in [`molscribe/interface.py`](molscribe/interface.py).\nSee python script [`predict.py`](predict.py) or jupyter notebook [`notebook/predict.ipynb`](notebook/predict.ipynb)\nfor example usage.\n\n### Evaluate MolScribe\n```\nbash scripts/eval_uspto_joint_chartok_1m680k.sh\n```\nThe script uses one GPU and batch size of 64 by default. If more GPUs are available, update `NUM_GPUS_PER_NODE` and \n`BATCH_SIZE` for faster evaluation.\n\n### Train MolScribe\n```\nbash scripts/train_uspto_joint_chartok_1m680k.sh\n```\nThe script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.\nDuring training, we use a modified code of [Indigo](https://github.com/epam/Indigo) (included in `molscribe/indigo/`).\n\n\n### Evaluation Script\nWe implement a standalone evaluation script [`evaluate.py`](evaluate.py). Example usage:\n```\npython evaluate.py \\\n --gold_file data/real/acs.csv \\\n --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \\\n --pred_field post_SMILES\n```\nThe prediction should be saved in a csv file, with columns `image_id` for the index (must match the gold file),\nand `SMILES` for predicted SMILES. If prediction has a different column name, specify it with `--pred_field`.\n\nThe result contains three scores:\n- canon_smiles: our main metric, exact matching accuracy.\n- graph: graph exact matching accuracy, ignoring tetrahedral chirality.\n- chiral: exact matching accuracy on chiral molecules.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "MolScribe",
"version": "1.1.1",
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a1c4998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25",
"md5": "767840db2b428f4b14da01e0aff76bd3",
"sha256": "f1fe78f422d063979446671e19f95695aab3b340a6ff03b9a9e773e6c9e16cff"
},
"downloads": -1,
"filename": "MolScribe-1.1.1.tar.gz",
"has_sig": false,
"md5_digest": "767840db2b428f4b14da01e0aff76bd3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 79002,
"upload_time": "2023-04-06T01:49:48",
"upload_time_iso_8601": "2023-04-06T01:49:48.481344Z",
"url": "https://files.pythonhosted.org/packages/a1/c4/998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25/MolScribe-1.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-06 01:49:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "thomas0809",
"github_project": "MolScribe",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "torch",
"specs": []
},
{
"name": "torchvision",
"specs": []
},
{
"name": "numpy",
"specs": [
[
">=",
"1.19.5"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.2.4"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.5.3"
]
]
},
{
"name": "opencv-python",
"specs": [
[
"==",
"4.5.5.64"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.5.1"
]
]
},
{
"name": "huggingface-hub",
"specs": [
[
">=",
"0.11.0"
]
]
},
{
"name": "tensorboardX",
"specs": []
},
{
"name": "SmilesPE",
"specs": [
[
"==",
"0.0.3"
]
]
},
{
"name": "OpenNMT-py",
"specs": [
[
"==",
"2.2.0"
]
]
},
{
"name": "rdkit-pypi",
"specs": [
[
">=",
"2021.03.2"
]
]
},
{
"name": "albumentations",
"specs": []
},
{
"name": "timm",
"specs": []
}
],
"lcname": "molscribe"
}