MolScribe


NameMolScribe JSON
Version 1.1.1 PyPI version JSON
download
home_pagehttps://github.com/thomas0809/MolScribe
SummaryMolScribe
upload_time2023-04-06 01:49:48
maintainer
docs_urlNone
authorYujie Qian
requires_python>=3.7
licenseMIT
keywords
VCS
bugtrack_url
requirements torch torchvision numpy pandas matplotlib opencv-python transformers huggingface-hub tensorboardX SmilesPE OpenNMT-py rdkit-pypi albumentations timm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # MolScribe

This is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical
structure. Try our [demo](https://huggingface.co/spaces/yujieq/MolScribe) on HuggingFace!

![MolScribe](assets/model.png)

If you use MolScribe in your research, please cite our [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480).
```
@article{
    MolScribe,
    title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},
    author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},
    journal = {Journal of Chemical Information and Modeling},
    publisher = {American Chemical Society ({ACS})},
    doi = {10.1021/acs.jcim.2c01480},
    year = 2023,
}
```

## Quick Start
Run the following command to install the package and its dependencies:
```
git clone git@github.com:thomas0809/MolScribe.git
cd MolScribe
python setup.py install
```

Download the MolScribe checkpoint from [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main) 
and predict molecular structures:
```python
import torch
from molscribe import MolScribe
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')

model = MolScribe(ckpt_path, device=torch.device('cpu'))
output = model.predict_image_file('assets/example.png', compute_confidence=True, get_atoms_bonds=True)
```

The output is a dictionary, with the following format
```
{
    'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',
    'molfile': '***', 
    'confidence': 0.9175,
    'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],
    'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]
}
```

Please refer to [`molscribe/interface.py`](molscribe/interface.py) for details and other available APIs.

For development or reproducing the experiments, please follow the instructions below.

## Experiments

### Requirements
Install the required packages
```
pip install -r requirements.txt
```

### Data
For training or evaluation, please download the corresponding datasets to `data/`.

Training data:

| Datasets                                                                            | Description                                                                                                                                   |
|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| USPTO <br> [Download](https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0) | Downloaded from [USPTO, Grant Red Book](https://bulkdata.uspto.gov/).                                                                         |
| PubChem <br> [Download](https://www.dropbox.com/s/mxvm5i8139y5cvk/pubchem.zip?dl=0) | Molecules are downloaded from [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/), and images are dynamically rendered during training. |

Benchmarks:

| Category                                                                                   | Datasets                                      | Description                                                                                                                                                                                                                                |
|--------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Synthetic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/synthetic.zip) | Indigo <br> ChemDraw                          | Images are rendered by Indigo and ChemDraw.                                                                                                                                                                                                |
| Realistic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/real.zip)      | CLEF <br> UOB <br> USPTO <br> Staker <br> ACS | CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. <br/> Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. <br> ACS is a new dataset collected by ourself. |
| Perturbed <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/perturb.zip)   | CLEF <br> UOB <br> USPTO <br> Staker          | Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/                                                                                                                                                                |


### Model
Our model checkpoints can be downloaded from [Dropbox](https://www.dropbox.com/sh/91u508kf48cotv4/AACQden2waMXIqLwYSi8zO37a?dl=0) 
or [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main).

Model architecture:
- Encoder: [Swin Transformer](https://github.com/microsoft/Swin-Transformer), Swin-B.
- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.
- Input size: 384x384

Download the model checkpoint to reproduce our experiments:
```
mkdir -p ckpts
wget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth
```

### Prediction
```
python predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png
```
MolScribe prediction interface is in [`molscribe/interface.py`](molscribe/interface.py).
See python script [`predict.py`](predict.py) or jupyter notebook [`notebook/predict.ipynb`](notebook/predict.ipynb)
for example usage.

### Evaluate MolScribe
```
bash scripts/eval_uspto_joint_chartok_1m680k.sh
```
The script uses one GPU and batch size of 64 by default. If more GPUs are available, update `NUM_GPUS_PER_NODE` and 
`BATCH_SIZE` for faster evaluation.

### Train MolScribe
```
bash scripts/train_uspto_joint_chartok_1m680k.sh
```
The script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.
During training, we use a modified code of [Indigo](https://github.com/epam/Indigo) (included in `molscribe/indigo/`).


### Evaluation Script
We implement a standalone evaluation script [`evaluate.py`](evaluate.py). Example usage:
```
python evaluate.py \
    --gold_file data/real/acs.csv \
    --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \
    --pred_field post_SMILES
```
The prediction should be saved in a csv file, with columns `image_id` for the index (must match the gold file),
and `SMILES` for predicted SMILES. If prediction has a different column name, specify it with `--pred_field`.

The result contains three scores:
- canon_smiles: our main metric, exact matching accuracy.
- graph: graph exact matching accuracy, ignoring tetrahedral chirality.
- chiral: exact matching accuracy on chiral molecules.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/thomas0809/MolScribe",
    "name": "MolScribe",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "",
    "author": "Yujie Qian",
    "author_email": "yujieq@csail.mit.edu",
    "download_url": "https://files.pythonhosted.org/packages/a1/c4/998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25/MolScribe-1.1.1.tar.gz",
    "platform": null,
    "description": "# MolScribe\n\nThis is the repository for MolScribe, an image-to-graph model that translates a molecular image to its chemical\nstructure. Try our [demo](https://huggingface.co/spaces/yujieq/MolScribe) on HuggingFace!\n\n![MolScribe](assets/model.png)\n\nIf you use MolScribe in your research, please cite our [paper](https://pubs.acs.org/doi/10.1021/acs.jcim.2c01480).\n```\n@article{\n    MolScribe,\n    title = {{MolScribe}: Robust Molecular Structure Recognition with Image-to-Graph Generation},\n    author = {Yujie Qian and Jiang Guo and Zhengkai Tu and Zhening Li and Connor W. Coley and Regina Barzilay},\n    journal = {Journal of Chemical Information and Modeling},\n    publisher = {American Chemical Society ({ACS})},\n    doi = {10.1021/acs.jcim.2c01480},\n    year = 2023,\n}\n```\n\n## Quick Start\nRun the following command to install the package and its dependencies:\n```\ngit clone git@github.com:thomas0809/MolScribe.git\ncd MolScribe\npython setup.py install\n```\n\nDownload the MolScribe checkpoint from [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main) \nand predict molecular structures:\n```python\nimport torch\nfrom molscribe import MolScribe\nfrom huggingface_hub import hf_hub_download\n\nckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')\n\nmodel = MolScribe(ckpt_path, device=torch.device('cpu'))\noutput = model.predict_image_file('assets/example.png', compute_confidence=True, get_atoms_bonds=True)\n```\n\nThe output is a dictionary, with the following format\n```\n{\n    'smiles': 'Fc1ccc(-c2cc(-c3ccccc3)n(-c3ccccc3)c2)cc1',\n    'molfile': '***', \n    'confidence': 0.9175,\n    'atoms': [{'atom_symbol': '[Ph]', 'x': 0.5714, 'y': 0.9523, 'confidence': 0.9127}, ... ],\n    'bonds': [{'bond_type': 'single', 'endpoint_atoms': [0, 1], 'confidence': 0.9999}, ... ]\n}\n```\n\nPlease refer to [`molscribe/interface.py`](molscribe/interface.py) for details and other available APIs.\n\nFor development or reproducing the experiments, please follow the instructions below.\n\n## Experiments\n\n### Requirements\nInstall the required packages\n```\npip install -r requirements.txt\n```\n\n### Data\nFor training or evaluation, please download the corresponding datasets to `data/`.\n\nTraining data:\n\n| Datasets                                                                            | Description                                                                                                                                   |\n|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|\n| USPTO <br> [Download](https://www.dropbox.com/s/3podz99nuwagudy/uspto_mol.zip?dl=0) | Downloaded from [USPTO, Grant Red Book](https://bulkdata.uspto.gov/).                                                                         |\n| PubChem <br> [Download](https://www.dropbox.com/s/mxvm5i8139y5cvk/pubchem.zip?dl=0) | Molecules are downloaded from [PubChem](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/), and images are dynamically rendered during training. |\n\nBenchmarks:\n\n| Category                                                                                   | Datasets                                      | Description                                                                                                                                                                                                                                |\n|--------------------------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Synthetic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/synthetic.zip) | Indigo <br> ChemDraw                          | Images are rendered by Indigo and ChemDraw.                                                                                                                                                                                                |\n| Realistic <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/real.zip)      | CLEF <br> UOB <br> USPTO <br> Staker <br> ACS | CLEF, UOB, and USPTO are downloaded from https://github.com/Kohulan/OCSR_Review. <br/> Staker is downloaded from https://drive.google.com/drive/folders/16OjPwQ7bQ486VhdX4DWpfYzRsTGgJkSu. <br> ACS is a new dataset collected by ourself. |\n| Perturbed <br> [Download](https://huggingface.co/yujieq/MolScribe/blob/main/perturb.zip)   | CLEF <br> UOB <br> USPTO <br> Staker          | Downloaded from https://github.com/bayer-science-for-a-better-life/Img2Mol/                                                                                                                                                                |\n\n\n### Model\nOur model checkpoints can be downloaded from [Dropbox](https://www.dropbox.com/sh/91u508kf48cotv4/AACQden2waMXIqLwYSi8zO37a?dl=0) \nor [HuggingFace Hub](https://huggingface.co/yujieq/MolScribe/tree/main).\n\nModel architecture:\n- Encoder: [Swin Transformer](https://github.com/microsoft/Swin-Transformer), Swin-B.\n- Decoder: Transformer, 6 layers, hidden_size=256, attn_heads=8.\n- Input size: 384x384\n\nDownload the model checkpoint to reproduce our experiments:\n```\nmkdir -p ckpts\nwget -P ckpts https://huggingface.co/yujieq/MolScribe/resolve/main/swin_base_char_aux_1m680k.pth\n```\n\n### Prediction\n```\npython predict.py --model_path ckpts/swin_base_char_aux_1m680k.pth --image_path assets/example.png\n```\nMolScribe prediction interface is in [`molscribe/interface.py`](molscribe/interface.py).\nSee python script [`predict.py`](predict.py) or jupyter notebook [`notebook/predict.ipynb`](notebook/predict.ipynb)\nfor example usage.\n\n### Evaluate MolScribe\n```\nbash scripts/eval_uspto_joint_chartok_1m680k.sh\n```\nThe script uses one GPU and batch size of 64 by default. If more GPUs are available, update `NUM_GPUS_PER_NODE` and \n`BATCH_SIZE` for faster evaluation.\n\n### Train MolScribe\n```\nbash scripts/train_uspto_joint_chartok_1m680k.sh\n```\nThe script uses four GPUs and batch size of 256 by default. It takes about one day to train the model with four A100 GPUs.\nDuring training, we use a modified code of [Indigo](https://github.com/epam/Indigo) (included in `molscribe/indigo/`).\n\n\n### Evaluation Script\nWe implement a standalone evaluation script [`evaluate.py`](evaluate.py). Example usage:\n```\npython evaluate.py \\\n    --gold_file data/real/acs.csv \\\n    --pred_file output/uspto/swin_base_char_aux_1m680k/prediction_acs.csv \\\n    --pred_field post_SMILES\n```\nThe prediction should be saved in a csv file, with columns `image_id` for the index (must match the gold file),\nand `SMILES` for predicted SMILES. If prediction has a different column name, specify it with `--pred_field`.\n\nThe result contains three scores:\n- canon_smiles: our main metric, exact matching accuracy.\n- graph: graph exact matching accuracy, ignoring tetrahedral chirality.\n- chiral: exact matching accuracy on chiral molecules.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "MolScribe",
    "version": "1.1.1",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a1c4998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25",
                "md5": "767840db2b428f4b14da01e0aff76bd3",
                "sha256": "f1fe78f422d063979446671e19f95695aab3b340a6ff03b9a9e773e6c9e16cff"
            },
            "downloads": -1,
            "filename": "MolScribe-1.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "767840db2b428f4b14da01e0aff76bd3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 79002,
            "upload_time": "2023-04-06T01:49:48",
            "upload_time_iso_8601": "2023-04-06T01:49:48.481344Z",
            "url": "https://files.pythonhosted.org/packages/a1/c4/998f2cecf4d88c07e3f1f43e0366cdb2238b81c2032d90a827a24db61a25/MolScribe-1.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-06 01:49:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "thomas0809",
    "github_project": "MolScribe",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.19.5"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.2.4"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.5.3"
                ]
            ]
        },
        {
            "name": "opencv-python",
            "specs": [
                [
                    "==",
                    "4.5.5.64"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.5.1"
                ]
            ]
        },
        {
            "name": "huggingface-hub",
            "specs": [
                [
                    ">=",
                    "0.11.0"
                ]
            ]
        },
        {
            "name": "tensorboardX",
            "specs": []
        },
        {
            "name": "SmilesPE",
            "specs": [
                [
                    "==",
                    "0.0.3"
                ]
            ]
        },
        {
            "name": "OpenNMT-py",
            "specs": [
                [
                    "==",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "rdkit-pypi",
            "specs": [
                [
                    ">=",
                    "2021.03.2"
                ]
            ]
        },
        {
            "name": "albumentations",
            "specs": []
        },
        {
            "name": "timm",
            "specs": []
        }
    ],
    "lcname": "molscribe"
}
        
Elapsed time: 0.63372s