synthemol

Name	synthemol JSON
Version	1.0.4 JSON
	download
home_page	https://github.com/swansonk14/SyntheMol
Summary	synthemol
upload_time	2024-06-23 10:50:37
maintainer	None
docs_url	None
author	Kyle Swanson
requires_python	>=3.10
license	MIT
keywords	machine learning drug design generative models synthesizable molecules
VCS
bugtrack_url
requirements	chemfunc chemprop descriptastorus matplotlib numpy pandas rdkit scikit-learn scipy torch tqdm typed-argument-parser
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SyntheMol: Generative AI for Drug Discovery

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/synthemol)](https://badge.fury.io/py/synthemol)
[![PyPI version](https://badge.fury.io/py/synthemol.svg)](https://badge.fury.io/py/synthemol)
[![Downloads](https://pepy.tech/badge/synthemol)](https://pepy.tech/project/synthemol)
[![license](https://img.shields.io/github/license/swansonk14/synthemol.svg)](https://github.com/swansonk14/SyntheMol/blob/main/LICENSE.txt)

SyntheMol is a generative AI method for designing structurally novel and diverse drug candidates with predicted bioactivity that are easy to synthesize.

SyntheMol consists of a Monte Carlo tree search (MCTS) that explores a combinatorial chemical space consisting of molecular building blocks and chemical reactions. The MCTS is guided by a bioactivity prediction AI model, such as a graph neural network or a random forest. Currently, SyntheMol is designed to use 137,656 building blocks and 13 chemical reactions from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator), which can produce over 30 billion molecules. However, SyntheMol can be easily adapted to use any set of building blocks and reactions.

SyntheMol is described in the following paper, where we applied SyntheMol to design novel antibiotic candidates for the Gram-negative bacterium _Acinetobacter baumannii_.

Swanson, K., Liu, G., Catacutan, D. B., Arnold, A., Zou, J., Stokes, J. M. [Generative AI for designing and validating easily synthesizable and structurally novel antibiotics](https://www.nature.com/articles/s42256-024-00809-7). _Nature Machine Intelligence_, 2024.

Full details for reproducing the results in the paper are provided in the [docs](docs) directory.


## Table of contents

* [Installation](#installation)
* [Combinatorial chemical space](#combinatorial-chemical-space)
* [Bioactivity prediction model](#bioactivity-prediction-model)
  + [Train model](#train-model)
  + [Pre-compute building block scores](#pre-compute-building-block-scores)
* [Generate molecules](#generate-molecules)
* [Filter generated molecules](#filter-generated-molecules)
  + [Novelty](#novelty)
  + [Bioactivity](#bioactivity)
  + [Diversity](#diversity)


## Installation

SyntheMol can be installed in < 3 minutes on any operating system using pip (optionally within a conda environment). SyntheMol can be run on a standard laptop (e.g., 16 GB memory and 8-16 CPUs), although a GPU is useful for faster training and prediction of the underlying bioactivity prediction model (Chemprop).

Optionally, create a conda environment.
```bash
conda create -y -n synthemol python=3.10
conda activate synthemol
```

Install SyntheMol via pip.
```bash
pip install synthemol
```

Alternatively, clone the repo and install SyntheMol locally.
```bash
git clone https://github.com/swansonk14/SyntheMol.git
cd SyntheMol
pip install -e .
```

If there are version issues with the required packages, create a conda environment with specific working versions of the packages as follows.
```bash
pip install -r requirements.txt
pip install -e .
```

**Note:** If you get the issue `ImportError: libXrender.so.1: cannot open shared object file: No such file or directory`, run `conda install -c conda-forge xorg-libxrender`.


## Combinatorial chemical space

SyntheMol is currently designed to use 139,493 building blocks (137,656 unique molecules) and 13 chemical reactions from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator), which can produce over 30 billion molecules (30,330,025,259). However, an alternate combinatorial chemical space can optionally be used by replacing the building blocks and chemical reactions as follows.

**Building blocks:** Replace `data/building_blocks.csv` with a custom file containing the building blocks. The file should be a CSV file with a header row and two columns: `smiles` and `ID`. The `smiles` column should contain the SMILES string for each building block, and the `ID` column should contain a unique ID for each building block.

**Chemical reactions:** In `SyntheMol/reactions/custom.py`, set `CUSTOM_REACTIONS` to a list of `Reaction` objects similar to the `REAL_REACTIONS` list in `SyntheMol/reactions/real.py`. If `CUSTOM_REACTIONS` is defined (i.e., not `None`), then it will automatically be used instead of `REAL_REACTIONS`.


## Bioactivity prediction model

SyntheMol requires a bioactivity prediction model to guide its generative process. SyntheMol is designed to use one of three types of models:

1. **Chemprop:** a message passing neural network from https://github.com/chemprop/chemprop
2. **Chemprop-RDKit:** Chemprop augmented with 200 RDKit molecular features
3. **Random forest:** a scikit-learn random forest model trained on 200 RDKit molecular features


### Train model

All three model types can be trained using [Chemprop](https://github.com/chemprop/chemprop), which is installed along with SyntheMol. All three model types can be trained on either regression or binary classification bioactivities. Full details are provided in the [Chemprop](https://github.com/chemprop/chemprop) README. Below is an example for training a Chemprop model on a binary classification task. By default, training is done on a GPU (if available).

Data file
```bash
# data/data.csv
smiles,activity
Br.CC(Cc1ccc(O)cc1)NCC(O)c1cc(O)cc(O)c1,0
CC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1
O=C(O)CCc1ccc(NCc2cccc(Oc3ccccc3)c2)cc1,0
...
```

Train Chemprop
```bash
chemprop_train \
    --data_path data/data.csv \
    --dataset_type classification \
    --save_dir models/chemprop
```


### Pre-compute building block scores

After training, use the model to pre-compute scores of building blocks to accelerate the SyntheMol generation process. Below is an example using the trained Chemprop model. By default, prediction is done on a GPU (if available).

```bash
chemprop_predict \
    --test_path "$(python -c 'import synthemol; print(str(synthemol.constants.BUILDING_BLOCKS_PATH))')" \
    --preds_path models/chemprop/building_blocks.csv \
    --checkpoint_dir models/chemprop
```


## Generate molecules

SyntheMol uses the bioactivity prediction model within a Monte Carlo tree search to generate molecules. Below is an example for generating molecules with a trained Chemprop model using 20,000 MCTS rollouts. SyntheMol uses CPUs only (no GPUs).

```bash
synthemol \
    --model_path models/chemprop \
    --model_type chemprop \
    --save_dir generations/chemprop \
    --building_blocks_path models/chemprop/building_blocks.csv \
    --building_blocks_score_column activity \
    --n_rollout 20000
```

Note: The `building_blocks_score_column` must match the column name in the building blocks file that contains the building block scores. When using `chemprop_train` and `chemprop_predict`, the column name will be the same as the column that contains target activity/property values in the training data file (e.g., `activity`).


## Filter generated molecules

Optionally, the generated molecules can be filtered for structural novelty, predicted bioactivity, and structural diversity. These filtering steps use [chemfunc](https://github.com/swansonk14/chemfunc), which is installed along with SyntheMol. Below is an example for filtering the generated molecules.

### Novelty

Filter for novelty by comparing the generated molecules to a set of active molecules (hits) from the training set or literature and removing similar generated molecules.

Hits file
```bash
# data/hits.csv
smiles,activity
CC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1
O=C(NNc1ccccc1)c1ccncc1,1
...
```

Compute Tversky similarity between generated molecules and hits
```bash
chemfunc nearest_neighbor \
    --data_path generations/chemprop/molecules.csv \
    --reference_data_path data/hits.csv \
    --reference_name hits \
    --metric tversky
```

Filter by similarity, only keeping molecules with a nearest neighbor similarity to hits of at most 0.5
```bash
chemfunc filter_molecules \
    --data_path generations/chemprop/molecules.csv \
    --save_path generations/chemprop/molecules_novel.csv \
    --filter_column hits_tversky_nearest_neighbor_similarity \
    --max_value 0.5
```


### Bioactivity

Filter for predicted bioactivity by keeping the molecules with the top 20% highest predicted bioactivity.

```bash
chemfunc filter_molecules \
    --data_path generations/chemprop/molecules_novel.csv \
    --save_path generations/chemprop/molecules_novel_bioactive.csv \
    --filter_column score \
    --top_proportion 0.2
```


### Diversity

Filter for diversity by clustering molecules based on their Morgan fingerprint and only keeping the top scoring molecule from each cluster.

Cluster molecules into 50 clusters
```bash
chemfunc cluster_molecules \
    --data_path generations/chemprop/molecules_novel_bioactive.csv \
    --num_clusters 50
```

Select the top scoring molecule from each cluster
```bash
chemfunc select_from_clusters \
    --data_path generations/chemprop/molecules_novel_bioactive.csv \
    --save_path generations/chemprop/molecules_novel_bioactive_diverse.csv \
    --value_column score
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/swansonk14/SyntheMol",
    "name": "synthemol",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "machine learning, drug design, generative models, synthesizable molecules",
    "author": "Kyle Swanson",
    "author_email": "swansonk.14@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8a/59/2376be416840dcec2be1ab7619792d21b131090f2912d6c0b96cf696bf6b/synthemol-1.0.4.tar.gz",
    "platform": null,
    "description": "# SyntheMol: Generative AI for Drug Discovery\n\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/synthemol)](https://badge.fury.io/py/synthemol)\n[![PyPI version](https://badge.fury.io/py/synthemol.svg)](https://badge.fury.io/py/synthemol)\n[![Downloads](https://pepy.tech/badge/synthemol)](https://pepy.tech/project/synthemol)\n[![license](https://img.shields.io/github/license/swansonk14/synthemol.svg)](https://github.com/swansonk14/SyntheMol/blob/main/LICENSE.txt)\n\nSyntheMol is a generative AI method for designing structurally novel and diverse drug candidates with predicted bioactivity that are easy to synthesize.\n\nSyntheMol consists of a Monte Carlo tree search (MCTS) that explores a combinatorial chemical space consisting of molecular building blocks and chemical reactions. The MCTS is guided by a bioactivity prediction AI model, such as a graph neural network or a random forest. Currently, SyntheMol is designed to use 137,656 building blocks and 13 chemical reactions from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator), which can produce over 30 billion molecules. However, SyntheMol can be easily adapted to use any set of building blocks and reactions.\n\nSyntheMol is described in the following paper, where we applied SyntheMol to design novel antibiotic candidates for the Gram-negative bacterium _Acinetobacter baumannii_.\n\nSwanson, K., Liu, G., Catacutan, D. B., Arnold, A., Zou, J., Stokes, J. M. [Generative AI for designing and validating easily synthesizable and structurally novel antibiotics](https://www.nature.com/articles/s42256-024-00809-7). _Nature Machine Intelligence_, 2024.\n\nFull details for reproducing the results in the paper are provided in the [docs](docs) directory.\n\n\n## Table of contents\n\n* [Installation](#installation)\n* [Combinatorial chemical space](#combinatorial-chemical-space)\n* [Bioactivity prediction model](#bioactivity-prediction-model)\n  + [Train model](#train-model)\n  + [Pre-compute building block scores](#pre-compute-building-block-scores)\n* [Generate molecules](#generate-molecules)\n* [Filter generated molecules](#filter-generated-molecules)\n  + [Novelty](#novelty)\n  + [Bioactivity](#bioactivity)\n  + [Diversity](#diversity)\n\n\n## Installation\n\nSyntheMol can be installed in < 3 minutes on any operating system using pip (optionally within a conda environment). SyntheMol can be run on a standard laptop (e.g., 16 GB memory and 8-16 CPUs), although a GPU is useful for faster training and prediction of the underlying bioactivity prediction model (Chemprop).\n\nOptionally, create a conda environment.\n```bash\nconda create -y -n synthemol python=3.10\nconda activate synthemol\n```\n\nInstall SyntheMol via pip.\n```bash\npip install synthemol\n```\n\nAlternatively, clone the repo and install SyntheMol locally.\n```bash\ngit clone https://github.com/swansonk14/SyntheMol.git\ncd SyntheMol\npip install -e .\n```\n\nIf there are version issues with the required packages, create a conda environment with specific working versions of the packages as follows.\n```bash\npip install -r requirements.txt\npip install -e .\n```\n\n**Note:** If you get the issue `ImportError: libXrender.so.1: cannot open shared object file: No such file or directory`, run `conda install -c conda-forge xorg-libxrender`.\n\n\n## Combinatorial chemical space\n\nSyntheMol is currently designed to use 139,493 building blocks (137,656 unique molecules) and 13 chemical reactions from the [Enamine REAL Space](https://enamine.net/compound-collections/real-compounds/real-space-navigator), which can produce over 30 billion molecules (30,330,025,259). However, an alternate combinatorial chemical space can optionally be used by replacing the building blocks and chemical reactions as follows.\n\n**Building blocks:** Replace `data/building_blocks.csv` with a custom file containing the building blocks. The file should be a CSV file with a header row and two columns: `smiles` and `ID`. The `smiles` column should contain the SMILES string for each building block, and the `ID` column should contain a unique ID for each building block.\n\n**Chemical reactions:** In `SyntheMol/reactions/custom.py`, set `CUSTOM_REACTIONS` to a list of `Reaction` objects similar to the `REAL_REACTIONS` list in `SyntheMol/reactions/real.py`. If `CUSTOM_REACTIONS` is defined (i.e., not `None`), then it will automatically be used instead of `REAL_REACTIONS`.\n\n\n## Bioactivity prediction model\n\nSyntheMol requires a bioactivity prediction model to guide its generative process. SyntheMol is designed to use one of three types of models:\n\n1. **Chemprop:** a message passing neural network from https://github.com/chemprop/chemprop\n2. **Chemprop-RDKit:** Chemprop augmented with 200 RDKit molecular features\n3. **Random forest:** a scikit-learn random forest model trained on 200 RDKit molecular features\n\n\n### Train model\n\nAll three model types can be trained using [Chemprop](https://github.com/chemprop/chemprop), which is installed along with SyntheMol. All three model types can be trained on either regression or binary classification bioactivities. Full details are provided in the [Chemprop](https://github.com/chemprop/chemprop) README. Below is an example for training a Chemprop model on a binary classification task. By default, training is done on a GPU (if available).\n\nData file\n```bash\n# data/data.csv\nsmiles,activity\nBr.CC(Cc1ccc(O)cc1)NCC(O)c1cc(O)cc(O)c1,0\nCC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1\nO=C(O)CCc1ccc(NCc2cccc(Oc3ccccc3)c2)cc1,0\n...\n```\n\nTrain Chemprop\n```bash\nchemprop_train \\\n    --data_path data/data.csv \\\n    --dataset_type classification \\\n    --save_dir models/chemprop\n```\n\n\n### Pre-compute building block scores\n\nAfter training, use the model to pre-compute scores of building blocks to accelerate the SyntheMol generation process. Below is an example using the trained Chemprop model. By default, prediction is done on a GPU (if available).\n\n```bash\nchemprop_predict \\\n    --test_path \"$(python -c 'import synthemol; print(str(synthemol.constants.BUILDING_BLOCKS_PATH))')\" \\\n    --preds_path models/chemprop/building_blocks.csv \\\n    --checkpoint_dir models/chemprop\n```\n\n\n## Generate molecules\n\nSyntheMol uses the bioactivity prediction model within a Monte Carlo tree search to generate molecules. Below is an example for generating molecules with a trained Chemprop model using 20,000 MCTS rollouts. SyntheMol uses CPUs only (no GPUs).\n\n```bash\nsynthemol \\\n    --model_path models/chemprop \\\n    --model_type chemprop \\\n    --save_dir generations/chemprop \\\n    --building_blocks_path models/chemprop/building_blocks.csv \\\n    --building_blocks_score_column activity \\\n    --n_rollout 20000\n```\n\nNote: The `building_blocks_score_column` must match the column name in the building blocks file that contains the building block scores. When using `chemprop_train` and `chemprop_predict`, the column name will be the same as the column that contains target activity/property values in the training data file (e.g., `activity`).\n\n\n## Filter generated molecules\n\nOptionally, the generated molecules can be filtered for structural novelty, predicted bioactivity, and structural diversity. These filtering steps use [chemfunc](https://github.com/swansonk14/chemfunc), which is installed along with SyntheMol. Below is an example for filtering the generated molecules.\n\n### Novelty\n\nFilter for novelty by comparing the generated molecules to a set of active molecules (hits) from the training set or literature and removing similar generated molecules.\n\nHits file\n```bash\n# data/hits.csv\nsmiles,activity\nCC[Hg]Sc1ccccc1C(=O)[O-].[Na+],1\nO=C(NNc1ccccc1)c1ccncc1,1\n...\n```\n\nCompute Tversky similarity between generated molecules and hits\n```bash\nchemfunc nearest_neighbor \\\n    --data_path generations/chemprop/molecules.csv \\\n    --reference_data_path data/hits.csv \\\n    --reference_name hits \\\n    --metric tversky\n```\n\nFilter by similarity, only keeping molecules with a nearest neighbor similarity to hits of at most 0.5\n```bash\nchemfunc filter_molecules \\\n    --data_path generations/chemprop/molecules.csv \\\n    --save_path generations/chemprop/molecules_novel.csv \\\n    --filter_column hits_tversky_nearest_neighbor_similarity \\\n    --max_value 0.5\n```\n\n\n### Bioactivity\n\nFilter for predicted bioactivity by keeping the molecules with the top 20% highest predicted bioactivity.\n\n```bash\nchemfunc filter_molecules \\\n    --data_path generations/chemprop/molecules_novel.csv \\\n    --save_path generations/chemprop/molecules_novel_bioactive.csv \\\n    --filter_column score \\\n    --top_proportion 0.2\n```\n\n\n### Diversity\n\nFilter for diversity by clustering molecules based on their Morgan fingerprint and only keeping the top scoring molecule from each cluster.\n\nCluster molecules into 50 clusters\n```bash\nchemfunc cluster_molecules \\\n    --data_path generations/chemprop/molecules_novel_bioactive.csv \\\n    --num_clusters 50\n```\n\nSelect the top scoring molecule from each cluster\n```bash\nchemfunc select_from_clusters \\\n    --data_path generations/chemprop/molecules_novel_bioactive.csv \\\n    --save_path generations/chemprop/molecules_novel_bioactive_diverse.csv \\\n    --value_column score\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "synthemol",
    "version": "1.0.4",
    "project_urls": {
        "Download": "https://github.com/swansonk14/SyntheMol/archive/refs/tags/v_1.0.4.tar.gz",
        "Homepage": "https://github.com/swansonk14/SyntheMol"
    },
    "split_keywords": [
        "machine learning",
        " drug design",
        " generative models",
        " synthesizable molecules"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a592376be416840dcec2be1ab7619792d21b131090f2912d6c0b96cf696bf6b",
                "md5": "7f1c5eeaa8fc4fb334b3e3ea8f87563a",
                "sha256": "cf5a2320175a706ab4c0f518ba9d7fd7a401597067ac3ef19d073ba134d57087"
            },
            "downloads": -1,
            "filename": "synthemol-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "7f1c5eeaa8fc4fb334b3e3ea8f87563a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 2633378,
            "upload_time": "2024-06-23T10:50:37",
            "upload_time_iso_8601": "2024-06-23T10:50:37.684848Z",
            "url": "https://files.pythonhosted.org/packages/8a/59/2376be416840dcec2be1ab7619792d21b131090f2912d6c0b96cf696bf6b/synthemol-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-23 10:50:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "swansonk14",
    "github_project": "SyntheMol",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "chemfunc",
            "specs": [
                [
                    "==",
                    "1.0.3"
                ]
            ]
        },
        {
            "name": "chemprop",
            "specs": [
                [
                    "==",
                    "1.5.2"
                ]
            ]
        },
        {
            "name": "descriptastorus",
            "specs": [
                [
                    "==",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.7.1"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "1.24.3"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "rdkit",
            "specs": [
                [
                    "==",
                    "2022.3.4"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "==",
                    "1.2.2"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "==",
                    "1.10.1"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "==",
                    "4.65.0"
                ]
            ]
        },
        {
            "name": "typed-argument-parser",
            "specs": [
                [
                    "==",
                    "1.8.0"
                ]
            ]
        }
    ],
    "lcname": "synthemol"
}

Kyle Swanson