synfrag


Namesynfrag JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/simmzx/SynFrag
SummarySynFrag: A Synthetic Accessibility Predictor based Fragment Assembly autoRegressive pretrain
upload_time2025-09-01 14:40:38
maintainerNone
docs_urlNone
authorXiang Zhang
requires_python>=3.8
licenseMIT
keywords chemistry molecular synthesizability synthetic accessibility fragment assembly deep learning graph neural networks cheminformatics drug discovery smiles
VCS
bugtrack_url
requirements torch torch-cluster torch-geometric torch-scatter torch-sparse torch-spline-conv torchmetrics dgl dgllife rdkit deepchem numpy pandas scipy scikit-learn matplotlib seaborn pillow openpyxl requests tqdm pyyaml numba
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [![AIDD](https://img.shields.io/badge/🧬%20AIDD-Synthetic%20Accessibility-4CAF50?style=flat)](https://github.com/simmzx/SynFrag)
[![PyPI](https://img.shields.io/badge/PyPI-synfrag%20v1.0.0-306998?style=flat&logo=pypi&logoColor=white)](https://pypi.org/project/synfrag/)
[![GitHub](https://img.shields.io/badge/simmzx💤-181717?style=flat&logo=github&logoColor=white)](https://github.com/simmzx)[![Email](https://img.shields.io/badge/📧Email-1E88E5?style=flat)](mailto:zhangxiang@simm.ac.cn?subject=Regarding%20FARScore)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# SynFrag: Synthetic Accessibility via Fragment Assembly Generation
> Predict the synthetic accessibility of molecules like an experienced synthetic chemist
## 🎯 What Makes SynFrag Different
SynFrag revolutionizes synthetic accessibility prediction through **Pre-training strategy for generating molecules via fragment autoregressive assembly**. Unlike traditional approaches that directly learn synthesis patterns, SynFrag first masters molecular construction fundamentals—understanding how molecules are assembled from fragments—then applies this knowledge to predict synthetic accessibility.
### Two-Stage Learning:
* **Stage 1**: Pretrain on 9.2M unlabeled molecules to learn molecular assembly patterns
* **Stage 2**: Finetune on 800K labeled molecules for synthetic accessibility prediction

This mirrors human chemical intuition: experienced chemists understand molecular construction before assessing synthetic difficulty.

## ✨ Key Features
* Easy Integration - Simple CSV input/output format
* Batch Prediction - One-click synthetic accessibility scoring
* High Accuracy - Achieves SOTA performance on multiple test sets with key metrics including accuracy, AUROC and specificity.

## 🌐 Online Service
**Instant molecular synthesis prediction in the cloud.** Simply upload your CSV file with SMILES and receive AI-powered synthetic accessibility scores in seconds.

## 🚀 Quick Start
### 1. Installation
```python
    # Clone repository
    git clone https://github.com/simmzx/SynFrag.git
    cd ../SynFrag

    # Create environment and install dependencies
    conda create -n SynFrag python=3.8
    conda activate SynFrag
    pip install -r requirements.txt
```
### 2. Prepare Data
Create CSV file with "smiles" field:
molecule_id  | smiles|
:---------: | :--------:|
Palbociclib  | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C |
(+)-Eburnamonine  | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] |
### 3. Run Prediction
CSV File Mode
```python
    python synfrag.py --input_file example.csv
```
Direct SMILES Mode
```python
    # Single molecule
    python synfrag.py --smiles "CCO"
    # Multiple molecules
    python synfrag.py --smiles "CCO" "CC(=O)O" "c1ccccc1"
```
### 4. View Results
Output file will contain SynFrag values:
| molecule_id | smiles  | synfrag |
| :------------: |:---------------:|:-----:|
| Palbociclib      | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C | 0.9453 |
| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]        |    0.0286 |

**SynFrag Interpretation:**
* Close to 1: Easy to synthesize
- Close to 0: Hard to synthesize
* Threshold 0.5: Binary classification cutoff

## 📖 Advanced Usage
Custom Pretraining and Finetuning task
### Pretrain Model
```python
    python synfrag_pretrain.py \
        --dataset smiles.txt \
        --vocab fragment.txt 
```
Note: `smiles.txt` contains unlabeled molecules, `fragment.txt` is a fragment vocabulary generated by `./scripts/utils/mol/cls.py` from `smiles.txt` for fragment assembly autoregressive pretrain.

### Finetune Model
```python
    python synfrag_finetune.py \
        --input_model_file gnn_pretrained.pth \
        --dataset dataset.csv
```
Note: `gnn_pretrained.pth` is a model saved in pretraining stage, `dataset.csv` contains labeled molecules for finetune on specific downstream task.

## 🔧 Requirements
* Python 3.8-3.10
* CUDA-enabled GPU (recommended)
* Key dependencies: PyTorch, RDKit, DGL, DeepChem

## 📄 Citation
If this program is useful to you, please cite our paper:


## 📧 Contact
For questions, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)
______________________________________________________________________________________________________
🌟 **Like this project? Give us a Star**

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/simmzx/SynFrag",
    "name": "synfrag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "chemistry, molecular, synthesizability, synthetic accessibility, fragment assembly, deep learning, graph neural networks, cheminformatics, drug discovery, SMILES",
    "author": "Xiang Zhang",
    "author_email": "776206454@qq.com",
    "download_url": "https://files.pythonhosted.org/packages/07/89/87678768043df42af64acf15b17cfc722d5cb8322f1af11bb16cef017b2f/synfrag-1.0.0.tar.gz",
    "platform": "any",
    "description": "[![AIDD](https://img.shields.io/badge/\ud83e\uddec%20AIDD-Synthetic%20Accessibility-4CAF50?style=flat)](https://github.com/simmzx/SynFrag)\r\n[![PyPI](https://img.shields.io/badge/PyPI-synfrag%20v1.0.0-306998?style=flat&logo=pypi&logoColor=white)](https://pypi.org/project/synfrag/)\r\n[![GitHub](https://img.shields.io/badge/simmzx\ud83d\udca4-181717?style=flat&logo=github&logoColor=white)](https://github.com/simmzx)[![Email](https://img.shields.io/badge/\ud83d\udce7Email-1E88E5?style=flat)](mailto:zhangxiang@simm.ac.cn?subject=Regarding%20FARScore)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\n# SynFrag: Synthetic Accessibility via Fragment Assembly Generation\r\n> Predict the synthetic accessibility of molecules like an experienced synthetic chemist\r\n## \ud83c\udfaf What Makes SynFrag Different\r\nSynFrag revolutionizes synthetic accessibility prediction through **Pre-training strategy for generating molecules via fragment autoregressive assembly**. Unlike traditional approaches that directly learn synthesis patterns, SynFrag first masters molecular construction fundamentals\u2014understanding how molecules are assembled from fragments\u2014then applies this knowledge to predict synthetic accessibility.\r\n### Two-Stage Learning:\r\n* **Stage 1**: Pretrain on 9.2M unlabeled molecules to learn molecular assembly patterns\r\n* **Stage 2**: Finetune on 800K labeled molecules for synthetic accessibility prediction\r\n\r\nThis mirrors human chemical intuition: experienced chemists understand molecular construction before assessing synthetic difficulty.\r\n\r\n## \u2728 Key Features\r\n* Easy Integration - Simple CSV input/output format\r\n* Batch Prediction - One-click synthetic accessibility scoring\r\n* High Accuracy - Achieves SOTA performance on multiple test sets with key metrics including accuracy, AUROC and specificity.\r\n\r\n## \ud83c\udf10 Online Service\r\n**Instant molecular synthesis prediction in the cloud.** Simply upload your CSV file with SMILES and receive AI-powered synthetic accessibility scores in seconds.\r\n\r\n## \ud83d\ude80 Quick Start\r\n### 1. Installation\r\n```python\r\n    # Clone repository\r\n    git clone https://github.com/simmzx/SynFrag.git\r\n    cd ../SynFrag\r\n\r\n    # Create environment and install dependencies\r\n    conda create -n SynFrag python=3.8\r\n    conda activate SynFrag\r\n    pip install -r requirements.txt\r\n```\r\n### 2. Prepare Data\r\nCreate CSV file with \"smiles\" field:\r\nmolecule_id  | smiles|\r\n:---------: | :--------:|\r\nPalbociclib  | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C |\r\n(+)-Eburnamonine  | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H] |\r\n### 3. Run Prediction\r\nCSV File Mode\r\n```python\r\n    python synfrag.py --input_file example.csv\r\n```\r\nDirect SMILES Mode\r\n```python\r\n    # Single molecule\r\n    python synfrag.py --smiles \"CCO\"\r\n    # Multiple molecules\r\n    python synfrag.py --smiles \"CCO\" \"CC(=O)O\" \"c1ccccc1\"\r\n```\r\n### 4. View Results\r\nOutput file will contain SynFrag values:\r\n| molecule_id | smiles  | synfrag |\r\n| :------------: |:---------------:|:-----:|\r\n| Palbociclib      | CC1=C(C(=O)N(C2=NC(=NC=C12)NC3=NC=C(C=C3)N4CCNCC4)C5CCCC5)C(=O)C | 0.9453 |\r\n| (+)-Eburnamonine | [C@]12(C3=C4CCN1CCC[C@@]2(CC(=O)N3C1C4=CC=CC=1)CC)[H]        |    0.0286 |\r\n\r\n**SynFrag Interpretation:**\r\n* Close to 1: Easy to synthesize\r\n- Close to 0: Hard to synthesize\r\n* Threshold 0.5: Binary classification cutoff\r\n\r\n## \ud83d\udcd6 Advanced Usage\r\nCustom Pretraining and Finetuning task\r\n### Pretrain Model\r\n```python\r\n    python synfrag_pretrain.py \\\r\n        --dataset smiles.txt \\\r\n        --vocab fragment.txt \r\n```\r\nNote: `smiles.txt` contains unlabeled molecules, `fragment.txt` is a fragment vocabulary generated by `./scripts/utils/mol/cls.py` from `smiles.txt` for fragment assembly autoregressive pretrain.\r\n\r\n### Finetune Model\r\n```python\r\n    python synfrag_finetune.py \\\r\n        --input_model_file gnn_pretrained.pth \\\r\n        --dataset dataset.csv\r\n```\r\nNote: `gnn_pretrained.pth` is a model saved in pretraining stage, `dataset.csv` contains labeled molecules for finetune on specific downstream task.\r\n\r\n## \ud83d\udd27 Requirements\r\n* Python 3.8-3.10\r\n* CUDA-enabled GPU (recommended)\r\n* Key dependencies: PyTorch, RDKit, DGL, DeepChem\r\n\r\n## \ud83d\udcc4 Citation\r\nIf this program is useful to you, please cite our paper:\r\n\r\n\r\n## \ud83d\udce7 Contact\r\nFor questions, please contact: Xiang Zhang (Email: zhangxiang@simm.ac.cn)\r\n______________________________________________________________________________________________________\r\n\ud83c\udf1f **Like this project? Give us a Star**\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "SynFrag: A Synthetic Accessibility Predictor based Fragment Assembly autoRegressive pretrain",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/simmzx/SynFrag/issues",
        "Documentation": "https://github.com/simmzx/SynFrag/docs",
        "Homepage": "https://github.com/simmzx/SynFrag",
        "Source": "https://github.com/simmzx/SynFrag"
    },
    "split_keywords": [
        "chemistry",
        " molecular",
        " synthesizability",
        " synthetic accessibility",
        " fragment assembly",
        " deep learning",
        " graph neural networks",
        " cheminformatics",
        " drug discovery",
        " smiles"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2df97cf571b773c41d71328b8f449657e7c4ef7fd1bc1af5dd56862a592a30a3",
                "md5": "883f675e79bd12fe477bdac9286ca814",
                "sha256": "bae31b4f481f5d5ee3bca01144bcfbeeda0a023d809cbc5b80b1bbd63e13a566"
            },
            "downloads": -1,
            "filename": "synfrag-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "883f675e79bd12fe477bdac9286ca814",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14488727,
            "upload_time": "2025-09-01T14:40:33",
            "upload_time_iso_8601": "2025-09-01T14:40:33.534146Z",
            "url": "https://files.pythonhosted.org/packages/2d/f9/7cf571b773c41d71328b8f449657e7c4ef7fd1bc1af5dd56862a592a30a3/synfrag-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "078987678768043df42af64acf15b17cfc722d5cb8322f1af11bb16cef017b2f",
                "md5": "678b98af16249df06e530362ae066523",
                "sha256": "645e2db03ad4012073de0dec21b2a05c32232135dd672d74a9c7b85812fa5fca"
            },
            "downloads": -1,
            "filename": "synfrag-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "678b98af16249df06e530362ae066523",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 14472973,
            "upload_time": "2025-09-01T14:40:38",
            "upload_time_iso_8601": "2025-09-01T14:40:38.098764Z",
            "url": "https://files.pythonhosted.org/packages/07/89/87678768043df42af64acf15b17cfc722d5cb8322f1af11bb16cef017b2f/synfrag-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-01 14:40:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "simmzx",
    "github_project": "SynFrag",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.12.0"
                ]
            ]
        },
        {
            "name": "torch-cluster",
            "specs": [
                [
                    ">=",
                    "1.6.0"
                ]
            ]
        },
        {
            "name": "torch-geometric",
            "specs": [
                [
                    ">=",
                    "2.3.0"
                ]
            ]
        },
        {
            "name": "torch-scatter",
            "specs": [
                [
                    ">=",
                    "2.1.0"
                ]
            ]
        },
        {
            "name": "torch-sparse",
            "specs": [
                [
                    ">=",
                    "0.6.15"
                ]
            ]
        },
        {
            "name": "torch-spline-conv",
            "specs": [
                [
                    ">=",
                    "1.2.1"
                ]
            ]
        },
        {
            "name": "torchmetrics",
            "specs": [
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "dgl",
            "specs": [
                [
                    ">=",
                    "0.6.1"
                ]
            ]
        },
        {
            "name": "dgllife",
            "specs": [
                [
                    ">=",
                    "0.2.9"
                ]
            ]
        },
        {
            "name": "rdkit",
            "specs": [
                [
                    ">=",
                    "2022.3.0"
                ]
            ]
        },
        {
            "name": "deepchem",
            "specs": [
                [
                    ">=",
                    "2.6.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ],
                [
                    "<",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "<",
                    "3.0.0"
                ],
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "<",
                    "2.0.0"
                ],
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "pillow",
            "specs": [
                [
                    ">=",
                    "9.0.0"
                ]
            ]
        },
        {
            "name": "openpyxl",
            "specs": [
                [
                    ">=",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.60.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    ">=",
                    "6.0.0"
                ]
            ]
        },
        {
            "name": "numba",
            "specs": [
                [
                    ">=",
                    "0.57.0"
                ]
            ]
        }
    ],
    "lcname": "synfrag"
}
        
Elapsed time: 1.23539s