# GT4SD (Generative Toolkit for Scientific Discovery)
[![PyPI version](https://badge.fury.io/py/gt4sd.svg)](https://badge.fury.io/py/gt4sd)
[![Actions tests](https://github.com/gt4sd/gt4sd-core/actions/workflows/tests.yaml/badge.svg)](https://github.com/gt4sd/gt4sd-core/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Contributions](https://img.shields.io/badge/contributions-welcome-blue)](https://github.com/GT4SD/gt4sd-core/blob/main/CONTRIBUTING.md)
[![Docs](https://img.shields.io/badge/website-live-brightgreen)](https://gt4sd.github.io/gt4sd-core/)
[![Total downloads](https://static.pepy.tech/badge/gt4sd)](https://pepy.tech/project/gt4sd)
[![Monthly downloads](https://static.pepy.tech/badge/gt4sd/month)](https://pepy.tech/project/gt4sd)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GT4SD/gt4sd-core/main)
[![DOI](https://zenodo.org/badge/458309249.svg)](https://zenodo.org/badge/latestdoi/458309249)
[![2022 IEEE Open Software Services Award](https://img.shields.io/badge/Award-2022%20IEEE%20Open%20Software%20Services%20Award-yellow)](https://conferences.computer.org/services/2022/awards/oss_award.html)
[![Paper DOI: 10.1038/s41524-023-01028-1](https://zenodo.org/badge/DOI/10.1038/s41524-023-01028-1.svg)](https://www.nature.com/articles/s41524-023-01028-1)
<img src="./docs/_static/gt4sd_graphical_abstract.png" alt="logo" width="800">
The **GT4SD** (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.
For full details on the library API and examples see the [docs](https://gt4sd.github.io/gt4sd-core/).
Almost all pretrained models are also available via `gradio`-powered [web apps](https://huggingface.co/GT4SD) on Hugging Face Spaces.
## Installation
### Requirements
Currently `gt4sd` relies on:
- python>=3.7,<=3.10
- pip==24.0
We are actively working on relaxing these, so stay tuned or help us with this by [contributing](./CONTRIBUTING.md) to the project.
### Conda
The recommended way to install the `gt4sd` is to create a dedicated conda environment, this will ensure all requirements are satisfied. For CPU:
```sh
git clone https://github.com/GT4SD/gt4sd-core.git
cd gt4sd-core/
conda env create -f conda_cpu_mac.yml # for linux use conda_cpu_linux.yml
conda activate gt4sd
pip install gt4sd
```
**NOTE 1:** By default `gt4sd` is installed with CPU requirements. For GPU usage replace with:
```sh
conda env create -f conda_gpu.yml
```
**NOTE 2:** In case you want to reuse an existing compatible environment (see [requirements](#requirements)), you can use `pip`, but as of now (:eyes: on [issue](https://github.com/GT4SD/gt4sd-core/issues/31) for changes), some dependencies require installation from GitHub, so for a complete setup install them with:
```sh
pip install -r vcs_requirements.txt
```
A few VCS dependencies require Git LFS (make sure it's available on your system).
### Development setup & installation
If you would like to contribute to the package, we recommend to install gt4sd in
editable mode inside your `conda` environment:
```sh
pip install --no-deps -e .
```
Learn more in [CONTRIBUTING.md](./CONTRIBUTING.md)
## Getting started
After install you can use `gt4sd` right away in your discovery workflows.
<img src="./docs/_static/gt4sd_case_study.jpg" alt="logo" width="800"/>
### Running inference pipelines in your python code
Running an algorithm is as easy as typing:
```python
from gt4sd.algorithms.conditional_generation.paccmann_rl.core import (
PaccMannRLProteinBasedGenerator, PaccMannRL
)
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
# algorithm configuration with default parameters
configuration = PaccMannRLProteinBasedGenerator()
# instantiate the algorithm for sampling
algorithm = PaccMannRL(configuration=configuration, target=target)
items = list(algorithm.sample(10))
print(items)
```
Or you can use the `ApplicationRegistry` to run an algorithm instance using a serialized representation of the algorithm:
```python
from gt4sd.algorithms.registry import ApplicationsRegistry
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
algorithm = ApplicationsRegistry.get_application_instance(
target=target,
algorithm_type='conditional_generation',
domain='materials',
algorithm_name='PaccMannRL',
algorithm_application='PaccMannRLProteinBasedGenerator',
generated_length=32,
# include additional configuration parameters as **kwargs
)
items = list(algorithm.sample(10))
print(items)
```
### Running inference pipelines via the CLI command
GT4SD can run inference pipelines based on the `gt4sd-inference` CLI command.
It allows to run all inference algorithms directly from the command line.
To see which algorithms are available and how to use the CLI for your favorite model,
check out [examples/cli/README.md](./examples/cli/README.md).
You can run inference pipelines simply typing:
```console
gt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --target MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT --number_of_samples 10
```
The command supports multiple parameters to select an algorithm and configure it for inference:
```console
$ gt4sd-inference --help
usage: gt4sd-inference [-h] [--algorithm_type ALGORITHM_TYPE]
[--domain DOMAIN] [--algorithm_name ALGORITHM_NAME]
[--algorithm_application ALGORITHM_APPLICATION]
[--algorithm_version ALGORITHM_VERSION]
[--target TARGET]
[--number_of_samples NUMBER_OF_SAMPLES]
[--configuration_file CONFIGURATION_FILE]
[--print_info [PRINT_INFO]]
optional arguments:
-h, --help show this help message and exit
--algorithm_type ALGORITHM_TYPE
Inference algorithm type, supported types:
conditional_generation, controlled_sampling,
generation, prediction. (default: None)
--domain DOMAIN Domain of the inference algorithm, supported types:
materials, nlp. (default: None)
--algorithm_name ALGORITHM_NAME
Inference algorithm name. (default: None)
--algorithm_application ALGORITHM_APPLICATION
Inference algorithm application. (default: None)
--algorithm_version ALGORITHM_VERSION
Inference algorithm version. (default: None)
--target TARGET Optional target for generation represented as a
string. Defaults to None, it can be also provided in
the configuration_file as an object, but the
commandline takes precendence. (default: None)
--number_of_samples NUMBER_OF_SAMPLES
Number of generated samples, defaults to 5. (default:
5)
--configuration_file CONFIGURATION_FILE
Configuration file for the inference pipeline in JSON
format. (default: None)
--print_info [PRINT_INFO]
Print info for the selected algorithm, preventing
inference run. Defaults to False. (default: False)
```
You can use `gt4sd-inference` to directly get information on the configuration parameters for the selected algorithm:
```console
gt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --print_info
INFO:gt4sd.cli.inference:Selected algorithm: {'algorithm_type': 'conditional_generation', 'domain': 'materials', 'algorithm_name': 'PaccMannRL', 'algorithm_application': 'PaccMannRLProteinBasedGenerator', 'algorithm_version': 'v0'}
INFO:gt4sd.cli.inference:Selected algorithm support the following configuration parameters:
{
"batch_size": {
"description": "Batch size used for the generative model sampling.",
"title": "Batch Size",
"default": 32,
"type": "integer",
"optional": true
},
"temperature": {
"description": "Temperature parameter for the softmax sampling in decoding.",
"title": "Temperature",
"default": 1.4,
"type": "number",
"optional": true
},
"generated_length": {
"description": "Maximum length in tokens of the generated molcules (relates to the SMILES length).",
"title": "Generated Length",
"default": 100,
"type": "integer",
"optional": true
}
}
Target information:
{
"target": {
"title": "Target protein sequence",
"description": "AA sequence of the protein target to generate non-toxic ligands against.",
"type": "string"
}
}
```
### Running training pipelines via the CLI command
GT4SD provides a trainer client based on the `gt4sd-trainer` CLI command.
The trainer currently supports the following training pipelines:
- `language-modeling-trainer`: language modelling via HuggingFace transfomers and PyTorch Lightning.
- `paccmann-vae-trainer`: PaccMann VAE models.
- `granular-trainer`: multimodal compositional autoencoders supporting MLP, RNN and Transformer layers.
- `guacamol-lstm-trainer`: GuacaMol LSTM models.
- `moses-organ-trainer`: Moses Organ implementation.
- `moses-vae-trainer`: Moses VAE models.
- `torchdrug-gcpn-trainer`: TorchDrug Graph Convolutional Policy Network model.
- `torchdrug-graphaf-trainer`: TorchDrug autoregressive GraphAF model.
- `diffusion-trainer`: Diffusers model.
- `gflownet-trainer`: GFlowNet model.
```console
$ gt4sd-trainer --help
usage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME
[--configuration_file CONFIGURATION_FILE]
optional arguments:
-h, --help show this help message and exit
--training_pipeline_name TRAINING_PIPELINE_NAME
Training type of the converted model, supported types:
granular-trainer, language-modeling-trainer, paccmann-
vae-trainer. (default: None)
--configuration_file CONFIGURATION_FILE
Configuration file for the trainining. It can be used
to completely by-pass pipeline specific arguments.
(default: None)
```
To launch a training you have two options.
You can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:
```sh
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}
```
Or you can provide directly the needed parameters as arguments:
```sh
gt4sd-trainer --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl
```
To get more info on a specific training pipeleins argument simply type:
```sh
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help
```
### Saving a trained algorithm for inference via the CLI command
Once a training pipeline has been run via the `gt4sd-trainer`, it's possible to save the trained algorithm via `gt4sd-saving` for usage in compatible inference pipelines.
Here a small example for `PaccMannGP` algorithm ([paper](https://doi.org/10.1021/acs.jcim.1c00889)).
You can train a model with `gt4sd-trainer` (quick training using few data, not really recommended for a realistic model :warning:):
```sh
gt4sd-trainer --training_pipeline_name paccmann-vae-trainer --epochs 250 --batch_size 4 --n_layers 1 --rnn_cell_size 16 --latent_dim 16 --train_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --test_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --eval_interval 15 --save_interval 15 --selfies
```
Save the model with the compatible inference pipeline using `gt4sd-saving`:
```sh
gt4sd-saving --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
```
Run the algorithm via `gt4sd-inference` (again the model produced in the example is trained on dummy data and will give dummy outputs, do not use it as is :no_good:):
```sh
gt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5 --target '{"molwt": {"target": 60.0}}'
```
### Uploading a trained algorithm on a public hub via the CLI command
You can upload trained and finetuned models easily in the public hub using `gt4sd-upload`. The syntax follows the saving pipeline:
```sh
gt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator
```
**NOTE:** GT4SD can be configured to upload models to a custom or self-hosted COS.
An example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.html).
### Computing properties
You can compute properties of your generated samples using the `gt4sd.properties` submodule:
```python
>>>from gt4sd.properties import PropertyPredictorRegistry
>>>similarity_predictor = PropertyPredictorRegistry.get_property_predictor("similarity_seed", {"smiles" : "C1=CC(=CC(=C1)Br)CN"})
>>>similarity_predictor("CCO")
0.0333
>>># let's inspect what other parameters we can set for similarity measuring
>>>similarity_predictor = PropertyPredictorRegistry.get_property_predictor("similarity_seed", {"smiles" : "C1=CC(=CC(=C1)Br)CN", "fp_key": "ECFP6"})
>>>similarity_predictor("CCO")
>>># inspect parameters
>>>PropertyPredictorRegistry.get_property_predictor_parameters_schema("similarity_seed")
'{"title": "SimilaritySeedParameters", "description": "Abstract class for property computation.", "type": "object", "properties": {"smiles": {"title": "Smiles", "example": "c1ccccc1", "type": "string"}, "fp_key": {"title": "Fp Key", "default": "ECFP4", "type": "string"}}, "required": ["smiles"]}'
>>># predict other properties
>>>qed = PropertyPredictorRegistry.get_property_predictor("qed")
>>>qed('CCO')
0.4068
>>># list properties
>>>PropertyPredictorRegistry.list_available()
['activity_against_target',
'aliphaticity',
...
'scscore',
'similarity_seed',
'tpsa',
'weight']
```
### Additional examples
Find more examples in [notebooks](./notebooks)
You can play with them right away using the provided Dockerfile, simply build the image and run it to explore the examples using Jupyter:
```sh
docker build -f Dockerfile -t gt4sd-demo .
docker run -p 8888:8888 gt4sd-demo
```
## Supported packages
Beyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:
- [GuacaMol](https://github.com/BenevolentAI/guacamol): inference pipelines for the baselines models and training pipelines for LSTM models.
- [Moses](https://github.com/molecularsets/moses): inference pipelines for the baselines models and training pipelines for VAEs and Organ.
- [TorchDrug](https://github.com/DeepGraphLearning/torchdrug): inference and training pipelines for GCPN and GraphAF models. Training pipelines support custom datasets as well as datasets native in TorchDrug.
- [MoLeR](https://github.com/microsoft/molecule-generation): inference pipelines for MoLeR (**MO**lecule-**LE**vel **R**epresentation) generative models for de-novo and scaffold-based generation.
- [TAPE](https://github.com/songlab-cal/tape): encoder modules compatible with the protein language models.
- [PaccMann](https://github.com/PaccMann/): inference pipelines for all algorithms of the PaccMann family as well as training pipelines for the generative VAEs.
- [transformers](https://huggingface.co/transformers): training and inference pipelines for generative models from [HuggingFace Models](https://huggingface.co/models)
- [diffusers](https://github.com/huggingface/diffusers): training and inference pipelines for generative models from [Diffusers Models](https://github.com/huggingface/diffusers)
- [GFlowNets](https://github.com/recursionpharma/gflownet): training and inference pipeline for [Generative Flow Networks](https://yoshuabengio.org/2022/03/05/generative-flow-networks/)
- [MolGX](https://github.com/GT4SD/molgx-core/): training and inference pipelines to generate small molecules satisfying target properties. The full implementation of MolGX, including additional functionalities, is available [here](https://github.com/GT4SD/molgx-core/).
- [Regression Transformers](https://github.com/IBM/regression-transformer/): training and inference pipelines to generate small molecules, polymers or peptides based on numerical property constraints. For details [read the paper](https://www.nature.com/articles/s42256-023-00639-z).
## References
If you use `gt4sd` in your projects, please consider citing the following:
```bib
@software{GT4SD,
author = {GT4SD Team},
month = {2},
title = {{GT4SD (Generative Toolkit for Scientific Discovery)}},
url = {https://github.com/GT4SD/gt4sd-core},
version = {main},
year = {2022}
}
@article{manica2022gt4sd,
title={Accelerating material design with the generative toolkit for scientific discovery},
author={Manica, Matteo and Born, Jannis and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Clarke, Dean and Teukam, Yves Gaetan Nana and Giannone, Giorgio and Hoffman, Samuel C and Buchan, Matthew and others},
journal={npj Computational Materials},
volume={9},
number={1},
pages={69},
year={2023},
publisher={Nature Publishing Group UK London}
}
```
## License
The `gt4sd` codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
Raw data
{
"_id": null,
"home_page": null,
"name": "gt4sd",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "GT4SD Generative Models Inference Training",
"author": "GT4SD team",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/bc/51/12d3556863958876b766e4bc7edd3e8ad4a889488e9a06c10dd40618d3fc/gt4sd-1.4.2.tar.gz",
"platform": null,
"description": "# GT4SD (Generative Toolkit for Scientific Discovery)\n\n[![PyPI version](https://badge.fury.io/py/gt4sd.svg)](https://badge.fury.io/py/gt4sd)\n[![Actions tests](https://github.com/gt4sd/gt4sd-core/actions/workflows/tests.yaml/badge.svg)](https://github.com/gt4sd/gt4sd-core/actions)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Contributions](https://img.shields.io/badge/contributions-welcome-blue)](https://github.com/GT4SD/gt4sd-core/blob/main/CONTRIBUTING.md)\n[![Docs](https://img.shields.io/badge/website-live-brightgreen)](https://gt4sd.github.io/gt4sd-core/)\n[![Total downloads](https://static.pepy.tech/badge/gt4sd)](https://pepy.tech/project/gt4sd)\n[![Monthly downloads](https://static.pepy.tech/badge/gt4sd/month)](https://pepy.tech/project/gt4sd)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/GT4SD/gt4sd-core/main)\n[![DOI](https://zenodo.org/badge/458309249.svg)](https://zenodo.org/badge/latestdoi/458309249)\n[![2022 IEEE Open Software Services Award](https://img.shields.io/badge/Award-2022%20IEEE%20Open%20Software%20Services%20Award-yellow)](https://conferences.computer.org/services/2022/awards/oss_award.html)\n[![Paper DOI: 10.1038/s41524-023-01028-1](https://zenodo.org/badge/DOI/10.1038/s41524-023-01028-1.svg)](https://www.nature.com/articles/s41524-023-01028-1)\n\n<img src=\"./docs/_static/gt4sd_graphical_abstract.png\" alt=\"logo\" width=\"800\">\n\n\nThe **GT4SD** (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.\n\nFor full details on the library API and examples see the [docs](https://gt4sd.github.io/gt4sd-core/).\nAlmost all pretrained models are also available via `gradio`-powered [web apps](https://huggingface.co/GT4SD) on Hugging Face Spaces.\n\n## Installation\n\n### Requirements\n\nCurrently `gt4sd` relies on:\n\n- python>=3.7,<=3.10\n- pip==24.0\n\nWe are actively working on relaxing these, so stay tuned or help us with this by [contributing](./CONTRIBUTING.md) to the project.\n\n### Conda\n\nThe recommended way to install the `gt4sd` is to create a dedicated conda environment, this will ensure all requirements are satisfied. For CPU:\n\n```sh\ngit clone https://github.com/GT4SD/gt4sd-core.git\ncd gt4sd-core/\nconda env create -f conda_cpu_mac.yml # for linux use conda_cpu_linux.yml\nconda activate gt4sd\npip install gt4sd\n```\n\n**NOTE 1:** By default `gt4sd` is installed with CPU requirements. For GPU usage replace with:\n\n```sh\nconda env create -f conda_gpu.yml\n```\n\n**NOTE 2:** In case you want to reuse an existing compatible environment (see [requirements](#requirements)), you can use `pip`, but as of now (:eyes: on [issue](https://github.com/GT4SD/gt4sd-core/issues/31) for changes), some dependencies require installation from GitHub, so for a complete setup install them with:\n\n```sh\npip install -r vcs_requirements.txt\n```\n\nA few VCS dependencies require Git LFS (make sure it's available on your system).\n\n### Development setup & installation\n\nIf you would like to contribute to the package, we recommend to install gt4sd in\neditable mode inside your `conda` environment:\n\n```sh\npip install --no-deps -e .\n```\n\nLearn more in [CONTRIBUTING.md](./CONTRIBUTING.md)\n\n## Getting started\n\nAfter install you can use `gt4sd` right away in your discovery workflows.\n\n<img src=\"./docs/_static/gt4sd_case_study.jpg\" alt=\"logo\" width=\"800\"/>\n\n\n### Running inference pipelines in your python code\n\nRunning an algorithm is as easy as typing:\n\n```python\nfrom gt4sd.algorithms.conditional_generation.paccmann_rl.core import (\n PaccMannRLProteinBasedGenerator, PaccMannRL\n)\ntarget = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'\n# algorithm configuration with default parameters\nconfiguration = PaccMannRLProteinBasedGenerator()\n# instantiate the algorithm for sampling\nalgorithm = PaccMannRL(configuration=configuration, target=target)\nitems = list(algorithm.sample(10))\nprint(items)\n```\n\nOr you can use the `ApplicationRegistry` to run an algorithm instance using a serialized representation of the algorithm:\n\n```python\nfrom gt4sd.algorithms.registry import ApplicationsRegistry\ntarget = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'\nalgorithm = ApplicationsRegistry.get_application_instance(\n target=target,\n algorithm_type='conditional_generation',\n domain='materials',\n algorithm_name='PaccMannRL',\n algorithm_application='PaccMannRLProteinBasedGenerator',\n generated_length=32,\n # include additional configuration parameters as **kwargs\n)\nitems = list(algorithm.sample(10))\nprint(items)\n```\n\n### Running inference pipelines via the CLI command\n\nGT4SD can run inference pipelines based on the `gt4sd-inference` CLI command.\nIt allows to run all inference algorithms directly from the command line.\nTo see which algorithms are available and how to use the CLI for your favorite model,\ncheck out [examples/cli/README.md](./examples/cli/README.md).\n\nYou can run inference pipelines simply typing:\n\n```console\ngt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --target MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT --number_of_samples 10\n```\n\nThe command supports multiple parameters to select an algorithm and configure it for inference:\n\n```console\n$ gt4sd-inference --help\nusage: gt4sd-inference [-h] [--algorithm_type ALGORITHM_TYPE]\n [--domain DOMAIN] [--algorithm_name ALGORITHM_NAME]\n [--algorithm_application ALGORITHM_APPLICATION]\n [--algorithm_version ALGORITHM_VERSION]\n [--target TARGET]\n [--number_of_samples NUMBER_OF_SAMPLES]\n [--configuration_file CONFIGURATION_FILE]\n [--print_info [PRINT_INFO]]\n\noptional arguments:\n -h, --help show this help message and exit\n --algorithm_type ALGORITHM_TYPE\n Inference algorithm type, supported types:\n conditional_generation, controlled_sampling,\n generation, prediction. (default: None)\n --domain DOMAIN Domain of the inference algorithm, supported types:\n materials, nlp. (default: None)\n --algorithm_name ALGORITHM_NAME\n Inference algorithm name. (default: None)\n --algorithm_application ALGORITHM_APPLICATION\n Inference algorithm application. (default: None)\n --algorithm_version ALGORITHM_VERSION\n Inference algorithm version. (default: None)\n --target TARGET Optional target for generation represented as a\n string. Defaults to None, it can be also provided in\n the configuration_file as an object, but the\n commandline takes precendence. (default: None)\n --number_of_samples NUMBER_OF_SAMPLES\n Number of generated samples, defaults to 5. (default:\n 5)\n --configuration_file CONFIGURATION_FILE\n Configuration file for the inference pipeline in JSON\n format. (default: None)\n --print_info [PRINT_INFO]\n Print info for the selected algorithm, preventing\n inference run. Defaults to False. (default: False)\n```\n\nYou can use `gt4sd-inference` to directly get information on the configuration parameters for the selected algorithm:\n\n```console\ngt4sd-inference --algorithm_name PaccMannRL --algorithm_application PaccMannRLProteinBasedGenerator --print_info\nINFO:gt4sd.cli.inference:Selected algorithm: {'algorithm_type': 'conditional_generation', 'domain': 'materials', 'algorithm_name': 'PaccMannRL', 'algorithm_application': 'PaccMannRLProteinBasedGenerator', 'algorithm_version': 'v0'}\nINFO:gt4sd.cli.inference:Selected algorithm support the following configuration parameters:\n{\n \"batch_size\": {\n \"description\": \"Batch size used for the generative model sampling.\",\n \"title\": \"Batch Size\",\n \"default\": 32,\n \"type\": \"integer\",\n \"optional\": true\n },\n \"temperature\": {\n \"description\": \"Temperature parameter for the softmax sampling in decoding.\",\n \"title\": \"Temperature\",\n \"default\": 1.4,\n \"type\": \"number\",\n \"optional\": true\n },\n \"generated_length\": {\n \"description\": \"Maximum length in tokens of the generated molcules (relates to the SMILES length).\",\n \"title\": \"Generated Length\",\n \"default\": 100,\n \"type\": \"integer\",\n \"optional\": true\n }\n}\nTarget information:\n{\n \"target\": {\n \"title\": \"Target protein sequence\",\n \"description\": \"AA sequence of the protein target to generate non-toxic ligands against.\",\n \"type\": \"string\"\n }\n}\n```\n\n### Running training pipelines via the CLI command\n\nGT4SD provides a trainer client based on the `gt4sd-trainer` CLI command.\n\nThe trainer currently supports the following training pipelines:\n\n- `language-modeling-trainer`: language modelling via HuggingFace transfomers and PyTorch Lightning.\n- `paccmann-vae-trainer`: PaccMann VAE models.\n- `granular-trainer`: multimodal compositional autoencoders supporting MLP, RNN and Transformer layers.\n- `guacamol-lstm-trainer`: GuacaMol LSTM models.\n- `moses-organ-trainer`: Moses Organ implementation.\n- `moses-vae-trainer`: Moses VAE models.\n- `torchdrug-gcpn-trainer`: TorchDrug Graph Convolutional Policy Network model.\n- `torchdrug-graphaf-trainer`: TorchDrug autoregressive GraphAF model.\n- `diffusion-trainer`: Diffusers model.\n- `gflownet-trainer`: GFlowNet model.\n\n```console\n$ gt4sd-trainer --help\nusage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME\n [--configuration_file CONFIGURATION_FILE]\n\noptional arguments:\n -h, --help show this help message and exit\n --training_pipeline_name TRAINING_PIPELINE_NAME\n Training type of the converted model, supported types:\n granular-trainer, language-modeling-trainer, paccmann-\n vae-trainer. (default: None)\n --configuration_file CONFIGURATION_FILE\n Configuration file for the trainining. It can be used\n to completely by-pass pipeline specific arguments.\n (default: None)\n```\n\nTo launch a training you have two options.\n\nYou can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:\n\n```sh\ngt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}\n```\n\nOr you can provide directly the needed parameters as arguments:\n\n```sh\ngt4sd-trainer --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /path/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl\n```\n\nTo get more info on a specific training pipeleins argument simply type:\n\n```sh\ngt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help\n```\n\n### Saving a trained algorithm for inference via the CLI command\n\nOnce a training pipeline has been run via the `gt4sd-trainer`, it's possible to save the trained algorithm via `gt4sd-saving` for usage in compatible inference pipelines.\n\nHere a small example for `PaccMannGP` algorithm ([paper](https://doi.org/10.1021/acs.jcim.1c00889)).\n\nYou can train a model with `gt4sd-trainer` (quick training using few data, not really recommended for a realistic model :warning:):\n\n```sh\ngt4sd-trainer --training_pipeline_name paccmann-vae-trainer --epochs 250 --batch_size 4 --n_layers 1 --rnn_cell_size 16 --latent_dim 16 --train_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --test_smiles_filepath src/gt4sd/training_pipelines/tests/molecules.smi --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --eval_interval 15 --save_interval 15 --selfies\n```\n\nSave the model with the compatible inference pipeline using `gt4sd-saving`:\n\n```sh\ngt4sd-saving --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp/ --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator\n```\n\nRun the algorithm via `gt4sd-inference` (again the model produced in the example is trained on dummy data and will give dummy outputs, do not use it as is :no_good:):\n\n```sh\ngt4sd-inference --algorithm_name PaccMannGP --algorithm_application PaccMannGPGenerator --algorithm_version fast-example-v0 --number_of_samples 5 --target '{\"molwt\": {\"target\": 60.0}}'\n```\n\n### Uploading a trained algorithm on a public hub via the CLI command\n\nYou can upload trained and finetuned models easily in the public hub using `gt4sd-upload`. The syntax follows the saving pipeline:\n\n```sh\ngt4sd-upload --training_pipeline_name paccmann-vae-trainer --model_path /tmp/gt4sd-paccmann-gp --training_name fast-example --target_version fast-example-v0 --algorithm_application PaccMannGPGenerator\n```\n\n**NOTE:** GT4SD can be configured to upload models to a custom or self-hosted COS.\nAn example on self-hosting locally a COS (minio) where to upload your models can be found [here](https://gt4sd.github.io/gt4sd-core/source/gt4sd_server_upload_md.html).\n\n\n### Computing properties\n\nYou can compute properties of your generated samples using the `gt4sd.properties` submodule:\n\n```python\n>>>from gt4sd.properties import PropertyPredictorRegistry\n>>>similarity_predictor = PropertyPredictorRegistry.get_property_predictor(\"similarity_seed\", {\"smiles\" : \"C1=CC(=CC(=C1)Br)CN\"})\n>>>similarity_predictor(\"CCO\")\n0.0333\n>>># let's inspect what other parameters we can set for similarity measuring\n>>>similarity_predictor = PropertyPredictorRegistry.get_property_predictor(\"similarity_seed\", {\"smiles\" : \"C1=CC(=CC(=C1)Br)CN\", \"fp_key\": \"ECFP6\"})\n>>>similarity_predictor(\"CCO\")\n>>># inspect parameters\n>>>PropertyPredictorRegistry.get_property_predictor_parameters_schema(\"similarity_seed\")\n'{\"title\": \"SimilaritySeedParameters\", \"description\": \"Abstract class for property computation.\", \"type\": \"object\", \"properties\": {\"smiles\": {\"title\": \"Smiles\", \"example\": \"c1ccccc1\", \"type\": \"string\"}, \"fp_key\": {\"title\": \"Fp Key\", \"default\": \"ECFP4\", \"type\": \"string\"}}, \"required\": [\"smiles\"]}'\n>>># predict other properties\n>>>qed = PropertyPredictorRegistry.get_property_predictor(\"qed\")\n>>>qed('CCO')\n0.4068\n>>># list properties\n>>>PropertyPredictorRegistry.list_available()\n['activity_against_target',\n 'aliphaticity',\n ...\n 'scscore',\n 'similarity_seed',\n 'tpsa',\n 'weight']\n```\n\n### Additional examples\n\nFind more examples in [notebooks](./notebooks)\n\nYou can play with them right away using the provided Dockerfile, simply build the image and run it to explore the examples using Jupyter:\n\n```sh\ndocker build -f Dockerfile -t gt4sd-demo .\ndocker run -p 8888:8888 gt4sd-demo\n```\n\n## Supported packages\n\nBeyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:\n\n- [GuacaMol](https://github.com/BenevolentAI/guacamol): inference pipelines for the baselines models and training pipelines for LSTM models.\n- [Moses](https://github.com/molecularsets/moses): inference pipelines for the baselines models and training pipelines for VAEs and Organ.\n- [TorchDrug](https://github.com/DeepGraphLearning/torchdrug): inference and training pipelines for GCPN and GraphAF models. Training pipelines support custom datasets as well as datasets native in TorchDrug.\n- [MoLeR](https://github.com/microsoft/molecule-generation): inference pipelines for MoLeR (**MO**lecule-**LE**vel **R**epresentation) generative models for de-novo and scaffold-based generation.\n- [TAPE](https://github.com/songlab-cal/tape): encoder modules compatible with the protein language models.\n- [PaccMann](https://github.com/PaccMann/): inference pipelines for all algorithms of the PaccMann family as well as training pipelines for the generative VAEs.\n- [transformers](https://huggingface.co/transformers): training and inference pipelines for generative models from [HuggingFace Models](https://huggingface.co/models)\n- [diffusers](https://github.com/huggingface/diffusers): training and inference pipelines for generative models from [Diffusers Models](https://github.com/huggingface/diffusers)\n- [GFlowNets](https://github.com/recursionpharma/gflownet): training and inference pipeline for [Generative Flow Networks](https://yoshuabengio.org/2022/03/05/generative-flow-networks/)\n- [MolGX](https://github.com/GT4SD/molgx-core/): training and inference pipelines to generate small molecules satisfying target properties. The full implementation of MolGX, including additional functionalities, is available [here](https://github.com/GT4SD/molgx-core/).\n- [Regression Transformers](https://github.com/IBM/regression-transformer/): training and inference pipelines to generate small molecules, polymers or peptides based on numerical property constraints. For details [read the paper](https://www.nature.com/articles/s42256-023-00639-z).\n\n\n## References\n\nIf you use `gt4sd` in your projects, please consider citing the following:\n\n```bib\n@software{GT4SD,\n author = {GT4SD Team},\n month = {2},\n title = {{GT4SD (Generative Toolkit for Scientific Discovery)}},\n url = {https://github.com/GT4SD/gt4sd-core},\n version = {main},\n year = {2022}\n}\n\n@article{manica2022gt4sd,\n title={Accelerating material design with the generative toolkit for scientific discovery},\n author={Manica, Matteo and Born, Jannis and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Clarke, Dean and Teukam, Yves Gaetan Nana and Giannone, Giorgio and Hoffman, Samuel C and Buchan, Matthew and others},\n journal={npj Computational Materials},\n volume={9},\n number={1},\n pages={69},\n year={2023},\n publisher={Nature Publishing Group UK London}\n}\n```\n\n## License\n\nThe `gt4sd` codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n",
"bugtrack_url": null,
"license": null,
"summary": "Generative Toolkit for Scientific Discovery (GT4SD).",
"version": "1.4.2",
"project_urls": null,
"split_keywords": [
"gt4sd",
"generative",
"models",
"inference",
"training"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d1ec289aa6d1fec5586edfa85e05fcecdf40a343709e149d50460b74ca4cec3b",
"md5": "ebb04b939243eaf415725f13ef4e978f",
"sha256": "a855ecdc6b3dbe6f040aa04bbbfaa8474fc295adc6daa18ea2d2890428a03a6c"
},
"downloads": -1,
"filename": "gt4sd-1.4.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ebb04b939243eaf415725f13ef4e978f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 644343,
"upload_time": "2024-06-13T13:19:51",
"upload_time_iso_8601": "2024-06-13T13:19:51.316059Z",
"url": "https://files.pythonhosted.org/packages/d1/ec/289aa6d1fec5586edfa85e05fcecdf40a343709e149d50460b74ca4cec3b/gt4sd-1.4.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bc5112d3556863958876b766e4bc7edd3e8ad4a889488e9a06c10dd40618d3fc",
"md5": "a6d147e849ec2684fe3942948cc4dc57",
"sha256": "84a1c10a4dd81529209ce4996cadc36e2a0b7748d0fd2faada9c199002188b24"
},
"downloads": -1,
"filename": "gt4sd-1.4.2.tar.gz",
"has_sig": false,
"md5_digest": "a6d147e849ec2684fe3942948cc4dc57",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 353309,
"upload_time": "2024-06-13T13:23:55",
"upload_time_iso_8601": "2024-06-13T13:23:55.597235Z",
"url": "https://files.pythonhosted.org/packages/bc/51/12d3556863958876b766e4bc7edd3e8ad4a889488e9a06c10dd40618d3fc/gt4sd-1.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-13 13:23:55",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "gt4sd"
}