gt4sd-molformer


Namegt4sd-molformer JSON
Version 0.1.3 PyPI version JSON
download
home_page
SummaryMolformer's submodule of GT4SD.
upload_time2023-03-17 16:51:42
maintainer
docs_urlNone
authorGT4SD team
requires_python
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # GT4SD's submodule for the MolFormer model

GT4SD submodule for the MolFormer model. The original MolFormer's codebase can be found at https://github.com/IBM/molformer. 
We refer the users to the original repo for usage information and further details about the model.


### Development setup & installation

The recommended way to install the `gt4sd-molformer` is to create a dedicated conda environment, this will ensure all requirements are satisfied:

```sh
git clone https://github.com/GT4SD/gt4sd-molformer.git
cd gt4sd-molformer/
conda env create -f conda.yml
conda activate gt4sd-molformer
```
Then run:
```sh
pip install .
```

If you would like to contribute to the package, you can install the package in editable mode:
```sh
pip install -e ".[dev]" 
```

Note: In order to be able to train or finetune a model, [Apex Optimizers](https://nvidia.github.io/apex/optimizers.html) must be compiled with CUDA and C++ extensions. This can be done using the provided install_apex.sh script. 
Before executing the script, the path to the CUDA 11 installation should have been saved in the CUDA_HOME env variable. 

```
export CUDA_HOME='Cuda 11 install'
bash install_apex.sh
```


### References

If you use `MolFormer` in your projects, please consider citing the following:

```bib
@article{10.1038/s42256-022-00580-7, 
year = {2022}, 
title = {{Large-scale chemical language representations capture molecular structure and properties}}, 
author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}, 
journal = {Nature Machine Intelligence}, 
doi = {10.1038/s42256-022-00580-7}, 
abstract = {{Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.}}, 
pages = {1256--1264}, 
number = {12}, 
volume = {4}
}

@misc{https://doi.org/10.48550/arxiv.2106.09553,
  doi = {10.48550/ARXIV.2106.09553},
  url = {https://arxiv.org/abs/2106.09553},
  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

```

If you use `gt4sd` in your projects, please consider citing the following:

```bib
@article{manica2022gt4sd,
  title={GT4SD: Generative Toolkit for Scientific Discovery},
  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},
  journal={arXiv preprint arXiv:2207.03928},
  year={2022}
}
```

### License

The `gt4sd` codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "gt4sd-molformer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "GT4SD team",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/d1/9f/d67aaa8c23a88142d8e890d8eff55a630c2e50964f16714552742e537b8c/gt4sd-molformer-0.1.3.tar.gz",
    "platform": null,
    "description": "# GT4SD's submodule for the MolFormer model\n\nGT4SD submodule for the MolFormer model. The original MolFormer's codebase can be found at https://github.com/IBM/molformer. \nWe refer the users to the original repo for usage information and further details about the model.\n\n\n### Development setup & installation\n\nThe recommended way to install the `gt4sd-molformer` is to create a dedicated conda environment, this will ensure all requirements are satisfied:\n\n```sh\ngit clone https://github.com/GT4SD/gt4sd-molformer.git\ncd gt4sd-molformer/\nconda env create -f conda.yml\nconda activate gt4sd-molformer\n```\nThen run:\n```sh\npip install .\n```\n\nIf you would like to contribute to the package, you can install the package in editable mode:\n```sh\npip install -e \".[dev]\" \n```\n\nNote: In order to be able to train or finetune a model, [Apex Optimizers](https://nvidia.github.io/apex/optimizers.html) must be compiled with CUDA and C++ extensions. This can be done using the provided install_apex.sh script. \nBefore executing the script, the path to the CUDA 11 installation should have been saved in the CUDA_HOME env variable. \n\n```\nexport CUDA_HOME='Cuda 11 install'\nbash install_apex.sh\n```\n\n\n### References\n\nIf you use `MolFormer` in your projects, please consider citing the following:\n\n```bib\n@article{10.1038/s42256-022-00580-7, \nyear = {2022}, \ntitle = {{Large-scale chemical language representations capture molecular structure and properties}}, \nauthor = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}, \njournal = {Nature Machine Intelligence}, \ndoi = {10.1038/s42256-022-00580-7}, \nabstract = {{Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MoLFormer, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MoLFormer trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties. Large language models have recently emerged with extraordinary capabilities, and these methods can be applied to model other kinds of sequence, such as string representations of molecules. Ross and colleagues have created a transformer-based model, trained on a large dataset of molecules, which provides good results on property prediction tasks.}}, \npages = {1256--1264}, \nnumber = {12}, \nvolume = {4}\n}\n\n@misc{https://doi.org/10.48550/arxiv.2106.09553,\n  doi = {10.48550/ARXIV.2106.09553},\n  url = {https://arxiv.org/abs/2106.09553},\n  author = {Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},\n  keywords = {Machine Learning (cs.LG), Computation and Language (cs.CL), Biomolecules (q-bio.BM), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},\n  title = {Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},\n  publisher = {arXiv},\n  year = {2021},\n  copyright = {arXiv.org perpetual, non-exclusive license}\n}\n\n```\n\nIf you use `gt4sd` in your projects, please consider citing the following:\n\n```bib\n@article{manica2022gt4sd,\n  title={GT4SD: Generative Toolkit for Scientific Discovery},\n  author={Manica, Matteo and Cadow, Joris and Christofidellis, Dimitrios and Dave, Ashish and Born, Jannis and Clarke, Dean and Teukam, Yves Gaetan Nana and Hoffman, Samuel C and Buchan, Matthew and Chenthamarakshan, Vijil and others},\n  journal={arXiv preprint arXiv:2207.03928},\n  year={2022}\n}\n```\n\n### License\n\nThe `gt4sd` codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Molformer's submodule of GT4SD.",
    "version": "0.1.3",
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ddc1f882da792d456d222bc3eb0183d94221944740235f2e69139789d76277f0",
                "md5": "8205a1aed3762c74c8194fa77ae04c3d",
                "sha256": "ab93005e324926bbfd128d7280bc12af11f8197dfdc2f2fa687527fa6afa646d"
            },
            "downloads": -1,
            "filename": "gt4sd_molformer-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8205a1aed3762c74c8194fa77ae04c3d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 90355,
            "upload_time": "2023-03-17T16:51:40",
            "upload_time_iso_8601": "2023-03-17T16:51:40.813243Z",
            "url": "https://files.pythonhosted.org/packages/dd/c1/f882da792d456d222bc3eb0183d94221944740235f2e69139789d76277f0/gt4sd_molformer-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d19fd67aaa8c23a88142d8e890d8eff55a630c2e50964f16714552742e537b8c",
                "md5": "1e4719502eed46aa51330b5ab6dbb4e5",
                "sha256": "3cac4e97cb499bc020c3338f4f7224705ee988e98ce00ebf70bd2f49a7bceb29"
            },
            "downloads": -1,
            "filename": "gt4sd-molformer-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "1e4719502eed46aa51330b5ab6dbb4e5",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 71808,
            "upload_time": "2023-03-17T16:51:42",
            "upload_time_iso_8601": "2023-03-17T16:51:42.460201Z",
            "url": "https://files.pythonhosted.org/packages/d1/9f/d67aaa8c23a88142d8e890d8eff55a630c2e50964f16714552742e537b8c/gt4sd-molformer-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-03-17 16:51:42",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "gt4sd-molformer"
}
        
Elapsed time: 0.06473s