Language models for astrochemistry
==================================
|PyPI| |Status| |Python Version| |License|
|Read the Docs| |Tests| |Codecov|
|pre-commit| |Black|
.. |PyPI| image:: https://img.shields.io/pypi/v/astrochem_embedding.svg
:target: https://pypi.org/project/astrochem_embedding/
:alt: PyPI
.. |Status| image:: https://img.shields.io/pypi/status/astrochem_embedding.svg
:target: https://pypi.org/project/astrochem_embedding/
:alt: Status
.. |Python Version| image:: https://img.shields.io/pypi/pyversions/astrochem_embedding
:target: https://pypi.org/project/astrochem_embedding
:alt: Python Version
.. |License| image:: https://img.shields.io/pypi/l/astrochem_embedding
:target: https://opensource.org/licenses/MIT
:alt: License
.. |Read the Docs| image:: https://img.shields.io/readthedocs/astrochem_embedding/latest.svg?label=Read%20the%20Docs
:target: https://astrochem_embedding.readthedocs.io/
:alt: Read the documentation at https://astrochem_embedding.readthedocs.io/
.. |Tests| image:: https://github.com/laserkelvin/astrochem_embedding/workflows/Tests/badge.svg
:target: https://github.com/laserkelvin/astrochem_embedding/actions?workflow=Tests
:alt: Tests
.. |Codecov| image:: https://codecov.io/gh/laserkelvin/astrochem_embedding/branch/main/graph/badge.svg
:target: https://codecov.io/gh/laserkelvin/astrochem_embedding
:alt: Codecov
.. |pre-commit| image:: https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white
:target: https://github.com/pre-commit/pre-commit
:alt: pre-commit
.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
:target: https://github.com/psf/black
:alt: Black
Features
--------
The goal of this project is to provide off the shelf language models that work
for studies in astrochemistry; the needs for general molecule discovery/chemistry
are different from astrochemistry, such as the emphasis on transient (e.g. open-shell)
molecules and isotopologues.
To support these aspects, we provide here light-weight language models (currently just
a GRU seq2seq model) based off of `SELFIES`_ syntax and PyTorch. Elements of
this project are designed to strike a balance between research agility and use for
production, and a lot of emphasis is placed on reproducibility using PyTorch Lightning
and a general user interface that doesn't force the user to know how to develop neural networks.
The current highlight of this package is the ``VICGAE``, or variance-invariance-covariance
regularized GRU autoencoder (I guess probably ``VICGRUAE`` is more accurate?). I intend to
write this up in a more detailed form in the near future, but the basic premise is this:
a pair of GRUs form a seq2seq model, whose task is to complete SELFIES strings based off
of randomly masked tokens within the molecule. To improve chemical representation learning,
the VIC regularization uses self-supervision to ensure the token embeddings are chemically
descriptive: we encourage variance (e.g. [CH2] is different from [OH]), invariance (e.g.
isotopic substitution should give more or less the same molecule), and covariance (i.e.
minimizing information sharing between embedding dimensions). While the GRU does the actual
SELFIES reconstruction, the VIC regularization is done at the token embedding level.
This has been tested on a few simple comparisons with cosine similarity, comparing isotopic
substitution, element substitution (i.e. C/Si/Ge), and functional group replacement; things
seem to work well for these simple cases.
Requirements
------------
This package requires Python 3.8+, as it uses some decorators only available after 3.7.
Installation
------------
The simplest way to get ``astrochem_embedding`` is through PyPI:
.. code:: console
$ pip install astrochem_embedding
If you're interested in development, want to train your own model,
or make sure you can take advantage of GPU acceleration, I recommend
using ``conda`` for your environment specification:
.. code:: console
$ conda create -n astrochem_embedding python=3.8
$ conda activate astrochem_embedding
$ pip install poetry
$ poetry install
$ conda install -c pytorch torch torchvision cudatoolkit=11.3
Usage
-----
The quickest way to get started is by loading a pre-trained model:
.. code:: python
>>> from astrochem_embedding import VICGAE
>>> import torch
>>> model = VICGAE.from_pretrained()
>>> model.embed_smiles("c1ccccc1")
will return a `torch.Tensor`. For now the general interface doesn't
support batching SMILES just yet, and so to operate on many SMILES
strings would simply require looping:
.. code:: python
>>> smiles = ["c1ccccc1", "[C]#N", "[13c]1ccccc1"]
>>> embeddings = torch.stack([model.embed_smiles(s) for s in smiles])
# optionally convert back to NumPy arrays
>>> numpy_embeddings = embeddings.numpy()
Project Structure
-----------------
The project filestructure is laid out as such::
├── CITATION.cff
├── codecov.yml
├── CODE_OF_CONDUCT.rst
├── CONTRIBUTING.rst
├── data
│ ├── external
│ ├── interim
│ ├── processed
│ └── raw
├── docs
│ ├── codeofconduct.rst
│ ├── conf.py
│ ├── contributing.rst
│ ├── index.rst
│ ├── license.rst
│ ├── reference.rst
│ ├── requirements.txt
│ └── usage.rst
├── environment.yml
├── models
├── notebooks
│ ├── dev
│ ├── exploratory
│ └── reports
├── noxfile.py
├── poetry.lock
├── pyproject.toml
├── README.rst
├── scripts
│ └── train.py
└── src
└── astrochem_embedding
├── __init__.py
├── layers
│ ├── __init__.py
│ ├── layers.py
│ └── tests
│ ├── __init__.py
│ └── test_layers.py
├── __main__.py
├── models
│ ├── __init__.py
│ ├── models.py
│ └── tests
│ ├── __init__.py
│ └── test_models.py
├── pipeline
│ ├── data.py
│ ├── __init__.py
│ ├── tests
│ │ ├── __init__.py
│ │ ├── test_data.py
│ │ └── test_transforms.py
│ └── transforms.py
└── utils.py
A brief summary of what each folder is designed for:
#. `data` contains copies of the data used for this project. It is recommended to form a pipeline whereby the `raw` data is preprocessed, serialized to `interim`, and when ready for analysis, placed into `processed`.
#. `models` contains serialized weights intended for distribution, and/or testing.
#. `notebooks` contains three subfolders: `dev` is for notebook based development, `exploratory` for data exploration, and `reports` for making figures and visualizations for writeup.
#. `scripts` contains files that meant for headless routines, generally those with long compute times such as model training and data cleaning.
#. `src/astrochem_embedding` contains the common code base for this project.
Code development
----------------
All of the code used for this project should be contained in `src/astrochem_embedding`,
at least in terms of the high-level functionality (i.e. not scripts), and is intended to be
a standalone Python package.
The package is structured to match the abstractions for deep learning, specifically PyTorch,
PyTorch Lightning, and Weights and Biases, by separating parts of data structures and processing
and model/layer development.
Some concise tenets for development
* Write unit tests as you go.
* Commit changes, and commit frequently. Write `semantic`_ git commits!
* Formatting is done with ``black``; don't fuss about it 😃
* For new Python dependencies, use `poetry add <package>`.
* For new environment dependencies, use `conda env export -f environment.yml`.
Notes on best practices, particularly regarding CI/CD, can be found in the extensive
documentation for the `Hypermodern Python Cookiecutter`_ repository.
License
-------
Distributed under the terms of the `MIT license`_,
*Language models for astrochemistry* is free and open source software.
Issues
------
If you encounter any problems,
please `file an issue`_ along with a detailed description.
Credits
-------
This project was generated from `@laserkelvin`_'s PyTorch Project Cookiecutter,
a fork of `@cjolowicz`_'s `Hypermodern Python Cookiecutter`_ template.
.. _@cjolowicz: https://github.com/cjolowicz
.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _MIT license: https://opensource.org/licenses/MIT
.. _PyPI: https://pypi.org/
.. _Hypermodern Python Cookiecutter: https://github.com/cjolowicz/cookiecutter-hypermodern-python
.. _file an issue: https://github.com/laserkelvin/astrochem_embedding/issues
.. _pip: https://pip.pypa.io/
.. github-only
.. _Contributor Guide: CONTRIBUTING.rst
.. _Usage: https://astrochem_embedding.readthedocs.io/en/latest/usage.html
.. _semantic: https://gist.github.com/joshbuchea/6f47e86d2510bce28f8e7f42ae84c716
.. _@laserkelvin: https://github.com/laserkelvin
.. _SELFIES: https://github.com/aspuru-guzik-group/selfies
Raw data
{
"_id": null,
"home_page": "",
"name": "astrochem-embedding",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "astrochemistry,nlp,self-supervised-learning",
"author": "",
"author_email": "Kelvin Lee <kin.long.kelvin.lee@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/c8/73/bf1387d2f94f408a699ff38619368b513ef3f11889b67efc8872d4a438dd/astrochem_embedding-0.2.0.tar.gz",
"platform": null,
"description": "Language models for astrochemistry\n==================================\n\n|PyPI| |Status| |Python Version| |License|\n\n|Read the Docs| |Tests| |Codecov|\n\n|pre-commit| |Black|\n\n.. |PyPI| image:: https://img.shields.io/pypi/v/astrochem_embedding.svg\n :target: https://pypi.org/project/astrochem_embedding/\n :alt: PyPI\n.. |Status| image:: https://img.shields.io/pypi/status/astrochem_embedding.svg\n :target: https://pypi.org/project/astrochem_embedding/\n :alt: Status\n.. |Python Version| image:: https://img.shields.io/pypi/pyversions/astrochem_embedding\n :target: https://pypi.org/project/astrochem_embedding\n :alt: Python Version\n.. |License| image:: https://img.shields.io/pypi/l/astrochem_embedding\n :target: https://opensource.org/licenses/MIT\n :alt: License\n.. |Read the Docs| image:: https://img.shields.io/readthedocs/astrochem_embedding/latest.svg?label=Read%20the%20Docs\n :target: https://astrochem_embedding.readthedocs.io/\n :alt: Read the documentation at https://astrochem_embedding.readthedocs.io/\n.. |Tests| image:: https://github.com/laserkelvin/astrochem_embedding/workflows/Tests/badge.svg\n :target: https://github.com/laserkelvin/astrochem_embedding/actions?workflow=Tests\n :alt: Tests\n.. |Codecov| image:: https://codecov.io/gh/laserkelvin/astrochem_embedding/branch/main/graph/badge.svg\n :target: https://codecov.io/gh/laserkelvin/astrochem_embedding\n :alt: Codecov\n.. |pre-commit| image:: https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white\n :target: https://github.com/pre-commit/pre-commit\n :alt: pre-commit\n.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n :target: https://github.com/psf/black\n :alt: Black\n\n\nFeatures\n--------\n\nThe goal of this project is to provide off the shelf language models that work\nfor studies in astrochemistry; the needs for general molecule discovery/chemistry\nare different from astrochemistry, such as the emphasis on transient (e.g. open-shell)\nmolecules and isotopologues.\n\nTo support these aspects, we provide here light-weight language models (currently just\na GRU seq2seq model) based off of `SELFIES`_ syntax and PyTorch. Elements of\nthis project are designed to strike a balance between research agility and use for\nproduction, and a lot of emphasis is placed on reproducibility using PyTorch Lightning\nand a general user interface that doesn't force the user to know how to develop neural networks.\n\nThe current highlight of this package is the ``VICGAE``, or variance-invariance-covariance\nregularized GRU autoencoder (I guess probably ``VICGRUAE`` is more accurate?). I intend to\nwrite this up in a more detailed form in the near future, but the basic premise is this:\na pair of GRUs form a seq2seq model, whose task is to complete SELFIES strings based off\nof randomly masked tokens within the molecule. To improve chemical representation learning,\nthe VIC regularization uses self-supervision to ensure the token embeddings are chemically\ndescriptive: we encourage variance (e.g. [CH2] is different from [OH]), invariance (e.g. \nisotopic substitution should give more or less the same molecule), and covariance (i.e.\nminimizing information sharing between embedding dimensions). While the GRU does the actual\nSELFIES reconstruction, the VIC regularization is done at the token embedding level.\n\nThis has been tested on a few simple comparisons with cosine similarity, comparing isotopic\nsubstitution, element substitution (i.e. C/Si/Ge), and functional group replacement; things\nseem to work well for these simple cases.\n\n\nRequirements\n------------\n\nThis package requires Python 3.8+, as it uses some decorators only available after 3.7.\n\n\nInstallation\n------------\n\nThe simplest way to get ``astrochem_embedding`` is through PyPI:\n\n.. code:: console\n \n $ pip install astrochem_embedding\n\nIf you're interested in development, want to train your own model,\nor make sure you can take advantage of GPU acceleration, I recommend\nusing ``conda`` for your environment specification:\n\n.. code:: console\n\n $ conda create -n astrochem_embedding python=3.8\n $ conda activate astrochem_embedding\n $ pip install poetry\n $ poetry install\n $ conda install -c pytorch torch torchvision cudatoolkit=11.3\n\nUsage\n-----\n\nThe quickest way to get started is by loading a pre-trained model:\n\n.. code:: python\n\n >>> from astrochem_embedding import VICGAE\n >>> import torch\n >>> model = VICGAE.from_pretrained()\n >>> model.embed_smiles(\"c1ccccc1\")\n\nwill return a `torch.Tensor`. For now the general interface doesn't\nsupport batching SMILES just yet, and so to operate on many SMILES\nstrings would simply require looping:\n\n.. code:: python\n\n >>> smiles = [\"c1ccccc1\", \"[C]#N\", \"[13c]1ccccc1\"]\n >>> embeddings = torch.stack([model.embed_smiles(s) for s in smiles])\n # optionally convert back to NumPy arrays\n >>> numpy_embeddings = embeddings.numpy()\n\n\nProject Structure\n-----------------\n\nThe project filestructure is laid out as such::\n\n \u251c\u2500\u2500 CITATION.cff\n \u251c\u2500\u2500 codecov.yml\n \u251c\u2500\u2500 CODE_OF_CONDUCT.rst\n \u251c\u2500\u2500 CONTRIBUTING.rst\n \u251c\u2500\u2500 data\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 external\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 interim\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 processed\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 raw\n \u251c\u2500\u2500 docs\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 codeofconduct.rst\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 conf.py\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 contributing.rst\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 index.rst\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 license.rst\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 reference.rst\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 requirements.txt\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 usage.rst\n \u251c\u2500\u2500 environment.yml\n \u251c\u2500\u2500 models\n \u251c\u2500\u2500 notebooks\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 dev\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 exploratory\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 reports\n \u251c\u2500\u2500 noxfile.py\n \u251c\u2500\u2500 poetry.lock\n \u251c\u2500\u2500 pyproject.toml\n \u251c\u2500\u2500 README.rst\n \u251c\u2500\u2500 scripts\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 train.py\n \u2514\u2500\u2500 src\n \u2514\u2500\u2500 astrochem_embedding\n \u251c\u2500\u2500 __init__.py\n \u251c\u2500\u2500 layers\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 layers.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 tests\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 test_layers.py\n \u251c\u2500\u2500 __main__.py\n \u251c\u2500\u2500 models\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 models.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 tests\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 test_models.py\n \u251c\u2500\u2500 pipeline\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 data.py\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u251c\u2500\u2500 tests\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 __init__.py\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u251c\u2500\u2500 test_data.py\n \u2502\u00a0\u00a0 \u2502\u00a0\u00a0 \u2514\u2500\u2500 test_transforms.py\n \u2502\u00a0\u00a0 \u2514\u2500\u2500 transforms.py\n \u2514\u2500\u2500 utils.py\n\nA brief summary of what each folder is designed for:\n\n#. `data` contains copies of the data used for this project. It is recommended to form a pipeline whereby the `raw` data is preprocessed, serialized to `interim`, and when ready for analysis, placed into `processed`.\n#. `models` contains serialized weights intended for distribution, and/or testing.\n#. `notebooks` contains three subfolders: `dev` is for notebook based development, `exploratory` for data exploration, and `reports` for making figures and visualizations for writeup.\n#. `scripts` contains files that meant for headless routines, generally those with long compute times such as model training and data cleaning.\n#. `src/astrochem_embedding` contains the common code base for this project.\n\n\nCode development\n----------------\n\nAll of the code used for this project should be contained in `src/astrochem_embedding`,\nat least in terms of the high-level functionality (i.e. not scripts), and is intended to be\na standalone Python package.\n\nThe package is structured to match the abstractions for deep learning, specifically PyTorch, \nPyTorch Lightning, and Weights and Biases, by separating parts of data structures and processing\nand model/layer development.\n\nSome concise tenets for development\n\n* Write unit tests as you go.\n* Commit changes, and commit frequently. Write `semantic`_ git commits!\n* Formatting is done with ``black``; don't fuss about it \ud83d\ude03\n* For new Python dependencies, use `poetry add <package>`.\n* For new environment dependencies, use `conda env export -f environment.yml`.\n\nNotes on best practices, particularly regarding CI/CD, can be found in the extensive\ndocumentation for the `Hypermodern Python Cookiecutter`_ repository.\n\nLicense\n-------\n\nDistributed under the terms of the `MIT license`_,\n*Language models for astrochemistry* is free and open source software.\n\n\nIssues\n------\n\nIf you encounter any problems,\nplease `file an issue`_ along with a detailed description.\n\n\nCredits\n-------\n\nThis project was generated from `@laserkelvin`_'s PyTorch Project Cookiecutter, \na fork of `@cjolowicz`_'s `Hypermodern Python Cookiecutter`_ template.\n\n.. _@cjolowicz: https://github.com/cjolowicz\n.. _Cookiecutter: https://github.com/audreyr/cookiecutter\n.. _MIT license: https://opensource.org/licenses/MIT\n.. _PyPI: https://pypi.org/\n.. _Hypermodern Python Cookiecutter: https://github.com/cjolowicz/cookiecutter-hypermodern-python\n.. _file an issue: https://github.com/laserkelvin/astrochem_embedding/issues\n.. _pip: https://pip.pypa.io/\n.. github-only\n.. _Contributor Guide: CONTRIBUTING.rst\n.. _Usage: https://astrochem_embedding.readthedocs.io/en/latest/usage.html\n.. _semantic: https://gist.github.com/joshbuchea/6f47e86d2510bce28f8e7f42ae84c716\n.. _@laserkelvin: https://github.com/laserkelvin\n.. _SELFIES: https://github.com/aspuru-guzik-group/selfies\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Language models for astrochemistry",
"version": "0.2.0",
"split_keywords": [
"astrochemistry",
"nlp",
"self-supervised-learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "75b0ef0d58dfe13e47cb920eb04a088b4e5ba2f369b843c378030f16c0a5002e",
"md5": "22e6e0ef247a106fc35003190ffcb034",
"sha256": "b9691a93c57877218be8b53e1c2d48d554c9a31e48c3765686f1fdae380d5d8c"
},
"downloads": -1,
"filename": "astrochem_embedding-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "22e6e0ef247a106fc35003190ffcb034",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 487809,
"upload_time": "2023-01-23T06:08:15",
"upload_time_iso_8601": "2023-01-23T06:08:15.954703Z",
"url": "https://files.pythonhosted.org/packages/75/b0/ef0d58dfe13e47cb920eb04a088b4e5ba2f369b843c378030f16c0a5002e/astrochem_embedding-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c873bf1387d2f94f408a699ff38619368b513ef3f11889b67efc8872d4a438dd",
"md5": "51fc3447cdacc4498e797d3ac6880cb6",
"sha256": "91a5ff9fe94ee060553900c05f8841e87765225f3871437498aea578d11d4203"
},
"downloads": -1,
"filename": "astrochem_embedding-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "51fc3447cdacc4498e797d3ac6880cb6",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 2809066,
"upload_time": "2023-01-23T06:08:17",
"upload_time_iso_8601": "2023-01-23T06:08:17.585044Z",
"url": "https://files.pythonhosted.org/packages/c8/73/bf1387d2f94f408a699ff38619368b513ef3f11889b67efc8872d4a438dd/astrochem_embedding-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-01-23 06:08:17",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "astrochem-embedding"
}