molfeat-hype


Namemolfeat-hype JSON
Version 0.1.0 PyPI version JSON
download
home_page
Summarymolfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization
upload_time2023-05-04 19:16:18
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache-2.0
keywords molfeat chatgpt llm llama alpaca
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# :comet: `molfeat-hype`

<div align="center">
    <img src="docs/assets/molfeat-hype-cover.svg" width="100%">
</div>
<p align="center">
    <b> ☄️ molfeat-hype - A molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization.</b> <br />
</p>
<p align="center">
  <a href="https://maclandrol.github.io/molfeat-hype/" target="_blank">
      Docs
  </a>
</p>

---

[![PyPI](https://img.shields.io/pypi/v/molfeat-hype)](https://pypi.org/project/molfeat-hype/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/molfeat-hype)](https://pypi.org/project/molfeat-hype/)
[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/maclandrol/molfeat-hype/blob/main/LICENSE)
[![test](https://github.com/maclandrol/molfeat-hype/actions/workflows/test.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/test.yml)
[![code-check](https://github.com/maclandrol/molfeat-hype/actions/workflows/code-check.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/code-check.yml)
[![release](https://github.com/maclandrol/molfeat-hype/actions/workflows/release.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/release.yml)

## Overview
`molfeat-hype` is an extension of `molfeat` that investigates the performance of embeddings from various LLMs trained without explicit molecular context for molecular modeling. It leverages some of the most hyped LLM models in NLP to answer the following question:

```Is it necessary to pretrain/finetune LLMs on molecular context to obtain good molecular representations?```

To find an answer to this question, check out the [benchmarks](tutorials/benchmark.ipynb).

<details>
 <summary>Spoilers</summary>
 <strong>YES!</strong> Understanding molecular context/structure/properties is key to building good molecular featurizers. 
</details>

### LLMs

`molfeat-hype` supports two types of LLM embeddings:

1. **Classic Embeddings**: These are classical embeddings provided by foundation models (or any LLMs). The models available in this tool include OpenAI's `openai/text-embedding-ada-002` model, `llama`, and several embedding models accessible through [`sentence-transformers`](https://github.com/UKPLab/sentence-transformers/tree/master).

2. **Instruction-based Embeddings**: These are models that have been trained to follow instructions (thus acting like ChatGPT) or are conditional models that require a prompt.

   - **Prompt-based instruction:** A model (like Chat-GPT: `openai/gpt-3.5-turbo`) is asked to act like an all-knowing AI assistant for drug discovery and provide the best molecular representation for the input list of molecules. Here, we parse the representation from the Chat agent output.
   - **Conditional embeddings:** A model trained for conditional text embeddings that takes instruction as additional input. Here, the embedding is the model underlying representation of the molecule conditioned by the instructions it received. For more information, see this [instructor-embedding](https://github.com/HKUNLP/instructor-embedding).

## Installation

You can install `molfeat-hype` using pip. `conda` installation is planned soon.

```bash
pip install molfeat-hype
```

`molfeat-hype` mostly depends on [molfeat](https://github.com/datamol-io/molfeat) and [langchain](https://github.com/hwchase17/langchain). Please see the [env.yml](./env.yml) file for a complete list of dependencies.

### Acknowledgements 

Check out the following projects that made molfeat-hype possible:

- To learn more about [`molfeat`](https://github.com/datamol-io/molfeat), please visit https://molfeat.datamol.io/. To learn more about the plugin system of molfeat, please see [extending molfeat](https://molfeat-docs.datamol.io/stable/developers/create-plugin.html)

- Please refer to the [`langchain`](https://github.com/hwchase17/langchain) documentation for any questions related to langchain.

## Usage

Since `molfeat-hype` is a `molfeat` plugin, it follows the same integration principle as with any other `molfeat` plugin. 

The following shows examples of how to use the `molfeat-hype` plugin package automatically when installed.

1. Using this package directly:

```python

from molfeat_hype.trans.llm_embeddings import LLMTransformer

mol_transf = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
```

2. Enabling autodiscovery as a plugin in `molfeat`, and addition of all embedding classes as an importable attribute to the entry point group `molfeat.trans.pretrained`:

```python
# Put this somewhere in your code (e.g., in the root __init__ file).
# Plugins should include any subword of 'molfeat_hype'.
from molfeat.plugins import load_registered_plugins
load_registered_plugins(add_submodules=True, plugins=["hype"])
```

```python
# This is now possible everywhere.
from molfeat.trans.pretrained import LLMTransformer
mol_transf = LLMTransformer(kind="sentence-transformers/all-mpnet-base-v2")
```

Once you have defined your molecule transformer, use it like any `molfeat` `MoleculeTransformer`:

```python
import datamol as dm
smiles = dm.freesolv()["smiles"].values[:5]
mol_transf(smiles)
```


## Changelog
See the latest changelogs at [CHANGELOG.rst](./CHANGELOG.rst).

## Maintainers

- @maclandrol

## Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether in the form of new features, improved infrastructure, or better documentation. 
For detailed information on how to contribute, see our [contribution guide](./contribute.md).


## Disclaimer
This repository contains an experimental investigation of LLM embeddings for molecules. Please note that the consistency and usefulness of the returned molecular embeddings are not guaranteed. This project is meant for fun and exploratory purposes only and should not be used as a demonstration of LLM capabilities for molecular embeddings. Any statements made in this repository are the opinions of the authors and do not necessarily reflect the views of any affiliated organizations or individuals. Use at your own risk.

## License

Under the Apache-2.0 license. See [LICENSE](LICENSE) for details.


            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "molfeat-hype",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "molfeat,chatGPT,LLM,llama,alpaca",
    "author": "",
    "author_email": "Emmanuel Noutahi <emmanuel.noutahi@hotmail.ca>",
    "download_url": "https://files.pythonhosted.org/packages/e8/a4/5e86bbf05da1a2303ea9a1b207b89100b02d70006e50a75b81867863a6b8/molfeat-hype-0.1.0.tar.gz",
    "platform": null,
    "description": "\n# :comet: `molfeat-hype`\n\n<div align=\"center\">\n    <img src=\"docs/assets/molfeat-hype-cover.svg\" width=\"100%\">\n</div>\n<p align=\"center\">\n    <b> \u2604\ufe0f molfeat-hype - A molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization.</b> <br />\n</p>\n<p align=\"center\">\n  <a href=\"https://maclandrol.github.io/molfeat-hype/\" target=\"_blank\">\n      Docs\n  </a>\n</p>\n\n---\n\n[![PyPI](https://img.shields.io/pypi/v/molfeat-hype)](https://pypi.org/project/molfeat-hype/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/molfeat-hype)](https://pypi.org/project/molfeat-hype/)\n[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/maclandrol/molfeat-hype/blob/main/LICENSE)\n[![test](https://github.com/maclandrol/molfeat-hype/actions/workflows/test.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/test.yml)\n[![code-check](https://github.com/maclandrol/molfeat-hype/actions/workflows/code-check.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/code-check.yml)\n[![release](https://github.com/maclandrol/molfeat-hype/actions/workflows/release.yml/badge.svg)](https://github.com/maclandrol/molfeat-hype/actions/workflows/release.yml)\n\n## Overview\n`molfeat-hype` is an extension of `molfeat` that investigates the performance of embeddings from various LLMs trained without explicit molecular context for molecular modeling. It leverages some of the most hyped LLM models in NLP to answer the following question:\n\n```Is it necessary to pretrain/finetune LLMs on molecular context to obtain good molecular representations?```\n\nTo find an answer to this question, check out the [benchmarks](tutorials/benchmark.ipynb).\n\n<details>\n <summary>Spoilers</summary>\n <strong>YES!</strong> Understanding molecular context/structure/properties is key to building good molecular featurizers. \n</details>\n\n### LLMs\n\n`molfeat-hype` supports two types of LLM embeddings:\n\n1. **Classic Embeddings**: These are classical embeddings provided by foundation models (or any LLMs). The models available in this tool include OpenAI's `openai/text-embedding-ada-002` model, `llama`, and several embedding models accessible through [`sentence-transformers`](https://github.com/UKPLab/sentence-transformers/tree/master).\n\n2. **Instruction-based Embeddings**: These are models that have been trained to follow instructions (thus acting like ChatGPT) or are conditional models that require a prompt.\n\n   - **Prompt-based instruction:** A model (like Chat-GPT: `openai/gpt-3.5-turbo`) is asked to act like an all-knowing AI assistant for drug discovery and provide the best molecular representation for the input list of molecules. Here, we parse the representation from the Chat agent output.\n   - **Conditional embeddings:** A model trained for conditional text embeddings that takes instruction as additional input. Here, the embedding is the model underlying representation of the molecule conditioned by the instructions it received. For more information, see this [instructor-embedding](https://github.com/HKUNLP/instructor-embedding).\n\n## Installation\n\nYou can install `molfeat-hype` using pip. `conda` installation is planned soon.\n\n```bash\npip install molfeat-hype\n```\n\n`molfeat-hype` mostly depends on [molfeat](https://github.com/datamol-io/molfeat) and [langchain](https://github.com/hwchase17/langchain). Please see the [env.yml](./env.yml) file for a complete list of dependencies.\n\n### Acknowledgements \n\nCheck out the following projects that made molfeat-hype possible:\n\n- To learn more about [`molfeat`](https://github.com/datamol-io/molfeat), please visit https://molfeat.datamol.io/. To learn more about the plugin system of molfeat, please see [extending molfeat](https://molfeat-docs.datamol.io/stable/developers/create-plugin.html)\n\n- Please refer to the [`langchain`](https://github.com/hwchase17/langchain) documentation for any questions related to langchain.\n\n## Usage\n\nSince `molfeat-hype` is a `molfeat` plugin, it follows the same integration principle as with any other `molfeat` plugin. \n\nThe following shows examples of how to use the `molfeat-hype` plugin package automatically when installed.\n\n1. Using this package directly:\n\n```python\n\nfrom molfeat_hype.trans.llm_embeddings import LLMTransformer\n\nmol_transf = LLMTransformer(kind=\"sentence-transformers/all-mpnet-base-v2\")\n```\n\n2. Enabling autodiscovery as a plugin in `molfeat`, and addition of all embedding classes as an importable attribute to the entry point group `molfeat.trans.pretrained`:\n\n```python\n# Put this somewhere in your code (e.g., in the root __init__ file).\n# Plugins should include any subword of 'molfeat_hype'.\nfrom molfeat.plugins import load_registered_plugins\nload_registered_plugins(add_submodules=True, plugins=[\"hype\"])\n```\n\n```python\n# This is now possible everywhere.\nfrom molfeat.trans.pretrained import LLMTransformer\nmol_transf = LLMTransformer(kind=\"sentence-transformers/all-mpnet-base-v2\")\n```\n\nOnce you have defined your molecule transformer, use it like any `molfeat` `MoleculeTransformer`:\n\n```python\nimport datamol as dm\nsmiles = dm.freesolv()[\"smiles\"].values[:5]\nmol_transf(smiles)\n```\n\n\n## Changelog\nSee the latest changelogs at [CHANGELOG.rst](./CHANGELOG.rst).\n\n## Maintainers\n\n- @maclandrol\n\n## Contributing\n\nAs an open-source project in a rapidly developing field, we are extremely open to contributions, whether in the form of new features, improved infrastructure, or better documentation. \nFor detailed information on how to contribute, see our [contribution guide](./contribute.md).\n\n\n## Disclaimer\nThis repository contains an experimental investigation of LLM embeddings for molecules. Please note that the consistency and usefulness of the returned molecular embeddings are not guaranteed. This project is meant for fun and exploratory purposes only and should not be used as a demonstration of LLM capabilities for molecular embeddings. Any statements made in this repository are the opinions of the authors and do not necessarily reflect the views of any affiliated organizations or individuals. Use at your own risk.\n\n## License\n\nUnder the Apache-2.0 license. See [LICENSE](LICENSE) for details.\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "molfeat plugin that leverages the most hyped LLM models in NLP for molecular featurization",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/maclandrol/molfeat-hype/issues",
        "Documentation": "https://molfeat-docs.datamol.io",
        "Source Code": "https://github.com/maclandrol/molfeat-hype",
        "Website": "https://molfeat.datamol.io"
    },
    "split_keywords": [
        "molfeat",
        "chatgpt",
        "llm",
        "llama",
        "alpaca"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2458e1a15ee60d986674d2c2957ab08dc7ac20787e89a73543e109888d96de62",
                "md5": "3b72b1bc66932d50a34c632b228d4259",
                "sha256": "6373eafd654c5e907885911ce8d90e6684ad43cee1cb4360db1e2887fc3bdb84"
            },
            "downloads": -1,
            "filename": "molfeat_hype-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3b72b1bc66932d50a34c632b228d4259",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 17477,
            "upload_time": "2023-05-04T19:16:14",
            "upload_time_iso_8601": "2023-05-04T19:16:14.835657Z",
            "url": "https://files.pythonhosted.org/packages/24/58/e1a15ee60d986674d2c2957ab08dc7ac20787e89a73543e109888d96de62/molfeat_hype-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e8a45e86bbf05da1a2303ea9a1b207b89100b02d70006e50a75b81867863a6b8",
                "md5": "4cf89e10d12d8aaae08e08d7601299be",
                "sha256": "ae347e5ba4099ddc7a0b3d973f6f4a1af560636a9ddb5d6b042f7b16c5ec6932"
            },
            "downloads": -1,
            "filename": "molfeat-hype-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4cf89e10d12d8aaae08e08d7601299be",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 70044052,
            "upload_time": "2023-05-04T19:16:18",
            "upload_time_iso_8601": "2023-05-04T19:16:18.693815Z",
            "url": "https://files.pythonhosted.org/packages/e8/a4/5e86bbf05da1a2303ea9a1b207b89100b02d70006e50a75b81867863a6b8/molfeat-hype-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-04 19:16:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maclandrol",
    "github_project": "molfeat-hype",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "molfeat-hype"
}
        
Elapsed time: 0.20831s