<div align="center">
<h1>Hestia-GOOD</h1>
<p>Computational tool for generating generalisation-evaluating evaluation sets.</p>
<a href="https://ibm.github.io/Hestia-GOOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
<a href="https://github.com/IBM/Hestia-GOOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-GOOD" /></a>
<a href="https://pypi.org/project/hestia-good/"><img src="https://img.shields.io/pypi/v/hestia-good" /></a>
<a href="https://pypi.org/project/hestia-good/"><img src="https://img.shields.io/pypi/dm/hestia-good" /></a>
<a target="_blank" href="https://colab.research.google.com/github/IBM/Hestia-GOOD/blob/main/examples/tutorial_1.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
</div>
- **Documentation:** <a href="https://ibm.github.io/Hestia-GOOD/" target="_blank">https://ibm.github.io/Hestia-GOOD</a>
- **Source Code:** <a href="https://github.com/IBM/Hestia-GOOD" target="_blank">https://github.com/IBM/Hestia-GOOD</a>
- **Paper [ICLR 2025]:** <a href="https://openreview.net/pdf?id=qFZnAC4GHR" target="_blank">https://openreview.net/pdf?id=qFZnAC4GHR</a>
## Contents
<details open markdown="1"><summary><b>Table of Contents</b></summary>
- [Intallation Guide](#installation)
- [Documentation](#documentation)
- [Examples](#examples)
- [License](#license)
</details>
## Installation <a name="installation"></a>
Installing in a conda environment is recommended. For creating the environment, please run:
```bash
conda create -n hestia python
conda activate hestia
```
### 1. Python Package
#### 1.1.From PyPI
```bash
pip install hestia-good
```
#### 1.2. Directly from source
```bash
pip install git+https://github.com/IBM/Hestia-GOOD
```
### 2. Optional dependencies
#### 2.1. Molecular similarity
RDKit is a dependency necessary for calculating molecular similarities:
```bash
pip install rdkit
```
#### 2.2. Sequence alignment
- MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)
```bash
# static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE4.1 (check using: cat /proc/cpuinfo | grep sse4)
wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE2 (slowest, for very old systems) (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# MacOS
brew install mmseqs2
```
To use Needleman-Wunch, either:
```bash
conda install -c bioconda emboss
```
or
```bash
sudo apt install emboss
```
- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)
#### 2.3. Structure alignment
- To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):
```bash
# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# MacOS
wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
```
## Documentation <a name="documentation"></a>
### 1. DatasetGenerator
The HestiaGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the AU-GOOD (Area Under the Generalization Out-Of-Distribution curve). More information in [Dataset Generator docs](https://ibm.github.io/Hestia-GOOD/dataset_generator/).
```python
from hestia.dataset_generator import HestiaGenerator, SimArguments
# Initialise the generator for a DataFrame
generator = HestiaGenerator(df)
# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
# Similarity arguments for protein similarity
prot_args = SimArguments(
data_type='sequence', field_name='sequence',
alignment_algorithm='mmseqs2+prefilter', verbose=3
)
# Similarity arguments for molecular similarity
mol_args = SimArguments(
data_type='small molecule', field_name='SMILES',
fingeprint='mapc', radius=2, bits=2048
)
# Calculate the similarity
generator.calculate_similarity(prot_args)
# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
threshold_step=0.05,
test_size=0.2, valid_size=0.1)
# Save partitions
generator.save_precalculated('precalculated_partitions.gz')
# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')
# Training code (filter partitions with test sets less than 18.5% of total data)
for threshold, partition in generator.get_partitions(filter=0.185):
train = df.iloc[partition['train']]
valid = df.iloc[partition['valid']]
test = df.iloc[partition['test']]
# ...
# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')
# Plot GOOD
generator.plot_good(results, 'test_mcc')
# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')
```
### 2. Similarity calculation
Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function. More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-GOOD/similarity/).
```python
from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd
df_query = pd.read_csv('example.csv')
# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.
sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)
```
### 3. Clustering
Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function. There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-GOOD/clustering/).
```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
cluster_algorithm='CDHIT')
```
### 4. Partitioning
Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. More details about partitioing algorithms can be found in [Partitionind documentation](https://ibm.github.io/Hestia-GOOD/partitioning). An example of how `cc_part` would be used is:
```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)
train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
```
License <a name="license"></a>
-------
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/IBM/Hestia-GOOD",
"name": "hestia-good",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "hestia",
"author": "Raul Fernandez-Diaz",
"author_email": "raul.fernandezdiaz@ucdconnect.ie",
"download_url": "https://files.pythonhosted.org/packages/99/cd/c668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299/hestia_good-1.0.2.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <h1>Hestia-GOOD</h1>\n\n <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>\n \n <a href=\"https://ibm.github.io/Hestia-GOOD/\"><img alt=\"Tutorials\" src=\"https://img.shields.io/badge/docs-tutorials-green\" /></a>\n <a href=\"https://github.com/IBM/Hestia-GOOD/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/IBM/Hestia-GOOD\" /></a>\n <a href=\"https://pypi.org/project/hestia-good/\"><img src=\"https://img.shields.io/pypi/v/hestia-good\" /></a>\n <a href=\"https://pypi.org/project/hestia-good/\"><img src=\"https://img.shields.io/pypi/dm/hestia-good\" /></a>\n <a target=\"_blank\" href=\"https://colab.research.google.com/github/IBM/Hestia-GOOD/blob/main/examples/tutorial_1.ipynb\">\n <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n </a>\n</div>\n\n- **Documentation:** <a href=\"https://ibm.github.io/Hestia-GOOD/\" target=\"_blank\">https://ibm.github.io/Hestia-GOOD</a>\n- **Source Code:** <a href=\"https://github.com/IBM/Hestia-GOOD\" target=\"_blank\">https://github.com/IBM/Hestia-GOOD</a>\n- **Paper [ICLR 2025]:** <a href=\"https://openreview.net/pdf?id=qFZnAC4GHR\" target=\"_blank\">https://openreview.net/pdf?id=qFZnAC4GHR</a>\n\n## Contents\n\n<details open markdown=\"1\"><summary><b>Table of Contents</b></summary>\n\n- [Intallation Guide](#installation)\n- [Documentation](#documentation)\n- [Examples](#examples)\n- [License](#license)\n </details>\n\n\n ## Installation <a name=\"installation\"></a>\n\nInstalling in a conda environment is recommended. For creating the environment, please run:\n\n```bash\nconda create -n hestia python\nconda activate hestia\n```\n\n### 1. Python Package\n\n#### 1.1.From PyPI\n\n\n```bash\npip install hestia-good\n```\n\n#### 1.2. Directly from source\n\n```bash\npip install git+https://github.com/IBM/Hestia-GOOD\n```\n\n### 2. Optional dependencies\n\n#### 2.1. Molecular similarity\n\nRDKit is a dependency necessary for calculating molecular similarities:\n\n```bash\npip install rdkit\n```\n\n#### 2.2. Sequence alignment\n\n - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)\n ```bash\n # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)\n wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # static build with SSE4.1 (check using: cat /proc/cpuinfo | grep sse4)\n wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # static build with SSE2 (slowest, for very old systems) (check using: cat /proc/cpuinfo | grep sse2)\n wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # MacOS\n brew install mmseqs2 \n ```\n\n To use Needleman-Wunch, either:\n\n ```bash\n conda install -c bioconda emboss\n ```\n or\n\n ```bash\n sudo apt install emboss\n ```\n\n\n- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)\n\n\n#### 2.3. Structure alignment \n\n - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):\n\n ```bash\n # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)\n wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)\n wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # Linux ARM64 build\n wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # MacOS\n wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n ```\n\n\n## Documentation <a name=\"documentation\"></a>\n\n### 1. DatasetGenerator\n\nThe HestiaGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the AU-GOOD (Area Under the Generalization Out-Of-Distribution curve). More information in [Dataset Generator docs](https://ibm.github.io/Hestia-GOOD/dataset_generator/).\n\n```python\nfrom hestia.dataset_generator import HestiaGenerator, SimArguments\n\n# Initialise the generator for a DataFrame\ngenerator = HestiaGenerator(df)\n\n# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)\n\n# Similarity arguments for protein similarity\nprot_args = SimArguments(\n data_type='sequence', field_name='sequence',\n alignment_algorithm='mmseqs2+prefilter', verbose=3\n)\n\n# Similarity arguments for molecular similarity\nmol_args = SimArguments(\n data_type='small molecule', field_name='SMILES',\n fingeprint='mapc', radius=2, bits=2048\n)\n\n# Calculate the similarity\ngenerator.calculate_similarity(prot_args)\n\n# Calculate partitions\ngenerator.calculate_partitions(min_threshold=0.3,\n threshold_step=0.05,\n test_size=0.2, valid_size=0.1)\n\n# Save partitions\ngenerator.save_precalculated('precalculated_partitions.gz')\n\n# Load pre-calculated partitions\ngenerator.from_precalculated('precalculated_partitions.gz')\n\n# Training code (filter partitions with test sets less than 18.5% of total data)\n\nfor threshold, partition in generator.get_partitions(filter=0.185):\n train = df.iloc[partition['train']]\n valid = df.iloc[partition['valid']]\n test = df.iloc[partition['test']]\n\n# ...\n\n# Calculate AU-GOOD\ngenerator.calculate_augood(results, 'test_mcc')\n\n# Plot GOOD\ngenerator.plot_good(results, 'test_mcc')\n\n# Compare two models\nresults = {'model A': [values_A], 'model B': [values_B]}\ngenerator.compare_models(results, statistical_test='wilcoxon')\n```\n\n### 2. Similarity calculation\n\nCalculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function. More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-GOOD/similarity/).\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nimport pandas as pd\n\ndf_query = pd.read_csv('example.csv')\n\n# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.\n# This column corresponds to `field_name` in the function.\n\nsim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)\n```\n\n\n\n### 3. Clustering\n\nClustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function. There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-GOOD/clustering/).\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.clustering import generate_clusters\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\nclusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,\n cluster_algorithm='CDHIT')\n```\n\n### 4. Partitioning\n\nPartitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. More details about partitioing algorithms can be found in [Partitionind documentation](https://ibm.github.io/Hestia-GOOD/partitioning). An example of how `cc_part` would be used is:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.partition import ccpart\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\ntrain, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)\n\ntrain_df = df.iloc[train, :]\ntest_df = df.iloc[test, :]\n```\n\nLicense <a name=\"license\"></a>\n-------\nHestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Independent evaluation set construction for trustworthy ML models in biochemistry",
"version": "1.0.2",
"project_urls": {
"Homepage": "https://github.com/IBM/Hestia-GOOD"
},
"split_keywords": [
"hestia"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "009d88d116fbf46fd21980e09c77d65feadfaa2ebb85c65d88469df1f1009b9f",
"md5": "053e461c0a7a58acef2d5cb8b2ad739d",
"sha256": "202e9475b8980a41f181186306957ba5236eb7eceb549b4f3c5c99c2da2e70f7"
},
"downloads": -1,
"filename": "hestia_good-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "053e461c0a7a58acef2d5cb8b2ad739d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 34448,
"upload_time": "2025-08-27T10:26:44",
"upload_time_iso_8601": "2025-08-27T10:26:44.250044Z",
"url": "https://files.pythonhosted.org/packages/00/9d/88d116fbf46fd21980e09c77d65feadfaa2ebb85c65d88469df1f1009b9f/hestia_good-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "99cdc668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299",
"md5": "4becb8bae3a31a3692911895e9e3b210",
"sha256": "c5287b450a43411bcd6bbf5bdc938040f8de20ce71d991139fe41a988f1f566e"
},
"downloads": -1,
"filename": "hestia_good-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "4becb8bae3a31a3692911895e9e3b210",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 35419,
"upload_time": "2025-08-27T10:26:45",
"upload_time_iso_8601": "2025-08-27T10:26:45.394614Z",
"url": "https://files.pythonhosted.org/packages/99/cd/c668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299/hestia_good-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-27 10:26:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IBM",
"github_project": "Hestia-GOOD",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "hestia-good"
}