hestia-good

Name	hestia-good JSON
Version	1.0.2 JSON
	download
home_page	https://github.com/IBM/Hestia-GOOD
Summary	Independent evaluation set construction for trustworthy ML models in biochemistry
upload_time	2025-08-27 10:26:45
maintainer	None
docs_url	None
author	Raul Fernandez-Diaz
requires_python	>=3.9
license	MIT
keywords	hestia
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align="center">
  <h1>Hestia-GOOD</h1>

  <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>
  
  <a href="https://ibm.github.io/Hestia-GOOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
  <a href="https://github.com/IBM/Hestia-GOOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-GOOD" /></a>
  <a href="https://pypi.org/project/hestia-good/"><img src="https://img.shields.io/pypi/v/hestia-good" /></a>
  <a href="https://pypi.org/project/hestia-good/"><img src="https://img.shields.io/pypi/dm/hestia-good" /></a>
  <a target="_blank" href="https://colab.research.google.com/github/IBM/Hestia-GOOD/blob/main/examples/tutorial_1.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
  </a>
</div>

- **Documentation:**  <a href="https://ibm.github.io/Hestia-GOOD/" target="_blank">https://ibm.github.io/Hestia-GOOD</a>
- **Source Code:** <a href="https://github.com/IBM/Hestia-GOOD" target="_blank">https://github.com/IBM/Hestia-GOOD</a>
- **Paper [ICLR 2025]:** <a href="https://openreview.net/pdf?id=qFZnAC4GHR" target="_blank">https://openreview.net/pdf?id=qFZnAC4GHR</a>

## Contents

<details open markdown="1"><summary><b>Table of Contents</b></summary>

- [Intallation Guide](#installation)
- [Documentation](#documentation)
- [Examples](#examples)
- [License](#license)
 </details>


 ## Installation <a name="installation"></a>

Installing in a conda environment is recommended. For creating the environment, please run:

```bash
conda create -n hestia python
conda activate hestia
```

### 1. Python Package

#### 1.1.From PyPI


```bash
pip install hestia-good
```

#### 1.2. Directly from source

```bash
pip install git+https://github.com/IBM/Hestia-GOOD
```

### 2. Optional dependencies

#### 2.1. Molecular similarity

RDKit is a dependency necessary for calculating molecular similarities:

```bash
pip install rdkit
```

#### 2.2. Sequence alignment

  - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)
  ```bash
  # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
  wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)
  wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)
  wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # MacOS
  brew install mmseqs2  
  ```

  To use Needleman-Wunch, either:

  ```bash
  conda install -c bioconda emboss
  ```
  or

  ```bash
  sudo apt install emboss
  ```


- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)


#### 2.3. Structure alignment 

  - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):

  ```bash
  # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
  wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
  wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # Linux ARM64 build
  wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # MacOS
  wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
  ```


## Documentation <a name="documentation"></a>

### 1. DatasetGenerator

The HestiaGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the AU-GOOD (Area Under the Generalization Out-Of-Distribution curve). More information in [Dataset Generator docs](https://ibm.github.io/Hestia-GOOD/dataset_generator/).

```python
from hestia.dataset_generator import HestiaGenerator, SimArguments

# Initialise the generator for a DataFrame
generator = HestiaGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)

# Similarity arguments for protein similarity
prot_args = SimArguments(
    data_type='sequence', field_name='sequence',
    alignment_algorithm='mmseqs2+prefilter', verbose=3
)

# Similarity arguments for molecular similarity
mol_args = SimArguments(
    data_type='small molecule', field_name='SMILES',
    fingeprint='mapc', radius=2, bits=2048
)

# Calculate the similarity
generator.calculate_similarity(prot_args)

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code (filter partitions with test sets less than 18.5% of total data)

for threshold, partition in generator.get_partitions(filter=0.185):
    train = df.iloc[partition['train']]
    valid = df.iloc[partition['valid']]
    test = df.iloc[partition['test']]

# ...

# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')

# Plot GOOD
generator.plot_good(results, 'test_mcc')

# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')
```

### 2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function. More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-GOOD/similarity/).

```python
from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)
```



### 3. Clustering

Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function. There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-GOOD/clustering/).

```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithm='CDHIT')
```

### 4. Partitioning

Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. More details about partitioing algorithms can be found in [Partitionind documentation](https://ibm.github.io/Hestia-GOOD/partitioning). An example of how `cc_part` would be used is:

```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
```

License <a name="license"></a>
-------
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/IBM/Hestia-GOOD",
    "name": "hestia-good",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "hestia",
    "author": "Raul Fernandez-Diaz",
    "author_email": "raul.fernandezdiaz@ucdconnect.ie",
    "download_url": "https://files.pythonhosted.org/packages/99/cd/c668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299/hestia_good-1.0.2.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <h1>Hestia-GOOD</h1>\n\n  <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>\n  \n  <a href=\"https://ibm.github.io/Hestia-GOOD/\"><img alt=\"Tutorials\" src=\"https://img.shields.io/badge/docs-tutorials-green\" /></a>\n  <a href=\"https://github.com/IBM/Hestia-GOOD/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/IBM/Hestia-GOOD\" /></a>\n  <a href=\"https://pypi.org/project/hestia-good/\"><img src=\"https://img.shields.io/pypi/v/hestia-good\" /></a>\n  <a href=\"https://pypi.org/project/hestia-good/\"><img src=\"https://img.shields.io/pypi/dm/hestia-good\" /></a>\n  <a target=\"_blank\" href=\"https://colab.research.google.com/github/IBM/Hestia-GOOD/blob/main/examples/tutorial_1.ipynb\">\n    <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n  </a>\n</div>\n\n- **Documentation:**  <a href=\"https://ibm.github.io/Hestia-GOOD/\" target=\"_blank\">https://ibm.github.io/Hestia-GOOD</a>\n- **Source Code:** <a href=\"https://github.com/IBM/Hestia-GOOD\" target=\"_blank\">https://github.com/IBM/Hestia-GOOD</a>\n- **Paper [ICLR 2025]:** <a href=\"https://openreview.net/pdf?id=qFZnAC4GHR\" target=\"_blank\">https://openreview.net/pdf?id=qFZnAC4GHR</a>\n\n## Contents\n\n<details open markdown=\"1\"><summary><b>Table of Contents</b></summary>\n\n- [Intallation Guide](#installation)\n- [Documentation](#documentation)\n- [Examples](#examples)\n- [License](#license)\n </details>\n\n\n ## Installation <a name=\"installation\"></a>\n\nInstalling in a conda environment is recommended. For creating the environment, please run:\n\n```bash\nconda create -n hestia python\nconda activate hestia\n```\n\n### 1. Python Package\n\n#### 1.1.From PyPI\n\n\n```bash\npip install hestia-good\n```\n\n#### 1.2. Directly from source\n\n```bash\npip install git+https://github.com/IBM/Hestia-GOOD\n```\n\n### 2. Optional dependencies\n\n#### 2.1. Molecular similarity\n\nRDKit is a dependency necessary for calculating molecular similarities:\n\n```bash\npip install rdkit\n```\n\n#### 2.2. Sequence alignment\n\n  - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)\n  ```bash\n  # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)\n  wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)\n  wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)\n  wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # MacOS\n  brew install mmseqs2  \n  ```\n\n  To use Needleman-Wunch, either:\n\n  ```bash\n  conda install -c bioconda emboss\n  ```\n  or\n\n  ```bash\n  sudo apt install emboss\n  ```\n\n\n- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)\n\n\n#### 2.3. Structure alignment \n\n  - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):\n\n  ```bash\n  # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)\n  wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)\n  wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # Linux ARM64 build\n  wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # MacOS\n  wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n  ```\n\n\n## Documentation <a name=\"documentation\"></a>\n\n### 1. DatasetGenerator\n\nThe HestiaGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the AU-GOOD (Area Under the Generalization Out-Of-Distribution curve). More information in [Dataset Generator docs](https://ibm.github.io/Hestia-GOOD/dataset_generator/).\n\n```python\nfrom hestia.dataset_generator import HestiaGenerator, SimArguments\n\n# Initialise the generator for a DataFrame\ngenerator = HestiaGenerator(df)\n\n# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)\n\n# Similarity arguments for protein similarity\nprot_args = SimArguments(\n    data_type='sequence', field_name='sequence',\n    alignment_algorithm='mmseqs2+prefilter', verbose=3\n)\n\n# Similarity arguments for molecular similarity\nmol_args = SimArguments(\n    data_type='small molecule', field_name='SMILES',\n    fingeprint='mapc', radius=2, bits=2048\n)\n\n# Calculate the similarity\ngenerator.calculate_similarity(prot_args)\n\n# Calculate partitions\ngenerator.calculate_partitions(min_threshold=0.3,\n                               threshold_step=0.05,\n                               test_size=0.2, valid_size=0.1)\n\n# Save partitions\ngenerator.save_precalculated('precalculated_partitions.gz')\n\n# Load pre-calculated partitions\ngenerator.from_precalculated('precalculated_partitions.gz')\n\n# Training code (filter partitions with test sets less than 18.5% of total data)\n\nfor threshold, partition in generator.get_partitions(filter=0.185):\n    train = df.iloc[partition['train']]\n    valid = df.iloc[partition['valid']]\n    test = df.iloc[partition['test']]\n\n# ...\n\n# Calculate AU-GOOD\ngenerator.calculate_augood(results, 'test_mcc')\n\n# Plot GOOD\ngenerator.plot_good(results, 'test_mcc')\n\n# Compare two models\nresults = {'model A': [values_A], 'model B': [values_B]}\ngenerator.compare_models(results, statistical_test='wilcoxon')\n```\n\n### 2. Similarity calculation\n\nCalculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function. More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-GOOD/similarity/).\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nimport pandas as pd\n\ndf_query = pd.read_csv('example.csv')\n\n# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.\n# This column corresponds to `field_name` in the function.\n\nsim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)\n```\n\n\n\n### 3. Clustering\n\nClustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function. There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-GOOD/clustering/).\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.clustering import generate_clusters\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\nclusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,\n                                cluster_algorithm='CDHIT')\n```\n\n### 4. Partitioning\n\nPartitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. More details about partitioing algorithms can be found in [Partitionind documentation](https://ibm.github.io/Hestia-GOOD/partitioning). An example of how `cc_part` would be used is:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.partition import ccpart\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\ntrain, test, partition_labs = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)\n\ntrain_df = df.iloc[train, :]\ntest_df = df.iloc[test, :]\n```\n\nLicense <a name=\"license\"></a>\n-------\nHestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Independent evaluation set construction for trustworthy ML models in biochemistry",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://github.com/IBM/Hestia-GOOD"
    },
    "split_keywords": [
        "hestia"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "009d88d116fbf46fd21980e09c77d65feadfaa2ebb85c65d88469df1f1009b9f",
                "md5": "053e461c0a7a58acef2d5cb8b2ad739d",
                "sha256": "202e9475b8980a41f181186306957ba5236eb7eceb549b4f3c5c99c2da2e70f7"
            },
            "downloads": -1,
            "filename": "hestia_good-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "053e461c0a7a58acef2d5cb8b2ad739d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 34448,
            "upload_time": "2025-08-27T10:26:44",
            "upload_time_iso_8601": "2025-08-27T10:26:44.250044Z",
            "url": "https://files.pythonhosted.org/packages/00/9d/88d116fbf46fd21980e09c77d65feadfaa2ebb85c65d88469df1f1009b9f/hestia_good-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "99cdc668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299",
                "md5": "4becb8bae3a31a3692911895e9e3b210",
                "sha256": "c5287b450a43411bcd6bbf5bdc938040f8de20ce71d991139fe41a988f1f566e"
            },
            "downloads": -1,
            "filename": "hestia_good-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4becb8bae3a31a3692911895e9e3b210",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 35419,
            "upload_time": "2025-08-27T10:26:45",
            "upload_time_iso_8601": "2025-08-27T10:26:45.394614Z",
            "url": "https://files.pythonhosted.org/packages/99/cd/c668e28bba216aa72370f98ac3d16569456ca2565c89359b0cc8cc820299/hestia_good-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-27 10:26:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IBM",
    "github_project": "Hestia-GOOD",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hestia-good"
}

Raul Fernandez-Diaz