hestia-ood


Namehestia-ood JSON
Version 0.0.34 PyPI version JSON
download
home_pagehttps://github.com/IBM/Hestia-OOD
SummarySuite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions
upload_time2024-11-11 15:33:45
maintainerNone
docs_urlNone
authorRaul Fernandez-Diaz
requires_python>=3.9
licenseMIT
keywords hestia
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
  <h1>Hestia</h1>

  <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>
  
  <a href="https://ibm.github.io/Hestia-OOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
  <a href="https://github.com/IBM/Hestia-OOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-OOD" /></a>
  <a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/v/hestia-ood" /></a>
  <a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/dm/hestia-ood" /></a>

</div>

- **Documentation:**  <a href="https://ibm.github.io/Hestia-OOD/" target="_blank">https://ibm.github.io/Hestia-OOD</a>
- **Source Code:** <a href="https://github.com/IBM/Hestia-OOD" target="_blank">https://github.com/IBM/Hestia-OOD</a>
- **Paper pre-print:** <a href="https://www.biorxiv.org/content/10.1101/2024.03.14.584508" target="_blank">https://www.biorxiv.org/content/10.1101/2024.03.14.584508</a>

## Contents

<details open markdown="1"><summary><b>Table of Contents</b></summary>

- [Intallation Guide](#installation)
- [Documentation](#documentation)
- [Examples](#examples)
- [License](#license)
 </details>


 ## Installation <a name="installation"></a>

Installing in a conda environment is recommended. For creating the environment, please run:

```bash
conda create -n hestia python
conda activate hestia
```

### 1. Python Package

#### 1.1.From PyPI


```bash
pip install hestia-ood
```

#### 1.2. Directly from source

```bash
pip install git+https://github.com/IBM/Hestia-OOD
```

### 3. Optional dependencies

#### 3.1. Molecular similarity

RDKit is a dependency necessary for calculating molecular similarities:

```bash
pip install rdkit
```

#### 3.2. Sequence alignment

  - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)
  ```bash
  # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
  wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)
  wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)
  wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH

  # MacOS
  brew install mmseqs2  
  ```

  To use Needleman-Wunch, either:

  ```bash
  conda install -c bioconda emboss
  ```
  or

  ```bash
  sudo apt install emboss
  ```


- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)


#### 3.3. Structure alignment 

  - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):

  ```bash
  # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
  wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
  wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # Linux ARM64 build
  wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH

  # MacOS
  wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
  ```


## Documentation <a name="documentation"></a>

### 1. DatasetGenerator

The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).

```python
from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments

# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
    data_type='protein', field_name='sequence',
    similarity_metric='mmseqs2+prefilter', verbose=3
)

# Calculate the similarity
generator.calculate_similarity(args)

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code

for threshold, partition in generator.get_partitions():
    train = df.iloc[partition['train']]
    valid = df.iloc[partition['valid']]
    test = df.iloc[partition['test']]

# ...

# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')

# Plot GOOD
generator.plot_good(results, 'test_mcc')

# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')
```

### 2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:

```python
from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)
```

More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).

### 3. Clustering

Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:

```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithm='CDHIT')
```

There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).


### 4. Partitioning

Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:

```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
```

License <a name="license"></a>
-------
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.




            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/IBM/Hestia-OOD",
    "name": "hestia-ood",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "hestia",
    "author": "Raul Fernandez-Diaz",
    "author_email": "raul.fernandezdiaz@ucdconnect.ie",
    "download_url": "https://files.pythonhosted.org/packages/8a/19/c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed/hestia_ood-0.0.34.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n  <h1>Hestia</h1>\n\n  <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>\n  \n  <a href=\"https://ibm.github.io/Hestia-OOD/\"><img alt=\"Tutorials\" src=\"https://img.shields.io/badge/docs-tutorials-green\" /></a>\n  <a href=\"https://github.com/IBM/Hestia-OOD/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/IBM/Hestia-OOD\" /></a>\n  <a href=\"https://pypi.org/project/hestia-ood/\"><img src=\"https://img.shields.io/pypi/v/hestia-ood\" /></a>\n  <a href=\"https://pypi.org/project/hestia-ood/\"><img src=\"https://img.shields.io/pypi/dm/hestia-ood\" /></a>\n\n</div>\n\n- **Documentation:**  <a href=\"https://ibm.github.io/Hestia-OOD/\" target=\"_blank\">https://ibm.github.io/Hestia-OOD</a>\n- **Source Code:** <a href=\"https://github.com/IBM/Hestia-OOD\" target=\"_blank\">https://github.com/IBM/Hestia-OOD</a>\n- **Paper pre-print:** <a href=\"https://www.biorxiv.org/content/10.1101/2024.03.14.584508\" target=\"_blank\">https://www.biorxiv.org/content/10.1101/2024.03.14.584508</a>\n\n## Contents\n\n<details open markdown=\"1\"><summary><b>Table of Contents</b></summary>\n\n- [Intallation Guide](#installation)\n- [Documentation](#documentation)\n- [Examples](#examples)\n- [License](#license)\n </details>\n\n\n ## Installation <a name=\"installation\"></a>\n\nInstalling in a conda environment is recommended. For creating the environment, please run:\n\n```bash\nconda create -n hestia python\nconda activate hestia\n```\n\n### 1. Python Package\n\n#### 1.1.From PyPI\n\n\n```bash\npip install hestia-ood\n```\n\n#### 1.2. Directly from source\n\n```bash\npip install git+https://github.com/IBM/Hestia-OOD\n```\n\n### 3. Optional dependencies\n\n#### 3.1. Molecular similarity\n\nRDKit is a dependency necessary for calculating molecular similarities:\n\n```bash\npip install rdkit\n```\n\n#### 3.2. Sequence alignment\n\n  - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)\n  ```bash\n  # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)\n  wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # static build with SSE4.1  (check using: cat /proc/cpuinfo | grep sse4)\n  wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # static build with SSE2 (slowest, for very old systems)  (check using: cat /proc/cpuinfo | grep sse2)\n  wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n  # MacOS\n  brew install mmseqs2  \n  ```\n\n  To use Needleman-Wunch, either:\n\n  ```bash\n  conda install -c bioconda emboss\n  ```\n  or\n\n  ```bash\n  sudo apt install emboss\n  ```\n\n\n- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)\n\n\n#### 3.3. Structure alignment \n\n  - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):\n\n  ```bash\n  # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)\n  wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)\n  wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # Linux ARM64 build\n  wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n  # MacOS\n  wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n  ```\n\n\n## Documentation <a name=\"documentation\"></a>\n\n### 1. DatasetGenerator\n\nThe HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).\n\n```python\nfrom hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments\n\n# Initialise the generator for a DataFrame\ngenerator = HestiaDatasetGenerator(df)\n\n# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)\nargs = SimilarityArguments(\n    data_type='protein', field_name='sequence',\n    similarity_metric='mmseqs2+prefilter', verbose=3\n)\n\n# Calculate the similarity\ngenerator.calculate_similarity(args)\n\n# Calculate partitions\ngenerator.calculate_partitions(min_threshold=0.3,\n                               threshold_step=0.05,\n                               test_size=0.2, valid_size=0.1)\n\n# Save partitions\ngenerator.save_precalculated('precalculated_partitions.gz')\n\n# Load pre-calculated partitions\ngenerator.from_precalculated('precalculated_partitions.gz')\n\n# Training code\n\nfor threshold, partition in generator.get_partitions():\n    train = df.iloc[partition['train']]\n    valid = df.iloc[partition['valid']]\n    test = df.iloc[partition['test']]\n\n# ...\n\n# Calculate AU-GOOD\ngenerator.calculate_augood(results, 'test_mcc')\n\n# Plot GOOD\ngenerator.plot_good(results, 'test_mcc')\n\n# Compare two models\nresults = {'model A': [values_A], 'model B': [values_B]}\ngenerator.compare_models(results, statistical_test='wilcoxon')\n```\n\n### 2. Similarity calculation\n\nCalculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nimport pandas as pd\n\ndf_query = pd.read_csv('example.csv')\n\n# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.\n# This column corresponds to `field_name` in the function.\n\nsim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)\n```\n\nMore details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).\n\n### 3. Clustering\n\nClustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.clustering import generate_clusters\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\nclusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,\n                                cluster_algorithm='CDHIT')\n```\n\nThere are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).\n\n\n### 4. Partitioning\n\nPartitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.partition import ccpart\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\ntrain, test = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)\n\ntrain_df = df.iloc[train, :]\ntest_df = df.iloc[test, :]\n```\n\nLicense <a name=\"license\"></a>\n-------\nHestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions",
    "version": "0.0.34",
    "project_urls": {
        "Homepage": "https://github.com/IBM/Hestia-OOD"
    },
    "split_keywords": [
        "hestia"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d4835d93f82aa52013f4206d1da00a357c4c02305d1c50d5b61452855acb1d74",
                "md5": "1a9767c0ff8eeeefbeff0e9e419e0d07",
                "sha256": "8dd2814426a169deec2fb2ea05cebb56592f7b8faa60698f2bcc79e233e1a524"
            },
            "downloads": -1,
            "filename": "hestia_ood-0.0.34-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1a9767c0ff8eeeefbeff0e9e419e0d07",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 30567,
            "upload_time": "2024-11-11T15:33:43",
            "upload_time_iso_8601": "2024-11-11T15:33:43.626508Z",
            "url": "https://files.pythonhosted.org/packages/d4/83/5d93f82aa52013f4206d1da00a357c4c02305d1c50d5b61452855acb1d74/hestia_ood-0.0.34-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a19c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed",
                "md5": "a012e1726da968efe9b3895a182586f7",
                "sha256": "a58638521184d759688292a9d0aad19b4ab5bc009d1e2309d965e7e860b4e938"
            },
            "downloads": -1,
            "filename": "hestia_ood-0.0.34.tar.gz",
            "has_sig": false,
            "md5_digest": "a012e1726da968efe9b3895a182586f7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 31492,
            "upload_time": "2024-11-11T15:33:45",
            "upload_time_iso_8601": "2024-11-11T15:33:45.386208Z",
            "url": "https://files.pythonhosted.org/packages/8a/19/c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed/hestia_ood-0.0.34.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-11 15:33:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "IBM",
    "github_project": "Hestia-OOD",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "hestia-ood"
}
        
Elapsed time: 0.41056s