Name | hestia-ood JSON |
Version |
0.0.34
JSON |
| download |
home_page | https://github.com/IBM/Hestia-OOD |
Summary | Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions |
upload_time | 2024-11-11 15:33:45 |
maintainer | None |
docs_url | None |
author | Raul Fernandez-Diaz |
requires_python | >=3.9 |
license | MIT |
keywords |
hestia
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
<div align="center">
<h1>Hestia</h1>
<p>Computational tool for generating generalisation-evaluating evaluation sets.</p>
<a href="https://ibm.github.io/Hestia-OOD/"><img alt="Tutorials" src="https://img.shields.io/badge/docs-tutorials-green" /></a>
<a href="https://github.com/IBM/Hestia-OOD/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/IBM/Hestia-OOD" /></a>
<a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/v/hestia-ood" /></a>
<a href="https://pypi.org/project/hestia-ood/"><img src="https://img.shields.io/pypi/dm/hestia-ood" /></a>
</div>
- **Documentation:** <a href="https://ibm.github.io/Hestia-OOD/" target="_blank">https://ibm.github.io/Hestia-OOD</a>
- **Source Code:** <a href="https://github.com/IBM/Hestia-OOD" target="_blank">https://github.com/IBM/Hestia-OOD</a>
- **Paper pre-print:** <a href="https://www.biorxiv.org/content/10.1101/2024.03.14.584508" target="_blank">https://www.biorxiv.org/content/10.1101/2024.03.14.584508</a>
## Contents
<details open markdown="1"><summary><b>Table of Contents</b></summary>
- [Intallation Guide](#installation)
- [Documentation](#documentation)
- [Examples](#examples)
- [License](#license)
</details>
## Installation <a name="installation"></a>
Installing in a conda environment is recommended. For creating the environment, please run:
```bash
conda create -n hestia python
conda activate hestia
```
### 1. Python Package
#### 1.1.From PyPI
```bash
pip install hestia-ood
```
#### 1.2. Directly from source
```bash
pip install git+https://github.com/IBM/Hestia-OOD
```
### 3. Optional dependencies
#### 3.1. Molecular similarity
RDKit is a dependency necessary for calculating molecular similarities:
```bash
pip install rdkit
```
#### 3.2. Sequence alignment
- MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)
```bash
# static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE4.1 (check using: cat /proc/cpuinfo | grep sse4)
wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# static build with SSE2 (slowest, for very old systems) (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH
# MacOS
brew install mmseqs2
```
To use Needleman-Wunch, either:
```bash
conda install -c bioconda emboss
```
or
```bash
sudo apt install emboss
```
- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)
#### 3.3. Structure alignment
- To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):
```bash
# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# MacOS
wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
```
## Documentation <a name="documentation"></a>
### 1. DatasetGenerator
The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).
```python
from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments
# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)
# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
data_type='protein', field_name='sequence',
similarity_metric='mmseqs2+prefilter', verbose=3
)
# Calculate the similarity
generator.calculate_similarity(args)
# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
threshold_step=0.05,
test_size=0.2, valid_size=0.1)
# Save partitions
generator.save_precalculated('precalculated_partitions.gz')
# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')
# Training code
for threshold, partition in generator.get_partitions():
train = df.iloc[partition['train']]
valid = df.iloc[partition['valid']]
test = df.iloc[partition['test']]
# ...
# Calculate AU-GOOD
generator.calculate_augood(results, 'test_mcc')
# Plot GOOD
generator.plot_good(results, 'test_mcc')
# Compare two models
results = {'model A': [values_A], 'model B': [values_B]}
generator.compare_models(results, statistical_test='wilcoxon')
```
### 2. Similarity calculation
Calculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:
```python
from hestia.similarity import sequence_similarity_mmseqs
import pandas as pd
df_query = pd.read_csv('example.csv')
# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.
sim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)
```
More details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).
### 3. Clustering
Clustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:
```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.clustering import generate_clusters
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
cluster_algorithm='CDHIT')
```
There are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).
### 4. Partitioning
Partitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:
```python
from hestia.similarity import sequence_similarity_mmseqs
from hestia.partition import ccpart
import pandas as pd
df = pd.read_csv('example.csv')
sim_df = sequence_similarity_mmseqs(df, field_name='sequence')
train, test = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)
train_df = df.iloc[train, :]
test_df = df.iloc[test, :]
```
License <a name="license"></a>
-------
Hestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.
Raw data
{
"_id": null,
"home_page": "https://github.com/IBM/Hestia-OOD",
"name": "hestia-ood",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "hestia",
"author": "Raul Fernandez-Diaz",
"author_email": "raul.fernandezdiaz@ucdconnect.ie",
"download_url": "https://files.pythonhosted.org/packages/8a/19/c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed/hestia_ood-0.0.34.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <h1>Hestia</h1>\n\n <p>Computational tool for generating generalisation-evaluating evaluation sets.</p>\n \n <a href=\"https://ibm.github.io/Hestia-OOD/\"><img alt=\"Tutorials\" src=\"https://img.shields.io/badge/docs-tutorials-green\" /></a>\n <a href=\"https://github.com/IBM/Hestia-OOD/blob/main/LICENSE\"><img alt=\"GitHub\" src=\"https://img.shields.io/github/license/IBM/Hestia-OOD\" /></a>\n <a href=\"https://pypi.org/project/hestia-ood/\"><img src=\"https://img.shields.io/pypi/v/hestia-ood\" /></a>\n <a href=\"https://pypi.org/project/hestia-ood/\"><img src=\"https://img.shields.io/pypi/dm/hestia-ood\" /></a>\n\n</div>\n\n- **Documentation:** <a href=\"https://ibm.github.io/Hestia-OOD/\" target=\"_blank\">https://ibm.github.io/Hestia-OOD</a>\n- **Source Code:** <a href=\"https://github.com/IBM/Hestia-OOD\" target=\"_blank\">https://github.com/IBM/Hestia-OOD</a>\n- **Paper pre-print:** <a href=\"https://www.biorxiv.org/content/10.1101/2024.03.14.584508\" target=\"_blank\">https://www.biorxiv.org/content/10.1101/2024.03.14.584508</a>\n\n## Contents\n\n<details open markdown=\"1\"><summary><b>Table of Contents</b></summary>\n\n- [Intallation Guide](#installation)\n- [Documentation](#documentation)\n- [Examples](#examples)\n- [License](#license)\n </details>\n\n\n ## Installation <a name=\"installation\"></a>\n\nInstalling in a conda environment is recommended. For creating the environment, please run:\n\n```bash\nconda create -n hestia python\nconda activate hestia\n```\n\n### 1. Python Package\n\n#### 1.1.From PyPI\n\n\n```bash\npip install hestia-ood\n```\n\n#### 1.2. Directly from source\n\n```bash\npip install git+https://github.com/IBM/Hestia-OOD\n```\n\n### 3. Optional dependencies\n\n#### 3.1. Molecular similarity\n\nRDKit is a dependency necessary for calculating molecular similarities:\n\n```bash\npip install rdkit\n```\n\n#### 3.2. Sequence alignment\n\n - MMSeqs2 [https://github.com/steineggerlab/mmseqs2](https://github.com/steineggerlab/mmseqs2)\n ```bash\n # static build with AVX2 (fastest) (check using: cat /proc/cpuinfo | grep avx2)\n wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz; tar xvfz mmseqs-linux-avx2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # static build with SSE4.1 (check using: cat /proc/cpuinfo | grep sse4)\n wget https://mmseqs.com/latest/mmseqs-linux-sse41.tar.gz; tar xvfz mmseqs-linux-sse41.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # static build with SSE2 (slowest, for very old systems) (check using: cat /proc/cpuinfo | grep sse2)\n wget https://mmseqs.com/latest/mmseqs-linux-sse2.tar.gz; tar xvfz mmseqs-linux-sse2.tar.gz; export PATH=$(pwd)/mmseqs/bin/:$PATH\n\n # MacOS\n brew install mmseqs2 \n ```\n\n To use Needleman-Wunch, either:\n\n ```bash\n conda install -c bioconda emboss\n ```\n or\n\n ```bash\n sudo apt install emboss\n ```\n\n\n- Windows: Download binaries from [EMBOSS](https://emboss.sourceforge.net/download/) and [MMSeqs2-latest](https://mmseqs.com/latest/mmseqs-win64.zip)\n\n\n#### 3.3. Structure alignment \n\n - To use Foldseek [https://github.com/steineggerlab/foldseek](https://github.com/steineggerlab/foldseek):\n\n ```bash\n # Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)\n wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)\n wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # Linux ARM64 build\n wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n\n # MacOS\n wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH\n ```\n\n\n## Documentation <a name=\"documentation\"></a>\n\n### 1. DatasetGenerator\n\nThe HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).\n\n```python\nfrom hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments\n\n# Initialise the generator for a DataFrame\ngenerator = HestiaDatasetGenerator(df)\n\n# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)\nargs = SimilarityArguments(\n data_type='protein', field_name='sequence',\n similarity_metric='mmseqs2+prefilter', verbose=3\n)\n\n# Calculate the similarity\ngenerator.calculate_similarity(args)\n\n# Calculate partitions\ngenerator.calculate_partitions(min_threshold=0.3,\n threshold_step=0.05,\n test_size=0.2, valid_size=0.1)\n\n# Save partitions\ngenerator.save_precalculated('precalculated_partitions.gz')\n\n# Load pre-calculated partitions\ngenerator.from_precalculated('precalculated_partitions.gz')\n\n# Training code\n\nfor threshold, partition in generator.get_partitions():\n train = df.iloc[partition['train']]\n valid = df.iloc[partition['valid']]\n test = df.iloc[partition['test']]\n\n# ...\n\n# Calculate AU-GOOD\ngenerator.calculate_augood(results, 'test_mcc')\n\n# Plot GOOD\ngenerator.plot_good(results, 'test_mcc')\n\n# Compare two models\nresults = {'model A': [values_A], 'model B': [values_B]}\ngenerator.compare_models(results, statistical_test='wilcoxon')\n```\n\n### 2. Similarity calculation\n\nCalculating pairwise similarity between the entities within a DataFrame `df_query` or between two DataFrames `df_query` and `df_target` can be achieved through the `calculate_similarity` function:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nimport pandas as pd\n\ndf_query = pd.read_csv('example.csv')\n\n# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.\n# This column corresponds to `field_name` in the function.\n\nsim_df = sequence_similarity_mmseqs(df_query, field_name='sequence', prefilter=True)\n```\n\nMore details about similarity calculation can be found in the [Similarity calculation documentation](https://ibm.github.io/Hestia-OOD/similarity/).\n\n### 3. Clustering\n\nClustering the entities within a DataFrame `df` can be achieved through the `generate_clusters` function:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.clustering import generate_clusters\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\nclusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,\n cluster_algorithm='CDHIT')\n```\n\nThere are three clustering algorithms currently supported: `CDHIT`, `greedy_cover_set`, or `connected_components`. More details about clustering can be found in the [Clustering documentation](https://ibm.github.io/Hestia-OOD/clustering/).\n\n\n### 4. Partitioning\n\nPartitioning the entities within a DataFrame `df` into a training and an evaluation subsets can be achieved through 4 different functions: `ccpart`, `graph_part`, `reduction_partition`, and `random_partition`. An example of how `cc_part` would be used is:\n\n```python\nfrom hestia.similarity import sequence_similarity_mmseqs\nfrom hestia.partition import ccpart\nimport pandas as pd\n\ndf = pd.read_csv('example.csv')\nsim_df = sequence_similarity_mmseqs(df, field_name='sequence')\ntrain, test = cc_part(df, threshold=0.3, test_size=0.2, sim_df=sim_df)\n\ntrain_df = df.iloc[train, :]\ntest_df = df.iloc[test, :]\n```\n\nLicense <a name=\"license\"></a>\n-------\nHestia is an open-source software licensed under the MIT Clause License. Check the details in the [LICENSE](https://github.com/IBM/Hestia/blob/master/LICENSE) file.\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Suite of tools for analysing the independence between training and evaluation biosequence datasets and to generate new generalisation-evaluating hold-out partitions",
"version": "0.0.34",
"project_urls": {
"Homepage": "https://github.com/IBM/Hestia-OOD"
},
"split_keywords": [
"hestia"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d4835d93f82aa52013f4206d1da00a357c4c02305d1c50d5b61452855acb1d74",
"md5": "1a9767c0ff8eeeefbeff0e9e419e0d07",
"sha256": "8dd2814426a169deec2fb2ea05cebb56592f7b8faa60698f2bcc79e233e1a524"
},
"downloads": -1,
"filename": "hestia_ood-0.0.34-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1a9767c0ff8eeeefbeff0e9e419e0d07",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 30567,
"upload_time": "2024-11-11T15:33:43",
"upload_time_iso_8601": "2024-11-11T15:33:43.626508Z",
"url": "https://files.pythonhosted.org/packages/d4/83/5d93f82aa52013f4206d1da00a357c4c02305d1c50d5b61452855acb1d74/hestia_ood-0.0.34-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8a19c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed",
"md5": "a012e1726da968efe9b3895a182586f7",
"sha256": "a58638521184d759688292a9d0aad19b4ab5bc009d1e2309d965e7e860b4e938"
},
"downloads": -1,
"filename": "hestia_ood-0.0.34.tar.gz",
"has_sig": false,
"md5_digest": "a012e1726da968efe9b3895a182586f7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 31492,
"upload_time": "2024-11-11T15:33:45",
"upload_time_iso_8601": "2024-11-11T15:33:45.386208Z",
"url": "https://files.pythonhosted.org/packages/8a/19/c4cf36b9c463b70bcbf5306b485803ea1cc351acc7d95bbcf21054c1afed/hestia_ood-0.0.34.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-11 15:33:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "IBM",
"github_project": "Hestia-OOD",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "hestia-ood"
}