<span style="font-size:2em;">**Empathi**</span><br>
<span style="font-size:1.15em;">**Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment**</span>
<!-- TABLE OF CONTENTS -->
<details>
<summary>Table of Contents</summary>
<ol>
<li>
<a href="#about-the-project">About the Project</a>
</li>
<li>
<a href="#getting-started">Getting Started</a>
<ul>
<li><a href="#prerequisites">Prerequisites</a></li>
<li><a href="#installation">Installation</a></li>
</ul>
</li>
<li><a href="#usage">Usage details</a></li>
</ol>
</details>
## About the Project
Empathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5
protein embeddings to make predictions. In addition, new functional groups were defined to be better suited for
machine-learning than the often-overlapping [PHROG](https://phrogs.lmge.uca.fr/) categories.
A preprint is available [here](https://doi.org/10.1101/2024.12.31.630607).
## Getting Started
Empathi has been packaged in [PyPI](https://pypi.org/project/empathi/) and as an
[Apptainer container](https://cloud.sylabs.io/library/alexandreboulay/empathi/empathi) for ease of use. \
The source code can also be downloaded from [HuggingFace](https://huggingface.co/AlexandreBoulay/empathi).
### Prerequisites
A GPU is recommended for large datasets.
The full list of dependencies and versions can be found in [requirements.txt](https://huggingface.co/AlexandreBoulay/EmPATHi/blob/main/requirements.txt).
Either git-lfs or Apptainer will be required. See instructions below.
Other dependencies are taken care of by pip and Apptainer.
```
python/3.11.5
joblib==1.2.0
numpy==1.26.4
pandas==2.2.1
torch==2.3.0
scipy==1.13.1
scikit-learn==1.5.0
transformers==4.43.1
sentencepiece==0.2.0
```
### Installation
There are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code. Installation should take less than 10 minutes.
A small fasta file is provided to test installation. This should run in <1 minute.
#### 1. PIP
First, create a virtual environment in python 3.11.5.
```
conda create -n empathi_env python=3.11.5
conda activate empathi_env
```
Download models for Empathi.
You will need git-lfs: for WSL or linux use `sudo apt-get install git-lfs`, for windows either use git
[bash](https://git-scm.com/downloads) or get it from [here](https://github.com/git-lfs/git-lfs/releases). Then:
```
git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
export PATH="/path/to/empathi/models:$PATH"
```
Install dependencies:
```
pip install empathi
```
Usage
```
empathi input_file name
```
#### 2. Apptainer
Download [Apptainer](https://apptainer.org/docs/admin/main/installation.html) or singularity. On windows, this will require a virtual machine.
[WSL](https://learn.microsoft.com/en-us/windows/wsl/install) works well.
Fetch Empathi from [Sylabs](https://cloud.sylabs.io/library/alexandreboulay/empathi/empathi):
```
apptainer pull empathi.sif library://alexandreboulay/empathi/empathi
```
Usage
```
apptainer run empathi.sif path/to/input_file name --confidence 0.95
```
#### 3. From source code
First, create a virtual environment in python 3.11.5.
```
conda create -n empathi_env python=3.11.5
conda activate empathi_env
```
Clone the repo.
You will need git-lfs: for WSL or linux use `sudo apt-get install git-lfs`, for windows either use git
[bash](https://git-scm.com/downloads) or get it from [here](https://github.com/git-lfs/git-lfs/releases). Then:
```
git lfs install
git clone https://huggingface.co/AlexandreBoulay/empathi
```
Install dependencies:
```
cd empathi
pip install -r requirements.txt
```
Usage
```
python src/empathi/empathi.py input_file name
```
### Usage details
A fasta file of protein sequences or a csv file of protein embeddings can be used as input.
By default, a confidence >0.95 is used to assign functions. Using a high confidence threshold (--confidence) will result in more precise
predictions (lower false positive rate), but also a lower sensitivity (less proteins assigned a function). If the objective of your
study is to annotate as many proteins as possible, consider using a confidence threshold of as low as 0.5.
Specifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU.
The embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file.
Options:
- input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).
- name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.
- --models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.
- --only_embeddings: Whether to only calculate embeddings (no functional prediction).
- --output_folder: Path to the output folder. Default is ./empathi_out/.
- --threads: Number of threads (default 1).
- --confidence: Confidence threshold used to assign predictions (default 0.95).
- --mode: Which types of proteins you want to predict. Accepted arguments are "all", "pvp", "DNA-associated", "adsorption-related", "lysis", "regulator", "cell_wall_depolymerase", "packaging", "RNA-associated", "ejection", "phosphorylation", "transferase", "nucleotide_metabolism", "reductase" and "defense_systems".
### Output format
The output consists of a csv file with an annotation column regrouping all assigned annotations per protein (separated by "|") and a
column per functional category with the confidence associated to each prediction.
Ex.
| Annotation |PVP |cell wall depolymerase|DNA-associated|...|
|---------------------------|----|----------------------|--------------|---|
|PVP\|cell wall depolymerase|0.98|0.99 |0.005 |...|
| DNA-associated |0.01|0.05 |0.998 |...|
<p align="center">
<h2 align="center">Hierarchical classification</h2>
<img src="data/figure_1.png" border="0"/>
</p>
Raw data
{
"_id": null,
"home_page": null,
"name": "empathi",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11.5",
"maintainer_email": null,
"keywords": "bacteriophages, bioinformatics, phages, protein functions",
"author": "Alexandre Boulay, Clovis Galiez, Elsa Rousseau",
"author_email": "Alexandre Boulay <alexandre.boulay.6@ulaval.ca>",
"download_url": "https://files.pythonhosted.org/packages/0c/6c/4ea9fe78a8ea73a2f4d39985c8aa285b62eb1a480fb5dc73cce80b909699/empathi-1.0.6.tar.gz",
"platform": null,
"description": "\n<span style=\"font-size:2em;\">**Empathi**</span><br>\n<span style=\"font-size:1.15em;\">**Embedding-based Phage Protein Annotation Tool by Hierarchical Assignment**</span>\n\n\n<!-- TABLE OF CONTENTS -->\n<details>\n <summary>Table of Contents</summary>\n <ol>\n <li>\n <a href=\"#about-the-project\">About the Project</a>\n </li>\n <li>\n <a href=\"#getting-started\">Getting Started</a>\n <ul>\n <li><a href=\"#prerequisites\">Prerequisites</a></li>\n <li><a href=\"#installation\">Installation</a></li>\n </ul>\n </li>\n <li><a href=\"#usage\">Usage details</a></li>\n </ol>\n</details>\n\n## About the Project\n\nEmpathi is a tool for the prediction of bacteriophage protein functions. It utilizes the highly informative ProtT5 \nprotein embeddings to make predictions. In addition, new functional groups were defined to be better suited for\nmachine-learning than the often-overlapping [PHROG](https://phrogs.lmge.uca.fr/) categories.\n\nA preprint is available [here](https://doi.org/10.1101/2024.12.31.630607).\n\n\n## Getting Started\nEmpathi has been packaged in [PyPI](https://pypi.org/project/empathi/) and as an \n[Apptainer container](https://cloud.sylabs.io/library/alexandreboulay/empathi/empathi) for ease of use. \\\nThe source code can also be downloaded from [HuggingFace](https://huggingface.co/AlexandreBoulay/empathi).\n \n\n### Prerequisites\nA GPU is recommended for large datasets.\n\nThe full list of dependencies and versions can be found in [requirements.txt](https://huggingface.co/AlexandreBoulay/EmPATHi/blob/main/requirements.txt).\n\nEither git-lfs or Apptainer will be required. See instructions below.\n\nOther dependencies are taken care of by pip and Apptainer.\n```\npython/3.11.5\njoblib==1.2.0\nnumpy==1.26.4\npandas==2.2.1\ntorch==2.3.0\nscipy==1.13.1\nscikit-learn==1.5.0\ntransformers==4.43.1\nsentencepiece==0.2.0\n```\n\n\n### Installation\nThere are three ways of installing Empathi: through PyPI, as an Apptainer container or as source code. Installation should take less than 10 minutes.\nA small fasta file is provided to test installation. This should run in <1 minute.\n\n#### 1. PIP\nFirst, create a virtual environment in python 3.11.5.\n```\nconda create -n empathi_env python=3.11.5\nconda activate empathi_env\n```\n\nDownload models for Empathi. \nYou will need git-lfs: for WSL or linux use `sudo apt-get install git-lfs`, for windows either use git\n[bash](https://git-scm.com/downloads) or get it from [here](https://github.com/git-lfs/git-lfs/releases). Then:\n```\ngit lfs install\ngit clone https://huggingface.co/AlexandreBoulay/empathi\nexport PATH=\"/path/to/empathi/models:$PATH\"\n```\n\nInstall dependencies:\n```\npip install empathi\n```\n\nUsage\n```\nempathi input_file name\n```\n\n\n#### 2. Apptainer\nDownload [Apptainer](https://apptainer.org/docs/admin/main/installation.html) or singularity. On windows, this will require a virtual machine. \n[WSL](https://learn.microsoft.com/en-us/windows/wsl/install) works well.\n\nFetch Empathi from [Sylabs](https://cloud.sylabs.io/library/alexandreboulay/empathi/empathi):\n```\napptainer pull empathi.sif library://alexandreboulay/empathi/empathi\n```\n\nUsage\n```\napptainer run empathi.sif path/to/input_file name --confidence 0.95\n```\n\n\n#### 3. From source code\nFirst, create a virtual environment in python 3.11.5.\n```\nconda create -n empathi_env python=3.11.5\nconda activate empathi_env\n```\n\nClone the repo. \nYou will need git-lfs: for WSL or linux use `sudo apt-get install git-lfs`, for windows either use git \n[bash](https://git-scm.com/downloads) or get it from [here](https://github.com/git-lfs/git-lfs/releases). Then:\n```\ngit lfs install\ngit clone https://huggingface.co/AlexandreBoulay/empathi\n```\n\nInstall dependencies:\n```\ncd empathi\npip install -r requirements.txt\n```\n\nUsage\n```\npython src/empathi/empathi.py input_file name\n```\n\n### Usage details\nA fasta file of protein sequences or a csv file of protein embeddings can be used as input.\n\nBy default, a confidence >0.95 is used to assign functions. Using a high confidence threshold (--confidence) will result in more precise\npredictions (lower false positive rate), but also a lower sensitivity (less proteins assigned a function). If the objective of your\nstudy is to annotate as many proteins as possible, consider using a confidence threshold of as low as 0.5.\n\nSpecifying the option --only_embeddings will only compute embeddings. This step is much faster with a GPU.\nThe embeddings file can then be reinputted using the same command (without --only_embeddings) and specifying the new file as input file. \n\nOptions:\n - input_file: Path to input file containing protein sequencs (.fa*) or protein embeddings (.pkl/.csv).\n - name: Name of file you want to save to (wOut extension). Should be different between runs to avoid overwriting files.\n - --models_folder: Path to folder containing EmPATHi models. Can be left unspecified if it was added to PATH earlier.\n - --only_embeddings: Whether to only calculate embeddings (no functional prediction).\n - --output_folder: Path to the output folder. Default is ./empathi_out/.\n - --threads: Number of threads (default 1).\n - --confidence: Confidence threshold used to assign predictions (default 0.95).\n - --mode: Which types of proteins you want to predict. Accepted arguments are \"all\", \"pvp\", \"DNA-associated\", \"adsorption-related\", \"lysis\", \"regulator\", \"cell_wall_depolymerase\", \"packaging\", \"RNA-associated\", \"ejection\", \"phosphorylation\", \"transferase\", \"nucleotide_metabolism\", \"reductase\" and \"defense_systems\".\n\n### Output format\nThe output consists of a csv file with an annotation column regrouping all assigned annotations per protein (separated by \"|\") and a \ncolumn per functional category with the confidence associated to each prediction.\n\nEx.\n| Annotation |PVP |cell wall depolymerase|DNA-associated|...|\n|---------------------------|----|----------------------|--------------|---|\n|PVP\\|cell wall depolymerase|0.98|0.99 |0.005 |...|\n| DNA-associated |0.01|0.05 |0.998 |...|\n\n<p align=\"center\">\n <h2 align=\"center\">Hierarchical classification</h2>\n <img src=\"data/figure_1.png\" border=\"0\"/>\n</p>\n",
"bugtrack_url": null,
"license": null,
"summary": "An embedding-based phage protein annotation tool by hierarchical assignment",
"version": "1.0.6",
"project_urls": {
"Homepage": "https://huggingface.co/AlexandreBoulay/EmPATHi"
},
"split_keywords": [
"bacteriophages",
" bioinformatics",
" phages",
" protein functions"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e70ae913aa60b18ada9627e29a433d17a53a786c3def84686fd5ffc1d6b35701",
"md5": "a818f29c3c454de3cfdeb087d47e0027",
"sha256": "660a990e80452e50fa084c506942410beb7065bdc1e9ab253a302aefcd87c9b1"
},
"downloads": -1,
"filename": "empathi-1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a818f29c3c454de3cfdeb087d47e0027",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11.5",
"size": 23522,
"upload_time": "2025-08-24T04:39:24",
"upload_time_iso_8601": "2025-08-24T04:39:24.643724Z",
"url": "https://files.pythonhosted.org/packages/e7/0a/e913aa60b18ada9627e29a433d17a53a786c3def84686fd5ffc1d6b35701/empathi-1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0c6c4ea9fe78a8ea73a2f4d39985c8aa285b62eb1a480fb5dc73cce80b909699",
"md5": "d118868665534230731939655c960266",
"sha256": "3697ca354c906437b96373557f031d7e4759e730a3cba6e5715b3ae701232f89"
},
"downloads": -1,
"filename": "empathi-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "d118868665534230731939655c960266",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11.5",
"size": 599787,
"upload_time": "2025-08-24T04:39:26",
"upload_time_iso_8601": "2025-08-24T04:39:26.129417Z",
"url": "https://files.pythonhosted.org/packages/0c/6c/4ea9fe78a8ea73a2f4d39985c8aa285b62eb1a480fb5dc73cce80b909699/empathi-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-24 04:39:26",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "empathi"
}