<p align="left">
<img src="assets/images/logo.png" alt="protclust logo" width="100"/>
</p>
# protclust
[](https://pypi.org/project/protclust/)
[](https://github.com/michaelscutari/protclust/actions)
[](https://github.com/YOUR-USERNAME/protclust/actions)
[](https://opensource.org/licenses/MIT)
[](https://pypi.org/project/protclust/)
A Python library for working with protein sequence data, providing:
- Clustering capabilities via MMseqs2
- Machine learning dataset creation with cluster-aware splits
- Protein sequence embeddings and feature extraction
---
## Requirements
This library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:
### Installation Options for MMseqs2
- **Homebrew**:
```bash
brew install mmseqs2
```
- **Conda**:
```bash
conda install -c conda-forge -c bioconda mmseqs2
```
- **Docker**:
```bash
docker pull ghcr.io/soedinglab/mmseqs2
```
- **Static Build (AVX2, SSE4.1, or SSE2)**:
```bash
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
```
MMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.
## Installation
### Installation
You can install protclust using pip:
```bash
pip install protclust
```
Or if installing from source, clone the repository and run:
```bash
pip install -e .
```
For development purposes, also install the testing dependencies:
```bash
pip install pytest pytest-cov pre-commit ruff
```
## Features
### Sequence Clustering and Dataset Creation
```python
import pandas as pd
from protclust import clean, cluster, split, set_verbosity
# Enable detailed logging (optional)
set_verbosity(verbose=True)
# Example data
df = pd.DataFrame({
"id": ["seq1", "seq2", "seq3", "seq4"],
"sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})
# Clean data
clean_df = clean(df, sequence_col="sequence")
# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")
# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="representative_sequence", test_size=0.3)
print("Train set:\n", train_df)
print("Test set:\n", test_df)
# Or use the combined function
from protclust import train_test_cluster_split
train_df, test_df = train_test_cluster_split(df, sequence_col="sequence", id_col="id", test_size=0.3)
```
### Advanced Splitting Options
```python
# Three-way split
from protclust import train_test_val_cluster_split
train_df, val_df, test_df = train_test_val_cluster_split(
df, sequence_col="sequence", test_size=0.2, val_size=0.1
)
# K-fold cross-validation with cluster awareness
from protclust import cluster_kfold
folds = cluster_kfold(df, sequence_col="sequence", n_splits=5)
# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
clustered_df,
group_col="representative_sequence",
test_size=0.3,
balance_cols=["molecular_weight", "hydrophobicity"]
)
```
### Protein Embeddings
```python
# Basic embeddings
from protclust.embeddings import blosum62, aac, property_embedding, onehot
# Add BLOSUM62 embeddings
df_with_blosum = blosum62(df, sequence_col="sequence")
# Generate embeddings with ESM models (requires extra dependencies)
from protclust.embeddings import embed_sequences
# ESM embedding
df_with_esm = embed_sequences(df, "esm", sequence_col="sequence")
# Saving embeddings to HDF5
df_with_refs = embed_sequences(
df,
"esm",
sequence_col="sequence",
use_hdf=True,
hdf_path="embeddings.h5"
)
# Retrieve embeddings
from protclust.embeddings import get_embeddings
embeddings = get_embeddings(df_with_refs, "esm", hdf_path="embeddings.h5")
```
## Parameters
Common parameters for clustering functions:
- `df`: Pandas DataFrame containing sequence data
- `sequence_col`: Column name containing sequences
- `id_col`: Column name containing unique identifiers
- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)
- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)
- `cov_mode`: Coverage mode (0-3, default 0)
- `test_size`: Desired fraction of data in test set (default 0.2)
- `random_state`: Random seed for reproducibility
- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`pytest tests/`)
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citation
If you use protclust in your research, please cite:
```bibtex
@software{protclust,
author = {Michael Scutari},
title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
url = {https://github.com/michaelscutari/protclust},
version = {0.1.0},
year = {2025},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "protclust",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "bioinformatics, protein, clustering, embeddings, machine learning",
"author": "Michael Scutari",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/46/f3/a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8/protclust-0.1.5.post1.tar.gz",
"platform": null,
"description": "<p align=\"left\">\n <img src=\"assets/images/logo.png\" alt=\"protclust logo\" width=\"100\"/>\n</p>\n\n# protclust\n\n[](https://pypi.org/project/protclust/)\n[](https://github.com/michaelscutari/protclust/actions)\n[](https://github.com/YOUR-USERNAME/protclust/actions)\n[](https://opensource.org/licenses/MIT)\n[](https://pypi.org/project/protclust/)\n\nA Python library for working with protein sequence data, providing:\n- Clustering capabilities via MMseqs2\n- Machine learning dataset creation with cluster-aware splits\n- Protein sequence embeddings and feature extraction\n\n---\n\n## Requirements\n\nThis library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:\n\n### Installation Options for MMseqs2\n\n- **Homebrew**:\n ```bash\n brew install mmseqs2\n ```\n\n- **Conda**:\n ```bash\n conda install -c conda-forge -c bioconda mmseqs2\n ```\n\n- **Docker**:\n ```bash\n docker pull ghcr.io/soedinglab/mmseqs2\n ```\n\n- **Static Build (AVX2, SSE4.1, or SSE2)**:\n ```bash\n wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz\n tar xvfz mmseqs-linux-avx2.tar.gz\n export PATH=$(pwd)/mmseqs/bin/:$PATH\n ```\n\nMMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.\n\n## Installation\n\n### Installation\n\nYou can install protclust using pip:\n\n```bash\npip install protclust\n```\n\nOr if installing from source, clone the repository and run:\n\n```bash\npip install -e .\n```\n\nFor development purposes, also install the testing dependencies:\n\n```bash\npip install pytest pytest-cov pre-commit ruff\n```\n\n## Features\n\n### Sequence Clustering and Dataset Creation\n\n```python\nimport pandas as pd\nfrom protclust import clean, cluster, split, set_verbosity\n\n# Enable detailed logging (optional)\nset_verbosity(verbose=True)\n\n# Example data\ndf = pd.DataFrame({\n \"id\": [\"seq1\", \"seq2\", \"seq3\", \"seq4\"],\n \"sequence\": [\"ACDEFGHIKL\", \"ACDEFGHIKL\", \"MNPQRSTVWY\", \"MNPQRSTVWY\"]\n})\n\n# Clean data\nclean_df = clean(df, sequence_col=\"sequence\")\n\n# Cluster sequences\nclustered_df = cluster(clean_df, sequence_col=\"sequence\", id_col=\"id\")\n\n# Split data into train and test sets\ntrain_df, test_df = split(clustered_df, group_col=\"representative_sequence\", test_size=0.3)\n\nprint(\"Train set:\\n\", train_df)\nprint(\"Test set:\\n\", test_df)\n\n# Or use the combined function\nfrom protclust import train_test_cluster_split\ntrain_df, test_df = train_test_cluster_split(df, sequence_col=\"sequence\", id_col=\"id\", test_size=0.3)\n```\n\n### Advanced Splitting Options\n\n```python\n# Three-way split\nfrom protclust import train_test_val_cluster_split\ntrain_df, val_df, test_df = train_test_val_cluster_split(\n df, sequence_col=\"sequence\", test_size=0.2, val_size=0.1\n)\n\n# K-fold cross-validation with cluster awareness\nfrom protclust import cluster_kfold\nfolds = cluster_kfold(df, sequence_col=\"sequence\", n_splits=5)\n\n# MILP-based splitting with property balancing\nfrom protclust import milp_split\ntrain_df, test_df = milp_split(\n clustered_df,\n group_col=\"representative_sequence\",\n test_size=0.3,\n balance_cols=[\"molecular_weight\", \"hydrophobicity\"]\n)\n```\n\n### Protein Embeddings\n\n```python\n# Basic embeddings\nfrom protclust.embeddings import blosum62, aac, property_embedding, onehot\n\n# Add BLOSUM62 embeddings\ndf_with_blosum = blosum62(df, sequence_col=\"sequence\")\n\n# Generate embeddings with ESM models (requires extra dependencies)\nfrom protclust.embeddings import embed_sequences\n\n# ESM embedding\ndf_with_esm = embed_sequences(df, \"esm\", sequence_col=\"sequence\")\n\n# Saving embeddings to HDF5\ndf_with_refs = embed_sequences(\n df,\n \"esm\",\n sequence_col=\"sequence\",\n use_hdf=True,\n hdf_path=\"embeddings.h5\"\n)\n\n# Retrieve embeddings\nfrom protclust.embeddings import get_embeddings\nembeddings = get_embeddings(df_with_refs, \"esm\", hdf_path=\"embeddings.h5\")\n```\n\n## Parameters\n\nCommon parameters for clustering functions:\n\n- `df`: Pandas DataFrame containing sequence data\n- `sequence_col`: Column name containing sequences\n- `id_col`: Column name containing unique identifiers\n- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)\n- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)\n- `cov_mode`: Coverage mode (0-3, default 0)\n- `test_size`: Desired fraction of data in test set (default 0.2)\n- `random_state`: Random seed for reproducibility\n- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Run tests (`pytest tests/`)\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use protclust in your research, please cite:\n\n```bibtex\n@software{protclust,\n author = {Michael Scutari},\n title = {protclust: Protein Sequence Clustering and ML Dataset Creation},\n url = {https://github.com/michaelscutari/protclust},\n version = {0.1.0},\n year = {2025},\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python tools for protein sequence clustering and embedding",
"version": "0.1.5.post1",
"project_urls": {
"Bug Tracker": "https://github.com/michaelscutari/protclust/issues",
"Documentation": "https://github.com/michaelscutari/protclust",
"Homepage": "https://github.com/michaelscutari/protclust"
},
"split_keywords": [
"bioinformatics",
" protein",
" clustering",
" embeddings",
" machine learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "656dcaa2ecc2496d92a1c19f0df6dfb7ce64481a0b46a1f1b28fce1e4b4ef56a",
"md5": "8b4c2add29ad5ac2262e3729f58db44b",
"sha256": "799c6852b671029686f44acf9f473461317b267338fe9c404a037f3fb0f5e891"
},
"downloads": -1,
"filename": "protclust-0.1.5.post1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8b4c2add29ad5ac2262e3729f58db44b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 39177,
"upload_time": "2025-03-21T12:55:25",
"upload_time_iso_8601": "2025-03-21T12:55:25.536537Z",
"url": "https://files.pythonhosted.org/packages/65/6d/caa2ecc2496d92a1c19f0df6dfb7ce64481a0b46a1f1b28fce1e4b4ef56a/protclust-0.1.5.post1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "46f3a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8",
"md5": "14eee782e81f0ee3e5d01d6854d5ff74",
"sha256": "d9f394245a86b48eadc662e904d53d015b6c3e28f7abb6b0accb25a02f06f306"
},
"downloads": -1,
"filename": "protclust-0.1.5.post1.tar.gz",
"has_sig": false,
"md5_digest": "14eee782e81f0ee3e5d01d6854d5ff74",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 70261,
"upload_time": "2025-03-21T12:55:27",
"upload_time_iso_8601": "2025-03-21T12:55:27.099207Z",
"url": "https://files.pythonhosted.org/packages/46/f3/a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8/protclust-0.1.5.post1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-03-21 12:55:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michaelscutari",
"github_project": "protclust",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "h5py",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"1.10.0"
]
]
},
{
"name": "fair-esm",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.15.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "pulp",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
">=",
"2.20.0"
]
]
},
{
"name": "ruff",
"specs": [
[
">=",
"0.10.0"
]
]
}
],
"lcname": "protclust"
}