<p align="left">
<img src="assets/images/logo.png" alt="protclust logo" width="100"/>
</p>
# protclust
[](https://pypi.org/project/protclust/)
[](https://github.com/michaelscutari/protclust/actions)
[](https://github.com/YOUR-USERNAME/protclust/actions)
[](https://opensource.org/licenses/MIT)
[](https://pypi.org/project/protclust/)
A Python library for working with protein sequence data, providing:
- Clustering capabilities via MMseqs2
- Machine learning dataset creation with cluster-aware splits
---
## Requirements
This library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:
### Installation Options for MMseqs2
- **Homebrew**:
```bash
brew install mmseqs2
```
- **Conda**:
```bash
conda install -c conda-forge -c bioconda mmseqs2
```
- **Docker**:
```bash
docker pull ghcr.io/soedinglab/mmseqs2
```
- **Static Build (AVX2, SSE4.1, or SSE2)**:
```bash
wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH
```
MMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.
## Installation
### Installation
You can install protclust using pip:
```bash
pip install protclust
```
Or if installing from source, clone the repository and run:
```bash
pip install -e .
```
For development purposes, also install the testing dependencies:
```bash
pip install pytest pytest-cov pre-commit ruff
```
## Features
### Sequence Clustering and Dataset Creation
```python
import pandas as pd
from protclust import clean, cluster, split, set_verbosity
# Enable detailed logging (optional)
set_verbosity(verbose=True)
# Example data
df = pd.DataFrame({
"id": ["seq1", "seq2", "seq3", "seq4"],
"sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})
# Clean data
clean_df = clean(df, sequence_col="sequence")
# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")
# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="cluster_representative", test_size=0.3)
print("Train set:\n", train_df)
print("Test set:\n", test_df)
# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
clustered_df,
group_col="cluster_representative",
test_size=0.3,
balance_cols=["molecular_weight", "hydrophobicity"]
)
```
## Parameters
Common parameters for clustering functions:
- `df`: Pandas DataFrame containing sequence data
- `sequence_col`: Column name containing sequences
- `id_col`: Column name containing unique identifiers
- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)
- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)
- `cov_mode`: Coverage mode (0-3, default 0)
- `cluster_mode`: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)
- `cluster_steps`: Number of cascaded clustering steps for large datasets (default 1)
- `test_size`: Desired fraction of data in test set (default 0.2)
- `random_state`: Random seed for reproducibility
- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`pytest tests/`)
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citation
If you use protclust in your research, please cite:
```bibtex
@software{protclust,
author = {Michael Scutari},
title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
url = {https://github.com/michaelscutari/protclust},
version = {0.2.0},
year = {2025},
}
```
Raw data
{
"_id": null,
"home_page": null,
"name": "protclust",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "bioinformatics, protein, clustering, machine learning",
"author": "Michael Scutari",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/14/f4/244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97/protclust-0.2.0.tar.gz",
"platform": null,
"description": "<p align=\"left\">\n <img src=\"assets/images/logo.png\" alt=\"protclust logo\" width=\"100\"/>\n</p>\n\n# protclust\n\n[](https://pypi.org/project/protclust/)\n[](https://github.com/michaelscutari/protclust/actions)\n[](https://github.com/YOUR-USERNAME/protclust/actions)\n[](https://opensource.org/licenses/MIT)\n[](https://pypi.org/project/protclust/)\n\nA Python library for working with protein sequence data, providing:\n- Clustering capabilities via MMseqs2\n- Machine learning dataset creation with cluster-aware splits\n\n---\n\n## Requirements\n\nThis library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:\n\n### Installation Options for MMseqs2\n\n- **Homebrew**:\n ```bash\n brew install mmseqs2\n ```\n\n- **Conda**:\n ```bash\n conda install -c conda-forge -c bioconda mmseqs2\n ```\n\n- **Docker**:\n ```bash\n docker pull ghcr.io/soedinglab/mmseqs2\n ```\n\n- **Static Build (AVX2, SSE4.1, or SSE2)**:\n ```bash\n wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz\n tar xvfz mmseqs-linux-avx2.tar.gz\n export PATH=$(pwd)/mmseqs/bin/:$PATH\n ```\n\nMMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.\n\n## Installation\n\n### Installation\n\nYou can install protclust using pip:\n\n```bash\npip install protclust\n```\n\nOr if installing from source, clone the repository and run:\n\n```bash\npip install -e .\n```\n\nFor development purposes, also install the testing dependencies:\n\n```bash\npip install pytest pytest-cov pre-commit ruff\n```\n\n## Features\n\n### Sequence Clustering and Dataset Creation\n\n```python\nimport pandas as pd\nfrom protclust import clean, cluster, split, set_verbosity\n\n# Enable detailed logging (optional)\nset_verbosity(verbose=True)\n\n# Example data\ndf = pd.DataFrame({\n \"id\": [\"seq1\", \"seq2\", \"seq3\", \"seq4\"],\n \"sequence\": [\"ACDEFGHIKL\", \"ACDEFGHIKL\", \"MNPQRSTVWY\", \"MNPQRSTVWY\"]\n})\n\n# Clean data\nclean_df = clean(df, sequence_col=\"sequence\")\n\n# Cluster sequences\nclustered_df = cluster(clean_df, sequence_col=\"sequence\", id_col=\"id\")\n\n# Split data into train and test sets\ntrain_df, test_df = split(clustered_df, group_col=\"cluster_representative\", test_size=0.3)\n\nprint(\"Train set:\\n\", train_df)\nprint(\"Test set:\\n\", test_df)\n\n# MILP-based splitting with property balancing\nfrom protclust import milp_split\ntrain_df, test_df = milp_split(\n clustered_df,\n group_col=\"cluster_representative\",\n test_size=0.3,\n balance_cols=[\"molecular_weight\", \"hydrophobicity\"]\n)\n```\n\n## Parameters\n\nCommon parameters for clustering functions:\n\n- `df`: Pandas DataFrame containing sequence data\n- `sequence_col`: Column name containing sequences\n- `id_col`: Column name containing unique identifiers\n- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)\n- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)\n- `cov_mode`: Coverage mode (0-3, default 0)\n- `cluster_mode`: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)\n- `cluster_steps`: Number of cascaded clustering steps for large datasets (default 1)\n- `test_size`: Desired fraction of data in test set (default 0.2)\n- `random_state`: Random seed for reproducibility\n- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Run tests (`pytest tests/`)\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use protclust in your research, please cite:\n\n```bibtex\n@software{protclust,\n author = {Michael Scutari},\n title = {protclust: Protein Sequence Clustering and ML Dataset Creation},\n url = {https://github.com/michaelscutari/protclust},\n version = {0.2.0},\n year = {2025},\n}\n```\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python tools for protein sequence clustering and dataset splitting",
"version": "0.2.0",
"project_urls": {
"Bug Tracker": "https://github.com/michaelscutari/protclust/issues",
"Documentation": "https://github.com/michaelscutari/protclust",
"Homepage": "https://github.com/michaelscutari/protclust"
},
"split_keywords": [
"bioinformatics",
" protein",
" clustering",
" machine learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b80d2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517",
"md5": "1ce106cee8421ad9b12c691df64e32a5",
"sha256": "5d0a91ccad77a5dfaf2258c40448abf4b8dc0cf14946b616bb8356de89fad36f"
},
"downloads": -1,
"filename": "protclust-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1ce106cee8421ad9b12c691df64e32a5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 14210,
"upload_time": "2025-09-09T00:20:41",
"upload_time_iso_8601": "2025-09-09T00:20:41.116362Z",
"url": "https://files.pythonhosted.org/packages/b8/0d/2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517/protclust-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "14f4244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97",
"md5": "c8f1a552627a86a953e1d634a582e43a",
"sha256": "230e305b7336015a9db9c795d4602334571bc5cff6c92504e70e72f383a13379"
},
"downloads": -1,
"filename": "protclust-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "c8f1a552627a86a953e1d634a582e43a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 37049,
"upload_time": "2025-09-09T00:20:42",
"upload_time_iso_8601": "2025-09-09T00:20:42.716987Z",
"url": "https://files.pythonhosted.org/packages/14/f4/244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97/protclust-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-09 00:20:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michaelscutari",
"github_project": "protclust",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.25.0"
]
]
},
{
"name": "pulp",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.0.0"
]
]
},
{
"name": "pytest-cov",
"specs": [
[
">=",
"4.0.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
">=",
"2.20.0"
]
]
},
{
"name": "ruff",
"specs": [
[
">=",
"0.10.0"
]
]
}
],
"lcname": "protclust"
}