protclust

Name	protclust JSON
Version	0.1.5.post1 JSON
	download
home_page	None
Summary	Python tools for protein sequence clustering and embedding
upload_time	2025-03-21 12:55:27
maintainer	None
docs_url	None
author	Michael Scutari
requires_python	>=3.7
license	MIT
keywords	bioinformatics protein clustering embeddings machine learning
VCS
bugtrack_url
requirements	numpy pandas scikit-learn h5py torch fair-esm transformers requests pulp pytest pytest-cov pre-commit ruff
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <p align="left">
  <img src="assets/images/logo.png" alt="protclust logo" width="100"/>
</p>

# protclust

[![PyPI version](https://img.shields.io/pypi/v/protclust.svg)](https://pypi.org/project/protclust/)
[![Tests](https://github.com/michaelscutari/protclust/workflows/Tests/badge.svg)](https://github.com/michaelscutari/protclust/actions)
[![Coverage](https://img.shields.io/badge/Coverage-85%25-green)](https://github.com/YOUR-USERNAME/protclust/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/pypi/pyversions/protclust.svg)](https://pypi.org/project/protclust/)

A Python library for working with protein sequence data, providing:
- Clustering capabilities via MMseqs2
- Machine learning dataset creation with cluster-aware splits
- Protein sequence embeddings and feature extraction

---

## Requirements

This library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:

### Installation Options for MMseqs2

- **Homebrew**:
    ```bash
    brew install mmseqs2
    ```

- **Conda**:
    ```bash
    conda install -c conda-forge -c bioconda mmseqs2
    ```

- **Docker**:
    ```bash
    docker pull ghcr.io/soedinglab/mmseqs2
    ```

- **Static Build (AVX2, SSE4.1, or SSE2)**:
    ```bash
    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
    tar xvfz mmseqs-linux-avx2.tar.gz
    export PATH=$(pwd)/mmseqs/bin/:$PATH
    ```

MMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.

## Installation

### Installation

You can install protclust using pip:

```bash
pip install protclust
```

Or if installing from source, clone the repository and run:

```bash
pip install -e .
```

For development purposes, also install the testing dependencies:

```bash
pip install pytest pytest-cov pre-commit ruff
```

## Features

### Sequence Clustering and Dataset Creation

```python
import pandas as pd
from protclust import clean, cluster, split, set_verbosity

# Enable detailed logging (optional)
set_verbosity(verbose=True)

# Example data
df = pd.DataFrame({
    "id": ["seq1", "seq2", "seq3", "seq4"],
    "sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})

# Clean data
clean_df = clean(df, sequence_col="sequence")

# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")

# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="representative_sequence", test_size=0.3)

print("Train set:\n", train_df)
print("Test set:\n", test_df)

# Or use the combined function
from protclust import train_test_cluster_split
train_df, test_df = train_test_cluster_split(df, sequence_col="sequence", id_col="id", test_size=0.3)
```

### Advanced Splitting Options

```python
# Three-way split
from protclust import train_test_val_cluster_split
train_df, val_df, test_df = train_test_val_cluster_split(
    df, sequence_col="sequence", test_size=0.2, val_size=0.1
)

# K-fold cross-validation with cluster awareness
from protclust import cluster_kfold
folds = cluster_kfold(df, sequence_col="sequence", n_splits=5)

# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
    clustered_df,
    group_col="representative_sequence",
    test_size=0.3,
    balance_cols=["molecular_weight", "hydrophobicity"]
)
```

### Protein Embeddings

```python
# Basic embeddings
from protclust.embeddings import blosum62, aac, property_embedding, onehot

# Add BLOSUM62 embeddings
df_with_blosum = blosum62(df, sequence_col="sequence")

# Generate embeddings with ESM models (requires extra dependencies)
from protclust.embeddings import embed_sequences

# ESM embedding
df_with_esm = embed_sequences(df, "esm", sequence_col="sequence")

# Saving embeddings to HDF5
df_with_refs = embed_sequences(
    df,
    "esm",
    sequence_col="sequence",
    use_hdf=True,
    hdf_path="embeddings.h5"
)

# Retrieve embeddings
from protclust.embeddings import get_embeddings
embeddings = get_embeddings(df_with_refs, "esm", hdf_path="embeddings.h5")
```

## Parameters

Common parameters for clustering functions:

- `df`: Pandas DataFrame containing sequence data
- `sequence_col`: Column name containing sequences
- `id_col`: Column name containing unique identifiers
- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)
- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)
- `cov_mode`: Coverage mode (0-3, default 0)
- `test_size`: Desired fraction of data in test set (default 0.2)
- `random_state`: Random seed for reproducibility
- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`pytest tests/`)
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use protclust in your research, please cite:

```bibtex
@software{protclust,
  author = {Michael Scutari},
  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
  url = {https://github.com/michaelscutari/protclust},
  version = {0.1.0},
  year = {2025},
}
```

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "protclust",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "bioinformatics, protein, clustering, embeddings, machine learning",
    "author": "Michael Scutari",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/46/f3/a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8/protclust-0.1.5.post1.tar.gz",
    "platform": null,
    "description": "<p align=\"left\">\n  <img src=\"assets/images/logo.png\" alt=\"protclust logo\" width=\"100\"/>\n</p>\n\n# protclust\n\n[![PyPI version](https://img.shields.io/pypi/v/protclust.svg)](https://pypi.org/project/protclust/)\n[![Tests](https://github.com/michaelscutari/protclust/workflows/Tests/badge.svg)](https://github.com/michaelscutari/protclust/actions)\n[![Coverage](https://img.shields.io/badge/Coverage-85%25-green)](https://github.com/YOUR-USERNAME/protclust/actions)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Version](https://img.shields.io/pypi/pyversions/protclust.svg)](https://pypi.org/project/protclust/)\n\nA Python library for working with protein sequence data, providing:\n- Clustering capabilities via MMseqs2\n- Machine learning dataset creation with cluster-aware splits\n- Protein sequence embeddings and feature extraction\n\n---\n\n## Requirements\n\nThis library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:\n\n### Installation Options for MMseqs2\n\n- **Homebrew**:\n    ```bash\n    brew install mmseqs2\n    ```\n\n- **Conda**:\n    ```bash\n    conda install -c conda-forge -c bioconda mmseqs2\n    ```\n\n- **Docker**:\n    ```bash\n    docker pull ghcr.io/soedinglab/mmseqs2\n    ```\n\n- **Static Build (AVX2, SSE4.1, or SSE2)**:\n    ```bash\n    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz\n    tar xvfz mmseqs-linux-avx2.tar.gz\n    export PATH=$(pwd)/mmseqs/bin/:$PATH\n    ```\n\nMMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.\n\n## Installation\n\n### Installation\n\nYou can install protclust using pip:\n\n```bash\npip install protclust\n```\n\nOr if installing from source, clone the repository and run:\n\n```bash\npip install -e .\n```\n\nFor development purposes, also install the testing dependencies:\n\n```bash\npip install pytest pytest-cov pre-commit ruff\n```\n\n## Features\n\n### Sequence Clustering and Dataset Creation\n\n```python\nimport pandas as pd\nfrom protclust import clean, cluster, split, set_verbosity\n\n# Enable detailed logging (optional)\nset_verbosity(verbose=True)\n\n# Example data\ndf = pd.DataFrame({\n    \"id\": [\"seq1\", \"seq2\", \"seq3\", \"seq4\"],\n    \"sequence\": [\"ACDEFGHIKL\", \"ACDEFGHIKL\", \"MNPQRSTVWY\", \"MNPQRSTVWY\"]\n})\n\n# Clean data\nclean_df = clean(df, sequence_col=\"sequence\")\n\n# Cluster sequences\nclustered_df = cluster(clean_df, sequence_col=\"sequence\", id_col=\"id\")\n\n# Split data into train and test sets\ntrain_df, test_df = split(clustered_df, group_col=\"representative_sequence\", test_size=0.3)\n\nprint(\"Train set:\\n\", train_df)\nprint(\"Test set:\\n\", test_df)\n\n# Or use the combined function\nfrom protclust import train_test_cluster_split\ntrain_df, test_df = train_test_cluster_split(df, sequence_col=\"sequence\", id_col=\"id\", test_size=0.3)\n```\n\n### Advanced Splitting Options\n\n```python\n# Three-way split\nfrom protclust import train_test_val_cluster_split\ntrain_df, val_df, test_df = train_test_val_cluster_split(\n    df, sequence_col=\"sequence\", test_size=0.2, val_size=0.1\n)\n\n# K-fold cross-validation with cluster awareness\nfrom protclust import cluster_kfold\nfolds = cluster_kfold(df, sequence_col=\"sequence\", n_splits=5)\n\n# MILP-based splitting with property balancing\nfrom protclust import milp_split\ntrain_df, test_df = milp_split(\n    clustered_df,\n    group_col=\"representative_sequence\",\n    test_size=0.3,\n    balance_cols=[\"molecular_weight\", \"hydrophobicity\"]\n)\n```\n\n### Protein Embeddings\n\n```python\n# Basic embeddings\nfrom protclust.embeddings import blosum62, aac, property_embedding, onehot\n\n# Add BLOSUM62 embeddings\ndf_with_blosum = blosum62(df, sequence_col=\"sequence\")\n\n# Generate embeddings with ESM models (requires extra dependencies)\nfrom protclust.embeddings import embed_sequences\n\n# ESM embedding\ndf_with_esm = embed_sequences(df, \"esm\", sequence_col=\"sequence\")\n\n# Saving embeddings to HDF5\ndf_with_refs = embed_sequences(\n    df,\n    \"esm\",\n    sequence_col=\"sequence\",\n    use_hdf=True,\n    hdf_path=\"embeddings.h5\"\n)\n\n# Retrieve embeddings\nfrom protclust.embeddings import get_embeddings\nembeddings = get_embeddings(df_with_refs, \"esm\", hdf_path=\"embeddings.h5\")\n```\n\n## Parameters\n\nCommon parameters for clustering functions:\n\n- `df`: Pandas DataFrame containing sequence data\n- `sequence_col`: Column name containing sequences\n- `id_col`: Column name containing unique identifiers\n- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)\n- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)\n- `cov_mode`: Coverage mode (0-3, default 0)\n- `test_size`: Desired fraction of data in test set (default 0.2)\n- `random_state`: Random seed for reproducibility\n- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Run tests (`pytest tests/`)\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use protclust in your research, please cite:\n\n```bibtex\n@software{protclust,\n  author = {Michael Scutari},\n  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},\n  url = {https://github.com/michaelscutari/protclust},\n  version = {0.1.0},\n  year = {2025},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python tools for protein sequence clustering and embedding",
    "version": "0.1.5.post1",
    "project_urls": {
        "Bug Tracker": "https://github.com/michaelscutari/protclust/issues",
        "Documentation": "https://github.com/michaelscutari/protclust",
        "Homepage": "https://github.com/michaelscutari/protclust"
    },
    "split_keywords": [
        "bioinformatics",
        " protein",
        " clustering",
        " embeddings",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "656dcaa2ecc2496d92a1c19f0df6dfb7ce64481a0b46a1f1b28fce1e4b4ef56a",
                "md5": "8b4c2add29ad5ac2262e3729f58db44b",
                "sha256": "799c6852b671029686f44acf9f473461317b267338fe9c404a037f3fb0f5e891"
            },
            "downloads": -1,
            "filename": "protclust-0.1.5.post1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8b4c2add29ad5ac2262e3729f58db44b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 39177,
            "upload_time": "2025-03-21T12:55:25",
            "upload_time_iso_8601": "2025-03-21T12:55:25.536537Z",
            "url": "https://files.pythonhosted.org/packages/65/6d/caa2ecc2496d92a1c19f0df6dfb7ce64481a0b46a1f1b28fce1e4b4ef56a/protclust-0.1.5.post1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46f3a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8",
                "md5": "14eee782e81f0ee3e5d01d6854d5ff74",
                "sha256": "d9f394245a86b48eadc662e904d53d015b6c3e28f7abb6b0accb25a02f06f306"
            },
            "downloads": -1,
            "filename": "protclust-0.1.5.post1.tar.gz",
            "has_sig": false,
            "md5_digest": "14eee782e81f0ee3e5d01d6854d5ff74",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 70261,
            "upload_time": "2025-03-21T12:55:27",
            "upload_time_iso_8601": "2025-03-21T12:55:27.099207Z",
            "url": "https://files.pythonhosted.org/packages/46/f3/a025ffb574990f1b29f0657f2732c84edb5ea46a42b1b73c6766f5118bc8/protclust-0.1.5.post1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-21 12:55:27",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "michaelscutari",
    "github_project": "protclust",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "h5py",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.10.0"
                ]
            ]
        },
        {
            "name": "fair-esm",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.15.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        },
        {
            "name": "pulp",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "2.20.0"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    ">=",
                    "0.10.0"
                ]
            ]
        }
    ],
    "lcname": "protclust"
}

Michael Scutari