protclust


Nameprotclust JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryPython tools for protein sequence clustering and dataset splitting
upload_time2025-09-09 00:20:42
maintainerNone
docs_urlNone
authorMichael Scutari
requires_python>=3.7
licenseMIT
keywords bioinformatics protein clustering machine learning
VCS
bugtrack_url
requirements numpy pandas scikit-learn requests pulp pytest pytest-cov pre-commit ruff
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="left">
  <img src="assets/images/logo.png" alt="protclust logo" width="100"/>
</p>

# protclust

[![PyPI version](https://img.shields.io/pypi/v/protclust.svg)](https://pypi.org/project/protclust/)
[![Tests](https://github.com/michaelscutari/protclust/workflows/Tests/badge.svg)](https://github.com/michaelscutari/protclust/actions)
[![Coverage](https://img.shields.io/badge/Coverage-85%25-green)](https://github.com/YOUR-USERNAME/protclust/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/pypi/pyversions/protclust.svg)](https://pypi.org/project/protclust/)

A Python library for working with protein sequence data, providing:
- Clustering capabilities via MMseqs2
- Machine learning dataset creation with cluster-aware splits

---

## Requirements

This library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:

### Installation Options for MMseqs2

- **Homebrew**:
    ```bash
    brew install mmseqs2
    ```

- **Conda**:
    ```bash
    conda install -c conda-forge -c bioconda mmseqs2
    ```

- **Docker**:
    ```bash
    docker pull ghcr.io/soedinglab/mmseqs2
    ```

- **Static Build (AVX2, SSE4.1, or SSE2)**:
    ```bash
    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
    tar xvfz mmseqs-linux-avx2.tar.gz
    export PATH=$(pwd)/mmseqs/bin/:$PATH
    ```

MMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.

## Installation

### Installation

You can install protclust using pip:

```bash
pip install protclust
```

Or if installing from source, clone the repository and run:

```bash
pip install -e .
```

For development purposes, also install the testing dependencies:

```bash
pip install pytest pytest-cov pre-commit ruff
```

## Features

### Sequence Clustering and Dataset Creation

```python
import pandas as pd
from protclust import clean, cluster, split, set_verbosity

# Enable detailed logging (optional)
set_verbosity(verbose=True)

# Example data
df = pd.DataFrame({
    "id": ["seq1", "seq2", "seq3", "seq4"],
    "sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})

# Clean data
clean_df = clean(df, sequence_col="sequence")

# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")

# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="cluster_representative", test_size=0.3)

print("Train set:\n", train_df)
print("Test set:\n", test_df)

# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
    clustered_df,
    group_col="cluster_representative",
    test_size=0.3,
    balance_cols=["molecular_weight", "hydrophobicity"]
)
```

## Parameters

Common parameters for clustering functions:

- `df`: Pandas DataFrame containing sequence data
- `sequence_col`: Column name containing sequences
- `id_col`: Column name containing unique identifiers
- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)
- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)
- `cov_mode`: Coverage mode (0-3, default 0)
- `cluster_mode`: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)
- `cluster_steps`: Number of cascaded clustering steps for large datasets (default 1)
- `test_size`: Desired fraction of data in test set (default 0.2)
- `random_state`: Random seed for reproducibility
- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Run tests (`pytest tests/`)
4. Commit your changes (`git commit -m 'Add some amazing feature'`)
5. Push to the branch (`git push origin feature/amazing-feature`)
6. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use protclust in your research, please cite:

```bibtex
@software{protclust,
  author = {Michael Scutari},
  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
  url = {https://github.com/michaelscutari/protclust},
  version = {0.2.0},
  year = {2025},
}
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "protclust",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "bioinformatics, protein, clustering, machine learning",
    "author": "Michael Scutari",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/14/f4/244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97/protclust-0.2.0.tar.gz",
    "platform": null,
    "description": "<p align=\"left\">\n  <img src=\"assets/images/logo.png\" alt=\"protclust logo\" width=\"100\"/>\n</p>\n\n# protclust\n\n[![PyPI version](https://img.shields.io/pypi/v/protclust.svg)](https://pypi.org/project/protclust/)\n[![Tests](https://github.com/michaelscutari/protclust/workflows/Tests/badge.svg)](https://github.com/michaelscutari/protclust/actions)\n[![Coverage](https://img.shields.io/badge/Coverage-85%25-green)](https://github.com/YOUR-USERNAME/protclust/actions)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Version](https://img.shields.io/pypi/pyversions/protclust.svg)](https://pypi.org/project/protclust/)\n\nA Python library for working with protein sequence data, providing:\n- Clustering capabilities via MMseqs2\n- Machine learning dataset creation with cluster-aware splits\n\n---\n\n## Requirements\n\nThis library requires [MMseqs2](https://github.com/soedinglab/MMseqs2), which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:\n\n### Installation Options for MMseqs2\n\n- **Homebrew**:\n    ```bash\n    brew install mmseqs2\n    ```\n\n- **Conda**:\n    ```bash\n    conda install -c conda-forge -c bioconda mmseqs2\n    ```\n\n- **Docker**:\n    ```bash\n    docker pull ghcr.io/soedinglab/mmseqs2\n    ```\n\n- **Static Build (AVX2, SSE4.1, or SSE2)**:\n    ```bash\n    wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz\n    tar xvfz mmseqs-linux-avx2.tar.gz\n    export PATH=$(pwd)/mmseqs/bin/:$PATH\n    ```\n\nMMseqs2 must be accessible via the `mmseqs` command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.\n\n## Installation\n\n### Installation\n\nYou can install protclust using pip:\n\n```bash\npip install protclust\n```\n\nOr if installing from source, clone the repository and run:\n\n```bash\npip install -e .\n```\n\nFor development purposes, also install the testing dependencies:\n\n```bash\npip install pytest pytest-cov pre-commit ruff\n```\n\n## Features\n\n### Sequence Clustering and Dataset Creation\n\n```python\nimport pandas as pd\nfrom protclust import clean, cluster, split, set_verbosity\n\n# Enable detailed logging (optional)\nset_verbosity(verbose=True)\n\n# Example data\ndf = pd.DataFrame({\n    \"id\": [\"seq1\", \"seq2\", \"seq3\", \"seq4\"],\n    \"sequence\": [\"ACDEFGHIKL\", \"ACDEFGHIKL\", \"MNPQRSTVWY\", \"MNPQRSTVWY\"]\n})\n\n# Clean data\nclean_df = clean(df, sequence_col=\"sequence\")\n\n# Cluster sequences\nclustered_df = cluster(clean_df, sequence_col=\"sequence\", id_col=\"id\")\n\n# Split data into train and test sets\ntrain_df, test_df = split(clustered_df, group_col=\"cluster_representative\", test_size=0.3)\n\nprint(\"Train set:\\n\", train_df)\nprint(\"Test set:\\n\", test_df)\n\n# MILP-based splitting with property balancing\nfrom protclust import milp_split\ntrain_df, test_df = milp_split(\n    clustered_df,\n    group_col=\"cluster_representative\",\n    test_size=0.3,\n    balance_cols=[\"molecular_weight\", \"hydrophobicity\"]\n)\n```\n\n## Parameters\n\nCommon parameters for clustering functions:\n\n- `df`: Pandas DataFrame containing sequence data\n- `sequence_col`: Column name containing sequences\n- `id_col`: Column name containing unique identifiers\n- `min_seq_id`: Minimum sequence identity threshold (0.0-1.0, default 0.3)\n- `coverage`: Minimum alignment coverage (0.0-1.0, default 0.5)\n- `cov_mode`: Coverage mode (0-3, default 0)\n- `cluster_mode`: Clustering algorithm (0: Set-Cover, 1: Connected component, 2: Greedy by length, default 0)\n- `cluster_steps`: Number of cascaded clustering steps for large datasets (default 1)\n- `test_size`: Desired fraction of data in test set (default 0.2)\n- `random_state`: Random seed for reproducibility\n- `tolerance`: Acceptable deviation from desired split sizes (default 0.05)\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Run tests (`pytest tests/`)\n4. Commit your changes (`git commit -m 'Add some amazing feature'`)\n5. Push to the branch (`git push origin feature/amazing-feature`)\n6. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use protclust in your research, please cite:\n\n```bibtex\n@software{protclust,\n  author = {Michael Scutari},\n  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},\n  url = {https://github.com/michaelscutari/protclust},\n  version = {0.2.0},\n  year = {2025},\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Python tools for protein sequence clustering and dataset splitting",
    "version": "0.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/michaelscutari/protclust/issues",
        "Documentation": "https://github.com/michaelscutari/protclust",
        "Homepage": "https://github.com/michaelscutari/protclust"
    },
    "split_keywords": [
        "bioinformatics",
        " protein",
        " clustering",
        " machine learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b80d2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517",
                "md5": "1ce106cee8421ad9b12c691df64e32a5",
                "sha256": "5d0a91ccad77a5dfaf2258c40448abf4b8dc0cf14946b616bb8356de89fad36f"
            },
            "downloads": -1,
            "filename": "protclust-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1ce106cee8421ad9b12c691df64e32a5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 14210,
            "upload_time": "2025-09-09T00:20:41",
            "upload_time_iso_8601": "2025-09-09T00:20:41.116362Z",
            "url": "https://files.pythonhosted.org/packages/b8/0d/2e89ec6371d9251f1f2a03dfff633abbdf2301d6a3b3939328a46d2e2517/protclust-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "14f4244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97",
                "md5": "c8f1a552627a86a953e1d634a582e43a",
                "sha256": "230e305b7336015a9db9c795d4602334571bc5cff6c92504e70e72f383a13379"
            },
            "downloads": -1,
            "filename": "protclust-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c8f1a552627a86a953e1d634a582e43a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 37049,
            "upload_time": "2025-09-09T00:20:42",
            "upload_time_iso_8601": "2025-09-09T00:20:42.716987Z",
            "url": "https://files.pythonhosted.org/packages/14/f4/244e75efcc50983b9ad27c86d5f03a9d7ff72344dc8c7e741dd3ffa41f97/protclust-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-09 00:20:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "michaelscutari",
    "github_project": "protclust",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.20.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.5.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.25.0"
                ]
            ]
        },
        {
            "name": "pulp",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.0.0"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    ">=",
                    "4.0.0"
                ]
            ]
        },
        {
            "name": "pre-commit",
            "specs": [
                [
                    ">=",
                    "2.20.0"
                ]
            ]
        },
        {
            "name": "ruff",
            "specs": [
                [
                    ">=",
                    "0.10.0"
                ]
            ]
        }
    ],
    "lcname": "protclust"
}
        
Elapsed time: 1.38396s