ampgen

Name	ampgen JSON
Version	0.1.14 JSON
	download
home_page	https://github.com/yourusername/AMPGen
Summary	A de novo generation pipeline leveraging evolutionary information for broad-spectrum antimicrobial peptide design
upload_time	2025-08-01 08:30:11
maintainer	None
docs_url	None
author	nicholexiong
requires_python	>=3.8
license	MIT
keywords	bioinformatics antimicrobial-peptides protein-design machine-learning evodiff
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

### AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design

## Overview

AMPGen is a pipeline for generating and evaluating novel antimicrobial peptide (AMP) sequences. Using EvoDiff, a novel diffusion framework in protein design, AMPGen generates new AMP sequences and employs machine learning models to classify and predict their antimicrobial efficacy. The pipeline has demonstrated exceptional success efficiency, with 29 out of 34 peptides (85.3%) exhibiting antimicrobial activity (MIC value less than 200 µg/ml against at least one bacterium).

## Methods

### Datasets Preparation

We compiled AMP and non-AMP datasets from various public databases for use in our classification and MIC prediction models. The AMP dataset includes sequences from APD, DADP, DBAASP, DRAMP, YADAMP, and dbAMP, resulting in a final set of 10,249 unique sequences with antibacterial targets. The non-AMP dataset, sourced from UniProt, consists of 11,989 sequences filtered to exclude those associated with specific antimicrobial keywords.

### De Novo AMP Generation

AMP sequences are generated using two pre-trained order-agnostic autoregressive diffusion models (OADM) from the [EvoDiff framework](https://github.com/microsoft/evodiff) ([paper](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1)).

1. **Sequence-Based Generation**:
- The sequence-based model, Evodiff-OA_DM_640M, is pre-trained on the Uniref50 dataset, containing 42 million protein sequences. This model unconditionally generates peptide sequences of length 15-35 aa.

2. **MSA-Based Generation**:
- The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:
- Unconditional generation of peptide sequences of length 15-35 aa.
- Conditional generation using MSAs with known AMP sequences as representative sequences.

### Classification and Efficacy Prediction

1. **XGBoost-based AMP Classifier**:
- **Dataset Preparation**: Sequences in the AMP dataset were filtered based on length, retaining those within the range of 5 to 65 aa, resulting in a total of 9,964 AMP-labeled peptide sequences as the positive dataset.
- **Feature Extraction**: Features were primarily derived from the PseKRAAC encoding method and QSOrder encoding parameters, resulting in 14 categories of 1,311 features.
- **Model Training**: The data was used to train an XGBoost model, with AMP sequences labeled as 1 and nonAMP sequences labeled as 0. Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting.

2. **LSTM Regression-based MIC Predictor**:
- **Dataset Preparation**: All entries in the AMP dataset with MIC values were included. The AMP sequences targeting Escherichia coli totaled 7,100, while those targeting Staphylococcus aureus totaled 6,482. Sequences with multiple MIC values targeting the same bacteria were averaged and converted to a uniform unit of μM. These values were then log-transformed (log₁₀). Additionally, 7,193 sequences from the nonAMP dataset were labeled with a logMIC value of 4.
- **Model Training**: Separate regression training was conducted on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) models. The datasets were split into training, validation, and test sets in the ratio of 72:18:10. Each model comprised two LSTM layers, a dropout layer with a dropout rate of 0.7, and a linear layer. The models were compiled using standard L2 loss and optimized with the Adam optimizer.

## Project Structure

```
── AMPGen
├── AMP_discriminator
│ ├── Discriminator_model
│ │ ├── iFeature
│ │ │ ├── codes
│ │ │ ├── data
│ │ │ └── PseKRAAC
│ │ ├── discriminator.py
│ │ └── features.py
│ └── tools
│ ├── plt.ipynb
│ ├── RF_train.py
│ ├── split.py
│ └── XGboost_train.py
├── AMP_generator
│ ├── calculate_properties.py
│ ├── conditional_generation_msa.py
│ ├── unconditional_generation.py
│ └── unconditional_generation_msa.py
├── data
│ ├── Discriminator_training_data
│ │ ├── classify_all_data_v1.csv
│ │ ├── classify_amp_v1.csv
│ │ ├── classify_nonamp_v1.csv
│ │ └── top14Featured_all.csv
│ ├── example
│ │ ├── msa_files
│ │ │ └── example_1944.a3m
│ │ ├── output
│ │ │ ├── embeddings
│ │ │ └── sequences.fasta
│ │ ├── conditional_generated_sequences.csv
│ │ ├── generated_msa_sequences.csv
│ │ ├── generated_sequences.csv
│ │ └── sequence_properties.csv
│ ├── Scorer_training_data
│ │ ├── regression_ecoli_all.csv
│ │ └── regression_stpa_all.csv
│ ├── combined_database_filtered_v2(1).xlsx
│ └── combined_database_v2(1).xlsx
├── MIC_scorer
│ ├── Scorer_model
│ │ ├── 1stpa_best_model_checkpoint.pth
│ │ ├── 2ecoli_best_model_checkpoint.pth
│ │ ├── ecoliscaler.pkl
│ │ ├── regression.py
│ │ └── stpascaler.pkl
│ ├── tools
│ │ ├── embeddingload.py
│ │ ├── extract.py
│ │ ├── lstm_train.py
│ │ ├── pltlstm.ipynb
│ │ └── tofasta.py
│ └── scorer.py
├── results
│ ├── example_classified_sequences.csv
│ └── example_results.csv
├── .DS_Store
├── .gitattributes
├── .gitignore
├── LICENSE
├── print_directory_tree.py
├── README.md
├── requirements.txt
├── setup.py
└── test.py
```

## Getting Started

### Installation Guide

Welcome to the AMPGen project! This guide will walk you through the steps required to install and set up the necessary environment and dependencies to run AMPGen. Before getting started, please ensure that you have Anaconda installed on your system.

### Prerequisites

To use the AMPGen system, you need Python 3.11 and a few essential libraries. We'll guide you through setting up a clean conda environment, installing EvoDiff, and then the necessary dependencies. Required Python libraries: `numpy`, `pandas`, `tqdm`, `scikit-learn`, `xgboost`.

### Setting Up the Environment

1. **Clone the AMPGen Repository**
Begin by cloning the AMPGen repository to your local machine:
```bash
git clone https://github.com/xiyanxiongnico/AMPGen.git
cd AMPGen
```

2. **Create a Conda Environment**
Next, create a new conda environment with Python 3.8.5, which is the recommended version for this project:
```bash
conda create --name AMPGen python=3.11
conda activate AMPGen
```

3. **Install Environment**
With the new environment activated, install all packages:
```bash
pip install -r requirements.txt
```

4. **Install PyTorch and Related Packages**
EvoDiff requires PyTorch along with additional libraries. The following example demonstrates the installation of a CPU-compatible version of PyTorch. For optimal performance, adjust the pytorch version based on your system’s specifications. Install the required packages using the following commands:
```bash
conda install pytorch torchvision torchaudio cpuonly -c pytorch
```

---

### Usage Guide

#### 1. **Generate New AMP Sequences**

You can generate AMP sequences using the following commands:

### Unconditional Generation of AMP Sequences

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model by running the following command in `AMPGen/AMP_generator/unconditional_generation.py`:

```bash
python unconditional_generation.py --total_sequences <total_sequences> --batch_size <batch_size> --output_file <path_to_output_file> --to_device<cuda or cpu>
```

#### Arguments:
- `--total_sequences` (int, required): The total number of sequences to generate.
- `--batch_size` (int, optional): The batch size for sequence generation (default=1).
- `--output_file` (str, required): The path to the output CSV file where the generated sequences will be saved.
- `--to_device`(str, optional):Device to run the model, cuda or cpu (default=cuda).

#### Example:
```bash
python unconditional_generation.py --total_sequences 10 --output_file ../data/example/generated_sequences.csv -to_device cpu
```

This command will generate 10 sequences and save them to `generated_sequences.csv`.

### Unconditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model with MSA by running the following command in `AMPGen/AMP_generator/unconditional_generation_msa.py`:

```bash
python unconditional_generation_msa.py --total_sequences <total_sequences> --batch_size <batch_size> --n_sequences <n_sequences> --output_csv_file <path_to_output_file> --to_device<cuda or cpu>
```

#### Arguments:
- `--total_sequences` (int, required): The total number of sequences to generate.
- `--batch_size` (int, optional): The batch size for sequence generation (default=1).
- `--n_sequences` (int, optional): The number of sequences in MSA to subsample (default=64).
- `--output_csv_file` (str, required): The path to the output CSV file where the generated sequences will be saved.
- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).

#### Example:
```bash
python unconditional_generation_msa.py --total_sequences 10 --output_csv_file ../data/example/generated_msa_sequences.csv
```

This command will generate 100 sequences in batches of 10 and save them to `generated_msa_sequences.csv`.When using this model, significant computational power is required, and we recommend utilizing a GPU for optimal performance.

### Conditional Generation of AMP Sequences with MSA

You can generate antimicrobial peptide (AMP) sequences using EvoDiff's conditional generation model with MSA by running the following command in `AMPGen/AMP_generator/conditional_generation_msa.py`:

```bash
python conditional_generation_msa.py --directory_path <path_to_msa_directory> --output_csv_file <path_to_output_file> --max_retries <max_retries> --to_device<cuda or cpu> --total_sequences <total_sequences>
```

#### Arguments:
- `--directory_path` (str, required): Path to the directory containing the MSA files (in `.a3m` format).
- `--output_csv_file` (str, required): The path to the output CSV file where the generated sequences will be saved.
- `--max_retries` (int, optional): Maximum number of retries for processing each file (default: 5).
- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).
- `--total_sequences` (int, required): The total number of sequences to generate.

#### Example:
```bash
python conditional_generation_msa.py --directory_path ../data/example/msa_files/ --output_csv_file ../data/example/conditional_generated_sequences.csv --to_device cpu --total_sequence 10
```

This command will process MSA files `example_1944.a3m` from the `msa_files` directory and generate sequences, saving the results to `conditional_generated_sequences.csv`.

#### 2. **Calculate Properties of Generated Sequences**

### Calculate Properties of Generated AMP Sequences

You can calculate the physical and chemical properties of the generated AMP sequences, including molecular weight, net charge, and hydrophobicity, by running the following command in `AMPGen/AMP_generator/calculate_properties.py`:

```bash
python calculate_properties.py --input_csv_file <path_to_input_file> --output_csv_file <path_to_output_file>
```

#### Arguments:
- `--input_csv_file` (str, required): The path to the input CSV file containing sequences.
- `--output_csv_file` (str, required): The path to the output CSV file where the calculated properties will be saved.

#### Example:
```bash
python calculate_properties.py --input_csv_file ../data/example/generated_sequences.csv --output_csv_file ../data/example/sequence_properties.csv
```

This command will calculate the properties of the sequences in `generated_sequences.csv` and save the results to `sequence_properties.csv`.

### 3. **Identify AMP Candidates**

### AMP Discriminator

To classify sequences as antimicrobial peptides (AMPs) using the AMP Discriminator, run the following command in `AMPGen/AMP_discriminator/Discriminator_model/discriminator.py`:

```bash
python discriminator.py --train_path <path_to_training_csv> --pre_path <path_to_input_csv> --out_path <path_to_output_csv>
```

#### Arguments:
- `--train_path` or `-tp` (str, required): The path to the CSV file containing the training data.
- `--pre_path` or `-pp` (str, required): The path to the input CSV file containing the sequences to classify.
- `--out_path` or `-op` (str, required): The path to the output CSV file where the classification results will be saved.
- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).

#### Example:
```bash
python discriminator.py --train_path ../../data/Discriminator_training_data/top14Featured_all.csv --pre_path ../../data/example/sequence_properties.csv --out_path ../../results/example_classified_sequences.csv --to_device cpu
```

This command will classify sequences in `sequence_properties.csv` using the model trained on `top14Featured_all.csv`, and save the results to `example_classified_sequences.csv`.

### 4. **Run the MIC Scorer**

This section provides a step-by-step guide to using the MIC scorer. The scorer predicts the MIC (Minimum Inhibitory Concentration) values using a pre-trained LSTM model based on protein embeddings generated by an ESM model.

#### **Step 1: Convert Sequences to FASTA Format**
Use the `to_fasta` function to convert your input CSV file (which contains sequences) into a FASTA file:
```bash
python mic_scorer.py --from_csv_path path/to/sequences.csv --to_fasta_path path/to/output/sequences.fasta
```

#### **Step 2: Generate Embeddings with ESM Model**
Once the sequences are in FASTA format, generate their embeddings using the `get_embedding` function and a pre-trained ESM model:
```bash
python mic_scorer.py --from_csv_path path/to/sequences.csv --esm_model_location esm_model_name --output_dir path/to/output/embeddings/
```
- Replace `esm_model_name` with the location of your ESM model (e.g., `esm2_t36_3B_UR50D`).
- The embeddings will be saved to the specified output directory.

#### **Step 3: Load Embeddings**
Use the `load_embeding` function to load the generated embeddings and merge them with the input sequence data:
```bash
python mic_scorer.py --from_csv_path path/to/sequences.csv --output_dir path/to/output/embeddings/
```

#### **Step 4: Predict MIC Values**
Finally, predict MIC values using a pre-trained LSTM model. The `get_predicted_mic` function handles this task:
```bash
python mic_scorer.py --from_csv_path path/to/sequences.csv --scaler_data_path path/to/scaler.pkl --model_path path/to/model.pth --result_path path/to/save/results.csv
```

#### **Full Command Example**
Run the entire MIC scoring process from sequence conversion to MIC prediction in `AMPGen/MIC_scorer/scorer.py`:
```bash
python scorer.py --from_csv_path ../results/example_classified_sequences.csv --to_fasta_path ../data/example/output/sequences.fasta --output_dir ../data/example/output/embeddings/ --scaler_data_path ./Scorer_model/stpascaler.pkl --model_path ./Scorer_model/1stpa_best_model_checkpoint.pth --result_path ../results/example_results.csv --to_device cpu
```

This command:
1. Converts the sequences to FASTA format.
2. Generates embeddings using the specified ESM model.
3. Loads the embeddings and prepares the data.
4. Predicts MIC values using the pre-trained LSTM model.

## Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/AMPGen",
    "name": "ampgen",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "nicholexiong <nicholexiong@example.com>",
    "keywords": "bioinformatics, antimicrobial-peptides, protein-design, machine-learning, evodiff",
    "author": "nicholexiong",
    "author_email": "nicholexiong <nicholexiong@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/2f/63/16393becd88208e90c15360b2e9c4fb554d2130285fce00b28541ead9ded/ampgen-0.1.14.tar.gz",
    "platform": null,
    "description": "\n### AMPGen: A de novo Generation Pipeline Leveraging Evolutionary Information for Broad-Spectrum Antimicrobial Peptide Design\n\n## Overview\n\nAMPGen is a pipeline for generating and evaluating novel antimicrobial peptide (AMP) sequences. Using EvoDiff, a novel diffusion framework in protein design, AMPGen generates new AMP sequences and employs machine learning models to classify and predict their antimicrobial efficacy. The pipeline has demonstrated exceptional success efficiency, with 29 out of 34 peptides (85.3%) exhibiting antimicrobial activity (MIC value less than 200 \u00b5g/ml against at least one bacterium).\n\n## Methods\n\n### Datasets Preparation\n\nWe compiled AMP and non-AMP datasets from various public databases for use in our classification and MIC prediction models. The AMP dataset includes sequences from APD, DADP, DBAASP, DRAMP, YADAMP, and dbAMP, resulting in a final set of 10,249 unique sequences with antibacterial targets. The non-AMP dataset, sourced from UniProt, consists of 11,989 sequences filtered to exclude those associated with specific antimicrobial keywords.\n\n### De Novo AMP Generation\n\nAMP sequences are generated using two pre-trained order-agnostic autoregressive diffusion models (OADM) from the [EvoDiff framework](https://github.com/microsoft/evodiff) ([paper](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1)).\n\n1. **Sequence-Based Generation**:\n   - The sequence-based model, Evodiff-OA_DM_640M, is pre-trained on the Uniref50 dataset, containing 42 million protein sequences. This model unconditionally generates peptide sequences of length 15-35 aa.\n\n2. **MSA-Based Generation**:\n   - The MSA-based model, Evodiff-MSA_OA_DM_MAXSUB, is trained on the OpenFold dataset and generates sequences in two ways:\n     - Unconditional generation of peptide sequences of length 15-35 aa.\n     - Conditional generation using MSAs with known AMP sequences as representative sequences.\n\n### Classification and Efficacy Prediction\n\n1. **XGBoost-based AMP Classifier**:\n   - **Dataset Preparation**: Sequences in the AMP dataset were filtered based on length, retaining those within the range of 5 to 65 aa, resulting in a total of 9,964 AMP-labeled peptide sequences as the positive dataset.\n   - **Feature Extraction**: Features were primarily derived from the PseKRAAC encoding method and QSOrder encoding parameters, resulting in 14 categories of 1,311 features.\n   - **Model Training**: The data was used to train an XGBoost model, with AMP sequences labeled as 1 and nonAMP sequences labeled as 0. Model tuning was conducted based on the F1 score and AUC index using 10-fold cross-validation (k-fold 10) to prevent overfitting.\n\n2. **LSTM Regression-based MIC Predictor**:\n   - **Dataset Preparation**: All entries in the AMP dataset with MIC values were included. The AMP sequences targeting Escherichia coli totaled 7,100, while those targeting Staphylococcus aureus totaled 6,482. Sequences with multiple MIC values targeting the same bacteria were averaged and converted to a uniform unit of \u03bcM. These values were then log-transformed (log\u2081\u2080). Additionally, 7,193 sequences from the nonAMP dataset were labeled with a logMIC value of 4.\n   - **Model Training**: Separate regression training was conducted on the Escherichia coli and Staphylococcus aureus datasets using Long Short-Term Memory (LSTM) models. The datasets were split into training, validation, and test sets in the ratio of 72:18:10. Each model comprised two LSTM layers, a dropout layer with a dropout rate of 0.7, and a linear layer. The models were compiled using standard L2 loss and optimized with the Adam optimizer.\n\n## Project Structure\n\n```\n\u2500\u2500 AMPGen\n    \u251c\u2500\u2500 AMP_discriminator\n    \u2502   \u251c\u2500\u2500 Discriminator_model\n    \u2502   \u2502   \u251c\u2500\u2500 iFeature\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 codes\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 data\n    \u2502   \u2502   \u2502   \u2514\u2500\u2500 PseKRAAC\n    \u2502   \u2502   \u251c\u2500\u2500 discriminator.py\n    \u2502   \u2502   \u2514\u2500\u2500 features.py\n    \u2502   \u2514\u2500\u2500 tools\n    \u2502       \u251c\u2500\u2500 plt.ipynb\n    \u2502       \u251c\u2500\u2500 RF_train.py\n    \u2502       \u251c\u2500\u2500 split.py\n    \u2502       \u2514\u2500\u2500 XGboost_train.py\n    \u251c\u2500\u2500 AMP_generator\n    \u2502   \u251c\u2500\u2500 calculate_properties.py\n    \u2502   \u251c\u2500\u2500 conditional_generation_msa.py\n    \u2502   \u251c\u2500\u2500 unconditional_generation.py\n    \u2502   \u2514\u2500\u2500 unconditional_generation_msa.py\n    \u251c\u2500\u2500 data\n    \u2502   \u251c\u2500\u2500 Discriminator_training_data\n    \u2502   \u2502   \u251c\u2500\u2500 classify_all_data_v1.csv\n    \u2502   \u2502   \u251c\u2500\u2500 classify_amp_v1.csv\n    \u2502   \u2502   \u251c\u2500\u2500 classify_nonamp_v1.csv\n    \u2502   \u2502   \u2514\u2500\u2500 top14Featured_all.csv\n    \u2502   \u251c\u2500\u2500 example\n    \u2502   \u2502   \u251c\u2500\u2500 msa_files\n    \u2502   \u2502   \u2502   \u2514\u2500\u2500 example_1944.a3m\n    \u2502   \u2502   \u251c\u2500\u2500 output\n    \u2502   \u2502   \u2502   \u251c\u2500\u2500 embeddings\n    \u2502   \u2502   \u2502   \u2514\u2500\u2500 sequences.fasta\n    \u2502   \u2502   \u251c\u2500\u2500 conditional_generated_sequences.csv\n    \u2502   \u2502   \u251c\u2500\u2500 generated_msa_sequences.csv\n    \u2502   \u2502   \u251c\u2500\u2500 generated_sequences.csv\n    \u2502   \u2502   \u2514\u2500\u2500 sequence_properties.csv\n    \u2502   \u251c\u2500\u2500 Scorer_training_data\n    \u2502   \u2502   \u251c\u2500\u2500 regression_ecoli_all.csv\n    \u2502   \u2502   \u2514\u2500\u2500 regression_stpa_all.csv\n    \u2502   \u251c\u2500\u2500 combined_database_filtered_v2(1).xlsx\n    \u2502   \u2514\u2500\u2500 combined_database_v2(1).xlsx\n    \u251c\u2500\u2500 MIC_scorer\n    \u2502   \u251c\u2500\u2500 Scorer_model\n    \u2502   \u2502   \u251c\u2500\u2500 1stpa_best_model_checkpoint.pth\n    \u2502   \u2502   \u251c\u2500\u2500 2ecoli_best_model_checkpoint.pth\n    \u2502   \u2502   \u251c\u2500\u2500 ecoliscaler.pkl\n    \u2502   \u2502   \u251c\u2500\u2500 regression.py\n    \u2502   \u2502   \u2514\u2500\u2500 stpascaler.pkl\n    \u2502   \u251c\u2500\u2500 tools\n    \u2502   \u2502   \u251c\u2500\u2500 embeddingload.py\n    \u2502   \u2502   \u251c\u2500\u2500 extract.py\n    \u2502   \u2502   \u251c\u2500\u2500 lstm_train.py\n    \u2502   \u2502   \u251c\u2500\u2500 pltlstm.ipynb\n    \u2502   \u2502   \u2514\u2500\u2500 tofasta.py\n    \u2502   \u2514\u2500\u2500 scorer.py\n    \u251c\u2500\u2500 results\n    \u2502   \u251c\u2500\u2500 example_classified_sequences.csv\n    \u2502   \u2514\u2500\u2500 example_results.csv    \n    \u251c\u2500\u2500 .DS_Store\n    \u251c\u2500\u2500 .gitattributes\n    \u251c\u2500\u2500 .gitignore\n    \u251c\u2500\u2500 LICENSE\n    \u251c\u2500\u2500 print_directory_tree.py\n    \u251c\u2500\u2500 README.md\n    \u251c\u2500\u2500 requirements.txt\n    \u251c\u2500\u2500 setup.py\n    \u2514\u2500\u2500 test.py\n```\n\n## Getting Started\n\n### Installation Guide\n\nWelcome to the AMPGen project! This guide will walk you through the steps required to install and set up the necessary environment and dependencies to run AMPGen. Before getting started, please ensure that you have Anaconda installed on your system.\n\n### Prerequisites\n\nTo use the AMPGen system, you need Python 3.11 and a few essential libraries. We'll guide you through setting up a clean conda environment, installing EvoDiff, and then the necessary dependencies. Required Python libraries: `numpy`, `pandas`, `tqdm`, `scikit-learn`, `xgboost`.\n\n### Setting Up the Environment\n\n1. **Clone the AMPGen Repository**  \n   Begin by cloning the AMPGen repository to your local machine:\n   ```bash\n   git clone https://github.com/xiyanxiongnico/AMPGen.git\n   cd AMPGen\n   ```\n\n2. **Create a Conda Environment**  \n   Next, create a new conda environment with Python 3.8.5, which is the recommended version for this project:\n   ```bash\n   conda create --name AMPGen python=3.11\n   conda activate AMPGen\n   ```\n\n3. **Install Environment**  \n   With the new environment activated, install all packages:\n   ```bash\n   pip install  -r requirements.txt\n   ```\n\n4. **Install PyTorch and Related Packages**  \n   EvoDiff requires PyTorch along with additional libraries. The following example demonstrates the installation of a CPU-compatible version of PyTorch. For optimal performance, adjust the pytorch version based on your system\u2019s specifications. Install the required packages using the following commands:\n   ```bash\n   conda install pytorch torchvision torchaudio cpuonly -c pytorch\n   ```\n\n---\n\n### Usage Guide\n\n#### 1. **Generate New AMP Sequences**\n\nYou can generate AMP sequences using the following commands:\n\n\n### Unconditional Generation of AMP Sequences\n\nYou can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model by running the following command in `AMPGen/AMP_generator/unconditional_generation.py`:\n\n```bash\npython unconditional_generation.py --total_sequences <total_sequences> --batch_size <batch_size> --output_file <path_to_output_file> --to_device<cuda or cpu>\n```\n\n#### Arguments:\n- `--total_sequences` (int, required): The total number of sequences to generate.\n- `--batch_size` (int, optional): The batch size for sequence generation (default=1).\n- `--output_file` (str, required): The path to the output CSV file where the generated sequences will be saved.\n- `--to_device`(str, optional):Device to run the model, cuda or cpu (default=cuda).\n\n#### Example:\n```bash\npython unconditional_generation.py --total_sequences 10  --output_file ../data/example/generated_sequences.csv -to_device cpu\n```\n\nThis command will generate 10 sequences  and save them to `generated_sequences.csv`.\n\n\n\n### Unconditional Generation of AMP Sequences with MSA\n\nYou can generate antimicrobial peptide (AMP) sequences using EvoDiff's unconditional generation model with MSA by running the following command in `AMPGen/AMP_generator/unconditional_generation_msa.py`:\n\n```bash\npython unconditional_generation_msa.py --total_sequences <total_sequences> --batch_size <batch_size> --n_sequences <n_sequences> --output_csv_file <path_to_output_file> --to_device<cuda or cpu>\n```\n\n#### Arguments:\n- `--total_sequences` (int, required): The total number of sequences to generate.\n- `--batch_size` (int, optional): The batch size for sequence generation (default=1).\n- `--n_sequences` (int, optional): The number of sequences in MSA to subsample (default=64).\n- `--output_csv_file` (str, required): The path to the output CSV file where the generated sequences will be saved.\n- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).\n\n#### Example:\n```bash\npython unconditional_generation_msa.py --total_sequences 10 --output_csv_file ../data/example/generated_msa_sequences.csv\n```\n\nThis command will generate 100 sequences in batches of 10 and save them to `generated_msa_sequences.csv`.When using this model, significant computational power is required, and we recommend utilizing a GPU for optimal performance.\n\n\n### Conditional Generation of AMP Sequences with MSA\n\nYou can generate antimicrobial peptide (AMP) sequences using EvoDiff's conditional generation model with MSA by running the following command in `AMPGen/AMP_generator/conditional_generation_msa.py`:\n\n```bash\npython conditional_generation_msa.py --directory_path <path_to_msa_directory> --output_csv_file <path_to_output_file> --max_retries <max_retries> --to_device<cuda or cpu> --total_sequences <total_sequences>\n```\n\n#### Arguments:\n- `--directory_path` (str, required): Path to the directory containing the MSA files (in `.a3m` format).\n- `--output_csv_file` (str, required): The path to the output CSV file where the generated sequences will be saved.\n- `--max_retries` (int, optional): Maximum number of retries for processing each file (default: 5).\n- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).\n- `--total_sequences` (int, required): The total number of sequences to generate.\n\n#### Example:\n```bash\npython conditional_generation_msa.py --directory_path ../data/example/msa_files/ --output_csv_file ../data/example/conditional_generated_sequences.csv --to_device cpu --total_sequence 10\n```\n\nThis command will process MSA files `example_1944.a3m` from the `msa_files` directory and generate sequences, saving the results to `conditional_generated_sequences.csv`.\n\n\n#### 2. **Calculate Properties of Generated Sequences**\n\n\n### Calculate Properties of Generated AMP Sequences\n\nYou can calculate the physical and chemical properties of the generated AMP sequences, including molecular weight, net charge, and hydrophobicity, by running the following command in `AMPGen/AMP_generator/calculate_properties.py`:\n\n```bash\npython calculate_properties.py --input_csv_file <path_to_input_file> --output_csv_file <path_to_output_file>\n```\n\n#### Arguments:\n- `--input_csv_file` (str, required): The path to the input CSV file containing sequences.\n- `--output_csv_file` (str, required): The path to the output CSV file where the calculated properties will be saved.\n\n#### Example:\n```bash\npython calculate_properties.py --input_csv_file ../data/example/generated_sequences.csv --output_csv_file ../data/example/sequence_properties.csv\n```\n\nThis command will calculate the properties of the sequences in `generated_sequences.csv` and save the results to `sequence_properties.csv`.\n\n\n### 3. **Identify AMP Candidates**\n\n### AMP Discriminator\n\nTo classify sequences as antimicrobial peptides (AMPs) using the AMP Discriminator, run the following command in `AMPGen/AMP_discriminator/Discriminator_model/discriminator.py`:\n\n```bash\npython discriminator.py --train_path <path_to_training_csv> --pre_path <path_to_input_csv> --out_path <path_to_output_csv>\n```\n\n#### Arguments:\n- `--train_path` or `-tp` (str, required): The path to the CSV file containing the training data.\n- `--pre_path` or `-pp` (str, required): The path to the input CSV file containing the sequences to classify.\n- `--out_path` or `-op` (str, required): The path to the output CSV file where the classification results will be saved.\n- `--to_device`(str,optional):Device to run the model, cuda or cpu (default=cuda).\n\n#### Example:\n```bash\npython discriminator.py --train_path  ../../data/Discriminator_training_data/top14Featured_all.csv --pre_path ../../data/example/sequence_properties.csv --out_path ../../results/example_classified_sequences.csv --to_device cpu\n```\n\nThis command will classify sequences in `sequence_properties.csv` using the model trained on `top14Featured_all.csv`, and save the results to `example_classified_sequences.csv`.\n\n\n### 4. **Run the MIC Scorer**\n\nThis section provides a step-by-step guide to using the MIC scorer. The scorer predicts the MIC (Minimum Inhibitory Concentration) values using a pre-trained LSTM model based on protein embeddings generated by an ESM model.\n\n#### **Step 1: Convert Sequences to FASTA Format**\nUse the `to_fasta` function to convert your input CSV file (which contains sequences) into a FASTA file:\n```bash\npython mic_scorer.py --from_csv_path path/to/sequences.csv --to_fasta_path path/to/output/sequences.fasta\n```\n\n#### **Step 2: Generate Embeddings with ESM Model**\nOnce the sequences are in FASTA format, generate their embeddings using the `get_embedding` function and a pre-trained ESM model:\n```bash\npython mic_scorer.py --from_csv_path path/to/sequences.csv --esm_model_location esm_model_name --output_dir path/to/output/embeddings/\n```\n- Replace `esm_model_name` with the location of your ESM model (e.g., `esm2_t36_3B_UR50D`).\n- The embeddings will be saved to the specified output directory.\n\n#### **Step 3: Load Embeddings**\nUse the `load_embeding` function to load the generated embeddings and merge them with the input sequence data:\n```bash\npython mic_scorer.py --from_csv_path path/to/sequences.csv --output_dir path/to/output/embeddings/\n```\n\n#### **Step 4: Predict MIC Values**\nFinally, predict MIC values using a pre-trained LSTM model. The `get_predicted_mic` function handles this task:\n```bash\npython mic_scorer.py --from_csv_path path/to/sequences.csv --scaler_data_path path/to/scaler.pkl --model_path path/to/model.pth --result_path path/to/save/results.csv\n```\n\n#### **Full Command Example**\nRun the entire MIC scoring process from sequence conversion to MIC prediction in `AMPGen/MIC_scorer/scorer.py`:\n```bash\npython scorer.py --from_csv_path ../results/example_classified_sequences.csv --to_fasta_path ../data/example/output/sequences.fasta --output_dir ../data/example/output/embeddings/ --scaler_data_path ./Scorer_model/stpascaler.pkl --model_path ./Scorer_model/1stpa_best_model_checkpoint.pth --result_path ../results/example_results.csv --to_device cpu\n```\n\nThis command:\n1. Converts the sequences to FASTA format.\n2. Generates embeddings using the specified ESM model.\n3. Loads the embeddings and prepares the data.\n4. Predicts MIC values using the pre-trained LSTM model.\n\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A de novo generation pipeline leveraging evolutionary information for broad-spectrum antimicrobial peptide design",
    "version": "0.1.14",
    "project_urls": {
        "Bug Tracker": "https://github.com/yourusername/AMPGen/issues",
        "Documentation": "https://github.com/yourusername/AMPGen#readme",
        "Homepage": "https://github.com/yourusername/AMPGen",
        "Repository": "https://github.com/yourusername/AMPGen"
    },
    "split_keywords": [
        "bioinformatics",
        " antimicrobial-peptides",
        " protein-design",
        " machine-learning",
        " evodiff"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e2915193324fa86c290432d2cada4cb01ecba659f72d991b9f18dd48ab8d9fa8",
                "md5": "a3915c345dc11061f75da6e634e46714",
                "sha256": "30dd387fa93c7ed72e51b1cde4ab0dbf33a47c1a43c1c24e27139b50a26731a2"
            },
            "downloads": -1,
            "filename": "ampgen-0.1.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a3915c345dc11061f75da6e634e46714",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 177949,
            "upload_time": "2025-08-01T08:30:08",
            "upload_time_iso_8601": "2025-08-01T08:30:08.229113Z",
            "url": "https://files.pythonhosted.org/packages/e2/91/5193324fa86c290432d2cada4cb01ecba659f72d991b9f18dd48ab8d9fa8/ampgen-0.1.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2f6316393becd88208e90c15360b2e9c4fb554d2130285fce00b28541ead9ded",
                "md5": "900f4c76e7c73320e2ee9b2ee74bc899",
                "sha256": "c1737e3cdab3f7cb4bbb15b06510db6da96056e386d8b82e94f4aa1791407f27"
            },
            "downloads": -1,
            "filename": "ampgen-0.1.14.tar.gz",
            "has_sig": false,
            "md5_digest": "900f4c76e7c73320e2ee9b2ee74bc899",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 138270,
            "upload_time": "2025-08-01T08:30:11",
            "upload_time_iso_8601": "2025-08-01T08:30:11.181294Z",
            "url": "https://files.pythonhosted.org/packages/2f/63/16393becd88208e90c15360b2e9c4fb554d2130285fce00b28541ead9ded/ampgen-0.1.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 08:30:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "AMPGen",
    "github_not_found": true,
    "lcname": "ampgen"
}

nicholexiong