# KASPioneer
A comprehensive KASP (Kompetitive Allele Specific PCR) primer design tool for SNP analysis and primer design with multiple analysis modes.
## Table of Contents
- [Overview](#overview)
- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Genome Index Setup](#genome-index-setup)
- [Functions](#functions)
  - [Genome-wide Analysis](#genome-wide-analysis)
  - [Region-based Analysis](#region-based-analysis)
  - [Target SNP Analysis](#target-snp-analysis)
  - [Custom Sequence Analysis](#custom-sequence-analysis)
- [Parameters](#parameters)
- [Output Files](#output-files)
- [Examples](#examples)
- [Performance Optimization](#performance-optimization)
- [Troubleshooting](#troubleshooting)
## Overview
KASPioneer is a modular KASP primer design tool that provides four distinct analysis modes for different research scenarios:
1. **Genome-wide Analysis** (`genome`): Density-based sampling across entire genome
2. **Region-based Analysis** (`region`): SNP analysis within specific genomic regions
3. **Target SNP Analysis** (`target`): Focused analysis on specific SNPs with flanking analysis
4. **Custom Sequence Analysis** (`custom`): Primer design from custom SNP sequences
## Prerequisites
- Python 3.7+
- Required Python packages:
  - `tqdm` (progress bars)
  - `primer3-py` (primer design calculations)
  - `pysam` (VCF parsing)
## Installation
### Method 1: Direct Installation (Recommended)
```bash
# Clone the repository
git clone <repository-url>
cd KASPioneer_v3
# Install dependencies
pip install tqdm primer3-py pysam numpy pandas
# Make the script executable
chmod +x kasp_designer.py
```
### Method 2: Using the Packaging Script
```bash
# Clone the repository
git clone <repository-url>
cd KASPioneer_v3
# Run the packaging script to create installable package
./run.sh
# Follow the instructions provided by the script for:
# - PyPI + pip installation
# - Conda package installation  
# - Standalone executable creation
```
### Method 3: Development Installation
```bash
# Clone the repository
git clone <repository-url>
cd KASPioneer_v3
# Install in development mode
pip install -e .
# Or install with all dependencies
pip install -r requirements.txt
```
### System Requirements
- **Python**: 3.7 or higher
- **Memory**: At least 4GB RAM (8GB+ recommended for large genomes)
- **Storage**: At least 2GB free space for genome indices
- **External Tools**: bowtie2 (for primer specificity validation)
### Installing bowtie2
```bash
# Ubuntu/Debian
sudo apt-get install bowtie2
# CentOS/RHEL
sudo yum install bowtie2
# macOS
brew install bowtie2
# Or download from: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
```
## Quick Start
### Test Data
The repository includes test data in the `test_data/` directory:
- **`BSA-test.vcf.gz`**: BSA (Bulked Segregant Analysis) test VCF file with sample data
- **`3k-test.vcf.gz`**: 3K rice genome test VCF file
- **`msu7_chr1_12.fa`**: Rice genome reference (chromosomes 1-12)
- **`test_snp_sequences.txt`**: Example custom SNP sequences file
- **`filtered.sh`**: Example filtering script
### 1. Build Genome Index (Required First Step)
**⚠️ IMPORTANT: You must build the genome index before running any primer design functions for specificity validation.**
```bash
# Using test data
python kasp_designer.py build_index --ref test_data/msu7_chr1_12.fa --bowtie_index genome_index/msu7_chr1_12
# Using your own data
python kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome
```
### 2. Run Primer Design
#### Using Test Data (Quick Test)
```bash
# Genome-wide analysis with test data
python kasp_designer.py genome --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --mother PR25 --father RIBENQING --output_prefix test_genome --bowtie_index genome_index/msu7_chr1_12 --threads 4
# Region-based analysis with test data
python kasp_designer.py region --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --chrom 1 --start 1000000 --end 2000000 --mother PR25 --father RIBENQING --output_prefix test_region --bowtie_index genome_index/msu7_chr1_12 --threads 4
# Target SNP analysis with test data
python kasp_designer.py target --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --chrom 1 --position 1500000 --mother PR25 --father RIBENQING --output_prefix test_target --bowtie_index genome_index/msu7_chr1_12 --threads 4
# Custom sequence analysis with test data
python kasp_designer.py custom --snp_sequences test_data/test_snp_sequences.txt --ref test_data/msu7_chr1_12.fa --output_prefix test_custom --bowtie_index genome_index/msu7_chr1_12 --threads 4
```
#### Using Your Own Data
```bash
# Genome-wide analysis
python kasp_designer.py genome --vcf variants.vcf.gz --ref genome.fa --mother PR25 --father RIBENQING --output_prefix results_genome --bowtie_index genome_index/genome --threads 12
# Region-based analysis
python kasp_designer.py region --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --start 123456 --end 173456 --mother PR25 --father RIBENQING --output_prefix results_region --bowtie_index genome_index/genome --threads 12
# Target SNP analysis
python kasp_designer.py target --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --position 123456 --mother PR25 --father RIBENQING --output_prefix results_target --bowtie_index genome_index/genome --threads 12
# Custom sequence analysis
python kasp_designer.py custom --snp_sequences snp_list.txt --ref genome.fa --output_prefix results_custom --bowtie_index genome_index/genome --threads 12
```
## Genome Index Setup
### Building the Index
The genome index is essential for primer specificity validation. Build it once and reuse for all analyses:
```bash
python kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome
```
**Parameters:**
- `--ref`: Reference genome FASTA file
- `--bowtie_index`: Output path for bowtie index (without extension)
**Output:**
- Creates bowtie index files: `genome_index/genome.1.bt2`, `genome_index/genome.2.bt2`, etc.
## Functions
### Genome-wide Analysis
**Function:** `genome`
**Purpose:** Performs density-based sampling across the entire genome, dividing each chromosome into bins and selecting representative SNPs for primer design. This function is ideal for genome-wide association studies (GWAS) and large-scale marker development.
**How it works:**
1. **Chromosome Division**: Each chromosome is divided into equal-sized bins (default: 20 bins per chromosome)
2. **SNP Selection**: Within each bin, SNPs are selected using uniform bin sampling to ensure even distribution
3. **Flanking Analysis**: For each selected SNP, flanking SNPs are analyzed to ensure proper spacing
4. **Parallel Processing**: Uses chromosome-based multiprocessing for optimal performance
5. **Quality Assessment**: Each primer set is evaluated for quality and specificity
**Command:**
```bash
python kasp_designer.py genome --vcf variants.vcf.gz --ref genome.fa --mother PR25 --father RIBENQING --output_prefix results_genome --bowtie_index genome_index/genome --threads 12
```
**Required Parameters:**
- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants
- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation
- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles
- `--father`: Father sample name in VCF - Must be homozygous for different alleles
- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_genome.tsv`
**Optional Parameters:**
- `--bins_per_chr`: Number of bins per chromosome (default: 20) - Higher values = more SNPs selected
- `--flank_snps`: Number of flanking SNPs on each side of median SNP (default: 10) - Ensures proper spacing
- `--threads`: Number of parallel threads (default: CPU cores - 1) - Optimize based on your system
**Output File:** `{output_prefix}_genome.tsv`
**Use Cases:**
- Genome-wide marker development
- Large-scale genetic mapping
- Population genetics studies
- Breeding program marker selection
### Region-based Analysis
**Function:** `region`
**Purpose:** Analyzes SNPs within a specific genomic region and designs primers for all qualifying SNPs. This function is perfect for fine-mapping studies, QTL analysis, and candidate gene regions.
**How it works:**
1. **Region Definition**: Defines a specific genomic region by chromosome and coordinates
2. **SNP Extraction**: Extracts all SNPs within the specified region from the VCF file
3. **Quality Filtering**: Applies quality filters and parent genotype requirements
4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis
5. **Comprehensive Analysis**: Designs primers for all qualifying SNPs in the region
**Command:**
```bash
python kasp_designer.py region --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --start 123456 --end 173456 --mother PR25 --father RIBENQING --output_prefix results_region --bowtie_index genome_index/genome --threads 12
```
**Required Parameters:**
- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants
- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation
- `--chrom`: Chromosome name - Target chromosome (e.g., "chr1", "1", "Chr1")
- `--start`: Start position of the region - Beginning coordinate (1-based)
- `--end`: End position of the region - Ending coordinate (1-based)
- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles
- `--father`: Father sample name in VCF - Must be homozygous for different alleles
- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_region.tsv`
**Output File:** `{output_prefix}_region.tsv`
**Use Cases:**
- Fine-mapping studies
- QTL (Quantitative Trait Loci) analysis
- Candidate gene regions
- Linkage analysis
- Regional association studies
### Target SNP Analysis
**Function:** `target`
**Purpose:** Focuses on a specific target SNP and optionally includes flanking SNPs for comprehensive analysis. This function is ideal for validating specific markers, designing primers for known functional SNPs, or analyzing SNPs of particular interest.
**How it works:**
1. **Target Identification**: Locates the specific SNP at the given chromosome position
2. **Flanking Analysis**: Optionally includes flanking SNPs for comprehensive coverage
3. **Quality Assessment**: Evaluates each SNP for primer design feasibility
4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis
5. **Detailed Results**: Provides comprehensive primer information for each SNP
**Command:**
```bash
python kasp_designer.py target --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --position 123456 --mother PR25 --father RIBENQING --output_prefix results_target --bowtie_index genome_index/genome --threads 12
```
**Required Parameters:**
- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants
- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation
- `--chrom`: Chromosome name - Target chromosome (e.g., "chr1", "1", "Chr1")
- `--position`: Target SNP position - Exact coordinate of the target SNP (1-based)
- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles
- `--father`: Father sample name in VCF - Must be homozygous for different alleles
- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_target.tsv`
**Optional Parameters:**
- `--flank_snps`: Number of flanking SNPs on each side of target SNP (default: 10) - Set to 0 for target SNP only
**Output File:** `{output_prefix}_target.tsv`
**Use Cases:**
- Validating specific markers
- Designing primers for known functional SNPs
- Analyzing SNPs of particular interest
- Marker-assisted selection (MAS)
- Genotyping specific loci
### Custom Sequence Analysis
**Function:** `custom`
**Purpose:** Designs primers from custom SNP sequences provided in a text file. This function is perfect for designing primers from known sequences, validating specific sequences, or working with sequences from external sources.
**How it works:**
1. **Sequence Parsing**: Reads custom SNP sequences from input file
2. **SNP Detection**: Automatically detects SNP positions and alleles from sequence format
3. **Primer Design**: Designs KASP primers for each custom sequence
4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis
5. **Quality Assessment**: Evaluates primer quality and specificity for each sequence
**Command:**
```bash
python kasp_designer.py custom --snp_sequences snp_list.txt --ref genome.fa --output_prefix results_custom --bowtie_index genome_index/genome --threads 12
```
**Required Parameters:**
- `--snp_sequences`: Input file with SNP sequences - Text file containing custom sequences
- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation
- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_custom.tsv`
**Input File Format:**
```
SNP_ID1,ATCGATCGATCG[A/T]GATCGATCGATCG
SNP_ID2,GCTAGCTAGCTA[G/C]TAGCTAGCTAGCT
SNP_ID3,TTTTTTTTTTTTTTTTTTTT[C/G]TTTTTTTTTTTTTTTTTTTT
```
**Format Rules:**
- One SNP per line
- Format: `SNP_ID,SEQUENCE_WITH_SNP`
- SNP format: `[REF/ALT]` where REF and ALT are the two alleles
- Sequence should be at least 50bp long for optimal primer design
- SNP should be positioned away from sequence ends (at least 25bp from each end)
**Output File:** `{output_prefix}_custom.tsv`
**Use Cases:**
- Designing primers from known sequences
- Validating specific sequences
- Working with sequences from external sources
- Converting existing markers to KASP format
- Designing primers for specific gene regions
## Parameters
### Common Parameters
All functions share these common parameters:
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `--min_qual` | float | 30.0 | Minimum SNP quality score |
| `--verbose` | flag | False | Enable verbose logging |
| `--min_left_dist` | int | 26 | Filter left SNP closer than this distance (bp) |
| `--min_right_dist` | int | 80 | Filter right SNP closer than this distance (bp) |
| `--upstream_len` | int | 30 | Upstream sequence length |
| `--downstream_len` | int | 100 | Downstream sequence length |
| `--min_primer_len` | int | 20 | Minimum primer length |
| `--max_primer_len` | int | 26 | Maximum primer length |
| `--min_gc` | float | 30.0 | Minimum GC content (%) |
| `--max_gc` | float | 60.0 | Maximum GC content (%) |
| `--min_tm` | float | 55.0 | Minimum melting temperature (°C) |
| `--max_tm` | float | 61.0 | Maximum melting temperature (°C) |
| `--max_tm_diff` | float | 3.0 | Maximum Tm difference between primers (°C) |
| `--max_product_len` | int | 100 | Maximum PCR product length (bp) |
| `--fam_tail` | string | GAAGGTGACCAAGTTCATGCT | FAM tail sequence |
| `--hex_tail` | string | GAAGGTCGGAGTCAACGGATT | HEX tail sequence |
| `--threads` | int | CPU cores - 1 | Number of parallel threads |
## Output Files
### Output File Structure
All functions generate TSV (Tab-Separated Values) files with the following columns:
#### Common Columns
| Column | Description |
|--------|-------------|
| `chrom` | Chromosome name |
| `pos` | SNP position |
| `ref` | Reference allele |
| `alt` | Alternative allele |
| `forwardA` | Forward primer for allele A |
| `forwardB` | Forward primer for allele B |
| `reverse` | Reverse primer |
| `forwardA_withtail` | Forward primer A with FAM tail |
| `forwardB_withtail` | Forward primer B with HEX tail |
| `fA_tm` | Melting temperature of forward primer A (°C) |
| `fB_tm` | Melting temperature of forward primer B (°C) |
| `r_tm` | Melting temperature of reverse primer (°C) |
| `fA_gc` | GC content of forward primer A (%) |
| `fB_gc` | GC content of forward primer B (%) |
| `r_gc` | GC content of reverse primer (%) |
| `product_len` | PCR product length (bp) |
| `primer_quality_score` | Overall primer quality score (0-100) |
| `structure_quality` | Structure quality score (0-1) |
| `recommended_annealing_temp` | Recommended annealing temperature (°C) |
| `specificity` | Specificity validation result |
#### Function-Specific Columns
**Genome Analysis (`genome`):**
| Column | Description |
|--------|-------------|
| `bin_name` | Bin name (e.g., "1_Bin_1") |
**Region Analysis (`region`):**
- No additional columns
**Target Analysis (`target`):**
| Column | Description |
|--------|-------------|
| `flank_direction` | SNP type: "TARGET", "FLANKING_LEFT", or "FLANKING_RIGHT" |
**Custom Analysis (`custom`):**
| Column | Description |
|--------|-------------|
| `snp_id` | Custom SNP identifier |
| `sequence` | Original input sequence |
| `ref_allele` | Reference allele |
| `alt_allele` | Alternative allele |
### Quality Metrics
#### Primer Quality Score (0-100)
The primer quality score is a comprehensive assessment of primer set quality, calculated using multiple factors:
**Score Ranges:**
- **90-100**: Excellent quality - Ideal for high-throughput genotyping
- **80-89**: Very good quality - Suitable for most applications
- **70-79**: Good quality - Acceptable for most studies
- **60-69**: Fair quality - May require optimization
- **50-59**: Poor quality - Not recommended for production use
- **0-49**: Very poor quality - Significant issues detected
**Scoring Factors (with penalties):**
1. **Tm Difference Penalty** (0-25 points deducted)
   - Ideal: < 2°C difference between primers
   - Penalty: 5 points per 1°C over 3°C difference
   - Formula: `max(abs(fA_tm - r_tm), abs(fB_tm - r_tm))`
2. **GC Content Penalty** (0-15 points deducted)
   - Ideal range: 40-60% for all primers
   - Penalty: 5 points per primer outside ideal range
   - Affects primer stability and specificity
3. **Product Length Penalty** (0-10 points deducted)
   - Ideal range: 60-120bp
   - Penalty: 10 points if outside range
   - Affects PCR efficiency and resolution
4. **Structure Quality Multiplier** (0-1 multiplier)
   - Multiplies total score by structure quality
   - Accounts for secondary structure issues
5. **Primer Length Penalty** (0-9 points deducted)
   - Ideal range: 18-25bp for all primers
   - Penalty: 3 points per primer outside range
   - Affects primer specificity and efficiency
**Final Score Calculation:**
```
Score = (100 - Tm_penalty - GC_penalty - Product_penalty - Length_penalty) × Structure_quality
```
#### Structure Quality Score (0-1)
The structure quality score evaluates secondary structure formation in primers:
**Score Ranges:**
- **1.0**: No secondary structures detected
- **0.7-0.9**: Minor secondary structures (acceptable)
- **0.4-0.6**: Moderate secondary structures (may affect performance)
- **0.0-0.3**: Significant secondary structures (not recommended)
**Structure Types Evaluated:**
1. **Hairpin Structures** (0.3 penalty)
   - Self-complementary regions within primers
   - Can cause primer-dimer formation
   - Detected using primer3 hairpin calculation
2. **Homodimer Structures** (0.2 penalty)
   - Self-dimerization of primers
   - Can reduce primer availability
   - Detected using primer3 homodimer calculation
3. **Heterodimer Structures** (0.2-0.3 penalty)
   - Cross-dimerization between primers
   - ForwardA-Reverse: 0.2 penalty
   - ForwardB-Reverse: 0.2 penalty
   - ForwardA-ForwardB: 0.3 penalty
**Calculation:**
```
Structure_quality = max(0, 1.0 - hairpin_penalty - homodimer_penalty - heterodimer_penalty)
```
#### Recommended Annealing Temperature
The recommended annealing temperature is calculated to optimize PCR performance:
**Formula:**
```
Recommended_Tm = min(fA_tm, fB_tm, r_tm) - 3°C
```
**Rationale:**
- Uses the lowest Tm among all primers as baseline
- Subtracts 3°C to ensure all primers can anneal efficiently
- Provides optimal balance between specificity and efficiency
- Accounts for primer competition in multiplex reactions
**Temperature Ranges:**
- **55-65°C**: Optimal range for most KASP reactions
- **50-55°C**: May require optimization for specificity
- **65-70°C**: May reduce efficiency for some primers
- **<50°C or >70°C**: Not recommended for KASP reactions
## Examples
### Example 1: Quick Test with Test Data
```bash
# Build index using test data
python kasp_designer.py build_index --ref test_data/msu7_chr1_12.fa --bowtie_index genome_index/msu7_chr1_12
# Test genome-wide analysis
python kasp_designer.py genome \
  --vcf test_data/BSA-test.vcf.gz \
  --ref test_data/msu7_chr1_12.fa \
  --mother PR25 \
  --father RIBENQING \
  --output_prefix test_genome \
  --bowtie_index genome_index/msu7_chr1_12 \
  --bins_per_chr 5 \
  --flank_snps 3 \
  --threads 4
```
### Example 2: Region-based Analysis with Test Data
```bash
# Test region-based analysis
python kasp_designer.py region \
  --vcf test_data/BSA-test.vcf.gz \
  --ref test_data/msu7_chr1_12.fa \
  --chrom 1 \
  --start 1000000 \
  --end 2000000 \
  --mother PR25 \
  --father RIBENQING \
  --output_prefix test_region \
  --bowtie_index genome_index/msu7_chr1_12 \
  --threads 4
```
### Example 3: Target SNP Analysis with Test Data
```bash
# Test target SNP analysis
python kasp_designer.py target \
  --vcf test_data/BSA-test.vcf.gz \
  --ref test_data/msu7_chr1_12.fa \
  --chrom 1 \
  --position 1500000 \
  --mother PR25 \
  --father RIBENQING \
  --output_prefix test_target \
  --bowtie_index genome_index/msu7_chr1_12 \
  --flank_snps 2 \
  --threads 4
```
### Example 4: Custom Sequence Analysis with Test Data
**Test input file (`test_data/test_snp_sequences.txt`):**
```
SNP_001,ATCGATCGATCGATCGATCG[A/T]GATCGATCGATCGATCGATCG
SNP_002,GCTAGCTAGCTAGCTAGCTA[G/C]TAGCTAGCTAGCTAGCTAGCT
SNP_003,TTTTTTTTTTTTTTTTTTTT[C/G]TTTTTTTTTTTTTTTTTTTT
```
**Command:**
```bash
python kasp_designer.py custom \
  --snp_sequences test_data/test_snp_sequences.txt \
  --ref test_data/msu7_chr1_12.fa \
  --output_prefix test_custom \
  --bowtie_index genome_index/msu7_chr1_12 \
  --threads 4
```
### Example 5: Production Analysis with Your Data
```bash
# Build index for your genome
python kasp_designer.py build_index --ref your_genome.fa --bowtie_index genome_index/your_genome
# Run genome-wide analysis
python kasp_designer.py genome \
  --vcf your_variants.vcf.gz \
  --ref your_genome.fa \
  --mother YOUR_MOTHER_SAMPLE \
  --father YOUR_FATHER_SAMPLE \
  --output_prefix production_genome_analysis \
  --bowtie_index genome_index/your_genome \
  --bins_per_chr 20 \
  --flank_snps 10 \
  --threads 16
```
## Performance Optimization
### Threading Strategy
- **Genome Analysis**: Uses chromosome-based parallel processing
- **Other Analyses**: Uses batch-based parallel processing
- **Recommended**: Use `--threads` equal to your CPU cores minus 1
### Memory Usage
- **Large Genomes**: Consider using compressed VCF files (.vcf.gz)
- **Memory Optimization**: The tool processes chromosomes sequentially to minimize memory usage
### Speed Tips
1. **Use compressed VCF files** (.vcf.gz) for faster I/O
2. **Pre-build genome index** and reuse for multiple analyses
3. **Adjust batch sizes** based on your system's memory
4. **Use SSD storage** for faster file I/O operations
## Troubleshooting
### Common Issues
#### 1. "Genome index not found" Error
```bash
# Solution: Build the genome index first
python kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome
```
#### 2. "Sample not found in VCF" Error
```bash
# Check sample names in VCF file
bcftools query -l variants.vcf.gz
# Use exact sample names (case-sensitive)
python kasp_designer.py genome --mother PR25 --father RIBENQING ...
```
#### 3. "No SNPs found" Warning
- Check VCF file quality filters (`--min_qual`)
- Verify chromosome names match between VCF and reference
- Check parent genotype filters (must be homozygous and different)
#### 4. Memory Issues
- Use compressed VCF files
- Reduce `--threads` parameter
- Process smaller regions at a time
#### 5. Slow Performance
- Ensure genome index is built
- Use SSD storage
- Increase `--threads` (but not more than CPU cores)
- Check disk space availability
### Debug Mode
Enable verbose logging for detailed information:
```bash
python kasp_designer.py genome --verbose --vcf variants.vcf.gz ...
```
### Log Files
The tool provides detailed logging information including:
- Processing progress
- SNP filtering statistics
- Primer design success rates
- Specificity validation results
## Citation
If you use KASPioneer in your research, please cite:
```
KASPioneer: A Comprehensive KASP Primer Design Tool
[Your citation information here]
```
## License
[Your license information here]
## Support
For questions, issues, or feature requests, please:
1. Check this README for common solutions
2. Search existing issues on GitHub
3. Create a new issue with detailed information
## Changelog
            
         
        Raw data
        
            {
    "_id": null,
    "home_page": null,
    "name": "kaspioneer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "KASP, primer design, SNP analysis, bioinformatics, PCR",
    "author": null,
    "author_email": "Baoyi Zhang <zhangbaoyi@ynu.edu.cn>",
    "download_url": "https://files.pythonhosted.org/packages/36/cd/5dce3f5bcc0fc54878817317c30be40aa5dcfeb5d12c36eaaef35e97e74e/kaspioneer-1.0.0.tar.gz",
    "platform": null,
    "description": "# KASPioneer\n\nA comprehensive KASP (Kompetitive Allele Specific PCR) primer design tool for SNP analysis and primer design with multiple analysis modes.\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Genome Index Setup](#genome-index-setup)\n- [Functions](#functions)\n  - [Genome-wide Analysis](#genome-wide-analysis)\n  - [Region-based Analysis](#region-based-analysis)\n  - [Target SNP Analysis](#target-snp-analysis)\n  - [Custom Sequence Analysis](#custom-sequence-analysis)\n- [Parameters](#parameters)\n- [Output Files](#output-files)\n- [Examples](#examples)\n- [Performance Optimization](#performance-optimization)\n- [Troubleshooting](#troubleshooting)\n\n## Overview\n\nKASPioneer is a modular KASP primer design tool that provides four distinct analysis modes for different research scenarios:\n\n1. **Genome-wide Analysis** (`genome`): Density-based sampling across entire genome\n2. **Region-based Analysis** (`region`): SNP analysis within specific genomic regions\n3. **Target SNP Analysis** (`target`): Focused analysis on specific SNPs with flanking analysis\n4. **Custom Sequence Analysis** (`custom`): Primer design from custom SNP sequences\n\n## Prerequisites\n\n- Python 3.7+\n- Required Python packages:\n  - `tqdm` (progress bars)\n  - `primer3-py` (primer design calculations)\n  - `pysam` (VCF parsing)\n\n## Installation\n\n### Method 1: Direct Installation (Recommended)\n\n```bash\n# Clone the repository\ngit clone <repository-url>\ncd KASPioneer_v3\n\n# Install dependencies\npip install tqdm primer3-py pysam numpy pandas\n\n# Make the script executable\nchmod +x kasp_designer.py\n```\n\n### Method 2: Using the Packaging Script\n\n```bash\n# Clone the repository\ngit clone <repository-url>\ncd KASPioneer_v3\n\n# Run the packaging script to create installable package\n./run.sh\n\n# Follow the instructions provided by the script for:\n# - PyPI + pip installation\n# - Conda package installation  \n# - Standalone executable creation\n```\n\n### Method 3: Development Installation\n\n```bash\n# Clone the repository\ngit clone <repository-url>\ncd KASPioneer_v3\n\n# Install in development mode\npip install -e .\n\n# Or install with all dependencies\npip install -r requirements.txt\n```\n\n### System Requirements\n\n- **Python**: 3.7 or higher\n- **Memory**: At least 4GB RAM (8GB+ recommended for large genomes)\n- **Storage**: At least 2GB free space for genome indices\n- **External Tools**: bowtie2 (for primer specificity validation)\n\n### Installing bowtie2\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install bowtie2\n\n# CentOS/RHEL\nsudo yum install bowtie2\n\n# macOS\nbrew install bowtie2\n\n# Or download from: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml\n```\n\n## Quick Start\n\n### Test Data\n\nThe repository includes test data in the `test_data/` directory:\n\n- **`BSA-test.vcf.gz`**: BSA (Bulked Segregant Analysis) test VCF file with sample data\n- **`3k-test.vcf.gz`**: 3K rice genome test VCF file\n- **`msu7_chr1_12.fa`**: Rice genome reference (chromosomes 1-12)\n- **`test_snp_sequences.txt`**: Example custom SNP sequences file\n- **`filtered.sh`**: Example filtering script\n\n### 1. Build Genome Index (Required First Step)\n\n**\u26a0\ufe0f IMPORTANT: You must build the genome index before running any primer design functions for specificity validation.**\n\n```bash\n# Using test data\npython kasp_designer.py build_index --ref test_data/msu7_chr1_12.fa --bowtie_index genome_index/msu7_chr1_12\n\n# Using your own data\npython kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome\n```\n\n### 2. Run Primer Design\n\n#### Using Test Data (Quick Test)\n\n```bash\n# Genome-wide analysis with test data\npython kasp_designer.py genome --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --mother PR25 --father RIBENQING --output_prefix test_genome --bowtie_index genome_index/msu7_chr1_12 --threads 4\n\n# Region-based analysis with test data\npython kasp_designer.py region --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --chrom 1 --start 1000000 --end 2000000 --mother PR25 --father RIBENQING --output_prefix test_region --bowtie_index genome_index/msu7_chr1_12 --threads 4\n\n# Target SNP analysis with test data\npython kasp_designer.py target --vcf test_data/BSA-test.vcf.gz --ref test_data/msu7_chr1_12.fa --chrom 1 --position 1500000 --mother PR25 --father RIBENQING --output_prefix test_target --bowtie_index genome_index/msu7_chr1_12 --threads 4\n\n# Custom sequence analysis with test data\npython kasp_designer.py custom --snp_sequences test_data/test_snp_sequences.txt --ref test_data/msu7_chr1_12.fa --output_prefix test_custom --bowtie_index genome_index/msu7_chr1_12 --threads 4\n```\n\n#### Using Your Own Data\n\n```bash\n# Genome-wide analysis\npython kasp_designer.py genome --vcf variants.vcf.gz --ref genome.fa --mother PR25 --father RIBENQING --output_prefix results_genome --bowtie_index genome_index/genome --threads 12\n\n# Region-based analysis\npython kasp_designer.py region --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --start 123456 --end 173456 --mother PR25 --father RIBENQING --output_prefix results_region --bowtie_index genome_index/genome --threads 12\n\n# Target SNP analysis\npython kasp_designer.py target --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --position 123456 --mother PR25 --father RIBENQING --output_prefix results_target --bowtie_index genome_index/genome --threads 12\n\n# Custom sequence analysis\npython kasp_designer.py custom --snp_sequences snp_list.txt --ref genome.fa --output_prefix results_custom --bowtie_index genome_index/genome --threads 12\n```\n\n## Genome Index Setup\n\n### Building the Index\n\nThe genome index is essential for primer specificity validation. Build it once and reuse for all analyses:\n\n```bash\npython kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome\n```\n\n**Parameters:**\n- `--ref`: Reference genome FASTA file\n- `--bowtie_index`: Output path for bowtie index (without extension)\n\n**Output:**\n- Creates bowtie index files: `genome_index/genome.1.bt2`, `genome_index/genome.2.bt2`, etc.\n\n## Functions\n\n### Genome-wide Analysis\n\n**Function:** `genome`\n\n**Purpose:** Performs density-based sampling across the entire genome, dividing each chromosome into bins and selecting representative SNPs for primer design. This function is ideal for genome-wide association studies (GWAS) and large-scale marker development.\n\n**How it works:**\n1. **Chromosome Division**: Each chromosome is divided into equal-sized bins (default: 20 bins per chromosome)\n2. **SNP Selection**: Within each bin, SNPs are selected using uniform bin sampling to ensure even distribution\n3. **Flanking Analysis**: For each selected SNP, flanking SNPs are analyzed to ensure proper spacing\n4. **Parallel Processing**: Uses chromosome-based multiprocessing for optimal performance\n5. **Quality Assessment**: Each primer set is evaluated for quality and specificity\n\n**Command:**\n```bash\npython kasp_designer.py genome --vcf variants.vcf.gz --ref genome.fa --mother PR25 --father RIBENQING --output_prefix results_genome --bowtie_index genome_index/genome --threads 12\n```\n\n**Required Parameters:**\n- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants\n- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation\n- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles\n- `--father`: Father sample name in VCF - Must be homozygous for different alleles\n- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_genome.tsv`\n\n**Optional Parameters:**\n- `--bins_per_chr`: Number of bins per chromosome (default: 20) - Higher values = more SNPs selected\n- `--flank_snps`: Number of flanking SNPs on each side of median SNP (default: 10) - Ensures proper spacing\n- `--threads`: Number of parallel threads (default: CPU cores - 1) - Optimize based on your system\n\n**Output File:** `{output_prefix}_genome.tsv`\n\n**Use Cases:**\n- Genome-wide marker development\n- Large-scale genetic mapping\n- Population genetics studies\n- Breeding program marker selection\n\n### Region-based Analysis\n\n**Function:** `region`\n\n**Purpose:** Analyzes SNPs within a specific genomic region and designs primers for all qualifying SNPs. This function is perfect for fine-mapping studies, QTL analysis, and candidate gene regions.\n\n**How it works:**\n1. **Region Definition**: Defines a specific genomic region by chromosome and coordinates\n2. **SNP Extraction**: Extracts all SNPs within the specified region from the VCF file\n3. **Quality Filtering**: Applies quality filters and parent genotype requirements\n4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis\n5. **Comprehensive Analysis**: Designs primers for all qualifying SNPs in the region\n\n**Command:**\n```bash\npython kasp_designer.py region --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --start 123456 --end 173456 --mother PR25 --father RIBENQING --output_prefix results_region --bowtie_index genome_index/genome --threads 12\n```\n\n**Required Parameters:**\n- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants\n- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation\n- `--chrom`: Chromosome name - Target chromosome (e.g., \"chr1\", \"1\", \"Chr1\")\n- `--start`: Start position of the region - Beginning coordinate (1-based)\n- `--end`: End position of the region - Ending coordinate (1-based)\n- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles\n- `--father`: Father sample name in VCF - Must be homozygous for different alleles\n- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_region.tsv`\n\n**Output File:** `{output_prefix}_region.tsv`\n\n**Use Cases:**\n- Fine-mapping studies\n- QTL (Quantitative Trait Loci) analysis\n- Candidate gene regions\n- Linkage analysis\n- Regional association studies\n\n### Target SNP Analysis\n\n**Function:** `target`\n\n**Purpose:** Focuses on a specific target SNP and optionally includes flanking SNPs for comprehensive analysis. This function is ideal for validating specific markers, designing primers for known functional SNPs, or analyzing SNPs of particular interest.\n\n**How it works:**\n1. **Target Identification**: Locates the specific SNP at the given chromosome position\n2. **Flanking Analysis**: Optionally includes flanking SNPs for comprehensive coverage\n3. **Quality Assessment**: Evaluates each SNP for primer design feasibility\n4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis\n5. **Detailed Results**: Provides comprehensive primer information for each SNP\n\n**Command:**\n```bash\npython kasp_designer.py target --vcf variants.vcf.gz --ref genome.fa --chrom chr1 --position 123456 --mother PR25 --father RIBENQING --output_prefix results_target --bowtie_index genome_index/genome --threads 12\n```\n\n**Required Parameters:**\n- `--vcf`: Input VCF file (can be .gz compressed) - Contains SNP variants\n- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation\n- `--chrom`: Chromosome name - Target chromosome (e.g., \"chr1\", \"1\", \"Chr1\")\n- `--position`: Target SNP position - Exact coordinate of the target SNP (1-based)\n- `--mother`: Mother sample name in VCF - Must be homozygous for different alleles\n- `--father`: Father sample name in VCF - Must be homozygous for different alleles\n- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_target.tsv`\n\n**Optional Parameters:**\n- `--flank_snps`: Number of flanking SNPs on each side of target SNP (default: 10) - Set to 0 for target SNP only\n\n**Output File:** `{output_prefix}_target.tsv`\n\n**Use Cases:**\n- Validating specific markers\n- Designing primers for known functional SNPs\n- Analyzing SNPs of particular interest\n- Marker-assisted selection (MAS)\n- Genotyping specific loci\n\n### Custom Sequence Analysis\n\n**Function:** `custom`\n\n**Purpose:** Designs primers from custom SNP sequences provided in a text file. This function is perfect for designing primers from known sequences, validating specific sequences, or working with sequences from external sources.\n\n**How it works:**\n1. **Sequence Parsing**: Reads custom SNP sequences from input file\n2. **SNP Detection**: Automatically detects SNP positions and alleles from sequence format\n3. **Primer Design**: Designs KASP primers for each custom sequence\n4. **Batch Processing**: Uses SNP batch-based multiprocessing for efficient analysis\n5. **Quality Assessment**: Evaluates primer quality and specificity for each sequence\n\n**Command:**\n```bash\npython kasp_designer.py custom --snp_sequences snp_list.txt --ref genome.fa --output_prefix results_custom --bowtie_index genome_index/genome --threads 12\n```\n\n**Required Parameters:**\n- `--snp_sequences`: Input file with SNP sequences - Text file containing custom sequences\n- `--ref`: Reference genome FASTA file - Used for primer design and specificity validation\n- `--output_prefix`: Output file prefix - Results will be saved as `{prefix}_custom.tsv`\n\n**Input File Format:**\n```\nSNP_ID1,ATCGATCGATCG[A/T]GATCGATCGATCG\nSNP_ID2,GCTAGCTAGCTA[G/C]TAGCTAGCTAGCT\nSNP_ID3,TTTTTTTTTTTTTTTTTTTT[C/G]TTTTTTTTTTTTTTTTTTTT\n```\n\n**Format Rules:**\n- One SNP per line\n- Format: `SNP_ID,SEQUENCE_WITH_SNP`\n- SNP format: `[REF/ALT]` where REF and ALT are the two alleles\n- Sequence should be at least 50bp long for optimal primer design\n- SNP should be positioned away from sequence ends (at least 25bp from each end)\n\n**Output File:** `{output_prefix}_custom.tsv`\n\n**Use Cases:**\n- Designing primers from known sequences\n- Validating specific sequences\n- Working with sequences from external sources\n- Converting existing markers to KASP format\n- Designing primers for specific gene regions\n\n## Parameters\n\n### Common Parameters\n\nAll functions share these common parameters:\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `--min_qual` | float | 30.0 | Minimum SNP quality score |\n| `--verbose` | flag | False | Enable verbose logging |\n| `--min_left_dist` | int | 26 | Filter left SNP closer than this distance (bp) |\n| `--min_right_dist` | int | 80 | Filter right SNP closer than this distance (bp) |\n| `--upstream_len` | int | 30 | Upstream sequence length |\n| `--downstream_len` | int | 100 | Downstream sequence length |\n| `--min_primer_len` | int | 20 | Minimum primer length |\n| `--max_primer_len` | int | 26 | Maximum primer length |\n| `--min_gc` | float | 30.0 | Minimum GC content (%) |\n| `--max_gc` | float | 60.0 | Maximum GC content (%) |\n| `--min_tm` | float | 55.0 | Minimum melting temperature (\u00b0C) |\n| `--max_tm` | float | 61.0 | Maximum melting temperature (\u00b0C) |\n| `--max_tm_diff` | float | 3.0 | Maximum Tm difference between primers (\u00b0C) |\n| `--max_product_len` | int | 100 | Maximum PCR product length (bp) |\n| `--fam_tail` | string | GAAGGTGACCAAGTTCATGCT | FAM tail sequence |\n| `--hex_tail` | string | GAAGGTCGGAGTCAACGGATT | HEX tail sequence |\n| `--threads` | int | CPU cores - 1 | Number of parallel threads |\n\n## Output Files\n\n### Output File Structure\n\nAll functions generate TSV (Tab-Separated Values) files with the following columns:\n\n#### Common Columns\n\n| Column | Description |\n|--------|-------------|\n| `chrom` | Chromosome name |\n| `pos` | SNP position |\n| `ref` | Reference allele |\n| `alt` | Alternative allele |\n| `forwardA` | Forward primer for allele A |\n| `forwardB` | Forward primer for allele B |\n| `reverse` | Reverse primer |\n| `forwardA_withtail` | Forward primer A with FAM tail |\n| `forwardB_withtail` | Forward primer B with HEX tail |\n| `fA_tm` | Melting temperature of forward primer A (\u00b0C) |\n| `fB_tm` | Melting temperature of forward primer B (\u00b0C) |\n| `r_tm` | Melting temperature of reverse primer (\u00b0C) |\n| `fA_gc` | GC content of forward primer A (%) |\n| `fB_gc` | GC content of forward primer B (%) |\n| `r_gc` | GC content of reverse primer (%) |\n| `product_len` | PCR product length (bp) |\n| `primer_quality_score` | Overall primer quality score (0-100) |\n| `structure_quality` | Structure quality score (0-1) |\n| `recommended_annealing_temp` | Recommended annealing temperature (\u00b0C) |\n| `specificity` | Specificity validation result |\n\n#### Function-Specific Columns\n\n**Genome Analysis (`genome`):**\n| Column | Description |\n|--------|-------------|\n| `bin_name` | Bin name (e.g., \"1_Bin_1\") |\n\n**Region Analysis (`region`):**\n- No additional columns\n\n**Target Analysis (`target`):**\n| Column | Description |\n|--------|-------------|\n| `flank_direction` | SNP type: \"TARGET\", \"FLANKING_LEFT\", or \"FLANKING_RIGHT\" |\n\n**Custom Analysis (`custom`):**\n| Column | Description |\n|--------|-------------|\n| `snp_id` | Custom SNP identifier |\n| `sequence` | Original input sequence |\n| `ref_allele` | Reference allele |\n| `alt_allele` | Alternative allele |\n\n### Quality Metrics\n\n#### Primer Quality Score (0-100)\nThe primer quality score is a comprehensive assessment of primer set quality, calculated using multiple factors:\n\n**Score Ranges:**\n- **90-100**: Excellent quality - Ideal for high-throughput genotyping\n- **80-89**: Very good quality - Suitable for most applications\n- **70-79**: Good quality - Acceptable for most studies\n- **60-69**: Fair quality - May require optimization\n- **50-59**: Poor quality - Not recommended for production use\n- **0-49**: Very poor quality - Significant issues detected\n\n**Scoring Factors (with penalties):**\n\n1. **Tm Difference Penalty** (0-25 points deducted)\n   - Ideal: < 2\u00b0C difference between primers\n   - Penalty: 5 points per 1\u00b0C over 3\u00b0C difference\n   - Formula: `max(abs(fA_tm - r_tm), abs(fB_tm - r_tm))`\n\n2. **GC Content Penalty** (0-15 points deducted)\n   - Ideal range: 40-60% for all primers\n   - Penalty: 5 points per primer outside ideal range\n   - Affects primer stability and specificity\n\n3. **Product Length Penalty** (0-10 points deducted)\n   - Ideal range: 60-120bp\n   - Penalty: 10 points if outside range\n   - Affects PCR efficiency and resolution\n\n4. **Structure Quality Multiplier** (0-1 multiplier)\n   - Multiplies total score by structure quality\n   - Accounts for secondary structure issues\n\n5. **Primer Length Penalty** (0-9 points deducted)\n   - Ideal range: 18-25bp for all primers\n   - Penalty: 3 points per primer outside range\n   - Affects primer specificity and efficiency\n\n**Final Score Calculation:**\n```\nScore = (100 - Tm_penalty - GC_penalty - Product_penalty - Length_penalty) \u00d7 Structure_quality\n```\n\n#### Structure Quality Score (0-1)\nThe structure quality score evaluates secondary structure formation in primers:\n\n**Score Ranges:**\n- **1.0**: No secondary structures detected\n- **0.7-0.9**: Minor secondary structures (acceptable)\n- **0.4-0.6**: Moderate secondary structures (may affect performance)\n- **0.0-0.3**: Significant secondary structures (not recommended)\n\n**Structure Types Evaluated:**\n\n1. **Hairpin Structures** (0.3 penalty)\n   - Self-complementary regions within primers\n   - Can cause primer-dimer formation\n   - Detected using primer3 hairpin calculation\n\n2. **Homodimer Structures** (0.2 penalty)\n   - Self-dimerization of primers\n   - Can reduce primer availability\n   - Detected using primer3 homodimer calculation\n\n3. **Heterodimer Structures** (0.2-0.3 penalty)\n   - Cross-dimerization between primers\n   - ForwardA-Reverse: 0.2 penalty\n   - ForwardB-Reverse: 0.2 penalty\n   - ForwardA-ForwardB: 0.3 penalty\n\n**Calculation:**\n```\nStructure_quality = max(0, 1.0 - hairpin_penalty - homodimer_penalty - heterodimer_penalty)\n```\n\n#### Recommended Annealing Temperature\nThe recommended annealing temperature is calculated to optimize PCR performance:\n\n**Formula:**\n```\nRecommended_Tm = min(fA_tm, fB_tm, r_tm) - 3\u00b0C\n```\n\n**Rationale:**\n- Uses the lowest Tm among all primers as baseline\n- Subtracts 3\u00b0C to ensure all primers can anneal efficiently\n- Provides optimal balance between specificity and efficiency\n- Accounts for primer competition in multiplex reactions\n\n**Temperature Ranges:**\n- **55-65\u00b0C**: Optimal range for most KASP reactions\n- **50-55\u00b0C**: May require optimization for specificity\n- **65-70\u00b0C**: May reduce efficiency for some primers\n- **<50\u00b0C or >70\u00b0C**: Not recommended for KASP reactions\n\n## Examples\n\n### Example 1: Quick Test with Test Data\n\n```bash\n# Build index using test data\npython kasp_designer.py build_index --ref test_data/msu7_chr1_12.fa --bowtie_index genome_index/msu7_chr1_12\n\n# Test genome-wide analysis\npython kasp_designer.py genome \\\n  --vcf test_data/BSA-test.vcf.gz \\\n  --ref test_data/msu7_chr1_12.fa \\\n  --mother PR25 \\\n  --father RIBENQING \\\n  --output_prefix test_genome \\\n  --bowtie_index genome_index/msu7_chr1_12 \\\n  --bins_per_chr 5 \\\n  --flank_snps 3 \\\n  --threads 4\n```\n\n### Example 2: Region-based Analysis with Test Data\n\n```bash\n# Test region-based analysis\npython kasp_designer.py region \\\n  --vcf test_data/BSA-test.vcf.gz \\\n  --ref test_data/msu7_chr1_12.fa \\\n  --chrom 1 \\\n  --start 1000000 \\\n  --end 2000000 \\\n  --mother PR25 \\\n  --father RIBENQING \\\n  --output_prefix test_region \\\n  --bowtie_index genome_index/msu7_chr1_12 \\\n  --threads 4\n```\n\n### Example 3: Target SNP Analysis with Test Data\n\n```bash\n# Test target SNP analysis\npython kasp_designer.py target \\\n  --vcf test_data/BSA-test.vcf.gz \\\n  --ref test_data/msu7_chr1_12.fa \\\n  --chrom 1 \\\n  --position 1500000 \\\n  --mother PR25 \\\n  --father RIBENQING \\\n  --output_prefix test_target \\\n  --bowtie_index genome_index/msu7_chr1_12 \\\n  --flank_snps 2 \\\n  --threads 4\n```\n\n### Example 4: Custom Sequence Analysis with Test Data\n\n**Test input file (`test_data/test_snp_sequences.txt`):**\n```\nSNP_001,ATCGATCGATCGATCGATCG[A/T]GATCGATCGATCGATCGATCG\nSNP_002,GCTAGCTAGCTAGCTAGCTA[G/C]TAGCTAGCTAGCTAGCTAGCT\nSNP_003,TTTTTTTTTTTTTTTTTTTT[C/G]TTTTTTTTTTTTTTTTTTTT\n```\n\n**Command:**\n```bash\npython kasp_designer.py custom \\\n  --snp_sequences test_data/test_snp_sequences.txt \\\n  --ref test_data/msu7_chr1_12.fa \\\n  --output_prefix test_custom \\\n  --bowtie_index genome_index/msu7_chr1_12 \\\n  --threads 4\n```\n\n### Example 5: Production Analysis with Your Data\n\n```bash\n# Build index for your genome\npython kasp_designer.py build_index --ref your_genome.fa --bowtie_index genome_index/your_genome\n\n# Run genome-wide analysis\npython kasp_designer.py genome \\\n  --vcf your_variants.vcf.gz \\\n  --ref your_genome.fa \\\n  --mother YOUR_MOTHER_SAMPLE \\\n  --father YOUR_FATHER_SAMPLE \\\n  --output_prefix production_genome_analysis \\\n  --bowtie_index genome_index/your_genome \\\n  --bins_per_chr 20 \\\n  --flank_snps 10 \\\n  --threads 16\n```\n\n## Performance Optimization\n\n### Threading Strategy\n\n- **Genome Analysis**: Uses chromosome-based parallel processing\n- **Other Analyses**: Uses batch-based parallel processing\n- **Recommended**: Use `--threads` equal to your CPU cores minus 1\n\n### Memory Usage\n\n- **Large Genomes**: Consider using compressed VCF files (.vcf.gz)\n- **Memory Optimization**: The tool processes chromosomes sequentially to minimize memory usage\n\n### Speed Tips\n\n1. **Use compressed VCF files** (.vcf.gz) for faster I/O\n2. **Pre-build genome index** and reuse for multiple analyses\n3. **Adjust batch sizes** based on your system's memory\n4. **Use SSD storage** for faster file I/O operations\n\n## Troubleshooting\n\n### Common Issues\n\n#### 1. \"Genome index not found\" Error\n```bash\n# Solution: Build the genome index first\npython kasp_designer.py build_index --ref genome.fa --bowtie_index genome_index/genome\n```\n\n#### 2. \"Sample not found in VCF\" Error\n```bash\n# Check sample names in VCF file\nbcftools query -l variants.vcf.gz\n\n# Use exact sample names (case-sensitive)\npython kasp_designer.py genome --mother PR25 --father RIBENQING ...\n```\n\n#### 3. \"No SNPs found\" Warning\n- Check VCF file quality filters (`--min_qual`)\n- Verify chromosome names match between VCF and reference\n- Check parent genotype filters (must be homozygous and different)\n\n#### 4. Memory Issues\n- Use compressed VCF files\n- Reduce `--threads` parameter\n- Process smaller regions at a time\n\n#### 5. Slow Performance\n- Ensure genome index is built\n- Use SSD storage\n- Increase `--threads` (but not more than CPU cores)\n- Check disk space availability\n\n### Debug Mode\n\nEnable verbose logging for detailed information:\n\n```bash\npython kasp_designer.py genome --verbose --vcf variants.vcf.gz ...\n```\n\n### Log Files\n\nThe tool provides detailed logging information including:\n- Processing progress\n- SNP filtering statistics\n- Primer design success rates\n- Specificity validation results\n\n## Citation\n\nIf you use KASPioneer in your research, please cite:\n\n```\nKASPioneer: A Comprehensive KASP Primer Design Tool\n[Your citation information here]\n```\n\n## License\n\n[Your license information here]\n\n## Support\n\nFor questions, issues, or feature requests, please:\n1. Check this README for common solutions\n2. Search existing issues on GitHub\n3. Create a new issue with detailed information\n\n## Changelog\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A comprehensive KASP primer design tool for SNP analysis and primer design",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "kasp",
        " primer design",
        " snp analysis",
        " bioinformatics",
        " pcr"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "388a9e35d4eb8335afbd8cbb975c972b5e5cc0701ad0b39621638a7223c8de3b",
                "md5": "0a581e423472421cfeb4e0ec21d987ce",
                "sha256": "a3998a8ed5dc2016b0158ff1bb54d0e760da025e2eeea6305a9d2f7cefdaf840"
            },
            "downloads": -1,
            "filename": "kaspioneer-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0a581e423472421cfeb4e0ec21d987ce",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 554127,
            "upload_time": "2025-10-26T05:07:54",
            "upload_time_iso_8601": "2025-10-26T05:07:54.053300Z",
            "url": "https://files.pythonhosted.org/packages/38/8a/9e35d4eb8335afbd8cbb975c972b5e5cc0701ad0b39621638a7223c8de3b/kaspioneer-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "36cd5dce3f5bcc0fc54878817317c30be40aa5dcfeb5d12c36eaaef35e97e74e",
                "md5": "20872d3d60638aff9e264f173859073b",
                "sha256": "ed4b7795b80f4a6d46bd2c6db5802b43b3b53ba8f4572c9fa2387579ce6d2d8c"
            },
            "downloads": -1,
            "filename": "kaspioneer-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "20872d3d60638aff9e264f173859073b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 555217,
            "upload_time": "2025-10-26T05:07:56",
            "upload_time_iso_8601": "2025-10-26T05:07:56.162614Z",
            "url": "https://files.pythonhosted.org/packages/36/cd/5dce3f5bcc0fc54878817317c30be40aa5dcfeb5d12c36eaaef35e97e74e/kaspioneer-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-26 05:07:56",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "kaspioneer"
}