flync


Nameflync JSON
Version 1.0.2 PyPI version JSON
download
home_pageNone
SummaryFLYNC: lncRNA discovery pipeline for Drosophila melanogaster
upload_time2025-11-05 20:22:39
maintainerNone
docs_urlNone
authorFLYNC Contributors
requires_python>=3.11
licenseMIT
keywords bioinformatics lncrna genomics machine-learning rna-seq
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![FLYNC logo](logo.jpeg)

# FLYNC - FLY Non-Coding gene discovery & classification

## TL;DR (Quick Start)

Install (conda recommended):
```bash
conda create -n flync -c bioconda -c conda-forge flync
conda activate flync
```

Setup genome (dm6):
```bash
flync setup --genome-dir genome
```

Generate config template and edit paths:
```bash
flync config --template --output config.yaml
# Set: samples: null and fastq_dir: "/path/to/fastq"
```

Run bioinformatics (auto-detect FASTQs):
```bash
flync run-bio -c config.yaml -j 8
```

Predict lncRNAs (novel transcripts):
```bash
flync run-ml -g results/assemblies/merged-new-transcripts.gtf \
  -o results/lncrna_predictions.csv -r genome/genome.fa -t 8
```

All-in-one:
```bash
flync run-all -c config.yaml -j 8
```

Essential outputs:
- results/assemblies/merged-new-transcripts.gtf (novel)
- results/lncrna_predictions.csv (predictions)
- results/dge/... (if metadata CSV with condition provided)

Need help? Run:
```bash
flync run-bio -c config.yaml --dry-run
```

## Minimal Conceptual Overview

1. Input: FASTQs (local) or SRA IDs (via metadata CSV).
2. Snakemake workflow builds transcriptome and isolates novel transcripts.
3. Feature engine converts GTF + genome into a model-ready feature matrix.
4. Pre-trained model classifies lncRNA vs protein-coding; outputs probabilities.
5. Optional differential expression if conditions provided.

## When to Use Which Command

- run-bio: You only need assemblies.
- run-ml: You have a GTF and want predictions.
- run-all: End-to-end (recommended for new users).
- setup: Prepare genome and indices once.
- config: Generate or validate config.yaml.

## Common Pitfalls (Fast Answers)

| Issue | Fix |
|-------|-----|
| samples: null fails | Ensure fastq_dir is set |
| Snakefile not found | pip install -e . |
| Missing genome index | Re-run flync setup |
| All predictions identical | Check feature extraction logs |
| DGE missing | Ensure metadata CSV has header + condition column |

Full detailed documentation continues below.

---

## Table of Contents

- [TL;DR (Quick Start)](#tldr-quick-start)
- [Minimal Conceptual Overview](#minimal-conceptual-overview)
- [When to Use Which Command](#when-to-use-which-command)
- [Common Pitfalls (Fast Answers)](#common-pitfalls-fast-answers)
- [Pipeline Overview](#pipeline-overview)
- [Key Features](#key-features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Usage Guide](#usage-guide)
  - [1. Setup Reference Genome](#1-setup-reference-genome)
  - [2. Configure Pipeline](#2-configure-pipeline)
  - [Sample Specification (3 Modes)](#sample-specification-3-modes)
  - [3. Run Bioinformatics Pipeline](#3-run-bioinformatics-pipeline)
  - [4. Run ML Prediction](#4-run-ml-prediction)
  - [5. Run Complete Pipeline (Recommended)](#5-run-complete-pipeline-recommended)
  - [6. Differential Gene Expression (DGE)](#6-differential-gene-expression-dge)
  - [7. Python API Usage](#7-python-api-usage)
- [Pipeline Architecture](#pipeline-architecture)
- [Advanced Usage](#advanced-usage)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
- [License](#license)

---

## Pipeline Overview

FLYNC executes a complete lncRNA discovery workflow with three main execution modes:

### Phase 1: Bioinformatics Pipeline (`flync run-bio`)
Run only the RNA-seq processing and assembly:
1. **Read Mapping** - Align RNA-seq reads to reference genome using HISAT2
2. **Transcriptome Assembly** - Reconstruct transcripts per sample with StringTie
3. **Assembly Merging** - Create unified transcriptome with gffcompare
4. **Novel Transcript Extraction** - Identify transcripts not in reference annotation
5. **Quantification** - Calculate expression levels per transcript
6. **DGE Analysis** (optional) - Differential expression with Ballgown when metadata.csv provided

### Phase 2: ML Prediction (`flync run-ml`)
Run only the machine learning classification:
1. **Feature Extraction** - Extract multi-modal genomic features
2. **Feature Cleaning** - Standardize and prepare features for ML
3. **ML Classification** - Predict lncRNA candidates using trained EBM model
4. **Confidence Scoring** - Provide prediction probabilities and confidence scores

### Complete Pipeline (`flync run-all`)
Run the entire workflow end-to-end with a single command

---

## Key Features

**Complete End-to-End Pipeline** - Single `flync run-all` command for full workflow  
**Unified Environment** - All dependencies managed in single `environment.yml`  
**Differential Expression** - Integrated Ballgown DGE analysis for condition comparisons  
**Public Python API** - Use FLYNC programmatically in custom workflows  
**Flexible Input Modes** - Auto-detect samples from FASTQ directory or use sample lists  
**Snakemake Orchestration** - Robust workflow management with automatic parallelization  
**Comprehensive Features** - 100+ genomic features from multiple data sources  
**Intelligent Caching** - Downloads and caches remote genomic tracks automatically  
**Production-Ready Models** - Pre-trained EBM classifier with high accuracy  
**Multi-Stage Docker** - Runtime and pre-warmed images for flexible deployment  
**Python 3.11** - Modern Python codebase with type hints and comprehensive documentation  

---

## Installation

### Overview

Choose an installation method based on what you need:

| Use Case | Recommended Method | Command |
|----------|-------------------|---------|
| Full pipeline (alignment + assembly + ML) | Conda (base) | `conda create -n flync -c bioconda -c conda-forge flync` |
| Add differential expression (Ballgown) | Conda add-on | `conda install -n flync flync-dge` |
| ML / feature extraction only (no aligners) | pip + extras | `pip install flync[features,ml]` |
| Programmatic Snakemake orchestration (no bio tools) | pip minimal + workflow | `pip install flync[workflow]` |
| Reproducible container execution | Docker (runtime) | `docker pull ghcr.io/homemlab/flync:latest` |
| Faster startup with pre-cached tracks | Docker (prewarmed) | `docker pull ghcr.io/homemlab/flync:latest-prewarmed` |

### Option 1: Conda (Recommended – Full Stack)

```bash
conda create -n flync -c bioconda -c conda-forge flync
conda activate flync
flync --help
```

Add DGE support (Ballgown + R stack):
```bash
conda install -n flync flync-dge  # after base install
```

Or install both at once:
```bash
conda create -n flync -c bioconda -c conda-forge flync flync-dge
```

### Option 2: pip (Python-Only / Lightweight)

Pip will NOT install external bioinformatics binaries (HISAT2, StringTie, samtools, etc.). Use this only for feature extraction or ML inference on an existing GTF.

```bash
python -m venv flync-venv
source flync-venv/bin/activate
pip install --upgrade pip

# Feature extraction + ML
pip install "flync[features,ml]"

# Add Snakemake lightweight orchestration (still no external binaries)
pip install "flync[workflow]"

flync run-ml --help
```

If you attempt `flync run-bio` without the required external tools, FLYNC will explain what is missing and how to install via conda.

### Option 3: Docker

Runtime image (downloads tracks on demand):
```bash
docker pull ghcr.io/homemlab/flync:latest
docker run --rm -v $PWD:/work ghcr.io/homemlab/flync:latest \
  flync --help
```

Prewarmed image (tracks pre-cached):
```bash
docker pull ghcr.io/homemlab/flync:latest-prewarmed
```

### Which Should I Pick?

| Scenario | Choose |
|----------|-------|
| New user, want everything | Conda base (add `flync-dge` if doing DGE) |
| HPC / cluster with module rules | Conda (export env YAML for reproducibility) |
| Notebook exploratory ML only | pip extras (`features,ml`) |
| CI / workflow integration | Docker runtime image |
| Need fastest repeated ML runs | Docker prewarmed image |

### External Tool Summary

The following are ONLY installed automatically via the Conda packages (`flync`, `flync-dge`):
```
hisat2, stringtie, gffcompare, gffread, samtools, bedtools, sra-tools,
R (r-base), bioconductor-ballgown, r-matrixstats, r-ggplot2
```
Pip installations will perform a dependency sanity check and abort `run-bio` if these are missing (unless `--skip-deps-check` is used).

### Development Install (Editable)

```bash
git clone https://github.com/homemlab/flync.git
cd flync
conda env create -f environment.yml
conda activate flync
pip install -e .
```

### Prerequisites

- **Operating System**: Linux (tested on Debian/Ubuntu)
- **Conda/Mamba**: Required for managing dependencies
- **System Requirements**:
  - 8+ GB RAM (16+ GB recommended for large datasets)
  - 20+ GB disk space (genome, indices, and tracks)
  - 4+ CPU cores (8+ recommended)

### Install from Source (Full + Editable)

For development or if you need the latest unreleased features:

```bash
# 1. Clone the repository
git clone https://github.com/homemlab/flync.git
cd flync
git checkout master  # Use the master branch (production)

# 2. Create conda environment with dependencies
conda env create -f environment.yml

# 3. Activate environment
conda activate flync

# 4. Install package in development mode
pip install -e .

# 5. Verify installation
flync --help
```

Docker image details moved above for quick discovery.

---

## Quick Start

**Complete workflow with `run-all` command:**

```bash
# 1. Activate conda environment
conda activate flync

# 2. Download genome and build indices
flync setup --genome-dir genome

# 3. Create configuration file
flync config --template --output config.yaml

# 4. Edit config.yaml with your paths and settings
# See config_example_full.yaml for all available options

# 5. Create metadata.csv with sample information (MUST have header row!)
cat > metadata.csv << EOF
sample_id,condition,replicate
SRR123456,control,1
SRR123457,control,2
SRR123458,treatment,1
SRR123459,treatment,2
EOF

# 6. Update config.yaml to use metadata.csv
# Change: samples: null
# To:     samples: metadata.csv

# 7. Run complete pipeline (bioinformatics + ML + DGE)
flync run-all --configfile config.yaml --cores 8
```

**Alternative: Step-by-step workflow:**

```bash
# Run bioinformatics pipeline only
flync run-bio --configfile config.yaml --cores 8

# Then run ML prediction
flync run-ml \
  --gtf results/assemblies/merged-new-transcripts.gtf \
  --output results/lncrna_predictions.csv \
  --ref-genome genome/genome.fa \
  --threads 8
```

**Python API Usage:**

```python
from flync import run_pipeline
from pathlib import Path

# Run complete pipeline programmatically
result = run_pipeline(
    config_path=Path("config.yaml"),
    cores=8,
    ml_threads=8
)

print(f"Status: {result['status']}")
print(f"Predictions: {result['predictions_file']}")
```

**Output:**
- `results/assemblies/merged.gtf` - Full transcriptome (reference + novel)
- `results/assemblies/merged-new-transcripts.gtf` - Novel transcripts only
- `results/cov/` - Per-sample quantification files
- `results/dge/` - Differential expression analysis (if metadata.csv provided)
  - `transcript_dge_results.csv` - Transcript-level DE results
  - `gene_dge_results.csv` - Gene-level DE results
  - `dge_summary.csv` - Summary statistics
- `results/lncrna_predictions.csv` - lncRNA predictions with confidence scores

---

## Usage Guide

### 1. Setup Reference Genome

Download *Drosophila melanogaster* genome (BDGP6.32/dm6) and build HISAT2 index:

```bash
flync setup --genome-dir genome
```

**What this does:**
- Downloads genome FASTA from Ensembl (release 106)
- Downloads gene annotation GTF
- Builds HISAT2 index (~10 minutes, requires ~4GB RAM)
- Extracts splice sites for splice-aware alignment

**Skip download if files exist:**
```bash
flync setup --genome-dir genome --skip-download
```

### 2. Configure Pipeline

Generate a configuration template:

```bash
flync config --template --output config.yaml
```

**Edit `config.yaml`** with your settings:

```yaml
# Sample specification (3 options - see below)
samples: null                           # Auto-detect from fastq_dir
fastq_dir: "/path/to/fastq/files"      # Directory with FASTQ files
fastq_paired: false                    # true for paired-end, false for single-end

# Reference files (created by 'flync setup')
genome: "genome/genome.fa"
annotation: "genome/genome.gtf"
hisat_index: "genome/genome.idx"
splice_sites: "genome/genome.ss"

# Output and resources
output_dir: "results"
threads: 8

# Tool parameters (optional)
params:
  hisat2: "-p 8 --dta --dta-cufflinks"
  stringtie_assemble: "-p 8"
  stringtie_merge: ""
  stringtie_quantify: "-eB"
  download_threads: 4  # For SRA downloads
```

#### Sample Specification (3 Modes)

**Mode 1: Auto-detect from FASTQ directory (Recommended)**
```yaml
samples: null  # Must be null to enable auto-detection
fastq_dir: "/path/to/fastq"
fastq_paired: false
```

Automatically detects samples from filenames:
- **Paired-end**: `sample1_1.fastq.gz` + `sample1_2.fastq.gz` → detects `sample1`
- **Single-end**: `sample1.fastq.gz` → detects `sample1`

**Mode 2: Plain text list**
```yaml
samples: "samples.txt"
fastq_dir: "/path/to/fastq"  # Optional if using SRA
```

`samples.txt`:
```
sample1
sample2
sample3
```

**Mode 3: CSV with metadata (for differential expression)**
```yaml
samples: "metadata.csv"
fastq_dir: "/path/to/fastq"  # Optional if using SRA
```

`metadata.csv`:
```csv
sample_id,condition,replicate
sample1,control,1
sample2,control,2
sample3,treated,1
```

**⚠️ Important:** When using CSV metadata, the header row with column names (`sample_id`, `condition`) is **required**. Without headers, the DGE analysis will fail. The `sample_id` column is mandatory for sample identification, and the `condition` column is required to enable differential expression analysis.

### 3. Run Bioinformatics Pipeline

Execute the complete RNA-seq workflow:

```bash
flync run-bio --configfile config.yaml --cores 8
```

**What happens:**
1. **Read Mapping**: HISAT2 aligns reads to genome (splice-aware)
2. **Assembly**: StringTie reconstructs transcripts per sample
3. **Merging**: Combines assemblies into unified transcriptome
4. **Comparison**: gffcompare identifies novel vs. known transcripts
5. **Quantification**: StringTie calculates TPM and FPKM values

**Input Modes:**

**A. Local FASTQ files** (set `fastq_dir` in config)
```bash
flync run-bio --configfile config.yaml --cores 8
```

**B. SRA accessions** (omit `fastq_dir`, provide SRA IDs in samples)
```csv
# samples.csv
sample_id,condition,replicate
SRR1234567,control,1
SRR1234568,treated,1
```

SRA files are automatically downloaded using `prefetch` + `fasterq-dump`.

**Useful Options:**
```bash
# Dry run - show what would be executed
flync run-bio -c config.yaml --dry-run

# Unlock after crash
flync run-bio -c config.yaml --unlock

# More cores for faster processing
flync run-bio -c config.yaml --cores 16
```

**Output Structure:**
```
results/
├── data/                           # Alignment files
│   └── {sample}/
│       └── {sample}.sorted.bam
├── assemblies/
│   ├── stringtie/                  # Per-sample assemblies
│   │   └── {sample}.rna.gtf
│   ├── merged.gtf                  # Unified transcriptome
│   ├── merged-new-transcripts.gtf  # Novel transcripts only
│   └── assembled-new-transcripts.fa # Novel transcript sequences
├── gffcompare/
│   └── gffcmp.stats               # Assembly comparison stats
├── cov/                           # Expression quantification
│   └── {sample}/
│       └── {sample}.rna.gtf
└── logs/                          # Per-rule log files
```

### 4. Run ML Prediction

Classify novel transcripts as lncRNA or protein-coding:

```bash
flync run-ml \
  --gtf results/assemblies/merged-new-transcripts.gtf \
  --output results/lncrna_predictions.csv \
  --ref-genome genome/genome.fa \
  --threads 8
```

**Required Arguments:**
- `--gtf`, `-g`: Input GTF file (novel transcripts or full assembly)
- `--output`, `-o`: Output CSV file for predictions
- `--ref-genome`, `-r`: Reference genome FASTA file

- `--output`, `-o`: Output CSV file for predictions
- `--ref-genome`, `-r`: Reference genome FASTA file

**Optional Arguments:**
- `--model`, `-m`: Custom trained model (default: bundled EBM model)
- `--bwq-config`: Custom BigWig track configuration
- `--threads`, `-t`: Number of threads (default: 8)
- `--cache-dir`: Cache directory for downloaded tracks (default: `./bwq_tracks`)
- `--clear-cache`: Clear cache before starting

**What happens:**
1. **Sequence Extraction**: Extracts spliced transcript sequences from GTF
2. **K-mer Profiling**: Calculates 3-12mer frequencies with TF-IDF + SVD
3. **BigWig Query**: Queries 50+ genomic tracks (chromatin, conservation, etc.)
4. **Structure Prediction**: Calculates RNA minimum free energy
5. **Feature Cleaning**: Standardizes features and aligns with model schema
6. **ML Prediction**: Classifies using pre-trained EBM model

**Output Format (`lncrna_predictions.csv`):**
```csv
transcript_id,prediction,confidence,probability_lncrna
MSTRG.1.1,1,0.95,0.95
MSTRG.1.2,0,0.87,0.13
MSTRG.2.1,1,0.89,0.89
```

**Column Descriptions:**
- `transcript_id`: Transcript identifier from GTF
- `prediction`: 1 = lncRNA, 0 = protein-coding
- `confidence`: Model confidence score (0-1)
- `probability_lncrna`: Probability of being lncRNA (0-1)

**Filter high-confidence lncRNAs:**
```bash
# Get lncRNAs with >90% confidence
awk -F',' '$3 > 0.90 && $2 == 1' results/lncrna_predictions.csv > high_conf_lncrnas.csv
```

### 5. Run Complete Pipeline (Recommended)

Execute both bioinformatics and ML prediction with a single command:

```bash
flync run-all --configfile config.yaml --cores 8
```

**Unified Configuration:**

```yaml
# Bioinformatics settings
samples: metadata.csv
genome: genome/genome.fa
annotation: genome/genome.gtf
hisat_index: genome/genome.idx
output_dir: results
threads: 8

# ML settings (required for run-all)
ml_reference_genome: genome/genome.fa
ml_output_file: results/lncrna_predictions.csv
ml_bwq_config: config/bwq_config.yaml  # Optional
ml_cache_dir: /path/to/cache          # Optional
```

**What happens:**
1. Runs bioinformatics pipeline (`flync run-bio`)
2. Automatically detects output GTF (`results/assemblies/merged-new-transcripts.gtf`)
3. Runs ML prediction on novel transcripts
4. Generates DGE analysis if `metadata.csv` has condition column

**Options:**
```bash
# Skip bioinformatics (use existing GTF)
flync run-all -c config.yaml --skip-bio

# Skip ML prediction (only run bioinformatics)
flync run-all -c config.yaml --skip-ml

# Dry run to see what would be executed
flync run-all -c config.yaml --dry-run

# Custom thread allocation
flync run-all -c config.yaml --cores 16 --ml-threads 8
```

### 6. Differential Gene Expression (DGE)

Run DGE analysis using Ballgown when metadata with conditions is provided:

**Requirements:**
- `samples` config key points to a CSV file (not TXT)
- CSV **must have a header row** with column names
- CSV **must contain** `sample_id` column (for sample identification)
- CSV **must contain** `condition` column (for grouping samples in DGE)

**Example metadata.csv:**
```csv
sample_id,condition,replicate
SRR123456,control,1
SRR123457,control,2
SRR123458,treatment,1
SRR123459,treatment,2
```

**⚠️ Critical:** The header row is **not optional**. If you omit it or have a headerless CSV, the DGE analysis will fail with an error about missing the `sample_id` column.

**DGE runs automatically** when using `flync run-bio` or `flync run-all` with metadata CSV.

**Output Files:**
```
results/dge/
├── transcript_dge_results.csv  # Transcript-level differential expression
├── gene_dge_results.csv        # Gene-level differential expression
├── dge_summary.csv             # Analysis summary statistics
├── transcript_ma_plot.png      # MA plot visualization
└── ballgown_dge.log           # Analysis log
```

**DGE Results Format:**
```csv
id,pval,qval,fc,gene_name,gene_id
MSTRG.1.1,0.001,0.01,2.5,gene_A,FBgn0001
MSTRG.1.2,0.05,0.12,1.8,gene_B,FBgn0002
```

**Filter significant transcripts:**
```bash
# Get transcripts with FDR < 0.05
awk -F',' '$3 < 0.05' results/dge/transcript_dge_results.csv > significant_de.csv
```

### 7. Python API Usage

Use FLYNC programmatically in custom workflows:

```python
from flync import run_pipeline, run_bioinformatics, run_ml_prediction
from pathlib import Path

# Run complete pipeline
result = run_pipeline(
    config_path=Path("config.yaml"),
    cores=8,
    ml_threads=8,
    verbose=True
)

if result['status'] == 'success':
    print(f"✓ Pipeline completed!")
    print(f"  Predictions: {result['predictions_file']}")
    print(f"  Output directory: {result['output_dir']}")
```

**Run only bioinformatics:**
```python
from flync import run_bioinformatics

result = run_bioinformatics(
    config_path=Path("config.yaml"),
    cores=16,
    verbose=True
)
```

**Run only ML prediction:**
```python
from flync import run_ml_prediction

result = run_ml_prediction(
    gtf_file=Path("merged.gtf"),
    output_file=Path("predictions.csv"),
    ref_genome=Path("genome.fa"),
    threads=8,
    verbose=True
)

print(f"Predicted {result['n_lncrna']} lncRNAs")
```

**Integration in larger workflows:**
```python
import flync

# Part of a larger analysis pipeline
def analyze_rnaseq_data(sample_dir, output_dir):
    # Run FLYNC
    result = flync.run_pipeline(
        config_path=create_config(sample_dir, output_dir),
        cores=8
    )
    
    # Continue with downstream analyses
    if result['status'] == 'success':
        lncrnas = pd.read_csv(result['predictions_file'])
        perform_enrichment_analysis(lncrnas)
        generate_report(lncrnas, result['output_dir'])
```

---

## Pipeline Architecture

FLYNC follows a modular Python-first architecture with unified CLI:

```
┌─────────────────────────────────────────────────────────────┐
│                   CLI Layer (click)                         │
│  flync run-all | run-bio | run-ml | setup | config          │
│  + Public Python API (flync.run_pipeline)                   │
└──────────────┬────────────────────────┬─────────────────────┘
               │                        │
     ┌─────────▼────────┐    ┌─────────▼──────────┐
     │  Bioinformatics  │    │   ML Prediction    │
     │    (Snakemake)   │    │    (Python)        │
     └─────────┬────────┘    └─────────┬──────────┘
               │                       │
     ┌─────────▼────────┐    ┌─────────▼──────────┐
     │  Workflow Rules  │    │ Feature Extraction │
     │  - mapping.smk   │    │  - feature_wrapper │
     │  - assembly.smk  │    │  - bwq, kmer, mfe  │
     │  - merge.smk     │    │  - cleaning        │
     │  - quantify.smk  │    │                    │
     │  - dge.smk       │    │                    │
     └──────────────────┘    └─────────┬──────────┘
                                       │
                            ┌──────────▼──────────┐
                            │   ML Predictor      │
                            │  - EBM model        │
                            │  - Schema validator │
                            └─────────────────────┘
```

### Core Components

**1. CLI (`src/flync/cli.py`) & API (`src/flync/api.py`)**
- Single unified command with 5 subcommands: `run-all`, `run-bio`, `run-ml`, `setup`, `config`
- New `run-all` orchestrates complete pipeline end-to-end
- Public Python API for programmatic access
- Custom error handling and helpful messages
- Absolute path resolution for file operations

**2. Workflows (`src/flync/workflows/`)**
- **Snakefile**: Main workflow orchestrator with conditional DGE
- **rules/mapping.smk**: HISAT2 alignment, SRA download, FASTQ symlinking
- **rules/assembly.smk**: StringTie per-sample assembly
- **rules/merge.smk**: StringTie merge + gffcompare
- **rules/quantify.smk**: Expression quantification
- **rules/dge.smk**: Ballgown differential expression
- **scripts/ballgown_dge.R**: R script for Ballgown DGE analysis
- **scripts/predownload_tracks.py**: Docker image track pre-caching

**3. Feature Extraction (`src/flync/features/`)**
- **feature_wrapper.py**: High-level orchestration
- **bwq.py**: BigWig/BigBed track querying
- **kmer.py**: K-mer profiling with TF-IDF and SVD
- **mfe.py**: RNA secondary structure (MFE calculation)
- **feature_cleaning.py**: Data preparation and schema alignment

**4. ML Prediction (`src/flync/ml/`)**
- **predictor.py**: Main prediction interface
- **ebm_predictor.py**: EBM model wrapper
- **schema_validator.py**: Feature schema validation

**5. Utilities (`src/flync/utils/`)**
- **kmer_redux.py**: K-mer transformation utilities
- **progress.py**: Progress bar management

**6. Assets (`src/flync/assets/`)**
- Pre-trained EBM models and scalers
- Model schema definitions

**7. Configuration (`src/flync/config/`)**
- **bwq_config.yaml**: Default BigWig track configuration

---

## Advanced Usage

### Custom BigWig Track Configuration

Create a custom `bwq_config.yaml` to query your own tracks:

```yaml
# List of BigWig/BigBed files to query
- path: /path/to/custom_track.bigWig
  upstream: 1000    # Extend region upstream
  downstream: 1000  # Extend region downstream
  stats:
    - stat: mean
      name: custom_mean
    - stat: max
      name: custom_max
    - stat: coverage
      name: custom_coverage

- path: https://example.com/remote_track.bigBed
  stats:
    - stat: coverage
      name: remote_coverage
    - stat: extract_names
      name: remote_names
      name_field_index: 3  # For BigBed name extraction
```

**Available Statistics:**
- `mean`, `max`, `min`, `sum`: Numerical summaries
- `std`: Standard deviation
- `coverage`: Fraction of region covered by signal
- `extract_names`: Extract names from BigBed entries

Use with ML prediction:
```bash
flync run-ml --gtf input.gtf --output predictions.csv \
  --ref-genome genome.fa --bwq-config custom_bwq_config.yaml
```

### Feature Extraction Only

Extract features without running prediction:

```bash
python src/flync/features/feature_wrapper.py all \
  --gtf annotations.gtf \
  --ref-genome genome.fa \
  --bwq-config config/bwq_config.yaml \
  --k-min 3 --k-max 12 \
  --use-tfidf --use-dim-redux --redux-n-components 1 \
  --output features.parquet
```

### Training Custom Models

**1. Prepare training data:**
```bash
# Split positive and negative samples
python src/flync/optimizer/prepare_data.py \
  --positive-file lncrna_features.parquet \
  --negative-file protein_coding_features.parquet \
  --output-dir datasets/ \
  --train-size 0.7 --val-size 0.15 --test-size 0.15
```

**2. Optimize hyperparameters:**
```bash
python src/flync/optimizer/hyperparameter_optimizer.py \
  --train-data datasets/train.parquet \
  --test-data datasets/test.parquet \
  --holdout-data datasets/holdout.parquet \
  --model-type randomforest \
  --optimization-metrics precision f1 \
  --n-trials 100 \
  --experiment-name "Custom_RF_Model"
```

**3. View results in MLflow UI:**
```bash
mlflow ui --backend-store-uri sqlite:///mlflow.db
# Open http://localhost:5000
```

**4. Extract model schema for inference:**
```bash
python src/flync/ml/schema_extractor.py \
  --model-path best_model.pkl \
  --training-data datasets/train.parquet \
  --output-schema model_schema.json
```

### Docker Deployment

**Build custom image:**
```bash
docker build -t my-flync:latest -f Dockerfile .
```

**Run with mounted volumes:**
```bash
docker run --rm \
  -v $PWD/data:/data \
  -v $PWD/genome:/genome \
  -v $PWD/results:/results \
  my-flync:latest \
  flync run-bio -c /data/config.yaml --cores 8
```

**Interactive shell:**
```bash
docker run -it --rm -v $PWD:/work my-flync:latest /bin/bash
```

---

## Troubleshooting

### Installation Issues

**Problem**: `command not found: flync`
```bash
# Solution: Activate conda environment
conda activate flync

# Verify installation
which flync
flync --version
```

**Problem**: `Snakefile not found` when running `flync run-bio`
```bash
# Solution: Reinstall package in editable mode
pip install -e .
```

**Problem**: Missing bioinformatics tools (hisat2, stringtie, etc.)
```bash
# Solution: Recreate conda environment
conda env remove -n flync
conda env create -f environment.yml
conda activate flync
```

### Pipeline Execution Issues

**Problem**: HISAT2 index build fails
```bash
# Check available disk space (needs ~10GB)
df -h

# Check available memory (needs ~4GB)
free -h

# Check logs
cat genome/idx.err.txt
```

**Problem**: SRA download hangs or fails
```bash
# Solution 1: Reduce download threads in config.yaml
params:
  download_threads: 2  # Instead of 4

# Solution 2: Pre-download SRA files manually
prefetch SRR1234567
fasterq-dump SRR1234567 --outdir fastq/
```

**Problem**: Snakemake workflow crashes
```bash
# Unlock working directory
flync run-bio -c config.yaml --unlock

# Check logs for specific rule
tail -f results/logs/hisat2/sample1.log

# Rerun with verbose output
flync run-bio -c config.yaml --cores 8 --dry-run --printshellcmds
```

**Problem**: `samples: null` fails
```bash
# Solution: Must also set fastq_dir in config.yaml
samples: null
fastq_dir: "/path/to/fastq"  # Required for auto-detection
fastq_paired: false
```

### Feature Extraction Issues

**Problem**: Feature extraction fails with "track not accessible"
```bash
# Solution: Check internet connection (tracks downloaded from UCSC/Ensembl)
wget -q --spider http://genome.ucsc.edu
echo $?  # Should be 0

# Clear cache and retry
flync run-ml --gtf input.gtf --clear-cache ...
```

**Problem**: "No sequences available for downstream feature generation"
```bash
# Solution 1: Verify GTF has transcript and exon features
grep -c 'transcript' input.gtf
grep -c 'exon' input.gtf

# Solution 2: Check reference genome is accessible
ls -lh genome/genome.fa
samtools faidx genome/genome.fa  # Build index if missing
```

**Problem**: "kmer_redux utilities not available"
```bash
# Solution: Verify utils module is installed
python -c "from flync.utils import kmer_redux; print('OK')"

# Reinstall if needed
pip install -e .
```

### ML Prediction Issues

**Problem**: "schema mismatch" error during prediction
```bash
# Solution: Feature transformations must match training
# Ensure these flags are set correctly:
flync run-ml --gtf input.gtf --output predictions.csv \
  --ref-genome genome.fa
# (Default model expects: use_tfidf=True, use_dim_redux=True, redux_n_components=1)
```

**Problem**: Predictions all 0 or all 1
```bash
# Solution 1: Check input GTF quality
# Ensure transcripts are complete and have exons

# Solution 2: Verify feature extraction succeeded
# Check for warnings in logs

# Solution 3: Use different model or retrain
flync run-ml --gtf input.gtf --model custom_model.pkl ...
```

**Problem**: Out of memory during feature extraction
```bash
# Solution 1: Reduce threads
flync run-ml --threads 4 ...

# Solution 2: Process in smaller batches
# Split GTF and process separately

# Solution 3: Use sparse k-mer format (automatic with default settings)
```

### Docker Issues

**Problem**: Docker permission denied
```bash
# Solution 1: Add user to docker group
sudo usermod -aG docker $USER
newgrp docker

# Solution 2: Run with sudo
sudo docker run ...
```

**Problem**: Docker container out of disk space
```bash
# Clean up old containers and images
docker system prune -a

# Check disk usage
docker system df
```

---

## Contributing

Contributions are welcome! Please follow these guidelines:

### Development Setup

```bash
# Clone and setup development environment
git clone https://github.com/homemlab/flync.git
cd flync
git checkout master

# Create development environment
conda env create -f environment.yml
conda activate flync

# Install in development mode
pip install -e .

# Optional: Install development dependencies
pip install pytest black flake8 mypy
```

### Code Style

- **Python**: Follow PEP 8, use Black formatter (line length 100)
- **Type Hints**: Required for public functions
- **Docstrings**: Google style for all modules, classes, functions
- **Imports**: Absolute imports preferred (`from flync.module import Class`)

### Testing

```bash
# Run tests (when implemented)
pytest tests/

# Format code
black src/flync/

# Type checking
mypy src/flync/
```

### Workflow for Contributions

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes with clear commit messages
4. Ensure code passes style checks and tests
5. Update documentation if needed
6. Submit a pull request to the `master` branch

### Reporting Issues

- Use GitHub Issues: https://github.com/homemlab/flync/issues
- Include:
  - FLYNC version (`flync --version`)
  - Operating system and version
  - Minimal reproducible example
  - Error messages and logs

## License

MIT License - see [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "flync",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "bioinformatics, lncRNA, genomics, machine-learning, RNA-seq",
    "author": "FLYNC Contributors",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/6f/16/4d0be91ec0f0e0245ac1198e5762e08bdf08eaf849f4e815f0c2a38dec6f/flync-1.0.2.tar.gz",
    "platform": null,
    "description": "![FLYNC logo](logo.jpeg)\n\n# FLYNC - FLY Non-Coding gene discovery & classification\n\n## TL;DR (Quick Start)\n\nInstall (conda recommended):\n```bash\nconda create -n flync -c bioconda -c conda-forge flync\nconda activate flync\n```\n\nSetup genome (dm6):\n```bash\nflync setup --genome-dir genome\n```\n\nGenerate config template and edit paths:\n```bash\nflync config --template --output config.yaml\n# Set: samples: null and fastq_dir: \"/path/to/fastq\"\n```\n\nRun bioinformatics (auto-detect FASTQs):\n```bash\nflync run-bio -c config.yaml -j 8\n```\n\nPredict lncRNAs (novel transcripts):\n```bash\nflync run-ml -g results/assemblies/merged-new-transcripts.gtf \\\n  -o results/lncrna_predictions.csv -r genome/genome.fa -t 8\n```\n\nAll-in-one:\n```bash\nflync run-all -c config.yaml -j 8\n```\n\nEssential outputs:\n- results/assemblies/merged-new-transcripts.gtf (novel)\n- results/lncrna_predictions.csv (predictions)\n- results/dge/... (if metadata CSV with condition provided)\n\nNeed help? Run:\n```bash\nflync run-bio -c config.yaml --dry-run\n```\n\n## Minimal Conceptual Overview\n\n1. Input: FASTQs (local) or SRA IDs (via metadata CSV).\n2. Snakemake workflow builds transcriptome and isolates novel transcripts.\n3. Feature engine converts GTF + genome into a model-ready feature matrix.\n4. Pre-trained model classifies lncRNA vs protein-coding; outputs probabilities.\n5. Optional differential expression if conditions provided.\n\n## When to Use Which Command\n\n- run-bio: You only need assemblies.\n- run-ml: You have a GTF and want predictions.\n- run-all: End-to-end (recommended for new users).\n- setup: Prepare genome and indices once.\n- config: Generate or validate config.yaml.\n\n## Common Pitfalls (Fast Answers)\n\n| Issue | Fix |\n|-------|-----|\n| samples: null fails | Ensure fastq_dir is set |\n| Snakefile not found | pip install -e . |\n| Missing genome index | Re-run flync setup |\n| All predictions identical | Check feature extraction logs |\n| DGE missing | Ensure metadata CSV has header + condition column |\n\nFull detailed documentation continues below.\n\n---\n\n## Table of Contents\n\n- [TL;DR (Quick Start)](#tldr-quick-start)\n- [Minimal Conceptual Overview](#minimal-conceptual-overview)\n- [When to Use Which Command](#when-to-use-which-command)\n- [Common Pitfalls (Fast Answers)](#common-pitfalls-fast-answers)\n- [Pipeline Overview](#pipeline-overview)\n- [Key Features](#key-features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Usage Guide](#usage-guide)\n  - [1. Setup Reference Genome](#1-setup-reference-genome)\n  - [2. Configure Pipeline](#2-configure-pipeline)\n  - [Sample Specification (3 Modes)](#sample-specification-3-modes)\n  - [3. Run Bioinformatics Pipeline](#3-run-bioinformatics-pipeline)\n  - [4. Run ML Prediction](#4-run-ml-prediction)\n  - [5. Run Complete Pipeline (Recommended)](#5-run-complete-pipeline-recommended)\n  - [6. Differential Gene Expression (DGE)](#6-differential-gene-expression-dge)\n  - [7. Python API Usage](#7-python-api-usage)\n- [Pipeline Architecture](#pipeline-architecture)\n- [Advanced Usage](#advanced-usage)\n- [Troubleshooting](#troubleshooting)\n- [Contributing](#contributing)\n- [License](#license)\n\n---\n\n## Pipeline Overview\n\nFLYNC executes a complete lncRNA discovery workflow with three main execution modes:\n\n### Phase 1: Bioinformatics Pipeline (`flync run-bio`)\nRun only the RNA-seq processing and assembly:\n1. **Read Mapping** - Align RNA-seq reads to reference genome using HISAT2\n2. **Transcriptome Assembly** - Reconstruct transcripts per sample with StringTie\n3. **Assembly Merging** - Create unified transcriptome with gffcompare\n4. **Novel Transcript Extraction** - Identify transcripts not in reference annotation\n5. **Quantification** - Calculate expression levels per transcript\n6. **DGE Analysis** (optional) - Differential expression with Ballgown when metadata.csv provided\n\n### Phase 2: ML Prediction (`flync run-ml`)\nRun only the machine learning classification:\n1. **Feature Extraction** - Extract multi-modal genomic features\n2. **Feature Cleaning** - Standardize and prepare features for ML\n3. **ML Classification** - Predict lncRNA candidates using trained EBM model\n4. **Confidence Scoring** - Provide prediction probabilities and confidence scores\n\n### Complete Pipeline (`flync run-all`)\nRun the entire workflow end-to-end with a single command\n\n---\n\n## Key Features\n\n**Complete End-to-End Pipeline** - Single `flync run-all` command for full workflow  \n**Unified Environment** - All dependencies managed in single `environment.yml`  \n**Differential Expression** - Integrated Ballgown DGE analysis for condition comparisons  \n**Public Python API** - Use FLYNC programmatically in custom workflows  \n**Flexible Input Modes** - Auto-detect samples from FASTQ directory or use sample lists  \n**Snakemake Orchestration** - Robust workflow management with automatic parallelization  \n**Comprehensive Features** - 100+ genomic features from multiple data sources  \n**Intelligent Caching** - Downloads and caches remote genomic tracks automatically  \n**Production-Ready Models** - Pre-trained EBM classifier with high accuracy  \n**Multi-Stage Docker** - Runtime and pre-warmed images for flexible deployment  \n**Python 3.11** - Modern Python codebase with type hints and comprehensive documentation  \n\n---\n\n## Installation\n\n### Overview\n\nChoose an installation method based on what you need:\n\n| Use Case | Recommended Method | Command |\n|----------|-------------------|---------|\n| Full pipeline (alignment + assembly + ML) | Conda (base) | `conda create -n flync -c bioconda -c conda-forge flync` |\n| Add differential expression (Ballgown) | Conda add-on | `conda install -n flync flync-dge` |\n| ML / feature extraction only (no aligners) | pip + extras | `pip install flync[features,ml]` |\n| Programmatic Snakemake orchestration (no bio tools) | pip minimal + workflow | `pip install flync[workflow]` |\n| Reproducible container execution | Docker (runtime) | `docker pull ghcr.io/homemlab/flync:latest` |\n| Faster startup with pre-cached tracks | Docker (prewarmed) | `docker pull ghcr.io/homemlab/flync:latest-prewarmed` |\n\n### Option 1: Conda (Recommended \u2013 Full Stack)\n\n```bash\nconda create -n flync -c bioconda -c conda-forge flync\nconda activate flync\nflync --help\n```\n\nAdd DGE support (Ballgown + R stack):\n```bash\nconda install -n flync flync-dge  # after base install\n```\n\nOr install both at once:\n```bash\nconda create -n flync -c bioconda -c conda-forge flync flync-dge\n```\n\n### Option 2: pip (Python-Only / Lightweight)\n\nPip will NOT install external bioinformatics binaries (HISAT2, StringTie, samtools, etc.). Use this only for feature extraction or ML inference on an existing GTF.\n\n```bash\npython -m venv flync-venv\nsource flync-venv/bin/activate\npip install --upgrade pip\n\n# Feature extraction + ML\npip install \"flync[features,ml]\"\n\n# Add Snakemake lightweight orchestration (still no external binaries)\npip install \"flync[workflow]\"\n\nflync run-ml --help\n```\n\nIf you attempt `flync run-bio` without the required external tools, FLYNC will explain what is missing and how to install via conda.\n\n### Option 3: Docker\n\nRuntime image (downloads tracks on demand):\n```bash\ndocker pull ghcr.io/homemlab/flync:latest\ndocker run --rm -v $PWD:/work ghcr.io/homemlab/flync:latest \\\n  flync --help\n```\n\nPrewarmed image (tracks pre-cached):\n```bash\ndocker pull ghcr.io/homemlab/flync:latest-prewarmed\n```\n\n### Which Should I Pick?\n\n| Scenario | Choose |\n|----------|-------|\n| New user, want everything | Conda base (add `flync-dge` if doing DGE) |\n| HPC / cluster with module rules | Conda (export env YAML for reproducibility) |\n| Notebook exploratory ML only | pip extras (`features,ml`) |\n| CI / workflow integration | Docker runtime image |\n| Need fastest repeated ML runs | Docker prewarmed image |\n\n### External Tool Summary\n\nThe following are ONLY installed automatically via the Conda packages (`flync`, `flync-dge`):\n```\nhisat2, stringtie, gffcompare, gffread, samtools, bedtools, sra-tools,\nR (r-base), bioconductor-ballgown, r-matrixstats, r-ggplot2\n```\nPip installations will perform a dependency sanity check and abort `run-bio` if these are missing (unless `--skip-deps-check` is used).\n\n### Development Install (Editable)\n\n```bash\ngit clone https://github.com/homemlab/flync.git\ncd flync\nconda env create -f environment.yml\nconda activate flync\npip install -e .\n```\n\n### Prerequisites\n\n- **Operating System**: Linux (tested on Debian/Ubuntu)\n- **Conda/Mamba**: Required for managing dependencies\n- **System Requirements**:\n  - 8+ GB RAM (16+ GB recommended for large datasets)\n  - 20+ GB disk space (genome, indices, and tracks)\n  - 4+ CPU cores (8+ recommended)\n\n### Install from Source (Full + Editable)\n\nFor development or if you need the latest unreleased features:\n\n```bash\n# 1. Clone the repository\ngit clone https://github.com/homemlab/flync.git\ncd flync\ngit checkout master  # Use the master branch (production)\n\n# 2. Create conda environment with dependencies\nconda env create -f environment.yml\n\n# 3. Activate environment\nconda activate flync\n\n# 4. Install package in development mode\npip install -e .\n\n# 5. Verify installation\nflync --help\n```\n\nDocker image details moved above for quick discovery.\n\n---\n\n## Quick Start\n\n**Complete workflow with `run-all` command:**\n\n```bash\n# 1. Activate conda environment\nconda activate flync\n\n# 2. Download genome and build indices\nflync setup --genome-dir genome\n\n# 3. Create configuration file\nflync config --template --output config.yaml\n\n# 4. Edit config.yaml with your paths and settings\n# See config_example_full.yaml for all available options\n\n# 5. Create metadata.csv with sample information (MUST have header row!)\ncat > metadata.csv << EOF\nsample_id,condition,replicate\nSRR123456,control,1\nSRR123457,control,2\nSRR123458,treatment,1\nSRR123459,treatment,2\nEOF\n\n# 6. Update config.yaml to use metadata.csv\n# Change: samples: null\n# To:     samples: metadata.csv\n\n# 7. Run complete pipeline (bioinformatics + ML + DGE)\nflync run-all --configfile config.yaml --cores 8\n```\n\n**Alternative: Step-by-step workflow:**\n\n```bash\n# Run bioinformatics pipeline only\nflync run-bio --configfile config.yaml --cores 8\n\n# Then run ML prediction\nflync run-ml \\\n  --gtf results/assemblies/merged-new-transcripts.gtf \\\n  --output results/lncrna_predictions.csv \\\n  --ref-genome genome/genome.fa \\\n  --threads 8\n```\n\n**Python API Usage:**\n\n```python\nfrom flync import run_pipeline\nfrom pathlib import Path\n\n# Run complete pipeline programmatically\nresult = run_pipeline(\n    config_path=Path(\"config.yaml\"),\n    cores=8,\n    ml_threads=8\n)\n\nprint(f\"Status: {result['status']}\")\nprint(f\"Predictions: {result['predictions_file']}\")\n```\n\n**Output:**\n- `results/assemblies/merged.gtf` - Full transcriptome (reference + novel)\n- `results/assemblies/merged-new-transcripts.gtf` - Novel transcripts only\n- `results/cov/` - Per-sample quantification files\n- `results/dge/` - Differential expression analysis (if metadata.csv provided)\n  - `transcript_dge_results.csv` - Transcript-level DE results\n  - `gene_dge_results.csv` - Gene-level DE results\n  - `dge_summary.csv` - Summary statistics\n- `results/lncrna_predictions.csv` - lncRNA predictions with confidence scores\n\n---\n\n## Usage Guide\n\n### 1. Setup Reference Genome\n\nDownload *Drosophila melanogaster* genome (BDGP6.32/dm6) and build HISAT2 index:\n\n```bash\nflync setup --genome-dir genome\n```\n\n**What this does:**\n- Downloads genome FASTA from Ensembl (release 106)\n- Downloads gene annotation GTF\n- Builds HISAT2 index (~10 minutes, requires ~4GB RAM)\n- Extracts splice sites for splice-aware alignment\n\n**Skip download if files exist:**\n```bash\nflync setup --genome-dir genome --skip-download\n```\n\n### 2. Configure Pipeline\n\nGenerate a configuration template:\n\n```bash\nflync config --template --output config.yaml\n```\n\n**Edit `config.yaml`** with your settings:\n\n```yaml\n# Sample specification (3 options - see below)\nsamples: null                           # Auto-detect from fastq_dir\nfastq_dir: \"/path/to/fastq/files\"      # Directory with FASTQ files\nfastq_paired: false                    # true for paired-end, false for single-end\n\n# Reference files (created by 'flync setup')\ngenome: \"genome/genome.fa\"\nannotation: \"genome/genome.gtf\"\nhisat_index: \"genome/genome.idx\"\nsplice_sites: \"genome/genome.ss\"\n\n# Output and resources\noutput_dir: \"results\"\nthreads: 8\n\n# Tool parameters (optional)\nparams:\n  hisat2: \"-p 8 --dta --dta-cufflinks\"\n  stringtie_assemble: \"-p 8\"\n  stringtie_merge: \"\"\n  stringtie_quantify: \"-eB\"\n  download_threads: 4  # For SRA downloads\n```\n\n#### Sample Specification (3 Modes)\n\n**Mode 1: Auto-detect from FASTQ directory (Recommended)**\n```yaml\nsamples: null  # Must be null to enable auto-detection\nfastq_dir: \"/path/to/fastq\"\nfastq_paired: false\n```\n\nAutomatically detects samples from filenames:\n- **Paired-end**: `sample1_1.fastq.gz` + `sample1_2.fastq.gz` \u2192 detects `sample1`\n- **Single-end**: `sample1.fastq.gz` \u2192 detects `sample1`\n\n**Mode 2: Plain text list**\n```yaml\nsamples: \"samples.txt\"\nfastq_dir: \"/path/to/fastq\"  # Optional if using SRA\n```\n\n`samples.txt`:\n```\nsample1\nsample2\nsample3\n```\n\n**Mode 3: CSV with metadata (for differential expression)**\n```yaml\nsamples: \"metadata.csv\"\nfastq_dir: \"/path/to/fastq\"  # Optional if using SRA\n```\n\n`metadata.csv`:\n```csv\nsample_id,condition,replicate\nsample1,control,1\nsample2,control,2\nsample3,treated,1\n```\n\n**\u26a0\ufe0f Important:** When using CSV metadata, the header row with column names (`sample_id`, `condition`) is **required**. Without headers, the DGE analysis will fail. The `sample_id` column is mandatory for sample identification, and the `condition` column is required to enable differential expression analysis.\n\n### 3. Run Bioinformatics Pipeline\n\nExecute the complete RNA-seq workflow:\n\n```bash\nflync run-bio --configfile config.yaml --cores 8\n```\n\n**What happens:**\n1. **Read Mapping**: HISAT2 aligns reads to genome (splice-aware)\n2. **Assembly**: StringTie reconstructs transcripts per sample\n3. **Merging**: Combines assemblies into unified transcriptome\n4. **Comparison**: gffcompare identifies novel vs. known transcripts\n5. **Quantification**: StringTie calculates TPM and FPKM values\n\n**Input Modes:**\n\n**A. Local FASTQ files** (set `fastq_dir` in config)\n```bash\nflync run-bio --configfile config.yaml --cores 8\n```\n\n**B. SRA accessions** (omit `fastq_dir`, provide SRA IDs in samples)\n```csv\n# samples.csv\nsample_id,condition,replicate\nSRR1234567,control,1\nSRR1234568,treated,1\n```\n\nSRA files are automatically downloaded using `prefetch` + `fasterq-dump`.\n\n**Useful Options:**\n```bash\n# Dry run - show what would be executed\nflync run-bio -c config.yaml --dry-run\n\n# Unlock after crash\nflync run-bio -c config.yaml --unlock\n\n# More cores for faster processing\nflync run-bio -c config.yaml --cores 16\n```\n\n**Output Structure:**\n```\nresults/\n\u251c\u2500\u2500 data/                           # Alignment files\n\u2502   \u2514\u2500\u2500 {sample}/\n\u2502       \u2514\u2500\u2500 {sample}.sorted.bam\n\u251c\u2500\u2500 assemblies/\n\u2502   \u251c\u2500\u2500 stringtie/                  # Per-sample assemblies\n\u2502   \u2502   \u2514\u2500\u2500 {sample}.rna.gtf\n\u2502   \u251c\u2500\u2500 merged.gtf                  # Unified transcriptome\n\u2502   \u251c\u2500\u2500 merged-new-transcripts.gtf  # Novel transcripts only\n\u2502   \u2514\u2500\u2500 assembled-new-transcripts.fa # Novel transcript sequences\n\u251c\u2500\u2500 gffcompare/\n\u2502   \u2514\u2500\u2500 gffcmp.stats               # Assembly comparison stats\n\u251c\u2500\u2500 cov/                           # Expression quantification\n\u2502   \u2514\u2500\u2500 {sample}/\n\u2502       \u2514\u2500\u2500 {sample}.rna.gtf\n\u2514\u2500\u2500 logs/                          # Per-rule log files\n```\n\n### 4. Run ML Prediction\n\nClassify novel transcripts as lncRNA or protein-coding:\n\n```bash\nflync run-ml \\\n  --gtf results/assemblies/merged-new-transcripts.gtf \\\n  --output results/lncrna_predictions.csv \\\n  --ref-genome genome/genome.fa \\\n  --threads 8\n```\n\n**Required Arguments:**\n- `--gtf`, `-g`: Input GTF file (novel transcripts or full assembly)\n- `--output`, `-o`: Output CSV file for predictions\n- `--ref-genome`, `-r`: Reference genome FASTA file\n\n- `--output`, `-o`: Output CSV file for predictions\n- `--ref-genome`, `-r`: Reference genome FASTA file\n\n**Optional Arguments:**\n- `--model`, `-m`: Custom trained model (default: bundled EBM model)\n- `--bwq-config`: Custom BigWig track configuration\n- `--threads`, `-t`: Number of threads (default: 8)\n- `--cache-dir`: Cache directory for downloaded tracks (default: `./bwq_tracks`)\n- `--clear-cache`: Clear cache before starting\n\n**What happens:**\n1. **Sequence Extraction**: Extracts spliced transcript sequences from GTF\n2. **K-mer Profiling**: Calculates 3-12mer frequencies with TF-IDF + SVD\n3. **BigWig Query**: Queries 50+ genomic tracks (chromatin, conservation, etc.)\n4. **Structure Prediction**: Calculates RNA minimum free energy\n5. **Feature Cleaning**: Standardizes features and aligns with model schema\n6. **ML Prediction**: Classifies using pre-trained EBM model\n\n**Output Format (`lncrna_predictions.csv`):**\n```csv\ntranscript_id,prediction,confidence,probability_lncrna\nMSTRG.1.1,1,0.95,0.95\nMSTRG.1.2,0,0.87,0.13\nMSTRG.2.1,1,0.89,0.89\n```\n\n**Column Descriptions:**\n- `transcript_id`: Transcript identifier from GTF\n- `prediction`: 1 = lncRNA, 0 = protein-coding\n- `confidence`: Model confidence score (0-1)\n- `probability_lncrna`: Probability of being lncRNA (0-1)\n\n**Filter high-confidence lncRNAs:**\n```bash\n# Get lncRNAs with >90% confidence\nawk -F',' '$3 > 0.90 && $2 == 1' results/lncrna_predictions.csv > high_conf_lncrnas.csv\n```\n\n### 5. Run Complete Pipeline (Recommended)\n\nExecute both bioinformatics and ML prediction with a single command:\n\n```bash\nflync run-all --configfile config.yaml --cores 8\n```\n\n**Unified Configuration:**\n\n```yaml\n# Bioinformatics settings\nsamples: metadata.csv\ngenome: genome/genome.fa\nannotation: genome/genome.gtf\nhisat_index: genome/genome.idx\noutput_dir: results\nthreads: 8\n\n# ML settings (required for run-all)\nml_reference_genome: genome/genome.fa\nml_output_file: results/lncrna_predictions.csv\nml_bwq_config: config/bwq_config.yaml  # Optional\nml_cache_dir: /path/to/cache          # Optional\n```\n\n**What happens:**\n1. Runs bioinformatics pipeline (`flync run-bio`)\n2. Automatically detects output GTF (`results/assemblies/merged-new-transcripts.gtf`)\n3. Runs ML prediction on novel transcripts\n4. Generates DGE analysis if `metadata.csv` has condition column\n\n**Options:**\n```bash\n# Skip bioinformatics (use existing GTF)\nflync run-all -c config.yaml --skip-bio\n\n# Skip ML prediction (only run bioinformatics)\nflync run-all -c config.yaml --skip-ml\n\n# Dry run to see what would be executed\nflync run-all -c config.yaml --dry-run\n\n# Custom thread allocation\nflync run-all -c config.yaml --cores 16 --ml-threads 8\n```\n\n### 6. Differential Gene Expression (DGE)\n\nRun DGE analysis using Ballgown when metadata with conditions is provided:\n\n**Requirements:**\n- `samples` config key points to a CSV file (not TXT)\n- CSV **must have a header row** with column names\n- CSV **must contain** `sample_id` column (for sample identification)\n- CSV **must contain** `condition` column (for grouping samples in DGE)\n\n**Example metadata.csv:**\n```csv\nsample_id,condition,replicate\nSRR123456,control,1\nSRR123457,control,2\nSRR123458,treatment,1\nSRR123459,treatment,2\n```\n\n**\u26a0\ufe0f Critical:** The header row is **not optional**. If you omit it or have a headerless CSV, the DGE analysis will fail with an error about missing the `sample_id` column.\n\n**DGE runs automatically** when using `flync run-bio` or `flync run-all` with metadata CSV.\n\n**Output Files:**\n```\nresults/dge/\n\u251c\u2500\u2500 transcript_dge_results.csv  # Transcript-level differential expression\n\u251c\u2500\u2500 gene_dge_results.csv        # Gene-level differential expression\n\u251c\u2500\u2500 dge_summary.csv             # Analysis summary statistics\n\u251c\u2500\u2500 transcript_ma_plot.png      # MA plot visualization\n\u2514\u2500\u2500 ballgown_dge.log           # Analysis log\n```\n\n**DGE Results Format:**\n```csv\nid,pval,qval,fc,gene_name,gene_id\nMSTRG.1.1,0.001,0.01,2.5,gene_A,FBgn0001\nMSTRG.1.2,0.05,0.12,1.8,gene_B,FBgn0002\n```\n\n**Filter significant transcripts:**\n```bash\n# Get transcripts with FDR < 0.05\nawk -F',' '$3 < 0.05' results/dge/transcript_dge_results.csv > significant_de.csv\n```\n\n### 7. Python API Usage\n\nUse FLYNC programmatically in custom workflows:\n\n```python\nfrom flync import run_pipeline, run_bioinformatics, run_ml_prediction\nfrom pathlib import Path\n\n# Run complete pipeline\nresult = run_pipeline(\n    config_path=Path(\"config.yaml\"),\n    cores=8,\n    ml_threads=8,\n    verbose=True\n)\n\nif result['status'] == 'success':\n    print(f\"\u2713 Pipeline completed!\")\n    print(f\"  Predictions: {result['predictions_file']}\")\n    print(f\"  Output directory: {result['output_dir']}\")\n```\n\n**Run only bioinformatics:**\n```python\nfrom flync import run_bioinformatics\n\nresult = run_bioinformatics(\n    config_path=Path(\"config.yaml\"),\n    cores=16,\n    verbose=True\n)\n```\n\n**Run only ML prediction:**\n```python\nfrom flync import run_ml_prediction\n\nresult = run_ml_prediction(\n    gtf_file=Path(\"merged.gtf\"),\n    output_file=Path(\"predictions.csv\"),\n    ref_genome=Path(\"genome.fa\"),\n    threads=8,\n    verbose=True\n)\n\nprint(f\"Predicted {result['n_lncrna']} lncRNAs\")\n```\n\n**Integration in larger workflows:**\n```python\nimport flync\n\n# Part of a larger analysis pipeline\ndef analyze_rnaseq_data(sample_dir, output_dir):\n    # Run FLYNC\n    result = flync.run_pipeline(\n        config_path=create_config(sample_dir, output_dir),\n        cores=8\n    )\n    \n    # Continue with downstream analyses\n    if result['status'] == 'success':\n        lncrnas = pd.read_csv(result['predictions_file'])\n        perform_enrichment_analysis(lncrnas)\n        generate_report(lncrnas, result['output_dir'])\n```\n\n---\n\n## Pipeline Architecture\n\nFLYNC follows a modular Python-first architecture with unified CLI:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                   CLI Layer (click)                         \u2502\n\u2502  flync run-all | run-bio | run-ml | setup | config          \u2502\n\u2502  + Public Python API (flync.run_pipeline)                   \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n               \u2502                        \u2502\n     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n     \u2502  Bioinformatics  \u2502    \u2502   ML Prediction    \u2502\n     \u2502    (Snakemake)   \u2502    \u2502    (Python)        \u2502\n     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n               \u2502                       \u2502\n     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510    \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n     \u2502  Workflow Rules  \u2502    \u2502 Feature Extraction \u2502\n     \u2502  - mapping.smk   \u2502    \u2502  - feature_wrapper \u2502\n     \u2502  - assembly.smk  \u2502    \u2502  - bwq, kmer, mfe  \u2502\n     \u2502  - merge.smk     \u2502    \u2502  - cleaning        \u2502\n     \u2502  - quantify.smk  \u2502    \u2502                    \u2502\n     \u2502  - dge.smk       \u2502    \u2502                    \u2502\n     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518    \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                                       \u2502\n                            \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u25bc\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n                            \u2502   ML Predictor      \u2502\n                            \u2502  - EBM model        \u2502\n                            \u2502  - Schema validator \u2502\n                            \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Core Components\n\n**1. CLI (`src/flync/cli.py`) & API (`src/flync/api.py`)**\n- Single unified command with 5 subcommands: `run-all`, `run-bio`, `run-ml`, `setup`, `config`\n- New `run-all` orchestrates complete pipeline end-to-end\n- Public Python API for programmatic access\n- Custom error handling and helpful messages\n- Absolute path resolution for file operations\n\n**2. Workflows (`src/flync/workflows/`)**\n- **Snakefile**: Main workflow orchestrator with conditional DGE\n- **rules/mapping.smk**: HISAT2 alignment, SRA download, FASTQ symlinking\n- **rules/assembly.smk**: StringTie per-sample assembly\n- **rules/merge.smk**: StringTie merge + gffcompare\n- **rules/quantify.smk**: Expression quantification\n- **rules/dge.smk**: Ballgown differential expression\n- **scripts/ballgown_dge.R**: R script for Ballgown DGE analysis\n- **scripts/predownload_tracks.py**: Docker image track pre-caching\n\n**3. Feature Extraction (`src/flync/features/`)**\n- **feature_wrapper.py**: High-level orchestration\n- **bwq.py**: BigWig/BigBed track querying\n- **kmer.py**: K-mer profiling with TF-IDF and SVD\n- **mfe.py**: RNA secondary structure (MFE calculation)\n- **feature_cleaning.py**: Data preparation and schema alignment\n\n**4. ML Prediction (`src/flync/ml/`)**\n- **predictor.py**: Main prediction interface\n- **ebm_predictor.py**: EBM model wrapper\n- **schema_validator.py**: Feature schema validation\n\n**5. Utilities (`src/flync/utils/`)**\n- **kmer_redux.py**: K-mer transformation utilities\n- **progress.py**: Progress bar management\n\n**6. Assets (`src/flync/assets/`)**\n- Pre-trained EBM models and scalers\n- Model schema definitions\n\n**7. Configuration (`src/flync/config/`)**\n- **bwq_config.yaml**: Default BigWig track configuration\n\n---\n\n## Advanced Usage\n\n### Custom BigWig Track Configuration\n\nCreate a custom `bwq_config.yaml` to query your own tracks:\n\n```yaml\n# List of BigWig/BigBed files to query\n- path: /path/to/custom_track.bigWig\n  upstream: 1000    # Extend region upstream\n  downstream: 1000  # Extend region downstream\n  stats:\n    - stat: mean\n      name: custom_mean\n    - stat: max\n      name: custom_max\n    - stat: coverage\n      name: custom_coverage\n\n- path: https://example.com/remote_track.bigBed\n  stats:\n    - stat: coverage\n      name: remote_coverage\n    - stat: extract_names\n      name: remote_names\n      name_field_index: 3  # For BigBed name extraction\n```\n\n**Available Statistics:**\n- `mean`, `max`, `min`, `sum`: Numerical summaries\n- `std`: Standard deviation\n- `coverage`: Fraction of region covered by signal\n- `extract_names`: Extract names from BigBed entries\n\nUse with ML prediction:\n```bash\nflync run-ml --gtf input.gtf --output predictions.csv \\\n  --ref-genome genome.fa --bwq-config custom_bwq_config.yaml\n```\n\n### Feature Extraction Only\n\nExtract features without running prediction:\n\n```bash\npython src/flync/features/feature_wrapper.py all \\\n  --gtf annotations.gtf \\\n  --ref-genome genome.fa \\\n  --bwq-config config/bwq_config.yaml \\\n  --k-min 3 --k-max 12 \\\n  --use-tfidf --use-dim-redux --redux-n-components 1 \\\n  --output features.parquet\n```\n\n### Training Custom Models\n\n**1. Prepare training data:**\n```bash\n# Split positive and negative samples\npython src/flync/optimizer/prepare_data.py \\\n  --positive-file lncrna_features.parquet \\\n  --negative-file protein_coding_features.parquet \\\n  --output-dir datasets/ \\\n  --train-size 0.7 --val-size 0.15 --test-size 0.15\n```\n\n**2. Optimize hyperparameters:**\n```bash\npython src/flync/optimizer/hyperparameter_optimizer.py \\\n  --train-data datasets/train.parquet \\\n  --test-data datasets/test.parquet \\\n  --holdout-data datasets/holdout.parquet \\\n  --model-type randomforest \\\n  --optimization-metrics precision f1 \\\n  --n-trials 100 \\\n  --experiment-name \"Custom_RF_Model\"\n```\n\n**3. View results in MLflow UI:**\n```bash\nmlflow ui --backend-store-uri sqlite:///mlflow.db\n# Open http://localhost:5000\n```\n\n**4. Extract model schema for inference:**\n```bash\npython src/flync/ml/schema_extractor.py \\\n  --model-path best_model.pkl \\\n  --training-data datasets/train.parquet \\\n  --output-schema model_schema.json\n```\n\n### Docker Deployment\n\n**Build custom image:**\n```bash\ndocker build -t my-flync:latest -f Dockerfile .\n```\n\n**Run with mounted volumes:**\n```bash\ndocker run --rm \\\n  -v $PWD/data:/data \\\n  -v $PWD/genome:/genome \\\n  -v $PWD/results:/results \\\n  my-flync:latest \\\n  flync run-bio -c /data/config.yaml --cores 8\n```\n\n**Interactive shell:**\n```bash\ndocker run -it --rm -v $PWD:/work my-flync:latest /bin/bash\n```\n\n---\n\n## Troubleshooting\n\n### Installation Issues\n\n**Problem**: `command not found: flync`\n```bash\n# Solution: Activate conda environment\nconda activate flync\n\n# Verify installation\nwhich flync\nflync --version\n```\n\n**Problem**: `Snakefile not found` when running `flync run-bio`\n```bash\n# Solution: Reinstall package in editable mode\npip install -e .\n```\n\n**Problem**: Missing bioinformatics tools (hisat2, stringtie, etc.)\n```bash\n# Solution: Recreate conda environment\nconda env remove -n flync\nconda env create -f environment.yml\nconda activate flync\n```\n\n### Pipeline Execution Issues\n\n**Problem**: HISAT2 index build fails\n```bash\n# Check available disk space (needs ~10GB)\ndf -h\n\n# Check available memory (needs ~4GB)\nfree -h\n\n# Check logs\ncat genome/idx.err.txt\n```\n\n**Problem**: SRA download hangs or fails\n```bash\n# Solution 1: Reduce download threads in config.yaml\nparams:\n  download_threads: 2  # Instead of 4\n\n# Solution 2: Pre-download SRA files manually\nprefetch SRR1234567\nfasterq-dump SRR1234567 --outdir fastq/\n```\n\n**Problem**: Snakemake workflow crashes\n```bash\n# Unlock working directory\nflync run-bio -c config.yaml --unlock\n\n# Check logs for specific rule\ntail -f results/logs/hisat2/sample1.log\n\n# Rerun with verbose output\nflync run-bio -c config.yaml --cores 8 --dry-run --printshellcmds\n```\n\n**Problem**: `samples: null` fails\n```bash\n# Solution: Must also set fastq_dir in config.yaml\nsamples: null\nfastq_dir: \"/path/to/fastq\"  # Required for auto-detection\nfastq_paired: false\n```\n\n### Feature Extraction Issues\n\n**Problem**: Feature extraction fails with \"track not accessible\"\n```bash\n# Solution: Check internet connection (tracks downloaded from UCSC/Ensembl)\nwget -q --spider http://genome.ucsc.edu\necho $?  # Should be 0\n\n# Clear cache and retry\nflync run-ml --gtf input.gtf --clear-cache ...\n```\n\n**Problem**: \"No sequences available for downstream feature generation\"\n```bash\n# Solution 1: Verify GTF has transcript and exon features\ngrep -c 'transcript' input.gtf\ngrep -c 'exon' input.gtf\n\n# Solution 2: Check reference genome is accessible\nls -lh genome/genome.fa\nsamtools faidx genome/genome.fa  # Build index if missing\n```\n\n**Problem**: \"kmer_redux utilities not available\"\n```bash\n# Solution: Verify utils module is installed\npython -c \"from flync.utils import kmer_redux; print('OK')\"\n\n# Reinstall if needed\npip install -e .\n```\n\n### ML Prediction Issues\n\n**Problem**: \"schema mismatch\" error during prediction\n```bash\n# Solution: Feature transformations must match training\n# Ensure these flags are set correctly:\nflync run-ml --gtf input.gtf --output predictions.csv \\\n  --ref-genome genome.fa\n# (Default model expects: use_tfidf=True, use_dim_redux=True, redux_n_components=1)\n```\n\n**Problem**: Predictions all 0 or all 1\n```bash\n# Solution 1: Check input GTF quality\n# Ensure transcripts are complete and have exons\n\n# Solution 2: Verify feature extraction succeeded\n# Check for warnings in logs\n\n# Solution 3: Use different model or retrain\nflync run-ml --gtf input.gtf --model custom_model.pkl ...\n```\n\n**Problem**: Out of memory during feature extraction\n```bash\n# Solution 1: Reduce threads\nflync run-ml --threads 4 ...\n\n# Solution 2: Process in smaller batches\n# Split GTF and process separately\n\n# Solution 3: Use sparse k-mer format (automatic with default settings)\n```\n\n### Docker Issues\n\n**Problem**: Docker permission denied\n```bash\n# Solution 1: Add user to docker group\nsudo usermod -aG docker $USER\nnewgrp docker\n\n# Solution 2: Run with sudo\nsudo docker run ...\n```\n\n**Problem**: Docker container out of disk space\n```bash\n# Clean up old containers and images\ndocker system prune -a\n\n# Check disk usage\ndocker system df\n```\n\n---\n\n## Contributing\n\nContributions are welcome! Please follow these guidelines:\n\n### Development Setup\n\n```bash\n# Clone and setup development environment\ngit clone https://github.com/homemlab/flync.git\ncd flync\ngit checkout master\n\n# Create development environment\nconda env create -f environment.yml\nconda activate flync\n\n# Install in development mode\npip install -e .\n\n# Optional: Install development dependencies\npip install pytest black flake8 mypy\n```\n\n### Code Style\n\n- **Python**: Follow PEP 8, use Black formatter (line length 100)\n- **Type Hints**: Required for public functions\n- **Docstrings**: Google style for all modules, classes, functions\n- **Imports**: Absolute imports preferred (`from flync.module import Class`)\n\n### Testing\n\n```bash\n# Run tests (when implemented)\npytest tests/\n\n# Format code\nblack src/flync/\n\n# Type checking\nmypy src/flync/\n```\n\n### Workflow for Contributions\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes with clear commit messages\n4. Ensure code passes style checks and tests\n5. Update documentation if needed\n6. Submit a pull request to the `master` branch\n\n### Reporting Issues\n\n- Use GitHub Issues: https://github.com/homemlab/flync/issues\n- Include:\n  - FLYNC version (`flync --version`)\n  - Operating system and version\n  - Minimal reproducible example\n  - Error messages and logs\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "FLYNC: lncRNA discovery pipeline for Drosophila melanogaster",
    "version": "1.0.2",
    "project_urls": null,
    "split_keywords": [
        "bioinformatics",
        " lncrna",
        " genomics",
        " machine-learning",
        " rna-seq"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "255f38a589f1bc782e5b8d5d1d474021948e0d64c28a4d16c24f5eb9fcab360e",
                "md5": "d2b6ace5f5e0c00b2066899cb3e3ce11",
                "sha256": "76af99faba7f2c8f43fa118642639179a9d0c928b4ae5152a1d94f48feb4e9c2"
            },
            "downloads": -1,
            "filename": "flync-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d2b6ace5f5e0c00b2066899cb3e3ce11",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 2417547,
            "upload_time": "2025-11-05T20:22:38",
            "upload_time_iso_8601": "2025-11-05T20:22:38.012244Z",
            "url": "https://files.pythonhosted.org/packages/25/5f/38a589f1bc782e5b8d5d1d474021948e0d64c28a4d16c24f5eb9fcab360e/flync-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6f164d0be91ec0f0e0245ac1198e5762e08bdf08eaf849f4e815f0c2a38dec6f",
                "md5": "802bd1a4fd59efe7a4ad9117a4bd6fce",
                "sha256": "4501884cf4441b59db61c5852c5571195a742e783904fecd09831663f9da8351"
            },
            "downloads": -1,
            "filename": "flync-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "802bd1a4fd59efe7a4ad9117a4bd6fce",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 2421213,
            "upload_time": "2025-11-05T20:22:39",
            "upload_time_iso_8601": "2025-11-05T20:22:39.530670Z",
            "url": "https://files.pythonhosted.org/packages/6f/16/4d0be91ec0f0e0245ac1198e5762e08bdf08eaf849f4e815f0c2a38dec6f/flync-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-11-05 20:22:39",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "flync"
}
        
Elapsed time: 2.10775s