# NGS AI Agent
An AI-powered agent for fully automated next-generation sequencing (NGS) data analysis, with Deep Mutational Scanning as a proof of concept.
## Features
### 🤖 **AI-Powered Intelligence**
- **Smart Metadata Analysis**: Ollama analyzes experimental descriptions to detect conditions
- **Intelligent File Matching**: AI matches messy sequencer filenames to metadata entries
- **Natural Language Understanding**: Extracts experimental information from human-readable descriptions
- **Multi-layer Matching**: Exact → AI → Fuzzy → Basic detection with robust fallbacks
### 🧬 **Complete NGS Pipeline**
- **End-to-End Workflow**: FASTQ files → Final reports with AI orchestration
- **Modular Design**: Reusable components for QC, trimming, mapping, and variant calling
- **Deep Mutational Scanning**: Specialized fitness calculation and amino acid effect analysis
- **Quality Control**: Automated QC with FastQC, trimming with Cutadapt, mapping with Bowtie2
### 📊 **Smart Analysis & Visualization**
- **AI-Generated Reports**: Scientific insights and recommendations powered by Ollama
- **Interactive Heatmaps**: Publication-ready visualizations of mutational effects
- **Tabular Metadata**: User-friendly CSV/Excel format for experimental design
- **Privacy-First**: All AI processing happens locally (no cloud dependencies)
## Quick Start
### 1. Setup Environment
```bash
# Clone and navigate to project
cd ngs-ai-agent
# Check if your existing ai-ngs environment has all required dependencies
python check_environment.py
# If dependencies are missing, install them in your ai-ngs environment:
conda activate ai-ngs
conda install -c conda-forge -c bioconda snakemake fastqc cutadapt bowtie2 samtools bcftools
pip install ollama langchain langchain-community
# Start Ollama (in separate terminal)
ollama serve
ollama pull qwen3-coder:latest
```
### 2. Run Analysis
```bash
# AI-powered analysis with metadata (recommended)
python src/ngs_ai_agent.py run \
--input-dir /path/to/fastq/files \
--reference /path/to/reference.fasta \
--metadata experiment_metadata.csv
# Basic analysis with auto-discovery
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --reference /path/to/reference.fasta
# Dry run to check AI configuration
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --metadata experiment.csv --dry-run
# Custom cores and config
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --cores 16 --metadata experiment.csv
```
### 3. View Results
Results will be generated in the `results/` directory:
- **Final Report**: `results/reports/final_report.html`
- **Fitness Scores**: `results/dms/fitness_scores.csv`
- **Heatmaps**: `results/visualization/dms_heatmap.png` and `.html`
## Project Structure
```
ngs-ai-agent/
├── config/
│ └── config.yaml # Main configuration
├── workflow/
│ ├── Snakefile # Main pipeline
│ ├── rules/ # Modular Snakemake rules
│ │ ├── qc.smk # Quality control
│ │ ├── trimming.smk # Read trimming
│ │ ├── mapping.smk # Read mapping
│ │ ├── variant_calling.smk # Variant calling
│ │ ├── dms_analysis.smk # DMS fitness calculation
│ │ └── visualization.smk # Plotting and visualization
│ └── scripts/ # Custom analysis scripts
│ ├── variant_caller.py # Custom variant caller
│ ├── dms_fitness.py # Fitness calculation
│ ├── variant_annotation.py
│ ├── create_heatmap.py # Heatmap generation
│ └── fitness_plots.py # Additional plots
├── src/
│ ├── ngs_ai_agent.py # Main CLI interface
│ ├── ai_orchestrator/ # AI components
│ │ └── ollama_client.py # Ollama integration
│ ├── modules/ # Analysis modules
│ └── visualization/ # Visualization tools
├── data/
│ ├── raw/ # Input FASTQ files
│ ├── processed/ # Processed data
│ └── test/ # Test datasets
├── results/ # Analysis outputs
├── resources/ # Reference genomes, annotations
├── logs/ # Pipeline logs
└── environment.yml # Conda environment
```
## AI Features
### Metadata Detection
The AI automatically analyzes filenames to extract:
- Sample names and IDs
- Time points (T0, T1, day0, day7, etc.)
- Replicate numbers
- Read types (R1/R2 for paired-end)
- Experimental conditions
### Filename Cleaning
Automatically standardizes filenames to:
```
sample1_T0_rep1_R1.fastq.gz
sample1_T0_rep1_R2.fastq.gz
sample2_T1_rep1_R1.fastq.gz
...
```
### Pipeline Configuration
AI determines:
- Single vs paired-end sequencing
- Sample groupings and time series
- Optimal parameters for each analysis step
### Report Generation
AI generates insights including:
- Key findings summary
- Hotspot identification
- Biological implications
- Follow-up recommendations
## Deep Mutational Scanning Pipeline
### 1. Quality Control
- FastQC analysis of raw reads
- MultiQC summary reports
### 2. Read Processing
- Adapter trimming with Cutadapt
- Quality filtering
### 3. Mapping
- Bowtie2 alignment to reference
- Local alignment optimized for DMS
### 4. Variant Calling
- Custom high-sensitivity variant caller
- Optimized for high-frequency variants in DMS
### 5. Fitness Calculation
- Enrichment ratio calculation
- Amino acid effect annotation
- Statistical analysis
### 6. Visualization
- Interactive heatmaps (position vs amino acid)
- Fitness distribution plots
- Coverage analysis
## Configuration
Edit `config/config.yaml` to customize:
```yaml
# Pipeline settings
pipeline:
threads: 8
memory_gb: 16
# AI settings
ai:
model: "llama3.1:8b"
temperature: 0.1
# Analysis parameters
variant_calling:
min_coverage: 10
min_frequency: 0.01
dms:
fitness_calculation:
method: "enrichment_ratio"
pseudocount: 1
```
## Extending the Pipeline
### Adding New Analysis Types
1. Create new rules in `workflow/rules/`
2. Add corresponding scripts in `workflow/scripts/`
3. Update the main `Snakefile`
4. Modify configuration as needed
Example for RNA-seq analysis:
```python
# workflow/rules/rnaseq.smk
rule star_align:
input:
reads="results/trimmed/{sample}.fastq.gz"
output:
bam="results/mapped/{sample}.bam"
shell:
"STAR --runThreadN {threads} --genomeDir {params.index} ..."
```
### Custom AI Prompts
Modify `src/ai_orchestrator/ollama_client.py` to add new AI capabilities:
```python
def detect_experimental_design(self, metadata):
prompt = f"""
Analyze this experimental metadata and determine the study design:
{metadata}
Identify:
1. Control vs treatment groups
2. Time course experiments
3. Dose-response studies
"""
# ... implementation
```
## Troubleshooting
### Common Issues
1. **Ollama not running**: Start with `ollama serve`
2. **Model not found**: Pull model with `ollama pull qwen3-coder:latest`
3. **Environment issues**: Recreate with `conda env remove -n ai-ngs && python src/ngs_ai_agent.py setup`
4. **Memory issues**: Reduce threads in config or increase system memory
### Logs
Check logs in:
- `logs/`: Snakemake execution logs
- Console output: Real-time pipeline progress
## Dependencies
### Core Tools
- Snakemake (workflow management)
- FastQC (quality control)
- Cutadapt (trimming)
- Bowtie2 (mapping)
- Samtools/BCFtools (file processing)
### AI Components
- Ollama (local LLM server)
- LangChain (AI orchestration)
### Python Libraries
- Pandas, NumPy (data processing)
- Matplotlib, Seaborn, Plotly (visualization)
- Biopython (sequence analysis)
- PyYAML (configuration)
## Citation
If you use NGS AI Agent in your research, please cite:
```
NGS AI Agent: AI-powered automated next-generation sequencing analysis
[Your publication details here]
```
## License
MIT License - see LICENSE file for details.
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## Support
- 📧 Email: [your-email@example.com]
- 🐛 Issues: GitHub Issues page
- 📚 Documentation: [Link to detailed docs]
Raw data
{
"_id": null,
"home_page": "https://github.com/your-org/ngs-ai-agent",
"name": "ngs-ai-agent",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ngs, bioinformatics, ai, genomics, sequencing, pipeline",
"author": "NGS AI Agent Team",
"author_email": "contact@ngs-ai-agent.com",
"download_url": "https://files.pythonhosted.org/packages/12/b0/a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58/ngs_ai_agent-1.0.0.tar.gz",
"platform": null,
"description": "# NGS AI Agent\n\nAn AI-powered agent for fully automated next-generation sequencing (NGS) data analysis, with Deep Mutational Scanning as a proof of concept.\n\n## Features\n\n### \ud83e\udd16 **AI-Powered Intelligence**\n- **Smart Metadata Analysis**: Ollama analyzes experimental descriptions to detect conditions\n- **Intelligent File Matching**: AI matches messy sequencer filenames to metadata entries\n- **Natural Language Understanding**: Extracts experimental information from human-readable descriptions\n- **Multi-layer Matching**: Exact \u2192 AI \u2192 Fuzzy \u2192 Basic detection with robust fallbacks\n\n### \ud83e\uddec **Complete NGS Pipeline** \n- **End-to-End Workflow**: FASTQ files \u2192 Final reports with AI orchestration\n- **Modular Design**: Reusable components for QC, trimming, mapping, and variant calling\n- **Deep Mutational Scanning**: Specialized fitness calculation and amino acid effect analysis\n- **Quality Control**: Automated QC with FastQC, trimming with Cutadapt, mapping with Bowtie2\n\n### \ud83d\udcca **Smart Analysis & Visualization**\n- **AI-Generated Reports**: Scientific insights and recommendations powered by Ollama\n- **Interactive Heatmaps**: Publication-ready visualizations of mutational effects\n- **Tabular Metadata**: User-friendly CSV/Excel format for experimental design\n- **Privacy-First**: All AI processing happens locally (no cloud dependencies) \n\n## Quick Start\n\n### 1. Setup Environment\n\n```bash\n# Clone and navigate to project\ncd ngs-ai-agent\n\n# Check if your existing ai-ngs environment has all required dependencies\npython check_environment.py\n\n# If dependencies are missing, install them in your ai-ngs environment:\nconda activate ai-ngs\nconda install -c conda-forge -c bioconda snakemake fastqc cutadapt bowtie2 samtools bcftools\npip install ollama langchain langchain-community\n\n# Start Ollama (in separate terminal)\nollama serve\nollama pull qwen3-coder:latest\n```\n\n### 2. Run Analysis\n\n```bash\n# AI-powered analysis with metadata (recommended)\npython src/ngs_ai_agent.py run \\\n --input-dir /path/to/fastq/files \\\n --reference /path/to/reference.fasta \\\n --metadata experiment_metadata.csv\n\n# Basic analysis with auto-discovery\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --reference /path/to/reference.fasta\n\n# Dry run to check AI configuration\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --metadata experiment.csv --dry-run\n\n# Custom cores and config\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --cores 16 --metadata experiment.csv\n```\n\n### 3. View Results\n\nResults will be generated in the `results/` directory:\n- **Final Report**: `results/reports/final_report.html`\n- **Fitness Scores**: `results/dms/fitness_scores.csv`\n- **Heatmaps**: `results/visualization/dms_heatmap.png` and `.html`\n\n## Project Structure\n\n```\nngs-ai-agent/\n\u251c\u2500\u2500 config/\n\u2502 \u2514\u2500\u2500 config.yaml # Main configuration\n\u251c\u2500\u2500 workflow/\n\u2502 \u251c\u2500\u2500 Snakefile # Main pipeline\n\u2502 \u251c\u2500\u2500 rules/ # Modular Snakemake rules\n\u2502 \u2502 \u251c\u2500\u2500 qc.smk # Quality control\n\u2502 \u2502 \u251c\u2500\u2500 trimming.smk # Read trimming\n\u2502 \u2502 \u251c\u2500\u2500 mapping.smk # Read mapping\n\u2502 \u2502 \u251c\u2500\u2500 variant_calling.smk # Variant calling\n\u2502 \u2502 \u251c\u2500\u2500 dms_analysis.smk # DMS fitness calculation\n\u2502 \u2502 \u2514\u2500\u2500 visualization.smk # Plotting and visualization\n\u2502 \u2514\u2500\u2500 scripts/ # Custom analysis scripts\n\u2502 \u251c\u2500\u2500 variant_caller.py # Custom variant caller\n\u2502 \u251c\u2500\u2500 dms_fitness.py # Fitness calculation\n\u2502 \u251c\u2500\u2500 variant_annotation.py\n\u2502 \u251c\u2500\u2500 create_heatmap.py # Heatmap generation\n\u2502 \u2514\u2500\u2500 fitness_plots.py # Additional plots\n\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 ngs_ai_agent.py # Main CLI interface\n\u2502 \u251c\u2500\u2500 ai_orchestrator/ # AI components\n\u2502 \u2502 \u2514\u2500\u2500 ollama_client.py # Ollama integration\n\u2502 \u251c\u2500\u2500 modules/ # Analysis modules\n\u2502 \u2514\u2500\u2500 visualization/ # Visualization tools\n\u251c\u2500\u2500 data/\n\u2502 \u251c\u2500\u2500 raw/ # Input FASTQ files\n\u2502 \u251c\u2500\u2500 processed/ # Processed data\n\u2502 \u2514\u2500\u2500 test/ # Test datasets\n\u251c\u2500\u2500 results/ # Analysis outputs\n\u251c\u2500\u2500 resources/ # Reference genomes, annotations\n\u251c\u2500\u2500 logs/ # Pipeline logs\n\u2514\u2500\u2500 environment.yml # Conda environment\n```\n\n## AI Features\n\n### Metadata Detection\nThe AI automatically analyzes filenames to extract:\n- Sample names and IDs\n- Time points (T0, T1, day0, day7, etc.)\n- Replicate numbers\n- Read types (R1/R2 for paired-end)\n- Experimental conditions\n\n### Filename Cleaning\nAutomatically standardizes filenames to:\n```\nsample1_T0_rep1_R1.fastq.gz\nsample1_T0_rep1_R2.fastq.gz\nsample2_T1_rep1_R1.fastq.gz\n...\n```\n\n### Pipeline Configuration\nAI determines:\n- Single vs paired-end sequencing\n- Sample groupings and time series\n- Optimal parameters for each analysis step\n\n### Report Generation\nAI generates insights including:\n- Key findings summary\n- Hotspot identification\n- Biological implications\n- Follow-up recommendations\n\n## Deep Mutational Scanning Pipeline\n\n### 1. Quality Control\n- FastQC analysis of raw reads\n- MultiQC summary reports\n\n### 2. Read Processing\n- Adapter trimming with Cutadapt\n- Quality filtering\n\n### 3. Mapping\n- Bowtie2 alignment to reference\n- Local alignment optimized for DMS\n\n### 4. Variant Calling\n- Custom high-sensitivity variant caller\n- Optimized for high-frequency variants in DMS\n\n### 5. Fitness Calculation\n- Enrichment ratio calculation\n- Amino acid effect annotation\n- Statistical analysis\n\n### 6. Visualization\n- Interactive heatmaps (position vs amino acid)\n- Fitness distribution plots\n- Coverage analysis\n\n## Configuration\n\nEdit `config/config.yaml` to customize:\n\n```yaml\n# Pipeline settings\npipeline:\n threads: 8\n memory_gb: 16\n\n# AI settings\nai:\n model: \"llama3.1:8b\"\n temperature: 0.1\n\n# Analysis parameters\nvariant_calling:\n min_coverage: 10\n min_frequency: 0.01\n\ndms:\n fitness_calculation:\n method: \"enrichment_ratio\"\n pseudocount: 1\n```\n\n## Extending the Pipeline\n\n### Adding New Analysis Types\n\n1. Create new rules in `workflow/rules/`\n2. Add corresponding scripts in `workflow/scripts/`\n3. Update the main `Snakefile`\n4. Modify configuration as needed\n\nExample for RNA-seq analysis:\n```python\n# workflow/rules/rnaseq.smk\nrule star_align:\n input:\n reads=\"results/trimmed/{sample}.fastq.gz\"\n output:\n bam=\"results/mapped/{sample}.bam\"\n shell:\n \"STAR --runThreadN {threads} --genomeDir {params.index} ...\"\n```\n\n### Custom AI Prompts\n\nModify `src/ai_orchestrator/ollama_client.py` to add new AI capabilities:\n\n```python\ndef detect_experimental_design(self, metadata):\n prompt = f\"\"\"\n Analyze this experimental metadata and determine the study design:\n {metadata}\n \n Identify:\n 1. Control vs treatment groups\n 2. Time course experiments\n 3. Dose-response studies\n \"\"\"\n # ... implementation\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Ollama not running**: Start with `ollama serve`\n2. **Model not found**: Pull model with `ollama pull qwen3-coder:latest`\n3. **Environment issues**: Recreate with `conda env remove -n ai-ngs && python src/ngs_ai_agent.py setup`\n4. **Memory issues**: Reduce threads in config or increase system memory\n\n### Logs\n\nCheck logs in:\n- `logs/`: Snakemake execution logs\n- Console output: Real-time pipeline progress\n\n## Dependencies\n\n### Core Tools\n- Snakemake (workflow management)\n- FastQC (quality control)\n- Cutadapt (trimming)\n- Bowtie2 (mapping)\n- Samtools/BCFtools (file processing)\n\n### AI Components\n- Ollama (local LLM server)\n- LangChain (AI orchestration)\n\n### Python Libraries\n- Pandas, NumPy (data processing)\n- Matplotlib, Seaborn, Plotly (visualization)\n- Biopython (sequence analysis)\n- PyYAML (configuration)\n\n## Citation\n\nIf you use NGS AI Agent in your research, please cite:\n\n```\nNGS AI Agent: AI-powered automated next-generation sequencing analysis\n[Your publication details here]\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## Support\n\n- \ud83d\udce7 Email: [your-email@example.com]\n- \ud83d\udc1b Issues: GitHub Issues page\n- \ud83d\udcda Documentation: [Link to detailed docs]\n",
"bugtrack_url": null,
"license": null,
"summary": "AI-powered automated NGS analysis pipeline",
"version": "1.0.0",
"project_urls": {
"Bug Reports": "https://github.com/your-org/ngs-ai-agent/issues",
"Documentation": "https://ngs-ai-agent.readthedocs.io/",
"Homepage": "https://github.com/your-org/ngs-ai-agent",
"Source": "https://github.com/your-org/ngs-ai-agent"
},
"split_keywords": [
"ngs",
" bioinformatics",
" ai",
" genomics",
" sequencing",
" pipeline"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b8749630a957a5d1e633bd61089c13515b3ddf3ebf8fb75c8b44eca0bd9ee901",
"md5": "3a88d9173c6d58c7385ae0614b27c948",
"sha256": "fe73d2821d6ae0693bb771da64c52fe40141d48f92115a6593bcb6d478e23aca"
},
"downloads": -1,
"filename": "ngs_ai_agent-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3a88d9173c6d58c7385ae0614b27c948",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 40337,
"upload_time": "2025-10-14T22:40:45",
"upload_time_iso_8601": "2025-10-14T22:40:45.405780Z",
"url": "https://files.pythonhosted.org/packages/b8/74/9630a957a5d1e633bd61089c13515b3ddf3ebf8fb75c8b44eca0bd9ee901/ngs_ai_agent-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "12b0a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58",
"md5": "a6d2502805abd29f13dbae6e6c8c244b",
"sha256": "6ac6a4f171d2f26325b05154ccf278bbeeebf0b7e861924d37cb4d596ff04ad4"
},
"downloads": -1,
"filename": "ngs_ai_agent-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "a6d2502805abd29f13dbae6e6c8c244b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 145850,
"upload_time": "2025-10-14T22:40:46",
"upload_time_iso_8601": "2025-10-14T22:40:46.552462Z",
"url": "https://files.pythonhosted.org/packages/12/b0/a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58/ngs_ai_agent-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-14 22:40:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "your-org",
"github_project": "ngs-ai-agent",
"github_not_found": true,
"lcname": "ngs-ai-agent"
}