ngs-ai-agent


Namengs-ai-agent JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/your-org/ngs-ai-agent
SummaryAI-powered automated NGS analysis pipeline
upload_time2025-10-14 22:40:46
maintainerNone
docs_urlNone
authorNGS AI Agent Team
requires_python>=3.8
licenseNone
keywords ngs bioinformatics ai genomics sequencing pipeline
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NGS AI Agent

An AI-powered agent for fully automated next-generation sequencing (NGS) data analysis, with Deep Mutational Scanning as a proof of concept.

## Features

### 🤖 **AI-Powered Intelligence**
- **Smart Metadata Analysis**: Ollama analyzes experimental descriptions to detect conditions
- **Intelligent File Matching**: AI matches messy sequencer filenames to metadata entries
- **Natural Language Understanding**: Extracts experimental information from human-readable descriptions
- **Multi-layer Matching**: Exact → AI → Fuzzy → Basic detection with robust fallbacks

### 🧬 **Complete NGS Pipeline** 
- **End-to-End Workflow**: FASTQ files → Final reports with AI orchestration
- **Modular Design**: Reusable components for QC, trimming, mapping, and variant calling
- **Deep Mutational Scanning**: Specialized fitness calculation and amino acid effect analysis
- **Quality Control**: Automated QC with FastQC, trimming with Cutadapt, mapping with Bowtie2

### 📊 **Smart Analysis & Visualization**
- **AI-Generated Reports**: Scientific insights and recommendations powered by Ollama
- **Interactive Heatmaps**: Publication-ready visualizations of mutational effects
- **Tabular Metadata**: User-friendly CSV/Excel format for experimental design
- **Privacy-First**: All AI processing happens locally (no cloud dependencies)  

## Quick Start

### 1. Setup Environment

```bash
# Clone and navigate to project
cd ngs-ai-agent

# Check if your existing ai-ngs environment has all required dependencies
python check_environment.py

# If dependencies are missing, install them in your ai-ngs environment:
conda activate ai-ngs
conda install -c conda-forge -c bioconda snakemake fastqc cutadapt bowtie2 samtools bcftools
pip install ollama langchain langchain-community

# Start Ollama (in separate terminal)
ollama serve
ollama pull qwen3-coder:latest
```

### 2. Run Analysis

```bash
# AI-powered analysis with metadata (recommended)
python src/ngs_ai_agent.py run \
  --input-dir /path/to/fastq/files \
  --reference /path/to/reference.fasta \
  --metadata experiment_metadata.csv

# Basic analysis with auto-discovery
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --reference /path/to/reference.fasta

# Dry run to check AI configuration
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --metadata experiment.csv --dry-run

# Custom cores and config
python src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --cores 16 --metadata experiment.csv
```

### 3. View Results

Results will be generated in the `results/` directory:
- **Final Report**: `results/reports/final_report.html`
- **Fitness Scores**: `results/dms/fitness_scores.csv`
- **Heatmaps**: `results/visualization/dms_heatmap.png` and `.html`

## Project Structure

```
ngs-ai-agent/
├── config/
│   └── config.yaml              # Main configuration
├── workflow/
│   ├── Snakefile               # Main pipeline
│   ├── rules/                  # Modular Snakemake rules
│   │   ├── qc.smk             # Quality control
│   │   ├── trimming.smk       # Read trimming
│   │   ├── mapping.smk        # Read mapping
│   │   ├── variant_calling.smk # Variant calling
│   │   ├── dms_analysis.smk   # DMS fitness calculation
│   │   └── visualization.smk   # Plotting and visualization
│   └── scripts/               # Custom analysis scripts
│       ├── variant_caller.py  # Custom variant caller
│       ├── dms_fitness.py     # Fitness calculation
│       ├── variant_annotation.py
│       ├── create_heatmap.py  # Heatmap generation
│       └── fitness_plots.py   # Additional plots
├── src/
│   ├── ngs_ai_agent.py        # Main CLI interface
│   ├── ai_orchestrator/       # AI components
│   │   └── ollama_client.py   # Ollama integration
│   ├── modules/               # Analysis modules
│   └── visualization/         # Visualization tools
├── data/
│   ├── raw/                   # Input FASTQ files
│   ├── processed/             # Processed data
│   └── test/                  # Test datasets
├── results/                   # Analysis outputs
├── resources/                 # Reference genomes, annotations
├── logs/                      # Pipeline logs
└── environment.yml           # Conda environment
```

## AI Features

### Metadata Detection
The AI automatically analyzes filenames to extract:
- Sample names and IDs
- Time points (T0, T1, day0, day7, etc.)
- Replicate numbers
- Read types (R1/R2 for paired-end)
- Experimental conditions

### Filename Cleaning
Automatically standardizes filenames to:
```
sample1_T0_rep1_R1.fastq.gz
sample1_T0_rep1_R2.fastq.gz
sample2_T1_rep1_R1.fastq.gz
...
```

### Pipeline Configuration
AI determines:
- Single vs paired-end sequencing
- Sample groupings and time series
- Optimal parameters for each analysis step

### Report Generation
AI generates insights including:
- Key findings summary
- Hotspot identification
- Biological implications
- Follow-up recommendations

## Deep Mutational Scanning Pipeline

### 1. Quality Control
- FastQC analysis of raw reads
- MultiQC summary reports

### 2. Read Processing
- Adapter trimming with Cutadapt
- Quality filtering

### 3. Mapping
- Bowtie2 alignment to reference
- Local alignment optimized for DMS

### 4. Variant Calling
- Custom high-sensitivity variant caller
- Optimized for high-frequency variants in DMS

### 5. Fitness Calculation
- Enrichment ratio calculation
- Amino acid effect annotation
- Statistical analysis

### 6. Visualization
- Interactive heatmaps (position vs amino acid)
- Fitness distribution plots
- Coverage analysis

## Configuration

Edit `config/config.yaml` to customize:

```yaml
# Pipeline settings
pipeline:
  threads: 8
  memory_gb: 16

# AI settings
ai:
  model: "llama3.1:8b"
  temperature: 0.1

# Analysis parameters
variant_calling:
  min_coverage: 10
  min_frequency: 0.01

dms:
  fitness_calculation:
    method: "enrichment_ratio"
    pseudocount: 1
```

## Extending the Pipeline

### Adding New Analysis Types

1. Create new rules in `workflow/rules/`
2. Add corresponding scripts in `workflow/scripts/`
3. Update the main `Snakefile`
4. Modify configuration as needed

Example for RNA-seq analysis:
```python
# workflow/rules/rnaseq.smk
rule star_align:
    input:
        reads="results/trimmed/{sample}.fastq.gz"
    output:
        bam="results/mapped/{sample}.bam"
    shell:
        "STAR --runThreadN {threads} --genomeDir {params.index} ..."
```

### Custom AI Prompts

Modify `src/ai_orchestrator/ollama_client.py` to add new AI capabilities:

```python
def detect_experimental_design(self, metadata):
    prompt = f"""
    Analyze this experimental metadata and determine the study design:
    {metadata}
    
    Identify:
    1. Control vs treatment groups
    2. Time course experiments
    3. Dose-response studies
    """
    # ... implementation
```

## Troubleshooting

### Common Issues

1. **Ollama not running**: Start with `ollama serve`
2. **Model not found**: Pull model with `ollama pull qwen3-coder:latest`
3. **Environment issues**: Recreate with `conda env remove -n ai-ngs && python src/ngs_ai_agent.py setup`
4. **Memory issues**: Reduce threads in config or increase system memory

### Logs

Check logs in:
- `logs/`: Snakemake execution logs
- Console output: Real-time pipeline progress

## Dependencies

### Core Tools
- Snakemake (workflow management)
- FastQC (quality control)
- Cutadapt (trimming)
- Bowtie2 (mapping)
- Samtools/BCFtools (file processing)

### AI Components
- Ollama (local LLM server)
- LangChain (AI orchestration)

### Python Libraries
- Pandas, NumPy (data processing)
- Matplotlib, Seaborn, Plotly (visualization)
- Biopython (sequence analysis)
- PyYAML (configuration)

## Citation

If you use NGS AI Agent in your research, please cite:

```
NGS AI Agent: AI-powered automated next-generation sequencing analysis
[Your publication details here]
```

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## Support

- 📧 Email: [your-email@example.com]
- 🐛 Issues: GitHub Issues page
- 📚 Documentation: [Link to detailed docs]

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/your-org/ngs-ai-agent",
    "name": "ngs-ai-agent",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ngs, bioinformatics, ai, genomics, sequencing, pipeline",
    "author": "NGS AI Agent Team",
    "author_email": "contact@ngs-ai-agent.com",
    "download_url": "https://files.pythonhosted.org/packages/12/b0/a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58/ngs_ai_agent-1.0.0.tar.gz",
    "platform": null,
    "description": "# NGS AI Agent\n\nAn AI-powered agent for fully automated next-generation sequencing (NGS) data analysis, with Deep Mutational Scanning as a proof of concept.\n\n## Features\n\n### \ud83e\udd16 **AI-Powered Intelligence**\n- **Smart Metadata Analysis**: Ollama analyzes experimental descriptions to detect conditions\n- **Intelligent File Matching**: AI matches messy sequencer filenames to metadata entries\n- **Natural Language Understanding**: Extracts experimental information from human-readable descriptions\n- **Multi-layer Matching**: Exact \u2192 AI \u2192 Fuzzy \u2192 Basic detection with robust fallbacks\n\n### \ud83e\uddec **Complete NGS Pipeline** \n- **End-to-End Workflow**: FASTQ files \u2192 Final reports with AI orchestration\n- **Modular Design**: Reusable components for QC, trimming, mapping, and variant calling\n- **Deep Mutational Scanning**: Specialized fitness calculation and amino acid effect analysis\n- **Quality Control**: Automated QC with FastQC, trimming with Cutadapt, mapping with Bowtie2\n\n### \ud83d\udcca **Smart Analysis & Visualization**\n- **AI-Generated Reports**: Scientific insights and recommendations powered by Ollama\n- **Interactive Heatmaps**: Publication-ready visualizations of mutational effects\n- **Tabular Metadata**: User-friendly CSV/Excel format for experimental design\n- **Privacy-First**: All AI processing happens locally (no cloud dependencies)  \n\n## Quick Start\n\n### 1. Setup Environment\n\n```bash\n# Clone and navigate to project\ncd ngs-ai-agent\n\n# Check if your existing ai-ngs environment has all required dependencies\npython check_environment.py\n\n# If dependencies are missing, install them in your ai-ngs environment:\nconda activate ai-ngs\nconda install -c conda-forge -c bioconda snakemake fastqc cutadapt bowtie2 samtools bcftools\npip install ollama langchain langchain-community\n\n# Start Ollama (in separate terminal)\nollama serve\nollama pull qwen3-coder:latest\n```\n\n### 2. Run Analysis\n\n```bash\n# AI-powered analysis with metadata (recommended)\npython src/ngs_ai_agent.py run \\\n  --input-dir /path/to/fastq/files \\\n  --reference /path/to/reference.fasta \\\n  --metadata experiment_metadata.csv\n\n# Basic analysis with auto-discovery\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --reference /path/to/reference.fasta\n\n# Dry run to check AI configuration\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --metadata experiment.csv --dry-run\n\n# Custom cores and config\npython src/ngs_ai_agent.py run --input-dir /path/to/fastq/files --cores 16 --metadata experiment.csv\n```\n\n### 3. View Results\n\nResults will be generated in the `results/` directory:\n- **Final Report**: `results/reports/final_report.html`\n- **Fitness Scores**: `results/dms/fitness_scores.csv`\n- **Heatmaps**: `results/visualization/dms_heatmap.png` and `.html`\n\n## Project Structure\n\n```\nngs-ai-agent/\n\u251c\u2500\u2500 config/\n\u2502   \u2514\u2500\u2500 config.yaml              # Main configuration\n\u251c\u2500\u2500 workflow/\n\u2502   \u251c\u2500\u2500 Snakefile               # Main pipeline\n\u2502   \u251c\u2500\u2500 rules/                  # Modular Snakemake rules\n\u2502   \u2502   \u251c\u2500\u2500 qc.smk             # Quality control\n\u2502   \u2502   \u251c\u2500\u2500 trimming.smk       # Read trimming\n\u2502   \u2502   \u251c\u2500\u2500 mapping.smk        # Read mapping\n\u2502   \u2502   \u251c\u2500\u2500 variant_calling.smk # Variant calling\n\u2502   \u2502   \u251c\u2500\u2500 dms_analysis.smk   # DMS fitness calculation\n\u2502   \u2502   \u2514\u2500\u2500 visualization.smk   # Plotting and visualization\n\u2502   \u2514\u2500\u2500 scripts/               # Custom analysis scripts\n\u2502       \u251c\u2500\u2500 variant_caller.py  # Custom variant caller\n\u2502       \u251c\u2500\u2500 dms_fitness.py     # Fitness calculation\n\u2502       \u251c\u2500\u2500 variant_annotation.py\n\u2502       \u251c\u2500\u2500 create_heatmap.py  # Heatmap generation\n\u2502       \u2514\u2500\u2500 fitness_plots.py   # Additional plots\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 ngs_ai_agent.py        # Main CLI interface\n\u2502   \u251c\u2500\u2500 ai_orchestrator/       # AI components\n\u2502   \u2502   \u2514\u2500\u2500 ollama_client.py   # Ollama integration\n\u2502   \u251c\u2500\u2500 modules/               # Analysis modules\n\u2502   \u2514\u2500\u2500 visualization/         # Visualization tools\n\u251c\u2500\u2500 data/\n\u2502   \u251c\u2500\u2500 raw/                   # Input FASTQ files\n\u2502   \u251c\u2500\u2500 processed/             # Processed data\n\u2502   \u2514\u2500\u2500 test/                  # Test datasets\n\u251c\u2500\u2500 results/                   # Analysis outputs\n\u251c\u2500\u2500 resources/                 # Reference genomes, annotations\n\u251c\u2500\u2500 logs/                      # Pipeline logs\n\u2514\u2500\u2500 environment.yml           # Conda environment\n```\n\n## AI Features\n\n### Metadata Detection\nThe AI automatically analyzes filenames to extract:\n- Sample names and IDs\n- Time points (T0, T1, day0, day7, etc.)\n- Replicate numbers\n- Read types (R1/R2 for paired-end)\n- Experimental conditions\n\n### Filename Cleaning\nAutomatically standardizes filenames to:\n```\nsample1_T0_rep1_R1.fastq.gz\nsample1_T0_rep1_R2.fastq.gz\nsample2_T1_rep1_R1.fastq.gz\n...\n```\n\n### Pipeline Configuration\nAI determines:\n- Single vs paired-end sequencing\n- Sample groupings and time series\n- Optimal parameters for each analysis step\n\n### Report Generation\nAI generates insights including:\n- Key findings summary\n- Hotspot identification\n- Biological implications\n- Follow-up recommendations\n\n## Deep Mutational Scanning Pipeline\n\n### 1. Quality Control\n- FastQC analysis of raw reads\n- MultiQC summary reports\n\n### 2. Read Processing\n- Adapter trimming with Cutadapt\n- Quality filtering\n\n### 3. Mapping\n- Bowtie2 alignment to reference\n- Local alignment optimized for DMS\n\n### 4. Variant Calling\n- Custom high-sensitivity variant caller\n- Optimized for high-frequency variants in DMS\n\n### 5. Fitness Calculation\n- Enrichment ratio calculation\n- Amino acid effect annotation\n- Statistical analysis\n\n### 6. Visualization\n- Interactive heatmaps (position vs amino acid)\n- Fitness distribution plots\n- Coverage analysis\n\n## Configuration\n\nEdit `config/config.yaml` to customize:\n\n```yaml\n# Pipeline settings\npipeline:\n  threads: 8\n  memory_gb: 16\n\n# AI settings\nai:\n  model: \"llama3.1:8b\"\n  temperature: 0.1\n\n# Analysis parameters\nvariant_calling:\n  min_coverage: 10\n  min_frequency: 0.01\n\ndms:\n  fitness_calculation:\n    method: \"enrichment_ratio\"\n    pseudocount: 1\n```\n\n## Extending the Pipeline\n\n### Adding New Analysis Types\n\n1. Create new rules in `workflow/rules/`\n2. Add corresponding scripts in `workflow/scripts/`\n3. Update the main `Snakefile`\n4. Modify configuration as needed\n\nExample for RNA-seq analysis:\n```python\n# workflow/rules/rnaseq.smk\nrule star_align:\n    input:\n        reads=\"results/trimmed/{sample}.fastq.gz\"\n    output:\n        bam=\"results/mapped/{sample}.bam\"\n    shell:\n        \"STAR --runThreadN {threads} --genomeDir {params.index} ...\"\n```\n\n### Custom AI Prompts\n\nModify `src/ai_orchestrator/ollama_client.py` to add new AI capabilities:\n\n```python\ndef detect_experimental_design(self, metadata):\n    prompt = f\"\"\"\n    Analyze this experimental metadata and determine the study design:\n    {metadata}\n    \n    Identify:\n    1. Control vs treatment groups\n    2. Time course experiments\n    3. Dose-response studies\n    \"\"\"\n    # ... implementation\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Ollama not running**: Start with `ollama serve`\n2. **Model not found**: Pull model with `ollama pull qwen3-coder:latest`\n3. **Environment issues**: Recreate with `conda env remove -n ai-ngs && python src/ngs_ai_agent.py setup`\n4. **Memory issues**: Reduce threads in config or increase system memory\n\n### Logs\n\nCheck logs in:\n- `logs/`: Snakemake execution logs\n- Console output: Real-time pipeline progress\n\n## Dependencies\n\n### Core Tools\n- Snakemake (workflow management)\n- FastQC (quality control)\n- Cutadapt (trimming)\n- Bowtie2 (mapping)\n- Samtools/BCFtools (file processing)\n\n### AI Components\n- Ollama (local LLM server)\n- LangChain (AI orchestration)\n\n### Python Libraries\n- Pandas, NumPy (data processing)\n- Matplotlib, Seaborn, Plotly (visualization)\n- Biopython (sequence analysis)\n- PyYAML (configuration)\n\n## Citation\n\nIf you use NGS AI Agent in your research, please cite:\n\n```\nNGS AI Agent: AI-powered automated next-generation sequencing analysis\n[Your publication details here]\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests if applicable\n5. Submit a pull request\n\n## Support\n\n- \ud83d\udce7 Email: [your-email@example.com]\n- \ud83d\udc1b Issues: GitHub Issues page\n- \ud83d\udcda Documentation: [Link to detailed docs]\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "AI-powered automated NGS analysis pipeline",
    "version": "1.0.0",
    "project_urls": {
        "Bug Reports": "https://github.com/your-org/ngs-ai-agent/issues",
        "Documentation": "https://ngs-ai-agent.readthedocs.io/",
        "Homepage": "https://github.com/your-org/ngs-ai-agent",
        "Source": "https://github.com/your-org/ngs-ai-agent"
    },
    "split_keywords": [
        "ngs",
        " bioinformatics",
        " ai",
        " genomics",
        " sequencing",
        " pipeline"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b8749630a957a5d1e633bd61089c13515b3ddf3ebf8fb75c8b44eca0bd9ee901",
                "md5": "3a88d9173c6d58c7385ae0614b27c948",
                "sha256": "fe73d2821d6ae0693bb771da64c52fe40141d48f92115a6593bcb6d478e23aca"
            },
            "downloads": -1,
            "filename": "ngs_ai_agent-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a88d9173c6d58c7385ae0614b27c948",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 40337,
            "upload_time": "2025-10-14T22:40:45",
            "upload_time_iso_8601": "2025-10-14T22:40:45.405780Z",
            "url": "https://files.pythonhosted.org/packages/b8/74/9630a957a5d1e633bd61089c13515b3ddf3ebf8fb75c8b44eca0bd9ee901/ngs_ai_agent-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "12b0a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58",
                "md5": "a6d2502805abd29f13dbae6e6c8c244b",
                "sha256": "6ac6a4f171d2f26325b05154ccf278bbeeebf0b7e861924d37cb4d596ff04ad4"
            },
            "downloads": -1,
            "filename": "ngs_ai_agent-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a6d2502805abd29f13dbae6e6c8c244b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 145850,
            "upload_time": "2025-10-14T22:40:46",
            "upload_time_iso_8601": "2025-10-14T22:40:46.552462Z",
            "url": "https://files.pythonhosted.org/packages/12/b0/a895b8e0071805b648e001a54ddc952bf87fff74fc44676b223994a77e58/ngs_ai_agent-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-14 22:40:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "your-org",
    "github_project": "ngs-ai-agent",
    "github_not_found": true,
    "lcname": "ngs-ai-agent"
}
        
Elapsed time: 2.09422s