vigil-ai

Name	vigil-ai JSON
Version	0.3.0 JSON
	download
home_page	None
Summary	AI-powered workflow generation with foundation models (Claude, ESM, BioGPT, ChemBERTa, etc.)
upload_time	2025-10-21 23:24:28
maintainer	None
docs_url	None
author	Science Abundance
requires_python	>=3.11
license	Apache-2.0
keywords	ai claude llm pipeline science workflow
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # vigil-ai

**AI-powered workflow generation with foundation models for reproducible science**

vigil-ai extends [Vigil](https://github.com/Science-Abundance/vigil) with AI capabilities, making scientific workflow creation accessible through natural language and specialized foundation models.

## Features

- **Natural language → Pipeline**: Generate Snakemake workflows from plain English descriptions
- **Foundation models**: 10+ specialized models for biology, chemistry, materials science
- **Domain-specific AI**: Auto-select the best model for your scientific domain
- **AI debugging**: Get intelligent suggestions for fixing pipeline errors
- **Workflow optimization**: Analyze and optimize for speed, cost, or resource usage
- **Task-based interface**: Simple, high-level API for common workflows
- **MCP integration**: Works with Claude Desktop and AI assistants

## Installation

**Basic (Claude models only):**
```bash
pip install vigil-ai
```

**With science models (ESM-2, BioGPT, ChemBERTa, etc.):**
```bash
pip install 'vigil-ai[science]'
```

**Or install with Vigil:**
```bash
pip install 'vigil[ai]'              # Basic
pip install 'vigil[ai,science]'      # With science models
```

## Requirements

- Python 3.11+
- Vigil >= 0.2.1
- Anthropic API key (get one at https://console.anthropic.com/)

## Setup

Set your Anthropic API key:

```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```

Or add to your `.env` file:

```
ANTHROPIC_API_KEY=your-api-key-here
```

## Usage

### Generate Pipeline from Description

```bash
vigil ai create "Filter variants by quality >30, annotate with Ensembl, calculate Ti/Tv ratio"
```

Output:
```
✓ Pipeline created: app/code/pipelines/Snakefile

Next steps:
  1. Review the generated pipeline
  2. Create necessary step scripts
  3. vigil run --cores 4
```

### Debug Pipeline Errors

```bash
vigil ai debug

# Or specify error log
vigil ai debug --error-log .snakemake/log/error.log
```

Output:
```
Analyzing error...

Root Cause:
The rule 'filter_variants' failed because the input file 'variants.csv' was not found.

Suggested Fix:
1. Check that your data exists: ls app/data/samples/
2. Verify file name matches exactly (case-sensitive)
3. If file is missing, download or create it
4. Run: vigil doctor to check project health

Prevention:
Add input validation before running pipeline.
```

### Optimize Workflow

```bash
vigil ai optimize --focus speed

# Or optimize for cost
vigil ai optimize --focus cost
```

Output:
```
Optimization Suggestions:

Rule: filter_variants
Issue: Sequential processing
Suggestion: Add threads: 4 and use parallel processing
Impact: 4x faster with multi-core

Rule: annotate
Issue: Repeated API calls
Suggestion: Implement caching for Ensembl queries
Impact: 10x faster on reruns
```

## Quick Start (Task-Based Interface)

The simplest way to use vigil-ai is through the task-based interface:

```python
from vigil_ai.tasks import PipelineGenerator, ErrorDebugger, ModelSelector

# 1. Generate a pipeline for biology
bio_gen = PipelineGenerator(domain="biology")
pipeline = bio_gen.create("Filter variants >30, annotate, calculate Ti/Tv")
bio_gen.create_and_save(pipeline, "workflow.smk")

# 2. Debug errors when they occur
debugger = ErrorDebugger()
fix = debugger.analyze("FileNotFoundError: variants.csv not found")
print(fix)

# 3. Get model recommendations
selector = ModelSelector()
model, reason = selector.recommend("I need to analyze protein sequences")
print(reason)  # "Recommended biology model (ESM-2) for protein analysis"
```

## Foundation Models

vigil-ai supports 10+ specialized foundation models across scientific domains:

### Biology Models
- **ESM-2** (650M, 3B, 15B) - Protein language models from Meta AI
- **BioGPT** - Biomedical text generation
- **ProtGPT2** - Protein sequence generation

### Chemistry Models
- **ChemBERTa** - Molecular property prediction
- **MolFormer** - Chemical structure analysis

### Materials Science Models
- **MatBERT** - Materials property prediction

### General Models
- **Claude 3.5 Sonnet** (default) - General-purpose, most capable
- **Claude 3 Opus** - Most powerful
- **Galactica** - Scientific knowledge and reasoning

### Using Domain-Specific Models

```python
from vigil_ai import get_model, ModelDomain

# Automatically select best model for domain
bio_model = get_model(domain=ModelDomain.BIOLOGY)      # Returns ESM-2
chem_model = get_model(domain=ModelDomain.CHEMISTRY)   # Returns ChemBERTa
mat_model = get_model(domain=ModelDomain.MATERIALS)    # Returns MatBERT

# Use specific model by name
esm = get_model(name="esm-2-650m")
embedding = esm.embed("MKFLKFSLLTAVLLSVVFAFSSCGDDDDTGYLPPSQAIQDLL")

# Generate with domain-specific model
from vigil_ai import generate_pipeline
pipeline = generate_pipeline(
    "Analyze protein sequences and predict function",
    domain=ModelDomain.BIOLOGY  # Uses ESM-2
)
```

## Python API (Low-Level)

For more control, use the low-level API:

```python
from vigil_ai import generate_pipeline, ai_debug, ai_optimize

# Generate pipeline
pipeline = generate_pipeline(
    "Filter variants by quality >30, calculate Ti/Tv ratio",
    template="genomics-starter",
    model="claude-3-5-sonnet-20241022"  # Or specify domain
)
print(pipeline)

# Debug error
fix = ai_debug("FileNotFoundError: variants.csv not found")
print(fix)

# Optimize workflow
suggestions = ai_optimize(focus="speed")
print(suggestions)
```

## Examples

### Create Imaging Analysis Pipeline

```bash
vigil ai create "Segment cells from microscopy images, count cells per field, measure intensity"
```

Generates:
```python
rule segment_cells:
    input: "data/images/{sample}.tif"
    output: "artifacts/masks/{sample}_mask.png"
    script: "../lib/steps/segment.py"

rule count_cells:
    input: "artifacts/masks/{sample}_mask.png"
    output: "artifacts/counts/{sample}_counts.json"
    script: "../lib/steps/count.py"

rule measure_intensity:
    input:
        image="data/images/{sample}.tif",
        mask="artifacts/masks/{sample}_mask.png"
    output: "artifacts/intensity/{sample}_intensity.csv"
    script: "../lib/steps/measure.py"
```

### Interactive Mode

```bash
vigil ai chat
```

Starts interactive session:
```
> Create a pipeline to filter variants
✓ Pipeline generated

> Add a rule to calculate metrics
✓ Added metrics rule

> How can I make this faster?
Suggestions:
1. Add parallel processing
2. Cache intermediate results
...
```

## Configuration

Create `.vigil-ai.yaml` in your project:

```yaml
ai:
  model: claude-3-5-sonnet-20241022  # Claude model to use
  max_tokens: 4096                     # Max response length
  temperature: 0.7                     # Creativity (0-1)
  cache_responses: true                # Cache AI responses
```

## Advanced Usage

### Generate Step Script

```python
from vigil_ai.generator import generate_step_script

script = generate_step_script(
    rule_name="filter_variants",
    description="Filter variants by quality score >30",
    inputs=["variants.csv"],
    outputs=["filtered.parquet"],
    language="python"
)

with open("app/code/lib/steps/filter.py", "w") as f:
    f.write(script)
```

### Custom Prompts

```python
from vigil_ai import generate_pipeline

pipeline = generate_pipeline(
    description="""
    Create a multi-sample variant calling pipeline:
    1. Align reads with BWA
    2. Mark duplicates with Picard
    3. Call variants with GATK
    4. Filter and annotate
    """,
    template="genomics-starter"
)
```

## Architecture

vigil-ai is part of a **three-layer architecture** for reproducible science:

```
┌─────────────────────────────────────────────────────┐
│  Agents Layer: AI Assistants                        │
│  (Claude Desktop, custom agents)                    │
└─────────────────────────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────┐
│  Application Layer: vigil-ai (THIS PACKAGE)         │
│  - MCP Server Integration                           │
│  - Foundation Models (Claude, ESM, BioGPT, etc.)    │
│  - Task Interface (PipelineGenerator, etc.)         │
└─────────────────────────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────┐
│  Foundation Layer: Vigil Core                       │
│  - Snakemake pipelines                              │
│  - Artifact management                              │
│  - Receipt tracking                                 │
└─────────────────────────────────────────────────────┘
```

### MCP Integration

vigil-ai extends the Vigil MCP server with 5 AI-powered verbs:

- `ai_generate_pipeline` - Generate Snakemake workflow from description
- `ai_debug_error` - Analyze and fix pipeline errors
- `ai_optimize_workflow` - Suggest performance optimizations
- `ai_list_models` - List available foundation models
- `ai_get_model_info` - Get model metadata and capabilities

**Use with Claude Desktop:**

```json
{
  "mcpServers": {
    "vigil": {
      "command": "vigil",
      "args": ["mcp"]
    }
  }
}
```

Then ask Claude: *"Generate a pipeline to filter variants and calculate metrics"*

## All Supported Models

### General-Purpose (API-based)
- `claude-3-5-sonnet-20241022` (default, recommended)
- `claude-3-opus-20240229` (most powerful)
- `claude-3-sonnet-20240229` (balanced)
- `claude-3-haiku-20240307` (fastest, cheapest)

### Biology (requires `[science]` install)
- `esm-2-650m` - Meta AI protein model, 650M params
- `esm-2-3b` - Meta AI protein model, 3B params (GPU recommended)
- `esm-2-15b` - Meta AI protein model, 15B params (GPU required)
- `biogpt` - Microsoft biomedical text model
- `protgpt2` - Protein sequence generation

### Chemistry (requires `[science]` install)
- `chemberta-v2` - DeepChem molecular property model
- `molformer` - Molecular structure analysis

### Materials Science (requires `[science]` install)
- `matbert` - Materials property prediction

## Cost Estimates

**Claude models (API-based):**
- Pipeline generation: ~$0.02-0.05 per request
- Debugging: ~$0.01-0.03 per request
- Optimization: ~$0.03-0.07 per request

**Science models (local inference):**
- Free to use (runs on your hardware)
- Requires GPU for optimal performance (ESM-2, BioGPT)
- CPU inference possible but slower

**Cost optimization tips:**
- Enable response caching: `cache_responses: true` in `.vigil-ai.yaml`
- Use smaller models for simpler tasks (`claude-3-haiku` vs `claude-3-opus`)
- Use local science models when applicable (no API costs)

## Example Gallery

See the [`examples/`](examples/) directory for complete examples:

- **[task_based_workflow.py](examples/task_based_workflow.py)** - Complete workflow using task interface
- **[domain_specific_models.py](examples/domain_specific_models.py)** - Using biology/chemistry/materials models
- **[basic_pipeline_generation.py](examples/basic_pipeline_generation.py)** - Low-level API examples
- **[with_caching_and_config.py](examples/with_caching_and_config.py)** - Configuration and caching

Run any example:
```bash
python examples/task_based_workflow.py
```

## Limitations

**General:**
- Claude models require internet connection and API key
- Generated pipelines need review before use in production
- AI suggestions should be validated by domain experts
- Not a replacement for scientific expertise

**Science models:**
- Require `pip install vigil-ai[science]` and additional dependencies
- Large models (ESM-2 15B) require significant GPU memory (40GB+)
- Local inference slower than API-based models
- May require domain-specific preprocessing

## Development

```bash
# Clone repo
git clone https://github.com/Science-Abundance/vigil
cd vigil/packages/vigil-core-ai

# Install in dev mode with all dependencies
pip install -e '.[dev,science]'

# Run tests
pytest

# Run tests with science models (requires GPU)
pytest -m science

# Lint
ruff check .

# Type check
mypy src/

# Run examples
python examples/task_based_workflow.py
python examples/domain_specific_models.py
```

## Contributing

Contributions welcome! See [CONTRIBUTING.md](../../CONTRIBUTING.md)

## License

Apache-2.0

## Support

- GitHub Issues: https://github.com/Science-Abundance/vigil/issues
- Documentation: https://github.com/Science-Abundance/vigil
- Discord: [coming soon]

## Acknowledgments

Built with:
- [Anthropic Claude](https://www.anthropic.com/) - General-purpose AI capabilities
- [Vigil](https://github.com/Science-Abundance/vigil) - Reproducible science platform
- [HuggingFace Transformers](https://huggingface.co/transformers/) - Foundation model infrastructure

Foundation models:
- **ESM-2** - Meta AI ([paper](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1))
- **BioGPT** - Microsoft Research ([paper](https://arxiv.org/abs/2210.10341))
- **ChemBERTa** - DeepChem ([paper](https://arxiv.org/abs/2010.09885))
- **MatBERT** - Materials Project ([paper](https://arxiv.org/abs/2109.15290))
- **Galactica** - Meta AI ([paper](https://arxiv.org/abs/2211.09085))

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vigil-ai",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "ai, claude, llm, pipeline, science, workflow",
    "author": "Science Abundance",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/40/7f/0c26ba8a8d66509a806ba22e8ad0e14c7e727f510b8d62727e6a7260a328/vigil_ai-0.3.0.tar.gz",
    "platform": null,
    "description": "# vigil-ai\n\n**AI-powered workflow generation with foundation models for reproducible science**\n\nvigil-ai extends [Vigil](https://github.com/Science-Abundance/vigil) with AI capabilities, making scientific workflow creation accessible through natural language and specialized foundation models.\n\n## Features\n\n- **Natural language \u2192 Pipeline**: Generate Snakemake workflows from plain English descriptions\n- **Foundation models**: 10+ specialized models for biology, chemistry, materials science\n- **Domain-specific AI**: Auto-select the best model for your scientific domain\n- **AI debugging**: Get intelligent suggestions for fixing pipeline errors\n- **Workflow optimization**: Analyze and optimize for speed, cost, or resource usage\n- **Task-based interface**: Simple, high-level API for common workflows\n- **MCP integration**: Works with Claude Desktop and AI assistants\n\n## Installation\n\n**Basic (Claude models only):**\n```bash\npip install vigil-ai\n```\n\n**With science models (ESM-2, BioGPT, ChemBERTa, etc.):**\n```bash\npip install 'vigil-ai[science]'\n```\n\n**Or install with Vigil:**\n```bash\npip install 'vigil[ai]'              # Basic\npip install 'vigil[ai,science]'      # With science models\n```\n\n## Requirements\n\n- Python 3.11+\n- Vigil >= 0.2.1\n- Anthropic API key (get one at https://console.anthropic.com/)\n\n## Setup\n\nSet your Anthropic API key:\n\n```bash\nexport ANTHROPIC_API_KEY='your-api-key-here'\n```\n\nOr add to your `.env` file:\n\n```\nANTHROPIC_API_KEY=your-api-key-here\n```\n\n## Usage\n\n### Generate Pipeline from Description\n\n```bash\nvigil ai create \"Filter variants by quality >30, annotate with Ensembl, calculate Ti/Tv ratio\"\n```\n\nOutput:\n```\n\u2713 Pipeline created: app/code/pipelines/Snakefile\n\nNext steps:\n  1. Review the generated pipeline\n  2. Create necessary step scripts\n  3. vigil run --cores 4\n```\n\n### Debug Pipeline Errors\n\n```bash\nvigil ai debug\n\n# Or specify error log\nvigil ai debug --error-log .snakemake/log/error.log\n```\n\nOutput:\n```\nAnalyzing error...\n\nRoot Cause:\nThe rule 'filter_variants' failed because the input file 'variants.csv' was not found.\n\nSuggested Fix:\n1. Check that your data exists: ls app/data/samples/\n2. Verify file name matches exactly (case-sensitive)\n3. If file is missing, download or create it\n4. Run: vigil doctor to check project health\n\nPrevention:\nAdd input validation before running pipeline.\n```\n\n### Optimize Workflow\n\n```bash\nvigil ai optimize --focus speed\n\n# Or optimize for cost\nvigil ai optimize --focus cost\n```\n\nOutput:\n```\nOptimization Suggestions:\n\nRule: filter_variants\nIssue: Sequential processing\nSuggestion: Add threads: 4 and use parallel processing\nImpact: 4x faster with multi-core\n\nRule: annotate\nIssue: Repeated API calls\nSuggestion: Implement caching for Ensembl queries\nImpact: 10x faster on reruns\n```\n\n## Quick Start (Task-Based Interface)\n\nThe simplest way to use vigil-ai is through the task-based interface:\n\n```python\nfrom vigil_ai.tasks import PipelineGenerator, ErrorDebugger, ModelSelector\n\n# 1. Generate a pipeline for biology\nbio_gen = PipelineGenerator(domain=\"biology\")\npipeline = bio_gen.create(\"Filter variants >30, annotate, calculate Ti/Tv\")\nbio_gen.create_and_save(pipeline, \"workflow.smk\")\n\n# 2. Debug errors when they occur\ndebugger = ErrorDebugger()\nfix = debugger.analyze(\"FileNotFoundError: variants.csv not found\")\nprint(fix)\n\n# 3. Get model recommendations\nselector = ModelSelector()\nmodel, reason = selector.recommend(\"I need to analyze protein sequences\")\nprint(reason)  # \"Recommended biology model (ESM-2) for protein analysis\"\n```\n\n## Foundation Models\n\nvigil-ai supports 10+ specialized foundation models across scientific domains:\n\n### Biology Models\n- **ESM-2** (650M, 3B, 15B) - Protein language models from Meta AI\n- **BioGPT** - Biomedical text generation\n- **ProtGPT2** - Protein sequence generation\n\n### Chemistry Models\n- **ChemBERTa** - Molecular property prediction\n- **MolFormer** - Chemical structure analysis\n\n### Materials Science Models\n- **MatBERT** - Materials property prediction\n\n### General Models\n- **Claude 3.5 Sonnet** (default) - General-purpose, most capable\n- **Claude 3 Opus** - Most powerful\n- **Galactica** - Scientific knowledge and reasoning\n\n### Using Domain-Specific Models\n\n```python\nfrom vigil_ai import get_model, ModelDomain\n\n# Automatically select best model for domain\nbio_model = get_model(domain=ModelDomain.BIOLOGY)      # Returns ESM-2\nchem_model = get_model(domain=ModelDomain.CHEMISTRY)   # Returns ChemBERTa\nmat_model = get_model(domain=ModelDomain.MATERIALS)    # Returns MatBERT\n\n# Use specific model by name\nesm = get_model(name=\"esm-2-650m\")\nembedding = esm.embed(\"MKFLKFSLLTAVLLSVVFAFSSCGDDDDTGYLPPSQAIQDLL\")\n\n# Generate with domain-specific model\nfrom vigil_ai import generate_pipeline\npipeline = generate_pipeline(\n    \"Analyze protein sequences and predict function\",\n    domain=ModelDomain.BIOLOGY  # Uses ESM-2\n)\n```\n\n## Python API (Low-Level)\n\nFor more control, use the low-level API:\n\n```python\nfrom vigil_ai import generate_pipeline, ai_debug, ai_optimize\n\n# Generate pipeline\npipeline = generate_pipeline(\n    \"Filter variants by quality >30, calculate Ti/Tv ratio\",\n    template=\"genomics-starter\",\n    model=\"claude-3-5-sonnet-20241022\"  # Or specify domain\n)\nprint(pipeline)\n\n# Debug error\nfix = ai_debug(\"FileNotFoundError: variants.csv not found\")\nprint(fix)\n\n# Optimize workflow\nsuggestions = ai_optimize(focus=\"speed\")\nprint(suggestions)\n```\n\n## Examples\n\n### Create Imaging Analysis Pipeline\n\n```bash\nvigil ai create \"Segment cells from microscopy images, count cells per field, measure intensity\"\n```\n\nGenerates:\n```python\nrule segment_cells:\n    input: \"data/images/{sample}.tif\"\n    output: \"artifacts/masks/{sample}_mask.png\"\n    script: \"../lib/steps/segment.py\"\n\nrule count_cells:\n    input: \"artifacts/masks/{sample}_mask.png\"\n    output: \"artifacts/counts/{sample}_counts.json\"\n    script: \"../lib/steps/count.py\"\n\nrule measure_intensity:\n    input:\n        image=\"data/images/{sample}.tif\",\n        mask=\"artifacts/masks/{sample}_mask.png\"\n    output: \"artifacts/intensity/{sample}_intensity.csv\"\n    script: \"../lib/steps/measure.py\"\n```\n\n### Interactive Mode\n\n```bash\nvigil ai chat\n```\n\nStarts interactive session:\n```\n> Create a pipeline to filter variants\n\u2713 Pipeline generated\n\n> Add a rule to calculate metrics\n\u2713 Added metrics rule\n\n> How can I make this faster?\nSuggestions:\n1. Add parallel processing\n2. Cache intermediate results\n...\n```\n\n## Configuration\n\nCreate `.vigil-ai.yaml` in your project:\n\n```yaml\nai:\n  model: claude-3-5-sonnet-20241022  # Claude model to use\n  max_tokens: 4096                     # Max response length\n  temperature: 0.7                     # Creativity (0-1)\n  cache_responses: true                # Cache AI responses\n```\n\n## Advanced Usage\n\n### Generate Step Script\n\n```python\nfrom vigil_ai.generator import generate_step_script\n\nscript = generate_step_script(\n    rule_name=\"filter_variants\",\n    description=\"Filter variants by quality score >30\",\n    inputs=[\"variants.csv\"],\n    outputs=[\"filtered.parquet\"],\n    language=\"python\"\n)\n\nwith open(\"app/code/lib/steps/filter.py\", \"w\") as f:\n    f.write(script)\n```\n\n### Custom Prompts\n\n```python\nfrom vigil_ai import generate_pipeline\n\npipeline = generate_pipeline(\n    description=\"\"\"\n    Create a multi-sample variant calling pipeline:\n    1. Align reads with BWA\n    2. Mark duplicates with Picard\n    3. Call variants with GATK\n    4. Filter and annotate\n    \"\"\",\n    template=\"genomics-starter\"\n)\n```\n\n## Architecture\n\nvigil-ai is part of a **three-layer architecture** for reproducible science:\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Agents Layer: AI Assistants                        \u2502\n\u2502  (Claude Desktop, custom agents)                    \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                        \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Application Layer: vigil-ai (THIS PACKAGE)         \u2502\n\u2502  - MCP Server Integration                           \u2502\n\u2502  - Foundation Models (Claude, ESM, BioGPT, etc.)    \u2502\n\u2502  - Task Interface (PipelineGenerator, etc.)         \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n                        \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502  Foundation Layer: Vigil Core                       \u2502\n\u2502  - Snakemake pipelines                              \u2502\n\u2502  - Artifact management                              \u2502\n\u2502  - Receipt tracking                                 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### MCP Integration\n\nvigil-ai extends the Vigil MCP server with 5 AI-powered verbs:\n\n- `ai_generate_pipeline` - Generate Snakemake workflow from description\n- `ai_debug_error` - Analyze and fix pipeline errors\n- `ai_optimize_workflow` - Suggest performance optimizations\n- `ai_list_models` - List available foundation models\n- `ai_get_model_info` - Get model metadata and capabilities\n\n**Use with Claude Desktop:**\n\n```json\n{\n  \"mcpServers\": {\n    \"vigil\": {\n      \"command\": \"vigil\",\n      \"args\": [\"mcp\"]\n    }\n  }\n}\n```\n\nThen ask Claude: *\"Generate a pipeline to filter variants and calculate metrics\"*\n\n## All Supported Models\n\n### General-Purpose (API-based)\n- `claude-3-5-sonnet-20241022` (default, recommended)\n- `claude-3-opus-20240229` (most powerful)\n- `claude-3-sonnet-20240229` (balanced)\n- `claude-3-haiku-20240307` (fastest, cheapest)\n\n### Biology (requires `[science]` install)\n- `esm-2-650m` - Meta AI protein model, 650M params\n- `esm-2-3b` - Meta AI protein model, 3B params (GPU recommended)\n- `esm-2-15b` - Meta AI protein model, 15B params (GPU required)\n- `biogpt` - Microsoft biomedical text model\n- `protgpt2` - Protein sequence generation\n\n### Chemistry (requires `[science]` install)\n- `chemberta-v2` - DeepChem molecular property model\n- `molformer` - Molecular structure analysis\n\n### Materials Science (requires `[science]` install)\n- `matbert` - Materials property prediction\n\n## Cost Estimates\n\n**Claude models (API-based):**\n- Pipeline generation: ~$0.02-0.05 per request\n- Debugging: ~$0.01-0.03 per request\n- Optimization: ~$0.03-0.07 per request\n\n**Science models (local inference):**\n- Free to use (runs on your hardware)\n- Requires GPU for optimal performance (ESM-2, BioGPT)\n- CPU inference possible but slower\n\n**Cost optimization tips:**\n- Enable response caching: `cache_responses: true` in `.vigil-ai.yaml`\n- Use smaller models for simpler tasks (`claude-3-haiku` vs `claude-3-opus`)\n- Use local science models when applicable (no API costs)\n\n## Example Gallery\n\nSee the [`examples/`](examples/) directory for complete examples:\n\n- **[task_based_workflow.py](examples/task_based_workflow.py)** - Complete workflow using task interface\n- **[domain_specific_models.py](examples/domain_specific_models.py)** - Using biology/chemistry/materials models\n- **[basic_pipeline_generation.py](examples/basic_pipeline_generation.py)** - Low-level API examples\n- **[with_caching_and_config.py](examples/with_caching_and_config.py)** - Configuration and caching\n\nRun any example:\n```bash\npython examples/task_based_workflow.py\n```\n\n## Limitations\n\n**General:**\n- Claude models require internet connection and API key\n- Generated pipelines need review before use in production\n- AI suggestions should be validated by domain experts\n- Not a replacement for scientific expertise\n\n**Science models:**\n- Require `pip install vigil-ai[science]` and additional dependencies\n- Large models (ESM-2 15B) require significant GPU memory (40GB+)\n- Local inference slower than API-based models\n- May require domain-specific preprocessing\n\n## Development\n\n```bash\n# Clone repo\ngit clone https://github.com/Science-Abundance/vigil\ncd vigil/packages/vigil-core-ai\n\n# Install in dev mode with all dependencies\npip install -e '.[dev,science]'\n\n# Run tests\npytest\n\n# Run tests with science models (requires GPU)\npytest -m science\n\n# Lint\nruff check .\n\n# Type check\nmypy src/\n\n# Run examples\npython examples/task_based_workflow.py\npython examples/domain_specific_models.py\n```\n\n## Contributing\n\nContributions welcome! See [CONTRIBUTING.md](../../CONTRIBUTING.md)\n\n## License\n\nApache-2.0\n\n## Support\n\n- GitHub Issues: https://github.com/Science-Abundance/vigil/issues\n- Documentation: https://github.com/Science-Abundance/vigil\n- Discord: [coming soon]\n\n## Acknowledgments\n\nBuilt with:\n- [Anthropic Claude](https://www.anthropic.com/) - General-purpose AI capabilities\n- [Vigil](https://github.com/Science-Abundance/vigil) - Reproducible science platform\n- [HuggingFace Transformers](https://huggingface.co/transformers/) - Foundation model infrastructure\n\nFoundation models:\n- **ESM-2** - Meta AI ([paper](https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1))\n- **BioGPT** - Microsoft Research ([paper](https://arxiv.org/abs/2210.10341))\n- **ChemBERTa** - DeepChem ([paper](https://arxiv.org/abs/2010.09885))\n- **MatBERT** - Materials Project ([paper](https://arxiv.org/abs/2109.15290))\n- **Galactica** - Meta AI ([paper](https://arxiv.org/abs/2211.09085))\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "AI-powered workflow generation with foundation models (Claude, ESM, BioGPT, ChemBERTa, etc.)",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://github.com/Science-Abundance/vigil/tree/main/packages/vigil-core-ai",
        "Homepage": "https://github.com/Science-Abundance/vigil",
        "Issues": "https://github.com/Science-Abundance/vigil/issues",
        "Repository": "https://github.com/Science-Abundance/vigil"
    },
    "split_keywords": [
        "ai",
        " claude",
        " llm",
        " pipeline",
        " science",
        " workflow"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d5f16fe9d09f5bf8bb00c7e813de647183a67da6ec234ad2c90c4837ad2a7837",
                "md5": "f9db67ec9c3bb54491d23c07f406b600",
                "sha256": "65c44ee94af1fab8f5b7645d731c8579aa65e4461f71a1c7f4222a7e8d626453"
            },
            "downloads": -1,
            "filename": "vigil_ai-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f9db67ec9c3bb54491d23c07f406b600",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 37589,
            "upload_time": "2025-10-21T23:24:27",
            "upload_time_iso_8601": "2025-10-21T23:24:27.581798Z",
            "url": "https://files.pythonhosted.org/packages/d5/f1/6fe9d09f5bf8bb00c7e813de647183a67da6ec234ad2c90c4837ad2a7837/vigil_ai-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "407f0c26ba8a8d66509a806ba22e8ad0e14c7e727f510b8d62727e6a7260a328",
                "md5": "e969ccc29c280beddbb30fc5c75e257f",
                "sha256": "070b10c76326d2091c6f42469cb6fb1e0da6475f9d75cee5995b03a61fa682be"
            },
            "downloads": -1,
            "filename": "vigil_ai-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e969ccc29c280beddbb30fc5c75e257f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 34419,
            "upload_time": "2025-10-21T23:24:28",
            "upload_time_iso_8601": "2025-10-21T23:24:28.906808Z",
            "url": "https://files.pythonhosted.org/packages/40/7f/0c26ba8a8d66509a806ba22e8ad0e14c7e727f510b8d62727e6a7260a328/vigil_ai-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-21 23:24:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Science-Abundance",
    "github_project": "vigil",
    "github_not_found": true,
    "lcname": "vigil-ai"
}

Science Abundance