# NVIDIA HELM Benchmark Framework
This directory contains the HELM (Holistic Evaluation of Language Models) framework for evaluating large language models in medical applications across various healthcare tasks.
## Overview
The HELM framework provides a comprehensive evaluation system for medical AI models, supporting multiple benchmark datasets and evaluation scenarios. It's designed to work with the EvalFactory infrastructure for standardized model evaluation.
## Available Benchmarks
The framework supports the following medical evaluation benchmarks:
| Benchmark | Description | Type |
|-----------|-------------|------|
| **medcalc_bench** | Medical calculation benchmark with patient notes and ground truth answers | Medical QA |
| **medec** | Medical error detection and correction pairs | Error Detection |
| **head_qa** | Biomedical multiple-choice questions for medical knowledge testing | Medical QA |
| **medbullets** | USMLE-style medical questions with explanations | Medical QA |
| **pubmed_qa** | PubMed abstracts with yes/no/maybe questions | Medical QA |
| **ehr_sql** | Natural language to SQL query generation for clinical research | SQL Generation |
| **race_based_med** | Detection of race-based biases in medical LLM outputs | Bias Detection |
| **medhallu** | Classification of factual vs hallucinated medical answers | Hallucination Detection |
## Quick Start
### 1. Environment Setup
First, ensure you have the required environment variables set:
```bash
# Set your API keys
export OPENAI_API_KEY="your-api-key-here"
# Set Python path if necessary
export PYTHONPATH=$PYTHONPATH:$.
```
### 2. Running Your First Benchmark
#### Method 1: Using `eval-factory` (Recommended)
`eval-factory` is a wrapper that simplifies the HELM benchmark process by handling configuration generation, benchmark execution, and result formatting automatically.
**What `eval-factory` does internally:**
1. **Configuration Processing**: Loads your YAML config and merges it with framework defaults
2. **Dynamic Config Generation**: Creates the necessary HELM model configurations dynamically
3. **Benchmark Execution**: Runs the HELM benchmark with proper parameters
4. **Result Processing**: Formats and saves results in standardized YAML format
Create a configuration file (e.g., `my_test.yml`):
```yaml
config:
type: medcalc_bench # Choose from available benchmarks
output_dir: results/my_test
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
```
Run the evaluation:
```bash
eval-factory run_eval \
--output_dir results/my_test \
--run_config my_test.yml
```
**Internal Process Breakdown:**
1. **Config Loading & Validation**:
- Loads your YAML configuration
- Validates against framework schema
- Merges with default parameters from `framework.yml`
2. **Dynamic Model Config Generation**:
- Calls `scripts/generate_dynamic_model_configs.py`
- Creates model-specific configuration files
- Handles provider-specific API endpoints and authentication
3. **HELM Benchmark Execution**:
- Executes `helm-run` with generated configurations
- Downloads and prepares benchmark datasets
- Runs evaluations with specified parameters
- Caches responses for efficiency
4. **Result Processing**:
- Collects raw benchmark results
- Formats into standardized YAML output
- Saves results in your specified output directory
#### Method 2: Using `helm-run` directly
```bash
helm-run \
--run-entries medcalc_bench:model=openai/gpt-4 \
--suite my-suite \
--max-eval-instances 10 \
--num-train-trials 1 \
-o results/my_test
```
**Comparison: `eval-factory` vs `helm-run`**
| Feature | `eval-factory` | `helm-run` |
|---------|-------------------|------------|
| **Configuration** | Simple YAML config | Complex command-line arguments |
| **Model Setup** | Automatic config generation | Manual model registration required |
| **Provider Support** | Built-in adapter handling | Requires custom model configs |
| **Results Format** | Standardized YAML output | Native HELM format only |
| **Ease of Use** | Beginner-friendly | Advanced users only |
| **Integration** | EvalFactory compatible | HELM-specific |
**Recommendation**: Use `eval-factory` for most use cases, especially when working with EvalFactory. Use `helm-run` only when you need fine-grained control over HELM's native features.
### 3. Understanding the Output
After running a benchmark, you'll find results in your specified output directory:
```
results/my_test/
├── responses/ # Raw model responses
├── cache.db # Cached responses for efficiency
├── instances.jsonl # Evaluation instances
├── results.jsonl # Final evaluation results
├── model_configs/ # Generated HELM model configurations
└── evaluation_config.yaml # Standardized evaluation results
```
**Generated Files Explanation:**
- **`responses/`**: Contains raw API responses from the model for each evaluation instance
- **`cache.db`**: SQLite database caching responses to avoid re-running identical queries
- **`instances.jsonl`**: The evaluation instances (questions, prompts, etc.) used in the benchmark
- **`results.jsonl`**: HELM's native results format with detailed metrics
- **`model_configs/`**: Dynamically generated configuration files for the specific model and provider
- **`evaluation_config.yaml`**: Standardized results in YAML format compatible with EvalFactory
**Key Advantage**: `eval-factory` automatically handles the complexity of HELM configuration generation, making it much easier to run benchmarks compared to using `helm-run` directly.
## Step-by-Step Guide
### Step 1: Choose Your Benchmark
Select from the available benchmarks based on your evaluation needs:
- **For general medical QA**: `medcalc_bench`, `head_qa`, `medbullets`
- **For error detection**: `medec`
- **For research applications**: `pubmed_qa`, `ehr_sql`
- **For safety evaluation**: `race_based_med`, `medhallu`
### Step 2: Configure Your Model
Create a YAML configuration file with your model details. Here are examples for different providers:
#### OpenAI Configuration
```yaml
config:
type: medcalc_bench
output_dir: results/openai_test
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
```
#### NVIDIA AI Foundation Models (build.nvidia.com)
```yaml
config:
type: pubmed_qa
output_dir: results/nim_test
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
```
#### NVIDIA Cloud Function (nvcf)
```yaml
config:
type: ehr_sql
output_dir: results/nvcf_test
target:
api_endpoint:
url: https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/13e4f873-9d52-4ba9-8194-61baf8dc2bc9/
model_id: meta-llama/Llama-3.3-70B-Instruct
type: chat
api_key: OPENAI_API_KEY
adapter_config:
use_nvcf: true
```
### Model Naming Conventions
Different providers use different model ID formats:
- **OpenAI**: `gpt-4`, `gpt-3.5-turbo`, `text-davinci-003`
- **NVIDIA**: `meta-llama/Llama-3.3-70B-Instruct`, `mistral-7b-instruct`
**Note**: NVCF requires a specific function ID in the URL and the `use_nvcf: true` adapter configuration.
### Step 3: Set Up API Credentials
Ensure your API credentials are properly configured:
```bash
# For OpenAI models
export OPENAI_API_KEY="<very-long-sequence>"
# For NVIDIA AI Foundation Models (build.nvidia.com)
export OPENAI_API_KEY="nvapi-..." # Uses same env var as OpenAI
# For NVIDIA Cloud Function (nvcf)
export OPENAI_API_KEY="nvapi-..." # Uses same env var as OpenAI
# Note: NVIDIA services typically use the same OPENAI_API_KEY environment variable
# but with NVIDIA-specific API keys (nvapi-... format)
```
### Step 4: Run the Evaluation
Execute the benchmark using one of the methods above. The framework will:
1. **Load the configuration** and validate parameters
2. **Generate model configs** dynamically for the specified model
3. **Download and prepare** the benchmark dataset
4. **Run evaluations** on the specified number of instances
5. **Cache responses** for efficiency and reproducibility
6. **Generate results** in standardized format
### Step 5: Analyze Results
Review the generated results:
```bash
# View raw results
cat results/my_test/results.jsonl
# Use HELM tools for analysis
helm-summarize --suite my-suite
helm-server # Start web interface to view results
```
## Advanced Configuration
### Customizing Evaluation Parameters
You can customize various parameters in your configuration:
```yaml
config:
type: medcalc_bench
output_dir: results/advanced_test
params:
limit_samples: 100 # Limit number of evaluation instances
parallelism: 4 # Number of parallel threads
extra:
num_train_trials: 3 # Number of training trials
max_length: 2048 # Maximum token length
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
```
### Advanced Configuration Parameters
The `config.params.extra` section provides additional parameters for fine-tuning evaluations:
#### `data_path`
- **Purpose**: Custom data path for scenarios that support it
- **Supported Scenarios**: `ehrshot`, `clear`, `medalign`, `n2c2_ct_matching`
- **Example**: `"/path/to/custom/data"`
- **Description**: Overrides the default data location for the scenario
#### `num_output_tokens`
- **Purpose**: Maximum number of tokens the model is allowed to generate in its response
- **Scope**: Controls only the output length, not the total sequence length
- **Example**: `1000` limits model responses to 1000 tokens
- **Use Case**: Useful for controlling response length in generation tasks
#### `max_length`
- **Purpose**: Maximum total length for the entire input-output sequence (input + output combined)
- **Scope**: Controls the combined length of both prompt and response
- **Example**: `2048` limits total conversation to 2048 tokens
- **Difference from num_output_tokens**: This controls total sequence length, while num_output_tokens only controls response length
#### `subject`
- **Purpose**: Specific task or subset to evaluate within a scenario
- **Examples by Scenario**:
- **ehrshot**: `"guo_readmission"`, `"new_hypertension"`, `"lab_anemia"`
- **n2c2_ct_matching**: `"ABDOMINAL"`, `"ADVANCED-CAD"`, `"CREATININE"`
- **clear**: `"major_depression"`, `"bipolar_disorder"`, `"substance_use_disorder"`
- **Description**: Filters the evaluation to a specific prediction task or medical condition
#### `condition`
- **Purpose**: Specific condition or scenario variant to evaluate
- **Supported Scenarios**: `clear`
- **Examples**: `"alcohol_dependence"`, `"chronic_pain"`, `"homelessness"`
- **Description**: Used by scenarios like 'clear' to specify medical conditions for evaluation
#### `num_train_trials`
- **Purpose**: Number of training trials for few-shot evaluation
- **Behavior**: Each trial samples a different set of in-context examples
- **Example**: `3` runs the evaluation 3 times with different examples
- **Use Case**: Useful for robust evaluation with multiple few-shot configurations
### Example Configuration with All Parameters
```yaml
config:
type: ehrshot
output_dir: results/ehrshot_evaluation
params:
limit_samples: 500
parallelism: 2
extra:
data_path: "/custom/path/to/ehrshot/data"
num_output_tokens: 1000
max_length: 4096
subject: "guo_readmission"
num_train_trials: 3
target:
api_endpoint:
url: https://api.openai.com/v1
model_id: gpt-4
type: chat
api_key: OPENAI_API_KEY
```
### Running Multiple Benchmarks
To run multiple benchmarks on the same model:
```bash
# Create separate config files for each benchmark
eval-factory run_eval --output_dir results/medcalc_test --run_config medcalc_config.yml
eval-factory run_eval --output_dir results/medec_test --run_config medec_config.yml
eval-factory run_eval --output_dir results/head_qa_test --run_config head_qa_config.yml
```
### Dry Run Mode
Test your configuration without running the full evaluation:
```bash
eval-factory run_eval \
--output_dir results/test \
--run_config my_config.yml \
--dry_run
```
This will show you the rendered configuration and command without executing the benchmark.
## Troubleshooting
### Common Issues
1. **API Key Errors**: Ensure your API keys are properly set and valid
2. **Model Not Found**: Verify the model ID and endpoint URL are correct
3. **Memory Issues**: Reduce `parallelism` or `limit_samples` for large models
4. **Timeout Errors**: Increase timeout settings or reduce batch sizes
### Debug Mode
Enable debug logging for detailed information:
```bash
eval-factory --debug run_eval \
--output_dir results/debug_test \
--run_config debug_config.yml
```
### Checking Available Tasks
List all available evaluation types:
```bash
eval-factory ls
```
## Examples from commands.sh
Here are some practical examples from the project:
### Basic Medical Calculation Benchmark
```bash
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_medcalc_bench \
--run_config test_cases/test_case_nim_llama_3_1_8b_medcalc_bench.yml
```
### Medical Error Detection
```bash
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_medec \
--run_config test_cases/test_case_nim_llama_3_1_8b_medec.yml
```
### Biomedical QA
```bash
eval-factory run_eval \
--output_dir test_cases/test_case_nim_llama_3_1_8b_head_qa \
--run_config test_cases/test_case_nim_llama_3_1_8b_head_qa.yml
```
## Running Evaluations with Judges
The HELM framework supports multi-judge evaluations for scenarios that require human-like assessment of model outputs. This is particularly useful for tasks like medical treatment plan generation, where multiple AI judges can provide more robust and reliable evaluations.
### Overview of Multi-Judge Setup
The framework supports three types of judges:
- **GPT Judge**: Uses OpenAI GPT models for evaluation
- **Llama Judge**: Uses Llama models for evaluation
- **Claude Judge**: Uses Anthropic Claude models for evaluation
Each judge can use different API keys, providing better rate limiting, cost tracking, and flexibility.
### Authentication Systems
The framework supports **two authentication methods** for judge models:
#### 1. Direct Judge API Keys (Recommended for Production)
Set individual API keys for each judge type:
```bash
# API key for the main model being evaluated
export OPENAI_API_KEY="your-main-model-api-key"
# API keys for the three judges (annotators)
export GPT_JUDGE_API_KEY="your-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="your-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="your-claude-judge-api-key"
```
#### 2. OAuth 2.0 Client Credentials Flow (Advanced)
Use NVIDIA's OAuth system for automatic token management:
```bash
# OAuth 2.0 credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"
# Main API key (still required)
export OPENAI_API_KEY="your-main-model-api-key"
```
### How Authentication Priority Works
The system follows this **exact priority order**:
1. **First Priority**: Judge-specific environment variables
- `GPT_JUDGE_API_KEY` for GPT models
- `LLAMA_JUDGE_API_KEY` for Llama models
- `CLAUDE_JUDGE_API_KEY` for Claude models
2. **Second Priority**: Fallback to main API key
- If judge keys aren't set, automatically uses `OPENAI_API_KEY`
- System logs: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"
3. **Third Priority**: Credentials configuration
- Falls back to `credentials.conf` or deployment-specific keys
**Important**: OAuth-generated tokens are **NOT** automatically used for judge API keys. The OAuth system is separate and serves different purposes.
### OAuth 2.0 System Details
The OAuth system provides:
- **Automatic Token Creation**: Generates access tokens using client credentials
- **Token Caching**: Stores tokens in memory and disk (`{service_name}_oauth_token.json`)
- **Automatic Refresh**: Refreshes expired tokens automatically
- **Scope Control**: Different permissions per service:
- `azureopenai-readwrite` for GPT services
- `awsanthropic-readwrite` for Claude services
**When to Use OAuth:**
- Better security (client credentials vs. long-lived API keys)
- Automatic token management
- Centralized billing and rate limiting
- Enterprise-grade authentication
**When to Use Direct API Keys:**
- Simpler setup
- Direct control over each judge's API key
- Different providers for different judges
- Testing and development scenarios
### Security Features
**API Key Protection**: The system automatically sanitizes error messages to prevent API keys from appearing in logs. Any API key patterns (like `nvapi-...`, `sk-...`, `hf_...`) are automatically replaced with `[API_KEY_REDACTED]` before logging.
### Configuration for Multi-Judge Evaluations
#### Basic Configuration (Direct API Keys)
```yaml
config:
type: mtsamples_replicate # Example scenario that uses judges
output_dir: results/multi_judge_test
params:
limit_samples: 10
parallelism: 1
extra:
num_train_trials: 1
max_length: 2048
# Different API keys for each judge
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.1-8b-instruct
type: chat
api_key: OPENAI_API_KEY
```
#### Advanced Configuration (OAuth + Direct Keys)
```yaml
config:
type: mtsamples_replicate
output_dir: results/oauth_multi_judge_test
params:
limit_samples: 50
parallelism: 2
extra:
num_train_trials: 3
max_length: 2048
# Mix OAuth (automatic) and direct keys
gpt_judge_api_key: GPT_JUDGE_API_KEY # Direct key for GPT
# Llama and Claude will use OAuth-generated tokens
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
```
### Supported Scenarios with Judges
Currently, the following scenarios support multi-judge evaluations:
| Scenario | Description | Judge Types Used |
|----------|-------------|------------------|
| **mtsamples_replicate** | Generate treatment plans based on clinical notes | GPT, Llama, Claude |
| **mtsamples_procedures** | Document and extract information about medical procedures | GPT, Llama, Claude |
| **aci_bench** | Extract and structure information from patient-doctor conversations | GPT, Llama, Claude |
| **medication_qa** | Answer consumer medication-related questions | GPT, Llama, Claude |
| **medi_qa** | Retrieve and rank answers based on medical question understanding | GPT, Llama, Claude |
| **med_dialog** | Generate summaries of doctor-patient conversations | GPT, Llama, Claude |
### Complete Setup Guide
#### Method 1: Direct API Keys (Simplest)
```bash
# 1. Set up environment variables
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"
# 2. Run the evaluation
eval-factory run_eval \
--output_dir results/multi_judge_test \
--run_config multi_judge_config.yml
```
#### Method 2: OAuth 2.0 System (Enterprise)
```bash
# 1. Set up OAuth credentials
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
export OPENAI_TOKEN_URL="https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token"
export OPENAI_SCOPE="awsanthropic-readwrite"
# 2. Set main API key (still required)
export OPENAI_API_KEY="nvapi-your-main-api-key"
# 3. Run the evaluation
eval-factory run_eval \
--output_dir results/oauth_multi_judge_test \
--run_config oauth_multi_judge_config.yml
```
#### Method 3: Hybrid Approach (Flexible)
```bash
# 1. Set OAuth credentials for automatic token generation
export OPENAI_CLIENT_ID="nvssa-prd-your-client-id"
export OPENAI_CLIENT_SECRET="ssap-your-client-secret"
# 2. Override specific judge with direct API key
export GPT_JUDGE_API_KEY="nvapi-gpt-specific-key"
# 3. Set main API key
export OPENAI_API_KEY="nvapi-your-main-api-key"
# 4. Run the evaluation
eval-factory run_eval \
--output_dir results/hybrid_multi_judge_test \
--run_config hybrid_multi_judge_config.yml
```
#### Method 4: Using `helm-run` directly
```bash
# Set up environment variables (any of the above methods)
export OPENAI_API_KEY="nvapi-your-main-api-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-api-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-api-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-api-key"
# Run the evaluation
helm-run \
--run-entries mtsamples_replicate:model=openai/gpt-4 \
--suite my-suite \
--max-eval-instances 10 \
--num-train-trials 1 \
-o results/multi_judge_test
```
### Advanced Judge Configuration
#### Using Different API Keys for Each Judge
You can use completely different API keys for each judge:
```bash
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-1"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-2"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-3"
```
#### Using the Same API Key for All Judges
If you want to use the same API key for all judges:
```bash
export GPT_JUDGE_API_KEY="nvapi-shared-key"
export LLAMA_JUDGE_API_KEY="nvapi-shared-key"
export CLAUDE_JUDGE_API_KEY="nvapi-shared-key"
```
#### OAuth Token Management
**Check OAuth Token Status:**
```bash
# Look for OAuth token files
ls -la *_oauth_token.json
# Check token expiration
cat openai_oauth_token.json | jq '.expires_at'
```
**Force Token Refresh:**
```bash
# The system automatically refreshes expired tokens
# You can also manually trigger refresh by deleting token files
rm *_oauth_token.json
```
**OAuth Scopes for Different Services:**
```bash
# For GPT services
export OPENAI_SCOPE="azureopenai-readwrite"
# For Claude services
export OPENAI_SCOPE="awsanthropic-readwrite"
# For general access
export OPENAI_SCOPE="awsanthropic-readwrite"
```
### Example Multi-Judge Evaluation
Here's a complete example for running a multi-judge evaluation:
```bash
# 1. Create configuration file (multi_judge_config.yml)
cat > multi_judge_config.yml << EOF
config:
type: mtsamples_replicate
output_dir: results/multi_judge_test
params:
limit_samples: 50
parallelism: 2
extra:
num_train_trials: 3
max_length: 2048
gpt_judge_api_key: GPT_JUDGE_API_KEY
llama_judge_api_key: LLAMA_JUDGE_API_KEY
claude_judge_api_key: CLAUDE_JUDGE_API_KEY
target:
api_endpoint:
url: https://integrate.api.nvidia.com/v1
model_id: nvdev/meta/llama-3.3-70b-instruct
type: chat
api_key: OPENAI_API_KEY
EOF
# 2. Set environment variables
export OPENAI_API_KEY="nvapi-main-model-key"
export GPT_JUDGE_API_KEY="nvapi-gpt-judge-key"
export LLAMA_JUDGE_API_KEY="nvapi-llama-judge-key"
export CLAUDE_JUDGE_API_KEY="nvapi-claude-judge-key"
# 3. Run the evaluation
eval-factory run_eval \
--output_dir results/multi_judge_test \
--run_config multi_judge_config.yml
```
### Troubleshooting Multi-Judge Evaluations
#### Check Environment Variables
Verify your environment variables are set correctly:
```bash
echo "Main API Key: $OPENAI_API_KEY"
echo "GPT Judge: $GPT_JUDGE_API_KEY"
echo "Llama Judge: $LLAMA_JUDGE_API_KEY"
echo "Claude Judge: $CLAUDE_JUDGE_API_KEY"
```
#### Check OAuth Configuration
Verify OAuth credentials are properly set:
```bash
echo "Client ID: $OPENAI_CLIENT_ID"
echo "Client Secret: $OPENAI_CLIENT_SECRET"
echo "Token URL: $OPENAI_TOKEN_URL"
echo "Scope: $OPENAI_SCOPE"
```
#### Debug Mode
Enable debug logging to see which API keys are being used:
```bash
eval-factory --debug run_eval \
--output_dir results/debug_multi_judge \
--run_config multi_judge_config.yml
```
#### Common Issues and Solutions
**Issue: "GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY"**
- **Cause**: Judge API key not set, system falling back to main API key
- **Solution**: Set the specific judge API key or accept the fallback
**Issue: "Missing environment variables for openai token"**
- **Cause**: OAuth credentials not properly configured
- **Solution**: Set `OPENAI_CLIENT_ID` and `OPENAI_CLIENT_SECRET`
**Issue: "Error creating openai OAuth token"**
- **Cause**: Invalid credentials or network issues
- **Solution**: Verify credentials and check network connectivity
**Issue: API key appears in logs**
- **Cause**: This should not happen with the security fix
- **Solution**: Check if you're using the latest version with API key sanitization
#### Log Analysis
**Look for these log patterns:**
```bash
# Judge API key usage
grep "Using.*judge API key" logs/*.log
# OAuth token creation
grep "Creating new.*OAuth token" logs/*.log
# API key fallbacks
grep "is not set, setting to" logs/*.log
# Authentication errors
grep "Authentication error detected" logs/*.log
```
#### Performance Monitoring
**Check API Key Usage:**
```bash
# Monitor which API keys are being used
grep "Using.*API key.*ends with" logs/*.log
# Check for rate limiting
grep "rate limit\|429" logs/*.log
# Monitor OAuth token refresh
grep "token expired\|refreshing" logs/*.log
```
Look for messages like:
```
Using GPT judge API key from environment variable for model: nvidia/gpt4o-abc123
Using Llama judge API key from environment variable for model: nvdev/meta/llama-3.3-70b-instruct-def456
Using Claude judge API key from environment variable for model: nvidia/claude-3-7-sonnet-20250219-ghi789
```
#### Common Issues
1. **Environment variables not loaded**: Make sure your environment variables are set before running the command
2. **API key format**: Ensure your API keys start with `nvapi-` for NVIDIA services
3. **Configuration file**: Verify your YAML configuration file references the correct environment variable names
4. **Judge model availability**: Ensure the judge models are available through your API endpoints
### Benefits of Multi-Judge Evaluations
- **Better rate limiting**: Each judge can have its own rate limits
- **Cost tracking**: Track costs separately for each judge
- **Flexibility**: Use different API keys for different purposes
- **Security**: Isolate API keys for different components
- **Robustness**: Multiple judges provide more reliable evaluations
- **Diversity**: Different judge models may catch different types of errors
## Integration with EvalFactory
This framework is designed to work seamlessly with the EvalFactory infrastructure:
- **Standardized Output**: Results are generated in a format compatible with EvalFactory
- **Configuration Management**: Uses YAML-based configuration for easy integration
- **Caching**: Built-in caching for efficient re-runs and reproducibility
- **Extensibility**: Easy to add new benchmarks and evaluation metrics
## Contributing
To add new benchmarks or modify existing ones:
1. Update `framework.yml` with new benchmark definitions
2. Implement the benchmark logic in the appropriate adapter
3. Add test cases and documentation
4. Update this README with new benchmark information
## References
- [HELM Framework](https://github.com/stanford-crfm/helm)
- [EvalFactory Documentation](https://github.com/nvidia/eval-factory)
- [Medical AI Evaluation Papers](https://arxiv.org/abs/2401.00000)
For more detailed information about specific benchmarks and their implementations, refer to the individual benchmark documentation and the main HELM repository.
# Holistic Evaluation of Language Models (HELM)
<a href="https://github.com/stanford-crfm/helm">
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/stanford-crfm/helm">
</a>
<a href="https://github.com/stanford-crfm/helm/graphs/contributors">
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors/stanford-crfm/helm">
</a>
<a href="https://github.com/stanford-crfm/helm/actions/workflows/test.yml?query=branch%3Amain">
<img alt="GitHub Actions Workflow Status" src="https://img.shields.io/github/actions/workflow/status/stanford-crfm/helm/test.yml">
</a>
<a href="https://crfm-helm.readthedocs.io/en/latest/">
<img alt="Documentation Status" src="https://readthedocs.org/projects/helm/badge/?version=latest">
</a>
<a href="https://github.com/stanford-crfm/helm/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/github/license/stanford-crfm/helm?color=blue" />
</a>
<a href="https://pypi.org/project/crfm-helm/">
<img alt="PyPI" src="https://img.shields.io/pypi/v/crfm-helm?color=blue" />
</a>
[comment]: <> (When using the img tag, which allows us to specify size, src has to be a URL.)
<img src="https://github.com/stanford-crfm/helm/raw/v0.5.4/helm-frontend/src/assets/helm-logo.png" alt="HELM logo" width="480"/>
**Holistic Evaluation of Language Models (HELM)** is an open source Python framework created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:
- Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)
- Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)
- Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)
- Web UI for inspecting individual prompts and responses
- Web leaderboard for comparing results across models and benchmarks
## Documentation
Please refer to [the documentation on Read the Docs](https://crfm-helm.readthedocs.io/) for instructions on how to install and run HELM.
## Quick Start
<!--quick-start-begin-->
Install the package from PyPI:
```sh
pip install crfm-helm
```
Run the following in your shell:
```sh
# Run benchmark
helm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10
# Summarize benchmark results
helm-summarize --suite my-suite
# Start a web server to display benchmark results
helm-server --suite my-suite
```
Then go to http://localhost:8000/ in your browser.
<!--quick-start-end-->
## Attribution
This NVIDIA fork of HELM is based on the original [Stanford CRFM HELM framework](https://github.com/stanford-crfm/helm). The original framework was created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) and is licensed under the Apache License 2.0.
## Leaderboards
We maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:
- [HELM Capabilities](https://crfm.stanford.edu/helm/capabilities/latest/)
- [HELM Safety](https://crfm.stanford.edu/helm/safety/latest/)
- [Holistic Evaluation of Vision-Language Models (VHELM)](https://crfm.stanford.edu/helm/vhelm/latest/)
We also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the [HELM website](https://crfm.stanford.edu/helm/) for a full list of leaderboards.
## Papers
The HELM framework was used in the following papers for evaluating models.
- **Holistic Evaluation of Language Models** - [paper](https://openreview.net/forum?id=iO4LZibEqW), [leaderboard](https://crfm.stanford.edu/helm/classic/latest/)
- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
- **Image2Struct: Benchmarking Structure Extraction for Vision-Language Models** - [paper](https://arxiv.org/abs/2410.22456)
- **Enterprise Benchmarks for Large Language Model Evaluation** - [paper](https://arxiv.org/abs/2410.12857), [documentation](https://crfm-helm.readthedocs.io/en/latest/enterprise_benchmark/)
- **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)
- **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
- **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)
The HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
## Citation
If you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.
```bibtex
@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
```
## Attribution
# Attribution and Acknowledgments
## Original Project
This project is a fork of the **Holistic Evaluation of Language Models (HELM)** framework created by the Center for Research on Foundation Models (CRFM) at Stanford.
- **Original Repository**: [https://github.com/stanford-crfm/helm](https://github.com/stanford-crfm/helm)
- **Original Documentation**: [https://crfm.stanford.edu/helm](https://crfm.stanford.edu/helm)
- **Original Paper**: [Holistic Evaluation of Language Models](https://openreview.net/forum?id=iO4LZibEqW)
- **Original Authors**: Stanford CRFM Team
- **Original License**: Apache License 2.0
## Citation
If you use this software in your research, please cite the original HELM paper:
```bibtex
@article{liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
```
## Fork Information
- **Fork Maintainer**: NVIDIA
- **Fork Purpose**: Medical AI evaluation and EvalFactory integration
## License
This fork is released under the same Apache License 2.0 as the original project, in accordance with the original license terms.
Raw data
{
"_id": null,
"home_page": "https://github.com/stanford-crfm/helm",
"name": "nvidia-crfm-helm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "language models benchmarking nvidia fork stanford crfm helm",
"author": "Stanford CRFM, NVIDIA",
"author_email": "contact-crfm@stanford.edu",
"download_url": null,
"platform": null,
"description": "# NVIDIA HELM Benchmark Framework\n\nThis directory contains the HELM (Holistic Evaluation of Language Models) framework for evaluating large language models in medical applications across various healthcare tasks.\n\n## Overview\n\nThe HELM framework provides a comprehensive evaluation system for medical AI models, supporting multiple benchmark datasets and evaluation scenarios. It's designed to work with the EvalFactory infrastructure for standardized model evaluation.\n\n## Available Benchmarks\n\nThe framework supports the following medical evaluation benchmarks:\n\n| Benchmark | Description | Type |\n|-----------|-------------|------|\n| **medcalc_bench** | Medical calculation benchmark with patient notes and ground truth answers | Medical QA |\n| **medec** | Medical error detection and correction pairs | Error Detection |\n| **head_qa** | Biomedical multiple-choice questions for medical knowledge testing | Medical QA |\n| **medbullets** | USMLE-style medical questions with explanations | Medical QA |\n| **pubmed_qa** | PubMed abstracts with yes/no/maybe questions | Medical QA |\n| **ehr_sql** | Natural language to SQL query generation for clinical research | SQL Generation |\n| **race_based_med** | Detection of race-based biases in medical LLM outputs | Bias Detection |\n| **medhallu** | Classification of factual vs hallucinated medical answers | Hallucination Detection |\n\n## Quick Start\n\n### 1. Environment Setup\n\nFirst, ensure you have the required environment variables set:\n\n```bash\n# Set your API keys\nexport OPENAI_API_KEY=\"your-api-key-here\"\n\n# Set Python path if necessary\nexport PYTHONPATH=$PYTHONPATH:$.\n```\n\n### 2. Running Your First Benchmark\n\n#### Method 1: Using `eval-factory` (Recommended)\n\n`eval-factory` is a wrapper that simplifies the HELM benchmark process by handling configuration generation, benchmark execution, and result formatting automatically.\n\n**What `eval-factory` does internally:**\n\n1. **Configuration Processing**: Loads your YAML config and merges it with framework defaults\n2. **Dynamic Config Generation**: Creates the necessary HELM model configurations dynamically\n3. **Benchmark Execution**: Runs the HELM benchmark with proper parameters\n4. **Result Processing**: Formats and saves results in standardized YAML format\n\nCreate a configuration file (e.g., `my_test.yml`):\n\n```yaml\nconfig:\n type: medcalc_bench # Choose from available benchmarks\n output_dir: results/my_test\ntarget:\n api_endpoint:\n url: https://api.openai.com/v1\n model_id: gpt-4\n type: chat\n api_key: OPENAI_API_KEY\n```\n\nRun the evaluation:\n\n```bash\neval-factory run_eval \\\n --output_dir results/my_test \\\n --run_config my_test.yml\n```\n\n**Internal Process Breakdown:**\n\n1. **Config Loading & Validation**: \n - Loads your YAML configuration\n - Validates against framework schema\n - Merges with default parameters from `framework.yml`\n\n2. **Dynamic Model Config Generation**:\n - Calls `scripts/generate_dynamic_model_configs.py`\n - Creates model-specific configuration files\n - Handles provider-specific API endpoints and authentication\n\n3. **HELM Benchmark Execution**:\n - Executes `helm-run` with generated configurations\n - Downloads and prepares benchmark datasets\n - Runs evaluations with specified parameters\n - Caches responses for efficiency\n\n4. **Result Processing**:\n - Collects raw benchmark results\n - Formats into standardized YAML output\n - Saves results in your specified output directory\n\n#### Method 2: Using `helm-run` directly\n\n```bash\nhelm-run \\\n --run-entries medcalc_bench:model=openai/gpt-4 \\\n --suite my-suite \\\n --max-eval-instances 10 \\\n --num-train-trials 1 \\\n -o results/my_test\n```\n\n**Comparison: `eval-factory` vs `helm-run`**\n\n| Feature | `eval-factory` | `helm-run` |\n|---------|-------------------|------------|\n| **Configuration** | Simple YAML config | Complex command-line arguments |\n| **Model Setup** | Automatic config generation | Manual model registration required |\n| **Provider Support** | Built-in adapter handling | Requires custom model configs |\n| **Results Format** | Standardized YAML output | Native HELM format only |\n| **Ease of Use** | Beginner-friendly | Advanced users only |\n| **Integration** | EvalFactory compatible | HELM-specific |\n\n**Recommendation**: Use `eval-factory` for most use cases, especially when working with EvalFactory. Use `helm-run` only when you need fine-grained control over HELM's native features.\n\n### 3. Understanding the Output\n\nAfter running a benchmark, you'll find results in your specified output directory:\n\n```\nresults/my_test/\n\u251c\u2500\u2500 responses/ # Raw model responses\n\u251c\u2500\u2500 cache.db # Cached responses for efficiency\n\u251c\u2500\u2500 instances.jsonl # Evaluation instances\n\u251c\u2500\u2500 results.jsonl # Final evaluation results\n\u251c\u2500\u2500 model_configs/ # Generated HELM model configurations\n\u2514\u2500\u2500 evaluation_config.yaml # Standardized evaluation results\n```\n\n**Generated Files Explanation:**\n\n- **`responses/`**: Contains raw API responses from the model for each evaluation instance\n- **`cache.db`**: SQLite database caching responses to avoid re-running identical queries\n- **`instances.jsonl`**: The evaluation instances (questions, prompts, etc.) used in the benchmark\n- **`results.jsonl`**: HELM's native results format with detailed metrics\n- **`model_configs/`**: Dynamically generated configuration files for the specific model and provider\n- **`evaluation_config.yaml`**: Standardized results in YAML format compatible with EvalFactory\n\n**Key Advantage**: `eval-factory` automatically handles the complexity of HELM configuration generation, making it much easier to run benchmarks compared to using `helm-run` directly.\n\n## Step-by-Step Guide\n\n### Step 1: Choose Your Benchmark\n\nSelect from the available benchmarks based on your evaluation needs:\n\n- **For general medical QA**: `medcalc_bench`, `head_qa`, `medbullets`\n- **For error detection**: `medec`\n- **For research applications**: `pubmed_qa`, `ehr_sql`\n- **For safety evaluation**: `race_based_med`, `medhallu`\n\n### Step 2: Configure Your Model\n\nCreate a YAML configuration file with your model details. Here are examples for different providers:\n\n#### OpenAI Configuration\n```yaml\nconfig:\n type: medcalc_bench\n output_dir: results/openai_test\ntarget:\n api_endpoint:\n url: https://api.openai.com/v1\n model_id: gpt-4\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n#### NVIDIA AI Foundation Models (build.nvidia.com)\n```yaml\nconfig:\n type: pubmed_qa\n output_dir: results/nim_test\ntarget:\n api_endpoint:\n url: https://integrate.api.nvidia.com/v1\n model_id: nvdev/meta/llama-3.3-70b-instruct\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n#### NVIDIA Cloud Function (nvcf)\n```yaml\nconfig:\n type: ehr_sql\n output_dir: results/nvcf_test\ntarget:\n api_endpoint:\n url: https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/13e4f873-9d52-4ba9-8194-61baf8dc2bc9/\n model_id: meta-llama/Llama-3.3-70B-Instruct\n type: chat\n api_key: OPENAI_API_KEY\n adapter_config:\n use_nvcf: true\n```\n\n### Model Naming Conventions\n\nDifferent providers use different model ID formats:\n\n- **OpenAI**: `gpt-4`, `gpt-3.5-turbo`, `text-davinci-003`\n- **NVIDIA**: `meta-llama/Llama-3.3-70B-Instruct`, `mistral-7b-instruct`\n\n**Note**: NVCF requires a specific function ID in the URL and the `use_nvcf: true` adapter configuration.\n\n### Step 3: Set Up API Credentials\n\nEnsure your API credentials are properly configured:\n\n```bash\n# For OpenAI models\nexport OPENAI_API_KEY=\"<very-long-sequence>\"\n\n# For NVIDIA AI Foundation Models (build.nvidia.com)\nexport OPENAI_API_KEY=\"nvapi-...\" # Uses same env var as OpenAI\n\n# For NVIDIA Cloud Function (nvcf)\nexport OPENAI_API_KEY=\"nvapi-...\" # Uses same env var as OpenAI\n\n# Note: NVIDIA services typically use the same OPENAI_API_KEY environment variable\n# but with NVIDIA-specific API keys (nvapi-... format)\n```\n\n### Step 4: Run the Evaluation\n\nExecute the benchmark using one of the methods above. The framework will:\n\n1. **Load the configuration** and validate parameters\n2. **Generate model configs** dynamically for the specified model\n3. **Download and prepare** the benchmark dataset\n4. **Run evaluations** on the specified number of instances\n5. **Cache responses** for efficiency and reproducibility\n6. **Generate results** in standardized format\n\n### Step 5: Analyze Results\n\nReview the generated results:\n\n```bash\n# View raw results\ncat results/my_test/results.jsonl\n\n# Use HELM tools for analysis\nhelm-summarize --suite my-suite\nhelm-server # Start web interface to view results\n```\n\n## Advanced Configuration\n\n### Customizing Evaluation Parameters\n\nYou can customize various parameters in your configuration:\n\n```yaml\nconfig:\n type: medcalc_bench\n output_dir: results/advanced_test\n params:\n limit_samples: 100 # Limit number of evaluation instances\n parallelism: 4 # Number of parallel threads\n extra:\n num_train_trials: 3 # Number of training trials\n max_length: 2048 # Maximum token length\ntarget:\n api_endpoint:\n url: https://api.openai.com/v1\n model_id: gpt-4\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n### Advanced Configuration Parameters\n\nThe `config.params.extra` section provides additional parameters for fine-tuning evaluations:\n\n#### `data_path`\n- **Purpose**: Custom data path for scenarios that support it\n- **Supported Scenarios**: `ehrshot`, `clear`, `medalign`, `n2c2_ct_matching`\n- **Example**: `\"/path/to/custom/data\"`\n- **Description**: Overrides the default data location for the scenario\n\n#### `num_output_tokens`\n- **Purpose**: Maximum number of tokens the model is allowed to generate in its response\n- **Scope**: Controls only the output length, not the total sequence length\n- **Example**: `1000` limits model responses to 1000 tokens\n- **Use Case**: Useful for controlling response length in generation tasks\n\n#### `max_length`\n- **Purpose**: Maximum total length for the entire input-output sequence (input + output combined)\n- **Scope**: Controls the combined length of both prompt and response\n- **Example**: `2048` limits total conversation to 2048 tokens\n- **Difference from num_output_tokens**: This controls total sequence length, while num_output_tokens only controls response length\n\n#### `subject`\n- **Purpose**: Specific task or subset to evaluate within a scenario\n- **Examples by Scenario**:\n - **ehrshot**: `\"guo_readmission\"`, `\"new_hypertension\"`, `\"lab_anemia\"`\n - **n2c2_ct_matching**: `\"ABDOMINAL\"`, `\"ADVANCED-CAD\"`, `\"CREATININE\"`\n - **clear**: `\"major_depression\"`, `\"bipolar_disorder\"`, `\"substance_use_disorder\"`\n- **Description**: Filters the evaluation to a specific prediction task or medical condition\n\n#### `condition`\n- **Purpose**: Specific condition or scenario variant to evaluate\n- **Supported Scenarios**: `clear`\n- **Examples**: `\"alcohol_dependence\"`, `\"chronic_pain\"`, `\"homelessness\"`\n- **Description**: Used by scenarios like 'clear' to specify medical conditions for evaluation\n\n#### `num_train_trials`\n- **Purpose**: Number of training trials for few-shot evaluation\n- **Behavior**: Each trial samples a different set of in-context examples\n- **Example**: `3` runs the evaluation 3 times with different examples\n- **Use Case**: Useful for robust evaluation with multiple few-shot configurations\n\n### Example Configuration with All Parameters\n\n```yaml\nconfig:\n type: ehrshot\n output_dir: results/ehrshot_evaluation\n params:\n limit_samples: 500\n parallelism: 2\n extra:\n data_path: \"/custom/path/to/ehrshot/data\"\n num_output_tokens: 1000\n max_length: 4096\n subject: \"guo_readmission\"\n num_train_trials: 3\ntarget:\n api_endpoint:\n url: https://api.openai.com/v1\n model_id: gpt-4\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n### Running Multiple Benchmarks\n\nTo run multiple benchmarks on the same model:\n\n```bash\n# Create separate config files for each benchmark\neval-factory run_eval --output_dir results/medcalc_test --run_config medcalc_config.yml\neval-factory run_eval --output_dir results/medec_test --run_config medec_config.yml\neval-factory run_eval --output_dir results/head_qa_test --run_config head_qa_config.yml\n```\n\n### Dry Run Mode\n\nTest your configuration without running the full evaluation:\n\n```bash\neval-factory run_eval \\\n --output_dir results/test \\\n --run_config my_config.yml \\\n --dry_run\n```\n\nThis will show you the rendered configuration and command without executing the benchmark.\n\n## Troubleshooting\n\n### Common Issues\n\n1. **API Key Errors**: Ensure your API keys are properly set and valid\n2. **Model Not Found**: Verify the model ID and endpoint URL are correct\n3. **Memory Issues**: Reduce `parallelism` or `limit_samples` for large models\n4. **Timeout Errors**: Increase timeout settings or reduce batch sizes\n\n### Debug Mode\n\nEnable debug logging for detailed information:\n\n```bash\neval-factory --debug run_eval \\\n --output_dir results/debug_test \\\n --run_config debug_config.yml\n```\n\n### Checking Available Tasks\n\nList all available evaluation types:\n\n```bash\neval-factory ls\n```\n\n## Examples from commands.sh\n\nHere are some practical examples from the project:\n\n### Basic Medical Calculation Benchmark\n```bash\neval-factory run_eval \\\n --output_dir test_cases/test_case_nim_llama_3_1_8b_medcalc_bench \\\n --run_config test_cases/test_case_nim_llama_3_1_8b_medcalc_bench.yml\n```\n\n### Medical Error Detection\n```bash\neval-factory run_eval \\\n --output_dir test_cases/test_case_nim_llama_3_1_8b_medec \\\n --run_config test_cases/test_case_nim_llama_3_1_8b_medec.yml\n```\n\n### Biomedical QA\n```bash\neval-factory run_eval \\\n --output_dir test_cases/test_case_nim_llama_3_1_8b_head_qa \\\n --run_config test_cases/test_case_nim_llama_3_1_8b_head_qa.yml\n```\n\n## Running Evaluations with Judges\n\nThe HELM framework supports multi-judge evaluations for scenarios that require human-like assessment of model outputs. This is particularly useful for tasks like medical treatment plan generation, where multiple AI judges can provide more robust and reliable evaluations.\n\n### Overview of Multi-Judge Setup\n\nThe framework supports three types of judges:\n- **GPT Judge**: Uses OpenAI GPT models for evaluation\n- **Llama Judge**: Uses Llama models for evaluation \n- **Claude Judge**: Uses Anthropic Claude models for evaluation\n\nEach judge can use different API keys, providing better rate limiting, cost tracking, and flexibility.\n\n### Authentication Systems\n\nThe framework supports **two authentication methods** for judge models:\n\n#### 1. Direct Judge API Keys (Recommended for Production)\nSet individual API keys for each judge type:\n\n```bash\n# API key for the main model being evaluated\nexport OPENAI_API_KEY=\"your-main-model-api-key\"\n\n# API keys for the three judges (annotators)\nexport GPT_JUDGE_API_KEY=\"your-gpt-judge-api-key\"\nexport LLAMA_JUDGE_API_KEY=\"your-llama-judge-api-key\"\nexport CLAUDE_JUDGE_API_KEY=\"your-claude-judge-api-key\"\n```\n\n#### 2. OAuth 2.0 Client Credentials Flow (Advanced)\nUse NVIDIA's OAuth system for automatic token management:\n\n```bash\n# OAuth 2.0 credentials for automatic token generation\nexport OPENAI_CLIENT_ID=\"nvssa-prd-your-client-id\"\nexport OPENAI_CLIENT_SECRET=\"ssap-your-client-secret\"\nexport OPENAI_TOKEN_URL=\"https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token\"\nexport OPENAI_SCOPE=\"awsanthropic-readwrite\"\n\n# Main API key (still required)\nexport OPENAI_API_KEY=\"your-main-model-api-key\"\n```\n\n### How Authentication Priority Works\n\nThe system follows this **exact priority order**:\n\n1. **First Priority**: Judge-specific environment variables\n - `GPT_JUDGE_API_KEY` for GPT models\n - `LLAMA_JUDGE_API_KEY` for Llama models\n - `CLAUDE_JUDGE_API_KEY` for Claude models\n\n2. **Second Priority**: Fallback to main API key\n - If judge keys aren't set, automatically uses `OPENAI_API_KEY`\n - System logs: \"GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY\"\n\n3. **Third Priority**: Credentials configuration\n - Falls back to `credentials.conf` or deployment-specific keys\n\n**Important**: OAuth-generated tokens are **NOT** automatically used for judge API keys. The OAuth system is separate and serves different purposes.\n\n### OAuth 2.0 System Details\n\nThe OAuth system provides:\n- **Automatic Token Creation**: Generates access tokens using client credentials\n- **Token Caching**: Stores tokens in memory and disk (`{service_name}_oauth_token.json`)\n- **Automatic Refresh**: Refreshes expired tokens automatically\n- **Scope Control**: Different permissions per service:\n - `azureopenai-readwrite` for GPT services\n - `awsanthropic-readwrite` for Claude services\n\n**When to Use OAuth:**\n- Better security (client credentials vs. long-lived API keys)\n- Automatic token management\n- Centralized billing and rate limiting\n- Enterprise-grade authentication\n\n**When to Use Direct API Keys:**\n- Simpler setup\n- Direct control over each judge's API key\n- Different providers for different judges\n- Testing and development scenarios\n\n### Security Features\n\n**API Key Protection**: The system automatically sanitizes error messages to prevent API keys from appearing in logs. Any API key patterns (like `nvapi-...`, `sk-...`, `hf_...`) are automatically replaced with `[API_KEY_REDACTED]` before logging.\n\n### Configuration for Multi-Judge Evaluations\n\n#### Basic Configuration (Direct API Keys)\n\n```yaml\nconfig:\n type: mtsamples_replicate # Example scenario that uses judges\n output_dir: results/multi_judge_test\n params:\n limit_samples: 10\n parallelism: 1\n extra:\n num_train_trials: 1\n max_length: 2048\n # Different API keys for each judge\n gpt_judge_api_key: GPT_JUDGE_API_KEY\n llama_judge_api_key: LLAMA_JUDGE_API_KEY\n claude_judge_api_key: CLAUDE_JUDGE_API_KEY\ntarget:\n api_endpoint:\n url: https://integrate.api.nvidia.com/v1\n model_id: nvdev/meta/llama-3.1-8b-instruct\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n#### Advanced Configuration (OAuth + Direct Keys)\n\n```yaml\nconfig:\n type: mtsamples_replicate\n output_dir: results/oauth_multi_judge_test\n params:\n limit_samples: 50\n parallelism: 2\n extra:\n num_train_trials: 3\n max_length: 2048\n # Mix OAuth (automatic) and direct keys\n gpt_judge_api_key: GPT_JUDGE_API_KEY # Direct key for GPT\n # Llama and Claude will use OAuth-generated tokens\ntarget:\n api_endpoint:\n url: https://integrate.api.nvidia.com/v1\n model_id: nvdev/meta/llama-3.3-70b-instruct\n type: chat\n api_key: OPENAI_API_KEY\n```\n\n### Supported Scenarios with Judges\n\nCurrently, the following scenarios support multi-judge evaluations:\n\n| Scenario | Description | Judge Types Used |\n|----------|-------------|------------------|\n| **mtsamples_replicate** | Generate treatment plans based on clinical notes | GPT, Llama, Claude |\n| **mtsamples_procedures** | Document and extract information about medical procedures | GPT, Llama, Claude |\n| **aci_bench** | Extract and structure information from patient-doctor conversations | GPT, Llama, Claude |\n| **medication_qa** | Answer consumer medication-related questions | GPT, Llama, Claude |\n| **medi_qa** | Retrieve and rank answers based on medical question understanding | GPT, Llama, Claude |\n| **med_dialog** | Generate summaries of doctor-patient conversations | GPT, Llama, Claude |\n\n### Complete Setup Guide\n\n#### Method 1: Direct API Keys (Simplest)\n\n```bash\n# 1. Set up environment variables\nexport OPENAI_API_KEY=\"nvapi-your-main-api-key\"\nexport GPT_JUDGE_API_KEY=\"nvapi-gpt-judge-api-key\"\nexport LLAMA_JUDGE_API_KEY=\"nvapi-llama-judge-api-key\"\nexport CLAUDE_JUDGE_API_KEY=\"nvapi-claude-judge-api-key\"\n\n# 2. Run the evaluation\neval-factory run_eval \\\n --output_dir results/multi_judge_test \\\n --run_config multi_judge_config.yml\n```\n\n#### Method 2: OAuth 2.0 System (Enterprise)\n\n```bash\n# 1. Set up OAuth credentials\nexport OPENAI_CLIENT_ID=\"nvssa-prd-your-client-id\"\nexport OPENAI_CLIENT_SECRET=\"ssap-your-client-secret\"\nexport OPENAI_TOKEN_URL=\"https://prod.api.nvidia.com/oauth/api/v1/ssa/default/token\"\nexport OPENAI_SCOPE=\"awsanthropic-readwrite\"\n\n# 2. Set main API key (still required)\nexport OPENAI_API_KEY=\"nvapi-your-main-api-key\"\n\n# 3. Run the evaluation\neval-factory run_eval \\\n --output_dir results/oauth_multi_judge_test \\\n --run_config oauth_multi_judge_config.yml\n```\n\n#### Method 3: Hybrid Approach (Flexible)\n\n```bash\n# 1. Set OAuth credentials for automatic token generation\nexport OPENAI_CLIENT_ID=\"nvssa-prd-your-client-id\"\nexport OPENAI_CLIENT_SECRET=\"ssap-your-client-secret\"\n\n# 2. Override specific judge with direct API key\nexport GPT_JUDGE_API_KEY=\"nvapi-gpt-specific-key\"\n\n# 3. Set main API key\nexport OPENAI_API_KEY=\"nvapi-your-main-api-key\"\n\n# 4. Run the evaluation\neval-factory run_eval \\\n --output_dir results/hybrid_multi_judge_test \\\n --run_config hybrid_multi_judge_config.yml\n```\n\n#### Method 4: Using `helm-run` directly\n\n```bash\n# Set up environment variables (any of the above methods)\nexport OPENAI_API_KEY=\"nvapi-your-main-api-key\"\nexport GPT_JUDGE_API_KEY=\"nvapi-gpt-judge-api-key\"\nexport LLAMA_JUDGE_API_KEY=\"nvapi-llama-judge-api-key\"\nexport CLAUDE_JUDGE_API_KEY=\"nvapi-claude-judge-api-key\"\n\n# Run the evaluation\nhelm-run \\\n --run-entries mtsamples_replicate:model=openai/gpt-4 \\\n --suite my-suite \\\n --max-eval-instances 10 \\\n --num-train-trials 1 \\\n -o results/multi_judge_test\n```\n\n### Advanced Judge Configuration\n\n#### Using Different API Keys for Each Judge\n\nYou can use completely different API keys for each judge:\n\n```bash\nexport GPT_JUDGE_API_KEY=\"nvapi-gpt-judge-1\"\nexport LLAMA_JUDGE_API_KEY=\"nvapi-llama-judge-2\"\nexport CLAUDE_JUDGE_API_KEY=\"nvapi-claude-judge-3\"\n```\n\n#### Using the Same API Key for All Judges\n\nIf you want to use the same API key for all judges:\n\n```bash\nexport GPT_JUDGE_API_KEY=\"nvapi-shared-key\"\nexport LLAMA_JUDGE_API_KEY=\"nvapi-shared-key\"\nexport CLAUDE_JUDGE_API_KEY=\"nvapi-shared-key\"\n```\n\n#### OAuth Token Management\n\n**Check OAuth Token Status:**\n```bash\n# Look for OAuth token files\nls -la *_oauth_token.json\n\n# Check token expiration\ncat openai_oauth_token.json | jq '.expires_at'\n```\n\n**Force Token Refresh:**\n```bash\n# The system automatically refreshes expired tokens\n# You can also manually trigger refresh by deleting token files\nrm *_oauth_token.json\n```\n\n**OAuth Scopes for Different Services:**\n```bash\n# For GPT services\nexport OPENAI_SCOPE=\"azureopenai-readwrite\"\n\n# For Claude services \nexport OPENAI_SCOPE=\"awsanthropic-readwrite\"\n\n# For general access\nexport OPENAI_SCOPE=\"awsanthropic-readwrite\"\n```\n\n### Example Multi-Judge Evaluation\n\nHere's a complete example for running a multi-judge evaluation:\n\n```bash\n# 1. Create configuration file (multi_judge_config.yml)\ncat > multi_judge_config.yml << EOF\nconfig:\n type: mtsamples_replicate\n output_dir: results/multi_judge_test\n params:\n limit_samples: 50\n parallelism: 2\n extra:\n num_train_trials: 3\n max_length: 2048\n gpt_judge_api_key: GPT_JUDGE_API_KEY\n llama_judge_api_key: LLAMA_JUDGE_API_KEY\n claude_judge_api_key: CLAUDE_JUDGE_API_KEY\ntarget:\n api_endpoint:\n url: https://integrate.api.nvidia.com/v1\n model_id: nvdev/meta/llama-3.3-70b-instruct\n type: chat\n api_key: OPENAI_API_KEY\nEOF\n\n# 2. Set environment variables\nexport OPENAI_API_KEY=\"nvapi-main-model-key\"\nexport GPT_JUDGE_API_KEY=\"nvapi-gpt-judge-key\"\nexport LLAMA_JUDGE_API_KEY=\"nvapi-llama-judge-key\"\nexport CLAUDE_JUDGE_API_KEY=\"nvapi-claude-judge-key\"\n\n# 3. Run the evaluation\neval-factory run_eval \\\n --output_dir results/multi_judge_test \\\n --run_config multi_judge_config.yml\n```\n\n### Troubleshooting Multi-Judge Evaluations\n\n#### Check Environment Variables\n\nVerify your environment variables are set correctly:\n\n```bash\necho \"Main API Key: $OPENAI_API_KEY\"\necho \"GPT Judge: $GPT_JUDGE_API_KEY\"\necho \"Llama Judge: $LLAMA_JUDGE_API_KEY\"\necho \"Claude Judge: $CLAUDE_JUDGE_API_KEY\"\n```\n\n#### Check OAuth Configuration\n\nVerify OAuth credentials are properly set:\n\n```bash\necho \"Client ID: $OPENAI_CLIENT_ID\"\necho \"Client Secret: $OPENAI_CLIENT_SECRET\"\necho \"Token URL: $OPENAI_TOKEN_URL\"\necho \"Scope: $OPENAI_SCOPE\"\n```\n\n#### Debug Mode\n\nEnable debug logging to see which API keys are being used:\n\n```bash\neval-factory --debug run_eval \\\n --output_dir results/debug_multi_judge \\\n --run_config multi_judge_config.yml\n```\n\n#### Common Issues and Solutions\n\n**Issue: \"GPT_JUDGE_API_KEY is not set, setting to OPENAI_API_KEY\"**\n- **Cause**: Judge API key not set, system falling back to main API key\n- **Solution**: Set the specific judge API key or accept the fallback\n\n**Issue: \"Missing environment variables for openai token\"**\n- **Cause**: OAuth credentials not properly configured\n- **Solution**: Set `OPENAI_CLIENT_ID` and `OPENAI_CLIENT_SECRET`\n\n**Issue: \"Error creating openai OAuth token\"**\n- **Cause**: Invalid credentials or network issues\n- **Solution**: Verify credentials and check network connectivity\n\n**Issue: API key appears in logs**\n- **Cause**: This should not happen with the security fix\n- **Solution**: Check if you're using the latest version with API key sanitization\n\n#### Log Analysis\n\n**Look for these log patterns:**\n```bash\n# Judge API key usage\ngrep \"Using.*judge API key\" logs/*.log\n\n# OAuth token creation\ngrep \"Creating new.*OAuth token\" logs/*.log\n\n# API key fallbacks\ngrep \"is not set, setting to\" logs/*.log\n\n# Authentication errors\ngrep \"Authentication error detected\" logs/*.log\n```\n\n#### Performance Monitoring\n\n**Check API Key Usage:**\n```bash\n# Monitor which API keys are being used\ngrep \"Using.*API key.*ends with\" logs/*.log\n\n# Check for rate limiting\ngrep \"rate limit\\|429\" logs/*.log\n\n# Monitor OAuth token refresh\ngrep \"token expired\\|refreshing\" logs/*.log\n```\n\nLook for messages like:\n```\nUsing GPT judge API key from environment variable for model: nvidia/gpt4o-abc123\nUsing Llama judge API key from environment variable for model: nvdev/meta/llama-3.3-70b-instruct-def456\nUsing Claude judge API key from environment variable for model: nvidia/claude-3-7-sonnet-20250219-ghi789\n```\n\n#### Common Issues\n\n1. **Environment variables not loaded**: Make sure your environment variables are set before running the command\n2. **API key format**: Ensure your API keys start with `nvapi-` for NVIDIA services\n3. **Configuration file**: Verify your YAML configuration file references the correct environment variable names\n4. **Judge model availability**: Ensure the judge models are available through your API endpoints\n\n### Benefits of Multi-Judge Evaluations\n\n- **Better rate limiting**: Each judge can have its own rate limits\n- **Cost tracking**: Track costs separately for each judge\n- **Flexibility**: Use different API keys for different purposes\n- **Security**: Isolate API keys for different components\n- **Robustness**: Multiple judges provide more reliable evaluations\n- **Diversity**: Different judge models may catch different types of errors\n\n## Integration with EvalFactory\n\nThis framework is designed to work seamlessly with the EvalFactory infrastructure:\n\n- **Standardized Output**: Results are generated in a format compatible with EvalFactory\n- **Configuration Management**: Uses YAML-based configuration for easy integration\n- **Caching**: Built-in caching for efficient re-runs and reproducibility\n- **Extensibility**: Easy to add new benchmarks and evaluation metrics\n\n## Contributing\n\nTo add new benchmarks or modify existing ones:\n\n1. Update `framework.yml` with new benchmark definitions\n2. Implement the benchmark logic in the appropriate adapter\n3. Add test cases and documentation\n4. Update this README with new benchmark information\n\n## References\n\n- [HELM Framework](https://github.com/stanford-crfm/helm)\n- [EvalFactory Documentation](https://github.com/nvidia/eval-factory)\n- [Medical AI Evaluation Papers](https://arxiv.org/abs/2401.00000)\n\nFor more detailed information about specific benchmarks and their implementations, refer to the individual benchmark documentation and the main HELM repository. \n\n\n# Holistic Evaluation of Language Models (HELM)\n\n\n<a href=\"https://github.com/stanford-crfm/helm\">\n <img alt=\"GitHub Repo stars\" src=\"https://img.shields.io/github/stars/stanford-crfm/helm\">\n</a>\n<a href=\"https://github.com/stanford-crfm/helm/graphs/contributors\">\n <img alt=\"GitHub contributors\" src=\"https://img.shields.io/github/contributors/stanford-crfm/helm\">\n</a>\n<a href=\"https://github.com/stanford-crfm/helm/actions/workflows/test.yml?query=branch%3Amain\">\n <img alt=\"GitHub Actions Workflow Status\" src=\"https://img.shields.io/github/actions/workflow/status/stanford-crfm/helm/test.yml\">\n</a>\n<a href=\"https://crfm-helm.readthedocs.io/en/latest/\">\n <img alt=\"Documentation Status\" src=\"https://readthedocs.org/projects/helm/badge/?version=latest\">\n</a>\n<a href=\"https://github.com/stanford-crfm/helm/blob/main/LICENSE\">\n <img alt=\"License\" src=\"https://img.shields.io/github/license/stanford-crfm/helm?color=blue\" />\n</a>\n<a href=\"https://pypi.org/project/crfm-helm/\">\n <img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/crfm-helm?color=blue\" />\n</a>\n\n[comment]: <> (When using the img tag, which allows us to specify size, src has to be a URL.)\n<img src=\"https://github.com/stanford-crfm/helm/raw/v0.5.4/helm-frontend/src/assets/helm-logo.png\" alt=\"HELM logo\" width=\"480\"/>\n\n**Holistic Evaluation of Language Models (HELM)** is an open source Python framework created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models. This framework includes the following features:\n\n- Datasets and benchmarks in a standardized format (e.g. MMLU-Pro, GPQA, IFEval, WildBench)\n- Models from various providers accessible through a unified interface (e.g. OpenAI models, Anthropic Claude, Google Gemini)\n- Metrics for measuring various aspects beyond accuracy (e.g. efficiency, bias, toxicity)\n- Web UI for inspecting individual prompts and responses\n- Web leaderboard for comparing results across models and benchmarks\n\n## Documentation\n\nPlease refer to [the documentation on Read the Docs](https://crfm-helm.readthedocs.io/) for instructions on how to install and run HELM.\n\n## Quick Start\n\n<!--quick-start-begin-->\n\nInstall the package from PyPI:\n\n```sh\npip install crfm-helm\n```\n\nRun the following in your shell:\n\n```sh\n# Run benchmark\nhelm-run --run-entries mmlu:subject=philosophy,model=openai/gpt2 --suite my-suite --max-eval-instances 10\n\n# Summarize benchmark results\nhelm-summarize --suite my-suite\n\n# Start a web server to display benchmark results\nhelm-server --suite my-suite\n```\n\nThen go to http://localhost:8000/ in your browser.\n\n<!--quick-start-end-->\n\n## Attribution\n\nThis NVIDIA fork of HELM is based on the original [Stanford CRFM HELM framework](https://github.com/stanford-crfm/helm). The original framework was created by the [Center for Research on Foundation Models (CRFM) at Stanford](https://crfm.stanford.edu/) and is licensed under the Apache License 2.0.\n\n## Leaderboards\n\nWe maintain offical leaderboards with results from evaluating recent models on notable benchmarks using this framework. Our current flagship leaderboards are:\n\n- [HELM Capabilities](https://crfm.stanford.edu/helm/capabilities/latest/)\n- [HELM Safety](https://crfm.stanford.edu/helm/safety/latest/)\n- [Holistic Evaluation of Vision-Language Models (VHELM)](https://crfm.stanford.edu/helm/vhelm/latest/)\n\nWe also maintain leaderboards for a diverse range of domains (e.g. medicine, finance) and aspects (e.g. multi-linguality, world knowledge, regulation compliance). Refer to the [HELM website](https://crfm.stanford.edu/helm/) for a full list of leaderboards.\n\n## Papers\n\nThe HELM framework was used in the following papers for evaluating models.\n\n- **Holistic Evaluation of Language Models** - [paper](https://openreview.net/forum?id=iO4LZibEqW), [leaderboard](https://crfm.stanford.edu/helm/classic/latest/)\n- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)\n- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)\n- **Image2Struct: Benchmarking Structure Extraction for Vision-Language Models** - [paper](https://arxiv.org/abs/2410.22456)\n- **Enterprise Benchmarks for Large Language Model Evaluation** - [paper](https://arxiv.org/abs/2410.12857), [documentation](https://crfm-helm.readthedocs.io/en/latest/enterprise_benchmark/)\n- **The Mighty ToRR: A Benchmark for Table Reasoning and Robustness** - [paper](https://arxiv.org/abs/2502.19412), [leaderboard](https://crfm.stanford.edu/helm/torr/latest/)\n- **Reliable and Efficient Amortized Model-based Evaluation** - [paper](https://arxiv.org/abs/2503.13335), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)\n- **MedHELM** - paper in progress, [leaderboard](https://crfm.stanford.edu/helm/medhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/reeval/)\n\nThe HELM framework can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).\n\n## Citation\n\nIf you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.\n\n```bibtex\n@article{\nliang2023holistic,\ntitle={Holistic Evaluation of Language Models},\nauthor={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},\njournal={Transactions on Machine Learning Research},\nissn={2835-8856},\nyear={2023},\nurl={https://openreview.net/forum?id=iO4LZibEqW},\nnote={Featured Certification, Expert Certification}\n}\n```\n\n\n## Attribution\n\n# Attribution and Acknowledgments\n\n## Original Project\n\nThis project is a fork of the **Holistic Evaluation of Language Models (HELM)** framework created by the Center for Research on Foundation Models (CRFM) at Stanford.\n\n- **Original Repository**: [https://github.com/stanford-crfm/helm](https://github.com/stanford-crfm/helm)\n- **Original Documentation**: [https://crfm.stanford.edu/helm](https://crfm.stanford.edu/helm)\n- **Original Paper**: [Holistic Evaluation of Language Models](https://openreview.net/forum?id=iO4LZibEqW)\n- **Original Authors**: Stanford CRFM Team\n- **Original License**: Apache License 2.0\n\n## Citation\n\nIf you use this software in your research, please cite the original HELM paper:\n\n```bibtex\n@article{liang2023holistic,\n title={Holistic Evaluation of Language Models},\n author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},\n journal={Transactions on Machine Learning Research},\n issn={2835-8856},\n year={2023},\n url={https://openreview.net/forum?id=iO4LZibEqW},\n note={Featured Certification, Expert Certification}\n}\n```\n\n## Fork Information\n\n- **Fork Maintainer**: NVIDIA\n- **Fork Purpose**: Medical AI evaluation and EvalFactory integration\n\n## License\n\nThis fork is released under the same Apache License 2.0 as the original project, in accordance with the original license terms.\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "NVIDIA: Benchmark for language models - Fork of Stanford CRFM HELM",
"version": "25.8",
"project_urls": {
"Homepage": "https://github.com/stanford-crfm/helm",
"Original Documentation": "https://crfm.stanford.edu/helm",
"Original Project": "https://github.com/stanford-crfm/helm"
},
"split_keywords": [
"language",
"models",
"benchmarking",
"nvidia",
"fork",
"stanford",
"crfm",
"helm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a1e1f6ecf697e9cfaa38b58ba0508bfb4458046bf7bf2d89275edee96714ce5a",
"md5": "266695225f3cc0ebf4557f60c6bcf9c4",
"sha256": "f18bf3d729fe8e22e0a9272c27141d4e4657fa6394042d59d6f63e4d5c12ed84"
},
"downloads": -1,
"filename": "nvidia_crfm_helm-25.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "266695225f3cc0ebf4557f60c6bcf9c4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 6986797,
"upload_time": "2025-09-04T10:50:17",
"upload_time_iso_8601": "2025-09-04T10:50:17.785806Z",
"url": "https://files.pythonhosted.org/packages/a1/e1/f6ecf697e9cfaa38b58ba0508bfb4458046bf7bf2d89275edee96714ce5a/nvidia_crfm_helm-25.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-04 10:50:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "stanford-crfm",
"github_project": "helm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "absl-py",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "accelerate",
"specs": [
[
"==",
"0.34.2"
]
]
},
{
"name": "ai2-olmo",
"specs": [
[
"==",
"0.2.2"
]
]
},
{
"name": "ai2-olmo",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "ai2-olmo-core",
"specs": [
[
"==",
"0.1.0"
]
]
},
{
"name": "aiodns",
"specs": [
[
"==",
"3.5.0"
]
]
},
{
"name": "aiohappyeyeballs",
"specs": [
[
"==",
"2.6.1"
]
]
},
{
"name": "aiohttp",
"specs": [
[
"==",
"3.12.14"
]
]
},
{
"name": "aiohttp-retry",
"specs": [
[
"==",
"2.9.1"
]
]
},
{
"name": "aiosignal",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "aleph-alpha-client",
"specs": [
[
"==",
"2.17.0"
]
]
},
{
"name": "annotated-types",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "anthropic",
"specs": [
[
"==",
"0.58.2"
]
]
},
{
"name": "antlr4-python3-runtime",
"specs": [
[
"==",
"4.9.3"
]
]
},
{
"name": "anyio",
"specs": [
[
"==",
"4.9.0"
]
]
},
{
"name": "astunparse",
"specs": [
[
"==",
"1.6.3"
]
]
},
{
"name": "async-timeout",
"specs": [
[
"==",
"5.0.1"
]
]
},
{
"name": "attrs",
"specs": [
[
"==",
"25.3.0"
]
]
},
{
"name": "autokeras",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "av",
"specs": [
[
"==",
"15.0.0"
]
]
},
{
"name": "awscli",
"specs": [
[
"==",
"1.41.9"
]
]
},
{
"name": "beaker-gantry",
"specs": [
[
"==",
"1.15.0"
]
]
},
{
"name": "beaker-gantry",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "beaker-py",
"specs": [
[
"==",
"1.34.3"
]
]
},
{
"name": "beaker-py",
"specs": [
[
"==",
"1.36.4"
]
]
},
{
"name": "beautifulsoup4",
"specs": [
[
"==",
"4.13.4"
]
]
},
{
"name": "black",
"specs": [
[
"==",
"24.3.0"
]
]
},
{
"name": "blis",
"specs": [
[
"==",
"1.2.1"
]
]
},
{
"name": "boltons",
"specs": [
[
"==",
"25.0.0"
]
]
},
{
"name": "boto3",
"specs": [
[
"==",
"1.39.9"
]
]
},
{
"name": "botocore",
"specs": [
[
"==",
"1.39.9"
]
]
},
{
"name": "bottle",
"specs": [
[
"==",
"0.12.25"
]
]
},
{
"name": "cached-path",
"specs": [
[
"==",
"1.7.3"
]
]
},
{
"name": "cachetools",
"specs": [
[
"==",
"5.5.2"
]
]
},
{
"name": "catalogue",
"specs": [
[
"==",
"2.0.10"
]
]
},
{
"name": "cattrs",
"specs": [
[
"==",
"22.2.0"
]
]
},
{
"name": "certifi",
"specs": [
[
"==",
"2025.7.14"
]
]
},
{
"name": "cffi",
"specs": [
[
"==",
"1.17.1"
]
]
},
{
"name": "cfgv",
"specs": [
[
"==",
"3.4.0"
]
]
},
{
"name": "charset-normalizer",
"specs": [
[
"==",
"3.4.2"
]
]
},
{
"name": "chex",
"specs": [
[
"==",
"0.1.89"
]
]
},
{
"name": "clang",
"specs": [
[
"==",
"20.1.5"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.1.8"
]
]
},
{
"name": "click",
"specs": [
[
"==",
"8.2.1"
]
]
},
{
"name": "click-help-colors",
"specs": [
[
"==",
"0.9.4"
]
]
},
{
"name": "clip-anytorch",
"specs": [
[
"==",
"2.6.0"
]
]
},
{
"name": "cloudpathlib",
"specs": [
[
"==",
"0.21.1"
]
]
},
{
"name": "cohere",
"specs": [
[
"==",
"5.16.1"
]
]
},
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "colorcet",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "coloredlogs",
"specs": [
[
"==",
"15.0.1"
]
]
},
{
"name": "colorlog",
"specs": [
[
"==",
"6.9.0"
]
]
},
{
"name": "confection",
"specs": [
[
"==",
"0.1.5"
]
]
},
{
"name": "contourpy",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "contourpy",
"specs": [
[
"==",
"1.3.2"
]
]
},
{
"name": "cycler",
"specs": [
[
"==",
"0.12.1"
]
]
},
{
"name": "cymem",
"specs": [
[
"==",
"2.0.11"
]
]
},
{
"name": "dacite",
"specs": [
[
"==",
"1.9.2"
]
]
},
{
"name": "data",
"specs": [
[
"==",
"0.4"
]
]
},
{
"name": "datasets",
"specs": [
[
"==",
"3.6.0"
]
]
},
{
"name": "decorator",
"specs": [
[
"==",
"5.2.1"
]
]
},
{
"name": "diffusers",
"specs": [
[
"==",
"0.34.0"
]
]
},
{
"name": "dill",
"specs": [
[
"==",
"0.3.8"
]
]
},
{
"name": "distlib",
"specs": [
[
"==",
"0.4.0"
]
]
},
{
"name": "distro",
"specs": [
[
"==",
"1.9.0"
]
]
},
{
"name": "dnspython",
"specs": [
[
"==",
"2.7.0"
]
]
},
{
"name": "docker",
"specs": [
[
"==",
"7.1.0"
]
]
},
{
"name": "docstring-parser",
"specs": [
[
"==",
"0.17.0"
]
]
},
{
"name": "docutils",
"specs": [
[
"==",
"0.19"
]
]
},
{
"name": "einops",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "einops-exts",
"specs": [
[
"==",
"0.0.4"
]
]
},
{
"name": "et-xmlfile",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "etils",
"specs": [
[
"==",
"1.5.2"
]
]
},
{
"name": "etils",
"specs": [
[
"==",
"1.13.0"
]
]
},
{
"name": "eval-type-backport",
"specs": [
[
"==",
"0.2.2"
]
]
},
{
"name": "exceptiongroup",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "face",
"specs": [
[
"==",
"24.0.0"
]
]
},
{
"name": "fairlearn",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "fastavro",
"specs": [
[
"==",
"1.11.1"
]
]
},
{
"name": "filelock",
"specs": [
[
"==",
"3.18.0"
]
]
},
{
"name": "flake8",
"specs": [
[
"==",
"5.0.4"
]
]
},
{
"name": "flatbuffers",
"specs": [
[
"==",
"25.2.10"
]
]
},
{
"name": "flax",
"specs": [
[
"==",
"0.8.5"
]
]
},
{
"name": "flax",
"specs": [
[
"==",
"0.10.7"
]
]
},
{
"name": "fonttools",
"specs": [
[
"==",
"4.59.0"
]
]
},
{
"name": "frozenlist",
"specs": [
[
"==",
"1.7.0"
]
]
},
{
"name": "fsspec",
"specs": [
[
"==",
"2025.3.0"
]
]
},
{
"name": "ftfy",
"specs": [
[
"==",
"6.3.1"
]
]
},
{
"name": "funcsigs",
"specs": [
[
"==",
"1.0.2"
]
]
},
{
"name": "future",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "gast",
"specs": [
[
"==",
"0.4.0"
]
]
},
{
"name": "gast",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "gdown",
"specs": [
[
"==",
"5.2.0"
]
]
},
{
"name": "gitdb",
"specs": [
[
"==",
"4.0.12"
]
]
},
{
"name": "gitpython",
"specs": [
[
"==",
"3.1.44"
]
]
},
{
"name": "glom",
"specs": [
[
"==",
"24.11.0"
]
]
},
{
"name": "google-api-core",
"specs": [
[
"==",
"1.34.1"
]
]
},
{
"name": "google-api-core",
"specs": [
[
"==",
"2.25.1"
]
]
},
{
"name": "google-api-python-client",
"specs": [
[
"==",
"2.176.0"
]
]
},
{
"name": "google-auth",
"specs": [
[
"==",
"2.40.3"
]
]
},
{
"name": "google-auth-httplib2",
"specs": [
[
"==",
"0.2.0"
]
]
},
{
"name": "google-auth-oauthlib",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "google-cloud-aiplatform",
"specs": [
[
"==",
"1.60.0"
]
]
},
{
"name": "google-cloud-aiplatform",
"specs": [
[
"==",
"1.91.0"
]
]
},
{
"name": "google-cloud-aiplatform",
"specs": [
[
"==",
"1.104.0"
]
]
},
{
"name": "google-cloud-bigquery",
"specs": [
[
"==",
"3.25.0"
]
]
},
{
"name": "google-cloud-bigquery",
"specs": [
[
"==",
"3.35.0"
]
]
},
{
"name": "google-cloud-core",
"specs": [
[
"==",
"2.4.3"
]
]
},
{
"name": "google-cloud-resource-manager",
"specs": [
[
"==",
"1.12.3"
]
]
},
{
"name": "google-cloud-resource-manager",
"specs": [
[
"==",
"1.14.2"
]
]
},
{
"name": "google-cloud-storage",
"specs": [
[
"==",
"2.14.0"
]
]
},
{
"name": "google-cloud-storage",
"specs": [
[
"==",
"2.19.0"
]
]
},
{
"name": "google-cloud-translate",
"specs": [
[
"==",
"3.15.3"
]
]
},
{
"name": "google-cloud-translate",
"specs": [
[
"==",
"3.21.1"
]
]
},
{
"name": "google-crc32c",
"specs": [
[
"==",
"1.7.1"
]
]
},
{
"name": "google-genai",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "google-pasta",
"specs": [
[
"==",
"0.2.0"
]
]
},
{
"name": "google-resumable-media",
"specs": [
[
"==",
"2.7.2"
]
]
},
{
"name": "googleapis-common-protos",
"specs": [
[
"==",
"1.63.1"
]
]
},
{
"name": "googleapis-common-protos",
"specs": [
[
"==",
"1.70.0"
]
]
},
{
"name": "gradio-client",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "gradio-client",
"specs": [
[
"==",
"1.11.0"
]
]
},
{
"name": "grpc-google-iam-v1",
"specs": [
[
"==",
"0.13.0"
]
]
},
{
"name": "grpc-google-iam-v1",
"specs": [
[
"==",
"0.14.2"
]
]
},
{
"name": "grpcio",
"specs": [
[
"==",
"1.73.1"
]
]
},
{
"name": "grpcio-status",
"specs": [
[
"==",
"1.48.2"
]
]
},
{
"name": "grpcio-status",
"specs": [
[
"==",
"1.71.2"
]
]
},
{
"name": "gunicorn",
"specs": [
[
"==",
"23.0.0"
]
]
},
{
"name": "h11",
"specs": [
[
"==",
"0.16.0"
]
]
},
{
"name": "h5py",
"specs": [
[
"==",
"3.14.0"
]
]
},
{
"name": "hf-xet",
"specs": [
[
"==",
"1.1.5"
]
]
},
{
"name": "html2text",
"specs": [
[
"==",
"2024.2.26"
]
]
},
{
"name": "httpcore",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": "httplib2",
"specs": [
[
"==",
"0.22.0"
]
]
},
{
"name": "httpx",
"specs": [
[
"==",
"0.27.2"
]
]
},
{
"name": "httpx-sse",
"specs": [
[
"==",
"0.4.0"
]
]
},
{
"name": "huggingface-hub",
"specs": [
[
"==",
"0.33.4"
]
]
},
{
"name": "humanfriendly",
"specs": [
[
"==",
"10.0"
]
]
},
{
"name": "humanize",
"specs": [
[
"==",
"4.12.3"
]
]
},
{
"name": "icetk",
"specs": [
[
"==",
"0.0.4"
]
]
},
{
"name": "identify",
"specs": [
[
"==",
"2.6.12"
]
]
},
{
"name": "idna",
"specs": [
[
"==",
"3.10"
]
]
},
{
"name": "imagehash",
"specs": [
[
"==",
"4.3.2"
]
]
},
{
"name": "imageio",
"specs": [
[
"==",
"2.37.0"
]
]
},
{
"name": "immutabledict",
"specs": [
[
"==",
"4.2.1"
]
]
},
{
"name": "importlib-metadata",
"specs": [
[
"==",
"8.7.0"
]
]
},
{
"name": "importlib-resources",
"specs": [
[
"==",
"5.13.0"
]
]
},
{
"name": "iniconfig",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "jax",
"specs": [
[
"==",
"0.4.30"
]
]
},
{
"name": "jax",
"specs": [
[
"==",
"0.6.2"
]
]
},
{
"name": "jaxlib",
"specs": [
[
"==",
"0.4.30"
]
]
},
{
"name": "jaxlib",
"specs": [
[
"==",
"0.6.2"
]
]
},
{
"name": "jieba",
"specs": [
[
"==",
"0.42.1"
]
]
},
{
"name": "jinja2",
"specs": [
[
"==",
"3.1.6"
]
]
},
{
"name": "jiter",
"specs": [
[
"==",
"0.10.0"
]
]
},
{
"name": "jmespath",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "joblib",
"specs": [
[
"==",
"1.5.1"
]
]
},
{
"name": "jsonpath-python",
"specs": [
[
"==",
"1.0.6"
]
]
},
{
"name": "kagglehub",
"specs": [
[
"==",
"0.3.12"
]
]
},
{
"name": "keras",
"specs": [
[
"==",
"2.11.0"
]
]
},
{
"name": "keras",
"specs": [
[
"==",
"3.10.0"
]
]
},
{
"name": "keras-hub",
"specs": [
[
"==",
"0.18.1"
]
]
},
{
"name": "keras-hub",
"specs": [
[
"==",
"0.19.0"
]
]
},
{
"name": "keras-hub",
"specs": [
[
"==",
"0.21.1"
]
]
},
{
"name": "keras-nlp",
"specs": [
[
"==",
"0.18.1"
]
]
},
{
"name": "keras-nlp",
"specs": [
[
"==",
"0.19.0"
]
]
},
{
"name": "keras-nlp",
"specs": [
[
"==",
"0.21.1"
]
]
},
{
"name": "keras-tuner",
"specs": [
[
"==",
"1.4.7"
]
]
},
{
"name": "kiwisolver",
"specs": [
[
"==",
"1.4.7"
]
]
},
{
"name": "kiwisolver",
"specs": [
[
"==",
"1.4.8"
]
]
},
{
"name": "kt-legacy",
"specs": [
[
"==",
"1.0.5"
]
]
},
{
"name": "langcodes",
"specs": [
[
"==",
"3.5.0"
]
]
},
{
"name": "langdetect",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": "language-data",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "latex",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "lazy-loader",
"specs": [
[
"==",
"0.4"
]
]
},
{
"name": "levenshtein",
"specs": [
[
"==",
"0.27.1"
]
]
},
{
"name": "libclang",
"specs": [
[
"==",
"18.1.1"
]
]
},
{
"name": "lightning-utilities",
"specs": [
[
"==",
"0.14.3"
]
]
},
{
"name": "llvmlite",
"specs": [
[
"==",
"0.43.0"
]
]
},
{
"name": "llvmlite",
"specs": [
[
"==",
"0.44.0"
]
]
},
{
"name": "logzio-python-handler",
"specs": [
[
"==",
"3.1.1"
]
]
},
{
"name": "lpips",
"specs": [
[
"==",
"0.1.4"
]
]
},
{
"name": "lxml",
"specs": [
[
"==",
"6.0.0"
]
]
},
{
"name": "mako",
"specs": [
[
"==",
"1.3.10"
]
]
},
{
"name": "marisa-trie",
"specs": [
[
"==",
"1.2.1"
]
]
},
{
"name": "markdown",
"specs": [
[
"==",
"3.8.2"
]
]
},
{
"name": "markdown-it-py",
"specs": [
[
"==",
"3.0.0"
]
]
},
{
"name": "markupsafe",
"specs": [
[
"==",
"3.0.2"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"==",
"3.9.4"
]
]
},
{
"name": "matplotlib",
"specs": [
[
"==",
"3.10.3"
]
]
},
{
"name": "mccabe",
"specs": [
[
"==",
"0.7.0"
]
]
},
{
"name": "mdurl",
"specs": [
[
"==",
"0.1.2"
]
]
},
{
"name": "mistralai",
"specs": [
[
"==",
"1.5.2"
]
]
},
{
"name": "ml-dtypes",
"specs": [
[
"==",
"0.5.1"
]
]
},
{
"name": "mpmath",
"specs": [
[
"==",
"1.3.0"
]
]
},
{
"name": "msgpack",
"specs": [
[
"==",
"1.1.1"
]
]
},
{
"name": "msgspec",
"specs": [
[
"==",
"0.19.0"
]
]
},
{
"name": "multidict",
"specs": [
[
"==",
"6.6.3"
]
]
},
{
"name": "multilingual-clip",
"specs": [
[
"==",
"1.0.10"
]
]
},
{
"name": "multiprocess",
"specs": [
[
"==",
"0.70.16"
]
]
},
{
"name": "murmurhash",
"specs": [
[
"==",
"1.0.13"
]
]
},
{
"name": "mypy",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "mypy-extensions",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "namex",
"specs": [
[
"==",
"0.1.0"
]
]
},
{
"name": "necessary",
"specs": [
[
"==",
"0.4.3"
]
]
},
{
"name": "nest-asyncio",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"3.2.1"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"3.4.2"
]
]
},
{
"name": "networkx",
"specs": [
[
"==",
"3.5"
]
]
},
{
"name": "nltk",
"specs": [
[
"==",
"3.9.1"
]
]
},
{
"name": "nodeenv",
"specs": [
[
"==",
"1.9.1"
]
]
},
{
"name": "nudenet",
"specs": [
[
"==",
"2.0.9"
]
]
},
{
"name": "numba",
"specs": [
[
"==",
"0.60.0"
]
]
},
{
"name": "numba",
"specs": [
[
"==",
"0.61.2"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"1.26.4"
]
]
},
{
"name": "nvidia-cublas-cu12",
"specs": [
[
"==",
"12.4.5.8"
]
]
},
{
"name": "nvidia-cuda-cupti-cu12",
"specs": [
[
"==",
"12.4.127"
]
]
},
{
"name": "nvidia-cuda-nvrtc-cu12",
"specs": [
[
"==",
"12.4.127"
]
]
},
{
"name": "nvidia-cuda-runtime-cu12",
"specs": [
[
"==",
"12.4.127"
]
]
},
{
"name": "nvidia-cudnn-cu12",
"specs": [
[
"==",
"9.1.0.70"
]
]
},
{
"name": "nvidia-cufft-cu12",
"specs": [
[
"==",
"11.2.1.3"
]
]
},
{
"name": "nvidia-curand-cu12",
"specs": [
[
"==",
"10.3.5.147"
]
]
},
{
"name": "nvidia-cusolver-cu12",
"specs": [
[
"==",
"11.6.1.9"
]
]
},
{
"name": "nvidia-cusparse-cu12",
"specs": [
[
"==",
"12.3.1.170"
]
]
},
{
"name": "nvidia-nccl-cu12",
"specs": [
[
"==",
"2.21.5"
]
]
},
{
"name": "nvidia-nvjitlink-cu12",
"specs": [
[
"==",
"12.4.127"
]
]
},
{
"name": "nvidia-nvtx-cu12",
"specs": [
[
"==",
"12.4.127"
]
]
},
{
"name": "oauthlib",
"specs": [
[
"==",
"3.3.1"
]
]
},
{
"name": "omegaconf",
"specs": [
[
"==",
"2.3.0"
]
]
},
{
"name": "onnxruntime",
"specs": [
[
"==",
"1.19.2"
]
]
},
{
"name": "onnxruntime",
"specs": [
[
"==",
"1.22.1"
]
]
},
{
"name": "open-clip-torch",
"specs": [
[
"==",
"2.32.0"
]
]
},
{
"name": "openai",
"specs": [
[
"==",
"1.97.0"
]
]
},
{
"name": "opencc",
"specs": [
[
"==",
"1.1.9"
]
]
},
{
"name": "opencv-python",
"specs": [
[
"==",
"4.8.1.78"
]
]
},
{
"name": "opencv-python-headless",
"specs": [
[
"==",
"4.11.0.86"
]
]
},
{
"name": "openpyxl",
"specs": [
[
"==",
"3.1.5"
]
]
},
{
"name": "opt-einsum",
"specs": [
[
"==",
"3.4.0"
]
]
},
{
"name": "optax",
"specs": [
[
"==",
"0.2.4"
]
]
},
{
"name": "optax",
"specs": [
[
"==",
"0.2.5"
]
]
},
{
"name": "optree",
"specs": [
[
"==",
"0.16.0"
]
]
},
{
"name": "orbax-checkpoint",
"specs": [
[
"==",
"0.6.4"
]
]
},
{
"name": "orbax-checkpoint",
"specs": [
[
"==",
"0.11.5"
]
]
},
{
"name": "outcome",
"specs": [
[
"==",
"1.3.0.post0"
]
]
},
{
"name": "packaging",
"specs": [
[
"==",
"25.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.3.1"
]
]
},
{
"name": "parameterized",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "pathspec",
"specs": [
[
"==",
"0.12.1"
]
]
},
{
"name": "pdf2image",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "petname",
"specs": [
[
"==",
"2.6"
]
]
},
{
"name": "pillow",
"specs": [
[
"==",
"10.4.0"
]
]
},
{
"name": "platformdirs",
"specs": [
[
"==",
"4.3.8"
]
]
},
{
"name": "pluggy",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "portalocker",
"specs": [
[
"==",
"3.2.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
"==",
"2.20.0"
]
]
},
{
"name": "preshed",
"specs": [
[
"==",
"3.0.10"
]
]
},
{
"name": "progressbar2",
"specs": [
[
"==",
"4.5.0"
]
]
},
{
"name": "propcache",
"specs": [
[
"==",
"0.3.2"
]
]
},
{
"name": "proto-plus",
"specs": [
[
"==",
"1.26.1"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"3.19.6"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"5.29.5"
]
]
},
{
"name": "psutil",
"specs": [
[
"==",
"7.0.0"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"==",
"21.0.0"
]
]
},
{
"name": "pyarrow-hotfix",
"specs": [
[
"==",
"0.7"
]
]
},
{
"name": "pyasn1",
"specs": [
[
"==",
"0.6.1"
]
]
},
{
"name": "pyasn1-modules",
"specs": [
[
"==",
"0.4.2"
]
]
},
{
"name": "pycares",
"specs": [
[
"==",
"4.9.0"
]
]
},
{
"name": "pycocoevalcap",
"specs": [
[
"==",
"1.2"
]
]
},
{
"name": "pycocotools",
"specs": [
[
"==",
"2.0.10"
]
]
},
{
"name": "pycodestyle",
"specs": [
[
"==",
"2.9.1"
]
]
},
{
"name": "pycparser",
"specs": [
[
"==",
"2.22"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.11.7"
]
]
},
{
"name": "pydantic-core",
"specs": [
[
"==",
"2.33.2"
]
]
},
{
"name": "pydload",
"specs": [
[
"==",
"1.0.9"
]
]
},
{
"name": "pyflakes",
"specs": [
[
"==",
"2.5.0"
]
]
},
{
"name": "pygments",
"specs": [
[
"==",
"2.19.2"
]
]
},
{
"name": "pyhocon",
"specs": [
[
"==",
"0.3.61"
]
]
},
{
"name": "pymongo",
"specs": [
[
"==",
"4.13.2"
]
]
},
{
"name": "pyparsing",
"specs": [
[
"==",
"3.2.3"
]
]
},
{
"name": "pypinyin",
"specs": [
[
"==",
"0.49.0"
]
]
},
{
"name": "pyreadline3",
"specs": [
[
"==",
"3.5.4"
]
]
},
{
"name": "pysocks",
"specs": [
[
"==",
"1.7.1"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"7.2.2"
]
]
},
{
"name": "python-dateutil",
"specs": [
[
"==",
"2.8.2"
]
]
},
{
"name": "python-utils",
"specs": [
[
"==",
"3.9.1"
]
]
},
{
"name": "pytorch-fid",
"specs": [
[
"==",
"0.3.0"
]
]
},
{
"name": "pytorch-lightning",
"specs": [
[
"==",
"2.5.2"
]
]
},
{
"name": "pytz",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "pywavelets",
"specs": [
[
"==",
"1.6.0"
]
]
},
{
"name": "pywavelets",
"specs": [
[
"==",
"1.8.0"
]
]
},
{
"name": "pywin32",
"specs": [
[
"==",
"311"
]
]
},
{
"name": "pyyaml",
"specs": [
[
"==",
"6.0.2"
]
]
},
{
"name": "qwen-vl-utils",
"specs": [
[
"==",
"0.0.11"
]
]
},
{
"name": "rapidfuzz",
"specs": [
[
"==",
"3.13.0"
]
]
},
{
"name": "regex",
"specs": [
[
"==",
"2024.11.6"
]
]
},
{
"name": "reka-api",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "requests",
"specs": [
[
"==",
"2.32.4"
]
]
},
{
"name": "requests-oauthlib",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "requirements-parser",
"specs": [
[
"==",
"0.13.0"
]
]
},
{
"name": "retrying",
"specs": [
[
"==",
"1.4.1"
]
]
},
{
"name": "rich",
"specs": [
[
"==",
"13.9.4"
]
]
},
{
"name": "rouge-score",
"specs": [
[
"==",
"0.1.2"
]
]
},
{
"name": "rsa",
"specs": [
[
"==",
"4.7.2"
]
]
},
{
"name": "s3transfer",
"specs": [
[
"==",
"0.13.1"
]
]
},
{
"name": "sacrebleu",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "safetensors",
"specs": [
[
"==",
"0.5.3"
]
]
},
{
"name": "scaleapi",
"specs": [
[
"==",
"2.17.0"
]
]
},
{
"name": "scikit-image",
"specs": [
[
"==",
"0.24.0"
]
]
},
{
"name": "scikit-image",
"specs": [
[
"==",
"0.25.2"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"==",
"1.6.1"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"==",
"1.7.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.13.1"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.15.3"
]
]
},
{
"name": "scipy",
"specs": [
[
"==",
"1.16.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
"==",
"0.13.2"
]
]
},
{
"name": "selenium",
"specs": [
[
"==",
"4.32.0"
]
]
},
{
"name": "selenium",
"specs": [
[
"==",
"4.34.2"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
"==",
"4.1.0"
]
]
},
{
"name": "sentencepiece",
"specs": [
[
"==",
"0.2.0"
]
]
},
{
"name": "sentry-sdk",
"specs": [
[
"==",
"2.33.1"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"80.9.0"
]
]
},
{
"name": "shapely",
"specs": [
[
"==",
"2.0.7"
]
]
},
{
"name": "shapely",
"specs": [
[
"==",
"2.1.1"
]
]
},
{
"name": "shellingham",
"specs": [
[
"==",
"1.5.4"
]
]
},
{
"name": "shutilwhich",
"specs": [
[
"==",
"1.1.0"
]
]
},
{
"name": "simple-slurm",
"specs": [
[
"==",
"0.2.7"
]
]
},
{
"name": "simplejson",
"specs": [
[
"==",
"3.20.1"
]
]
},
{
"name": "six",
"specs": [
[
"==",
"1.17.0"
]
]
},
{
"name": "smart-open",
"specs": [
[
"==",
"7.3.0.post1"
]
]
},
{
"name": "smashed",
"specs": [
[
"==",
"0.21.5"
]
]
},
{
"name": "smmap",
"specs": [
[
"==",
"5.0.2"
]
]
},
{
"name": "sniffio",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "sortedcontainers",
"specs": [
[
"==",
"2.4.0"
]
]
},
{
"name": "soupsieve",
"specs": [
[
"==",
"2.7"
]
]
},
{
"name": "spacy",
"specs": [
[
"==",
"3.8.7"
]
]
},
{
"name": "spacy-legacy",
"specs": [
[
"==",
"3.0.12"
]
]
},
{
"name": "spacy-loggers",
"specs": [
[
"==",
"1.0.5"
]
]
},
{
"name": "sqlitedict",
"specs": [
[
"==",
"2.1.0"
]
]
},
{
"name": "srsly",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "surge-api",
"specs": [
[
"==",
"1.5.10"
]
]
},
{
"name": "sympy",
"specs": [
[
"==",
"1.13.1"
]
]
},
{
"name": "tabulate",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "tempdir",
"specs": [
[
"==",
"0.7.1"
]
]
},
{
"name": "tensorboard",
"specs": [
[
"==",
"2.11.2"
]
]
},
{
"name": "tensorboard",
"specs": [
[
"==",
"2.18.0"
]
]
},
{
"name": "tensorboard",
"specs": [
[
"==",
"2.20.0"
]
]
},
{
"name": "tensorboard-data-server",
"specs": [
[
"==",
"0.6.1"
]
]
},
{
"name": "tensorboard-data-server",
"specs": [
[
"==",
"0.7.2"
]
]
},
{
"name": "tensorboard-plugin-wit",
"specs": [
[
"==",
"1.8.1"
]
]
},
{
"name": "tensorflow",
"specs": [
[
"==",
"2.11.1"
]
]
},
{
"name": "tensorflow",
"specs": [
[
"==",
"2.18.1"
]
]
},
{
"name": "tensorflow",
"specs": [
[
"==",
"2.20.0"
]
]
},
{
"name": "tensorflow-estimator",
"specs": [
[
"==",
"2.11.0"
]
]
},
{
"name": "tensorflow-hub",
"specs": [
[
"==",
"0.16.1"
]
]
},
{
"name": "tensorflow-io-gcs-filesystem",
"specs": [
[
"==",
"0.37.1"
]
]
},
{
"name": "tensorflow-text",
"specs": [
[
"==",
"2.11.0"
]
]
},
{
"name": "tensorflow-text",
"specs": [
[
"==",
"2.18.1"
]
]
},
{
"name": "tensorstore",
"specs": [
[
"==",
"0.1.69"
]
]
},
{
"name": "tensorstore",
"specs": [
[
"==",
"0.1.74"
]
]
},
{
"name": "termcolor",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "tf-keras",
"specs": [
[
"==",
"2.15.0"
]
]
},
{
"name": "thinc",
"specs": [
[
"==",
"8.3.4"
]
]
},
{
"name": "threadpoolctl",
"specs": [
[
"==",
"3.6.0"
]
]
},
{
"name": "tifffile",
"specs": [
[
"==",
"2024.8.30"
]
]
},
{
"name": "tifffile",
"specs": [
[
"==",
"2025.5.10"
]
]
},
{
"name": "tifffile",
"specs": [
[
"==",
"2025.6.11"
]
]
},
{
"name": "tiktoken",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "timm",
"specs": [
[
"==",
"0.6.13"
]
]
},
{
"name": "together",
"specs": [
[
"==",
"1.3.14"
]
]
},
{
"name": "tokenizers",
"specs": [
[
"==",
"0.21.2"
]
]
},
{
"name": "toml",
"specs": [
[
"==",
"0.10.2"
]
]
},
{
"name": "tomli",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "toolz",
"specs": [
[
"==",
"1.0.0"
]
]
},
{
"name": "torch",
"specs": [
[
"==",
"2.5.1"
]
]
},
{
"name": "torch-fidelity",
"specs": [
[
"==",
"0.3.0"
]
]
},
{
"name": "torchmetrics",
"specs": [
[
"==",
"0.11.4"
]
]
},
{
"name": "torchvision",
"specs": [
[
"==",
"0.20.1"
]
]
},
{
"name": "tqdm",
"specs": [
[
"==",
"4.67.1"
]
]
},
{
"name": "transformers",
"specs": [
[
"==",
"4.52.4"
]
]
},
{
"name": "transformers-stream-generator",
"specs": [
[
"==",
"0.0.5"
]
]
},
{
"name": "treescope",
"specs": [
[
"==",
"0.1.9"
]
]
},
{
"name": "trio",
"specs": [
[
"==",
"0.30.0"
]
]
},
{
"name": "trio-websocket",
"specs": [
[
"==",
"0.12.2"
]
]
},
{
"name": "triton",
"specs": [
[
"==",
"3.1.0"
]
]
},
{
"name": "trouting",
"specs": [
[
"==",
"0.3.3"
]
]
},
{
"name": "typer",
"specs": [
[
"==",
"0.15.3"
]
]
},
{
"name": "types-requests",
"specs": [
[
"==",
"2.31.0.6"
]
]
},
{
"name": "types-requests",
"specs": [
[
"==",
"2.32.4.20250611"
]
]
},
{
"name": "types-urllib3",
"specs": [
[
"==",
"1.26.25.14"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
"==",
"4.14.1"
]
]
},
{
"name": "typing-inspect",
"specs": [
[
"==",
"0.9.0"
]
]
},
{
"name": "typing-inspection",
"specs": [
[
"==",
"0.4.1"
]
]
},
{
"name": "tzdata",
"specs": [
[
"==",
"2025.2"
]
]
},
{
"name": "uncertainty-calibration",
"specs": [
[
"==",
"0.1.4"
]
]
},
{
"name": "unidecode",
"specs": [
[
"==",
"1.4.0"
]
]
},
{
"name": "uritemplate",
"specs": [
[
"==",
"4.2.0"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"1.26.20"
]
]
},
{
"name": "urllib3",
"specs": [
[
"==",
"2.5.0"
]
]
},
{
"name": "virtualenv",
"specs": [
[
"==",
"20.32.0"
]
]
},
{
"name": "wandb",
"specs": [
[
"==",
"0.21.0"
]
]
},
{
"name": "wasabi",
"specs": [
[
"==",
"1.1.3"
]
]
},
{
"name": "wcwidth",
"specs": [
[
"==",
"0.2.13"
]
]
},
{
"name": "weasel",
"specs": [
[
"==",
"0.4.1"
]
]
},
{
"name": "websocket-client",
"specs": [
[
"==",
"1.8.0"
]
]
},
{
"name": "websockets",
"specs": [
[
"==",
"12.0"
]
]
},
{
"name": "websockets",
"specs": [
[
"==",
"14.2"
]
]
},
{
"name": "werkzeug",
"specs": [
[
"==",
"3.1.3"
]
]
},
{
"name": "wheel",
"specs": [
[
"==",
"0.45.1"
]
]
},
{
"name": "wrapt",
"specs": [
[
"==",
"1.17.2"
]
]
},
{
"name": "writer-sdk",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "writerai",
"specs": [
[
"==",
"4.0.1"
]
]
},
{
"name": "wsproto",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "xdoctest",
"specs": [
[
"==",
"1.2.0"
]
]
},
{
"name": "xlrd",
"specs": [
[
"==",
"2.0.2"
]
]
},
{
"name": "xxhash",
"specs": [
[
"==",
"3.5.0"
]
]
},
{
"name": "yarl",
"specs": [
[
"==",
"1.20.1"
]
]
},
{
"name": "zipp",
"specs": [
[
"==",
"3.23.0"
]
]
},
{
"name": "zstandard",
"specs": [
[
"==",
"0.18.0"
]
]
}
],
"lcname": "nvidia-crfm-helm"
}