inceptbench


Nameinceptbench JSON
Version 1.1.7 PyPI version JSON
download
home_pagehttps://github.com/incept-ai/inceptbench
SummaryComprehensive benchmark and evaluation framework for educational AI question generation
upload_time2025-10-20 15:22:01
maintainerNone
docs_urlNone
authorTrilogy Team
requires_python<3.14,>=3.11
licenseProprietary - Copyright Trilogy Education Services
keywords education evaluation ai questions assessment benchmark edubench scaffolding
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # InceptBench

Educational question evaluation CLI tool with comprehensive AI-powered assessment. Evaluates questions locally using multiple evaluation modules including quality_evaluator, answer_verification, reading_question_qc, and EduBench tasks.

[![PyPI version](https://badge.fury.io/py/inceptbench.svg)](https://badge.fury.io/py/inceptbench)
[![Python Version](https://img.shields.io/pypi/pyversions/inceptbench.svg)](https://pypi.org/project/inceptbench/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Repository**: [https://github.com/trilogy-group/inceptbench](https://github.com/trilogy-group/inceptbench)

## Features

๐ŸŽฏ **Comprehensive Evaluation**
- **Internal Evaluator** - Scaffolding quality and DI compliance scoring (0-1 scale)
- **Answer Verification** - GPT-4o powered correctness checking
- **Reading Question QC** - MCQ distractor and question quality checks
- **EduBench Tasks** - Educational benchmarks (QA, EC, IP, AG, QG, TMG) (0-10 scale)

๐Ÿ“Š **Flexible Output**
- Simplified mode (default) for quick score viewing - ~95% smaller output
- Full mode (`--full`) with all detailed metrics, issues, strengths, and reasoning
- Append mode (`-a`) for collecting multiple evaluations
- JSON output for easy integration

๐Ÿš€ **Easy to Use**
- Simple CLI interface
- Runs locally with OpenAI and Anthropic API integrations
- Batch processing support
- High-throughput benchmark mode for parallel evaluation
- Only evaluates requested modules (configurable via `submodules_to_run`)

## Installation

```bash
pip install inceptbench

# Or upgrade to latest version
pip install inceptbench --upgrade --no-cache-dir
```

## Quick Start

### 1. Set up API Keys

Create a `.env` file in your working directory:

```bash
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional for EduBench tasks
```

### 2. Generate Sample File

```bash
inceptbench example
```

This creates `qs.json` with a complete example question including the `submodules_to_run` configuration.

### 3. Evaluate

```bash
# Simplified output (default)
inceptbench evaluate qs.json

# With progress messages
inceptbench evaluate qs.json --verbose

# Full detailed output
inceptbench evaluate qs.json --full --verbose
```

## Usage

### Commands

#### `evaluate` - Evaluate questions from JSON file

```bash
# Basic evaluation (simplified scores - default)
inceptbench evaluate questions.json

# Verbose output with progress messages
inceptbench evaluate questions.json --verbose

# Full detailed evaluation results
inceptbench evaluate questions.json --full

# Save results to file (overwrite)
inceptbench evaluate questions.json -o results.json

# Append results to file (creates if not exists)
inceptbench evaluate questions.json -a all_evaluations.json --verbose

# Full detailed results to file
inceptbench evaluate questions.json --full -o detailed_results.json --verbose
```

#### `example` - Generate sample input file

```bash
# Generate qs.json (default)
inceptbench example

# Save to custom filename
inceptbench example -o sample.json
```

#### `benchmark` - High-throughput parallel evaluation

Process many questions in parallel for maximum throughput. Perfect for evaluating large datasets.

```bash
# Basic benchmark (100 parallel workers by default)
inceptbench benchmark questions.json

# Custom worker count
inceptbench benchmark questions.json --workers 50

# Save results with verbose output
inceptbench benchmark questions.json -o results.json --verbose

# With custom settings
inceptbench benchmark questions.json --workers 200 -o benchmark_results.json --verbose
```

**Benchmark Output:**
```json
{
  "request_id": "uuid",
  "total_questions": 100,
  "successful": 98,
  "failed": 2,
  "scores": [
    {
      "id": "q1",
      "final_score": 0.91,
      "scores": {
        "quality_evaluator": {"overall": 0.93},
        "answer_verification": {"is_correct": true},
        "reading_question_qc": {"overall_score": 0.8}
      }
    }
  ],
  "failed_ids": ["q42", "q87"],
  "evaluation_time_seconds": 45.3,
  "avg_score": 0.89
}
```

#### `help` - Show detailed help

```bash
inceptbench help
```

## Input Format

The input JSON file must contain:
- `submodules_to_run`: List of evaluation modules to run
- `generated_questions`: Array of questions to evaluate

**Available Modules:**
- `quality_evaluator` - Internal evaluator (scaffolding + DI compliance)
- `answer_verification` - GPT-4o answer correctness checking
- `reading_question_qc` - MCQ distractor quality checks
- `external_edubench` - EduBench educational tasks (QA, EC, IP, etc.)

**Example:**

```json
{
  "submodules_to_run": [
    "quality_evaluator",
    "answer_verification",
    "reading_question_qc"
  ],
  "generated_questions": [
    {
      "id": "q1",
      "type": "mcq",
      "question": "ุฅุฐุง ูƒุงู† ุซู…ู† 2 ู‚ู„ู… ู‡ูˆ 14 ุฑูŠุงู„ู‹ุงุŒ ูู…ุง ุซู…ู† 5 ุฃู‚ู„ุงู… ุจู†ูุณ ุงู„ู…ุนุฏู„ุŸ",
      "answer": "35 ุฑูŠุงู„ู‹ุง",
      "answer_explanation": "ุงู„ุฎุทูˆุฉ 1: ุชุญู„ูŠู„ ุงู„ู…ุณุฃู„ุฉ โ€” ู„ุฏูŠู†ุง ุซู…ู† 2 ู‚ู„ู… ูˆู‡ูˆ 14 ุฑูŠุงู„ู‹ุง. ู†ุญุชุงุฌ ุฅู„ู‰ ู…ุนุฑูุฉ ุซู…ู† 5 ุฃู‚ู„ุงู… ุจู†ูุณ ุงู„ู…ุนุฏู„. ูŠุฌุจ ุงู„ุชููƒูŠุฑ ููŠ ุงู„ุนู„ุงู‚ุฉ ุจูŠู† ุนุฏุฏ ุงู„ุฃู‚ู„ุงู… ูˆุงู„ุณุนุฑ ูˆูƒูŠููŠุฉ ุชุญูˆูŠู„ ุนุฏุฏ ุงู„ุฃู‚ู„ุงู… ุจู…ุนุฏู„ ุซุงุจุช.\nุงู„ุฎุทูˆุฉ 2: ุชุทูˆูŠุฑ ุงู„ุงุณุชุฑุงุชูŠุฌูŠุฉ โ€” ูŠู…ูƒู†ู†ุง ุฃูˆู„ู‹ุง ุฅูŠุฌุงุฏ ุซู…ู† ู‚ู„ู… ูˆุงุญุฏ ุจู‚ุณู…ุฉ 14 รท 2 = 7 ุฑูŠุงู„ุŒ ุซู… ุถุฑุจู‡ ููŠ 5 ู„ุฅูŠุฌุงุฏ ุซู…ู† 5 ุฃู‚ู„ุงู…: 7 ร— 5 = 35 ุฑูŠุงู„ู‹ุง.\nุงู„ุฎุทูˆุฉ 3: ุงู„ุชุทุจูŠู‚ ูˆุงู„ุชุญู‚ู‚ โ€” ู†ุชุญู‚ู‚ ู…ู† ู…ู†ุทู‚ูŠุฉ ุงู„ุฅุฌุงุจุฉ ุจู…ู‚ุงุฑู†ุฉ ุงู„ุณุนุฑ ุจุนุฏุฏ ุงู„ุฃู‚ู„ุงู…. ุงู„ุณุนุฑ ูŠุชู†ุงุณุจ ุทุฑุฏูŠู‹ุง ู…ุน ุงู„ุนุฏุฏุŒ ูˆุจุงู„ุชุงู„ูŠ 35 ุฑูŠุงู„ู‹ุง ู‡ูŠ ุงู„ุฅุฌุงุจุฉ ุงู„ุตุญูŠุญุฉ ูˆุงู„ู…ู†ุทู‚ูŠุฉ.",
      "answer_options": {
        "A": "28 ุฑูŠุงู„ู‹ุง",
        "B": "70 ุฑูŠุงู„ู‹ุง",
        "C": "30 ุฑูŠุงู„ู‹ุง",
        "D": "35 ุฑูŠุงู„ู‹ุง"
      },
      "skill": {
        "title": "Grade 6 Mid-Year Comprehensive Assessment",
        "grade": "6",
        "subject": "mathematics",
        "difficulty": "medium",
        "description": "Apply proportional reasoning, rational number operations, algebraic thinking, geometric measurement, and statistical analysis to solve multi-step real-world problems",
        "language": "ar"
      },
      "image_url": null,
      "additional_details": "๐Ÿ”น **Question generation logic:**\nThis question targets proportional reasoning for Grade 6 students, testing their ability to apply ratios and unit rates to real-world problems. It follows a classic proportionality structure โ€” starting with a known ratio (2 items for 14 riyals) and scaling it up to 5 items. The stepwise reasoning develops algebraic thinking and promotes estimation checks to confirm logical correctness.\n\n๐Ÿ”น **Personalized insight examples:**\n- Choosing 28 ุฑูŠุงู„ู‹ุง shows a misunderstanding by doubling instead of proportionally scaling.\n- Choosing 7 ุฑูŠุงู„ู‹ุง indicates the learner found the unit rate but didn't scale it up to 5.\n- Choosing 14 ุฑูŠุงู„ู‹ุง confuses the given 2-item cost with the required 5-item cost.\n\n๐Ÿ”น **Instructional design & DI integration:**\nThe question aligns with *Percent, Ratio, and Probability* learning targets. In DI format 15.7, it models how equivalent fractions and proportional relationships can predict outcomes across different scales. This builds foundational understanding for probability and proportional reasoning. By using a simple, relatable context (price of pens), it connects mathematical ratios to practical real-world applications, supporting concept transfer and cognitive engagement."
    }
  ]
}
```

Use `inceptbench example` to generate this file automatically.

## Authentication

**Required API Keys:**

The tool integrates with OpenAI and Anthropic APIs for running evaluations. Create a `.env` file in your working directory:

```bash
OPENAI_API_KEY=your_openai_api_key
ANTHROPIC_API_KEY=your_anthropic_api_key
HUGGINGFACE_TOKEN=your_hf_token  # Optional, for EduBench tasks
```

The tool will automatically load these from the `.env` file when you run evaluations.

## Output Examples

### Example 1: Evaluate Command - Simplified Mode (Default)

**Command:**
```bash
inceptbench evaluate questions.json
```

**Output:** Returns only essential scores - **~95% smaller output**

```json
{
  "request_id": "c7bce978-66e9-4f8f-ac52-5468340fde8f",
  "evaluations": {
    "q1": {
      "quality_evaluator": {
        "overall": 0.9333333333333333
      },
      "answer_verification": {
        "is_correct": true
      },
      "reading_question_qc": {
        "overall_score": 0.8
      },
      "final_score": 0.9111111111111111
    },
    "q2": {
      "quality_evaluator": {
        "overall": 0.8777777777777778
      },
      "answer_verification": {
        "is_correct": false
      },
      "reading_question_qc": {
        "overall_score": 0.7
      },
      "final_score": 0.5259259259259259
    }
  },
  "evaluation_time_seconds": 12.15
}
```

**Note:** Only requested modules (specified in `submodules_to_run`) will be included in the output. Unrequested modules will not appear.

---

### Example 2: Evaluate Command - Full Mode

**Command:**
```bash
inceptbench evaluate questions.json --full
```

**Output:** Complete evaluation details with all scores, issues, strengths, reasoning, and recommendations:

```json
{
  "request_id": "a8d3f2e1-9c4b-4a7e-b5d6-1f2a3b4c5d6e",
  "evaluations": {
    "q1": {
      "quality_evaluator": {
        "overall": 0.9333333333333333,
        "scores": {
          "correctness": 1.0,
          "grade_alignment": 0.9,
          "difficulty_alignment": 0.9,
          "language_quality": 0.9,
          "pedagogical_value": 1.0,
          "explanation_quality": 0.9,
          "instruction_adherence": 1.0,
          "format_compliance": 1.0,
          "query_relevance": 1.0,
          "di_compliance": 0.9
        },
        "issues": [],
        "strengths": [
          "Excellent three-step scaffolding structure (Analyze โ†’ Strategy โ†’ Apply)",
          "Strong Direct Instruction compliance with clear modeling",
          "Grade-appropriate proportional reasoning for Grade 6",
          "Clear real-world context with pens and pricing"
        ],
        "recommendation": "accept",
        "suggested_improvements": [
          "Consider adding a visual diagram to support the proportional reasoning",
          "Could strengthen connection to DI Format 15.7 principles"
        ],
        "di_scores": {
          "overall": 0.9,
          "general_principles": 0.95,
          "format_alignment": 0.85,
          "grade_language": 0.9
        },
        "section_evaluations": {
          "question": {
            "section_score": 0.95,
            "issues": [],
            "strengths": [
              "Clear proportional reasoning problem",
              "Grade-appropriate difficulty"
            ],
            "recommendation": "accept"
          },
          "scaffolding": {
            "section_score": 0.92,
            "issues": [
              "Could include more explicit connection to prior knowledge"
            ],
            "strengths": [
              "Three-step structure follows best practices",
              "Verification step included"
            ],
            "recommendation": "accept"
          }
        }
      },
      "answer_verification": {
        "is_correct": true,
        "correct_answer": "35 riyals",
        "confidence": 10,
        "reasoning": "The answer is mathematically correct. To find the price of 5 pens when 2 pens cost 14 riyals: First find unit price: 14 รท 2 = 7 riyals per pen. Then multiply by 5: 7 ร— 5 = 35 riyals. The provided answer matches this calculation."
      },
      "reading_question_qc": {
        "overall_score": 0.8,
        "distractor_checks": {
          "plausibility": {
            "passed": true,
            "score": 0.9,
            "details": "All distractors represent common student errors",
            "category": "distractor"
          },
          "homogeneity": {
            "passed": true,
            "score": 0.85,
            "details": "Distractors have similar format and length",
            "category": "distractor"
          },
          "independence": {
            "passed": true,
            "score": 0.8,
            "details": "Each distractor represents a distinct error pattern",
            "category": "distractor"
          }
        },
        "question_checks": {
          "clarity": {
            "passed": true,
            "score": 0.9,
            "details": "Question is clear and unambiguous",
            "category": "question"
          },
          "complexity": {
            "passed": true,
            "score": 0.75,
            "details": "Appropriate complexity for grade level",
            "category": "question"
          }
        },
        "passed": true
      },
      "final_score": 0.9111111111111111
    }
  },
  "evaluation_time_seconds": 18.42
}
```

---

### Example 3: Benchmark Command - High-Throughput Parallel Mode

**Command:**
```bash
inceptbench benchmark questions.json --workers 100 --verbose
```

**Console Output:**
```
๐Ÿ“‚ Loading: questions.json
๐Ÿš€ Benchmark mode: 10 questions with 100 workers
Evaluating questions: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 10/10 [00:41<00:00,  4.12s/it]
โœ… Saved to: benchmark_results.json
๐Ÿ“Š Results: 10/10 successful
โฑ๏ธ  Time: 41.23s
๐Ÿ“ˆ Avg Score: 0.911
```

**Output File (benchmark_results.json):**
```json
{
  "request_id": "312d0684-49ed-4cc8-8ec3-9252daac89aa",
  "total_questions": 10,
  "successful": 10,
  "failed": 0,
  "scores": [
    {
      "id": "q1",
      "final_score": 0.9111111111111111,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9333333333333333
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q2",
      "final_score": 0.9074074074074074,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9222222222222223
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q3",
      "final_score": 0.9074074074074074,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9222222222222223
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q4",
      "final_score": 0.9148148148148149,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9444444444444444
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q5",
      "final_score": 0.9259259259259259,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9777777777777779
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q6",
      "final_score": 0.9185185185185185,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9555555555555555
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q7",
      "final_score": 0.9148148148148149,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9444444444444444
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q8",
      "final_score": 0.8814814814814814,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9444444444444444
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.7
        }
      }
    },
    {
      "id": "q9",
      "final_score": 0.9148148148148149,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9444444444444444
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    },
    {
      "id": "q10",
      "final_score": 0.9111111111111111,
      "scores": {
        "quality_evaluator": {
          "overall": 0.9333333333333333
        },
        "answer_verification": {
          "is_correct": true
        },
        "reading_question_qc": {
          "overall_score": 0.8
        }
      }
    }
  ],
  "failed_ids": [],
  "evaluation_time_seconds": 41.23,
  "avg_score": 0.9107407407407407
}
```

**Key differences in benchmark mode:**
- Returns all questions at once with summary statistics
- Includes `total_questions`, `successful`, `failed` counts
- Lists `failed_ids` for easy debugging
- Shows `avg_score` across all questions
- Always uses simplified mode (no detailed scores)
- Optimized for high throughput with parallel processing

## Command Reference

| Command | Description |
|---------|-------------|
| `evaluate` | Evaluate questions from JSON file |
| `benchmark` | High-throughput parallel evaluation for large datasets |
| `example` | Generate sample input file |
| `help` | Show detailed help and usage examples |

### Evaluate Options

| Option | Short | Description |
|--------|-------|-------------|
| `--output PATH` | `-o` | Save results to file (overwrites) |
| `--append PATH` | `-a` | Append results to file (creates if not exists) |
| `--full` | `-f` | Return full detailed evaluation results (default: simplified scores only) |
| `--verbose` | `-v` | Show progress messages |
| `--timeout SECS` | `-t` | Request timeout in seconds (default: 600) |

### Benchmark Options

| Option | Short | Description |
|--------|-------|-------------|
| `--output PATH` | `-o` | Save results to file |
| `--workers NUM` | `-w` | Number of parallel workers (default: 100) |
| `--verbose` | `-v` | Show progress messages |

## Examples

### Basic Evaluation

```bash
# Evaluate with default settings (simplified scores)
inceptbench evaluate questions.json

# With progress messages
inceptbench evaluate questions.json --verbose
```

### Full Detailed Evaluation

```bash
# Get complete evaluation with all details
inceptbench evaluate questions.json --full --verbose

# Save full results to file
inceptbench evaluate questions.json --full -o detailed_results.json
```

### Collecting Multiple Evaluations

```bash
# Append multiple evaluations to one file
inceptbench evaluate test1.json -a all_results.json --verbose
inceptbench evaluate test2.json -a all_results.json --verbose
inceptbench evaluate test3.json -a all_results.json --verbose

# Result: all_results.json contains an array of all 3 evaluations
```

### Batch Processing

```bash
# Evaluate all files and append to one results file
for file in questions/*.json; do
  inceptbench evaluate "$file" -a batch_results.json --verbose
done
```

### Benchmark Mode (High-Throughput Parallel Processing)

For large-scale evaluations, use benchmark mode to process hundreds of questions in parallel:

```bash
# Evaluate 100 questions with 100 parallel workers
inceptbench benchmark large_dataset.json --verbose

# Process 1000 questions with 200 workers, save results
inceptbench benchmark dataset_1000.json --workers 200 -o benchmark_results.json --verbose

# Results include: success rate, avg score, timing, and failed question IDs
```

**When to use benchmark mode:**
- Large datasets (100+ questions)
- Need for maximum throughput
- Want simplified scores only (no detailed output)
- Need to identify failed questions quickly

**Output includes:**
- Total questions processed
- Success/failure counts
- Failed question IDs for easy debugging
- Average score across all questions
- Total evaluation time
- One simplified score per question

## Evaluation Modules

### quality_evaluator (Internal Evaluator)
- Scaffolding quality assessment (answer_explanation structure)
- Direct Instruction (DI) compliance checking
- Pedagogical structure validation
- Language quality scoring
- Grade and difficulty alignment
- Returns scores on 0-1 scale

### answer_verification
- GPT-4o powered correctness checking
- Mathematical accuracy validation
- Confidence scoring (0-10)
- Reasoning explanation

### reading_question_qc
- MCQ distractor quality checks
- Question clarity validation
- Overall quality scoring

### external_edubench
- **QA**: Question Answering - Can the model answer the question?
- **EC**: Error Correction - Can the model identify and correct errors?
- **IP**: Instructional Planning - Can the model provide step-by-step solutions?
- **AG**: Answer Generation - Can the model generate correct answers?
- **QG**: Question Generation - Question quality assessment
- **TMG**: Test Making Generation - Test design quality
- Returns scores on 0-10 scale

All modules are optional and configurable via `submodules_to_run` in the input JSON.

## Requirements

- Python >= 3.11
- OpenAI API key
- Anthropic API key
- Hugging Face token (optional, for EduBench tasks)

## Support

- **Repository**: [https://github.com/trilogy-group/inceptbench](https://github.com/trilogy-group/inceptbench)
- **Issues**: [GitHub Issues](https://github.com/trilogy-group/inceptbench/issues)
- **Help**: Run `inceptbench help` for detailed documentation

## License

MIT License - see [LICENSE](LICENSE) file for details.

---

**Made by the Incept Team**

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/incept-ai/inceptbench",
    "name": "inceptbench",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.11",
    "maintainer_email": null,
    "keywords": "education, evaluation, ai, questions, assessment, benchmark, edubench, scaffolding",
    "author": "Trilogy Team",
    "author_email": "stanislav.huseletov@trilogy.com",
    "download_url": "https://files.pythonhosted.org/packages/5a/b7/54f52a334bb5de2afa78e2a6873bad19e9b54a88af2fef1ad01c985b6044/inceptbench-1.1.7.tar.gz",
    "platform": null,
    "description": "# InceptBench\n\nEducational question evaluation CLI tool with comprehensive AI-powered assessment. Evaluates questions locally using multiple evaluation modules including quality_evaluator, answer_verification, reading_question_qc, and EduBench tasks.\n\n[![PyPI version](https://badge.fury.io/py/inceptbench.svg)](https://badge.fury.io/py/inceptbench)\n[![Python Version](https://img.shields.io/pypi/pyversions/inceptbench.svg)](https://pypi.org/project/inceptbench/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**Repository**: [https://github.com/trilogy-group/inceptbench](https://github.com/trilogy-group/inceptbench)\n\n## Features\n\n\ud83c\udfaf **Comprehensive Evaluation**\n- **Internal Evaluator** - Scaffolding quality and DI compliance scoring (0-1 scale)\n- **Answer Verification** - GPT-4o powered correctness checking\n- **Reading Question QC** - MCQ distractor and question quality checks\n- **EduBench Tasks** - Educational benchmarks (QA, EC, IP, AG, QG, TMG) (0-10 scale)\n\n\ud83d\udcca **Flexible Output**\n- Simplified mode (default) for quick score viewing - ~95% smaller output\n- Full mode (`--full`) with all detailed metrics, issues, strengths, and reasoning\n- Append mode (`-a`) for collecting multiple evaluations\n- JSON output for easy integration\n\n\ud83d\ude80 **Easy to Use**\n- Simple CLI interface\n- Runs locally with OpenAI and Anthropic API integrations\n- Batch processing support\n- High-throughput benchmark mode for parallel evaluation\n- Only evaluates requested modules (configurable via `submodules_to_run`)\n\n## Installation\n\n```bash\npip install inceptbench\n\n# Or upgrade to latest version\npip install inceptbench --upgrade --no-cache-dir\n```\n\n## Quick Start\n\n### 1. Set up API Keys\n\nCreate a `.env` file in your working directory:\n\n```bash\nOPENAI_API_KEY=your_openai_key\nANTHROPIC_API_KEY=your_anthropic_key\nHUGGINGFACE_TOKEN=your_hf_token  # Optional for EduBench tasks\n```\n\n### 2. Generate Sample File\n\n```bash\ninceptbench example\n```\n\nThis creates `qs.json` with a complete example question including the `submodules_to_run` configuration.\n\n### 3. Evaluate\n\n```bash\n# Simplified output (default)\ninceptbench evaluate qs.json\n\n# With progress messages\ninceptbench evaluate qs.json --verbose\n\n# Full detailed output\ninceptbench evaluate qs.json --full --verbose\n```\n\n## Usage\n\n### Commands\n\n#### `evaluate` - Evaluate questions from JSON file\n\n```bash\n# Basic evaluation (simplified scores - default)\ninceptbench evaluate questions.json\n\n# Verbose output with progress messages\ninceptbench evaluate questions.json --verbose\n\n# Full detailed evaluation results\ninceptbench evaluate questions.json --full\n\n# Save results to file (overwrite)\ninceptbench evaluate questions.json -o results.json\n\n# Append results to file (creates if not exists)\ninceptbench evaluate questions.json -a all_evaluations.json --verbose\n\n# Full detailed results to file\ninceptbench evaluate questions.json --full -o detailed_results.json --verbose\n```\n\n#### `example` - Generate sample input file\n\n```bash\n# Generate qs.json (default)\ninceptbench example\n\n# Save to custom filename\ninceptbench example -o sample.json\n```\n\n#### `benchmark` - High-throughput parallel evaluation\n\nProcess many questions in parallel for maximum throughput. Perfect for evaluating large datasets.\n\n```bash\n# Basic benchmark (100 parallel workers by default)\ninceptbench benchmark questions.json\n\n# Custom worker count\ninceptbench benchmark questions.json --workers 50\n\n# Save results with verbose output\ninceptbench benchmark questions.json -o results.json --verbose\n\n# With custom settings\ninceptbench benchmark questions.json --workers 200 -o benchmark_results.json --verbose\n```\n\n**Benchmark Output:**\n```json\n{\n  \"request_id\": \"uuid\",\n  \"total_questions\": 100,\n  \"successful\": 98,\n  \"failed\": 2,\n  \"scores\": [\n    {\n      \"id\": \"q1\",\n      \"final_score\": 0.91,\n      \"scores\": {\n        \"quality_evaluator\": {\"overall\": 0.93},\n        \"answer_verification\": {\"is_correct\": true},\n        \"reading_question_qc\": {\"overall_score\": 0.8}\n      }\n    }\n  ],\n  \"failed_ids\": [\"q42\", \"q87\"],\n  \"evaluation_time_seconds\": 45.3,\n  \"avg_score\": 0.89\n}\n```\n\n#### `help` - Show detailed help\n\n```bash\ninceptbench help\n```\n\n## Input Format\n\nThe input JSON file must contain:\n- `submodules_to_run`: List of evaluation modules to run\n- `generated_questions`: Array of questions to evaluate\n\n**Available Modules:**\n- `quality_evaluator` - Internal evaluator (scaffolding + DI compliance)\n- `answer_verification` - GPT-4o answer correctness checking\n- `reading_question_qc` - MCQ distractor quality checks\n- `external_edubench` - EduBench educational tasks (QA, EC, IP, etc.)\n\n**Example:**\n\n```json\n{\n  \"submodules_to_run\": [\n    \"quality_evaluator\",\n    \"answer_verification\",\n    \"reading_question_qc\"\n  ],\n  \"generated_questions\": [\n    {\n      \"id\": \"q1\",\n      \"type\": \"mcq\",\n      \"question\": \"\u0625\u0630\u0627 \u0643\u0627\u0646 \u062b\u0645\u0646 2 \u0642\u0644\u0645 \u0647\u0648 14 \u0631\u064a\u0627\u0644\u064b\u0627\u060c \u0641\u0645\u0627 \u062b\u0645\u0646 5 \u0623\u0642\u0644\u0627\u0645 \u0628\u0646\u0641\u0633 \u0627\u0644\u0645\u0639\u062f\u0644\u061f\",\n      \"answer\": \"35 \u0631\u064a\u0627\u0644\u064b\u0627\",\n      \"answer_explanation\": \"\u0627\u0644\u062e\u0637\u0648\u0629 1: \u062a\u062d\u0644\u064a\u0644 \u0627\u0644\u0645\u0633\u0623\u0644\u0629 \u2014 \u0644\u062f\u064a\u0646\u0627 \u062b\u0645\u0646 2 \u0642\u0644\u0645 \u0648\u0647\u0648 14 \u0631\u064a\u0627\u0644\u064b\u0627. \u0646\u062d\u062a\u0627\u062c \u0625\u0644\u0649 \u0645\u0639\u0631\u0641\u0629 \u062b\u0645\u0646 5 \u0623\u0642\u0644\u0627\u0645 \u0628\u0646\u0641\u0633 \u0627\u0644\u0645\u0639\u062f\u0644. \u064a\u062c\u0628 \u0627\u0644\u062a\u0641\u0643\u064a\u0631 \u0641\u064a \u0627\u0644\u0639\u0644\u0627\u0642\u0629 \u0628\u064a\u0646 \u0639\u062f\u062f \u0627\u0644\u0623\u0642\u0644\u0627\u0645 \u0648\u0627\u0644\u0633\u0639\u0631 \u0648\u0643\u064a\u0641\u064a\u0629 \u062a\u062d\u0648\u064a\u0644 \u0639\u062f\u062f \u0627\u0644\u0623\u0642\u0644\u0627\u0645 \u0628\u0645\u0639\u062f\u0644 \u062b\u0627\u0628\u062a.\\n\u0627\u0644\u062e\u0637\u0648\u0629 2: \u062a\u0637\u0648\u064a\u0631 \u0627\u0644\u0627\u0633\u062a\u0631\u0627\u062a\u064a\u062c\u064a\u0629 \u2014 \u064a\u0645\u0643\u0646\u0646\u0627 \u0623\u0648\u0644\u064b\u0627 \u0625\u064a\u062c\u0627\u062f \u062b\u0645\u0646 \u0642\u0644\u0645 \u0648\u0627\u062d\u062f \u0628\u0642\u0633\u0645\u0629 14 \u00f7 2 = 7 \u0631\u064a\u0627\u0644\u060c \u062b\u0645 \u0636\u0631\u0628\u0647 \u0641\u064a 5 \u0644\u0625\u064a\u062c\u0627\u062f \u062b\u0645\u0646 5 \u0623\u0642\u0644\u0627\u0645: 7 \u00d7 5 = 35 \u0631\u064a\u0627\u0644\u064b\u0627.\\n\u0627\u0644\u062e\u0637\u0648\u0629 3: \u0627\u0644\u062a\u0637\u0628\u064a\u0642 \u0648\u0627\u0644\u062a\u062d\u0642\u0642 \u2014 \u0646\u062a\u062d\u0642\u0642 \u0645\u0646 \u0645\u0646\u0637\u0642\u064a\u0629 \u0627\u0644\u0625\u062c\u0627\u0628\u0629 \u0628\u0645\u0642\u0627\u0631\u0646\u0629 \u0627\u0644\u0633\u0639\u0631 \u0628\u0639\u062f\u062f \u0627\u0644\u0623\u0642\u0644\u0627\u0645. \u0627\u0644\u0633\u0639\u0631 \u064a\u062a\u0646\u0627\u0633\u0628 \u0637\u0631\u062f\u064a\u064b\u0627 \u0645\u0639 \u0627\u0644\u0639\u062f\u062f\u060c \u0648\u0628\u0627\u0644\u062a\u0627\u0644\u064a 35 \u0631\u064a\u0627\u0644\u064b\u0627 \u0647\u064a \u0627\u0644\u0625\u062c\u0627\u0628\u0629 \u0627\u0644\u0635\u062d\u064a\u062d\u0629 \u0648\u0627\u0644\u0645\u0646\u0637\u0642\u064a\u0629.\",\n      \"answer_options\": {\n        \"A\": \"28 \u0631\u064a\u0627\u0644\u064b\u0627\",\n        \"B\": \"70 \u0631\u064a\u0627\u0644\u064b\u0627\",\n        \"C\": \"30 \u0631\u064a\u0627\u0644\u064b\u0627\",\n        \"D\": \"35 \u0631\u064a\u0627\u0644\u064b\u0627\"\n      },\n      \"skill\": {\n        \"title\": \"Grade 6 Mid-Year Comprehensive Assessment\",\n        \"grade\": \"6\",\n        \"subject\": \"mathematics\",\n        \"difficulty\": \"medium\",\n        \"description\": \"Apply proportional reasoning, rational number operations, algebraic thinking, geometric measurement, and statistical analysis to solve multi-step real-world problems\",\n        \"language\": \"ar\"\n      },\n      \"image_url\": null,\n      \"additional_details\": \"\ud83d\udd39 **Question generation logic:**\\nThis question targets proportional reasoning for Grade 6 students, testing their ability to apply ratios and unit rates to real-world problems. It follows a classic proportionality structure \u2014 starting with a known ratio (2 items for 14 riyals) and scaling it up to 5 items. The stepwise reasoning develops algebraic thinking and promotes estimation checks to confirm logical correctness.\\n\\n\ud83d\udd39 **Personalized insight examples:**\\n- Choosing 28 \u0631\u064a\u0627\u0644\u064b\u0627 shows a misunderstanding by doubling instead of proportionally scaling.\\n- Choosing 7 \u0631\u064a\u0627\u0644\u064b\u0627 indicates the learner found the unit rate but didn't scale it up to 5.\\n- Choosing 14 \u0631\u064a\u0627\u0644\u064b\u0627 confuses the given 2-item cost with the required 5-item cost.\\n\\n\ud83d\udd39 **Instructional design & DI integration:**\\nThe question aligns with *Percent, Ratio, and Probability* learning targets. In DI format 15.7, it models how equivalent fractions and proportional relationships can predict outcomes across different scales. This builds foundational understanding for probability and proportional reasoning. By using a simple, relatable context (price of pens), it connects mathematical ratios to practical real-world applications, supporting concept transfer and cognitive engagement.\"\n    }\n  ]\n}\n```\n\nUse `inceptbench example` to generate this file automatically.\n\n## Authentication\n\n**Required API Keys:**\n\nThe tool integrates with OpenAI and Anthropic APIs for running evaluations. Create a `.env` file in your working directory:\n\n```bash\nOPENAI_API_KEY=your_openai_api_key\nANTHROPIC_API_KEY=your_anthropic_api_key\nHUGGINGFACE_TOKEN=your_hf_token  # Optional, for EduBench tasks\n```\n\nThe tool will automatically load these from the `.env` file when you run evaluations.\n\n## Output Examples\n\n### Example 1: Evaluate Command - Simplified Mode (Default)\n\n**Command:**\n```bash\ninceptbench evaluate questions.json\n```\n\n**Output:** Returns only essential scores - **~95% smaller output**\n\n```json\n{\n  \"request_id\": \"c7bce978-66e9-4f8f-ac52-5468340fde8f\",\n  \"evaluations\": {\n    \"q1\": {\n      \"quality_evaluator\": {\n        \"overall\": 0.9333333333333333\n      },\n      \"answer_verification\": {\n        \"is_correct\": true\n      },\n      \"reading_question_qc\": {\n        \"overall_score\": 0.8\n      },\n      \"final_score\": 0.9111111111111111\n    },\n    \"q2\": {\n      \"quality_evaluator\": {\n        \"overall\": 0.8777777777777778\n      },\n      \"answer_verification\": {\n        \"is_correct\": false\n      },\n      \"reading_question_qc\": {\n        \"overall_score\": 0.7\n      },\n      \"final_score\": 0.5259259259259259\n    }\n  },\n  \"evaluation_time_seconds\": 12.15\n}\n```\n\n**Note:** Only requested modules (specified in `submodules_to_run`) will be included in the output. Unrequested modules will not appear.\n\n---\n\n### Example 2: Evaluate Command - Full Mode\n\n**Command:**\n```bash\ninceptbench evaluate questions.json --full\n```\n\n**Output:** Complete evaluation details with all scores, issues, strengths, reasoning, and recommendations:\n\n```json\n{\n  \"request_id\": \"a8d3f2e1-9c4b-4a7e-b5d6-1f2a3b4c5d6e\",\n  \"evaluations\": {\n    \"q1\": {\n      \"quality_evaluator\": {\n        \"overall\": 0.9333333333333333,\n        \"scores\": {\n          \"correctness\": 1.0,\n          \"grade_alignment\": 0.9,\n          \"difficulty_alignment\": 0.9,\n          \"language_quality\": 0.9,\n          \"pedagogical_value\": 1.0,\n          \"explanation_quality\": 0.9,\n          \"instruction_adherence\": 1.0,\n          \"format_compliance\": 1.0,\n          \"query_relevance\": 1.0,\n          \"di_compliance\": 0.9\n        },\n        \"issues\": [],\n        \"strengths\": [\n          \"Excellent three-step scaffolding structure (Analyze \u2192 Strategy \u2192 Apply)\",\n          \"Strong Direct Instruction compliance with clear modeling\",\n          \"Grade-appropriate proportional reasoning for Grade 6\",\n          \"Clear real-world context with pens and pricing\"\n        ],\n        \"recommendation\": \"accept\",\n        \"suggested_improvements\": [\n          \"Consider adding a visual diagram to support the proportional reasoning\",\n          \"Could strengthen connection to DI Format 15.7 principles\"\n        ],\n        \"di_scores\": {\n          \"overall\": 0.9,\n          \"general_principles\": 0.95,\n          \"format_alignment\": 0.85,\n          \"grade_language\": 0.9\n        },\n        \"section_evaluations\": {\n          \"question\": {\n            \"section_score\": 0.95,\n            \"issues\": [],\n            \"strengths\": [\n              \"Clear proportional reasoning problem\",\n              \"Grade-appropriate difficulty\"\n            ],\n            \"recommendation\": \"accept\"\n          },\n          \"scaffolding\": {\n            \"section_score\": 0.92,\n            \"issues\": [\n              \"Could include more explicit connection to prior knowledge\"\n            ],\n            \"strengths\": [\n              \"Three-step structure follows best practices\",\n              \"Verification step included\"\n            ],\n            \"recommendation\": \"accept\"\n          }\n        }\n      },\n      \"answer_verification\": {\n        \"is_correct\": true,\n        \"correct_answer\": \"35 riyals\",\n        \"confidence\": 10,\n        \"reasoning\": \"The answer is mathematically correct. To find the price of 5 pens when 2 pens cost 14 riyals: First find unit price: 14 \u00f7 2 = 7 riyals per pen. Then multiply by 5: 7 \u00d7 5 = 35 riyals. The provided answer matches this calculation.\"\n      },\n      \"reading_question_qc\": {\n        \"overall_score\": 0.8,\n        \"distractor_checks\": {\n          \"plausibility\": {\n            \"passed\": true,\n            \"score\": 0.9,\n            \"details\": \"All distractors represent common student errors\",\n            \"category\": \"distractor\"\n          },\n          \"homogeneity\": {\n            \"passed\": true,\n            \"score\": 0.85,\n            \"details\": \"Distractors have similar format and length\",\n            \"category\": \"distractor\"\n          },\n          \"independence\": {\n            \"passed\": true,\n            \"score\": 0.8,\n            \"details\": \"Each distractor represents a distinct error pattern\",\n            \"category\": \"distractor\"\n          }\n        },\n        \"question_checks\": {\n          \"clarity\": {\n            \"passed\": true,\n            \"score\": 0.9,\n            \"details\": \"Question is clear and unambiguous\",\n            \"category\": \"question\"\n          },\n          \"complexity\": {\n            \"passed\": true,\n            \"score\": 0.75,\n            \"details\": \"Appropriate complexity for grade level\",\n            \"category\": \"question\"\n          }\n        },\n        \"passed\": true\n      },\n      \"final_score\": 0.9111111111111111\n    }\n  },\n  \"evaluation_time_seconds\": 18.42\n}\n```\n\n---\n\n### Example 3: Benchmark Command - High-Throughput Parallel Mode\n\n**Command:**\n```bash\ninceptbench benchmark questions.json --workers 100 --verbose\n```\n\n**Console Output:**\n```\n\ud83d\udcc2 Loading: questions.json\n\ud83d\ude80 Benchmark mode: 10 questions with 100 workers\nEvaluating questions: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 10/10 [00:41<00:00,  4.12s/it]\n\u2705 Saved to: benchmark_results.json\n\ud83d\udcca Results: 10/10 successful\n\u23f1\ufe0f  Time: 41.23s\n\ud83d\udcc8 Avg Score: 0.911\n```\n\n**Output File (benchmark_results.json):**\n```json\n{\n  \"request_id\": \"312d0684-49ed-4cc8-8ec3-9252daac89aa\",\n  \"total_questions\": 10,\n  \"successful\": 10,\n  \"failed\": 0,\n  \"scores\": [\n    {\n      \"id\": \"q1\",\n      \"final_score\": 0.9111111111111111,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9333333333333333\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q2\",\n      \"final_score\": 0.9074074074074074,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9222222222222223\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q3\",\n      \"final_score\": 0.9074074074074074,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9222222222222223\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q4\",\n      \"final_score\": 0.9148148148148149,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9444444444444444\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q5\",\n      \"final_score\": 0.9259259259259259,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9777777777777779\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q6\",\n      \"final_score\": 0.9185185185185185,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9555555555555555\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q7\",\n      \"final_score\": 0.9148148148148149,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9444444444444444\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q8\",\n      \"final_score\": 0.8814814814814814,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9444444444444444\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.7\n        }\n      }\n    },\n    {\n      \"id\": \"q9\",\n      \"final_score\": 0.9148148148148149,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9444444444444444\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    },\n    {\n      \"id\": \"q10\",\n      \"final_score\": 0.9111111111111111,\n      \"scores\": {\n        \"quality_evaluator\": {\n          \"overall\": 0.9333333333333333\n        },\n        \"answer_verification\": {\n          \"is_correct\": true\n        },\n        \"reading_question_qc\": {\n          \"overall_score\": 0.8\n        }\n      }\n    }\n  ],\n  \"failed_ids\": [],\n  \"evaluation_time_seconds\": 41.23,\n  \"avg_score\": 0.9107407407407407\n}\n```\n\n**Key differences in benchmark mode:**\n- Returns all questions at once with summary statistics\n- Includes `total_questions`, `successful`, `failed` counts\n- Lists `failed_ids` for easy debugging\n- Shows `avg_score` across all questions\n- Always uses simplified mode (no detailed scores)\n- Optimized for high throughput with parallel processing\n\n## Command Reference\n\n| Command | Description |\n|---------|-------------|\n| `evaluate` | Evaluate questions from JSON file |\n| `benchmark` | High-throughput parallel evaluation for large datasets |\n| `example` | Generate sample input file |\n| `help` | Show detailed help and usage examples |\n\n### Evaluate Options\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--output PATH` | `-o` | Save results to file (overwrites) |\n| `--append PATH` | `-a` | Append results to file (creates if not exists) |\n| `--full` | `-f` | Return full detailed evaluation results (default: simplified scores only) |\n| `--verbose` | `-v` | Show progress messages |\n| `--timeout SECS` | `-t` | Request timeout in seconds (default: 600) |\n\n### Benchmark Options\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--output PATH` | `-o` | Save results to file |\n| `--workers NUM` | `-w` | Number of parallel workers (default: 100) |\n| `--verbose` | `-v` | Show progress messages |\n\n## Examples\n\n### Basic Evaluation\n\n```bash\n# Evaluate with default settings (simplified scores)\ninceptbench evaluate questions.json\n\n# With progress messages\ninceptbench evaluate questions.json --verbose\n```\n\n### Full Detailed Evaluation\n\n```bash\n# Get complete evaluation with all details\ninceptbench evaluate questions.json --full --verbose\n\n# Save full results to file\ninceptbench evaluate questions.json --full -o detailed_results.json\n```\n\n### Collecting Multiple Evaluations\n\n```bash\n# Append multiple evaluations to one file\ninceptbench evaluate test1.json -a all_results.json --verbose\ninceptbench evaluate test2.json -a all_results.json --verbose\ninceptbench evaluate test3.json -a all_results.json --verbose\n\n# Result: all_results.json contains an array of all 3 evaluations\n```\n\n### Batch Processing\n\n```bash\n# Evaluate all files and append to one results file\nfor file in questions/*.json; do\n  inceptbench evaluate \"$file\" -a batch_results.json --verbose\ndone\n```\n\n### Benchmark Mode (High-Throughput Parallel Processing)\n\nFor large-scale evaluations, use benchmark mode to process hundreds of questions in parallel:\n\n```bash\n# Evaluate 100 questions with 100 parallel workers\ninceptbench benchmark large_dataset.json --verbose\n\n# Process 1000 questions with 200 workers, save results\ninceptbench benchmark dataset_1000.json --workers 200 -o benchmark_results.json --verbose\n\n# Results include: success rate, avg score, timing, and failed question IDs\n```\n\n**When to use benchmark mode:**\n- Large datasets (100+ questions)\n- Need for maximum throughput\n- Want simplified scores only (no detailed output)\n- Need to identify failed questions quickly\n\n**Output includes:**\n- Total questions processed\n- Success/failure counts\n- Failed question IDs for easy debugging\n- Average score across all questions\n- Total evaluation time\n- One simplified score per question\n\n## Evaluation Modules\n\n### quality_evaluator (Internal Evaluator)\n- Scaffolding quality assessment (answer_explanation structure)\n- Direct Instruction (DI) compliance checking\n- Pedagogical structure validation\n- Language quality scoring\n- Grade and difficulty alignment\n- Returns scores on 0-1 scale\n\n### answer_verification\n- GPT-4o powered correctness checking\n- Mathematical accuracy validation\n- Confidence scoring (0-10)\n- Reasoning explanation\n\n### reading_question_qc\n- MCQ distractor quality checks\n- Question clarity validation\n- Overall quality scoring\n\n### external_edubench\n- **QA**: Question Answering - Can the model answer the question?\n- **EC**: Error Correction - Can the model identify and correct errors?\n- **IP**: Instructional Planning - Can the model provide step-by-step solutions?\n- **AG**: Answer Generation - Can the model generate correct answers?\n- **QG**: Question Generation - Question quality assessment\n- **TMG**: Test Making Generation - Test design quality\n- Returns scores on 0-10 scale\n\nAll modules are optional and configurable via `submodules_to_run` in the input JSON.\n\n## Requirements\n\n- Python >= 3.11\n- OpenAI API key\n- Anthropic API key\n- Hugging Face token (optional, for EduBench tasks)\n\n## Support\n\n- **Repository**: [https://github.com/trilogy-group/inceptbench](https://github.com/trilogy-group/inceptbench)\n- **Issues**: [GitHub Issues](https://github.com/trilogy-group/inceptbench/issues)\n- **Help**: Run `inceptbench help` for detailed documentation\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n---\n\n**Made by the Incept Team**\n",
    "bugtrack_url": null,
    "license": "Proprietary - Copyright Trilogy Education Services",
    "summary": "Comprehensive benchmark and evaluation framework for educational AI question generation",
    "version": "1.1.7",
    "project_urls": {
        "Homepage": "https://github.com/incept-ai/inceptbench",
        "Repository": "https://github.com/incept-ai/inceptbench"
    },
    "split_keywords": [
        "education",
        " evaluation",
        " ai",
        " questions",
        " assessment",
        " benchmark",
        " edubench",
        " scaffolding"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "02137bb1d77addc7c4b43b5fbbf76b320ad147e1a6bdc06c27de5f8a6560d5fd",
                "md5": "012d1e66214325db41a31f4c092d079d",
                "sha256": "fcd7071aa1011812ad6a1d61e8047e8097e8605d316473b0d0138eb3c02ddc6a"
            },
            "downloads": -1,
            "filename": "inceptbench-1.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "012d1e66214325db41a31f4c092d079d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.11",
            "size": 46179493,
            "upload_time": "2025-10-20T15:21:57",
            "upload_time_iso_8601": "2025-10-20T15:21:57.496291Z",
            "url": "https://files.pythonhosted.org/packages/02/13/7bb1d77addc7c4b43b5fbbf76b320ad147e1a6bdc06c27de5f8a6560d5fd/inceptbench-1.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5ab754f52a334bb5de2afa78e2a6873bad19e9b54a88af2fef1ad01c985b6044",
                "md5": "e4b208ae9fbc48c1ac2f3f3e95782e6e",
                "sha256": "65ea058cddb42173edd23133b56f67369b7b6e84d283c0506f60a8987cf20839"
            },
            "downloads": -1,
            "filename": "inceptbench-1.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "e4b208ae9fbc48c1ac2f3f3e95782e6e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.11",
            "size": 45688538,
            "upload_time": "2025-10-20T15:22:01",
            "upload_time_iso_8601": "2025-10-20T15:22:01.823977Z",
            "url": "https://files.pythonhosted.org/packages/5a/b7/54f52a334bb5de2afa78e2a6873bad19e9b54a88af2fef1ad01c985b6044/inceptbench-1.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-20 15:22:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "incept-ai",
    "github_project": "inceptbench",
    "github_not_found": true,
    "lcname": "inceptbench"
}
        
Elapsed time: 1.69106s