novaeval


Namenovaeval JSON
Version 0.5.2 PyPI version JSON
download
home_pageNone
SummaryA comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models
upload_time2025-08-29 06:18:10
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseApache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (which shall not include communications that are solely for the purpose of providing information about the License). "Derivative Works" shall mean any work, whether in Source or Object form, that is based upon (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and derivative works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to use, reproduce, modify, distribute, and prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, trademark, patent, attribution and other notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright notice to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Support. You may choose to offer, and to charge a fee for, warranty, support, indemnity or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or support. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same page as the copyright notice for easier identification within third-party archives. Copyright 2024 Noveum Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords llm evaluation ai machine-learning benchmarking testing rag agents conversational-ai g-eval
VCS
bugtrack_url
requirements pydantic pyyaml requests numpy pandas tqdm click rich jinja2 plotly scikit-learn ijson python-dotenv noveum-trace typing_extensions datasets transformers openai anthropic boto3 sentence-transformers ollama
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # NovaEval by Noveum.ai

[![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
[![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
[![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)
[![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.

> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.

## ๐Ÿค We Need Your Help!

NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:

### ๐ŸŽฏ High-Priority Contribution Areas

We're actively looking for contributors in these key areas:

- **๐Ÿงช Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- **๐Ÿ“š Examples**: Create real-world evaluation examples and use cases
- **๐Ÿ“ Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
- **๐Ÿ“– Documentation**: Improve API documentation and user guides
- **๐Ÿ” RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- **๐Ÿค– Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations

### ๐Ÿš€ Getting Started as a Contributor

1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
3. **Review Code**: Help review pull requests and provide feedback
4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
5. **Spread the Word**: Star the repository and share with your network

## ๐Ÿš€ Features

- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
- **Secure**: Built-in credential management and secret store integration
- **Scalable**: Designed for both local testing and large-scale production evaluations
- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD

## ๐Ÿ“ฆ Installation

### From PyPI (Recommended)

```bash
pip install novaeval
```

### From Source

```bash
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
```

### Docker

```bash
docker pull noveum/novaeval:latest
```

## ๐Ÿƒโ€โ™‚๏ธ Quick Start

### Basic Evaluation

```python
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer

# Configure for cost-conscious evaluation
MAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning

# Initialize components
dataset = MMLUDataset(
    subset="elementary_mathematics",  # Easier subset for demo
    num_samples=10,
    split="test"
)

model = OpenAIModel(
    model_name="gpt-4o-mini",  # Cost-effective model
    temperature=0.0,
    max_tokens=MAX_TOKENS
)

scorer = AccuracyScorer(extract_answer=True)

# Create and run evaluation
evaluator = Evaluator(
    dataset=dataset,
    models=[model],
    scorers=[scorer],
    output_dir="./results"
)

results = evaluator.run()

# Display detailed results
for model_name, model_results in results["model_results"].items():
    for scorer_name, score_info in model_results["scores"].items():
        if isinstance(score_info, dict):
            mean_score = score_info.get("mean", 0)
            count = score_info.get("count", 0)
            print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
```

### Configuration-Based Evaluation

```python
from novaeval import Evaluator

# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
```

### Command Line Interface

NovaEval provides a comprehensive CLI for running evaluations:

```bash
# Run evaluation from configuration file
novaeval run config.yaml

# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy

# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers

# Generate sample configuration
novaeval generate-config sample-config.yaml
```

๐Ÿ“– **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options

### Example Configuration

```yaml
# evaluation_config.yaml
dataset:
  type: "mmlu"
  subset: "abstract_algebra"
  num_samples: 500

models:
  - type: "openai"
    model_name: "gpt-4"
    temperature: 0.0
  - type: "anthropic"
    model_name: "claude-3-opus"
    temperature: 0.0

scorers:
  - type: "accuracy"
  - type: "semantic_similarity"
    threshold: 0.8

output:
  directory: "./results"
  formats: ["json", "csv", "html"]
  upload_to_s3: true
  s3_bucket: "my-eval-results"
```

## ๐ŸŒ HTTP API

NovaEval provides a FastAPI-based HTTP API for programmatic access to evaluation capabilities. This enables easy integration with web applications, microservices, and CI/CD pipelines.

### Quick API Start

```bash
# Install API dependencies
pip install -e ".[api]"

# Run the API server
uvicorn app.main:app --host 0.0.0.0 --port 8000

# Access interactive documentation
open http://localhost:8000/docs
```

### Core API Endpoints

- **Health Check**: `GET /health` - Service health status
- **Component Discovery**: `GET /api/v1/components/` - List available models, datasets, scorers
- **Model Operations**: `POST /api/v1/models/{model}/predict` - Generate predictions
- **Dataset Operations**: `POST /api/v1/datasets/{dataset}/load` - Load and query datasets
- **Scorer Operations**: `POST /api/v1/scorers/{scorer}/score` - Score predictions
- **Evaluation Jobs**: `POST /api/v1/evaluations/submit` - Submit async evaluation jobs

### Example API Usage

```python
import requests

# Submit evaluation via API
evaluation_config = {
    "name": "api_evaluation",
    "models": [{"provider": "openai", "identifier": "gpt-3.5-turbo"}],
    "datasets": [{"name": "mmlu", "split": "test", "limit": 10}],
    "scorers": [{"name": "accuracy"}]
}

response = requests.post(
    "http://localhost:8000/api/v1/evaluations/submit",
    json=evaluation_config
)

task_id = response.json()["task_id"]
print(f"Evaluation started: {task_id}")
```

### Deployment Options

- **Docker**: `docker run -p 8000:8000 novaeval-api:latest`
- **Kubernetes**: Full manifests provided in `kubernetes/`
- **Cloud Platforms**: Supports AWS, GCP, Azure with environment variable configuration

๐Ÿ“– **[Complete API Documentation](app/README.md)** - Detailed API reference, examples, and deployment guide

## ๐Ÿ—๏ธ Architecture

NovaEval is built with extensibility and modularity in mind:

```
src/novaeval/
โ”œโ”€โ”€ datasets/          # Dataset loaders and processors
โ”œโ”€โ”€ evaluators/        # Core evaluation logic
โ”œโ”€โ”€ integrations/      # External service integrations
โ”œโ”€โ”€ models/           # Model interfaces and adapters
โ”œโ”€โ”€ reporting/        # Report generation and visualization
โ”œโ”€โ”€ scorers/          # Scoring mechanisms and metrics
โ””โ”€โ”€ utils/            # Utility functions and helpers
```

### Core Components

- **Datasets**: Standardized interface for loading evaluation datasets
- **Models**: Unified API for different AI model providers
- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
- **Evaluators**: Orchestrates the evaluation process
- **Reporting**: Generates comprehensive reports and artifacts
- **Integrations**: Handles external services (S3, credential stores, etc.)

## ๐Ÿ“Š Supported Datasets

- **MMLU**: Massive Multitask Language Understanding
- **HuggingFace**: Any dataset from the HuggingFace Hub
- **Custom**: JSON, CSV, or programmatic dataset definitions
- **Code Evaluation**: Programming benchmarks and code generation tasks
- **Agent Traces**: Multi-turn conversation and agent evaluation

## ๐Ÿค– Supported Models

- **OpenAI**: GPT-3.5, GPT-4, and newer models
- **Anthropic**: Claude family models
- **AWS Bedrock**: Amazon's managed AI services
- **Noveum AI Gateway**: Integration with Noveum's model gateway
- **Custom**: Extensible interface for any API-based model

## ๐Ÿ“ Built-in Scorers & Metrics

NovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the `BaseScorer` interface and support both synchronous and asynchronous evaluation.

### ๐ŸŽฏ Accuracy & Classification Metrics

#### **ExactMatchScorer**
- **Purpose**: Performs exact string matching between prediction and ground truth
- **Features**:
  - Case-sensitive/insensitive matching options
  - Whitespace normalization and stripping
  - Perfect for classification tasks with exact expected outputs
- **Use Cases**: Multiple choice questions, command validation, exact answer matching
- **Configuration**: `case_sensitive`, `strip_whitespace`, `normalize_whitespace`

#### **AccuracyScorer**
- **Purpose**: Advanced classification accuracy with answer extraction capabilities
- **Features**:
  - Intelligent answer extraction from model responses using multiple regex patterns
  - Support for MMLU-style multiple choice questions (A, B, C, D)
  - Letter-to-choice text conversion
  - Robust parsing of various answer formats
- **Use Cases**: MMLU evaluations, multiple choice tests, classification benchmarks
- **Configuration**: `extract_answer`, `answer_pattern`, `choices`

#### **F1Scorer**
- **Purpose**: Token-level F1 score for partial matching scenarios
- **Features**:
  - Calculates precision, recall, and F1 score
  - Configurable tokenization (word-level or character-level)
  - Case-sensitive/insensitive options
- **Use Cases**: Question answering, text summarization, partial credit evaluation
- **Returns**: Dictionary with `precision`, `recall`, `f1`, and `score` values

### ๐Ÿ’ฌ Conversational AI Metrics

#### **KnowledgeRetentionScorer**
- **Purpose**: Evaluates if the LLM retains information provided by users throughout conversations
- **Features**:
  - Sophisticated knowledge extraction from conversation history
  - Sliding window approach for relevant context (configurable window size)
  - Detects when LLM asks for previously provided information
  - Tracks knowledge items with confidence scores
- **Use Cases**: Chatbots, virtual assistants, multi-turn conversations
- **Requirements**: LLM model for knowledge extraction, conversation context

#### **ConversationRelevancyScorer**
- **Purpose**: Measures response relevance to recent conversation context
- **Features**:
  - Sliding window context analysis
  - LLM-based relevance assessment (1-5 scale)
  - Context coherence evaluation
  - Conversation flow maintenance tracking
- **Use Cases**: Dialogue systems, context-aware assistants
- **Configuration**: `window_size` for context scope

#### **ConversationCompletenessScorer**
- **Purpose**: Assesses whether user intentions and requests are fully addressed
- **Features**:
  - Extracts user intentions from conversation history
  - Evaluates fulfillment level of each intention
  - Comprehensive coverage analysis
  - Outcome-based evaluation
- **Use Cases**: Customer service bots, task-oriented dialogue systems

#### **RoleAdherenceScorer**
- **Purpose**: Evaluates consistency with assigned persona or role
- **Features**:
  - Role consistency tracking throughout conversations
  - Character maintenance assessment
  - Persona adherence evaluation
  - Customizable role expectations
- **Use Cases**: Character-based chatbots, role-playing AI, specialized assistants
- **Configuration**: `expected_role` parameter

#### **ConversationalMetricsScorer**
- **Purpose**: Comprehensive conversational evaluation combining multiple metrics
- **Features**:
  - Combines knowledge retention, relevancy, completeness, and role adherence
  - Configurable metric inclusion/exclusion
  - Weighted aggregation of individual scores
  - Detailed per-metric breakdown
- **Use Cases**: Holistic conversation quality assessment
- **Configuration**: Enable/disable individual metrics, window sizes, role expectations

### ๐Ÿ” RAG (Retrieval-Augmented Generation) Metrics

#### **AnswerRelevancyScorer**
- **Purpose**: Evaluates how relevant answers are to given questions
- **Features**:
  - Generates questions from answers using LLM
  - Semantic similarity comparison using embeddings (SentenceTransformers)
  - Multiple question generation for robust evaluation
  - Cosine similarity scoring
- **Use Cases**: RAG systems, Q&A applications, knowledge bases
- **Configuration**: `threshold`, `embedding_model`

#### **FaithfulnessScorer**
- **Purpose**: Measures if responses are faithful to provided context without hallucinations
- **Features**:
  - Extracts factual claims from responses
  - Verifies each claim against source context
  - Three-tier verification: SUPPORTED/PARTIALLY_SUPPORTED/NOT_SUPPORTED
  - Detailed claim-by-claim analysis
- **Use Cases**: RAG faithfulness, fact-checking, source attribution
- **Configuration**: `threshold` for pass/fail determination

#### **ContextualPrecisionScorer**
- **Purpose**: Evaluates precision of retrieved context relevance
- **Features**:
  - Splits context into chunks for granular analysis
  - Relevance scoring per chunk (1-5 scale)
  - Intelligent context segmentation
  - Average relevance calculation
- **Use Cases**: Retrieval system evaluation, context quality assessment
- **Requirements**: Context must be provided for evaluation

#### **ContextualRecallScorer**
- **Purpose**: Measures if all necessary information for answering is present in context
- **Features**:
  - Extracts key information from expected outputs
  - Checks presence of each key fact in provided context
  - Three-tier presence detection: PRESENT/PARTIALLY_PRESENT/NOT_PRESENT
  - Comprehensive information coverage analysis
- **Use Cases**: Retrieval completeness, context sufficiency evaluation
- **Requirements**: Both context and expected output required

#### **RAGASScorer**
- **Purpose**: Composite RAGAS methodology combining multiple RAG metrics
- **Features**:
  - Integrates Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall
  - Configurable weighted aggregation
  - Parallel execution of individual metrics
  - Comprehensive RAG pipeline evaluation
- **Use Cases**: Complete RAG system assessment, benchmark evaluation
- **Configuration**: Custom weights for each metric component

### ๐Ÿค– LLM-as-Judge Metrics

#### **GEvalScorer**
- **Purpose**: Uses LLMs with chain-of-thought reasoning for custom evaluation criteria
- **Features**:
  - Based on G-Eval research paper methodology
  - Configurable evaluation criteria and steps
  - Chain-of-thought reasoning support
  - Multiple evaluation iterations for consistency
  - Custom score ranges and thresholds
- **Use Cases**: Custom evaluation criteria, human-aligned assessment, complex judgments
- **Configuration**: `criteria`, `use_cot`, `num_iterations`, `threshold`

#### **CommonGEvalCriteria** (Predefined Criteria)
- **Correctness**: Factual accuracy and completeness assessment
- **Relevance**: Topic adherence and query alignment evaluation
- **Coherence**: Logical flow and structural consistency analysis
- **Helpfulness**: Practical value and actionability assessment

#### **PanelOfJudgesScorer**
- **Purpose**: Multi-LLM evaluation with diverse perspectives and aggregation
- **Features**:
  - Multiple LLM judges with individual weights and specialties
  - Configurable aggregation methods (mean, median, weighted, consensus, etc.)
  - Consensus requirement and threshold controls
  - Parallel judge evaluation for efficiency
  - Detailed individual and aggregate reasoning
- **Use Cases**: High-stakes evaluation, bias reduction, robust assessment
- **Configuration**: Judge models, weights, specialties, aggregation method

#### **SpecializedPanelScorer** (Panel Configurations)
- **Diverse Panel**: Different models with varied specialties (accuracy, clarity, completeness)
- **Consensus Panel**: High-consensus requirement for agreement-based decisions
- **Weighted Expert Panel**: Domain experts with expertise-based weighting

### ๐ŸŽญ Agent Evaluation Metrics

#### **Tool Relevancy Scoring**
- **Purpose**: Evaluates appropriateness of tool calls given available tools
- **Features**: Compares selected tools against available tool catalog
- **Use Cases**: Agent tool selection assessment, action planning evaluation

#### **Tool Correctness Scoring**
- **Purpose**: Compares actual tool calls against expected tool calls
- **Features**: Detailed tool call comparison and correctness assessment
- **Use Cases**: Agent behavior validation, expected action verification

#### **Parameter Correctness Scoring**
- **Purpose**: Evaluates correctness of parameters passed to tool calls
- **Features**: Parameter validation against tool call results and expectations
- **Use Cases**: Tool usage quality, parameter selection accuracy

#### **Task Progression Scoring**
- **Purpose**: Measures agent progress toward assigned tasks
- **Features**: Analyzes task completion status and advancement quality
- **Use Cases**: Agent effectiveness measurement, task completion tracking

#### **Context Relevancy Scoring**
- **Purpose**: Assesses response appropriateness given agent's role and task
- **Features**: Role-task-response alignment evaluation
- **Use Cases**: Agent behavior consistency, contextual appropriateness

#### **Role Adherence Scoring**
- **Purpose**: Evaluates consistency with assigned agent role across actions
- **Features**: Comprehensive role consistency across tool calls and responses
- **Use Cases**: Agent persona maintenance, role-based behavior validation

#### **Goal Achievement Scoring**
- **Purpose**: Measures overall goal accomplishment using complete interaction traces
- **Features**: End-to-end goal evaluation with G-Eval methodology
- **Use Cases**: Agent effectiveness assessment, outcome-based evaluation

#### **Conversation Coherence Scoring**
- **Purpose**: Evaluates logical flow and context maintenance in agent conversations
- **Features**: Conversational coherence and context tracking analysis
- **Use Cases**: Agent dialogue quality, conversation flow assessment

#### **AgentScorers** (Convenience Class)
- **Purpose**: Unified interface for all agent evaluation metrics
- **Features**: Single class providing access to all agent scorers with consistent LLM model
- **Methods**: Individual scoring methods plus `score_all()` for comprehensive evaluation

### ๐Ÿ”ง Advanced Features

#### **BaseScorer Interface**
All scorers inherit from `BaseScorer` providing:
- **Statistics Tracking**: Automatic score history and statistics
- **Batch Processing**: Efficient batch scoring capabilities
- **Input Validation**: Robust input validation and error handling
- **Configuration Support**: Flexible configuration from dictionaries
- **Metadata Reporting**: Detailed scoring metadata and information

#### **ScoreResult Model**
Comprehensive scoring results include:
- **Numerical Score**: Primary evaluation score
- **Pass/Fail Status**: Threshold-based binary result
- **Detailed Reasoning**: Human-readable evaluation explanation
- **Rich Metadata**: Additional context and scoring details

### ๐Ÿ“Š Usage Examples

```python
# Basic accuracy scoring
scorer = AccuracyScorer(extract_answer=True)
score = scorer.score("The answer is B", "B")

# Advanced conversational evaluation
conv_scorer = ConversationalMetricsScorer(
    model=your_llm_model,
    include_knowledge_retention=True,
    include_relevancy=True,
    window_size=10
)
result = await conv_scorer.evaluate(input_text, output_text, context=conv_context)

# RAG system evaluation
ragas = RAGASScorer(
    model=your_llm_model,
    weights={"faithfulness": 0.4, "answer_relevancy": 0.3, "contextual_precision": 0.3}
)
result = await ragas.evaluate(question, answer, context=retrieved_context)

# Panel-based evaluation
panel = SpecializedPanelScorer.create_diverse_panel(
    models=[model1, model2, model3],
    evaluation_criteria="overall quality and helpfulness"
)
result = await panel.evaluate(input_text, output_text)

# Agent evaluation
agent_scorers = AgentScorers(model=your_llm_model)
all_scores = agent_scorers.score_all(agent_data)
```

## ๐Ÿš€ Deployment

### Local Development

```bash
# Install dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run example evaluation
python examples/basic_evaluation.py
```

### Docker

```bash
# Build image
docker build -t nova-eval .

# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
```

### Kubernetes

```bash
# Deploy to Kubernetes
kubectl apply -f kubernetes/

# Check status
kubectl get pods -l app=nova-eval
```

## ๐Ÿ”ง Configuration

NovaEval supports configuration through:

- **YAML/JSON files**: Declarative configuration
- **Environment variables**: Runtime configuration
- **Python code**: Programmatic configuration
- **CLI arguments**: Command-line overrides

### Environment Variables

```bash
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
```

### CI/CD Integration

NovaEval includes optimized GitHub Actions workflows:
- **Unit tests** run on all PRs and pushes for quick feedback
- **Integration tests** run on main branch only to minimize API costs
- **Cross-platform testing** on macOS, Linux, and Windows

## ๐Ÿ“ˆ Reporting and Artifacts

NovaEval generates comprehensive evaluation reports:

- **Summary Reports**: High-level metrics and insights
- **Detailed Results**: Per-sample predictions and scores
- **Visualizations**: Charts and graphs for result analysis
- **Artifacts**: Model outputs, intermediate results, and debug information
- **Export Formats**: JSON, CSV, HTML, PDF

### Example Report Structure

```
results/
โ”œโ”€โ”€ summary.json              # High-level metrics
โ”œโ”€โ”€ detailed_results.csv      # Per-sample results
โ”œโ”€โ”€ artifacts/
โ”‚   โ”œโ”€โ”€ model_outputs/        # Raw model responses
โ”‚   โ”œโ”€โ”€ intermediate/         # Processing artifacts
โ”‚   โ””โ”€โ”€ debug/               # Debug information
โ”œโ”€โ”€ visualizations/
โ”‚   โ”œโ”€โ”€ accuracy_by_category.png
โ”‚   โ”œโ”€โ”€ score_distribution.png
โ”‚   โ””โ”€โ”€ confusion_matrix.png
โ””โ”€โ”€ report.html              # Interactive HTML report
```

## ๐Ÿ”Œ Extending NovaEval

### Custom Datasets

```python
from novaeval.datasets import BaseDataset

class MyCustomDataset(BaseDataset):
    def load_data(self):
        # Implement data loading logic
        return samples

    def get_sample(self, index):
        # Return individual sample
        return sample
```

### Custom Scorers

```python
from novaeval.scorers import BaseScorer

class MyCustomScorer(BaseScorer):
    def score(self, prediction, ground_truth, context=None):
        # Implement scoring logic
        return score
```

### Custom Models

```python
from novaeval.models import BaseModel

class MyCustomModel(BaseModel):
    def generate(self, prompt, **kwargs):
        # Implement model inference
        return response
```

## ๐Ÿค Contributing

We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.

### ๐ŸŽฏ Priority Contribution Areas

As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:

1. **Unit Tests** - Expand test coverage beyond the current 23%
2. **Examples** - Real-world evaluation scenarios and use cases
3. **Guides & Notebooks** - Interactive evaluation tutorials
4. **Documentation** - API docs, user guides, and tutorials
5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations

### Development Setup

```bash
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run with coverage
pytest --cov=src/novaeval --cov-report=html
```

### ๐Ÿ—๏ธ Contribution Workflow

1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes following our coding standards
4. **Add** tests for your changes
5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
6. **Push** to the branch (`git push origin feature/amazing-feature`)
7. **Open** a Pull Request

### ๐Ÿ“‹ Contribution Guidelines

- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
- **Testing**: Add unit tests for new features and bug fixes
- **Documentation**: Update documentation for API changes
- **Commit Messages**: Use conventional commit format
- **Issues**: Reference relevant issues in your PR description

### ๐ŸŽ‰ Recognition

Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community

## ๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## ๐Ÿ™ Acknowledgments

- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community

## ๐Ÿ“ž Support

- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
- **Email**: support@noveum.ai

---

Made with โค๏ธ by the Noveum.ai team

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "novaeval",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Noveum AI <team@noveum.ai>",
    "keywords": "llm, evaluation, ai, machine-learning, benchmarking, testing, rag, agents, conversational-ai, g-eval",
    "author": null,
    "author_email": "Noveum AI <team@noveum.ai>",
    "download_url": "https://files.pythonhosted.org/packages/e7/4b/faa6256fd7c98b04a71108a15c419ee0d755a6803cedfd3a0150412d01a6/novaeval-0.5.2.tar.gz",
    "platform": null,
    "description": "# NovaEval by Noveum.ai\n\n[![CI](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)\n[![Release](https://github.com/Noveum/NovaEval/actions/workflows/release.yml/badge.svg)](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)\n[![codecov](https://codecov.io/gh/Noveum/NovaEval/branch/main/graph/badge.svg)](https://codecov.io/gh/Noveum/NovaEval)\n[![PyPI version](https://badge.fury.io/py/novaeval.svg)](https://badge.fury.io/py/novaeval)\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\nA comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.\n\n> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.\n\n## \ud83e\udd1d We Need Your Help!\n\nNovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:\n\n### \ud83c\udfaf High-Priority Contribution Areas\n\nWe're actively looking for contributors in these key areas:\n\n- **\ud83e\uddea Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)\n- **\ud83d\udcda Examples**: Create real-world evaluation examples and use cases\n- **\ud83d\udcdd Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks\n- **\ud83d\udcd6 Documentation**: Improve API documentation and user guides\n- **\ud83d\udd0d RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation\n- **\ud83e\udd16 Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations\n\n### \ud83d\ude80 Getting Started as a Contributor\n\n1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`\n2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)\n3. **Review Code**: Help review pull requests and provide feedback\n4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)\n5. **Spread the Word**: Star the repository and share with your network\n\n## \ud83d\ude80 Features\n\n- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers\n- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics\n- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more\n- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations\n- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations\n- **Secure**: Built-in credential management and secret store integration\n- **Scalable**: Designed for both local testing and large-scale production evaluations\n- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD\n\n## \ud83d\udce6 Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install novaeval\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/Noveum/NovaEval.git\ncd NovaEval\npip install -e .\n```\n\n### Docker\n\n```bash\ndocker pull noveum/novaeval:latest\n```\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Quick Start\n\n### Basic Evaluation\n\n```python\nfrom novaeval import Evaluator\nfrom novaeval.datasets import MMLUDataset\nfrom novaeval.models import OpenAIModel\nfrom novaeval.scorers import AccuracyScorer\n\n# Configure for cost-conscious evaluation\nMAX_TOKENS = 100  # Adjust based on budget: 5-10 for answers, 100+ for reasoning\n\n# Initialize components\ndataset = MMLUDataset(\n    subset=\"elementary_mathematics\",  # Easier subset for demo\n    num_samples=10,\n    split=\"test\"\n)\n\nmodel = OpenAIModel(\n    model_name=\"gpt-4o-mini\",  # Cost-effective model\n    temperature=0.0,\n    max_tokens=MAX_TOKENS\n)\n\nscorer = AccuracyScorer(extract_answer=True)\n\n# Create and run evaluation\nevaluator = Evaluator(\n    dataset=dataset,\n    models=[model],\n    scorers=[scorer],\n    output_dir=\"./results\"\n)\n\nresults = evaluator.run()\n\n# Display detailed results\nfor model_name, model_results in results[\"model_results\"].items():\n    for scorer_name, score_info in model_results[\"scores\"].items():\n        if isinstance(score_info, dict):\n            mean_score = score_info.get(\"mean\", 0)\n            count = score_info.get(\"count\", 0)\n            print(f\"{scorer_name}: {mean_score:.4f} ({count} samples)\")\n```\n\n### Configuration-Based Evaluation\n\n```python\nfrom novaeval import Evaluator\n\n# Load configuration from YAML/JSON\nevaluator = Evaluator.from_config(\"evaluation_config.yaml\")\nresults = evaluator.run()\n```\n\n### Command Line Interface\n\nNovaEval provides a comprehensive CLI for running evaluations:\n\n```bash\n# Run evaluation from configuration file\nnovaeval run config.yaml\n\n# Quick evaluation with minimal setup\nnovaeval quick -d mmlu -m gpt-4 -s accuracy\n\n# List available datasets, models, and scorers\nnovaeval list-datasets\nnovaeval list-models\nnovaeval list-scorers\n\n# Generate sample configuration\nnovaeval generate-config sample-config.yaml\n```\n\n\ud83d\udcd6 **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options\n\n### Example Configuration\n\n```yaml\n# evaluation_config.yaml\ndataset:\n  type: \"mmlu\"\n  subset: \"abstract_algebra\"\n  num_samples: 500\n\nmodels:\n  - type: \"openai\"\n    model_name: \"gpt-4\"\n    temperature: 0.0\n  - type: \"anthropic\"\n    model_name: \"claude-3-opus\"\n    temperature: 0.0\n\nscorers:\n  - type: \"accuracy\"\n  - type: \"semantic_similarity\"\n    threshold: 0.8\n\noutput:\n  directory: \"./results\"\n  formats: [\"json\", \"csv\", \"html\"]\n  upload_to_s3: true\n  s3_bucket: \"my-eval-results\"\n```\n\n## \ud83c\udf10 HTTP API\n\nNovaEval provides a FastAPI-based HTTP API for programmatic access to evaluation capabilities. This enables easy integration with web applications, microservices, and CI/CD pipelines.\n\n### Quick API Start\n\n```bash\n# Install API dependencies\npip install -e \".[api]\"\n\n# Run the API server\nuvicorn app.main:app --host 0.0.0.0 --port 8000\n\n# Access interactive documentation\nopen http://localhost:8000/docs\n```\n\n### Core API Endpoints\n\n- **Health Check**: `GET /health` - Service health status\n- **Component Discovery**: `GET /api/v1/components/` - List available models, datasets, scorers\n- **Model Operations**: `POST /api/v1/models/{model}/predict` - Generate predictions\n- **Dataset Operations**: `POST /api/v1/datasets/{dataset}/load` - Load and query datasets\n- **Scorer Operations**: `POST /api/v1/scorers/{scorer}/score` - Score predictions\n- **Evaluation Jobs**: `POST /api/v1/evaluations/submit` - Submit async evaluation jobs\n\n### Example API Usage\n\n```python\nimport requests\n\n# Submit evaluation via API\nevaluation_config = {\n    \"name\": \"api_evaluation\",\n    \"models\": [{\"provider\": \"openai\", \"identifier\": \"gpt-3.5-turbo\"}],\n    \"datasets\": [{\"name\": \"mmlu\", \"split\": \"test\", \"limit\": 10}],\n    \"scorers\": [{\"name\": \"accuracy\"}]\n}\n\nresponse = requests.post(\n    \"http://localhost:8000/api/v1/evaluations/submit\",\n    json=evaluation_config\n)\n\ntask_id = response.json()[\"task_id\"]\nprint(f\"Evaluation started: {task_id}\")\n```\n\n### Deployment Options\n\n- **Docker**: `docker run -p 8000:8000 novaeval-api:latest`\n- **Kubernetes**: Full manifests provided in `kubernetes/`\n- **Cloud Platforms**: Supports AWS, GCP, Azure with environment variable configuration\n\n\ud83d\udcd6 **[Complete API Documentation](app/README.md)** - Detailed API reference, examples, and deployment guide\n\n## \ud83c\udfd7\ufe0f Architecture\n\nNovaEval is built with extensibility and modularity in mind:\n\n```\nsrc/novaeval/\n\u251c\u2500\u2500 datasets/          # Dataset loaders and processors\n\u251c\u2500\u2500 evaluators/        # Core evaluation logic\n\u251c\u2500\u2500 integrations/      # External service integrations\n\u251c\u2500\u2500 models/           # Model interfaces and adapters\n\u251c\u2500\u2500 reporting/        # Report generation and visualization\n\u251c\u2500\u2500 scorers/          # Scoring mechanisms and metrics\n\u2514\u2500\u2500 utils/            # Utility functions and helpers\n```\n\n### Core Components\n\n- **Datasets**: Standardized interface for loading evaluation datasets\n- **Models**: Unified API for different AI model providers\n- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics\n- **Evaluators**: Orchestrates the evaluation process\n- **Reporting**: Generates comprehensive reports and artifacts\n- **Integrations**: Handles external services (S3, credential stores, etc.)\n\n## \ud83d\udcca Supported Datasets\n\n- **MMLU**: Massive Multitask Language Understanding\n- **HuggingFace**: Any dataset from the HuggingFace Hub\n- **Custom**: JSON, CSV, or programmatic dataset definitions\n- **Code Evaluation**: Programming benchmarks and code generation tasks\n- **Agent Traces**: Multi-turn conversation and agent evaluation\n\n## \ud83e\udd16 Supported Models\n\n- **OpenAI**: GPT-3.5, GPT-4, and newer models\n- **Anthropic**: Claude family models\n- **AWS Bedrock**: Amazon's managed AI services\n- **Noveum AI Gateway**: Integration with Noveum's model gateway\n- **Custom**: Extensible interface for any API-based model\n\n## \ud83d\udccf Built-in Scorers & Metrics\n\nNovaEval provides a comprehensive suite of scorers organized by evaluation domain. All scorers implement the `BaseScorer` interface and support both synchronous and asynchronous evaluation.\n\n### \ud83c\udfaf Accuracy & Classification Metrics\n\n#### **ExactMatchScorer**\n- **Purpose**: Performs exact string matching between prediction and ground truth\n- **Features**:\n  - Case-sensitive/insensitive matching options\n  - Whitespace normalization and stripping\n  - Perfect for classification tasks with exact expected outputs\n- **Use Cases**: Multiple choice questions, command validation, exact answer matching\n- **Configuration**: `case_sensitive`, `strip_whitespace`, `normalize_whitespace`\n\n#### **AccuracyScorer**\n- **Purpose**: Advanced classification accuracy with answer extraction capabilities\n- **Features**:\n  - Intelligent answer extraction from model responses using multiple regex patterns\n  - Support for MMLU-style multiple choice questions (A, B, C, D)\n  - Letter-to-choice text conversion\n  - Robust parsing of various answer formats\n- **Use Cases**: MMLU evaluations, multiple choice tests, classification benchmarks\n- **Configuration**: `extract_answer`, `answer_pattern`, `choices`\n\n#### **F1Scorer**\n- **Purpose**: Token-level F1 score for partial matching scenarios\n- **Features**:\n  - Calculates precision, recall, and F1 score\n  - Configurable tokenization (word-level or character-level)\n  - Case-sensitive/insensitive options\n- **Use Cases**: Question answering, text summarization, partial credit evaluation\n- **Returns**: Dictionary with `precision`, `recall`, `f1`, and `score` values\n\n### \ud83d\udcac Conversational AI Metrics\n\n#### **KnowledgeRetentionScorer**\n- **Purpose**: Evaluates if the LLM retains information provided by users throughout conversations\n- **Features**:\n  - Sophisticated knowledge extraction from conversation history\n  - Sliding window approach for relevant context (configurable window size)\n  - Detects when LLM asks for previously provided information\n  - Tracks knowledge items with confidence scores\n- **Use Cases**: Chatbots, virtual assistants, multi-turn conversations\n- **Requirements**: LLM model for knowledge extraction, conversation context\n\n#### **ConversationRelevancyScorer**\n- **Purpose**: Measures response relevance to recent conversation context\n- **Features**:\n  - Sliding window context analysis\n  - LLM-based relevance assessment (1-5 scale)\n  - Context coherence evaluation\n  - Conversation flow maintenance tracking\n- **Use Cases**: Dialogue systems, context-aware assistants\n- **Configuration**: `window_size` for context scope\n\n#### **ConversationCompletenessScorer**\n- **Purpose**: Assesses whether user intentions and requests are fully addressed\n- **Features**:\n  - Extracts user intentions from conversation history\n  - Evaluates fulfillment level of each intention\n  - Comprehensive coverage analysis\n  - Outcome-based evaluation\n- **Use Cases**: Customer service bots, task-oriented dialogue systems\n\n#### **RoleAdherenceScorer**\n- **Purpose**: Evaluates consistency with assigned persona or role\n- **Features**:\n  - Role consistency tracking throughout conversations\n  - Character maintenance assessment\n  - Persona adherence evaluation\n  - Customizable role expectations\n- **Use Cases**: Character-based chatbots, role-playing AI, specialized assistants\n- **Configuration**: `expected_role` parameter\n\n#### **ConversationalMetricsScorer**\n- **Purpose**: Comprehensive conversational evaluation combining multiple metrics\n- **Features**:\n  - Combines knowledge retention, relevancy, completeness, and role adherence\n  - Configurable metric inclusion/exclusion\n  - Weighted aggregation of individual scores\n  - Detailed per-metric breakdown\n- **Use Cases**: Holistic conversation quality assessment\n- **Configuration**: Enable/disable individual metrics, window sizes, role expectations\n\n### \ud83d\udd0d RAG (Retrieval-Augmented Generation) Metrics\n\n#### **AnswerRelevancyScorer**\n- **Purpose**: Evaluates how relevant answers are to given questions\n- **Features**:\n  - Generates questions from answers using LLM\n  - Semantic similarity comparison using embeddings (SentenceTransformers)\n  - Multiple question generation for robust evaluation\n  - Cosine similarity scoring\n- **Use Cases**: RAG systems, Q&A applications, knowledge bases\n- **Configuration**: `threshold`, `embedding_model`\n\n#### **FaithfulnessScorer**\n- **Purpose**: Measures if responses are faithful to provided context without hallucinations\n- **Features**:\n  - Extracts factual claims from responses\n  - Verifies each claim against source context\n  - Three-tier verification: SUPPORTED/PARTIALLY_SUPPORTED/NOT_SUPPORTED\n  - Detailed claim-by-claim analysis\n- **Use Cases**: RAG faithfulness, fact-checking, source attribution\n- **Configuration**: `threshold` for pass/fail determination\n\n#### **ContextualPrecisionScorer**\n- **Purpose**: Evaluates precision of retrieved context relevance\n- **Features**:\n  - Splits context into chunks for granular analysis\n  - Relevance scoring per chunk (1-5 scale)\n  - Intelligent context segmentation\n  - Average relevance calculation\n- **Use Cases**: Retrieval system evaluation, context quality assessment\n- **Requirements**: Context must be provided for evaluation\n\n#### **ContextualRecallScorer**\n- **Purpose**: Measures if all necessary information for answering is present in context\n- **Features**:\n  - Extracts key information from expected outputs\n  - Checks presence of each key fact in provided context\n  - Three-tier presence detection: PRESENT/PARTIALLY_PRESENT/NOT_PRESENT\n  - Comprehensive information coverage analysis\n- **Use Cases**: Retrieval completeness, context sufficiency evaluation\n- **Requirements**: Both context and expected output required\n\n#### **RAGASScorer**\n- **Purpose**: Composite RAGAS methodology combining multiple RAG metrics\n- **Features**:\n  - Integrates Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall\n  - Configurable weighted aggregation\n  - Parallel execution of individual metrics\n  - Comprehensive RAG pipeline evaluation\n- **Use Cases**: Complete RAG system assessment, benchmark evaluation\n- **Configuration**: Custom weights for each metric component\n\n### \ud83e\udd16 LLM-as-Judge Metrics\n\n#### **GEvalScorer**\n- **Purpose**: Uses LLMs with chain-of-thought reasoning for custom evaluation criteria\n- **Features**:\n  - Based on G-Eval research paper methodology\n  - Configurable evaluation criteria and steps\n  - Chain-of-thought reasoning support\n  - Multiple evaluation iterations for consistency\n  - Custom score ranges and thresholds\n- **Use Cases**: Custom evaluation criteria, human-aligned assessment, complex judgments\n- **Configuration**: `criteria`, `use_cot`, `num_iterations`, `threshold`\n\n#### **CommonGEvalCriteria** (Predefined Criteria)\n- **Correctness**: Factual accuracy and completeness assessment\n- **Relevance**: Topic adherence and query alignment evaluation\n- **Coherence**: Logical flow and structural consistency analysis\n- **Helpfulness**: Practical value and actionability assessment\n\n#### **PanelOfJudgesScorer**\n- **Purpose**: Multi-LLM evaluation with diverse perspectives and aggregation\n- **Features**:\n  - Multiple LLM judges with individual weights and specialties\n  - Configurable aggregation methods (mean, median, weighted, consensus, etc.)\n  - Consensus requirement and threshold controls\n  - Parallel judge evaluation for efficiency\n  - Detailed individual and aggregate reasoning\n- **Use Cases**: High-stakes evaluation, bias reduction, robust assessment\n- **Configuration**: Judge models, weights, specialties, aggregation method\n\n#### **SpecializedPanelScorer** (Panel Configurations)\n- **Diverse Panel**: Different models with varied specialties (accuracy, clarity, completeness)\n- **Consensus Panel**: High-consensus requirement for agreement-based decisions\n- **Weighted Expert Panel**: Domain experts with expertise-based weighting\n\n### \ud83c\udfad Agent Evaluation Metrics\n\n#### **Tool Relevancy Scoring**\n- **Purpose**: Evaluates appropriateness of tool calls given available tools\n- **Features**: Compares selected tools against available tool catalog\n- **Use Cases**: Agent tool selection assessment, action planning evaluation\n\n#### **Tool Correctness Scoring**\n- **Purpose**: Compares actual tool calls against expected tool calls\n- **Features**: Detailed tool call comparison and correctness assessment\n- **Use Cases**: Agent behavior validation, expected action verification\n\n#### **Parameter Correctness Scoring**\n- **Purpose**: Evaluates correctness of parameters passed to tool calls\n- **Features**: Parameter validation against tool call results and expectations\n- **Use Cases**: Tool usage quality, parameter selection accuracy\n\n#### **Task Progression Scoring**\n- **Purpose**: Measures agent progress toward assigned tasks\n- **Features**: Analyzes task completion status and advancement quality\n- **Use Cases**: Agent effectiveness measurement, task completion tracking\n\n#### **Context Relevancy Scoring**\n- **Purpose**: Assesses response appropriateness given agent's role and task\n- **Features**: Role-task-response alignment evaluation\n- **Use Cases**: Agent behavior consistency, contextual appropriateness\n\n#### **Role Adherence Scoring**\n- **Purpose**: Evaluates consistency with assigned agent role across actions\n- **Features**: Comprehensive role consistency across tool calls and responses\n- **Use Cases**: Agent persona maintenance, role-based behavior validation\n\n#### **Goal Achievement Scoring**\n- **Purpose**: Measures overall goal accomplishment using complete interaction traces\n- **Features**: End-to-end goal evaluation with G-Eval methodology\n- **Use Cases**: Agent effectiveness assessment, outcome-based evaluation\n\n#### **Conversation Coherence Scoring**\n- **Purpose**: Evaluates logical flow and context maintenance in agent conversations\n- **Features**: Conversational coherence and context tracking analysis\n- **Use Cases**: Agent dialogue quality, conversation flow assessment\n\n#### **AgentScorers** (Convenience Class)\n- **Purpose**: Unified interface for all agent evaluation metrics\n- **Features**: Single class providing access to all agent scorers with consistent LLM model\n- **Methods**: Individual scoring methods plus `score_all()` for comprehensive evaluation\n\n### \ud83d\udd27 Advanced Features\n\n#### **BaseScorer Interface**\nAll scorers inherit from `BaseScorer` providing:\n- **Statistics Tracking**: Automatic score history and statistics\n- **Batch Processing**: Efficient batch scoring capabilities\n- **Input Validation**: Robust input validation and error handling\n- **Configuration Support**: Flexible configuration from dictionaries\n- **Metadata Reporting**: Detailed scoring metadata and information\n\n#### **ScoreResult Model**\nComprehensive scoring results include:\n- **Numerical Score**: Primary evaluation score\n- **Pass/Fail Status**: Threshold-based binary result\n- **Detailed Reasoning**: Human-readable evaluation explanation\n- **Rich Metadata**: Additional context and scoring details\n\n### \ud83d\udcca Usage Examples\n\n```python\n# Basic accuracy scoring\nscorer = AccuracyScorer(extract_answer=True)\nscore = scorer.score(\"The answer is B\", \"B\")\n\n# Advanced conversational evaluation\nconv_scorer = ConversationalMetricsScorer(\n    model=your_llm_model,\n    include_knowledge_retention=True,\n    include_relevancy=True,\n    window_size=10\n)\nresult = await conv_scorer.evaluate(input_text, output_text, context=conv_context)\n\n# RAG system evaluation\nragas = RAGASScorer(\n    model=your_llm_model,\n    weights={\"faithfulness\": 0.4, \"answer_relevancy\": 0.3, \"contextual_precision\": 0.3}\n)\nresult = await ragas.evaluate(question, answer, context=retrieved_context)\n\n# Panel-based evaluation\npanel = SpecializedPanelScorer.create_diverse_panel(\n    models=[model1, model2, model3],\n    evaluation_criteria=\"overall quality and helpfulness\"\n)\nresult = await panel.evaluate(input_text, output_text)\n\n# Agent evaluation\nagent_scorers = AgentScorers(model=your_llm_model)\nall_scores = agent_scorers.score_all(agent_data)\n```\n\n## \ud83d\ude80 Deployment\n\n### Local Development\n\n```bash\n# Install dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run example evaluation\npython examples/basic_evaluation.py\n```\n\n### Docker\n\n```bash\n# Build image\ndocker build -t nova-eval .\n\n# Run evaluation\ndocker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml\n```\n\n### Kubernetes\n\n```bash\n# Deploy to Kubernetes\nkubectl apply -f kubernetes/\n\n# Check status\nkubectl get pods -l app=nova-eval\n```\n\n## \ud83d\udd27 Configuration\n\nNovaEval supports configuration through:\n\n- **YAML/JSON files**: Declarative configuration\n- **Environment variables**: Runtime configuration\n- **Python code**: Programmatic configuration\n- **CLI arguments**: Command-line overrides\n\n### Environment Variables\n\n```bash\nexport NOVA_EVAL_OUTPUT_DIR=\"./results\"\nexport NOVA_EVAL_LOG_LEVEL=\"INFO\"\nexport OPENAI_API_KEY=\"your-api-key\"\nexport AWS_ACCESS_KEY_ID=\"your-aws-key\"\n```\n\n### CI/CD Integration\n\nNovaEval includes optimized GitHub Actions workflows:\n- **Unit tests** run on all PRs and pushes for quick feedback\n- **Integration tests** run on main branch only to minimize API costs\n- **Cross-platform testing** on macOS, Linux, and Windows\n\n## \ud83d\udcc8 Reporting and Artifacts\n\nNovaEval generates comprehensive evaluation reports:\n\n- **Summary Reports**: High-level metrics and insights\n- **Detailed Results**: Per-sample predictions and scores\n- **Visualizations**: Charts and graphs for result analysis\n- **Artifacts**: Model outputs, intermediate results, and debug information\n- **Export Formats**: JSON, CSV, HTML, PDF\n\n### Example Report Structure\n\n```\nresults/\n\u251c\u2500\u2500 summary.json              # High-level metrics\n\u251c\u2500\u2500 detailed_results.csv      # Per-sample results\n\u251c\u2500\u2500 artifacts/\n\u2502   \u251c\u2500\u2500 model_outputs/        # Raw model responses\n\u2502   \u251c\u2500\u2500 intermediate/         # Processing artifacts\n\u2502   \u2514\u2500\u2500 debug/               # Debug information\n\u251c\u2500\u2500 visualizations/\n\u2502   \u251c\u2500\u2500 accuracy_by_category.png\n\u2502   \u251c\u2500\u2500 score_distribution.png\n\u2502   \u2514\u2500\u2500 confusion_matrix.png\n\u2514\u2500\u2500 report.html              # Interactive HTML report\n```\n\n## \ud83d\udd0c Extending NovaEval\n\n### Custom Datasets\n\n```python\nfrom novaeval.datasets import BaseDataset\n\nclass MyCustomDataset(BaseDataset):\n    def load_data(self):\n        # Implement data loading logic\n        return samples\n\n    def get_sample(self, index):\n        # Return individual sample\n        return sample\n```\n\n### Custom Scorers\n\n```python\nfrom novaeval.scorers import BaseScorer\n\nclass MyCustomScorer(BaseScorer):\n    def score(self, prediction, ground_truth, context=None):\n        # Implement scoring logic\n        return score\n```\n\n### Custom Models\n\n```python\nfrom novaeval.models import BaseModel\n\nclass MyCustomModel(BaseModel):\n    def generate(self, prompt, **kwargs):\n        # Implement model inference\n        return response\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.\n\n### \ud83c\udfaf Priority Contribution Areas\n\nAs mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:\n\n1. **Unit Tests** - Expand test coverage beyond the current 23%\n2. **Examples** - Real-world evaluation scenarios and use cases\n3. **Guides & Notebooks** - Interactive evaluation tutorials\n4. **Documentation** - API docs, user guides, and tutorials\n5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation\n6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/Noveum/NovaEval.git\ncd NovaEval\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=src/novaeval --cov-report=html\n```\n\n### \ud83c\udfd7\ufe0f Contribution Workflow\n\n1. **Fork** the repository\n2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)\n3. **Make** your changes following our coding standards\n4. **Add** tests for your changes\n5. **Commit** your changes (`git commit -m 'Add amazing feature'`)\n6. **Push** to the branch (`git push origin feature/amazing-feature`)\n7. **Open** a Pull Request\n\n### \ud83d\udccb Contribution Guidelines\n\n- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks\n- **Testing**: Add unit tests for new features and bug fixes\n- **Documentation**: Update documentation for API changes\n- **Commit Messages**: Use conventional commit format\n- **Issues**: Reference relevant issues in your PR description\n\n### \ud83c\udf89 Recognition\n\nContributors will be:\n- Listed in our contributors page\n- Mentioned in release notes for significant contributions\n- Invited to join our contributor Discord community\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust\n- Built with modern Python best practices and industry standards\n- Designed for the AI evaluation community\n\n## \ud83d\udcde Support\n\n- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)\n- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)\n- **Email**: support@noveum.ai\n\n---\n\nMade with \u2764\ufe0f by the Noveum.ai team\n",
    "bugtrack_url": null,
    "license": "Apache License\n                                   Version 2.0, January 2004\n                                http://www.apache.org/licenses/\n        \n           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n        \n           1. Definitions.\n        \n              \"License\" shall mean the terms and conditions for use, reproduction,\n              and distribution as defined by Sections 1 through 9 of this document.\n        \n              \"Licensor\" shall mean the copyright owner or entity granting the License.\n        \n              \"Legal Entity\" shall mean the union of the acting entity and all\n              other entities that control, are controlled by, or are under common\n              control with that entity. For the purposes of this definition,\n              \"control\" means (i) the power, direct or indirect, to cause the\n              direction or management of such entity, whether by contract or\n              otherwise, or (ii) ownership of fifty percent (50%) or more of the\n              outstanding shares, or (iii) beneficial ownership of such entity.\n        \n              \"You\" (or \"Your\") shall mean an individual or Legal Entity\n              exercising permissions granted by this License.\n        \n              \"Source\" shall mean the preferred form for making modifications,\n              including but not limited to software source code, documentation\n              source, and configuration files.\n        \n              \"Object\" shall mean any form resulting from mechanical\n              transformation or translation of a Source form, including but\n              not limited to compiled object code, generated documentation,\n              and conversions to other media types.\n        \n              \"Work\" shall mean the work of authorship, whether in Source or\n              Object form, made available under the License, as indicated by a\n              copyright notice that is included in or attached to the work\n              (which shall not include communications that are solely for the\n              purpose of providing information about the License).\n        \n              \"Derivative Works\" shall mean any work, whether in Source or Object\n              form, that is based upon (or derived from) the Work and for which the\n              editorial revisions, annotations, elaborations, or other modifications\n              represent, as a whole, an original work of authorship. For the purposes\n              of this License, Derivative Works shall not include works that remain\n              separable from, or merely link (or bind by name) to the interfaces of,\n              the Work and derivative works thereof.\n        \n              \"Contribution\" shall mean any work of authorship, including\n              the original version of the Work and any modifications or additions\n              to that Work or Derivative Works thereof, that is intentionally\n              submitted to Licensor for inclusion in the Work by the copyright owner\n              or by an individual or Legal Entity authorized to submit on behalf of\n              the copyright owner. For the purposes of this definition, \"submitted\"\n              means any form of electronic, verbal, or written communication sent\n              to the Licensor or its representatives, including but not limited to\n              communication on electronic mailing lists, source code control\n              systems, and issue tracking systems that are managed by, or on behalf\n              of, the Licensor for the purpose of discussing and improving the Work,\n              but excluding communication that is conspicuously marked or otherwise\n              designated in writing by the copyright owner as \"Not a Contribution.\"\n        \n              \"Contributor\" shall mean Licensor and any individual or Legal Entity\n              on behalf of whom a Contribution has been received by Licensor and\n              subsequently incorporated within the Work.\n        \n           2. Grant of Copyright License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              copyright license to use, reproduce, modify, distribute, and prepare\n              Derivative Works of, publicly display, publicly perform, sublicense,\n              and distribute the Work and such Derivative Works in Source or Object\n              form.\n        \n           3. Grant of Patent License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              (except as stated in this section) patent license to make, have made,\n              use, offer to sell, sell, import, and otherwise transfer the Work,\n              where such license applies only to those patent claims licensable\n              by such Contributor that are necessarily infringed by their\n              Contribution(s) alone or by combination of their Contribution(s)\n              with the Work to which such Contribution(s) was submitted. If You\n              institute patent litigation against any entity (including a\n              cross-claim or counterclaim in a lawsuit) alleging that the Work\n              or a Contribution incorporated within the Work constitutes direct\n              or contributory patent infringement, then any patent licenses\n              granted to You under this License for that Work shall terminate\n              as of the date such litigation is filed.\n        \n           4. Redistribution. You may reproduce and distribute copies of the\n              Work or Derivative Works thereof in any medium, with or without\n              modifications, and in Source or Object form, provided that You\n              meet the following conditions:\n        \n              (a) You must give any other recipients of the Work or\n                  Derivative Works a copy of this License; and\n        \n              (b) You must cause any modified files to carry prominent notices\n                  stating that You changed the files; and\n        \n              (c) You must retain, in the Source form of any Derivative Works\n                  that You distribute, all copyright, trademark, patent,\n                  attribution and other notices from the Source form of the Work,\n                  excluding those notices that do not pertain to any part of\n                  the Derivative Works; and\n        \n              (d) If the Work includes a \"NOTICE\" text file as part of its\n                  distribution, then any Derivative Works that You distribute must\n                  include a readable copy of the attribution notices contained\n                  within such NOTICE file, excluding those notices that do not\n                  pertain to any part of the Derivative Works, in at least one\n                  of the following places: within a NOTICE text file distributed\n                  as part of the Derivative Works; within the Source form or\n                  documentation, if provided along with the Derivative Works; or,\n                  within a display generated by the Derivative Works, if and\n                  wherever such third-party notices normally appear. The contents\n                  of the NOTICE file are for informational purposes only and\n                  do not modify the License. You may add Your own attribution\n                  notices within Derivative Works that You distribute, alongside\n                  or as an addendum to the NOTICE text from the Work, provided\n                  that such additional attribution notices cannot be construed\n                  as modifying the License.\n        \n              You may add Your own copyright notice to Your modifications and\n              may provide additional or different license terms and conditions\n              for use, reproduction, or distribution of Your modifications, or\n              for any such Derivative Works as a whole, provided Your use,\n              reproduction, and distribution of the Work otherwise complies with\n              the conditions stated in this License.\n        \n           5. Submission of Contributions. Unless You explicitly state otherwise,\n              any Contribution intentionally submitted for inclusion in the Work\n              by You to the Licensor shall be under the terms and conditions of\n              this License, without any additional terms or conditions.\n              Notwithstanding the above, nothing herein shall supersede or modify\n              the terms of any separate license agreement you may have executed\n              with Licensor regarding such Contributions.\n        \n           6. Trademarks. This License does not grant permission to use the trade\n              names, trademarks, service marks, or product names of the Licensor,\n              except as required for reasonable and customary use in describing the\n              origin of the Work and reproducing the content of the NOTICE file.\n        \n           7. Disclaimer of Warranty. Unless required by applicable law or\n              agreed to in writing, Licensor provides the Work (and each\n              Contributor provides its Contributions) on an \"AS IS\" BASIS,\n              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n              implied, including, without limitation, any warranties or conditions\n              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n              PARTICULAR PURPOSE. You are solely responsible for determining the\n              appropriateness of using or redistributing the Work and assume any\n              risks associated with Your exercise of permissions under this License.\n        \n           8. Limitation of Liability. In no event and under no legal theory,\n              whether in tort (including negligence), contract, or otherwise,\n              unless required by applicable law (such as deliberate and grossly\n              negligent acts) or agreed to in writing, shall any Contributor be\n              liable to You for damages, including any direct, indirect, special,\n              incidental, or consequential damages of any character arising as a\n              result of this License or out of the use or inability to use the\n              Work (including but not limited to damages for loss of goodwill,\n              work stoppage, computer failure or malfunction, or any and all\n              other commercial damages or losses), even if such Contributor\n              has been advised of the possibility of such damages.\n        \n           9. Accepting Warranty or Support. You may choose to offer, and to\n              charge a fee for, warranty, support, indemnity or other liability\n              obligations and/or rights consistent with this License. However, in\n              accepting such obligations, You may act only on Your own behalf and on\n              Your sole responsibility, not on behalf of any other Contributor, and\n              only if You agree to indemnify, defend, and hold each Contributor\n              harmless for any liability incurred by, or claims asserted against,\n              such Contributor by reason of your accepting any such warranty or support.\n        \n           END OF TERMS AND CONDITIONS\n        \n           APPENDIX: How to apply the Apache License to your work.\n        \n              To apply the Apache License to your work, attach the following\n              boilerplate notice, with the fields enclosed by brackets \"[]\"\n              replaced with your own identifying information. (Don't include\n              the brackets!)  The text should be enclosed in the appropriate\n              comment syntax for the file format. We also recommend that a\n              file or class name and description of purpose be included on the\n              same page as the copyright notice for easier identification within\n              third-party archives.\n        \n           Copyright 2024 Noveum\n        \n           Licensed under the Apache License, Version 2.0 (the \"License\");\n           you may not use this file except in compliance with the License.\n           You may obtain a copy of the License at\n        \n               http://www.apache.org/licenses/LICENSE-2.0\n        \n           Unless required by applicable law or agreed to in writing, software\n           distributed under the License is distributed on an \"AS IS\" BASIS,\n           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n           See the License for the specific language governing permissions and\n           limitations under the License.\n        ",
    "summary": "A comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models",
    "version": "0.5.2",
    "project_urls": {
        "Bug Tracker": "https://github.com/Noveum/NovaEval/issues",
        "Changelog": "https://github.com/Noveum/NovaEval/blob/main/CHANGELOG.md",
        "Documentation": "https://novaeval.readthedocs.io",
        "Homepage": "https://github.com/Noveum/NovaEval",
        "Repository": "https://github.com/Noveum/NovaEval"
    },
    "split_keywords": [
        "llm",
        " evaluation",
        " ai",
        " machine-learning",
        " benchmarking",
        " testing",
        " rag",
        " agents",
        " conversational-ai",
        " g-eval"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "15a379aad78f26573ae9a293cac9aad2ea17f9263fa00f6390e54ad161d4ffa2",
                "md5": "bbe0e6e49367de8a9b153d0af32a68ac",
                "sha256": "4c747740dc4479e275cef30cd5d142f7f48d4ad94ddb9f2488e709453c7200dc"
            },
            "downloads": -1,
            "filename": "novaeval-0.5.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bbe0e6e49367de8a9b153d0af32a68ac",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 172326,
            "upload_time": "2025-08-29T06:18:08",
            "upload_time_iso_8601": "2025-08-29T06:18:08.214514Z",
            "url": "https://files.pythonhosted.org/packages/15/a3/79aad78f26573ae9a293cac9aad2ea17f9263fa00f6390e54ad161d4ffa2/novaeval-0.5.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e74bfaa6256fd7c98b04a71108a15c419ee0d755a6803cedfd3a0150412d01a6",
                "md5": "d49ec33729679843258a0d8bb88e0f54",
                "sha256": "1a80ed1108c5b8e32525bd00903ac30528885f875d6d3970f82be373e0e54952"
            },
            "downloads": -1,
            "filename": "novaeval-0.5.2.tar.gz",
            "has_sig": false,
            "md5_digest": "d49ec33729679843258a0d8bb88e0f54",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 2504026,
            "upload_time": "2025-08-29T06:18:10",
            "upload_time_iso_8601": "2025-08-29T06:18:10.128105Z",
            "url": "https://files.pythonhosted.org/packages/e7/4b/faa6256fd7c98b04a71108a15c419ee0d755a6803cedfd3a0150412d01a6/novaeval-0.5.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-29 06:18:10",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Noveum",
    "github_project": "NovaEval",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pyyaml",
            "specs": [
                [
                    ">=",
                    "6.0"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    ">=",
                    "2.28.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.64.0"
                ]
            ]
        },
        {
            "name": "click",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "12.0.0"
                ]
            ]
        },
        {
            "name": "jinja2",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "plotly",
            "specs": [
                [
                    ">=",
                    "5.0.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "ijson",
            "specs": [
                [
                    ">=",
                    "3.2.0"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "noveum-trace",
            "specs": [
                [
                    ">=",
                    "0.3.5"
                ]
            ]
        },
        {
            "name": "typing_extensions",
            "specs": [
                [
                    ">=",
                    "4.7.0"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.20.0"
                ]
            ]
        },
        {
            "name": "openai",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "anthropic",
            "specs": [
                [
                    ">=",
                    "0.3.0"
                ]
            ]
        },
        {
            "name": "boto3",
            "specs": [
                [
                    ">=",
                    "1.26.0"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "ollama",
            "specs": [
                [
                    "==",
                    "0.5.3"
                ]
            ]
        }
    ],
    "lcname": "novaeval"
}
        
Elapsed time: 0.51327s