# NovaEval by Noveum.ai
[](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)
[](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)
[](https://codecov.io/gh/Noveum/NovaEval)
[](https://badge.fury.io/py/novaeval)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
A comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.
## ๐ง Development Status
> **โ ๏ธ ACTIVE DEVELOPMENT - NOT PRODUCTION READY**
>
> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.
>
> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.
## ๐ค We Need Your Help!
NovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:
### ๐ฏ High-Priority Contribution Areas
We're actively looking for contributors in these key areas:
- **๐งช Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)
- **๐ Examples**: Create real-world evaluation examples and use cases
- **๐ Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks
- **๐ Documentation**: Improve API documentation and user guides
- **๐ RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation
- **๐ค Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations
### ๐ Getting Started as a Contributor
1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`
2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
3. **Review Code**: Help review pull requests and provide feedback
4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
5. **Spread the Word**: Star the repository and share with your network
## ๐ Features
- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers
- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics
- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more
- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations
- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations
- **Secure**: Built-in credential management and secret store integration
- **Scalable**: Designed for both local testing and large-scale production evaluations
- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD
## ๐ฆ Installation
### From PyPI (Recommended)
```bash
pip install novaeval
```
### From Source
```bash
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
pip install -e .
```
### Docker
```bash
docker pull noveum/novaeval:latest
```
## ๐โโ๏ธ Quick Start
### Basic Evaluation
```python
from novaeval import Evaluator
from novaeval.datasets import MMLUDataset
from novaeval.models import OpenAIModel
from novaeval.scorers import AccuracyScorer
# Configure for cost-conscious evaluation
MAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning
# Initialize components
dataset = MMLUDataset(
subset="elementary_mathematics", # Easier subset for demo
num_samples=10,
split="test"
)
model = OpenAIModel(
model_name="gpt-4o-mini", # Cost-effective model
temperature=0.0,
max_tokens=MAX_TOKENS
)
scorer = AccuracyScorer(extract_answer=True)
# Create and run evaluation
evaluator = Evaluator(
dataset=dataset,
models=[model],
scorers=[scorer],
output_dir="./results"
)
results = evaluator.run()
# Display detailed results
for model_name, model_results in results["model_results"].items():
for scorer_name, score_info in model_results["scores"].items():
if isinstance(score_info, dict):
mean_score = score_info.get("mean", 0)
count = score_info.get("count", 0)
print(f"{scorer_name}: {mean_score:.4f} ({count} samples)")
```
### Configuration-Based Evaluation
```python
from novaeval import Evaluator
# Load configuration from YAML/JSON
evaluator = Evaluator.from_config("evaluation_config.yaml")
results = evaluator.run()
```
### Command Line Interface
NovaEval provides a comprehensive CLI for running evaluations:
```bash
# Run evaluation from configuration file
novaeval run config.yaml
# Quick evaluation with minimal setup
novaeval quick -d mmlu -m gpt-4 -s accuracy
# List available datasets, models, and scorers
novaeval list-datasets
novaeval list-models
novaeval list-scorers
# Generate sample configuration
novaeval generate-config sample-config.yaml
```
๐ **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options
### Example Configuration
```yaml
# evaluation_config.yaml
dataset:
type: "mmlu"
subset: "abstract_algebra"
num_samples: 500
models:
- type: "openai"
model_name: "gpt-4"
temperature: 0.0
- type: "anthropic"
model_name: "claude-3-opus"
temperature: 0.0
scorers:
- type: "accuracy"
- type: "semantic_similarity"
threshold: 0.8
output:
directory: "./results"
formats: ["json", "csv", "html"]
upload_to_s3: true
s3_bucket: "my-eval-results"
```
## ๐๏ธ Architecture
NovaEval is built with extensibility and modularity in mind:
```
src/novaeval/
โโโ datasets/ # Dataset loaders and processors
โโโ evaluators/ # Core evaluation logic
โโโ integrations/ # External service integrations
โโโ models/ # Model interfaces and adapters
โโโ reporting/ # Report generation and visualization
โโโ scorers/ # Scoring mechanisms and metrics
โโโ utils/ # Utility functions and helpers
```
### Core Components
- **Datasets**: Standardized interface for loading evaluation datasets
- **Models**: Unified API for different AI model providers
- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics
- **Evaluators**: Orchestrates the evaluation process
- **Reporting**: Generates comprehensive reports and artifacts
- **Integrations**: Handles external services (S3, credential stores, etc.)
## ๐ Supported Datasets
- **MMLU**: Massive Multitask Language Understanding
- **HuggingFace**: Any dataset from the HuggingFace Hub
- **Custom**: JSON, CSV, or programmatic dataset definitions
- **Code Evaluation**: Programming benchmarks and code generation tasks
- **Agent Traces**: Multi-turn conversation and agent evaluation
## ๐ค Supported Models
- **OpenAI**: GPT-3.5, GPT-4, and newer models
- **Anthropic**: Claude family models
- **AWS Bedrock**: Amazon's managed AI services
- **Noveum AI Gateway**: Integration with Noveum's model gateway
- **Custom**: Extensible interface for any API-based model
## ๐ Built-in Scorers
### Accuracy-Based
- **ExactMatch**: Exact string matching
- **Accuracy**: Classification accuracy
- **F1Score**: F1 score for classification tasks
### Semantic-Based
- **SemanticSimilarity**: Embedding-based similarity scoring
- **BERTScore**: BERT-based semantic evaluation
- **RougeScore**: ROUGE metrics for text generation
### Code-Specific
- **CodeExecution**: Execute and validate code outputs
- **SyntaxChecker**: Validate code syntax
- **TestCoverage**: Code coverage analysis
### Custom
- **LLMJudge**: Use another LLM as a judge
- **HumanEval**: Integration with human evaluation workflows
## ๐ Deployment
### Local Development
```bash
# Install dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run example evaluation
python examples/basic_evaluation.py
```
### Docker
```bash
# Build image
docker build -t nova-eval .
# Run evaluation
docker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml
```
### Kubernetes
```bash
# Deploy to Kubernetes
kubectl apply -f kubernetes/
# Check status
kubectl get pods -l app=nova-eval
```
## ๐ง Configuration
NovaEval supports configuration through:
- **YAML/JSON files**: Declarative configuration
- **Environment variables**: Runtime configuration
- **Python code**: Programmatic configuration
- **CLI arguments**: Command-line overrides
### Environment Variables
```bash
export NOVA_EVAL_OUTPUT_DIR="./results"
export NOVA_EVAL_LOG_LEVEL="INFO"
export OPENAI_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-aws-key"
```
### CI/CD Integration
NovaEval includes optimized GitHub Actions workflows:
- **Unit tests** run on all PRs and pushes for quick feedback
- **Integration tests** run on main branch only to minimize API costs
- **Cross-platform testing** on macOS, Linux, and Windows
## ๐ Reporting and Artifacts
NovaEval generates comprehensive evaluation reports:
- **Summary Reports**: High-level metrics and insights
- **Detailed Results**: Per-sample predictions and scores
- **Visualizations**: Charts and graphs for result analysis
- **Artifacts**: Model outputs, intermediate results, and debug information
- **Export Formats**: JSON, CSV, HTML, PDF
### Example Report Structure
```
results/
โโโ summary.json # High-level metrics
โโโ detailed_results.csv # Per-sample results
โโโ artifacts/
โ โโโ model_outputs/ # Raw model responses
โ โโโ intermediate/ # Processing artifacts
โ โโโ debug/ # Debug information
โโโ visualizations/
โ โโโ accuracy_by_category.png
โ โโโ score_distribution.png
โ โโโ confusion_matrix.png
โโโ report.html # Interactive HTML report
```
## ๐ Extending NovaEval
### Custom Datasets
```python
from novaeval.datasets import BaseDataset
class MyCustomDataset(BaseDataset):
def load_data(self):
# Implement data loading logic
return samples
def get_sample(self, index):
# Return individual sample
return sample
```
### Custom Scorers
```python
from novaeval.scorers import BaseScorer
class MyCustomScorer(BaseScorer):
def score(self, prediction, ground_truth, context=None):
# Implement scoring logic
return score
```
### Custom Models
```python
from novaeval.models import BaseModel
class MyCustomModel(BaseModel):
def generate(self, prompt, **kwargs):
# Implement model inference
return response
```
## ๐ค Contributing
We welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.
### ๐ฏ Priority Contribution Areas
As mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:
1. **Unit Tests** - Expand test coverage beyond the current 23%
2. **Examples** - Real-world evaluation scenarios and use cases
3. **Guides & Notebooks** - Interactive evaluation tutorials
4. **Documentation** - API docs, user guides, and tutorials
5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation
6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations
### Development Setup
```bash
# Clone repository
git clone https://github.com/Noveum/NovaEval.git
cd NovaEval
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run with coverage
pytest --cov=src/novaeval --cov-report=html
```
### ๐๏ธ Contribution Workflow
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Make** your changes following our coding standards
4. **Add** tests for your changes
5. **Commit** your changes (`git commit -m 'Add amazing feature'`)
6. **Push** to the branch (`git push origin feature/amazing-feature`)
7. **Open** a Pull Request
### ๐ Contribution Guidelines
- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks
- **Testing**: Add unit tests for new features and bug fixes
- **Documentation**: Update documentation for API changes
- **Commit Messages**: Use conventional commit format
- **Issues**: Reference relevant issues in your PR description
### ๐ Recognition
Contributors will be:
- Listed in our contributors page
- Mentioned in release notes for significant contributions
- Invited to join our contributor Discord community
## ๐ License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## ๐ Acknowledgments
- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust
- Built with modern Python best practices and industry standards
- Designed for the AI evaluation community
## ๐ Support
- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)
- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)
- **Email**: support@noveum.ai
---
Made with โค๏ธ by the Noveum.ai team
Raw data
{
"_id": null,
"home_page": null,
"name": "novaeval",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "Noveum AI <team@noveum.ai>",
"keywords": "llm, evaluation, ai, machine-learning, benchmarking, testing, rag, agents, conversational-ai, g-eval",
"author": null,
"author_email": "Noveum AI <team@noveum.ai>",
"download_url": "https://files.pythonhosted.org/packages/e9/3d/ce1f1a5024855a0b602d8e38b4f3718c53abc2069aa02798f5b0a99fc8ab/novaeval-0.4.0.tar.gz",
"platform": null,
"description": "# NovaEval by Noveum.ai\n\n[](https://github.com/Noveum/NovaEval/actions/workflows/ci.yml)\n[](https://github.com/Noveum/NovaEval/actions/workflows/release.yml)\n[](https://codecov.io/gh/Noveum/NovaEval)\n[](https://badge.fury.io/py/novaeval)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/Apache-2.0)\n\nA comprehensive, extensible AI model evaluation framework designed for production use. NovaEval provides a unified interface for evaluating language models across various datasets, metrics, and deployment scenarios.\n\n## \ud83d\udea7 Development Status\n\n> **\u26a0\ufe0f ACTIVE DEVELOPMENT - NOT PRODUCTION READY**\n>\n> NovaEval is currently in active development and **not recommended for production use**. We are actively working on improving stability, adding features, and expanding test coverage. APIs may change without notice.\n>\n> **We're looking for contributors!** See the [Contributing](#-contributing) section below for ways to help.\n\n## \ud83e\udd1d We Need Your Help!\n\nNovaEval is an open-source project that thrives on community contributions. Whether you're a seasoned developer or just getting started, there are many ways to contribute:\n\n### \ud83c\udfaf High-Priority Contribution Areas\n\nWe're actively looking for contributors in these key areas:\n\n- **\ud83e\uddea Unit Tests**: Help us improve our test coverage (currently 23% overall, 90%+ for core modules)\n- **\ud83d\udcda Examples**: Create real-world evaluation examples and use cases\n- **\ud83d\udcdd Guides & Notebooks**: Write evaluation guides and interactive Jupyter notebooks\n- **\ud83d\udcd6 Documentation**: Improve API documentation and user guides\n- **\ud83d\udd0d RAG Metrics**: Add more metrics specifically for Retrieval-Augmented Generation evaluation\n- **\ud83e\udd16 Agent Evaluation**: Build frameworks for evaluating AI agents and multi-turn conversations\n\n### \ud83d\ude80 Getting Started as a Contributor\n\n1. **Start Small**: Pick up issues labeled `good first issue` or `help wanted`\n2. **Join Discussions**: Share your ideas in [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)\n3. **Review Code**: Help review pull requests and provide feedback\n4. **Report Issues**: Found a bug? Report it in [GitHub Issues](https://github.com/Noveum/NovaEval/issues)\n5. **Spread the Word**: Star the repository and share with your network\n\n## \ud83d\ude80 Features\n\n- **Multi-Model Support**: Evaluate models from OpenAI, Anthropic, AWS Bedrock, and custom providers\n- **Extensible Scoring**: Built-in scorers for accuracy, semantic similarity, code evaluation, and custom metrics\n- **Dataset Integration**: Support for MMLU, HuggingFace datasets, custom datasets, and more\n- **Production Ready**: Docker support, Kubernetes deployment, and cloud integrations\n- **Comprehensive Reporting**: Detailed evaluation reports, artifacts, and visualizations\n- **Secure**: Built-in credential management and secret store integration\n- **Scalable**: Designed for both local testing and large-scale production evaluations\n- **Cross-Platform**: Tested on macOS, Linux, and Windows with comprehensive CI/CD\n\n## \ud83d\udce6 Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install novaeval\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/Noveum/NovaEval.git\ncd NovaEval\npip install -e .\n```\n\n### Docker\n\n```bash\ndocker pull noveum/novaeval:latest\n```\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Quick Start\n\n### Basic Evaluation\n\n```python\nfrom novaeval import Evaluator\nfrom novaeval.datasets import MMLUDataset\nfrom novaeval.models import OpenAIModel\nfrom novaeval.scorers import AccuracyScorer\n\n# Configure for cost-conscious evaluation\nMAX_TOKENS = 100 # Adjust based on budget: 5-10 for answers, 100+ for reasoning\n\n# Initialize components\ndataset = MMLUDataset(\n subset=\"elementary_mathematics\", # Easier subset for demo\n num_samples=10,\n split=\"test\"\n)\n\nmodel = OpenAIModel(\n model_name=\"gpt-4o-mini\", # Cost-effective model\n temperature=0.0,\n max_tokens=MAX_TOKENS\n)\n\nscorer = AccuracyScorer(extract_answer=True)\n\n# Create and run evaluation\nevaluator = Evaluator(\n dataset=dataset,\n models=[model],\n scorers=[scorer],\n output_dir=\"./results\"\n)\n\nresults = evaluator.run()\n\n# Display detailed results\nfor model_name, model_results in results[\"model_results\"].items():\n for scorer_name, score_info in model_results[\"scores\"].items():\n if isinstance(score_info, dict):\n mean_score = score_info.get(\"mean\", 0)\n count = score_info.get(\"count\", 0)\n print(f\"{scorer_name}: {mean_score:.4f} ({count} samples)\")\n```\n\n### Configuration-Based Evaluation\n\n```python\nfrom novaeval import Evaluator\n\n# Load configuration from YAML/JSON\nevaluator = Evaluator.from_config(\"evaluation_config.yaml\")\nresults = evaluator.run()\n```\n\n### Command Line Interface\n\nNovaEval provides a comprehensive CLI for running evaluations:\n\n```bash\n# Run evaluation from configuration file\nnovaeval run config.yaml\n\n# Quick evaluation with minimal setup\nnovaeval quick -d mmlu -m gpt-4 -s accuracy\n\n# List available datasets, models, and scorers\nnovaeval list-datasets\nnovaeval list-models\nnovaeval list-scorers\n\n# Generate sample configuration\nnovaeval generate-config sample-config.yaml\n```\n\n\ud83d\udcd6 **[Complete CLI Reference](docs/cli-reference.md)** - Detailed documentation for all CLI commands and options\n\n### Example Configuration\n\n```yaml\n# evaluation_config.yaml\ndataset:\n type: \"mmlu\"\n subset: \"abstract_algebra\"\n num_samples: 500\n\nmodels:\n - type: \"openai\"\n model_name: \"gpt-4\"\n temperature: 0.0\n - type: \"anthropic\"\n model_name: \"claude-3-opus\"\n temperature: 0.0\n\nscorers:\n - type: \"accuracy\"\n - type: \"semantic_similarity\"\n threshold: 0.8\n\noutput:\n directory: \"./results\"\n formats: [\"json\", \"csv\", \"html\"]\n upload_to_s3: true\n s3_bucket: \"my-eval-results\"\n```\n\n## \ud83c\udfd7\ufe0f Architecture\n\nNovaEval is built with extensibility and modularity in mind:\n\n```\nsrc/novaeval/\n\u251c\u2500\u2500 datasets/ # Dataset loaders and processors\n\u251c\u2500\u2500 evaluators/ # Core evaluation logic\n\u251c\u2500\u2500 integrations/ # External service integrations\n\u251c\u2500\u2500 models/ # Model interfaces and adapters\n\u251c\u2500\u2500 reporting/ # Report generation and visualization\n\u251c\u2500\u2500 scorers/ # Scoring mechanisms and metrics\n\u2514\u2500\u2500 utils/ # Utility functions and helpers\n```\n\n### Core Components\n\n- **Datasets**: Standardized interface for loading evaluation datasets\n- **Models**: Unified API for different AI model providers\n- **Scorers**: Pluggable scoring mechanisms for various evaluation metrics\n- **Evaluators**: Orchestrates the evaluation process\n- **Reporting**: Generates comprehensive reports and artifacts\n- **Integrations**: Handles external services (S3, credential stores, etc.)\n\n## \ud83d\udcca Supported Datasets\n\n- **MMLU**: Massive Multitask Language Understanding\n- **HuggingFace**: Any dataset from the HuggingFace Hub\n- **Custom**: JSON, CSV, or programmatic dataset definitions\n- **Code Evaluation**: Programming benchmarks and code generation tasks\n- **Agent Traces**: Multi-turn conversation and agent evaluation\n\n## \ud83e\udd16 Supported Models\n\n- **OpenAI**: GPT-3.5, GPT-4, and newer models\n- **Anthropic**: Claude family models\n- **AWS Bedrock**: Amazon's managed AI services\n- **Noveum AI Gateway**: Integration with Noveum's model gateway\n- **Custom**: Extensible interface for any API-based model\n\n## \ud83d\udccf Built-in Scorers\n\n### Accuracy-Based\n- **ExactMatch**: Exact string matching\n- **Accuracy**: Classification accuracy\n- **F1Score**: F1 score for classification tasks\n\n### Semantic-Based\n- **SemanticSimilarity**: Embedding-based similarity scoring\n- **BERTScore**: BERT-based semantic evaluation\n- **RougeScore**: ROUGE metrics for text generation\n\n### Code-Specific\n- **CodeExecution**: Execute and validate code outputs\n- **SyntaxChecker**: Validate code syntax\n- **TestCoverage**: Code coverage analysis\n\n### Custom\n- **LLMJudge**: Use another LLM as a judge\n- **HumanEval**: Integration with human evaluation workflows\n\n## \ud83d\ude80 Deployment\n\n### Local Development\n\n```bash\n# Install dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run example evaluation\npython examples/basic_evaluation.py\n```\n\n### Docker\n\n```bash\n# Build image\ndocker build -t nova-eval .\n\n# Run evaluation\ndocker run -v $(pwd)/config:/config -v $(pwd)/results:/results nova-eval --config /config/eval.yaml\n```\n\n### Kubernetes\n\n```bash\n# Deploy to Kubernetes\nkubectl apply -f kubernetes/\n\n# Check status\nkubectl get pods -l app=nova-eval\n```\n\n## \ud83d\udd27 Configuration\n\nNovaEval supports configuration through:\n\n- **YAML/JSON files**: Declarative configuration\n- **Environment variables**: Runtime configuration\n- **Python code**: Programmatic configuration\n- **CLI arguments**: Command-line overrides\n\n### Environment Variables\n\n```bash\nexport NOVA_EVAL_OUTPUT_DIR=\"./results\"\nexport NOVA_EVAL_LOG_LEVEL=\"INFO\"\nexport OPENAI_API_KEY=\"your-api-key\"\nexport AWS_ACCESS_KEY_ID=\"your-aws-key\"\n```\n\n### CI/CD Integration\n\nNovaEval includes optimized GitHub Actions workflows:\n- **Unit tests** run on all PRs and pushes for quick feedback\n- **Integration tests** run on main branch only to minimize API costs\n- **Cross-platform testing** on macOS, Linux, and Windows\n\n## \ud83d\udcc8 Reporting and Artifacts\n\nNovaEval generates comprehensive evaluation reports:\n\n- **Summary Reports**: High-level metrics and insights\n- **Detailed Results**: Per-sample predictions and scores\n- **Visualizations**: Charts and graphs for result analysis\n- **Artifacts**: Model outputs, intermediate results, and debug information\n- **Export Formats**: JSON, CSV, HTML, PDF\n\n### Example Report Structure\n\n```\nresults/\n\u251c\u2500\u2500 summary.json # High-level metrics\n\u251c\u2500\u2500 detailed_results.csv # Per-sample results\n\u251c\u2500\u2500 artifacts/\n\u2502 \u251c\u2500\u2500 model_outputs/ # Raw model responses\n\u2502 \u251c\u2500\u2500 intermediate/ # Processing artifacts\n\u2502 \u2514\u2500\u2500 debug/ # Debug information\n\u251c\u2500\u2500 visualizations/\n\u2502 \u251c\u2500\u2500 accuracy_by_category.png\n\u2502 \u251c\u2500\u2500 score_distribution.png\n\u2502 \u2514\u2500\u2500 confusion_matrix.png\n\u2514\u2500\u2500 report.html # Interactive HTML report\n```\n\n## \ud83d\udd0c Extending NovaEval\n\n### Custom Datasets\n\n```python\nfrom novaeval.datasets import BaseDataset\n\nclass MyCustomDataset(BaseDataset):\n def load_data(self):\n # Implement data loading logic\n return samples\n\n def get_sample(self, index):\n # Return individual sample\n return sample\n```\n\n### Custom Scorers\n\n```python\nfrom novaeval.scorers import BaseScorer\n\nclass MyCustomScorer(BaseScorer):\n def score(self, prediction, ground_truth, context=None):\n # Implement scoring logic\n return score\n```\n\n### Custom Models\n\n```python\nfrom novaeval.models import BaseModel\n\nclass MyCustomModel(BaseModel):\n def generate(self, prompt, **kwargs):\n # Implement model inference\n return response\n```\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! NovaEval is actively seeking contributors to help build a robust AI evaluation framework. Please see our [Contributing Guide](CONTRIBUTING.md) for detailed guidelines.\n\n### \ud83c\udfaf Priority Contribution Areas\n\nAs mentioned in the [We Need Your Help](#-we-need-your-help) section, we're particularly looking for help with:\n\n1. **Unit Tests** - Expand test coverage beyond the current 23%\n2. **Examples** - Real-world evaluation scenarios and use cases\n3. **Guides & Notebooks** - Interactive evaluation tutorials\n4. **Documentation** - API docs, user guides, and tutorials\n5. **RAG Metrics** - Specialized metrics for retrieval-augmented generation\n6. **Agent Evaluation** - Frameworks for multi-turn and agent-based evaluations\n\n### Development Setup\n\n```bash\n# Clone repository\ngit clone https://github.com/Noveum/NovaEval.git\ncd NovaEval\n\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=src/novaeval --cov-report=html\n```\n\n### \ud83c\udfd7\ufe0f Contribution Workflow\n\n1. **Fork** the repository\n2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)\n3. **Make** your changes following our coding standards\n4. **Add** tests for your changes\n5. **Commit** your changes (`git commit -m 'Add amazing feature'`)\n6. **Push** to the branch (`git push origin feature/amazing-feature`)\n7. **Open** a Pull Request\n\n### \ud83d\udccb Contribution Guidelines\n\n- **Code Quality**: Follow PEP 8 and use the provided pre-commit hooks\n- **Testing**: Add unit tests for new features and bug fixes\n- **Documentation**: Update documentation for API changes\n- **Commit Messages**: Use conventional commit format\n- **Issues**: Reference relevant issues in your PR description\n\n### \ud83c\udf89 Recognition\n\nContributors will be:\n- Listed in our contributors page\n- Mentioned in release notes for significant contributions\n- Invited to join our contributor Discord community\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\n- Inspired by evaluation frameworks like DeepEval, Confident AI, and Braintrust\n- Built with modern Python best practices and industry standards\n- Designed for the AI evaluation community\n\n## \ud83d\udcde Support\n\n- **Documentation**: [https://noveum.github.io/NovaEval](https://noveum.github.io/NovaEval)\n- **Issues**: [GitHub Issues](https://github.com/Noveum/NovaEval/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/Noveum/NovaEval/discussions)\n- **Email**: support@noveum.ai\n\n---\n\nMade with \u2764\ufe0f by the Noveum.ai team\n",
"bugtrack_url": null,
"license": null,
"summary": "A comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models",
"version": "0.4.0",
"project_urls": {
"Bug Tracker": "https://github.com/Noveum/NovaEval/issues",
"Changelog": "https://github.com/Noveum/NovaEval/blob/main/CHANGELOG.md",
"Documentation": "https://novaeval.readthedocs.io",
"Homepage": "https://github.com/Noveum/NovaEval",
"Repository": "https://github.com/Noveum/NovaEval"
},
"split_keywords": [
"llm",
" evaluation",
" ai",
" machine-learning",
" benchmarking",
" testing",
" rag",
" agents",
" conversational-ai",
" g-eval"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "36e0cc6ffe183255afc2f5a300fae9d210fe668fba7f1201c46bc327dd70c1f4",
"md5": "4267dec9635e259e78b8b6ff7d430745",
"sha256": "425b68c44b47726ba94e7fd3bf4fde72323f9fc99d122f923d0aecbcf9f04216"
},
"downloads": -1,
"filename": "novaeval-0.4.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4267dec9635e259e78b8b6ff7d430745",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 94008,
"upload_time": "2025-07-22T19:20:39",
"upload_time_iso_8601": "2025-07-22T19:20:39.604222Z",
"url": "https://files.pythonhosted.org/packages/36/e0/cc6ffe183255afc2f5a300fae9d210fe668fba7f1201c46bc327dd70c1f4/novaeval-0.4.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e93dce1f1a5024855a0b602d8e38b4f3718c53abc2069aa02798f5b0a99fc8ab",
"md5": "b7bf2947ccee5d25681828cbe5bf9ccf",
"sha256": "616ead0e92f85e68e8ec3dd1118a9d227b2bf4c02a9c50417f82c11d1fe16ee9"
},
"downloads": -1,
"filename": "novaeval-0.4.0.tar.gz",
"has_sig": false,
"md5_digest": "b7bf2947ccee5d25681828cbe5bf9ccf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 190786,
"upload_time": "2025-07-22T19:20:41",
"upload_time_iso_8601": "2025-07-22T19:20:41.412903Z",
"url": "https://files.pythonhosted.org/packages/e9/3d/ce1f1a5024855a0b602d8e38b4f3718c53abc2069aa02798f5b0a99fc8ab/novaeval-0.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-22 19:20:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Noveum",
"github_project": "NovaEval",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pydantic",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "pyyaml",
"specs": [
[
">=",
"6.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.28.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.21.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.64.0"
]
]
},
{
"name": "click",
"specs": [
[
">=",
"8.0.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"12.0.0"
]
]
},
{
"name": "jinja2",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "plotly",
"specs": [
[
">=",
"5.0.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.20.0"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "anthropic",
"specs": [
[
">=",
"0.3.0"
]
]
},
{
"name": "boto3",
"specs": [
[
">=",
"1.26.0"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
">=",
"2.2.0"
]
]
}
],
"lcname": "novaeval"
}