evalops

Name	evalops JSON
Version	0.0.6 JSON
	download
home_page	https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator
Summary	evalops - TGSC
upload_time	2024-12-22 01:15:40
maintainer	None
docs_url	None
author	Kye Gomez
requires_python	<4.0,>=3.10
license	MIT
keywords	artificial intelligence deep learning optimizers prompt engineering
VCS
bugtrack_url
requirements	numpy pandas scipy pydantic sentence-transformers spacy scikit-learn num2words loguru threadpoolctl
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Statistical Model Evaluator

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


[![Join our Discord](https://img.shields.io/badge/Discord-Join%20our%20server-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503) [![Subscribe on YouTube](https://img.shields.io/badge/YouTube-Subscribe-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@kyegomez3242) [![Connect on LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/kye-g-38759a207/) [![Follow on X.com](https://img.shields.io/badge/X.com-Follow-1DA1F2?style=for-the-badge&logo=x&logoColor=white)](https://x.com/kyegomezb)

A robust, production-ready framework for statistically rigorous evaluation of language models, implementing the methodology described in ["A Statistical Approach to Model Evaluations"](https://www.anthropic.com/research/statistical-approach-to-model-evals) (2024).



## 🚀 Features

- **Statistical Robustness**: Leverages Central Limit Theorem for reliable metrics
- **Clustered Standard Errors**: Handles non-independent question groups
- **Variance Reduction**: Multiple sampling strategies and parallel processing
- **Paired Difference Analysis**: Sophisticated model comparison tools
- **Power Analysis**: Sample size determination for meaningful comparisons
- **Production Ready**: 
  - Comprehensive logging
  - Type hints throughout
  - Error handling
  - Result caching
  - Parallel processing
  - Modular design

## Instal


```bash
pip3 install -U evalops
```

## Usage 

```python
import os

from dotenv import load_dotenv
from swarm_models import OpenAIChat
from swarms import Agent

from evalops import StatisticalModelEvaluator

load_dotenv()

# Get the OpenAI API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

# Create instances of the OpenAIChat class with different models
model_gpt4 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o", temperature=0.1
)

model_gpt35 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o-mini", temperature=0.1
)

# Initialize a general knowledge agent
agent = Agent(
    agent_name="General-Knowledge-Agent",
    system_prompt="You are a helpful assistant that answers general knowledge questions accurately and concisely.",
    llm=model_gpt4,
    max_loops=1,
    dynamic_temperature_enabled=True,
    saved_state_path="general_agent.json",
    user_name="swarms_corp",
    context_length=200000,
    return_step_meta=False,
    output_type="string",
)

evaluator = StatisticalModelEvaluator(cache_dir="./eval_cache")

# General knowledge test cases
general_questions = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is the largest planet in our solar system?",
    "What is the chemical symbol for gold?",
    "Who painted the Mona Lisa?",
]

general_answers = [
    "Paris",
    "William Shakespeare",
    "Jupiter",
    "Au",
    "Leonardo da Vinci",
]

# Evaluate models on general knowledge questions
result_gpt4 = evaluator.evaluate_model(
    model=agent,
    questions=general_questions,
    correct_answers=general_answers,
    num_samples=5,
)

result_gpt35 = evaluator.evaluate_model(
    model=agent,
    questions=general_questions,
    correct_answers=general_answers,
    num_samples=5,
)

# Compare model performance
comparison = evaluator.compare_models(result_gpt4, result_gpt35)

# Print results
print(f"GPT-4 Mean Score: {result_gpt4.mean_score:.3f}")
print(f"GPT-3.5 Mean Score: {result_gpt35.mean_score:.3f}")
print(
    f"Significant Difference: {comparison['significant_difference']}"
)
print(f"P-value: {comparison['p_value']:.3f}")

```


## 📖 Detailed Usage

### Basic Model Evaluation

```python
class MyLanguageModel:
    def run(self, task: str) -> str:
        # Your model implementation
        return "model response"

evaluator = StatisticalModelEvaluator(
    cache_dir="./eval_cache",
    log_level="INFO",
    random_seed=42
)

# Prepare your evaluation data
questions = ["Question 1", "Question 2", ...]
answers = ["Answer 1", "Answer 2", ...]

# Run evaluation
result = evaluator.evaluate_model(
    model=MyLanguageModel(),
    questions=questions,
    correct_answers=answers,
    num_samples=3,  # Number of times to sample each question
    batch_size=32,  # Batch size for parallel processing
    cache_key="model_v1"  # Optional caching key
)

# Access results
print(f"Mean Score: {result.mean_score:.3f}")
print(f"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]")
```

### Handling Clustered Questions

```python
# For questions that are grouped (e.g., multiple questions about the same passage)
cluster_ids = ["passage1", "passage1", "passage2", "passage2", ...]

result = evaluator.evaluate_model(
    model=MyLanguageModel(),
    questions=questions,
    correct_answers=answers,
    cluster_ids=cluster_ids
)
```

### Comparing Models

```python
# Evaluate two models
result_a = evaluator.evaluate_model(model=ModelA(), ...)
result_b = evaluator.evaluate_model(model=ModelB(), ...)

# Compare results
comparison = evaluator.compare_models(result_a, result_b)

print(f"Mean Difference: {comparison['mean_difference']:.3f}")
print(f"P-value: {comparison['p_value']:.4f}")
print(f"Significant Difference: {comparison['significant_difference']}")
```

### Power Analysis

```python
required_samples = evaluator.calculate_required_samples(
    effect_size=0.05,  # Minimum difference to detect
    baseline_variance=0.1,  # Estimated variance in scores
    power=0.8,  # Desired statistical power
    alpha=0.05  # Significance level
)

print(f"Required number of samples: {required_samples}")
```


## Loading datasets from huggingface

```python
import os

from dotenv import load_dotenv
from swarm_models import OpenAIChat
from swarms import Agent

from evalops import StatisticalModelEvaluator
from evalops.huggingface_loader import EvalDatasetLoader

load_dotenv()

# Get the OpenAI API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

# Create instance of OpenAIChat
model_gpt4 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o", temperature=0.1
)

# Initialize a general knowledge agent
agent = Agent(
    agent_name="General-Knowledge-Agent",
    system_prompt="You are a helpful assistant that answers general knowledge questions accurately and concisely.",
    llm=model_gpt4,
    max_loops=1,
    dynamic_temperature_enabled=True,
    saved_state_path="general_agent.json",
    user_name="swarms_corp",
    context_length=200000,
    return_step_meta=False,
    output_type="string",
)

evaluator = StatisticalModelEvaluator(cache_dir="./eval_cache")

# Initialize the dataset loader
eval_loader = EvalDatasetLoader(cache_dir="./eval_cache")

# Load a common evaluation dataset
questions, answers = eval_loader.load_dataset(
    dataset_name="truthful_qa",
    subset="multiple_choice",
    split="validation",
    answer_key="best_question",
)

# Use the loaded questions and answers with your evaluator
result_gpt4 = evaluator.evaluate_model(
    model=agent,
    questions=questions,
    correct_answers=answers,
    num_samples=5,
)


# Print results
print(result_gpt4)


```


## Simple Eval
`eval` is a simple function that wraps the evaluator class and makes it easy to use.

```python
import os

from dotenv import load_dotenv
from swarm_models import OpenAIChat
from swarms import Agent

from evalops.wrapper import eval

load_dotenv()

# Get the OpenAI API key from the environment variable
api_key = os.getenv("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable not set")

# Create instance of OpenAIChat
model_gpt4 = OpenAIChat(
    openai_api_key=api_key, model_name="gpt-4o", temperature=0.1
)

# Initialize a general knowledge agent
agent = Agent(
    agent_name="General-Knowledge-Agent",
    system_prompt="You are a helpful assistant that answers general knowledge questions accurately and concisely.",
    llm=model_gpt4,
    max_loops=1,
    dynamic_temperature_enabled=True,
    saved_state_path="general_agent.json",
    user_name="swarms_corp",
    context_length=200000,
    return_step_meta=False,
    output_type="string",
)


# General knowledge test cases
general_questions = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is the largest planet in our solar system?",
    "What is the chemical symbol for gold?",
    "Who painted the Mona Lisa?",
]

# Answers
general_answers = [
    "Paris",
    "William Shakespeare",
    "Jupiter",
    "Au",
    "Leonardo da Vinci",
]


print(eval(
    questions = general_questions,
    answers=general_answers,
    agent=agent,
    samples=2,
))

```

## 🎛️ Configuration Options

| Parameter | Description | Default |
|-----------|-------------|---------|
| `cache_dir` | Directory for caching results | `None` |
| `log_level` | Logging verbosity ("DEBUG", "INFO", etc.) | `"INFO"` |
| `random_seed` | Seed for reproducibility | `None` |
| `batch_size` | Batch size for parallel processing | `32` |
| `num_samples` | Samples per question | `1` |

## 📊 Output Formats

### EvalResult Object

```python
@dataclass
class EvalResult:
    mean_score: float      # Average score across questions
    sem: float            # Standard error of the mean
    ci_lower: float       # Lower bound of 95% CI
    ci_upper: float       # Upper bound of 95% CI
    raw_scores: List[float]  # Individual question scores
    metadata: Dict        # Additional evaluation metadata
```

### Comparison Output

```python
{
    "mean_difference": float,    # Difference between means
    "correlation": float,        # Score correlation
    "t_statistic": float,       # T-test statistic
    "p_value": float,           # Statistical significance
    "significant_difference": bool  # True if p < 0.05
}
```

## 🔍 Best Practices

1. **Sample Size**: Use power analysis to determine appropriate sample sizes
2. **Clustering**: Always specify cluster IDs when questions are grouped
3. **Caching**: Enable caching for expensive evaluations
4. **Error Handling**: Monitor logs for evaluation failures
5. **Reproducibility**: Set random seed for consistent results

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙋‍♂️ Support

- 📫 Email: kye@swarms.world
- 💬 Issues: [GitHub Issues](https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator/issues)
- 📖 Documentation: [Full Documentation](https://docs.swarms.world)

## 🙏 Acknowledgments

- Thanks to all contributors
- Inspired by the paper "A Statistical Approach to Model Evaluations" (2024)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator",
    "name": "evalops",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "artificial intelligence, deep learning, optimizers, Prompt Engineering",
    "author": "Kye Gomez",
    "author_email": "kye@apac.ai",
    "download_url": "https://files.pythonhosted.org/packages/ee/17/731a226d84b07c1e1809417359706b77c446c9b0e898c800429239024dc7/evalops-0.0.6.tar.gz",
    "platform": null,
    "description": "# Statistical Model Evaluator\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n\n[![Join our Discord](https://img.shields.io/badge/Discord-Join%20our%20server-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/agora-999382051935506503) [![Subscribe on YouTube](https://img.shields.io/badge/YouTube-Subscribe-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@kyegomez3242) [![Connect on LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/kye-g-38759a207/) [![Follow on X.com](https://img.shields.io/badge/X.com-Follow-1DA1F2?style=for-the-badge&logo=x&logoColor=white)](https://x.com/kyegomezb)\n\nA robust, production-ready framework for statistically rigorous evaluation of language models, implementing the methodology described in [\"A Statistical Approach to Model Evaluations\"](https://www.anthropic.com/research/statistical-approach-to-model-evals) (2024).\n\n\n\n## \ud83d\ude80 Features\n\n- **Statistical Robustness**: Leverages Central Limit Theorem for reliable metrics\n- **Clustered Standard Errors**: Handles non-independent question groups\n- **Variance Reduction**: Multiple sampling strategies and parallel processing\n- **Paired Difference Analysis**: Sophisticated model comparison tools\n- **Power Analysis**: Sample size determination for meaningful comparisons\n- **Production Ready**: \n  - Comprehensive logging\n  - Type hints throughout\n  - Error handling\n  - Result caching\n  - Parallel processing\n  - Modular design\n\n## Instal\n\n\n```bash\npip3 install -U evalops\n```\n\n## Usage \n\n```python\nimport os\n\nfrom dotenv import load_dotenv\nfrom swarm_models import OpenAIChat\nfrom swarms import Agent\n\nfrom evalops import StatisticalModelEvaluator\n\nload_dotenv()\n\n# Get the OpenAI API key from the environment variable\napi_key = os.getenv(\"OPENAI_API_KEY\")\nif not api_key:\n    raise ValueError(\"OPENAI_API_KEY environment variable not set\")\n\n# Create instances of the OpenAIChat class with different models\nmodel_gpt4 = OpenAIChat(\n    openai_api_key=api_key, model_name=\"gpt-4o\", temperature=0.1\n)\n\nmodel_gpt35 = OpenAIChat(\n    openai_api_key=api_key, model_name=\"gpt-4o-mini\", temperature=0.1\n)\n\n# Initialize a general knowledge agent\nagent = Agent(\n    agent_name=\"General-Knowledge-Agent\",\n    system_prompt=\"You are a helpful assistant that answers general knowledge questions accurately and concisely.\",\n    llm=model_gpt4,\n    max_loops=1,\n    dynamic_temperature_enabled=True,\n    saved_state_path=\"general_agent.json\",\n    user_name=\"swarms_corp\",\n    context_length=200000,\n    return_step_meta=False,\n    output_type=\"string\",\n)\n\nevaluator = StatisticalModelEvaluator(cache_dir=\"./eval_cache\")\n\n# General knowledge test cases\ngeneral_questions = [\n    \"What is the capital of France?\",\n    \"Who wrote Romeo and Juliet?\",\n    \"What is the largest planet in our solar system?\",\n    \"What is the chemical symbol for gold?\",\n    \"Who painted the Mona Lisa?\",\n]\n\ngeneral_answers = [\n    \"Paris\",\n    \"William Shakespeare\",\n    \"Jupiter\",\n    \"Au\",\n    \"Leonardo da Vinci\",\n]\n\n# Evaluate models on general knowledge questions\nresult_gpt4 = evaluator.evaluate_model(\n    model=agent,\n    questions=general_questions,\n    correct_answers=general_answers,\n    num_samples=5,\n)\n\nresult_gpt35 = evaluator.evaluate_model(\n    model=agent,\n    questions=general_questions,\n    correct_answers=general_answers,\n    num_samples=5,\n)\n\n# Compare model performance\ncomparison = evaluator.compare_models(result_gpt4, result_gpt35)\n\n# Print results\nprint(f\"GPT-4 Mean Score: {result_gpt4.mean_score:.3f}\")\nprint(f\"GPT-3.5 Mean Score: {result_gpt35.mean_score:.3f}\")\nprint(\n    f\"Significant Difference: {comparison['significant_difference']}\"\n)\nprint(f\"P-value: {comparison['p_value']:.3f}\")\n\n```\n\n\n## \ud83d\udcd6 Detailed Usage\n\n### Basic Model Evaluation\n\n```python\nclass MyLanguageModel:\n    def run(self, task: str) -> str:\n        # Your model implementation\n        return \"model response\"\n\nevaluator = StatisticalModelEvaluator(\n    cache_dir=\"./eval_cache\",\n    log_level=\"INFO\",\n    random_seed=42\n)\n\n# Prepare your evaluation data\nquestions = [\"Question 1\", \"Question 2\", ...]\nanswers = [\"Answer 1\", \"Answer 2\", ...]\n\n# Run evaluation\nresult = evaluator.evaluate_model(\n    model=MyLanguageModel(),\n    questions=questions,\n    correct_answers=answers,\n    num_samples=3,  # Number of times to sample each question\n    batch_size=32,  # Batch size for parallel processing\n    cache_key=\"model_v1\"  # Optional caching key\n)\n\n# Access results\nprint(f\"Mean Score: {result.mean_score:.3f}\")\nprint(f\"95% CI: [{result.ci_lower:.3f}, {result.ci_upper:.3f}]\")\n```\n\n### Handling Clustered Questions\n\n```python\n# For questions that are grouped (e.g., multiple questions about the same passage)\ncluster_ids = [\"passage1\", \"passage1\", \"passage2\", \"passage2\", ...]\n\nresult = evaluator.evaluate_model(\n    model=MyLanguageModel(),\n    questions=questions,\n    correct_answers=answers,\n    cluster_ids=cluster_ids\n)\n```\n\n### Comparing Models\n\n```python\n# Evaluate two models\nresult_a = evaluator.evaluate_model(model=ModelA(), ...)\nresult_b = evaluator.evaluate_model(model=ModelB(), ...)\n\n# Compare results\ncomparison = evaluator.compare_models(result_a, result_b)\n\nprint(f\"Mean Difference: {comparison['mean_difference']:.3f}\")\nprint(f\"P-value: {comparison['p_value']:.4f}\")\nprint(f\"Significant Difference: {comparison['significant_difference']}\")\n```\n\n### Power Analysis\n\n```python\nrequired_samples = evaluator.calculate_required_samples(\n    effect_size=0.05,  # Minimum difference to detect\n    baseline_variance=0.1,  # Estimated variance in scores\n    power=0.8,  # Desired statistical power\n    alpha=0.05  # Significance level\n)\n\nprint(f\"Required number of samples: {required_samples}\")\n```\n\n\n## Loading datasets from huggingface\n\n```python\nimport os\n\nfrom dotenv import load_dotenv\nfrom swarm_models import OpenAIChat\nfrom swarms import Agent\n\nfrom evalops import StatisticalModelEvaluator\nfrom evalops.huggingface_loader import EvalDatasetLoader\n\nload_dotenv()\n\n# Get the OpenAI API key from the environment variable\napi_key = os.getenv(\"OPENAI_API_KEY\")\nif not api_key:\n    raise ValueError(\"OPENAI_API_KEY environment variable not set\")\n\n# Create instance of OpenAIChat\nmodel_gpt4 = OpenAIChat(\n    openai_api_key=api_key, model_name=\"gpt-4o\", temperature=0.1\n)\n\n# Initialize a general knowledge agent\nagent = Agent(\n    agent_name=\"General-Knowledge-Agent\",\n    system_prompt=\"You are a helpful assistant that answers general knowledge questions accurately and concisely.\",\n    llm=model_gpt4,\n    max_loops=1,\n    dynamic_temperature_enabled=True,\n    saved_state_path=\"general_agent.json\",\n    user_name=\"swarms_corp\",\n    context_length=200000,\n    return_step_meta=False,\n    output_type=\"string\",\n)\n\nevaluator = StatisticalModelEvaluator(cache_dir=\"./eval_cache\")\n\n# Initialize the dataset loader\neval_loader = EvalDatasetLoader(cache_dir=\"./eval_cache\")\n\n# Load a common evaluation dataset\nquestions, answers = eval_loader.load_dataset(\n    dataset_name=\"truthful_qa\",\n    subset=\"multiple_choice\",\n    split=\"validation\",\n    answer_key=\"best_question\",\n)\n\n# Use the loaded questions and answers with your evaluator\nresult_gpt4 = evaluator.evaluate_model(\n    model=agent,\n    questions=questions,\n    correct_answers=answers,\n    num_samples=5,\n)\n\n\n# Print results\nprint(result_gpt4)\n\n\n```\n\n\n## Simple Eval\n`eval` is a simple function that wraps the evaluator class and makes it easy to use.\n\n```python\nimport os\n\nfrom dotenv import load_dotenv\nfrom swarm_models import OpenAIChat\nfrom swarms import Agent\n\nfrom evalops.wrapper import eval\n\nload_dotenv()\n\n# Get the OpenAI API key from the environment variable\napi_key = os.getenv(\"OPENAI_API_KEY\")\nif not api_key:\n    raise ValueError(\"OPENAI_API_KEY environment variable not set\")\n\n# Create instance of OpenAIChat\nmodel_gpt4 = OpenAIChat(\n    openai_api_key=api_key, model_name=\"gpt-4o\", temperature=0.1\n)\n\n# Initialize a general knowledge agent\nagent = Agent(\n    agent_name=\"General-Knowledge-Agent\",\n    system_prompt=\"You are a helpful assistant that answers general knowledge questions accurately and concisely.\",\n    llm=model_gpt4,\n    max_loops=1,\n    dynamic_temperature_enabled=True,\n    saved_state_path=\"general_agent.json\",\n    user_name=\"swarms_corp\",\n    context_length=200000,\n    return_step_meta=False,\n    output_type=\"string\",\n)\n\n\n# General knowledge test cases\ngeneral_questions = [\n    \"What is the capital of France?\",\n    \"Who wrote Romeo and Juliet?\",\n    \"What is the largest planet in our solar system?\",\n    \"What is the chemical symbol for gold?\",\n    \"Who painted the Mona Lisa?\",\n]\n\n# Answers\ngeneral_answers = [\n    \"Paris\",\n    \"William Shakespeare\",\n    \"Jupiter\",\n    \"Au\",\n    \"Leonardo da Vinci\",\n]\n\n\nprint(eval(\n    questions = general_questions,\n    answers=general_answers,\n    agent=agent,\n    samples=2,\n))\n\n```\n\n## \ud83c\udf9b\ufe0f Configuration Options\n\n| Parameter | Description | Default |\n|-----------|-------------|---------|\n| `cache_dir` | Directory for caching results | `None` |\n| `log_level` | Logging verbosity (\"DEBUG\", \"INFO\", etc.) | `\"INFO\"` |\n| `random_seed` | Seed for reproducibility | `None` |\n| `batch_size` | Batch size for parallel processing | `32` |\n| `num_samples` | Samples per question | `1` |\n\n## \ud83d\udcca Output Formats\n\n### EvalResult Object\n\n```python\n@dataclass\nclass EvalResult:\n    mean_score: float      # Average score across questions\n    sem: float            # Standard error of the mean\n    ci_lower: float       # Lower bound of 95% CI\n    ci_upper: float       # Upper bound of 95% CI\n    raw_scores: List[float]  # Individual question scores\n    metadata: Dict        # Additional evaluation metadata\n```\n\n### Comparison Output\n\n```python\n{\n    \"mean_difference\": float,    # Difference between means\n    \"correlation\": float,        # Score correlation\n    \"t_statistic\": float,       # T-test statistic\n    \"p_value\": float,           # Statistical significance\n    \"significant_difference\": bool  # True if p < 0.05\n}\n```\n\n## \ud83d\udd0d Best Practices\n\n1. **Sample Size**: Use power analysis to determine appropriate sample sizes\n2. **Clustering**: Always specify cluster IDs when questions are grouped\n3. **Caching**: Enable caching for expensive evaluations\n4. **Error Handling**: Monitor logs for evaluation failures\n5. **Reproducibility**: Set random seed for consistent results\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4b\u200d\u2642\ufe0f Support\n\n- \ud83d\udceb Email: kye@swarms.world\n- \ud83d\udcac Issues: [GitHub Issues](https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator/issues)\n- \ud83d\udcd6 Documentation: [Full Documentation](https://docs.swarms.world)\n\n## \ud83d\ude4f Acknowledgments\n\n- Thanks to all contributors\n- Inspired by the paper \"A Statistical Approach to Model Evaluations\" (2024)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "evalops - TGSC",
    "version": "0.0.6",
    "project_urls": {
        "Documentation": "https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator",
        "Homepage": "https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator",
        "Repository": "https://github.com/The-Swarm-Corporation/StatisticalModelEvaluator"
    },
    "split_keywords": [
        "artificial intelligence",
        " deep learning",
        " optimizers",
        " prompt engineering"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "89aea8412c2009d12b33281d7c517e6dfd3edf51bbd0ce0df9191745c2ea581e",
                "md5": "83666ba855024c59707b675df2d362cf",
                "sha256": "09e3b7edf3169381ecbedbf66b8ca45a183d1f9beaffd7c86b1e0dfcd6cea1ee"
            },
            "downloads": -1,
            "filename": "evalops-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "83666ba855024c59707b675df2d362cf",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 13219,
            "upload_time": "2024-12-22T01:15:38",
            "upload_time_iso_8601": "2024-12-22T01:15:38.741126Z",
            "url": "https://files.pythonhosted.org/packages/89/ae/a8412c2009d12b33281d7c517e6dfd3edf51bbd0ce0df9191745c2ea581e/evalops-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee17731a226d84b07c1e1809417359706b77c446c9b0e898c800429239024dc7",
                "md5": "fb821fe3b60535bc747d73df59931785",
                "sha256": "b071fcde10526ff076d1b227ab910f83cf40154ee3b5cca34f4a85edcb4e3441"
            },
            "downloads": -1,
            "filename": "evalops-0.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "fb821fe3b60535bc747d73df59931785",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 14893,
            "upload_time": "2024-12-22T01:15:40",
            "upload_time_iso_8601": "2024-12-22T01:15:40.721284Z",
            "url": "https://files.pythonhosted.org/packages/ee/17/731a226d84b07c1e1809417359706b77c446c9b0e898c800429239024dc7/evalops-0.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-22 01:15:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "The-Swarm-Corporation",
    "github_project": "StatisticalModelEvaluator",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.21.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "scipy",
            "specs": [
                [
                    ">=",
                    "1.7.0"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "num2words",
            "specs": [
                [
                    ">=",
                    "0.5.10"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    ">=",
                    "0.6.0"
                ]
            ]
        },
        {
            "name": "threadpoolctl",
            "specs": [
                [
                    ">=",
                    "3.0.0"
                ]
            ]
        }
    ],
    "lcname": "evalops"
}

Kye Gomez