equator-qa


Nameequator-qa JSON
Version 0.0.4 PyPI version JSON
download
home_pageNone
SummaryEquator: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions.
upload_time2025-01-12 16:20:08
maintainerNone
docs_urlNone
authorNone
requires_python>=3.6
licenseMIT
keywords llm evaluation open-ended questions reasoning framework
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# EQUATOR Evaluator

## Overview

The **EQUATOR Evaluator** is a robust framework designed to systematically evaluate the factual accuracy and reasoning capabilities of large language models (LLMs). Unlike traditional evaluation methods, which often prioritize fluency over accuracy, this tool employs a **deterministic scoring system** that ensures precise and unbiased assessment of LLM-generated responses.

This repository implements the methodology described in the research paper "EQUATOR: A Deterministic Framework for
Evaluating LLM Reasoning with Open-EndedQuestions. # v1.0.0-beta"(Bernard et al., 2024). By leveraging vector databases and smaller, locally hosted LLMs, the LLM Evaluator bridges the gap between scalability and accuracy in automated assessments.

Study paper: 
[ArVix Study](https://arxiv.org/abs/2501.00257)


![Equator Framework](EQUATOR-Framework.png "EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning")

---

## Key Features

1. **Deterministic Scoring**: Assigns binary scores (100% or 0%) based solely on factual correctness.
2. **Vector Database Integration**: Embeds open-ended questions and human-evaluated answers for semantic matching.
3. **Automated Evaluation**: Uses smaller LLMs to provide scalable and efficient assessments.
4. **Bias Mitigation**: Eliminates scoring biases related to linguistic fluency or persuasion.
5. **Cost Efficiency**: Optimizes token usage, significantly reducing operational costs for evaluation.

---

## Why LLM Evaluator?

Traditional methods, like multiple-choice or human evaluation, fail to capture the nuanced reasoning and factual accuracy required in high-stakes domains such as medicine or law. The LLM Evaluator:

- Focuses on **factual correctness** over linguistic style.
- Reduces reliance on human evaluators by automating the grading process.
- Provides insights into where LLMs fall short, enabling targeted improvements in model training.

---

## Methodology

### 1. Deterministic Scoring Framework
The scoring framework evaluates LLM-generated answers against a vector database of human-evaluated responses. It follows these steps:
1. **Embed Inputs**: Convert questions and answers into vector embeddings using models like `all-minilm`.
2. **Retrieve Closest Match**: Identify the most semantically similar answer key using cosine similarity.
3. **Binary Scoring**: Assign 100% if the student’s answer matches the answer key; otherwise, 0%.

### 2. Vector Database
The vector database, implemented with ChromaDB, stores embeddings of open-ended questions and their corresponding answer keys. This database serves as the single source of truth for evaluations.

### 3. Evaluator LLM
A smaller LLM (e.g., LLaMA 3.2B) acts as the evaluator, ensuring strict adherence to the scoring criteria while reducing computational overhead.

---
## Details of features

We classify LLMS as evaluators and students 
Eluator LLMS evaluate the "student models " in the case the STOA models found on OpenRouter (276) Below is an updated “Evaluator vs. Student” matrix that includes **Groq → Ollama** support as well.

---

## Evaluator vs. Student Matrix
* Values are approximations
Openrouater has 293 models from OpenAI etc. 
Groq has 14
Ollama 34925 = 148 family and 270 sizes
| **Evaluator LLM**  | **Student LLM**| **Support Status**       |
|--------------------|-------------------------------------------|-------------------------|
| **Ollama (local)** | OpenRouter      | **Currently supported** |
| **Ollama (local)** | Groq            | **Currently supported** |
| **Ollama (local)** | Ollama (local)  | **Currently supported** |
| **Groq**           | OpenRouter      | **Currently supported** |
| **Groq**           | Ollama (local)  | **Currently supported** |
| **Groq**           | Groq            | **Next        release** |
| **OpenRouter**     | OpenRouter      | **Next        release** |

To determine the **possible amount of testing** from a combinatorial perspective based on your current support for Evaluator and Student LLMs, we'll break down the calculations step-by-step.

---

## **1. Understanding the Components**

### **Evaluator LLMs:**
- **Ollama (Local):** 34,925 models
- **Groq:** 14 models

### **Student LLMs:**
- **OpenRouter:** 293 models
- **Groq:** 14 models
- **Ollama (Local):** 34,925 models

**Total Evaluator Models:** 34,925 (Ollama) + 14 (Groq) = **34,939 Evaluators**

**Total Student Models:** 293 (OpenRouter) + 14 (Groq) + 34,925 (Ollama) = **35,232 Students**

---

## **2. Supported Evaluator-Student Combinations**

**currently supported** combinations are:

1. **Ollama (Evaluator) → OpenRouter (Student)**
2. **Ollama (Evaluator) → Groq (Student)**
3. **Ollama (Evaluator) → Ollama (Student)**
4. **Groq (Evaluator) → OpenRouter (Student)**
5. **Groq (Evaluator) → Ollama (Student)**

### **Unsupported (Next Release):**
6. **Groq (Evaluator) → Groq (Student)**
7. **OpenRouter (Evaluator) → OpenRouter (Student)**

---

## Calculating the Number of Combinations**

### * Current Support**

1. **Ollama Evaluator Combinations:**
   - **With OpenRouter Students:**  
     34,925 Evaluators × 293 Students = **10,232,275 combinations**
   
   - **With Groq Students:**  
     34,925 Evaluators × 14 Students = **488,950 combinations**
   
   - **With Ollama Students:**  
     34,925 Evaluators × 34,925 Students = **1,219,755,625 combinations**
   
   **Subtotal for Ollama Evaluators:**  
   10,232,275 + 488,950 + 1,219,755,625 = **1,230,476,850 combinations**

2. **Groq Evaluator Combinations:**
   - **With OpenRouter Students:**  
     14 Evaluators × 293 Students = **4,102 combinations**
   
   - **With Ollama Students:**  
     14 Evaluators × 34,925 Students = **488,950 combinations**
   
   **Subtotal for Groq Evaluators:**  
   4,102 + 488,950 = **493,052 combinations**

**Total Current Combinations:**  
1,230,476,850 (Ollama) + 493,052 (Groq) = **1,230,969,902 combinations**

### **B. Future Support (Next Release)**

1. **Groq Evaluator → Groq Student:**  
   14 Evaluators × 14 Students = **196 combinations**

2. **OpenRouter Evaluator → OpenRouter Student:**  
   293 Evaluators × 293 Students = **85,849 combinations**

**Total Future Combinations:**  
196 + 85,849 = **86,045 combinations**

---

## **4. Grand Total of Possible Evaluator-Student Combinations**

- **Currently Supported:** ~**1,230,970,000 combinations**
- **With Next Release:** ~**1,231,056,000 combinations**

*Note:* These figures are **approximate** due to rounding in intermediate steps.

---

## **5. Summary**

- **Total Supported Combinations (Current):**  
  **~1.23 Billion Evaluator-Student Pairs**

- **Additional Combinations (Next Release):**  
  **~86,045 Evaluator-Student Pairs**

---

## **6. Implications for Testing**

With **over 1.23 billion** possible Evaluator-Student pairs currently supported, comprehensive testing would involve an extensive and potentially resource-intensive process. Here's how you might approach it:

### **A. Prioritization Strategies:**
1. **Model Importance:** Focus on evaluating high-impact or frequently used models first.
2. **Diversity:** Ensure a diverse range of model families and sizes are tested to cover different capabilities and use cases.
3. **Incremental Testing:** Start with a subset of combinations and gradually expand.

### **B. Automation and Parallelization:**
- Utilize automated testing frameworks to handle large-scale evaluations.
- Leverage parallel processing to distribute the workload across multiple machines or instances.

### **C. Sampling Techniques:**
- Instead of exhaustively testing all combinations, use statistical sampling methods to select representative pairs for evaluation.

### **D. Continuous Integration:**
- Implement continuous testing pipelines that automatically evaluate new combinations as models are added or updated.

---

## **7. Recommendations**

Given the sheer volume of possible combinations, it's crucial to implement a **strategic testing plan**:

1. **Define Testing Objectives:** Clearly outline what you aim to achieve with each test (e.g., performance benchmarks, compatibility checks).
2. **Allocate Resources:** Ensure you have the necessary computational resources to handle large-scale testing.
3. **Monitor and Iterate:** Continuously monitor testing outcomes and refine your strategies based on findings and evolving requirements.

By adopting a structured and prioritized approach, you can effectively manage the extensive testing landscape and ensure robust evaluation of your LLM combinations.

### Key Points

1. **Evaluator LLMs (the “grader”)**  
   - **Ollama** (local).  
   - **Groq**.  
   - *More evaluators planned for future releases.*

2. **Student LLMs (the “respondent”)**  
   - **OpenRouter** (276+ models: OpenAI, Anthropic, etc.).  
   - **Groq**.  
   - **Ollama** (local).  
   - *More students planned for future releases.*

3. **Current Highlights**  
   - **Ollama** can evaluate answers from OpenRouter, Groq, or Ollama itself.  
   - **Groq** can evaluate answers from OpenRouter, Groq, **or Ollama**.  
   - Ongoing development will expand these capabilities even further.  

Use this chart as a quick reference for which LLM can serve as the **evaluator** versus which can serve as the **student**.  We will be testing an OpenRouter to OpenRouter impelmation in our next release.  
Below is an updated “Evaluator vs. Student” matrix that includes **Groq → Ollama** support as well.

---
## **Installation**

1. **Clone the repository**
    ```bash
    git clone https://github.com/yourusername/equator-qa-benchmark.git
    cd equator-qa-benchmark
    ```


2. **Set Up the Environment**
   - Rename `copy-to.env` to `.env` in your working directory.
   - Add the necessary API keys to the `.env` file.
   - Example:
     ```plaintext
     OPENROUTER_KEY="sk-xxx"
     GROQ_API_KEY="gsk_xxx"
     ```

3. **Optional: Set Up a Virtual Environment**
   It is recommended to use a virtual environment to avoid conflicts with other Python packages.

   #### On **Windows**
   ```bash
   python -m venv .venv
   .venv\Scripts\activate
   pip install equator
   deactivate
   ```

   #### On **Linux/MacOS**
   ```bash
   python3 -m venv .venv
   source .venv/bin/activate
   pip install equator
   deactivate
   ```


1. **Install pip install our requirements.txt**
   ```bash
   >pip install -r requirements.txt
   ```
 4. **Open the main.py file.** 
    1. Look at the comments in the file for directions on how to configure your test runs. It's straight forward.  




### Configuration

- **Execution Steps**: Define the steps to execute in the `execution_steps` list.  Please run the application with just do one execution_step at a time and comment out the other steps. In a future release we will enable a bit more automation.  
* Note the steps will use different models as evaluators and students.  
   ie.   ollama_to_groq_evaluate = ollama is the evaluator and groq is the student. 
    ```python
  execution_steps = [
        # "ollama_to_groq_evaluate",  # working
        # "ollama_to_openrouter_evaluate", # working 
        # "groq_to_ollama_evaluate",  # working
        # "groq_to_openrouter_evaluate", # working
        "generate_statistics",
    ]

    ```

- **Models**: Specify the models to benchmark in the respective lists.
    ```python
    student_openrouter_models = [
        "nousresearch/hermes-3-llama-3.1-405b",
    ]
    
    student_groq_models = [
        "llama3-70b-8192",
    ]

    student_ollama_models = [
        "llama3.2",
    ]
    ```

- **Benchmark Settings**: Adjust benchmarking parameters such as evaluator models, benchmark name, and answer rounds.
    ```python
    GROQ_EVALUATOR_MODEL = "llama3-70b-8192"
    OLLAMA_EVALUATOR_MODEL = "llama3.2"
    benchmark_name = "Bernard"
    answer_rounds = 2
    ```

### Logging

Logs are saved to `vectordb.log` with INFO level by default.


## Usage

### Running the Program

1. **In your Python Environment**
2. 
   ```bash
   >py -m main   
   ```

### Viewing Results
We create a directory named after the corresponding date to organize the benchmarks. Within this directory, you will find a collection of charts and CSV files containing statistics and token analytics.

Results, including scores and explanations, are saved in the specified output directory as JSON files. Each entry includes:
- Question
- Model-generated answer
- Evaluator response for the score
- Score

---
## Example Dataset

The repository includes two datasets to test the reasoning capabilities of LLMs:

1. **Default Dataset**: 
   - The file `linguistic_benchmark.json` contains open-ended questions across various categories, such as puzzles, spatial reasoning, and logic. This smaller dataset is ideal for quick tests or debugging. You are welcome to add more questions to the dataset or customize them for your domain. 


We have a QA dataset, linguistic_benchmark.json, with over 1,000 entries. However, we plan to create a website to publish our results using this dataset.---

**Why We Keep Our Dataset Private**

   Our research examines the performance of large language models (LLMs) across state-of-the-art (SOTA) benchmarks, and we aim to maintain statistically significant evaluation results. If we were to release our full dataset publicly, there is a risk that future models could be trained or fine-tuned on our test items, which would compromise the fairness and meaningfulness of our benchmark. By keeping these data private, we ensure that our comparisons remain valid and our results accurately reflect model performance under unbiased test conditions. 
   
  
   Although our primary focus is maintaining a statistically significant and unbiased dataset for testing AI performance in QA reasoning and logic, we understand that different industries—such as law, medicine, or finance—have unique needs. Our linguistic_benchmark.json file can be extended to include domain-specific prompts and example responses. This approach allows you to evaluate how well AI models perform in your specialized context without compromising the integrity of our core benchmarking methodology. By adding your own questions, you can preserve our standardized evaluation framework while tailoring the tests to your field’s specific challenges.  We aim to maintain a current benchmark results for our EQUATOR at equator.github.io 

   

## Contributions

### Authors
- Raymond Bernard (Independent Researcher)
- Shaina Raza, Ph.D. (Vector Institute)
- Subhabrata Das, PhD (JP Morgan Chase)
- Raul Murugan (Columbia University)

---

## Future Work

- Expand the vector database to include more diverse datasets.
- Optimize the embedding and retrieval process for larger-scale deployments.
- Investigate additional scoring criteria for complex reasoning tasks.

---
- *Acknowledgment*: We extend our gratitude to James Huckle for inspiring our work.  
- We have incorporated elements from [https://github.com/autogenai/easy-problems-that-llms-get-wrong](https://github.com/autogenai/easy-problems-that-llms-get-wrong).  
- Our approach advances the field by simplifying the benchmarking process through our capability to score open-ended questions effectively.  
- Rather than benchmarking multiple models across disparate APIs, we leverage OpenRouter.ai's unified API, using the OpenAI SDK, which provides access to over 270 models for comprehensive benchmarking.  
  
- ## Citation
If you use this framework in your research, please cite:

```
@article {bernard2024equator,
  title        = {{EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. \# v1.0.0-beta}},
  author       = {Bernard, Raymond and Raza, Shaina and Das, Subhabrata and Murugan, Rahul},
  year         = {2024},
  eprint       = {2501.00257},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  note         = {MSC classes: 68T20; ACM classes: I.2.7; I.2.6; H.3.3},
  howpublished = {arXiv preprint arXiv:2501.00257 [cs.CL]},
  doi          = {10.48550/arXiv.2501.00257},
}

```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

## License

This project is licensed under the MIT License.

---

*Generated with ❤️ by Equator QA Team*

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "equator-qa",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "LLM evaluation, open-ended questions, reasoning framework",
    "author": null,
    "author_email": "Ray Bernard <ray.bernard@outlook.com>",
    "download_url": "https://files.pythonhosted.org/packages/ee/18/e0fd95e32f0edfc50ffa357272f282b9541a79297e84d330c80f654b0bf0/equator_qa-0.0.4.tar.gz",
    "platform": null,
    "description": "\n# EQUATOR Evaluator\n\n## Overview\n\nThe **EQUATOR Evaluator** is a robust framework designed to systematically evaluate the factual accuracy and reasoning capabilities of large language models (LLMs). Unlike traditional evaluation methods, which often prioritize fluency over accuracy, this tool employs a **deterministic scoring system** that ensures precise and unbiased assessment of LLM-generated responses.\n\nThis repository implements the methodology described in the research paper \"EQUATOR: A Deterministic Framework for\nEvaluating LLM Reasoning with Open-EndedQuestions. # v1.0.0-beta\"(Bernard et al., 2024). By leveraging vector databases and smaller, locally hosted LLMs, the LLM Evaluator bridges the gap between scalability and accuracy in automated assessments.\n\nStudy paper: \n[ArVix Study](https://arxiv.org/abs/2501.00257)\n\n\n![Equator Framework](EQUATOR-Framework.png \"EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning\")\n\n---\n\n## Key Features\n\n1. **Deterministic Scoring**: Assigns binary scores (100% or 0%) based solely on factual correctness.\n2. **Vector Database Integration**: Embeds open-ended questions and human-evaluated answers for semantic matching.\n3. **Automated Evaluation**: Uses smaller LLMs to provide scalable and efficient assessments.\n4. **Bias Mitigation**: Eliminates scoring biases related to linguistic fluency or persuasion.\n5. **Cost Efficiency**: Optimizes token usage, significantly reducing operational costs for evaluation.\n\n---\n\n## Why LLM Evaluator?\n\nTraditional methods, like multiple-choice or human evaluation, fail to capture the nuanced reasoning and factual accuracy required in high-stakes domains such as medicine or law. The LLM Evaluator:\n\n- Focuses on **factual correctness** over linguistic style.\n- Reduces reliance on human evaluators by automating the grading process.\n- Provides insights into where LLMs fall short, enabling targeted improvements in model training.\n\n---\n\n## Methodology\n\n### 1. Deterministic Scoring Framework\nThe scoring framework evaluates LLM-generated answers against a vector database of human-evaluated responses. It follows these steps:\n1. **Embed Inputs**: Convert questions and answers into vector embeddings using models like `all-minilm`.\n2. **Retrieve Closest Match**: Identify the most semantically similar answer key using cosine similarity.\n3. **Binary Scoring**: Assign 100% if the student\u2019s answer matches the answer key; otherwise, 0%.\n\n### 2. Vector Database\nThe vector database, implemented with ChromaDB, stores embeddings of open-ended questions and their corresponding answer keys. This database serves as the single source of truth for evaluations.\n\n### 3. Evaluator LLM\nA smaller LLM (e.g., LLaMA 3.2B) acts as the evaluator, ensuring strict adherence to the scoring criteria while reducing computational overhead.\n\n---\n## Details of features\n\nWe classify LLMS as evaluators and students \nEluator LLMS evaluate the \"student models \" in the case the STOA models found on OpenRouter (276) Below is an updated \u201cEvaluator vs. Student\u201d matrix that includes **Groq \u2192 Ollama** support as well.\n\n---\n\n## Evaluator vs. Student Matrix\n* Values are approximations\nOpenrouater has 293 models from OpenAI etc. \nGroq has 14\nOllama 34925 = 148 family and 270 sizes\n| **Evaluator LLM**  | **Student LLM**| **Support Status**       |\n|--------------------|-------------------------------------------|-------------------------|\n| **Ollama (local)** | OpenRouter      | **Currently supported** |\n| **Ollama (local)** | Groq            | **Currently supported** |\n| **Ollama (local)** | Ollama (local)  | **Currently supported** |\n| **Groq**           | OpenRouter      | **Currently supported** |\n| **Groq**           | Ollama (local)  | **Currently supported** |\n| **Groq**           | Groq            | **Next        release** |\n| **OpenRouter**     | OpenRouter      | **Next        release** |\n\nTo determine the **possible amount of testing** from a combinatorial perspective based on your current support for Evaluator and Student LLMs, we'll break down the calculations step-by-step.\n\n---\n\n## **1. Understanding the Components**\n\n### **Evaluator LLMs:**\n- **Ollama (Local):** 34,925 models\n- **Groq:** 14 models\n\n### **Student LLMs:**\n- **OpenRouter:** 293 models\n- **Groq:** 14 models\n- **Ollama (Local):** 34,925 models\n\n**Total Evaluator Models:** 34,925 (Ollama) + 14 (Groq) = **34,939 Evaluators**\n\n**Total Student Models:** 293 (OpenRouter) + 14 (Groq) + 34,925 (Ollama) = **35,232 Students**\n\n---\n\n## **2. Supported Evaluator-Student Combinations**\n\n**currently supported** combinations are:\n\n1. **Ollama (Evaluator) \u2192 OpenRouter (Student)**\n2. **Ollama (Evaluator) \u2192 Groq (Student)**\n3. **Ollama (Evaluator) \u2192 Ollama (Student)**\n4. **Groq (Evaluator) \u2192 OpenRouter (Student)**\n5. **Groq (Evaluator) \u2192 Ollama (Student)**\n\n### **Unsupported (Next Release):**\n6. **Groq (Evaluator) \u2192 Groq (Student)**\n7. **OpenRouter (Evaluator) \u2192 OpenRouter (Student)**\n\n---\n\n## Calculating the Number of Combinations**\n\n### * Current Support**\n\n1. **Ollama Evaluator Combinations:**\n   - **With OpenRouter Students:**  \n     34,925 Evaluators \u00d7 293 Students = **10,232,275 combinations**\n   \n   - **With Groq Students:**  \n     34,925 Evaluators \u00d7 14 Students = **488,950 combinations**\n   \n   - **With Ollama Students:**  \n     34,925 Evaluators \u00d7 34,925 Students = **1,219,755,625 combinations**\n   \n   **Subtotal for Ollama Evaluators:**  \n   10,232,275 + 488,950 + 1,219,755,625 = **1,230,476,850 combinations**\n\n2. **Groq Evaluator Combinations:**\n   - **With OpenRouter Students:**  \n     14 Evaluators \u00d7 293 Students = **4,102 combinations**\n   \n   - **With Ollama Students:**  \n     14 Evaluators \u00d7 34,925 Students = **488,950 combinations**\n   \n   **Subtotal for Groq Evaluators:**  \n   4,102 + 488,950 = **493,052 combinations**\n\n**Total Current Combinations:**  \n1,230,476,850 (Ollama) + 493,052 (Groq) = **1,230,969,902 combinations**\n\n### **B. Future Support (Next Release)**\n\n1. **Groq Evaluator \u2192 Groq Student:**  \n   14 Evaluators \u00d7 14 Students = **196 combinations**\n\n2. **OpenRouter Evaluator \u2192 OpenRouter Student:**  \n   293 Evaluators \u00d7 293 Students = **85,849 combinations**\n\n**Total Future Combinations:**  \n196 + 85,849 = **86,045 combinations**\n\n---\n\n## **4. Grand Total of Possible Evaluator-Student Combinations**\n\n- **Currently Supported:** ~**1,230,970,000 combinations**\n- **With Next Release:** ~**1,231,056,000 combinations**\n\n*Note:* These figures are **approximate** due to rounding in intermediate steps.\n\n---\n\n## **5. Summary**\n\n- **Total Supported Combinations (Current):**  \n  **~1.23 Billion Evaluator-Student Pairs**\n\n- **Additional Combinations (Next Release):**  \n  **~86,045 Evaluator-Student Pairs**\n\n---\n\n## **6. Implications for Testing**\n\nWith **over 1.23 billion** possible Evaluator-Student pairs currently supported, comprehensive testing would involve an extensive and potentially resource-intensive process. Here's how you might approach it:\n\n### **A. Prioritization Strategies:**\n1. **Model Importance:** Focus on evaluating high-impact or frequently used models first.\n2. **Diversity:** Ensure a diverse range of model families and sizes are tested to cover different capabilities and use cases.\n3. **Incremental Testing:** Start with a subset of combinations and gradually expand.\n\n### **B. Automation and Parallelization:**\n- Utilize automated testing frameworks to handle large-scale evaluations.\n- Leverage parallel processing to distribute the workload across multiple machines or instances.\n\n### **C. Sampling Techniques:**\n- Instead of exhaustively testing all combinations, use statistical sampling methods to select representative pairs for evaluation.\n\n### **D. Continuous Integration:**\n- Implement continuous testing pipelines that automatically evaluate new combinations as models are added or updated.\n\n---\n\n## **7. Recommendations**\n\nGiven the sheer volume of possible combinations, it's crucial to implement a **strategic testing plan**:\n\n1. **Define Testing Objectives:** Clearly outline what you aim to achieve with each test (e.g., performance benchmarks, compatibility checks).\n2. **Allocate Resources:** Ensure you have the necessary computational resources to handle large-scale testing.\n3. **Monitor and Iterate:** Continuously monitor testing outcomes and refine your strategies based on findings and evolving requirements.\n\nBy adopting a structured and prioritized approach, you can effectively manage the extensive testing landscape and ensure robust evaluation of your LLM combinations.\n\n### Key Points\n\n1. **Evaluator LLMs (the \u201cgrader\u201d)**  \n   - **Ollama** (local).  \n   - **Groq**.  \n   - *More evaluators planned for future releases.*\n\n2. **Student LLMs (the \u201crespondent\u201d)**  \n   - **OpenRouter** (276+ models: OpenAI, Anthropic, etc.).  \n   - **Groq**.  \n   - **Ollama** (local).  \n   - *More students planned for future releases.*\n\n3. **Current Highlights**  \n   - **Ollama** can evaluate answers from OpenRouter, Groq, or Ollama itself.  \n   - **Groq** can evaluate answers from OpenRouter, Groq, **or Ollama**.  \n   - Ongoing development will expand these capabilities even further.  \n\nUse this chart as a quick reference for which LLM can serve as the **evaluator** versus which can serve as the **student**.  We will be testing an OpenRouter to OpenRouter impelmation in our next release.  \nBelow is an updated \u201cEvaluator vs. Student\u201d matrix that includes **Groq \u2192 Ollama** support as well.\n\n---\n## **Installation**\n\n1. **Clone the repository**\n    ```bash\n    git clone https://github.com/yourusername/equator-qa-benchmark.git\n    cd equator-qa-benchmark\n    ```\n\n\n2. **Set Up the Environment**\n   - Rename `copy-to.env` to `.env` in your working directory.\n   - Add the necessary API keys to the `.env` file.\n   - Example:\n     ```plaintext\n     OPENROUTER_KEY=\"sk-xxx\"\n     GROQ_API_KEY=\"gsk_xxx\"\n     ```\n\n3. **Optional: Set Up a Virtual Environment**\n   It is recommended to use a virtual environment to avoid conflicts with other Python packages.\n\n   #### On **Windows**\n   ```bash\n   python -m venv .venv\n   .venv\\Scripts\\activate\n   pip install equator\n   deactivate\n   ```\n\n   #### On **Linux/MacOS**\n   ```bash\n   python3 -m venv .venv\n   source .venv/bin/activate\n   pip install equator\n   deactivate\n   ```\n\n\n1. **Install pip install our requirements.txt**\n   ```bash\n   >pip install -r requirements.txt\n   ```\n 4. **Open the main.py file.** \n    1. Look at the comments in the file for directions on how to configure your test runs. It's straight forward.  \n\n\n\n\n### Configuration\n\n- **Execution Steps**: Define the steps to execute in the `execution_steps` list.  Please run the application with just do one execution_step at a time and comment out the other steps. In a future release we will enable a bit more automation.  \n* Note the steps will use different models as evaluators and students.  \n   ie.   ollama_to_groq_evaluate = ollama is the evaluator and groq is the student. \n    ```python\n  execution_steps = [\n        # \"ollama_to_groq_evaluate\",  # working\n        # \"ollama_to_openrouter_evaluate\", # working \n        # \"groq_to_ollama_evaluate\",  # working\n        # \"groq_to_openrouter_evaluate\", # working\n        \"generate_statistics\",\n    ]\n\n    ```\n\n- **Models**: Specify the models to benchmark in the respective lists.\n    ```python\n    student_openrouter_models = [\n        \"nousresearch/hermes-3-llama-3.1-405b\",\n    ]\n    \n    student_groq_models = [\n        \"llama3-70b-8192\",\n    ]\n\n    student_ollama_models = [\n        \"llama3.2\",\n    ]\n    ```\n\n- **Benchmark Settings**: Adjust benchmarking parameters such as evaluator models, benchmark name, and answer rounds.\n    ```python\n    GROQ_EVALUATOR_MODEL = \"llama3-70b-8192\"\n    OLLAMA_EVALUATOR_MODEL = \"llama3.2\"\n    benchmark_name = \"Bernard\"\n    answer_rounds = 2\n    ```\n\n### Logging\n\nLogs are saved to `vectordb.log` with INFO level by default.\n\n\n## Usage\n\n### Running the Program\n\n1. **In your Python Environment**\n2. \n   ```bash\n   >py -m main   \n   ```\n\n### Viewing Results\nWe create a directory named after the corresponding date to organize the benchmarks. Within this directory, you will find a collection of charts and CSV files containing statistics and token analytics.\n\nResults, including scores and explanations, are saved in the specified output directory as JSON files. Each entry includes:\n- Question\n- Model-generated answer\n- Evaluator response for the score\n- Score\n\n---\n## Example Dataset\n\nThe repository includes two datasets to test the reasoning capabilities of LLMs:\n\n1. **Default Dataset**: \n   - The file `linguistic_benchmark.json` contains open-ended questions across various categories, such as puzzles, spatial reasoning, and logic. This smaller dataset is ideal for quick tests or debugging. You are welcome to add more questions to the dataset or customize them for your domain. \n\n\nWe have a QA dataset, linguistic_benchmark.json, with over 1,000 entries. However, we plan to create a website to publish our results using this dataset.---\n\n**Why We Keep Our Dataset Private**\n\n   Our research examines the performance of large language models (LLMs) across state-of-the-art (SOTA) benchmarks, and we aim to maintain statistically significant evaluation results. If we were to release our full dataset publicly, there is a risk that future models could be trained or fine-tuned on our test items, which would compromise the fairness and meaningfulness of our benchmark. By keeping these data private, we ensure that our comparisons remain valid and our results accurately reflect model performance under unbiased test conditions. \n   \n  \n   Although our primary focus is maintaining a statistically significant and unbiased dataset for testing AI performance in QA reasoning and logic, we understand that different industries\u2014such as law, medicine, or finance\u2014have unique needs. Our linguistic_benchmark.json file can be extended to include domain-specific prompts and example responses. This approach allows you to evaluate how well AI models perform in your specialized context without compromising the integrity of our core benchmarking methodology. By adding your own questions, you can preserve our standardized evaluation framework while tailoring the tests to your field\u2019s specific challenges.  We aim to maintain a current benchmark results for our EQUATOR at equator.github.io \n\n   \n\n## Contributions\n\n### Authors\n- Raymond Bernard (Independent Researcher)\n- Shaina Raza, Ph.D. (Vector Institute)\n- Subhabrata Das, PhD (JP Morgan Chase)\n- Raul Murugan (Columbia University)\n\n---\n\n## Future Work\n\n- Expand the vector database to include more diverse datasets.\n- Optimize the embedding and retrieval process for larger-scale deployments.\n- Investigate additional scoring criteria for complex reasoning tasks.\n\n---\n- *Acknowledgment*: We extend our gratitude to James Huckle for inspiring our work.  \n- We have incorporated elements from [https://github.com/autogenai/easy-problems-that-llms-get-wrong](https://github.com/autogenai/easy-problems-that-llms-get-wrong).  \n- Our approach advances the field by simplifying the benchmarking process through our capability to score open-ended questions effectively.  \n- Rather than benchmarking multiple models across disparate APIs, we leverage OpenRouter.ai's unified API, using the OpenAI SDK, which provides access to over 270 models for comprehensive benchmarking.  \n  \n- ## Citation\nIf you use this framework in your research, please cite:\n\n```\n@article {bernard2024equator,\n  title        = {{EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. \\# v1.0.0-beta}},\n  author       = {Bernard, Raymond and Raza, Shaina and Das, Subhabrata and Murugan, Rahul},\n  year         = {2024},\n  eprint       = {2501.00257},\n  archivePrefix= {arXiv},\n  primaryClass = {cs.CL},\n  note         = {MSC classes: 68T20; ACM classes: I.2.7; I.2.6; H.3.3},\n  howpublished = {arXiv preprint arXiv:2501.00257 [cs.CL]},\n  doi          = {10.48550/arXiv.2501.00257},\n}\n\n```\n\n## Contributing\n\nContributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.\n\n## License\n\nThis project is licensed under the MIT License.\n\n---\n\n*Generated with \u2764\ufe0f by Equator QA Team*\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Equator: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions.",
    "version": "0.0.4",
    "project_urls": null,
    "split_keywords": [
        "llm evaluation",
        " open-ended questions",
        " reasoning framework"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bb5e9b5ecac8b5cb5d0fdf69214f01ef192073bac17c4dc0463612ac0f4ccd0e",
                "md5": "724f7abce938e1acc55d6cc06507415e",
                "sha256": "1d00b5aa60cc8456c029af813b85aab4affd46f60171dbb8e74d5f26399a061f"
            },
            "downloads": -1,
            "filename": "equator_qa-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "724f7abce938e1acc55d6cc06507415e",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 222718,
            "upload_time": "2025-01-12T16:20:05",
            "upload_time_iso_8601": "2025-01-12T16:20:05.734901Z",
            "url": "https://files.pythonhosted.org/packages/bb/5e/9b5ecac8b5cb5d0fdf69214f01ef192073bac17c4dc0463612ac0f4ccd0e/equator_qa-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee18e0fd95e32f0edfc50ffa357272f282b9541a79297e84d330c80f654b0bf0",
                "md5": "8f821c7a393f6d3b70c8c56566d3ce2a",
                "sha256": "4ebb8e766e24e1947107e5fc2c13800a193b05463f6fd434fcabadfe86d06d5b"
            },
            "downloads": -1,
            "filename": "equator_qa-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "8f821c7a393f6d3b70c8c56566d3ce2a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 228301,
            "upload_time": "2025-01-12T16:20:08",
            "upload_time_iso_8601": "2025-01-12T16:20:08.428628Z",
            "url": "https://files.pythonhosted.org/packages/ee/18/e0fd95e32f0edfc50ffa357272f282b9541a79297e84d330c80f654b0bf0/equator_qa-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-12 16:20:08",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "equator-qa"
}
        
Elapsed time: 1.51269s