# Haerae-Evaluation-Toolkit
[](https://arxiv.org/abs/2503.22968)
<p align="center">
<img src="assets/imgs/logo.png.png" alt="logo" width="250">
</p>
Haerae-Evaluation-Toolkit is an emerging open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), focusing on Korean.
[Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models](https://arxiv.org/abs/2503.22968) (Paper Link)
## ✨ Key Features
- **Multiple Evaluation Methods**
- Logit-Based, String-Match, Partial-Match LLM-as-a-Judge, and more.
- **Reasoning Chain Analysis**
- Dedicated to analyzing extended Korean chain-of-thought reasoning.
- **Extensive Korean Datasets**
- Includes HAE-RAE Bench, KMMLU, KUDGE, CLiCK, K2-Eval, HRM8K, Benchhub, Kormedqa, KBL and more.
- **Scalable Inference-Time Techniques**
- Best-of-N, Majority Voting, Beam Search, and other advanced methods.
- **Integration-Ready**
- Supports OpenAI-Compatible Endpoints, Huggingface, and LiteLLM.
- **Flexible and Pluggable Architecture**
- Easily extend with new datasets, evaluation metrics, and inference backends.
---
## 🚀 Project Status
We are actively developing core features and interfaces. Current goals include:
- **Unified API**
- Seamless loading and integration of diverse Korean benchmark datasets.
- **Configurable Inference Scaling**
- Generate higher-quality outputs through techniques like best-of-N and beam search.
- **Pluggable Evaluation Methods**
- Enable chain-of-thought assessments, logit-based scoring, and standard evaluation metrics.
- **Modular Architecture**
- Easily extendable for new backends, tasks, or custom evaluation logic.
---
## 🛠️ Key Components
- **Dataset Abstraction**
- Load and preprocess your datasets (or subsets) with minimal configuration.
- **Scalable Methods**
- Apply decoding strategies such as sampling, beam search, and best-of-N approaches.
- **Evaluation Library**
- Compare predictions to references, use judge models, or create custom scoring methods.
- **Registry System**
- Add new components (datasets, models, scaling methods) via simple decorator-based registration.
---
## ⚙️ Installation
1. **Clone the repository:**
```bash
git clone https://github.com/HAE-RAE/haerae-evaluation-toolkit.git
cd haerae-evaluation-toolkit
```
2. **(Optional) Create and activate a virtual environment:**
* Using venv:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
* Using Conda:
```bash
conda create -n hret python=3.11 -y
conda activate hret
```
3. **Install dependencies:** Choose one of the following methods:
* **Using pip:**
```bash
pip install -r requirements.txt
```
* **Using uv (Recommended for speed):**
* First, install uv if you haven't already. See [uv installation guide](https://github.com/astral-sh/uv).
* Then, install dependencies using uv:
```bash
uv pip install -r requirements.txt
```
---
---
## 🚀 Quickstart: Using the Evaluator API
Below is a minimal example of how to use the `Evaluator` interface to load a dataset, apply a model and (optionally) a scaling method, and then evaluate the outputs.
Below is an example, for more detailed instructions on getting it up and running, see **tutorial/kor(eng)/quick_start.md**.
### Python Usage
```python
from llm_eval.evaluator import Evaluator
# 1) Initialize an Evaluator with default parameters (optional).
evaluator = Evaluator()
# 2) Run the evaluation pipeline
results = evaluator.run(
model="huggingface", # or "litellm", "openai", etc.
judge_model=None, # specify e.g. "huggingface_judge" if needed
reward_model=None, # specify e.g. "huggingface_reward" if needed
dataset="haerae_bench", # or "kmmlu", "qarv", ...
subset=["csat_geo", "csat_law"], # optional subset(s)
split="test", # "train"/"validation"/"test"
dataset_params={"revision":"main"}, # example HF config
model_params={"model_name_or_path":"gpt2"}, # example HF Transformers param
judge_params={}, # params for judge model (if judge_model is not None)
reward_params={}, # params for reward model (if reward_model is not None)
scaling_method=None, # or "beam_search", "best_of_n"
scaling_params={}, # e.g., {"beam_size":3, "num_iterations":5}
evaluator_params={} # e.g., custom evaluation settings
)
```
- Dataset is loaded from the registry (e.g., `haerae_bench` is just one of many).
- Model is likewise loaded via the registry (`huggingface`, `litellm`, etc.).
- judge_model and reward_model can be provided if you want LLM-as-a-Judge or reward-model logic. If both are None, the system uses a single model backend.
- `ScalingMethod` is optional if you want to do specialized decoding.
- `EvaluationMethod` (e.g., `string_match`, `log_likelihood`, `partial_match` or `llm_judge`) measures performance.
### CLI Usage
We also provide a simple command-line interface (CLI) via `evaluator.py`:
```bash
python llm_eval/evaluator.py \
--model huggingface \
--judge_model huggingface_judge \
--reward_model huggingface_reward \
--dataset haerae_bench \
--subset csat_geo \
--split test \
--scaling_method beam_search \
--evaluation_method string_match \
--model_params '{"model_name_or_path": "gpt2"}' \
--scaling_params '{"beam_size":3, "num_iterations":5}' \
--output_file results.json
```
This command will:
1. Load the `haerae_bench` (subset=`csat_geo`) test split.
2. Create a MultiModel internally with:
Generate model: huggingface → gpt2
Judge model: huggingface_judge (if you pass relevant judge_params)
Reward model: huggingface_reward (if you pass relevant reward_params).
3. Apply Beam Search (`beam_size=3`).
4. Evaluate final outputs via `string_match`.
5. Save the resulting JSON file to `results.json`.
### Configuration File
Instead of passing many arguments, the entire pipeline can be described in a
single YAML file. Create `evaluator_config.yaml`:
```yaml
dataset:
name: haerae_bench
split: test
params: {}
model:
name: huggingface
params:
model_name_or_path: gpt2
evaluation:
method: string_match
params: {}
language_penalize: true
target_lang: ko
few_shot:
num: 0
```
Run the configuration with:
```python
from llm_eval.evaluator import run_from_config
result = run_from_config("evaluator_config.yaml")
```
See `examples/evaluator_config.yaml` for a full template including judge,
reward, and scaling options.
---
## 🎯 HRET API: MLOps-Friendly Interface
For production environments and MLOps integration, we provide **HRET** (Haerae Evaluation Toolkit) - a decorator-based API inspired by deepeval that makes LLM evaluation seamless and integration-ready.
### Quick Start with HRET
```python
import llm_eval.hret as hret
# Simple decorator-based evaluation
@hret.evaluate(dataset="kmmlu", model="huggingface")
def my_model(input_text: str) -> str:
return model.generate(input_text)
# Run evaluation
result = my_model()
print(f"Accuracy: {result.metrics['accuracy']}")
```
### Key HRET Features
- **🎨 Decorator-Based API**: `@hret.evaluate`, `@hret.benchmark`, `@hret.track_metrics`
- **🔧 Context Managers**: Fine-grained control with `hret.evaluation_context()`
- **📊 MLOps Integration**: Built-in support for MLflow, Weights & Biases, and custom loggers
- **⚙️ Configuration Management**: YAML/JSON config files and global settings
- **📈 Metrics Tracking**: Cross-run comparison and performance monitoring
- **🚀 Production Ready**: Designed for training pipelines, A/B testing, and continuous evaluation
### Advanced Usage Examples
#### Model Benchmarking
```python
@hret.benchmark(dataset="kmmlu")
def compare_models():
return {
"gpt-4": lambda x: gpt4_model.generate(x),
"claude-3": lambda x: claude_model.generate(x),
"custom": lambda x: custom_model.generate(x)
}
results = compare_models()
```
#### MLOps Integration
```python
with hret.evaluation_context(dataset="kmmlu") as ctx:
# Add MLOps integrations
ctx.log_to_mlflow(experiment_name="llm_experiments")
ctx.log_to_wandb(project_name="model_evaluation")
# Run evaluation
result = ctx.evaluate(my_model_function)
```
#### Training Pipeline Integration
```python
class ModelTrainingPipeline:
def evaluate_checkpoint(self, epoch):
with hret.evaluation_context(
run_name=f"checkpoint_epoch_{epoch}"
) as ctx:
ctx.log_to_mlflow(experiment_name="training")
result = ctx.evaluate(self.model.generate)
if self.detect_degradation(result):
self.send_alert(epoch, result)
```
### Configuration Management
Create `hret_config.yaml`:
```yaml
default_dataset: "kmmlu"
default_model: "huggingface"
mlflow_tracking: true
wandb_tracking: true
output_dir: "./results"
auto_save_results: true
```
Load and use:
```python
hret.load_config("hret_config.yaml")
result = hret.quick_eval(my_model_function)
```
### Documentation
- **English**: [docs/eng/08-hret-api-guide.md](docs/eng/08-hret-api-guide.md)
- **한국어**: [docs/kor/08-hret-api-guide.md](docs/kor/08-hret-api-guide.md)
- **Examples**: [examples/hret_examples.py](examples/hret_examples.py), [examples/mlops_integration_example.py](examples/mlops_integration_example.py)
HRET maintains full backward compatibility with the existing Evaluator API while providing a modern, MLOps-friendly interface for production deployments.
---
## 🤝 Contributing & Contact
We welcome collaborators, contributors, and testers interested in advancing LLM evaluation methods, especially for Korean language tasks.
### 📩 Contact Us
- Development Lead: gksdnf424@gmail.com
- Research Lead: spthsrbwls123@yonsei.ac.kr
We look forward to hearing your ideas and contributions!
---
---
## 📝 Citation
If you find HRET useful in your research, please consider citing our paper:
```bibtex
@misc{lee2025redefiningevaluationstandardsunified,
title={Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models},
author={Hanwool Lee and Dasol Choi and Sooyong Kim and Ilgyun Jung and Sangwon Baek and Guijin Son and Inseon Hwang and Naeun Lee and Seunghyeok Hong},
year={2025},
eprint={2503.22968},
archivePrefix={arXiv},
primaryClass={cs.CE},
url={https://arxiv.org/abs/2503.22968},
}
```
## 📜 License
Licensed under the Apache License 2.0.
© 2025 The HAE-RAE Team. All rights reserved.
Raw data
{
"_id": null,
"home_page": null,
"name": "haerae-evaluation-toolkit",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "llm, evaluation, korean, nlp, benchmark, hret",
"author": null,
"author_email": "Hanwool Lee <gksdnf424@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/83/3f/da408c7bc5aacae540833e05d1e3a70019f3e67aba3f904e7bca1cfa24c2/haerae_evaluation_toolkit-0.1.0.tar.gz",
"platform": null,
"description": "# Haerae-Evaluation-Toolkit\n[](https://arxiv.org/abs/2503.22968)\n\n<p align=\"center\">\n <img src=\"assets/imgs/logo.png.png\" alt=\"logo\" width=\"250\">\n</p>\n\n\nHaerae-Evaluation-Toolkit is an emerging open-source Python library designed to streamline and standardize the evaluation of Large Language Models (LLMs), focusing on Korean.\n\n[Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models](https://arxiv.org/abs/2503.22968) (Paper Link)\n\n## \u2728 Key Features\n\n- **Multiple Evaluation Methods**\n - Logit-Based, String-Match, Partial-Match LLM-as-a-Judge, and more.\n\n- **Reasoning Chain Analysis**\n - Dedicated to analyzing extended Korean chain-of-thought reasoning.\n\n- **Extensive Korean Datasets**\n - Includes HAE-RAE Bench, KMMLU, KUDGE, CLiCK, K2-Eval, HRM8K, Benchhub, Kormedqa, KBL and more.\n\n- **Scalable Inference-Time Techniques**\n - Best-of-N, Majority Voting, Beam Search, and other advanced methods.\n\n- **Integration-Ready**\n - Supports OpenAI-Compatible Endpoints, Huggingface, and LiteLLM.\n\n- **Flexible and Pluggable Architecture**\n - Easily extend with new datasets, evaluation metrics, and inference backends.\n\n---\n\n## \ud83d\ude80 Project Status\n\nWe are actively developing core features and interfaces. Current goals include:\n\n- **Unified API**\n - Seamless loading and integration of diverse Korean benchmark datasets.\n\n- **Configurable Inference Scaling**\n - Generate higher-quality outputs through techniques like best-of-N and beam search.\n\n- **Pluggable Evaluation Methods**\n - Enable chain-of-thought assessments, logit-based scoring, and standard evaluation metrics.\n\n- **Modular Architecture**\n - Easily extendable for new backends, tasks, or custom evaluation logic.\n\n---\n\n## \ud83d\udee0\ufe0f Key Components\n\n- **Dataset Abstraction**\n - Load and preprocess your datasets (or subsets) with minimal configuration.\n\n- **Scalable Methods**\n - Apply decoding strategies such as sampling, beam search, and best-of-N approaches.\n\n- **Evaluation Library**\n - Compare predictions to references, use judge models, or create custom scoring methods.\n\n- **Registry System**\n - Add new components (datasets, models, scaling methods) via simple decorator-based registration.\n\n---\n\n## \u2699\ufe0f Installation\n\n1. **Clone the repository:**\n ```bash\n git clone https://github.com/HAE-RAE/haerae-evaluation-toolkit.git\n cd haerae-evaluation-toolkit\n ```\n\n2. **(Optional) Create and activate a virtual environment:**\n * Using venv:\n ```bash\n python -m venv venv\n source venv/bin/activate # On Windows use `venv\\Scripts\\activate`\n ```\n * Using Conda:\n ```bash\n conda create -n hret python=3.11 -y\n conda activate hret\n ```\n\n3. **Install dependencies:** Choose one of the following methods:\n\n * **Using pip:**\n ```bash\n pip install -r requirements.txt\n ```\n\n * **Using uv (Recommended for speed):**\n * First, install uv if you haven't already. See [uv installation guide](https://github.com/astral-sh/uv).\n * Then, install dependencies using uv:\n ```bash\n uv pip install -r requirements.txt\n ```\n\n---\n\n---\n\n## \ud83d\ude80 Quickstart: Using the Evaluator API\n\nBelow is a minimal example of how to use the `Evaluator` interface to load a dataset, apply a model and (optionally) a scaling method, and then evaluate the outputs.\n\nBelow is an example, for more detailed instructions on getting it up and running, see **tutorial/kor(eng)/quick_start.md**.\n\n### Python Usage\n\n```python\nfrom llm_eval.evaluator import Evaluator\n\n# 1) Initialize an Evaluator with default parameters (optional).\nevaluator = Evaluator()\n\n# 2) Run the evaluation pipeline\nresults = evaluator.run(\n model=\"huggingface\", # or \"litellm\", \"openai\", etc.\n judge_model=None, # specify e.g. \"huggingface_judge\" if needed\n reward_model=None, # specify e.g. \"huggingface_reward\" if needed\n dataset=\"haerae_bench\", # or \"kmmlu\", \"qarv\", ...\n subset=[\"csat_geo\", \"csat_law\"], # optional subset(s)\n split=\"test\", # \"train\"/\"validation\"/\"test\"\n dataset_params={\"revision\":\"main\"}, # example HF config\n model_params={\"model_name_or_path\":\"gpt2\"}, # example HF Transformers param\n judge_params={}, # params for judge model (if judge_model is not None)\n reward_params={}, # params for reward model (if reward_model is not None)\n scaling_method=None, # or \"beam_search\", \"best_of_n\"\n scaling_params={}, # e.g., {\"beam_size\":3, \"num_iterations\":5}\n evaluator_params={} # e.g., custom evaluation settings\n)\n\n\n```\n\n- Dataset is loaded from the registry (e.g., `haerae_bench` is just one of many).\n- Model is likewise loaded via the registry (`huggingface`, `litellm`, etc.).\n- judge_model and reward_model can be provided if you want LLM-as-a-Judge or reward-model logic. If both are None, the system uses a single model backend.\n- `ScalingMethod` is optional if you want to do specialized decoding.\n- `EvaluationMethod` (e.g., `string_match`, `log_likelihood`, `partial_match` or `llm_judge`) measures performance.\n\n### CLI Usage\n\nWe also provide a simple command-line interface (CLI) via `evaluator.py`:\n\n```bash\npython llm_eval/evaluator.py \\\n --model huggingface \\\n --judge_model huggingface_judge \\\n --reward_model huggingface_reward \\\n --dataset haerae_bench \\\n --subset csat_geo \\\n --split test \\\n --scaling_method beam_search \\\n --evaluation_method string_match \\\n --model_params '{\"model_name_or_path\": \"gpt2\"}' \\\n --scaling_params '{\"beam_size\":3, \"num_iterations\":5}' \\\n --output_file results.json\n\n```\n\nThis command will:\n\n1. Load the `haerae_bench` (subset=`csat_geo`) test split.\n2. Create a MultiModel internally with:\nGenerate model: huggingface \u2192 gpt2\nJudge model: huggingface_judge (if you pass relevant judge_params)\nReward model: huggingface_reward (if you pass relevant reward_params).\n3. Apply Beam Search (`beam_size=3`).\n4. Evaluate final outputs via `string_match`.\n5. Save the resulting JSON file to `results.json`.\n\n### Configuration File\n\nInstead of passing many arguments, the entire pipeline can be described in a\nsingle YAML file. Create `evaluator_config.yaml`:\n\n```yaml\ndataset:\n name: haerae_bench\n split: test\n params: {}\nmodel:\n name: huggingface\n params:\n model_name_or_path: gpt2\nevaluation:\n method: string_match\n params: {}\nlanguage_penalize: true\ntarget_lang: ko\nfew_shot:\n num: 0\n```\n\nRun the configuration with:\n\n```python\nfrom llm_eval.evaluator import run_from_config\n\nresult = run_from_config(\"evaluator_config.yaml\")\n```\n\nSee `examples/evaluator_config.yaml` for a full template including judge,\nreward, and scaling options.\n\n\n---\n\n## \ud83c\udfaf HRET API: MLOps-Friendly Interface\n\nFor production environments and MLOps integration, we provide **HRET** (Haerae Evaluation Toolkit) - a decorator-based API inspired by deepeval that makes LLM evaluation seamless and integration-ready.\n\n### Quick Start with HRET\n\n```python\nimport llm_eval.hret as hret\n\n# Simple decorator-based evaluation\n@hret.evaluate(dataset=\"kmmlu\", model=\"huggingface\")\ndef my_model(input_text: str) -> str:\n return model.generate(input_text)\n\n# Run evaluation\nresult = my_model()\nprint(f\"Accuracy: {result.metrics['accuracy']}\")\n```\n\n### Key HRET Features\n\n- **\ud83c\udfa8 Decorator-Based API**: `@hret.evaluate`, `@hret.benchmark`, `@hret.track_metrics`\n- **\ud83d\udd27 Context Managers**: Fine-grained control with `hret.evaluation_context()`\n- **\ud83d\udcca MLOps Integration**: Built-in support for MLflow, Weights & Biases, and custom loggers\n- **\u2699\ufe0f Configuration Management**: YAML/JSON config files and global settings\n- **\ud83d\udcc8 Metrics Tracking**: Cross-run comparison and performance monitoring\n- **\ud83d\ude80 Production Ready**: Designed for training pipelines, A/B testing, and continuous evaluation\n\n### Advanced Usage Examples\n\n#### Model Benchmarking\n```python\n@hret.benchmark(dataset=\"kmmlu\")\ndef compare_models():\n return {\n \"gpt-4\": lambda x: gpt4_model.generate(x),\n \"claude-3\": lambda x: claude_model.generate(x),\n \"custom\": lambda x: custom_model.generate(x)\n }\n\nresults = compare_models()\n```\n\n#### MLOps Integration\n```python\nwith hret.evaluation_context(dataset=\"kmmlu\") as ctx:\n # Add MLOps integrations\n ctx.log_to_mlflow(experiment_name=\"llm_experiments\")\n ctx.log_to_wandb(project_name=\"model_evaluation\")\n\n # Run evaluation\n result = ctx.evaluate(my_model_function)\n```\n\n#### Training Pipeline Integration\n```python\nclass ModelTrainingPipeline:\n def evaluate_checkpoint(self, epoch):\n with hret.evaluation_context(\n run_name=f\"checkpoint_epoch_{epoch}\"\n ) as ctx:\n ctx.log_to_mlflow(experiment_name=\"training\")\n result = ctx.evaluate(self.model.generate)\n\n if self.detect_degradation(result):\n self.send_alert(epoch, result)\n```\n\n### Configuration Management\n\nCreate `hret_config.yaml`:\n```yaml\ndefault_dataset: \"kmmlu\"\ndefault_model: \"huggingface\"\nmlflow_tracking: true\nwandb_tracking: true\noutput_dir: \"./results\"\nauto_save_results: true\n```\n\nLoad and use:\n```python\nhret.load_config(\"hret_config.yaml\")\nresult = hret.quick_eval(my_model_function)\n```\n\n### Documentation\n\n- **English**: [docs/eng/08-hret-api-guide.md](docs/eng/08-hret-api-guide.md)\n- **\ud55c\uad6d\uc5b4**: [docs/kor/08-hret-api-guide.md](docs/kor/08-hret-api-guide.md)\n- **Examples**: [examples/hret_examples.py](examples/hret_examples.py), [examples/mlops_integration_example.py](examples/mlops_integration_example.py)\n\nHRET maintains full backward compatibility with the existing Evaluator API while providing a modern, MLOps-friendly interface for production deployments.\n\n---\n\n## \ud83e\udd1d Contributing & Contact\n\nWe welcome collaborators, contributors, and testers interested in advancing LLM evaluation methods, especially for Korean language tasks.\n\n### \ud83d\udce9 Contact Us\n\n- Development Lead: gksdnf424@gmail.com\n- Research Lead: spthsrbwls123@yonsei.ac.kr\n\nWe look forward to hearing your ideas and contributions!\n\n---\n\n---\n\n## \ud83d\udcdd Citation\n\nIf you find HRET useful in your research, please consider citing our paper:\n\n```bibtex\n@misc{lee2025redefiningevaluationstandardsunified,\n title={Redefining Evaluation Standards: A Unified Framework for Evaluating the Korean Capabilities of Language Models},\n author={Hanwool Lee and Dasol Choi and Sooyong Kim and Ilgyun Jung and Sangwon Baek and Guijin Son and Inseon Hwang and Naeun Lee and Seunghyeok Hong},\n year={2025},\n eprint={2503.22968},\n archivePrefix={arXiv},\n primaryClass={cs.CE},\n url={https://arxiv.org/abs/2503.22968},\n}\n```\n## \ud83d\udcdc License\n\nLicensed under the Apache License 2.0.\n\n\u00a9 2025 The HAE-RAE Team. All rights reserved.\n",
"bugtrack_url": null,
"license": null,
"summary": "A comprehensive, standardized validation toolkit for Korean Large Language Models (LLMs).",
"version": "0.1.0",
"project_urls": {
"Bug Reports": "https://github.com/HAE-RAE/haerae-evaluation-toolkit/issues",
"Documentation": "https://github.com/HAE-RAE/haerae-evaluation-toolkit/tree/main/docs",
"Homepage": "https://github.com/HAE-RAE/haerae-evaluation-toolkit",
"Issues": "https://github.com/HAE-RAE/haerae-evaluation-toolkit/issues",
"Repository": "https://github.com/HAE-RAE/haerae-evaluation-toolkit",
"Source Code": "https://github.com/HAE-RAE/haerae-evaluation-toolkit"
},
"split_keywords": [
"llm",
" evaluation",
" korean",
" nlp",
" benchmark",
" hret"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7352cdc8f4227d16c6e529b8c74f4fb685a113bf3d5ae5b7d696ff4a0efa9227",
"md5": "cdec548d615f66eaee37ed8dbcb9ffdc",
"sha256": "8097095b7a37788b39c06ac92cdbffa115caa37ae75c48f3e9f625e0da4692c6"
},
"downloads": -1,
"filename": "haerae_evaluation_toolkit-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cdec548d615f66eaee37ed8dbcb9ffdc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 129336,
"upload_time": "2025-08-31T12:00:23",
"upload_time_iso_8601": "2025-08-31T12:00:23.747524Z",
"url": "https://files.pythonhosted.org/packages/73/52/cdc8f4227d16c6e529b8c74f4fb685a113bf3d5ae5b7d696ff4a0efa9227/haerae_evaluation_toolkit-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "833fda408c7bc5aacae540833e05d1e3a70019f3e67aba3f904e7bca1cfa24c2",
"md5": "e749f634fd5fc7ddf2f46c9381a0f9e5",
"sha256": "27e62841d9d9059ea7e70e2fef18545d5bc12a9eca80cb643aae1c749b010a27"
},
"downloads": -1,
"filename": "haerae_evaluation_toolkit-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "e749f634fd5fc7ddf2f46c9381a0f9e5",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 1721510,
"upload_time": "2025-08-31T12:00:25",
"upload_time_iso_8601": "2025-08-31T12:00:25.382361Z",
"url": "https://files.pythonhosted.org/packages/83/3f/da408c7bc5aacae540833e05d1e3a70019f3e67aba3f904e7bca1cfa24c2/haerae_evaluation_toolkit-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-31 12:00:25",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "HAE-RAE",
"github_project": "haerae-evaluation-toolkit",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "transformers",
"specs": [
[
">=",
"4.20.0"
]
]
},
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "openai",
"specs": [
[
">=",
"1.0.0"
],
[
"<",
"1.100.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"3.2.0"
]
]
},
{
"name": "litellm",
"specs": [
[
">=",
"1.75.0"
]
]
},
{
"name": "math-verify",
"specs": [
[
">=",
"0.1.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.5.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.64.0"
]
]
},
{
"name": "langdetect",
"specs": [
[
">=",
"1.0.9"
]
]
},
{
"name": "httpx",
"specs": [
[
">=",
"0.24.0"
]
]
},
{
"name": "vllm",
"specs": [
[
">=",
"0.4.0"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.20.0"
]
]
},
{
"name": "spacy",
"specs": [
[
">=",
"3.4.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.1.0"
]
]
},
{
"name": "pre-commit",
"specs": [
[
"==",
"4.0.1"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.3.0"
]
]
}
],
"lcname": "haerae-evaluation-toolkit"
}