hugme


Namehugme JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryLibrary to evaluate models on HuGME benchmark.
upload_time2025-10-29 09:31:14
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords models model-training fine-tuning natural-language-processing deep-learning evaluation benchmark
VCS
bugtrack_url
requirements tqdm torch openai deepeval accelerate transformers sentencepiece pyspellchecker textstat peft scipy torchvision huspacy spacy
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # HuGME: Hungarian Generative Model Evaluation benchmark

**HuGME** is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on **Hungarian language proficiency and cultural understanding**. It provides a structured assessment of model performance across multiple dimensions, based on [DeepEval](https://docs.confident-ai.com/).

## 📌 Installation & Usage

### Installation

To install **HuGME**, use the following command:

```bash
git clone https://github.com/nytud/hugme
pip install .
```

### Running HuGME

You can execute HuGME with:

```bash
hugme --model-name /path/to/your/model --tasks bias --parameters config.json
```

### Command-Line Parameters

| Parameter         | Description |
|------------------|-------------|
| `--model-name`   | Name of the model (Hugging Face (local) model or OpenAI models). |
| `--tasks`        | Tasks to evaluate (`bias`, `toxicity`, `faithfulness`, `summarization`, `answer-relevancy`, `mmlu`, `spelling`, `truthfulqa`, `prompt-alignment`, `readability`, `needle-in-haystack`). |
| `--judge`        | Default: `"gpt-3.5-turbo-1106"`. Specifies the judge model for evaluations. |
| `--use-cuda`     | Default: `True`. Enables GPU acceleration. |
| `--cuda-id`      | Default: `1`. Specifies which GPU to use. Indexing starts from 0 |
| `--seed`         | Sets a random seed for reproducibility. |
| `--parameters`   | Required. Path to a JSON configuration file for model parameters. See below for example. |
| `--save-results` | Default: `True`. Whether to save evaluation results. |
| `--use-gen-results` | Path to generated file by the model to evaluation on. |
| `--provider` | Default: `False`. Provider to use. Choices: (`openai`) |
| `--thinking` | Default: `False`. Enable thinking mode. |
| `--use-alpaca-prompt` | Default: `False`. Use alpaca prompt. |
| `--sample-size` | Default: `1.0`. Sample size (percenatage) from task's dataset. |

### 🛠 Configure HuGME

Before running HuGME, you must set the `DATASETS` environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.

```bash
export DATASETS=/path/to/datasets
```

The following environment variable needs to be configured for spelling task:

```bash
export BERT_MODEL=/path/to/bert-model
```

HuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in `--parameters` flag. Example:

```json
{
  "max_new_tokens": 50,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 150,
  "repetition_penalty": 0.98,
  "diversity_penalty": 0,
  "do_sample": true,
  "return_full_text": false
}
```


### 🔑 Providing API Keys

To authenticate with OpenAI or Hugging Face, set your API keys as environment variables:

```bash
export OPENAI_API_KEY=sk-examplekey # judge model for deepeval based metrics
export HF_TOKEN=hf-exampletoken # using huggingface models
export PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider
export PROVIDER_URL=hf-provider-url # using custom (openai package compatible) provider
```

Alternatively, provide them inline when running the evaluation:

```bash
OPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmlu
```

## 🧠 Results

After running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the `results/` directory.


## 📊 Evaluation Tasks

HuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found [here](https://docs.confident-ai.com/docs/getting-started).

### 1️⃣ Bias

Assesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as *Szerintem* 'I think', *Úgy gondolom* 'I believe', or *Véleményem szerint* 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.

### 2️⃣ Toxicity

Evaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.

### 3️⃣ Answer relevancy

Evaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.

### 4️⃣ Faithfulness

Examines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.

### 5️⃣ Summarization

Tests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.

### 6️⃣ Prompt alignment

Evaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.

### 7️⃣ Spelling

Evaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.

### 8️⃣ Readability

Evaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.

### 9️⃣ HuTruthfulQA

Adapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.

### 🔟 HuMMLU (Massive Multitask Language Understanding)

Adapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.

### 🧩 Needle in the Haystack

Tests LLM performance in extracting specific information ("needle") from large bodies of Hungarian text ("haystack") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.

Providers like OpenAI are currently unsupported for this metric.

# 🤝 Contributing

Contributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "hugme",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "models, model-training, fine-tuning, natural-language-processing, deep-learning, evaluation, benchmark",
    "author": null,
    "author_email": "HUN-REN Research Center for Linguistics <osvathm.matyas@nytud.hun-ren.hu>",
    "download_url": "https://files.pythonhosted.org/packages/28/45/02d8f9b50fab766873e7500537cb74cfaafbdcdd8eb99e25872f204c5ffb/hugme-0.0.1.tar.gz",
    "platform": null,
    "description": "# HuGME: Hungarian Generative Model Evaluation benchmark\n\n**HuGME** is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on **Hungarian language proficiency and cultural understanding**. It provides a structured assessment of model performance across multiple dimensions, based on [DeepEval](https://docs.confident-ai.com/).\n\n## \ud83d\udccc Installation & Usage\n\n### Installation\n\nTo install **HuGME**, use the following command:\n\n```bash\ngit clone https://github.com/nytud/hugme\npip install .\n```\n\n### Running HuGME\n\nYou can execute HuGME with:\n\n```bash\nhugme --model-name /path/to/your/model --tasks bias --parameters config.json\n```\n\n### Command-Line Parameters\n\n| Parameter         | Description |\n|------------------|-------------|\n| `--model-name`   | Name of the model (Hugging Face (local) model or OpenAI models). |\n| `--tasks`        | Tasks to evaluate (`bias`, `toxicity`, `faithfulness`, `summarization`, `answer-relevancy`, `mmlu`, `spelling`, `truthfulqa`, `prompt-alignment`, `readability`, `needle-in-haystack`). |\n| `--judge`        | Default: `\"gpt-3.5-turbo-1106\"`. Specifies the judge model for evaluations. |\n| `--use-cuda`     | Default: `True`. Enables GPU acceleration. |\n| `--cuda-id`      | Default: `1`. Specifies which GPU to use. Indexing starts from 0 |\n| `--seed`         | Sets a random seed for reproducibility. |\n| `--parameters`   | Required. Path to a JSON configuration file for model parameters. See below for example. |\n| `--save-results` | Default: `True`. Whether to save evaluation results. |\n| `--use-gen-results` | Path to generated file by the model to evaluation on. |\n| `--provider` | Default: `False`. Provider to use. Choices: (`openai`) |\n| `--thinking` | Default: `False`. Enable thinking mode. |\n| `--use-alpaca-prompt` | Default: `False`. Use alpaca prompt. |\n| `--sample-size` | Default: `1.0`. Sample size (percenatage) from task's dataset. |\n\n### \ud83d\udee0 Configure HuGME\n\nBefore running HuGME, you must set the `DATASETS` environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.\n\n```bash\nexport DATASETS=/path/to/datasets\n```\n\nThe following environment variable needs to be configured for spelling task:\n\n```bash\nexport BERT_MODEL=/path/to/bert-model\n```\n\nHuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in `--parameters` flag. Example:\n\n```json\n{\n  \"max_new_tokens\": 50,\n  \"temperature\": 0.7,\n  \"top_p\": 0.9,\n  \"top_k\": 150,\n  \"repetition_penalty\": 0.98,\n  \"diversity_penalty\": 0,\n  \"do_sample\": true,\n  \"return_full_text\": false\n}\n```\n\n\n### \ud83d\udd11 Providing API Keys\n\nTo authenticate with OpenAI or Hugging Face, set your API keys as environment variables:\n\n```bash\nexport OPENAI_API_KEY=sk-examplekey #\u00a0judge model for deepeval based metrics\nexport HF_TOKEN=hf-exampletoken #\u00a0using huggingface models\nexport PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider\nexport PROVIDER_URL=hf-provider-url # using custom (openai package compatible) provider\n```\n\nAlternatively, provide them inline when running the evaluation:\n\n```bash\nOPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmlu\n```\n\n## \ud83e\udde0 Results\n\nAfter running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the `results/` directory.\n\n\n## \ud83d\udcca Evaluation Tasks\n\nHuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found [here](https://docs.confident-ai.com/docs/getting-started).\n\n### 1\ufe0f\u20e3 Bias\n\nAssesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as *Szerintem* 'I think', *\u00dagy gondolom* 'I believe', or *V\u00e9lem\u00e9nyem szerint* 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.\n\n### 2\ufe0f\u20e3 Toxicity\n\nEvaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.\n\n### 3\ufe0f\u20e3 Answer relevancy\n\nEvaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.\n\n### 4\ufe0f\u20e3 Faithfulness\n\nExamines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.\n\n### 5\ufe0f\u20e3 Summarization\n\nTests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.\n\n### 6\ufe0f\u20e3 Prompt alignment\n\nEvaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.\n\n### 7\ufe0f\u20e3 Spelling\n\nEvaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.\n\n### 8\ufe0f\u20e3 Readability\n\nEvaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.\n\n### 9\ufe0f\u20e3 HuTruthfulQA\n\nAdapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.\n\n### \ud83d\udd1f HuMMLU (Massive Multitask Language Understanding)\n\nAdapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.\n\n### \ud83e\udde9 Needle in the Haystack\n\nTests LLM performance in extracting specific information (\"needle\") from large bodies of Hungarian text (\"haystack\") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.\n\nProviders like OpenAI are currently unsupported for this metric.\n\n# \ud83e\udd1d Contributing\n\nContributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Library to evaluate models on HuGME benchmark.",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/nytud/hugme",
        "Repository": "https://github.com/nytud/hugme"
    },
    "split_keywords": [
        "models",
        " model-training",
        " fine-tuning",
        " natural-language-processing",
        " deep-learning",
        " evaluation",
        " benchmark"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "4ed90b742765a802bd06ab507acd69f633e5975fc10ea15791a35b16afda7a7b",
                "md5": "3a8ac417cc290b6ac81e34b79bd2ef4c",
                "sha256": "67c6d334527a44bbc6ae0b8423e674ab5c0ab31ed23defe78f1bcb42a62f78c7"
            },
            "downloads": -1,
            "filename": "hugme-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3a8ac417cc290b6ac81e34b79bd2ef4c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 27489,
            "upload_time": "2025-10-29T09:31:13",
            "upload_time_iso_8601": "2025-10-29T09:31:13.962284Z",
            "url": "https://files.pythonhosted.org/packages/4e/d9/0b742765a802bd06ab507acd69f633e5975fc10ea15791a35b16afda7a7b/hugme-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "284502d8f9b50fab766873e7500537cb74cfaafbdcdd8eb99e25872f204c5ffb",
                "md5": "3d0d32f306a75023bf0073a623ddc2bf",
                "sha256": "2533188b529cf6472208dcfceaa2eb8c5918acd858e200201f5fb497dace0655"
            },
            "downloads": -1,
            "filename": "hugme-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "3d0d32f306a75023bf0073a623ddc2bf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 25223,
            "upload_time": "2025-10-29T09:31:14",
            "upload_time_iso_8601": "2025-10-29T09:31:14.999190Z",
            "url": "https://files.pythonhosted.org/packages/28/45/02d8f9b50fab766873e7500537cb74cfaafbdcdd8eb99e25872f204c5ffb/hugme-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 09:31:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nytud",
    "github_project": "hugme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "tqdm",
            "specs": []
        },
        {
            "name": "torch",
            "specs": []
        },
        {
            "name": "openai",
            "specs": []
        },
        {
            "name": "deepeval",
            "specs": []
        },
        {
            "name": "accelerate",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.54.1"
                ]
            ]
        },
        {
            "name": "sentencepiece",
            "specs": []
        },
        {
            "name": "pyspellchecker",
            "specs": []
        },
        {
            "name": "textstat",
            "specs": []
        },
        {
            "name": "peft",
            "specs": []
        },
        {
            "name": "scipy",
            "specs": []
        },
        {
            "name": "torchvision",
            "specs": []
        },
        {
            "name": "huspacy",
            "specs": []
        },
        {
            "name": "spacy",
            "specs": []
        }
    ],
    "lcname": "hugme"
}
        
Elapsed time: 3.57784s