<div align="center">
# Evalia
`evalia` implements the Bayes@N framework introduced in [Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation](https://arxiv.org/abs/2504.11651)
[](https://arxiv.org/abs/2504.11651)
[](https://pypi.org/project/evalia/)
[](https://pypi.org/project/evalia/)
[](#license)
</div>
---
## Installation
```bash
pip install evalia
```
Requires Python 3.9–3.13 and NumPy.
## Data and shape conventions
- Categories: encode outcomes per trial as integers in `{0, ..., C}`.
- Weights: choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary R).
- Shapes: `R` is `M x N`, `R0` is `M x D` (if provided); both must share the same `M` and category set.
## APIs
- `bayes.eval.bayes(R, w, R0=None) -> (mu: float, sigma: float)`
- `R`: `M x N` int array with entries in `{0, ..., C}`
- `w`: length `C+1` float array of rubric weights
- `R0` (optional): `M x D` int array of prior outcomes (same category set as `R`)
- Returns posterior estimate `mu` of the rubric-weighted performance and its uncertainty `sigma`.
- `bayes.eval.avg(R) -> float`
- Returns the naive mean of elements in `R`. For binary accuracy, encode incorrect=0, correct=1.
- `bayes.utils.competition_ranks_from_scores(scores, tol=1e-12) -> list[int]`
- Convert scores to competition ranks (1,2,3,3,5,…) with tie handling.
## How to use
```python
import numpy as np
from bayes.eval import bayes
# Outcomes R: shape (M, N) with integer categories in {0, ..., C}
R = np.array([
[0, 1, 2, 2, 1], # Item 1, N=5 trials
[1, 1, 0, 2, 2], # Item 2, N=5 trials
])
# Rubric weights w: length C+1. Here: 0=incorrect, 1=partial(0.5), 2=correct(1.0)
w = np.array([0.0, 0.5, 1.0])
# Optional prior outcomes R0: shape (M, D). If omitted, D=0.
R0 = np.array([
[0, 2],
[1, 2],
])
# With prior (D=2 → T=10)
mu, sigma = bayes(R, w, R0)
print(mu, sigma) # expected ~ (0.575, 0.084275)
# Without prior (D=0 → T=8)
mu2, sigma2 = bayes(R, w)
print(mu2, sigma2) # expected ~ (0.5625, 0.091998)
```
## Citing
If you use Evalia or Bayes@N, please cite:
```
@article{hariri2025dontpassk,
title = {Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
author = {Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},
journal={arXiv preprint arXiv:2504.11651},
year = {2025},
url = {https://mohsenhariri.github.io/bayes-kit/}
}
```
## License
MIT License. See the `LICENSE` file for details.
## Support
- Documentation and updates: https://mohsenhariri.github.io/bayes-kit/
- Issues and feature requests: https://github.com/mohsenhariri/bayes-kit/issues
Raw data
{
"_id": null,
"home_page": null,
"name": "evalia",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.14,>=3.9",
"maintainer_email": null,
"keywords": "bayesian, statistics, ranking, evaluation, machine learning, large language models",
"author": null,
"author_email": "Mohsen Hariri <mohsen.hariri@case.edu>, Amirhossein Samandar <amirhossein.samandar@case.edu>",
"download_url": "https://files.pythonhosted.org/packages/13/75/49e512278c173af8fa9b631a54f767a0f2e6018959a3848f4d76da671501/evalia-0.0.1.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n\n# Evalia\n\n`evalia` implements the Bayes@N framework introduced in [Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation](https://arxiv.org/abs/2504.11651)\n\n[](https://arxiv.org/abs/2504.11651)\n[](https://pypi.org/project/evalia/)\n[](https://pypi.org/project/evalia/)\n[](#license)\n\n</div>\n\n---\n\n## Installation\n\n```bash\npip install evalia\n```\n\nRequires Python 3.9\u20133.13 and NumPy.\n\n## Data and shape conventions\n\n- Categories: encode outcomes per trial as integers in `{0, ..., C}`.\n- Weights: choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary R).\n- Shapes: `R` is `M x N`, `R0` is `M x D` (if provided); both must share the same `M` and category set.\n\n## APIs\n\n- `bayes.eval.bayes(R, w, R0=None) -> (mu: float, sigma: float)`\n - `R`: `M x N` int array with entries in `{0, ..., C}`\n - `w`: length `C+1` float array of rubric weights\n - `R0` (optional): `M x D` int array of prior outcomes (same category set as `R`)\n - Returns posterior estimate `mu` of the rubric-weighted performance and its uncertainty `sigma`.\n\n- `bayes.eval.avg(R) -> float`\n - Returns the naive mean of elements in `R`. For binary accuracy, encode incorrect=0, correct=1.\n\n- `bayes.utils.competition_ranks_from_scores(scores, tol=1e-12) -> list[int]`\n - Convert scores to competition ranks (1,2,3,3,5,\u2026) with tie handling.\n\n\n## How to use\n\n```python\n\nimport numpy as np\nfrom bayes.eval import bayes\n\n# Outcomes R: shape (M, N) with integer categories in {0, ..., C}\nR = np.array([\n [0, 1, 2, 2, 1], # Item 1, N=5 trials\n [1, 1, 0, 2, 2], # Item 2, N=5 trials\n])\n\n# Rubric weights w: length C+1. Here: 0=incorrect, 1=partial(0.5), 2=correct(1.0)\nw = np.array([0.0, 0.5, 1.0])\n\n# Optional prior outcomes R0: shape (M, D). If omitted, D=0.\nR0 = np.array([\n [0, 2],\n [1, 2],\n])\n\n# With prior (D=2 \u2192 T=10)\nmu, sigma = bayes(R, w, R0)\nprint(mu, sigma) # expected ~ (0.575, 0.084275)\n\n# Without prior (D=0 \u2192 T=8)\nmu2, sigma2 = bayes(R, w)\nprint(mu2, sigma2) # expected ~ (0.5625, 0.091998)\n\n```\n\n\n## Citing\n\nIf you use Evalia or Bayes@N, please cite:\n\n```\n@article{hariri2025dontpassk,\n title = {Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},\n author = {Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},\n journal={arXiv preprint arXiv:2504.11651},\n year = {2025},\n url = {https://mohsenhariri.github.io/bayes-kit/}\n}\n```\n\n\n## License\n\nMIT License. See the `LICENSE` file for details.\n\n\n## Support\n\n- Documentation and updates: https://mohsenhariri.github.io/bayes-kit/\n- Issues and feature requests: https://github.com/mohsenhariri/bayes-kit/issues\n",
"bugtrack_url": null,
"license": null,
"summary": "Bayesian evaluation and ranking toolkit",
"version": "0.0.1",
"project_urls": {
"Documentation": "https://mohsenhariri.github.io/bayes-kit/",
"Homepage": "https://mohsenhariri.github.io/bayes-kit/",
"Issues": "https://github.com/mohsenhariri/bayes-kit/issues",
"Repository": "https://github.com/mohsenhariri/bayes-kit"
},
"split_keywords": [
"bayesian",
" statistics",
" ranking",
" evaluation",
" machine learning",
" large language models"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "cec07e82369e39f25d6f504aea1cb89a6dec6b64a406d6f93a1425b73127f4ff",
"md5": "80d9ca8fe241d61a94253175f332f2be",
"sha256": "35cdc24461bdc1ae3f03ab8d9db68af818e00edf94b43e02279659094a0cb7d2"
},
"downloads": -1,
"filename": "evalia-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "80d9ca8fe241d61a94253175f332f2be",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.9",
"size": 6489,
"upload_time": "2025-10-06T05:33:39",
"upload_time_iso_8601": "2025-10-06T05:33:39.983530Z",
"url": "https://files.pythonhosted.org/packages/ce/c0/7e82369e39f25d6f504aea1cb89a6dec6b64a406d6f93a1425b73127f4ff/evalia-0.0.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "137549e512278c173af8fa9b631a54f767a0f2e6018959a3848f4d76da671501",
"md5": "ced7c2b3de5b7c0c38f28110fa0ffb7a",
"sha256": "8e32635680b26011d2ab8aa92876caffdacf39a9935efc1d53485681e1d5700d"
},
"downloads": -1,
"filename": "evalia-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "ced7c2b3de5b7c0c38f28110fa0ffb7a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.14,>=3.9",
"size": 5997,
"upload_time": "2025-10-06T05:33:41",
"upload_time_iso_8601": "2025-10-06T05:33:41.070105Z",
"url": "https://files.pythonhosted.org/packages/13/75/49e512278c173af8fa9b631a54f767a0f2e6018959a3848f4d76da671501/evalia-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 05:33:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "mohsenhariri",
"github_project": "bayes-kit",
"github_not_found": true,
"lcname": "evalia"
}