bayes-eval-kit


Namebayes-eval-kit JSON
Version 0.0.1 PyPI version JSON
download
home_pageNone
SummaryBayesian evaluation and ranking toolkit
upload_time2025-10-06 04:23:36
maintainerNone
docs_urlNone
authorNone
requires_python<3.14,>=3.9
licenseNone
keywords bayesian statistics ranking evaluation machine learning large language models
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">

# bayes-eval-kit

`bayes-eval-kit` implements the Bayes@N framework introduced in ["Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation"](https://arxiv.org/abs/2504.11651)

[![arXiv](https://img.shields.io/badge/arXiv-2504.11651-b31b1b.svg)](https://arxiv.org/abs/2504.11651)
[![PyPI version](https://img.shields.io/pypi/v/bayes-eval-kit.svg)](https://pypi.org/project/bayes-eval-kit/)
[![Python versions](https://img.shields.io/pypi/pyversions/bayes-eval-kit.svg)](https://pypi.org/project/bayes-eval-kit/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](#license)

</div>

---

## Installation

```bash
pip install bayes-eval-kit
```

Requires Python 3.9–3.13 and NumPy.

## Data and shape conventions

- Categories: encode outcomes per trial as integers in `{0, ..., C}`.
- Weights: choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary R).
- Shapes: `R` is `M x N`, `R0` is `M x D` (if provided); both must share the same `M` and category set.

## APIs

- `bayes.eval.bayes(R, w, R0=None) -> (mu: float, sigma: float)`
  - `R`: `M x N` int array with entries in `{0, ..., C}`
  - `w`: length `C+1` float array of rubric weights
  - `R0` (optional): `M x D` int array of prior outcomes (same category set as `R`)
  - Returns posterior estimate `mu` of the rubric-weighted performance and its uncertainty `sigma`.

- `bayes.eval.avg(R) -> float`
  - Returns the naive mean of elements in `R`. For binary accuracy, encode incorrect=0, correct=1.

- `bayes.utils.competition_ranks_from_scores(scores, tol=1e-12) -> list[int]`
  - Convert scores to competition ranks (1,2,3,3,5,…) with tie handling.


## How to use

```python

    import numpy as np
    from bayes.eval import bayes

    # Outcomes R: shape (M, N) with integer categories in {0, ..., C}
    R = np.array([
        [0, 1, 2, 2, 1],   # Item 1, N=5 trials
        [1, 1, 0, 2, 2],   # Item 2, N=5 trials
    ])

    # Rubric weights w: length C+1. Here: 0=incorrect, 1=partial(0.5), 2=correct(1.0)
    w = np.array([0.0, 0.5, 1.0])

    # Optional prior outcomes R0: shape (M, D). If omitted, D=0.
    R0 = np.array([
        [0, 2],
        [1, 2],
    ])

    # With prior (D=2 → T=10)
    mu, sigma = bayes(R, w, R0)
    print(mu, sigma)      # expected ~ (0.575, 0.084275)

    # Without prior (D=0 → T=8)
    mu2, sigma2 = bayes(R, w)
    print(mu2, sigma2)    # expected ~ (0.5625, 0.091998)

```


## Citing

If you use bayes-eval-kit or Bayes@N, please cite:

```
@article{hariri2025dontpassk,
  title   = {Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},
  author  = {Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},
  journal={arXiv preprint arXiv:2504.11651},
  year    = {2025},
  url     = {https://mohsenhariri.github.io/bayes-kit/}
}
```


## License

MIT License. See the `LICENSE` file for details.


## Support

- Documentation and updates: https://mohsenhariri.github.io/bayes-kit/
- Issues and feature requests: https://github.com/mohsenhariri/bayes-kit/issues

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "bayes-eval-kit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.9",
    "maintainer_email": null,
    "keywords": "bayesian, statistics, ranking, evaluation, machine learning, large language models",
    "author": null,
    "author_email": "Mohsen Hariri <mohsen.hariri@case.edu>, Amirhossein Samandar <amirhossein.samandar@case.edu>",
    "download_url": "https://files.pythonhosted.org/packages/a1/99/3ddbecde9ac175bc4a9c70c3020fe01e0e8af8d84c9f9d16021b53c85f55/bayes_eval_kit-0.0.1.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n\n# bayes-eval-kit\n\n`bayes-eval-kit` implements the Bayes@N framework introduced in [\"Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation\"](https://arxiv.org/abs/2504.11651)\n\n[![arXiv](https://img.shields.io/badge/arXiv-2504.11651-b31b1b.svg)](https://arxiv.org/abs/2504.11651)\n[![PyPI version](https://img.shields.io/pypi/v/bayes-eval-kit.svg)](https://pypi.org/project/bayes-eval-kit/)\n[![Python versions](https://img.shields.io/pypi/pyversions/bayes-eval-kit.svg)](https://pypi.org/project/bayes-eval-kit/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](#license)\n\n</div>\n\n---\n\n## Installation\n\n```bash\npip install bayes-eval-kit\n```\n\nRequires Python 3.9\u20133.13 and NumPy.\n\n## Data and shape conventions\n\n- Categories: encode outcomes per trial as integers in `{0, ..., C}`.\n- Weights: choose rubric weights `w` of length `C+1` (e.g., `[0, 1]` for binary R).\n- Shapes: `R` is `M x N`, `R0` is `M x D` (if provided); both must share the same `M` and category set.\n\n## APIs\n\n- `bayes.eval.bayes(R, w, R0=None) -> (mu: float, sigma: float)`\n  - `R`: `M x N` int array with entries in `{0, ..., C}`\n  - `w`: length `C+1` float array of rubric weights\n  - `R0` (optional): `M x D` int array of prior outcomes (same category set as `R`)\n  - Returns posterior estimate `mu` of the rubric-weighted performance and its uncertainty `sigma`.\n\n- `bayes.eval.avg(R) -> float`\n  - Returns the naive mean of elements in `R`. For binary accuracy, encode incorrect=0, correct=1.\n\n- `bayes.utils.competition_ranks_from_scores(scores, tol=1e-12) -> list[int]`\n  - Convert scores to competition ranks (1,2,3,3,5,\u2026) with tie handling.\n\n\n## How to use\n\n```python\n\n    import numpy as np\n    from bayes.eval import bayes\n\n    # Outcomes R: shape (M, N) with integer categories in {0, ..., C}\n    R = np.array([\n        [0, 1, 2, 2, 1],   # Item 1, N=5 trials\n        [1, 1, 0, 2, 2],   # Item 2, N=5 trials\n    ])\n\n    # Rubric weights w: length C+1. Here: 0=incorrect, 1=partial(0.5), 2=correct(1.0)\n    w = np.array([0.0, 0.5, 1.0])\n\n    # Optional prior outcomes R0: shape (M, D). If omitted, D=0.\n    R0 = np.array([\n        [0, 2],\n        [1, 2],\n    ])\n\n    # With prior (D=2 \u2192 T=10)\n    mu, sigma = bayes(R, w, R0)\n    print(mu, sigma)      # expected ~ (0.575, 0.084275)\n\n    # Without prior (D=0 \u2192 T=8)\n    mu2, sigma2 = bayes(R, w)\n    print(mu2, sigma2)    # expected ~ (0.5625, 0.091998)\n\n```\n\n\n## Citing\n\nIf you use bayes-eval-kit or Bayes@N, please cite:\n\n```\n@article{hariri2025dontpassk,\n  title   = {Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation},\n  author  = {Hariri, Mohsen and Samandar, Amirhossein and Hinczewski, Michael and Chaudhary, Vipin},\n  journal={arXiv preprint arXiv:2504.11651},\n  year    = {2025},\n  url     = {https://mohsenhariri.github.io/bayes-kit/}\n}\n```\n\n\n## License\n\nMIT License. See the `LICENSE` file for details.\n\n\n## Support\n\n- Documentation and updates: https://mohsenhariri.github.io/bayes-kit/\n- Issues and feature requests: https://github.com/mohsenhariri/bayes-kit/issues\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Bayesian evaluation and ranking toolkit",
    "version": "0.0.1",
    "project_urls": {
        "Documentation": "https://mohsenhariri.github.io/bayes-kit/",
        "Homepage": "https://mohsenhariri.github.io/bayes-kit/",
        "Issues": "https://github.com/mohsenhariri/bayes-kit/issues",
        "Repository": "https://github.com/mohsenhariri/bayes-kit"
    },
    "split_keywords": [
        "bayesian",
        " statistics",
        " ranking",
        " evaluation",
        " machine learning",
        " large language models"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7871499750cfa8efb1b3a57b42c766118741fa1a30d3732d98be244af0ae2455",
                "md5": "bac40b3e1f606ace9448a09ce408d0fc",
                "sha256": "8f1e9dc646f27df5c3555a7fea5d604ea2d609c221fe0e8987d4ba8b06f5d973"
            },
            "downloads": -1,
            "filename": "bayes_eval_kit-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bac40b3e1f606ace9448a09ce408d0fc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.9",
            "size": 6586,
            "upload_time": "2025-10-06T04:23:34",
            "upload_time_iso_8601": "2025-10-06T04:23:34.361819Z",
            "url": "https://files.pythonhosted.org/packages/78/71/499750cfa8efb1b3a57b42c766118741fa1a30d3732d98be244af0ae2455/bayes_eval_kit-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a1993ddbecde9ac175bc4a9c70c3020fe01e0e8af8d84c9f9d16021b53c85f55",
                "md5": "0b435ba8de5a05889fb015b011a9c9e4",
                "sha256": "ba1c780c2939411b2abddb6c30a3e26ccbe3ea97126ab93fb30f6c44ec1b6187"
            },
            "downloads": -1,
            "filename": "bayes_eval_kit-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "0b435ba8de5a05889fb015b011a9c9e4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.9",
            "size": 6029,
            "upload_time": "2025-10-06T04:23:36",
            "upload_time_iso_8601": "2025-10-06T04:23:36.296354Z",
            "url": "https://files.pythonhosted.org/packages/a1/99/3ddbecde9ac175bc4a9c70c3020fe01e0e8af8d84c9f9d16021b53c85f55/bayes_eval_kit-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-06 04:23:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mohsenhariri",
    "github_project": "bayes-kit",
    "github_not_found": true,
    "lcname": "bayes-eval-kit"
}
        
Elapsed time: 1.51662s