# Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
The official code for Bring Your Own Data! Self-Supervised Evaluation for Large Language Models.
If you have any questions, feel free to email (<njain17@umd.edu>).
<img src="images/Teaser.png">
## About
To complement conventional evaluation, we propose a framework for _self-supervised model evaluation_. In this framework, metrics are defined as invariances and sensitivities that can be checked in a self-supervised fashion using interventions based only on the model in question rather than external labels. Self-supervised evaluation pipelines are _dataset-agnostic_, and so they can be utilized over larger corpora of evaluation data than conventional metrics, or even directly in production systems to monitor day-to-day performance. In this work, we develop this framework, discuss desiderata for such metrics, and provide a number of case studies for self-supervised metrics: knownledge capability, toxicity detection, long-range (context), word-order, and tokenization sensitivities. By developing these new metrics, we hope to provide a more comprehensive and nuanced understanding of the strengths and limitations of LLMs.
## Installation
You can run `pip install byod` to directly install our package. Or, install directly from source via `pip install git+https://github.com/neelsjain/BYOD/`.
## Dependencies
* transformers==4.28.1
* scipy==1.10.1
* torch==2.0.0
* datasets==2.11.0
* nltk==3.8.1
* apache_beam==2.48.0
Python 3.8 or higher is recommended
## Usage
See `run_model.sh` for examples on how to evaluate a model. We provide scripts to run all huggingface models against metrics computed on wikipedia data, as an example. These are named `run_[metric].py`.
Note that only models are huggingface are currently supported.
You can also use the metrics directly, given your own `model`, `tokenizer`, and `dataset`, like so
```
import BYOD
long_range_sensitivity = BYOD.lrs_metric(model, data, tokenizer)
negation_knowledge = BYOD.negation_metric(model, data, tokenizer)
tokenization_robustness = BYOD.tokenization_metric(model, data, tokenizer)
toxicity_proxy = BYOD.toxicity_metric(model, data, tokenizer)
word_order_sensitivity = BYOD.word_order_metric(model, data, tokenizer)
```
## Suggestions and Pull Requests are welcome!
Everything can be better! If you have suggestions on improving the codebase or the invariance/sensitivity test. Feel free to reach out!
Raw data
{
"_id": null,
"home_page": "https://github.com/neelsjain/BYOD",
"name": "BYOD",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": "",
"keywords": "todo",
"author": "Neel Jain, Khalid Saifullah, Jonas Geiping",
"author_email": "njain17@umd.edu",
"download_url": "https://files.pythonhosted.org/packages/14/8e/173e5c1e7ffe71950be6a626f4a2e489dcb1ae5859c66b6a62a75c5e6b9b/BYOD-0.3.0.tar.gz",
"platform": "any",
"description": "# Bring Your Own Data! Self-Supervised Evaluation for Large Language Models\n\nThe official code for Bring Your Own Data! Self-Supervised Evaluation for Large Language Models.\nIf you have any questions, feel free to email (<njain17@umd.edu>).\n\n\n<img src=\"images/Teaser.png\">\n\n## About\nTo complement conventional evaluation, we propose a framework for _self-supervised model evaluation_. In this framework, metrics are defined as invariances and sensitivities that can be checked in a self-supervised fashion using interventions based only on the model in question rather than external labels. Self-supervised evaluation pipelines are _dataset-agnostic_, and so they can be utilized over larger corpora of evaluation data than conventional metrics, or even directly in production systems to monitor day-to-day performance. In this work, we develop this framework, discuss desiderata for such metrics, and provide a number of case studies for self-supervised metrics: knownledge capability, toxicity detection, long-range (context), word-order, and tokenization sensitivities. By developing these new metrics, we hope to provide a more comprehensive and nuanced understanding of the strengths and limitations of LLMs.\n\n## Installation\n\nYou can run `pip install byod` to directly install our package. Or, install directly from source via `pip install git+https://github.com/neelsjain/BYOD/`.\n\n## Dependencies\n\n* transformers==4.28.1\n* scipy==1.10.1\n* torch==2.0.0\n* datasets==2.11.0\n* nltk==3.8.1\n* apache_beam==2.48.0\n\nPython 3.8 or higher is recommended\n\n## Usage\n\nSee `run_model.sh` for examples on how to evaluate a model. We provide scripts to run all huggingface models against metrics computed on wikipedia data, as an example. These are named `run_[metric].py`.\n\nNote that only models are huggingface are currently supported.\n\n\nYou can also use the metrics directly, given your own `model`, `tokenizer`, and `dataset`, like so\n```\nimport BYOD\n\nlong_range_sensitivity = BYOD.lrs_metric(model, data, tokenizer)\nnegation_knowledge = BYOD.negation_metric(model, data, tokenizer)\ntokenization_robustness = BYOD.tokenization_metric(model, data, tokenizer)\ntoxicity_proxy = BYOD.toxicity_metric(model, data, tokenizer)\nword_order_sensitivity = BYOD.word_order_metric(model, data, tokenizer)\n```\n\n\n## Suggestions and Pull Requests are welcome!\nEverything can be better! If you have suggestions on improving the codebase or the invariance/sensitivity test. Feel free to reach out!\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Bring Your Own Data! Self-Supervised Evaluation for Large Language Models",
"version": "0.3.0",
"project_urls": {
"Homepage": "https://github.com/neelsjain/BYOD"
},
"split_keywords": [
"todo"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fad9b67797f45b074abdd712c856876369429403f0dbff07d972c6aa133a01ad",
"md5": "1b8f755fbd0dad333a631effbae9bb99",
"sha256": "a6b088133439c584633971594bb83211f0a90743546009f6c61bc986d6f51a6e"
},
"downloads": -1,
"filename": "BYOD-0.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1b8f755fbd0dad333a631effbae9bb99",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 22101,
"upload_time": "2023-06-23T01:57:18",
"upload_time_iso_8601": "2023-06-23T01:57:18.159466Z",
"url": "https://files.pythonhosted.org/packages/fa/d9/b67797f45b074abdd712c856876369429403f0dbff07d972c6aa133a01ad/BYOD-0.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "148e173e5c1e7ffe71950be6a626f4a2e489dcb1ae5859c66b6a62a75c5e6b9b",
"md5": "ee30b3b347da95febe49cd4bc075459a",
"sha256": "d1eb6f2970e3cc5042e1a0edfdb596478112f843a62bc320ae297bac12a2a076"
},
"downloads": -1,
"filename": "BYOD-0.3.0.tar.gz",
"has_sig": false,
"md5_digest": "ee30b3b347da95febe49cd4bc075459a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 264068,
"upload_time": "2023-06-23T01:57:19",
"upload_time_iso_8601": "2023-06-23T01:57:19.996711Z",
"url": "https://files.pythonhosted.org/packages/14/8e/173e5c1e7ffe71950be6a626f4a2e489dcb1ae5859c66b6a62a75c5e6b9b/BYOD-0.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-23 01:57:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "neelsjain",
"github_project": "BYOD",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "byod"
}