llama-index-packs-evaluator-benchmarker


Namellama-index-packs-evaluator-benchmarker JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
Summaryllama-index packs evaluator_benchmarker integration
upload_time2024-08-22 16:51:01
maintainernerdai
docs_urlNone
authorYour Name
requires_python<4.0,>=3.8.1
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Evaluator Benchmarker Pack

A pack for quick computation of benchmark results of your own LLM evaluator
on an Evaluation llama-dataset. Specifically, this pack supports benchmarking
an appropriate evaluator on the following llama-datasets:

- `LabelledEvaluatorDataset` for single-grading evaluations
- `LabelledPairwiseEvaluatorDataset` for pairwise-grading evaluations

These llama-datasets can be downloaed from [llama-hub](https://llamahub.ai).

## CLI Usage

You can download llamapacks directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:

```bash
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
```

You can then inspect the files at `./evaluator_benchmarker_pack` and use them as a template for your own project!

## Code Usage

You can download the pack to the `./evaluator_benchmarker_pack` directory through python
code as well. The sample script below demonstrates how to construct `EvaluatorBenchmarkerPack`
using a `LabelledPairwiseEvaluatorDataset` downloaded from `llama-hub` and a
`PairwiseComparisonEvaluator` that uses GPT-4 as the LLM. Note though that this pack
can also be used on a `LabelledEvaluatorDataset` with a `BaseEvaluator` that performs
single-grading evaluation — in this case, the usage flow remains the same.

```python
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext

# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
    "MiniMtBenchHumanJudgementDataset", "./data"
)

# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)


# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)

# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
    evaluator=evaluator,
    eval_dataset=pairwise_dataset,
    show_progress=True,
)

# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported
print(benchmark_df)
```

`Output:`

```text
number_examples                1689
inconclusives                  140
ties                           379
agreement_rate_with_ties       0.657844
agreement_rate_without_ties    0.828205
```

Note that `evaluator_benchmarker_pack.run()` will also save the `benchmark_df` files in the same directory.

```bash
.
└── benchmark.csv
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-packs-evaluator-benchmarker",
    "maintainer": "nerdai",
    "docs_url": null,
    "requires_python": "<4.0,>=3.8.1",
    "maintainer_email": null,
    "keywords": null,
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/f0/a7/e8c559f8c6a9ab31da2549eb032984eb7e8e20afd7a54d6dc7aa22311725/llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz",
    "platform": null,
    "description": "# Evaluator Benchmarker Pack\n\nA pack for quick computation of benchmark results of your own LLM evaluator\non an Evaluation llama-dataset. Specifically, this pack supports benchmarking\nan appropriate evaluator on the following llama-datasets:\n\n- `LabelledEvaluatorDataset` for single-grading evaluations\n- `LabelledPairwiseEvaluatorDataset` for pairwise-grading evaluations\n\nThese llama-datasets can be downloaed from [llama-hub](https://llamahub.ai).\n\n## CLI Usage\n\nYou can download llamapacks directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:\n\n```bash\nllamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack\n```\n\nYou can then inspect the files at `./evaluator_benchmarker_pack` and use them as a template for your own project!\n\n## Code Usage\n\nYou can download the pack to the `./evaluator_benchmarker_pack` directory through python\ncode as well. The sample script below demonstrates how to construct `EvaluatorBenchmarkerPack`\nusing a `LabelledPairwiseEvaluatorDataset` downloaded from `llama-hub` and a\n`PairwiseComparisonEvaluator` that uses GPT-4 as the LLM. Note though that this pack\ncan also be used on a `LabelledEvaluatorDataset` with a `BaseEvaluator` that performs\nsingle-grading evaluation \u2014 in this case, the usage flow remains the same.\n\n```python\nfrom llama_index.core.llama_dataset import download_llama_dataset\nfrom llama_index.core.llama_pack import download_llama_pack\nfrom llama_index.core.evaluation import PairwiseComparisonEvaluator\nfrom llama_index.llms.openai import OpenAI\nfrom llama_index.core import ServiceContext\n\n# download a LabelledRagDataset from llama-hub\npairwise_dataset = download_llama_dataset(\n    \"MiniMtBenchHumanJudgementDataset\", \"./data\"\n)\n\n# define your evaluator\ngpt_4_context = ServiceContext.from_defaults(\n    llm=OpenAI(temperature=0, model=\"gpt-4\"),\n)\nevaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)\n\n\n# download and install dependencies\nEvaluatorBenchmarkerPack = download_llama_pack(\n    \"EvaluatorBenchmarkerPack\", \"./pack\"\n)\n\n# construction requires an evaluator and an eval_dataset\nevaluator_benchmarker_pack = EvaluatorBenchmarkerPack(\n    evaluator=evaluator,\n    eval_dataset=pairwise_dataset,\n    show_progress=True,\n)\n\n# PERFORM EVALUATION\nbenchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported\nprint(benchmark_df)\n```\n\n`Output:`\n\n```text\nnumber_examples                1689\ninconclusives                  140\nties                           379\nagreement_rate_with_ties       0.657844\nagreement_rate_without_ties    0.828205\n```\n\nNote that `evaluator_benchmarker_pack.run()` will also save the `benchmark_df` files in the same directory.\n\n```bash\n.\n\u2514\u2500\u2500 benchmark.csv\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index packs evaluator_benchmarker integration",
    "version": "0.2.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "16641abd3dfffe7a51b1f984726467dea02c9bf9a9ec5980d0ed7960747e51c6",
                "md5": "299009163b7b11f1d0d7cb75a5e2c1aa",
                "sha256": "d2f5209742d3d3c02392461028dcc60aeffdd7ce7db27783d58882138de8cc06"
            },
            "downloads": -1,
            "filename": "llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "299009163b7b11f1d0d7cb75a5e2c1aa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8.1",
            "size": 4306,
            "upload_time": "2024-08-22T16:50:59",
            "upload_time_iso_8601": "2024-08-22T16:50:59.843721Z",
            "url": "https://files.pythonhosted.org/packages/16/64/1abd3dfffe7a51b1f984726467dea02c9bf9a9ec5980d0ed7960747e51c6/llama_index_packs_evaluator_benchmarker-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f0a7e8c559f8c6a9ab31da2549eb032984eb7e8e20afd7a54d6dc7aa22311725",
                "md5": "58c07e9585c384c198aef38087becbf1",
                "sha256": "672940b62051256a2cb55155455df2b6da277f4aa44765f2cdba89d5ed095aec"
            },
            "downloads": -1,
            "filename": "llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "58c07e9585c384c198aef38087becbf1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8.1",
            "size": 3939,
            "upload_time": "2024-08-22T16:51:01",
            "upload_time_iso_8601": "2024-08-22T16:51:01.238164Z",
            "url": "https://files.pythonhosted.org/packages/f0/a7/e8c559f8c6a9ab31da2549eb032984eb7e8e20afd7a54d6dc7aa22311725/llama_index_packs_evaluator_benchmarker-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-22 16:51:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-packs-evaluator-benchmarker"
}
        
Elapsed time: 0.30344s