llama-index-packs-evaluator-benchmarker


Namellama-index-packs-evaluator-benchmarker JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
Summaryllama-index packs evaluator_benchmarker integration
upload_time2024-11-17 22:42:15
maintainernerdai
docs_urlNone
authorYour Name
requires_python<4.0,>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Evaluator Benchmarker Pack

A pack for quick computation of benchmark results of your own LLM evaluator
on an Evaluation llama-dataset. Specifically, this pack supports benchmarking
an appropriate evaluator on the following llama-datasets:

- `LabelledEvaluatorDataset` for single-grading evaluations
- `LabelledPairwiseEvaluatorDataset` for pairwise-grading evaluations

These llama-datasets can be downloaed from [llama-hub](https://llamahub.ai).

## CLI Usage

You can download llamapacks directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:

```bash
llamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack
```

You can then inspect the files at `./evaluator_benchmarker_pack` and use them as a template for your own project!

## Code Usage

You can download the pack to the `./evaluator_benchmarker_pack` directory through python
code as well. The sample script below demonstrates how to construct `EvaluatorBenchmarkerPack`
using a `LabelledPairwiseEvaluatorDataset` downloaded from `llama-hub` and a
`PairwiseComparisonEvaluator` that uses GPT-4 as the LLM. Note though that this pack
can also be used on a `LabelledEvaluatorDataset` with a `BaseEvaluator` that performs
single-grading evaluation — in this case, the usage flow remains the same.

```python
from llama_index.core.llama_dataset import download_llama_dataset
from llama_index.core.llama_pack import download_llama_pack
from llama_index.core.evaluation import PairwiseComparisonEvaluator
from llama_index.llms.openai import OpenAI
from llama_index.core import ServiceContext

# download a LabelledRagDataset from llama-hub
pairwise_dataset = download_llama_dataset(
    "MiniMtBenchHumanJudgementDataset", "./data"
)

# define your evaluator
gpt_4_context = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model="gpt-4"),
)
evaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)


# download and install dependencies
EvaluatorBenchmarkerPack = download_llama_pack(
    "EvaluatorBenchmarkerPack", "./pack"
)

# construction requires an evaluator and an eval_dataset
evaluator_benchmarker_pack = EvaluatorBenchmarkerPack(
    evaluator=evaluator,
    eval_dataset=pairwise_dataset,
    show_progress=True,
)

# PERFORM EVALUATION
benchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported
print(benchmark_df)
```

`Output:`

```text
number_examples                1689
inconclusives                  140
ties                           379
agreement_rate_with_ties       0.657844
agreement_rate_without_ties    0.828205
```

Note that `evaluator_benchmarker_pack.run()` will also save the `benchmark_df` files in the same directory.

```bash
.
└── benchmark.csv
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llama-index-packs-evaluator-benchmarker",
    "maintainer": "nerdai",
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/62/05/8d14c4ba9ed6df9c75fb3d76c26bbdb27d11c621c61cfcf974605d7ab6e7/llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz",
    "platform": null,
    "description": "# Evaluator Benchmarker Pack\n\nA pack for quick computation of benchmark results of your own LLM evaluator\non an Evaluation llama-dataset. Specifically, this pack supports benchmarking\nan appropriate evaluator on the following llama-datasets:\n\n- `LabelledEvaluatorDataset` for single-grading evaluations\n- `LabelledPairwiseEvaluatorDataset` for pairwise-grading evaluations\n\nThese llama-datasets can be downloaed from [llama-hub](https://llamahub.ai).\n\n## CLI Usage\n\nYou can download llamapacks directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:\n\n```bash\nllamaindex-cli download-llamapack EvaluatorBenchmarkerPack --download-dir ./evaluator_benchmarker_pack\n```\n\nYou can then inspect the files at `./evaluator_benchmarker_pack` and use them as a template for your own project!\n\n## Code Usage\n\nYou can download the pack to the `./evaluator_benchmarker_pack` directory through python\ncode as well. The sample script below demonstrates how to construct `EvaluatorBenchmarkerPack`\nusing a `LabelledPairwiseEvaluatorDataset` downloaded from `llama-hub` and a\n`PairwiseComparisonEvaluator` that uses GPT-4 as the LLM. Note though that this pack\ncan also be used on a `LabelledEvaluatorDataset` with a `BaseEvaluator` that performs\nsingle-grading evaluation \u2014 in this case, the usage flow remains the same.\n\n```python\nfrom llama_index.core.llama_dataset import download_llama_dataset\nfrom llama_index.core.llama_pack import download_llama_pack\nfrom llama_index.core.evaluation import PairwiseComparisonEvaluator\nfrom llama_index.llms.openai import OpenAI\nfrom llama_index.core import ServiceContext\n\n# download a LabelledRagDataset from llama-hub\npairwise_dataset = download_llama_dataset(\n    \"MiniMtBenchHumanJudgementDataset\", \"./data\"\n)\n\n# define your evaluator\ngpt_4_context = ServiceContext.from_defaults(\n    llm=OpenAI(temperature=0, model=\"gpt-4\"),\n)\nevaluator = PairwiseComparisonEvaluator(service_context=gpt_4_context)\n\n\n# download and install dependencies\nEvaluatorBenchmarkerPack = download_llama_pack(\n    \"EvaluatorBenchmarkerPack\", \"./pack\"\n)\n\n# construction requires an evaluator and an eval_dataset\nevaluator_benchmarker_pack = EvaluatorBenchmarkerPack(\n    evaluator=evaluator,\n    eval_dataset=pairwise_dataset,\n    show_progress=True,\n)\n\n# PERFORM EVALUATION\nbenchmark_df = evaluator_benchmarker_pack.run()  # async arun() also supported\nprint(benchmark_df)\n```\n\n`Output:`\n\n```text\nnumber_examples                1689\ninconclusives                  140\nties                           379\nagreement_rate_with_ties       0.657844\nagreement_rate_without_ties    0.828205\n```\n\nNote that `evaluator_benchmarker_pack.run()` will also save the `benchmark_df` files in the same directory.\n\n```bash\n.\n\u2514\u2500\u2500 benchmark.csv\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "llama-index packs evaluator_benchmarker integration",
    "version": "0.3.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4fdfe4f0199dbc028034cbffd930083627b2790e2527c5c5b7079952440cf61",
                "md5": "c6b8d5b52054bbcc72958a278b01ac21",
                "sha256": "c729fd578eb4ec36dc6ac3b21c835c20a29341d15fe3ffe89d41bffc70bfee10"
            },
            "downloads": -1,
            "filename": "llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c6b8d5b52054bbcc72958a278b01ac21",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 4303,
            "upload_time": "2024-11-17T22:42:13",
            "upload_time_iso_8601": "2024-11-17T22:42:13.923332Z",
            "url": "https://files.pythonhosted.org/packages/f4/fd/fe4f0199dbc028034cbffd930083627b2790e2527c5c5b7079952440cf61/llama_index_packs_evaluator_benchmarker-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "62058d14c4ba9ed6df9c75fb3d76c26bbdb27d11c621c61cfcf974605d7ab6e7",
                "md5": "9f24dea3b9fe801122379ada79991e5c",
                "sha256": "bdc246bc5bef3ce47ce6d7018535ab73f5472df1925749606de9189b6e6ddf04"
            },
            "downloads": -1,
            "filename": "llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9f24dea3b9fe801122379ada79991e5c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 3920,
            "upload_time": "2024-11-17T22:42:15",
            "upload_time_iso_8601": "2024-11-17T22:42:15.069135Z",
            "url": "https://files.pythonhosted.org/packages/62/05/8d14c4ba9ed6df9c75fb3d76c26bbdb27d11c621c61cfcf974605d7ab6e7/llama_index_packs_evaluator_benchmarker-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-17 22:42:15",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "llama-index-packs-evaluator-benchmarker"
}
        
Elapsed time: 1.28861s