# arize-phoenix-evals
Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.
Phoenix's approach to LLM evals is notable for the following reasons:
- Includes pre-tested templates and convenience functions for a set of common Eval “tasks”
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function
## Installation
Install the arize-phoenix sub-package via `pip`
```shell
pip install arize-phoenix-evals
```
Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:
```shell
pip install 'openai>=1.0.0'
```
## Usage
Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:
```python
import os
from phoenix.evals import (
RAG_RELEVANCY_PROMPT_TEMPLATE,
RAG_RELEVANCY_PROMPT_RAILS_MAP,
OpenAIModel,
download_benchmark_dataset,
llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay
os.environ["OPENAI_API_KEY"] = "<your-openai-key>"
# Download the benchmark golden dataset
df = download_benchmark_dataset(
task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
# Sample and re-name the columns to match the template
df = df.sample(100)
df = df.rename(
columns={
"query_text": "input",
"document_text": "reference",
},
)
model = OpenAIModel(
model="gpt-4",
temperature=0.0,
)
rails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
df[["eval_relevance"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)
#Golden dataset has True/False map to -> "irrelevant" / "relevant"
#we can then scikit compare to output of template - same format
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]
# Compute Per-Class Precision, Recall, F1 Score, Support
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)
```
To learn more about LLM Evals, see the [LLM Evals documentation](https://docs.arize.com/phoenix/concepts/llm-evals/).
Raw data
{
"_id": null,
"home_page": null,
"name": "arize-phoenix-evals",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.14,>=3.8",
"maintainer_email": null,
"keywords": "Explainability, Monitoring, Observability",
"author": null,
"author_email": "Arize AI <phoenix-devs@arize.com>",
"download_url": "https://files.pythonhosted.org/packages/f8/50/ebf87f63e08e9eb6a9288741e2d6e849ff53eb31f7d62a4df29b8cf55a22/arize_phoenix_evals-0.17.5.tar.gz",
"platform": null,
"description": "# arize-phoenix-evals\n\nPhoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.\n\nPhoenix's approach to LLM evals is notable for the following reasons:\n\n- Includes pre-tested templates and convenience functions for a set of common Eval \u201ctasks\u201d\n- Data science rigor applied to the testing of model and template combinations\n- Designed to run as fast as possible on batches of data\n- Includes benchmark datasets and tests for each eval function\n\n## Installation\n\nInstall the arize-phoenix sub-package via `pip`\n\n```shell\npip install arize-phoenix-evals\n```\n\nNote you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:\n\n```shell\npip install 'openai>=1.0.0'\n```\n\n## Usage\n\nHere is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:\n\n```python\nimport os\nfrom phoenix.evals import (\n RAG_RELEVANCY_PROMPT_TEMPLATE,\n RAG_RELEVANCY_PROMPT_RAILS_MAP,\n OpenAIModel,\n download_benchmark_dataset,\n llm_classify,\n)\nfrom sklearn.metrics import precision_recall_fscore_support, confusion_matrix, ConfusionMatrixDisplay\n\nos.environ[\"OPENAI_API_KEY\"] = \"<your-openai-key>\"\n\n# Download the benchmark golden dataset\ndf = download_benchmark_dataset(\n task=\"binary-relevance-classification\", dataset_name=\"wiki_qa-train\"\n)\n# Sample and re-name the columns to match the template\ndf = df.sample(100)\ndf = df.rename(\n columns={\n \"query_text\": \"input\",\n \"document_text\": \"reference\",\n },\n)\nmodel = OpenAIModel(\n model=\"gpt-4\",\n temperature=0.0,\n)\n\n\nrails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())\ndf[[\"eval_relevance\"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)\n#Golden dataset has True/False map to -> \"irrelevant\" / \"relevant\"\n#we can then scikit compare to output of template - same format\ny_true = df[\"relevant\"].map({True: \"relevant\", False: \"irrelevant\"})\ny_pred = df[\"eval_relevance\"]\n\n# Compute Per-Class Precision, Recall, F1 Score, Support\nprecision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)\n```\n\nTo learn more about LLM Evals, see the [LLM Evals documentation](https://docs.arize.com/phoenix/concepts/llm-evals/).\n",
"bugtrack_url": null,
"license": "Elastic-2.0",
"summary": "LLM Evaluations",
"version": "0.17.5",
"project_urls": {
"Documentation": "https://docs.arize.com/phoenix/",
"Issues": "https://github.com/Arize-ai/phoenix/issues",
"Source": "https://github.com/Arize-ai/phoenix"
},
"split_keywords": [
"explainability",
" monitoring",
" observability"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4d1d5793f6d54e57621edc4f0f0fc5f8d2ba9995ea7c9ed4c60cd86d81bc0a35",
"md5": "0e8d9af2145650553c1608f96658cd7a",
"sha256": "1d82307529ec97a0571a8ffe526854bf276ddaa5a18a1799286acc282ee6ce14"
},
"downloads": -1,
"filename": "arize_phoenix_evals-0.17.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0e8d9af2145650553c1608f96658cd7a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.14,>=3.8",
"size": 57152,
"upload_time": "2024-11-19T21:43:35",
"upload_time_iso_8601": "2024-11-19T21:43:35.512557Z",
"url": "https://files.pythonhosted.org/packages/4d/1d/5793f6d54e57621edc4f0f0fc5f8d2ba9995ea7c9ed4c60cd86d81bc0a35/arize_phoenix_evals-0.17.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f850ebf87f63e08e9eb6a9288741e2d6e849ff53eb31f7d62a4df29b8cf55a22",
"md5": "589e253da6f3a2e3dc115043300cc182",
"sha256": "a8bc6c6bf1d90f585a5a76847f3e8e9125d1d2794c9623d3028b5b0582c889f1"
},
"downloads": -1,
"filename": "arize_phoenix_evals-0.17.5.tar.gz",
"has_sig": false,
"md5_digest": "589e253da6f3a2e3dc115043300cc182",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.14,>=3.8",
"size": 43560,
"upload_time": "2024-11-19T21:43:36",
"upload_time_iso_8601": "2024-11-19T21:43:36.648417Z",
"url": "https://files.pythonhosted.org/packages/f8/50/ebf87f63e08e9eb6a9288741e2d6e849ff53eb31f7d62a4df29b8cf55a22/arize_phoenix_evals-0.17.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-19 21:43:36",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Arize-ai",
"github_project": "phoenix",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"tox": true,
"lcname": "arize-phoenix-evals"
}