arize-phoenix-evals


Namearize-phoenix-evals JSON
Version 0.26.1 PyPI version JSON
download
home_pageNone
SummaryLLM Evaluations
upload_time2025-08-01 22:19:18
maintainerNone
docs_urlNone
authorNone
requires_python<3.14,>=3.8
licenseElastic-2.0
keywords explainability monitoring observability
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # arize-phoenix-evals

Phoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.

Phoenix's approach to LLM evals is notable for the following reasons:

- Includes pre-tested templates and convenience functions for a set of common Eval "tasks"
- Data science rigor applied to the testing of model and template combinations
- Designed to run as fast as possible on batches of data
- Includes benchmark datasets and tests for each eval function

## Installation

Install the arize-phoenix-evals sub-package via `pip`

```shell
pip install arize-phoenix-evals
```

Note you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:

```shell
pip install 'openai>=1.0.0'
```

## Usage

Here is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:

This example uses scikit-learn, so install it via `pip`
```shell
pip install scikit-learn
```

```python
import os
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import precision_recall_fscore_support

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

# Choose a model to evaluate on question-answering relevancy classification
model = OpenAIModel(
    model="o3-mini",
    temperature=0.0,
)

# Choose 100 examples from a small dataset of question-answer pairs
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df = df.sample(100)
df = df.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

# Use the language model to classify each example in the dataset
rails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP
class_names = list(rails_map.values())
result_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)

# Map the true labels to the class names for comparison
y_true = df["relevant"].map(rails_map)
# Get the labels generated by the model being evaluated
y_pred = result_df["label"]

# Evaluate the classification results of the model
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)
print("Classification Results:")
for idx, label in enumerate(class_names):
    print(f"Class: {label} (count: {support[idx]})")
    print(f"  Precision: {precision[idx]:.2f}")
    print(f"  Recall:    {recall[idx]:.2f}")
    print(f"  F1 Score:  {f1[idx]:.2f}\n")
```

To learn more about LLM Evals, see the [LLM Evals documentation](https://arize.com/docs/phoenix/concepts/llm-evals/).

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "arize-phoenix-evals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.8",
    "maintainer_email": null,
    "keywords": "Explainability, Monitoring, Observability",
    "author": null,
    "author_email": "Arize AI <phoenix-devs@arize.com>",
    "download_url": "https://files.pythonhosted.org/packages/13/22/57f44effefcda036e01f9ba0921885bea3dc16fb5be3a409a664b3697215/arize_phoenix_evals-0.26.1.tar.gz",
    "platform": null,
    "description": "# arize-phoenix-evals\n\nPhoenix provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) application, whether or not the response is toxic, and much more.\n\nPhoenix's approach to LLM evals is notable for the following reasons:\n\n- Includes pre-tested templates and convenience functions for a set of common Eval \"tasks\"\n- Data science rigor applied to the testing of model and template combinations\n- Designed to run as fast as possible on batches of data\n- Includes benchmark datasets and tests for each eval function\n\n## Installation\n\nInstall the arize-phoenix-evals sub-package via `pip`\n\n```shell\npip install arize-phoenix-evals\n```\n\nNote you will also have to install the LLM vendor SDK you would like to use with LLM Evals. For example, to use OpenAI's GPT-4, you will need to install the OpenAI Python SDK:\n\n```shell\npip install 'openai>=1.0.0'\n```\n\n## Usage\n\nHere is an example of running the RAG relevance eval on a dataset of Wikipedia questions and answers:\n\nThis example uses scikit-learn, so install it via `pip`\n```shell\npip install scikit-learn\n```\n\n```python\nimport os\nfrom phoenix.evals import (\n    RAG_RELEVANCY_PROMPT_TEMPLATE,\n    RAG_RELEVANCY_PROMPT_RAILS_MAP,\n    OpenAIModel,\n    download_benchmark_dataset,\n    llm_classify,\n)\nfrom sklearn.metrics import precision_recall_fscore_support\n\nos.environ[\"OPENAI_API_KEY\"] = \"<your-openai-key>\"\n\n# Choose a model to evaluate on question-answering relevancy classification\nmodel = OpenAIModel(\n    model=\"o3-mini\",\n    temperature=0.0,\n)\n\n# Choose 100 examples from a small dataset of question-answer pairs\ndf = download_benchmark_dataset(\n    task=\"binary-relevance-classification\", dataset_name=\"wiki_qa-train\"\n)\ndf = df.sample(100)\ndf = df.rename(\n    columns={\n        \"query_text\": \"input\",\n        \"document_text\": \"reference\",\n    },\n)\n\n# Use the language model to classify each example in the dataset\nrails_map = RAG_RELEVANCY_PROMPT_RAILS_MAP\nclass_names = list(rails_map.values())\nresult_df = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, class_names)\n\n# Map the true labels to the class names for comparison\ny_true = df[\"relevant\"].map(rails_map)\n# Get the labels generated by the model being evaluated\ny_pred = result_df[\"label\"]\n\n# Evaluate the classification results of the model\nprecision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, labels=class_names)\nprint(\"Classification Results:\")\nfor idx, label in enumerate(class_names):\n    print(f\"Class: {label} (count: {support[idx]})\")\n    print(f\"  Precision: {precision[idx]:.2f}\")\n    print(f\"  Recall:    {recall[idx]:.2f}\")\n    print(f\"  F1 Score:  {f1[idx]:.2f}\\n\")\n```\n\nTo learn more about LLM Evals, see the [LLM Evals documentation](https://arize.com/docs/phoenix/concepts/llm-evals/).\n",
    "bugtrack_url": null,
    "license": "Elastic-2.0",
    "summary": "LLM Evaluations",
    "version": "0.26.1",
    "project_urls": {
        "Documentation": "https://arize.com/docs/phoenix/",
        "Issues": "https://github.com/Arize-ai/phoenix/issues",
        "Source": "https://github.com/Arize-ai/phoenix"
    },
    "split_keywords": [
        "explainability",
        " monitoring",
        " observability"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "46d4aec1f526c3352967619b09a6fd92a239bb4273b36abeadf6799923d9ca42",
                "md5": "f655192370287af0c0b84cf630752e75",
                "sha256": "ff3dc9138d0bf03873cbfc772b0a2c16584f5ba5ca5a34339e5198da887ec0a7"
            },
            "downloads": -1,
            "filename": "arize_phoenix_evals-0.26.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f655192370287af0c0b84cf630752e75",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.8",
            "size": 93545,
            "upload_time": "2025-08-01T22:19:17",
            "upload_time_iso_8601": "2025-08-01T22:19:17.742649Z",
            "url": "https://files.pythonhosted.org/packages/46/d4/aec1f526c3352967619b09a6fd92a239bb4273b36abeadf6799923d9ca42/arize_phoenix_evals-0.26.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "132257f44effefcda036e01f9ba0921885bea3dc16fb5be3a409a664b3697215",
                "md5": "8b28fce9fbed9cc1ed3ba332916370a2",
                "sha256": "7dcf9e3e51ed744196f29a473ced2bbaadd61bdcaa10a1c4abeedafc807f1067"
            },
            "downloads": -1,
            "filename": "arize_phoenix_evals-0.26.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8b28fce9fbed9cc1ed3ba332916370a2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.8",
            "size": 68247,
            "upload_time": "2025-08-01T22:19:18",
            "upload_time_iso_8601": "2025-08-01T22:19:18.678437Z",
            "url": "https://files.pythonhosted.org/packages/13/22/57f44effefcda036e01f9ba0921885bea3dc16fb5be3a409a664b3697215/arize_phoenix_evals-0.26.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 22:19:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Arize-ai",
    "github_project": "phoenix",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "tox": true,
    "lcname": "arize-phoenix-evals"
}
        
Elapsed time: 2.64244s