azure-ai-evaluation

Name	azure-ai-evaluation JSON
Version	1.10.0 JSON
	download
home_page	https://github.com/Azure/azure-sdk-for-python
Summary	Microsoft Azure Evaluation Library for Python
upload_time	2025-07-31 23:46:31
maintainer	None
docs_url	None
author	Microsoft Corporation
requires_python	>=3.9
license	MIT License
keywords	azure azure sdk
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            # Azure AI Evaluation client library for Python

Use Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.

Use Azure AI Evaluation SDK to:
- Evaluate existing data from generative AI applications
- Evaluate generative AI applications
- Evaluate by generating mathematical, AI-assisted quality and safety metrics

Azure AI SDK provides following to evaluate Generative AI Applications:
- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.
- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.

[Source code][source_code]
| [Package (PyPI)][evaluation_pypi]
| [API reference documentation][evaluation_ref_docs]
| [Product documentation][product_documentation]
| [Samples][evaluation_samples]


## Getting started

### Prerequisites

- Python 3.9 or later is required to use this package.
- [Optional] You must have [Azure AI Foundry Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators

### Install the package

Install the Azure AI Evaluation SDK for Python with [pip][pip_link]:

```bash
pip install azure-ai-evaluation
```

## Key concepts

### Evaluators

Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.

#### Built-in evaluators

Built-in evaluators are out of box evaluators provided by Microsoft:
| Category  | Evaluator class                                                                                                                    |
|-----------|------------------------------------------------------------------------------------------------------------------------------------|
| [Performance and quality][performance_and_quality_evaluators] (AI-assisted)  | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |
| [Performance and quality][performance_and_quality_evaluators] (NLP)  | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|
| [Risk and safety][risk_and_safety_evaluators] (AI-assisted)    | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator`                                             |
| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator`                                             |

For more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].

```python
import os

from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator

# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo."
)

# AI assisted quality evaluator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
    query="What is the capital of Japan?",
    response="The capital of Japan is Tokyo."
)

# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project Details 
azure_ai_project = {
    "subscription_id": "<subscription_id>",
    "resource_group_name": "<resource_group_name>",
    "project_name": "<project_name>",
}

violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)

# Option # 2 : Using Azure AI Project Url 
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"

violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)
```

#### Custom evaluators

Built-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.

```python

# Custom evaluator as a function to calculate response length
def response_length(response, **kwargs):
    return len(response)

# Custom class based evaluator to check for blocked words
class BlocklistEvaluator:
    def __init__(self, blocklist):
        self._blocklist = blocklist

    def __call__(self, *, response: str, **kwargs):
        score = any([word in answer for word in self._blocklist])
        return {"score": score}

blocklist_evaluator = BlocklistEvaluator(blocklist=["bad, worst, terrible"])

result = response_length("The capital of Japan is Tokyo.")
result = blocklist_evaluator(answer="The capital of Japan is Tokyo.")

```

### Evaluate API
The package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.

#### Evaluate existing dataset

```python
from azure.ai.evaluation import evaluate

result = evaluate(
    data="data.jsonl", # provide your data here
    evaluators={
        "blocklist": blocklist_evaluator,
        "relevance": relevance_evaluator
    },
    # column mapping
    evaluator_config={
        "relevance": {
            "column_mapping": {
                "query": "${data.queries}"
                "ground_truth": "${data.ground_truth}"
                "response": "${outputs.response}"
            } 
        }
    }
    # Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project
    azure_ai_project = azure_ai_project,
    # Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL
    output_path="./evaluation_results.json"
)
```
For more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]

#### Evaluate generative AI application
```python
from askwiki import askwiki

result = evaluate(
    data="data.jsonl",
    target=askwiki,
    evaluators={
        "relevance": relevance_eval
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.queries}"
                "context": "${outputs.context}"
                "response": "${outputs.response}"
            } 
        }
    }
)
```
Above code snippet refers to askwiki application in this [sample][evaluate_app].

For more details refer to [Evaluate on a target][evaluate_target]

### Simulator


Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:


```python
async def callback(
    messages: Dict[str, List[Dict]],
    stream: bool = False,
    session_state: Any = None,
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # Get the last message from the user
    latest_message = messages_list[-1]
    query = latest_message["content"]
    # Call your endpoint or AI application here
    # response should be a string
    response = call_to_your_application(query, messages_list, context)
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": "",
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}
```

The simulator initialization and invocation looks like this:
```python
from azure.ai.evaluation.simulator import Simulator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT_NAME"),
    "api_version": os.environ.get("AZURE_API_VERSION"),
}
custom_simulator = Simulator(model_config=model_config)
outputs = asyncio.run(custom_simulator(
    target=callback,
    conversation_turns=[
        [
            "What should I know about the public gardens in the US?",
        ],
        [
            "How do I simulate data against LLMs",
        ],
    ],
    max_conversation_turns=2,
))
with open("simulator_output.jsonl", "w") as f:
    for output in outputs:
        f.write(output.to_eval_qr_json_lines())
```

#### Adversarial Simulator

```python
from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential

# There are two ways to provide Azure AI Project.
# Option #1 : Using Azure AI Project 
azure_ai_project = {
    "subscription_id": <subscription_id>,
    "resource_group_name": <resource_group_name>,
    "project_name": <project_name>
}

# Option #2 : Using Azure AI Project Url 
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"

scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())

outputs = asyncio.run(
    simulator(
        scenario=scenario,
        max_conversation_turns=1,
        max_simulation_results=3,
        target=callback
    )
)

print(outputs.to_eval_qr_json_lines())
```

For more details about the simulator, visit the following links:
- [Adversarial Simulation docs][adversarial_simulation_docs]
- [Adversarial scenarios][adversarial_simulation_scenarios]
- [Simulating jailbreak attacks][adversarial_jailbreak]

## Examples

In following section you will find examples of:
- [Evaluate an application][evaluate_app]
- [Evaluate different models][evaluate_models]
- [Custom Evaluators][custom_evaluators]
- [Adversarial Simulation][adversarial_simulation]
- [Simulate with conversation starter][simulate_with_conversation_starter]

More examples can be found [here][evaluate_samples].

## Troubleshooting

### General

Please refer to [troubleshooting][evaluation_tsg] for common issues.

### Logging

This library uses the standard
[logging][python_logging] library for logging.
Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO
level.

Detailed DEBUG level logging, including request/response bodies and unredacted
headers, can be enabled on a client with the `logging_enable` argument.

See full SDK logging documentation with examples [here][sdk_logging_docs].

## Next steps

- View our [samples][evaluation_samples].
- View our [documentation][product_documentation]

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit [cla.microsoft.com][cla].

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct][code_of_conduct]. For more information see the [Code of Conduct FAQ][coc_faq] or contact [opencode@microsoft.com][coc_contact] with any additional questions or comments.

<!-- LINKS -->

[source_code]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation
[evaluation_pypi]: https://pypi.org/project/azure-ai-evaluation/
[evaluation_ref_docs]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
[evaluation_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios
[product_documentation]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk
[python_logging]: https://docs.python.org/3/library/logging.html
[sdk_logging_docs]: https://docs.microsoft.com/azure/developer/python/azure-sdk-logging
[azure_core_readme]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md
[pip_link]: https://pypi.org/project/pip/
[azure_core_ref_docs]: https://aka.ms/azsdk-python-core-policies
[azure_core]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md
[azure_identity]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity
[cla]: https://cla.microsoft.com
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
[coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/
[coc_contact]: mailto:opencode@microsoft.com
[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target
[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate
[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview
[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate
[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint
[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio
[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio
[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/
[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint
[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators
[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate
[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in
[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators
[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators
[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators
[adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation
[adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios
[adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Adversarial_Data
[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter
[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks


# Release History

## 1.10.0 (2025-07-31)

### Breaking Changes

- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.

### Features Added

- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
- Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher 
tolerance for harmful responses).
- Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.


### Bugs Fixed

- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.
- Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.
- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.


### Other Changes

- The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)
  - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.
    This is due to be removed in a future release.


## 1.9.0 (2025-07-02)

### Features Added

- Added support for Azure Open AI evaluation via `AzureOpenAIScoreModelGrader` class, which serves as a wrapper around Azure Open AI score model configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.
- Added new experimental risk categories ProtectedMaterial and CodeVulnerability for redteam agent scan.


### Bugs Fixed

- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.

- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
- Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.

## 1.8.0 (2025-05-29)

### Features Added

- Introduces `AttackStrategy.MultiTurn` and `AttackStrategy.Crescendo` to `RedTeam`. These strategies attack the target of a `RedTeam` scan over the course of multi-turn conversations. 

### Bugs Fixed
- AdversarialSimulator in `ADVERSARIAL_CONVERSATION` mode was broken. It is now fixed.

## 1.7.0 (2025-05-12)

### Bugs Fixed
- azure-ai-evaluation failed with module not found [#40992](https://github.com/Azure/azure-sdk-for-python/issues/40992)

## 1.6.0 (2025-05-07)

### Features Added
- New `<evaluator>.binary_aggregate` field added to evaluation result metrics. This field contains the aggregated binary evaluation results for each evaluator, providing a summary of the evaluation outcomes.
- Added support for Azure Open AI evaluation via 4 new 'grader' classes, which serve as wrappers around Azure Open AI grader configurations. These new grader objects can be supplied to the main `evaluate` method as if they were normal callable evaluators. The new classes are:
    - AzureOpenAIGrader (general class for experienced users)
    - AzureOpenAILabelGrader
    - AzureOpenAIStringCheckGrader
    - AzureOpenAITextSimilarityGrader

### Breaking Changes
- In the experimental RedTeam's scan method, the `data_only` param has been replaced with `skip_evals` and if you do not want data to be uploaded, use the `skip_upload` flag.

### Bugs Fixed
- Fixed error in `evaluate` where data fields could not contain numeric characters. Previously, a data file with schema:
    ```
    "query1": "some query", "response": "some response"
    ```
    throws error when passed into `evaluator_config` as `{"evaluator_name": {"column_mapping": {"query": "${data.query1}", "response": "${data.response}"}},}`.
    Now, users may import data containing fields with numeric characters. 


## 1.5.0 (2025-04-04)

### Features Added

- New `RedTeam` agent functionality to assess the safety and resilience of AI systems against adversarial prompt attacks

## 1.4.0 (2025-03-27)

### Features Added
- Enhanced binary evaluation results with customizable thresholds
  - Added threshold support for QA and ContentSafety evaluators
  - Evaluation results now include both the score and threshold values
  - Configurable threshold parameter allows custom binary classification boundaries
  - Default thresholds provided for backward compatibility
  - Quality evaluators use "higher is better" scoring (score ≥ threshold is positive)
  - Content safety evaluators use "lower is better" scoring (score ≤ threshold is positive)
- New Built-in evaluator called CodeVulnerabilityEvaluator is added. 
  - It provides capabilities to identify the following code vulnerabilities.
    - path-injection
    - sql-injection
    - code-injection
    - stack-trace-exposure
    - incomplete-url-substring-sanitization
    - flask-debug
    - clear-text-logging-sensitive-data
    - incomplete-hostname-regexp
    - server-side-unvalidated-url-redirection
    - weak-cryptographic-algorithm
    - full-ssrf
    - bind-socket-all-network-interfaces
    - client-side-unvalidated-url-redirection
    - likely-bugs
    - reflected-xss
    - clear-text-storage-sensitive-data
    - tarslip
    - hardcoded-credentials
    - insecure-randomness
  - It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)
  
- New Built-in evaluator called UngroundedAttributesEvaluator is added.
  - It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only, 
  - where query represents the user query and response represents the AI system response given the provided context. 
 
  - Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class 
  - or emotional state of a person.
  
  - It identifies the following attributes:
    
    - emotional_state
    - protected_class
    - groundedness
- New Built-in evaluators for Agent Evaluation (Preview)
  - IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.
  - ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.
  - TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.
  - ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.

### Bugs Fixed
- Fixed error in `GroundednessProEvaluator` when handling non-numeric values like "n/a" returned from the service.
- Uploading local evaluation results from `evaluate` with the same run name will no longer result in each online run sharing (and bashing) result files.

## 1.3.0 (2025-02-28)

### Breaking Changes
- Multimodal specific evaluators `ContentSafetyMultimodalEvaluator`, `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` has been removed. Please use `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` and `ProtectedMaterialEvaluator` instead.
- Metric name in ProtectedMaterialEvaluator's output is changed from `protected_material.fictional_characters_label` to `protected_material.fictional_characters_defect_rate`. It's now consistent with other evaluator's metric names (ending with `_defect_rate`).

## 1.2.0 (2025-01-27)

### Features Added
- CSV files are now supported as data file inputs with `evaluate()` API. The CSV file should have a header row with column names that match the `data` and `target` fields in the `evaluate()` method and the filename should be passed as the `data` parameter. Column name 'Conversation' in CSV file is not fully supported yet.  

### Breaking Changes
- `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` will be removed in next release. 

### Bugs Fixed
- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Studio.
- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results
- Fixed the non adversarial simulator to run in task-free mode
- Content safety evaluators (violence, self harm, sexual, hate/unfairness) return the maximum result as the
  main score when aggregating per-turn evaluations from a conversation into an overall
  evaluation score. Other conversation-capable evaluators still default to a mean for aggregation.
- Fixed bug in non adversarial simulator sample where `tasks` undefined

### Other Changes
- Changed minimum required python version to use this package from 3.8 to 3.9
- Stop dependency on the local promptflow service. No promptflow service will automatically start when running evaluation.
- Evaluators internally allow for custom aggregation. However, this causes serialization failures if evaluated while the
  environment variable `AI_EVALS_BATCH_USE_ASYNC` is set to false.

## 1.1.0 (2024-12-12)

### Features Added
- Added image support in `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` and `ProtectedMaterialEvaluator`. Provide image URLs or base64 encoded images in `conversation` input for image evaluation. See below for an example:

```python
evaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)
conversation = {
    "messages": [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "You are an AI assistant that understands images."}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Can you describe this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg"
                    },
                },
            ],
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "text",
                    "text": "The image shows a man with short brown hair smiling, wearing a dark-colored shirt.",
                }
            ],
        },
    ]
}
print("Calling Content Safety Evaluator for multi-modal")
score = evaluator(conversation=conversation)
```

- Please switch to generic evaluators for image evaluations as mentioned above. `ContentSafetyMultimodalEvaluator`, `ContentSafetyMultimodalEvaluatorBase`, `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` will be deprecated in the next release.

### Bugs Fixed
- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Foundry portal.
- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results

## 1.0.1 (2024-11-15)

### Bugs Fixed
- Removing `azure-ai-inference` as dependency.
- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results

## 1.0.0 (2024-11-13)

### Breaking Changes
- The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.
- Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.

### Bugs Fixed
- Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.
- Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation
- Fixed an issue where during long-running simulations, API token expires causing "Forbidden" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.
- Fixed an issue with the `ContentSafetyEvaluator` that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.
- Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or
otherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`
would be 2, not 1.5.

### Other Changes
- Refined error messages for serviced-based evaluators and simulators.
- Tracing has been disabled due to Cosmos DB initialization issue.
- Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.
- Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the  `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.
- For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:
```python
adversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
outputs = asyncio.run(
    adversarial_simulator(
        scenario=scenario,
        target=callback,
        randomize_order=True
    )
)
```

## 1.0.0b5 (2024-10-28)

### Features Added
- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.
- Groundedness detection in Non Adversarial Simulator via query/context pairs
```python
import importlib.resources as pkg_resources
package = "azure.ai.evaluation.simulator._data_sources"
resource_name = "grounding.json"
custom_simulator = Simulator(model_config=model_config)
conversation_turns = []
with pkg_resources.path(package, resource_name) as grounding_file:
    with open(grounding_file, "r") as file:
        data = json.load(file)
for item in data:
    conversation_turns.append([item])
outputs = asyncio.run(custom_simulator(
    target=callback,
    conversation_turns=conversation_turns,
    max_conversation_turns=1,
))
```
- Adding evaluator for multimodal use cases

### Breaking Changes
- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.
- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.
- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.
- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.
- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`
- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:
```json
{"question": <user_message>, "answer": <assistant_message>}
```
`to_eval_qr_json_lines` now has:
```json
{"query": <user_message>, "response": assistant_message}
```

### Bugs Fixed
- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format
- Fixed an issue where the `evaluate` API would fail with "[WinError 32] The process cannot access the file because it is being used by another process" when venv folder and target function file are in the same directory.
- Fix evaluate API failure when `trace.destination` is set to `none`
- Non adversarial simulator now accepts context from the callback

### Other Changes
- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.
- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.
- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
  - `CoherenceEvaluator`
  - `RelevanceEvaluator`
  - `FluencyEvaluator`
  - `GroundednessEvaluator`
  - `SimilarityEvaluator`
  - `RetrievalEvaluator`
- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern "<metric_name>_reason". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.

    | Evaluator | New `max_token` for Generation |
    | --- | --- |
    | `CoherenceEvaluator` | 800 |
    | `RelevanceEvaluator` | 800 |
    | `FluencyEvaluator` | 800 |
    | `GroundednessEvaluator` | 800 |
    | `RetrievalEvaluator` | 1600 |
- Improved the error message for storage access permission issues to provide clearer guidance for users.

## 1.0.0b4 (2024-10-16)

### Breaking Changes

- Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.
- `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.
- Changed package extra name from "pf-azure" to "remote".

### Bugs Fixed
- Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.
- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.

### Other Changes
- Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.
- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.

## 1.0.0b3 (2024-10-01)

### Features Added

- Added `type` field to `AzureOpenAIModelConfiguration` and `OpenAIModelConfiguration`
- The following evaluators now support `conversation` as an alternative input to their usual single-turn inputs:
  - `ViolenceEvaluator`
  - `SexualEvaluator`
  - `SelfHarmEvaluator`
  - `HateUnfairnessEvaluator`
  - `ProtectedMaterialEvaluator`
  - `IndirectAttackEvaluator`
  - `CoherenceEvaluator`
  - `RelevanceEvaluator`
  - `FluencyEvaluator`
  - `GroundednessEvaluator`
- Surfaced `RetrievalScoreEvaluator`, formally an internal part of `ChatEvaluator` as a standalone conversation-only evaluator.

### Breaking Changes

- Removed `ContentSafetyChatEvaluator` and `ChatEvaluator`
- The `evaluator_config` parameter of `evaluate` now maps in evaluator name to a dictionary `EvaluatorConfig`, which is a `TypedDict`. The
`column_mapping` between `data` or `target` and evaluator field names should now be specified inside this new dictionary:

Before:
```python
evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "query": "${data.question}",
            "response": "${data.answer}",
        }
    },
    ...
)
```

After
```python
evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "column_mapping": {
                "query": "${data.question}",
                "response": "${data.answer}",
             }
        }
    },
    ...
)
```

- Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.
Before:
```python
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
    "project_name": os.environ.get("PROJECT_NAME"),
}
sim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())
```
After:
```python
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "azure_deployment": os.environ.get("AZURE_DEPLOYMENT"),
}
sim = Simulator(model_config=model_config)
```
If `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.

### Bugs Fixed

- Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`

## 1.0.0b2 (2024-09-24)

### Breaking Changes

- `data` and `evaluators` are now required keywords in `evaluate`.

## 1.0.0b1 (2024-09-20)

### Breaking Changes

- The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed
- The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`
- The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.
- Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.
- Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.

### Features Added

- First preview
- This package is port of `promptflow-evals`. New features will be added only to this package moving forward.
- Added a `TypedDict` for `AzureAIProject` that allows for better intellisense and type checking when passing in project information

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Azure/azure-sdk-for-python",
    "name": "azure-ai-evaluation",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "azure, azure sdk",
    "author": "Microsoft Corporation",
    "author_email": "azuresdkengsysadmins@microsoft.com",
    "download_url": "https://files.pythonhosted.org/packages/f5/6e/018e0bc3d7368fc6379373939d3ef29da75755e998545e67c660ad8efadd/azure_ai_evaluation-1.10.0.tar.gz",
    "platform": null,
    "description": "# Azure AI Evaluation client library for Python\n\nUse Azure AI Evaluation SDK to assess the performance of your generative AI applications. Generative AI application generations are quantitatively measured with mathematical based metrics, AI-assisted quality and safety metrics. Metrics are defined as `evaluators`. Built-in or custom evaluators can provide comprehensive insights into the application's capabilities and limitations.\n\nUse Azure AI Evaluation SDK to:\n- Evaluate existing data from generative AI applications\n- Evaluate generative AI applications\n- Evaluate by generating mathematical, AI-assisted quality and safety metrics\n\nAzure AI SDK provides following to evaluate Generative AI Applications:\n- [Evaluators][evaluators] - Generate scores individually or when used together with `evaluate` API.\n- [Evaluate API][evaluate_api] - Python API to evaluate dataset or application using built-in or custom evaluators.\n\n[Source code][source_code]\n| [Package (PyPI)][evaluation_pypi]\n| [API reference documentation][evaluation_ref_docs]\n| [Product documentation][product_documentation]\n| [Samples][evaluation_samples]\n\n\n## Getting started\n\n### Prerequisites\n\n- Python 3.9 or later is required to use this package.\n- [Optional] You must have [Azure AI Foundry Project][ai_project] or [Azure Open AI][azure_openai] to use AI-assisted evaluators\n\n### Install the package\n\nInstall the Azure AI Evaluation SDK for Python with [pip][pip_link]:\n\n```bash\npip install azure-ai-evaluation\n```\n\n## Key concepts\n\n### Evaluators\n\nEvaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models or generative AI applications.\n\n#### Built-in evaluators\n\nBuilt-in evaluators are out of box evaluators provided by Microsoft:\n| Category  | Evaluator class                                                                                                                    |\n|-----------|------------------------------------------------------------------------------------------------------------------------------------|\n| [Performance and quality][performance_and_quality_evaluators] (AI-assisted)  | `GroundednessEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, `SimilarityEvaluator`, `RetrievalEvaluator` |\n| [Performance and quality][performance_and_quality_evaluators] (NLP)  | `F1ScoreEvaluator`, `RougeScoreEvaluator`, `GleuScoreEvaluator`, `BleuScoreEvaluator`, `MeteorScoreEvaluator`|\n| [Risk and safety][risk_and_safety_evaluators] (AI-assisted)    | `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `IndirectAttackEvaluator`, `ProtectedMaterialEvaluator`                                             |\n| [Composite][composite_evaluators] | `QAEvaluator`, `ContentSafetyEvaluator`                                             |\n\nFor more in-depth information on each evaluator definition and how it's calculated, see [Evaluation and monitoring metrics for generative AI][evaluation_metrics].\n\n```python\nimport os\n\nfrom azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator\n\n# NLP bleu score evaluator\nbleu_score_evaluator = BleuScoreEvaluator()\nresult = bleu_score(\n    response=\"Tokyo is the capital of Japan.\",\n    ground_truth=\"The capital of Japan is Tokyo.\"\n)\n\n# AI assisted quality evaluator\nmodel_config = {\n    \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n    \"api_key\": os.environ.get(\"AZURE_OPENAI_API_KEY\"),\n    \"azure_deployment\": os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\"),\n}\n\nrelevance_evaluator = RelevanceEvaluator(model_config)\nresult = relevance_evaluator(\n    query=\"What is the capital of Japan?\",\n    response=\"The capital of Japan is Tokyo.\"\n)\n\n# There are two ways to provide Azure AI Project.\n# Option #1 : Using Azure AI Project Details \nazure_ai_project = {\n    \"subscription_id\": \"<subscription_id>\",\n    \"resource_group_name\": \"<resource_group_name>\",\n    \"project_name\": \"<project_name>\",\n}\n\nviolence_evaluator = ViolenceEvaluator(azure_ai_project)\nresult = violence_evaluator(\n    query=\"What is the capital of France?\",\n    response=\"Paris.\"\n)\n\n# Option # 2 : Using Azure AI Project Url \nazure_ai_project = \"https://{resource_name}.services.ai.azure.com/api/projects/{project_name}\"\n\nviolence_evaluator = ViolenceEvaluator(azure_ai_project)\nresult = violence_evaluator(\n    query=\"What is the capital of France?\",\n    response=\"Paris.\"\n)\n```\n\n#### Custom evaluators\n\nBuilt-in evaluators are great out of the box to start evaluating your application's generations. However you can build your own code-based or prompt-based evaluator to cater to your specific evaluation needs.\n\n```python\n\n# Custom evaluator as a function to calculate response length\ndef response_length(response, **kwargs):\n    return len(response)\n\n# Custom class based evaluator to check for blocked words\nclass BlocklistEvaluator:\n    def __init__(self, blocklist):\n        self._blocklist = blocklist\n\n    def __call__(self, *, response: str, **kwargs):\n        score = any([word in answer for word in self._blocklist])\n        return {\"score\": score}\n\nblocklist_evaluator = BlocklistEvaluator(blocklist=[\"bad, worst, terrible\"])\n\nresult = response_length(\"The capital of Japan is Tokyo.\")\nresult = blocklist_evaluator(answer=\"The capital of Japan is Tokyo.\")\n\n```\n\n### Evaluate API\nThe package provides an `evaluate` API which can be used to run multiple evaluators together to evaluate generative AI application response.\n\n#### Evaluate existing dataset\n\n```python\nfrom azure.ai.evaluation import evaluate\n\nresult = evaluate(\n    data=\"data.jsonl\", # provide your data here\n    evaluators={\n        \"blocklist\": blocklist_evaluator,\n        \"relevance\": relevance_evaluator\n    },\n    # column mapping\n    evaluator_config={\n        \"relevance\": {\n            \"column_mapping\": {\n                \"query\": \"${data.queries}\"\n                \"ground_truth\": \"${data.ground_truth}\"\n                \"response\": \"${outputs.response}\"\n            } \n        }\n    }\n    # Optionally provide your AI Foundry project information to track your evaluation results in your Azure AI Foundry project\n    azure_ai_project = azure_ai_project,\n    # Optionally provide an output path to dump a json of metric summary, row level data and metric and AI Foundry URL\n    output_path=\"./evaluation_results.json\"\n)\n```\nFor more details refer to [Evaluate on test dataset using evaluate()][evaluate_dataset]\n\n#### Evaluate generative AI application\n```python\nfrom askwiki import askwiki\n\nresult = evaluate(\n    data=\"data.jsonl\",\n    target=askwiki,\n    evaluators={\n        \"relevance\": relevance_eval\n    },\n    evaluator_config={\n        \"default\": {\n            \"column_mapping\": {\n                \"query\": \"${data.queries}\"\n                \"context\": \"${outputs.context}\"\n                \"response\": \"${outputs.response}\"\n            } \n        }\n    }\n)\n```\nAbove code snippet refers to askwiki application in this [sample][evaluate_app].\n\nFor more details refer to [Evaluate on a target][evaluate_target]\n\n### Simulator\n\n\nSimulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application. The intergration between your AI application and the simulator happens at the callback method. Here's how a sample callback would look like:\n\n\n```python\nasync def callback(\n    messages: Dict[str, List[Dict]],\n    stream: bool = False,\n    session_state: Any = None,\n    context: Optional[Dict[str, Any]] = None,\n) -> dict:\n    messages_list = messages[\"messages\"]\n    # Get the last message from the user\n    latest_message = messages_list[-1]\n    query = latest_message[\"content\"]\n    # Call your endpoint or AI application here\n    # response should be a string\n    response = call_to_your_application(query, messages_list, context)\n    formatted_response = {\n        \"content\": response,\n        \"role\": \"assistant\",\n        \"context\": \"\",\n    }\n    messages[\"messages\"].append(formatted_response)\n    return {\"messages\": messages[\"messages\"], \"stream\": stream, \"session_state\": session_state, \"context\": context}\n```\n\nThe simulator initialization and invocation looks like this:\n```python\nfrom azure.ai.evaluation.simulator import Simulator\nmodel_config = {\n    \"azure_endpoint\": os.environ.get(\"AZURE_ENDPOINT\"),\n    \"azure_deployment\": os.environ.get(\"AZURE_DEPLOYMENT_NAME\"),\n    \"api_version\": os.environ.get(\"AZURE_API_VERSION\"),\n}\ncustom_simulator = Simulator(model_config=model_config)\noutputs = asyncio.run(custom_simulator(\n    target=callback,\n    conversation_turns=[\n        [\n            \"What should I know about the public gardens in the US?\",\n        ],\n        [\n            \"How do I simulate data against LLMs\",\n        ],\n    ],\n    max_conversation_turns=2,\n))\nwith open(\"simulator_output.jsonl\", \"w\") as f:\n    for output in outputs:\n        f.write(output.to_eval_qr_json_lines())\n```\n\n#### Adversarial Simulator\n\n```python\nfrom azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario\nfrom azure.identity import DefaultAzureCredential\n\n# There are two ways to provide Azure AI Project.\n# Option #1 : Using Azure AI Project \nazure_ai_project = {\n    \"subscription_id\": <subscription_id>,\n    \"resource_group_name\": <resource_group_name>,\n    \"project_name\": <project_name>\n}\n\n# Option #2 : Using Azure AI Project Url \nazure_ai_project = \"https://{resource_name}.services.ai.azure.com/api/projects/{project_name}\"\n\nscenario = AdversarialScenario.ADVERSARIAL_QA\nsimulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())\n\noutputs = asyncio.run(\n    simulator(\n        scenario=scenario,\n        max_conversation_turns=1,\n        max_simulation_results=3,\n        target=callback\n    )\n)\n\nprint(outputs.to_eval_qr_json_lines())\n```\n\nFor more details about the simulator, visit the following links:\n- [Adversarial Simulation docs][adversarial_simulation_docs]\n- [Adversarial scenarios][adversarial_simulation_scenarios]\n- [Simulating jailbreak attacks][adversarial_jailbreak]\n\n## Examples\n\nIn following section you will find examples of:\n- [Evaluate an application][evaluate_app]\n- [Evaluate different models][evaluate_models]\n- [Custom Evaluators][custom_evaluators]\n- [Adversarial Simulation][adversarial_simulation]\n- [Simulate with conversation starter][simulate_with_conversation_starter]\n\nMore examples can be found [here][evaluate_samples].\n\n## Troubleshooting\n\n### General\n\nPlease refer to [troubleshooting][evaluation_tsg] for common issues.\n\n### Logging\n\nThis library uses the standard\n[logging][python_logging] library for logging.\nBasic information about HTTP sessions (URLs, headers, etc.) is logged at INFO\nlevel.\n\nDetailed DEBUG level logging, including request/response bodies and unredacted\nheaders, can be enabled on a client with the `logging_enable` argument.\n\nSee full SDK logging documentation with examples [here][sdk_logging_docs].\n\n## Next steps\n\n- View our [samples][evaluation_samples].\n- View our [documentation][product_documentation]\n\n## Contributing\n\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit [cla.microsoft.com][cla].\n\nWhen you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct][code_of_conduct]. For more information see the [Code of Conduct FAQ][coc_faq] or contact [opencode@microsoft.com][coc_contact] with any additional questions or comments.\n\n<!-- LINKS -->\n\n[source_code]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/evaluation/azure-ai-evaluation\n[evaluation_pypi]: https://pypi.org/project/azure-ai-evaluation/\n[evaluation_ref_docs]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview\n[evaluation_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios\n[product_documentation]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk\n[python_logging]: https://docs.python.org/3/library/logging.html\n[sdk_logging_docs]: https://docs.microsoft.com/azure/developer/python/azure-sdk-logging\n[azure_core_readme]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md\n[pip_link]: https://pypi.org/project/pip/\n[azure_core_ref_docs]: https://aka.ms/azsdk-python-core-policies\n[azure_core]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/README.md\n[azure_identity]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity\n[cla]: https://cla.microsoft.com\n[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/\n[coc_faq]: https://opensource.microsoft.com/codeofconduct/faq/\n[coc_contact]: mailto:opencode@microsoft.com\n[evaluate_target]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-a-target\n[evaluate_dataset]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#evaluate-on-test-dataset-using-evaluate\n[evaluators]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview\n[evaluate_api]: https://learn.microsoft.com/python/api/azure-ai-evaluation/azure.ai.evaluation?view=azure-python-preview#azure-ai-evaluation-evaluate\n[evaluate_app]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_App_Endpoint\n[evaluation_tsg]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md\n[ai_studio]: https://learn.microsoft.com/azure/ai-studio/what-is-ai-studio\n[ai_project]: https://learn.microsoft.com/azure/ai-studio/how-to/create-projects?tabs=ai-studio\n[azure_openai]: https://learn.microsoft.com/azure/ai-services/openai/\n[evaluate_models]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Targets/Evaluate_Base_Model_Endpoint\n[custom_evaluators]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Supported_Evaluation_Metrics/Custom_Evaluators\n[evaluate_samples]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate\n[evaluation_metrics]: https://learn.microsoft.com/azure/ai-studio/concepts/evaluation-metrics-built-in\n[performance_and_quality_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#performance-and-quality-evaluators\n[risk_and_safety_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#risk-and-safety-evaluators\n[composite_evaluators]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/evaluate-sdk#composite-evaluators\n[adversarial_simulation_docs]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#generate-adversarial-simulations-for-safety-evaluation\n[adversarial_simulation_scenarios]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#supported-adversarial-simulation-scenarios\n[adversarial_simulation]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Adversarial_Data\n[simulate_with_conversation_starter]: https://github.com/Azure-Samples/azureai-samples/tree/main/scenarios/evaluate/Simulators/Simulate_Context-Relevant_Data/Simulate_From_Conversation_Starter\n[adversarial_jailbreak]: https://learn.microsoft.com/azure/ai-studio/how-to/develop/simulator-interaction-data#simulating-jailbreak-attacks\n\n\n# Release History\n\n## 1.10.0 (2025-07-31)\n\n### Breaking Changes\n\n- Added `evaluate_query` parameter to all RAI service evaluators that can be passed as a keyword argument. This parameter controls whether queries are included in evaluation data when evaluating query-response pairs. Previously, queries were always included in evaluations. When set to `True`, both query and response will be evaluated; when set to `False` (default), only the response will be evaluated. This parameter is available across all RAI service evaluators including `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator`, `ProtectedMaterialEvaluator`, `IndirectAttackEvaluator`, `CodeVulnerabilityEvaluator`, `UngroundedAttributesEvaluator`, `GroundednessProEvaluator`, and `EciEvaluator`.  Existing code that relies on queries being evaluated will need to explicitly set `evaluate_query=True` to maintain the previous behavior.\n\n### Features Added\n\n- Added support for Azure OpenAI Python grader via `AzureOpenAIPythonGrader` class, which serves as a wrapper around Azure Open AI Python grader configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.\n- Added `attack_success_thresholds` parameter to `RedTeam` class for configuring custom thresholds that determine attack success. This allows users to set specific threshold values for each risk category, with scores greater than the threshold considered successful attacks (i.e. higher threshold means higher \ntolerance for harmful responses).\n- Enhanced threshold reporting in RedTeam results to include default threshold values when custom thresholds aren't specified, providing better transparency about the evaluation criteria used.\n\n\n### Bugs Fixed\n\n- Fixed red team scan `output_path` issue where individual evaluation results were overwriting each other instead of being preserved as separate files. Individual evaluations now create unique files while the user's `output_path` is reserved for final aggregated results.\n- Significant improvements to TaskAdherence evaluator. New version has less variance, is much faster and consumes fewer tokens.\n- Significant improvements to Relevance evaluator. New version has more concrete rubrics and has less variance, is much faster and consumes fewer tokens.\n\n\n### Other Changes\n\n- The default engine for evaluation was changed from `promptflow` (PFClient) to an in-SDK batch client (RunSubmitterClient)\n  - Note: We've temporarily kept an escape hatch to fall back to the legacy `promptflow` implementation by setting `_use_pf_client=True` when invoking `evaluate()`.\n    This is due to be removed in a future release.\n\n\n## 1.9.0 (2025-07-02)\n\n### Features Added\n\n- Added support for Azure Open AI evaluation via `AzureOpenAIScoreModelGrader` class, which serves as a wrapper around Azure Open AI score model configurations. This new grader object can be supplied to the main `evaluate` method as if it were a normal callable evaluator.\n- Added new experimental risk categories ProtectedMaterial and CodeVulnerability for redteam agent scan.\n\n\n### Bugs Fixed\n\n- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.\n\n- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].\n- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)\n- Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the \"file_content\" type prompts away from `ADVERSARIAL_QA` to the new enum\n- `AzureOpenAIScoreModelGrader` evaluator now supports `pass_threshold` parameter to set the minimum score required for a response to be considered passing. This allows users to define custom thresholds for evaluation results, enhancing flexibility in grading AI model responses.\n\n## 1.8.0 (2025-05-29)\n\n### Features Added\n\n- Introduces `AttackStrategy.MultiTurn` and `AttackStrategy.Crescendo` to `RedTeam`. These strategies attack the target of a `RedTeam` scan over the course of multi-turn conversations. \n\n### Bugs Fixed\n- AdversarialSimulator in `ADVERSARIAL_CONVERSATION` mode was broken. It is now fixed.\n\n## 1.7.0 (2025-05-12)\n\n### Bugs Fixed\n- azure-ai-evaluation failed with module not found [#40992](https://github.com/Azure/azure-sdk-for-python/issues/40992)\n\n## 1.6.0 (2025-05-07)\n\n### Features Added\n- New `<evaluator>.binary_aggregate` field added to evaluation result metrics. This field contains the aggregated binary evaluation results for each evaluator, providing a summary of the evaluation outcomes.\n- Added support for Azure Open AI evaluation via 4 new 'grader' classes, which serve as wrappers around Azure Open AI grader configurations. These new grader objects can be supplied to the main `evaluate` method as if they were normal callable evaluators. The new classes are:\n    - AzureOpenAIGrader (general class for experienced users)\n    - AzureOpenAILabelGrader\n    - AzureOpenAIStringCheckGrader\n    - AzureOpenAITextSimilarityGrader\n\n### Breaking Changes\n- In the experimental RedTeam's scan method, the `data_only` param has been replaced with `skip_evals` and if you do not want data to be uploaded, use the `skip_upload` flag.\n\n### Bugs Fixed\n- Fixed error in `evaluate` where data fields could not contain numeric characters. Previously, a data file with schema:\n    ```\n    \"query1\": \"some query\", \"response\": \"some response\"\n    ```\n    throws error when passed into `evaluator_config` as `{\"evaluator_name\": {\"column_mapping\": {\"query\": \"${data.query1}\", \"response\": \"${data.response}\"}},}`.\n    Now, users may import data containing fields with numeric characters. \n\n\n## 1.5.0 (2025-04-04)\n\n### Features Added\n\n- New `RedTeam` agent functionality to assess the safety and resilience of AI systems against adversarial prompt attacks\n\n## 1.4.0 (2025-03-27)\n\n### Features Added\n- Enhanced binary evaluation results with customizable thresholds\n  - Added threshold support for QA and ContentSafety evaluators\n  - Evaluation results now include both the score and threshold values\n  - Configurable threshold parameter allows custom binary classification boundaries\n  - Default thresholds provided for backward compatibility\n  - Quality evaluators use \"higher is better\" scoring (score \u2265 threshold is positive)\n  - Content safety evaluators use \"lower is better\" scoring (score \u2264 threshold is positive)\n- New Built-in evaluator called CodeVulnerabilityEvaluator is added. \n  - It provides capabilities to identify the following code vulnerabilities.\n    - path-injection\n    - sql-injection\n    - code-injection\n    - stack-trace-exposure\n    - incomplete-url-substring-sanitization\n    - flask-debug\n    - clear-text-logging-sensitive-data\n    - incomplete-hostname-regexp\n    - server-side-unvalidated-url-redirection\n    - weak-cryptographic-algorithm\n    - full-ssrf\n    - bind-socket-all-network-interfaces\n    - client-side-unvalidated-url-redirection\n    - likely-bugs\n    - reflected-xss\n    - clear-text-storage-sensitive-data\n    - tarslip\n    - hardcoded-credentials\n    - insecure-randomness\n  - It also supports multiple coding languages such as (Python, Java, C++, C#, Go, Javascript, SQL)\n  \n- New Built-in evaluator called UngroundedAttributesEvaluator is added.\n  - It evaluates ungrounded inference of human attributes for a given query, response, and context for a single-turn evaluation only, \n  - where query represents the user query and response represents the AI system response given the provided context. \n \n  - Ungrounded Attributes checks for whether a response is first, ungrounded, and checks if it contains information about protected class \n  - or emotional state of a person.\n  \n  - It identifies the following attributes:\n    \n    - emotional_state\n    - protected_class\n    - groundedness\n- New Built-in evaluators for Agent Evaluation (Preview)\n  - IntentResolutionEvaluator - Evaluates the intent resolution of an agent's response to a user query.\n  - ResponseCompletenessEvaluator - Evaluates the response completeness of an agent's response to a user query.\n  - TaskAdherenceEvaluator - Evaluates the task adherence of an agent's response to a user query.\n  - ToolCallAccuracyEvaluator - Evaluates the accuracy of tool calls made by an agent in response to a user query.\n\n### Bugs Fixed\n- Fixed error in `GroundednessProEvaluator` when handling non-numeric values like \"n/a\" returned from the service.\n- Uploading local evaluation results from `evaluate` with the same run name will no longer result in each online run sharing (and bashing) result files.\n\n## 1.3.0 (2025-02-28)\n\n### Breaking Changes\n- Multimodal specific evaluators `ContentSafetyMultimodalEvaluator`, `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` has been removed. Please use `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` and `ProtectedMaterialEvaluator` instead.\n- Metric name in ProtectedMaterialEvaluator's output is changed from `protected_material.fictional_characters_label` to `protected_material.fictional_characters_defect_rate`. It's now consistent with other evaluator's metric names (ending with `_defect_rate`).\n\n## 1.2.0 (2025-01-27)\n\n### Features Added\n- CSV files are now supported as data file inputs with `evaluate()` API. The CSV file should have a header row with column names that match the `data` and `target` fields in the `evaluate()` method and the filename should be passed as the `data` parameter. Column name 'Conversation' in CSV file is not fully supported yet.  \n\n### Breaking Changes\n- `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` will be removed in next release. \n\n### Bugs Fixed\n- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Studio.\n- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results\n- Fixed the non adversarial simulator to run in task-free mode\n- Content safety evaluators (violence, self harm, sexual, hate/unfairness) return the maximum result as the\n  main score when aggregating per-turn evaluations from a conversation into an overall\n  evaluation score. Other conversation-capable evaluators still default to a mean for aggregation.\n- Fixed bug in non adversarial simulator sample where `tasks` undefined\n\n### Other Changes\n- Changed minimum required python version to use this package from 3.8 to 3.9\n- Stop dependency on the local promptflow service. No promptflow service will automatically start when running evaluation.\n- Evaluators internally allow for custom aggregation. However, this causes serialization failures if evaluated while the\n  environment variable `AI_EVALS_BATCH_USE_ASYNC` is set to false.\n\n## 1.1.0 (2024-12-12)\n\n### Features Added\n- Added image support in `ContentSafetyEvaluator`, `ViolenceEvaluator`, `SexualEvaluator`, `SelfHarmEvaluator`, `HateUnfairnessEvaluator` and `ProtectedMaterialEvaluator`. Provide image URLs or base64 encoded images in `conversation` input for image evaluation. See below for an example:\n\n```python\nevaluator = ContentSafetyEvaluator(credential=azure_cred, azure_ai_project=project_scope)\nconversation = {\n    \"messages\": [\n        {\n            \"role\": \"system\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"You are an AI assistant that understands images.\"}\n            ],\n        },\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"text\", \"text\": \"Can you describe this image?\"},\n                {\n                    \"type\": \"image_url\",\n                    \"image_url\": {\n                        \"url\": \"https://cdn.britannica.com/68/178268-050-5B4E7FB6/Tom-Cruise-2013.jpg\"\n                    },\n                },\n            ],\n        },\n        {\n            \"role\": \"assistant\",\n            \"content\": [\n                {\n                    \"type\": \"text\",\n                    \"text\": \"The image shows a man with short brown hair smiling, wearing a dark-colored shirt.\",\n                }\n            ],\n        },\n    ]\n}\nprint(\"Calling Content Safety Evaluator for multi-modal\")\nscore = evaluator(conversation=conversation)\n```\n\n- Please switch to generic evaluators for image evaluations as mentioned above. `ContentSafetyMultimodalEvaluator`, `ContentSafetyMultimodalEvaluatorBase`, `ViolenceMultimodalEvaluator`, `SexualMultimodalEvaluator`, `SelfHarmMultimodalEvaluator`, `HateUnfairnessMultimodalEvaluator` and `ProtectedMaterialMultimodalEvaluator` will be deprecated in the next release.\n\n### Bugs Fixed\n- Removed `[remote]` extra. This is no longer needed when tracking results in Azure AI Foundry portal.\n- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results\n\n## 1.0.1 (2024-11-15)\n\n### Bugs Fixed\n- Removing `azure-ai-inference` as dependency.\n- Fixed `AttributeError: 'NoneType' object has no attribute 'get'` while running simulator with 1000+ results\n\n## 1.0.0 (2024-11-13)\n\n### Breaking Changes\n- The `parallel` parameter has been removed from composite evaluators: `QAEvaluator`, `ContentSafetyChatEvaluator`, and `ContentSafetyMultimodalEvaluator`. To control evaluator parallelism, you can now use the `_parallel` keyword argument, though please note that this private parameter may change in the future.\n- Parameters `query_response_generating_prompty_kwargs` and `user_simulator_prompty_kwargs` have been renamed to `query_response_generating_prompty_options` and `user_simulator_prompty_options` in the Simulator's __call__ method.\n\n### Bugs Fixed\n- Fixed an issue where the `output_path` parameter in the `evaluate` API did not support relative path.\n- Output of adversarial simulators are of type `JsonLineList` and the helper function `to_eval_qr_json_lines` now outputs context from both user and assistant turns along with `category` if it exists in the conversation\n- Fixed an issue where during long-running simulations, API token expires causing \"Forbidden\" error. Instead, users can now set an environment variable `AZURE_TOKEN_REFRESH_INTERVAL` to refresh the token more frequently to prevent expiration and ensure continuous operation of the simulation.\n- Fixed an issue with the `ContentSafetyEvaluator` that caused parallel execution of sub-evaluators to fail. Parallel execution is now enabled by default again, but can still be disabled via the '_parallel' boolean keyword argument during class initialization.\n- Fix `evaluate` function not producing aggregated metrics if ANY values to be aggregated were None, NaN, or\notherwise difficult to process. Such values are ignored fully, so the aggregated metric of `[1, 2, 3, NaN]`\nwould be 2, not 1.5.\n\n### Other Changes\n- Refined error messages for serviced-based evaluators and simulators.\n- Tracing has been disabled due to Cosmos DB initialization issue.\n- Introduced environment variable `AI_EVALS_DISABLE_EXPERIMENTAL_WARNING` to disable the warning message for experimental features.\n- Changed the randomization pattern for `AdversarialSimulator` such that there is an almost equal number of Adversarial harm categories (e.g. Hate + Unfairness, Self-Harm, Violence, Sex) represented in the  `AdversarialSimulator` outputs. Previously, for 200 `max_simulation_results` a user might see 140 results belonging to the 'Hate + Unfairness' category and 40 results belonging to the 'Self-Harm' category. Now, user will see 50 results for each of Hate + Unfairness, Self-Harm, Violence, and Sex.\n- For the `DirectAttackSimulator`, the prompt templates used to generate simulated outputs for each Adversarial harm category will no longer be in a randomized order by default. To override this behavior, pass `randomize_order=True` when you call the `DirectAttackSimulator`, for example:\n```python\nadversarial_simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())\noutputs = asyncio.run(\n    adversarial_simulator(\n        scenario=scenario,\n        target=callback,\n        randomize_order=True\n    )\n)\n```\n\n## 1.0.0b5 (2024-10-28)\n\n### Features Added\n- Added `GroundednessProEvaluator`, which is a service-based evaluator for determining response groundedness.\n- Groundedness detection in Non Adversarial Simulator via query/context pairs\n```python\nimport importlib.resources as pkg_resources\npackage = \"azure.ai.evaluation.simulator._data_sources\"\nresource_name = \"grounding.json\"\ncustom_simulator = Simulator(model_config=model_config)\nconversation_turns = []\nwith pkg_resources.path(package, resource_name) as grounding_file:\n    with open(grounding_file, \"r\") as file:\n        data = json.load(file)\nfor item in data:\n    conversation_turns.append([item])\noutputs = asyncio.run(custom_simulator(\n    target=callback,\n    conversation_turns=conversation_turns,\n    max_conversation_turns=1,\n))\n```\n- Adding evaluator for multimodal use cases\n\n### Breaking Changes\n- Renamed environment variable `PF_EVALS_BATCH_USE_ASYNC` to `AI_EVALS_BATCH_USE_ASYNC`.\n- `RetrievalEvaluator` now requires a `context` input in addition to `query` in single-turn evaluation.\n- `RelevanceEvaluator` no longer takes `context` as an input. It now only takes `query` and `response` in single-turn evaluation.\n- `FluencyEvaluator` no longer takes `query` as an input. It now only takes `response` in single-turn evaluation.\n- AdversarialScenario enum does not include `ADVERSARIAL_INDIRECT_JAILBREAK`, invoking IndirectJailbreak or XPIA should be done with `IndirectAttackSimulator`\n- Outputs of `Simulator` and `AdversarialSimulator` previously had `to_eval_qa_json_lines` and now has `to_eval_qr_json_lines`. Where `to_eval_qa_json_lines` had:\n```json\n{\"question\": <user_message>, \"answer\": <assistant_message>}\n```\n`to_eval_qr_json_lines` now has:\n```json\n{\"query\": <user_message>, \"response\": assistant_message}\n```\n\n### Bugs Fixed\n- Non adversarial simulator works with `gpt-4o` models using the `json_schema` response format\n- Fixed an issue where the `evaluate` API would fail with \"[WinError 32] The process cannot access the file because it is being used by another process\" when venv folder and target function file are in the same directory.\n- Fix evaluate API failure when `trace.destination` is set to `none`\n- Non adversarial simulator now accepts context from the callback\n\n### Other Changes\n- Improved error messages for the `evaluate` API by enhancing the validation of input parameters. This update provides more detailed and actionable error descriptions.\n- `GroundednessEvaluator` now supports `query` as an optional input in single-turn evaluation. If `query` is provided, a different prompt template will be used for the evaluation.\n- To align with our support of a diverse set of models, the following evaluators will now have a new key in their result output without the `gpt_` prefix. To maintain backwards compatibility, the old key with the `gpt_` prefix will still be present in the output; however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.\n  - `CoherenceEvaluator`\n  - `RelevanceEvaluator`\n  - `FluencyEvaluator`\n  - `GroundednessEvaluator`\n  - `SimilarityEvaluator`\n  - `RetrievalEvaluator`\n- The following evaluators will now have a new key in their result output including LLM reasoning behind the score. The new key will follow the pattern \"<metric_name>_reason\". The reasoning is the result of a more detailed prompt template being used to generate the LLM response. Note that this requires the maximum number of tokens used to run these evaluators to be increased.\n\n    | Evaluator | New `max_token` for Generation |\n    | --- | --- |\n    | `CoherenceEvaluator` | 800 |\n    | `RelevanceEvaluator` | 800 |\n    | `FluencyEvaluator` | 800 |\n    | `GroundednessEvaluator` | 800 |\n    | `RetrievalEvaluator` | 1600 |\n- Improved the error message for storage access permission issues to provide clearer guidance for users.\n\n## 1.0.0b4 (2024-10-16)\n\n### Breaking Changes\n\n- Removed `numpy` dependency. All NaN values returned by the SDK have been changed to from `numpy.nan` to `math.nan`.\n- `credential` is now required to be passed in for all content safety evaluators and `ProtectedMaterialsEvaluator`. `DefaultAzureCredential` will no longer be chosen if a credential is not passed.\n- Changed package extra name from \"pf-azure\" to \"remote\".\n\n### Bugs Fixed\n- Adversarial Conversation simulations would fail with `Forbidden`. Added logic to re-fetch token in the exponential retry logic to retrive RAI Service response.\n- Fixed an issue where the Evaluate API did not fail due to missing inputs when the target did not return columns required by the evaluators.\n\n### Other Changes\n- Enhance the error message to provide clearer instruction when required packages for the remote tracking feature are missing.\n- Print the per-evaluator run summary at the end of the Evaluate API call to make troubleshooting row-level failures easier.\n\n## 1.0.0b3 (2024-10-01)\n\n### Features Added\n\n- Added `type` field to `AzureOpenAIModelConfiguration` and `OpenAIModelConfiguration`\n- The following evaluators now support `conversation` as an alternative input to their usual single-turn inputs:\n  - `ViolenceEvaluator`\n  - `SexualEvaluator`\n  - `SelfHarmEvaluator`\n  - `HateUnfairnessEvaluator`\n  - `ProtectedMaterialEvaluator`\n  - `IndirectAttackEvaluator`\n  - `CoherenceEvaluator`\n  - `RelevanceEvaluator`\n  - `FluencyEvaluator`\n  - `GroundednessEvaluator`\n- Surfaced `RetrievalScoreEvaluator`, formally an internal part of `ChatEvaluator` as a standalone conversation-only evaluator.\n\n### Breaking Changes\n\n- Removed `ContentSafetyChatEvaluator` and `ChatEvaluator`\n- The `evaluator_config` parameter of `evaluate` now maps in evaluator name to a dictionary `EvaluatorConfig`, which is a `TypedDict`. The\n`column_mapping` between `data` or `target` and evaluator field names should now be specified inside this new dictionary:\n\nBefore:\n```python\nevaluate(\n    ...,\n    evaluator_config={\n        \"hate_unfairness\": {\n            \"query\": \"${data.question}\",\n            \"response\": \"${data.answer}\",\n        }\n    },\n    ...\n)\n```\n\nAfter\n```python\nevaluate(\n    ...,\n    evaluator_config={\n        \"hate_unfairness\": {\n            \"column_mapping\": {\n                \"query\": \"${data.question}\",\n                \"response\": \"${data.answer}\",\n             }\n        }\n    },\n    ...\n)\n```\n\n- Simulator now requires a model configuration to call the prompty instead of an Azure AI project scope. This enables the usage of simulator with Entra ID based auth.\nBefore:\n```python\nazure_ai_project = {\n    \"subscription_id\": os.environ.get(\"AZURE_SUBSCRIPTION_ID\"),\n    \"resource_group_name\": os.environ.get(\"RESOURCE_GROUP\"),\n    \"project_name\": os.environ.get(\"PROJECT_NAME\"),\n}\nsim = Simulator(azure_ai_project=azure_ai_project, credentails=DefaultAzureCredentials())\n```\nAfter:\n```python\nmodel_config = {\n    \"azure_endpoint\": os.environ.get(\"AZURE_OPENAI_ENDPOINT\"),\n    \"azure_deployment\": os.environ.get(\"AZURE_DEPLOYMENT\"),\n}\nsim = Simulator(model_config=model_config)\n```\nIf `api_key` is not included in the `model_config`, the prompty runtime in `promptflow-core` will pick up `DefaultAzureCredential`.\n\n### Bugs Fixed\n\n- Fixed issue where Entra ID authentication was not working with `AzureOpenAIModelConfiguration`\n\n## 1.0.0b2 (2024-09-24)\n\n### Breaking Changes\n\n- `data` and `evaluators` are now required keywords in `evaluate`.\n\n## 1.0.0b1 (2024-09-20)\n\n### Breaking Changes\n\n- The `synthetic` namespace has been renamed to `simulator`, and sub-namespaces under this module have been removed\n- The `evaluate` and `evaluators` namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace `azure.ai.evaluation`\n- The parameter name `project_scope` in content safety evaluators have been renamed to `azure_ai_project` for consistency with evaluate API and simulators.\n- Model configurations classes are now of type `TypedDict` and are exposed in the `azure.ai.evaluation` module instead of coming from `promptflow.core`.\n- Updated the parameter names for `question` and `answer` in built-in evaluators to more generic terms: `query` and `response`.\n\n### Features Added\n\n- First preview\n- This package is port of `promptflow-evals`. New features will be added only to this package moving forward.\n- Added a `TypedDict` for `AzureAIProject` that allows for better intellisense and type checking when passing in project information\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "Microsoft Azure Evaluation Library for Python",
    "version": "1.10.0",
    "project_urls": {
        "Bug Reports": "https://github.com/Azure/azure-sdk-for-python/issues",
        "Homepage": "https://github.com/Azure/azure-sdk-for-python",
        "Source": "https://github.com/Azure/azure-sdk-for-python"
    },
    "split_keywords": [
        "azure",
        " azure sdk"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d638a0ed945934833c0b3a679611351be0f1e7c8667af92a58a43da2ece512e5",
                "md5": "f2b2276b15de05f8b1f8321b77724c95",
                "sha256": "05bf98b29d8d0218180725d0db41f5a4d8fd394eceaacb87c79f9bca503206a9"
            },
            "downloads": -1,
            "filename": "azure_ai_evaluation-1.10.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f2b2276b15de05f8b1f8321b77724c95",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 1000093,
            "upload_time": "2025-07-31T23:46:32",
            "upload_time_iso_8601": "2025-07-31T23:46:32.970258Z",
            "url": "https://files.pythonhosted.org/packages/d6/38/a0ed945934833c0b3a679611351be0f1e7c8667af92a58a43da2ece512e5/azure_ai_evaluation-1.10.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f56e018e0bc3d7368fc6379373939d3ef29da75755e998545e67c660ad8efadd",
                "md5": "63713123d23e3be4f11f5b1692983592",
                "sha256": "58bdf2d2be9e2b94a881026b006104f64654bbe61dc101bff84c2d9fa8c575a2"
            },
            "downloads": -1,
            "filename": "azure_ai_evaluation-1.10.0.tar.gz",
            "has_sig": false,
            "md5_digest": "63713123d23e3be4f11f5b1692983592",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 1058944,
            "upload_time": "2025-07-31T23:46:31",
            "upload_time_iso_8601": "2025-07-31T23:46:31.021624Z",
            "url": "https://files.pythonhosted.org/packages/f5/6e/018e0bc3d7368fc6379373939d3ef29da75755e998545e67c660ad8efadd/azure_ai_evaluation-1.10.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 23:46:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Azure",
    "github_project": "azure-sdk-for-python",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "azure-ai-evaluation"
}

Microsoft Corporation