autoevals

Name	autoevals JSON
Version	0.0.121 JSON
	download
home_page	https://www.braintrustdata.com
Summary	Universal library for evaluating AI models
upload_time	2025-03-01 01:06:43
maintainer	None
docs_url	None
author	BrainTrust
requires_python	>=3.8.0
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Autoevals

Autoevals is a tool to quickly and easily evaluate AI model outputs.

It bundles together a variety of automatic evaluation methods including:

- LLM-as-a-Judge
- Heuristic (e.g. Levenshtein distance)
- Statistical (e.g. BLEU)

Autoevals is developed by the team at [Braintrust](https://braintrust.dev/).

Autoevals uses model-graded evaluation for a variety of subjective tasks including fact checking,
safety, and more. Many of these evaluations are adapted from OpenAI's excellent [evals](https://github.com/openai/evals)
project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug
their outputs.

You can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs,
and manage exceptions.

## Installation

Autoevals is distributed as a [Python library on PyPI](https://pypi.org/project/autoevals/) and
[Node.js library on NPM](https://www.npmjs.com/package/autoevals).

### Python

```bash
pip install autoevals
```

### Node.js

```bash
npm install autoevals
```

## Example

Use Autoevals to model-grade an example LLM completion using the [factuality prompt](templates/factuality.yaml).
By default, Autoevals uses your `OPENAI_API_KEY` environment variable to authenticate with OpenAI's API.

### Python

```python
from autoevals.llm import *

# Create a new LLM-based evaluator
evaluator = Factuality()

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")
```

#### Use with other AI providers through the AI proxy

Autoevals will look for an `OPENAI_BASE_URL` environment variable to use as the base for requests to an OpenAI compatible API. If `OPENAI_BASE_URL` is not set, it will default to the [AI proxy](https://www.braintrust.dev/docs/guides/proxy). This provides numerous benefits like simplified access to many AI providers, reduced costs with automatic request caching, and increased observability when you enable logging to Braintrust. The proxy is free to use, even if you don't have a Braintrust account.

If you have a Braintrust account, you can set the `BRAINTUST_API_KEY` environment variable instead of `OPENAI_API_KEY` to unlock additional features like logging and monitoring. Additionally, you can route requests to [supported AI providers and models](https://www.braintrust.dev/docs/guides/proxy#supported-models) or custom models you have configured in Braintrust.

```python
# NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
from autoevals.llm import *

# Create an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
evaluator = Factuality(model="claude-3-5-sonnet-latest")

# Evaluate an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

result = evaluator(output, expected, input=input)

# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
print(f"Factuality score: {result.score}")
print(f"Factuality metadata: {result.metadata['rationale']}")
```

#### Custom Client

If you need to use a different OpenAI compatible API or require custom behavior, you can initialize the library with a custom client.

```python
import openai
from autoevals import init
from autoevals.oai import LLMClient

openai_client = openai.OpenAI(base_url="https://api.openai.com/v1/")

class CustomClient(LLMClient):
    openai=openai_client  # you can also pass in openai module and we will instantiate it for you
    embed = openai.embeddings.create
    moderation = openai.moderations.create
    RateLimitError = openai.RateLimitError

    def complete(self, **kwargs):
        # make adjustments as needed
        return self.openai.chat.completions.create(**kwargs)

# Autoevals will now use your custom client
client = init(client=CustomClient)
```

If you only need to use a custom client for a specific evaluator, you can pass in the client to the evaluator.

```python
evaluator = Factuality(client=CustomClient)
```

### Node.js

```javascript
import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  const result = await Factuality({ output, expected, input });
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata.rationale}`);
})();
```

#### Use with other AI providers through the AI proxy

Autoevals will look for an `OPENAI_BASE_URL` environment variable to use as the base for requests to an OpenAI compatible API. If `OPENAI_BASE_URL` is not set, it will default to the [AI proxy](https://www.braintrust.dev/docs/guides/proxy). This provides numerous benefits like simplified access to many AI providers, reduced costs with automatic request caching, and increased observability when you enable logging to Braintrust. The proxy is free to use, even if you don't have a Braintrust account.

If you have a Braintrust account, you can set the `BRAINTUST_API_KEY` environment variable instead of `OPENAI_API_KEY` to unlock additional features like logging and monitoring. Additionally, you can route requests to [supported AI providers and models](https://www.braintrust.dev/docs/guides/proxy#supported-models) or custom models you have configured in Braintrust.

```javascript
// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set
import { Factuality } from "autoevals";

(async () => {
  const input = "Which country has the highest population?";
  const output = "People's Republic of China";
  const expected = "China";

  // Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic
  const result = await Factuality({
    model: "claude-3-5-sonnet-latest",
    output,
    expected,
    input,
  });

  // The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
  console.log(`Factuality score: ${result.score}`);
  console.log(`Factuality metadata: ${result.metadata?.rationale}`);
})();
```

## Using Braintrust with Autoevals

Once you grade an output using Autoevals, it's convenient to use [Braintrust](https://www.braintrust.dev/docs/libs/python) to log and compare your evaluation results.

### Python

```python
from autoevals.llm import *
import braintrust

# Create a new LLM-based evaluator
evaluator = Factuality()

# Set up an example LLM completion
input = "Which country has the highest population?"
output = "People's Republic of China"
expected = "China"

# Set up a BrainTrust experiment to log our eval to
experiment = braintrust.init(
    project="Autoevals", api_key="YOUR_BRAINTRUST_API_KEY"
)

# Start a span and run our evaluator
with experiment.start_span() as span:
    result = evaluator(output, expected, input=input)

    # The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator
    print(f"Factuality score: {result.score}")
    print(f"Factuality metadata: {result.metadata['rationale']}")

    span.log(
        inputs={"query": input},
        output=output,
        expected=expected,
        scores={
            "factuality": result.score,
        },
        metadata={
            "factuality": result.metadata,
        },
    )

print(experiment.summarize())
```

### Node.js

Create a file named `example.eval.js` (it must end with `.eval.js` or `.eval.js`):

```javascript
import { Eval } from "braintrust";
import { Factuality } from "autoevals";

Eval("Autoevals", {
  data: () => [
    {
      input: "Which country has the highest population?",
      expected: "China",
    },
  ],
  task: () => "People's Republic of China",
  scores: [Factuality],
});
```

Then, run

```bash
npx braintrust run example.eval.js
```

## Supported Evaluation Methods

### LLM-as-a-Judge

- Battle
- ClosedQA
- Humor
- Factuality
- Moderation
- Security
- Summarization
- SQL
- Translation
- Fine-tuned binary classifiers

### RAG

- Context precision
- Context relevancy
- Context recall
- Context entities recall
- Faithfullness
- Answer relevance
- Answer semantic similarity
- Answer correctness
- Aspect critique

### Composite

- Semantic list contains
- JSON validity

### Embeddings

- Embedding similarity
- BERTScore

### Heuristic

- Levenshtein distance
- Exact match
- Numeric difference
- JSON diff
- Jaccard distance

### Statistical

- BLEU
- ROUGE
- METEOR

## Custom Evaluation Prompts

Autoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:

### Python

```python
from autoevals import LLMClassifier

# Define a prompt prefix for a LLMClassifier (returns just one answer)
prompt_prefix = """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}
"""

# Define the scoring mechanism
# 1 if the generated answer is better than the expected answer
# 0 otherwise
output_scores = {"1": 1, "2": 0}

evaluator = LLMClassifier(
    name="TitleQuality",
    prompt_template=prompt_prefix,
    choice_scores=output_scores,
    use_cot=True,
)

# Evaluate an example LLM completion
page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""
output = (
    "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
)
expected = "Standardize Error Responses across APIs"

response = evaluator(output, expected, input=page_content)

print(f"Score: {response.score}")
print(f"Metadata: {response.metadata}")
```

### Node.js

```javascript
import { LLMClassifierFromTemplate } from "autoevals";

(async () => {
  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{input}}

1: {{output}}
2: {{expected}}`;

  const choiceScores = { 1: 1, 2: 0 };

  const evaluator =
    LLMClassifierFromTemplate <
    { input: string } >
    {
      name: "TitleQuality",
      promptTemplate,
      choiceScores,
      useCoT: true,
    };

  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,
We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?
Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;
  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;
  const expected = `Standardize Error Responses across APIs`;

  const response = await evaluator({ input, output, expected });

  console.log("Score", response.score);
  console.log("Metadata", response.metadata);
})();
```

## Creating custom scorers

You can also create your own scoring functions that do not use LLMs. For example, to test whether the word `'banana'`
is in the output, you can use the following:

### Python

```python
from autoevals import Score


def banana_scorer(output, expected, input):
    return Score(name="banana_scorer", score=1 if "banana" in output else 0)


input = "What is 1 banana + 2 bananas?"
output = "3"
expected = "3 bananas"

result = banana_scorer(output, expected, input)

print(f"Banana score: {result.score}")
```

### Node.js

```javascript
import { Score } from "autoevals";

const bananaScorer = ({
  output,
  expected,
  input,
}: {
  output: string;
  expected: string;
  input: string;
}): Score => {
  return { name: "banana_scorer", score: output.includes("banana") ? 1 : 0 };
};

(async () => {
  const input = "What is 1 banana + 2 bananas?";
  const output = "3";
  const expected = "3 bananas";

  const result = bananaScorer({ output, expected, input });
  console.log(`Banana score: ${result.score}`);
})();
```

## Why does this library exist?

There is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:

- Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in [number.py](/py/autoevals/number.py) to see how it's done for numeric differences.
- Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to
  debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.
- Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in `input`, `output`, and `expected` values through a bunch of different evaluation methods.

## Documentation

The full docs are available [here](https://www.braintrust.dev/docs/reference/autoevals).

Raw data

            {
    "_id": null,
    "home_page": "https://www.braintrustdata.com",
    "name": "autoevals",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": null,
    "keywords": null,
    "author": "BrainTrust",
    "author_email": "info@braintrustdata.com",
    "download_url": "https://files.pythonhosted.org/packages/cd/57/250fb03732b3077d4955ed12f9a77779a1ee6bda7e3b7ff6e96a65c1edd0/autoevals-0.0.121.tar.gz",
    "platform": null,
    "description": "# Autoevals\n\nAutoevals is a tool to quickly and easily evaluate AI model outputs.\n\nIt bundles together a variety of automatic evaluation methods including:\n\n- LLM-as-a-Judge\n- Heuristic (e.g. Levenshtein distance)\n- Statistical (e.g. BLEU)\n\nAutoevals is developed by the team at [Braintrust](https://braintrust.dev/).\n\nAutoevals uses model-graded evaluation for a variety of subjective tasks including fact checking,\nsafety, and more. Many of these evaluations are adapted from OpenAI's excellent [evals](https://github.com/openai/evals)\nproject but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug\ntheir outputs.\n\nYou can also create your own model-graded evaluations with Autoevals. It's easy to add custom prompts, parse outputs,\nand manage exceptions.\n\n## Installation\n\nAutoevals is distributed as a [Python library on PyPI](https://pypi.org/project/autoevals/) and\n[Node.js library on NPM](https://www.npmjs.com/package/autoevals).\n\n### Python\n\n```bash\npip install autoevals\n```\n\n### Node.js\n\n```bash\nnpm install autoevals\n```\n\n## Example\n\nUse Autoevals to model-grade an example LLM completion using the [factuality prompt](templates/factuality.yaml).\nBy default, Autoevals uses your `OPENAI_API_KEY` environment variable to authenticate with OpenAI's API.\n\n### Python\n\n```python\nfrom autoevals.llm import *\n\n# Create a new LLM-based evaluator\nevaluator = Factuality()\n\n# Evaluate an example LLM completion\ninput = \"Which country has the highest population?\"\noutput = \"People's Republic of China\"\nexpected = \"China\"\n\nresult = evaluator(output, expected, input=input)\n\n# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\nprint(f\"Factuality score: {result.score}\")\nprint(f\"Factuality metadata: {result.metadata['rationale']}\")\n```\n\n#### Use with other AI providers through the AI proxy\n\nAutoevals will look for an `OPENAI_BASE_URL` environment variable to use as the base for requests to an OpenAI compatible API. If `OPENAI_BASE_URL` is not set, it will default to the [AI proxy](https://www.braintrust.dev/docs/guides/proxy). This provides numerous benefits like simplified access to many AI providers, reduced costs with automatic request caching, and increased observability when you enable logging to Braintrust. The proxy is free to use, even if you don't have a Braintrust account.\n\nIf you have a Braintrust account, you can set the `BRAINTUST_API_KEY` environment variable instead of `OPENAI_API_KEY` to unlock additional features like logging and monitoring. Additionally, you can route requests to [supported AI providers and models](https://www.braintrust.dev/docs/guides/proxy#supported-models) or custom models you have configured in Braintrust.\n\n```python\n# NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set\nfrom autoevals.llm import *\n\n# Create an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic\nevaluator = Factuality(model=\"claude-3-5-sonnet-latest\")\n\n# Evaluate an example LLM completion\ninput = \"Which country has the highest population?\"\noutput = \"People's Republic of China\"\nexpected = \"China\"\n\nresult = evaluator(output, expected, input=input)\n\n# The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\nprint(f\"Factuality score: {result.score}\")\nprint(f\"Factuality metadata: {result.metadata['rationale']}\")\n```\n\n#### Custom Client\n\nIf you need to use a different OpenAI compatible API or require custom behavior, you can initialize the library with a custom client.\n\n```python\nimport openai\nfrom autoevals import init\nfrom autoevals.oai import LLMClient\n\nopenai_client = openai.OpenAI(base_url=\"https://api.openai.com/v1/\")\n\nclass CustomClient(LLMClient):\n    openai=openai_client  # you can also pass in openai module and we will instantiate it for you\n    embed = openai.embeddings.create\n    moderation = openai.moderations.create\n    RateLimitError = openai.RateLimitError\n\n    def complete(self, **kwargs):\n        # make adjustments as needed\n        return self.openai.chat.completions.create(**kwargs)\n\n# Autoevals will now use your custom client\nclient = init(client=CustomClient)\n```\n\nIf you only need to use a custom client for a specific evaluator, you can pass in the client to the evaluator.\n\n```python\nevaluator = Factuality(client=CustomClient)\n```\n\n### Node.js\n\n```javascript\nimport { Factuality } from \"autoevals\";\n\n(async () => {\n  const input = \"Which country has the highest population?\";\n  const output = \"People's Republic of China\";\n  const expected = \"China\";\n\n  const result = await Factuality({ output, expected, input });\n  console.log(`Factuality score: ${result.score}`);\n  console.log(`Factuality metadata: ${result.metadata.rationale}`);\n})();\n```\n\n#### Use with other AI providers through the AI proxy\n\nAutoevals will look for an `OPENAI_BASE_URL` environment variable to use as the base for requests to an OpenAI compatible API. If `OPENAI_BASE_URL` is not set, it will default to the [AI proxy](https://www.braintrust.dev/docs/guides/proxy). This provides numerous benefits like simplified access to many AI providers, reduced costs with automatic request caching, and increased observability when you enable logging to Braintrust. The proxy is free to use, even if you don't have a Braintrust account.\n\nIf you have a Braintrust account, you can set the `BRAINTUST_API_KEY` environment variable instead of `OPENAI_API_KEY` to unlock additional features like logging and monitoring. Additionally, you can route requests to [supported AI providers and models](https://www.braintrust.dev/docs/guides/proxy#supported-models) or custom models you have configured in Braintrust.\n\n```javascript\n// NOTE: ensure BRAINTRUST_API_KEY is set in your environment and OPENAI_API_KEY is not set\nimport { Factuality } from \"autoevals\";\n\n(async () => {\n  const input = \"Which country has the highest population?\";\n  const output = \"People's Republic of China\";\n  const expected = \"China\";\n\n  // Run an LLM-based evaluator using the Claude 3.5 Sonnet model from Anthropic\n  const result = await Factuality({\n    model: \"claude-3-5-sonnet-latest\",\n    output,\n    expected,\n    input,\n  });\n\n  // The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\n  console.log(`Factuality score: ${result.score}`);\n  console.log(`Factuality metadata: ${result.metadata?.rationale}`);\n})();\n```\n\n## Using Braintrust with Autoevals\n\nOnce you grade an output using Autoevals, it's convenient to use [Braintrust](https://www.braintrust.dev/docs/libs/python) to log and compare your evaluation results.\n\n### Python\n\n```python\nfrom autoevals.llm import *\nimport braintrust\n\n# Create a new LLM-based evaluator\nevaluator = Factuality()\n\n# Set up an example LLM completion\ninput = \"Which country has the highest population?\"\noutput = \"People's Republic of China\"\nexpected = \"China\"\n\n# Set up a BrainTrust experiment to log our eval to\nexperiment = braintrust.init(\n    project=\"Autoevals\", api_key=\"YOUR_BRAINTRUST_API_KEY\"\n)\n\n# Start a span and run our evaluator\nwith experiment.start_span() as span:\n    result = evaluator(output, expected, input=input)\n\n    # The evaluator returns a score from [0,1] and includes the raw outputs from the evaluator\n    print(f\"Factuality score: {result.score}\")\n    print(f\"Factuality metadata: {result.metadata['rationale']}\")\n\n    span.log(\n        inputs={\"query\": input},\n        output=output,\n        expected=expected,\n        scores={\n            \"factuality\": result.score,\n        },\n        metadata={\n            \"factuality\": result.metadata,\n        },\n    )\n\nprint(experiment.summarize())\n```\n\n### Node.js\n\nCreate a file named `example.eval.js` (it must end with `.eval.js` or `.eval.js`):\n\n```javascript\nimport { Eval } from \"braintrust\";\nimport { Factuality } from \"autoevals\";\n\nEval(\"Autoevals\", {\n  data: () => [\n    {\n      input: \"Which country has the highest population?\",\n      expected: \"China\",\n    },\n  ],\n  task: () => \"People's Republic of China\",\n  scores: [Factuality],\n});\n```\n\nThen, run\n\n```bash\nnpx braintrust run example.eval.js\n```\n\n## Supported Evaluation Methods\n\n### LLM-as-a-Judge\n\n- Battle\n- ClosedQA\n- Humor\n- Factuality\n- Moderation\n- Security\n- Summarization\n- SQL\n- Translation\n- Fine-tuned binary classifiers\n\n### RAG\n\n- Context precision\n- Context relevancy\n- Context recall\n- Context entities recall\n- Faithfullness\n- Answer relevance\n- Answer semantic similarity\n- Answer correctness\n- Aspect critique\n\n### Composite\n\n- Semantic list contains\n- JSON validity\n\n### Embeddings\n\n- Embedding similarity\n- BERTScore\n\n### Heuristic\n\n- Levenshtein distance\n- Exact match\n- Numeric difference\n- JSON diff\n- Jaccard distance\n\n### Statistical\n\n- BLEU\n- ROUGE\n- METEOR\n\n## Custom Evaluation Prompts\n\nAutoevals supports custom evaluation prompts for model-graded evaluation. To use them, simply pass in a prompt and scoring mechanism:\n\n### Python\n\n```python\nfrom autoevals import LLMClassifier\n\n# Define a prompt prefix for a LLMClassifier (returns just one answer)\nprompt_prefix = \"\"\"\nYou are a technical project manager who helps software engineers generate better titles for their GitHub issues.\nYou will look at the issue description, and pick which of two titles better describes it.\n\nI'm going to provide you with the issue description, and two possible titles.\n\nIssue Description: {{input}}\n\n1: {{output}}\n2: {{expected}}\n\"\"\"\n\n# Define the scoring mechanism\n# 1 if the generated answer is better than the expected answer\n# 0 otherwise\noutput_scores = {\"1\": 1, \"2\": 0}\n\nevaluator = LLMClassifier(\n    name=\"TitleQuality\",\n    prompt_template=prompt_prefix,\n    choice_scores=output_scores,\n    use_cot=True,\n)\n\n# Evaluate an example LLM completion\npage_content = \"\"\"\nAs suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,\nWe can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?\nNicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification\"\"\"\noutput = (\n    \"Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX\"\n)\nexpected = \"Standardize Error Responses across APIs\"\n\nresponse = evaluator(output, expected, input=page_content)\n\nprint(f\"Score: {response.score}\")\nprint(f\"Metadata: {response.metadata}\")\n```\n\n### Node.js\n\n```javascript\nimport { LLMClassifierFromTemplate } from \"autoevals\";\n\n(async () => {\n  const promptTemplate = `You are a technical project manager who helps software engineers generate better titles for their GitHub issues.\nYou will look at the issue description, and pick which of two titles better describes it.\n\nI'm going to provide you with the issue description, and two possible titles.\n\nIssue Description: {{input}}\n\n1: {{output}}\n2: {{expected}}`;\n\n  const choiceScores = { 1: 1, 2: 0 };\n\n  const evaluator =\n    LLMClassifierFromTemplate <\n    { input: string } >\n    {\n      name: \"TitleQuality\",\n      promptTemplate,\n      choiceScores,\n      useCoT: true,\n    };\n\n  const input = `As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,\nWe can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?\nNicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification`;\n  const output = `Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX`;\n  const expected = `Standardize Error Responses across APIs`;\n\n  const response = await evaluator({ input, output, expected });\n\n  console.log(\"Score\", response.score);\n  console.log(\"Metadata\", response.metadata);\n})();\n```\n\n## Creating custom scorers\n\nYou can also create your own scoring functions that do not use LLMs. For example, to test whether the word `'banana'`\nis in the output, you can use the following:\n\n### Python\n\n```python\nfrom autoevals import Score\n\n\ndef banana_scorer(output, expected, input):\n    return Score(name=\"banana_scorer\", score=1 if \"banana\" in output else 0)\n\n\ninput = \"What is 1 banana + 2 bananas?\"\noutput = \"3\"\nexpected = \"3 bananas\"\n\nresult = banana_scorer(output, expected, input)\n\nprint(f\"Banana score: {result.score}\")\n```\n\n### Node.js\n\n```javascript\nimport { Score } from \"autoevals\";\n\nconst bananaScorer = ({\n  output,\n  expected,\n  input,\n}: {\n  output: string;\n  expected: string;\n  input: string;\n}): Score => {\n  return { name: \"banana_scorer\", score: output.includes(\"banana\") ? 1 : 0 };\n};\n\n(async () => {\n  const input = \"What is 1 banana + 2 bananas?\";\n  const output = \"3\";\n  const expected = \"3 bananas\";\n\n  const result = bananaScorer({ output, expected, input });\n  console.log(`Banana score: ${result.score}`);\n})();\n```\n\n## Why does this library exist?\n\nThere is nothing particularly novel about the evaluation methods in this library. They are all well-known and well-documented. However, there are a few things that are particularly difficult when evaluating in practice:\n\n- Normalizing metrics between 0 and 1 is tough. For example, check out the calculation in [number.py](/py/autoevals/number.py) to see how it's done for numeric differences.\n- Parsing the outputs on model-graded evaluations is also challenging. There are frameworks that do this, but it's hard to\n  debug one output at a time, propagate errors, and tweak the prompts. Autoevals makes these tasks easy.\n- Collecting metrics behind a uniform interface makes it easy to swap out evaluation methods and compare them. Prior to Autoevals, we couldn't find an open source library where you can simply pass in `input`, `output`, and `expected` values through a bunch of different evaluation methods.\n\n## Documentation\n\nThe full docs are available [here](https://www.braintrust.dev/docs/reference/autoevals).\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Universal library for evaluating AI models",
    "version": "0.0.121",
    "project_urls": {
        "Bug Tracker": "https://github.com/braintrustdata/autoevals",
        "Homepage": "https://www.braintrustdata.com"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "108766d7b643ffba01c3bd0fc49d6b94691f421a22478f358eceb92df5201268",
                "md5": "9ef6e3a22bd3499c84e23937432bae1d",
                "sha256": "9a1cbd861bcee6ba6692c5db355a5827d0a0008635b8f3fb53a6e147f6289147"
            },
            "downloads": -1,
            "filename": "autoevals-0.0.121-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9ef6e3a22bd3499c84e23937432bae1d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 41918,
            "upload_time": "2025-03-01T01:06:41",
            "upload_time_iso_8601": "2025-03-01T01:06:41.968386Z",
            "url": "https://files.pythonhosted.org/packages/10/87/66d7b643ffba01c3bd0fc49d6b94691f421a22478f358eceb92df5201268/autoevals-0.0.121-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cd57250fb03732b3077d4955ed12f9a77779a1ee6bda7e3b7ff6e96a65c1edd0",
                "md5": "90cb60a663d712f20fccc6429efb3e4a",
                "sha256": "239eec88065f5d003966f5792fe4d699896ddb2d3a70249a0e92e57764189ff9"
            },
            "downloads": -1,
            "filename": "autoevals-0.0.121.tar.gz",
            "has_sig": false,
            "md5_digest": "90cb60a663d712f20fccc6429efb3e4a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 38865,
            "upload_time": "2025-03-01T01:06:43",
            "upload_time_iso_8601": "2025-03-01T01:06:43.869844Z",
            "url": "https://files.pythonhosted.org/packages/cd/57/250fb03732b3077d4955ed12f9a77779a1ee6bda7e3b7ff6e96a65c1edd0/autoevals-0.0.121.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-01 01:06:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "braintrustdata",
    "github_project": "autoevals",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "autoevals"
}

BrainTrust