llm-app-test


Namellm-app-test JSON
Version 0.2.0b2 PyPI version JSON
download
home_pageNone
SummaryA behavioral testing library for LLM applications that allows developers to write natural language specifications for unit and integration tests. Validate LLM application behavior using plain English assertions in a simple assert(str, str) form factor.
upload_time2024-12-15 12:05:07
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords testing llm behavioral pytest
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # llm_app_test

[![Discord](https://img.shields.io/discord/1307634517766443058?logo=discord&logoColor=white)](https://discord.gg/awy83bZsKf)
[![PyPI Version](https://img.shields.io/pypi/v/llm-app-test?label=pypi%20package)](https://pypi.org/project/llm-app-test/)
![PyPI Downloads](https://img.shields.io/pypi/dm/llm-app-test)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://Shredmetal.github.io/llmtest/)
[![codecov](https://codecov.io/github/Shredmetal/llmtest/graph/badge.svg?token=EVDZIPM2C0)](https://codecov.io/github/Shredmetal/llmtest)

> TL;DR: A Python library that lets you test LLM applications by describing expected behavior in plain English.

## Overview

A behavioral testing framework for applications using large language models (LLMs). It leverages LLMs to validate the behavior of applications containing LLMs against natural language test specifications (reliability validated through 30,000 test executions), providing a powerful tool for unit/integration testing of applications containing an LLM (not for testing LLMs themselves). 

We made this because we were unsatisfied with existing approaches to testing our apps that someone else told us to stick a non-deterministic black box into:

1. String/regex matching and embeddings are too brittle - this is obvious for the former. Embeddings allow slightly more flexibility, but you still need to roughly guess what your LLM-powered app is going to say, and set things like thresholds while having an understanding of vector spaces.
2. Academic metrics are of little help to API consumers like us with no ability to change the model. However, we still believe this tool is useful for the software engineering side of things. Please refer to the [Testing Philosophy](#testing-philosophy) section below on when to send things back to the data scientists.
3. We just wanted to define a behavior and assert on it.

⚠️ Note on Reliability: While we cannot guarantee 100% reliability (due to the fundamental nature of LLMs), we validated the library with 30,000 test executions with zero format violations and non-determinism only occurring in one case containing a genuine semantic boundary. 

We stress that  past success doesn't guarantee future determinism - this is an unsolvable problem in LLM testing, but we've implemented extensive mitigations to make it as reliable as possible. We will continue to validate reliability through brute force testing and will report if issues are detected. In this regard, please refer to the reliability testing section of the documentation.

You can go straight to the [documentation](https://Shredmetal.github.io/llmtest/) if you wish ⚠️ This is where you will find the most up-to-date information about the library.

## The Cool Stuff:

✨ Test your LLM apps in minutes, not hours

🚀 CI ready out of the box (Tested in GitHub Actions CI - Please let us know if it just works(tm) in other CI systems)

💰 Cost-effective testing solution

🔧 No infrastructure needed (Unless if you want to inject a custom LLM, please refer to the configuration page of the documentation for details)

## Installation - Reading the rest of the readme first is strongly recommended before use

```
pip install llm-app-test
```

Please refer to the [documentation](https://shredmetal.github.io/llmtest/getting-started/installation/) for full instructions.

## Testing Philosophy

When integrating LLMs into your application, treat them as you would any closed-source third-party library:

1. Write tests for expected behavior
2. Focus on interface boundaries
3. Test application-level functionality
4. Maintain clear separation of concerns

### ⚠️ Important Information on Understanding Responsibilities

This library is built by software engineers to give software engineers a tool to validate the behavior of applications that have had an LLM stuffed in them. It is **NOT** a Data Science tool nor a replacement for model metrics used by Data Science teams to validate model suitability.

#### Software Engineer's Role

- Write tests for expected application behavior
- Validate inputs and outputs
- Ensure proper integration
- Monitor system performance
- Escalate consistent failures to DS team (as this might indicate a fundamental problem with the model, or perhaps to seek assistance with the `expected_behavior` prompt in the `assert_behavioral_match` function)

#### Data Science Team's Role

- Handle model-level issues
- Address consistent test failures
- Evaluate model suitability
- Optimise model performance
- Adjust prompts when needed

### When to Escalate

Escalate to your Data Science team when:

1. Tests consistently fail despite correct implementation
2. Model responses are consistently inappropriate
3. Performance degradation is observed
4. Pattern of failures indicates model-level issues

### 🔍 What Makes This Different?

This is an ENGINEERING tool, not a data science tool. The difference is crucial:

Data Science Tools:
- Test model performance
- Evaluate model accuracy
- Measure model metrics

llm_app_test (Engineering Tool):
- Tests your APPLICATION code
- Validates integration points
- Ensures system behavior
- Maintains production reliability

Think of it this way: You don't test Redis itself, you test your application's use of Redis. 
Similarly, llm_app_test helps you test your application's use of LLMs.

## Testing Hierarchy

llm-app-test is designed to complement existing approaches. We recommend this testing hierarchy:

1. **Behavioral Testing (llm-app-test)**
    - Fast, cost-effective first line of testing
    - Validates IF your LLM application is even working as intended
    - Tests core functionality and behavior
    - Must pass before proceeding to benchmarking
    - Failure indicates fundamental problems with the application

2. **Benchmarking and Performance Evaluation**
    - Much slower and more expensive
    - Only run AFTER behavioral tests pass
    - Measures HOW WELL the application performs (in our view, this blurs the lines into LLM evaluation but it should still be done, just not as the first line of defence against broken apps due to the time and cost required)
    - Tests performance metrics, response quality
    - Used for optimization and model selection

That said, we are planning on building a benchmarking system to allow you to get some metrics on how well your system is complying with behavioral specifications, planned for 0.3.0b1.

### Visual representation of testing hierarchy:

![Testing Hierarchy](https://i.imgur.com/TFPJa9M.png "LLM Application Testing Flow")

## Example of Behavioral Testing:

### Real World Example

Here's a powerful example showing behavioral testing in action:

```python
from langchain_core.messages import SystemMessage, HumanMessage
from llm_app_test.behavioral_assert.behavioral_assert import BehavioralAssertion
from your_bot_module import SimpleApiCallBot  # Your LLM wrapper

def test_ww2_narrative():
    behavioral_assert = BehavioralAssertion()

    # Define the bot's behavior with a system message
    system_message = SystemMessage(
        """
        You are a historian bot and you will respond to specific requests 
        for information about history. Be detailed but do not go beyond 
        what was asked for.
        """
    )

    # Initialize the bot
    # This is a simple API call to openAI - you can find this in our tests/actual_usage_tests directory in the repo
    bot = SimpleApiCallBot(system_message=system_message) 
    
    # Create the user's request
    human_message = HumanMessage(
        content="Tell me about the European Theater of World War 2, the major battles, and how the European war ended"
    )

    # Get the bot's response
    actual_output = bot.generate_ai_response(human_message)

    # Define expected behavior
    expected_behavior = """
    A narrative about World War 2 and the global nature of the war
    """

    # This will fail because the bot's response focuses only on Europe
    behavioral_assert.assert_behavioral_match(actual_output, expected_behavior)
```

Note: Claude is a little too helpful and will say stuff about the other theaters so when calling the generation bot with Claude, we used this:

```
"Tell me about the European Theater of World War 2, the major battles, and how the European war ended. Only mention the European theater and do not mention the other theaters."
```

Actual bot response from one run (GPT-4o):

```
The European Theater of World War II was a significant front in the global conflict that lasted from 1939 to 1945. It involved most of the countries of Europe and was marked by numerous major battles and campaigns. Here is an overview of some of the key events and battles:

1. **Invasion of Poland (1939):** The war in Europe began with Germany's invasion of Poland on September 1, 1939. This prompted Britain and France to declare war on Germany. The swift German victory was achieved through the use of Blitzkrieg tactics.

2. **Battle of France (1940):** In May 1940, Germany launched an invasion of France and the Low Countries. The German forces bypassed the heavily fortified Maginot Line and quickly advanced through the Ardennes, leading to the fall of France in June 1940.

3. **Battle of Britain (1940):** Following the fall of France, Germany attempted to gain air superiority over Britain in preparation for an invasion. The Royal Air Force successfully defended the UK, marking the first major defeat for Hitler's military forces.

4. **Operation Barbarossa (1941):** On June 22, 1941, Germany launched a massive invasion of the Soviet Union. This campaign opened the Eastern Front, which became the largest and bloodiest theater of war in World War II.

5. **Battle of Stalingrad (1942-1943):** One of the deadliest battles in history, the Battle of Stalingrad was a turning point on the Eastern Front. The Soviet victory marked the beginning of a major offensive push against German forces.

6. **North African Campaign (1940-1943):** This series of battles involved the Allies and Axis powers fighting for control of North Africa. The decisive Allied victory at the Second Battle of El Alamein in 1942 marked the beginning of the end for Axis forces in Africa.

7. **Invasion of Italy (1943):** After the successful North African Campaign, the Allies invaded Sicily in July 1943 and then mainland Italy. This led to the fall of Mussolini's regime and Italy's eventual surrender, although fighting continued in Italy until 1945.

8. **D-Day and the Battle of Normandy (1944):** On June 6, 1944, Allied forces launched Operation Overlord, the largest amphibious invasion in history, landing on the beaches of Normandy, France. This marked the beginning of the liberation of Western Europe from Nazi occupation.

9. **Battle of the Bulge (1944-1945):** Germany's last major offensive on the Western Front took place in the Ardennes Forest. Despite initial successes, the Allies eventually repelled the German forces, leading to a rapid advance into Germany.

10. **Fall of Berlin (1945):** The final major offensive in Europe was the Soviet assault on Berlin in April 1945. The city fell on May 2, 1945, leading to the suicide of Adolf Hitler and the unconditional surrender of German forces.

The European war officially ended with Germany's unconditional surrender on May 7, 1945, which was ratified on May 8, known as Victory in Europe (VE) Day. This marked the end of World War II in Europe, although the war continued in the Pacific until Japan's surrender in September 1945.

```

Error message thrown by `assert_behavioral_match`:

```
E           llm_app_test.exceptions.test_exceptions.BehavioralAssertionError: Behavioral assertion failed: 
Behavioral Assertion Failed:  - Reason: The actual output focuses primarily on the European Theater of World War II, 
rather than providing a narrative about the global nature of the war.
```

### What This Example Demonstrates

1. Real Application Testing
    - Tests an actual LLM-based application

2. Behavioral Testing Power
    - The bot provides a detailed, accurate response
    - However, the test fails because it doesn't meet the expected behavior
    - Shows how behavioral testing catches incorrect behavior from your app

3. Clear Error Messages
    - The error clearly explains why the test failed
    - Points to specific behavioral mismatch
    - Helps developers understand what needs to change

## Documentation

Full documentation available at: [https://Shredmetal.github.io/llmtest/](https://Shredmetal.github.io/llmtest/)

- Installation Guide
- Quick Start Guide
- API Reference
- Best Practices
- CI/CD Integration
- Configuration Options

## License

MIT

## Reporting Issues
If you encounter issues:
1. Create an issue on our GitHub repository
2. Include your Python version and environment details
3. Describe the problem you encountered with version 0.2.0b2

## 🆘 Support
- Discord: [Join our community](https://discord.gg/awy83bZsKf)
- Issues: [GitHub Issues](https://github.com/Shredmetal/llmtest/issues)
- Documentation: [Full Docs](https://shredmetal.github.io/llmtest/)
- Email: morganj.lee01@gmail.com / Elnathan.erh@gmail.com

## Due to the number of downloads I am seeing on pypistats.org, I am including these instructions in case a beta update breaks something on your end:

### Emergency Rollback Instructions

If you experience issues with version 0.2.0b2, you can roll back to the previous stable version (0.2.0b1) using one of these methods:

#### Method 1: Direct Installation of Previous Version

```
pip uninstall llm-app-test 
pip install llm-app-test==0.2.0b1
```
#### Method 2: Force Reinstall (if Method 1 fails)

```
pip install --force-reinstall llm-app-test==0.2.0b1
```
#### Verification
After rolling back, verify the installation:
```
import llm_app_test 
print(llm_app_test.version) # Should show 0.2.0b1
```

## ⚠️ Important Note About Rate Limits - If Running Large Numbers of Tests:

### Anthropic Rate limits:

Tier 1:

| Model                        | Maximum Requests per minute (RPM) | Maximum Tokens per minute (TPM) | Maximum Tokens per day (TPD) |
|------------------------------|-----------------------------------|---------------------------------|------------------------------|
| Claude 3.5 Sonnet 2024-10-22 | 50                                | 40,000                          | 1,000,000                    |
| Claude 3.5 Sonnet 2024-06-20 | 50                                | 40,000                          | 1,000,000                    |
| Claude 3 Opus                | 50                                | 20,000                          | 1,000,000                    |


Tier 2:

| Model                        | Maximum Requests per minute (RPM) | Maximum Tokens per minute (TPM) | Maximum Tokens per day (TPD) |
|------------------------------|-----------------------------------|---------------------------------|------------------------------|
| Claude 3.5 Sonnet 2024-10-22 | 1,000                             | 80,000                          | 2,500,000                    |
| Claude 3.5 Sonnet 2024-06-20 | 1,000                             | 80,000                          | 2,500,000                    |
| Claude 3 Opus                | 1,000                             | 40,000                          | 2,500,000                    |

### OpenAI Rate Limits

Tier 1

| Model                   | RPM | RPD    | TPM     | Batch Queue Limit |
|-------------------------|-----|--------|---------|-------------------|
| gpt-4o                  | 500 | -      | 30,000  | 90,000            |
| gpt-4o-mini             | 500 | 10,000 | 200,000 | 2,000,000         |
| gpt-4o-realtime-preview | 100 | 100    | 20,000  | -                 |
| gpt-4-turbo             | 500 | -      | 30,000  | 90,000            |


Tier 2:

| Model                   | RPM   | TPM       | Batch Queue Limit |
|-------------------------|-------|-----------|-------------------|
| gpt-4o                  | 5,000 | 450,000   | 1,350,000         |
| gpt-4o-mini             | 5,000 | 2,000,000 | 20,000,000        |
| gpt-4o-realtime-preview | 200   | 40,000    | -                 |
| gpt-4-turbo             | 5,000 | 450,000   | 1,350,000         |

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-app-test",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "testing, llm, behavioral, pytest",
    "author": null,
    "author_email": "Morgan Lee <morganj.lee01@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/9e/b2/ceffaae0cac3611b6e00562f6c639012038c3950192ae890a91abfa69b20/llm_app_test-0.2.0b2.tar.gz",
    "platform": null,
    "description": "# llm_app_test\r\n\r\n[![Discord](https://img.shields.io/discord/1307634517766443058?logo=discord&logoColor=white)](https://discord.gg/awy83bZsKf)\r\n[![PyPI Version](https://img.shields.io/pypi/v/llm-app-test?label=pypi%20package)](https://pypi.org/project/llm-app-test/)\r\n![PyPI Downloads](https://img.shields.io/pypi/dm/llm-app-test)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)\r\n[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://Shredmetal.github.io/llmtest/)\r\n[![codecov](https://codecov.io/github/Shredmetal/llmtest/graph/badge.svg?token=EVDZIPM2C0)](https://codecov.io/github/Shredmetal/llmtest)\r\n\r\n> TL;DR: A Python library that lets you test LLM applications by describing expected behavior in plain English.\r\n\r\n## Overview\r\n\r\nA behavioral testing framework for applications using large language models (LLMs). It leverages LLMs to validate the behavior of applications containing LLMs against natural language test specifications (reliability validated through 30,000 test executions), providing a powerful tool for unit/integration testing of applications containing an LLM (not for testing LLMs themselves). \r\n\r\nWe made this because we were unsatisfied with existing approaches to testing our apps that someone else told us to stick a non-deterministic black box into:\r\n\r\n1. String/regex matching and embeddings are too brittle - this is obvious for the former. Embeddings allow slightly more flexibility, but you still need to roughly guess what your LLM-powered app is going to say, and set things like thresholds while having an understanding of vector spaces.\r\n2. Academic metrics are of little help to API consumers like us with no ability to change the model. However, we still believe this tool is useful for the software engineering side of things. Please refer to the [Testing Philosophy](#testing-philosophy) section below on when to send things back to the data scientists.\r\n3. We just wanted to define a behavior and assert on it.\r\n\r\n\u26a0\ufe0f Note on Reliability: While we cannot guarantee 100% reliability (due to the fundamental nature of LLMs), we validated the library with 30,000 test executions with zero format violations and non-determinism only occurring in one case containing a genuine semantic boundary. \r\n\r\nWe stress that  past success doesn't guarantee future determinism - this is an unsolvable problem in LLM testing, but we've implemented extensive mitigations to make it as reliable as possible. We will continue to validate reliability through brute force testing and will report if issues are detected. In this regard, please refer to the reliability testing section of the documentation.\r\n\r\nYou can go straight to the [documentation](https://Shredmetal.github.io/llmtest/) if you wish \u26a0\ufe0f This is where you will find the most up-to-date information about the library.\r\n\r\n## The Cool Stuff:\r\n\r\n\u2728 Test your LLM apps in minutes, not hours\r\n\r\n\ud83d\ude80 CI ready out of the box (Tested in GitHub Actions CI - Please let us know if it just works(tm) in other CI systems)\r\n\r\n\ud83d\udcb0 Cost-effective testing solution\r\n\r\n\ud83d\udd27 No infrastructure needed (Unless if you want to inject a custom LLM, please refer to the configuration page of the documentation for details)\r\n\r\n## Installation - Reading the rest of the readme first is strongly recommended before use\r\n\r\n```\r\npip install llm-app-test\r\n```\r\n\r\nPlease refer to the [documentation](https://shredmetal.github.io/llmtest/getting-started/installation/) for full instructions.\r\n\r\n## Testing Philosophy\r\n\r\nWhen integrating LLMs into your application, treat them as you would any closed-source third-party library:\r\n\r\n1. Write tests for expected behavior\r\n2. Focus on interface boundaries\r\n3. Test application-level functionality\r\n4. Maintain clear separation of concerns\r\n\r\n### \u26a0\ufe0f Important Information on Understanding Responsibilities\r\n\r\nThis library is built by software engineers to give software engineers a tool to validate the behavior of applications that have had an LLM stuffed in them. It is **NOT** a Data Science tool nor a replacement for model metrics used by Data Science teams to validate model suitability.\r\n\r\n#### Software Engineer's Role\r\n\r\n- Write tests for expected application behavior\r\n- Validate inputs and outputs\r\n- Ensure proper integration\r\n- Monitor system performance\r\n- Escalate consistent failures to DS team (as this might indicate a fundamental problem with the model, or perhaps to seek assistance with the `expected_behavior` prompt in the `assert_behavioral_match` function)\r\n\r\n#### Data Science Team's Role\r\n\r\n- Handle model-level issues\r\n- Address consistent test failures\r\n- Evaluate model suitability\r\n- Optimise model performance\r\n- Adjust prompts when needed\r\n\r\n### When to Escalate\r\n\r\nEscalate to your Data Science team when:\r\n\r\n1. Tests consistently fail despite correct implementation\r\n2. Model responses are consistently inappropriate\r\n3. Performance degradation is observed\r\n4. Pattern of failures indicates model-level issues\r\n\r\n### \ud83d\udd0d What Makes This Different?\r\n\r\nThis is an ENGINEERING tool, not a data science tool. The difference is crucial:\r\n\r\nData Science Tools:\r\n- Test model performance\r\n- Evaluate model accuracy\r\n- Measure model metrics\r\n\r\nllm_app_test (Engineering Tool):\r\n- Tests your APPLICATION code\r\n- Validates integration points\r\n- Ensures system behavior\r\n- Maintains production reliability\r\n\r\nThink of it this way: You don't test Redis itself, you test your application's use of Redis. \r\nSimilarly, llm_app_test helps you test your application's use of LLMs.\r\n\r\n## Testing Hierarchy\r\n\r\nllm-app-test is designed to complement existing approaches. We recommend this testing hierarchy:\r\n\r\n1. **Behavioral Testing (llm-app-test)**\r\n    - Fast, cost-effective first line of testing\r\n    - Validates IF your LLM application is even working as intended\r\n    - Tests core functionality and behavior\r\n    - Must pass before proceeding to benchmarking\r\n    - Failure indicates fundamental problems with the application\r\n\r\n2. **Benchmarking and Performance Evaluation**\r\n    - Much slower and more expensive\r\n    - Only run AFTER behavioral tests pass\r\n    - Measures HOW WELL the application performs (in our view, this blurs the lines into LLM evaluation but it should still be done, just not as the first line of defence against broken apps due to the time and cost required)\r\n    - Tests performance metrics, response quality\r\n    - Used for optimization and model selection\r\n\r\nThat said, we are planning on building a benchmarking system to allow you to get some metrics on how well your system is complying with behavioral specifications, planned for 0.3.0b1.\r\n\r\n### Visual representation of testing hierarchy:\r\n\r\n![Testing Hierarchy](https://i.imgur.com/TFPJa9M.png \"LLM Application Testing Flow\")\r\n\r\n## Example of Behavioral Testing:\r\n\r\n### Real World Example\r\n\r\nHere's a powerful example showing behavioral testing in action:\r\n\r\n```python\r\nfrom langchain_core.messages import SystemMessage, HumanMessage\r\nfrom llm_app_test.behavioral_assert.behavioral_assert import BehavioralAssertion\r\nfrom your_bot_module import SimpleApiCallBot  # Your LLM wrapper\r\n\r\ndef test_ww2_narrative():\r\n    behavioral_assert = BehavioralAssertion()\r\n\r\n    # Define the bot's behavior with a system message\r\n    system_message = SystemMessage(\r\n        \"\"\"\r\n        You are a historian bot and you will respond to specific requests \r\n        for information about history. Be detailed but do not go beyond \r\n        what was asked for.\r\n        \"\"\"\r\n    )\r\n\r\n    # Initialize the bot\r\n    # This is a simple API call to openAI - you can find this in our tests/actual_usage_tests directory in the repo\r\n    bot = SimpleApiCallBot(system_message=system_message) \r\n    \r\n    # Create the user's request\r\n    human_message = HumanMessage(\r\n        content=\"Tell me about the European Theater of World War 2, the major battles, and how the European war ended\"\r\n    )\r\n\r\n    # Get the bot's response\r\n    actual_output = bot.generate_ai_response(human_message)\r\n\r\n    # Define expected behavior\r\n    expected_behavior = \"\"\"\r\n    A narrative about World War 2 and the global nature of the war\r\n    \"\"\"\r\n\r\n    # This will fail because the bot's response focuses only on Europe\r\n    behavioral_assert.assert_behavioral_match(actual_output, expected_behavior)\r\n```\r\n\r\nNote: Claude is a little too helpful and will say stuff about the other theaters so when calling the generation bot with Claude, we used this:\r\n\r\n```\r\n\"Tell me about the European Theater of World War 2, the major battles, and how the European war ended. Only mention the European theater and do not mention the other theaters.\"\r\n```\r\n\r\nActual bot response from one run (GPT-4o):\r\n\r\n```\r\nThe European Theater of World War II was a significant front in the global conflict that lasted from 1939 to 1945. It involved most of the countries of Europe and was marked by numerous major battles and campaigns. Here is an overview of some of the key events and battles:\r\n\r\n1. **Invasion of Poland (1939):** The war in Europe began with Germany's invasion of Poland on September 1, 1939. This prompted Britain and France to declare war on Germany. The swift German victory was achieved through the use of Blitzkrieg tactics.\r\n\r\n2. **Battle of France (1940):** In May 1940, Germany launched an invasion of France and the Low Countries. The German forces bypassed the heavily fortified Maginot Line and quickly advanced through the Ardennes, leading to the fall of France in June 1940.\r\n\r\n3. **Battle of Britain (1940):** Following the fall of France, Germany attempted to gain air superiority over Britain in preparation for an invasion. The Royal Air Force successfully defended the UK, marking the first major defeat for Hitler's military forces.\r\n\r\n4. **Operation Barbarossa (1941):** On June 22, 1941, Germany launched a massive invasion of the Soviet Union. This campaign opened the Eastern Front, which became the largest and bloodiest theater of war in World War II.\r\n\r\n5. **Battle of Stalingrad (1942-1943):** One of the deadliest battles in history, the Battle of Stalingrad was a turning point on the Eastern Front. The Soviet victory marked the beginning of a major offensive push against German forces.\r\n\r\n6. **North African Campaign (1940-1943):** This series of battles involved the Allies and Axis powers fighting for control of North Africa. The decisive Allied victory at the Second Battle of El Alamein in 1942 marked the beginning of the end for Axis forces in Africa.\r\n\r\n7. **Invasion of Italy (1943):** After the successful North African Campaign, the Allies invaded Sicily in July 1943 and then mainland Italy. This led to the fall of Mussolini's regime and Italy's eventual surrender, although fighting continued in Italy until 1945.\r\n\r\n8. **D-Day and the Battle of Normandy (1944):** On June 6, 1944, Allied forces launched Operation Overlord, the largest amphibious invasion in history, landing on the beaches of Normandy, France. This marked the beginning of the liberation of Western Europe from Nazi occupation.\r\n\r\n9. **Battle of the Bulge (1944-1945):** Germany's last major offensive on the Western Front took place in the Ardennes Forest. Despite initial successes, the Allies eventually repelled the German forces, leading to a rapid advance into Germany.\r\n\r\n10. **Fall of Berlin (1945):** The final major offensive in Europe was the Soviet assault on Berlin in April 1945. The city fell on May 2, 1945, leading to the suicide of Adolf Hitler and the unconditional surrender of German forces.\r\n\r\nThe European war officially ended with Germany's unconditional surrender on May 7, 1945, which was ratified on May 8, known as Victory in Europe (VE) Day. This marked the end of World War II in Europe, although the war continued in the Pacific until Japan's surrender in September 1945.\r\n\r\n```\r\n\r\nError message thrown by `assert_behavioral_match`:\r\n\r\n```\r\nE           llm_app_test.exceptions.test_exceptions.BehavioralAssertionError: Behavioral assertion failed: \r\nBehavioral Assertion Failed:  - Reason: The actual output focuses primarily on the European Theater of World War II, \r\nrather than providing a narrative about the global nature of the war.\r\n```\r\n\r\n### What This Example Demonstrates\r\n\r\n1. Real Application Testing\r\n    - Tests an actual LLM-based application\r\n\r\n2. Behavioral Testing Power\r\n    - The bot provides a detailed, accurate response\r\n    - However, the test fails because it doesn't meet the expected behavior\r\n    - Shows how behavioral testing catches incorrect behavior from your app\r\n\r\n3. Clear Error Messages\r\n    - The error clearly explains why the test failed\r\n    - Points to specific behavioral mismatch\r\n    - Helps developers understand what needs to change\r\n\r\n## Documentation\r\n\r\nFull documentation available at: [https://Shredmetal.github.io/llmtest/](https://Shredmetal.github.io/llmtest/)\r\n\r\n- Installation Guide\r\n- Quick Start Guide\r\n- API Reference\r\n- Best Practices\r\n- CI/CD Integration\r\n- Configuration Options\r\n\r\n## License\r\n\r\nMIT\r\n\r\n## Reporting Issues\r\nIf you encounter issues:\r\n1. Create an issue on our GitHub repository\r\n2. Include your Python version and environment details\r\n3. Describe the problem you encountered with version 0.2.0b2\r\n\r\n## \ud83c\udd98 Support\r\n- Discord: [Join our community](https://discord.gg/awy83bZsKf)\r\n- Issues: [GitHub Issues](https://github.com/Shredmetal/llmtest/issues)\r\n- Documentation: [Full Docs](https://shredmetal.github.io/llmtest/)\r\n- Email: morganj.lee01@gmail.com / Elnathan.erh@gmail.com\r\n\r\n## Due to the number of downloads I am seeing on pypistats.org, I am including these instructions in case a beta update breaks something on your end:\r\n\r\n### Emergency Rollback Instructions\r\n\r\nIf you experience issues with version 0.2.0b2, you can roll back to the previous stable version (0.2.0b1) using one of these methods:\r\n\r\n#### Method 1: Direct Installation of Previous Version\r\n\r\n```\r\npip uninstall llm-app-test \r\npip install llm-app-test==0.2.0b1\r\n```\r\n#### Method 2: Force Reinstall (if Method 1 fails)\r\n\r\n```\r\npip install --force-reinstall llm-app-test==0.2.0b1\r\n```\r\n#### Verification\r\nAfter rolling back, verify the installation:\r\n```\r\nimport llm_app_test \r\nprint(llm_app_test.version) # Should show 0.2.0b1\r\n```\r\n\r\n## \u26a0\ufe0f Important Note About Rate Limits - If Running Large Numbers of Tests:\r\n\r\n### Anthropic Rate limits:\r\n\r\nTier 1:\r\n\r\n| Model                        | Maximum Requests per minute (RPM) | Maximum Tokens per minute (TPM) | Maximum Tokens per day (TPD) |\r\n|------------------------------|-----------------------------------|---------------------------------|------------------------------|\r\n| Claude 3.5 Sonnet 2024-10-22 | 50                                | 40,000                          | 1,000,000                    |\r\n| Claude 3.5 Sonnet 2024-06-20 | 50                                | 40,000                          | 1,000,000                    |\r\n| Claude 3 Opus                | 50                                | 20,000                          | 1,000,000                    |\r\n\r\n\r\nTier 2:\r\n\r\n| Model                        | Maximum Requests per minute (RPM) | Maximum Tokens per minute (TPM) | Maximum Tokens per day (TPD) |\r\n|------------------------------|-----------------------------------|---------------------------------|------------------------------|\r\n| Claude 3.5 Sonnet 2024-10-22 | 1,000                             | 80,000                          | 2,500,000                    |\r\n| Claude 3.5 Sonnet 2024-06-20 | 1,000                             | 80,000                          | 2,500,000                    |\r\n| Claude 3 Opus                | 1,000                             | 40,000                          | 2,500,000                    |\r\n\r\n### OpenAI Rate Limits\r\n\r\nTier 1\r\n\r\n| Model                   | RPM | RPD    | TPM     | Batch Queue Limit |\r\n|-------------------------|-----|--------|---------|-------------------|\r\n| gpt-4o                  | 500 | -      | 30,000  | 90,000            |\r\n| gpt-4o-mini             | 500 | 10,000 | 200,000 | 2,000,000         |\r\n| gpt-4o-realtime-preview | 100 | 100    | 20,000  | -                 |\r\n| gpt-4-turbo             | 500 | -      | 30,000  | 90,000            |\r\n\r\n\r\nTier 2:\r\n\r\n| Model                   | RPM   | TPM       | Batch Queue Limit |\r\n|-------------------------|-------|-----------|-------------------|\r\n| gpt-4o                  | 5,000 | 450,000   | 1,350,000         |\r\n| gpt-4o-mini             | 5,000 | 2,000,000 | 20,000,000        |\r\n| gpt-4o-realtime-preview | 200   | 40,000    | -                 |\r\n| gpt-4-turbo             | 5,000 | 450,000   | 1,350,000         |\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A behavioral testing library for LLM applications that allows developers to write natural language specifications for unit and integration tests. Validate LLM application behavior using plain English assertions in a simple assert(str, str) form factor.",
    "version": "0.2.0b2",
    "project_urls": {
        "Documentation": "https://shredmetal.github.io/llmtest/",
        "Homepage": "https://github.com/Shredmetal/llmtest",
        "Repository": "https://github.com/Shredmetal/llmtest.git"
    },
    "split_keywords": [
        "testing",
        " llm",
        " behavioral",
        " pytest"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d0b08ce7a7fb197deef2a8c10c47f2baa0a6333ebae1ad0ecb59857f0abfd30",
                "md5": "e62871faf1fdbce6c4e3de8439e55dc2",
                "sha256": "26f6e4d821e7105458a76feefdfa7f6dc5d91781fff5fae14e1bcbd7e5ccb50f"
            },
            "downloads": -1,
            "filename": "llm_app_test-0.2.0b2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e62871faf1fdbce6c4e3de8439e55dc2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 25410,
            "upload_time": "2024-12-15T12:05:04",
            "upload_time_iso_8601": "2024-12-15T12:05:04.424769Z",
            "url": "https://files.pythonhosted.org/packages/8d/0b/08ce7a7fb197deef2a8c10c47f2baa0a6333ebae1ad0ecb59857f0abfd30/llm_app_test-0.2.0b2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "9eb2ceffaae0cac3611b6e00562f6c639012038c3950192ae890a91abfa69b20",
                "md5": "550d5b6b0d77149aa1c2e57351a65428",
                "sha256": "854dddbac53313d056dc5ff2e78da40417b09f4a278aa731f9cd9edec2f62e4b"
            },
            "downloads": -1,
            "filename": "llm_app_test-0.2.0b2.tar.gz",
            "has_sig": false,
            "md5_digest": "550d5b6b0d77149aa1c2e57351a65428",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 25759,
            "upload_time": "2024-12-15T12:05:07",
            "upload_time_iso_8601": "2024-12-15T12:05:07.042641Z",
            "url": "https://files.pythonhosted.org/packages/9e/b2/ceffaae0cac3611b6e00562f6c639012038c3950192ae890a91abfa69b20/llm_app_test-0.2.0b2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-15 12:05:07",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Shredmetal",
    "github_project": "llmtest",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "llm-app-test"
}
        
Elapsed time: 0.40444s