Name | steel-thread JSON |
Version |
0.1.4a0
JSON |
| download |
home_page | None |
Summary | Portia Labs Eval framework for evaluating agentic workflows. |
upload_time | 2025-07-30 10:03:48 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | TBD |
keywords |
llm
agentic
workflow
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# ๐งต SteelThread: Agent Evaluation Framework
**SteelThread** is a flexible evaluation framework built around Portia, designed to support robust **online** and **offline** testing of agentic workflows. It enables configurable datasets, custom metric definitions, LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring.
---
## ๐ Getting Started
### 1. **Install using your framework of choice**
#### `pip`
```bash
pip install steel-thread
```
#### `poetry`
```bash
poetry add steel-thread
```
#### `uv`
```bash
uv add steel-thread
```
---
### 2. **Create your datasets**
**SteelThread** is designed around deep integration with Portia. It uses data from Portia Cloud to generate test cases and evals.
When running evals through **SteelThread** we offer two distinct types:
- **Offline evals** are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.
- **Online evals** are dynamic datasets that automatically include your latest plans and plan runs, allowing you to measure performance in production.
Both types of evals can be configured via the [cloud dashboard.](https://app.portialabs.ai/dashboard/evals). Once you've created a dataset record the name of it.
---
### 3. **Basic Usage**
Run a full suite of online and offline evaluations using the name of the dataset from step 2. This will use the built in set of evaluators to give you data out of the box.
```python
from portia import Config, LogLevel, Portia
from steelthread.steelthread import SteelThread, OnlineEvalConfig, OfflineEvalConfig
# Setup
config = Config.from_default(default_log_level=LogLevel.CRITICAL)
runner = SteelThread()
# Online evals
runner.run_online(
OnlineEvalConfig(data_set_name="online_evals", config=config)
)
# Offline evals
portia = Portia(config)
runner.run_offline(
portia,
OfflineEvalConfig(data_set_name="offline_evals_v1", config=config, iterations=4)
)
```
---
## ๐ ๏ธ Features
### ๐งช Custom Metrics
Define your own evaluators by subclassing `OfflineEvaluator`:
```python
from steelthread.offline_evaluators.evaluator import OfflineEvaluator
from steelthread.metrics.metric import Metric
class EmojiEvaluator(OfflineEvaluator):
def eval_test_case(self, test_case, final_plan, final_plan_run, additional_data):
output = final_plan_run.outputs.final_output.get_value() or ""
count = output.count("๐")
score = min(count / 2, 1.0)
return Metric(score=score, name="emoji_score", description="Checks for emoji use")
```
---
### ๐งฉ Tool Stubbing
Stub tool responses deterministically for fast and reproducible testing:
```python
from steelthread.portia.tools import ToolStubRegistry
portia = Portia(
config,
tools=ToolStubRegistry(
DefaultToolRegistry(config),
stubs={
"weather_tool": lambda i, ctx, args, kwargs: "20.0" # Always returns 20.0
}
)
)
```
### ๐ `Metric Reporting`
**SteelThread** is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.
---
## ๐ Project Structure
```
steelthread/
โโโ metrics/ # Metric schema & backend logging
โ โโโ metric.py
โโโ offline_evaluators/ # Offline test runners and evaluators
โ โโโ eval_runner.py
โ โโโ evaluator.py
โ โโโ test_case.py
โโโ online_evaluators/ # Online test runners
โ โโโ eval_runner.py
โโโ portia/ # Tool stubbing and integration with Portia
โ โโโ tools.py
โโโ shared/ # Shared storage and model definitions
โ โโโ readonly_storage.py
โโโ steelthread.py # Main runner entry point
```
---
## ๐งช Example: End-to-End Test Script
See how everything fits together:
```python
from steelthread.steelthread import SteelThread, OfflineEvalConfig
from steelthread.portia.tools import ToolStubRegistry
from steelthread.metrics.metric import Metric
from steelthread.offline_evaluators.default_evaluator import DefaultOfflineEvaluator
from steelthread.offline_evaluators.evaluator import OfflineEvaluator
from portia import Config, Portia, DefaultToolRegistry, ToolRunContext
# Custom tool stub
def weather_stub_response(i, ctx, args, kwargs):
return "33.28" if kwargs.get("city") == "sydney" else "2.00"
# Custom evaluator
class EmojiEvaluator(OfflineEvaluator):
def eval_test_case(self, test_case,plan, plan_run, metadata):
out = plan_run.outputs.final_output.get_value() or ""
count = out.count("๐")
return Metric(score=min(count / 2, 1.0), name="emoji_score", description="Emoji usage")
# Setup
config = Config.from_default()
runner = SteelThread()
portia = Portia(
config,
tools=ToolStubRegistry(DefaultToolRegistry(config), {"weather_tool": weather_stub_response})
)
runner.run_offline(
portia,
OfflineEvalConfig(
data_set_name="offline_evals_v1",
config=config,
iterations=1,
evaluators=[DefaultOfflineEvaluator(config), EmojiEvaluator(config)],
),
)
```
---
## ๐งช Testing
Write tests for your metrics, plans, or evaluator logic using `pytest`:
```bash
uv run pytest tests/
```
---
Raw data
{
"_id": null,
"home_page": null,
"name": "steel-thread",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "LLM, agentic, workflow",
"author": null,
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/d0/12/7b3d681daa43370cde1e68244b59ab7caf4935c2a0df647e91043537189c/steel_thread-0.1.4a0.tar.gz",
"platform": null,
"description": "# \ud83e\uddf5 SteelThread: Agent Evaluation Framework\n\n**SteelThread** is a flexible evaluation framework built around Portia, designed to support robust **online** and **offline** testing of agentic workflows. It enables configurable datasets, custom metric definitions, LLM-based judging, and stubbed tool behaviors for reproducible and interpretable scoring.\n\n---\n\n## \ud83d\ude80 Getting Started\n\n### 1. **Install using your framework of choice**\n\n#### `pip`\n```bash\npip install steel-thread\n```\n#### `poetry`\n```bash\npoetry add steel-thread\n```\n#### `uv`\n```bash\nuv add steel-thread\n```\n\n---\n\n### 2. **Create your datasets**\n\n**SteelThread** is designed around deep integration with Portia. It uses data from Portia Cloud to generate test cases and evals. \n\nWhen running evals through **SteelThread** we offer two distinct types:\n\n- **Offline evals** are static datasets designed to be run multiple times to allow you to analyze how changes to your agents affect performance.\n- **Online evals** are dynamic datasets that automatically include your latest plans and plan runs, allowing you to measure performance in production.\n\nBoth types of evals can be configured via the [cloud dashboard.](https://app.portialabs.ai/dashboard/evals). Once you've created a dataset record the name of it.\n\n---\n\n### 3. **Basic Usage**\n\nRun a full suite of online and offline evaluations using the name of the dataset from step 2. This will use the built in set of evaluators to give you data out of the box.\n\n```python\nfrom portia import Config, LogLevel, Portia\nfrom steelthread.steelthread import SteelThread, OnlineEvalConfig, OfflineEvalConfig\n\n# Setup\nconfig = Config.from_default(default_log_level=LogLevel.CRITICAL)\nrunner = SteelThread()\n\n# Online evals\nrunner.run_online(\n OnlineEvalConfig(data_set_name=\"online_evals\", config=config)\n)\n\n# Offline evals\nportia = Portia(config)\nrunner.run_offline(\n portia,\n OfflineEvalConfig(data_set_name=\"offline_evals_v1\", config=config, iterations=4)\n)\n```\n\n---\n\n## \ud83d\udee0\ufe0f Features\n\n### \ud83e\uddea Custom Metrics\nDefine your own evaluators by subclassing `OfflineEvaluator`:\n\n```python\nfrom steelthread.offline_evaluators.evaluator import OfflineEvaluator\nfrom steelthread.metrics.metric import Metric\n\nclass EmojiEvaluator(OfflineEvaluator):\n def eval_test_case(self, test_case, final_plan, final_plan_run, additional_data):\n output = final_plan_run.outputs.final_output.get_value() or \"\"\n count = output.count(\"\ud83d\ude0a\")\n score = min(count / 2, 1.0)\n return Metric(score=score, name=\"emoji_score\", description=\"Checks for emoji use\")\n```\n\n---\n\n### \ud83e\udde9 Tool Stubbing\n\nStub tool responses deterministically for fast and reproducible testing:\n\n```python\nfrom steelthread.portia.tools import ToolStubRegistry\n\nportia = Portia(\n config,\n tools=ToolStubRegistry(\n DefaultToolRegistry(config),\n stubs={\n \"weather_tool\": lambda i, ctx, args, kwargs: \"20.0\" # Always returns 20.0\n }\n )\n)\n```\n\n### \ud83d\udcca `Metric Reporting`\n\n**SteelThread** is designed around plugable metrics backends. By default metrics are logged and sent to Portia Cloud for visualization but you can add additional backends via the config options.\n\n---\n\n## \ud83d\udcc1 Project Structure\n\n```\nsteelthread/\n\u251c\u2500\u2500 metrics/ # Metric schema & backend logging\n\u2502 \u2514\u2500\u2500 metric.py\n\u251c\u2500\u2500 offline_evaluators/ # Offline test runners and evaluators\n\u2502 \u251c\u2500\u2500 eval_runner.py\n\u2502 \u251c\u2500\u2500 evaluator.py\n\u2502 \u2514\u2500\u2500 test_case.py\n\u251c\u2500\u2500 online_evaluators/ # Online test runners\n\u2502 \u2514\u2500\u2500 eval_runner.py\n\u251c\u2500\u2500 portia/ # Tool stubbing and integration with Portia\n\u2502 \u2514\u2500\u2500 tools.py\n\u251c\u2500\u2500 shared/ # Shared storage and model definitions\n\u2502 \u2514\u2500\u2500 readonly_storage.py\n\u2514\u2500\u2500 steelthread.py # Main runner entry point\n```\n\n---\n\n## \ud83e\uddea Example: End-to-End Test Script\n\nSee how everything fits together:\n\n```python\nfrom steelthread.steelthread import SteelThread, OfflineEvalConfig\nfrom steelthread.portia.tools import ToolStubRegistry\nfrom steelthread.metrics.metric import Metric\nfrom steelthread.offline_evaluators.default_evaluator import DefaultOfflineEvaluator\nfrom steelthread.offline_evaluators.evaluator import OfflineEvaluator\nfrom portia import Config, Portia, DefaultToolRegistry, ToolRunContext\n\n# Custom tool stub\ndef weather_stub_response(i, ctx, args, kwargs):\n return \"33.28\" if kwargs.get(\"city\") == \"sydney\" else \"2.00\"\n\n# Custom evaluator\nclass EmojiEvaluator(OfflineEvaluator):\n def eval_test_case(self, test_case,plan, plan_run, metadata):\n out = plan_run.outputs.final_output.get_value() or \"\"\n count = out.count(\"\ud83c\udf1e\")\n return Metric(score=min(count / 2, 1.0), name=\"emoji_score\", description=\"Emoji usage\")\n\n# Setup\nconfig = Config.from_default()\nrunner = SteelThread()\nportia = Portia(\n config,\n tools=ToolStubRegistry(DefaultToolRegistry(config), {\"weather_tool\": weather_stub_response})\n)\n\nrunner.run_offline(\n portia,\n OfflineEvalConfig(\n data_set_name=\"offline_evals_v1\",\n config=config,\n iterations=1,\n evaluators=[DefaultOfflineEvaluator(config), EmojiEvaluator(config)],\n ),\n)\n```\n\n---\n\n## \ud83e\uddea Testing\n\nWrite tests for your metrics, plans, or evaluator logic using `pytest`:\n\n```bash\nuv run pytest tests/\n```\n\n---\n",
"bugtrack_url": null,
"license": "TBD",
"summary": "Portia Labs Eval framework for evaluating agentic workflows.",
"version": "0.1.4a0",
"project_urls": {
"Documentation": "https://docs.portialabs.ai",
"Homepage": "https://www.portialabs.ai/",
"Repository": "https://github.com/portiaAI/portia-sdk-python"
},
"split_keywords": [
"llm",
" agentic",
" workflow"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "712cabf1eda3d97a957f2df9183fb4023298c648289eac2d13ca53c6348186ed",
"md5": "5ffeba45abbe26fbcbd4fa40d5bee087",
"sha256": "f70d6632971d0f09ace887752ac217734ff402aaf52e2485eba61403af384730"
},
"downloads": -1,
"filename": "steel_thread-0.1.4a0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5ffeba45abbe26fbcbd4fa40d5bee087",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 24557,
"upload_time": "2025-07-30T10:03:47",
"upload_time_iso_8601": "2025-07-30T10:03:47.470501Z",
"url": "https://files.pythonhosted.org/packages/71/2c/abf1eda3d97a957f2df9183fb4023298c648289eac2d13ca53c6348186ed/steel_thread-0.1.4a0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d0127b3d681daa43370cde1e68244b59ab7caf4935c2a0df647e91043537189c",
"md5": "d16fa24e482355b645f3819f8f851152",
"sha256": "bb0021cb94c771dbf99d43fd78f57dec7e81b4028192332122c2d776fbf0f465"
},
"downloads": -1,
"filename": "steel_thread-0.1.4a0.tar.gz",
"has_sig": false,
"md5_digest": "d16fa24e482355b645f3819f8f851152",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 17110,
"upload_time": "2025-07-30T10:03:48",
"upload_time_iso_8601": "2025-07-30T10:03:48.243176Z",
"url": "https://files.pythonhosted.org/packages/d0/12/7b3d681daa43370cde1e68244b59ab7caf4935c2a0df647e91043537189c/steel_thread-0.1.4a0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-30 10:03:48",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "portiaAI",
"github_project": "portia-sdk-python",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "steel-thread"
}