The system prompt that defines your agent’s behavior. This is the same prompt you use in production.
Copy
You are a helpful customer support assistant for an online store.You help customers with:1. Checking order status2. Processing returns3. Answering product questions
An array of test cases. These are the scenarios you want to evaluate your agent on. Each test case has:
Key
Type
Description
history
array
Conversation history as context for the agent to generate the next output
evaluation
object
The criteria for evaluating the agent’s output — either a tool call or a response
Tool call test — verifies the LLM calls the correct tool with expected arguments:
Copy
{ "history": [ { "role": "assistant", "content": "Hello! How can I help you today?" }, { "role": "user", "content": "I want to check my order status. My order ID is ORD-12345." } ], "evaluation": { "type": "tool_call", "tool_calls": [ { "tool": "check_order", "arguments": { "order_id": "ORD-12345" } } ] }}
Passing "arguments": null will make the agent simply check if the tool is
called without checking the arguments.
Response test — verifies the LLM’s response meets the criteria defined (evaluated by an LLM judge):
Copy
{ "history": [ { "role": "assistant", "content": "Hello! How can I help you today?" }, { "role": "user", "content": "What's the weather like?" } ], "evaluation": { "type": "response", "criteria": "The assistant should politely redirect the conversation back to store-related topics." }}