Skip to main content
Evaluate LLM behavior through test cases.

Running Tests

Define your system prompt, tools, and test cases to evaluate LLM behavior:
import asyncio
from calibrate.llm import tests

# Define your tools
tools = [
    {
        "type": "client",
        "name": "plan_next_question",
        "description": "Plan the next question to ask",
        "parameters": [
            {
                "id": "next_unanswered_question_index",
                "type": "integer",
                "description": "Index of next question",
                "required": True
            },
            {
                "id": "questions_answered",
                "type": "array",
                "description": "List of answered question indices",
                "items": {"type": "integer"},
                "required": True
            }
        ]
    }
]

# Define your test cases
test_cases = [
    {
        "history": [
            {"role": "assistant", "content": "Hello! What is your name?"},
            {"role": "user", "content": "Aman Dalmia"}
        ],
        "evaluation": {
            "type": "tool_call",
            "tool_calls": [
                {
                    "tool": "plan_next_question",
                    "arguments": {
                        "next_unanswered_question_index": 2,
                        "questions_answered": [1]
                    }
                }
            ]
        },
        "settings": {"language": "english"}
    },
    {
        "history": [
            {"role": "assistant", "content": "What is your phone number?"},
            {"role": "user", "content": "Can I skip this question?"}
        ],
        "evaluation": {
            "type": "response",
            "criteria": "The assistant should allow the user to skip giving their phone number."
        }
    }
]

# Run tests
result = asyncio.run(tests.run(
    system_prompt="You are a helpful assistant filling out a form...",
    tools=tools,
    test_cases=test_cases,
    output_dir="./out",
    model="openai/gpt-4.1",
    provider="openrouter",
    run_name="my_test_run",
))
Function Parameters:
ParameterTypeRequiredDefaultDescription
system_promptstrYes-System prompt for the LLM
toolslistYes-List of tool definitions
test_caseslistYes-List of test case dicts with ‘history’, ‘evaluation’, optional ‘settings’
output_dirstrNo”./out”Output directory for results
modelstrNo”gpt-4.1”Model name
providerstrNo”openrouter”LLM provider: openai or openrouter
run_namestrNoNoneOptional name for output folder
Provider Options:
  • openai: Use OpenAI’s API directly. Model names: gpt-4.1, gpt-4o
  • openrouter: Access multiple LLM providers. Model names: openai/gpt-4.1, anthropic/claude-3-opus

Metrics

LLM tests measure pass/fail for each test case:
  • Pass Rate: Percentage of test cases that match expected behavior
  • Evaluation types: Tool call matching or response criteria matching (via LLM Judge)

Learn more about metrics

Detailed explanation of metrics and evaluation types

Output Structure

/path/to/output/<test_config_name>/<model_name>
├── results.json
├── metrics.json
└── logs

results.json

Contains detailed results for each test case:
[
  {
    "output": {
      "response": "Sure, I can help you with that.",
      "tool_calls": [
        {
          "tool": "plan_next_question",
          "arguments": {
            "next_unanswered_question_index": 2,
            "questions_answered": [1]
          }
        }
      ]
    },
    "metrics": {
      "passed": true
    },
    "test_case": {
      "history": [...],
      "evaluation": {...}
    }
  }
]

metrics.json

Contains summary statistics:
{
  "total": 5,
  "passed": 4
}

Generating Leaderboard

After running tests for multiple models, compile a leaderboard:
from calibrate.llm import tests

tests.leaderboard(
    output_dir="/path/to/output",
    save_dir="./leaderboard"
)
This generates:
  • llm_leaderboard.csv: CSV file with pass percentages by model
  • llm_leaderboard.png: Visual comparison chart
leaderboard.csv format:
modeltest_config_nameoverall
openai__gpt-4.180.080.0
openai__gpt-4o100.0100.0
Function Parameters:
ParameterTypeRequiredDefaultDescription
output_dirstrYes-Directory containing test results
save_dirstrYes-Directory to save leaderboard files

Low-level APIs

For more control, run individual test cases:

Run a Single Test Case

import asyncio
from calibrate.llm import tests

result = asyncio.run(tests.run_test(
    chat_history=[
        {"role": "assistant", "content": "Hello! What is your name?"},
        {"role": "user", "content": "Aman Dalmia"}
    ],
    evaluation={
        "type": "tool_call",
        "tool_calls": [{"tool": "plan_next_question", "arguments": {...}}]
    },
    system_prompt="You are a helpful assistant...",
    model="gpt-4.1",
    provider="openrouter",
    tools=[...]
))

Run LLM Inference Without Evaluation

import asyncio
from calibrate.llm import tests

result = asyncio.run(tests.run_inference(
    chat_history=[...],
    system_prompt="You are a helpful assistant...",
    model="gpt-4.1",
    provider="openrouter",
    tools=[...]
))