Skip to main content
Run fully automated text-only conversations between two LLMs—the “agent” plus multiple users having specific personas to mimic specific scenarios. Each simulation pairs every persona with every scenario. The runner fans out into separate folders and saves transcripts, evaluation results, plus logs for inspection.

Running Simulations

import asyncio
from calibrate.llm import simulations

# Define your tools
tools = [
    {
        "type": "client",
        "name": "plan_next_question",
        "description": "Plan the next question to ask",
        "parameters": [
            {
                "id": "next_unanswered_question_index",
                "type": "integer",
                "description": "Index of next question",
                "required": True
            },
            {
                "id": "questions_answered",
                "type": "array",
                "description": "List of answered question indices",
                "items": {"type": "integer"},
                "required": True
            }
        ]
    }
]

# Define personas
personas = [
    {
        "characteristics": "A shy mother named Geeta, 39 years old, gives short answers",
        "gender": "female",
        "language": "english"
    }
]

# Define scenarios
scenarios = [
    {"description": "User completes the form without any issues"},
    {"description": "User hesitates and wants to skip some questions"}
]

# Define evaluation criteria
evaluation_criteria = [
    {
        "name": "question_completeness",
        "description": "Whether all the questions in the form were covered"
    },
    {
        "name": "assistant_behavior",
        "description": "Whether the assistant asks one concise question per turn"
    }
]

# Run simulations
result = asyncio.run(simulations.run(
    system_prompt="You are a helpful nurse filling out a form...",
    tools=tools,
    personas=personas,
    scenarios=scenarios,
    evaluation_criteria=evaluation_criteria,
    output_dir="./out",
    model="openai/gpt-4.1",
    provider="openrouter",
    parallel=1,
    agent_speaks_first=True,
    max_turns=50,
))
Function Parameters:
ParameterTypeRequiredDefaultDescription
system_promptstrYes-System prompt for the bot/agent
toolslistYes-List of tool definitions
personaslistYes-List of persona dicts with ‘characteristics’, ‘gender’, ‘language’
scenarioslistYes-List of scenario dicts with ‘description’
evaluation_criterialistYes-List of criteria dicts with ‘name’ and ‘description’
output_dirstrNo”./out”Output directory for results
modelstrNo”gpt-4.1”Model name for both agent and user
providerstrNo”openrouter”LLM provider: openai or openrouter
parallelintNo1Number of parallel simulations
agent_speaks_firstboolNoTrueWhether agent initiates conversation
max_turnsintNo50Maximum assistant turns
Provider Options:
  • openai: Use OpenAI’s API directly. Model names: gpt-4.1, gpt-4o
  • openrouter: Access multiple LLM providers. Model names: openai/gpt-4.1, anthropic/claude-3-opus

Metrics

Text simulations evaluate against your defined criteria using LLM Judge:
  • Evaluation Criteria Match: Each criterion is evaluated as True/False with reasoning
  • Aggregated Stats: Mean, standard deviation across all simulations

Learn more about metrics

Detailed explanation of LLM Judge and evaluation criteria

Output Structure

/path/to/output
├── simulation_persona_1_scenario_1
│   ├── transcript.json
│   ├── evaluation_results.csv
│   ├── config.json
│   ├── logs
│   └── results.log
├── simulation_persona_1_scenario_2
│   └── ...
├── results.csv
└── metrics.json

config.json

Contains the persona and scenario used for each simulation:
{
  "persona": {
    "characteristics": "description of user personality, background, and behavior",
    "gender": "female",
    "language": "english"
  },
  "scenario": {
    "description": "the scenario description for this simulation"
  }
}

evaluation_results.csv

Contains evaluation results for each criterion:
namematchreasoning
question_completenessTrueThe assistant asked for the user’s full name, address, and telephone number…
assistant_behaviorTrueThe assistant asked one concise question per turn…

results.csv

Aggregates match scores across all simulations:
namequestion_completenessassistant_behavior
simulation_persona_1_scenario_11.01.0
simulation_persona_1_scenario_21.01.0

metrics.json

Provides summary statistics for each evaluation criterion:
{
  "question_completeness": {
    "mean": 1.0,
    "std": 0,
    "values": [1.0, 1.0, 1.0]
  },
  "assistant_behavior": {
    "mean": 1.0,
    "std": 0.0,
    "values": [1.0, 1.0, 1.0]
  }
}

Generating Leaderboard

After running simulations for multiple models, compile a leaderboard:
from calibrate.llm import simulations

simulations.leaderboard(
    output_dir="/path/to/output",
    save_dir="./leaderboard"
)
This generates:
  • llm_leaderboard.csv: CSV file with pass percentages by model
  • llm_leaderboard.png: Visual comparison chart
leaderboard.csv format:
modeltest_config_nameoverall
openai__gpt-4.180.080.0
openai__gpt-4o100.0100.0
Function Parameters:
ParameterTypeRequiredDefaultDescription
output_dirstrYes-Directory containing simulation results
save_dirstrYes-Directory to save leaderboard files

Low-level API

For more control, run a single simulation directly:
import asyncio
from calibrate.llm import simulations

result = asyncio.run(simulations.run_simulation(
    bot_system_prompt="You are a helpful assistant...",
    tools=[...],
    user_system_prompt="You are simulating a user with persona...",
    evaluation_criteria=[
        {"name": "completeness", "description": "..."},
    ],
    bot_model="gpt-4.1",
    user_model="gpt-4.1",
    bot_provider="openrouter",
    user_provider="openrouter",
    agent_speaks_first=True,
    max_turns=50,
    output_dir="./out"
))