Skip to main content
Run text-only automated conversations between two LLMs (agent + simulated user). Each simulation pairs every persona with every scenario.

Learn more about metrics

Detailed explanation of LLM Judge and evaluation criteria

calibrate llm simulations run

Run LLM simulations from a configuration file.
calibrate llm simulations run -c <config_file> -o <output_dir> -m <model> -p <provider> -n <parallel>

Arguments

FlagLongTypeRequiredDefaultDescription
-c--configstringYes-Path to simulation configuration JSON file
-o--output-dirstringNo./outPath to output directory
-m--modelstringNogpt-4.1Model name for both agent and user
-p--providerstringNoopenrouterProvider: openai or openrouter
-n--parallelintNo1Number of simulations to run in parallel

Examples

Basic simulation:
calibrate llm simulations run -c ./config.json -o ./out
Run with parallel simulations:
calibrate llm simulations run -c ./config.json -o ./out -n 4
Use specific model:
calibrate llm simulations run -c ./config.json -o ./out -m openai/gpt-4.1 -p openrouter

Configuration File Structure

{
  "system_prompt": "You are a helpful nurse filling out a form...",
  "tools": [
    {
      "type": "client",
      "name": "plan_next_question",
      "description": "Plan the next question",
      "parameters": [
        {
          "id": "next_unanswered_question_index",
          "type": "integer",
          "description": "Next question index",
          "required": true
        }
      ]
    }
  ],
  "personas": [
    {
      "characteristics": "A shy mother named Geeta, 39 years old, gives short answers",
      "gender": "female",
      "language": "english"
    },
    {
      "characteristics": "An elderly farmer who speaks slowly and asks for clarification",
      "gender": "male",
      "language": "hindi"
    }
  ],
  "scenarios": [
    { "description": "User completes the form without any issues" },
    { "description": "User hesitates and wants to skip some questions" }
  ],
  "evaluation_criteria": [
    {
      "name": "question_completeness",
      "description": "Whether all the questions in the form were covered"
    },
    {
      "name": "assistant_behavior",
      "description": "Whether the assistant asks one concise question per turn"
    }
  ],
  "settings": {
    "agent_speaks_first": true,
    "max_turns": 50
  }
}

Output Structure

/path/to/output/
├── simulation_persona_1_scenario_1/
│   ├── transcript.json          # Full conversation
│   ├── evaluation_results.csv   # Per-criterion results
│   ├── config.json              # Persona and scenario used
│   ├── results.log
│   └── logs
├── simulation_persona_1_scenario_2/
│   └── ...
├── results.csv                  # Aggregated results
└── metrics.json                 # Summary statistics

Output Files

transcript.json contains the full conversation:
[
  { "role": "assistant", "content": "Hello! What is your name?" },
  { "role": "user", "content": "My name is Geeta." },
  { "role": "assistant", "content": "Thank you, Geeta. What is your address?" }
]
evaluation_results.csv contains per-criterion results:
namematchreasoning
question_completenessTrueAll questions were asked and answered…
assistant_behaviorTrueThe assistant asked one question per turn…
results.csv aggregates match scores across all simulations:
namequestion_completenessassistant_behavior
simulation_persona_1_scenario_11.01.0
simulation_persona_1_scenario_21.00.0

calibrate llm simulations leaderboard

Generate a leaderboard comparing multiple simulation runs.
calibrate llm simulations leaderboard -o <output_dir> -s <save_dir>

Arguments

FlagLongTypeRequiredDescription
-o--output-dirstringYesDirectory containing simulation run folders
-s--save-dirstringYesDirectory to save leaderboard outputs

Example

calibrate llm simulations leaderboard -o ./out -s ./leaderboard

Provider Options

ProviderModel FormatExample
openaiOpenAI naminggpt-4.1, gpt-4o
openrouterProvider/modelopenai/gpt-4.1, anthropic/claude-3-opus

Required Environment Variables

# For OpenAI provider
export OPENAI_API_KEY=your_key

# For OpenRouter provider
export OPENROUTER_API_KEY=your_key