Skip to main content
Run LLM tests against predefined test cases with expected outputs. Useful for validating tool calls and response patterns.

Learn more about metrics

Detailed explanation of metrics and evaluation types

calibrate llm tests run

Run LLM test cases from a configuration file.
calibrate llm tests run -c <config_file> -o <output_dir> -m <model> -p <provider>

Arguments

FlagLongTypeRequiredDefaultDescription
-c--configstringNoexamples/tests.jsonPath to test configuration JSON file
-o--output-dirstringNo./outPath to output directory
-m--modelstringNogpt-4.1Model name (e.g., gpt-4.1, openai/gpt-4.1)
-p--providerstringNoopenrouterProvider: openai or openrouter

Examples

Basic test run:
calibrate llm tests run -c ./config.json -o ./out
Run with specific model:
calibrate llm tests run -c ./config.json -o ./out -m openai/gpt-4.1 -p openrouter
Use OpenAI directly:
calibrate llm tests run -c ./config.json -o ./out -m gpt-4.1 -p openai

Configuration File Structure

{
  "system_prompt": "You are a helpful assistant filling out a form...",
  "tools": [
    {
      "type": "client",
      "name": "plan_next_question",
      "description": "Plan the next question to ask",
      "parameters": [
        {
          "id": "next_unanswered_question_index",
          "type": "integer",
          "description": "Index of next question",
          "required": true
        }
      ]
    }
  ],
  "test_cases": [
    {
      "history": [
        { "role": "assistant", "content": "Hello! What is your name?" },
        { "role": "user", "content": "Aman Dalmia" }
      ],
      "evaluation": {
        "type": "tool_call",
        "tool_calls": [
          {
            "tool": "plan_next_question",
            "arguments": {
              "next_unanswered_question_index": 2,
              "questions_answered": [1]
            }
          }
        ]
      }
    },
    {
      "history": [
        { "role": "assistant", "content": "What is your phone number?" },
        { "role": "user", "content": "Can I skip this question?" }
      ],
      "evaluation": {
        "type": "response",
        "criteria": "The assistant should allow the user to skip."
      }
    }
  ]
}

Output Structure

/path/to/output/<test_config_name>/<model_name>
├── results.json      # Detailed results for each test case
├── metrics.json      # Summary statistics (total, passed)
└── logs

calibrate llm tests leaderboard

Generate a leaderboard comparing multiple test runs.
calibrate llm tests leaderboard -o <output_dir> -s <save_dir>

Arguments

FlagLongTypeRequiredDescription
-o--output-dirstringYesDirectory containing test run folders
-s--save-dirstringYesDirectory to save leaderboard outputs

Example

calibrate llm tests leaderboard -o ./out -s ./leaderboard

Output

  • ttt_leaderboard.csv - Pass percentages by model and test config
  • ttt_leaderboard.png - Visual comparison chart

Provider Options

ProviderModel FormatExample
openaiOpenAI naminggpt-4.1, gpt-4o
openrouterProvider/modelopenai/gpt-4.1, anthropic/claude-3-opus

Required Environment Variables

# For OpenAI provider
export OPENAI_API_KEY=your_key

# For OpenRouter provider
export OPENROUTER_API_KEY=your_key