Documentation Index
Fetch the complete documentation index at: https://penseapp.vercel.app/docs/llms.txt
Use this file to discover all available pages before exploring further.
Set up your agent
There are two ways to run LLM evaluations:- Calibrate agent — define your agent inside the config file and evaluate across models.
- Agent connection — connect your existing agent via HTTP and run tests against it directly.
Calibrate agent
Define your agent’s system prompt and tools directly in the config file. Calibrate runs the evaluation using an LLM of your choice. Refer to this sample for a full template.system_prompt
The system prompt that defines your agent’s behavior. This is the same prompt you use in production.
tools
A list of tools available to your agent. See the guide on Configuring Tools for how to set it up along with examples for different tool types.
Agent connection
If you already have an agent running, you can connect calibrate directly to it instead of redefining your agent’s system prompt and tools. Refer to this sample for a full template.Connect your agent
| Key | Required | Description |
|---|---|---|
agent_url | Yes | The URL calibrate will make a POST request to |
agent_headers | No | HTTP headers included in every request — typically for auth |
Request format
Calibrate will make a POST request to youragent_url with the following body:
model field is added so your agent can route to the right model:
model field and you must implement the model routing in your agent. The easiest way is to use an existing framework like OpenRouter and select among a wide range of models.
Response format
Your agent must return a JSON response with at least one of these keys:| Key | Type | Description |
|---|---|---|
response | string or null | The agent’s text reply |
tool_calls | array | Tool calls made by the agent |
tool_calls must have:
Verify your connection
Before running evaluations, you can verify that your agent endpoint is reachable and returns the expected format:Test cases
An array of test cases to evaluate your agent on. Each test case has:| Key | Type | Description |
|---|---|---|
id | string | Optional unique test-case id. When provided, results include test_case_id. |
history | array | Conversation history as context for the agent to generate the next output |
evaluation | object | The criteria for evaluating the agent’s output — either a tool call or a response |
Passing
"arguments": null will make the agent simply check if the tool is
called without checking the arguments.history and evaluation keys work identically for both setup options.
Evaluators
Response test cases can be evaluated by one or more text LLM judges (routed through OpenRouter, setOPENROUTER_API_KEY in your environment).
Each evaluator’s system_prompt is sent to its own dedicated LLM judge call (one call per evaluator, run in parallel) with the conversation history and the agent’s last response as the inputs.
Defining custom evaluators
Define one or more evaluators at the top level and reference them by name from each test case. Each evaluator is an independent LLM call and produces its own column in the leaderboard:Using the default evaluator
The simplest setup needs no top-levelevaluators. Pass criteria as a string and the implicit correctness evaluator runs. The string is substituted into the default evaluator’s system prompt as the {{criteria}} variable:
correctness (default evaluator system prompt)
correctness (default evaluator system prompt)
When you omit top-level
evaluators and pass a string criteria, this is the implicit evaluator from DEFAULT_LLM_TEST_EVALUATOR that is used by default. Your criteria string is substituted for {{criteria}}.| Key | Type | Description |
|---|---|---|
evaluators | array | Top-level list of evaluators. Each one becomes its own LLM call per test case. |
evaluators[].id | string | Optional unique id. Results echo it as evaluator_id; output config.json includes raw evaluators and an evaluators_map. |
evaluators[].name | string | Unique evaluator name. Test cases reference it via evaluation.criteria[].name. |
evaluators[].system_prompt | string | Full system prompt for this evaluator. Supports {{variable}} placeholders that are substituted from the test case’s arguments. |
evaluators[].judge_model | string | OpenRouter model id for this evaluator (default: openai/gpt-5.4-mini). |
evaluators[].type | string | "binary" (default) or "rating". |
evaluators[].scale_min | integer | Required when type is "rating". Lowest allowed score. |
evaluators[].scale_max | integer | Required when type is "rating". Highest allowed score. |
Variable substitution
A test case can passarguments to an evaluator. Each {{variable}} placeholder in the evaluator’s system_prompt is replaced with the matching value before the judge call:
criteria (shown in the quickstart above) is just a shortcut for the implicit correctness evaluator with { "arguments": { "criteria": "<your string>" } }.
Binary vs rating
Binary evaluators produce per-row pass/fail and a mean pass-rate in aggregates. Rating evaluators report mean/min/max scores per model on the leaderboard. At test-case pass/fail time, every referenced evaluator must pass: binary evaluators requirematch: true, and rating evaluators require the numeric score to equal scale_max (anything lower fails that evaluator). So on a 1–5 scale (scale_min 1, scale_max 5), only a judge score of 5 counts as pass — intermediate scores fail unless they hit the top of your declared scale.
If you want thresholds different from “top-of-scale” (for example “tone at least 4”), encode that as a binary evaluator whose system_prompt asks the judge explicitly (for example: “Mark True if and only if tone is at least 4 on a 1–5 scale”).
Full examples
Calibrate agent example
Calibrate agent example
Same structure as
examples/llm/config-internal-agent.json, with documentation-only // comments before each test case (jsonc — strip comments if you paste into a strict JSON parser).Agent connection example
Agent connection example
Same structure as
examples/llm/config-external-agent.json, with documentation-only // comments before each test case (jsonc — strip comments if you paste into a strict JSON parser).Get started
Interactive mode
Runcalibrate llm with no arguments to launch the interactive UI:
- Config file — path to your config file
- Provider — OpenRouter or OpenAI
- Model entry — enter model names one at a time (single or multiple)
- Confirm models — review the list and add more or proceed
- Output directory — where results will be saved (defaults to
./out) - API keys — enter the API keys for the selected provider
agent_url in your config and switches to agent mode:
Single test:
- Config file — path to your config file (with
agent_urland optionallyagent_headers) - Mode — select “Single test”
- Verify — calibrate verifies the connection; shows success or failure with the option to go back
- Output directory — where results will be saved (defaults to
./out) - API keys — only
OPENAI_API_KEYis required (used by the LLM judge)
- Config file — path to your config file (with
agent_urland optionallyagent_headers) - Mode — select “Benchmark across models”
- Model entry — enter a model name; calibrate verifies the connection with that model before asking if you want to add another
- Confirm models — review the list and add more or proceed
- Output directory — where results will be saved (defaults to
./out) - API keys — only
OPENAI_API_KEYis required (used by the LLM judge)
Non-interactive mode
Calibrate agent: Single model:agent_url and optionally agent_headers:
--skip-verify to skip the agent connection verification step. This is useful in CI/automation pipelines where you’ve already confirmed the agent is reachable:
Output
Once all the models have completed, it displays a leaderboard with pass rates (% of tests passed) and bar charts for visualization.

Resources
Integrations
See the full list of supported providers and their configuration options