Skip to main content

Documentation Index

Fetch the complete documentation index at: https://penseapp.vercel.app/docs/llms.txt

Use this file to discover all available pages before exploring further.

Get started

calibrate tts
The interactive UI guides you through the full evaluation process:
  1. Language selection — pick from 10+ supported Indic languages
  2. Provider selection — choose providers (only those supporting your language are shown)
  3. Input CSV — path to CSV file with id and text columns
The input CSV should have this format:
idtext
row_1hello world
row_2this is a test
row_3how are you doing today
Refer to this sample for a template.
  1. Output directory — where results will be saved (defaults to ./out)
  2. API keys — enter the API keys for the selected providers
The evaluation runs providers in parallel (max 2 at a time), showing the progress as audio files are generated.

Evaluator configuration

By default, an audio LLM judge — routed through OpenRouter (set OPENROUTER_API_KEY in your environment) — evaluates whether the reference text is pronounced correctly in the synthesized audio using the built-in pronunciation evaluator; expand Default evaluator: pronunciation below for the exact system_prompt from the codebase. You can customize the judge model and add multiple evaluators by passing an optional config file with --config:
calibrate tts -p openai google -i sample.csv -o ./out --config config.json
Each evaluator’s system_prompt is sent as the system message to its own dedicated audio LLM judge call (one call per evaluator, run in parallel). The user message contains the reference text and the audio sample. The config file supports:
{
  "evaluators": [
    {
      "id": "intelligibility-id",
      "name": "intelligibility",
      "system_prompt": "You are a highly accurate evaluator. You will be given an audio sample and the reference text it is supposed to speak. Mark True if the spoken text is clearly understandable from the audio.",
      "judge_model": "openai/gpt-audio"
    },
    {
      "id": "pronunciation-id",
      "name": "pronunciation",
      "system_prompt": "You are a highly accurate evaluator. You will be given an audio sample and the reference text it is supposed to speak. Mark True only if all words are pronounced correctly with natural-sounding speech.",
      "judge_model": "openai/gpt-audio"
    }
  ]
}
KeyTypeDescription
evaluatorsarrayList of evaluators. Each one becomes its own audio LLM call per row.
evaluators[].idstringOptional unique id. Output config.json includes the raw evaluators list and an evaluators_map from id to name.
evaluators[].namestringUnique evaluator name. Becomes the column name in the leaderboard.
evaluators[].system_promptstringFull system prompt used for this evaluator’s audio LLM judge call.
evaluators[].judge_modelstringOpenRouter model id for this evaluator (default: openai/gpt-audio). Must be an audio-capable model in the OpenRouter catalog — for example OpenAI’s audio models or Google’s Gemini audio-capable entries; the sample config uses google/gemini-2.5-flash.
Each evaluator also accepts:
KeyTypeDescription
typestring"binary" (default) or "rating"
scale_minintegerRequired when type is "rating". Lowest allowed score.
scale_maxintegerRequired when type is "rating". Highest allowed score.
Binary evaluators produce per-row pass/fail and a mean pass-rate. Rating evaluators produce an integer score on your scale and a mean score in the leaderboard. When multiple evaluators are defined, each is scored independently — one audio LLM call per evaluator per row, all run in parallel — and appears as a separate column in the results and leaderboard. Refer to the sample config for a template.
The --config flag is optional. When omitted, a single built-in pronunciation evaluator scores audio intelligibility. The TTS judge requires an audio-capable model.
Matches DEFAULT_TTS_EVALUATOR in calibrate/judges.py when no --config is passed.
You are a highly accurate evaluator evaluating the audio output of a TTS model.

You will be given the audio and the text that should have been spoken in the audio.

You need to evaluate if the text is easily understandable from the audio. Check whether the spoken words match the reference text and the audio is clear enough to convey the intended message.

Output

Once all the providers have completed, it displays a leaderboard measuring key metrics along with bar charts for better visualization.
TTS leaderboard
You can also view the generated audio and metrics for each row of your dataset including the LLM judge score and reasoning. Use the arrow keys to navigate rows and press Enter or p to play the generated audio.
TTS provider outputs

Learn more about metrics

Detailed explanation of all metrics and how LLM Judge works

Resources

Integrations

See the full list of supported providers and their configuration options