Skip to main content

Documentation Index

Fetch the complete documentation index at: https://penseapp.vercel.app/docs/llms.txt

Use this file to discover all available pages before exploring further.

Get started

calibrate stt
The interactive UI guides you through the full evaluation process:
  1. Language selection — pick from 10+ supported Indic languages
  2. Provider selection — choose providers (only those supporting your language are shown)
  3. Input directory — path to the directory containing your audio files and reference transcripts
The input directory should have this structure:
/path/to/data/
├── stt.csv
└── audios/
    ├── audio_1.wav
    └── audio_2.wav
The stt.csv file contains the reference transcriptions:
idtext
audio_1Hi
audio_2Madam, my name is Geeta Shankar
All audio files should be in WAV format. The evaluation script expects files at audios/<id>.wav where <id> matches the id column in your CSV.
Refer to the sample dataset for a template.
  1. Output directory — where results will be saved (defaults to ./out)
  2. API keys — enter the API keys for the selected providers
The evaluation runs providers in parallel (max 2 at a time), showing the transcriptions as they are generated.

Evaluator configuration

By default, a text LLM judge — routed through OpenRouter (set OPENROUTER_API_KEY in your environment) — evaluates whether each transcription matches the reference text semantically using the built-in semantic_match evaluator; expand Default evaluator: semantic_match below for the exact system_prompt from the codebase. You can customize the judge model and add multiple evaluators by passing an optional config file with --config:
calibrate stt -p deepgram google -i ./data -o ./out --config config.json
Each evaluator’s system_prompt is sent as the system message to its own dedicated LLM judge call (one call per evaluator, run in parallel). The user message contains the source/transcription pair. The config file supports:
{
  "evaluators": [
    {
      "id": "semantic-match-id",
      "name": "semantic_match",
      "system_prompt": "You are a highly accurate evaluator. You will be given a source text and a transcription. Mark True if the values represented by both strings match semantically.",
      "judge_model": "openai/gpt-5.4-mini"
    },
    {
      "id": "completeness-id",
      "name": "completeness",
      "system_prompt": "You are a highly accurate evaluator. You will be given a source text and a transcription. Mark True if all information from the source text is present in the transcription.",
      "judge_model": "openai/gpt-5.4-mini"
    }
  ]
}
KeyTypeDescription
evaluatorsarrayList of evaluators. Each one becomes its own LLM call per row.
evaluators[].idstringOptional unique id. Output config.json includes the raw evaluators list and an evaluators_map from id to name.
evaluators[].namestringUnique evaluator name. Becomes the column name in the leaderboard.
evaluators[].system_promptstringFull system prompt used for this evaluator’s LLM judge call.
evaluators[].judge_modelstringOpenRouter model id for this evaluator (default: openai/gpt-5.4-mini). Use any model in the OpenRouter catalog.
Each evaluator also accepts:
KeyTypeDescription
typestring"binary" (default) or "rating"
scale_minintegerRequired when type is "rating". Lowest allowed score.
scale_maxintegerRequired when type is "rating". Highest allowed score.
Binary evaluators produce per-row pass/fail and a mean pass-rate. Rating evaluators produce an integer score on your scale and a mean score in the leaderboard. When multiple evaluators are defined, each is scored independently — one LLM call per evaluator per row, all run in parallel — and appears as a separate column in the results and leaderboard. Refer to the sample config for a template.
The --config flag is optional. When omitted, a single built-in semantic_match evaluator scores semantic match between source and transcription.
Matches DEFAULT_STT_EVALUATOR in calibrate/judges.py when no --config is passed.
You are a highly accurate evaluator evaluating the transcription output of an STT model.

You will be given two strings - one is the source string used to produce an audio and the other is the transcription of that audio.

You need to evaluate if the two strings are the same.

# Important Instructions:
- Check whether the values represented by both the strings match. E.g. if one string says 1,2,3 but the other string says "one, two, three" or "one, 2, three", they should be considered the same as their underlying value is the same. However, if the actual values itself are different, e.g. for the name of a person or address or the value of any other key detail - that difference should be noted.
- Ignore differences like a word being split up into more than 1 word by spaces. Look at whether the values mean the same in both the strings.
- Minor differences in values of entities (e.g. proper nouns, numbers) matter and should be considered an error.
- If all the "values" for the strings match, mark it as True. Else, False.

Output

Once all the providers have completed, it displays a leaderboard measuring key metrics along with bar charts for better visualization.
STT leaderboard
You can also view the generated transcript and metrics for each row of your dataset including the LLM judge score and reasoning.
STT provider outputs

Learn more about metrics

Detailed explanation of all metrics and why using an LLM Judge is necessary

Resources

Integrations

See the full list of supported providers and their configuration options