Documentation Index
Fetch the complete documentation index at: https://penseapp.vercel.app/docs/llms.txt
Use this file to discover all available pages before exploring further.
Get started
The interactive UI guides you through the full evaluation process:
- Language selection — pick from 10+ supported Indic languages
- Provider selection — choose providers (only those supporting your language are shown)
- Input directory — path to the directory containing your audio files and reference transcripts
| id | text |
|---|---|
| audio_1 | Hi |
| audio_2 | Madam, my name is Geeta Shankar |
All audio files should be in WAV format. The evaluation script expects files
at
audios/<id>.wav where <id> matches the id column in your CSV.- Output directory — where results will be saved (defaults to
./out) - API keys — enter the API keys for the selected providers
Evaluator configuration
By default, a text LLM judge — routed through OpenRouter (setOPENROUTER_API_KEY in your environment) — evaluates whether each transcription matches the reference text semantically using the built-in semantic_match evaluator; expand Default evaluator: semantic_match below for the exact system_prompt from the codebase. You can customize the judge model and add multiple evaluators by passing an optional config file with --config:
system_prompt is sent as the system message to its own dedicated LLM judge call (one call per evaluator, run in parallel). The user message contains the source/transcription pair.
The config file supports:
| Key | Type | Description |
|---|---|---|
evaluators | array | List of evaluators. Each one becomes its own LLM call per row. |
evaluators[].id | string | Optional unique id. Output config.json includes the raw evaluators list and an evaluators_map from id to name. |
evaluators[].name | string | Unique evaluator name. Becomes the column name in the leaderboard. |
evaluators[].system_prompt | string | Full system prompt used for this evaluator’s LLM judge call. |
evaluators[].judge_model | string | OpenRouter model id for this evaluator (default: openai/gpt-5.4-mini). Use any model in the OpenRouter catalog. |
| Key | Type | Description |
|---|---|---|
type | string | "binary" (default) or "rating" |
scale_min | integer | Required when type is "rating". Lowest allowed score. |
scale_max | integer | Required when type is "rating". Highest allowed score. |
The
--config flag is optional. When omitted, a single built-in
semantic_match evaluator scores semantic match between source and
transcription.semantic_match (default evaluator system prompt)
semantic_match (default evaluator system prompt)
Matches
DEFAULT_STT_EVALUATOR in calibrate/judges.py when no --config is passed.Output
Once all the providers have completed, it displays a leaderboard measuring key metrics along with bar charts for better visualization.

Learn more about metrics
Detailed explanation of all metrics and why using an LLM Judge is necessary
Resources
Integrations
See the full list of supported providers and their configuration options