Documentation Index
Fetch the complete documentation index at: https://penseapp.vercel.app/docs/llms.txt
Use this file to discover all available pages before exploring further.
Get started
The interactive UI guides you through the full evaluation process:
- Language selection — pick from 10+ supported Indic languages
- Provider selection — choose providers (only those supporting your language are shown)
- Input CSV — path to CSV file with
idandtextcolumns
| id | text |
|---|---|
| row_1 | hello world |
| row_2 | this is a test |
| row_3 | how are you doing today |
- Output directory — where results will be saved (defaults to
./out) - API keys — enter the API keys for the selected providers
Evaluator configuration
By default, an audio LLM judge — routed through OpenRouter (setOPENROUTER_API_KEY in your environment) — evaluates whether the reference text is pronounced correctly in the synthesized audio using the built-in pronunciation evaluator; expand Default evaluator: pronunciation below for the exact system_prompt from the codebase. You can customize the judge model and add multiple evaluators by passing an optional config file with --config:
system_prompt is sent as the system message to its own dedicated audio LLM judge call (one call per evaluator, run in parallel). The user message contains the reference text and the audio sample.
The config file supports:
| Key | Type | Description |
|---|---|---|
evaluators | array | List of evaluators. Each one becomes its own audio LLM call per row. |
evaluators[].id | string | Optional unique id. Output config.json includes the raw evaluators list and an evaluators_map from id to name. |
evaluators[].name | string | Unique evaluator name. Becomes the column name in the leaderboard. |
evaluators[].system_prompt | string | Full system prompt used for this evaluator’s audio LLM judge call. |
evaluators[].judge_model | string | OpenRouter model id for this evaluator (default: openai/gpt-audio). Must be an audio-capable model in the OpenRouter catalog — for example OpenAI’s audio models or Google’s Gemini audio-capable entries; the sample config uses google/gemini-2.5-flash. |
| Key | Type | Description |
|---|---|---|
type | string | "binary" (default) or "rating" |
scale_min | integer | Required when type is "rating". Lowest allowed score. |
scale_max | integer | Required when type is "rating". Highest allowed score. |
The
--config flag is optional. When omitted, a single built-in
pronunciation evaluator scores audio intelligibility. The TTS judge requires
an audio-capable model.pronunciation (default evaluator system prompt)
pronunciation (default evaluator system prompt)
Matches
DEFAULT_TTS_EVALUATOR in calibrate/judges.py when no --config is passed.Output
Once all the providers have completed, it displays a leaderboard measuring key metrics along with bar charts for better visualization.

Learn more about metrics
Detailed explanation of all metrics and how LLM Judge works
Resources
Integrations
See the full list of supported providers and their configuration options