Evaluate Text to Speech providers and generate comparative leaderboards.
Prepare a CSV file with text samples to synthesize:
| id | text |
|---|
| row_1 | hello world |
| row_2 | this is a test |
| row_3 | how are you doing today |
The CSV should have two columns:
id - Unique identifier for each text
text - The text to synthesize into speech
Learn more about metrics
Detailed explanation of all metrics and how LLM Judge works
calibrate tts eval
Run Text to Speech evaluation against a specific provider.
calibrate tts eval -p <provider> -l <language> -i <input_file> -o <output_dir>
Arguments
| Flag | Long | Type | Required | Default | Description |
|---|
-p | --provider | string | No | google | Provider: cartesia, openai, groq, google, elevenlabs, sarvam, smallest |
-l | --language | string | No | english | Language: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi |
-i | --input | string | Yes | - | Path to input CSV file |
-o | --output-dir | string | No | ./out | Path to output directory |
-d | --debug | flag | No | false | Run on first N texts only |
-dc | --debug_count | int | No | 5 | Number of texts in debug mode |
| --overwrite | flag | No | false | Overwrite existing results instead of resuming |
Examples
Basic evaluation:
calibrate tts eval -p google -l english -i ./sample.csv -o ./out
Evaluate with Cartesia:
calibrate tts eval -p cartesia -i ./sample.csv -o ./out
Debug mode (process only first 3 texts):
calibrate tts eval -p google -i ./sample.csv -o ./out -d -dc 3
Output Structure
/path/to/output/<provider>
├── audios/
│ ├── row_1.wav
│ ├── row_2.wav
│ └── row_3.wav
├── results.csv # Per-text results with TTFB and LLM judge scores
├── metrics.json # Aggregated metrics (TTFB mean/std, LLM judge score)
└── results.log # Terminal output summary
Output Files
results.csv contains per-text evaluation results:
| id | text | audio_path | ttfb | llm_judge_score | llm_judge_reasoning |
|---|
| row_1 | hello world | ./out/elevenlabs/audios/row_1.wav | 1.511 | True | The audio says ‘hello world’ clearly and matches the reference text exactly. |
| row_2 | this is a test | ./out/elevenlabs/audios/row_2.wav | 1.215 | True | The audio clearly says ‘this is a test,’ which matches exactly with the provided reference text. |
calibrate tts leaderboard
Generate a comparative leaderboard from multiple evaluation runs.
calibrate tts leaderboard -o <output_dir> -s <save_dir>
Arguments
| Flag | Long | Type | Required | Description |
|---|
-o | --output-dir | string | Yes | Directory containing evaluation run folders |
-s | --save-dir | string | Yes | Directory to save leaderboard outputs |
Example
calibrate tts leaderboard -o ./out -s ./leaderboard
Output
The leaderboard command generates:
tts_leaderboard.xlsx - Comparative spreadsheet of all providers
- Individual metric charts:
llm_judge_score.png, ttfb.png
Supported Providers
| Provider | Languages |
|---|
google | English, Hindi, Kannada, Sindhi |
openai | English |
cartesia | English |
elevenlabs | English, Sindhi |
sarvam | Hindi, Kannada |
groq | English |
smallest | English, Hindi |
Sindhi TTS is supported by Google (uses gemini-2.5-flash-lite-preview-tts model) and ElevenLabs (uses eleven_v3 model).
Required Environment Variables
Set the appropriate API key for your chosen provider:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export OPENAI_API_KEY=your_key
export CARTESIA_API_KEY=your_key
export ELEVENLABS_API_KEY=your_key
export SARVAM_API_KEY=your_key
export SMALLEST_API_KEY=your_key