Skip to main content
Evaluate Text to Speech providers and generate comparative leaderboards.

Input Data Format

Prepare a CSV file with text samples to synthesize:
idtext
row_1hello world
row_2this is a test
row_3how are you doing today
The CSV should have two columns:
  • id - Unique identifier for each text
  • text - The text to synthesize into speech

Learn more about metrics

Detailed explanation of all metrics and how LLM Judge works

calibrate tts eval

Run Text to Speech evaluation against a specific provider.
calibrate tts eval -p <provider> -l <language> -i <input_file> -o <output_dir>

Arguments

FlagLongTypeRequiredDefaultDescription
-p--providerstringNogoogleProvider: cartesia, openai, groq, google, elevenlabs, sarvam, smallest
-l--languagestringNoenglishLanguage: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
-i--inputstringYes-Path to input CSV file
-o--output-dirstringNo./outPath to output directory
-d--debugflagNofalseRun on first N texts only
-dc--debug_countintNo5Number of texts in debug mode
--overwriteflagNofalseOverwrite existing results instead of resuming

Examples

Basic evaluation:
calibrate tts eval -p google -l english -i ./sample.csv -o ./out
Evaluate with Cartesia:
calibrate tts eval -p cartesia -i ./sample.csv -o ./out
Debug mode (process only first 3 texts):
calibrate tts eval -p google -i ./sample.csv -o ./out -d -dc 3

Output Structure

/path/to/output/<provider>
├── audios/
│   ├── row_1.wav
│   ├── row_2.wav
│   └── row_3.wav
├── results.csv       # Per-text results with TTFB and LLM judge scores
├── metrics.json      # Aggregated metrics (TTFB mean/std, LLM judge score)
└── results.log       # Terminal output summary

Output Files

results.csv contains per-text evaluation results:
idtextaudio_pathttfbllm_judge_scorellm_judge_reasoning
row_1hello world./out/elevenlabs/audios/row_1.wav1.511TrueThe audio says ‘hello world’ clearly and matches the reference text exactly.
row_2this is a test./out/elevenlabs/audios/row_2.wav1.215TrueThe audio clearly says ‘this is a test,’ which matches exactly with the provided reference text.

calibrate tts leaderboard

Generate a comparative leaderboard from multiple evaluation runs.
calibrate tts leaderboard -o <output_dir> -s <save_dir>

Arguments

FlagLongTypeRequiredDescription
-o--output-dirstringYesDirectory containing evaluation run folders
-s--save-dirstringYesDirectory to save leaderboard outputs

Example

calibrate tts leaderboard -o ./out -s ./leaderboard

Output

The leaderboard command generates:
  • tts_leaderboard.xlsx - Comparative spreadsheet of all providers
  • Individual metric charts: llm_judge_score.png, ttfb.png

Supported Providers

ProviderLanguages
googleEnglish, Hindi, Kannada, Sindhi
openaiEnglish
cartesiaEnglish
elevenlabsEnglish, Sindhi
sarvamHindi, Kannada
groqEnglish
smallestEnglish, Hindi
Sindhi TTS is supported by Google (uses gemini-2.5-flash-lite-preview-tts model) and ElevenLabs (uses eleven_v3 model).

Required Environment Variables

Set the appropriate API key for your chosen provider:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export OPENAI_API_KEY=your_key
export CARTESIA_API_KEY=your_key
export ELEVENLABS_API_KEY=your_key
export SARVAM_API_KEY=your_key
export SMALLEST_API_KEY=your_key