Skip to main content

Input Data Structure

Prepare an input CSV file with the following structure:
idtext
row_1hello world
row_2this is a test
The CSV should have two columns: id (unique identifier for each text) and text (the text strings you want to synthesize into speech).

Running Provider Evaluation

import asyncio
from calibrate.tts import eval

asyncio.run(eval(
    provider="google",      # cartesia, openai, groq, google, elevenlabs, sarvam, smallest
    language="english",     # english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
    input="/path/to/input.csv",
    output_dir="/path/to/output",
    debug=True,             # optional: run on first 5 texts only
    debug_count=5,          # optional: number of texts in debug mode
))
Function Parameters:
ParameterTypeRequiredDefaultDescription
inputstrYes-Path to input CSV file containing texts to synthesize
providerstrNo”google”Text to Speech provider: cartesia, openai, groq, google, elevenlabs, sarvam, smallest
languagestrNo”english”Language: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
output_dirstrNo”./out”Path to output directory for results
debugboolNoFalseRun on first N texts only
debug_countintNo5Number of texts in debug mode
overwriteboolNoFalseOverwrite existing results instead of resuming from checkpoint

Output Structure

/path/to/output/<provider>
├── audios
│   ├── row_1.wav
│   ├── row_2.wav
├── results.log
├── results.csv
└── metrics.json

results.csv

Contains detailed results for each text:
idtextaudio_pathttfbllm_judge_scorellm_judge_reasoning
row_1hello world./out/elevenlabs/audios/row_1.wav1.511TrueThe audio says ‘hello world’ clearly and matches the reference text exactly.
row_2this is a test./out/elevenlabs/audios/row_2.wav1.215TrueThe audio clearly says ‘this is a test,’ which matches exactly with the provided reference text.

metrics.json

Contains aggregated metrics:
{
  "llm_judge_score": 1.0,
  "ttfb": {
    "mean": 1.362950086593628,
    "std": 0.1476140022277832,
    "values": [1.5105640888214111, 1.2153360843658447]
  }
}

Metrics

Text to Speech evaluation measures both quality and latency: Quality metrics:
  • LLM Judge Score: Semantic evaluation of pronunciation accuracy
Latency metrics:
  • TTFB (Time to First Byte): Time until first audio chunk is received

Learn more about metrics

Detailed explanation of all metrics and how LLM Judge works

Provider Leaderboard

After running multiple provider evaluations, generate a combined leaderboard:
from calibrate.tts import leaderboard

leaderboard(
    output_dir="/path/to/output",
    save_dir="./leaderboards"
)
This scans each run directory, reads metrics.json and results.csv, then generates:
  • tts_leaderboard.xlsx: Excel file with all metrics by provider
  • Individual metric charts: llm_judge_score.png, ttfb.png
Function Parameters:
ParameterTypeRequiredDefaultDescription
output_dirstrYes-Directory containing provider evaluation results
save_dirstrYes-Directory to save leaderboard files