Text to Speech

Input Data Structure

Prepare an input CSV file with the following structure:

id	text
row_1	hello world
row_2	this is a test

The CSV should have two columns: id (unique identifier for each text) and text (the text strings you want to synthesize into speech).

Running Provider Evaluation

import asyncio
from calibrate.tts import eval

asyncio.run(eval(
    provider="google",      # cartesia, openai, groq, google, elevenlabs, sarvam, smallest
    language="english",     # english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
    input="/path/to/input.csv",
    output_dir="/path/to/output",
    debug=True,             # optional: run on first 5 texts only
    debug_count=5,          # optional: number of texts in debug mode
))

Function Parameters:

Parameter	Type	Required	Default	Description
`input`	str	Yes	-	Path to input CSV file containing texts to synthesize
`provider`	str	No	”google”	Text to Speech provider: cartesia, openai, groq, google, elevenlabs, sarvam, smallest
`language`	str	No	”english”	Language: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
`output_dir`	str	No	”./out”	Path to output directory for results
`debug`	bool	No	False	Run on first N texts only
`debug_count`	int	No	5	Number of texts in debug mode
`overwrite`	bool	No	False	Overwrite existing results instead of resuming from checkpoint

Output Structure

/path/to/output/<provider>
├── audios
│   ├── row_1.wav
│   ├── row_2.wav
├── results.log
├── results.csv
└── metrics.json

results.csv

Contains detailed results for each text:

id	text	audio_path	ttfb	llm_judge_score	llm_judge_reasoning
row_1	hello world	./out/elevenlabs/audios/row_1.wav	1.511	True	The audio says ‘hello world’ clearly and matches the reference text exactly.
row_2	this is a test	./out/elevenlabs/audios/row_2.wav	1.215	True	The audio clearly says ‘this is a test,’ which matches exactly with the provided reference text.

metrics.json

Contains aggregated metrics:

{
  "llm_judge_score": 1.0,
  "ttfb": {
    "mean": 1.362950086593628,
    "std": 0.1476140022277832,
    "values": [1.5105640888214111, 1.2153360843658447]
  }
}

Metrics

Text to Speech evaluation measures both quality and latency: Quality metrics:

LLM Judge Score: Semantic evaluation of pronunciation accuracy

Latency metrics:

TTFB (Time to First Byte): Time until first audio chunk is received

Learn more about metrics

Detailed explanation of all metrics and how LLM Judge works

Provider Leaderboard

After running multiple provider evaluations, generate a combined leaderboard:

from calibrate.tts import leaderboard

leaderboard(
    output_dir="/path/to/output",
    save_dir="./leaderboards"
)

This scans each run directory, reads metrics.json and results.csv, then generates:

tts_leaderboard.xlsx: Excel file with all metrics by provider
Individual metric charts: llm_judge_score.png, ttfb.png

Function Parameters:

Parameter	Type	Required	Default	Description
`output_dir`	str	Yes	-	Directory containing provider evaluation results
`save_dir`	str	Yes	-	Directory to save leaderboard files

Python SDK

Component Testing

End to End Tests

Input Data Structure

Running Provider Evaluation

Output Structure

results.csv

metrics.json

Metrics

Learn more about metrics

Provider Leaderboard

Python SDK

Component Testing

End to End Tests

​Input Data Structure

​Running Provider Evaluation

​Output Structure

​results.csv

​metrics.json

​Metrics

Learn more about metrics

​Provider Leaderboard

Input Data Structure

Running Provider Evaluation

Output Structure

results.csv

metrics.json

Metrics

Provider Leaderboard