Skip to main content
Evaluate Speech to Text providers and generate comparative leaderboards.

Input Data Structure

Organize your input data in the following structure:
├── /path/to/data
│   └── stt.csv
│   └── audios/
│       └── audio_1.wav
│       └── audio_2.wav
The stt.csv file should have the following format:
idtext
audio_1Hi
audio_2Madam, my name is Geeta Shankar
All audio files should be in WAV format. The evaluation script expects files at audios/<id>.wav where <id> matches the id column in your CSV.

Learn more about metrics

Detailed explanation of all metrics, including why LLM Judge is necessary

calibrate stt eval

Run Speech to Text evaluation against a specific provider.
calibrate stt eval -p <provider> -l <language> -i <input_dir> -o <output_dir>

Arguments

FlagLongTypeRequiredDefaultDescription
-p--providerstringYes-Provider: deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq
-l--languagestringNoenglishLanguage: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
-i--input-dirstringYes-Path to input directory containing stt.csv and audios/ folder
-o--output-dirstringNo./outPath to output directory for results
-f--input-file-namestringNostt.csvName of input CSV file
-d--debugflagNofalseRun on first N audio files only
-dc--debug_countintNo5Number of files in debug mode
--ignore_retryflagNofalseSkip retry if not all audios processed
--overwriteflagNofalseOverwrite existing results instead of resuming

Examples

Basic evaluation:
calibrate stt eval -p deepgram -l english -i ./data -o ./out
Evaluate with Hindi language:
calibrate stt eval -p sarvam -l hindi -i ./data -o ./out
Debug mode (process only first 5 files):
calibrate stt eval -p deepgram -l english -i ./data -o ./out -d -dc 5

Output Structure

/path/to/output/<provider>
├── results.csv       # Per-file results with metrics
├── metrics.json      # Aggregated metrics (WER, similarity, LLM judge)
└── results.log       # Full logs including terminal output

Output Files

results.csv contains per-file evaluation results:
idgtpredwerstring_similarityllm_judge_scorellm_judge_reasoning
audio_1HelloHello0.01.0TrueExact match
audio_2My name is GeetaMy name is Gita0.250.9FalseName spelling differs

calibrate stt leaderboard

Generate a comparative leaderboard from multiple evaluation runs.
calibrate stt leaderboard -o <output_dir> -s <save_dir>

Arguments

FlagLongTypeRequiredDescription
-o--output-dirstringYesDirectory containing evaluation run folders
-s--save-dirstringYesDirectory to save leaderboard outputs

Example

calibrate stt leaderboard -o ./out -s ./leaderboard

Output

The leaderboard command generates:
  • stt_leaderboard.xlsx - Comparative spreadsheet of all providers
  • Individual metric charts: wer.png, string_similarity.png, llm_judge_score.png

Supported Providers

ProviderLanguages
deepgramEnglish, Hindi
openaiEnglish, Hindi
googleEnglish, Hindi, Kannada, Sindhi
sarvamEnglish, Hindi, Kannada
elevenlabsEnglish, Sindhi
cartesiaEnglish, Sindhi
smallestEnglish, Hindi
groqEnglish
Sindhi STT is supported by Google (uses chirp_2 model), Cartesia, and ElevenLabs providers.

Required Environment Variables

Set the appropriate API key for your chosen provider:
export DEEPGRAM_API_KEY=your_key
export OPENAI_API_KEY=your_key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export SARVAM_API_KEY=your_key
export ELEVENLABS_API_KEY=your_key
export CARTESIA_API_KEY=your_key
export GROQ_API_KEY=your_key
export SMALLEST_API_KEY=your_key