Skip to main content

Input data structure

Organize your input data in the following structure:
├── /path/to/data
│   └── stt.csv
│   └── audios/
│       └── audio_1.wav
│       └── audio_2.wav
stt.csv should have the following format:
idtext
audio_1Hi
audio_2Madam, my name is Geeta Shankar
All audio files should be in WAV format. The file names should match the id column in stt.csv.

Run evaluation

import asyncio
from calibrate.stt import eval

asyncio.run(eval(
    provider="deepgram",  # deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq
    language="english",   # english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
    input_dir="/path/to/data",
    output_dir="/path/to/output",
    debug=True,           # optional: run on first 5 audio files only
    debug_count=5,        # optional: number of files in debug mode
))
Function Parameters:
ParameterTypeRequiredDefaultDescription
providerstrYes-Speech to Text provider: deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq
input_dirstrYes-Path to input directory containing stt.csv and audios/ folder
output_dirstrNo”./out”Path to output directory for results
languagestrNo”english”Language of audio files: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
input_file_namestrNo”stt.csv”Name of input CSV file
debugboolNoFalseRun on first N audio files only
debug_countintNo5Number of files in debug mode
ignore_retryboolNoFalseSkip retry if not all audios processed
overwriteboolNoFalseOverwrite existing results instead of resuming from checkpoint

Evaluation process

When you run the evaluation, each audio file is sent to the Speech to Text provider and the received transcript is displayed:
--------------------------------
Processing audio [1/13]: audio_1.wav
Transcript: Hi.
--------------------------------
Processing audio [2/13]: audio_2.wav
Transcript: Madam, my name is Geeta Shankar.

Intermediate saves and crash recovery

After each audio file is processed, the script saves intermediate results to results.csv. This means:
  • If the process crashes or is interrupted, you won’t lose progress
  • You can resume from where you left off by running the same command again

Automatic retry logic

The evaluation script includes automatic retry logic for robustness:
  1. After processing all audio files, the script checks if any transcripts are missing
  2. If some audios failed to get transcripts, it automatically retries those specific files
  3. If a retry attempt makes no progress (same number of failures as before), the script exits the loop and saves empty transcripts for the failed files
  4. You can pass the same parameters to resume from where it left off
To skip retries and proceed directly to metrics calculation:
asyncio.run(eval(
    provider="deepgram",
    input_dir="/path/to/data",
    output_dir="/path/to/output",
    ignore_retry=True,  # Skip retry logic
))

Metrics

Speech to Text evaluation measures accuracy using the following metrics:
  • WER (Word Error Rate): Edit distance between predicted and reference transcripts
  • String Similarity: Character-level similarity ratio
  • LLM Judge Score: Semantic evaluation using an LLM

Learn more about metrics

Detailed explanation of all metrics, including why LLM Judge is necessary

Output structure

/path/to/output/provider
├── results.csv
├── metrics.json
└── results.log

results.csv

Contains detailed results for each audio file:
idgtpredwerstring_similarityllm_judge_scorellm_judge_reasoning
audio_1HiHi.0.00.95TrueThe transcription matches the source exactly.
audio_2Please write Rekha Kumari, sister.Please write Reha Kumari’s sister.0.40.93FalseThe name ‘Rekha’ was transcribed as ‘Reha’, which is a different name.

metrics.json

Contains aggregated metrics for the entire evaluation run:
{
  "wer": 0.12962962962962962,
  "string_similarity": 0.8792465033551621,
  "llm_judge_score": 0.85
}
  • wer: Mean Word Error Rate across all audio files
  • string_similarity: Mean string similarity score across all audio files
  • llm_judge_score: Mean LLM Judge score across all audio files

results.log

Contains the full logs of the evaluation including terminal output and debug information. Useful for debugging issues with specific providers or audio files.
--------------------------------
Running command: calibrate stt eval -p deepgram -i /path/to/data -o /path/to/output
--------------------------------
Processing audio [1/13]: audio_1.wav
Transcript: Hi.
--------------------------------
...

Generate leaderboard

After running evaluations for multiple providers, you can generate a combined leaderboard to compare them side-by-side.

Prerequisites

For the leaderboard to work correctly, all provider evaluations must use the same output_dir. Each provider’s results are saved in a subdirectory named after the provider:
/path/to/output/          # This is the output_dir you pass to leaderboard()
├── deepgram/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
├── google/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
├── openai/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
└── sarvam/
    ├── results.log
    ├── results.csv
    └── metrics.json

Generate the leaderboard

from calibrate.stt import leaderboard

leaderboard(
    output_dir="/path/to/output",  # Same directory used for all provider evaluations
    save_dir="./leaderboards"
)
Function Parameters:
ParameterTypeRequiredDefaultDescription
output_dirstrYes-Directory containing subdirectories for each provider’s evaluation results
save_dirstrYes-Directory where leaderboard files will be saved

Leaderboard outputs

The leaderboard generates two files:

stt_leaderboard.xlsx

An Excel workbook with:
  • Summary sheet: Comparison of all providers across all metrics
  • Per-provider sheets: Detailed results showing only the failed transcriptions (where llm_judge_score is False) for each provider
Example summary sheet:
runcountwerstring_similarityllm_judge_score
deepgram500.0890.9340.92
google500.1120.9120.88
openai500.0950.9280.90
sarvam500.1560.8910.82

Per-Metric Charts

Individual bar chart visualizations for each metric, comparing all providers:
  • wer.png - Word Error Rate comparison (lower is better)
  • string_similarity.png - String Similarity comparison (higher is better)
  • llm_judge_score.png - LLM Judge Score comparison (higher is better)
Each chart displays the metric value for each provider with value labels on top of each bar for easy comparison.