Speech to Text

Input data structure

Organize your input data in the following structure:

├── /path/to/data
│   └── stt.csv
│   └── audios/
│       └── audio_1.wav
│       └── audio_2.wav

stt.csv should have the following format:

id	text
audio_1	Hi
audio_2	Madam, my name is Geeta Shankar

All audio files should be in WAV format. The file names should match the id column in stt.csv.

Run evaluation

import asyncio
from calibrate.stt import eval

asyncio.run(eval(
    provider="deepgram",  # deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq
    language="english",   # english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
    input_dir="/path/to/data",
    output_dir="/path/to/output",
    debug=True,           # optional: run on first 5 audio files only
    debug_count=5,        # optional: number of files in debug mode
))

Function Parameters:

Parameter	Type	Required	Default	Description
`provider`	str	Yes	-	Speech to Text provider: deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq
`input_dir`	str	Yes	-	Path to input directory containing stt.csv and audios/ folder
`output_dir`	str	No	”./out”	Path to output directory for results
`language`	str	No	”english”	Language of audio files: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi
`input_file_name`	str	No	”stt.csv”	Name of input CSV file
`debug`	bool	No	False	Run on first N audio files only
`debug_count`	int	No	5	Number of files in debug mode
`ignore_retry`	bool	No	False	Skip retry if not all audios processed
`overwrite`	bool	No	False	Overwrite existing results instead of resuming from checkpoint

Evaluation process

When you run the evaluation, each audio file is sent to the Speech to Text provider and the received transcript is displayed:

--------------------------------
Processing audio [1/13]: audio_1.wav
Transcript: Hi.
--------------------------------
Processing audio [2/13]: audio_2.wav
Transcript: Madam, my name is Geeta Shankar.

Intermediate saves and crash recovery

After each audio file is processed, the script saves intermediate results to results.csv. This means:

If the process crashes or is interrupted, you won’t lose progress
You can resume from where you left off by running the same command again

Automatic retry logic

The evaluation script includes automatic retry logic for robustness:

After processing all audio files, the script checks if any transcripts are missing
If some audios failed to get transcripts, it automatically retries those specific files
If a retry attempt makes no progress (same number of failures as before), the script exits the loop and saves empty transcripts for the failed files
You can pass the same parameters to resume from where it left off

To skip retries and proceed directly to metrics calculation:

asyncio.run(eval(
    provider="deepgram",
    input_dir="/path/to/data",
    output_dir="/path/to/output",
    ignore_retry=True,  # Skip retry logic
))

Metrics

Speech to Text evaluation measures accuracy using the following metrics:

WER (Word Error Rate): Edit distance between predicted and reference transcripts
String Similarity: Character-level similarity ratio
LLM Judge Score: Semantic evaluation using an LLM

Learn more about metrics

Detailed explanation of all metrics, including why LLM Judge is necessary

Output structure

/path/to/output/provider
├── results.csv
├── metrics.json
└── results.log

results.csv

Contains detailed results for each audio file:

id	gt	pred	wer	string_similarity	llm_judge_score	llm_judge_reasoning
audio_1	Hi	Hi.	0.0	0.95	True	The transcription matches the source exactly.
audio_2	Please write Rekha Kumari, sister.	Please write Reha Kumari’s sister.	0.4	0.93	False	The name ‘Rekha’ was transcribed as ‘Reha’, which is a different name.

metrics.json

Contains aggregated metrics for the entire evaluation run:

{
  "wer": 0.12962962962962962,
  "string_similarity": 0.8792465033551621,
  "llm_judge_score": 0.85
}

wer: Mean Word Error Rate across all audio files
string_similarity: Mean string similarity score across all audio files
llm_judge_score: Mean LLM Judge score across all audio files

results.log

Contains the full logs of the evaluation including terminal output and debug information. Useful for debugging issues with specific providers or audio files.

--------------------------------
Running command: calibrate stt eval -p deepgram -i /path/to/data -o /path/to/output
--------------------------------
Processing audio [1/13]: audio_1.wav
Transcript: Hi.
--------------------------------
...

Generate leaderboard

After running evaluations for multiple providers, you can generate a combined leaderboard to compare them side-by-side.

Prerequisites

For the leaderboard to work correctly, all provider evaluations must use the same output_dir. Each provider’s results are saved in a subdirectory named after the provider:

/path/to/output/          # This is the output_dir you pass to leaderboard()
├── deepgram/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
├── google/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
├── openai/
│   ├── results.log
│   ├── results.csv
│   └── metrics.json
└── sarvam/
    ├── results.log
    ├── results.csv
    └── metrics.json

Generate the leaderboard

from calibrate.stt import leaderboard

leaderboard(
    output_dir="/path/to/output",  # Same directory used for all provider evaluations
    save_dir="./leaderboards"
)

Function Parameters:

Parameter	Type	Required	Default	Description
`output_dir`	str	Yes	-	Directory containing subdirectories for each provider’s evaluation results
`save_dir`	str	Yes	-	Directory where leaderboard files will be saved

Leaderboard outputs

The leaderboard generates two files:

stt_leaderboard.xlsx

An Excel workbook with:

Summary sheet: Comparison of all providers across all metrics
Per-provider sheets: Detailed results showing only the failed transcriptions (where llm_judge_score is False) for each provider

Example summary sheet:

run	count	wer	string_similarity	llm_judge_score
deepgram	50	0.089	0.934	0.92
google	50	0.112	0.912	0.88
openai	50	0.095	0.928	0.90
sarvam	50	0.156	0.891	0.82

Per-Metric Charts

Individual bar chart visualizations for each metric, comparing all providers:

wer.png - Word Error Rate comparison (lower is better)
string_similarity.png - String Similarity comparison (higher is better)
llm_judge_score.png - LLM Judge Score comparison (higher is better)

Each chart displays the metric value for each provider with value labels on top of each bar for easy comparison.

Python SDK

Component Testing

End to End Tests

Input data structure

Run evaluation

Evaluation process

Intermediate saves and crash recovery

Automatic retry logic

Metrics

Learn more about metrics

Output structure

results.csv

metrics.json

results.log

Generate leaderboard

Prerequisites

Generate the leaderboard

Leaderboard outputs

stt_leaderboard.xlsx

Per-Metric Charts

Python SDK

Component Testing

End to End Tests

​Input data structure

​Run evaluation

​Evaluation process

​Intermediate saves and crash recovery

​Automatic retry logic

​Metrics

Learn more about metrics

​Output structure

​results.csv

​metrics.json

​results.log

​Generate leaderboard

​Prerequisites

​Generate the leaderboard

​Leaderboard outputs

​stt_leaderboard.xlsx

​Per-Metric Charts

Input data structure

Run evaluation

Evaluation process

Intermediate saves and crash recovery

Automatic retry logic

Metrics

Output structure

results.csv

metrics.json

results.log

Generate leaderboard

Prerequisites

Generate the leaderboard

Leaderboard outputs

stt_leaderboard.xlsx

Per-Metric Charts