Input data structure
Organize your input data in the following structure:stt.csv should have the following format:
| id | text |
|---|---|
| audio_1 | Hi |
| audio_2 | Madam, my name is Geeta Shankar |
id column in stt.csv.
Run evaluation
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
provider | str | Yes | - | Speech to Text provider: deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq |
input_dir | str | Yes | - | Path to input directory containing stt.csv and audios/ folder |
output_dir | str | No | ”./out” | Path to output directory for results |
language | str | No | ”english” | Language of audio files: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi |
input_file_name | str | No | ”stt.csv” | Name of input CSV file |
debug | bool | No | False | Run on first N audio files only |
debug_count | int | No | 5 | Number of files in debug mode |
ignore_retry | bool | No | False | Skip retry if not all audios processed |
overwrite | bool | No | False | Overwrite existing results instead of resuming from checkpoint |
Evaluation process
When you run the evaluation, each audio file is sent to the Speech to Text provider and the received transcript is displayed:Intermediate saves and crash recovery
After each audio file is processed, the script saves intermediate results toresults.csv. This means:
- If the process crashes or is interrupted, you won’t lose progress
- You can resume from where you left off by running the same command again
Automatic retry logic
The evaluation script includes automatic retry logic for robustness:- After processing all audio files, the script checks if any transcripts are missing
- If some audios failed to get transcripts, it automatically retries those specific files
- If a retry attempt makes no progress (same number of failures as before), the script exits the loop and saves empty transcripts for the failed files
- You can pass the same parameters to resume from where it left off
Metrics
Speech to Text evaluation measures accuracy using the following metrics:- WER (Word Error Rate): Edit distance between predicted and reference transcripts
- String Similarity: Character-level similarity ratio
- LLM Judge Score: Semantic evaluation using an LLM
Learn more about metrics
Detailed explanation of all metrics, including why LLM Judge is necessary
Output structure
results.csv
Contains detailed results for each audio file:| id | gt | pred | wer | string_similarity | llm_judge_score | llm_judge_reasoning |
|---|---|---|---|---|---|---|
| audio_1 | Hi | Hi. | 0.0 | 0.95 | True | The transcription matches the source exactly. |
| audio_2 | Please write Rekha Kumari, sister. | Please write Reha Kumari’s sister. | 0.4 | 0.93 | False | The name ‘Rekha’ was transcribed as ‘Reha’, which is a different name. |
metrics.json
Contains aggregated metrics for the entire evaluation run:wer: Mean Word Error Rate across all audio filesstring_similarity: Mean string similarity score across all audio filesllm_judge_score: Mean LLM Judge score across all audio files
results.log
Contains the full logs of the evaluation including terminal output and debug information. Useful for debugging issues with specific providers or audio files.Generate leaderboard
After running evaluations for multiple providers, you can generate a combined leaderboard to compare them side-by-side.Prerequisites
For the leaderboard to work correctly, all provider evaluations must use the sameoutput_dir. Each provider’s results are saved in a subdirectory named after the provider:
Generate the leaderboard
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
output_dir | str | Yes | - | Directory containing subdirectories for each provider’s evaluation results |
save_dir | str | Yes | - | Directory where leaderboard files will be saved |
Leaderboard outputs
The leaderboard generates two files:stt_leaderboard.xlsx
An Excel workbook with:- Summary sheet: Comparison of all providers across all metrics
- Per-provider sheets: Detailed results showing only the failed transcriptions (where
llm_judge_scoreis False) for each provider
| run | count | wer | string_similarity | llm_judge_score |
|---|---|---|---|---|
| deepgram | 50 | 0.089 | 0.934 | 0.92 |
| 50 | 0.112 | 0.912 | 0.88 | |
| openai | 50 | 0.095 | 0.928 | 0.90 |
| sarvam | 50 | 0.156 | 0.891 | 0.82 |
Per-Metric Charts
Individual bar chart visualizations for each metric, comparing all providers:wer.png- Word Error Rate comparison (lower is better)string_similarity.png- String Similarity comparison (higher is better)llm_judge_score.png- LLM Judge Score comparison (higher is better)