Evaluate Speech to Text providers and generate comparative leaderboards.
Organize your input data in the following structure:
├── /path/to/data
│ └── stt.csv
│ └── audios/
│ └── audio_1.wav
│ └── audio_2.wav
The stt.csv file should have the following format:
| id | text |
|---|
| audio_1 | Hi |
| audio_2 | Madam, my name is Geeta Shankar |
All audio files should be in WAV format. The evaluation script expects files at audios/<id>.wav where <id> matches the id column in your CSV.
Learn more about metrics
Detailed explanation of all metrics, including why LLM Judge is necessary
calibrate stt eval
Run Speech to Text evaluation against a specific provider.
calibrate stt eval -p <provider> -l <language> -i <input_dir> -o <output_dir>
Arguments
| Flag | Long | Type | Required | Default | Description |
|---|
-p | --provider | string | Yes | - | Provider: deepgram, openai, cartesia, google, sarvam, elevenlabs, smallest, groq |
-l | --language | string | No | english | Language: english, hindi, kannada, bengali, malayalam, marathi, odia, punjabi, tamil, telugu, gujarati, sindhi |
-i | --input-dir | string | Yes | - | Path to input directory containing stt.csv and audios/ folder |
-o | --output-dir | string | No | ./out | Path to output directory for results |
-f | --input-file-name | string | No | stt.csv | Name of input CSV file |
-d | --debug | flag | No | false | Run on first N audio files only |
-dc | --debug_count | int | No | 5 | Number of files in debug mode |
| --ignore_retry | flag | No | false | Skip retry if not all audios processed |
| --overwrite | flag | No | false | Overwrite existing results instead of resuming |
Examples
Basic evaluation:
calibrate stt eval -p deepgram -l english -i ./data -o ./out
Evaluate with Hindi language:
calibrate stt eval -p sarvam -l hindi -i ./data -o ./out
Debug mode (process only first 5 files):
calibrate stt eval -p deepgram -l english -i ./data -o ./out -d -dc 5
Output Structure
/path/to/output/<provider>
├── results.csv # Per-file results with metrics
├── metrics.json # Aggregated metrics (WER, similarity, LLM judge)
└── results.log # Full logs including terminal output
Output Files
results.csv contains per-file evaluation results:
| id | gt | pred | wer | string_similarity | llm_judge_score | llm_judge_reasoning |
|---|
| audio_1 | Hello | Hello | 0.0 | 1.0 | True | Exact match |
| audio_2 | My name is Geeta | My name is Gita | 0.25 | 0.9 | False | Name spelling differs |
calibrate stt leaderboard
Generate a comparative leaderboard from multiple evaluation runs.
calibrate stt leaderboard -o <output_dir> -s <save_dir>
Arguments
| Flag | Long | Type | Required | Description |
|---|
-o | --output-dir | string | Yes | Directory containing evaluation run folders |
-s | --save-dir | string | Yes | Directory to save leaderboard outputs |
Example
calibrate stt leaderboard -o ./out -s ./leaderboard
Output
The leaderboard command generates:
stt_leaderboard.xlsx - Comparative spreadsheet of all providers
- Individual metric charts:
wer.png, string_similarity.png, llm_judge_score.png
Supported Providers
| Provider | Languages |
|---|
deepgram | English, Hindi |
openai | English, Hindi |
google | English, Hindi, Kannada, Sindhi |
sarvam | English, Hindi, Kannada |
elevenlabs | English, Sindhi |
cartesia | English, Sindhi |
smallest | English, Hindi |
groq | English |
Sindhi STT is supported by Google (uses chirp_2 model), Cartesia, and ElevenLabs providers.
Required Environment Variables
Set the appropriate API key for your chosen provider:
export DEEPGRAM_API_KEY=your_key
export OPENAI_API_KEY=your_key
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
export SARVAM_API_KEY=your_key
export ELEVENLABS_API_KEY=your_key
export CARTESIA_API_KEY=your_key
export GROQ_API_KEY=your_key
export SMALLEST_API_KEY=your_key