Measures the edit distance between the predicted and reference transcripts at the word level. It calculates the minimum number of word insertions, deletions, and substitutions needed to transform the prediction into the reference, divided by the number of words in the reference.
Range: 0 to infinity (0 is perfect, lower is better)
Limitation: WER treats all word differences equally, even when the semantic meaning is preserved
The LLM Judge uses a powerful LLM to semantically evaluate whether the transcription matches the source text. Unlike WER, it understands context and meaning.
Range: 0 to 1 (1 means all transcriptions semantically match, higher is better)
Output: Returns both a match (True/False) and reasoning for each audio file
You can save and manage your evaluation data as datasets for reuse across multiple evaluations — avoiding re-uploading the same audio files every time.
Click New dataset, enter a name, and click Create.
You’ll be taken to the dataset detail page where you can add samples in two ways:
Add samples inline — Click Upload .wav to attach an audio file and type the reference transcription for each row. Click + Add another sample to add more rows.
Bulk upload via ZIP — Upload a ZIP file containing an audios folder with .wav files and a data.csv file mapping audio files to their reference transcriptions. Click Download sample ZIP to get a template with the correct structure.
Once your dataset has samples, click the New evaluation button on the dataset page. This pre-selects the dataset and takes you to the evaluation settings where you choose the language and providers to compare.