Skip to main content

Metrics

Word Error Rate (WER)

Measures the edit distance between the predicted and reference transcripts at the word level. It calculates the minimum number of word insertions, deletions, and substitutions needed to transform the prediction into the reference, divided by the number of words in the reference.
  • Range: 0 to infinity (0 is perfect, lower is better)
  • Limitation: WER treats all word differences equally, even when the semantic meaning is preserved

Example

ReferencePredictionWER
”Hello world""Hello world”0.0
”Hello world""Hello there”0.5
”one two three""1 2 3”1.0

String Similarity

Computes the ratio of matching characters between the normalized reference and prediction.
  • Range: 0 to 1 (1 is perfect match, higher is better)
  • Use case: Useful for catching character-level differences

Example

ReferencePredictionString Similarity
”Geeta""Geeta”1.0
”Geeta""Gita”0.8
”Geeta""John”0.2

LLM Judge Score

The LLM Judge uses a powerful LLM to semantically evaluate whether the transcription matches the source text. Unlike WER, it understands context and meaning.
  • Range: 0 to 1 (1 means all transcriptions semantically match, higher is better)
  • Output: Returns both a match (True/False) and reasoning for each audio file

Why LLM Judge is necessary

Traditional metrics like WER fail in cases where the transcription is semantically correct but textually different:
SourceTranscriptionWERLLM Judge
”1, 2, 3""one, two, three”1.0 (100% error)True (same values)
“Rs. 500""Rupees five hundred”1.0 (100% error)True (same amount)
“Phone: 9833472990""Phone number is 98334 72990”0.67 (67% error)True (same number)
“Please write Rekha Kumari, sister.""Please write Reha Kumari’s sister.”0.4 (40% error)False (name is different: Rekha vs Reha)

LLM Judge Guidelines

  • Numbers in different formats (digits vs words) are considered equivalent
  • Word spacing differences are ignored
  • Actual value differences (names, addresses, key details) are flagged as mismatches

Next Steps