Speech to Text

Metrics

Word Error Rate (WER)

Measures the edit distance between the predicted and reference transcripts at the word level. It calculates the minimum number of word insertions, deletions, and substitutions needed to transform the prediction into the reference, divided by the number of words in the reference.

Range: 0 to infinity (0 is perfect, lower is better)
Limitation: WER treats all word differences equally, even when the semantic meaning is preserved

Example

Reference	Prediction	WER
”Hello world"	"Hello world”	0.0
”Hello world"	"Hello there”	0.5
”one two three"	"1 2 3”	1.0

String Similarity

Computes the ratio of matching characters between the normalized reference and prediction.

Range: 0 to 1 (1 is perfect match, higher is better)
Use case: Useful for catching character-level differences

Example

Reference	Prediction	String Similarity
”Geeta"	"Geeta”	1.0
”Geeta"	"Gita”	0.8
”Geeta"	"John”	0.2

LLM Judge Score

The LLM Judge uses a powerful LLM to semantically evaluate whether the transcription matches the source text. Unlike WER, it understands context and meaning.

Range: 0 to 1 (1 means all transcriptions semantically match, higher is better)
Output: Returns both a match (True/False) and reasoning for each audio file

Why LLM Judge is necessary

Traditional metrics like WER fail in cases where the transcription is semantically correct but textually different:

Source	Transcription	WER	LLM Judge
”1, 2, 3"	"one, two, three”	1.0 (100% error)	True (same values)
“Rs. 500"	"Rupees five hundred”	1.0 (100% error)	True (same amount)
“Phone: 9833472990"	"Phone number is 98334 72990”	0.67 (67% error)	True (same number)
“Please write Rekha Kumari, sister."	"Please write Reha Kumari’s sister.”	0.4 (40% error)	False (name is different: Rekha vs Reha)

LLM Judge Guidelines

Numbers in different formats (digits vs words) are considered equivalent
Word spacing differences are ignored
Actual value differences (names, addresses, key details) are flagged as mismatches

Next Steps

Quickstart

Run your first evaluation on your dataset

Get Started

Quickstart

Core Concepts

Metrics

Word Error Rate (WER)

Example

String Similarity

Example

LLM Judge Score

Why LLM Judge is necessary

LLM Judge Guidelines

Next Steps

Quickstart

Get Started

Quickstart

Core Concepts

​Metrics

​Word Error Rate (WER)

​Example

​String Similarity

​Example

​LLM Judge Score

​Why LLM Judge is necessary

​LLM Judge Guidelines

​Next Steps

Quickstart

Metrics

Word Error Rate (WER)

Example

String Similarity

Example

LLM Judge Score

Why LLM Judge is necessary

LLM Judge Guidelines

Next Steps