Skip to main content

Metrics

LLM Judge Score

An audio LLM Judge (gpt-audio) listens to the generated audio and evaluates whether the pronunciation matches the input text.
The LLM Judge directly compares the raw audio against the input text. It does not convert the generated speech to text first — it evaluates the audio natively using an audio-capable model.
Evaluation accuracy may differ for low-resource languages like Sindhi, as the underlying audio model has limited training data for these languages.
  • Range: 0 to 1 (1 means all audio correctly pronounces the text, higher is better)
  • Output: Returns both a match (True/False) and reasoning for each audio file
What the LLM Judge evaluates:
  • Correct pronunciation of words
  • Proper handling of numbers, abbreviations, and special characters

Example

TextAudioLLM JudgeReasoning
”Hello world”TrueThe audio clearly says “hello world” with correct pronunciation.
”Call 1-800-555-0123”TrueThe phone number is pronounced correctly.
”Dr. Smith”False”Dr.” was pronounced as “dur” instead of “doctor”.

Time to First Byte (TTFB)

Measures the time (in seconds) from when the request is made to the provider until the first audio chunk is received. It is critical for real-time voice agents where responsiveness matters.

Next Steps