Metrics
LLM Judge Score
An audio LLM Judge (gpt-audio) listens to the generated audio and evaluates whether the pronunciation matches the input text.
The LLM Judge directly compares the raw audio against the input text. It does
not convert the generated speech to text first — it evaluates the audio
natively using an audio-capable model.
- Range: 0 to 1 (1 means all audio correctly pronounces the text, higher is better)
- Output: Returns both a match (True/False) and reasoning for each audio file
- Correct pronunciation of words
- Proper handling of numbers, abbreviations, and special characters
Example
| Text | Audio | LLM Judge | Reasoning |
|---|---|---|---|
| ”Hello world” | True | The audio clearly says “hello world” with correct pronunciation. | |
| ”Call 1-800-555-0123” | True | The phone number is pronounced correctly. | |
| ”Dr. Smith” | False | ”Dr.” was pronounced as “dur” instead of “doctor”. |