Introduction - Calibrate

Calibrate is an open-source evaluation framework for voice agents.

Testing voice agents manually is slow, inconsistent, and doesn’t scale. Voice agents require multiple components to work well together:

Transcribing the user’s speech accurately and efficiently
Reasoning about the user’s response against the context and purpose of the conversation
Calling the right tools with the right arguments in the right order
Generating the right next turn response
Producing natural-sounding speech back to the user
Detecting if the user has completed speaking or has taken a pause before speaking again
Handling interruptions by the user while the agent is speaking
Keeping the turnaround time low
Handling multiple languages, accents, dialects and code-switching

and a lot more. There is no simple way to evaluate each component across all the providers on your own dataset. Even if individual components work well, it is hard to tell if the agent as a whole will work as expected when you deploy it to production. This leads to many teams deploying their agents with a lot of uncertainty and risk with their users having to deal with poor experiences. Calibrate solves this problem by letting you evaluate your entire voice agent stack:

Speech to Text (STT): Benchmark multiple providers (Google, Sarvam, ElevenLabs and more) on your dataset across 10+ indic languages using metrics optimised for agentic use cases
Text to Speech (TTS): Benchmark generated speech by multiple providers automatically using an Audio LLM Judge across 10+ indic languages
Text to Text (LLMs): Evaluate the response quality and tool calling of your LLMs for multi-turn conversations and find the best LLM for your agent
Simulations: Simulate realistic conversations using realistic user personas and scenarios to test failure modes for your agent including interruptions

Calibrate helps you continuously improve your agent, ensure a bug never repeats itself and deploy your agent with confidence.

Get Started

Speech to Text

Compare transcription accuracy across multiple providers on your dataset

LLM tests

Create test suites that verify model responses and tool calling behavior

Text to Speech

Benchmark generated audio quality across multiple providers

Simulations

Simulate agent conversations with customizable personas and scenarios

Learn More

Core Concepts

Understand how Calibrate helps you evaluate effectively

CLI

Run evaluations from the command line

Integrations

See all supported providers for different components

Speech to Text

​Get Started

Speech to Text

LLM tests

Text to Speech

Simulations

​Learn More

Core Concepts

CLI

Integrations

Get Started

Learn More