Open source

AI agent evaluation for non‑profits

Existing AI evaluation tools are too complex or overly expensive for non‑profits to use. Calibrate is built by ML engineers with decades of experience to make AI evaluation accessible with best practices baked into every step

BUILT BY ARTPARK @ IIScFUNDED BY GOVERNMENT OF KARNATAKA

Evaluate the quality
of your AI responses

Define edge cases and evaluate the agent's response against custom criteria

Step

Add the conversation history as input to the agent

Text agents llm-input preview 1-1
Step

Define custom criteria to evaluate the agent's response given the conversation history

Text agents llm-evaluator preview 1-2
Step

Run the test and see whether the agent response passed the evaluation criteria

Text agents llm-output preview 1-3

Find the best LLM
for your agent

Compare different models on your tests to find the best one for your agent

Step

Pick the models to compare and run benchmarking on all tests

Text agents llm-multi-input preview 1-1
Step

Get a leaderboard across models and evaluators to pick the best LLM

Text agents llm-multi-output preview 1-2

Find the best speech‑to‑text model
for your users

Calibrate uses evaluators that compare the meaning of the predicted transcriptions with the references beyond simple rule-based metrics to rank different models

Step

Upload your audios with reference texts

Voice agents stt-upload preview 1-1
Step

Select the language, models to compare and the evaluators for measuring transcription accuracy

Voice agents stt-config preview 1-2
Step

See the leaderboard across models for each metric

Voice agents stt-leaderboard preview 1-3
Step

For each model view row-by-row outputs along with evaluator scores and reasoning

Voice agents stt-rows preview 1-4

Select the perfect voice
for your agent

Calibrate uses AI models which lets you evaluate the generated audios against the reference texts on pronunciation, clarity, naturalness and more

Step

Add the reference texts to be spoken

Voice agents tts-texts preview 2-1
Step

Select the language, models to compare and the evaluators to measure the quality of the generated audios

Voice agents tts-config preview 2-2
Step

See the leaderboard across models for each metric

Voice agents tts-leaderboard preview 2-3
Step

For each model view the generated audios for each row along with evaluator scores and reasoning

Voice agents tts-rows preview 2-4

Align your LLM judges
with human judgment

LLM evaluators are prone to mistakes as well. The only way to trust them is to compare them with human judgment. Collect human labels, measure alignment with LLM judges and iteratively improve them.

Step

Create a human alignment task and attach the evaluators you want to align. Each evaluator becomes an annotation column for every item

Human alignment human-task preview 1-1
Step

Add the items to the task that humans will review and label

Human alignment human-items preview 1-2
Step

Create labelling jobs by assigning items to annotators and send unique links so that labels from one annotator never influence another's

Human alignment human-jobs preview 1-3
Step

Each annotator labels each item across all evaluators you want to align. Measure consistency in their labels by sending a few common items to all annotators

Human alignment human-blind-link preview 1-4
Step

Run your evaluators on the same items and review disagreements between LLM judges and human labels

Human alignment human-evaluator-runs preview 1-5
Step

Iteratively improve LLM judge alignment by experimenting with different evaluator settings. Once aligned, the evaluator can be reliably used to automate monitoring.

Human alignment human-unified preview 1-6

Simulate conversations
with your agent

Catch bugs and regressions in your agent before deploying it to real users

Step

Create user personas to define who your users are

Simulations sim-personas preview 1-1
Step

Create scenarios to depict the purpose of the user's interaction with the agent

Simulations sim-purpose preview 1-2
Step

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Simulations sim-run preview 1-3
Step

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Simulations sim-inspect preview 1-4

Proudly open source

What we open-source is what we use ourselves. Nothing hidden behind a paywall.

Self-hosting

We can help you run Calibrate on your infrastructure to ensure sensitive data stays in environments you control

No per-seat pricing. Ever.

No per-user fees. Add staff, partners, and consultants as your team grows

Auditable, end to end

The full codebase is on GitHub for pre-deploy review and real diligence

No vendor lock-in

Fork, adapt, and make changes as you wish

Works with any AI agent stack

Supports all major models with more coming soon

Supports integrations including Deepgram, ElevenLabs, OpenAI, Google, Cartesia, Anthropic, Groq, DeepSeek, Smallest AI, Claude, Gemini, Qwen, Meta, Mistral, Cohere, Sarvam, AI21, Baidu, NVIDIA, Amazon.

Join the community

Talk to the team building Calibrate to get your questions answered and shape our roadmap

Start Calibrating today

Become a team that ships trustworthy AI agents beyond vibe checks