Open source

AI agent evaluation for non‑profits

Existing AI evaluation tools are too complex or overly expensive for non‑profits to use. Calibrate is built by ML engineers with decades of experience to make AI evaluation accessible with best practices baked into every step

BUILT BY ARTPARK @ IIScFUNDED BY GOVERNMENT OF KARNATAKA

Evaluate the quality
of your AI responses

Define edge cases and evaluate the agent's response against custom criteria

Step

Add the conversation history as input to the agent

Step

Define custom criteria to evaluate the agent's response given the conversation history

Step

Run the test and see whether the agent response passed the evaluation criteria

Find the best LLM
for your agent

Compare different models on your tests to find the best one for your agent

Step

Pick the models to compare and run benchmarking on all tests

Step

Get a leaderboard across models and evaluators to pick the best LLM

Evaluate the quality
of your AI responses

Define edge cases and evaluate the agent's response against custom criteria

Step

Add the conversation history as input to the agent

Step

Define custom criteria to evaluate the agent's response given the conversation history

Step

Run the test and see whether the agent response passed the evaluation criteria

Find the best LLM
for your agent

Compare different models on your tests to find the best one for your agent

Step

Pick the models to compare and run benchmarking on all tests

Step

Get a leaderboard across models and evaluators to pick the best LLM

Text agents llm-multi-output preview 1-2

Find the best speech‑to‑text model
for your users

Calibrate uses evaluators that compare the meaning of the predicted transcriptions with the references beyond simple rule-based metrics to rank different models

Step

Upload your audios with reference texts

Step

Select the language, models to compare and the evaluators for measuring transcription accuracy

Step

See the leaderboard across models for each metric

Voice agents stt-leaderboard preview 1-3

Step

For each model view row-by-row outputs along with evaluator scores and reasoning

Find the best speech‑to‑text model
for your users

Calibrate uses evaluators that compare the meaning of the predicted transcriptions with the references beyond simple rule-based metrics to rank different models

Step

Upload your audios with reference texts

Step

Select the language, models to compare and the evaluators for measuring transcription accuracy

Step

See the leaderboard across models for each metric

Step

For each model view row-by-row outputs along with evaluator scores and reasoning

Select the perfect voice
for your agent

Calibrate uses AI models which lets you evaluate the generated audios against the reference texts on pronunciation, clarity, naturalness and more

Step

Add the reference texts to be spoken

Step

Select the language, models to compare and the evaluators to measure the quality of the generated audios

Step

See the leaderboard across models for each metric

Voice agents tts-leaderboard preview 2-3

Step

For each model view the generated audios for each row along with evaluator scores and reasoning

Select the perfect voice
for your agent

Calibrate uses AI models which lets you evaluate the generated audios against the reference texts on pronunciation, clarity, naturalness and more

Step

Add the reference texts to be spoken

Step

Select the language, models to compare and the evaluators to measure the quality of the generated audios

Step

See the leaderboard across models for each metric

Step

For each model view the generated audios for each row along with evaluator scores and reasoning

Align your LLM judges
with human judgment

LLM evaluators are prone to mistakes as well. The only way to trust them is to compare them with human judgment. Collect human labels, measure alignment with LLM judges and iteratively improve them.

Step

Create a human alignment task and attach the evaluators you want to align. Each evaluator becomes an annotation column for every item

Step

Add the items to the task that humans will review and label

Step

Create labelling jobs by assigning items to annotators and send unique links so that labels from one annotator never influence another's

Step

Each annotator labels each item across all evaluators you want to align. Measure consistency in their labels by sending a few common items to all annotators

Step

Run your evaluators on the same items and review disagreements between LLM judges and human labels

Step

Iteratively improve LLM judge alignment by experimenting with different evaluator settings. Once aligned, the evaluator can be reliably used to automate monitoring.

Align your LLM judges
with human judgment

LLM evaluators are prone to mistakes as well. The only way to trust them is to compare them with human judgment. Collect human labels, measure alignment with LLM judges and iteratively improve them.

Step

Create a human alignment task and attach the evaluators you want to align. Each evaluator becomes an annotation column for every item

Step

Add the items to the task that humans will review and label

Step

Create labelling jobs by assigning items to annotators and send unique links so that labels from one annotator never influence another's

Step

Each annotator labels each item across all evaluators you want to align. Measure consistency in their labels by sending a few common items to all annotators

Human alignment human-blind-link preview 1-4

Step

Run your evaluators on the same items and review disagreements between LLM judges and human labels

Human alignment human-evaluator-runs preview 1-5

Step

Iteratively improve LLM judge alignment by experimenting with different evaluator settings. Once aligned, the evaluator can be reliably used to automate monitoring.

Human alignment human-unified preview 1-6

Simulate conversations
with your agent

Catch bugs and regressions in your agent before deploying it to real users

Step

Create user personas to define who your users are

Step

Create scenarios to depict the purpose of the user's interaction with the agent

Step

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Step

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Simulate conversations
with your agent

Catch bugs and regressions in your agent before deploying it to real users

Step

Create user personas to define who your users are

Step

Create scenarios to depict the purpose of the user's interaction with the agent

Step

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Step

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Proudly open source

What we open-source is what we use ourselves. Nothing hidden behind a paywall.

Self-hosting

We can help you run Calibrate on your infrastructure to ensure sensitive data stays in environments you control

No per-seat pricing. Ever.

No per-user fees. Add staff, partners, and consultants as your team grows

Auditable, end to end

The full codebase is on GitHub for pre-deploy review and real diligence

No vendor lock-in

Fork, adapt, and make changes as you wish

artpark-sahai-org/calibrate

Works with any AI agent stack

Supports all major models with more coming soon

See all integrations→

Join the community

Talk to the team building Calibrate to get your questions answered and shape our roadmap

WhatsApp Let's talk

Team

Combined experience of 25+ years building AI systems

Aman Dalmia

Principal ML Engineer, Artpark

Jigar Doshi

Director of ML, Artpark

Start Calibrating today

Become a team that ships trustworthy AI agents beyond vibe checks

Get started for free Read the docs

AI agent evaluation for non‑profits

Evaluate the qualityof your AI responses

Add the conversation history as input to the agent

Define custom criteria to evaluate the agent's response given the conversation history

Run the test and see whether the agent response passed the evaluation criteria

Find the best LLMfor your agent

Pick the models to compare and run benchmarking on all tests

Get a leaderboard across models and evaluators to pick the best LLM

Evaluate the qualityof your AI responses

Add the conversation history as input to the agent

Define custom criteria to evaluate the agent's response given the conversation history

Run the test and see whether the agent response passed the evaluation criteria

Find the best LLMfor your agent

Pick the models to compare and run benchmarking on all tests

Get a leaderboard across models and evaluators to pick the best LLM

Find the best speech‑to‑text modelfor your users

Upload your audios with reference texts

Select the language, models to compare and the evaluators for measuring transcription accuracy

See the leaderboard across models for each metric

For each model view row-by-row outputs along with evaluator scores and reasoning

Find the best speech‑to‑text modelfor your users

Upload your audios with reference texts

Select the language, models to compare and the evaluators for measuring transcription accuracy

See the leaderboard across models for each metric

For each model view row-by-row outputs along with evaluator scores and reasoning

Select the perfect voicefor your agent

Add the reference texts to be spoken

Select the language, models to compare and the evaluators to measure the quality of the generated audios

See the leaderboard across models for each metric

For each model view the generated audios for each row along with evaluator scores and reasoning

Select the perfect voicefor your agent

Add the reference texts to be spoken

Select the language, models to compare and the evaluators to measure the quality of the generated audios

See the leaderboard across models for each metric

For each model view the generated audios for each row along with evaluator scores and reasoning

Align your LLM judgeswith human judgment

Create a human alignment task and attach the evaluators you want to align. Each evaluator becomes an annotation column for every item

Add the items to the task that humans will review and label

Create labelling jobs by assigning items to annotators and send unique links so that labels from one annotator never influence another's

Each annotator labels each item across all evaluators you want to align. Measure consistency in their labels by sending a few common items to all annotators

Run your evaluators on the same items and review disagreements between LLM judges and human labels

Iteratively improve LLM judge alignment by experimenting with different evaluator settings. Once aligned, the evaluator can be reliably used to automate monitoring.

Align your LLM judgeswith human judgment

Create a human alignment task and attach the evaluators you want to align. Each evaluator becomes an annotation column for every item

Add the items to the task that humans will review and label

Create labelling jobs by assigning items to annotators and send unique links so that labels from one annotator never influence another's

Each annotator labels each item across all evaluators you want to align. Measure consistency in their labels by sending a few common items to all annotators

Run your evaluators on the same items and review disagreements between LLM judges and human labels

Iteratively improve LLM judge alignment by experimenting with different evaluator settings. Once aligned, the evaluator can be reliably used to automate monitoring.

Simulate conversationswith your agent

Create user personas to define who your users are

Create scenarios to depict the purpose of the user's interaction with the agent

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Simulate conversationswith your agent

Create user personas to define who your users are

Create scenarios to depict the purpose of the user's interaction with the agent

Run a simulation with personas and scenarios using custom evaluators and get performance metrics across all runs

Inspect each simulation run with full transcript and generated audios for the agent and the simulated user

Proudly open source

Self-hosting

No per-seat pricing. Ever.

Auditable, end to end

No vendor lock-in

Works with any AI agent stack

Join the community

Team

Aman Dalmia

Jigar Doshi

Start Calibrating today

Evaluate the quality
of your AI responses

Find the best LLM
for your agent

Evaluate the quality
of your AI responses

Find the best LLM
for your agent

Find the best speech‑to‑text model
for your users

Find the best speech‑to‑text model
for your users

Select the perfect voice
for your agent

Select the perfect voice
for your agent

Align your LLM judges
with human judgment

Align your LLM judges
with human judgment

Simulate conversations
with your agent

Simulate conversations
with your agent