Binary AI evals and why we need more than a verdict

Binary AI evals and why we need more than a verdict
Photo by Tingey Injury Law Firm / Unsplash

When we build AI evaluations, we want some sort of objective way to tell whether or not an evaluator is "correct".

Outputting a true or false evaluation often isn't enough - we want to be able to trust the evaluation itself, and be able to communicate WHY we can trust that evaluation to other engineers and stakeholders.


In this post, we’ll be talking specifically about binary evals. For example:

is_successful → Was this conversation successful?

Finance or healthcare will have a different idea of “successful” than a customer support agent or an AI companion.

Some other binary evals (outside of success/failure) could be:

  • Did the customer get frustrated?
  • Did the agent fulfil all steps for compliance purposes?
  • Should this have been escalated to a human?
An example of a binary evaluation. There are only two possible outputs

Start by ensuring that you have a high quality pre-evaluated dataset.

If we have subject matter experts (SMEs) highly available, we can tag a small set of customer conversations, typically under 1000, as success or failures. For newer use-cases or, we can start with 10~100 example conversations.

For larger, more mature datasets that have 1,000+ representative conversations, start with a human-labeled seed dataset. Then, fine-tune a small LLM that can extend those evaluations to the full data-set. This could also be a smart frontier model that is prompt engineered. Finally, have the SMEs run through the AI evaluations as a review task.

Once you have your representative data-set, make sure that there is a statistically significant sample of both success and failure tags. A naive mix is 80% success and 20% failure.

A simple process for creating a tagged representative dataset for evaluation

Once we have that high quality dataset, we can measure whether or not our evaluator has high alignment to the tagged dataset.

If we have a bunch of false positives or false negatives, that’s a sign the evaluator is NOT aligned with that labeled or tagged dataset.

Concrete example. You have 400 items in your evaluation dataset, tagged success or failure. Lets say the first 320 are success, and the last 80 are failure

You run your evaluator. You want your evaluator to be fully aligned to the SME evaluation in your dataset. That is, you want 320 true positives, and 80 true negatives.

A flowchart showing the four possible outcomes of a single conversation. True Positive, False Negative, False Positive, True Negative.

By comparing with the pre-evaluated data set, you can come up with an objective number of alignment.

After running the numbers, you’re probably starting somewhere between 40% to 80% in terms of alignment.

How to calculate percent alignment

The key challenge with moving from a number like 45% aligned to 100% aligned is that we need to know why the evaluator disagrees.

To do this, we need to add two things to the evaluation schema, reasoning and evidence.

By adding reasoning and evidence, we as humans, or a stronger model, can inspect whether the evaluator’s logic or prompting is failing.


Without reasoning and evidence, developers often get stuck doing a manual checklist audit.

Imagine a long transcript of 30 to 50 messages back and forth.

There are often a lot of possible failure modes. Even in a straightforward financial flow where a customer service agent helps the user get a new card, there can be a bunch of steps.

Common requirements in a card replacement flow:

  • Telling the user the conversation is being recorded and that they are an AI agent.
  • Correctly verifying the number sent to their banking app.
  • Offering expedited delivery if the card is urgent, or issuing a digital card instantly.
  • Successful audit on recent purchases if the customer is unsure where the card is.

So if we have a long transcript with many possible failure modes and only a single true or false signal, it’s easy to get stuck.

Without reasoning, a large portion of dev time is staring at conversations trying to figure out where specifically things have failed.

Reasoning and evidence enable two things. Evidence enables failure localisation, and reasoning lets you understand why the model acted the way it did and figure out how to improve it.


These two properties cuts down five minutes to five seconds per mis-alignment.

Instead of going through a long transcript, reading everything, and hunting for the issue, you skip to "this is why it failed, and this is the evidence."

Often, you can quickly ctrl+f to jump the evidence and make debugging much shorter.


A starting evaluator output schema can be simple.

You just need a boolean, like is_successful, a reasoning string that’s a short explanation of 1 to 3 sentences, and an evidence property that is a list of strings.

If your data is more complex, you could use a list of objects with a JSON path for evidence.

But often the evidence can be just an array of strings. These can be phrases, sentences, or paragraphs, depending on your preference.

You can put this JSON schema as structured output, in the system prompt, inside a multi-turn conversation, or inside the assistant pre-fill.


You want to make sure that you have the following in your system prompt:

  • An overall persona for the evaluator to assume when evaluating
  • What success looks like for your type of conversation
  • What failure looks like for your type of conversation
  • Expectations for the reasoning block, including length of responses and intended purpose. Possible few shot examples
  • Expectations for the evidence block, in terms of length and surrounding contexts. Possible few shot examples