Binary AI evals and why we need more than a verdict
When we build AI evaluations, we want some sort of objective way to tell whether or not an evaluator is "correct". Outputting a true or false evaluation often isn't enough - we want to be able to trust the evaluation itself, and be able to communicate