Your LLM-As-A-Judge Might Be Lying to You
You build a pairwise eval pipeline, run it over your dataset, and get a clear winner. You ship the change. A few days later you pull the logs to double-check, and something doesn't add up: A beat B, B beat C, and somehow C also beat A. That