Building Your First Evaluation System for LLM Applications

🚧 Work in Progress 🚧
This blog post is currently being written and may be incomplete or subject to changes. Check back soon for updates!

Much has been discussed in software about the importance of a robust and thorough testing suite to the long term health and scalability of any product. When a team is able to trust the existing unit and integration tests, this speeds up development. It allows for new features to be developed faster, and refactors on previously done work to be done with confidence.

This holds true, even when we start to integrate our systems with large language models (LLMs), or to build entirely new applications based on these new technologies. There is an inherent difference, however, between classical software projects and those that use LLMs. The difference being that LLMs are non-deterministic at its core.

This difference means that we have to relearn and redevelop our body of knowledge as Software Engineers for how to build a reliable and trustworthy testing suite. When we test a deterministic system, unit and integration tests work well. When testing a non-deterministic system, we must use evaluation tests, also called evals.

Evals are a way to give confidence that a non-deterministic AI system will work to a given level of quality and success given a data set supplied. This blog post outlines a basic system that I am using in order to go from zero to a basic level of evals in a system using LLMs primarily based on prompt engineering.

I will first outline the overall steps before diving into a case study, with examples of prompts I might call, and steps I might take in order to build this system.

The steps are as follows:

At a high level, come up with a basic understanding of what your system can do by defining it in a matrix. A simple matrix might consist of features, scenarios, and personas.
Using prompt engineering, generate a large amount of comprehensive test scenarios. This can also be a multi step prompt of first generating a series of Given-When-Then triads followed by the actual scenario setup data.
Come up with a list of deterministic pass / fail tests that can be run against all of the outputs. For example, the presence or absence of a series of words. Whether or not a function was called. Whether or not the database was updated. The response length, and the number of sentences returned. The semantic similarity compared to a specified string.
Create a system of versioning for prompts, and generate a report for each evaluation run. Track this over time. Decide on a minimum pass / fail threshold, and start to iterate on the prompt until you reach that threshold. Set up the system to give a scheduled daily, weekly, or monthly report.
Create a dashboard where its easy for the team to view real conversations logged from the system. Allow anyone to pull in real data into the testing system with one click. Make sure to add obfuscation if the conversation contains sensitive personally identifiable information.

At the end of this process, you will have:

A basic testing suite for your non-determininistic system
A report that can be run ad-hoc, with a history of how different generations of the system fared and how it improved over time
A dashboard that allows for real conversations to be added into this testing suite.

Admittedly, at the end of this process, there are a few issues that we must address. The solutions to this I will discuss in a future blog post.

The first real problem is that the data generated initially is not real data, and must be replaced with real data or fake data that is provably very close to the real data as possible.

The second problem is that deterministic tests are able to superficially test a non-deterministic system, but to fully test the success of a conversation requires another non-deterministic system. This can be a human being or another LLM, a grader LLM.

The solutions to these two problems I will discuss in a later blog post, and require an extension of the current system to accomodate those problems.

Case Study: T-Shirtify

Lets start with a customer service chatbot for a fictional e-commerce company that sells t-shirts called "T-Shirtify".

Step 1: Define the system

The first step is to define the system. We will start with a simple matrix of features, scenarios, and personas.

Features (What the system can do):

Order Tracking. The user can ask about the status of their order and get an update on the latest status information.
Return Processing. The user can ask to start a return process, and the system will guide them through the initial steps, asking any required information.
Product Recommendations. The user can ask a general query about if a specific product exists, and the system will provide a recommendation.

Personas (Who the system is designed to help):

Joe, Confused First-Time User. Joe is a 52-year-old high school teacher from Portland who rarely shops online and prefers calling customer service. He's a first time user of the system, and is confused about how to use the system.
Karen, Frustrated Customer. Karen is a 45-year-old busy mother of two from suburban Chicago who values efficiency and clear communication. She's a frustrated customer, and is angry about the system not working as expected.
Eva, Wholesaler. Eva is a 35-year-old small business owner from Miami who runs a local boutique and frequently orders inventory online. She's a wholesaler, and is looking to buy a large quantity of t-shirts for resale.
Scenarios (A mapping of feature -> persona -> what they are trying to do):
Joe's Product Recommendations. Joe is looking for a t-shirt that matches a specific image or vibe, and the system will provide a recommendation.
Karen's Return Processing. Karen is looking to return a t-shirt that she received, and the system will guide her through the process.
Eva's Order Tracking. Eva is looking to track the status of an order, and the system will provide an update on the latest status information.

Step 2: Generate test scenarios

Now that we have defined the system, we can generate test scenarios. We will use a prompt engineering tool to generate a large amount of test scenarios. Here is an example of a prompt that can be used for this:

Using the following variables, I was able to generate 20 test scenarios that I've linked below:

<Context>
    <ApplicationFeature>
        <Name>Product Recommendations</Name>
        <Description>The user can ask a general query about if a specific product exists, and the system will provide a recommendation.</Description>
    </ApplicationFeature>
    <Persona>
        <Name>Joe, Confused First-Time User</Name>
        <Description>Joe is a 52-year-old high school teacher from Portland who rarely shops online and prefers calling customer service. He's a first time user of the system, and is confused about how to use the system.</Description>
    </Persona>
    <FeatureScenario>
        <Name>Joe's Product Recommendations</Name>
        <Description>Joe is looking for a t-shirt that matches a specific image or vibe, and the system will provide a recommendation.</Description>
    </FeatureScenario>
</Context>

Building Your First Evaluation System for LLM Applications

Case Study: T-Shirtify

Step 1: Define the system

Step 2: Generate test scenarios

Recent Posts

Building Your First Evaluation System for LLM Applications

Rethinking PERT, Introducing Wagn Project Tracker

Work Breakdown Structure (WBS)

Estimation Uncertainty and Quantifying Risk

PERT - Program Evaluation and Review Technique