Test First, Ask Questions Later: Build Successful LLM Apps by Leading With Evaluations

Key takeaways

Starting with evaluations helps businesses and stakeholders clearly define the business requirements of an LLM before beginning the development cycle.
Proper evaluations can reduce risks, save development costs, and simplify the evaluation and communication of progress, increasing the likelihood of success for LLM projects.

Test Driven Development (TDD) has been around since 1990 and is a paradigm for software development. Instead of writing tests after you write your code, you write tests first and use them to implement features in your code as you write it.

An example would be writing a test that asserts a certain output from an Application Programming Interface (API) and writing code that will allow that test to execute successfully.

While the utility of TDD is a hotly debated topic in the traditional software development industry, it’s crucial to designing well-architected Large Language Model (LLM) applications. In addition to all the benefits of traditional software testing, like establishing performance standards and enabling regression testing, there are several other benefits to this approach.

First, based on my experience as an AI consultant, too many companies are throwing LLMs at just any problem. They do this hoping to discover a shortcut to solving problems without fully fleshing out the finer details and requirements of the problems they are trying to solve.

Leading with evaluations forces businesses and stakeholders to think through and thoroughly define the business requirements of the project before starting the development cycle. Sometimes they may find that an LLM is not the best answer, and that’s okay.

For example, imagine you run an online store specializing in designer shoes and apparel. Your copywriting department writes descriptions for new products. The engineering team is tasked with using LLMs to automate this process.

The engineers collaborate with copywriting experts to understand their workflow and create a gold standard test set for evaluating the system. They develop a set of prompts and desired outcomes to assess the LLM’s performance.

During this evaluation, they discover that the main challenge for copywriters is assessing the quality of their work after writing descriptions. This insight leads the team to shift its focus. Instead of using LLMs to write descriptions outright, they develop a “copilot” tool. This tool offers real-time suggestions to improve the descriptions, enhancing the copywriters’ work rather than replacing it.

Second, language models are probabilistic. With code, the output will typically be deterministic. We can control every decision that is made and predict with extremely high confidence what the outcome will be. In a probabilistic scenario, which is what you face when using LLMs, there is no guaranteed outcome. We can only increase the probability of getting the desired result.

In both cases, defining an architecture before putting hands-on-keyboards is important. In the case of LLMs, it’s also important to define your target ahead of time. This avoids a prompting infinite loop where you guess and check your way into a corner that it is difficult to back out of. With tests, you have targets, and you can prompt those to measure progress and understand any regressions.

Lastly, LLM TDD provides a well-defined set of criteria against which you can measure the system once it hits production. This could be a mix of closed-form metrics and LLM-as-a-judge type metrics, where another LLM is prompted to evaluate the system output.

With this in mind, let’s go over what some of these evaluations might look like.

Closed-form evaluations

While the explosion of modern language models started just a couple of years ago, Natural Language Processing (NLP) is a discipline that has existed for several decades. This has led to an extraordinary amount of research on the nuances and complexities of human language.

Some examples of closed form evaluations include:

Bilingual Evaluation Understudy (BLEU): BLEU measures the precision of n-grams (a contiguous sequence of words) between machine-translated text and reference translations.
Metric for Evaluation of Translation with Explicit ORdering (METEOR): This metric enhances BLEU by considering synonyms, stemming, and paraphrasing. It combines precision and recall, giving more weight to recall capturing relevant content more effectively.
Recall-Oriented Understudy for Gisting Evaluation (ROUGE): ROUGE is a family of metrics that evaluates the quality of summaries and translations generated by NLP models by analyzing the overlap of words, phrases, and sequences between machine-generated and reference text.

While these metrics capture the proportion of words and phrases shared between generated content and ground-truth text, they say nothing about the semantic quality of the text that is offered.

To demonstrate this, let’s take the following example of generated text compared to ground-truth text:

Generated text: The fast brown fox leaps over the lazy dog.
Ground-truth text: The quick brown fox jumps over a lazy dog.

Evaluating these two pieces of text using the BLEU metrics may result in a high score because fast and leaps are synonyms of quick and jumps. But, this score doesn’t consider how the different words may change how the text reads.

In addition, while it will assign a high score in terms of word overlap, it does not consider the subtle difference between a and the, which impacts the meaning of the sentence. This line of thinking can be extrapolated to the other metrics as well.

To account for the shortcomings of these metrics, they should be accompanied or even replaced by additional LLM-powered metrics that offer deeper insights into the coherence, grammatical correctness, and contextual relevance of a given piece of text.

Language model powered metrics

By leveraging the power of language models, we can capture more subtle characteristics of the given text and provide a more accurate picture of how that output compares to some baseline.

Most of these techniques leverage embeddings to measure similarity. These embeddings are trained mathematical representations of text that encode the semantic meaning of a word, sentence, or larger chunk of text.

BERTScore

BERTScore uses Bidirectional Encoder Representations from Transformers (BERT) token-level sentence embeddings to evaluate the semantic similarity between the generated and reference texts. It then calculates the cosine similarity between the embeddings, which assigns a score between -1 and 1, with -1 being opposite and 1 meaning the two pieces of text are semantically identical.

COMET

Cross-Lingual Optimized Metric for Evaluation of Translation (COMET) uses multilingual sentence embeddings from a task-specific, pre-trained model to measure the cosine distance between a reference translation and a candidate translation. It uses the original text to provide additional context for the calculation.

LLM-as-a-judge

LLM-as-a-judge uses an evaluation prompt and another LLM to evaluate the output text. It can be used to compare generated text with reference text. It can also perform pointwise comparison, where it evaluates two outputs to determine which one meets a provided criterion.

This is the most flexible of any of the metrics mentioned so far and is an area of active research. Some common evaluation metrics to measure with this technique include:

Coherence: Assesses the logical flow of the generated text
Fluency: Assesses the naturalness and readability of the generated text
Relevance: Assesses whether a model’s responses relate to the input and provide useful information

When starting with evaluations, it’s important to pick which metrics are important to you. But you also need a gold standard test set, often referred to as a ground-truth dataset, that is representative of what you expect to see in terms of system inputs and outputs. This could be anything from text-to-SQL to HR questions.

Although you can automate the process of producing this test set, the test set should be human-evaluated and curated for completeness. This is where it’s useful to have Subject Matter Experts (SMEs) ensure that you receive high-quality data, as this will be the driving force in your development cycle.

Using the metrics you define, run your evaluations against the inputs of your gold standard set to produce a set of scores. Again, it’s important to have a well-defined scoring rubric for grading and to use a human to provide these scores when possible.

Once you have these, you can begin your development cycle. As you build out and add features to your system, you should be running your evaluations using your gold standard test set to guard against regression. This can be done through a combination of traditional CI/CD processes, such as pre-commit hooks and GitHub Actions.

Conclusion

Starting with evaluations not only improves your development cycle but also provides clarity about the business problem you are trying to solve. Without this, you can find yourself in an endless game of guess-and-vibe-check during development and unnecessarily increase the risk of hitting the mark when it comes to the objective.

Summary of steps

Consult with business stakeholders and SMEs to refine the problem and brainstorm the definition of success.
Choose metrics that align with objectives to ensure the measurements are meaningful and can be communicated clearly to a non-technical audience.
Set up an evaluation process for engineers to assess systems as the project progresses and new features are added. This can be manual or automated, but it should create as little friction as possible for developers.
Run these evaluations periodically to defend against any regression occurring in the quality of outcomes.

Proper evaluations are essential for keeping projects on track. Implementing this approach can help reduce risk, save development costs, and simplify the evaluation and communication of progress. This increases the likelihood of success for LLM projects.

Reimagine your future with AI-powered solutions. Learn the ways Insight can support your AI transformation journey.

Chris Thompson

Architect, Insight

Chris is an open-source technology advocate based in Lake Oswego, Oregon. With expertise in MLOps and DevOps, he helps companies improve their workflows by leveraging cutting-edge technologies

Blog Test First, Ask Questions Later: Build Successful LLM Apps by Leading With Evaluations

Key takeaways

Test Driven Development (TDD) has been around since 1990 and is a paradigm for software development. Instead of writing tests after you write your code, you write tests first and use them to implement features in your code as you write it.

Closed-form evaluations

Language model powered metrics

BERTScore

COMET

LLM-as-a-judge

Conclusion

Summary of steps

Reimagine your future with AI-powered solutions. Learn the ways Insight can support your AI transformation journey.

Chris Thompson

Related posts

eBook The Adaptable Enterprise: Why AI Readiness Is Disruption Readiness

Datasheet 7 Considerations for Choosing a Modern Application Platform

eBook Mastering IT in an Era of Transformation

eBook Insight Services Engagement Offers