Your AI Prompts Are Failing, and You Probably Don’t Even Know

In my job at Libretto, I talk to a lot of folks who use LLM prompts in their software, and I ask all of them, “how do you know it’s working in production?”. The median answer I get back is “well, we don’t really have a comprehensive system in place” or “we do some spot-checking” or “we don’t hear complaints from users”. Needless to say, none of these are answers we would accept from more traditional software development.
I have a new rough metric for software teams that are incorporating LLMs: how many LLM failures in your product can you tell me about off the top of your head? If you know about more than a dozen in the last month, that’s not too shabby. If you only know about a handful, that’s worrisome. But if you can’t rattle off any LLM failures, you’re in trouble.
The weird, contradictory truth about LLMs is that they are both extremely powerful and that they constantly make mistakes. If you don’t know about ways that LLMs have failed in your product, which possibility is more likely: that your LLM prompts are working 100% of the time, or that they are failing your users and you aren’t noticing? I’ll bet on the latter every day of the week.
So, how do we fix this? At Libretto, we believe that we should look to existing software development practices, specifically test-driven development. Test-driven development is the idea that you start with test cases for your code and build up your corpus of tests by creating failing tests before you even write your code. It’s a lot more complicated when it comes to test-driven development with LLMs, though, for three reasons:
- Lack of determinism: LLMs change their behavior seemingly on a whim, and tests that were passing yesterday may not pass today.
- Lack of predicability: Unlike “normal” code, whose behavior you can predict relatively confidently just by reading it, there is no way to know what a prompt will do until you actually try it out.
- Lack of imagination: With normal test-driven development, the range of inputs to your code is constrained to well-known data types, while with LLMs the range of inputs is the entirety of human language and thought.
These difficulties have led us to develop three principles for creating prompts that actually work in production, and they’ve guided us in building Libretto.
Principle 1: Tests are more important than prompts.
When integrating LLMs into your code, test-driven development teaches us that it’s more important to make good tests than it is to have a great prompt. Prompting techniques change as models gain new capabilities and features, so a great prompt today may be worthless in a month or two. But your tests define how you need your prompt to behave, so they will be useful for as long as the prompt is part of your system.
Seen properly, most of the work of prompt development is really test development. Unlike traditional machine learning, you don’t need hundreds of thousands or millions of pieces of training data, but you do need a body of tests for each prompt. And if you have less than, say, a hundred tests for any particular prompt, you probably don’t have much of an idea of how well that prompt works in practice.
This is why Libretto is so focused on the creation of test cases for your prompts; once you have a real corpus of tests, the prompt engineering part becomes an order of magnitude easier.
Principle 2: Automated evaluations are the only way to scale.
Most developers who start working on a prompt try a few inputs to the prompt and just read the outputs to see if they seem right. This is fine for prototypes, but it very quickly fails to scale. Figuring out if your new prompt works quickly becomes a slog of reading through AI outputs in large spreadsheets and manually grading each LLM response. As someone who has done a fair amount of this by hand, it is truly mind-numbing work. Your eyes glaze over and you lose the will to do your job. It’s not a pretty picture.
The way out here is automated evaluations, both deterministic and LLM-powered, that can grade LLM outputs for you. When you have automated evals on your prompt that you can trust, the pace of development picks up tremendously. In Libretto, you press a button, and all your test cases are run in parallel, and you get an objective score for your prompt.
It’s worth noting, too, that we’ve found that customers find it hard to develop prompt evals, especially when they start working on a new prompt. Imagining what could go wrong and how you need to evaluate the LLM responses can be quite tricky. That’s why every prompt loaded up into Libretto automatically gets evals assigned to it that are specific to the prompt in question. Developers can get started with our automated evals from the very beginning and throw the manual grading spreadsheets in the trash.
Principle 3: The best place to find tests is in production.
So if tests are more important than prompts, how do we amass a good body of tests for our LLM prompt?
This is a problem we’ve seen bedevil every customer we’ve worked with. Simply put, it’s hard. You will never be able to guess at the range of inputs that will be sent into your prompts, because you have to imagine the entire scope of human language and every zany thing a user might do with your software.
If you are building a zero-to-one product, this is simply a problem you have to buckle down and deal with, coming up with test inputs that are the best you can imagine. But if you have a prompt in production already (or in a limited beta), you have a gold mine sitting right in front of you: your production data.
This is why at Libretto we have focused so much on monitoring your LLMs and automatically finding the LLM responses that are most likely to be errors. When you integrate Libretto’s API, we automatically scan the responses for both common issues like toxic outputs and for prompt-specific issues like failing a prompt eval. You no longer need to manually sift through a spreadsheet of prod data reading over every API call to try and find LLM failures. Libretto highlights them for you. And once you find a bad LLM interaction in your monitoring data, you can add it to the prompt’s test cases with a single click.
Putting it all together
So what does it look like to be test-driven in LLM prompt development? It’s all about iteration and using production data to find what’s going wrong.
With Libretto, our customers get into a loop of:

- Find problems in production data. Sift through your production data (or use Libretto to do the sifting for you!) and find cases where your prompt is falling short.
- Add the problem cases to the prompt’s tests. With one click, you can add failing production examples into a prompt’s test cases.
- Optimize your prompt to fix the problems. Now that you’ve found a problem, time to use your prompt engineering skills to make it better. Tweak your prompt, try other models or parameters, and run all your tests in Libretto in parallel to make sure that the tweaks are fixing your problem case without introducing regressions.
- Deploy and start again. Deploy the prompt changes, and go back to step 1.
Libretto helps every step of the way, flagging potential issues in your production data, keeping track of your body of tests, and helping you quickly run tests and optimize your prompts.
With Libretto and test-driven development guided by production data, you’ll always be able to rattle off some fun stories about weird errors from your LLMs in production. Even better, you’ll also know that you’ve stopped those stories from happening again.
Want to get started? Sign up for our free tier now!
Stay Ahead in AI Development
Try out Libretto for free.
Stay Updated with Our Blog
Subscribe to receive the latest insights and updates directly in your inbox.
