min read

Yes, AI Models Like GPT-4o Change Without Warning. Here’s What You Can Do About It.

Written by

Dan Jakaitis, Sasha Aickin

Published on

February 19, 2025

As we move into a world where more and more of our software depends on large language models like GPT and Claude, we are increasingly hearing about the problem of “model drift”. Companies like OpenAI, Google, and Anthropic are constantly updating their deployed models in a ton of different ways. Most of the time, these updates don’t make much of a difference, but once in a while, they can absolutely torpedo one of your prompts.

Don’t believe us? We’ve seen it happen.

The Day GPT-4o Changed Its Mind

Here at Libretto, we’ve been worried about model drift for a while, which is why we built drift detection into our LLM monitoring and testing product. As part of that, we made a public dashboard that looks for model drift in ten sample prompts across a few popular hosted LLMs.

When we looked at the dashboard on Tuesday morning, we noticed that GPT-4o had experienced a major change in behavior the night before (on February 17th, 2025). This is what it looks like in our public model drift dashboard:

Before explaining this chart, let’s step back for a second and talk about how we can detect model drift.

With deterministic code, it’s pretty easy to figure out when there are regressions and changes: run your test code and see if the outputs are different from the last time you ran the tests. Easy.

With LLMs, though, detecting changes is a lot harder. LLMs give different results to the same prompts all the time, so it’s not enough to just compare results with the last run. At Libretto, we detect model drift for a prompt like this:

Sample a prompt’s production traffic to get a representative group of inputs for a particular prompt.
On the first day of drift detection, establish a baseline by running each of the inputs to the prompt through the model 100 times.
For each of those 100 answers, take an embedding and then characterize the distribution of the 100 embeddings of the responses for that particular input to the prompt. This gives us a baseline distribution of answers for that set of inputs. In laymen’s terms, it tells us what is a typical answer from the LLM for the inputs on the day we take the baseline.
On each subsequent day, we test for drift by running each of those exact inputs to the prompt through the model again, and calculate whether the current response from the LLM would fall within the baseline distribution we calculated on day 1. If there are an abnormal number of responses that are out of that distribution, it’s likely that the model has changed in some significant way.

In the chart above, we are looking at the drift detection for a prompt called “Product Name Generator”, which is a generative prompt that asks the LLM to take in a product description and come up with some fun product names. The X axis represents time, and each dot represents one single LLM response for one input to the prompt. The Y axis represents how many standard deviations away from the baseline distribution the LLM response is. Once in a while you get a day with one or two outlier responses, but for several weeks the results stayed more or less static: the vast majority of responses on any given day were within 1 or 2 standard deviations of the baseline. That is, until you get to February 17th, when a large portion of the responses were more than 4 standard deviations away from the baseline.

The first question to ask is: qualitatively, what’s going on here? The way to figure this out is to click on one of the outlier data points and see the path of one particular set of inputs:

Here we chart one particular set of inputs to the prompt and how they have changed over time; you can see that this is one of the inputs that stayed relatively the same before dramatically changing on February 17th. For each data point in the series, we show you on the left a typical response from the baseline sample and on the right the response we got when we tested for drift. By clicking around here, you can see that on every day before February 17th, GPT-4o answered with a list of product name ideas, but on the 17th, it answered with just a single product name. Clicking through on the other outliers shows that they, too, switched from responding with lists to responding with a single answer.

We did some digging to verify that what we were seeing was real, and we found that, out of 1802 LLM requests we made to test the prompt for drift from January 16th through February 17th, only 20 responses came back with single answers. Eleven of those 20 responses happened on February 17th. This wasn’t just a weird coincidence. We’ve double and triple checked our work, and we’re pretty convinced that OpenAI did something to GPT-4o on February 17th that broke one of our test prompts.

This Is Kind Of Scary

For those of us who are used to the trials of keeping software up and running in production, this is pretty terrifying. On top of all the other things that add uncertainty to our software (flaky servers, third-party APIs going down, scaling issues), we now have to worry about whether or not our LLM provider is tinkering with the internals of a model that we depend on.

It gets worse, though.

As part of our public drift dashboard, we track 10 different LLM prompts for drift, and look at the aggregated page for GPT-4o:

Notice that the Product Name Generator is the only prompt that shows evidence of model drift here. On the one hand, this is good, because it means that OpenAI didn’t change GPT-4o so much that it radically changed the results from every prompt. But from the perspective of someone with a GPT-4o prompt in production, it’s a big problem, because it shows that your LLM vendor might change your model in a way that ONLY affects your prompt and is not going to be noticed by the larger LLM developer community. If you aren’t tracking drift proactively, it’ll be extremely difficult to debug whether you or your model provider are to blame for degraded performance.

Fighting Back Against Model Drift

So: we know that models change, and sometimes those model changes hork your existing prompts. Given that, what can we do as engineers to protect our products’ quality?

First off, we can monitor each of our prompts for model drift. Using a tool like Libretto, you will get notified if your model shows sudden and unexpected changes. When a model changes and messes up your prompt, you won’t have to wait for the issue to show up in KPIs like user retention or revenue.

Second, we can have a good set of test cases and automated evaluations for every prompt. If you detect drift, you have essentially two options to fix the problem: change your prompt or change your model. In either case, you need to have a solid, repeatable testing strategy to quickly understand whether your changes will make the results better or worse. That’s why we are so focused at Libretto on building test cases and evaluations.

Finally, we can pro-actively test other models and vendors to have a Plan B in case of model drift. If we’ve already done the work to know which models are high quality and cost-effective, then we can switch quickly when our current LLM changes underneath us. Needless to say, this is dead easy to do in Libretto.

We’ll be talking more about model drift and our Public Drift Dashboard (now in beta) in the upcoming weeks, and if you want to start monitoring your own prompts for drift, sign up for Libretto for free now.

Stay Ahead in AI Development

Try out Libretto for free.

Thank you! You're all set!

Oops! Please try again later.

The Day GPT-4o Changed Its Mind

This Is Kind Of Scary

Fighting Back Against Model Drift

Stay Ahead in AI Development

Stay Updated with Our Blog