A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.

LLM Drift refers to significant alterations in LLM responses over a brief timeframe. It has nothing to do with inherent unpredictability of LLMs or minor amendments in prompt engineering, but instead involves a fundamental shift in the LLM.

A recent investigation discovered that the accuracy of responses by GPT-4 and GPT-3.5 sees substantial fluctuations in a positive direction over a four-month period. And worryingly, in the opposite direction as well.

Notably, the study shows significant variations in both GPT-3.5 and GPT-4, observing performance degradation in certain tasks.

Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source

The most notable changes observed during the study were:

  • Chain-Of-Thought efficiency saw changes, with GPT-4 being less likely to answer questions which yields an opinion.
  • The decrease in opinionated answers can be related to an improvement of safety questions.
  • Tendencies of deviation for GPT4 and GPT3.5 are often different
  • Even-though LLM improvements can be achieved with fine-tuning and contextual prompt injection (RAG) unexpected behaviour will still be present.
  • The researchers stressed continued testing and benchmarking, due to the fact that the study’s analysis was largely based on shifts in broader accuracy as the main metric. However, fine-grained investigations could disclose additional interesting shift patterns.

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.


The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.

Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.

With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.

Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.

The datasets used and basic code examples from the study are available on GitHub. I also added an executed notebook which you can view here.

The GitHub repository also holds the datasets and generated content. Each csv file corresponds to one dataset with one record/row corresponding with one query and the generation from one LLM service.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox