Back to blog
Articles
Articles
September 25, 2023
·
4 min read

LLM Drift

September 25, 2023
|
4 min read

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Building In Alignment: The Role of Observability in LLM-Led Conversational Design
December 6, 2023
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
5 min read

LangChain Hub

A few days ago LangChain launched the LangChain Hub…
September 21, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.

LLM Drift refers to significant alterations in LLM responses over a brief timeframe. It has nothing to do with inherent unpredictability of LLMs or minor amendments in prompt engineering, but instead involves a fundamental shift in the LLM.

A recent investigation discovered that the accuracy of responses by GPT-4 and GPT-3.5 sees substantial fluctuations in a positive direction over a four-month period. And worryingly, in the opposite direction as well.

Notably, the study shows significant variations in both GPT-3.5 and GPT-4, observing performance degradation in certain tasks.

Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source

The most notable changes observed during the study were:

  • Chain-Of-Thought efficiency saw changes, with GPT-4 being less likely to answer questions which yields an opinion.
  • The decrease in opinionated answers can be related to an improvement of safety questions.
  • Tendencies of deviation for GPT4 and GPT3.5 are often different
  • Even-though LLM improvements can be achieved with fine-tuning and contextual prompt injection (RAG) unexpected behaviour will still be present.
  • The researchers stressed continued testing and benchmarking, due to the fact that the study’s analysis was largely based on shifts in broader accuracy as the main metric. However, fine-grained investigations could disclose additional interesting shift patterns.

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.

Source

The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.

Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.

With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.

Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.

The datasets used and basic code examples from the study are available on GitHub. I also added an executed notebook which you can view here.

The GitHub repository also holds the datasets and generated content. Each csv file corresponds to one dataset with one record/row corresponding with one query and the generation from one LLM service.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox