Back to blog
Articles
Articles
December 6, 2023
·
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

December 6, 2023
|
5 min

Design evolves in step with technology. In the last ten years, improvements in cloud computing have changed conversational design, shifting the field away from rigid, rule-based programming and toward predictive conversational flows. 

The shift has marked the advent of the Conversational Designer role–we’ve stopped answering the phones and started designing how the phones should be answered. While tactics have changed, there’s been one common goal: to make customer experiences more conversational.

Last year brought a technological overhaul. Large language models (LLMs) unlock a new level of contextual awareness and response generation, affording us capabilities beyond response planning and toward context-aware content creation on the go. 

Powered by LLMs, conversational design projects will be accelerated–but building in alignment with LLMs will be the only way to accelerate success. Success will depend on three areas of alignment, all of which are outlined below. 

Problem-Alignment: Letting User Data Drive

Since foundation models have become widely available, industry discourse has focused on correction and accuracy at the model level. Rightfully so–this is powerful, tricky technology, and designing reliable performance is challenging. 

But we’re better served to start one step earlier–to focus on correction and accuracy at the human level before moving on to troubleshoot the model. 

Human bias is the primary source of failure in conversational design. Our intelligence betrays us; we understand our business, we think we know our users, and we need to make decisions. Almost inevitably, we bake our bias into every project, solving for what we think our users want instead of letting our data show us our gaps and order our priorities. 

The human habit toward top-down decision-making can be explained, at least in part, by the traditionally prohibitive cost of proper data exploration; data-led decision-making has always been talent-, time-, and cost-intensive. Here-before summaries and automations-is one of the most powerful use cases of LLMs. LLMs are aware and creative. They can read hundreds of thousands of conversations at once without getting bored. Companies have amassed endless libraries of bot logs and transcription-ready voice recordings, and LLMs can now study all of them–every turn across millions of conversations. With the right data engineering workflow, we can finally let our data tell us what’s going on.

In a recent webinar hosted by the Conversational Design Institute, Stephen Broadhurst–our VP of Customer Solutions at HumanFirst–piloted a live workflow, creating a comprehensive set of user needs across 10,000 conversations. This clip from the full webinar walks through the process of using prompts to extract call drivers from full conversations to understand the distribution of key issues at different levels of granularity.

Pragmatic-Alignment: In the Language of the User

Streamlined data engineering will help us better understand the problems users face. It’s important that we also understand the language they use to express them. What distinguishes a return from a refund? A payment from a charge? What do we mean when we say ‘resolved’?

The pragmatic aspect of language–how context contributes to meaning–is challenging on a human-to-human level. Human-to-model discrepancies are inevitable, and model-to-model disagreements are equally likely. This clip from Stephen’s webinar shows how hard it can be to call a gorilla a gorilla–a classification we established hundreds of years ago.

To build structured projects around complex business topics and to design an on-the-go understanding that will support intuitive conversations, we need to agree with our LLMs on ground truth. We need to know it distinguishes between returns and refunds the same way we do.

Performance-Alignment: Supervision and Observability

Through prompt engineering, we can establish that ground truth. We can tell the LLM in normal language that ‘resolution’ means the agent has answered the user’s question, even if it wasn’t the user’s desired response. Instead of leaving judgment to the model, we can set the conditions of its consideration such that it shares our understanding. 

With that understanding established, we’ll need to know how the prompt performs. We might have confidence in a prompt that worked in OpenAI Playground, but successful conversational design requires we know how it works across thousands of conversations, and how badly it fails when it does. 

Normally, we can’t see inside an LLM’s decisions or assess a prompt’s performance at full scale. But with the right tooling, we can consider the output of a prompt within the context of the source conversation. This opens up a human-in-the-loop observation window to an otherwise black-box process. We can run that prompt across hundreds of conversations and consider any output against the corresponding source conversation to discern whether it’s done what we’d want. 

Here, Stephen demonstrates a workflow for defining prompt directives and testing responses against source conversations to assess and improve performance.

New technology and tooling will again elevate conversational design. Teams will be able to test prompts at scale, supervise performance, and design more intuitive responses. They’ll be equipped to build models with the agility required to cover hundreds of distinct user needs. The result of this kind of testing and training will be new levels of trust in AI-led conversations, more competitive user experiences, and further advancements in the field of Conversational AI. 

HumanFirst is a data-centric productivity platform designed to help companies find and solve problems with AI-powered workflows that combine prompt and data engineering. Experiment with raw data, surface insights, and build reliable solutions with speed, accuracy, and trust.

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
4 min read

LLM Drift

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
September 25, 2023
Articles
5 min read

LangChain Hub

A few days ago LangChain launched the LangChain Hub…
September 21, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

Articles

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

GREGORY WHITESIDE
December 6, 2023
.
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Design evolves in step with technology. In the last ten years, improvements in cloud computing have changed conversational design, shifting the field away from rigid, rule-based programming and toward predictive conversational flows. 

The shift has marked the advent of the Conversational Designer role–we’ve stopped answering the phones and started designing how the phones should be answered. While tactics have changed, there’s been one common goal: to make customer experiences more conversational.

Last year brought a technological overhaul. Large language models (LLMs) unlock a new level of contextual awareness and response generation, affording us capabilities beyond response planning and toward context-aware content creation on the go. 

Powered by LLMs, conversational design projects will be accelerated–but building in alignment with LLMs will be the only way to accelerate success. Success will depend on three areas of alignment, all of which are outlined below. 

Problem-Alignment: Letting User Data Drive

Since foundation models have become widely available, industry discourse has focused on correction and accuracy at the model level. Rightfully so–this is powerful, tricky technology, and designing reliable performance is challenging. 

But we’re better served to start one step earlier–to focus on correction and accuracy at the human level before moving on to troubleshoot the model. 

Human bias is the primary source of failure in conversational design. Our intelligence betrays us; we understand our business, we think we know our users, and we need to make decisions. Almost inevitably, we bake our bias into every project, solving for what we think our users want instead of letting our data show us our gaps and order our priorities. 

The human habit toward top-down decision-making can be explained, at least in part, by the traditionally prohibitive cost of proper data exploration; data-led decision-making has always been talent-, time-, and cost-intensive. Here-before summaries and automations-is one of the most powerful use cases of LLMs. LLMs are aware and creative. They can read hundreds of thousands of conversations at once without getting bored. Companies have amassed endless libraries of bot logs and transcription-ready voice recordings, and LLMs can now study all of them–every turn across millions of conversations. With the right data engineering workflow, we can finally let our data tell us what’s going on.

In a recent webinar hosted by the Conversational Design Institute, Stephen Broadhurst–our VP of Customer Solutions at HumanFirst–piloted a live workflow, creating a comprehensive set of user needs across 10,000 conversations. This clip from the full webinar walks through the process of using prompts to extract call drivers from full conversations to understand the distribution of key issues at different levels of granularity.

Pragmatic-Alignment: In the Language of the User

Streamlined data engineering will help us better understand the problems users face. It’s important that we also understand the language they use to express them. What distinguishes a return from a refund? A payment from a charge? What do we mean when we say ‘resolved’?

The pragmatic aspect of language–how context contributes to meaning–is challenging on a human-to-human level. Human-to-model discrepancies are inevitable, and model-to-model disagreements are equally likely. This clip from Stephen’s webinar shows how hard it can be to call a gorilla a gorilla–a classification we established hundreds of years ago.

To build structured projects around complex business topics and to design an on-the-go understanding that will support intuitive conversations, we need to agree with our LLMs on ground truth. We need to know it distinguishes between returns and refunds the same way we do.

Performance-Alignment: Supervision and Observability

Through prompt engineering, we can establish that ground truth. We can tell the LLM in normal language that ‘resolution’ means the agent has answered the user’s question, even if it wasn’t the user’s desired response. Instead of leaving judgment to the model, we can set the conditions of its consideration such that it shares our understanding. 

With that understanding established, we’ll need to know how the prompt performs. We might have confidence in a prompt that worked in OpenAI Playground, but successful conversational design requires we know how it works across thousands of conversations, and how badly it fails when it does. 

Normally, we can’t see inside an LLM’s decisions or assess a prompt’s performance at full scale. But with the right tooling, we can consider the output of a prompt within the context of the source conversation. This opens up a human-in-the-loop observation window to an otherwise black-box process. We can run that prompt across hundreds of conversations and consider any output against the corresponding source conversation to discern whether it’s done what we’d want. 

Here, Stephen demonstrates a workflow for defining prompt directives and testing responses against source conversations to assess and improve performance.

New technology and tooling will again elevate conversational design. Teams will be able to test prompts at scale, supervise performance, and design more intuitive responses. They’ll be equipped to build models with the agility required to cover hundreds of distinct user needs. The result of this kind of testing and training will be new levels of trust in AI-led conversations, more competitive user experiences, and further advancements in the field of Conversational AI. 

HumanFirst is a data-centric productivity platform designed to help companies find and solve problems with AI-powered workflows that combine prompt and data engineering. Experiment with raw data, surface insights, and build reliable solutions with speed, accuracy, and trust.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox