Back to blog
Articles
Articles
December 6, 2023
·
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

December 6, 2023
|
5 min

Design evolves in step with technology. In the last ten years, improvements in cloud computing have changed conversational design, shifting the field away from rigid, rule-based programming and toward predictive conversational flows. 

The shift has marked the advent of the Conversational Designer role–we’ve stopped answering the phones and started designing how the phones should be answered. While tactics have changed, there’s been one common goal: to make customer experiences more conversational.

Last year brought a technological overhaul. Large language models (LLMs) unlock a new level of contextual awareness and response generation, affording us capabilities beyond response planning and toward context-aware content creation on the go. 

Powered by LLMs, conversational design projects will be accelerated–but building in alignment with LLMs will be the only way to accelerate success. Success will depend on three areas of alignment, all of which are outlined below. 

Problem-Alignment: Letting User Data Drive

Since foundation models have become widely available, industry discourse has focused on correction and accuracy at the model level. Rightfully so–this is powerful, tricky technology, and designing reliable performance is challenging. 

But we’re better served to start one step earlier–to focus on correction and accuracy at the human level before moving on to troubleshoot the model. 

Human bias is the primary source of failure in conversational design. Our intelligence betrays us; we understand our business, we think we know our users, and we need to make decisions. Almost inevitably, we bake our bias into every project, solving for what we think our users want instead of letting our data show us our gaps and order our priorities. 

The human habit toward top-down decision-making can be explained, at least in part, by the traditionally prohibitive cost of proper data exploration; data-led decision-making has always been talent-, time-, and cost-intensive. Here-before summaries and automations-is one of the most powerful use cases of LLMs. LLMs are aware and creative. They can read hundreds of thousands of conversations at once without getting bored. Companies have amassed endless libraries of bot logs and transcription-ready voice recordings, and LLMs can now study all of them–every turn across millions of conversations. With the right data engineering workflow, we can finally let our data tell us what’s going on.

In a recent webinar hosted by the Conversational Design Institute, Stephen Broadhurst–our VP of Customer Solutions at HumanFirst–piloted a live workflow, creating a comprehensive set of user needs across 10,000 conversations. This clip from the full webinar walks through the process of using prompts to extract call drivers from full conversations to understand the distribution of key issues at different levels of granularity.

Pragmatic-Alignment: In the Language of the User

Streamlined data engineering will help us better understand the problems users face. It’s important that we also understand the language they use to express them. What distinguishes a return from a refund? A payment from a charge? What do we mean when we say ‘resolved’?

The pragmatic aspect of language–how context contributes to meaning–is challenging on a human-to-human level. Human-to-model discrepancies are inevitable, and model-to-model disagreements are equally likely. This clip from Stephen’s webinar shows how hard it can be to call a gorilla a gorilla–a classification we established hundreds of years ago.

To build structured projects around complex business topics and to design an on-the-go understanding that will support intuitive conversations, we need to agree with our LLMs on ground truth. We need to know it distinguishes between returns and refunds the same way we do.

Performance-Alignment: Supervision and Observability

Through prompt engineering, we can establish that ground truth. We can tell the LLM in normal language that ‘resolution’ means the agent has answered the user’s question, even if it wasn’t the user’s desired response. Instead of leaving judgment to the model, we can set the conditions of its consideration such that it shares our understanding. 

With that understanding established, we’ll need to know how the prompt performs. We might have confidence in a prompt that worked in OpenAI Playground, but successful conversational design requires we know how it works across thousands of conversations, and how badly it fails when it does. 

Normally, we can’t see inside an LLM’s decisions or assess a prompt’s performance at full scale. But with the right tooling, we can consider the output of a prompt within the context of the source conversation. This opens up a human-in-the-loop observation window to an otherwise black-box process. We can run that prompt across hundreds of conversations and consider any output against the corresponding source conversation to discern whether it’s done what we’d want. 

Here, Stephen demonstrates a workflow for defining prompt directives and testing responses against source conversations to assess and improve performance.

New technology and tooling will again elevate conversational design. Teams will be able to test prompts at scale, supervise performance, and design more intuitive responses. They’ll be equipped to build models with the agility required to cover hundreds of distinct user needs. The result of this kind of testing and training will be new levels of trust in AI-led conversations, more competitive user experiences, and further advancements in the field of Conversational AI. 

HumanFirst is a data-centric productivity platform designed to help companies find and solve problems with AI-powered workflows that combine prompt and data engineering. Experiment with raw data, surface insights, and build reliable solutions with speed, accuracy, and trust.

Latest content

Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024
Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024

Let your data drive.

Articles

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

GREGORY WHITESIDE
December 6, 2023
.
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Design evolves in step with technology. In the last ten years, improvements in cloud computing have changed conversational design, shifting the field away from rigid, rule-based programming and toward predictive conversational flows. 

The shift has marked the advent of the Conversational Designer role–we’ve stopped answering the phones and started designing how the phones should be answered. While tactics have changed, there’s been one common goal: to make customer experiences more conversational.

Last year brought a technological overhaul. Large language models (LLMs) unlock a new level of contextual awareness and response generation, affording us capabilities beyond response planning and toward context-aware content creation on the go. 

Powered by LLMs, conversational design projects will be accelerated–but building in alignment with LLMs will be the only way to accelerate success. Success will depend on three areas of alignment, all of which are outlined below. 

Problem-Alignment: Letting User Data Drive

Since foundation models have become widely available, industry discourse has focused on correction and accuracy at the model level. Rightfully so–this is powerful, tricky technology, and designing reliable performance is challenging. 

But we’re better served to start one step earlier–to focus on correction and accuracy at the human level before moving on to troubleshoot the model. 

Human bias is the primary source of failure in conversational design. Our intelligence betrays us; we understand our business, we think we know our users, and we need to make decisions. Almost inevitably, we bake our bias into every project, solving for what we think our users want instead of letting our data show us our gaps and order our priorities. 

The human habit toward top-down decision-making can be explained, at least in part, by the traditionally prohibitive cost of proper data exploration; data-led decision-making has always been talent-, time-, and cost-intensive. Here-before summaries and automations-is one of the most powerful use cases of LLMs. LLMs are aware and creative. They can read hundreds of thousands of conversations at once without getting bored. Companies have amassed endless libraries of bot logs and transcription-ready voice recordings, and LLMs can now study all of them–every turn across millions of conversations. With the right data engineering workflow, we can finally let our data tell us what’s going on.

In a recent webinar hosted by the Conversational Design Institute, Stephen Broadhurst–our VP of Customer Solutions at HumanFirst–piloted a live workflow, creating a comprehensive set of user needs across 10,000 conversations. This clip from the full webinar walks through the process of using prompts to extract call drivers from full conversations to understand the distribution of key issues at different levels of granularity.

Pragmatic-Alignment: In the Language of the User

Streamlined data engineering will help us better understand the problems users face. It’s important that we also understand the language they use to express them. What distinguishes a return from a refund? A payment from a charge? What do we mean when we say ‘resolved’?

The pragmatic aspect of language–how context contributes to meaning–is challenging on a human-to-human level. Human-to-model discrepancies are inevitable, and model-to-model disagreements are equally likely. This clip from Stephen’s webinar shows how hard it can be to call a gorilla a gorilla–a classification we established hundreds of years ago.

To build structured projects around complex business topics and to design an on-the-go understanding that will support intuitive conversations, we need to agree with our LLMs on ground truth. We need to know it distinguishes between returns and refunds the same way we do.

Performance-Alignment: Supervision and Observability

Through prompt engineering, we can establish that ground truth. We can tell the LLM in normal language that ‘resolution’ means the agent has answered the user’s question, even if it wasn’t the user’s desired response. Instead of leaving judgment to the model, we can set the conditions of its consideration such that it shares our understanding. 

With that understanding established, we’ll need to know how the prompt performs. We might have confidence in a prompt that worked in OpenAI Playground, but successful conversational design requires we know how it works across thousands of conversations, and how badly it fails when it does. 

Normally, we can’t see inside an LLM’s decisions or assess a prompt’s performance at full scale. But with the right tooling, we can consider the output of a prompt within the context of the source conversation. This opens up a human-in-the-loop observation window to an otherwise black-box process. We can run that prompt across hundreds of conversations and consider any output against the corresponding source conversation to discern whether it’s done what we’d want. 

Here, Stephen demonstrates a workflow for defining prompt directives and testing responses against source conversations to assess and improve performance.

New technology and tooling will again elevate conversational design. Teams will be able to test prompts at scale, supervise performance, and design more intuitive responses. They’ll be equipped to build models with the agility required to cover hundreds of distinct user needs. The result of this kind of testing and training will be new levels of trust in AI-led conversations, more competitive user experiences, and further advancements in the field of Conversational AI. 

HumanFirst is a data-centric productivity platform designed to help companies find and solve problems with AI-powered workflows that combine prompt and data engineering. Experiment with raw data, surface insights, and build reliable solutions with speed, accuracy, and trust.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox