Building In Alignment: The Role of Observability in LLM-Led Conversational Design
Design evolves in step with technology. In the last ten years, improvements in cloud computing have changed conversational design, shifting the field away from rigid, rule-based programming and toward predictive conversational flows.
The shift has marked the advent of the Conversational Designer role–we’ve stopped answering the phones and started designing how the phones should be answered. While tactics have changed, there’s been one common goal: to make customer experiences more conversational.
Last year brought a technological overhaul. Large language models (LLMs) unlock a new level of contextual awareness and response generation, affording us capabilities beyond response planning and toward context-aware content creation on the go.
Powered by LLMs, conversational design projects will be accelerated–but building in alignment with LLMs will be the only way to accelerate success. Success will depend on three areas of alignment, all of which are outlined below.
Problem-Alignment: Letting User Data Drive
Since foundation models have become widely available, industry discourse has focused on correction and accuracy at the model level. Rightfully so–this is powerful, tricky technology, and designing reliable performance is challenging.
But we’re better served to start one step earlier–to focus on correction and accuracy at the human level before moving on to troubleshoot the model.
Human bias is the primary source of failure in conversational design. Our intelligence betrays us; we understand our business, we think we know our users, and we need to make decisions. Almost inevitably, we bake our bias into every project, solving for what we think our users want instead of letting our data show us our gaps and order our priorities.
The human habit toward top-down decision-making can be explained, at least in part, by the traditionally prohibitive cost of proper data exploration; data-led decision-making has always been talent-, time-, and cost-intensive. Here-before summaries and automations-is one of the most powerful use cases of LLMs. LLMs are aware and creative. They can read hundreds of thousands of conversations at once without getting bored. Companies have amassed endless libraries of bot logs and transcription-ready voice recordings, and LLMs can now study all of them–every turn across millions of conversations. With the right data engineering workflow, we can finally let our data tell us what’s going on.
In a recent webinar hosted by the Conversational Design Institute, Stephen Broadhurst–our VP of Customer Solutions at HumanFirst–piloted a live workflow, creating a comprehensive set of user needs across 10,000 conversations. This clip from the full webinar walks through the process of using prompts to extract call drivers from full conversations to understand the distribution of key issues at different levels of granularity.
Pragmatic-Alignment: In the Language of the User
Streamlined data engineering will help us better understand the problems users face. It’s important that we also understand the language they use to express them. What distinguishes a return from a refund? A payment from a charge? What do we mean when we say ‘resolved’?
The pragmatic aspect of language–how context contributes to meaning–is challenging on a human-to-human level. Human-to-model discrepancies are inevitable, and model-to-model disagreements are equally likely. This clip from Stephen’s webinar shows how hard it can be to call a gorilla a gorilla–a classification we established hundreds of years ago.
To build structured projects around complex business topics and to design an on-the-go understanding that will support intuitive conversations, we need to agree with our LLMs on ground truth. We need to know it distinguishes between returns and refunds the same way we do.
Performance-Alignment: Supervision and Observability
Through prompt engineering, we can establish that ground truth. We can tell the LLM in normal language that ‘resolution’ means the agent has answered the user’s question, even if it wasn’t the user’s desired response. Instead of leaving judgment to the model, we can set the conditions of its consideration such that it shares our understanding.
With that understanding established, we’ll need to know how the prompt performs. We might have confidence in a prompt that worked in OpenAI Playground, but successful conversational design requires we know how it works across thousands of conversations, and how badly it fails when it does.
Normally, we can’t see inside an LLM’s decisions or assess a prompt’s performance at full scale. But with the right tooling, we can consider the output of a prompt within the context of the source conversation. This opens up a human-in-the-loop observation window to an otherwise black-box process. We can run that prompt across hundreds of conversations and consider any output against the corresponding source conversation to discern whether it’s done what we’d want.
Here, Stephen demonstrates a workflow for defining prompt directives and testing responses against source conversations to assess and improve performance.
New technology and tooling will again elevate conversational design. Teams will be able to test prompts at scale, supervise performance, and design more intuitive responses. They’ll be equipped to build models with the agility required to cover hundreds of distinct user needs. The result of this kind of testing and training will be new levels of trust in AI-led conversations, more competitive user experiences, and further advancements in the field of Conversational AI.
HumanFirst is a data-centric productivity platform designed to help companies find and solve problems with AI-powered workflows that combine prompt and data engineering. Experiment with raw data, surface insights, and build reliable solutions with speed, accuracy, and trust.