Back to blog
Articles
Articles
November 26, 2020
·
4 MIN READ

A bottom-up approach to NLU

November 26, 2020
|
4 MIN READ

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Building In Alignment: The Role of Observability in LLM-Led Conversational Design
December 6, 2023
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
4 min read

LLM Drift

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
September 25, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success.

An important aspect of conversation design is understanding your customers’ intents. What are your customers asking? What problems do they have?

To solve this, access to real conversational data is critical — without it, you’re pretty much playing a guessing game; you can brainstorm the most common intents with your team, but correctly addressing the long tail specific to your domain is next to impossible.

However, access to conversational data isn’t enough: without proper tooling you’ll find yourself manually sifting through transcripts of conversations with absolutely no idea on where to start and when to stop, what utterance constitutes a valid intent vs. what is noise etc.

The typical approach to this problem has been to apply unsupervised clustering techniques.

Image taken from IBM Watson (reference)

There are two clear problems with unsupervised clustering as an approach to discovery and training of intents:

  • A first obvious problem is that clusters will often overlap (see image above), and represent similar / same intents, requiring a manual intervention to disambiguate them.
  • A less obvious but more fundamental problem, is that unsupervised clustering techniques do not say anything about how abstract or specific the intent generated from a given cluster should be.

For example, a cluster with utterances similar to “how can I transfer funds to my checking account?” could be assigned to any one of the these 3 labels, from most abstract to most specific

  1. Has a question
  2. Has a question > about bank account
  3. Has a question > about bank account > transfers
Determining which label to apply is a non-trivial problem, as the right level of abstraction for any given intent depends on whether there is sufficient data to accurately train the intent at that level of abstraction.

This is a classic chicken-and-egg problem: you need labeled data in order to correctly label your data.

Bottom-up approach to intent discovery & data labeling

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success. Instead of expecting a human or unsupervised algorithm to correctly “predict” what intents and abstractions exist in the data, it provides a simple framework to iteratively discover this information.

The bottom-up “algorithm” is simple:

  • Step 1: Identify a few very high-level intents that can capture most (if not all) of meaning in your data (in our experience, “has a question” and “has a problem” are great starting points).
  • Step 2: Label your conversation / utterance data, assigning utterances to one of these high-level intents (the cognitive load at this labeling step is minimal, since the decision boils down to simply assigning each utterance to one of the existing high-level intents)

The outcome of this step is very valuable in itself, as it provides high-quality and domain-specific training data to classify users who “have a question” or have a problem”.

Image by Author
  • Step 3: For every intent (i.e: “has a question”), identify more specific “sub-intents” that its training examples can fall into (i.e: “has a question > about credit account”, “has a question > about account settings”)
  • Step 4: Re-assign the top-level intents’ training data to the more specific sub-intents you’ve just created
Image by Author
  • Repeat steps 3 & 4 (i.e: divide an conquer)
Image by Author

Every step produces training data for classifiers that can recognize increasingly specific intents: this is one of the major advantages of this approach.

What’s the catch?

If this solution to labeling and training data seems too obvious, it’s because it is: divide-and-conquer has been used to break down problems into manageable chunks for a long time; it just hasn’t been easily made available to data labeling and intent discovery use-cases yet.

The main reason for this is a question of tooling and resources: the labeling and refactoring workflows required to make this efficient and manageable at scale are costly to build out, and only the more sophisticated companies have done so — these companies are able to charge customers thousands and thousands of dollars to build and train intents from unstructured data.

There are however some solutions out there focusing on democratizing this approach: HumanFirst is one of them, and provides one of the first out-of-the-box bottom-up labeling and intent discovery solution. In our next article, we’ll explore how machine-learning and semantic search can accelerate this bottom-up approach. Stay tuned!

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox