Back to blog
Articles
Articles
December 17, 2021
·
4 MIN READ

The Importance of High-Quality Training Data

December 17, 2021
|
4 MIN READ

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Building In Alignment: The Role of Observability in LLM-Led Conversational Design
December 6, 2023
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
4 min read

LLM Drift

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
September 25, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

Articles

The Importance of High-Quality Training Data

ALEX DUBOIS
December 17, 2021
.
4 MIN READ

In the new AI/NLU paradigm, businesses realize that complex algorithms will no longer sustain their competitive advantage. The advantage lies in the ability to curate and utilize high-quality training data.

However, the tooling to build, curate and manage AI training data has not followed the surge of conversational AI/NLU tools.

But yet… it’s widely recognized that AI/ML is 95%+ data work:

Andrew Ng, pioneer of the data-centric movement, logically wondered why we aren’t ensuring data quality is of the utmost importance for a machine learning team, especially since the majority of machine learning is data cleaning and preparation.

This was in response to the conventional wisdom of AI practitioners, which suggested that in order to improve AI systems, users must iterate the model and hold the data fixed, also known as the model-centric approach.

The data-centric approach flipped that equation on its head. AI practitioners began to systematically improve the quality of the data while holding the model fixed.

This is why we’re seeing a shift that reflects the importance of a data-centric view:

How do I make the shift from model-centric to data-centric?

To adopt a data-centric approach, you have to prioritize the continuous iteration and improvement of the data over the model.

Let’s take a closer look.

Quality > Quantity

It’s important to prioritize quality over quantity. This axiom has been ingrained in us by every boss, professor, and Marie Kondo enthusiast. Why hoard massive amounts of data you don’t need?

Low-quality, high-volume data will inevitably focus your attention on the wrong things, not to mention the 100+ hours spent tuning your model to overcome it. The lack of high-quality data makes many of these algorithms pretty impractical.

Use real-world data

Your system needs to translate into real-world utility. It is important to use historical, customized, high-quality data so your model is tailored to your exact use case. This solves the generalization problem of certain industries being slow to adopt machine-learning processes due to the lack of tailored data (like healthcare and agriculture). Training data isn’t ubiquitous, it’s not a one-size-fits-all model that can be universally applied to any domain.

Using ‘state-of-the-art’ generative models to curate synthetic data will favor quantity over quality. Your model will be trained with implausible examples and will result in the inability to succeed when deployed in the real world.

The best training data comes from real conversations that are specific to your users.

Get systematic

Discover a systematic way to manage your data by having consistent labeling techniques without ambiguity, a streamlined way to discover new intents to increase coverage, and ways to disambiguate overlapping intents.

Using an unmethodical, top-down approach leads to inaccuracy and the long-tail of requests will not be scoped in.

Use error analysis

Find where the data systematically underperforms, with workflows to fill in the poor or uncovered data. For example, adopt workflows with timely feedback loops and revisions. Catching errors early on and keeping track of changes is important to make sure you can roll back anything that didn’t work out as expected.

Continuous Improvement

Systematic improvement of training datasets is one of the most effective ways to improve the performance of a model. You should methodize the following:

  • How do we continuously improve the intents that we’ve deployed?
  • How do we continuously discover new intents?
  • How is this workflow streamlined?

… but the best way to make the shift towards data-centricity is adopting data-centric NLU tooling.

HumanFirst: Data-centric Tooling for NLU

We often get the question,

“Is HumanFirst another chatbot or NLU platform, like DialogFlow or Watson?”

If it’s not obvious by now, the answer is no. We addressed the tooling gap in the market by building a hyper-efficient tool for building and maintaining the data that powers your chatbot or NLU.

We saw teams turning to Excel or building (and maintaining) their own tooling and processes to do this work. Both of these alternatives lead to inefficiencies, frustration, high cost, and a delayed time to market.

So, what are we?

A tool to systematically engineer data used to build AI systems, with the ability to:

  • Explore your unlabeled dataset to inform development priorities and business decisions
  • Discover what to label with clustering and semantic search capabilities
  • Build an intent hierarchy with machine-learning labeling workflows
  • Test your data with real-time updates and revisions
  • Correct issues with remediation workflows, while discovering the long-tail of intents through this course
  • Work with natural language data in a fun, intuitive, and useful way

As we move towards the democratization of ML skills and tools, data is becoming the key component and differentiator of modern ML pipelines. Thus, having cleaned and de-noised datasets will become the key differentiator in data architectures. Training data needs to return accurate predictions and have the ability to scale systematically and sustainably.

Our vision at HumanFirst is to make the entire process of discovering, training, and improving intents from raw natural language data productized and user-intuitive. HumanFirst maintains the most advanced data pipeline and platform to address this gap in the ML/AI tooling ecosystem.

HumanFirst is like Excel, for Natural Language Data. A complete productivity suite to transform natural language into business insights and AI training data.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox