The Importance of High-Quality Training Data

December 17, 2021

In the new AI/NLU paradigm, businesses realize that complex algorithms will no longer sustain their competitive advantage. The advantage lies in the ability to curate and utilize high-quality training data.

However, the tooling to build, curate and manage AI training data has not followed the surge of conversational AI/NLU tools.

But yet… it’s widely recognized that AI/ML is 95%+ data work:

Andrew Ng, pioneer of the data-centric movement, logically wondered why we aren’t ensuring data quality is of the utmost importance for a machine learning team, especially since the majority of machine learning is data cleaning and preparation.

This was in response to the conventional wisdom of AI practitioners, which suggested that in order to improve AI systems, users must iterate the model and hold the data fixed, also known as the model-centric approach.

The data-centric approach flipped that equation on its head. AI practitioners began to systematically improve the quality of the data while holding the model fixed.

This is why we’re seeing a shift that reflects the importance of a data-centric view:

How do I make the shift from model-centric to data-centric?

To adopt a data-centric approach, you have to prioritize the continuous iteration and improvement of the data over the model.

Let’s take a closer look.

Quality > Quantity

It’s important to prioritize quality over quantity. This axiom has been ingrained in us by every boss, professor, and Marie Kondo enthusiast. Why hoard massive amounts of data you don’t need?

Low-quality, high-volume data will inevitably focus your attention on the wrong things, not to mention the 100+ hours spent tuning your model to overcome it. The lack of high-quality data makes many of these algorithms pretty impractical.

Use real-world data

Your system needs to translate into real-world utility. It is important to use historical, customized, high-quality data so your model is tailored to your exact use case. This solves the generalization problem of certain industries being slow to adopt machine-learning processes due to the lack of tailored data (like healthcare and agriculture). Training data isn’t ubiquitous, it’s not a one-size-fits-all model that can be universally applied to any domain.

Using ‘state-of-the-art’ generative models to curate synthetic data will favor quantity over quality. Your model will be trained with implausible examples and will result in the inability to succeed when deployed in the real world.

The best training data comes from real conversations that are specific to your users.

Get systematic

Discover a systematic way to manage your data by having consistent labeling techniques without ambiguity, a streamlined way to discover new intents to increase coverage, and ways to disambiguate overlapping intents.

Using an unmethodical, top-down approach leads to inaccuracy and the long-tail of requests will not be scoped in.

Use error analysis

Find where the data systematically underperforms, with workflows to fill in the poor or uncovered data. For example, adopt workflows with timely feedback loops and revisions. Catching errors early on and keeping track of changes is important to make sure you can roll back anything that didn’t work out as expected.

Continuous Improvement

Systematic improvement of training datasets is one of the most effective ways to improve the performance of a model. You should methodize the following:

  • How do we continuously improve the intents that we’ve deployed?
  • How do we continuously discover new intents?
  • How is this workflow streamlined?

… but the best way to make the shift towards data-centricity is adopting data-centric NLU tooling.

HumanFirst: Data-centric Tooling for NLU

We often get the question,

“Is HumanFirst another chatbot or NLU platform, like DialogFlow or Watson?”

If it’s not obvious by now, the answer is no. We addressed the tooling gap in the market by building a hyper-efficient tool for building and maintaining the data that powers your chatbot or NLU.

We saw teams turning to Excel or building (and maintaining) their own tooling and processes to do this work. Both of these alternatives lead to inefficiencies, frustration, high cost, and a delayed time to market.

So, what are we?

A tool to systematically engineer data used to build AI systems, with the ability to:

  • Explore your unlabeled dataset to inform development priorities and business decisions
  • Discover what to label with clustering and semantic search capabilities
  • Build an intent hierarchy with machine-learning labeling workflows
  • Test your data with real-time updates and revisions
  • Correct issues with remediation workflows, while discovering the long-tail of intents through this course
  • Work with natural language data in a fun, intuitive, and useful way

As we move towards the democratization of ML skills and tools, data is becoming the key component and differentiator of modern ML pipelines. Thus, having cleaned and de-noised datasets will become the key differentiator in data architectures. Training data needs to return accurate predictions and have the ability to scale systematically and sustainably.

Our vision at HumanFirst is to make the entire process of discovering, training, and improving intents from raw natural language data productized and user-intuitive. HumanFirst maintains the most advanced data pipeline and platform to address this gap in the ML/AI tooling ecosystem.

HumanFirst is like Excel, for Natural Language Data. A complete productivity suite to transform natural language into business insights and AI training data.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox