Are we focusing too much on the labeling problem?

July 22, 2021

Programatic labeling has been a hot area of development in the last 2 years, with Snorkel Flow leading the way, and other commercial and open-source projects appearing in this space.

We're excited by this development, because it completely supports our vision that the next generation of NLU data engineering cannot be done by “armies” of non-domain experts.

While I don’t have hands-on experience with these tools (and for Snorkel Flow can only rely on screenshots and testimonials to assert how they do things), I do have a reasonably good understanding of what they do, and more importantly, why they’re being promoted as a viable alternative to hand-labeling training data.

The premise of these techniques is that better labeled datasets can be achieved by replacing costly low-quality labeling work, with high-quality & lower-cost upstream work.

In a nutshell, programatic labeling allows subject matter experts to define hard-coded rules (called labeling functions, or LFs) that assert whether a particular label should be attached to a particular piece of data.

For example, a simple rule would look for the presence of the words "due to" to match cases where a user is comparing the effects of certain drugs (i.e: the hazards of optic nerve toxicity due to ethambutol are known").

While this sounds like basic Regex matching, programatic labeling then does some « magic » (it’s really not magic, but from end-user's perspective it’s a black box 🙂 ) to learn the « weights » and correlation of each of the hard-coded rules (i.e: how statistically important they are to the final label relative to the other rules).

These rules are then applied across the unlabeled data to generate “noisy" labeled datasets (without requiring manual labeling); this auto-labeled data is used to train whatever final end-to-end model (where standard ML accuracy metrics like F1 score etc can be used to determine where gaps / improvements are needed in the LFs).

TLDR: instead of requiring a human to manually label the unstructured data, this technique requires a human to manually generate high-quality LFs that then generate labeled datasets automatically

With the right data, it's definitely feasible that a few dozen LFs replace the need to hand-label thousands of training examples. It’s also true that a model trained on a large set of very similar training examples will almost certainly be less performant than one where the data set is smaller, but where all variations and important keywords are represented within the data.

Given this, it appears a significant efficiency gain is possible with this approach, so it’s no wonder that it’s attracting attention (and investment).

What's the catch?

The starting point for programmatic labeling is:

  • You know what labels you want to train your model on
  • Your domain expertise is sufficient to create LFs that capture the different ways and variations by which these labels are expressed in your data
  • Your ML models requires massive amounts of training data (presumably only achievable with armies of labelers)

In my experience however these 3 assumptions are not valid for a lot of the most exciting real-world applications of natural language understanding.

Let's dive in:

  • Assumption 1: You know what labels you want to train your model on

If labeling was really the crux of the NLU problem, platforms like that provide scalable workforces to do the labeling would already have “solved” NLU - and we would be at a point where it makes sense to start optimizing by looking for lower-cost alternatives (both in time and resources) that provide the same benefits (like programatic labeling).

However labeling platforms and workforces simply haven't found the same success with natural language as they did with images and videos; I argue here that this is because the need to know/provide labels ahead of time severely restricts their usefulness for real-world scenarios - and by extension I’d argue that  programatic labeling has the same limitations.

Let’s take a call center AI use-case of automatically annotating/labeling voice call transcripts (side-note: I expect a lot of growth in this area over the next few years).

Forget basic sentiment analysis: the next generation of call center AI will support thousands of labels that identify everything from customers' questions to the type of follow-ups and objections that were encountered during the call.

The teams tasked with building this type of AI have a massive effort of discovering what these hundred (and thousands) of intents/labels are, and to repeat this process for every single new customers’ unstructured data: there is no way around this, and the quality of the AI’s predictions will 100% depend on how well the actual labels match what's in the unstructured data - no amount of labeling (whether programatic or not) will replace this.  

  • Assumption 2: Your domain expertise is sufficient to create high-quality LFs

Assuming you have discovered and validated what needs to be labeled, you now need training data for each of these labels - and this is where programatic labeling promises to both accelerate the process and provide much better performing training datasets.

The (possibly oversimplified!) premise is that given a specific label, the domain expert will be able to define LFs that capture the way this label is expressed in the unstructured data (i.e: think of this as encoding all the different keywords that match a specific label).

In my view (and experience) this is a chicken-and-egg problem, especially when it comes to use-cases that depend on a long-tail of data: discovering these different “rules" is the real crux of the problem, especially if you’re trying to make sure that they also reflect the way your users talk! For example, if all of your users are talking colloquially and saying “wassup” or “wadup", an LF rule checking for the existence of “how are you” will not label these utterance, and the model will suffer.

In a sense, it’s the same type of challenge teams face when trying to accurately capture the different variations of entities that their customers might use (i.e: bus, motor coach, coach, charter, minibus, shuttle, mini coach, single-decker, double-decker and doubledecker might all need to be known and mapped to bus ahead of time if it's important for your business logic).

No domain expert can be expected to know exactly how things are represented within the data; tooling that facilitates this discovery process will have a bigger impact than tools that focus on the labeling action itself
  • Assumption 3: Your ML models need massive amounts of training data (only achievable with armies of labelers)

There are certainly classes of problems where massive amounts of pre-labeled data are needed (training deep-learning end-to-end models for example).

However one of the most exciting developments in AI has been transfer learning and few-shot learning techniques, that greatly minimize the need for massive amounts of labeled data.

In our experience, a lot of the exciting use-cases for natural language understanding can be expressed as classification problems, where the underlying AI is simply the result of tuning a pre-trained transformer model (or something similar).

In this case, the amount of labeled data you need to train new classes can be quite small (from as little to 20 training examples to a few hundred): in our experience real-life application needs rarely require labeling thousands of data points for any given class.

Yet the need for labeling thousands of data points is one of the key arguments for a programatic labeling approach; in this talk for example, one of the the Snorkel founder argues that a few dozen hard-coded rules will generate training data that outperforms thousands of hand-labeled training examples.

However this makes me question whether this isn't comparing apples and oranges.

If hand-labeled examples are extremely similar and repeating, then labeling thousands of these data points isn’t something teams should be doing in the first place, as it won’t lead to a better performing model - this is where active-learning techniques can drastically improve the labeling efficiency

No one will argue that time and resources would be better spent focusing on quality vs. quantity of the training data, regardless of the labeling approach used.

There’s no question that domain expertise trumps low-skilled labeling when it comes to natural language, particularly for applications that depend on a long tail of data; and providing domain experts with tools to identify and leverage high-value signals (like keywords or natural language constructs) that generalize well across datasets is key to accelerating the work.

We see domain expertise not only in the hands of ML & data scientists in the future, but also in the hands of less technical stakeholders like product owners, customer support managers, linguists etc - basically anyone who has access to natural language data, and wants to extract value from it.

We're excited to see the increasing importance for tools that makes exploring and transforming natural language easier  - there's no better time to invest in your data than today!

HumanFirst is Excel for Natural Language Data. A complete productivity suite to transform natural language into business insights and AI training data.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox