Human Labeling: One Most Critical But Often Overlooked Component Behind Machine Learning Applications

Machine learning (ML) models are trained on labeled data sets to make predictions about new observations. How representative the labeled data is compared to new observations and the quality of labels often determines the accuracy of the model in predicting new cases. In complex ML applications—such as autonomous driving, facial recognition, content moderation, and so on—large amounts of high-quality data labels are essential for training state-of-the-art models. Many real-world products depend on human labeling services to continuously generate a reliable stream of accurate human labels for training and improving the models.

As models improve, the demand for more high-quality labels grows, to cover more corner novel cases and new data paradigms. For example, the initial version of an autonomous driving model might be trained on a controlled road in good weather conditions. The data that is labeled and used to train the early models try to capture the testing environment as much as possible. As the model improves, it will be test driven in the real world. This consists of more complex scenarios, unexpected road conditions, and uncommon weather conditions. More data and labels will be needed to cover the newly discovered cases and provide sufficient data for the algorithms to learn from. As the founder of Starsky Robotics noted as the five-year-old autonomous truck startup closed its doors:

“It’s widely understood that the hardest part of building AI [Artificial Intelligence] is how it deals with situations that happen uncommonly, i.e. edge cases. In fact, the better your model, the harder it is to find robust data sets of novel edge cases. Additionally, the better your model, the more accurate the data you need to improve it.”

Most academic research and public discussions on ML applications focus on the design and implementation of models, instead of the process of getting better training data. In practice, what differentiates the performance of an ML application from its competition is often the quality and scale of data labels it uses for training. Now it is common to see a small engineering team able to build a working prototype of autonomous cars to compete with big tech companies. However, those large companies still maintain their competitive advantage by deploying more resources to collect better training data and higher quality labels for their ML applications.

The Ever-Growing Data Challenges Behind ML Applications

In future posts, I will go over some key challenges in producing high-quality labels for some of the popular ML applications, how to build a scalable human labeling system, and compare some online services. But first, why is human labeling so important? What often goes wrong?

When we start to build the initial models for our application, we often use some easily obtainable training data by either downloading public data online or manually labeling some sampled data ourselves. This approach can help us get a baseline of our solution quickly. In some cases, we can even start with a pre-trained model and then apply transfer learning to our own dataset. This approach can leverage the large standard training set and computational resources used for the pre-trained model and also fine-tune the model to our own customized data.

However, we will run into two core issues when we start to optimize our model:

The initial data labels may not cover all the needs of the application. For example, the application may require us to break down an existing label into a few fine-grained ones, or create a new label for newly discovered cases.
The initial training data may not fully represent the real data the application needs to handle. The real application may need to handle many more business cases, or the distribution of labels in the training data may be very different from the distribution of labels seen in real data.

Both of the issues require the models to be retrained with newly labeled data. This is a common theme when developing real-world applications. Although efforts might be put into getting a large set of high-quality labels at the beginning, changing business requirements and evolving user behaviors often make the initial data outdated really quickly.

For real-world applications, business needs constantly grows as more use cases are built, which then incur new user behavior while generating new kinds of data. In some applications, the definition for the same label may change as well. For example, Facebook has lately been at the center of heated public debate over its content moderation policies. The standard used to label what content is deemed harmful keeps changing as the business discussion evolves. This means data needs to be labeled using the updated guidelines right away. And, the model should be retrained using new labels as well.

The challenges often require teams to invest in the human labeling process, which can continuously generate new labeled data using up-to-date guidelines and well-trained human forces. The data being labeled are often sampled from real data handled by the applications, instead of pre-existing data that are used to bootstrap the initial models. The labels and the guidelines for humans to follow are also constantly updated to align with the business requirement and new insights from model optimization.

Humans in the Loop

AI is fundamentally transforming many products nowadays. ML models can personalize the content in your feed based on your viewing history. They can also predict the ads or products you are most likely to purchase. Google Maps can now utilize more contextual signals to better guess where exactly you are: hanging out in a coffee shop versus walking around the mall. Significant progress has been made toward autonomous driving, medical assistance, and so on.

Most of these real-world products depend on ML models that interact with each other rather than a single-purposed one. The combination of performances of all models delivers the ultimate user experience for a product. The overall accuracy of prediction made by these models is often bounded by the worst performing ones.

To improve overall product performance, it’s more practical to replace some components that lack high-quality automated solutions with humans in the loop. In this way, we can improve the overall performance of the ML system by removing the poor performing components. Human intervention will also create more training data for those poor components to continuously improve until they can be fully automated. The more accurate output generated by human workers can directly enhance all other components that directly or indirectly depend on that data. This creates a virtuous cycle.

Take online advertising as an example. Current ads serving systems need many tags and labels to target the ads to the right audience in real time. It’s possible to build automatic classifiers to generate the tags for each advertisement, but much better results are often achieved if the tags are either supplied by the creator or labeled by humans directly. The more accurate and consistent those tags are, the less noise the models need to handle, hence the better performance the overall system can achieve.

Another example is for Point of Interest models that are often used in location-based services like Foursquare or Yelp. The services need to assign category labels to individual points (places) to identify its characteristics related to business needs. Those labels will further enable other product features like recommendations, search, and check-in. Although it’s possible to adopt a general location category and automatically assign labels using an algorithmic process, significant improvements can be generated by investing in a highly curated taxonomy tailored to the product features. For example, it makes sense for Yelp to use taxonomy related to more specific types of restaurants than to road conditions. On the other hand, the latter is more useful for applications like Waze.

Ground Truth for AI

There’s a well-known saying in the ML community: garbage in, garbage out. It means that if the model is trained using badly labeled data with lots of errors, the model will predict badly in return. A solid human labeling process provides high-quality labels to train the models with better quality. Those high-quality human labels can be used as ground truth for the related ML applications to train, calibrate, and evaluate against.

Hence, it’s important to invest in a human labeling service for teams working on ML-empowered products, so they can continuously generate high-quality labels, handle quickly changing business requirements, incorporate constantly evolving labeling standards, and learn newly discovered novel cases.