Labeled data rarely exists in the real world

By Sachin Kalaskar Articles #datascience #machinelearning #activelearning #weaksupervision #semisupervisedlearning Comments Off

Majority of the machine learning courses assume the availability of labeled training data as it’s crucial for supervised machine learning models. But, labeled data rarely exists in the real world. We need to collect labels and labeling data is hard and costly, without labels there is no model.

So what’s the solution?

Active Learning:

We need to build a model to label the data we need to build a model. Here we find samples for human-in-the-loop to label them. Let’s simplify this in the steps below.

Select a small percentage of the whole data, label that data, build a rough model to start with.
Create predictions for the remaining data from the rough model created with predicted outcome and confidence.
Sampling the data with high confidence won’t add much value to our model, so we need to sample data with low confidence predictions as that’s what our model has trouble with and needs to learn. Labeling such low confidence samples will contribute to improving our model the most. This labeling approach is called uncertainty sampling because we are using the uncertainty of our model to decide which samples we should hand-label next.
Now using this new labeled data, retrain a new version of our model to start the process all over again and this model will be better than the previous model created using a random sampled dataset.

Weak Supervision:

Weak supervision takes lots of low quality labels from multiple data sources like labeling functions created based on heuristics, regex, alternative datasets or pre-trained models. Labels from weak supervision may not be as good as hand-labeled data, but still they are good enough to give decent performance.

Semi-supervised Learning:

Semi-supervised learning takes a small amount of hand-labeled data with remaining unlabeled data during training. Let’s simplify this in the steps below.

Select a small percentage of the whole data, label that data, build a high-precision model to start with.
Create predictions for the remaining data from the model created with predicted outcome and confidence.
Sample the most confident predictions as pseudo-labels and add them to the training data.
Train another model on labels and pseudo-labels and repeat until we have sufficient high-confidence pseudo-labels.

Have you ever faced such situations where you didn’t have labels for supervised learning? What was your approach to overcome this problem?

Labeled data rarely exists in the real world