Imbalanced Classification in real world datasets

Imbalanced Classification in real world datasets

planning imbalanced datasets

Classification is a very common use case in machine learning and is taught in many courses with somewhat balanced datasets, but many real-world use-cases have datasets with an imbalanced distribution of categories/classes to be classified. Standard approaches for balanced classification will give the illusion of good accuracy when accuracy itself becomes a misguiding metric for evaluation when data has an imbalanced distribution of categories.

A few examples:

  • Fraudulent credit card transactions detection – The fraud transactions are way less than normal transactions.

  • Cancer detection – Non-cancer cases are significantly more than cancer cases.

  • Click through rate prediction – Users clicking on advertisements are much lower than those not clicking on ads.

Such classification use cases need different and detailed approaches for different sections of the machine learning pipeline.

  • Sampling techniques to handle imbalanced distribution of categories.

    • Oversampling minority class – SMOTE is the most popular method.

    • Undersampling the majority class.

  • Stratified cross-validation to ensure that each fold has the same class distribution as the original dataset while training.

  • Evaluation metrics selection.

    • For predicting class probabilities.

      • ROC AUC to decide the threshold for probabilities when both classes are important.

      • Precision-Recall AUC to decide threshold when positive (minority like in above examples) class is important.

    • For predicting class labels

      • Fbeta=1 – Score when both False Positives and False Negatives are costly as this balances Precision and Recall.

      • Fbeta<1 – Score when False Positives are more costly as this gives more importance to Precision than Recall.

      • Fbeta>1 – Score when False Negatives are more costly as this gives more importance to Recall than Precision.

These are the primary approaches to be taken care of while handling imbalance classification, but there are many other details that need to be considered for such use-cases.

Have you ever had an imbalanced classification use case?

Share this post