Imbalanced Classification in real world datasets
Classification is a very common use case in machine learning and is taught in many courses with somewhat balanced datasets, but many real-world use-cases have datasets with an imbalanced distribution of categories/classes to be classified. Standard approaches for balanced classification will give the illusion of good accuracy when accuracy itself becomes a misguiding metric for evaluation when data has an imbalanced distribution of categories.
A few examples:
-
Fraudulent credit card transactions detection – The fraud transactions are way less than normal transactions.
-
Cancer detection – Non-cancer cases are significantly more than cancer cases.
-
Click through rate prediction – Users clicking on advertisements are much lower than those not clicking on ads.
Such classification use cases need different and detailed approaches for different sections of the machine learning pipeline.
-
Sampling techniques to handle imbalanced distribution of categories.
-
Oversampling minority class – SMOTE is the most popular method.
-
Undersampling the majority class.
-
-
Stratified cross-validation to ensure that each fold has the same class distribution as the original dataset while training.
-
Evaluation metrics selection.
-
For predicting class probabilities.
-
ROC AUC to decide the threshold for probabilities when both classes are important.
-
Precision-Recall AUC to decide threshold when positive (minority like in above examples) class is important.
-
-
For predicting class labels
-
Fbeta=1 – Score when both False Positives and False Negatives are costly as this balances Precision and Recall.
-
Fbeta<1 – Score when False Positives are more costly as this gives more importance to Precision than Recall.
-
Fbeta>1 – Score when False Negatives are more costly as this gives more importance to Recall than Precision.
-
-
These are the primary approaches to be taken care of while handling imbalance classification, but there are many other details that need to be considered for such use-cases.
Have you ever had an imbalanced classification use case?