A comprehensive guide to handling imbalanced datasets. By Francis Adrian Viernes.
One of the mistakes I made as a rookie data scientist was placing heightened importance on the accuracy metric. Now, this is not to dismiss the importance of accuracy as a measure of machine learning (ML) performance. In some models, we aim to have high accuracy. After all, this metric is the one most understood by executive and business leaders.
For the purposes of our discussion, let’s refer to the classifier we have developed in the introduction as a ‘naive classifier’. A naive classifier (not the same as a Naive Bayes classifier) is called as such because it oversimplifies assumptions in producing or labeling an output.
The article then discusses the following:
- Use the proper metrics
- Set a new threshold: How do you Choose the Right Threshold?
- Collect More Data
- Augment Dataset and Undersampling
- Rethink the Features of the Model
- Methods (Ensemble Methods)
This is detailed article with example code and charts explaining the main points of the content. Imbalanced datasets are shining examples of a data science paradox. On the one hand, while it is very common for data scientists to deal with imbalanced classifiers (even maybe more than balanced ones), the purpose is usually to identify the uncommon class. Nice one!
[Read More]