What to expect from a model when there is nothing to learn ?

Reading time ~1 minute

An imbalanced binary classification problem

Did it ever happen to you to have a model that have a lower accuracy than a constant guessing (the one that predicts the most common class) ? It happened to me recently and I was quitte puzzled: 200 data points, two classes, 60% of the sample belonged to class one, the remaining part, to class two.

After running a random forest, I observed an accuracy of 50% on the out of bag predictions. This seemed really low, if there were nothing to learn from the data, then why did the model did not predict the most common class and achieved an accuracy of 60% ?

Playing with the parameters (increasing min_sample_leaf) improved the performance (but it was forcing the trees to have a ridiculously low depth).

So I decided to simulate the distribution of the out of sample accuracy of my model with the following snippet! (in R)

library("randomForest")

N <- 100
P <- 0.6
TRIALS <- 10000

random_response <- as.factor(rbinom(n = N, size = 1, p = P))

evaluate_error_rate <- function(blob)
{
  model <-
    randomForest(x = matrix(rnorm(n = N * 5), nrow = N), y = random_response)
    model$err.rate[nrow(model$err.rate), 1]
}

res <- sapply(1:TRIALS, evaluate_error_rate)

hist(res, main = 'Distribution of the out of sample error',
      xlab = 'Out of sample error (percent of mistmatches)')
mean(res) # 0.417235 seems good...
sum(res[res>0.5])/length(res) # 0.010754
sum(res[res>0.4])/length(res) # 0.272972

Illustration

What were the results ? The mean of the sample seems to converge to the accuracy of a constant predictor, which is good. However, the probability that a model makes more mistake than a random guess was not that low.

What is event better is that, given a training procedure, a number of points, a number of variables and an imbalance between classes, this distribution should not change. So now, one could even imagine to create a statistical test “did my model actually learned something”, and give a probability to the rejection of the null hypothesis. I leave this to the reader :)

Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a good introduction to R (used in the code snippets) and machine learning.

Topological Data Analysis - A tutorial

# IntroductionTopological data analysis (TDA) allows to reduce many hypothesis when doing statistics. A lot of research in this field has...… Continue reading

Best books about data science

Published on December 22, 2018

ECML/PKDD 15: Taxi Trajectory Prediction

Published on December 16, 2018