What to expect from a model when there is nothing to learn ?

An imbalanced binary classification problem

Did it ever happen to you to have a model that have a lower accuracy than a constant guessing (the one that predicts the most common class) ? It happened to me recently and I was quitte puzzled: 200 data points, two classes, 60% of the sample belonged to class one, the remaining part, to class two.

After running a random forest, I observed an accuracy of 50% on the out of bag predictions. This seemed really low, if there were nothing to learn from the data, then why did the model did not predict the most common class and achieved an accuracy of 60% ?

Playing with the parameters (increasing min_sample_leaf) improved the performance (but it was forcing the trees to have a ridiculously low depth).

So I decided to simulate the distribution of the out of sample accuracy of my model with the following snippet! (in R)

library("randomForest")

N <- 100
P <- 0.6
TRIALS <- 10000

random_response <- as.factor(rbinom(n = N, size = 1, p = P))

evaluate_error_rate <- function(blob)
{
  model <-
    randomForest(x = matrix(rnorm(n = N * 5), nrow = N), y = random_response)
    model$err.rate[nrow(model$err.rate), 1]
}

res <- sapply(1:TRIALS, evaluate_error_rate)

hist(res, main = 'Distribution of the out of sample error',
      xlab = 'Out of sample error (percent of mistmatches)')
mean(res) # 0.417235 seems good...
sum(res[res>0.5])/length(res) # 0.010754
sum(res[res>0.4])/length(res) # 0.272972

Illustration

What were the results ? The mean of the sample seems to converge to the accuracy of a constant predictor, which is good. However, the probability that a model makes more mistake than a random guess was not that low.

What is event better is that, given a training procedure, a number of points, a number of variables and an imbalance between classes, this distribution should not change. So now, one could even imagine to create a statistical test “did my model actually learned something”, and give a probability to the rejection of the null hypothesis. I leave this to the reader :)

Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Applied Predictive Modeling by Max Kuhn and Kjell Johnson is a good introduction to R (used in the code snippets) and machine learning.

Should I Scale my data?

If you are asking this, then you probably do not understand the algorithm you are using. This is a bad habit to start with, but if you do not want, have the time or the interest, the following table should be a decent starting point.

Some definitions

Centering Centering a variable consists in substracting the mean value to each value, so that the new variable has a sample mean equals to 0.

Reducing Reducing a value consists in dividing every value of the sample by the standard deviation of the sample.

Scaling Here we will call “scaling” the action consisting of centering the data and then reducing it. After the scaling, the sample has a null sample mean and a standard deviation of 1.

Generalities about algorithms regarding the scaling of the data

Supervised learning

Algorithm Scaling
Decision Tree No
Random Forest No
Gradient Boosting No
Linear Regression No
Penalized Linear Regression Yes, probably
SVM (Kernel) Yes, probably
k-Nearest Neighbours Yes, probably
Nearest centroid Yes, probably
Neural Network  Yes, probably

Unsupervised learning

Algorithm Scaling needed
PCA Yes, probably
Random projections Yes, probably
t-SNE Yes, probably

The following tables should be read this way. If scaling is not needed, it means you should not see changes between the results you obtain with or without scaling.

If it says yes, probably, it means that scaling is useful as features should have the same order of magnitude for the algorithm to work properly. However, it does not mean that performance will increase.

Per example, in presence of features lying on a bounded scale (when translating an image to a grayscale image and then feeding it to a neural network, or when turning a text to a TFIDF matrix), scaling is not recommended.

When the scaling is performed before applying the algorithm

Note that some libraries (especially in R) take care of the scaling before applying the algorithm. Though this seems to be a bad idea (the behaviour of the algorithm if a column is constant becomes implementation dependent, per example), this may save you some efforts.

svm(x, y = NULL, scale = TRUE, type = NULL, kernel =
    "radial", degree = 3, gamma = if (is.vector(x)) 1 else 1 / ncol(x),
    coef0 = 0, cost = 1, nu = 0.5,
    class.weights = NULL, cachesize = 40, tolerance = 0.001, epsilon = 0.1,
    shrinking = TRUE, cross = 0, probability = FALSE, fitted = TRUE,
    ..., subset, na.action = na.omit)

This is the svm function as presented in the e1071 R package. Note the default value of scale.

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"),
    weights, offset=NULL, alpha = 1, nlambda = 100,
    lambda.min.ratio = ifelse(nobs<nvars,0.01,0.0001), lambda=NULL,
    standardize = TRUE, intercept=TRUE, thresh = 1e-07, dfmax = nvars + 1,
    pmax = min(dfmax * 2+20, nvars), exclude, penalty.factor = rep(1, nvars),
    lower.limits=-Inf, upper.limits=Inf, maxit=100000,
    type.gaussian=ifelse(nvars<500,"covariance","naive"),
    type.logistic=c("Newton","modified.Newton"),
    standardize.response=FALSE, type.multinomial=c("ungrouped","grouped"))

In the glmnet package the argument is now called standardize. Note that here, the response can be standardized as well. This topic will not be covered in this post. Now coming back to the dangers of such an approach, look at the following sample:

N <- 100
P <- 5
X <- matrix(data = rnorm(N*P), nrow = N)
Y <- matrix(rnorm(N), nrow = N)

X[,1] <- X[,1]*0 # some evil action 

require("glmnet")
model <- glmnet(x = X, y = Y)

# Runs without issues

require("e1071")
model2 <- svm(x = X, y = Y)

# Warning message:
#  In svm.default(x = X, y = Y) :
#  Variable(s) ‘X1’ constant. Cannot scale data.

In the case where you run many models on many datasets (or many combination of features) some will scale the data, others will not (if one of the features is constant) and may report bad performances because the scaling was not operated…

Is it always possible to scale the data ?

Theoretical point of view

The assumptions when substracting the mean and dividing by the standard deviation is that they both exist ! Though with finite samples, we can always evaluate sample mean and sample variance, if the variables come from (say a Cauchy distribution) the coefficients for scaling may vary dramatically when enriching the sample with new points.

However, in the case where one is trying to learn anything from distributions that are not integrable, there will be many other issues to deal with.

Practical point of view

With a sparse dataset, scaling is not a good idea : it would force many of the points (the ones that are 0s in the original dataset). But reducing the variables is possible! And it turns out that some algorithms are not affected by the centering (or not) of the data.

A better approach

As we saw, there are actually three types of algorithms : those who do not change with monotonic transformations of the inputs, those who do not change with translations of the input and those who do not fit in the first two categories.

Note that the “monotonic transformation invariance” is the strongest property, as translation is just a monotonic transformation.

So the algorithms would enjoy a better representation in this table:

Supervised learning

Algorithm Translation invariant Monotonic transformation invariant
Decision Tree X X
Random Forest X X
Gradient Boosting X X
Linear Regression X  
Penalized Linear Regression    
SVM (Gaussian kernel) X  
SVM (Other kernels)    
k-Nearest Neighbours X  
Nearest centroid X  
Neural Network    

Unsupervised learning

Algorithm Translation invariant Monotonic transformation invariant
PCA    
Random projections    
t-SNE X  

Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !