What is stacking and why using it

Introduction

In the case of supervised learning, stacking is a process that enables to improve the performance of a predictor. It can be used for classification and regression problems. If you took part in statistics competitions, you may already be familiar with it, but the resources about this technique are quite scarce on the internet.

From blending to stacking

As with blending, where a simple average between models with similar performance often proposes a model whose performance is higher than the one of each model in the blend, stacking combine models in a way that is dependant on the training set.

Stacking can be thought as “the sequel” of blending. Imagine you have two models with a similar performance on a dataset. The simplest blend consists in averaging the two models. However, you may become curious and propose different weight for each model. Per example, a blending with a weight of 0.7 for the first model predictions and 0.3 for the second may be better than the default 50/50 weights.

Now let’s suppose you have \(n\) models and are looking for the best weights for these n models. The problem you are facing now becomes to performing a linear regression, doesn’t it ?

The only difference is that you cannot use the prediction on the training set directly (where some models like Random Forests usually have a perfect accuracy). This is where stacking comes in!

When do you need stacking ?

Staging is usually the winning solution of many data science competitions.

Per example, the Homesite quote conversion

Quick overview for now about the NMA approach:

10 variations of the dataset in total (factor combinations, factors mapped to response rates, replacing correlated pairs by differences etc) lots of models (xgboost, keras, ranger, logreg, even occasional svm - although that took forever) trained on various datasets and different params; stored as lvl1 metafeatures mix lvl1 metafeatures with: xgboost, nnet, hillclimbing, glmnet and ranger, stack - 5 lvl2 metafeatures mix the lvl2 metafeatures with hillclimbing bag at each stage as much as time permitted

I will come back to the notions of lv1 / lvl2… stages

Or, in the BNP Paribas Cardif Claims management:

We also produced many different base level models without much Feature engineering, just different input format types (like load all categorical variables as counts, or as onehot encoding etc).

Our ensemble was consisted of 223 models. Faron did a lot of work in removing noise and discarding many of these in order to get to our bets score with a lvl2 ensemble of geomean weights between an ET , 2NN and 2 Xgmodels.

And there are plenty of other examples. So basically, stacking comes in when the accuracy of your classifier or regressor is the essence of your problem. It makes the interpretability of the model really low and is harder to implement and deploy than a simple machine learning pipe.

Principles of stacking

The idea, to give a correct weight to each model is to perform a cross validation on the training set and return a dataset for which each element correspond to the unseen fold prediction. On this new dataset, you can fit the new model.

Per example, in the case of a regression problem, if you have \(n\) rows in your data set, \(p\) features and \(k\) models, this step turns your training data from a \(n,p\) matrix to a \(n,k\) matrix.

In the case of a muli class problem, if you have \(n\) rows in your data set, \(p\) features, \(m\) classes and \(k\) models, this step turns your training data from a \(n,p\) matrix to a \(n,k \dot m\) matrix.

An example

The dataset

I will use the MNIST dataset, under a CSV format, which can be found here: on Kaggle and a logloss penalty. Of course, you will be able to play with other metrics / datasets using the code below!

The stacking / CV class

The class below can be used for a multi-class learning problem. Some minor adaptations may be required, per example for regression problems.

from sklearn.model_selection import KFold
import datetime
import pandas as pd
import numpy as np
from time import time

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


class ModelStager:

    def __init__(self, penalty, n_folds,
                 verbose=1, shuffle=True, random_state=1):
        self._penalty = penalty
        self._n_folds = n_folds
        self._verbose = verbose
        self._random_state = random_state
        self._shuffle = shuffle

    def _print(self, input_str):
        time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
        print(bcolors.HEADER + "[ModelStager | " + time + "] " + bcolors.ENDC + str(input_str)) 

    def fit(self, X, y, model):
        kfold = KFold(n_splits=self._n_folds, shuffle=self._shuffle,
                  random_state=self._random_state)

        cv_scores = []
        oof_predictions = pd.DataFrame(index=X.index, columns=range(y.nunique()))

        fold_idx = 0

        for tr_idx, val_idx in kfold.split(X):

            X_tr = X.iloc[tr_idx]
            X_val = X.iloc[val_idx]

            y_tr = y.iloc[tr_idx]
            y_val = y.iloc[val_idx]

            if self._verbose:
                self._print("Data_tr shape : " + str(X_tr.shape))

            fold_idx = fold_idx + 1
            t = time()

            model.fit(X_tr, y_tr)

            validation_prediction = model.predict_proba(X_val)

            oof_predictions.iloc[val_idx] = validation_prediction

            cv_score_model = self._penalty(y_val, validation_prediction)
            cv_scores.append(cv_score_model)

            if self._verbose:
                self._print("Fold %.0f : TEST %.5f | TIME %.2fm (1-fold)" %
                            (fold_idx, cv_score_model, (time() - t) / 60))

        self._print("TEST AVERAGE : %.5f" % (np.mean(cv_scores)))

        return oof_predictions

As you can see, the ModelStager also performs cross validation. All the magic happens in oof_predictions, which is in charge of keeping track of the out-of-fold prediction and returning it. As mentioned earlier, it shares the index with X, and the columns correspond to the number of classes.

All the bcolors and custom printing function are just things I am used to work with, no need to bother about it.

Example

Random Forest and Extra trees

If you append this at the bottom of the previous class, you may re run the operations.

Two models are proposed, and their ensemble below (using a logistic regression).

if __name__ == "__main__":
    
    from sklearn.metrics import log_loss
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
    from xgboost import XGBClassifier

    train_data = pd.read_csv("./mnist_train.csv", nrows=5000)
    X = train_data.drop(["label"], axis=1)
    y = train_data["label"]
    
    stager = ModelStager(log_loss, 5)

    print("RF model")
    model_rf = RandomForestClassifier()
    stage1_rf = stager.fit(X, y, model_rf)

    print("ET model")
    model_et = ExtraTreesClassifier()
    stage1_et = stager.fit(X, y, model_et)

    print("Stage 1 : (RF, ET) -> logistic model")
    stage1_rf_et = pd.concat([stage1_rf, stage1_et], axis=1)
    stager.fit(stage1_rf_et, y, LogisticRegression())

Results

RF model
[ModelStager | 2021-01-07 13:54] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:54] Fold 1 : TEST 0.48133 | TIME 0.05m (1-fold)
[ModelStager | 2021-01-07 13:54] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 2 : TEST 0.44262 | TIME 0.05m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 3 : TEST 0.46714 | TIME 0.05m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 4 : TEST 0.45846 | TIME 0.05m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 5 : TEST 0.45377 | TIME 0.05m (1-fold)
[ModelStager | 2021-01-07 13:55] TEST AVERAGE : 0.46066
ET model
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 1 : TEST 0.44834 | TIME 0.04m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 2 : TEST 0.44679 | TIME 0.04m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 3 : TEST 0.43367 | TIME 0.04m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 4 : TEST 0.43551 | TIME 0.04m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:55] Fold 5 : TEST 0.42378 | TIME 0.04m (1-fold)
[ModelStager | 2021-01-07 13:55] TEST AVERAGE : 0.43762
Stage 1 : (RF, ET) -> logistic model
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 20)
[ModelStager | 2021-01-07 13:55] Fold 1 : TEST 0.20850 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 20)
[ModelStager | 2021-01-07 13:55] Fold 2 : TEST 0.16870 | TIME 0.00m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 20)
[ModelStager | 2021-01-07 13:55] Fold 3 : TEST 0.21278 | TIME 0.00m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 20)
[ModelStager | 2021-01-07 13:55] Fold 4 : TEST 0.20536 | TIME 0.00m (1-fold)
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 20)
[ModelStager | 2021-01-07 13:55] Fold 5 : TEST 0.18016 | TIME 0.00m (1-fold)
[ModelStager | 2021-01-07 13:55] TEST AVERAGE : 0.19510

This is quite a huge boost ;) to be honest, this part is a little bit of an artifact, as ensemble of decision trees are usually quite bad at predicting probabilities, the logloss is artificially high. And the logistic regression corrects this phenomenon.

Random Forest, Extra Trees and Gradient Boosting

As stated above, the main performance gain comes from using an algorithm that is better at optimizing logloss.

Gradient boosting methods (most notably, xgboost) are good at predicting probability. This is illustrated when we perform the cross validation of a gradient boosting model over the original dataset.

print("XGB model")
model_xgb = XGBClassifier(use_label_encoder=False)
stage_1_xgb = stager.fit(X, y, model_xgb)

Yields the following results

XGB model
[ModelStager | 2021-01-07 13:55] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:56] Fold 1 : TEST 0.21077 | TIME 1.05m (1-fold)
[ModelStager | 2021-01-07 13:56] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:57] Fold 2 : TEST 0.16564 | TIME 1.15m (1-fold)
[ModelStager | 2021-01-07 13:57] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:58] Fold 3 : TEST 0.25023 | TIME 1.09m (1-fold)
[ModelStager | 2021-01-07 13:58] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 13:59] Fold 4 : TEST 0.24772 | TIME 1.03m (1-fold)
[ModelStager | 2021-01-07 13:59] Data_tr shape : (4000, 784)
[ModelStager | 2021-01-07 14:00] Fold 5 : TEST 0.18703 | TIME 1.12m (1-fold)
[ModelStager | 2021-01-07 14:00] TEST AVERAGE : 0.21228

Though much better than the single RandomForestClassifier or ExtraTreesClassifier alone, it does not beat the staged model.

Now let’s add the xgb features to the stage 1:

print("Stage 1 : (FR, ET, XGB) -> logistic model")
stage1_rf_et_xgb = pd.concat([stage1_rf, stage1_et, stage_1_xgb], axis=1)
stager.fit(stage1_rf_et_xgb, y, LogisticRegression())

And once again, the performance increases.

Stage 1 : (FR, ET, XGB) -> logistic model
[ModelStager | 2021-01-07 14:00] Data_tr shape : (4000, 30)
[ModelStager | 2021-01-07 14:00] Fold 1 : TEST 0.19343 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 14:00] Data_tr shape : (4000, 30)
[ModelStager | 2021-01-07 14:00] Fold 2 : TEST 0.15602 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 14:00] Data_tr shape : (4000, 30)
[ModelStager | 2021-01-07 14:00] Fold 3 : TEST 0.20996 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 14:00] Data_tr shape : (4000, 30)
[ModelStager | 2021-01-07 14:00] Fold 4 : TEST 0.20830 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 14:00] Data_tr shape : (4000, 30)
[ModelStager | 2021-01-07 14:00] Fold 5 : TEST 0.17053 | TIME 0.01m (1-fold)
[ModelStager | 2021-01-07 14:00] TEST AVERAGE : 0.18765

And if we include the gradient boosting predictions in the stage 1 features, the logloss drops from 0.19510 to 0.18765

More stage 1 features

I only presented 3 models in the stage 1. I could have added plenty of others, such as nearest neighbors, linear models… However, I strongly recommend to play with the code below and try to add these models, I am pretty sure than much better scores can be obtained ;)

Or I could also have performed some feature engineering for some models and not for others. As you can see, the number of combination becomes really huge. The “secret” to have a good performance after stacking is to have models that are as different (surprisingly, the performance of each model is not that important) as possible.

Beyond the linear model

I mostly referred to the stage 1 as a weighting operation, but it does not have to be a linear model. You can also use other model on top of your stage one features! Per example, another gradient boosting model, or a neural network. You can even repeat the above to produce stage 2 features, and train another model on this stage 2.

The code

from sklearn.model_selection import KFold
import datetime
import pandas as pd
import numpy as np
from time import time

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


class ModelStager:

    def __init__(self, penalty, n_folds,
                 verbose=1, shuffle=True, random_state=1):
        self._penalty = penalty
        self._n_folds = n_folds
        self._verbose = verbose
        self._random_state = random_state
        self._shuffle = shuffle

    def _print(self, input_str):
        time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M")
        print(bcolors.HEADER + "[ModelStager | " + time + "] " + bcolors.ENDC + str(input_str)) 

    def fit(self, X, y, model):
        kfold = KFold(n_splits=self._n_folds, shuffle=self._shuffle,
                  random_state=self._random_state)

        cv_scores = []
        oof_predictions = pd.DataFrame(index=X.index, columns=range(y.nunique()))

        fold_idx = 0

        for tr_idx, val_idx in kfold.split(X):

            X_tr = X.iloc[tr_idx]
            X_val = X.iloc[val_idx]

            y_tr = y.iloc[tr_idx]
            y_val = y.iloc[val_idx]

            if self._verbose:
                self._print("Data_tr shape : " + str(X_tr.shape))

            fold_idx = fold_idx + 1
            t = time()

            model.fit(X_tr, y_tr)

            validation_prediction = model.predict_proba(X_val)

            oof_predictions.iloc[val_idx] = validation_prediction

            cv_score_model = self._penalty(y_val, validation_prediction)
            cv_scores.append(cv_score_model)

            if self._verbose:
                self._print("Fold %.0f : TEST %.5f | TIME %.2fm (1-fold)" %
                            (fold_idx, cv_score_model, (time() - t) / 60))

        self._print("TEST AVERAGE : %.5f" % (np.mean(cv_scores)))

        return oof_predictions


if __name__ == "__main__":
    
    from sklearn.metrics import log_loss
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
    from xgboost import XGBClassifier

    train_data = pd.read_csv("./mnist_train.csv", nrows=5000)
    X = train_data.drop(["label"], axis=1)
    y = train_data["label"]
    
    stager = ModelStager(log_loss, 5)

    print("RF model")
    model_rf = RandomForestClassifier()
    stage1_rf = stager.fit(X, y, model_rf)

    print("ET model")
    model_et = ExtraTreesClassifier()
    stage1_et = stager.fit(X, y, model_et)

    print("Stage 1 : (RF, ET) -> logistic model")
    stage1_rf_et = pd.concat([stage1_rf, stage1_et], axis=1)
    stager.fit(stage1_rf_et, y, LogisticRegression())

    print("XGB model")
    model_xgb = XGBClassifier(use_label_encoder=False)
    stage_1_xgb = stager.fit(X, y, model_xgb)

    print("Stage 1 : (FR, ET, XGB) -> logistic model")
    stage1_rf_et_xgb = pd.concat([stage1_rf, stage1_et, stage_1_xgb], axis=1)
    stager.fit(stage1_rf_et_xgb, y, LogisticRegression())

An introduction to model stacking

January 03, 2021