# Random forest vs extra trees

## A little bit of context

Quite some time ago, I asked a question on stats.stackexchange about differences between random forests and extremly random forests. Though the answers were good, I was still lacking some informations. Beyond the nice theoretical arguments, I run some simulations to have a better idea of their behavior.

Though these are, by no means, definite conclusions about their respective behaviors, those simulations performed on toy datasets, from specific implementations, I hope they will provide more insights to the reader!

## Summary of the simulations

Extra trees seem much faster (about three times) than the random forest method (at, least, in scikit-learn implementation). This is consistent with the theoretical construction of the two learners.

On toy datasets, the following conclusions could be reached :

• Extra trees seem to keep a higher performance in presence of noisy features,
• When all the variables are relevant, both methods seem to achieve the same performance.

## Theoretical point of view

A decision tree is usually trained by recursively splitting the data. Each split is chosen according to an information criterion which is maximized (or minimized) by one of the “splitters”. These splitter usually take the form of `x[i]>t` were `t` is a threshold and `x[i]` indicates the i-th component of an observation.

Decision trees, being prone to overfit, have been transformed to random forests by training many trees over various subsamples of the data (in terms of both observations and predictors used to train them).

The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees).

This leads to more diversified trees and less splitters to evaluate when training an extremly random forest.

The point here is not to be exhaustive, there are more references and the main articles at the bottom of the article.

## Timing

First, let’s observe the timing for each model. I will use the methods implemented in scikit-learn and I will reuse a code seen in another post (the code with new minor updates is at the bottom of the post). Without a detailed analysis, it seems that Extra Trees are three times faster than the random forest method.

N P Time RF Time ET
500 5 0.093678 0.0679979
500 10 0.117224 0.0763619
500 20 0.176268 0.097132
500 50 0.358183 0.157907
500 100 0.619386 0.256645
500 200 1.14059 0.45396
1000 5 0.139871 0.0846941
1000 10 0.211061 0.106443
1000 20 0.385125 0.151639
1000 50 0.805403 0.277682
1000 100 1.39056 0.501522
1000 200 2.76709 0.926728
2000 5 0.269487 0.122763
2000 10 0.474972 0.171372
2000 20 0.771758 0.264499
2000 50 1.81821 0.539937
2000 100 3.47868 1.03636
2000 200 6.95018 2.1839
5000 5 0.86582 0.246231
5000 10 1.2243 0.373797
5000 20 2.20815 0.624288
5000 50 5.13883 1.41648
5000 100 9.79915 3.25462
5000 200 21.5956 6.86543

Note the use of tabulate which allows a pretty print for python dataframes ;)

``````import numpy as np
import ComplexityEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from tabulate import tabulate

def random_data_regression(n, p):
return np.random.rand(n, p), np.random.rand(n)

regression_models = [RandomForestRegressor(),
ExtraTreesRegressor()]

names = ["RandomForestRegressor",
"ExtraTreesRegressor"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
[500, 1000, 2000, 5000],
[5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
res, data = complexity_evaluator.Run(model, random_data_regression)
print(names[i])
i = i + 1
``````

## Irrelevant variables

Now, let’s try to see how they compare when we add irrelevant variables. First, we need to propose a data set where there is something to learn (as opposed as what was previously done).

Let’s fix a linear dependency between the target variable and the first three variables, along the lines of:

``````def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.zeros([p, 1])
beta[0,0] = 1
beta[1,0] = 2
beta[2,0] = -1
Y = np.ravel(np.dot(X, beta))
return X, Y
`````` Fig. 1: Comparison of random forests and extra trees in presence of irrelevant predictors

In blue are presented the results from the random forest and red for the extra trees.

The results are quite striking: Extra Trees perform consistently better when there are a few relevant predictors and many noisy ones.

``````import numpy as np
import AccuracyEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from tabulate import tabulate
import matplotlib.pyplot as plt

def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.zeros([p, 1])
beta[0,0] = 1
beta[1,0] = 2
beta[2,0] = -1
Y = np.ravel(np.dot(X, beta))
return X, Y

n_columns = [5, 10, 20, 30, 50, 80, 100]

accuracy_evaluator = AccuracyEvaluator.AccuracyEvaluator(
,
n_columns,
mean_squared_error,
10)

data_rf = accuracy_evaluator.Run(RandomForestRegressor(), linear_data_generation)
data_et = accuracy_evaluator.Run(ExtraTreesRegressor(), linear_data_generation)

plt.errorbar(n_columns, data_rf["Metric"], yerr= data_rf["Std"], fmt="o", ecolor = "blue", color="blue")
plt.errorbar(n_columns, data_et["Metric"], yerr= data_et["Std"], fmt="o", ecolor = "red", color="red")
plt.xlabel('Number of columns')
plt.ylabel('Mean Squared Error')
plt.show()
``````

## Many relevant variables

Starting from the code above and changing the decision function to be a sum of each predictor:

``````def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.ones([p, 1])
Y = np.ravel(np.dot(X, beta))
return X, Y
`````` Fig. 2: Comparison of random forests and extra trees in presence of many relevant predictors

In blue are presented the results from the random forest and red for the extra trees.

It seems that both methods perform equally in presence of many relevant features.

## Code

``````import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math

class ComplexityEvaluator:

def __init__(self, nrow_samples, ncol_samples):
self._nrow_samples = nrow_samples
self._ncol_samples = ncol_samples

def _time_samples(self, model, random_data_generator):
rows_list = []
for nrow in self._nrow_samples:
for ncol in self._ncol_samples:
train, labels = random_data_generator(nrow, ncol)

start_time = time.time()
model.fit(train, labels)
elapsed_time = time.time() - start_time

result = {"N": nrow, "P": ncol, "Time": elapsed_time}
rows_list.append(result)

return rows_list

def Run(self, model, random_data_generator):
orig_data = pd.DataFrame(self._time_samples(model, random_data_generator))
data = orig_data.applymap(math.log)
linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(data[["N", "P"]], data[["Time"]])
return linear_model.coef_, orig_data

if __name__ == "__main__":
class TestModel:

def __init__(self):
pass

def fit(self, x, y):
time.sleep(x.shape / 1000.)

def random_data_generator(n, p):
return np.random.rand(n, p), np.random.rand(n, 1)

model = TestModel()

complexity_evaluator = ComplexityEvaluator(
[200, 500, 1000, 2000, 3000], [1, 5, 10])

res = complexity_evaluator.Run(model, random_data_generator)

print(res)

``````

And then :

``````import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math
from Stacker import Stacker

class AccuracyEvaluator:

def __init__(self, nrow_samples, ncol_samples, penalty, n_folds=5):
self._nrow_samples = nrow_samples
self._ncol_samples = ncol_samples
self._stacker = Stacker(penalty, n_folds, False)

def _accuracy_samples(self, model, random_data_generator):

def predict(model, X):
return model.predict(X)

rows_list = []
for nrow in self._nrow_samples:
for ncol in self._ncol_samples:
train, labels = random_data_generator(nrow, ncol)

mean_perf, std_perf,  _ = self._stacker.RunSparse(train,
labels, model, predict)

result = {"N": nrow, "P": ncol,
"Metric": mean_perf, "Std": std_perf}
rows_list.append(result)

return rows_list

def Run(self, model, random_data_generator):
orig_data = pd.DataFrame(
self._accuracy_samples(model, random_data_generator))
return orig_data
``````

I will come back to stacking in another post some day, which I also use for cross-validation (stacking can be obtained almost for free during a cross validation). Plus the naming of some functions is quite unfortunate (`RunSparse()` should probably be called `RunNumpyArray()` or something like this, likewise, the `Run()` should probably be `RunPandas()`…). The code is here for the sake of completeness.

``````import pandas as pd
import numpy as np

from time import time
from sklearn import cross_validation

class Stacker:

def __init__(self, penalty, n_folds, verbose=True, random_state=1):
self._penalty = penalty
self._n_folds = n_folds
self._verbose = verbose
self._random_state = random_state

def Run(self, X, y, model, predict_method):
kf = cross_validation.KFold(
y.shape, n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
trscores, cvscores, times = [], [], []
i = 0
stack_train = np.zeros(len(y))  # stacked predictions
for i, (train_fold, validation_fold) in enumerate(kf):
i = i + 1
t = time()
model.fit(X.iloc[train_fold], y.iloc[train_fold])

tr_pred = predict_method(model, X.iloc[train_fold])

trscore = self._penalty(y.iloc[train_fold], tr_pred)

validation_prediction = predict_method(
model, X.iloc[validation_fold])
cvscore = self._penalty(
y.iloc[validation_fold], validation_prediction)

trscores.append(trscore)
cvscores.append(cvscore)
times.append(time() - t)

stack_train[validation_fold] = validation_prediction

if self._verbose:
print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
(np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
print(model.get_params(deep=True))
print("\n")

return np.mean(cvscores), stack_train

def RunSparse(self, X, y, model, predict_method):
kf = cross_validation.KFold(
y.shape, n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
trscores, cvscores, times = [], [], []
i = 0
stack_train = np.zeros(len(y))  # stacked predictions
for i, (train_fold, validation_fold) in enumerate(kf):
i = i + 1
t = time()
model.fit(X[train_fold], y[train_fold])

tr_pred = predict_method(model, X[train_fold])

trscore = self._penalty(y[train_fold], tr_pred)

validation_prediction = predict_method(
model, X[validation_fold])
cvscore = self._penalty(
y[validation_fold], validation_prediction)

trscores.append(trscore)
cvscores.append(cvscore)
times.append(time() - t)

stack_train[validation_fold] = validation_prediction

if self._verbose:
print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
(np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
print(model.get_params(deep=True))
print("\n")

return np.mean(cvscores), np.std(cvscores),stack_train
``````

## Learning more and references

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone, R.A. Olshen

Breiman L (2001). “Random Forests”. Machine Learning. 45 (1): 5–32.

Geurts P, Ernst D, Wehenkel L (2006). “Extremely randomized trees”. Machine Learning. 63: 3–42

Python data science handbook

### Keras memory leak

## Keras memory usage keeps increasingI was having fun, attempting to do some deep learning with a 2M lines dataset (nothing my computer ...… Continue reading

#### Finding the index of the largest element in a list in OCaml

Published on August 29, 2021

#### Python fast screenshots and locateOnScreen

Published on June 06, 2021