Random Greedy Forest tutorial

Reading time ~5 minutes


A not so famous algorithm

Regularized Greedy Forest is a quite recent algorithm in machine learning, the article “Learning Nonlinear Functions Using Regularized Greedy Forest” has been published in 2014.

If we want to compare it to gradient boosting, which seems to have been studied back in 1999, it took a while before this algorithm received its many reliable and fast implementations (xgboost, catboost, LightGBM).

It seems to be a very good candidate in terms of performance. However, as we will see in the parameters section, it requires some tuning (regularization and number of leaves can be application critical).


As stated by the authors, Regularized Greedy Forest can achieve better performance than gradient boosting approaches:

In contrast to these traditional boosting algorithms that treat a tree learner as a black box, the method we propose directly learns decision forests via fully-corrective regularized greedy search using the underlying forest structure. Our method achieves higher accuracy and smaller models than gradient boosting on many of the datasets we have tested on.

And if you are skeptical about benchmarks proposed by the authors, check out the following posts, related to data science competitions:

On top of that, I noted that on some dataset I have comparable performance between xgboost and RGF after a careful tuning of the parameters for both the models.

How does it work ?

The algorithm

As for gradient boosting and random forests, the idea is to train a collection of decision trees. However, the key difference is that you are allowed to modify previously trained trees and weights attributed to each tree if that improves the performance of the overall model.

To make things more clear:

  • In the case of random forests, all the trees are trained simultaneously and regardless of each other performance.
  • In the case of gradient boosting, a new tree is trained on the residuals of the previous trees.
  • In the case of random greedy forest, things are more complicated :) At each step we may either start a new tree, or split an existing leaf node. Then, the weights of each leaf are adjusted, to optimize the loss function.

To put it in equations, when we try to learn some objective function \(F\) what we do is to solve this type of optimization program.

\[\hat{F} = \underset{F}{\arg\min} \, \mathbb{E}_{x,y}[L(y, F(x))]\]

The space of functions where \(F\) lives being quite large, we usually rely on heuristics to make the above problem solvable in an acceptable amount of time.

Per example, in the case of gradient boosting, we solve a series of training of decision trees, with the following induction:

\[F_0(x) = \underset{\gamma}{\arg\min} {\sum_{i=1}^n {L(y_i, \gamma)}}\]

\(F_m(x) = F_{m-1}(x) + \underset{h_m \in \mathcal{H}}{\operatorname{arg\,min}} \left[{\sum_{i=1}^n {L(y_i, F_{m-1}(x_i) + h_m(x_i))}}\right]\),

So that each step is as simple as training a decision tree, and the training time is, roughly, the number of trees times the training time of a tree.

In the case of the Regularized Greedy Forest, the procedure is as follows:

  • Fix the weights, and change the structure of the forest (which changes basis functions) so that the loss $Q(F)$ is reduced the most.
  • Fix the structure of the forest, and change the weights so that lossQ(F) is minimized

Cross validating a random greedy forest

The cross validation is usual, here are the roles of the different parameters.

Parameters description

The parameters are sorted by order of importance in terms of their impact on the accuracy of the model.

max_leaf : the total number of leaves in the forest. Note that given the training procedure, we never have to specify the total number of trees needed in the forest. Beware, the larger this parameter, the longer the training. By default, the value is 1000 for RGFClassifier and 500 RGFRegressor.

l2 : the penalty. This parameter has to be tuned to obtain a good performance. By default, the value is 0.1 but smaller values usually improve performance.

n_tree_search : (1 by default) Number of trees to be searched for the nodes to split.

algorithm one of (“RGF”, “RGF_Opt”, “RGF_Sib”), where the algorithm are the following:

  • RGF: RGF with L2 regularization on leaf-only models.
  • RGF Opt: RGF with min-penalty regularization.
  • RGF Sib: RGF with min-penalty regularization with the sum-to-zero sibling constraints.

By default, the algorithm is “RGF” for both RGFClassifier() and RGFRegressor().

reg_depth : Must be no smaller than 1.0. Meant for being used with algorithm="RGF_Opt"|"RGF_Sib". A larger value penalizes deeper nodes more severely.

loss one of ("LS", "Expo", "Log", "Abs"), by default this is LS for regressions and Log for classification.

  • LS: Square loss,
  • Expo: Exponential loss,
  • Log: Logistic loss,
  • Abs: Absolute error loss.

n_iter : Number of iterations of coordinate descent to optimize weights. If None, 10 is used for loss=”LS” and 5 for loss=”Expo” or “Log”. Not critical to improve accuracy.

Classification only

calc_prob : One of (“sigmoid”, “softmax”), by default “sigmoid”. I guess it will not affect accuracy or roc_auc, but may affect logloss.



Fortunately, the implementation comes with a scikit learn interface, therefore you can use the usual .fit() and .transform() methods, so there is not much to say about how to use it. Per example, the following will cross validate a random greedy forest.

from sklearn import datasets
from sklearn.utils.validation import check_random_state
from sklearn.model_selection import StratifiedKFold, cross_val_score
from rgf.sklearn import RGFClassifier

iris = datasets.load_iris()
rng = check_random_state(0)
perm = rng.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]

rgf = RGFClassifier(max_leaf=400,

n_folds = 3

rgf_scores = cross_val_score(rgf,

rgf_score = sum(rgf_scores)/n_folds
print('RGF Classifier score: {0:.5f}'.format(rgf_score))


Learning Nonlinear Functions Using Regularized Greedy Forest is the main article on the topic.

The RGF implementation for the basic implementation, and for the sparse implementation (FastRGF).

OCaml List rev_map vs map

If you found this page, you are probably very familiar with OCaml already!So, OCaml has a ````map```` function whose purpose is pretty cl...… Continue reading

How to optimize PyTorch code ?

Published on March 17, 2024

Acronyms of deep learning

Published on March 10, 2024