Computational complexity of machine learning algorithms

Reading time ~8 minutes

What is complexity ? Good question. I should have addressed it right away. It is a notion which is often addressed in algorithmic classes, but not in machine learning classes… Simply put, say you have a model. Training it on \(n\) points takes \(x\) minutes. Now what happens if you train it on \(kn\) points ? If the training time is now \(kx\) then the training time is linear. Sometimes, it is more. The new training time may be \(k^2 x\). In this case, the training time would be called quadratic in the number of points. If you have a long training time for few thousands points, do not expect to be able to run this method on millions of points (kernel SVMs, I miss you in large samples).

It is harder than one would think to evaluate the complexity of a machine learning algorithm, especially as it may be implementation dependent, properties of the data may lead to other algorithms or the training time often depends on some parameters passed to the algorithm. Another caveat is that the learning algorithms are complex and rely on other algorithms.

A theoretical point of view

Some bounds

Here, I propose upper bounds (as the implementation achieving this bound will be described) when the data is dense.

Calling \(n\) the number of training sample, \(p\) the number of features, \(n_{trees}\) the number of trees (for methods based on various trees), \(n_{sv}\), the number of support vectors and \(n_{l_i}\) the number of neurons at layer \(i\) in a neural network, we have the following approximations.

Algorithm Classification/Regression Training Prediction
Decision Tree C+R \(O(n^2p)\) \(O(p)\)
Random Forest C+R \(O(n^2pn_{trees})\) \(O(pn_{trees})\)
Random Forest R Breiman implementation \(O(n^2p n_{trees})\) \(O(pn_{trees})\)
Random Forest C Breiman implementation \(O(n^2 \sqrt p n_{trees})\) \(O(pn_{trees})\)
Extremly Random Trees  C+R \(O(npn_{trees})\) \(O(npn_{trees})\)
Gradient Boosting (\(n_{trees}\)) C+R \(O(npn_{trees})\) \(O(pn_{trees})\)
Linear Regression R \(O(p^2n+p^3)\) \(O(p)\)
SVM (Kernel) C+R \(O(n^2p+n^3)\) \(O(n_{sv}p)\)
k-Nearest Neighbours (naive) C+R \(-\) \(O(np)\)
Nearest centroid C \(O(np)\) \(O(p)\)
Neural Network C+R  ? \(O (pn_{l_1} + n_{l_1} n_{l_2} + ... )\)
Naive Bayes C \(O(np)\)  \(O(p)\)

Justifications

Decision Tree based models

Obviously, ensemble methods multiply the complexity of the original model by the number of “voters” in the model, and replace the training size by the size of each bag.

When training a decision tree, a split has to be found until a maximum depth \(d\) has been reached.

The strategy for finding this split is to look for each variable (there are \(p\) of them) to the different thresholds (there are up to \(n\) of them) and the information gain that is achieved (evaluation in \(O(n)\))

In the Breiman implementation, and for classification, it is recommanded to use \(\sqrt p\) predictors for each (weak) classifier.

Extremly random trees

The search strategy for the optimal split simply does not take place in the case of ERTs. This makes it much simpler to find a (weaker) split.

However (in my experience), ERTs implementation are not much faster than RFs.

Linear regressions

The problem of finding the vector of weights \(\beta\) in a linear regression boils down to evaluating the following equation: \(\beta=(X'X)^{-1}X'Y\).

The most computationnaly intensive part is to evaluate the product \(X'X\), which is done in \(p^2n\) operations, and then inverting it, which is done in \(p^3\) operations.

Though most implementations prefer to use a gradient descent to solve the system of equations \((X'X)\beta = X'Y\), the complexity remains the same.

Support Vector Machine

For the training part, the classical algorithms require to evaluate the kernel matrix \(K\), the matrix whose general term is \(K(x_i,x_j)\) where \(K\) is the specified kernel.

It is assumed that K can be evaluated with a \(O(p)\) complexity, as it is true for common kernels (Gaussian, polynomials, sigmoid…). This assumption may be wrong for other kernels.

Then, solving the constrained quadratic programm is “morally equivalent to” inverting a square matrix of size \(n\), whose complexity is assumed to be \(O(n^3)\)

k-Nearest Neighbours

In its simplest form, given a new data point \(x\), the kNN algorithm looks for the k closest points to \(x\) in the training data and returns the most common label (or the averaged values of targets for a regression problem).

To achieve this, it is necessary to compare the distance between \(x\) and every point in the data set. This amounts to \(n\) operations. For the common distances (Euclide, Manhattan…) this operation is performed in a \(O(p)\) operations. Not that kernel k Nearest Neighbours have the same complexity (provided the kernels enjoy the same property).

However, many efforts pre-train the kNN, indexing the points with quadtrees, which enable to lower dramatically the number of comparisons to the points in the training set.

Likewise, in the case of a sparse data set, with an inverted indexation of the rows, it is not necessary to compute all the pairwise distances.

The practical point of view

All this is nice, but what about real life examples ? We can focus on sk-learn implementations below.

The assumptions will be that the complexities take the form of \(O(n^\alpha p^\beta)\) and \(\alpha\) and \(\beta\) will be estimated using randomly generated samples with \(n\) and \(p\) varying. Then, using a log-log regression, the complexities are estimated.

Though this assumption is wrong, it should help to have a better idea of how the algorithms work and it will reveal some implementation details / difference between the default settings of the same algorithm that one may overlook.

The results

Method \(\alpha\) \(\beta\)
RandomForestRegressor 1.21 0.89
ExtraTreesRegressor 1.03 0.88
AdaBoostRegressor 0.71 1.01
LinearRegression 0.72 1.3
SVR 1.96 0.42
RandomForestClassifier 1.09 0.38
ExtraTreesClassifier 0.81 0.31
AdaBoostClassifier 0.89 0.79
SVC 2.05 0.52
LogisticRegression(solver=liblinear) 0.9 0.88
LogisticRegression(solver=sag) 1.03 0.95

Surprisingly, some methods are sublinear in \(n\). Perhaps the sample sizes were too small. As expected, the Support Vector show a complexity in \(n\) that does not scale well with the sample size (close to 2).

Another interesting point to note are the complexities in \(p\) for the random forest and extra trees, the component in \(p\) varies according to the fact that we are performing a regression or a classification problem. A short look at the documentation explains it, they have different behaviors for each problem!

For the regression:

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.
If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=n_features.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

Whereas the classification default behavior is

If “auto”, then max_features=sqrt(n_features).

The code

Fore those who would like to run the code over other algorithms, here is the method I used.

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math


class ComplexityEvaluator:

    def __init__(self, nrow_samples, ncol_samples):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples

    def _time_samples(self, model, random_data_generator):
        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                start_time = time.time()
                model.fit(train, labels)
                elapsed_time = time.time() - start_time

                result = {"N": nrow, "P": ncol, "Time": elapsed_time}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        data = pd.DataFrame(self._time_samples(model, random_data_generator))
        print(data)
        data = data.applymap(math.log)
        linear_model = LinearRegression(fit_intercept=True)
        linear_model.fit(data[["N", "P"]], data[["Time"]])
        return linear_model.coef_

And, some unit tests (that can just be appended at the bottom of the previous class).

if __name__ == "__main__":
    class TestModel:

        def __init__(self):
            pass

        def fit(self, x, y):
            time.sleep(x.shape[0] / 1000.)

    def random_data_generator(n, p):
        return np.random.rand(n, p), np.random.rand(n, 1)

    model = TestModel()

    complexity_evaluator = ComplexityEvaluator(
            [200, 500, 1000, 2000, 3000], [1,5,10])

    res = complexity_evaluator.Run(model, random_data_generator)

    print(res)

After a small unit test, everything seems consistent.

N   P      Time
0    200   1  0.200263
1    200   5  0.200421
2    200  10  0.200520
3    500   1  0.500928
4    500   5  0.500711
5    500  10  0.501064
6   1000   1  1.001396
7   1000   5  1.001420
8   1000  10  1.000629
9   2000   1  2.002225
10  2000   5  2.002096
11  2000  10  2.000759
12  3000   1  3.003331
13  3000   5  3.003340
14  3000  10  3.003350
[[  9.99583664e-01   1.13487892e-05]]

So let’s enjoy the number of algorithms offered by sklearn. The following list may be updated as new algorithms are tested.

import numpy as np
import ComplexityEvaluator
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LogisticRegression


def random_data_regression(n, p):
    return np.random.rand(n, p), np.random.rand(n)


def random_data_classification(n, p):
    return np.random.rand(n, p), np.random.binomial(1, 0.5, n)


regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor(),
                     AdaBoostRegressor(),
                     LinearRegression(),
                     SVR()]

classification_models = [RandomForestClassifier(),
                         ExtraTreesClassifier(),
                         AdaBoostClassifier(),
                         SVC(),
                         LogisticRegression(),
                         LogisticRegression(solver='sag')]

names = ["RandomForestRegressor",
         "ExtraTreesRegressor",
         "AdaBoostRegressor",
         "LinearRegression",
         "SVR",
         "RandomForestClassifier",
         "ExtraTreesClassifier",
         "AdaBoostClassifier",
         "SVC",
         "LogisticRegression(solver=liblinear)",
         "LogisticRegression(solver=sag)"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
    [500, 1000, 2000, 5000, 10000, 15000, 20000],
    [5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
    res = complexity_evaluator.Run(model, random_data_regression)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

for model in classification_models:
    res = complexity_evaluator.Run(model, random_data_classification)[0]
    print(names[i] + ' | ' + str(round(res[0], 2)) +
          ' | ' + str(round(res[1], 2)))
    i = i + 1

Learning more

Machine learning

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them.

Algorithms and computational complexity

If you want a thorough understanding of computational complexity theory, these books are great resources.

Computational Complexity: A Modern Approach is a clear, detailed analysis of the topic, also covering cryptography and quantum computation.

Randomized Algorithms Though more specialized than the first one, I like the interplay between probabilities and algorithms presented here.

How to optimize PyTorch code ?

Optimizing some deep learning code may seem quite complicated. After all, [PyTorch](https://pytorch.org/) is already super optimized so w...… Continue reading

Acronyms of deep learning

Published on March 10, 2024

AI with OCaml : the tic tac toe game

Published on September 24, 2023