Minimize L1 penalty with (univariate) linear regression

Reading time ~2 minutes


Most of the regression problems I dealt with focused on minimizing the L2 norm of the difference between the predictions of a model and the true value. However, from a business point of view, it may make more sense to minimize directly the L1 norm.

Indeed, using a least square approach will mainly penalize large errors. But if the cost of an error is constant (once again, from the business perspective), then, the L1 penalty makes sense as well.

However, note that the estimator may be vastly different, if we want to use a constant model, the values of the intercept differ, one being the mean, the other one, the median.

More formally:

\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j),\]

So far, so good. However, things become complex quite quickly: a theoretical advantage of L2 penalty is that it makes the penalty differientiable, and we enjoy many closed formulas for various problems relying on L2 penatlies.

\[\hat{\beta} = \underset{\beta} {\arg \min}\,L\left(D, \beta\right) = \underset{\beta}\arg \min \sum_{i=1}^{n} \left(\beta \cdot x_i - y_i\right)^2\] \[\hat{\beta} = \left(X^\textsf{T}X\right)^{-1}X^\textsf{T}Y\]


I will focus on implementing the univariate case, without intercept. Including intercept or multivariate case relies on a (much) more complex optimization.

\[\hat{b} = \arg\min_x \sum_{j=1}^n |y_j - b \cdot x_j|\]

Calling \(L(b) = \sum_{j=1}^n |y_j - b \cdot x_j|\) we note that \(L\) is a convex function, whose derivative is not continuous.

It derivative, where it is defined, is:

\[l(b) = - \sum_{j=1}^n x_j \cdot \mathbb{sgn} (y_j - b \cdot x_j)\]

Given that \(L\) is convex, \(l\) must be monotonic. Besides \(l(-\infty) = - \sum_{j=1}^n  |x_j| < 0\) and \(l(+\infty) = \sum_{j=1}^n  |x_j| > 0\).

Therefore, we will look for a zero of \(l\) using dichotomy.


As detailed above, one we have the penalty and the dichotomy algorithm implemented, there is nothing else to do:

import numpy as np

def dichotomy(n, f, l, r):

    c = (l + r) / 2
    if n ==0:
        return c

        if f(c) < 0 :
            return dichotomy(n-1, f, c, r)
            return dichotomy(n-1, f, l, c)

def penalty(x, y, b):
    return -np.sum(x * np.sign(y - b * x))

class L1Regressor:

    def __init__(self, n_iterations=20):
        self.n_iterations = n_iterations
        self.b = None

    def fit(self, x, y):
        ratios = y / x
        l, r = np.min(ratios), np.max(ratios)
        self.b = dichotomy(self.n_iterations, lambda b: penalty(x, y, b), l, r)
        return self


If we append the following:

if __name__ == "__main__":

    import matplotlib.pyplot as plt

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    slope = L1Regressor().fit(x, y).b

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')

We obtain:

L1 regression

Comparison with L2

We can add some noise to the observations and we expect to have a more robust regression with L1 penalty.

Below are plotted the two slopes: in red, the L1 penalty is minimized, in green, the L2 penalty.

L1 regression vs L2

    from sklearn.linear_model import LinearRegression

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    x[:10] = 0.9 + 0.1 * np.random.rand(10)
    y[:10] = 2 + x[:10] * 0.5 + np.random.normal(size=10)

    slope = L1Regressor().fit(x, y).b
    slopel2 = LinearRegression(fit_intercept=False).fit(x.reshape(-1,1), y).coef_[0]

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')
    plt.plot(x, x*slopel2, c = 'g')

OCaml List rev_map vs map

If you found this page, you are probably very familiar with OCaml already!So, OCaml has a ````map```` function whose purpose is pretty cl...… Continue reading

How to optimize PyTorch code ?

Published on March 17, 2024

Acronyms of deep learning

Published on March 10, 2024