Minimize L1 penalty with (univariate) linear regression

Introduction

Most of the regression problems I dealt with focused on minimizing the L2 norm of the difference between the predictions of a model and the true value. However, from a business point of view, it may make more sense to minimize directly the L1 norm.

Indeed, using a least square approach will mainly penalize large errors. But if the cost of an error is constant (once again, from the business perspective), then, the L1 penalty makes sense as well.

However, note that the estimator may be vastly different, if we want to use a constant model, the values of the intercept differ, one being the mean, the other one, the median.

More formally:

\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j),\]

So far, so good. However, things become complex quite quickly: a theoretical advantage of L2 penalty is that it makes the penalty differientiable, and we enjoy many closed formulas for various problems relying on L2 penatlies.

\[\hat{\beta} = \underset{\beta} {\arg \min}\,L\left(D, \beta\right) = \underset{\beta}\arg \min \sum_{i=1}^{n} \left(\beta \cdot x_i - y_i\right)^2\] \[\hat{\beta} = \left(X^\textsf{T}X\right)^{-1}X^\textsf{T}Y\]

Algorithm

I will focus on implementing the univariate case, without intercept. Including intercept or multivariate case relies on a (much) more complex optimization.

\[\hat{b} = \arg\min_x \sum_{j=1}^n |y_j - b \cdot x_j|\]

Calling \(L(b) = \sum_{j=1}^n |y_j - b \cdot x_j|\) we note that \(L\) is a convex function, whose derivative is not continuous.

It derivative, where it is defined, is:

\[l(b) = - \sum_{j=1}^n x_j \cdot \mathbb{sgn} (y_j - b \cdot x_j)\]

Given that \(L\) is convex, \(l\) must be monotonic. Besides \(l(-\infty) = - \sum_{j=1}^n |x_j| < 0\) and \(l(+\infty) = \sum_{j=1}^n |x_j| > 0\).

Therefore, we will look for a zero of \(l\) using dichotomy.

Implementation

As detailed above, one we have the penalty and the dichotomy algorithm implemented, there is nothing else to do:

import numpy as np

def dichotomy(n, f, l, r):

    c = (l + r) / 2
    if n ==0:
        return c

    else:
        if f(c) < 0 :
            return dichotomy(n-1, f, c, r)
        else:
            return dichotomy(n-1, f, l, c)


def penalty(x, y, b):
    return -np.sum(x * np.sign(y - b * x))


class L1Regressor:

    def __init__(self, n_iterations=20):
        self.n_iterations = n_iterations
        self.b = None

    def fit(self, x, y):
        ratios = y / x
        l, r = np.min(ratios), np.max(ratios)
        self.b = dichotomy(self.n_iterations, lambda b: penalty(x, y, b), l, r)
        return self

Tests

If we append the following:

if __name__ == "__main__":

    import matplotlib.pyplot as plt

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    slope = L1Regressor().fit(x, y).b

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')
    plt.savefig("./illustration.svg")

We obtain:

L1 regression

Comparison with L2

We can add some noise to the observations and we expect to have a more robust regression with L1 penalty.

Below are plotted the two slopes: in red, the L1 penalty is minimized, in green, the L2 penalty.

L1 regression vs L2

    from sklearn.linear_model import LinearRegression

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    x[:10] = 0.9 + 0.1 * np.random.rand(10)
    y[:10] = 2 + x[:10] * 0.5 + np.random.normal(size=10)

    slope = L1Regressor().fit(x, y).b
    slopel2 = LinearRegression(fit_intercept=False).fit(x.reshape(-1,1), y).coef_[0]

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')
    plt.plot(x, x*slopel2, c = 'g')
    plt.savefig("./illustration_noise.svg")

Minimize L1 penalty with (univariate) linear regression

June 12, 2022

Introduction

Algorithm

Implementation

Tests

Comparison with L2

OCaml List rev_map vs map

How to optimize PyTorch code ?

Acronyms of deep learning