# Introduction

Topological data analysis (TDA) allows to reduce many hypothesis when doing statistics. A lot of research in this field has been done over the last years and [1] and [4] provide a brilliant exposition about the mathematical concepts behind TDA. Here, I want to focus on one aspect of TDA: compressed representations of shapes.

# A simple topological data analysis pipeline

We will focus on the article [2] in the references. It allows to have an idea of the shape of the data. More precisely, we will go through the various steps that enabled the authors of [2] to produce the following figure:

Fig. 1: The pipeline

## Overall pipeline

The authors go through the following steps to represent the new shape:

A) A 3D object (hand) represented as a point cloud.

B) A filter value is applied to the point cloud and the object is now colored by the values of the filter function.

C) The data set is binned into overlapping groups.

D) Each bin is clustered and a network is built.

Therefore, we will need a “filter function”, a method to cluster a point cloud (the “clusterer”), N intervals and an overlap parameter. Let’s summarize this in the following class.

class TDAPipeline():

def __init__(self, filter_function, clusterer, n, overlap):
self._filter_function = filter_function
self._clusterer = clusterer
self._n = n
self._overlap = overlap

# (string, int list) dict
self._clusters = []

# string list
self._clusters_names = []

# (string, string) list
self._edges = []


We are now ready to dig in the other concepts.

## The filters

Filter functions allows to associate real valued quantities to the data points. It can be the coordinates of a point on any axis, or the coordinates of the points on the first component of a PCA.

class AxisFilter():

def __init__(self, i):
self._i = i

def Fit(self, data):
pass

def Filter(self, data):
return data[:,self._i]


An interesting discovery in this article is the concept of “L-infinity centrality”. This filter allows to associate to a point a real number qualifying how close it is to the boundary of the cloud.

L-infinity centrality, is defined for a data point x to be the maximum distance from x to any other data point in the set. Large values of this function correspond to points that are far from the center of the data set.

import numpy as np
from scipy.spatial.distance import pdist, squareform

class LinfinityCentrality():

def __init__(self, distance):
self._distance = distance

def Fit(self, data):
# Barnes Hut approx. or something similar may speed up the filtering part later
pass

def Filter(self, data):
pairwise_dist = squareform(pdist(data, self._distance))
return np.amax(pairwise_dist, axis = 1)


## The clustering

Once the points have been associated to an interval in the image of the filter function, they have to be clustered. The authors use the following method:

We first construct the single linkage dendrogram for the data in the bin, and record the threshold values for each transition in the clustering. We select an integer k, and build a k-interval histogram of these transition values. The lustering is performed using the last threshold before the first gap in this histogram.

Which can be implemented taking advantage of the implementation of the single linkage proposed in scipy.

import numpy as np
from sklearn.cluster import AgglomerativeClustering

def __init__(self, k):
self._k = k

def Cluster(self, data):
if data.shape[0] == 0:
return []

counts, thresholds = np.histogram(linkage_matrix[:, 2], bins=self._k)
first_gap_index = self._first_zero(counts, -1)
if(first_gap_index > -0.5):
threshold = thresholds[first_gap_index]
else:
return clusters

def _first_zero(self, arr, invalid_val=-1):


# Some classical shapes

We now have what we need to follow the pipeline of the author. Let’s just sample some points from various shapes with the following classes:

class Sampler():

def __init__(self):
raise NotImplemented()

def Sample(self, n):
raise NotImplemented()


It is quite easy to sample points uniformly over a sphere (up to numerical precision issues) by sampling an distribution over R^3 which is rotation invariant (here, three independant gaussian with same mean and variance) and normalize it.

import numpy as np
from sklearn.preprocessing import normalize
from Sampler import Sampler

class SphereSampler(Sampler):

def __init__(self, r=1):
self._r = r

'''
Uniform sampling over the sphere.
'''
def Sample(self, n):
mat = np.random.randn(n, 3)
return self._r * normalize(mat, axis=1, norm='l2')



I could not find a uniform distribution over the torus, but this should be “uniform enough”.

import numpy as np
from Sampler import Sampler

class TorusSampler(Sampler):

def __init__(self, r=1, R=3, x0=0, y0=0, z0=0):
self._r = r
self._R = R
self._x0 = x0
self._y0 = y0
self._z0 = z0

'''
This sampling may not be uniform
'''
def Sample(self, n):
u = 2 * np.pi * np.random.rand(n)
v = 2 * np.pi * np.random.rand(n)

cosu = np.cos(u)
sinu = np.sin(u)

cosv = np.cos(v)
sinv = np.sin(v)

x = self._x0 + (self._R + self._r * cosu) * cosv
y = self._y0 + (self._R + self._r * cosu) * sinv
z = self._z0 + self._r * sinu

return np.array([x, y, z]).T


# Experiments

## Following the method of the authors

We are now ready to implement the full pipeline and produce sensible graphs.

import numpy as np

class TDAPipeline():

def __init__(self, filter_function, clusterer, n, overlap):
self._filter_function = filter_function
self._clusterer = clusterer
self._n = n
self._overlap = overlap

# (string, int list) dict
self._clusters = []

# string list
self._clusters_names = []

# (string, string) list
self._edges = []

def Run(self, data):
self._filter_function.Fit(data)

# one dimensional array : f(data), f : R^n -> R
filtered_data = self._filter_function.Filter(data)
lb = np.min(filtered_data)
ub = np.max(filtered_data)
overlapping_intervals = self._generate_overlapping_intervals(
lb, ub, self._n, self._overlap)

# f^-1([a,b]) for [a,b] in the overlapping intervals
sliced_clusters = []
i = 0
for interval in overlapping_intervals:
in_interval = map(lambda x: self._between_pair(
x, interval), filtered_data)
indices = np.where(in_interval)[0]
filtered_points = data[indices, :]
clusters = self._clusterer.Cluster(filtered_points)
points_by_cluster = self._to_dict_list(
str(i) + "_", indices, clusters)
sliced_clusters.append(points_by_cluster)
i = i + 1

self._clusters = self._flatten_dict_list(sliced_clusters)
self._clusters_names = self._clusters.keys()
self._edges = self._build_edges(sliced_clusters)

return self._edges

def ClusterAverage(self, target):
if len(self._clusters) == 0:
raise "The algorithm has not been trained!"

indices_dict = self._clusters

return [np.mean(target[indices_dict[n]]) for n in self._clusters_names]

def _build_edges(self, sliced_clusters):
edges = []
for sets_prev in sliced_clusters:
for sets_new in sliced_clusters:
for prev_key in sets_prev.keys():
for new_key in sets_new.keys():
if not set(sets_new[new_key]).isdisjoint(set(sets_prev[prev_key])):
edges.append((prev_key, new_key))
return edges

def _flatten_dict_list(self, dict_list):
res = {}
for dicts in dict_list:
res.update(dicts)
return res

def _to_dict_list(self, prefix, vector, clusters):
cluster_names = np.unique(clusters)
res = {}
for cluster_name in cluster_names:
indices_in_cluster = map(lambda x: cluster_name == x, clusters)
points_in_cluster = vector[np.where(indices_in_cluster)[0]]
res.update({prefix + str(cluster_name): points_in_cluster})
return res

def _generate_overlapping_intervals(self, lb, ub, n, overlap):
step = 1. * (ub - lb) / n
res = [(lb + i * step - overlap * step, lb + (i + 1)
* step + overlap * step) for i in range(n)]
return res

def _between_pair(self, x, pair):
a, b = pair
return x > a and x < b


## Illustrations

Did it work :) ? I found a hand data set “Free realistic hand” here, it required a little bit of parsing to put it in a 3d file.

Fig. 2: The pipeline, data : hand, filter : PCA, clustering : single linkage

Unfortunately the sampling around the wrist is not high enough and the wrist is associated to the two blue dots, separated from the rest of the figure., but we do recognize the four fingers(in red) well separated from the thumb (in green).

But we can generate synthetic data to get rid of this sampling issue. Per example, with a torus:

Fig. 3: A torus, filter : first component, clustering : single linkage

Or two touching tori:

Fig. 4: Two tori, filter : first component, clustering : single linkage
import numpy as np

from AxisFilter import AxisFilter
from PCAFilter import PCAFilter
from LinfinityCentrality import LinfinityCentrality

from TesseractSampler import TesseractSampler
from TorusSampler import TorusSampler
from SphereSampler import SphereSampler

from TDAPipeline import TDAPipeline

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout

def plot_tda_pipeline(title, sample, filter_function, clusterer, N, OVERLAP):
tdap = TDAPipeline(filter_function, clusterer, N, OVERLAP)
edges = tdap.Run(sample)

colors = filter_function.Filter(sample)
cluster_colors = tdap.ClusterAverage(colors)

fig = plt.figure()

x = sample[:, 0]
y = sample[:, 1]
z = sample[:, 2]

ax.scatter(x, y, z, c=colors, marker='o')

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.title(title)

G = nx.OrderedGraph() # withouth OrderedGraph, the nodes may be shuffled (and the colors would not match)
pos = graphviz_layout(G)
nx.draw(G, pos, with_labels = True)
nx.draw_networkx_nodes(G, pos,
node_color = cluster_colors)

plt.savefig(title)

N = 10
OVERLAP = 0.2
SAMPLE_SIZE = 2000
K = 5

named_samples = [("Torus", TorusSampler(1, 3).Sample(SAMPLE_SIZE)),
("Tesseract", TesseractSampler().Sample(SAMPLE_SIZE)),
("Two Spheres", np.vstack((SphereSampler().Sample(SAMPLE_SIZE),
SphereSampler(3).Sample(SAMPLE_SIZE)))),
("Two touching tori", np.vstack((TorusSampler(1, 3).Sample(SAMPLE_SIZE),
TorusSampler(1, 3, 6, 0, 0).Sample(SAMPLE_SIZE)))),

named_filters = [("Axis filter (x)", AxisFilter(0)),
("PCA", PCAFilter()),
("L-Linfinity Centrality", LinfinityCentrality('euclidean'))]

for named_filter in named_filters:
name_filter, filter_function = named_filter
for named_sample in named_samples:
name_data, sample = named_sample
title = "Data : " + name_data + ", " + "Filter : " + name_filter
plot_tda_pipeline(title, sample, filter_function,



# References

## Articles

[1]F. Chazal and B. Michel, “An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists,” arXiv:1710.04019 [cs, math, stat], Oct. 2017.

[2]P. Y. Lum et al., “Extracting insights from the shape of complex data using topology,” Scientific Reports, vol. 3, no. 1, Dec. 2013.

[3]B. Zieliński, M. Lipiński, M. Juda, M. Zeppelzauer, and P. Dłotko, “Persistence Bag-of-Words for Topological Data Analysis,” arXiv:1812.09245 [cs, math, stat], Dec. 2018.

[4]A. Zomorodian, “Topological Data Analysis,” p. 39.

[5]J. Mathews, S. Nadeem, M. Pouryahya, A. Tannenbaum, and J. O. Deasy, “Topological Data Analysis of PAM50 and 21-Gene Breast Cancer Assays:,” bioRxiv, Nov. 2018.

# Best books about data science

This is my ideal library about machine learning.

## Theoretical

### Machine learning and statistics

The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani, Jerome Friedman

A must have. Covers most of the classical algorithms commonly used in machine learning.

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach

Provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. Beware, this is highly theoretical and requires familiarity with probability theory !

Cours de Statistique Mathématique (in French!), by Alain Monfort.

It provides a clear and thorough introduction to the theory behind statistical hypothesis testing, estimation, and linear regressions.

Lectures on Algebraic Statistics by Mathias Drton, Bernd Sturmfels, Seth Sullivant.

Following my article about speeding up cross validation where some nice ideas arose from algebra, this books covers more aspects about the links between statistics and algebra, and a chapter dedicated to open problems in this field.

### Probability

What would be machine learning without a nice understanding of probabilities and the proofs behind many methods? This books provides a large range of exercises and their solutions.

Real Analysis and Probability, by R. M. Dudley.

This is one of the clearest exposition of probability theory and of the interplay between the properties of metric spaces and probability measures.

Randomized Algorithms by Rajeev Motwani, Prabhakar Raghavan

I did not know where to put this one. In my post about time complexity I discussed some issues raised by the complexity of training a classifier or a regressor. This book provides a nice overview of this theory, heavily relying on probability.

## Practice

### R

Applied Predictive Modeling by Max Kuhn, Kjell Johnson

Full of nice ideas about how to treat various data sets! They even implemented a R package. I tried the package but, to be honest, never felt the need to reuse one of the methods proposed.

ggplot2: Elegant Graphics for Data Analysis (Use R!)

I am a big fan of ggplot2. I know, there are now better libraries on the market, with loads and loads of new functionality, interactive plots, yadi yadi yada… But mastering ggplot2 is so amazing. It only takes 5 lines to produce amazing graphs, summarizing findings from data. Which is really amazing, especially at the workplace.

### OCaml

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

## Miscellanous

The Art of Mathematics: Coffee Time in Memphis

Various exercises in mathematics, with hints and (beautiful) solutions. Some are directly related to probability (11. Loaded Dice, 69. Absent-minded Passengers, 77. Independent Random Variables, 139. A Probabilistic Inequality).

Music: A Mathematical Offering by Dave Benson.

Definitely not probability or statistics oriented. However, you may find here a new way to look at Fourier’s transform, and if you love music and mathematics, this is a must have.

Chaos Monkeys

Though not only about machine learning, this book is about the startup scene. Absolutely non technical, but fascinating :)

Freakonomics by Steven D. Levitt, Stephen J. Dubner

When economists meet real life :) non technical either.

Disclaimer: these are amazon sponsored links, however, I have read every book (in some cases, only partly) and only the best are in this list.

# ECML/PKDD 15: Taxi Trajectory Prediction

As an occasionnal “Kaggler” I like to write my own code or experiment new alogrithms. This one led me to a top 20 place. The solution was really simple: each line of the train and test set is described as a collection of binary feature. For each feature, one can associate the destinations who share this feature. If the cloud of points associated to this binary feature has a low “variance” (i.e. most of the points sharing this feature end up in the same place) and a high number of observations, we want to give a higher weight to this feature. Otherwise, this feature can be deemed as poorly informative and we want to decrease its weight.

## Preprocessing

• Generate a set of balls covering the map (radiuses and centers being chosen to avoid having too many features in the end)

• Remove the trajectories with lightspeed jumps

• For the training, cut the trajectories according to a $$\min(U[0,1],U[0,1])$$ law (it provided a good matching between cross validation and leaderboard score).

• Replace the (truncated) trajectories by the set of balls they cross

• Keep all the other features

## Learning

• For each feature (boolean: have this trajectory crossed Ball k, is it id_207?) generate a cloud of points that are the final points sharing this feature. Actually, the cloud itself is never stored in memory (it would not fit on most computers I guess). Only its barycentre and variance are (they are then updated as mean and variance would be).

• Features and their interactions were considered. Indeed, without interactions the performance is really low. Interactions such as “Is the car on the road to the airport AND is the car going north” intuitively seem much more efficient than the two features taken independently.

## Predicting

• Given the features, gather all the barycenters and variances.

• Return an average of the barycenters, weighted by the inverse of the standard deviation (raised to a certain power - CV showed that 7 was the best) :

• $\hat{f}(p_{1}...p_{n})=\sum_{k,p_{k}\,is\,true}\frac{\#\left(C\left(p_{k}\right)\right)^{\alpha}\mathrm{bar}(C(p_{k}))}{\mathrm{sd}\left(C(p_{k})\right)^{\beta}}$

Where :

$$p_{k}$$ stands for a boolean feature,

$$C(p_{k})$$ stands for the cloud associated to the feature,

$$\#\left(C(p_{k})\right)$$, sd and bar stands for the number of points, the variance and the barycenter of the cloud.

$$\alpha$$ and $$\beta$$ are penalization parameters.

## Speed of the algorithm

Using streaming evaluation of mean and standard deviation of a collection of points, this algorithms had a linear complexity in terms of the number of points. The training and testing part could actually be performed in a couple of minutes.

## How to improve it ?

### Performance

Considering interactions of depth 3 ! Unfortunatly, I was limited by my computer’s memory.

Combining this with other approaches (genuine random forests…) could have helped as well!

### Speed

The part:

Generate a set of balls covering the map (radiuses and centers being chosen to avoid having too many features in the end)

Could actually have been made much faster using squares (and rounding) insteads of balls! The algorithm with balls had a $$O(n \cdot d)$$ complexity ($$d$$ being the number of balls) whereas rounding would have led to a $$O(n)$$ complexity, no matter the number of squares.

## Bonus: generating heatmaps!

The data comes from OpenData.Paris.fr and is inspired from the following kernel.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

bins = 500
lat_min, lat_max = 48.75, 48.95
lon_min, lon_max = 2.25, 2.5

COLNAME = "Geometry X Y"

def getCoord(x,i):
return map(float, x.split(','))[i]

chunksize=10000,
usecols=[COLNAME],
sep=";")

z = np.zeros((bins, bins))

for i, chunk in enumerate(data):
print("Chunk " + str(i))
chunk['x'] =chunk[COLNAME].apply(lambda x: getCoord(x,0))
chunk['y'] =chunk[COLNAME].apply(lambda x: getCoord(x,1))
z += np.histogram2d(chunk['x'], chunk['y'], bins=bins,
range=[[lat_min, lat_max],
[lon_min, lon_max]])[0]

log_density = np.log(1 + z)

plt.imshow(log_density[::-1, :],  # flip vertically
extent=[lat_min, lat_max, lon_min, lon_max])

plt.savefig('heatmap.png')


# Random forest vs SVM

### General remarks

If you have too many rows (more than 10 000), prefer the random forest. If you do not have much time to pre process the data (and, or have a mix of categorical and numerical features), prefer the random forest.

Random forests are inherently mutliclass whereas Support Vector Machines need workarounds to treat multiple classes classification tasks. Usually this consists in building binary classifiers which distinguish (i) between one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one).

## Empiric facts

### Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret)

For the sake of the example, the next two paragraphs deal with datasets where SVMs are better than random forests.

### MNIST benchmark

From crossvalidated, RFs seem to achieve a 2.8% error rate on the MNSIT dataset.

On the other hand, on Yann Lecun benchmarks a simple SVM with a gaussian kernel could reach a 1.4% error rate. Virtual SVM could reach 0.56% error rate.

Fig. 1: Illustration of the MNIST dataset

### Microarray data

Quoting A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification, who use 22 diagnostic and prognostic datasets:

We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.

## Size of the data set and theoretical complexity

The first thing to consider is the feasability of training a model on a specific data set. Are there too many rows or columns to train a specific model ? Recall the table from the article about time complexity.

Algorithm Classification/Regression Training Prediction
Decision Tree C+R $$O(n^2p)$$ $$O(p)$$
Random Forest C+R $$O(n^2pn_{trees})$$ $$O(pn_{trees})$$
Random Forest R Breiman implementation $$O(n^2p n_{trees})$$ $$O(pn_{trees})$$
Random Forest C Breiman implementation $$O(n^2 \sqrt p n_{trees})$$ $$O(pn_{trees})$$
SVM (Kernel) C+R $$O(n^2p+n^3)$$ $$O(n_{sv}p)$$

What we can see is that the computational complexity of Support Vector Machines (SVM) is much higher than for Random Forests (RF). This means that training a SVM will be longer to train than a RF when the size of the training data is higher.

This has to be considered when chosing the algorithm. Typically, SVMs tend to become unusable when the number of rows exceeds 20 000.

Therefore, random forests should be prefered when the data set grows larger. You can use, however, bagged support vector machines.

## Transformation of the data : practical details

Recall the following table, which can be found in the article about data scaling and reducing

Algorithm Translation invariant Monotonic transformation invariant
Decision Tree X X
Random Forest X X
SVM (Gaussian kernel) X
SVM (Other kernels)

This simply means that you do not have to worry about scaling / centering data or any other transformation (such as Box Cox) before feeding your data to the random forest algorithm.

## The shape of the decision function

One scarcely has to deal with two dimensional data. If this ever happens to you, bear in mind that random forest tend to produce decision boundaries which are segements parallel to the x and y axises, whereas SVMs (depending on the kernel) provide smoother boundaries. Below are some illustrations.

### Results

Illusatration of the decision boundary of a SVM

Illustration of the decision boundary of a random forest

### The code

import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from DecisionBoundaryPlotter import DecisionBoundaryPlotter

def random_data_classification(n, p, f):
predictors = np.random.rand(n, p)
return predictors, np.apply_along_axis(f, 1, predictors)

def parabolic(x, y):
return (x**2 + y**3 > 0.5) * 1

def parabolic_mat(x):
return parabolic(x[0], x[1])

X, Y = random_data_classification(100, 2, parabolic_mat)

dbp = DecisionBoundaryPlotter(X, Y)

named_classifiers = [(RandomForestClassifier(), "RandomForestClassifier"),
(SVC(probability=True), "SVC")]

for named_classifier in named_classifiers:
print(named_classifier[1])
dbp.PlotContour(named_classifier[0], named_classifier[1])
dbp.PlotHeatmap(named_classifier[0], named_classifier[1])



The DecisionBoundaryPlotter class being defined below:

import numpy as np
import matplotlib.pyplot as plt

class DecisionBoundaryPlotter:

def __init__(self, X, Y, xs=np.linspace(0, 1, 100),
ys=np.linspace(0, 1, 100)):
self._X = X
self._Y = Y
self._xs = xs
self._ys = ys

def _predictor(self, model):
model.fit(self._X, self._Y)
return (lambda x: model.predict_proba(x.reshape(1,-1))[0, 0])

def _evaluate_height(self, f):
fun_map = np.empty((self._xs.size, self._ys.size))
for i in range(self._xs.size):
for j in range(self._ys.size):
v = f(
np.array([self._xs[i], self._ys[j]]))
fun_map[i, j] = v
return fun_map

def PlotHeatmap(self, model, name):
f = self._predictor(model)
fun_map = self._evaluate_height(f)

fig = plt.figure()
s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
im = s.imshow(
fun_map,
extent=(self._ys[0], self._ys[-1], self._xs[0], self._xs[-1]),
origin='lower')
fig.colorbar(im)
fig.savefig(name + '_Heatmap.png')

def PlotContour(self, model, name):
f = self._predictor(model)
fun_map = self._evaluate_height(f)

fig = plt.figure()
s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
s.contour(self._xs, self._ys, fun_map, levels = [0.5])
s.scatter(self._X[:,0], self._X[:,1], c = self._Y)
fig.savefig(name + '_Contour.png')


## SVM as a random forest

Something funny is that random forests can actually be rewritten as kernel methods. I have not (yet) looked at the articles below in more details

Random forests and kernel methods, Erwan Scornet

The Random Forest Kernel (and creating other kernels for big data from random partitions)

## Towards faster SVMs ?

As I was writing the article, I discovered some recent progresses had been made : a scalable version of the Bayesian SVM has been developed. See : Bayesian Nonlinear Support Vector Machines for Big Data.

## Learning more

Wenzel, Florian; Galy-Fajou, Theo; Deutsch, Matthäus; Kloft, Marius (2017). “Bayesian Nonlinear Support Vector Machines for Big Data”. Machine Learning and Knowledge Discovery in Databases (ECML PKDD).

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone, R.A. Olshen

# Creating VIM plugin with Python

## My first plugin

As I was working on various OCaml projects that were growing bigger and bigger, the various open directives started accumulating in some files. And I missed the feature I had in visual studio that automatically sorts the names of the using directives. This allowed a much quicker visual inspection of the various modules I needed for a piece of code to run.

As I could not find a solution to do it easily, I had a look at the different ways to implement vim plugins.

It works!

## Python : import vim

It turned out that the plugins can be directly written using Python ! in a vim script (mine will be called librarySorter.vim), you can simply write python code, import vim to read the buffer.

The buffer seems to be an array like structure on which you can get/set the i-th element of the buffer (i.e. the i-th line of your file).

Unfortunately, the documentation to use the package vim is quite limited. As this Stack Overflow question stresses it.

A nice ressource though, is this presentation.

## Coding the plugin

### Making sure it is compiled with python3

if !has('python3')
echo "Error: Required vim compiled with +python3"
finish
endif


### Declaring a function

function! LibrarySorter()

endfunction


### Calling Python code

The part written in Python has to be between these two EOF declarations.

python3 << EOF

# python code

EOF


In the case of my plugin, this looks like this:

python3 << EOF
import vim

#Reads the whole file, as a list
cb = list(vim.current.buffer)

#Filter the open directives
openLines = list(filter(lambda x: x.startswith("open "), cb))

#If all the open directives are not at the top of the file, stop immediatly
if not vim.current.buffer[len(openLines) - 1].startswith("open "):
raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

#And rearrange the lines in the buffer
for i in range(len(openLines)):
vim.current.buffer[i] = openLines[i]

EOF


And now, in vim, you can simply type: :call LibrarySorter() anywhere in your OCaml source and look at the head of your file :)

## The code

if !has('python3')
echo "Error: Required vim compiled with +python3"
finish
endif

function! LibrarySorter()

python3 << EOF
import vim

cb = list(vim.current.buffer)

openLines = list(filter(lambda x: x.startswith("open "), cb))

if not vim.current.buffer[len(openLines) - 1].startswith("open "):
raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

for i in range(len(openLines)):
vim.current.buffer[i] = openLines[i]

EOF

endfunction


## What’s next ?

I actually have the same issue when importing files in Python, TypeScript… Might be nice to generalize this to various languages :)

## Learning more

### Ocaml

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed.

### Vim

Practical Vim: Edit Text at the Speed of Thought

Modern Vim: Craft Your Development Environment with Vim 8 and Neovim

Learning the vi and Vim Editors

# Using ocamlopt with ocamlbuild

As your OCaml programs grow bigger, it is a good idea to use a build system. Ocamlbuild is quite good at this job ! It will take care of your dependencies, find libraries and more importantly compile the files in the right order…

It took me a while to find the answer to the simple question, how to use ocamlopt instead of the default compiler. Assuming you want to compile a main.ml file, the short answer is to use ocamlbuild main.native, instead of ocamlbuild main.byte

# ocamlopt and ocamlc

OCaml comes with two compilers : ocamlc and ocamlopt. ocamlc is the bytecode compiler, and ocamlopt is the native code compiler. If you don’t know which one to use, use ocamlopt since it provides standalone executables that are normally faster than bytecode.

# Example

For a very quick benchmark let’s sort a long list.

let a = List.init 1000000 (fun x -> Random.int 500) ;;
let b = List.sort compare a ;;

print_string "done;\n";


And compile this with the different compilers…

ocamlbuild main.native
echo "native"
time ./main.native
rm main.native
rm -rf _build/

ocamlbuild main.byte
echo "byte"
time ./main.byte
rm main.byte
rm -rf _build/


And the results are quite convincing.

Finished, 4 targets (0 cached) in 00:00:00.
native
done;

real  0m2.156s
user  0m2.064s
sys 0m0.088s
Finished, 3 targets (0 cached) in 00:00:00.
byte
done;

real  0m6.340s
user  0m6.268s
sys 0m0.068s


# Learning more

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

# Ocaml and JSON

Manipulating JSON (JavaScript Object Notation) elements is amazing and allows to move data around between different programs easily. The only thing you need is to be able to write and read JSON in all your programs. Though this can be simple in some languages, other will require some work to.

OCaml being strongly typed, serializing / deserializing object is less obvious than in, say, Python.

## The tools

Yojson: low-level JSON library for OCaml

ppx_deriving_yojson : writing json

To get them, simply use opam:

opam install yojson
opam install ppx_deriving_yojson


### Optionnal, for vim users

If you have not done it already, these tools will help you navigate through the different functions proposed by the above packages. They allow to have “intellisense” for OCaml.

Merlin

YouCompleteMe

Note that you have to create a .merlin file in the directory where you are working, to help vim find the relevant packages.

PKG find yojson


## Type to JSON

### What is PPX ?

Writing code for each type to turn it into JSON is particularly boring and can lead to errors. In many languages this is straightworard. Like using json.dumps(obj) in Python. The PPX language is a feature that provides a new API for syntactic extensions in OCaml. deriving yojson generates three functions per type:

# #require "ppx_deriving_yojson";;
# type ty = .. [@@deriving yojson];;
val ty_of_yojson : Yojson.Safe.json -> (ty, string) Result.result
val ty_of_yojson_exn : Yojson.Safe.json -> ty
val ty_to_yojson : ty -> Yojson.Safe.json


So what happens is that, during the compilation process, the functions ty_of_yojson and the others will be, automatically and silently, generated and the compiler will not complain that they are missing - though you never had to write them.

### Nested types

Let’s give it a try ! And come up with a nested type.

type t = {x: int; y: int} [@@deriving to_yojson]
type u = {s: string; pos: t} [@@deriving to_yojson]
let () = print_endline (Yojson.Safe.pretty_to_string (u_to_yojson {s= "hello"; pos={x= 1; y= 2}}))

ocamlfind opt -package ppx_deriving_yojson -linkpkg -o testJson.byte testJson.ml
./testJson.byte
rm testJson.byte


And…

{ "s": "hello", "pos": { "x": 1, "y": 2 } }


It worked :)

# Learning more

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

# Installing R Studio on Manjaro

Manjaro is amazing. But, it sometimes lacks resources and help to install “common” (it is a matter of point of view) packages. sudo pacman -S rstudio will not be enough, unfortunately.

But this will do it :

sudo pacman -S yaourt
sudo pacman -S r
yaourt -S rstudio-desktop-bin


You will have to press “no” a couple of times:

==> Edit PKGBUILD ? [Y/n] ("A" to abort)
==> ------------------------------------
==> n

==> Edit rstudio-desktop-bin.install ? [Y/n] ("A" to abort)
==> -------------------------------------------------------
==> n



And finally, yes, when asked whether you want to continue:

==> Continue building rstudio-desktop-bin ? [Y/n]
==> ---------------------------------------------
==> y


# Not so fast !

## What is exactly yaourt ?

Yaourt stands for “Yet Another User Repository Tool”. It is a command line tool for installing packages on Arch Linux. It is a wrapper for Pacman, the shipped package management utility for Arch Linux with extended features and remarkable AUR (Arch Linux User Repository) support.

It differs from pacman because pacman will only access what is in the official repos. But the packages created by members of the community which have not (yet) made their way into the official repos can be found on the AUR.

Hope this helped!

# Random forest vs extra trees

## A little bit of context

Quite some time ago, I asked a question on stats.stackexchange about differences between random forests and extremly random forests. Though the answers were good, I was still lacking some informations. Beyond the nice theoretical arguments, I run some simulations to have a better idea of their behavior.

Though these are, by no means, definite conclusions about their respective behaviors, those simulations performed on toy datasets, from specific implementations, I hope they will provide more insights to the reader!

## Summary of the simulations

Extra trees seem much faster (about three times) than the random forest method (at, least, in scikit-learn implementation). This is consistent with the theoretical construction of the two learners.

On toy datasets, the following conclusions could be reached :

• Extra trees seem to keep a higher performance in presence of noisy features,
• When all the variables are relevant, both methods seem to achieve the same performance.

## Theoretical point of view

A decision tree is usually trained by recursively splitting the data. Each split is chosen according to an information criterion which is maximized (or minimized) by one of the “splitters”. These splitter usually take the form of x[i]>t were t is a threshold and x[i] indicates the i-th component of an observation.

Decision trees, being prone to overfit, have been transformed to random forests by training many trees over various subsamples of the data (in terms of both observations and predictors used to train them).

The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees).

This leads to more diversified trees and less splitters to evaluate when training an extremly random forest.

The point here is not to be exhaustive, there are more references and the main articles at the bottom of the article.

## Timing

First, let’s observe the timing for each model. I will use the methods implemented in scikit-learn and I will reuse a code seen in another post (the code with new minor updates is at the bottom of the post). Without a detailed analysis, it seems that Extra Trees are three times faster than the random forest method.

N P Time RF Time ET
500 5 0.093678 0.0679979
500 10 0.117224 0.0763619
500 20 0.176268 0.097132
500 50 0.358183 0.157907
500 100 0.619386 0.256645
500 200 1.14059 0.45396
1000 5 0.139871 0.0846941
1000 10 0.211061 0.106443
1000 20 0.385125 0.151639
1000 50 0.805403 0.277682
1000 100 1.39056 0.501522
1000 200 2.76709 0.926728
2000 5 0.269487 0.122763
2000 10 0.474972 0.171372
2000 20 0.771758 0.264499
2000 50 1.81821 0.539937
2000 100 3.47868 1.03636
2000 200 6.95018 2.1839
5000 5 0.86582 0.246231
5000 10 1.2243 0.373797
5000 20 2.20815 0.624288
5000 50 5.13883 1.41648
5000 100 9.79915 3.25462
5000 200 21.5956 6.86543

Note the use of tabulate which allows a pretty print for python dataframes ;)

import numpy as np
import ComplexityEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from tabulate import tabulate

def random_data_regression(n, p):
return np.random.rand(n, p), np.random.rand(n)

regression_models = [RandomForestRegressor(),
ExtraTreesRegressor()]

names = ["RandomForestRegressor",
"ExtraTreesRegressor"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
[500, 1000, 2000, 5000],
[5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
res, data = complexity_evaluator.Run(model, random_data_regression)
print(names[i])
i = i + 1


## Irrelevant variables

Now, let’s try to see how they compare when we add irrelevant variables. First, we need to propose a data set where there is something to learn (as opposed as what was previously done).

Let’s fix a linear dependency between the target variable and the first three variables, along the lines of:

def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.zeros([p, 1])
beta[0,0] = 1
beta[1,0] = 2
beta[2,0] = -1
Y = np.ravel(np.dot(X, beta))
return X, Y


Fig. 1: Comparison of random forests and extra trees in presence of irrelevant predictors

In blue are presented the results from the random forest and red for the extra trees.

The results are quite striking: Extra Trees perform consistently better when there are a few relevant predictors and many noisy ones.

import numpy as np
import AccuracyEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from tabulate import tabulate
import matplotlib.pyplot as plt

def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.zeros([p, 1])
beta[0,0] = 1
beta[1,0] = 2
beta[2,0] = -1
Y = np.ravel(np.dot(X, beta))
return X, Y

n_columns = [5, 10, 20, 30, 50, 80, 100]

accuracy_evaluator = AccuracyEvaluator.AccuracyEvaluator(
[500],
n_columns,
mean_squared_error,
10)

data_rf = accuracy_evaluator.Run(RandomForestRegressor(), linear_data_generation)
data_et = accuracy_evaluator.Run(ExtraTreesRegressor(), linear_data_generation)

plt.errorbar(n_columns, data_rf["Metric"], yerr= data_rf["Std"], fmt="o", ecolor = "blue", color="blue")
plt.errorbar(n_columns, data_et["Metric"], yerr= data_et["Std"], fmt="o", ecolor = "red", color="red")
plt.xlabel('Number of columns')
plt.ylabel('Mean Squared Error')
plt.show()


## Many relevant variables

Starting from the code above and changing the decision function to be a sum of each predictor:

def linear_data_generation(n, p):
X = np.random.rand(n, p)
beta = np.ones([p, 1])
Y = np.ravel(np.dot(X, beta))
return X, Y


Fig. 2: Comparison of random forests and extra trees in presence of many relevant predictors

In blue are presented the results from the random forest and red for the extra trees.

It seems that both methods perform equally in presence of many relevant features.

## Code

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math

class ComplexityEvaluator:

def __init__(self, nrow_samples, ncol_samples):
self._nrow_samples = nrow_samples
self._ncol_samples = ncol_samples

def _time_samples(self, model, random_data_generator):
rows_list = []
for nrow in self._nrow_samples:
for ncol in self._ncol_samples:
train, labels = random_data_generator(nrow, ncol)

start_time = time.time()
model.fit(train, labels)
elapsed_time = time.time() - start_time

result = {"N": nrow, "P": ncol, "Time": elapsed_time}
rows_list.append(result)

return rows_list

def Run(self, model, random_data_generator):
orig_data = pd.DataFrame(self._time_samples(model, random_data_generator))
data = orig_data.applymap(math.log)
linear_model = LinearRegression(fit_intercept=True)
linear_model.fit(data[["N", "P"]], data[["Time"]])
return linear_model.coef_, orig_data

if __name__ == "__main__":
class TestModel:

def __init__(self):
pass

def fit(self, x, y):
time.sleep(x.shape[0] / 1000.)

def random_data_generator(n, p):
return np.random.rand(n, p), np.random.rand(n, 1)

model = TestModel()

complexity_evaluator = ComplexityEvaluator(
[200, 500, 1000, 2000, 3000], [1, 5, 10])

res = complexity_evaluator.Run(model, random_data_generator)

print(res)



And then :

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math
from Stacker import Stacker

class AccuracyEvaluator:

def __init__(self, nrow_samples, ncol_samples, penalty, n_folds=5):
self._nrow_samples = nrow_samples
self._ncol_samples = ncol_samples
self._stacker = Stacker(penalty, n_folds, False)

def _accuracy_samples(self, model, random_data_generator):

def predict(model, X):
return model.predict(X)

rows_list = []
for nrow in self._nrow_samples:
for ncol in self._ncol_samples:
train, labels = random_data_generator(nrow, ncol)

mean_perf, std_perf,  _ = self._stacker.RunSparse(train,
labels, model, predict)

result = {"N": nrow, "P": ncol,
"Metric": mean_perf, "Std": std_perf}
rows_list.append(result)

return rows_list

def Run(self, model, random_data_generator):
orig_data = pd.DataFrame(
self._accuracy_samples(model, random_data_generator))
return orig_data


I will come back to stacking in another post some day, which I also use for cross-validation (stacking can be obtained almost for free during a cross validation). Plus the naming of some functions is quite unfortunate (RunSparse() should probably be called RunNumpyArray() or something like this, likewise, the Run() should probably be RunPandas()…). The code is here for the sake of completeness.

import pandas as pd
import numpy as np

from time import time
from sklearn import cross_validation

class Stacker:

def __init__(self, penalty, n_folds, verbose=True, random_state=1):
self._penalty = penalty
self._n_folds = n_folds
self._verbose = verbose
self._random_state = random_state

def Run(self, X, y, model, predict_method):
kf = cross_validation.KFold(
y.shape[0], n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
trscores, cvscores, times = [], [], []
i = 0
stack_train = np.zeros(len(y))  # stacked predictions
for i, (train_fold, validation_fold) in enumerate(kf):
i = i + 1
t = time()
model.fit(X.iloc[train_fold], y.iloc[train_fold])

tr_pred = predict_method(model, X.iloc[train_fold])

trscore = self._penalty(y.iloc[train_fold], tr_pred)

validation_prediction = predict_method(
model, X.iloc[validation_fold])
cvscore = self._penalty(
y.iloc[validation_fold], validation_prediction)

trscores.append(trscore)
cvscores.append(cvscore)
times.append(time() - t)

stack_train[validation_fold] = validation_prediction

if self._verbose:
print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
(np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
print(model.get_params(deep=True))
print("\n")

return np.mean(cvscores), stack_train

def RunSparse(self, X, y, model, predict_method):
kf = cross_validation.KFold(
y.shape[0], n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
trscores, cvscores, times = [], [], []
i = 0
stack_train = np.zeros(len(y))  # stacked predictions
for i, (train_fold, validation_fold) in enumerate(kf):
i = i + 1
t = time()
model.fit(X[train_fold], y[train_fold])

tr_pred = predict_method(model, X[train_fold])

trscore = self._penalty(y[train_fold], tr_pred)

validation_prediction = predict_method(
model, X[validation_fold])
cvscore = self._penalty(
y[validation_fold], validation_prediction)

trscores.append(trscore)
cvscores.append(cvscore)
times.append(time() - t)

stack_train[validation_fold] = validation_prediction

if self._verbose:
print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
(np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
print(model.get_params(deep=True))
print("\n")

return np.mean(cvscores), np.std(cvscores),stack_train


## Learning more and references

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone, R.A. Olshen

Breiman L (2001). “Random Forests”. Machine Learning. 45 (1): 5–32.

Geurts P, Ernst D, Wehenkel L (2006). “Extremely randomized trees”. Machine Learning. 63: 3–42

Python data science handbook

# Product-sum numbers

## Statement

The problem can be found here.

A natural number, N, that can be written as the sum and product of a given set of at least two natural numbers, {a1, a2, … , ak} is called a product-sum number: N = a1 + a2 + … + ak = a1 x a2 x … x ak.

For example, 6 = 1 + 2 + 3 = 1 x 2 x 3.

For a given set of size, k, we shall call the smallest N with this property a minimal product-sum number. The minimal product-sum numbers for sets of size, k = 2, 3, 4, 5, and 6 are as follows.

k=2: 4 = 2 x 2 = 2 + 2

k=3: 6 = 1 x 2 x 3 = 1 + 2 + 3

k=4: 8 = 1 x 1 x 2 x 4 = 1 + 1 + 2 + 4

k=5: 8 = 1 x 1 x 2 x 2 x 2 = 1 + 1 + 2 + 2 + 2

k=6: 12 = 1 x 1 x 1 x 1 x 2 x 6 = 1 + 1 + 1 + 1 + 2 + 6

Hence for 2≤k≤6, the sum of all the minimal product-sum numbers is 4+6+8+12 = 30; note that 8 is only counted once in the sum.

In fact, as the complete set of minimal product-sum numbers for 2≤k≤12 is {4, 6, 8, 12, 15, 16}, the sum is 61.

What is the sum of all the minimal product-sum numbers for 2≤k≤12000?

## Ideas

The fact that the sum of the “product-sum numbers” was asked by counting only once each number lead me to think that there was some way to test wether a number is a “product-sum” easily.

Though it was quite easy to prove that prime numbers cannot be product sum numbers, other criterions were not that obvious.

12000 did not seem to be that far for a naive approach : for each number, find the first product-sum number following it.

For this, only a test “is this number a sum product number for k terms” was needed. The simplest way to perform it simply was to enumerate all the possible ways to write an integer as a product of integers (different from 1).

let enumerate_products n =
let rec aux k i acc =
if i == k then [i::acc]
else if i > k then []
else
if k mod i == 0 then
(aux (k/i) i (i::acc))@(aux k (i+1) acc)
else (aux k (i+1) acc)
in
aux n 2 []


Testing is now easy:

let is_sum_product n k =
if is_prime n then
false
else
let product_representations = enumerate_products n in

let pseudo_sum product_representation =
(sum_list product_representation) + k - (List.length product_representation) in

List.exists (fun x -> (pseudo_sum x) == n) product_representations


Using the fact that a number n, decomposed in k factors needs to be bigger than k and smaller than 2k in order to be a product sum, finding the next product-sum number is easy:

let find_smallest_sum_product k =
let rec aux i =
if i > 2*k then
failwith "error somewhere"
else if is_sum_product i k then i
else aux (succ i) in
aux (succ k) (* would not work for n=k=1... *)


## A simple solution

The whole code to find the solution is the following. With ocamlopt, it took me around 3 minutes on a laptop.

let is_prime n =
let rec is_not_divisor d =
d * d > n || (n mod d <> 0 && is_not_divisor (d+1)) in
n <> 1 && is_not_divisor 2

let sum_list input_list =
List.fold_right (fun x y -> x + y) input_list 0

let enumerate_products n =
let rec aux k i acc =
if i == k then [i::acc]
else if i > k then []
else
if k mod i == 0 then
(aux (k/i) i (i::acc))@(aux k (i+1) acc)
else (aux k (i+1) acc)
in
aux n 2 []

let is_sum_product n k =
if is_prime n then
false
else
let product_representations = enumerate_products n in

let pseudo_sum product_representation =
(sum_list product_representation) + k - (List.length product_representation) in

List.exists (fun x -> (pseudo_sum x) == n) product_representations

let find_smallest_sum_product k =
let rec aux i =
if i > 2*k then
failwith "error somewhere"
else if is_sum_product i k then i
else aux (succ i) in
aux (succ k) (* would not work for n=k=1... *)

let sum_sum_product a b =
let integer_interval = List.init (b-a+1) (fun i -> i+a) in
let sum_products = List.map find_smallest_sum_product integer_interval in
let unique_sum_products = List.sort_uniq compare sum_products in
sum_list unique_sum_products ;;

(* Unit testing *)
assert (sum_sum_product 2 12 == 61);

(* Result *)
print_int (sum_sum_product 2 12000);
print_string "\n";


# More problems

If you like this kind of problems, I strongly recommend The Art of Mathematics: Coffee Time in Memphis, by Béla Bollobás.

To find out more about OCaml this book provides a detailed presentation of the language.