# Context

## What is motus

Motus is a quite famous (at least in France) game where one has to guess a seven letters word based on its first letter and another letter given somewhere in the word. The player has various trials (8) to guess the word. Each word proposed must start with the same first letter as the word to guess, and the player is informed when a letter is at its place, or if a letter belongs to the word to guess, but is not in the right place.

Being quite popular, this game now has various mobile applications!

Fig. 1: Capture of "Motus, le jeu officiel France 2"

As you can see, the letters stressed in red are the one indicate the good position for the letter, and the yellow circle indicates that the letter is in the word, but currently in the wrong position.

## Why winning

Long story, but I would like a winning strategy at this game which ideally, I can perform without the help of a computer.

# A solution

## Beginner approach

For having watched the shows many times on TV, or others playing the app, a strategy used by most players seem to incrementally use the letters which you know well positionned to propose words which comply with the valid letters.

## The idea

Though the naive approach works, it feels like “sub optimal words” (i.e. words which seem far from the final word given the information we have when we propose them) can bring a lot of information, since we can craft them to be as informative as possible!

“Crafting” (the process of finding a seven letter words who share as few letter as possible with the words already suggested) is quite hard. However, learning a list of words which are as informative as possible is easy!

Therefore, the idea used here is that, in 4 guesses, one can almost know the letters present in the word to guess (4 * (7-1) = 24). Therefore, for each letter, one only needs to know four words, such that the number of different letters in these four words is maximal.

Then the second hypothesis is that when we know the letters of the words to guess, it is easy to guess it. It turned out to be harder than I thought, yet, my statistics were vastly improved!

# The implementation

## Details

It took the list of French words from this blog. The first part consists in removing the word which are not sevent characters long and getting rid of their accents.

Then, for a list of words, the coverage_score is defined, it simply is the number of different characters formed by these words.

It is shortly implemented and tested below:

def coverage_score(words):
cnt = Counter()
for word in words:
for character in word:
cnt[character] += 1
return len(list(cnt))

assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)


Then, for each letter, words are sampled by four and for each sampling, the score is evaluated. This process is repeated 1 000 000 times, and the 4-uple with the highest coverage score is returned. And this is it.

def random_optimization(generator, score, n_attempts):
i = 0
best_score = 0
best_element = generator()
while i < n_attempts:
i += 1
e = generator()
score_e = score(e)
if score_e > best_score:
best_score = score_e
best_element = e
return (best_element, best_score)


## The code

from collections import Counter
import random
import string
import unidecode

def is_seven_characters_long(s):
return len(s) == 7

def remove_line_breaks(s):
return s.replace("\n", "")

fp = open(input_filepath, 'r')
words = map(remove_line_breaks, words)

seven_letter_words = [
word for word in words if is_seven_characters_long(word)]

removed_accents = map(unidecode.unidecode, seven_letter_words)
removed_dashes = filter(lambda x: not "-" in x, removed_accents)

remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes)

return list(remove_ez)

def coverage_score(words):
cnt = Counter()
for word in words:
for character in word:
cnt[character] += 1
return len(list(cnt))

assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)

def random_optimization(generator, score, n_attempts):
i = 0
best_score = 0
best_element = generator()
while i < n_attempts:
i += 1
e = generator()
score_e = score(e)
if score_e > best_score:
best_score = score_e
best_element = e
return (best_element, best_score)

for x in string.ascii_lowercase:
print(x)
words_starting_with_x = list(
filter(lambda w: w.startswith(x), candidate_words))

def generator_x():
res = [random.choice(words_starting_with_x) for i in range(4)]
list.sort(res)
return res

res = random_optimization(generator_x, coverage_score, 100000)
print("(coverage : " + str(res[1]) + ")")

print("\n".join(res[0]))
print(10 * '-')


# Results

Now, I simply have to learn the 104 words listed below

File loaded!
a
(coverage : 19)
affuble
agrions
amochai
apteryx
----------
b
(coverage : 18)
bavocha
bowling
brusque
brutaux
----------
c
(coverage : 18)
candide
capeyat
choques
combler
----------
d
(coverage : 18)
deflore
dejuche
demodat
dopings
----------
e
(coverage : 19)
endurci
engavat
enzymes
explora
----------
f
(coverage : 18)
fachait
faxames
flingot
fuyards
----------
g
(coverage : 18)
giclant
glyphes
gommeux
grisbis
----------
h
(coverage : 19)
hagarde
hickory
honteux
humbles
----------
i
(coverage : 18)
ichtyol
impulse
inegaux
ivoires
----------
j
(coverage : 19)
jockeys
jouxter
jubilat
jumping
----------
k
(coverage : 20)
kamichi
kobolds
krypton
kufique
----------
l
(coverage : 18)
lambris
legende
lexique
lycopes
----------
m
(coverage : 18)
malfont
margeas
midship
muqueux
----------
n
(coverage : 18)
nasarde
neigeux
notable
nymphal
----------
o
(coverage : 19)
oblatif
ocreras
oppidum
oxygene
----------
p
(coverage : 18)
palombe
penchas
poquait
profond
----------
q
(coverage : 15)
quartat
quiches
quidams
quillon
----------
r
(coverage : 18)
rebonds
rejugee
rempile
ruchait
----------
s
(coverage : 19)
sandjak
sculpte
sombras
swingua
----------
t
(coverage : 18)
thymols
topique
trafics
trepang
----------
u
(coverage : 16)
ultimes
urbaine
urgence
uropode
----------
v
(coverage : 19)
vachard
vampent
velique
voyages
----------
w
(coverage : 18)
wergeld
whiskys
wombats
wurmien
----------
x
(coverage : 11)
xanthie
ximenie
xylenes
xylenes
----------
y
(coverage : 16)
yankees
yiddish
yogourt
ypreaux
----------
z
(coverage : 17)
zeolite
ziberas
zincage
zythums
----------


# Improvements

## Word selection

The thing is that the accepted words will mostly depend on the implementation of the game, I could not find a official dictionary for this. So the conjugated verbs would be rejected, per example (hence the line : remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes), which does not account for the full complexity of French conjugations, to say the least)

## Algorithm

The “coverage” seems quite low (around 19). I believe this can be improved using smarter ideas than a random, brute force optimization.

Perhaps some better optimizations could be performed, such as, start with a word, randomly pick among the most different words and so on so forth.

## The score itself

The definition of the coverage score does not take into account the scarcity of some letters. z is quite scarce, and it is often a waste of time to check whether it is in the word to guess.

# Introduction

Python has an amazing progress bar library tqdm. In a for loop or a map statement, tqdm allows the user to know how much time is needed for the loop to end. Though simple, this happens to be very useful when performing treatment on a dataset with hundreds of thousands of lines.

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]

What tdqm does

[##############-------------------------] 38.00% ending in : 3.21s

What I would be happy with

Regarding machine learning pipelines, Python does a great job. However, it lacks the power of compiler. First, pre-treatment of the data is sometimes done using functions written in pure Python, not particularly fast.

But more important, imagine the following pipeline in OCaml:

let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> remove_stopwords
|> List.map vectorize
|> FloatArray.average


Maybe these functions do not really exist, but a sentence (my_text) is splitted in a list of words, the spelling of each word is fixed (List.map fix_spelling), the stopwords are removed, the words are vectorized (with something like word2vec) and at last, the list of vectorized words is averaged. It is pretty clear that we input a sentence (a string) and return a float array.

Now imagine that for some reason, the pipeline is misspelled in something like this:

let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> List.map vectorize
|> remove_stopwords
|> FloatArray.average


The compiler would immediatly complain that remove_stopwords does not apply to float array list (but was expecting a string list). On the other hand, depending on the implementation, it may take a while for the same program in Pyhton to realize that some steps cannot be performed.

Therefore, I believe that OCaml could be an amazing fit for machine learning: the computing intensive libraries (random forests, SVMs…) can be implemented in another language (which is the case with scikit-learn) and the pipeline would be much more robust in OCaml.

As I love to use tqdm in Python, I wanted an OCaml equivalent.

# OCaml functors

A functor is simply a module, which accepts another module as a parameter. Let’s look at the following module signature, a finite collection. It has a map function, which transforms every element and a length function. Note that length could be inferred from map and a ref. However, modules such as Seq allow infinite collections: they have a map function, but cannot evaluate the length of a sequence.

module type FINITECOLLECTION = sig

type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int

end


Given this signature we can now define a module which will operate on finite collections. Without paying much attention to the contents of map, simply note that the only functions called in map are the ones specified in the signature: Collection.length and Collection.map.

(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =

let open ProkopeUtils in

let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in

let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in

print_string "\n";
res

end



Now, we would like to be able to use this new map method on array and list types. For this, we just have to reproduce the signature of the FINITECOLLECTION. The following does it.

module UsualArray  = struct
type 'a t = 'a array
include Array
end

module Array = Prokope(UsualArray)

module UsualList  = struct
type 'a t = 'a list
include List
end

module List = Prokope(UsualList)


Now this function can simply be invoked appending Prokope before

let dummy a =
ignore (Unix.select [] [] [] 0.1);
a

let double x = 2 * x

let id x = x

let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()


And you would see:

[#######################################] 100.00% ending in : 0.00s
[##############-------------------------] 38.00% ending in : 3.21s


# The implementation

## A couple of tricks

A couple of things to know is that OCaml may not flush to the standard output as you may expect. But this can be forced by calling

flush stdout;


Now to write a line on the actual line of the terminal, the only thing to do is to call

print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");


You may find more informations on \n and \r on StackOverflow though.

## The implementation

I decided to call this library Prokope. Maybe on opam someday ;)

### Prokope utils

Let’s start with the helpers needed to display and configure a progress bar.

prokopeUtils.ml

(** Various helpers for Prokope *)

(** Graphical parameters for the progress bar *)
type progress_bar_config = {
width: int;
cell_filled: string;
cell_empty: string
}

(** The width of the terminal *)
let default_width =
match Terminal_size.get_columns () with
| Some(a) -> a
| None -> 80

(** A default progress bar *)
let default_progress_bar = {
width = default_width;
cell_filled = "#";
cell_empty = "-"
}

(** For a collection 'a t returns a collection of type (int, 'a) t *)

let index = ref (-1) in

index := !index + 1;
(!index, e)
in

(** Repeats the string s n times *)
let repeat_string s n =
List.init n (fun _ -> ())
|> List.fold_left (fun e _ -> s^e) ""

(** Returns the string represented by the progress bar configuration and the percentage *)
let progress_bar_string config percentage =

let gauge_string n_steps inner_width =
let remaining_steps = inner_width - n_steps in
"["^
(repeat_string config.cell_filled n_steps)^
(repeat_string config.cell_empty remaining_steps)^
"]"
in

let inner_width = max (config.width - 40) 0 in

let n_steps = int_of_float (percentage *. (float_of_int inner_width)) in

let rounded_percentage_str = Printf.sprintf "%.2f" (100. *. percentage) in

(gauge_string n_steps inner_width)^" "^rounded_percentage_str^"% "

(** For a fonction ('a -> 'b) returns a function ( ('a, int) -> 'b )
* "aware" of its starting time and operations to perform *)
let progress_wrapper config list_length starting_time f a =

let i, x = a in

let percentage = (float_of_int i +. 1.) /. (float_of_int list_length) in

let time_taken = (Unix.gettimeofday() -. starting_time) in

let speed = (float_of_int i) /. time_taken in

let remaining_time_str = (float_of_int (list_length - i)) /. speed
|> Printf.sprintf "%.2f"
in

print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");

flush stdout;

f x



### The functor

Why a functor ? To enable a user to generate a progress bar for every finite collection ! Imagine that your data is stored on a binary tree for some reason. Having a map and a length function over trees is trivial (and it may also belong to your tree module). Calling Prokope(MyTree) would therefore return a new module, for which map shows a progress bar.

prokope.ml

(** User intended functor *)

(** Any collection whose elements can be counted and mapped.
Note that length has to be implemented (this disqualifies
modules such as Seq). *)
module type FINITECOLLECTION = sig

type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int

end

(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =

let open ProkopeUtils in

let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in

let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in

print_string "\n";
res

end

module UsualArray  = struct
type 'a t = 'a array
include Array
end

module Array = Prokope(UsualArray)

module UsualList  = struct
type 'a t = 'a list
include List
end

module List = Prokope(UsualList)


### Testing and displaying

run.ml

let dummy a =
ignore (Unix.select [] [] [] 0.1);
a

let double x = 2 * x

let id x = x

let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()


run.sh

ocamlfind opt -package unix,terminal_size -linkpkg -o run.byte prokopeUtils.ml prokope.ml run.ml

./run.byte


Hope you liked it! If you feel you may need it some day, or have improvements suggestions, please let me know, I would improve it and publish it!

# The CEV Process

In mathematical finance, the CEV or constant elasticity of variance model is a stochastic volatility model which was developed by John Cox in 1975. It captures the fact that the log returns may not have a constant volatility, which the usual geometric brownian motion assumes.

The equation of the CEV model is the following.

For $\gamma = 1$ we have the usual equation for the geometric Brownian motion:

# Simulating processes

Using the discretization method, we can simulate these processes.

## Geometric Brownian motion

The geometric Brownian motion can be simulated using the following class.

import numpy as np

class ProcessGBM:

def __init__(self, mu, sigma):
self._mu = mu
self._sigma = sigma

def Simulate(self, T=1, dt=0.001, S0=1.):
n = round(T / dt)

mu = self._mu
sigma = self._sigma

gaussian_increments = np.random.normal(size=n - 1)
res = np.zeros(n)
res[0] = S0
S = S0
sqrt_dt = dt ** 0.5
for i in range(n - 1):
S = S + S * mu * dt + sigma * \
S * gaussian_increments[i] * sqrt_dt
res[i + 1] = S

return res


Following a similar logic, we can generate trajectories for a CEV process.

import numpy as np

class ProcessCEV:

def __init__(self, mu, sigma, gamma):
self._mu = mu
self._sigma = sigma
self._gamma = gamma

def Simulate(self, T=1, dt=0.001, S0=1.):
n = round(T / dt)

mu = self._mu
sigma = self._sigma
gamma = self._gamma

gaussian_increments = np.random.normal(size=n - 1)
res = np.zeros(n)
res[0] = S0
S = S0
sqrt_dt = dt ** 0.5
for i in range(n - 1):
S = S + S * mu * dt + sigma * \
(S ** gamma) * gaussian_increments[i] * sqrt_dt
res[i + 1] = S

return res


Which can then be plotted:

import matplotlib.pyplot as plt

from ProcessGBM import ProcessGBM
from ProcessCEV import ProcessCEV

T = 20
dt = 0.01
plt.plot(ProcessGBM(0.05, 0.15).Simulate(T, dt), label="GBM")
plt.plot(ProcessCEV(0.05, 0.15, 0.5).Simulate(
T, dt), label="CEV (gamma = 0.5)")
plt.plot(ProcessCEV(0.05, 0.15, 1.5).Simulate(
T, dt), label="CEV (gamma = 1.5)")

plt.xlabel('Time index')
plt.ylabel('Value')

plt.title("Realization of stochastic processes")

plt.legend()

plt.show()


Realisation of various stochastic processes with different parameters

And we can “see” the properties of the trajectories, a low value for $\gamma$ yields trajectories which are slightly linear, whereas a $\gamma$ higher than $1$ allows the volatility to increase for large values of the process (the trajectory in red).

# Estimating the CEV parameters

## Theory

A detailed method to estimate the various parameters can be found here: L. CHU, K & Yang, Hailiang & Yuen, Kam. (2001). Estimation in the Constant Elasticity of Variance Model. British Actuarial Journal. 7. 10.1017/S1357321700002233.

Note that they use slightly different notations, which will be kept in this section (the code for the estimator, however, translates everything back in terms of the original notations).

Calling $\sigma^2_t=\delta^2 S_t^{\theta - 2}$

The paper recalls that $V_t$ is an estimate of $\sigma^2_t$, where $V_t$ is:

For $\alpha$ a constant.

It looks like there is there a typo in the original paper, after a quick implementation (see below), I realized that the estimator does not work this this formula, but does so when I remove one of the $\frac{1}{\Delta t}$. Therefore, in what follows, the following formula will be implemented:

Then, the variance of $V_t$ is minimized for a value of ($c_i$ are negative constants)

And the authors suggest to start from a random value of $\alpha$ and iterate over successive estimations of $\mu$ and $\sigma_t$. However, $\sigma_t$ not being constant, this formula did not make a lot of sense to me. I chose a constant value of $\alpha$ in the code, which provides good results.

## Proof sketch : Itô’s lemma

In order to know where does $V_t$ comes from, we can apply Itô’s formula to $S^{1+\alpha}_t$:

Now:

Therefore, dividing by $(1+\alpha)S^{1+\alpha}_t$:

Finally:

And it works for any $\alpha$. To be honest I am quite uncomfortable with the meaning to give to every equation, but, at some level, it makes sens that $V_t$ “has to be” of the form above.

## Python implementation

import numpy as np

class EstimatorCEV:

def __init__(self, dt):
self._dt = dt
self._alpha0 = -5

def Estimate(self, trajectory):
sigma, gamma = self._evaluate_sigma_gamma(trajectory, self._alpha0)
if sigma == None:
return None, None, None
else:
mu = self._estimate_mu(trajectory)
return (mu, sigma, gamma)

def _log_increments(self, trajectory):
return np.diff(trajectory) / trajectory[:-1]

def _estimate_mu(self, trajectory):
return np.mean(self._log_increments(trajectory)) / self._dt

def _log_increments_alpha(self, trajectory, alpha):
mod_increments = self._log_increments(trajectory ** (1 + alpha))
return mod_increments / (1 + alpha)

def _evaluate_Vt(self, trajectory, alpha):
lhs = self._log_increments_alpha(trajectory, alpha)
rhs = self._log_increments(trajectory)
center = 2 * (lhs - rhs) / (alpha * self._dt)
return center

def _evaluate_sigma_gamma(self, trajectory, alpha):
if np.any(trajectory <= 0):
return None, None

Vts = self._evaluate_Vt(trajectory, alpha)
if np.any(Vts <= 0):
return None, None
logVts = np.log(Vts)

Sts = trajectory[:-1]  # removes the last term as in eq. (5)
if np.any(Sts <= 0):
return None, None
logSts = np.log(Sts)

ones = np.ones(Sts.shape[0])
A = np.column_stack((ones, logSts))

res = np.linalg.lstsq(A, logVts, rcond=None)[0]
return (2 * np.exp(res[0] / 2), 0.5 * (res[1] + 2))

if __name__ == "__main__":

from ProcessCEV import ProcessCEV

def test(true_mu, true_sigma, true_gamma):
dt = 0.001
T = 10

sample_mu = []
sample_sigma = []
sample_gamma = []

for i in range(100):
mu_est, sigma_est, gamma_est = EstimatorCEV(dt=dt).Estimate(ProcessCEV(
true_mu, true_sigma, true_gamma).Simulate(T, dt=dt))

if mu_est != None:
sample_mu = [mu_est] + sample_mu
sample_sigma = [sigma_est] + sample_sigma
sample_gamma = [gamma_est ] + sample_gamma

print("mu : " + str(true_mu) + " \t| est : " + str(np.mean(sample_mu)) + " \t| std : " + str(np.std(sample_mu)))
print("sigma : " + str(true_sigma) + " \t| est : " + str(np.mean(sample_sigma)) + " \t| std : " + str(np.std(sample_sigma)))
print("gamma : " + str(true_gamma) + " \t| est : " + str(np.mean(sample_gamma)) + " \t| std : " + str(np.std(sample_gamma)))
print(10*"-")

test(0.,0.5,0.8)
test(0.2,0.5,0.8)
test(0.2,0.5,1.2)
test(0.,0.3,0.2)
test(0.,0.5,2)


Running the tests…

mu : 0.0  | est : 0.08393998598504683   | std : 0.10950360432103644
sigma : 0.5   | est : 0.5290411001633833  | std : 0.010372415832707949
gamma : 0.8   | est : 0.7995912270171245  | std : 0.023552858445145052
----------
mu : 0.2  | est : 0.21126788898622506   | std : 0.1379314635624178
sigma : 0.5   | est : 0.5295722080080916  | std : 0.01067649238716792
gamma : 0.8   | est : 0.7981059837608774  | std : 0.019084913069205692
----------
mu : 0.2  | est : 0.20468181972280103   | std : 0.16048227152782532
sigma : 0.5   | est : 0.5299253491070517  | std : 0.008389773769199063
gamma : 1.2   | est : 1.2042365895022817  | std : 0.021453412919601324
----------
mu : 0.0  | est : 0.0743428420875073  | std : 0.07189306488901058
sigma : 0.3   | est : 0.31771919915814173   | std : 0.006063917451139664
gamma : 0.2   | est : 0.2080203482037684  | std : 0.039438575768452104
----------
mu : 0.0  | est : -0.009815876356783935   | std : 0.11178850054168586
sigma : 0.5   | est : 0.5311825321732557  | std : 0.013167973885022026
gamma : 2   | est : 2.0050079526000903  | std : 0.03448408469682434
----------


It worked !

# Learning more

I used to study stochastic processes a while ago and reading: Pierre Henry-Labordère – Analysis, Geometry & Modeling in Finance (which I recommend if you are interested in stochastic processes and differential geometry and mathematical finance in the same time) and [A Non-Gaussian Option Pricing Model with Skew

L. Borland, J.P. Bouchaud](https://arxiv.org/abs/cond-mat/0403022) led me to spend more time understanding them. Hence this small tuto, hope you liked it.

Financial Modelling with Jump Processes is a must read if you want to have a detailed view on the non gaussian models used for financial asset. It covers both the theoretical view of these models, the simulation and estimation procedures.

# Introduction

OCaml seemed to lack a decent library for scientific computing. I came across Owl and the official repo and wanted to give it a try.

It has support for matrix operations, dataframes, simple plots based on PLplot (I used a port of PLplot which was quite undocumented in OCaml)…

# Install Owl

On my Ubuntu (16) it all starts with…

opam install owl


Unfortunatley, soon enough:

[ERROR] The compilation of conf-openblas failed at "sh -exc cc $CFLAGS test.c -lopenblas".  Opam suggests me the following: =-=- conf-openblas.0.2.0 troobleshooting =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= => This package relies on external (system) dependencies that may be missing. opam depext conf-openblas.0.2.0' may help you find the correct installation for your system.  After executing the command opam depext conf-openblas.0.2.0, the following packages are installed: The following additional packages will be installed: liblapacke libopenblas-base libtmglib-dev libtmglib3  And re-running: opam install owl  It works ! Let’s get started # Examples ## SVMs The package comes with a list of examples. Let’s start with support vector machines! Note that I removed the first line from the actual version of the example below. (* This example demonstrates SVM regression. *) open Owl let generate_data () = let open Mat in let c = 500 in let x1 = (gaussian c 2 *$ 2.) in
let a, b = float_of_int (Random.int 15), float_of_int (Random.int 15) in
let x1 = map_at_col (fun x -> x +. a) x1 0 in
let x1 = map_at_col (fun x -> x +. b) x1 1 in
let x2 = (gaussian c 2 *$2.) in let a, b = float_of_int (Random.int 15), float_of_int (Random.int 15) in let x2 = map_at_col (fun x -> x +. a) x2 0 in let x2 = map_at_col (fun x -> x +. b) x2 1 in let y1 = create c 1 ( 1.) in let y2 = create c 1 ( -1.)in let x = concat_vertical x1 x2 in let y = concat_vertical y1 y2 in x, y let draw_line x0 y0 p = let a, b, c = 0., 20., 100 in let x' = Mat.empty 1 c in let y' = Mat.empty 1 c in for i = 0 to c - 1 do let x = a +. (float_of_int i *. (b -. a) /. float_of_int c) in let y = (Mat.get p 0 0 *. x +. Mat.get p 2 0) /. (Mat.get p 1 0 *. (-1.)) in Mat.set x' 0 i x; Mat.set y' 0 i y done; let h = Plot.create "plot_svm.png" in Plot.(plot ~h ~spec:[ RGB (100,100,100) ] x' y'); Plot.(scatter ~h ~spec:[ RGB (150,150,150) ] x0 y0); Plot.output h let _ = let _ = Random.self_init () in let x, y = generate_data () in let r = Regression.D.svm ~i:true x y in let p = Mat.(r.(0) @= r.(1)) in draw_line (Mat.col x 0) (Mat.col x 1) p; Owl_pretty.print_dsnda Mat.(r.(0) @= r.(1))  Using OCamlBuild, the only thing to do is to write the following small script: ocamlbuild -use-ocamlfind -pkgs 'owl' svm.byte ./svm.byte rm svm.byte  The algorithm should output the following: --- Training config epochs : 1000 batch : full method : gradient descent loss : Hinge learning rate : adagrad 1 regularisation : l2norm (alhpa = 0.001) momentum : none clipping : none stopping : const (a = 1e-16) checkpoint : none verbosity : true --- 2019-04-15 14:12:02.638 INFO : T: 00s | E: 1.0/1000 | B: 1/1000 | L: 1800.944031[-] 2019-04-15 14:12:02.640 INFO : T: 00s | E: 2.0/1000 | B: 2/1000 | L: 1986.033647[▲] 2019-04-15 14:12:02.641 INFO : T: 00s | E: 3.0/1000 | B: 3/1000 | L: 851.791568[▼] 2019-04-15 14:12:02.641 INFO : T: 00s | E: 4.0/1000 | B: 4/1000 | L: 256.961005[▼] 2019-04-15 14:12:02.642 INFO : T: 00s | E: 5.0/1000 | B: 5/1000 | L: 81.197291[▼] 2019-04-15 14:12:02.643 INFO : T: 00s | E: 6.0/1000 | B: 6/1000 | L: 38.378032[▼] 2019-04-15 14:12:02.644 INFO : T: 00s | E: 7.0/1000 | B: 7/1000 | L: 22.740718[▼] 2019-04-15 14:12:02.644 INFO : T: 00s | E: 8.0/1000 | B: 8/1000 | L: 15.785715[▼] 2019-04-15 14:12:02.645 INFO : T: 00s | E: 9.0/1000 | B: 9/1000 | L: 11.920411[▼]  And produce the following plot (or something similar since the data is generated randomly). Illustration of a linear SVM in Owl ## Plots A full tutorial can be found here. Having used PLplot before, it is indeed much lighter! open Owl (* N(0,1) sample *) let y = Mat.gaussian 5000 2 ;; (* Initiates the plot *) let h = Plot.create "me.png" ;; (* Side effects *) Plot.set_background_color h 255 255 255; (* More side effects *) Plot.set_title h "Bivariate model"; (* Simple scatter *) Plot.scatter ~h (Mat.col y 0) (Mat.col y 1); Plot.output h;;  ## Matrix operations and Monte Carlo Playing with matrices, broadcasting operations… is quite easy and the syntax prevents you to substract floats and matrices without knowing it. Which is, from my point of view, a good thing. Let’s implement a simple variance reduction method based on the generation of antitethics random variables (see per example this course). let antithetic_simulation mu sigma n = let sample = Mat.gaussian ~mu:mu ~sigma:sigma (n/2) 1 in let anti_sample = Mat.((2. *. mu)$- sample) in
Mat.(sample @= anti_sample)


The various operations in Mat.( ... ) have different meanings. The $- stresses that we are dealing with a float on the left hand side and a matrix on the right hand side, whereas the @= stands for the vertical concatenation of matrices. All the details can be found on the offical website. # Other ressources ## Merlin Given the large number of functions (more than 7000 !), Merlin is a must have! Merlin's auto completion. Quite dense! # Concluding remarks I have just been playing with this new library for a couple of days, but it is promising! What is next ? Maybe benchmarks, tutorials ? Using GPU with Owl ? Let me know in the comments :) # Topological Data Analysis - A Python tutorial # Introduction Topological data analysis (TDA) allows to reduce many hypothesis when doing statistics. A lot of research in this field has been done over the last years and [1] and [4] provide a brilliant exposition about the mathematical concepts behind TDA. Here, I want to focus on one aspect of TDA: compressed representations of shapes. # A simple topological data analysis pipeline We will focus on the article [2] in the references. It allows to have an idea of the shape of the data. More precisely, we will go through the various steps that enabled the authors of [2] to produce the following figure: Fig. 1: The pipeline ## Overall pipeline The authors go through the following steps to represent the new shape: A) A 3D object (hand) represented as a point cloud. B) A filter value is applied to the point cloud and the object is now colored by the values of the filter function. C) The data set is binned into overlapping groups. D) Each bin is clustered and a network is built. Therefore, we will need a “filter function”, a method to cluster a point cloud (the “clusterer”), N intervals and an overlap parameter. Let’s summarize this in the following class. class TDAPipeline(): def __init__(self, filter_function, clusterer, n, overlap): self._filter_function = filter_function self._clusterer = clusterer self._n = n self._overlap = overlap # (string, int list) dict self._clusters = [] # string list self._clusters_names = [] # (string, string) list self._edges = []  We are now ready to dig in the other concepts. ## The filters Filter functions allows to associate real valued quantities to the data points. It can be the coordinates of a point on any axis, or the coordinates of the points on the first component of a PCA. class AxisFilter(): def __init__(self, i): self._i = i def Fit(self, data): pass def Filter(self, data): return data[:,self._i]  An interesting discovery in this article is the concept of “L-infinity centrality”. This filter allows to associate to a point a real number qualifying how close it is to the boundary of the cloud. L-infinity centrality, is defined for a data point x to be the maximum distance from x to any other data point in the set. Large values of this function correspond to points that are far from the center of the data set. import numpy as np from scipy.spatial.distance import pdist, squareform class LinfinityCentrality(): def __init__(self, distance): self._distance = distance def Fit(self, data): # Barnes Hut approx. or something similar may speed up the filtering part later pass def Filter(self, data): pairwise_dist = squareform(pdist(data, self._distance)) return np.amax(pairwise_dist, axis = 1)  ## The clustering Once the points have been associated to an interval in the image of the filter function, they have to be clustered. The authors use the following method: We first construct the single linkage dendrogram for the data in the bin, and record the threshold values for each transition in the clustering. We select an integer k, and build a k-interval histogram of these transition values. The lustering is performed using the last threshold before the first gap in this histogram. Which can be implemented taking advantage of the implementation of the single linkage proposed in scipy. import numpy as np from sklearn.cluster import AgglomerativeClustering from scipy.cluster.hierarchy import fcluster, linkage class SingleLinkageClusterer(): def __init__(self, k): self._k = k def Cluster(self, data): if data.shape[0] == 0: return [] linkage_matrix = linkage(data) counts, thresholds = np.histogram(linkage_matrix[:, 2], bins=self._k) first_gap_index = self._first_zero(counts, -1) if(first_gap_index > -0.5): threshold = thresholds[first_gap_index] else: threshold = linkage_matrix[-1, 2] clusters = fcluster(linkage_matrix, t=threshold, criterion='distance') return clusters def _first_zero(self, arr, invalid_val=-1): mask = arr == 0 return np.where(mask.any(), mask.argmax(), invalid_val)  # Some classical shapes We now have what we need to follow the pipeline of the author. Let’s just sample some points from various shapes with the following classes: class Sampler(): def __init__(self): raise NotImplemented() def Sample(self, n): raise NotImplemented()  It is quite easy to sample points uniformly over a sphere (up to numerical precision issues) by sampling an distribution over R^3 which is rotation invariant (here, three independant gaussian with same mean and variance) and normalize it. import numpy as np from sklearn.preprocessing import normalize from Sampler import Sampler class SphereSampler(Sampler): def __init__(self, r=1): self._r = r ''' Uniform sampling over the sphere. ''' def Sample(self, n): mat = np.random.randn(n, 3) return self._r * normalize(mat, axis=1, norm='l2')  I could not find a uniform distribution over the torus, but this should be “uniform enough”. import numpy as np from Sampler import Sampler class TorusSampler(Sampler): def __init__(self, r=1, R=3, x0=0, y0=0, z0=0): self._r = r self._R = R self._x0 = x0 self._y0 = y0 self._z0 = z0 ''' This sampling may not be uniform ''' def Sample(self, n): u = 2 * np.pi * np.random.rand(n) v = 2 * np.pi * np.random.rand(n) cosu = np.cos(u) sinu = np.sin(u) cosv = np.cos(v) sinv = np.sin(v) x = self._x0 + (self._R + self._r * cosu) * cosv y = self._y0 + (self._R + self._r * cosu) * sinv z = self._z0 + self._r * sinu return np.array([x, y, z]).T  # Experiments ## Following the method of the authors We are now ready to implement the full pipeline and produce sensible graphs. import numpy as np class TDAPipeline(): def __init__(self, filter_function, clusterer, n, overlap): self._filter_function = filter_function self._clusterer = clusterer self._n = n self._overlap = overlap # (string, int list) dict self._clusters = [] # string list self._clusters_names = [] # (string, string) list self._edges = [] def Run(self, data): self._filter_function.Fit(data) # one dimensional array : f(data), f : R^n -> R filtered_data = self._filter_function.Filter(data) lb = np.min(filtered_data) ub = np.max(filtered_data) overlapping_intervals = self._generate_overlapping_intervals( lb, ub, self._n, self._overlap) # f^-1([a,b]) for [a,b] in the overlapping intervals sliced_clusters = [] i = 0 for interval in overlapping_intervals: in_interval = map(lambda x: self._between_pair( x, interval), filtered_data) indices = np.where(in_interval)[0] filtered_points = data[indices, :] clusters = self._clusterer.Cluster(filtered_points) points_by_cluster = self._to_dict_list( str(i) + "_", indices, clusters) sliced_clusters.append(points_by_cluster) i = i + 1 self._clusters = self._flatten_dict_list(sliced_clusters) self._clusters_names = self._clusters.keys() self._edges = self._build_edges(sliced_clusters) return self._edges def ClusterAverage(self, target): if len(self._clusters) == 0: raise "The algorithm has not been trained!" indices_dict = self._clusters return [np.mean(target[indices_dict[n]]) for n in self._clusters_names] def _build_edges(self, sliced_clusters): edges = [] for sets_prev in sliced_clusters: for sets_new in sliced_clusters: for prev_key in sets_prev.keys(): for new_key in sets_new.keys(): if not set(sets_new[new_key]).isdisjoint(set(sets_prev[prev_key])): edges.append((prev_key, new_key)) return edges def _flatten_dict_list(self, dict_list): res = {} for dicts in dict_list: res.update(dicts) return res def _to_dict_list(self, prefix, vector, clusters): cluster_names = np.unique(clusters) res = {} for cluster_name in cluster_names: indices_in_cluster = map(lambda x: cluster_name == x, clusters) points_in_cluster = vector[np.where(indices_in_cluster)[0]] res.update({prefix + str(cluster_name): points_in_cluster}) return res def _generate_overlapping_intervals(self, lb, ub, n, overlap): step = 1. * (ub - lb) / n res = [(lb + i * step - overlap * step, lb + (i + 1) * step + overlap * step) for i in range(n)] return res def _between_pair(self, x, pair): a, b = pair return x > a and x < b  ## Illustrations Did it work :) ? I found a hand data set “Free realistic hand” here, it required a little bit of parsing to put it in a 3d file. Fig. 2: The pipeline, data : hand, filter : PCA, clustering : single linkage Unfortunately the sampling around the wrist is not high enough and the wrist is associated to the two blue dots, separated from the rest of the figure., but we do recognize the four fingers(in red) well separated from the thumb (in green). But we can generate synthetic data to get rid of this sampling issue. Per example, with a torus: Fig. 3: A torus, filter : first component, clustering : single linkage Or two touching tori: Fig. 4: Two tori, filter : first component, clustering : single linkage import numpy as np from AxisFilter import AxisFilter from PCAFilter import PCAFilter from LinfinityCentrality import LinfinityCentrality from SingleLinkageClusterer import SingleLinkageClusterer from TesseractSampler import TesseractSampler from TorusSampler import TorusSampler from SphereSampler import SphereSampler from TDAPipeline import TDAPipeline from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt import networkx as nx from networkx.drawing.nx_agraph import graphviz_layout def plot_tda_pipeline(title, sample, filter_function, clusterer, N, OVERLAP): tdap = TDAPipeline(filter_function, clusterer, N, OVERLAP) edges = tdap.Run(sample) colors = filter_function.Filter(sample) cluster_colors = tdap.ClusterAverage(colors) fig = plt.figure() ax = fig.add_subplot(211, projection='3d') x = sample[:, 0] y = sample[:, 1] z = sample[:, 2] ax.scatter(x, y, z, c=colors, marker='o') ax.set_xlabel('X Label') ax.set_ylabel('Y Label') ax.set_zlabel('Z Label') plt.title(title) fig.add_subplot(212) G = nx.OrderedGraph() # withouth OrderedGraph, the nodes may be shuffled (and the colors would not match) G.add_nodes_from(tdap._clusters_names) G.add_edges_from(edges) pos = graphviz_layout(G) nx.draw(G, pos, with_labels = True) nx.draw_networkx_nodes(G, pos, node_color = cluster_colors) plt.savefig(title) N = 10 OVERLAP = 0.2 SAMPLE_SIZE = 2000 K = 5 named_samples = [("Torus", TorusSampler(1, 3).Sample(SAMPLE_SIZE)), ("Tesseract", TesseractSampler().Sample(SAMPLE_SIZE)), ("Two Spheres", np.vstack((SphereSampler().Sample(SAMPLE_SIZE), SphereSampler(3).Sample(SAMPLE_SIZE)))), ("Two touching tori", np.vstack((TorusSampler(1, 3).Sample(SAMPLE_SIZE), TorusSampler(1, 3, 6, 0, 0).Sample(SAMPLE_SIZE)))), ("Hand", np.loadtxt("./Hand.csv", delimiter=","))] named_filters = [("Axis filter (x)", AxisFilter(0)), ("PCA", PCAFilter()), ("L-Linfinity Centrality", LinfinityCentrality('euclidean'))] for named_filter in named_filters: name_filter, filter_function = named_filter for named_sample in named_samples: name_data, sample = named_sample title = "Data : " + name_data + ", " + "Filter : " + name_filter plot_tda_pipeline(title, sample, filter_function, SingleLinkageClusterer(K), N, OVERLAP)  # References ## Books ## Articles [1]F. Chazal and B. Michel, “An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists,” arXiv:1710.04019 [cs, math, stat], Oct. 2017. [2]P. Y. Lum et al., “Extracting insights from the shape of complex data using topology,” Scientific Reports, vol. 3, no. 1, Dec. 2013. [3]B. Zieliński, M. Lipiński, M. Juda, M. Zeppelzauer, and P. Dłotko, “Persistence Bag-of-Words for Topological Data Analysis,” arXiv:1812.09245 [cs, math, stat], Dec. 2018. [4]A. Zomorodian, “Topological Data Analysis,” p. 39. [5]J. Mathews, S. Nadeem, M. Pouryahya, A. Tannenbaum, and J. O. Deasy, “Topological Data Analysis of PAM50 and 21-Gene Breast Cancer Assays:,” bioRxiv, Nov. 2018. # Best books about data science This is my ideal library about machine learning. ## Theoretical ### Machine learning and statistics The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani, Jerome Friedman A must have. Covers most of the classical algorithms commonly used in machine learning. Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach Provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. Beware, this is highly theoretical and requires familiarity with probability theory ! Cours de Statistique Mathématique (in French!), by Alain Monfort. It provides a clear and thorough introduction to the theory behind statistical hypothesis testing, estimation, and linear regressions. Lectures on Algebraic Statistics by Mathias Drton, Bernd Sturmfels, Seth Sullivant. Following my article about speeding up cross validation where some nice ideas arose from algebra, this books covers more aspects about the links between statistics and algebra, and a chapter dedicated to open problems in this field. ### Probability What would be machine learning without a nice understanding of probabilities and the proofs behind many methods? This books provides a large range of exercises and their solutions. Real Analysis and Probability, by R. M. Dudley. This is one of the clearest exposition of probability theory and of the interplay between the properties of metric spaces and probability measures. Randomized Algorithms by Rajeev Motwani, Prabhakar Raghavan I did not know where to put this one. In my post about time complexity I discussed some issues raised by the complexity of training a classifier or a regressor. This book provides a nice overview of this theory, heavily relying on probability. ## Practice ### R Applied Predictive Modeling by Max Kuhn, Kjell Johnson Full of nice ideas about how to treat various data sets! They even implemented a R package. I tried the package but, to be honest, never felt the need to reuse one of the methods proposed. ggplot2: Elegant Graphics for Data Analysis (Use R!) I am a big fan of ggplot2. I know, there are now better libraries on the market, with loads and loads of new functionality, interactive plots, yadi yadi yada… But mastering ggplot2 is so amazing. It only takes 5 lines to produce amazing graphs, summarizing findings from data. Which is really amazing, especially at the workplace. ### Python ### OCaml Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml. ## Miscellanous The Art of Mathematics: Coffee Time in Memphis Various exercises in mathematics, with hints and (beautiful) solutions. Some are directly related to probability (11. Loaded Dice, 69. Absent-minded Passengers, 77. Independent Random Variables, 139. A Probabilistic Inequality). Music: A Mathematical Offering by Dave Benson. Definitely not probability or statistics oriented. However, you may find here a new way to look at Fourier’s transform, and if you love music and mathematics, this is a must have. Chaos Monkeys Though not only about machine learning, this book is about the startup scene. Absolutely non technical, but fascinating :) Freakonomics by Steven D. Levitt, Stephen J. Dubner When economists meet real life :) non technical either. Disclaimer: these are amazon sponsored links, however, I have read every book (in some cases, only partly) and only the best are in this list. # ECML/PKDD 15: Taxi Trajectory Prediction As an occasionnal “Kaggler” I like to write my own code or experiment new alogrithms. This one led me to a top 20 place. The solution was really simple: each line of the train and test set is described as a collection of binary feature. For each feature, one can associate the destinations who share this feature. If the cloud of points associated to this binary feature has a low “variance” (i.e. most of the points sharing this feature end up in the same place) and a high number of observations, we want to give a higher weight to this feature. Otherwise, this feature can be deemed as poorly informative and we want to decrease its weight. ## Preprocessing • Generate a set of balls covering the map (radiuses and centers being chosen to avoid having too many features in the end) • Remove the trajectories with lightspeed jumps • For the training, cut the trajectories according to a $\min(U[0,1],U[0,1])$ law (it provided a good matching between cross validation and leaderboard score). • Replace the (truncated) trajectories by the set of balls they cross • Keep all the other features ## Learning • For each feature (boolean: have this trajectory crossed Ball k, is it id_207?) generate a cloud of points that are the final points sharing this feature. Actually, the cloud itself is never stored in memory (it would not fit on most computers I guess). Only its barycentre and variance are (they are then updated as mean and variance would be). • Features and their interactions were considered. Indeed, without interactions the performance is really low. Interactions such as “Is the car on the road to the airport AND is the car going north” intuitively seem much more efficient than the two features taken independently. ## Predicting • Given the features, gather all the barycenters and variances. • Return an average of the barycenters, weighted by the inverse of the standard deviation (raised to a certain power - CV showed that 7 was the best) : Where : $p_{k}$ stands for a boolean feature, $C(p_{k})$ stands for the cloud associated to the feature, $\#\left(C(p_{k})\right)$, sd and bar stands for the number of points, the variance and the barycenter of the cloud. $\alpha$ and $\beta$ are penalization parameters. ## Speed of the algorithm Using streaming evaluation of mean and standard deviation of a collection of points, this algorithms had a linear complexity in terms of the number of points. The training and testing part could actually be performed in a couple of minutes. ## How to improve it ? ### Performance Considering interactions of depth 3 ! Unfortunatly, I was limited by my computer’s memory. Combining this with other approaches (genuine random forests…) could have helped as well! ### Speed The part: Generate a set of balls covering the map (radiuses and centers being chosen to avoid having too many features in the end) Could actually have been made much faster using squares (and rounding) insteads of balls! The algorithm with balls had a $O(n \cdot d)$ complexity ($d$ being the number of balls) whereas rounding would have led to a $O(n)$ complexity, no matter the number of squares. ## Bonus: generating heatmaps! The data comes from OpenData.Paris.fr and is inspired from the following kernel. import numpy as np import pandas as pd import matplotlib.pyplot as plt bins = 500 lat_min, lat_max = 48.75, 48.95 lon_min, lon_max = 2.25, 2.5 COLNAME = "Geometry X Y" def getCoord(x,i): return map(float, x.split(','))[i] data = pd.read_csv("./troncon_voie.csv", chunksize=10000, usecols=[COLNAME], sep=";") z = np.zeros((bins, bins)) for i, chunk in enumerate(data): print("Chunk " + str(i)) chunk['x'] =chunk[COLNAME].apply(lambda x: getCoord(x,0)) chunk['y'] =chunk[COLNAME].apply(lambda x: getCoord(x,1)) z += np.histogram2d(chunk['x'], chunk['y'], bins=bins, range=[[lat_min, lat_max], [lon_min, lon_max]])[0] log_density = np.log(1 + z) plt.imshow(log_density[::-1, :], # flip vertically extent=[lat_min, lat_max, lon_min, lon_max]) plt.savefig('heatmap.png')  # Random forest vs SVM ## Too long, didn’t read ### General remarks If you have too many rows (more than 10 000), prefer the random forest. If you do not have much time to pre process the data (and, or have a mix of categorical and numerical features), prefer the random forest. ### Classification tasks Random forests are inherently mutliclass whereas Support Vector Machines need workarounds to treat multiple classes classification tasks. Usually this consists in building binary classifiers which distinguish (i) between one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). ## Empiric facts ### Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? This article is a must read. The authors run 179 classification algorithms over 121 datasets. One of their conclusion being: The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) For the sake of the example, the next two paragraphs deal with datasets where SVMs are better than random forests. ### MNIST benchmark From crossvalidated, RFs seem to achieve a 2.8% error rate on the MNSIT dataset. On the other hand, on Yann Lecun benchmarks a simple SVM with a gaussian kernel could reach a 1.4% error rate. Virtual SVM could reach 0.56% error rate. Fig. 1: Illustration of the MNIST dataset ### Microarray data Quoting A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification, who use 22 diagnostic and prognostic datasets: We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used. ## Size of the data set and theoretical complexity The first thing to consider is the feasability of training a model on a specific data set. Are there too many rows or columns to train a specific model ? Recall the table from the article about time complexity. Algorithm Classification/Regression Training Prediction Decision Tree C+R $O(n^2p)$ $O(p)$ Random Forest C+R $O(n^2pn_{trees})$ $O(pn_{trees})$ Random Forest R Breiman implementation $O(n^2p n_{trees})$ $O(pn_{trees})$ Random Forest C Breiman implementation $O(n^2 \sqrt p n_{trees})$ $O(pn_{trees})$ SVM (Kernel) C+R $O(n^2p+n^3)$ $O(n_{sv}p)$ What we can see is that the computational complexity of Support Vector Machines (SVM) is much higher than for Random Forests (RF). This means that training a SVM will be longer to train than a RF when the size of the training data is higher. This has to be considered when chosing the algorithm. Typically, SVMs tend to become unusable when the number of rows exceeds 20 000. Therefore, random forests should be prefered when the data set grows larger. You can use, however, bagged support vector machines. ## Transformation of the data : practical details Recall the following table, which can be found in the article about data scaling and reducing Algorithm Translation invariant Monotonic transformation invariant Decision Tree X X Random Forest X X Gradient Boosting X X SVM (Gaussian kernel) X SVM (Other kernels) This simply means that you do not have to worry about scaling / centering data or any other transformation (such as Box Cox) before feeding your data to the random forest algorithm. ## The shape of the decision function One scarcely has to deal with two dimensional data. If this ever happens to you, bear in mind that random forest tend to produce decision boundaries which are segements parallel to the x and y axises, whereas SVMs (depending on the kernel) provide smoother boundaries. Below are some illustrations. ### Results Illusatration of the decision boundary of a SVM Illustration of the decision boundary of a random forest ### The code import numpy as np from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier from sklearn.svm import SVR, SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from DecisionBoundaryPlotter import DecisionBoundaryPlotter def random_data_classification(n, p, f): predictors = np.random.rand(n, p) return predictors, np.apply_along_axis(f, 1, predictors) def parabolic(x, y): return (x**2 + y**3 > 0.5) * 1 def parabolic_mat(x): return parabolic(x[0], x[1]) X, Y = random_data_classification(100, 2, parabolic_mat) dbp = DecisionBoundaryPlotter(X, Y) named_classifiers = [(RandomForestClassifier(), "RandomForestClassifier"), (SVC(probability=True), "SVC")] for named_classifier in named_classifiers: print(named_classifier[1]) dbp.PlotContour(named_classifier[0], named_classifier[1]) dbp.PlotHeatmap(named_classifier[0], named_classifier[1])  The DecisionBoundaryPlotter class being defined below: import numpy as np import matplotlib.pyplot as plt class DecisionBoundaryPlotter: def __init__(self, X, Y, xs=np.linspace(0, 1, 100), ys=np.linspace(0, 1, 100)): self._X = X self._Y = Y self._xs = xs self._ys = ys def _predictor(self, model): model.fit(self._X, self._Y) return (lambda x: model.predict_proba(x.reshape(1,-1))[0, 0]) def _evaluate_height(self, f): fun_map = np.empty((self._xs.size, self._ys.size)) for i in range(self._xs.size): for j in range(self._ys.size): v = f( np.array([self._xs[i], self._ys[j]])) fun_map[i, j] = v return fun_map def PlotHeatmap(self, model, name): f = self._predictor(model) fun_map = self._evaluate_height(f) fig = plt.figure() s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$') im = s.imshow( fun_map, extent=(self._ys[0], self._ys[-1], self._xs[0], self._xs[-1]), origin='lower') fig.colorbar(im) fig.savefig(name + '_Heatmap.png') def PlotContour(self, model, name): f = self._predictor(model) fun_map = self._evaluate_height(f) fig = plt.figure() s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y\$')
s.contour(self._xs, self._ys, fun_map, levels = [0.5])
s.scatter(self._X[:,0], self._X[:,1], c = self._Y)
fig.savefig(name + '_Contour.png')


## SVM as a random forest

Something funny is that random forests can actually be rewritten as kernel methods. I have not (yet) looked at the articles below in more details

Random forests and kernel methods, Erwan Scornet

The Random Forest Kernel (and creating other kernels for big data from random partitions)

## Towards faster SVMs ?

As I was writing the article, I discovered some recent progresses had been made : a scalable version of the Bayesian SVM has been developed. See : Bayesian Nonlinear Support Vector Machines for Big Data.

## Learning more

Wenzel, Florian; Galy-Fajou, Theo; Deutsch, Matthäus; Kloft, Marius (2017). “Bayesian Nonlinear Support Vector Machines for Big Data”. Machine Learning and Knowledge Discovery in Databases (ECML PKDD).

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone, R.A. Olshen

# Creating VIM plugin with Python

## My first plugin

As I was working on various OCaml projects that were growing bigger and bigger, the various open directives started accumulating in some files. And I missed the feature I had in visual studio that automatically sorts the names of the using directives. This allowed a much quicker visual inspection of the various modules I needed for a piece of code to run.

As I could not find a solution to do it easily, I had a look at the different ways to implement vim plugins.

It works!

## Python : import vim

It turned out that the plugins can be directly written using Python ! in a vim script (mine will be called librarySorter.vim), you can simply write python code, import vim to read the buffer.

The buffer seems to be an array like structure on which you can get/set the i-th element of the buffer (i.e. the i-th line of your file).

Unfortunately, the documentation to use the package vim is quite limited. As this Stack Overflow question stresses it.

A nice ressource though, is this presentation.

## Coding the plugin

### Making sure it is compiled with python3

if !has('python3')
echo "Error: Required vim compiled with +python3"
finish
endif


### Declaring a function

function! LibrarySorter()

endfunction


### Calling Python code

The part written in Python has to be between these two EOF declarations.

python3 << EOF

# python code

EOF


In the case of my plugin, this looks like this:

python3 << EOF
import vim

#Reads the whole file, as a list
cb = list(vim.current.buffer)

#Filter the open directives
openLines = list(filter(lambda x: x.startswith("open "), cb))

#If all the open directives are not at the top of the file, stop immediatly
if not vim.current.buffer[len(openLines) - 1].startswith("open "):
raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

#And rearrange the lines in the buffer
for i in range(len(openLines)):
vim.current.buffer[i] = openLines[i]

EOF


And now, in vim, you can simply type: :call LibrarySorter() anywhere in your OCaml source and look at the head of your file :)

## The code

if !has('python3')
echo "Error: Required vim compiled with +python3"
finish
endif

function! LibrarySorter()

python3 << EOF
import vim

cb = list(vim.current.buffer)

openLines = list(filter(lambda x: x.startswith("open "), cb))

if not vim.current.buffer[len(openLines) - 1].startswith("open "):
raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

for i in range(len(openLines)):
vim.current.buffer[i] = openLines[i]

EOF

endfunction


## What’s next ?

I actually have the same issue when importing files in Python, TypeScript… Might be nice to generalize this to various languages :)

## Learning more

### Ocaml

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed.

### Vim

Practical Vim: Edit Text at the Speed of Thought

Modern Vim: Craft Your Development Environment with Vim 8 and Neovim

Learning the vi and Vim Editors

# Using ocamlopt with ocamlbuild

As your OCaml programs grow bigger, it is a good idea to use a build system. Ocamlbuild is quite good at this job ! It will take care of your dependencies, find libraries and more importantly compile the files in the right order…

It took me a while to find the answer to the simple question, how to use ocamlopt instead of the default compiler. Assuming you want to compile a main.ml file, the short answer is to use ocamlbuild main.native, instead of ocamlbuild main.byte

# ocamlopt and ocamlc

OCaml comes with two compilers : ocamlc and ocamlopt. ocamlc is the bytecode compiler, and ocamlopt is the native code compiler. If you don’t know which one to use, use ocamlopt since it provides standalone executables that are normally faster than bytecode.

# Example

For a very quick benchmark let’s sort a long list.

let a = List.init 1000000 (fun x -> Random.int 500) ;;
let b = List.sort compare a ;;

print_string "done;\n";


And compile this with the different compilers…

ocamlbuild main.native
echo "native"
time ./main.native
rm main.native
rm -rf _build/

ocamlbuild main.byte
echo "byte"
time ./main.byte
rm main.byte
rm -rf _build/


And the results are quite convincing.

Finished, 4 targets (0 cached) in 00:00:00.
native
done;

real  0m2.156s
user  0m2.064s
sys 0m0.088s
Finished, 3 targets (0 cached) in 00:00:00.
byte
done;

real  0m6.340s
user  0m6.268s
sys 0m0.068s
`

# Learning more

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.