Resources in topological data analysis

Topological Data Analysis is a promising field in data analysis. Indeed, the hypothesis made on the data are much weaker than the hypothesis behind “usual” machine learning algorithms. Quarantine may be a great time for learning it ;)

This list is not exhaustive, I may have missed important resources. If so, please let me know in the comments!

Articles

Mathematical topic presentations

The following papers are great presentations of the topic:

TDA applications

Books

Videos

A private company, Ayasdi has published many videos on the topic. Most of them are related to real world applications of TDA.

And their youtube channel has many others.

Software

Python

R

tSNE vs PCA

Introduction

Both PCA and tSNE are well known methods to perform dimension reduction. The question of their difference is often asked and here, I will present various points of view: theoretical, computational and emprical to study their differences. As usual, one method is not “better” in every sense than the other, and we will see that their successes vastly depend on the dataset and that a method may preserve some features of the data, while the other do not.

A little bit of wording and notations

PCA stands for Principal Component Analysis

whereas tSNE stands for Stochastic Neighbor Embedding, the t itself referring to the Student-t kernel.

As “usual” we will call \(n\) the number of observations and \(p\) the number of features.

Theoretical aspects

tSNE

Quoting the authors of [1]:

Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities. The similarity of data point \(x_j\) to datapoint \(x_i\) is the conditional probability \(p_{j \vert i}\), that \(x_i\) would pick \(x_j\) as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at \(x_i\). For nearby datapoints, \(p_{j \vert i}\) relatively high,i whereas for widely separated datapoints, \(p_{j \vert i}\) will be almost infinitesimal.

In a nutshell, this maps your data in a lower dimensional space, where a small distance between two points means that they were close in the original space.

The method goes on defining:

$$p_{j \vert i} = \frac{\exp(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert x_k - x_j \rVert^2 / 2\sigma_i^2)}$$

As the probabilities in the original space, and \(q\) the same probability, but in the space of represented points.

$$q_{j \vert i} = \frac{\exp(-\lVert y_i - y_j \rVert ^2)}{\sum_{k \neq i} \exp(-\lVert y_k - y_j \rVert^2)}$$

Then, the divergence between these distribution is minimized using a gradient descent over the \(y\) elements:

$$C = \sum_{i,j}p_{j \vert i} \log \frac{p_{j \vert i}}{q_{j \vert i} } $$

PCA

On the other hand, PCA does not rely on a probabilistic approach to represent the points. Indeed, the approach is more geometric (or related to linear algebra): PCA looks for axis that explains as much variance as possible of the data.

Following the wikipedia article, we can have a better understanding of how it performs dimension reduction:

The transformation \(T = X W\) maps a data vector \(x_i\) from an original space of \(p\) variables to a new space of \(p\) variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first \(L\) principal components, produced by using only the first \(L\) eigenvectors, gives the truncated transformation

$$\mathbf{T}_L = \mathbf{X} \mathbf{W}_L$$

where the matrix \(T_L\) now has \(n\) rows but only \(L\) columns. In other words, PCA learns a linear transformation \(t = W^T x, x \in R^p, t \in R^L,\) where the columns of \(p × L\) matrix \(W\) form an orthogonal basis for the \(L\) features (the components of representation \(t\)) that are decorrelated. By construction, of all the transformed data matrices with only \(L\) columns, this score matrix maximises the variance in the original data that has been preserved, while minimising the total squared reconstruction error \(\|\mathbf{T}\mathbf{W}^T - \mathbf{T}_L\mathbf{W}^T_L\|_2^2\) or \(\|\mathbf{X} - \mathbf{X}_L\|_2^2\).

What we see here is that the total squared reconstruction error is minimized. Therefore, if two points are far away in the original space, their distance will have to be important as well once they are mapped to a lower dimensional space.

In that case, this means that after a PCA, two points are far away from each other if they were far away from each other in the original data set.

Feasability

Computational complexity

If you are not familiar with this topic, I already wrote about it in computational complexity and machine learning.

PCA can be performed quite quickly : it consists in evaluating the covariance matrix of the data, and performing an eigen value of this matrix. Covariance matrix computation is performed in \(O(p^2 n)\) operations while its eigen-value decomposition is \(O(p^3)\) operations. So, the complexity of PCA is \(O(p^2 n+p^3)\). Considering that \(p\) is fixed, this is \(O(n)\).

On the other hand, tSNE can benefit form the Barnes-Hut approximation [2], making it usable in high dimensions with a complexity of \(O(n \log n)\) (more informations can be found in the article).

Therefore, it is not obvious if a method has to be preferred to another with large datasets.

Parameters

The PCA is parameter free whereas the tSNE has many parameters, some related to the problem specification (perplexity, early_exaggeration), others related to the gradient descent part of the algorithm.

Indeed, in the theoretical part, we saw that PCA has a clear meaning once the number of axis has been set.

However, we saw that \(\sigma\) appeared in the penalty function to optimize for the tSNE. Besides, as a gradient descent is required, all the usual parameters of the gradient descent will have to be specified as well (step sizes, iterations, stopping criterions…).

Looking at the implementations, the arguments of the constructors in scikit-learn are the following:

class sklearn.decomposition.PCA(n_components=None, 
		copy=True, whiten=False, svd_solver='auto', 
		tol=0.0, iterated_power='auto', 
		random_state=None)

The parameters mostly relate to the solvers, but the method is unique. On the other hand…

class sklearn.manifold.TSNE(n_components=2, perplexity=30.0, 
		early_exaggeration=12.0, learning_rate=200.0, 
		n_iter=1000, n_iter_without_progress=300, 
		min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, 
		random_state=None, method='barnes_hut', angle=0.5, n_jobs=None)

However, default parameters usually provide nice results. From the original paper:

The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.

Therefore, both in terms of feasability or efforts demanded by the calibration procedure, there is no reason to prefer a method to the other.

Empirical analysis

Figures

The following shows the reduction from a three dimensional space to a two dimensional space using both methods. The colors of the points are preserved, so that one can “keep track” of them.

Gaussian blobs

Gaussian blobs

Fig. 1: Gaussian blobs in three dimensions

Gaussian blobs - PCA

Fig. 2: Gaussian blobs after PCA

Gaussian blobs

Fig. 3: Gaussian blobs after tSNE

Both methods were successful at this task: the blobs, that were initally well separated, remain well separated in the lower dimensional space. An interesting phenomenon, which validates what the theoretical arguments predicted is that in the case of the PCA, the (light) blue and cyan points are far away from each other, whereas they appear to be closer when the tSNE is used.

The swiss roll

Swiss roll

Fig. 4: Swiss roll in three dimensions

Swiss roll - PCA

Fig. 5: Swiss roll after PCA

Swiss roll

Fig. 6: Swiss roll after tSNE

Somehow the roll is broken by the tSNE, which is weird because one would expect the red dots to be close to the orange dots… On the other hand, a linear classifier would be more successful on the data represented with the tSNE than with the PCA.

Digit dataset

Swiss roll - PCA

Fig. 7: Digits after PCA

Swiss roll

Fig. 8: Digits after tSNE

The tSNE works amazingly well on this data set, and exhibits a neat separation of most of the digits!

Hand dataset

Hand

Fig. 9: Hand data

Hand - PCA

Fig. 10: Hand - PCA

Hand - tSNE

Fig. 11: Hand - tSNE

Here we note that the fingers “remain together” with the tSNE.

Other observations

Other observations could be inferred as well, per example, the size of a cluster does not mean much with the tSNE, while it has a meaning in the case of the PCA.

Gaussian blobs

Fig. 12: Gaussian blobs in three dimensions

Gaussian blobs - PCA

Fig. 13: Gaussian blobs after PCA

Gaussian blobs

Fig. 14: Gaussian blobs after tSNE

Code

Shall you want to explore new datasets, feel free to use the following code!

import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.datasets import make_swiss_roll, make_blobs, load_digits
from sklearn import decomposition
from sklearn.manifold import TSNE

# Generate data (swiss roll dataset)
n_samples = 2500

def generate_swiss_roll_data(n_samples):
    noise = 0.05
    X, _ = make_swiss_roll(n_samples, noise)
    # Make it thinner
    X[:, 1] *= .5
    distance_from_y_axis = X[:, 0] ** 2 + X[:, 2] ** 2 
    X_color = plt.cm.jet(distance_from_y_axis / np.max(distance_from_y_axis + 1))
    return X, X_color, "Swiss roll"

def generate_gaussian_blobs(n_samples):
    X, y = make_blobs(n_samples=n_samples, centers=5, n_features=3,random_state=3)
    X_color = plt.cm.jet(y / np.max(y + 1))
    return X, X_color, "Gaussian blobs"

def generate_gaussian_blobs_different(n_samples):
    X, y = make_blobs(n_samples=n_samples, centers=2, n_features=3,random_state=3)
    X[y == 1, :] *= 2
    X_color = plt.cm.jet(y / np.max(y + 1))
    return X, X_color, "Gaussian blobs different sizes"

def generate_digits(n_samples):
    X, y = load_digits(n_class = 10, return_X_y = True)
    X_color = plt.cm.jet(y / np.max(y + 1))
    return X, X_color, "Digits"

def generate_hand(n_samples):
    X = np.loadtxt("./Hand.csv", delimiter=",")
    z_component =  X[:, 2] - X[:, 0]
    X_color = plt.cm.jet(z_component / np.max(z_component + 1))
    return X, X_color, "Hand"


def produce_plots(data_generator, data_generator_name):
    X, X_color, title = data_generator(n_samples)
    fig = plt.figure()
    ax = p3.Axes3D(fig)
    ax.view_init(7, -80)
    ax.scatter(X[:, 0], X[:, 1], X[:, 2],
            color = X_color,
            s=20, edgecolor='k')
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
    ax.set_zlabel('$z$')
    plt.title(title)
    plt.savefig(data_generator_name + '.png')

    # Fit and plot PCA
    X_pca = decomposition.PCA(n_components=2).fit_transform(X)
    fig = plt.figure()
    s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
    s.scatter(X_pca[:, 0], X_pca[:, 1], c = X_color)
    plt.title(title + " - PCA")
    fig.savefig(data_generator_name  + '_pca.png')

    # Fit and plot tSNE
    X_tsne = TSNE(n_components=2).fit_transform(X)
    fig = plt.figure()
    s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
    s.scatter(X_tsne[:, 0], X_tsne[:, 1], c = X_color)
    plt.title(title + " - tSNE")
    fig.savefig(data_generator_name + '_tsne.png')

produce_plots(generate_digits, "digits")
produce_plots(generate_hand, "hand")
produce_plots(generate_swiss_roll_data, "swiss_roll")
produce_plots(generate_gaussian_blobs, "blobs")
produce_plots(generate_gaussian_blobs_different, "blobs_diff")

Learning more

References

[1] van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.[

[2] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014. https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the PCA.

Python data science handbook

Websites

Distill article on tSNE

Count different elements in list - Ocaml

It is quite common to have to count the occurences of each element in a list. Python would propose the Counter class, per example, and looking around, one would soon enough find the following snippet :

from collections import Counter

words = ['a', 'b', 'c', 'a']

Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency

However, OCaml lacks the abundance of snippets to find here and there. This article will focus on counting the occurences of each element in a list, using various approaches.

A functional approach

A simple way relying on built-in List module would be to use sort_unique and count the element appearing in the list for each element returned by sort_unique.

let count_unique_elements_naive list = 
  let count_element e list = List.filter (fun x -> x = e) list |> List.length in
  List.sort_uniq Int.compare list 
  |> List.map (fun e -> (e, count_element e list))

The following could be put in a count.ml file in order to test the previous snippet:

let print_int_pair_list list =
  let string_of_int_pair pair = match pair with
  | a,b -> (string_of_int a)^" - "^(string_of_int b)^"\n" in
  List.map string_of_int_pair list 
  |> List.iter print_string

let count_unique_elements_naive list = 
  let count_element e list = List.filter (fun x -> x = e) list |> List.length in
  List.sort_uniq Int.compare list 
  |> List.map (fun e -> (e, count_element e list))


let () =
  count_unique_elements_naive [1; 1; 3; 5; 7; 7; 1]
  |> print_int_pair_list

Running ocaml count.ml would output the following:

1 - 3
3 - 1
5 - 1
7 - 2

Obviously, this method makes as many passes on the list as the number of unique elements in the list, which is a waste of time.

Using Hash tables

Here hash tables comes-in handy. Updating a counter for each element of the list, one can achieve the same results in a single pass over the data.

Note that in recent versions (>= 4.07), it is easier to turn Hashtbls to list or seq, using the to_seq functions.

let count_unique_elements_hashtbl list =
  let counter = Hashtbl.create 10000 in
  let update_counter x = 
    if Hashtbl.mem counter x then
      let current_count = Hashtbl.find counter x in
      Hashtbl.replace counter x (succ current_count)
    else
      Hashtbl.replace counter x 1
      in
  List.iter update_counter list;
  Hashtbl.to_seq counter
  |> List.of_seq

Using integer hash tables

As stated in the documentation:

The functorial interface allows the use of specific comparison and hash functions, either for performance/security concerns, or because keys are not hashable/comparable with the polymorphic builtins.

Performing a minor replacement in the code above, we can rewrite this as:

module IntHash =
  struct
    type t = int
        let equal i j = i=j
        let hash i = i land max_int
  end


(* Here we create a specialized module for hash tables whose key is an integer *)
module IntHashtbl = Hashtbl.Make(IntHash)


let count_unique_elements_int_hashtbl list =
  let counter = IntHashtbl.create 10000 in
  let update_counter x = 
    if IntHashtbl.mem counter x then
      let current_count = IntHashtbl.find counter x in
      IntHashtbl.replace counter x (succ current_count)
    else
      IntHashtbl.replace counter x 1
  in
  List.iter update_counter list;
  IntHashtbl.to_seq counter
  |> List.of_seq

Benchmark

And the winner is count_unique_elements_int_hashtbl.

$ ocamlopt main.ml 
$ ./a.out 
count_unique_elements_naive        Execution time: 0.372140s
count_unique_elements_hashtbl      Execution time: 0.089627s
count_unique_elements_int_hashtbl  Execution time: 0.081218s

If one wants to reproduce the results:

let print_int_pair_list list =
  let string_of_int_pair pair = match pair with
  | a,b -> (string_of_int a)^" - "^(string_of_int b)^"\n" in
  List.map string_of_int_pair list 
  |> List.iter print_string


let time f x =
  let t = Sys.time() in
  let fx = f x in
  Printf.printf "Execution time: %fs\n" (Sys.time() -. t);
  fx


let count_unique_elements_naive list = 
  let count_element e list = List.filter (fun x -> x = e) list |> List.length in
  List.sort_uniq Int.compare list 
  |> List.map (fun e -> (e, count_element e list))


let count_unique_elements_hashtbl list =
  let counter = Hashtbl.create 10000 in
  let update_counter x = 
    if Hashtbl.mem counter x then
      let current_count = Hashtbl.find counter x in
      Hashtbl.replace counter x (succ current_count)
    else
      Hashtbl.replace counter x 1
      in
  List.iter update_counter list;
  Hashtbl.to_seq counter
  |> List.of_seq


module IntHash =
  struct
    type t = int
        let equal i j = i=j
        let hash i = i land max_int
  end


(* Using the Hashtbl functorial interface *)
module IntHashtbl = Hashtbl.Make(IntHash)


let count_unique_elements_int_hashtbl list =
  let counter = IntHashtbl.create 10000 in
  let update_counter x = 
    if IntHashtbl.mem counter x then
      let current_count = IntHashtbl.find counter x in
      IntHashtbl.replace counter x (succ current_count)
    else
      IntHashtbl.replace counter x 1
  in
  List.iter update_counter list;
  IntHashtbl.to_seq counter
  |> List.of_seq


let sample = List.init 1000000 (fun _ -> Random.int 10)


let named_methods = [("count_unique_elements_naive", count_unique_elements_naive)
  ;("count_unique_elements_hashtbl", count_unique_elements_hashtbl)
  ;("count_unique_elements_int_hashtbl", count_unique_elements_int_hashtbl)]

let () =
  let _ = List.map (fun pair -> print_string (fst pair); print_string "\t";time (snd pair) sample) named_methods in
  ()

Books

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed.

Printing hello world in OCaml

Why writing a tutorial ?

There are indeed plenty of good resources if one is willing to learn OCaml (OCaml tutorials may be a good start). However, the usual road one follows when starting a new language, like printing hello world is not commonly presented. This might be due to the fact that printing hello world is actually one of the least functional thing to do…

Besides, OCaml comes with a utop toplevel, different compilers: by default ocamlc and ocamlopt but plenty of others are available… Knowing where to start may not be obvious.

Here we will compile and execute a program that prints hello world.

Setting up ocaml

The best starting point is to install opam which is to OCaml what pip is to Python. opam also enables to switch between various compilers,

Installing OPAM details the different steps depending on your distribution. Assuming you are using Ubuntu, the following should do the job:

sudo apt-get install opam

Once opam is installed, the following commands will enable you to use OCaml

# environment setup
opam init
eval `opam env`
# install given version of the compiler
opam switch create 4.08.0
eval `opam env`

# check you got what you want
which ocaml
ocaml -version

In my case:

~$ ocaml -version
The OCaml toplevel, version 4.08.0

The project

The usual extension of an OCaml file is .ml. Let’s just create a dedicated folder (called hellocaml below), in which we place a main.ml file. The contents of the file main.ml will be pretty simple:

let () =
  print_string "Hello world\n";

print_string simply prints the input string in the terminal. Note that this print function is strongly typed, you cannot write:

print_string 42;

Instead, one would have to write:

print_int 42;

Note that the \n has nothing magical in it. By default, language such as python will jump to the next line once executed. In OCaml, the print_string function does not do this by default. The following code would produce the same output.

let () =
  print_string "Hello ";
  print_string "world\n";

For now the contents of our directory looks like:

hellocaml/
└── main.ml

Executing the project

user@:~/hellocaml$ ocaml main.ml 
hello world

Compiling the project

Ocaml has different build systems but we will not use any of them for now. We will simply use the default tools to turn our code into an executable.

ocamlopt and ocamlc

For now, you do not need to know a lot about these two.

Compiling the project

Typing the following command:

ocamlc main.ml

The directory now contains:

hellocaml/
├── a.out
├── main.cmi
├── main.cmo
└── main.ml

Executing the project

user@:~/hellocaml$ ./a.out 
hello world

What is next ?

All the cool features of the language ! Nothing about the functional aspect has been discussed

Resources

Online

Why OCaml, video. If you wonder about the points of functionnal programming, this video summarizes key aspects of OCaml.

99 problems in OCaml if you prefer to learn with exercises instead of tutorials ;)

Stay tuned, I plan to write more detailed tutorials!

Books

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed.

Opam permissions on Manjaro

bwrap: No permissions to creating new namespace

When installing ounit, I faced the following issue:

$ opam install ounit
[NOTE] It seems you have not updated your repositories for a while. Consider updating them with:
       opam update

The following actions will be performed:
  ∗ install stdlib-shims 0.1.0 [required by ounit]
  ∗ install ounit        2.1.2
===== ∗ 2 =====
Do you want to continue? [Y/n] Y

A while later:

#=== ERROR while compiling stdlib-shims.0.1.0 =================================#
# context     2.0.5 | linux/x86_64 | ocaml-base-compiler.4.08.0 | https://opam.ocaml.org#225ac83b
# path        ~/.opam/408/.opam-switch/build/stdlib-shims.0.1.0
# command     ~/.opam/opam-init/hooks/sandbox.sh build dune build -p stdlib-shims -j 3
# exit-code   1
# env-file    ~/.opam/log/stdlib-shims-11129-ad8886.env
# output-file ~/.opam/log/stdlib-shims-11129-ad8886.out
### output ###
# bwrap: No permissions to creating new namespace, likely because the kernel does not allow non-privileged user namespaces. On e.g. debian this can be enabled with 'sysctl kernel.unprivileged_userns_clone=1'.

So, it cannot create the new namespaces… One could be tempted to run it with sudo. But this is not good. Not good at all (see the last section). Instead, you have to enable non privileged user namespaces.

The solution

Simply run:

$ sudo sysctl kernel.unprivileged_userns_clone=1

Which will enable non privileged user namespaces. And now the install should just be fine.

$ opam install ounit
[NOTE] It seems you have not updated your repositories for a while. Consider updating them with:
       opam update

The following actions will be performed:
  ∗ install stdlib-shims 0.1.0 [required by ounit]
  ∗ install ounit        2.1.2
===== ∗ 2 =====
Do you want to continue? [Y/n] Y

<><> Gathering sources ><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
[ounit.2.1.2] found in cache
[stdlib-shims.0.1.0] found in cache

<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
∗ installed stdlib-shims.0.1.0
∗ installed ounit.2.1.2
Done.

Why not using sudo ?

One could be tempted to…

$ sudo opam install ounit

I did it in the first place. It worked “fine”. Running it a second time:

[WARNING] Running as root is not recommended
[NOTE] Package ounit is already installed (current version is 2.2.1).

So, I have the package. But, when I tried testing my program:

$ ./runTests.sh 
+ ocamlfind ocamldep -package oUnit2 -modules tests/randomVariableGausianTest.ml > tests/randomVariableGausianTest.ml.depends
ocamlfind: Package `oUnit2' not found

Weird… I have the package, but for some reason it remains not found…

Then, running sudo opam list and opam list I obtained different lists of packages. I guess the packages installed with sudo are not installed at the right place (and therefore, cannot be called by the compiler by default).

Winning at motus

Context

What is motus

Motus is a quite famous (at least in France) game where one has to guess a seven letters word based on its first letter and another letter given somewhere in the word. The player has various trials (8) to guess the word. Each word proposed must start with the same first letter as the word to guess, and the player is informed when a letter is at its place, or if a letter belongs to the word to guess, but is not in the right place.

Being quite popular, this game now has various mobile applications!

Illustration of motus

Fig. 1: Capture of "Motus, le jeu officiel France 2"

As you can see, the letters stressed in red are the one indicate the good position for the letter, and the yellow circle indicates that the letter is in the word, but currently in the wrong position.

Why winning

Long story, but I would like a winning strategy at this game which ideally, I can perform without the help of a computer.

A solution

Beginner approach

For having watched the shows many times on TV, or others playing the app, a strategy used by most players seem to incrementally use the letters which you know well positionned to propose words which comply with the valid letters.

The idea

Though the naive approach works, it feels like “sub optimal words” (i.e. words which seem far from the final word given the information we have when we propose them) can bring a lot of information, since we can craft them to be as informative as possible!

“Crafting” (the process of finding a seven letter words who share as few letter as possible with the words already suggested) is quite hard. However, learning a list of words which are as informative as possible is easy!

Therefore, the idea used here is that, in 4 guesses, one can almost know the letters present in the word to guess (4 * (7-1) = 24). Therefore, for each letter, one only needs to know four words, such that the number of different letters in these four words is maximal.

Then the second hypothesis is that when we know the letters of the words to guess, it is easy to guess it. It turned out to be harder than I thought, yet, my statistics were vastly improved!

The implementation

Details

It took the list of French words from this blog. The first part consists in removing the word which are not sevent characters long and getting rid of their accents.

Then, for a list of words, the coverage_score is defined, it simply is the number of different characters formed by these words.

It is shortly implemented and tested below:

def coverage_score(words):
    cnt = Counter()
    for word in words:
        for character in word:
            cnt[character] += 1
    return len(list(cnt))


assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)
assert(coverage_score(['abc', 'def', 'cad']) == 6)

Then, for each letter, words are sampled by four and for each sampling, the score is evaluated. This process is repeated 1 000 000 times, and the 4-uple with the highest coverage score is returned. And this is it.

def random_optimization(generator, score, n_attempts):
    i = 0
    best_score = 0
    best_element = generator()
    while i < n_attempts:
        i += 1
        e = generator()
        score_e = score(e)
        if score_e > best_score:
            best_score = score_e
            best_element = e
    return (best_element, best_score)

The code

from collections import Counter
import random
import string
import unidecode


def load_words_and_clean(input_filepath):

    def is_seven_characters_long(s):
        return len(s) == 7

    def remove_line_breaks(s):
        return s.replace("\n", "")

    fp = open(input_filepath, 'r')
    words = fp.readlines()
    words = map(remove_line_breaks, words)

    seven_letter_words = [
        word for word in words if is_seven_characters_long(word)]

    removed_accents = map(unidecode.unidecode, seven_letter_words)
    removed_dashes = filter(lambda x: not "-" in x, removed_accents)

    remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes)

    return list(remove_ez)


def coverage_score(words):
    cnt = Counter()
    for word in words:
        for character in word:
            cnt[character] += 1
    return len(list(cnt))


assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)
assert(coverage_score(['abc', 'def', 'cad']) == 6)


def random_optimization(generator, score, n_attempts):
    i = 0
    best_score = 0
    best_element = generator()
    while i < n_attempts:
        i += 1
        e = generator()
        score_e = score(e)
        if score_e > best_score:
            best_score = score_e
            best_element = e
    return (best_element, best_score)


candidate_words = load_words_and_clean("./liste.de.mots.francais.frgut.txt")
print("File loaded!")

for x in string.ascii_lowercase:
    print(x)
    words_starting_with_x = list(
        filter(lambda w: w.startswith(x), candidate_words))

    def generator_x():
        res = [random.choice(words_starting_with_x) for i in range(4)]
        list.sort(res)
        return res

    res = random_optimization(generator_x, coverage_score, 100000)
    print("(coverage : " + str(res[1]) + ")")

    print("\n".join(res[0]))
    print(10 * '-')

Results

Now, I simply have to learn the 104 words listed below

File loaded!
a
(coverage : 19)
affuble
agrions
amochai
apteryx
----------
b
(coverage : 18)
bavocha
bowling
brusque
brutaux
----------
c
(coverage : 18)
candide
capeyat
choques
combler
----------
d
(coverage : 18)
deflore
dejuche
demodat
dopings
----------
e
(coverage : 19)
endurci
engavat
enzymes
explora
----------
f
(coverage : 18)
fachait
faxames
flingot
fuyards
----------
g
(coverage : 18)
giclant
glyphes
gommeux
grisbis
----------
h
(coverage : 19)
hagarde
hickory
honteux
humbles
----------
i
(coverage : 18)
ichtyol
impulse
inegaux
ivoires
----------
j
(coverage : 19)
jockeys
jouxter
jubilat
jumping
----------
k
(coverage : 20)
kamichi
kobolds
krypton
kufique
----------
l
(coverage : 18)
lambris
legende
lexique
lycopes
----------
m
(coverage : 18)
malfont
margeas
midship
muqueux
----------
n
(coverage : 18)
nasarde
neigeux
notable
nymphal
----------
o
(coverage : 19)
oblatif
ocreras
oppidum
oxygene
----------
p
(coverage : 18)
palombe
penchas
poquait
profond
----------
q
(coverage : 15)
quartat
quiches
quidams
quillon
----------
r
(coverage : 18)
rebonds
rejugee
rempile
ruchait
----------
s
(coverage : 19)
sandjak
sculpte
sombras
swingua
----------
t
(coverage : 18)
thymols
topique
trafics
trepang
----------
u
(coverage : 16)
ultimes
urbaine
urgence
uropode
----------
v
(coverage : 19)
vachard
vampent
velique
voyages
----------
w
(coverage : 18)
wergeld
whiskys
wombats
wurmien
----------
x
(coverage : 11)
xanthie
ximenie
xylenes
xylenes
----------
y
(coverage : 16)
yankees
yiddish
yogourt
ypreaux
----------
z
(coverage : 17)
zeolite
ziberas
zincage
zythums
----------

Improvements

Word selection

The thing is that the accepted words will mostly depend on the implementation of the game, I could not find a official dictionary for this. So the conjugated verbs would be rejected, per example (hence the line : remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes), which does not account for the full complexity of French conjugations, to say the least)

Algorithm

The “coverage” seems quite low (around 19). I believe this can be improved using smarter ideas than a random, brute force optimization.

Perhaps some better optimizations could be performed, such as, start with a word, randomly pick among the most different words and so on so forth.

The score itself

The definition of the coverage score does not take into account the scarcity of some letters. z is quite scarce, and it is often a waste of time to check whether it is in the word to guess.

OCaml functor example : a progress bar

Introduction

Python has an amazing progress bar library tqdm. In a for loop or a map statement, tqdm allows the user to know how much time is needed for the loop to end. Though simple, this happens to be very useful when performing treatment on a dataset with hundreds of thousands of lines.

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]
What tdqm does


[##############-------------------------] 38.00% ending in : 3.21s
What I would be happy with

Regarding machine learning pipelines, Python does a great job. However, it lacks the power of compiler. First, pre-treatment of the data is sometimes done using functions written in pure Python, not particularly fast.

But more important, imagine the following pipeline in OCaml:

let new_text = my_text 
|> Str.split
|> List.map fix_spelling
|> remove_stopwords
|> List.map vectorize
|> FloatArray.average

Maybe these functions do not really exist, but a sentence (my_text) is splitted in a list of words, the spelling of each word is fixed (List.map fix_spelling), the stopwords are removed, the words are vectorized (with something like word2vec) and at last, the list of vectorized words is averaged. It is pretty clear that we input a sentence (a string) and return a float array.

Now imagine that for some reason, the pipeline is misspelled in something like this:

let new_text = my_text 
|> Str.split
|> List.map fix_spelling
|> List.map vectorize
|> remove_stopwords
|> FloatArray.average

The compiler would immediatly complain that remove_stopwords does not apply to float array list (but was expecting a string list). On the other hand, depending on the implementation, it may take a while for the same program in Pyhton to realize that some steps cannot be performed.

Therefore, I believe that OCaml could be an amazing fit for machine learning: the computing intensive libraries (random forests, SVMs…) can be implemented in another language (which is the case with scikit-learn) and the pipeline would be much more robust in OCaml.

As I love to use tqdm in Python, I wanted an OCaml equivalent.

OCaml functors

A functor is simply a module, which accepts another module as a parameter. Let’s look at the following module signature, a finite collection. It has a map function, which transforms every element and a length function. Note that length could be inferred from map and a ref. However, modules such as Seq allow infinite collections: they have a map function, but cannot evaluate the length of a sequence.

module type FINITECOLLECTION = sig
  
  type 'a t
  val map : ('a -> 'b) -> 'a t -> 'b t 
  val length : 'a t -> int

end

Given this signature we can now define a module which will operate on finite collections. Without paying much attention to the contents of map, simply note that the only functions called in map are the ones specified in the signature: Collection.length and Collection.map.

(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

  (** Applies f to a finite collection l, while reporting progress in the terminal *)
  let map f l =
  
    let open ProkopeUtils in

    let starting_time = Unix.gettimeofday() in
    let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in
    
    let res = add_indices_to_collection Collection.map l
    |> Collection.map wrapped_f in
    
    print_string "\n";
    res

end

Now, we would like to be able to use this new map method on array and list types. For this, we just have to reproduce the signature of the FINITECOLLECTION. The following does it.

module UsualArray  = struct 
  type 'a t = 'a array
  include Array
end

module Array = Prokope(UsualArray)


module UsualList  = struct 
  type 'a t = 'a list
  include List
end


module List = Prokope(UsualList)

Now this function can simply be invoked appending Prokope before

let dummy a = 
  ignore (Unix.select [] [] [] 0.1);
  a

let double x = 2 * x 

let id x = x

let () =
  let sample = List.init 50 id in
  assert(List.map double sample = Prokope.List.map double sample);
  let _ = Prokope.List.map dummy sample in
  ()

And you would see:

[#######################################] 100.00% ending in : 0.00s
[##############-------------------------] 38.00% ending in : 3.21s

The implementation

A couple of tricks

A couple of things to know is that OCaml may not flush to the standard output as you may expect. But this can be forced by calling

flush stdout;

Now to write a line on the actual line of the terminal, the only thing to do is to call

print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");

You may find more informations on \n and \r on StackOverflow though.

The implementation

I decided to call this library Prokope. Maybe on opam someday ;)

Prokope utils

Let’s start with the helpers needed to display and configure a progress bar.

prokopeUtils.ml

(** Various helpers for Prokope *)

(** Graphical parameters for the progress bar *)
type progress_bar_config = {
  width: int;
  cell_filled: string;
  cell_empty: string
}


(** The width of the terminal *)
let default_width =  
  match Terminal_size.get_columns () with  
  | Some(a) -> a
  | None -> 80 


(** A default progress bar *)
let default_progress_bar = {
  width = default_width; 
  cell_filled = "#";
  cell_empty = "-"
}
  

(** For a collection 'a t returns a collection of type (int, 'a) t *)
let add_indices_to_collection map collection =
  
  let index = ref (-1) in

  let add_index_to_element e =
    index := !index + 1;
    (!index, e)
  in
  
  map add_index_to_element collection


(** Repeats the string s n times *)
let repeat_string s n = 
  List.init n (fun _ -> ()) 
  |> List.fold_left (fun e _ -> s^e) ""


(** Returns the string represented by the progress bar configuration and the percentage *)
let progress_bar_string config percentage =
  
  let gauge_string n_steps inner_width =
    let remaining_steps = inner_width - n_steps in
    "["^
    (repeat_string config.cell_filled n_steps)^
    (repeat_string config.cell_empty remaining_steps)^
    "]"
  in

  let inner_width = max (config.width - 40) 0 in
  
  let n_steps = int_of_float (percentage *. (float_of_int inner_width)) in
  
  let rounded_percentage_str = Printf.sprintf "%.2f" (100. *. percentage) in
  
  (gauge_string n_steps inner_width)^" "^rounded_percentage_str^"% "


(** For a fonction ('a -> 'b) returns a function ( ('a, int) -> 'b )
 * "aware" of its starting time and operations to perform *)
let progress_wrapper config list_length starting_time f a =

  let i, x = a in
  
  let percentage = (float_of_int i +. 1.) /. (float_of_int list_length) in
  
  let time_taken = (Unix.gettimeofday() -. starting_time) in
  
  let speed = (float_of_int i) /. time_taken in
  
  let remaining_time_str = (float_of_int (list_length - i)) /. speed
    |> Printf.sprintf "%.2f" 
  in
  
  print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");
  
  flush stdout;
  
  f x 

The functor

Why a functor ? To enable a user to generate a progress bar for every finite collection ! Imagine that your data is stored on a binary tree for some reason. Having a map and a length function over trees is trivial (and it may also belong to your tree module). Calling Prokope(MyTree) would therefore return a new module, for which map shows a progress bar.

prokope.ml

(** User intended functor *)

(** Any collection whose elements can be counted and mapped.
 Note that length has to be implemented (this disqualifies
 modules such as Seq). *)
module type FINITECOLLECTION = sig
  
  type 'a t
  val map : ('a -> 'b) -> 'a t -> 'b t 
  val length : 'a t -> int

end


(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

  (** Applies f to a finite collection l, while reporting progress in the terminal *)
  let map f l =
  
    let open ProkopeUtils in

    let starting_time = Unix.gettimeofday() in
    let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in
    
    let res = add_indices_to_collection Collection.map l
    |> Collection.map wrapped_f in
    
    print_string "\n";
    res

end


module UsualArray  = struct 
  type 'a t = 'a array
  include Array
end

module Array = Prokope(UsualArray)


module UsualList  = struct 
  type 'a t = 'a list
  include List
end


module List = Prokope(UsualList)

Testing and displaying

run.ml

let dummy a = 
  ignore (Unix.select [] [] [] 0.1);
  a

let double x = 2 * x 

let id x = x

let () =
  let sample = List.init 50 id in
  assert(List.map double sample = Prokope.List.map double sample);
  let _ = Prokope.List.map dummy sample in
  ()

run.sh

ocamlfind opt -package unix,terminal_size -linkpkg -o run.byte prokopeUtils.ml prokope.ml run.ml 

./run.byte

Hope you liked it! If you feel you may need it some day, or have improvements suggestions, please let me know, I would improve it and publish it!

Estimating the parameters of a CEV Process

The CEV Process

In mathematical finance, the CEV or constant elasticity of variance model is a stochastic volatility model which was developed by John Cox in 1975. It captures the fact that the log returns may not have a constant volatility, which the usual geometric brownian motion assumes.

The equation of the CEV model is the following.

\[dS_t = \mu S_t dt+\sigma S_t^{\gamma}dW_t\]

For \(\gamma = 1\) we have the usual equation for the geometric Brownian motion:

\[dS_t = \mu S_t dt+\sigma S_t dW_t\]

Simulating processes

Using the discretization method, we can simulate these processes.

Geometric Brownian motion

The geometric Brownian motion can be simulated using the following class.

import numpy as np


class ProcessGBM:

    def __init__(self, mu, sigma):
        self._mu = mu
        self._sigma = sigma

    def Simulate(self, T=1, dt=0.001, S0=1.):
        n = round(T / dt)
        
        mu = self._mu
        sigma = self._sigma

        gaussian_increments = np.random.normal(size=n - 1)
        res = np.zeros(n)
        res[0] = S0
        S = S0
        sqrt_dt = dt ** 0.5
        for i in range(n - 1):
            S = S + S * mu * dt + sigma * \
                S * gaussian_increments[i] * sqrt_dt
            res[i + 1] = S

        return res

Following a similar logic, we can generate trajectories for a CEV process.

import numpy as np


class ProcessCEV:

    def __init__(self, mu, sigma, gamma):
        self._mu = mu
        self._sigma = sigma
        self._gamma = gamma

    def Simulate(self, T=1, dt=0.001, S0=1.):
        n = round(T / dt)
        
        mu = self._mu
        sigma = self._sigma
        gamma = self._gamma

        gaussian_increments = np.random.normal(size=n - 1)
        res = np.zeros(n)
        res[0] = S0
        S = S0
        sqrt_dt = dt ** 0.5
        for i in range(n - 1):
            S = S + S * mu * dt + sigma * \
                (S ** gamma) * gaussian_increments[i] * sqrt_dt
            res[i + 1] = S

        return res

Which can then be plotted:

import matplotlib.pyplot as plt

from ProcessGBM import ProcessGBM
from ProcessCEV import ProcessCEV

T = 20
dt = 0.01
plt.plot(ProcessGBM(0.05, 0.15).Simulate(T, dt), label="GBM")
plt.plot(ProcessCEV(0.05, 0.15, 0.5).Simulate(
    T, dt), label="CEV (gamma = 0.5)")
plt.plot(ProcessCEV(0.05, 0.15, 1.5).Simulate(
    T, dt), label="CEV (gamma = 1.5)")

plt.xlabel('Time index')
plt.ylabel('Value')

plt.title("Realization of stochastic processes")

plt.legend()

plt.show()

Simulation of various processes

Realisation of various stochastic processes with different parameters

And we can “see” the properties of the trajectories, a low value for \(\gamma\) yields trajectories which are slightly linear, whereas a \(\gamma\) higher than \(1\) allows the volatility to increase for large values of the process (the trajectory in red).

Estimating the CEV parameters

Theory

A detailed method to estimate the various parameters can be found here: L. CHU, K & Yang, Hailiang & Yuen, Kam. (2001). Estimation in the Constant Elasticity of Variance Model. British Actuarial Journal. 7. 10.1017/S1357321700002233.

Note that they use slightly different notations, which will be kept in this section (the code for the estimator, however, translates everything back in terms of the original notations).

\[dS_t = \mu S_t dt+\delta S_t^{\theta / 2}dW_t\]

Calling \(\sigma^2_t=\delta^2 S_t^{\theta - 2}\)

The paper recalls that \(V_t\) is an estimate of \(\sigma^2_t\), where \(V_t\) is:

\[V_t=\frac{2}{\alpha \Delta t} \left( \frac{S_{t+\Delta t}^{1+\alpha} - S_{t}^{1+\alpha}} {(1+\alpha)S_{t}^{1+\alpha}} - \frac{S_{t+\Delta t} - S_{t}} {S_{t}} \right) \frac{1}{\Delta t}\]

For \(\alpha\) a constant.

It looks like there is there a typo in the original paper, after a quick implementation (see below), I realized that the estimator does not work this this formula, but does so when I remove one of the \(\frac{1}{\Delta t}\). Therefore, in what follows, the following formula will be implemented:

\[V_t=\frac{2}{\alpha \Delta t} \left( \frac{S_{t+\Delta t}^{1+\alpha} - S_{t}^{1+\alpha}} {(1+\alpha)S_{t}^{1+\alpha}} - \frac{S_{t+\Delta t} - S_{t}} {S_{t}} \right)\]

Then, the variance of \(V_t\) is minimized for a value of (\(c_i\) are negative constants)

\[\alpha=c_1 + c_2\frac{\mu}{\sigma^2_t}\]

And the authors suggest to start from a random value of \(\alpha\) and iterate over successive estimations of \(\mu\) and \(\sigma_t\). However, \(\sigma_t\) not being constant, this formula did not make a lot of sense to me. I chose a constant value of \(\alpha\) in the code, which provides good results.

Proof sketch : Itô’s lemma

In order to know where does \(V_t\) comes from, we can apply Itô’s formula to \(S^{1+\alpha}_t\):

\[d(S^{1+\alpha}_t)=(1+\alpha)S^{\alpha}_t dS_t + \frac{1}{2} \alpha (1+\alpha) S^{\alpha - 1}_t d<S_t>\]

Now:

\[d<S_t> = \delta^2 S^{\theta}_t dt\]

Therefore, dividing by \((1+\alpha)S^{1+\alpha}_t\):

\[\frac{d(S^{1+\alpha}_t)}{(1+\alpha)S^{1+\alpha}_t} - \frac{dS_t}{S_t} = \frac{1}{2} \alpha \delta^2 S^{\theta - 2}_t dt\]

Finally:

\[\frac{2}{\alpha dt} \left( \frac{d(S^{1+\alpha}_t)}{(1+\alpha)S^{1+\alpha}_t} - \frac{dS_t}{S_t} \right) = \delta^2 S^{\theta - 2}_t\]

And it works for any \(\alpha\). To be honest I am quite uncomfortable with the meaning to give to every equation, but, at some level, it makes sens that \(V_t\) “has to be” of the form above.

Python implementation

import numpy as np


class EstimatorCEV:

    def __init__(self, dt):
        self._dt = dt
        self._alpha0 = -5
    
    def Estimate(self, trajectory):
        sigma, gamma = self._evaluate_sigma_gamma(trajectory, self._alpha0)
        if sigma == None:
            return None, None, None
        else:
            mu = self._estimate_mu(trajectory)
            return (mu, sigma, gamma)

    def _log_increments(self, trajectory):
        return np.diff(trajectory) / trajectory[:-1]
    
    def _estimate_mu(self, trajectory):
        return np.mean(self._log_increments(trajectory)) / self._dt 

    def _log_increments_alpha(self, trajectory, alpha):
        mod_increments = self._log_increments(trajectory ** (1 + alpha))
        return mod_increments / (1 + alpha)

    def _evaluate_Vt(self, trajectory, alpha):
        lhs = self._log_increments_alpha(trajectory, alpha)
        rhs = self._log_increments(trajectory)
        center = 2 * (lhs - rhs) / (alpha * self._dt)
        return center

    def _evaluate_sigma_gamma(self, trajectory, alpha):
        if np.any(trajectory <= 0):
            return None, None
        
        Vts = self._evaluate_Vt(trajectory, alpha)
        if np.any(Vts <= 0):
            return None, None
        logVts = np.log(Vts)

        Sts = trajectory[:-1]  # removes the last term as in eq. (5)
        if np.any(Sts <= 0):
            return None, None
        logSts = np.log(Sts)

        ones = np.ones(Sts.shape[0])
        A = np.column_stack((ones, logSts))

        res = np.linalg.lstsq(A, logVts, rcond=None)[0]
        return (2 * np.exp(res[0] / 2), 0.5 * (res[1] + 2))


if __name__ == "__main__":

    from ProcessCEV import ProcessCEV

    def test(true_mu, true_sigma, true_gamma):
        dt = 0.001
        T = 10
        
        sample_mu = []
        sample_sigma = []
        sample_gamma = []

        for i in range(100):
            mu_est, sigma_est, gamma_est = EstimatorCEV(dt=dt).Estimate(ProcessCEV(
                true_mu, true_sigma, true_gamma).Simulate(T, dt=dt))
            
            if mu_est != None:
                sample_mu = [mu_est] + sample_mu
                sample_sigma = [sigma_est] + sample_sigma
                sample_gamma = [gamma_est ] + sample_gamma
    
        print("mu : " + str(true_mu) + " \t| est : " + str(np.mean(sample_mu)) + " \t| std : " + str(np.std(sample_mu)))
        print("sigma : " + str(true_sigma) + " \t| est : " + str(np.mean(sample_sigma)) + " \t| std : " + str(np.std(sample_sigma)))
        print("gamma : " + str(true_gamma) + " \t| est : " + str(np.mean(sample_gamma)) + " \t| std : " + str(np.std(sample_gamma)))
        print(10*"-")


    test(0.,0.5,0.8)
    test(0.2,0.5,0.8)
    test(0.2,0.5,1.2)
    test(0.,0.3,0.2)
    test(0.,0.5,2)

Running the tests…

mu : 0.0  | est : 0.08393998598504683   | std : 0.10950360432103644
sigma : 0.5   | est : 0.5290411001633833  | std : 0.010372415832707949
gamma : 0.8   | est : 0.7995912270171245  | std : 0.023552858445145052
----------
mu : 0.2  | est : 0.21126788898622506   | std : 0.1379314635624178
sigma : 0.5   | est : 0.5295722080080916  | std : 0.01067649238716792
gamma : 0.8   | est : 0.7981059837608774  | std : 0.019084913069205692
----------
mu : 0.2  | est : 0.20468181972280103   | std : 0.16048227152782532
sigma : 0.5   | est : 0.5299253491070517  | std : 0.008389773769199063
gamma : 1.2   | est : 1.2042365895022817  | std : 0.021453412919601324
----------
mu : 0.0  | est : 0.0743428420875073  | std : 0.07189306488901058
sigma : 0.3   | est : 0.31771919915814173   | std : 0.006063917451139664
gamma : 0.2   | est : 0.2080203482037684  | std : 0.039438575768452104
----------
mu : 0.0  | est : -0.009815876356783935   | std : 0.11178850054168586
sigma : 0.5   | est : 0.5311825321732557  | std : 0.013167973885022026
gamma : 2   | est : 2.0050079526000903  | std : 0.03448408469682434
----------

It worked !

Learning more

I used to study stochastic processes a while ago and reading: Pierre Henry-Labordère – Analysis, Geometry & Modeling in Finance (which I recommend if you are interested in stochastic processes and differential geometry and mathematical finance in the same time) and [A Non-Gaussian Option Pricing Model with Skew

L. Borland, J.P. Bouchaud](https://arxiv.org/abs/cond-mat/0403022) led me to spend more time understanding them. Hence this small tuto, hope you liked it.

Financial Modelling with Jump Processes is a must read if you want to have a detailed view on the non gaussian models used for financial asset. It covers both the theoretical view of these models, the simulation and estimation procedures.

OCaml Owl - Scientific computing and machine learning in OCaml

Introduction

OCaml seemed to lack a decent library for scientific computing. I came across Owl and the official repo and wanted to give it a try.

It has support for matrix operations, dataframes, simple plots based on PLplot (I used a port of PLplot which was quite undocumented in OCaml)…

Install Owl

On my Ubuntu (16) it all starts with…

opam install owl

Unfortunatley, soon enough:

[ERROR] The compilation of conf-openblas failed at "sh -exc cc $CFLAGS test.c -lopenblas".

Opam suggests me the following:

=-=- conf-openblas.0.2.0 troobleshooting =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
=> This package relies on external (system) dependencies that may be missing. `opam depext conf-openblas.0.2.0' may help you find the correct installation for your system.

After executing the command opam depext conf-openblas.0.2.0, the following packages are installed:

The following additional packages will be installed:
  liblapacke libopenblas-base libtmglib-dev libtmglib3

And re-running:

opam install owl

It works ! Let’s get started

Examples

SVMs

The package comes with a list of examples. Let’s start with support vector machines! Note that I removed the first line from the actual version of the example below.

(* This example demonstrates SVM regression. *)

open Owl

let generate_data () =
  let open Mat in
  let c = 500 in
  let x1 = (gaussian c 2 *$ 2.) in
  let a, b = float_of_int (Random.int 15), float_of_int (Random.int 15) in
  let x1 = map_at_col (fun x -> x +. a) x1 0 in
  let x1 = map_at_col (fun x -> x +. b) x1 1 in
  let x2 = (gaussian c 2 *$ 2.) in
  let a, b = float_of_int (Random.int 15), float_of_int (Random.int 15) in
  let x2 = map_at_col (fun x -> x +. a) x2 0 in
  let x2 = map_at_col (fun x -> x +. b) x2 1 in
  let y1 = create c 1 ( 1.) in
  let y2 = create c 1 ( -1.)in
  let x = concat_vertical x1 x2 in
  let y = concat_vertical y1 y2 in
  x, y


let draw_line x0 y0 p =
  let a, b, c = 0., 20., 100 in
  let x' = Mat.empty 1 c in
  let y' = Mat.empty 1 c in
  for i = 0 to c - 1 do
    let x = a +. (float_of_int i *. (b -. a) /. float_of_int c) in
    let y = (Mat.get p 0 0 *. x +. Mat.get p 2 0) /. (Mat.get p 1 0 *. (-1.)) in
    Mat.set x' 0 i x; Mat.set y' 0 i y
  done;
  let h = Plot.create "plot_svm.png" in
  Plot.(plot ~h ~spec:[ RGB (100,100,100) ] x' y');
  Plot.(scatter ~h ~spec:[ RGB (150,150,150) ] x0 y0);
  Plot.output h


let _ =
  let _ = Random.self_init () in
  let x, y = generate_data () in
  let r = Regression.D.svm ~i:true x y in
  let p = Mat.(r.(0) @= r.(1)) in
  draw_line (Mat.col x 0) (Mat.col x 1) p;
  Owl_pretty.print_dsnda Mat.(r.(0) @= r.(1))

Using OCamlBuild, the only thing to do is to write the following small script:

ocamlbuild -use-ocamlfind -pkgs 'owl' svm.byte
./svm.byte
rm svm.byte

The algorithm should output the following:

--- Training config
    epochs         : 1000
    batch          : full
    method         : gradient descent
    loss           : Hinge
    learning rate  : adagrad 1
    regularisation : l2norm (alhpa = 0.001)
    momentum       : none
    clipping       : none
    stopping       : const (a = 1e-16)
    checkpoint     : none
    verbosity      : true
---
2019-04-15 14:12:02.638 INFO : T: 00s | E: 1.0/1000 | B: 1/1000 | L: 1800.944031[-]
2019-04-15 14:12:02.640 INFO : T: 00s | E: 2.0/1000 | B: 2/1000 | L: 1986.033647[▲]
2019-04-15 14:12:02.641 INFO : T: 00s | E: 3.0/1000 | B: 3/1000 | L: 851.791568[▼]
2019-04-15 14:12:02.641 INFO : T: 00s | E: 4.0/1000 | B: 4/1000 | L: 256.961005[▼]
2019-04-15 14:12:02.642 INFO : T: 00s | E: 5.0/1000 | B: 5/1000 | L: 81.197291[▼]
2019-04-15 14:12:02.643 INFO : T: 00s | E: 6.0/1000 | B: 6/1000 | L: 38.378032[▼]
2019-04-15 14:12:02.644 INFO : T: 00s | E: 7.0/1000 | B: 7/1000 | L: 22.740718[▼]
2019-04-15 14:12:02.644 INFO : T: 00s | E: 8.0/1000 | B: 8/1000 | L: 15.785715[▼]
2019-04-15 14:12:02.645 INFO : T: 00s | E: 9.0/1000 | B: 9/1000 | L: 11.920411[▼]

And produce the following plot (or something similar since the data is generated randomly).

Output of the SVM example

Illustration of a linear SVM in Owl

Plots

A full tutorial can be found here. Having used PLplot before, it is indeed much lighter!

open Owl

(* N(0,1) sample *)
let y = Mat.gaussian 5000 2 ;;

(* Initiates the plot *)
let h = Plot.create "me.png" ;;

(* Side effects *)
Plot.set_background_color h 255 255 255;

(* More side effects *)
Plot.set_title h "Bivariate model";

(* Simple scatter *)
Plot.scatter ~h (Mat.col y 0) (Mat.col y 1);
Plot.output h;;

Matrix operations and Monte Carlo

Playing with matrices, broadcasting operations… is quite easy and the syntax prevents you to substract floats and matrices without knowing it. Which is, from my point of view, a good thing. Let’s implement a simple variance reduction method based on the generation of antitethics random variables (see per example this course).

let antithetic_simulation mu sigma n =
    let sample = Mat.gaussian ~mu:mu ~sigma:sigma (n/2) 1 in
    let anti_sample = Mat.((2. *. mu) $- sample) in
    Mat.(sample @= anti_sample) 

The various operations in Mat.( ... ) have different meanings. The $- stresses that we are dealing with a float on the left hand side and a matrix on the right hand side, whereas the @= stands for the vertical concatenation of matrices. All the details can be found on the offical website.

Other ressources

Merlin

Given the large number of functions (more than 7000 !), Merlin is a must have!

Merlin and Owl

Merlin's auto completion. Quite dense!

Concluding remarks

I have just been playing with this new library for a couple of days, but it is promising! What is next ? Maybe benchmarks, tutorials ? Using GPU with Owl ? Let me know in the comments :)

Topological Data Analysis - A Python tutorial

Introduction

Topological data analysis (TDA) allows to reduce many hypothesis when doing statistics. A lot of research in this field has been done over the last years and [1] and [4] provide a brilliant exposition about the mathematical concepts behind TDA. Here, I want to focus on one aspect of TDA: compressed representations of shapes.

A simple topological data analysis pipeline

We will focus on the article [2] in the references. It allows to have an idea of the shape of the data. More precisely, we will go through the various steps that enabled the authors of [2] to produce the following figure:

Illustration of the pipeline

Fig. 1: The pipeline

Overall pipeline

The authors go through the following steps to represent the new shape:

A) A 3D object (hand) represented as a point cloud.

B) A filter value is applied to the point cloud and the object is now colored by the values of the filter function.

C) The data set is binned into overlapping groups.

D) Each bin is clustered and a network is built.

Therefore, we will need a “filter function”, a method to cluster a point cloud (the “clusterer”), N intervals and an overlap parameter. Let’s summarize this in the following class.

class TDAPipeline():

    def __init__(self, filter_function, clusterer, n, overlap):
        self._filter_function = filter_function
        self._clusterer = clusterer
        self._n = n
        self._overlap = overlap

        # (string, int list) dict
        self._clusters = []

        # string list
        self._clusters_names = []

        # (string, string) list
        self._edges = []

We are now ready to dig in the other concepts.

The filters

Filter functions allows to associate real valued quantities to the data points. It can be the coordinates of a point on any axis, or the coordinates of the points on the first component of a PCA.

class AxisFilter():

    def __init__(self, i):
        self._i = i

    def Fit(self, data):
        pass

    def Filter(self, data):
        return data[:,self._i]

An interesting discovery in this article is the concept of “L-infinity centrality”. This filter allows to associate to a point a real number qualifying how close it is to the boundary of the cloud.

L-infinity centrality, is defined for a data point x to be the maximum distance from x to any other data point in the set. Large values of this function correspond to points that are far from the center of the data set.

import numpy as np
from scipy.spatial.distance import pdist, squareform

class LinfinityCentrality():

    def __init__(self, distance):
        self._distance = distance
    
    
    def Fit(self, data):
        # Barnes Hut approx. or something similar may speed up the filtering part later
        pass


    def Filter(self, data):
        pairwise_dist = squareform(pdist(data, self._distance)) 
        return np.amax(pairwise_dist, axis = 1)

The clustering

Once the points have been associated to an interval in the image of the filter function, they have to be clustered. The authors use the following method:

We first construct the single linkage dendrogram for the data in the bin, and record the threshold values for each transition in the clustering. We select an integer k, and build a k-interval histogram of these transition values. The lustering is performed using the last threshold before the first gap in this histogram.

Which can be implemented taking advantage of the implementation of the single linkage proposed in scipy.

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import fcluster, linkage

class SingleLinkageClusterer():

    def __init__(self, k):
        self._k = k

    def Cluster(self, data):
        if data.shape[0] == 0:
            return []
        
        linkage_matrix = linkage(data)
        counts, thresholds = np.histogram(linkage_matrix[:, 2], bins=self._k)
        first_gap_index = self._first_zero(counts, -1)
        if(first_gap_index > -0.5):
            threshold = thresholds[first_gap_index]
        else:
            threshold = linkage_matrix[-1, 2]
        clusters = fcluster(linkage_matrix, t=threshold, criterion='distance')
        return clusters

    def _first_zero(self, arr, invalid_val=-1):
        mask = arr == 0
        return np.where(mask.any(), mask.argmax(), invalid_val)

Some classical shapes

We now have what we need to follow the pipeline of the author. Let’s just sample some points from various shapes with the following classes:

class Sampler():

    def __init__(self):
        raise NotImplemented()

    def Sample(self, n):
        raise NotImplemented()

It is quite easy to sample points uniformly over a sphere (up to numerical precision issues) by sampling an distribution over R^3 which is rotation invariant (here, three independant gaussian with same mean and variance) and normalize it.

import numpy as np
from sklearn.preprocessing import normalize
from Sampler import Sampler


class SphereSampler(Sampler):

    def __init__(self, r=1):
        self._r = r

    '''
    Uniform sampling over the sphere.
    '''
    def Sample(self, n):
        mat = np.random.randn(n, 3)
        return self._r * normalize(mat, axis=1, norm='l2')


I could not find a uniform distribution over the torus, but this should be “uniform enough”.

import numpy as np
from Sampler import Sampler


class TorusSampler(Sampler):

    def __init__(self, r=1, R=3, x0=0, y0=0, z0=0):
        self._r = r
        self._R = R
        self._x0 = x0
        self._y0 = y0
        self._z0 = z0
    
    '''
    This sampling may not be uniform
    '''
    def Sample(self, n):
        u = 2 * np.pi * np.random.rand(n)
        v = 2 * np.pi * np.random.rand(n)

        cosu = np.cos(u)
        sinu = np.sin(u)

        cosv = np.cos(v)
        sinv = np.sin(v)

        x = self._x0 + (self._R + self._r * cosu) * cosv
        y = self._y0 + (self._R + self._r * cosu) * sinv
        z = self._z0 + self._r * sinu

        return np.array([x, y, z]).T

Experiments

Following the method of the authors

We are now ready to implement the full pipeline and produce sensible graphs.

import numpy as np


class TDAPipeline():

    def __init__(self, filter_function, clusterer, n, overlap):
        self._filter_function = filter_function
        self._clusterer = clusterer
        self._n = n
        self._overlap = overlap

        # (string, int list) dict
        self._clusters = []

        # string list
        self._clusters_names = []

        # (string, string) list
        self._edges = []

    def Run(self, data):
        self._filter_function.Fit(data)

        # one dimensional array : f(data), f : R^n -> R
        filtered_data = self._filter_function.Filter(data)
        lb = np.min(filtered_data)
        ub = np.max(filtered_data)
        overlapping_intervals = self._generate_overlapping_intervals(
            lb, ub, self._n, self._overlap)

        # f^-1([a,b]) for [a,b] in the overlapping intervals
        sliced_clusters = []
        i = 0
        for interval in overlapping_intervals:
            in_interval = map(lambda x: self._between_pair(
                x, interval), filtered_data)
            indices = np.where(in_interval)[0]
            filtered_points = data[indices, :]
            clusters = self._clusterer.Cluster(filtered_points)
            points_by_cluster = self._to_dict_list(
                str(i) + "_", indices, clusters)
            sliced_clusters.append(points_by_cluster)
            i = i + 1

        self._clusters = self._flatten_dict_list(sliced_clusters)
        self._clusters_names = self._clusters.keys()
        self._edges = self._build_edges(sliced_clusters)

        return self._edges

    def ClusterAverage(self, target):
        if len(self._clusters) == 0:
            raise "The algorithm has not been trained!"

        indices_dict = self._clusters

        return [np.mean(target[indices_dict[n]]) for n in self._clusters_names]

    def _build_edges(self, sliced_clusters):
        edges = []
        for sets_prev in sliced_clusters:
            for sets_new in sliced_clusters:
                for prev_key in sets_prev.keys():
                    for new_key in sets_new.keys():
                        if not set(sets_new[new_key]).isdisjoint(set(sets_prev[prev_key])):
                            edges.append((prev_key, new_key))
        return edges

    def _flatten_dict_list(self, dict_list):
        res = {}
        for dicts in dict_list:
            res.update(dicts)
        return res

    def _to_dict_list(self, prefix, vector, clusters):
        cluster_names = np.unique(clusters)
        res = {}
        for cluster_name in cluster_names:
            indices_in_cluster = map(lambda x: cluster_name == x, clusters)
            points_in_cluster = vector[np.where(indices_in_cluster)[0]]
            res.update({prefix + str(cluster_name): points_in_cluster})
        return res

    def _generate_overlapping_intervals(self, lb, ub, n, overlap):
        step = 1. * (ub - lb) / n
        res = [(lb + i * step - overlap * step, lb + (i + 1)
                * step + overlap * step) for i in range(n)]
        return res

    def _between_pair(self, x, pair):
        a, b = pair
        return x > a and x < b

Illustrations

Did it work :) ? I found a hand data set “Free realistic hand” here, it required a little bit of parsing to put it in a 3d file.

Compressed representation of the hand

Fig. 2: The pipeline, data : hand, filter : PCA, clustering : single linkage

Unfortunately the sampling around the wrist is not high enough and the wrist is associated to the two blue dots, separated from the rest of the figure., but we do recognize the four fingers(in red) well separated from the thumb (in green).

But we can generate synthetic data to get rid of this sampling issue. Per example, with a torus:

Compressed representation of a torus

Fig. 3: A torus, filter : first component, clustering : single linkage

Or two touching tori:

Compressed representation of two touching tori

Fig. 4: Two tori, filter : first component, clustering : single linkage
import numpy as np

from AxisFilter import AxisFilter
from PCAFilter import PCAFilter
from LinfinityCentrality import LinfinityCentrality

from SingleLinkageClusterer import SingleLinkageClusterer

from TesseractSampler import TesseractSampler 
from TorusSampler import TorusSampler
from SphereSampler import SphereSampler

from TDAPipeline import TDAPipeline

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout


def plot_tda_pipeline(title, sample, filter_function, clusterer, N, OVERLAP):
    tdap = TDAPipeline(filter_function, clusterer, N, OVERLAP)
    edges = tdap.Run(sample)

    colors = filter_function.Filter(sample)
    cluster_colors = tdap.ClusterAverage(colors)

    fig = plt.figure()
    ax = fig.add_subplot(211, projection='3d')

    x = sample[:, 0]
    y = sample[:, 1]
    z = sample[:, 2]

    ax.scatter(x, y, z, c=colors, marker='o')

    ax.set_xlabel('X Label')
    ax.set_ylabel('Y Label')
    ax.set_zlabel('Z Label')

    plt.title(title)
    fig.add_subplot(212)

    G = nx.OrderedGraph() # withouth OrderedGraph, the nodes may be shuffled (and the colors would not match)
    G.add_nodes_from(tdap._clusters_names)
    G.add_edges_from(edges)
    pos = graphviz_layout(G)
    nx.draw(G, pos, with_labels = True)
    nx.draw_networkx_nodes(G, pos,
            node_color = cluster_colors)

    plt.savefig(title)


N = 10
OVERLAP = 0.2
SAMPLE_SIZE = 2000
K = 5

named_samples = [("Torus", TorusSampler(1, 3).Sample(SAMPLE_SIZE)),
       ("Tesseract", TesseractSampler().Sample(SAMPLE_SIZE)),          
        ("Two Spheres", np.vstack((SphereSampler().Sample(SAMPLE_SIZE),
                                            SphereSampler(3).Sample(SAMPLE_SIZE)))),
                 ("Two touching tori", np.vstack((TorusSampler(1, 3).Sample(SAMPLE_SIZE),
                                                  TorusSampler(1, 3, 6, 0, 0).Sample(SAMPLE_SIZE)))),
                 ("Hand", np.loadtxt("./Hand.csv", delimiter=","))]

named_filters = [("Axis filter (x)", AxisFilter(0)),
                 ("PCA", PCAFilter()),
                 ("L-Linfinity Centrality", LinfinityCentrality('euclidean'))]

for named_filter in named_filters:
    name_filter, filter_function = named_filter
    for named_sample in named_samples:
        name_data, sample = named_sample
        title = "Data : " + name_data + ", " + "Filter : " + name_filter
        plot_tda_pipeline(title, sample, filter_function,
                          SingleLinkageClusterer(K), N, OVERLAP)

References

Books

Articles

[1]F. Chazal and B. Michel, “An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists,” arXiv:1710.04019 [cs, math, stat], Oct. 2017.

[2]P. Y. Lum et al., “Extracting insights from the shape of complex data using topology,” Scientific Reports, vol. 3, no. 1, Dec. 2013.

[3]B. Zieliński, M. Lipiński, M. Juda, M. Zeppelzauer, and P. Dłotko, “Persistence Bag-of-Words for Topological Data Analysis,” arXiv:1812.09245 [cs, math, stat], Dec. 2018.

[4]A. Zomorodian, “Topological Data Analysis,” p. 39.

[5]J. Mathews, S. Nadeem, M. Pouryahya, A. Tannenbaum, and J. O. Deasy, “Topological Data Analysis of PAM50 and 21-Gene Breast Cancer Assays:,” bioRxiv, Nov. 2018.