Creating VIM plugin with Python

My first plugin

As I was working on various OCaml projects that were growing bigger and bigger, the various open directives started accumulating in some files. And I missed the feature I had in visual studio that automatically sorts the names of the using directives. This allowed a much quicker visual inspection of the various modules I needed for a piece of code to run.

As I could not find a solution to do it easily, I had a look at the different ways to implement vim plugins.

It works

It works!

Python : import vim

It turned out that the plugins can be directly written using Python ! in a vim script (mine will be called librarySorter.vim), you can simply write python code, import vim to read the buffer.

The buffer seems to be an array like structure on which you can get/set the i-th element of the buffer (i.e. the i-th line of your file).

Unfortunately, the documentation to use the package vim is quite limited. As this Stack Overflow question stresses it.

A nice ressource though, is this presentation.

Coding the plugin

Making sure it is compiled with python3

if !has('python3')
  echo "Error: Required vim compiled with +python3"
  finish
endif

Declaring a function

function! LibrarySorter()

// your code

endfunction

Calling Python code

The part written in Python has to be between these two EOF declarations.

python3 << EOF

# python code

EOF

In the case of my plugin, this looks like this:

python3 << EOF
import vim

#Reads the whole file, as a list
cb = list(vim.current.buffer)

#Filter the open directives
openLines = list(filter(lambda x: x.startswith("open "), cb))

#If all the open directives are not at the top of the file, stop immediatly
if not vim.current.buffer[len(openLines) - 1].startswith("open "):
  raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

#And rearrange the lines in the buffer
for i in range(len(openLines)):
  vim.current.buffer[i] = openLines[i]

EOF

And now, in vim, you can simply type: :call LibrarySorter() anywhere in your OCaml source and look at the head of your file :)

The code

if !has('python3')
  echo "Error: Required vim compiled with +python3"
  finish
endif

function! LibrarySorter()

python3 << EOF
import vim

cb = list(vim.current.buffer)

openLines = list(filter(lambda x: x.startswith("open "), cb))

if not vim.current.buffer[len(openLines) - 1].startswith("open "):
  raise Exception("Error, some open directives are not at the head of your file")

openLines = sorted(openLines)

for i in range(len(openLines)):
  vim.current.buffer[i] = openLines[i]

EOF

endfunction

What’s next ?

I actually have the same issue when importing files in Python, TypeScript… Might be nice to generalize this to various languages :)

Learning more

Ocaml

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Vim

Practical Vim: Edit Text at the Speed of Thought

Modern Vim: Craft Your Development Environment with Vim 8 and Neovim

Learning the vi and Vim Editors

Using ocamlopt with ocamlbuild

As your OCaml programs grow bigger, it is a good idea to use a build system. Ocamlbuild is quite good at this job ! It will take care of your dependencies, find libraries and more importantly compile the files in the right order…

It took me a while to find the answer to the simple question, how to use ocamlopt instead of the default compiler. Assuming you want to compile a main.ml file, the short answer is to use ocamlbuild main.native, instead of ocamlbuild main.byte

ocamlopt and ocamlc

OCaml comes with two compilers : ocamlc and ocamlopt. ocamlc is the bytecode compiler, and ocamlopt is the native code compiler. If you don’t know which one to use, use ocamlopt since it provides standalone executables that are normally faster than bytecode.

Example

For a very quick benchmark let’s sort a long list.

let a = List.init 1000000 (fun x -> Random.int 500) ;;
let b = List.sort compare a ;;

print_string "done;\n";

And compile this with the different compilers…

ocamlbuild main.native
echo "native"
time ./main.native
rm main.native
rm -rf _build/


ocamlbuild main.byte
echo "byte"
time ./main.byte
rm main.byte
rm -rf _build/

And the results are quite convincing.

Finished, 4 targets (0 cached) in 00:00:00.
native
done;

real  0m2.156s
user  0m2.064s
sys 0m0.088s
Finished, 3 targets (0 cached) in 00:00:00.
byte
done;

real  0m6.340s
user  0m6.268s
sys 0m0.068s

Learning more

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

JSON with OCaml

Ocaml and JSON

Manipulating JSON (JavaScript Object Notation) elements is amazing and allows to move data around between different programs easily. The only thing you need is to be able to write and read JSON in all your programs. Though this can be simple in some languages, other will require some work to.

OCaml being strongly typed, serializing / deserializing object is less obvious than in, say, Python.

The tools

Yojson: low-level JSON library for OCaml

ppx_deriving_yojson : writing json

To get them, simply use opam:

opam install yojson
opam install ppx_deriving_yojson

Optionnal, for vim users

If you have not done it already, these tools will help you navigate through the different functions proposed by the above packages. They allow to have “intellisense” for OCaml.

Merlin

YouCompleteMe

Note that you have to create a .merlin file in the directory where you are working, to help vim find the relevant packages.

PKG find yojson

Type to JSON

What is PPX ?

Writing code for each type to turn it into JSON is particularly boring and can lead to errors. In many languages this is straightworard. Like using json.dumps(obj) in Python. The PPX language is a feature that provides a new API for syntactic extensions in OCaml. deriving yojson generates three functions per type:

# #require "ppx_deriving_yojson";;
# type ty = .. [@@deriving yojson];;
val ty_of_yojson : Yojson.Safe.json -> (ty, string) Result.result
val ty_of_yojson_exn : Yojson.Safe.json -> ty
val ty_to_yojson : ty -> Yojson.Safe.json

So what happens is that, during the compilation process, the functions ty_of_yojson and the others will be, automatically and silently, generated and the compiler will not complain that they are missing - though you never had to write them.

Nested types

Let’s give it a try ! And come up with a nested type.

type t = {x: int; y: int} [@@deriving to_yojson]
type u = {s: string; pos: t} [@@deriving to_yojson]
let () = print_endline (Yojson.Safe.pretty_to_string (u_to_yojson {s= "hello"; pos={x= 1; y= 2}}))
ocamlfind opt -package ppx_deriving_yojson -linkpkg -o testJson.byte testJson.ml
./testJson.byte
rm testJson.byte

And…

{ "s": "hello", "pos": { "x": 1, "y": 2 } }

It worked :)

Learning more

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Installing R Studio on Manjaro

Manjaro is amazing. But, it sometimes lacks resources and help to install “common” (it is a matter of point of view) packages. sudo pacman -S rstudio will not be enough, unfortunately.

But this will do it :

sudo pacman -S yaourt
sudo pacman -S r
yaourt -S rstudio-desktop-bin

You will have to press “no” a couple of times:

==> Edit PKGBUILD ? [Y/n] ("A" to abort)
==> ------------------------------------
==> n
==> Edit rstudio-desktop-bin.install ? [Y/n] ("A" to abort)
==> -------------------------------------------------------
==> n

And finally, yes, when asked whether you want to continue:

==> Continue building rstudio-desktop-bin ? [Y/n]
==> ---------------------------------------------
==> y

R Studio on Manjaro

Not so fast !

What is exactly yaourt ?

Yaourt stands for “Yet Another User Repository Tool”. It is a command line tool for installing packages on Arch Linux. It is a wrapper for Pacman, the shipped package management utility for Arch Linux with extended features and remarkable AUR (Arch Linux User Repository) support.

You may find more informations about yaourt (here)[https://archlinux.fr/yaourt-en].

It differs from pacman because pacman will only access what is in the official repos. But the packages created by members of the community which have not (yet) made their way into the official repos can be found on the AUR.

Hope this helped!

Random forest vs extra trees

A little bit of context

Quite some time ago, I asked a question on stats.stackexchange about differences between random forests and extremly random forests. Though the answers were good, I was still lacking some informations. Beyond the nice theoretical arguments, I run some simulations to have a better idea of their behavior.

Though these are, by no means, definite conclusions about their respective behaviors, those simulations performed on toy datasets, from specific implementations, I hope they will provide more insights to the reader!

Summary of the simulations

Extra trees seem much faster (about three times) than the random forest method (at, least, in scikit-learn implementation). This is consistent with the theoretical construction of the two learners.

On toy datasets, the following conclusions could be reached :

  • Extra trees seem to keep a higher performance in presence of noisy features,
  • When all the variables are relevant, both methods seem to achieve the same performance.

Theoretical point of view

A decision tree is usually trained by recursively splitting the data. Each split is chosen according to an information criterion which is maximized (or minimized) by one of the “splitters”. These splitter usually take the form of x[i]>t were t is a threshold and x[i] indicates the i-th component of an observation.

Decision trees, being prone to overfit, have been transformed to random forests by training many trees over various subsamples of the data (in terms of both observations and predictors used to train them).

The main difference between random forests and extra trees (usually called extreme random forests) lies in the fact that, instead of computing the locally optimal feature/split combination (for the random forest), for each feature under consideration, a random value is selected for the split (for the extra trees).

This leads to more diversified trees and less splitters to evaluate when training an extremly random forest.

The point here is not to be exhaustive, there are more references and the main articles at the bottom of the article.

Timing

First, let’s observe the timing for each model. I will use the methods implemented in scikit-learn and I will reuse a code seen in another post (the code with new minor updates is at the bottom of the post). Without a detailed analysis, it seems that Extra Trees are three times faster than the random forest method.

N P Time RF Time ET
500 5 0.093678 0.0679979
500 10 0.117224 0.0763619
500 20 0.176268 0.097132
500 50 0.358183 0.157907
500 100 0.619386 0.256645
500 200 1.14059 0.45396
1000 5 0.139871 0.0846941
1000 10 0.211061 0.106443
1000 20 0.385125 0.151639
1000 50 0.805403 0.277682
1000 100 1.39056 0.501522
1000 200 2.76709 0.926728
2000 5 0.269487 0.122763
2000 10 0.474972 0.171372
2000 20 0.771758 0.264499
2000 50 1.81821 0.539937
2000 100 3.47868 1.03636
2000 200 6.95018 2.1839
5000 5 0.86582 0.246231
5000 10 1.2243 0.373797
5000 20 2.20815 0.624288
5000 50 5.13883 1.41648
5000 100 9.79915 3.25462
5000 200 21.5956 6.86543

Note the use of tabulate which allows a pretty print for python dataframes ;)

import numpy as np
import ComplexityEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
from tabulate import tabulate

def random_data_regression(n, p):
    return np.random.rand(n, p), np.random.rand(n)


regression_models = [RandomForestRegressor(),
                     ExtraTreesRegressor()]

names = ["RandomForestRegressor",
         "ExtraTreesRegressor"]

complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator(
    [500, 1000, 2000, 5000],
    [5, 10, 20, 50, 100, 200])

i = 0
for model in regression_models:
    res, data = complexity_evaluator.Run(model, random_data_regression)
    print(names[i])
    print tabulate(data, headers='keys', tablefmt='psql', showindex='never')
    i = i + 1

Irrelevant variables

Now, let’s try to see how they compare when we add irrelevant variables. First, we need to propose a data set where there is something to learn (as opposed as what was previously done).

Let’s fix a linear dependency between the target variable and the first three variables, along the lines of:

def linear_data_generation(n, p):
    X = np.random.rand(n, p)
    beta = np.zeros([p, 1])
    beta[0,0] = 1
    beta[1,0] = 2
    beta[2,0] = -1
    Y = np.ravel(np.dot(X, beta))
    return X, Y

Comparison of random forests and extra trees in presence of irrelevant predictors

Fig. 1: Comparison of random forests and extra trees in presence of irrelevant predictors

In blue are presented the results from the random forest and red for the extra trees.

The results are quite striking: Extra Trees perform consistently better when there are a few relevant predictors and many noisy ones.

import numpy as np
import AccuracyEvaluator
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from tabulate import tabulate
import matplotlib.pyplot as plt

def linear_data_generation(n, p):
    X = np.random.rand(n, p)
    beta = np.zeros([p, 1])
    beta[0,0] = 1
    beta[1,0] = 2
    beta[2,0] = -1
    Y = np.ravel(np.dot(X, beta))
    return X, Y


n_columns = [5, 10, 20, 30, 50, 80, 100]

accuracy_evaluator = AccuracyEvaluator.AccuracyEvaluator(
    [500],
    n_columns, 
    mean_squared_error,
    10)

data_rf = accuracy_evaluator.Run(RandomForestRegressor(), linear_data_generation)
data_et = accuracy_evaluator.Run(ExtraTreesRegressor(), linear_data_generation)

plt.errorbar(n_columns, data_rf["Metric"], yerr= data_rf["Std"], fmt="o", ecolor = "blue", color="blue")
plt.errorbar(n_columns, data_et["Metric"], yerr= data_et["Std"], fmt="o", ecolor = "red", color="red")
plt.xlabel('Number of columns')
plt.ylabel('Mean Squared Error')
plt.show()

Many relevant variables

Starting from the code above and changing the decision function to be a sum of each predictor:

def linear_data_generation(n, p):
    X = np.random.rand(n, p)
    beta = np.ones([p, 1])
    Y = np.ravel(np.dot(X, beta))
    return X, Y

Comparison of random forests and extra trees in presence of many relevant predictors

Fig. 2: Comparison of random forests and extra trees in presence of many relevant predictors

In blue are presented the results from the random forest and red for the extra trees.

It seems that both methods perform equally in presence of many relevant features.

Code

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math

class ComplexityEvaluator:

    def __init__(self, nrow_samples, ncol_samples):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples

    def _time_samples(self, model, random_data_generator):
        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                start_time = time.time()
                model.fit(train, labels)
                elapsed_time = time.time() - start_time

                result = {"N": nrow, "P": ncol, "Time": elapsed_time}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        orig_data = pd.DataFrame(self._time_samples(model, random_data_generator))
        data = orig_data.applymap(math.log)
        linear_model = LinearRegression(fit_intercept=True)
        linear_model.fit(data[["N", "P"]], data[["Time"]])
        return linear_model.coef_, orig_data


if __name__ == "__main__":
    class TestModel:

        def __init__(self):
            pass

        def fit(self, x, y):
            time.sleep(x.shape[0] / 1000.)

    def random_data_generator(n, p):
        return np.random.rand(n, p), np.random.rand(n, 1)

    model = TestModel()

    complexity_evaluator = ComplexityEvaluator(
        [200, 500, 1000, 2000, 3000], [1, 5, 10])

    res = complexity_evaluator.Run(model, random_data_generator)

    print(res)

And then :

import numpy as np
import pandas as pd
import time
from sklearn.linear_model import LinearRegression
import math
from Stacker import Stacker


class AccuracyEvaluator:

    def __init__(self, nrow_samples, ncol_samples, penalty, n_folds=5):
        self._nrow_samples = nrow_samples
        self._ncol_samples = ncol_samples
        self._stacker = Stacker(penalty, n_folds, False)

    def _accuracy_samples(self, model, random_data_generator):

        def predict(model, X):
            return model.predict(X)

        rows_list = []
        for nrow in self._nrow_samples:
            for ncol in self._ncol_samples:
                train, labels = random_data_generator(nrow, ncol)

                mean_perf, std_perf,  _ = self._stacker.RunSparse(train,
                                                                  labels, model, predict)

                result = {"N": nrow, "P": ncol,
                          "Metric": mean_perf, "Std": std_perf}
                rows_list.append(result)

        return rows_list

    def Run(self, model, random_data_generator):
        orig_data = pd.DataFrame(
            self._accuracy_samples(model, random_data_generator))
        return orig_data

I will come back to stacking in another post some day, which I also use for cross-validation (stacking can be obtained almost for free during a cross validation). Plus the naming of some functions is quite unfortunate (RunSparse() should probably be called RunNumpyArray() or something like this, likewise, the Run() should probably be RunPandas()…). The code is here for the sake of completeness.

import pandas as pd
import numpy as np

from time import time
from sklearn import cross_validation


class Stacker:

    def __init__(self, penalty, n_folds, verbose=True, random_state=1):
        self._penalty = penalty
        self._n_folds = n_folds
        self._verbose = verbose
        self._random_state = random_state

    def Run(self, X, y, model, predict_method):
        kf = cross_validation.KFold(
            y.shape[0], n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
        trscores, cvscores, times = [], [], []
        i = 0
        stack_train = np.zeros(len(y))  # stacked predictions
        for i, (train_fold, validation_fold) in enumerate(kf):
            i = i + 1
            t = time()
            model.fit(X.iloc[train_fold], y.iloc[train_fold])

            tr_pred = predict_method(model, X.iloc[train_fold])

            trscore = self._penalty(y.iloc[train_fold], tr_pred)

            validation_prediction = predict_method(
                model, X.iloc[validation_fold])
            cvscore = self._penalty(
                y.iloc[validation_fold], validation_prediction)

            trscores.append(trscore)
            cvscores.append(cvscore)
            times.append(time() - t)

            stack_train[validation_fold] = validation_prediction

        if self._verbose:
            print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
                  (np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
            print(model.get_params(deep=True))
            print("\n")

        return np.mean(cvscores), stack_train
    
    def RunSparse(self, X, y, model, predict_method):
        kf = cross_validation.KFold(
            y.shape[0], n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
        trscores, cvscores, times = [], [], []
        i = 0
        stack_train = np.zeros(len(y))  # stacked predictions
        for i, (train_fold, validation_fold) in enumerate(kf):
            i = i + 1
            t = time()
            model.fit(X[train_fold], y[train_fold])

            tr_pred = predict_method(model, X[train_fold])

            trscore = self._penalty(y[train_fold], tr_pred)

            validation_prediction = predict_method(
                model, X[validation_fold])
            cvscore = self._penalty(
                y[validation_fold], validation_prediction)

            trscores.append(trscore)
            cvscores.append(cvscore)
            times.append(time() - t)

            stack_train[validation_fold] = validation_prediction

        if self._verbose:
            print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
                  (np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
            print(model.get_params(deep=True))
            print("\n")

        return np.mean(cvscores), np.std(cvscores),stack_train

Learning more and references

Classification and Regression Trees by Leo Breiman, Jerome Friedman, Charles J. Stone, R.A. Olshen

Breiman L (2001). “Random Forests”. Machine Learning. 45 (1): 5–32.

Geurts P, Ernst D, Wehenkel L (2006). “Extremely randomized trees”. Machine Learning. 63: 3–42

Python data science handbook

Project Euler 88

Product-sum numbers

Statement

The problem can be found here.

A natural number, N, that can be written as the sum and product of a given set of at least two natural numbers, {a1, a2, … , ak} is called a product-sum number: N = a1 + a2 + … + ak = a1 x a2 x … x ak.

For example, 6 = 1 + 2 + 3 = 1 x 2 x 3.

For a given set of size, k, we shall call the smallest N with this property a minimal product-sum number. The minimal product-sum numbers for sets of size, k = 2, 3, 4, 5, and 6 are as follows.

k=2: 4 = 2 x 2 = 2 + 2

k=3: 6 = 1 x 2 x 3 = 1 + 2 + 3

k=4: 8 = 1 x 1 x 2 x 4 = 1 + 1 + 2 + 4

k=5: 8 = 1 x 1 x 2 x 2 x 2 = 1 + 1 + 2 + 2 + 2

k=6: 12 = 1 x 1 x 1 x 1 x 2 x 6 = 1 + 1 + 1 + 1 + 2 + 6

Hence for 2≤k≤6, the sum of all the minimal product-sum numbers is 4+6+8+12 = 30; note that 8 is only counted once in the sum.

In fact, as the complete set of minimal product-sum numbers for 2≤k≤12 is {4, 6, 8, 12, 15, 16}, the sum is 61.

What is the sum of all the minimal product-sum numbers for 2≤k≤12000?

Ideas

The fact that the sum of the “product-sum numbers” was asked by counting only once each number lead me to think that there was some way to test wether a number is a “product-sum” easily.

Though it was quite easy to prove that prime numbers cannot be product sum numbers, other criterions were not that obvious.

12000 did not seem to be that far for a naive approach : for each number, find the first product-sum number following it.

For this, only a test “is this number a sum product number for k terms” was needed. The simplest way to perform it simply was to enumerate all the possible ways to write an integer as a product of integers (different from 1).

let enumerate_products n =
  let rec aux k i acc = 
    if i == k then [i::acc]
    else if i > k then []
    else 
      if k mod i == 0 then 
        (aux (k/i) i (i::acc))@(aux k (i+1) acc)
    else (aux k (i+1) acc)
  in
  aux n 2 []

Testing is now easy:

let is_sum_product n k =
  if is_prime n then 
    false
  else
    let product_representations = enumerate_products n in

    let pseudo_sum product_representation =
      (sum_list product_representation) + k - (List.length product_representation) in

    List.exists (fun x -> (pseudo_sum x) == n) product_representations

Using the fact that a number n, decomposed in k factors needs to be bigger than k and smaller than 2k in order to be a product sum, finding the next product-sum number is easy:

let find_smallest_sum_product k =
  let rec aux i = 
    if i > 2*k then 
      failwith "error somewhere" 
    else if is_sum_product i k then i
    else aux (succ i) in
  aux (succ k) (* would not work for n=k=1... *) 

A simple solution

The whole code to find the solution is the following. With ocamlopt, it took me around 3 minutes on a laptop.

let is_prime n =
  let rec is_not_divisor d =
    d * d > n || (n mod d <> 0 && is_not_divisor (d+1)) in
  n <> 1 && is_not_divisor 2


let sum_list input_list = 
  List.fold_right (fun x y -> x + y) input_list 0


let enumerate_products n =
  let rec aux k i acc = 
    if i == k then [i::acc]
    else if i > k then []
    else 
      if k mod i == 0 then 
        (aux (k/i) i (i::acc))@(aux k (i+1) acc)
    else (aux k (i+1) acc)
  in
  aux n 2 []


let is_sum_product n k =
  if is_prime n then 
    false
  else
    let product_representations = enumerate_products n in

    let pseudo_sum product_representation =
      (sum_list product_representation) + k - (List.length product_representation) in

    List.exists (fun x -> (pseudo_sum x) == n) product_representations


let find_smallest_sum_product k =
  let rec aux i = 
    if i > 2*k then 
      failwith "error somewhere" 
    else if is_sum_product i k then i
    else aux (succ i) in
  aux (succ k) (* would not work for n=k=1... *) 


let sum_sum_product a b =
  let integer_interval = List.init (b-a+1) (fun i -> i+a) in
      let sum_products = List.map find_smallest_sum_product integer_interval in
      let unique_sum_products = List.sort_uniq compare sum_products in
      sum_list unique_sum_products ;;


(* Unit testing *)
assert (sum_sum_product 2 12 == 61);

(* Result *)
print_int (sum_sum_product 2 12000);
print_string "\n";

More problems

If you like this kind of problems, I strongly recommend The Art of Mathematics: Coffee Time in Memphis, by Béla Bollobás.

To find out more about OCaml this book provides a detailed presentation of the language.

Acronyms of data science

Every field has its acronyms, machine learning does not avoid this rule. I am not a big fan of heavy acronyms usage, but if you happen to be lost in the middle of meeting, this may help!

Below is a list of acronyms, if I missed any, or if you have nice references regarding each topic, feel free to comment, I will do my best to keep this list as up-to-date as possible!

AE Auto encoder

AD Automatic differentiation

ANN Artificial Neural Network

API Application Programming Interface

ARD Automatic Relevance Determination

ASR Automatic Speech Recognition ASR

BPTT Back Propagation Through Time

BPTS Back Propagation Through Structure

BNN Binary Neural Net

COCO Common Objects in Context

CPPN Compositional Pattern-Producing Network

CTC Connectionist Temporal Classification

CNN Convolutional Neural network

CRF Conditional Random Field

CV Cross Validation

DBN Deep Belief Network

DCGAN deep convolutional generative adversarial networks

DNN Deep Neural Network

DT Decision tree

EBM Energy Based Model

ESP Enforced SubPopulations

ELU Exponential Linear Unit

ERF Extremly random forest

GAN Generative Adversarial Network

GBM Gradient Boosting Machine

GMM Gaussian Mixture Model

GRU Gated Recurrent Unit GRU

HMM Hidden Markov Model

NB Naive Bayes

NN Neural Network

KPCA Kernel Principal Component Analysis

KSVM Kernel Support Vector Machine

GA Genetic algorithm GA

HTM Heirarchal temporal memory

HMM Hidden Markov Model

HAM Hierarchical Attentive Memory

KNN k-Nearest Neighbors

LOOCV Leave one out cross validation

LReLU Leaky ReLU

LTU Linear Threshold Unit

LSTM Long Short Term memory

MCMC Markov chain Monte Carlo

MDP Markov Decision Processes

ML Machine Learning

MLP Multi-layer Perceptrons

NLP Natural Language Processing

NTM Neural Turing Machine

NEAT NeuroEvolution of Augmenting Topologies

OLS Ordinary Least Squares Regression

PReLU Paramaterized ReLU

OCR Optical Character Recognition.

PCA Principal Component Analysis.

PAC-MDP Probably Approximately Correct in Markov Decision Processes

RTRL Real Time Recurrent Learning

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

RNTNRecursive Neural Tensor Network

RL Reinforcement Learning

RVM Relevance Vector Machine

ResNet Residual Neural Network

RF Random Forest

RBM Restricted Boltzmann Machines

SIFT Scale-Invariant Feature Transform

SRN Simple Recurrent Network

SVD singular value decomposition

SGD Stochastic Gradient Descent

SVM Support Vector Machine

SANE Symbiotic Adaptive NeuroEvolution

TDA Topological Data Analysis

TF TensorFlow

TFIDF Term Frequency Inverse Document Frequency

VLAD Vector of Locally Aggregated Descriptors

WFST Weighted Finite-State Transducers

Updating Manjaro

A short while ago, I decided to jump from Ubuntu to Manjaro. Ubuntu kept showing me textboxes like “an issue has been detected, close or report ?”, was taking a huge amount of resources and most of the system updates did not solve these. Time for a change !

What is Manjaro ?

It is a distribution of Linux based on the Arch Linux distribution. Manjaro Linux has a focus on user friendliness and accessibility and the system itself is designed to work fully ‘straight out of the box’ with its variety of pre-installed software.

If you are a big fan of comparisons, which I will not detail here, you can find some on slant.co or in french, on citizenz or a top 7 reasons why….

Time for a system update on Manjaro

What could possibly go wrong on a system based on user-friendliness ? Well, the system update can be a pain. As here or here or if you installed VLC/

[user@user-pc ~]$ sudo pacman -Syu
[sudo] password for user: 
:: Synchronizing package databases...
core
extra
community
multilib

[...]

error: unresolvable package conflicts detected
error: failed to prepare transaction (conflicting dependencies)
:: python-nautilus and python2-nautilus are in conflict

Removing packages

To remove a single package, leaving all of its dependencies installed: # pacman -R package_name

To remove a package and its dependencies which are not required by any other installed package: # pacman -Rs package_name

Let’s try getting rid of python2-nautilus. Just in case you wonder what it does.

[user@user-pc ~]$ sudo pacman -Rs python2-nautilus
checking dependencies...
error: failed to prepare transaction (could not satisfy dependencies)
:: nautilus-admin: removing python2-nautilus breaks dependency 'python-nautilus'

Not exactly what I needed… It was actually called nautilus-admin.

[user@user-pc ~]$ sudo pacman -Rs nautilus-admin
checking dependencies...

Packages (3) python2-gobject-3.26.1-1  python2-nautilus-1.1-4  nautilus-admin-1.1.1-1

Total Removed Size:  1.27 MiB

:: Do you want to remove these packages? [Y/n] Y
:: Processing package changes...
(1/3) removing nautilus-admin
(2/3) removing python2-nautilus
(3/3) removing python2-gobject
:: Running post-transaction hooks...
(1/1) Arming ConditionNeedsUpdate...

Let’s try the update once again ! Pay attention not to answer yes to every question…

[user@user-pc ~]$ sudo pacman -Syu

[...]

:: Starting full system upgrade...
:: Replace compositeproto with extra/xorgproto? [Y/n] Y
:: Replace damageproto with extra/xorgproto? [Y/n] y
:: Replace fixesproto with extra/xorgproto? [Y/n] y
:: Replace fontsproto with extra/xorgproto? [Y/n] y
:: Replace gnome-themes-standard with extra/gnome-themes-extra? [Y/n] y
:: Replace gnome-tweak-tool with extra/gnome-tweaks? [Y/n] y
:: Replace inputproto with extra/xorgproto? [Y/n] y
:: Replace kbproto with extra/xorgproto? [Y/n] y
:: Replace manjaro-gnome-extension-settings with community/manjaro-gnome-extension-settings-17.0? [Y/n] n
:: Replace manjaro-gnome-extension-settings with community/manjaro-gnome-extension-settings-18.0? [Y/n] y
:: Replace manjaro-gnome-settings with community/manjaro-gnome-settings-17.0? [Y/n] n
:: Replace manjaro-gnome-settings with community/manjaro-gnome-settings-18.0? [Y/n] y
:: Replace pkg-config with core/pkgconf? [Y/n] y
:: Replace randrproto with extra/xorgproto? [Y/n] y
:: Replace recordproto with extra/xorgproto? [Y/n] y
:: Replace renderproto with extra/xorgproto? [Y/n] y
:: Replace scrnsaverproto with extra/xorgproto? [Y/n] y
:: Replace videoproto with extra/xorgproto? [Y/n] y
:: Replace xextproto with extra/xorgproto? [Y/n] y
:: Replace xf86vidmodeproto with extra/xorgproto? [Y/n] y
:: Replace xineramaproto with extra/xorgproto? [Y/n] y
:: Replace xproto with extra/xorgproto? [Y/n] y

The following can take a while:

zenity-3.28.1-1-x86_64     3.8 MiB   586K/s 00:07 [######################] 100% 
pipewire-0.1.9-3-x86_64 1143.8 KiB   397K/s 00:03 [######################] 100% 
mutter-3.28.2-1-x86_64     2.2 MiB   342K/s 00:07 [######################] 100%

You thought it was done ? No. Too bad.

(729/729) checking keys in keyring                                                                                                                             [##################################################################################################] 100%
(729/729) checking package integrity                                                                                                                           [##################################################################################################] 100%
(729/729) loading package files                                                                                                                                [##################################################################################################] 100%
(729/729) checking for file conflicts                                                                                                                          [##################################################################################################] 100%
error: failed to commit transaction (conflicting files)
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/__init__.py exists in filesystem

[...]

python-pip: /usr/lib/python3.6/site-packages/pip/_internal/vcs/mercurial.py exists in filesystem
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/vcs/subversion.py exists in filesystem
python-pip: /usr/lib/python3.6/site-packages/pip/_internal/wheel.py exists in filesystem
Errors occurred, no packages were upgraded.

There is an issue with pip now… Using :

[user@user-pc ~]$ sudo pacman -S python-pip --force

Enables to update Python. And now, the following should work!

[user@user-pc ~]$ sudo pacman -Syu  

Hope this helped :)

Packages for machine learning

I hope to provide more packages and more informations to this list from times to times. If you have some specific questions regarding a package or have some recommendations, feel free to leave a comment, I will have a look!

Machine learning means so many possible tasks, comes with so many packages and tools that it is hard to have an idea of which one to use. I am just listing the one I find really useful. They are not better than the packages I do not use and I cannot guarantee they are absolutely bug free, but they are robust enough to work with!

Linux tools

A good terminal will be your best friend.

i="1"
for filename in ./pdfs/*.pdf; do
  i="$((i+1))"
  if [ "$i" -gt 20 ]; then
    break
  fi
  echo "Processing $filename file..."
  pdf2txt.py -m 2 "$filename" >> "txts/$(basename "$filename" .pdf).txt"
done
How amazing is that ?

tabview

Probably the best tool to navigate through a CSV file, in the terminal. It is really light, fast, supports many VIM commands. Here is the repo.

csvkit

Install it from pip. It comes with a lot of handy tools:

  • csvstat

  • csvlook though I prefer tabview, csvlook my_data.csv > my_data.md allows to display a csv file in markdown.

Combined with head you can navigate through various files really fast. There actually is whole website dedicated to this.

glances

This allows you to see how busy your machine is when running an algorithm.

PDF to text files

pdf2txt.py

Useful to extract text from pdf files. There is no perfect solution (as far as I know) for this task, but this one is a good starting point.

Tesseract (and ImageMagick)

Another approach to extracting text from pdf files is using OCR (Optical Character Recognition). Tesseract does a great job but importing pdf directly can lead to errors. However, ImageMagick does a great job at turning pdfs to pngs.

echo "Processing $filename file..."
convert -density 300 "$filename[0-1]" -quality 90 "output.png"
tesseract -l fra "output-0.png" "output-0"
tesseract -l fra "output-1.png" "output-1"
cat "output-0.txt" "output-1.txt" > "ocr$(basename "$filename" .pdf).txt"
rm "output-0.txt" "output-1.txt" "output-0.png" "output-1.png"

R

When installing R, and it happened to me many times, I love running the following script. It feels like coming home. I will not go through all the details of each package, it is just that it will be useful for me to have this code around :)


load.lib<-function(libT,l=NULL)
{
  lib.loc <- l
    print(lib.loc)

    if (length(which(installed.packages(lib.loc=lib.loc)[,1]==libT))==0)
    {
      install.packages(libT, lib=lib.loc,repos='http://cran.us.r-project.org')
    }
}

data_reading_libraries <- c(
    "readr",
    "readstata13"
    )

machine_learning_libraries <- c(
    "xgboost",
    "glmnet",
    "randomForest",
    "Rtsne",
    "e1071"
    )

data_libraries <- c(
    "data.table",
    "FeatureHashing",
    "dplyr",
    "Matrix"
    )

string_libraries <- c(
    "stringdist",
    "tm"
    )

plot_libraries <- c(
    "ggplot2",
    "RColorBrewer",
    "fields",
    "akima"
    )

favorite_libs <- c(data_reading_libraries,
    machine_learning_libraries,
    data_libraries,
    string_libraries,
    plot_libraries)

  for(lib in favorite_libs){load.lib(lib)}

General stuff

Reading data

readr

If you have been using the default csv reader in R read.csv, you must be familiar with its drawbacks : slow, many parameters, a parser which sometimes fails… readr on the other hand is super fast, robust and comes with read_csv and read_csv2 depending on the csv standard your file relies on. (The good thing with standard being that there always are many versions of them…)

XML

It allows to read XML files (obviously) but also HTML tables (yes, some people actually use this to transfer data, though it makes the whole file much bigger because of so many HTML tags…)

Machine learning libraries

glmnet

A library that enables to perform elastic net regressions. Has a cross validation method which enjoys nice properties of the path of optimization, which allows to evaluate a path of solutions as fast as a single model.

randomForest

The standard if you want to use random forests with R. Link to the cran page

e1071

I tried its “competitor” (kernlab), but prefered this one.

Rtsne

Wrapper for the C++ implementation of Barnes-Hut t-Distributed Stochastic Neighbor Embedding. Was the fastest tSNE implementation when I tried them.

Data viz

coefplot

Visualizing linear regressions is now simple.

Coefplot

require(coefplot)

N <- 10
P <- 3

X <- matrix(rnorm(n = N*P), nrow = N)

w <- rnorm(P)

Y <- X %*% w + rnorm(P)

my_data <- cbind.data.frame(Y,X)
colnames(my_data) <- c("Y",paste0("Var",1:3))
model <- lm(Y ~ ., data = my_data)

coefplot(model)

corplot

A matrix of correlation can be quite ugly. This one just makes it easier to read, with colors…

forestFloor

Wouldn’t it be great to have something that tells you a little bit more about your random forests models ? This package can.

Python

General stuff

tqdm

tqdm is one of the most useful package I discovered. Though it actually does not perform any operation or handles your dataframes smartly, it shows a progress bar for the loops you want it to. Still not convinced ? With it, you can keep track on every feature engineering job you launch, see which ones are never going to end without the pain of writing all these bars yourself.

pandas

The industry standard for dataframes.

numpy

The industry standard for numeric operations

csv

Easy manipulation of csv files. The method DictReader is particularly useful when one needs to stream from a csv file.

unicodecsv

import unicodecsv as csv

Solves so many issues.

Machine learning libraries

sk-learn

A collection of robust and well optimized methods. A must have.

xgboost, catboost, gbm light

Libraries dedicated to gradient boosting. Amazing performances, amazingly robust.

R Kernlab (caret) VS e1071

A little bit of context

After seeing plenty of “Python VS Java for data science”, questions that seemed ways to broad, I decided to focus on various benchmarks for different tasks. Today will be about Support Vector Machines, in R.

There are two main packages for SVMs in R : kernlab and e1071.

While kernlab implements kernel-based machine learning methods for classification, regression, clustering, e1071 seems to tackle various problems like support vector machines, shortest path computation, bagged clustering, naive Bayes classifier.

Difference between the SVM implementations

Proposed kernels

As kernlab focuses on kernel methods, many of them are implemented:

  • rbfdot Radial Basis kernel function “Gaussian”
  • polydot Polynomial kernel function
  • vanilladot Linear kernel function
  • tanhdot Hyperbolic tangent kernel function
  • laplacedot Laplacian kernel function
  • besseldot Bessel kernel function
  • anovadot ANOVA RBF kernel function
  • splinedot Spline kernel
  • stringdot String kernel

While e1071 proposes the following:

  • linear
  • radial basis
  • polynomial
  • sigmoid

Solver

e1071 relies on libsvm, which last update was released on December 22, 2016.

On the other hand, ksvm uses John Platt’s SMO algorithm for solving the SVM QP problem an most SVM formulations.

These may have an impact on the training time of these models (hopefully, the solutions should be the same).

Testing it

Unfortunately, the names of the parameters are quite different between the two libaries are not exactly the same.

This has been an issue for some users.

The radial basis kernel in e1071 is defined as and as . This is good. However, note that the (cost) parameters are called C and cost.

Besides, there are many models which are available (epsilon regressions, nu regressions…) and the default behavior is not always obvious.

At last, there have been many heuristics developed to chose the best “bandwith” (referred to as and depending on the package), and the proposed heuristics are not always the same. The code below makes sure the methods match when enough parameters are provided.

require("kernlab")
require("e1071")

N <- 1000
P <- 20
noise <- 3

X <- as.matrix(matrix(rnorm(N * P), nrow = N))
Y <- as.vector(X[, 1] + X[, 2] * X[, 2] + noise * rnorm(P))

model_kernlab <-
  kernlab::ksvm(
      x = X,
      y = Y,
      scaled = TRUE,
      C = 5,
      kernel = "rbfdot",
      kpar = list(sigma = 1),
      type = "eps-svr",
      epsilon = 0.1
      )

model_e1071 <- e1071::svm(x = X,
      y = Y,
      cost = 5,
      scale = TRUE, 
      kernel = "radial",
      gamma = 1,
      type = "eps-regression",
      epsilon = 0.1)

Let”s make sure the model match.

> mean(abs(
+   predict(object = model_e1071, newdata = X) - predict(object = model_kernlab, newdata = X)
+ ))
[1] 1.254188e-14
> 
> sum(abs(predict(object = model_e1071, newdata = X) - Y))
[1] 380.0338
> sum(abs(predict(object = model_kernlab, newdata = X) - Y))
[1] 380.0338

Benchmarking

Results

Now that we know the algorithms propose the same results, we can (safely) compare the time of execution.

Comparison with 20 features

```e1071``` is in red

Comparison with 200 features

Given the success of libsvm, I expected e1071 to be faster than kernlab. The heuristics implementend to optimize the quadratic form with its constraints are not the same, (see [1] and [2]) and they may actually lead to different results on other data sets.

Learning more

[1] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, Apr. 2011.

[2] J. C. Platt, “Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines,” p. 21.

Applied predictive modelling is also good introduction to predictive modelling, with R.

Code

If you want to reproduce the results, the whole code is below:


require("kernlab")
require("e1071")

fit_e1071 <- function(X, Y) {
  e1071::svm(
    x = X,
    y = Y,
    cost = 5,
    scale = TRUE,
    kernel = "radial",
    gamma = 1,
    type = "eps-regression",
    epsilon = 0.1
  )
}

fit_kernlab <- function(X, Y) {
  kernlab::ksvm(
    x = X,
    y = Y,
    scaled = TRUE,
    C = 5,
    kernel = "rbfdot",
    kpar = list(sigma = 1),
    type = "eps-svr",
    epsilon = 0.1
  )
}

time_e1071 <- c()
time_kernlab <- c()

steps <- c(10, 20, 50, 100, 200, 500, 1000, 2000)

for (n in steps)
{
  noise <- 3
  P <- 20
  
  X <- as.matrix(matrix(rnorm(n * P), nrow = n))
  Y <- as.vector(X[, 1] + X[, 2] * X[, 2] + noise * rnorm(n))
  
  time_e1071 <- c(time_e1071, system.time(fit_e1071(X, Y))[1])
  time_kernlab <- c(time_kernlab, system.time(fit_kernlab(X, Y))[1])
}

plot(
  steps,
  time_e1071,
  type = "l",
  col = "red",
  xlab = "n",
  ylab = "Time",
  main = paste0("Execution time for ", P, " features"))
lines(steps, time_kernlab)