Introduction

Following the article about machine learning and ocaml I present another learning algorithm in OCaml and its comparison to a Python (compiled with pypy) equivalent.

First of all, PyPy is a fast, compliant alternative implementation of the Python language (2.7.13 and 3.5.3). As fast, it means orders of magnitude faster than a simple Python execution. As many libraries are not (directly) compatible with it, it remains quite unused. However, when using the default libraries, PyPy works like a charm.

Online logistic regression

A common algorithm, often met on Kaggle, is the online logistic regression. Note that the data used will as well come from the Avazu CTR competition. To make the experience shorter, I will only work with the first 1 000 000 lines of the train.csv file (later referred to as train_small.csv).

This algorithm allows to train data in streaming which means that one does not need to load all the data in memory (which is sometimes infeasible). These methods are usually quite fast, and being simple to implement, this allows a lot of tuning and feature engineering on the fly.

On top of that, it automatically “one-hot encodes” the data. Which means that even if the number of categories per variable is important (which results in an underlying high dimensional space), this will not be an issue for this method (and the memory used by this program will remain constant, randomly projecting some components).

The Python code

I do not take any credit for the awesomeness of this code. I simply paste it for the experiment to be easier to reproduce. Some modifications have been performed to make it more comparable to the ocaml implementation (most of them are commented).

            '''
           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
                   Version 2, December 2004

Copyright (C) 2004 Sam Hocevar <[email protected]>

Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.

           DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
  TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION

 0. You just DO WHAT THE FUCK YOU WANT TO.
'''


from datetime import datetime
from csv import DictReader
from math import exp, log, sqrt


##############################################################################
# parameters #################################################################
##############################################################################

# A, paths
train = 'train_small.csv'               # path to training file
# B, model
alpha = .01  # learning rate
# C, feature/hash trick
D = 2 ** 20              # number of weights to use
do_interactions = False  # whether to enable poly2 feature interactions

# D, training/validation
epoch = 1      # learn training data for N passes
holdout = 100  # use every N training instance for holdout validation


##############################################################################
# class, function, generator definitions #####################################
##############################################################################

class logistic_regression(object):
    ''' Classical logistic regression
    
        This class (algorithm) is not used in this code, it is putted here
        for a quick reference in hope to make the following (more complex)
        algorithm more understandable.
    '''

    def __init__(self, alpha, D, interaction=False):
        # parameters
        self.alpha = alpha

        # model
        self.w = [0.] * D

    def predict(self, x):
        # parameters
        alpha = self.alpha

        # model
        w = self.w

        # wTx is the inner product of w and x
        wTx = sum(w[i] for i in x)

        # bounded sigmoid function, this is the probability of being clicked
        return 1. / (1. + exp(-max(min(wTx, 35.), -35.)))

    def update(self, x, p, y):
        # parameter
        alpha = self.alpha

        # model
        w = self.w

        # gradient under logloss
        g = p - y

        # update w
        for i in x:
            w[i] -= g * alpha


def logloss(p, y):
    ''' FUNCTION: Bounded logloss

        INPUT:
            p: our prediction
            y: real answer

        OUTPUT:
            logarithmic loss of p given y
    '''

    # p = max(min(p, 1. - 10e-15), 10e-15)
    return -log(p) if y == 1. else -log(1. - p)


def data(path, D):
    ''' GENERATOR: Apply hash-trick to the original csv row
                   and for simplicity, we one-hot-encode everything

        INPUT:
            path: path to training or testing file
            D: the max index that we can hash to

        YIELDS:
            ID: id of the instance, mainly useless
            x: a list of hashed and one-hot-encoded 'indices'
               we only need the index since all values are either 0 or 1
            y: y = 1 if we have a click, else we have y = 0
    '''

    for t, row in enumerate(DictReader(open(path))):
        # process id
        # ID = row['id'] -> not needed
        del row['id']

        # process clicks
        y = 0.
        #if 'click' in row: -> always true in the train set
        if row['click'] == '1':
            y = 1.
        del row['click']

        # turn hour really into hour, it was originally YYMMDDHH
        #row['hour'] = row['hour'][6:]

        # build x
        x = [0]  # 0 is the index of the bias term
        #for key in sorted(row):  # sort is for preserving feature ordering
        for key in row:
            value = row[key]

            # one-hot encode everything with hash trick
            index = abs(hash(key + '_' + value)) % D
            x.append(index)

        yield t, x, y


##############################################################################
# start training #############################################################
##############################################################################

start = datetime.now()

# initialize ourselves a learner
learner = logistic_regression(alpha, D)

# start training
for e in xrange(epoch):
    loss = 0.
    count = 0

    for t, x, y in data(train, D):  # data is a generator
        #  t: just a instance counter
        # ID: id provided in original data
        #  x: features
        #  y: label (click)

        # step 1, get prediction from learner
        p = learner.predict(x)

        if t % holdout == 0:
            # step 2-1, calculate holdout validation loss
            #           we do not train with the holdout data so that our
            #           validation loss is an accurate estimation of
            #           the out-of-sample error
            loss += logloss(p, y)
            count += 1
        else:
            # step 2-2, update learner with label (click) information
            learner.update(x, p, y)

        if t % 100000 == 0 and t > 1:
            print('encountered: %d\tcurrent logloss: %f' % (
                t, loss/count))



          

OCaml implementation

            open Maths
open Read_tools


let get_indices dict n = Hashtbl.fold (fun k v acc -> ((Hashtbl.hash (k^" "^v)) mod n)  :: acc) dict [] 

let predict indices weights = sigmoid (dot_product indices weights) 

let rec update indices weights p y alpha =  match indices with
        | [] -> ()
        | h::tail -> weights.(h) <- (weights.(h) -. (p -. y) *. alpha) ; update tail weights p y alpha 
        

let n = pow 2 20 
let weights = Array.make n 0. 
let dict_stream = dict_reader "train_small.csv" 
let updater indices weights p y = update indices weights p y 0.01 
let refresh_loss = 100000 

let train weights updater dict_stream = 
        let rec aux weights updater dict_stream t loss n = match (try Some( Stream.next dict_stream) 
                                                                  with _ -> None) with
        | Some dict -> Hashtbl.remove dict "id"; 
                          let y = float_of_string (Hashtbl.find dict "click") in
                          Hashtbl.remove dict "click";
                          let indices = get_indices dict n in
                          let p = predict indices weights in
                          updater indices weights p y;
                          
                          if ((t mod refresh_loss) == 0) && t > 0 then begin 
          print_string "encountered: ";
          print_int t;
          print_string "\t logloss:";
          print_float (loss /. float_of_int(t)); 
          print_endline " "; 
                          end;
                          
                          aux weights updater dict_stream (t + 1) (loss +. (log_loss p y)) n
                          
  | None -> () in aux weights updater dict_stream 0 0. (Array.length weights) ;;   

train weights updater dict_stream;

          

Where the maths.ml contains few math functions.

            let dot_product indices weights =
    let rec aux indices weights acc = 
                match indices with
                | [] -> acc
                | h::tail -> aux tail weights (acc +. weights.(h)) in
        aux indices weights 0.
        
let sigmoid x = 1. /. (1. +. exp(0. -. (max (min x 35.) (-35.))))

let log_loss p y = match y with 1. -> -. log(p) | _ -> -. log(1. -. p)

let rec pow a = function
  | 0 -> 1
  | 1 -> a
  | n -> 
    let b = pow a (n / 2) in
    b * b * (if n mod 2 == 0 then 1 else a) 


          

And read_tools.ml implements various tools to stream from a csv file.

            open Str

let csv_separator = ","

let err_lists_sizes = "Incompatible lists size"

let line_stream_of_channel channel =
    Stream.from (fun _ -> try Some (input_line channel) with End_of_file -> None)
        
let read_lines file_path = line_stream_of_channel (open_in file_path)
        
let read_first_line file_path = Stream.next (read_lines file_path)
        
let split_line line = Str.split (Str.regexp csv_separator) line

let concat_elements list1 list2 = 
        let rec aux list1 list2 acc = match list1,list2 with
        | [],[] -> acc
        | a,[] -> failwith err_lists_sizes
        | [],a -> failwith err_lists_sizes
        | h1::t1,h2::t2 -> aux t1 t2 ((h1^h2)::acc) in List.rev (aux list1 list2 [])
        
let to_dict list1 list2 =
        let rec aux list1 list2 my_hash = match list1,list2 with
        | [],[] -> my_hash
        | a,[] -> failwith err_lists_sizes
        | [],a -> failwith err_lists_sizes
        | h1::t1,h2::t2 -> Hashtbl.add my_hash h1 h2; aux t1 t2 my_hash in aux list1 list2 (Hashtbl.create 15)
        
let dict_reader file_path = 
        let line_stream = read_lines file_path in
        let header = split_line (Stream.next line_stream) in
        Stream.from
      (fun _ ->
         try Some (to_dict header (split_line (Stream.next line_stream))) with End_of_file -> None)

          

Results

            $ time pypy main.py 
encountered: 100000 current logloss: 0.435212
encountered: 200000 current logloss: 0.433022
encountered: 300000 current logloss: 0.416148
encountered: 400000 current logloss: 0.411722
encountered: 500000 current logloss: 0.403127
encountered: 600000 current logloss: 0.405477
encountered: 700000 current logloss: 0.405665
encountered: 800000 current logloss: 0.400225
encountered: 900000 current logloss: 0.397522

real  0m24.649s
user  0m23.964s
sys 0m0.540s

          

            $ ocamlfind opt -linkpkg str.cmxa maths.ml read_tools.ml main.ml -o main.byte
$ time ./main.byte 
encountered: 100000  logloss:0.422601941149 
encountered: 200000  logloss:0.418270062183 
encountered: 300000  logloss:0.409809392923 
encountered: 400000  logloss:0.40026898568 
encountered: 500000  logloss:0.396667944914 
encountered: 600000  logloss:0.398242961764 
encountered: 700000  logloss:0.397423028349 
encountered: 800000  logloss:0.394274547682 
encountered: 900000  logloss:0.391874034523 

real  0m23.187s
user  0m22.980s
sys 0m0.176s

          

The logloss seems lower with the OCaml code. Don’t be fooled by this. This stems from the fact that Python and OCaml have different hashing functions, therefore, the collisions happening between the features (with one hot encoding) are not the same from one source to the other. As for the training time, there is no sensible difference ! I guess that concludes the story… For this problem, just pick the language you feel most comfortable with :)

Improvements

In terms of code

Actually, OCaml as well has a csv library, available with opam.

In terms of accuracy

There are many variations of this algorithm. The “most famous” being the FTRL

Learning more

Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. The section about streaming algorithms is particularly useful in the context of this article.

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

OCaml vs PyPy

May 26, 2018