How to optimize PyTorch code ?

Optimizing some deep learning code may seem quite complicated. After all, PyTorch is already super optimized so why (and how) one could improve what is already great ?

For the why, there are many reasons:

Why ?

GPU time is expensive

As data scientist, we are used to telling our bosses or investors that we need more money to improve our models (oh look at that beautiful GPU), what if you could halve your training time for free ?

Less training time is more accuracy

The money argument apart, you can tests more models in the same time! In the end, it means that you are more likely to find a better model in a same amount of time.

Optimizing code is super satisfying

Honestly, there is no other way to put it. Instead of optimizing the accuracy of your model, try to optimize its training time. See how good this feels!

Save the planet

Well, it is actually a pity to think that in the example below (taken from a typical training loop for PyTorch), half of the time is simply wasted. We will not degrade the (validation) loss of the model, yet, we will spend less electricity and time to achieve it!

How ?

Well, we will not touch anything that is inside PyTorch, obviously. I just noted that, as data scientist, we may not be too aware of low hanging fruits in terms of performance in our scripts. Here, the tricks will mostly lie into data loading and transformations.

A concrete case

Without further due, let’s start! I will focus on the most common parts of a PyTorch training script, in the case of image recognition. Here the problem is a transfer learning from a pretrained model (resnet34, because it is fast to execute) to a binary classification problem.

The images are satellite data so they have more channels than a usual RGB image.

We will focus on the image loading function:

def load_and_convert_tiff(file_path):
    image = tiff.imread(file_path)
    R = image[:, :, 1]*255*2
    G = image[:, :, 2]*255*2
    B = image[:, :, 3]*255*2
    rgb_image = np.stack((R, G, B), axis=2).astype(np.uint8)
    rgb_image = Image.fromarray(rgb_image)
    return rgb_image

And the Dataset class:

class Sentinel2Dataset(Dataset):
    def __init__(self, file_paths, labels, transform=None):
        self.file_paths = file_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        img_path = self.file_paths[idx]
        image = load_and_convert_tiff(img_path)
        if self.transform is not None:
            image = self.transform(image)
        label = self.labels[idx]
        return image, label

Along with the composition of transformations:

augment_and_transform = transforms.Compose(
            [transforms.Resize(334),
             transforms.ToTensor(),
             transforms.RandomRotation(degrees=90),
             transforms.RandomVerticalFlip(p=0.5),
             transforms.RandomHorizontalFlip(p=0.5),
             transforms.Normalize(
                 mean=[0.485, 0.456, 0.406],
                 std=[0.229, 0.224, 0.225]),])

We have a usual training loop, similar to what PyTorch suggests to do.

In order to keep the article simple, I will not dive into other parts of the code until it becomes necessary. Let’s just assume that the training loop is properly implemented, and we use tqdm to measure the training time of our model.

These results will be our baseline for what follows:

32it [00:11,  2.76it/s]
EPOCH: 0 | LOSS train: 0.511 | LOSS valid: 0.495
32it [00:11,  2.82it/s]
EPOCH: 1 | LOSS train: 0.389 | LOSS valid: 0.405
32it [00:10,  3.00it/s]
EPOCH: 2 | LOSS train: 0.352 | LOSS valid: 0.359

Where 2.76it/s means 2.76 batches pass per second (the dataset is split into 32 batches). This is the number we will focus on. We will also keep an eye on the training and validation losses to make sure we do not break things.

Get rid of the useless

The part rgb_image = Image.fromarray(rgb_image) is actually useless. Some people use it because transforms.Resize() may behave slightly differently on PIL images than on tensors (depending on the parameters you feed to the transform).

We can simply remove it from the function:

def load_and_convert_tiff(file_path):
    image = tiff.imread(file_path)
    R = image[:, :, 1]*255*2
    G = image[:, :, 2]*255*2
    B = image[:, :, 3]*255*2
    rgb_image = np.stack((R, G, B), axis=2).astype(np.uint8)
    return rgb_image

And now we switch the transforms.ToTensor() and transforms.Resize(...) as transforms does not support numpy arrays.

augment_and_transform = transforms.Compose(
	[transforms.ToTensor(),
	 transforms.Resize(334, antialias=False, interpolation=InterpolationMode.NEAREST_EXACT),
	 transforms.RandomRotation(degrees=90),
	 transforms.RandomVerticalFlip(p=0.5),
	 transforms.RandomHorizontalFlip(p=0.5),
	 transforms.Normalize(
	     mean=[0.485, 0.456, 0.406],
	     std=[0.229, 0.224, 0.225]),])

We can rerun our script and…

32it [00:09,  3.45it/s]
EPOCH: 0 | LOSS train: 0.519 | LOSS valid: 0.506 
32it [00:09,  3.43it/s]
EPOCH: 1 | LOSS train: 0.389 | LOSS valid: 0.407 
32it [00:09,  3.38it/s]
EPOCH: 2 | LOSS train: 0.351 | LOSS valid: 0.357 

TADA ! the speed up is already impressive!

Use your RAM if possible!

Note that this advice will not work if your dataset is too large! Here, we have 1000 training images. This is quite low, my machine has 32GB of RAM so I might as well load them once and for all. Besides, it will save my SSD.

Instead of reading the image from the hard drive each time __getitem__ is called, we can make an array of image which will be stored in memory.

In my case, these images only represents 10% of my RAM.

class Sentinel2Dataset(Dataset):
    def __init__(self, file_paths, labels, transform=None):
        self.file_paths = file_paths
        self.images = []
        for file_path in tqdm(self.file_paths):
            image = load_and_convert_tiff(file_path)
            self.images.append(image)

        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        img_path = self.file_paths[idx]
        image = self.images[idx]
        if self.transform is not None:
            image = self.transform(image)
        label = self.labels[idx]
        return image, label

Note the overhead! Indeed, when creating the class, it takes 2 seconds to load all the images in memory.

100%|███████████████████████████████| 993/993 [00:02<00:00, 418.36it/s]
100%|███████████████████████████████| 249/249 [00:00<00:00, 380.70it/s]

But waow, the speedup in training is totally worth it! We are close to halving our initial training time.

32it [00:06,  5.01it/s]
EPOCH: 0 | LOSS train: 0.519 | LOSS valid: 0.506
32it [00:06,  5.31it/s]
EPOCH: 1 | LOSS train: 0.389 | LOSS valid: 0.407
32it [00:06,  5.30it/s]
EPOCH: 2 | LOSS train: 0.351 | LOSS valid: 0.357

Factor the transformations!

The resize happens all the time for the images. So each pass on the whole training set resizes the same image again and again. Let’s get rid of it. Let’s turn the augment transform:

augment_and_transform = transforms.Compose(
	[transforms.ToTensor(),
	 transforms.Resize(334, antialias=False, interpolation=InterpolationMode.NEAREST_EXACT),
	 transforms.RandomRotation(degrees=90),
	 transforms.RandomVerticalFlip(p=0.5),
	 transforms.RandomHorizontalFlip(p=0.5),
	 transforms.Normalize(
	     mean=[0.485, 0.456, 0.406],
	     std=[0.229, 0.224, 0.225]),])

To:

augment_and_transform = transforms.Compose(
	[transforms.RandomRotation(degrees=90),
	 transforms.RandomVerticalFlip(p=0.5),
	 transforms.RandomHorizontalFlip(p=0.5),
	 transforms.Normalize(
	     mean=[0.485, 0.456, 0.406],
	     std=[0.229, 0.224, 0.225]),])

So that only the data augmentation happens here. Now, when loading the images in memory, let’s perform the common transformations:

class Sentinel2Dataset(Dataset):
    def __init__(self, file_paths, labels, transform=None):

        factored_transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Resize(334, antialias=False, interpolation=InterpolationMode.NEAREST_EXACT)])

        self.file_paths = file_paths
        self.images = []
        for file_path in tqdm(self.file_paths):
            image = load_and_convert_tiff(file_path)
            transformed_image = factored_transform(image)
            self.images.append(transformed_image)

        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        img_path = self.file_paths[idx]
        image = self.images[idx]
        if self.transform is not None:
            image = self.transform(image)
        label = self.labels[idx]
        return image, label

And this is it, the training time decreased once more:

32it [00:06,  5.12it/s]
EPOCH: 0 | LOSS train: 0.519 | LOSS valid: 0.506
32it [00:05,  5.45it/s]
EPOCH: 1 | LOSS train: 0.389 | LOSS valid: 0.407
32it [00:05,  5.46it/s]
EPOCH: 2 | LOSS train: 0.351 | LOSS valid: 0.357

Play with deterministic / benchmark

Some of you may be familiar with the deterministic and benchmark flags. I usually see this useful function:

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

If you inverse the booleans, your results may not be the same at every run (the difference should be low though), but you will gain some extra time.

def seed_everything(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True

As the running output shows:

32it [00:07,  4.39it/s]
EPOCH: 0 | LOSS train: 0.519 | LOSS valid: 0.487
32it [00:05,  5.61it/s]
EPOCH: 1 | LOSS train: 0.392 | LOSS valid: 0.410
32it [00:05,  5.63it/s]
EPOCH: 2 | LOSS train: 0.358 | LOSS valid: 0.400

Note that the variation in the loss is worrysome…

Use (try) gradient accumulation

Gradient accumulation seemed promising to me. I read it on this blog and the heuristic seemed interesting.

Besides, as it reduces the number of gradient updates to the model, I expected to gain some performance (and this should be particulary true on larger models).

The recipe consists in turning the training loop:

def train_one_epoch(epoch_index):
total_loss = 0.

for i, data in tqdm(enumerate(training_loader)):

    inputs, labels = data
    inputs = inputs.to(torch.device(device))
    labels = labels.to(torch.device(device))

    optimizer.zero_grad()
    outputs = model(inputs)

    batch_loss = loss_fn(outputs, labels)
    batch_loss.backward()

    optimizer.step()
    total_loss += batch_loss.item()

return total_loss / (i+1)

In this, where the gradient update is performed evervy accum_iter step.

def train_one_epoch(epoch_index):
    total_loss = 0.
    accum_iter = 4
    
    optimizer.zero_grad()
    for i, data in tqdm(enumerate(training_loader)):
    
        inputs, labels = data
        inputs = inputs.to(torch.device(device))
        labels = labels.to(torch.device(device))
    
        outputs = model(inputs)
    
        batch_loss = loss_fn(outputs, labels)
        batch_loss = batch_loss / accum_iter
        batch_loss.backward()
        total_loss += batch_loss.item()
    
        if ((i + 1) % accum_iter == 0) or (i + 1 == len(training_loader)):
    	optimizer.step()
    	optimizer.zero_grad()
    
    return total_loss / (i+1)

But the decrease in performance is too important (maybe I am doing something wrong ?) for no gain in execution time.

32it [00:06,  5.05it/s]
EPOCH: 0 | LOSS train: 0.156 | LOSS valid: 0.612
32it [00:06,  5.33it/s]
EPOCH: 1 | LOSS train: 0.127 | LOSS valid: 0.532
32it [00:06,  5.30it/s]
EPOCH: 2 | LOSS train: 0.110 | LOSS valid: 0.482

More performance with smaller images!

By reducing the size of the image, we can achieve a massive speedup, just note the 228 instead of 334:

factored_transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Resize(228, antialias=False, interpolation=InterpolationMode.NEAREST_EXACT)])

And we are 4 times faster than our initial benchmark!

32it [00:04,  7.78it/s]
EPOCH: 0 | LOSS train: 0.518 | LOSS valid: 0.500
32it [00:02, 10.76it/s]
EPOCH: 1 | LOSS train: 0.393 | LOSS valid: 0.436
32it [00:02, 10.76it/s]
EPOCH: 2 | LOSS train: 0.363 | LOSS valid: 0.393

However, these kind of optimization changes what we actually are doing and seem to harm the loss of the model…

Conclusion

This is it! We almost halved the training time of our model, without harming its performance :) If we allow ourselves to decrease the model performance, we saw that smaller images are actually a way to go much faster.

I do not have other tricks that can easily be used at the moment. I hope you liked this article, do not hesitate to share to your friends, colleagues and on social media !

Learning more

If you are new to machine learning, Deep learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville is an excellent introduction to the topic. The algorithms and mathematics are presented without any code so it will not be outdated as soon as new breaking change is introduced in the main packages ;) *note that this is a sponsored link.

Acronyms of deep learning

When learning about deep learning, it can be quite confusing to discover boatloads of acronyms that readers are expected to know. In my quest to discovering it, I gathered plenty of them which I write here as a reference. Hope this will help :)

General

  1. AI - Artificial Intelligence: The simulation of human intelligence processes by machines, typically involving tasks such as learning, reasoning, and problem-solving.

  2. ML - Machine Learning: A subset of AI that provides systems with the ability to automatically learn and improve from experience without being explicitly programmed.

  3. DL - Deep Learning: A subset of machine learning that utilizes neural networks with multiple layers (hence “deep”) to extract higher-level features from raw data.

  4. NN - Neural Network: A computational model inspired by the structure and function of the human brain, consisting of interconnected nodes (neurons) organized in layers.

  5. CNN - Convolutional Neural Network: A type of neural network specifically designed for processing structured grid-like data, such as images or audio.

  6. RNN - Recurrent Neural Network: A type of neural network that processes sequential data by maintaining an internal state, allowing it to exhibit temporal dynamics.

  7. LSTM - Long Short-Term Memory: A type of recurrent neural network architecture that addresses the vanishing gradient problem, enabling better learning of long-term dependencies in sequential data.

  8. GRU - Gated Recurrent Unit: Another type of recurrent neural network architecture, similar to LSTM but with a simpler structure, making it more computationally efficient.

  9. GAN - Generative Adversarial Network: A type of neural network architecture consisting of two networks, a generator and a discriminator, trained adversarially to generate realistic synthetic data.

  10. DNN - Deep Neural Network: A neural network with multiple layers between the input and output layers, allowing it to learn complex representations of data.

  11. RL - Reinforcement Learning: A type of machine learning where an agent learns to make decisions by interacting with an environment to maximize some notion of cumulative reward.

  12. ReLU - Rectified Linear Unit: A commonly used activation function in deep learning that introduces non-linearity by outputting the input if it is positive and zero otherwise.

Optimizers

  1. SGD - Stochastic Gradient Descent: An optimization algorithm commonly used in training neural networks by iteratively updating the model parameters in the direction that minimizes the loss function.

  2. Adam (Adaptive Moment Estimation): Adam is an optimization algorithm used for training deep learning models. It combines the advantages of two other popular optimization algorithms, namely RMSprop and momentum. Adam computes adaptive learning rates for each parameter by estimating the first and second moments of the gradients. This adaptive learning rate helps Adam converge faster and more robustly compared to traditional stochastic gradient descent (SGD) methods.

  3. RMSprop (Root Mean Square Propagation): RMSprop is an optimization algorithm designed to address some of the limitations of traditional stochastic gradient descent (SGD) methods, particularly in dealing with vanishing and exploding gradients. RMSprop adaptively scales the learning rate for each parameter based on the magnitude of recent gradients. By maintaining a moving average of the squared gradients, RMSprop effectively normalizes the updates and improves the stability and convergence of the optimization process.

Famous networks

  1. ResNet (Residual Neural Network): A type of deep convolutional neural network architecture that introduced the concept of residual learning, where shortcut connections (skip connections) are added to the network to ease the training of very deep neural networks by mitigating the vanishing gradient problem.

  2. EfficientNet: A family of convolutional neural network architectures that achieve state-of-the-art performance on image classification tasks with significantly fewer parameters and FLOPs (floating-point operations) compared to traditional models. EfficientNet achieves this efficiency by scaling the network width, depth, and resolution in a balanced manner using compound scaling.

  3. VGG (Visual Geometry Group): A convolutional neural network architecture proposed by the Visual Geometry Group at the University of Oxford. VGG is characterized by its simplicity and uniform architecture, consisting of several convolutional layers followed by max-pooling layers, with fully connected layers at the end.

  4. ImageNet: A large-scale dataset containing over 14 million images annotated with high-level semantic labels, designed for training and evaluating image classification models. ImageNet has been instrumental in advancing the field of computer vision, serving as the benchmark dataset for numerous deep learning models.

  5. DenseNet (Densely Connected Convolutional Network): A type of convolutional neural network architecture where each layer is connected to every other layer in a feed-forward fashion. DenseNet introduces dense connectivity patterns between layers, allowing feature maps from all preceding layers to be directly input into subsequent layers. This dense connectivity encourages feature reuse, facilitates gradient flow, and significantly reduces the number of parameters, leading to improved parameter efficiency and performance compared to traditional architectures.

Hardware

  1. GPU - Graphics Processing Unit: A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images for display on electronic devices. In deep learning, GPUs are commonly used for their parallel processing capabilities, which significantly speed up neural network training.

  2. TPU - Tensor Processing Unit: An application-specific integrated circuit (ASIC) developed by Google specifically for neural network processing. TPUs are optimized for TensorFlow and are designed to accelerate both training and inference tasks.

Learning more

Deep learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville is an excellent introduction to the topic. The algorithms and mathematics are presented without any code so it will not be outdated as soon as new breaking change is introduced in the main packages ;) note that this is a sponsored link.

AI with OCaml : the tic tac toe game

In the ever-evolving landscape of artificial intelligence, the pursuit of perfection in game-playing algorithms has been an enduring quest. One of the earliest and simplest games to captivate the minds of AI enthusiasts is Tic-Tac-Toe. Although this classic game might seem elementary, it serves as a testing ground for cutting-edge AI strategies. In this article, we embark on a journey into the world of OCaml, a functional programming language, and its application in crafting an unbeatable Tic-Tac-Toe-playing AI.

Me affronting my machine

I am playing the red cross

At the heart of our AI lies the powerful Min-Max strategy, a method that has revolutionized the way machines approach decision-making in two-player games. With OCaml’s elegance and expressiveness, we delve into the intricacies of this strategy, exploring how it enables our AI to not just play, but dominate, the age-old game of Tic-Tac-Toe. Whether you’re a seasoned programmer, a curious AI enthusiast, or simply a Tic-Tac-Toe aficionado, this article promises to shed light on the inner workings of AI-driven game strategy and the wonders that OCaml brings to the table. So, let’s embark on this exciting journey and unlock the secrets behind a Tic-Tac-Toe AI that can never be beaten.

Let’s start with the definition of the pieces:

type piece = Cross | Circle


let piece_opposite = function
  | Cross -> Circle
  | Circle -> Cross


let add_color color s = "\027[3" ^ color ^ "m" ^ s ^ "\027[39m"


let piece_option_to_string = function
  | Some(Cross) -> add_color "1" "X"
  | Some(Circle) -> add_color "6" "0"
  | None -> " "

We will represent the board as a piece option array array. The option is None if the board is empty at a specific position, otherwise, it is Some(piece)

We also want to render the board so that a human can easily play :)

type board = piece option array array


let board_at board i j = board.(i).(j)


let board_init _ =
  Array.init 3 (fun _ -> Array.make 3 None)


let board_copy b =
  let res = board_init () in
  let n,p = 3, 3 in
  for i = 0 to n - 1 do
    for j = 0 to p - 1 do
      res.(i).(j) <- b.(i).(j)
    done
  done ;
  res


let board_place board piece i j =
  let board' = board_copy board in
  let () = board'.(i).(j) <- Some(piece) in
  board'


let board_transpose b =
  let res = board_init () in
  let n,p = 3, 3 in
  for i = 0 to n - 1 do
    for j = 0 to p - 1 do
      res.(j).(i) <- b.(i).(j)
    done
  done ;
  res


let board_print b =
  let print_separator () =
    print_string "+" ;
    Array.iter (fun _ -> print_string "-") b.(0) ;
    print_string "+\n"
  in

  let print_row r =
    print_string "|" ;
    Array.map piece_option_to_string r |> Array.fold_left ( ^ ) "" |> print_string ;
    print_string "|\n"
  in

  print_separator () ;
  Array.iter print_row b ;
  print_separator ()

So, we only have pieces and boards so far, time for some game logic.

let has_won piece board =

  let winning_line = Array.for_all (fun x -> x = Some(piece)) in
  let is_main_diagonal_winning = Array.for_all (fun x -> x = Some(piece)) [| board.(0).(0); board.(1).(1); board.(2).(2) |] in
  let is_other_diagonal_winning = Array.for_all (fun x -> x = Some(piece)) [| board.(0).(2); board.(1).(1); board.(2).(0) |] in

  Array.exists winning_line board 
  || Array.exists winning_line (board_transpose board)
  || is_main_diagonal_winning
  || is_other_diagonal_winning


let has_lost piece board =
  has_won (piece_opposite piece) board

let winning_board_cross = [|
  [|Some(Cross); None; None|];
  [|Some(Cross); Some(Circle); None|];
  [|Some(Cross); Some(Circle); None|];
|]

let winning_board_circle = [|
  [|Some(Cross); None; Some(Circle)|];
  [|Some(Cross); Some(Circle); None|];
  [|Some(Circle); Some(Circle); None|];
|]


let empty_board = board_init ()


let () =  assert (has_won Cross winning_board_cross)
let () =  assert (has_won Circle winning_board_circle)
let () =  assert (has_lost Circle winning_board_cross)
let () =  assert (has_lost Cross winning_board_circle)
let () =  assert (not (has_lost Cross empty_board) && not (has_lost Circle empty_board))
let () =  assert (not (has_won Cross empty_board) && not (has_won Circle empty_board))
let possible_moves board =
  let all_moves = List.init 3 (fun i -> List.init 3 (fun j -> (i,j))) |> List.flatten in
  List.filter (fun p -> board_at board (fst p) (snd p) |> Option.is_none) all_moves


let _ = assert (List.length (possible_moves winning_board_cross) = 4)


let is_game_over board = 
  has_won Cross board 
  || has_won Circle board
  || Array.for_all (Array.for_all Option.is_some) board


let () = assert (is_game_over winning_board_cross)
let () = assert (is_game_over winning_board_circle)

The evaluation function is a critical component of an AI-powered Tic-Tac-Toe player utilizing the Min-Max strategy. Its primary role is to assess the current game state and assign a numerical value that reflects the desirability of that state for the AI player. This numerical value serves as a basis for the Min-Max algorithm to make informed decisions about which moves to choose during the game.

let eval player board =
  if has_won player board then
    1
  else if has_lost player board then
    ~- 1
  else
    0

let () = assert ((eval Cross winning_board_cross) = 1)
let () = assert ((eval Circle winning_board_cross) = ~-1)
let () = assert ((eval Circle winning_board_circle) = 1)
let () = assert ((eval Cross winning_board_circle) = ~-1)

In the context of the Min-Max algorithm, the evaluation function often works in tandem with recursive calls. As the Min-Max algorithm explores possible future game states through a tree of possibilities, the evaluation function helps assess these states at different depths. The AI calculates scores for each state, and these scores are propagated up the tree to make informed decisions.

Let’s define a tree

type 'a tree =
  | Node of 'a * 'a tree list
  | Leaf


let tree_to_list tree =
  let rec aux = function
    | Leaf ->
        []
    | Node (n, children) ->
        n :: List.fold_left List.append [] (List.map aux children)
  in
  aux tree

And the tree of moves.

let make_moves_tree max_depth board player = 

  let rec aux depth board player =
    let moves = possible_moves board in
    match moves, depth with 
    | [], _ -> Node(board, [Leaf])
    | _, 0 -> Node(board, [Leaf])
    | l, d -> if (is_game_over board) then
        Node(board, [Leaf])
      else
        Node(board, 
                List.map 
                  (fun m -> aux (d - 1) (board_place board player (fst m) (snd m)) (piece_opposite player) ) 
                  moves)
  in
  aux max_depth board player

Now we are in a position to implement the evaluation function, taking into account future states. As the min max name suggests, we will need the following two functions:

let list_max l = List.fold_left Int.max Int.min_int l
                                                     

let list_min l = List.fold_left Int.min Int.max_int l


let () = assert (list_min [1;3;2] = 1)
let () = assert (list_max [1;3;2] = 3)

And we are now in a position to have a full evaluation function!

let evaluate_board max_depth board player =

  let tree = make_moves_tree max_depth board player in

  let rec aux d tree = match tree with
    | Node(b, [Leaf]) -> eval player b
    | Node(b, l) -> 
      if (d mod 2 = 0) then list_max (List.map (aux (d+1)) l)
      else list_min (List.map (aux (d+1)) l)
    | Leaf -> failwith "Should not happen"

  in aux 0 tree


let () = assert ((evaluate_board 1 winning_board_cross Cross) = 1)
let () = assert ((evaluate_board 1 winning_board_cross Circle) = ~-1)

Finding the best move is just a matter of minimizing the opponent evaluation function:

let find_best_move max_depth board player =
  let moves = possible_moves board in
  let possible_boards = List.map (fun m -> board_place board player (fst m) (snd m)) moves in
  let scores = List.map (fun b -> evaluate_board max_depth b (piece_opposite player)) possible_boards in
  let moves_and_scores = List.combine moves scores in
  let best_move_and_score = List.sort (fun x y -> Int.compare (snd x) (snd y)) moves_and_scores
                           |> List.hd in
  best_move_and_score |> fst


let almost_winning_board = [|
  [|Some(Cross); None; None|];
  [|Some(Cross); None; None|];
  [|None; Some(Circle); Some(Circle)|];
|]

let () = assert ((find_best_move 2 almost_winning_board Circle) = (2,0))
let () = assert ((find_best_move 2 almost_winning_board Cross) = (2,0))

And the playing loop:

let max_depth = 9

let rec play board player =
  let () = board_print board in

  if (is_game_over board) then begin
      print_string "Game over!\n" ;
        if has_won Cross board then 
          print_string "You won!\n"
        else if has_won Circle board then
          print_string "Computer won!\n"
        else
          print_string "Draw!"
    end
  else
    match player with
    | Cross -> 
      let () = print_string "\nEnter move...\n" in
      let () = flush stdout in
      let command = input_line stdin |> int_of_string in
      let i, j = (command / 3), (command mod 3) in
      let board' = board_place board Cross i j in
      play board' Circle

    | Circle ->
      let i, j = find_best_move max_depth board Circle in
      let board' = board_place board Circle i j in
      play board' Cross

let () = play empty_board Cross
Add arguments to Python decorators

Python decorator are a convenient way to wrap a function with another one. Per example, when timing a function, it is nice to call the timer before and after without having to rewrite the same timer.start() and timer.stop() calls everywhere.

Among other nice features of decorators, there are a couple of them which I want to stress here:

  • they apply to classes,
  • they accept arguments.

Here, I want to implement a printer for my classes showing exactly which class has called some of its methods and when. Ideally, the colors should be specified for each class, so that it makes it easy to monitor the execution of the code when various objects called different methods.

This is something I often face when dealing with machine learning pipelines: they consist of many objects (blend of models, feature pipelines) and it is hard to know exactly who is running an when, and identify bottlenecks at a glance.

Class decorators

Applying a decorator to a class allows to add method on classes directly, a little bit like inheritance. It drove my puzzled: why would one prefer a class decorator to inheritance ? See per example this question. The thing is that inheritance actually has a meaning in terms of your object: a cat is an animal, therefore, inheritance makes sense. On the other hand, if you want to add generic helpers that do not really fit into your object definitions, a class decorator is preferable.

import datetime

def class_printer(cls):

      def print_with_time(self, s):
          now_str = datetime.datetime.now().strftime("%m/%d/%y, %h:%m:%s")
          print(f"[{type(self).__name__} - {now_str}] {s}")

      setattr(cls, 'print_with_time', print_with_time)

      return cls


@class_printer
class a:

    def __init__(self):
        pass

    def run(self):
        self.print_with_time("run method called")

if __name__ == "__main__":
    a = a()
    a.run()

And it outputs:

[A - 03/08/2022, 11:20:44] run method called

Great! Now, every time we need this specific printer, we don’t have much to do. Time to improve it!

Passing arguments to a decorator

Example

import datetime

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'


def color_class_printer(color):
    def class_printer(cls):

        def print_with_time(self, s):
            now_str = datetime.datetime.now().strftime("%m/%d/%Y, %H:%M:%S")
            print(f"[{color}{type(self).__name__} - {now_str}{bcolors.ENDC}] {s}")

        setattr(cls, 'print_with_time', print_with_time)

        return cls
    return class_printer
Unique elements in a list OCaml

Unique elements in a list in OCaml

First, one should note that there is a function List.sort_uniq which is proposed in OCaml. However, it requires the implementation of a comparison function (like > for integers of float numbers) and this comparison it not always obvious to implement (think about a list of arrays per example).

Instead, the following snippet will do the job!

Snippet

let unique l =

  let rec aux l acc =
    match l with
    | [] ->
        List.rev acc
    | h :: t ->
        if List.mem h acc then aux t acc else aux t (h :: acc)
  in
  aux l []

And we can run it in the toplevel:

# let unique l =

  let rec aux l acc =
    match l with
    | [] ->
        List.rev acc
    | h :: t ->
        if List.mem h acc then aux t acc else aux t (h :: acc)
  in
  aux l []                  ;;
val unique : 'a list -> 'a list = <fun>

# unique [1;2;3] ;;
- : int list = [1; 2; 3]

# unique [1;2;3;2;4;1];;
- : int list = [1; 2; 3; 4]
#

Why the List.rev

The List.rev may be seen as a useless operation. It just makes sure that the elements are returned in their order of first occurence.

In order to be closer to the standard terminology used in the language, one could maybe implement two functions: unique and stable_uniq.

let unique l =

  let rec aux l acc =
    match l with
    | [] ->
        acc
    | h :: t ->
        if List.mem h acc then aux t acc else aux t (h :: acc)
  in

let stable_unique l =

  let rec aux l acc =
    match l with
    | [] ->
        List.rev acc
    | h :: t ->
        if List.mem h acc then aux t acc else aux t (h :: acc)
  in
  aux l []

List.mem and List.memq

Note that I used List.mem but you should be aware of the following differences (from the docs):

val mem : ‘a -> ‘a list -> bool

mem a set is true if and only if a is equal to an element of set.

val memq : ‘a -> ‘a list -> bool

Same as List.mem, but uses physical equality instead of structural equality to compare list elements.

List intersection in OCaml

Intersection of two lists in OCaml

List intersection is a simple operation that is often needed, unfortunately, it is not implemeted directly in OCaml (at least, in 4.12). Note that with large collections, sets are a better container for intersections. However, when you know you are working with a limited number of items and that performance should not be an issue, it is perfectly fine to intersect lists.

The following one liner will do the job:

let intersection l1 l2 = List.filter (fun e -> List.mem e l2) l1 

That’s it. ````List.mem``` makes sure that an element is a __mem__ber of a list, and we filter according to this condition.

You can test it in the toplevel! I recommend using it for these simple functions. It does not replace a proper unit testing, but if you are working quickly on a prototype, nobody will blame you ;)

#  let intersection l1 l2 = List.filter (fun e -> List.mem e l2) l1  ;;
val intersection : 'a list -> 'a list -> 'a list = <fun>
# intersection [1;2;3] [4] ;;
- : int list = []
# intersection [1;2;3] [1] ;;
- : int list = [1]

Intersection of a list of list

Let’s leverage the ````fold_left``` function ! It makes sense to define the intersection of a single list as the list itself.

let lists_intersection lists = match lists with
  | [] -> []
  | [h] -> h
  | h::t -> List.fold_left intersection h t

And we can test it quickly as well:

# let lists_intersection lists = match lists with                     
  | [] -> []
  | [h] -> h
  | h::t -> List.fold_left intersection h t      ;;
val lists_intersection : 'a list list -> 'a list = <fun>
# lists_intersection [[1;2;3]; [1]];;   
- : int list = [1]
# lists_intersection [[1;2;3]; [1;2]; [1;2]];;
- : int list = [1; 2]

It works!

Best books on Fermat's last theorem

This article was updated on 26th february 2023 to add a summary and new books.

Introduction

I haven’t published many articles these days, the main reason being that I got attracted into the history of Fermat’s last theorem. This theorem states that the equation:

\[X^{n} + Y^{n} = Z^{n}\]

Only has trivial integer solutions (i.e. one of the elements is 0) for \(n>2\). This problem fascinated mathematicians for more than three centuries before it was finally solved! Indeed, it is easy to see that for \(n=2\), the triple \(3^2 + 4^2 = 5^2\) does the job.

Regarding the “Maths contents”, do not be scared, all books are a good fit for motivated undergrads in mathematics. Those with one star are even more accessible.

Book Maths contents Topics Interest
Invitation to the Mathematics of Fermat-Wiles ** Approach to Wiles proof without leaving historical approaches ***
Fermat’s Last Theorem: A Genetic Introduction to Algebraic Number Theory ** Exposes the theory of algebraic integers, Kummer’s proof for regular prime and many historical details ***
Fermat’s Last Theorem for Amateurs * Many partial historical results related to FLT presented in great details. ***
13 Lectures on Fermat’s last theorem ** A lot of partial historical results related to FLT presented without details. **
Fermat a-t-il démontré son grand théorème ? L’hypothèse « Pascal » (in French) * A historical speculation around a possible proof of Fermat and a detailed analysis of his correspondance. **
Fermat last theorem None General introduction and history **
Le théorème de Fermat : son histoire by E. Nogues ** A history of research around FLT, written at the beginning of the 20th century! *

Details

Top 3

We can decompose the history of the main approaches of Fermat’s last theorem in three epochs: the arithmetical one (Sophie Germain, Cauchy, Legendre, Pellet…), the algebraic integers epoch (Kummer, Mirimanoff…) and the modular form (Wiles) epoch. Obviously, this misses many other approaches (sieves, analysis…) but it covers a large part of the historical publications.

These three books would cover these epochs and I strongly recommended all three of them to have a beautiful journey into the study of FLT.

Invitation to the Mathematics of Fermat-Wiles by Yves Hellegouarch

This book aims to present the proof given by Wiles. Obviously, some details will not be presented, but this is an amazing introduction. Topics presented contains introction topic (the case \(n=4\)), Kummer’s proof… Soon enough, the author jumps to Elliptic Functions, Elliptic Curves and Modular Forms which are essential.

It contains a lot of exercises and clear proofs, if there is only one book to read on the topic (and you are not afraid of mathematical details), this is probably the best one to read!

Fermat’s Last Theorem: A Genetic Introduction to Algebraic Number Theory

This one is a detailed study of Kummer’s approach. Ribenboim books only scratch the surface of this proof while here, (almost) the whole book is dedicated to give a detailed proof of Kummer’s result. Some beautiful relations between Fermat’s equation, the Zeta function and Bernoulli numbers are prestend in a very clear way, with various numeric examples. Besides, the author also dug into the archives of the French academy of science, yielding thrilling historical notes about sessions where members wrongly speculated about their possible proofs.

Fermat’s Last Theorem for Amateurs by Paulo Ribenboim

This book presents many special cases of proofs of FLT \(n=2,3,4,5,6,7...\) with many details. Though the original problem is to find solutions in the set of integers, the author propose to study this equation in other sets as well: Gaussian integers (\(\mathbb{Z}[i]\)), \(p\)-adic numbers (they truely are a hidden gem of number theory)… As the title states, it is for amateurs though this is presented as your usual math book: with theorem and proofs. Note that however, an emphasis is put on examples.

Other books

13 Lectures on Fermat’s last theorem by Paulo Ribenboim

This one is interesting: it has been written in 1979, before Fermat’s last theorem was actually proven. It summarizes many of the efforts put into trying to prove FLT and features some of the proofs. The historical context is essential, as many of the theorems are presented with notes regarding their history.

A large part of the book is devoted to the fascinating proof by Kummer and its limitations. Many other results with various importance are also presented: estimates, equivalence of FLT with other statements… One I found particularly interesting is:

  1. The equation \(X^{2m-1} + Y^{2m-1} = Z^{2m-1}\) has only the trivial solutions in \(\mathbb{Z}\)
  2. For every \(a \in \mathbb{Q}\) non-zero, the polynomial \(Z^2-a^mZ+a\) is irreducible over \(\mathbb{Q}\)

A minor regret is that many of the statements are not proved and left “as exercises” with no clue regarding their level of difficulty nor hints. It is particularly interesting to read it with Fermat’s Last Theorem for Amateurs as some of the proofs not present in this book are in the other.

However, it is remains an amazing introduction if you do not want to dig into all the details but instead are just looking for a nice overview of the history, with some details.

Fermat last theorem by Simon Singh

Requires absolutely no mathematical background. It is a very nice introduction to the problem, mostly focusing on the history of (and around) Fermat’s last theorem. Some mathematical details, along with various interesting puzzles are presented in the text and in the appendix. Really nice if you want to know what this theorem is about, and why so important it became, without digging into the mathematical details.

Fermat a-t-il démontré son grand théorème ? L’hypothèse « Pascal » (in French) by Laurent Hua and Jean Rousseau

This book is split in two parts: the first one is purely historical and does not require to know much about mathematics. The authors present the possibility that Fermat may have proved his theorem. Though the consensus seems to claim the opposite, proving than Fermat did not prove his theorem has never been done!

The first part is a thorough analysis of Fermat’s correspondance with other mathematicians of his time and it does become moving, especially towards the end. This analysis is so detailed that you even get a sense of Fermat’s sense of humor in some of the letters presented!

The second one is more mathematical, but a high school student could understand most of it. It gives what could be the starting point of a proof with the tools known by Fermat at his time. Unfortunately, it does not go too far, but the approach is interesting!

Le théorème de Fermat : son histoire by E. Nogues

Written at the beginning of the 20th century. It consists in translatiobns (in French) of all important articles regarding Fermat’s last theorem at this time. It is particularly intersting if you want to dig into the details of the history of the theorem.

Extra readings

In some of the books, you will need (on top of usual algebra, basic number theory and analysis tools) to know more advanced topics, such as Galois theory. To that end Galois Theory Through Exercises by Juliusz Brzeziński was my best read on this topic. It has a chapter completely dedicated to cyclotomic fields, which are an object commonly used in the books above!

Number Theory 1: Fermat’s Dream, by Kazuya Kato, Nobushige Kurokawa, Takeshi Saito this is actually the book that drove me into the study of Fermat’s Last Theorem. It is really well written, with a lot of figures, exercises and corrections.

Not read yet

The books below are my to read list. If you have any opinion on them, let me know in the comments!

Three Lectures on Fermat’s Last Theorem by Louis Joel Mordell

Mordell is a name that I now met many times! I want to read it mostly for its historical value, and to know what he had to say about this theorem, a century and a half (almost) ago!

A Course in Arithmetic by Jean-Pierre Serre

Please note that all the links above are affiliate links. However, having read these books, I am confident about the quality of my recommendations!

Benchmark Fossil Demand Forecasting Challenge

Introduction

Zindi is hosting the Fossil Demand Forecasting Challenge, where competitors have to predict the amount of units sold for various products.

Note that the rules state that the metric to optimize is not is usual squared error, but instead, the absolute error:

The evaluation metric for this challenge is Mean Absolute Error.

All the models relying on the minimization of least squares (usual regressions, random forests with default parameters) are likely to perform poorly since they will return the mean over subsambles, while minimizing the absolute error returns the mean of the sample.

In a mathematical language:

\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j)\]

A simple benchmark

With that knowledge, the benchmark below simply returns, for each product, the median of units sold over the year 2021. The score should be around 192xxx

import numpy as np
import pandas as pd
import random
random.seed(0)
np.random.seed(0)


train = pd.read_csv("../raw_data/Train.csv")
sku_names = train["sku_name"].unique()
train["year_month"] = train["year"].astype(
    str) + "/" + train["month"].astype(str)
train["date"] = pd.to_datetime(train["year_month"])
train_recent = train[train["date"] >= "2021/01"]


medians = train_recent.groupby("sku_name")["sellin"].median().to_dict()

test = pd.read_csv("../raw_data/Test.csv")
sku_names_test = test["sku_name"].unique()

missing = {}
for sku_name_test in sku_names_test:
    missing[sku_name_test] = 0


test["Target"] = test["sku_name"].replace(medians).replace(missing).astype(int)

test["Item_ID"] = test["sku_name"] + "_" + \
    test["month"].astype(str) + "_" + test["year"].astype(str)
test[["Item_ID", "Target"]].to_csv("./submission_.csv", index=False)
Minimize L1 penalty with (univariate) linear regression

Introduction

Most of the regression problems I dealt with focused on minimizing the L2 norm of the difference between the predictions of a model and the true value. However, from a business point of view, it may make more sense to minimize directly the L1 norm.

Indeed, using a least square approach will mainly penalize large errors. But if the cost of an error is constant (once again, from the business perspective), then, the L1 penalty makes sense as well.

However, note that the estimator may be vastly different, if we want to use a constant model, the values of the intercept differ, one being the mean, the other one, the median.

More formally:

\[\arg\min_x \sum_{j=1}^n (x_j - x)^2 = \bar{x},\] \[\arg\min_x \sum_{j=1}^n |x_j - x| = \mathrm{med}(x_1, \dots, x_j),\]

So far, so good. However, things become complex quite quickly: a theoretical advantage of L2 penalty is that it makes the penalty differientiable, and we enjoy many closed formulas for various problems relying on L2 penatlies.

\[\hat{\beta} = \underset{\beta} {\arg \min}\,L\left(D, \beta\right) = \underset{\beta}\arg \min \sum_{i=1}^{n} \left(\beta \cdot x_i - y_i\right)^2\] \[\hat{\beta} = \left(X^\textsf{T}X\right)^{-1}X^\textsf{T}Y\]

Algorithm

I will focus on implementing the univariate case, without intercept. Including intercept or multivariate case relies on a (much) more complex optimization.

\[\hat{b} = \arg\min_x \sum_{j=1}^n |y_j - b \cdot x_j|\]

Calling \(L(b) = \sum_{j=1}^n |y_j - b \cdot x_j|\) we note that \(L\) is a convex function, whose derivative is not continuous.

It derivative, where it is defined, is:

\[l(b) = - \sum_{j=1}^n x_j \cdot \mathbb{sgn} (y_j - b \cdot x_j)\]

Given that \(L\) is convex, \(l\) must be monotonic. Besides \(l(-\infty) = - \sum_{j=1}^n  |x_j| < 0\) and \(l(+\infty) = \sum_{j=1}^n  |x_j| > 0\).

Therefore, we will look for a zero of \(l\) using dichotomy.

Implementation

As detailed above, one we have the penalty and the dichotomy algorithm implemented, there is nothing else to do:

import numpy as np

def dichotomy(n, f, l, r):

    c = (l + r) / 2
    if n ==0:
        return c

    else:
        if f(c) < 0 :
            return dichotomy(n-1, f, c, r)
        else:
            return dichotomy(n-1, f, l, c)


def penalty(x, y, b):
    return -np.sum(x * np.sign(y - b * x))


class L1Regressor:

    def __init__(self, n_iterations=20):
        self.n_iterations = n_iterations
        self.b = None

    def fit(self, x, y):
        ratios = y / x
        l, r = np.min(ratios), np.max(ratios)
        self.b = dichotomy(self.n_iterations, lambda b: penalty(x, y, b), l, r)
        return self

Tests

If we append the following:

if __name__ == "__main__":

    import matplotlib.pyplot as plt

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    slope = L1Regressor().fit(x, y).b

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')
    plt.savefig("./illustration.svg")

We obtain:

L1 regression

Comparison with L2

We can add some noise to the observations and we expect to have a more robust regression with L1 penalty.

Below are plotted the two slopes: in red, the L1 penalty is minimized, in green, the L2 penalty.

L1 regression vs L2

    from sklearn.linear_model import LinearRegression

    x = np.random.rand(100)
    y = x * 0.5 + 0.1 * np.random.normal(size=100)

    x[:10] = 0.9 + 0.1 * np.random.rand(10)
    y[:10] = 2 + x[:10] * 0.5 + np.random.normal(size=10)

    slope = L1Regressor().fit(x, y).b
    slopel2 = LinearRegression(fit_intercept=False).fit(x.reshape(-1,1), y).coef_[0]

    plt.scatter(x, y)
    plt.plot(x, x*slope, c = 'r')
    plt.plot(x, x*slopel2, c = 'g')
    plt.savefig("./illustration_noise.svg")
Random number generation in Cython

Problem

In one of my programs, I had to perform (a lot of) random sampling from Python lists. So much that it ended up being my bottleneck.

Without going in too much details, the function was mostly generating random numbers and accessing elements in lists. I gave it a simple Cython try with something along these lines:

import random
import cython

@cython.boundscheck(False)
def sample(int n_sampling, l):
    a = []
    for _ in range(n_sampling):
        a.append(l[0])
    return a

@cython.boundscheck(False)
def rando(int n_sampling, l):
    a = []
    for _ in range(n_sampling):
        a.append(random.randrange(3))
    return a

Cython usually needs the following setup code:

from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("fast_sampler.pyx")
)

And the following command will build the code:

python setup.py build_ext --inplace

The timing results:

python -mtimeit -s"import fast_sampler" "fast_sampler.sample(10000,[1,2,3])"
5000 loops, best of 5: 61.8 usec per loop

Accessing elements in a loop seems quick enough.

python -mtimeit -s"import fast_sampler" "fast_sampler.rando(10000,[1,2,3])"
50 loops, best of 5: 5.77 msec per loop

However, the calls to random.randrange() seem to be the bottleneck.

If we add this cimport statement, we can directly call rand()

from libc.stdlib cimport rand

@cython.boundscheck(False)
def rando_c(int n_sampling, l):

    a = []
    for _ in range(n_sampling):
        a.append(rand() % 3)
    return a

And finally:

python -mtimeit -s"import fast_sampler" "fast_sampler.rando_c(10000,[1,2,3])"
2000 loops, best of 5: 104 usec per loop

Which brings a 50x speedup!

What about the seed ?

Usually, it is a good practice to add these lines in a Python code when dealing with random number generation (to ensure reproducibility):

random.seed(0)
np.random.seed(0)

By default, rand() always returns the same numbers, as long as you do not call srand() before, so you do not have to worry about them any more ! (At least, not in this part of your code).