# Best casual readings in mathematics

If you are reading this blog, you probably have a science degree ;) after graduating the opportunities to use mathematics in the day job are often limited to a small subset of what was learnt. The books below, varying in difficulty, are an occasion to practice or to have “mathematical recreations”.

I try to rate the level of technicality of the book *** are the math books you would expect to read to prepare for a master’s degree exam, ** still require a pencil and a paper to explore the details and * or less contain very few technicalities.

## Miscellanous

These are among my favorites: they require very little knowledge about any specific topic in mathematics and are pure moments of cleverness, smart arguments and nice illustrations.

(**) Proofs from the book also available in PDF. When Erdos proved something and the proof looked clumsy to him, he was saying “This is not the proof from the book” or “Let’s look for the proof of the book”. And indeed, when proving statements, some proofs seem more natural than others: usually, the shortest and most convincing ones. This book is an attempt - a successful one - to gather them. The infinity of prime numbers, D’Alembert Gauss theorem, the Law of quadratic reciprocity are just a few examples of all the results presented in this book.

Some proofs may be profitable to the reading of Number Theory 1: Fermat’s Dream.

(**) The Art of Mathematics: Coffee Time in Memphis more than a hundred exercises, with hints and detailed solutions. The difficulty varies greatly from an exercise to another and the solutions come from many different fields. Ideal for long trips ;)

## Algebra an number theory

If you like number theory, the following books are a must have. The first volume is easily accessible, however, the following ones will require a working knowledge of Galois theory. I love the way authors present the intuitions behind the proofs and the main steps to go through before actually “jumping” into the proof.

(**) Number Theory 1: Fermat’s Dream

(***) Number Theory 2: Introduction to Class Field Theory If the previous book presented some facts that looked “magic”, this one focuses on explaining why they happen. It is much harder than the previous read. I would strongly advise this read to those who loved the previous one.

(***) Number Theory 3: Iwasawa Theory and Modular Forms The same advice applies ;)

If you do not know about Galois theory, these two references may help. They do not qualify as “casual readings” but they will help understand the “Number theory saga”.

(**) Galois Theory for Beginners: A Historical Perspective

(***) Galois Theory by Emil Artin Though this course is quite old, the book gives a clear presentation of the topic

## Applications of mathematics in real life

### Music

(**) Music: A Mathematical Offering by Dave Benson covers many topics about the interplay between mathematics and music.

### Finance

(*) A Practitioner’s Guide to Asset Allocation by William Kinlaw, Mark P. Kritzman, David Turkington. The mathematical details are very light in this book, the focus is put on the models, the history and controversies around models and some actual data about typical correlations, returns of asset classes (this is scarcer than one would think for a book of finance!). If you want to invest by yourself and know about how to diversify, this is a very good starting point.

(**) Portfolio Optimization and Performance Analysis by Jean-Luc Prigent. If you liked the previous book and want to dig (much deeper) in portfolio optimization, this book is a detailed analysis of the existing models.

## History

(*) Euler: The Master of Us All great book about the works of Euler, as the title indicates. The chapter are organized according to the branches of mathematics Euler contributed to (all of them, at his time), and the proofs are the proofs he presented at the time.

Théorème vivant (in French) this one is hard to describe. Do not expect to understand precisely the contents of Cedric Villani’s work by reading this book. Likewise, the equations come with few explanations. It is more like a diary of a researcher.

# LightGBM on the GPU

LightGBM is currently one of the best implementations of gradient boosting. I will not go in the details of this library in this post, but it is the fastest and most accurate way to train gradient boosting algorithms. And it has a GPU support. Installing something for the GPU is often tedious… Let’s try it!

# Setting up LightGBM with your GPU

I will assume a nVidia GPU. I personnally have a GeForce GTX 745, with the Driver Version: 410.48. If you do not have a GPU already, be careful in the model you chose. When buying a GPU, you have to make sure the “compute capability” is high enough with respect to the software you plan to use. Per example, rapids.ai needs at leats a NVIDIA Pascal™ GPU or better with compute capability 6.0+. I am not going to discuss about rapids.ai in this post, but if you plan to install LightGBM on your GPU, you will soon enough want to play with rapids.ai as well.

Your simplest choice is probably : GTX 1660 Ti which was released in february 2019 and has a compute capability of 7.5

Among older GPUs which have a compute capability of 6+, the prices change quite often but you could make a good deal below.

But keep in mind that with these cards, the support may be abandoned soon enough…

## Test if LightGBM supports GPU

If you can run the following python script:

from lightgbm import LGBMClassifier
from sklearn.datasets import make_moons

model = LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=300, device = "gpu")

train, label = make_moons(n_samples=300000, shuffle=True, noise=0.3, random_state=None)

model.fit(train, label)


Without this message:

[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
Traceback (most recent call last):
File "seq.py", line 11, in <module>
model.fit(train, label)
File "/home/kerneltrip/anaconda3/lib/python3.7/site-packages/lightgbm/sklearn.py", line 800, in fit
callbacks=callbacks)
File "/home/kerneltrip/anaconda3/lib/python3.7/site-packages/lightgbm/sklearn.py", line 595, in fit
callbacks=callbacks)
File "/home/kerneltrip/anaconda3/lib/python3.7/site-packages/lightgbm/engine.py", line 228, in train
booster = Booster(params=params, train_set=train_set)
File "/home/kerneltrip/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py", line 1666, in __init__
ctypes.byref(self.handle)))
File "/home/kerneltrip/anaconda3/lib/python3.7/site-packages/lightgbm/basic.py", line 47, in _safe_call
raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
lightgbm.basic.LightGBMError: GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1


Then, you do not need this tutorial ;)

## Setup guide

Though there is some information here, following the instructions did not do the job for me (hence this detailed guide).

### GPU drivers

First, you need to have you drivers set up.

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update


You may find your device and the drivers using:

:~$ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 == vendor : NVIDIA Corporation [...] model : GM107 [GeForce GTX 745] driver : nvidia-340 - distro non-free driver : nvidia-384 - distro non-free driver : nvidia-410 - third-party non-free recommended driver : nvidia-396 - third-party non-free [...]  And then, you can run the following, where 410 replaces the recommended version of the driver: sudo apt-get update sudo apt-get install --no-install-recommends nvidia-410 sudo apt-get install --no-install-recommends nvidia-opencl-icd-410 nvidia-opencl-dev opencl-headers  ### LGBM dependencies The officials instructions are the following, first the prerequisites: sudo apt-get install --no-install-recommends git cmake build-essential libboost-dev libboost-system-dev libboost-filesystem-dev  (For some reason, I was still missing Boost elements as we will see later) ### Building LGBM for the GPU Time to download LightGBM git clone --recursive https://github.com/microsoft/LightGBM cd LightGBM mkdir build ; cd build  Let’s try: cmake -DUSE_GPU=1 ..  Unfortunately, this does not work. CMake Error at /usr/local/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message): Could NOT find OpenCL (missing: OpenCL_LIBRARY OpenCL_INCLUDE_DIR)  The official instructions helped, but : cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..  Still does not work. Now the errors about OpenCL are replaced with others related to Boost : CMake Error at /usr/local/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:164 (message): Could NOT find Boost (missing: Boost_INCLUDE_DIR filesystem system) (Required is at least version "1.56.0") Call Stack (most recent call first): /usr/local/share/cmake-3.17/Modules/FindPackageHandleStandardArgs.cmake:445 (_FPHSA_FAILURE_MESSAGE) /usr/local/share/cmake-3.17/Modules/FindBoost.cmake:2145 (find_package_handle_standard_args) CMakeLists.txt:121 (find_package)  This might be overkill but: sudo apt-get install libboost-all-dev  Did the job. The following additional packages will be installed: icu-devtools libboost-atomic-dev libboost-atomic1.58-dev libboost-atomic1.58.0 libboost-chrono-dev libboost-chrono1.58-dev libboost-chrono1.58.0 libboost-context-dev libboost-context1.58-dev libboost-context1.58.0 libboost-coroutine-dev libboost-coroutine1.58-dev libboost-coroutine1.58.0 libboost-date-time-dev libboost-date-time1.58-dev libboost-dev [...]  Finally: cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..  Outputs: /cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. -- OpenCL include directory: /usr/local/cuda/include -- Found Boost: /usr/include (found suitable version "1.58.0", minimum required is "1.56.0") found components: filesystem system -- Performing Test MM_PREFETCH -- Performing Test MM_PREFETCH - Success -- Using _mm_prefetch -- Performing Test MM_MALLOC -- Performing Test MM_MALLOC - Success -- Using _mm_malloc -- Configuring done -- Generating done -- Build files have been written to: /home/kerneltrip/Codes/LightGBM/build  And I can run: make -j$(nproc)


Should look like this :

Scanning dependencies of target lightgbm
Scanning dependencies of target _lightgbm
[  3%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_model_text.cpp.o
[  3%] Building CXX object CMakeFiles/lightgbm.dir/src/main.cpp.o
[  4%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
[...]
[ 98%] Built target _lightgbm
[100%] Built target lightgbm


Everything was successfully built! Time to set up the python. The repo should look like this on your machine :

~/Codes/LightGBM$tree -d -L 1 . ├── build ├── compute ├── docker ├── docs ├── examples ├── helpers ├── include ├── pmml ├── python-package ├── R-package ├── src ├── swig ├── tests └── windows  The official instructions recommend the following operations, but I would not recommend them. sudo apt-get -y install python-pip sudo -H pip install setuptools numpy scipy scikit-learn -U cd python-package/ sudo python setup.py install --precompile cd ..  install python-pip has the habit of conflicting with the pip that you may have. Instead, install the missing packages step by step. In my case, I use conda and I was only missing setuptools. conda install setuptools python setup.py install --precompile  Did the job! Now… from lightgbm import LGBMClassifier from sklearn.datasets import make_moons model = LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=300, device = "gpu") train, label = make_moons(n_samples=300000, shuffle=True, noise=0.3, random_state=None) model.fit(train, label)  Run without any issues ! You can observe the GPU usage with glances[gpu] (this will be fast though) ## An error message that did not appear on CPU: Unfortunately, you may find this error message with the GPU (on some datasets), which you did not have on the CPU :(  raise LightGBMError(decode_string(_LIB.LGBM_GetLastError())) lightgbm.basic.LightGBMError: Check failed: (best_split_info.left_count) > (0) at /home/kerneltrip/Codes/LightGBM/src/treelearner/serial_tree_learner.cpp, line 613 .  The most recent information I could get is : https://github.com/microsoft/LightGBM/issues/2742 Apparently, the issue happened on the CPU, then it was fixed, but not on the GPU version. This issue has only been raised some hours ago, let’s hope it will be fixed soon enough. # Using postMessage between iframes There were plenty of resources regarding the use of postMessage here and there, however, none focused on reproducing the bugs locally. Just “copy this line in the iframe, this line in the page hosting the iframe and everything will work.” I had the same issue and wanted to fix it and make sure it was running without long build, deploys etc Here we will start from scratch, using python to serve pages locally. First we will serve two pages on the same port, then ensure postMessage between these pages works, break it with serving it on different ports and finally fix the iframe communication. Note that I do not focus on the origin of the event checks below, but developer.mozilla.org does. # Sending data to a parent frame with postMessage ## Step 1 : creating and serving two pages Let’s start with serving a page, containing an iframe locally, at the same address. For this you will need a web browser, and a working python distribution. In what follows, everything will be served on port 4000. index.html <!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <title>I am the main page</title> <meta name="description" content="Iframe communication tutorial"> <meta name="author" content="TheKernelTrip"> </head> <body> <h1>I am the main page</h1> <iframe id="iframe" src="http://localhost:4000/iframe.html" height="600" width="800"> </iframe> </body> </html>  iframe.html <!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <title>Iframe</title> <meta name="description" content="Iframe communication tutorial"> <meta name="author" content="TheKernelTrip"> </head> <body> <h1>I am the iframe</h1> <p>I should be in the site</p> </body> </html>  Side note, I remembered that serving a page locally involved a “python http.server”. The following trick should be known by everybody using terminal, if you want to avoid googling again and again ;) user@user:~$ history | grep http.se
1283  python3 -m http.server 4000
[...]


The command I was looking for was:

python3 -m http.server 4000


Opening localhost in your browser should look like…

## Step 2 : postMessage to parent frame

Let’s add some javascript in the pages so that the iframe can send data to the parent frame. Basically, what happens below is that the iframe manages to run a function located in the parent window.

You can check it by clicking the button once the pages are served.

index.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>I am the main page</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the main page</h1>
<iframe id="iframe" src="http://localhost:4000/iframe.html" height="600" width="800">
</iframe>
</body>
</html>


iframe.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>Iframe</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the iframe</h1>
<p>I should be in the site</p>
<button id="my-btn">Send to parent</button>
</body>
</html>

<script>
function handleButtonClick(e) {
window.parent.postMessage('message');
console.log("Button clicked in the frame");
}
</script>


## Step 2: emulating cross domain, locally

### Triggering the error

Now let’s try to serve the frame and the site on “different domains”. For this, we need to move the pages in different folders, so that it looks like:

.
├── iframe
│   └── index.html
└── index
└── index.html


We need to run two instances of http.serve, the main page will be served on port 4000 and the iframe on port 4200.

user@user:~/Codes/iframe-communication/index$python3 -m http.server 4000  and user@user:~/Codes/iframe-communication/iframe$ python3 -m http.server 4200


While the contents of the pages will be (note that the port to the iframe has to be changed):

index/index.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>I am the main page</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the main page</h1>
<iframe id="iframe" src="http://localhost:4200" height="600" width="800">
</iframe>
</body>
</html>

<script>
}, false);
</script>


iframe/index.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>Iframe</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the iframe</h1>
<p>I should be in the site</p>
<button id="my-btn">Send to parent</button>
</body>
</html>

<script>
function handleButtonClick(e) {
window.parent.postMessage('message');
console.log("Button clicked in the frame");
}
</script>


Clicking on the button, you should see the following error in your console:

(index):24 Failed to execute 'postMessage' on 'DOMWindow':
The target origin provided ('null') does not match the recipient
window's origin ('http://localhost:4000').


### Fixing the error

The message is clear enough, there is a missing argument : the iframe should now the address of the parent.

Writing the following:

index/index.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>I am the main page</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the main page</h1>
<iframe id="iframe" src="http://localhost:4200" height="600" width="800">
</iframe>
</body>
</html>

<script>
}, false);
</script>


iframe/index.html

<!doctype html>

<html lang="en">
<meta charset="utf-8">

<title>Iframe</title>
<meta name="description" content="Iframe communication tutorial">
<meta name="author" content="TheKernelTrip">

<body>
<h1>I am the iframe</h1>
<p>I should be in the site</p>
<button id="my-btn">Send to parent</button>
</body>
</html>

<script>
function handleButtonClick(e) {
window.parent.postMessage('message', "http://localhost:4000");
console.log("Button clicked in the frame");
}
</script>


# Resources in topological data analysis

Topological Data Analysis is a promising field in data analysis. Indeed, the hypothesis made on the data are much weaker than the hypothesis behind “usual” machine learning algorithms. Quarantine may be a great time for learning it ;)

This list is not exhaustive, I may have missed important resources. If so, please let me know in the comments!

## Articles

### Mathematical topic presentations

The following papers are great presentations of the topic:

## Videos

A private company, Ayasdi has published many videos on the topic. Most of them are related to real world applications of TDA.

And their youtube channel has many others.

# tSNE vs PCA

## Introduction

Both PCA and tSNE are well known methods to perform dimension reduction. The question of their difference is often asked and here, I will present various points of view: theoretical, computational and emprical to study their differences. As usual, one method is not “better” in every sense than the other, and we will see that their successes vastly depend on the dataset and that a method may preserve some features of the data, while the other do not.

## A little bit of wording and notations

PCA stands for Principal Component Analysis

whereas tSNE stands for Stochastic Neighbor Embedding, the t itself referring to the Student-t kernel.

As “usual” we will call $n$ the number of observations and $p$ the number of features.

## Theoretical aspects

### tSNE

Quoting the authors of [1]:

Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities. The similarity of data point $x_j$ to datapoint $x_i$ is the conditional probability $p_{j \vert i}$, that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$. For nearby datapoints, $p_{j \vert i}$ relatively high,i whereas for widely separated datapoints, $p_{j \vert i}$ will be almost infinitesimal.

In a nutshell, this maps your data in a lower dimensional space, where a small distance between two points means that they were close in the original space.

The method goes on defining:

$$p_{j \vert i} = \frac{\exp(-\lVert x_i - x_j \rVert^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\lVert x_k - x_j \rVert^2 / 2\sigma_i^2)}$$

As the probabilities in the original space, and $q$ the same probability, but in the space of represented points.

$$q_{j \vert i} = \frac{\exp(-\lVert y_i - y_j \rVert ^2)}{\sum_{k \neq i} \exp(-\lVert y_k - y_j \rVert^2)}$$

Then, the divergence between these distribution is minimized using a gradient descent over the $y$ elements:

$$C = \sum_{i,j}p_{j \vert i} \log \frac{p_{j \vert i}}{q_{j \vert i} }$$

### PCA

On the other hand, PCA does not rely on a probabilistic approach to represent the points. Indeed, the approach is more geometric (or related to linear algebra): PCA looks for axis that explains as much variance as possible of the data.

Following the wikipedia article, we can have a better understanding of how it performs dimension reduction:

The transformation $T = X W$ maps a data vector $x_i$ from an original space of $p$ variables to a new space of $p$ variables which are uncorrelated over the dataset. However, not all the principal components need to be kept. Keeping only the first $L$ principal components, produced by using only the first $L$ eigenvectors, gives the truncated transformation

$$\mathbf{T}_L = \mathbf{X} \mathbf{W}_L$$

where the matrix $T_L$ now has $n$ rows but only $L$ columns. In other words, PCA learns a linear transformation $t = W^T x, x \in R^p, t \in R^L,$ where the columns of $p × L$ matrix $W$ form an orthogonal basis for the $L$ features (the components of representation $t$) that are decorrelated. By construction, of all the transformed data matrices with only $L$ columns, this score matrix maximises the variance in the original data that has been preserved, while minimising the total squared reconstruction error $\|\mathbf{T}\mathbf{W}^T - \mathbf{T}_L\mathbf{W}^T_L\|_2^2$ or $\|\mathbf{X} - \mathbf{X}_L\|_2^2$.

What we see here is that the total squared reconstruction error is minimized. Therefore, if two points are far away in the original space, their distance will have to be important as well once they are mapped to a lower dimensional space.

In that case, this means that after a PCA, two points are far away from each other if they were far away from each other in the original data set.

## Feasability

### Computational complexity

If you are not familiar with this topic, I already wrote about it in computational complexity and machine learning.

PCA can be performed quite quickly : it consists in evaluating the covariance matrix of the data, and performing an eigen value of this matrix. Covariance matrix computation is performed in $O(p^2 n)$ operations while its eigen-value decomposition is $O(p^3)$ operations. So, the complexity of PCA is $O(p^2 n+p^3)$. Considering that $p$ is fixed, this is $O(n)$.

On the other hand, tSNE can benefit form the Barnes-Hut approximation [2], making it usable in high dimensions with a complexity of $O(n \log n)$ (more informations can be found in the article).

Therefore, it is not obvious if a method has to be preferred to another with large datasets.

### Parameters

The PCA is parameter free whereas the tSNE has many parameters, some related to the problem specification (perplexity, early_exaggeration), others related to the gradient descent part of the algorithm.

Indeed, in the theoretical part, we saw that PCA has a clear meaning once the number of axis has been set.

However, we saw that $\sigma$ appeared in the penalty function to optimize for the tSNE. Besides, as a gradient descent is required, all the usual parameters of the gradient descent will have to be specified as well (step sizes, iterations, stopping criterions…).

Looking at the implementations, the arguments of the constructors in scikit-learn are the following:

class sklearn.decomposition.PCA(n_components=None,
copy=True, whiten=False, svd_solver='auto',
tol=0.0, iterated_power='auto',
random_state=None)


The parameters mostly relate to the solvers, but the method is unique. On the other hand…

class sklearn.manifold.TSNE(n_components=2, perplexity=30.0,
early_exaggeration=12.0, learning_rate=200.0,
n_iter=1000, n_iter_without_progress=300,
random_state=None, method='barnes_hut', angle=0.5, n_jobs=None)


However, default parameters usually provide nice results. From the original paper:

The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.

Therefore, both in terms of feasability or efforts demanded by the calibration procedure, there is no reason to prefer a method to the other.

## Empirical analysis

### Figures

The following shows the reduction from a three dimensional space to a two dimensional space using both methods. The colors of the points are preserved, so that one can “keep track” of them.

#### Gaussian blobs

Fig. 1: Gaussian blobs in three dimensions

Fig. 2: Gaussian blobs after PCA

Fig. 3: Gaussian blobs after tSNE

Both methods were successful at this task: the blobs, that were initally well separated, remain well separated in the lower dimensional space. An interesting phenomenon, which validates what the theoretical arguments predicted is that in the case of the PCA, the (light) blue and cyan points are far away from each other, whereas they appear to be closer when the tSNE is used.

#### The swiss roll

Fig. 4: Swiss roll in three dimensions

Fig. 5: Swiss roll after PCA

Fig. 6: Swiss roll after tSNE

Somehow the roll is broken by the tSNE, which is weird because one would expect the red dots to be close to the orange dots… On the other hand, a linear classifier would be more successful on the data represented with the tSNE than with the PCA.

#### Digit dataset

Fig. 7: Digits after PCA

Fig. 8: Digits after tSNE

The tSNE works amazingly well on this data set, and exhibits a neat separation of most of the digits!

#### Hand dataset

Fig. 9: Hand data

Fig. 10: Hand - PCA

Fig. 11: Hand - tSNE

Here we note that the fingers “remain together” with the tSNE.

### Other observations

Other observations could be inferred as well, per example, the size of a cluster does not mean much with the tSNE, while it has a meaning in the case of the PCA.

Fig. 12: Gaussian blobs in three dimensions

Fig. 13: Gaussian blobs after PCA

Fig. 14: Gaussian blobs after tSNE

### Code

Shall you want to explore new datasets, feel free to use the following code!

import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3
from sklearn.datasets import make_swiss_roll, make_blobs, load_digits
from sklearn import decomposition
from sklearn.manifold import TSNE

# Generate data (swiss roll dataset)
n_samples = 2500

def generate_swiss_roll_data(n_samples):
noise = 0.05
X, _ = make_swiss_roll(n_samples, noise)
# Make it thinner
X[:, 1] *= .5
distance_from_y_axis = X[:, 0] ** 2 + X[:, 2] ** 2
X_color = plt.cm.jet(distance_from_y_axis / np.max(distance_from_y_axis + 1))
return X, X_color, "Swiss roll"

def generate_gaussian_blobs(n_samples):
X, y = make_blobs(n_samples=n_samples, centers=5, n_features=3,random_state=3)
X_color = plt.cm.jet(y / np.max(y + 1))
return X, X_color, "Gaussian blobs"

def generate_gaussian_blobs_different(n_samples):
X, y = make_blobs(n_samples=n_samples, centers=2, n_features=3,random_state=3)
X[y == 1, :] *= 2
X_color = plt.cm.jet(y / np.max(y + 1))
return X, X_color, "Gaussian blobs different sizes"

def generate_digits(n_samples):
X, y = load_digits(n_class = 10, return_X_y = True)
X_color = plt.cm.jet(y / np.max(y + 1))
return X, X_color, "Digits"

def generate_hand(n_samples):
z_component =  X[:, 2] - X[:, 0]
X_color = plt.cm.jet(z_component / np.max(z_component + 1))
return X, X_color, "Hand"

def produce_plots(data_generator, data_generator_name):
X, X_color, title = data_generator(n_samples)
fig = plt.figure()
ax = p3.Axes3D(fig)
ax.view_init(7, -80)
ax.scatter(X[:, 0], X[:, 1], X[:, 2],
color = X_color,
s=20, edgecolor='k')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_zlabel('$z$')
plt.title(title)
plt.savefig(data_generator_name + '.png')

# Fit and plot PCA
X_pca = decomposition.PCA(n_components=2).fit_transform(X)
fig = plt.figure()
s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
s.scatter(X_pca[:, 0], X_pca[:, 1], c = X_color)
plt.title(title + " - PCA")
fig.savefig(data_generator_name  + '_pca.png')

# Fit and plot tSNE
X_tsne = TSNE(n_components=2).fit_transform(X)
fig = plt.figure()
s = fig.add_subplot(1, 1, 1, xlabel='$x$', ylabel='$y$')
s.scatter(X_tsne[:, 0], X_tsne[:, 1], c = X_color)
plt.title(title + " - tSNE")
fig.savefig(data_generator_name + '_tsne.png')

produce_plots(generate_digits, "digits")
produce_plots(generate_hand, "hand")
produce_plots(generate_swiss_roll_data, "swiss_roll")
produce_plots(generate_gaussian_blobs, "blobs")
produce_plots(generate_gaussian_blobs_different, "blobs_diff")


## Learning more

### References

[1] van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9:2579-2605, 2008.[

[2] L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014. https://lvdmaaten.github.io/publications/papers/JMLR_2014.pdf

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the PCA.

Python data science handbook

### Websites

Distill article on tSNE

# Count different elements in list - Ocaml

It is quite common to have to count the occurences of each element in a list. Python would propose the Counter class, per example, and looking around, one would soon enough find the following snippet :

from collections import Counter

words = ['a', 'b', 'c', 'a']

Counter(words).keys() # equals to list(set(words))
Counter(words).values() # counts the elements' frequency


However, OCaml lacks the abundance of snippets to find here and there. This article will focus on counting the occurences of each element in a list, using various approaches.

## A functional approach

A simple way relying on built-in List module would be to use sort_unique and count the element appearing in the list for each element returned by sort_unique.

let count_unique_elements_naive list =
let count_element e list = List.filter (fun x -> x = e) list |> List.length in
List.sort_uniq Int.compare list
|> List.map (fun e -> (e, count_element e list))


The following could be put in a count.ml file in order to test the previous snippet:

let print_int_pair_list list =
let string_of_int_pair pair = match pair with
| a,b -> (string_of_int a)^" - "^(string_of_int b)^"\n" in
List.map string_of_int_pair list
|> List.iter print_string

let count_unique_elements_naive list =
let count_element e list = List.filter (fun x -> x = e) list |> List.length in
List.sort_uniq Int.compare list
|> List.map (fun e -> (e, count_element e list))

let () =
count_unique_elements_naive [1; 1; 3; 5; 7; 7; 1]
|> print_int_pair_list


Running ocaml count.ml would output the following:

1 - 3
3 - 1
5 - 1
7 - 2


Obviously, this method makes as many passes on the list as the number of unique elements in the list, which is a waste of time.

## Using Hash tables

Here hash tables comes-in handy. Updating a counter for each element of the list, one can achieve the same results in a single pass over the data.

Note that in recent versions (>= 4.07), it is easier to turn Hashtbls to list or seq, using the to_seq functions.

let count_unique_elements_hashtbl list =
let counter = Hashtbl.create 10000 in
let update_counter x =
if Hashtbl.mem counter x then
let current_count = Hashtbl.find counter x in
Hashtbl.replace counter x (succ current_count)
else
Hashtbl.replace counter x 1
in
List.iter update_counter list;
Hashtbl.to_seq counter
|> List.of_seq


## Using integer hash tables

As stated in the documentation:

The functorial interface allows the use of specific comparison and hash functions, either for performance/security concerns, or because keys are not hashable/comparable with the polymorphic builtins.

Performing a minor replacement in the code above, we can rewrite this as:

module IntHash =
struct
type t = int
let equal i j = i=j
let hash i = i land max_int
end

(* Here we create a specialized module for hash tables whose key is an integer *)
module IntHashtbl = Hashtbl.Make(IntHash)

let count_unique_elements_int_hashtbl list =
let counter = IntHashtbl.create 10000 in
let update_counter x =
if IntHashtbl.mem counter x then
let current_count = IntHashtbl.find counter x in
IntHashtbl.replace counter x (succ current_count)
else
IntHashtbl.replace counter x 1
in
List.iter update_counter list;
IntHashtbl.to_seq counter
|> List.of_seq


## Benchmark

And the winner is count_unique_elements_int_hashtbl.

$ocamlopt main.ml$ ./a.out
count_unique_elements_naive        Execution time: 0.372140s
count_unique_elements_hashtbl      Execution time: 0.089627s
count_unique_elements_int_hashtbl  Execution time: 0.081218s


If one wants to reproduce the results:

let print_int_pair_list list =
let string_of_int_pair pair = match pair with
| a,b -> (string_of_int a)^" - "^(string_of_int b)^"\n" in
List.map string_of_int_pair list
|> List.iter print_string

let time f x =
let t = Sys.time() in
let fx = f x in
Printf.printf "Execution time: %fs\n" (Sys.time() -. t);
fx

let count_unique_elements_naive list =
let count_element e list = List.filter (fun x -> x = e) list |> List.length in
List.sort_uniq Int.compare list
|> List.map (fun e -> (e, count_element e list))

let count_unique_elements_hashtbl list =
let counter = Hashtbl.create 10000 in
let update_counter x =
if Hashtbl.mem counter x then
let current_count = Hashtbl.find counter x in
Hashtbl.replace counter x (succ current_count)
else
Hashtbl.replace counter x 1
in
List.iter update_counter list;
Hashtbl.to_seq counter
|> List.of_seq

module IntHash =
struct
type t = int
let equal i j = i=j
let hash i = i land max_int
end

(* Using the Hashtbl functorial interface *)
module IntHashtbl = Hashtbl.Make(IntHash)

let count_unique_elements_int_hashtbl list =
let counter = IntHashtbl.create 10000 in
let update_counter x =
if IntHashtbl.mem counter x then
let current_count = IntHashtbl.find counter x in
IntHashtbl.replace counter x (succ current_count)
else
IntHashtbl.replace counter x 1
in
List.iter update_counter list;
IntHashtbl.to_seq counter
|> List.of_seq

let sample = List.init 1000000 (fun _ -> Random.int 10)

let named_methods = [("count_unique_elements_naive", count_unique_elements_naive)
;("count_unique_elements_hashtbl", count_unique_elements_hashtbl)
;("count_unique_elements_int_hashtbl", count_unique_elements_int_hashtbl)]

let () =
let _ = List.map (fun pair -> print_string (fst pair); print_string "\t";time (snd pair) sample) named_methods in
()


## Books

Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml.

Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed.

# Printing hello world in OCaml

## Why writing a tutorial ?

There are indeed plenty of good resources if one is willing to learn OCaml (OCaml tutorials may be a good start). However, the usual road one follows when starting a new language, like printing hello world is not commonly presented. This might be due to the fact that printing hello world is actually one of the least functional thing to do…

Besides, OCaml comes with a utop toplevel, different compilers: by default ocamlc and ocamlopt but plenty of others are available… Knowing where to start may not be obvious.

Here we will compile and execute a program that prints hello world.

## Setting up ocaml

The best starting point is to install opam which is to OCaml what pip is to Python. opam also enables to switch between various compilers,

Installing OPAM details the different steps depending on your distribution. Assuming you are using Ubuntu, the following should do the job:

sudo apt-get install opam


Once opam is installed, the following commands will enable you to use OCaml

# environment setup
opam init
eval opam env
# install given version of the compiler
opam switch create 4.08.0
eval opam env

# check you got what you want
which ocaml
ocaml -version


In my case:

~$ocaml -version The OCaml toplevel, version 4.08.0  ## The project The usual extension of an OCaml file is .ml. Let’s just create a dedicated folder (called hellocaml below), in which we place a main.ml file. The contents of the file main.ml will be pretty simple: let () = print_string "Hello world\n";  print_string simply prints the input string in the terminal. Note that this print function is strongly typed, you cannot write: print_string 42;  Instead, one would have to write: print_int 42;  Note that the \n has nothing magical in it. By default, language such as python will jump to the next line once executed. In OCaml, the print_string function does not do this by default. The following code would produce the same output. let () = print_string "Hello "; print_string "world\n";  For now the contents of our directory looks like: hellocaml/ └── main.ml  ## Executing the project user@:~/hellocaml$ ocaml main.ml
hello world


## Compiling the project

Ocaml has different build systems but we will not use any of them for now. We will simply use the default tools to turn our code into an executable.

### ocamlopt and ocamlc

For now, you do not need to know a lot about these two.

### Compiling the project

Typing the following command:

ocamlc main.ml


The directory now contains:

hellocaml/
├── a.out
├── main.cmi
├── main.cmo
└── main.ml


user@:~/hellocaml$./a.out hello world  # What is next ? All the cool features of the language ! Nothing about the functional aspect has been discussed # Resources ## Online Why OCaml, video. If you wonder about the points of functionnal programming, this video summarizes key aspects of OCaml. 99 problems in OCaml if you prefer to learn with exercises instead of tutorials ;) Stay tuned, I plan to write more detailed tutorials! ## Books Real World OCaml: Functional programming for the masses by Yaron Minsky, Anil Madhavapeddy and Jason Hickey is a good introduction to OCaml. Purely Functional Data Structures by Chris Okasaki, which presents another way to look at data structures, from a functional point of view. As a prerequisite, a knowledge of common structures such as Heaps, Linked Lists is needed. # Opam permissions on Manjaro ## bwrap: No permissions to creating new namespace When installing ounit, I faced the following issue: $ opam install ounit
[NOTE] It seems you have not updated your repositories for a while. Consider updating them with:
opam update

The following actions will be performed:
∗ install stdlib-shims 0.1.0 [required by ounit]
∗ install ounit        2.1.2
===== ∗ 2 =====
Do you want to continue? [Y/n] Y


A while later:

#=== ERROR while compiling stdlib-shims.0.1.0 =================================#
# context     2.0.5 | linux/x86_64 | ocaml-base-compiler.4.08.0 | https://opam.ocaml.org#225ac83b
# path        ~/.opam/408/.opam-switch/build/stdlib-shims.0.1.0
# command     ~/.opam/opam-init/hooks/sandbox.sh build dune build -p stdlib-shims -j 3
# exit-code   1
### output ###
# bwrap: No permissions to creating new namespace, likely because the kernel does not allow non-privileged user namespaces. On e.g. debian this can be enabled with 'sysctl kernel.unprivileged_userns_clone=1'.


So, it cannot create the new namespaces… One could be tempted to run it with sudo. But this is not good. Not good at all (see the last section). Instead, you have to enable non privileged user namespaces.

## The solution

Simply run:

$sudo sysctl kernel.unprivileged_userns_clone=1  Which will enable non privileged user namespaces. And now the install should just be fine. $ opam install ounit
[NOTE] It seems you have not updated your repositories for a while. Consider updating them with:
opam update

The following actions will be performed:
∗ install stdlib-shims 0.1.0 [required by ounit]
∗ install ounit        2.1.2
===== ∗ 2 =====
Do you want to continue? [Y/n] Y

<><> Gathering sources ><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
[ounit.2.1.2] found in cache
[stdlib-shims.0.1.0] found in cache

<><> Processing actions <><><><><><><><><><><><><><><><><><><><><><><><><><><><>
∗ installed stdlib-shims.0.1.0
∗ installed ounit.2.1.2
Done.


## Why not using sudo ?

One could be tempted to…

$sudo opam install ounit  I did it in the first place. It worked “fine”. Running it a second time: [WARNING] Running as root is not recommended [NOTE] Package ounit is already installed (current version is 2.2.1).  So, I have the package. But, when I tried testing my program: $ ./runTests.sh
+ ocamlfind ocamldep -package oUnit2 -modules tests/randomVariableGausianTest.ml > tests/randomVariableGausianTest.ml.depends
ocamlfind: Package oUnit2' not found


Weird… I have the package, but for some reason it remains not found…

Then, running sudo opam list and opam list I obtained different lists of packages. I guess the packages installed with sudo are not installed at the right place (and therefore, cannot be called by the compiler by default).

# Context

## What is motus

Motus is a quite famous (at least in France) game where one has to guess a seven letters word based on its first letter and another letter given somewhere in the word. The player has various trials (8) to guess the word. Each word proposed must start with the same first letter as the word to guess, and the player is informed when a letter is at its place, or if a letter belongs to the word to guess, but is not in the right place.

Being quite popular, this game now has various mobile applications!

Fig. 1: Capture of "Motus, le jeu officiel France 2"

As you can see, the letters stressed in red are the one indicate the good position for the letter, and the yellow circle indicates that the letter is in the word, but currently in the wrong position.

## Why winning

Long story, but I would like a winning strategy at this game which ideally, I can perform without the help of a computer.

# A solution

## Beginner approach

For having watched the shows many times on TV, or others playing the app, a strategy used by most players seem to incrementally use the letters which you know well positionned to propose words which comply with the valid letters.

## The idea

Though the naive approach works, it feels like “sub optimal words” (i.e. words which seem far from the final word given the information we have when we propose them) can bring a lot of information, since we can craft them to be as informative as possible!

“Crafting” (the process of finding a seven letter words who share as few letter as possible with the words already suggested) is quite hard. However, learning a list of words which are as informative as possible is easy!

Therefore, the idea used here is that, in 4 guesses, one can almost know the letters present in the word to guess (4 * (7-1) = 24). Therefore, for each letter, one only needs to know four words, such that the number of different letters in these four words is maximal.

Then the second hypothesis is that when we know the letters of the words to guess, it is easy to guess it. It turned out to be harder than I thought, yet, my statistics were vastly improved!

# The implementation

## Details

It took the list of French words from this blog. The first part consists in removing the word which are not sevent characters long and getting rid of their accents.

Then, for a list of words, the coverage_score is defined, it simply is the number of different characters formed by these words.

It is shortly implemented and tested below:

def coverage_score(words):
cnt = Counter()
for word in words:
for character in word:
cnt[character] += 1
return len(list(cnt))

assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)


Then, for each letter, words are sampled by four and for each sampling, the score is evaluated. This process is repeated 1 000 000 times, and the 4-uple with the highest coverage score is returned. And this is it.

def random_optimization(generator, score, n_attempts):
i = 0
best_score = 0
best_element = generator()
while i < n_attempts:
i += 1
e = generator()
score_e = score(e)
if score_e > best_score:
best_score = score_e
best_element = e
return (best_element, best_score)


## The code

from collections import Counter
import random
import string
import unidecode

def is_seven_characters_long(s):
return len(s) == 7

def remove_line_breaks(s):
return s.replace("\n", "")

fp = open(input_filepath, 'r')
words = map(remove_line_breaks, words)

seven_letter_words = [
word for word in words if is_seven_characters_long(word)]

removed_accents = map(unidecode.unidecode, seven_letter_words)
removed_dashes = filter(lambda x: not "-" in x, removed_accents)

remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes)

return list(remove_ez)

def coverage_score(words):
cnt = Counter()
for word in words:
for character in word:
cnt[character] += 1
return len(list(cnt))

assert(coverage_score(['aa', 'bb']) == 2)
assert(coverage_score(['ac', 'cb']) == 3)
assert(coverage_score(['abc', 'def']) == 6)

def random_optimization(generator, score, n_attempts):
i = 0
best_score = 0
best_element = generator()
while i < n_attempts:
i += 1
e = generator()
score_e = score(e)
if score_e > best_score:
best_score = score_e
best_element = e
return (best_element, best_score)

for x in string.ascii_lowercase:
print(x)
words_starting_with_x = list(
filter(lambda w: w.startswith(x), candidate_words))

def generator_x():
res = [random.choice(words_starting_with_x) for i in range(4)]
list.sort(res)
return res

res = random_optimization(generator_x, coverage_score, 100000)
print("(coverage : " + str(res[1]) + ")")

print("\n".join(res[0]))
print(10 * '-')


# Results

Now, I simply have to learn the 104 words listed below

File loaded!
a
(coverage : 19)
affuble
agrions
amochai
apteryx
----------
b
(coverage : 18)
bavocha
bowling
brusque
brutaux
----------
c
(coverage : 18)
candide
capeyat
choques
combler
----------
d
(coverage : 18)
deflore
dejuche
demodat
dopings
----------
e
(coverage : 19)
endurci
engavat
enzymes
explora
----------
f
(coverage : 18)
fachait
faxames
flingot
fuyards
----------
g
(coverage : 18)
giclant
glyphes
gommeux
grisbis
----------
h
(coverage : 19)
hagarde
hickory
honteux
humbles
----------
i
(coverage : 18)
ichtyol
impulse
inegaux
ivoires
----------
j
(coverage : 19)
jockeys
jouxter
jubilat
jumping
----------
k
(coverage : 20)
kamichi
kobolds
krypton
kufique
----------
l
(coverage : 18)
lambris
legende
lexique
lycopes
----------
m
(coverage : 18)
malfont
margeas
midship
muqueux
----------
n
(coverage : 18)
nasarde
neigeux
notable
nymphal
----------
o
(coverage : 19)
oblatif
ocreras
oppidum
oxygene
----------
p
(coverage : 18)
palombe
penchas
poquait
profond
----------
q
(coverage : 15)
quartat
quiches
quidams
quillon
----------
r
(coverage : 18)
rebonds
rejugee
rempile
ruchait
----------
s
(coverage : 19)
sandjak
sculpte
sombras
swingua
----------
t
(coverage : 18)
thymols
topique
trafics
trepang
----------
u
(coverage : 16)
ultimes
urbaine
urgence
uropode
----------
v
(coverage : 19)
vachard
vampent
velique
voyages
----------
w
(coverage : 18)
wergeld
whiskys
wombats
wurmien
----------
x
(coverage : 11)
xanthie
ximenie
xylenes
xylenes
----------
y
(coverage : 16)
yankees
yiddish
yogourt
ypreaux
----------
z
(coverage : 17)
zeolite
ziberas
zincage
zythums
----------


# Improvements

## Word selection

The thing is that the accepted words will mostly depend on the implementation of the game, I could not find a official dictionary for this. So the conjugated verbs would be rejected, per example (hence the line : remove_ez = filter(lambda x: not x.endswith("ez"), removed_dashes), which does not account for the full complexity of French conjugations, to say the least)

## Algorithm

The “coverage” seems quite low (around 19). I believe this can be improved using smarter ideas than a random, brute force optimization.

Perhaps some better optimizations could be performed, such as, start with a word, randomly pick among the most different words and so on so forth.

## The score itself

The definition of the coverage score does not take into account the scarcity of some letters. z is quite scarce, and it is often a waste of time to check whether it is in the word to guess.

# Introduction

Python has an amazing progress bar library tqdm. In a for loop or a map statement, tqdm allows the user to know how much time is needed for the loop to end. Though simple, this happens to be very useful when performing treatment on a dataset with hundreds of thousands of lines.

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]

What tdqm does

[##############-------------------------] 38.00% ending in : 3.21s

What I would be happy with

Regarding machine learning pipelines, Python does a great job. However, it lacks the power of compiler. First, pre-treatment of the data is sometimes done using functions written in pure Python, not particularly fast.

But more important, imagine the following pipeline in OCaml:

let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> remove_stopwords
|> List.map vectorize
|> FloatArray.average


Maybe these functions do not really exist, but a sentence (my_text) is splitted in a list of words, the spelling of each word is fixed (List.map fix_spelling), the stopwords are removed, the words are vectorized (with something like word2vec) and at last, the list of vectorized words is averaged. It is pretty clear that we input a sentence (a string) and return a float array.

Now imagine that for some reason, the pipeline is misspelled in something like this:

let new_text = my_text
|> Str.split
|> List.map fix_spelling
|> List.map vectorize
|> remove_stopwords
|> FloatArray.average


The compiler would immediatly complain that remove_stopwords does not apply to float array list (but was expecting a string list). On the other hand, depending on the implementation, it may take a while for the same program in Pyhton to realize that some steps cannot be performed.

Therefore, I believe that OCaml could be an amazing fit for machine learning: the computing intensive libraries (random forests, SVMs…) can be implemented in another language (which is the case with scikit-learn) and the pipeline would be much more robust in OCaml.

As I love to use tqdm in Python, I wanted an OCaml equivalent.

# OCaml functors

A functor is simply a module, which accepts another module as a parameter. Let’s look at the following module signature, a finite collection. It has a map function, which transforms every element and a length function. Note that length could be inferred from map and a ref. However, modules such as Seq allow infinite collections: they have a map function, but cannot evaluate the length of a sequence.

module type FINITECOLLECTION = sig

type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int

end


Given this signature we can now define a module which will operate on finite collections. Without paying much attention to the contents of map, simply note that the only functions called in map are the ones specified in the signature: Collection.length and Collection.map.

(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =

let open ProkopeUtils in

let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in

let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in

print_string "\n";
res

end



Now, we would like to be able to use this new map method on array and list types. For this, we just have to reproduce the signature of the FINITECOLLECTION. The following does it.

module UsualArray  = struct
type 'a t = 'a array
include Array
end

module Array = Prokope(UsualArray)

module UsualList  = struct
type 'a t = 'a list
include List
end

module List = Prokope(UsualList)


Now this function can simply be invoked appending Prokope before

let dummy a =
ignore (Unix.select [] [] [] 0.1);
a

let double x = 2 * x

let id x = x

let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()


And you would see:

[#######################################] 100.00% ending in : 0.00s
[##############-------------------------] 38.00% ending in : 3.21s


# The implementation

## A couple of tricks

A couple of things to know is that OCaml may not flush to the standard output as you may expect. But this can be forced by calling

flush stdout;


Now to write a line on the actual line of the terminal, the only thing to do is to call

print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");


You may find more informations on \n and \r on StackOverflow though.

## The implementation

I decided to call this library Prokope. Maybe on opam someday ;)

### Prokope utils

Let’s start with the helpers needed to display and configure a progress bar.

prokopeUtils.ml

(** Various helpers for Prokope *)

(** Graphical parameters for the progress bar *)
type progress_bar_config = {
width: int;
cell_filled: string;
cell_empty: string
}

(** The width of the terminal *)
let default_width =
match Terminal_size.get_columns () with
| Some(a) -> a
| None -> 80

(** A default progress bar *)
let default_progress_bar = {
width = default_width;
cell_filled = "#";
cell_empty = "-"
}

(** For a collection 'a t returns a collection of type (int, 'a) t *)

let index = ref (-1) in

index := !index + 1;
(!index, e)
in

(** Repeats the string s n times *)
let repeat_string s n =
List.init n (fun _ -> ())
|> List.fold_left (fun e _ -> s^e) ""

(** Returns the string represented by the progress bar configuration and the percentage *)
let progress_bar_string config percentage =

let gauge_string n_steps inner_width =
let remaining_steps = inner_width - n_steps in
"["^
(repeat_string config.cell_filled n_steps)^
(repeat_string config.cell_empty remaining_steps)^
"]"
in

let inner_width = max (config.width - 40) 0 in

let n_steps = int_of_float (percentage *. (float_of_int inner_width)) in

let rounded_percentage_str = Printf.sprintf "%.2f" (100. *. percentage) in

(gauge_string n_steps inner_width)^" "^rounded_percentage_str^"% "

(** For a fonction ('a -> 'b) returns a function ( ('a, int) -> 'b )
* "aware" of its starting time and operations to perform *)
let progress_wrapper config list_length starting_time f a =

let i, x = a in

let percentage = (float_of_int i +. 1.) /. (float_of_int list_length) in

let time_taken = (Unix.gettimeofday() -. starting_time) in

let speed = (float_of_int i) /. time_taken in

let remaining_time_str = (float_of_int (list_length - i)) /. speed
|> Printf.sprintf "%.2f"
in

print_string ("\r"^(progress_bar_string config percentage)^"ending in : "^remaining_time_str^"s");

flush stdout;

f x



### The functor

Why a functor ? To enable a user to generate a progress bar for every finite collection ! Imagine that your data is stored on a binary tree for some reason. Having a map and a length function over trees is trivial (and it may also belong to your tree module). Calling Prokope(MyTree) would therefore return a new module, for which map shows a progress bar.

prokope.ml

(** User intended functor *)

(** Any collection whose elements can be counted and mapped.
Note that length has to be implemented (this disqualifies
modules such as Seq). *)
module type FINITECOLLECTION = sig

type 'a t
val map : ('a -> 'b) -> 'a t -> 'b t
val length : 'a t -> int

end

(** Creates a new map function, reporting the progress in the terminal *)
module Prokope(Collection : FINITECOLLECTION) = struct

(** Applies f to a finite collection l, while reporting progress in the terminal *)
let map f l =

let open ProkopeUtils in

let starting_time = Unix.gettimeofday() in
let wrapped_f = progress_wrapper default_progress_bar (Collection.length l) starting_time f in

let res = add_indices_to_collection Collection.map l
|> Collection.map wrapped_f in

print_string "\n";
res

end

module UsualArray  = struct
type 'a t = 'a array
include Array
end

module Array = Prokope(UsualArray)

module UsualList  = struct
type 'a t = 'a list
include List
end

module List = Prokope(UsualList)


### Testing and displaying

run.ml

let dummy a =
ignore (Unix.select [] [] [] 0.1);
a

let double x = 2 * x

let id x = x

let () =
let sample = List.init 50 id in
assert(List.map double sample = Prokope.List.map double sample);
let _ = Prokope.List.map dummy sample in
()


run.sh

ocamlfind opt -package unix,terminal_size -linkpkg -o run.byte prokopeUtils.ml prokope.ml run.ml

./run.byte
`

Hope you liked it! If you feel you may need it some day, or have improvements suggestions, please let me know, I would improve it and publish it!