Packages for machine learning

Reading time ~4 minutes

I hope to provide more packages and more informations to this list from times to times. If you have some specific questions regarding a package or have some recommendations, feel free to leave a comment, I will have a look!

Machine learning means so many possible tasks, comes with so many packages and tools that it is hard to have an idea of which one to use. I am just listing the one I find really useful. They are not better than the packages I do not use and I cannot guarantee they are absolutely bug free, but they are robust enough to work with!

Linux tools

A good terminal will be your best friend.

for filename in ./pdfs/*.pdf; do
  if [ "$i" -gt 20 ]; then
  echo "Processing $filename file..." -m 2 "$filename" >> "txts/$(basename "$filename" .pdf).txt"
How amazing is that ?


Probably the best tool to navigate through a CSV file, in the terminal. It is really light, fast, supports many VIM commands. Here is the repo.


Install it from pip. It comes with a lot of handy tools:

  • csvstat

  • csvlook though I prefer tabview, csvlook my_data.csv > allows to display a csv file in markdown.

Combined with head you can navigate through various files really fast. There actually is whole website dedicated to this.


This allows you to see how busy your machine is when running an algorithm.

PDF to text files

Useful to extract text from pdf files. There is no perfect solution (as far as I know) for this task, but this one is a good starting point.

Tesseract (and ImageMagick)

Another approach to extracting text from pdf files is using OCR (Optical Character Recognition). Tesseract does a great job but importing pdf directly can lead to errors. However, ImageMagick does a great job at turning pdfs to pngs.

echo "Processing $filename file..."
convert -density 300 "$filename[0-1]" -quality 90 "output.png"
tesseract -l fra "output-0.png" "output-0"
tesseract -l fra "output-1.png" "output-1"
cat "output-0.txt" "output-1.txt" > "ocr$(basename "$filename" .pdf).txt"
rm "output-0.txt" "output-1.txt" "output-0.png" "output-1.png"


When installing R, and it happened to me many times, I love running the following script. It feels like coming home. I will not go through all the details of each package, it is just that it will be useful for me to have this code around :)

  lib.loc <- l

    if (length(which(installed.packages(lib.loc=lib.loc)[,1]==libT))==0)
      install.packages(libT, lib=lib.loc,repos='')

data_reading_libraries <- c(

machine_learning_libraries <- c(

data_libraries <- c(

string_libraries <- c(

plot_libraries <- c(

favorite_libs <- c(data_reading_libraries,

  for(lib in favorite_libs){load.lib(lib)}

General stuff

Reading data


If you have been using the default csv reader in R read.csv, you must be familiar with its drawbacks : slow, many parameters, a parser which sometimes fails… readr on the other hand is super fast, robust and comes with read_csv and read_csv2 depending on the csv standard your file relies on. (The good thing with standard being that there always are many versions of them…)


It allows to read XML files (obviously) but also HTML tables (yes, some people actually use this to transfer data, though it makes the whole file much bigger because of so many HTML tags…)

Machine learning libraries


A library that enables to perform elastic net regressions. Has a cross validation method which enjoys nice properties of the path of optimization, which allows to evaluate a path of solutions as fast as a single model.


The standard if you want to use random forests with R. Link to the cran page


I tried its “competitor” (kernlab), but prefered this one.


Wrapper for the C++ implementation of Barnes-Hut t-Distributed Stochastic Neighbor Embedding. Was the fastest tSNE implementation when I tried them.

Data viz


Visualizing linear regressions is now simple.



N <- 10
P <- 3

X <- matrix(rnorm(n = N*P), nrow = N)

w <- rnorm(P)

Y <- X %*% w + rnorm(P)

my_data <-,X)
colnames(my_data) <- c("Y",paste0("Var",1:3))
model <- lm(Y ~ ., data = my_data)



A matrix of correlation can be quite ugly. This one just makes it easier to read, with colors…


Wouldn’t it be great to have something that tells you a little bit more about your random forests models ? This package can.


General stuff


tqdm is one of the most useful package I discovered. Though it actually does not perform any operation or handles your dataframes smartly, it shows a progress bar for the loops you want it to. Still not convinced ? With it, you can keep track on every feature engineering job you launch, see which ones are never going to end without the pain of writing all these bars yourself.


The industry standard for dataframes.


The industry standard for numeric operations


Easy manipulation of csv files. The method DictReader is particularly useful when one needs to stream from a csv file.


import unicodecsv as csv

Solves so many issues.

Machine learning libraries


A collection of robust and well optimized methods. A must have.

xgboost, catboost, gbm light

Libraries dedicated to gradient boosting. Amazing performances, amazingly robust.

Estimating the parameters of a CEV Process

# The CEV ProcessIn mathematical finance, the CEV or constant elasticity of variance model is a stochastic volatility model which was dev...… Continue reading