# Generic cross-validation

## Definitions

### k-fold cross-validation

Cross-validation is a process that enables to estimate the out-of-sample performance of a model. There exist many types of cross-validation, but the most common method consists in splitting the training-set in $k$ “folds” ($k$ samples of approximately $n/k$ lines) and train the model $k$-times, each time over samples of $n-n/k$ points. The prediction error is then measured on the predictions of the remaining $n/k$ points. Illustration of the cross-validation. From Wikipedia.

### Leave one out cross-validation (LOOCV)

The leave one out cross-validation is a specialization of the $k$-fold cross-validation, with $k=n$. For each point, you train the model over the $n-1$ other points, predict the label (or value, in the case of a regression problem) and average the errors.

### Hyperparameters tuning

The reason to perform cross-validation is to enable the tuning of hyperparameters (or doing feature selection, but I will focus on hyperparameters tuning). They are parameters that are not directly infered from the data (this definition is quite lose). Per example, in the case of a linear regression, the coefficients of the model are directly infered. In the case of a penalized regression, the penalty parameter is not infered during the training procedure (if there is a way to select it, please let me know in the comments), and the usual way to chose the best penalty parameter is by cross validating many models, with different penalty parameters and chosing the one which achieves the best score (the highest accuracy, the lowest error…).

This is where cross-validation becomes costly : suppose you want to try $10$ values of your hyper-parameter (this can be much higher), and evaluate the performance of each hyperparameter with a $5$ fold cross-validation. This will be $10 \cdot 5 = 50$ times longer than training a single model !

Usually, there is not one single hyper parameter. In the case of the Support Vector Machines with Radial Basis Functions, there are usually two of them which are tuned $C$ and $\gamma$. Trying $10$ values for each hyperparameter now leads to $100$ models to train (which is in turn multiplied by the number of folds).

This explains the need for smarter approaches to cross-validations.

## Generic implementation

Cross-validation is a long process. And it needs to be repeated. Many times. Maybe even more when the number of parameters is large. The leave-one-out cross-validation is the longest one, as you have to train the model as many times as you have data points.

You have two options. Find the best implementation of your algorithm given your problem. Buy more computation power.

Assuming you have already achieved the first step and that the second one is not really satisfactory, there is another option. Change your cross-validation method! Though it needs some efforts (the usual cross-validation pipelines/loops have to be specialized), it will be fun!

Following the article about time complexity and enjoying the fact that, when you train a model on a specific fold during a cross-validation, you may reuse part of your calculations, I will present some tips to make this cross-validation faster.

As always, this improvement has a price : genericity. When using scikit learn, the models have similar signatures. The cross-validation procedure has to be written only once and works for every model. See per example the code below.

import pandas as pd
import numpy as np

from time import time
from sklearn import cross_validation

class Stacker:

def __init__(self, penalty, n_folds, verbose=True, random_state=1):
self._penalty = penalty
self._n_folds = n_folds
self._verbose = verbose
self._random_state = random_state

def Run(self, X, y, model, predict_method):
kf = cross_validation.KFold(
y.shape, n_folds=self._n_folds, shuffle=True, random_state=self._random_state)
trscores, cvscores, times = [], [], []
i = 0
stack_train = np.zeros(len(y))  # stacked predictions
for i, (train_fold, validation_fold) in enumerate(kf):
i = i + 1
t = time()
model.fit(X.iloc[train_fold], y.iloc[train_fold])

tr_pred = predict_method(model, X.iloc[train_fold])

trscore = self._penalty(y.iloc[train_fold], tr_pred)

validation_prediction = predict_method(
model, X.iloc[validation_fold])
cvscore = self._penalty(
y.iloc[validation_fold], validation_prediction)

trscores.append(trscore)
cvscores.append(cvscore)
times.append(time() - t)

stack_train[validation_fold] = validation_prediction

if self._verbose:
print("TRAIN %.5f | TEST %.5f | TEST-STD %5f | TIME %.2fm (1-fold)" %
(np.mean(trscores), np.mean(cvscores), np.std(cvscores), np.mean(times) / 60))
print(model.get_params(deep=True))
print("\n")

return np.mean(cvscores), stack_train


Not only it performs a cross-validation, but keeps the out-of-sample predictions for each fold. This allows to do model stacking, another topic I will probably discuss in antoher post.

As you see in this code, there is no information regarding the model needed. It will just take a model, that has a fit method, and predict it (the predict method must be passed as a function, it allows post processing of the predictions, per example).

## Linear regression

Here, the specialization works so well that one can even use the following closed formula for a leave one out cross-validation

Following the equation of a linear model : $y = X\beta + \mathbf{e}.$ it is well known that (for an OLS estimate) $\hat{\beta} = \underset{\beta}{\operatorname{argmin}} \| y-X \beta \|^2$ we have $\hat{\beta} = (X'X)^{-1}Xy$.

So we can write :

Where $\mathbf{H} = X(X'X)^{-1}X'$

If the diagonal values of $\mathbf{H}$ are denoted by $h_{1},\dots,h_{n}$, then the cross-validation statistic can be computed using:

Where $e_{i}$ is the residual obtained from fitting the model to all $n$ observations. See  for more details.

## Elastic-net

The famous library glmnet  solves the following problem, over a whole “path” (i.e. having $\lambda$ varying) while being as fast as one would normally compute the solution for a single value of $\lambda$.

The least angle regression (LARS) algorithm, on the other hand, is unique in that it solves the minimzation problem for all $\lambda$ ∈ [0,∞] […] This is possible because the lasso solution is piecewise linear with respect to $\lambda$.

The library comes with a cross-validation method supporting parallelization:

cv.glmnet(x, y, weights, offset, lambda, type.measure, nfolds, foldid, grouped, keep,
parallel, ...)


## SVMs

The same authors proposed a similar method for calculating solutions to SVMs calibrations, where a path of solution depending on $C$, the cost parameter is proposed.

An R package svmpath is available. Refering to the article:

It exploits the fact that the “hinge” loss-function is piecewise linear, and the penalty term is quadratic. This means that in the dual space, the lagrange multipliers will be pieceise linear (c.f. lars).

require("svmpath")

N <- 500
P <- 2
sigma <- 0.1

X <- matrix(rnorm(N * P), nrow = N)
Y <- 2 * (X[, 1] + X[, 2] * X[, 2] + sigma * rnorm(P) > 0) - 1

svm_model <-
svmpath(
x = X,
y = Y,
plot = F
)
plot(svm_model)

Xtest <- matrix(rnorm(N * P), nrow = N)
Ytest <-
2 * (Xtest[, 1] + Xtest[, 2] * Xtest[, 2] + sigma * rnorm(P)  > 0) - 1

pred <-
svmpath::predict.svmpath(svm_model, newx = Xtest, type = "class")

eval_error <- function(predicted) {
return(mean(abs(Ytest - predicted)))
}

errors <- apply(X = pred, MARGIN = 2, FUN = eval_error)
plot(
svm_model$lambda, errors, main = "Out Of Sample error rate", xlab = expression(lambda), ylab = "Error rate" ) Unfortunately, it has a very large memory consumption event for small data sets. ## Gradient boosting With gradient boosting, when cross validating over the number of trees, a simple observation is to note that models are trained sequentially. If $k$ models are trained over $k$ folds, each model will predict a new point based on the following equation: Now, by simply truncating the sum, one can create the sequence: For each fold, this step can be performed, so that the performance at each step can be evaluated. XGBoost implements something along these lines (with callbacks), the method can be found on their repository . cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)  # A generic framework All this is nice, but it means that each method requires a specific cross-validation routine, depending on the parameter one is focusing on. It makes it quite complex, forces to write more code (and make mistakes). It would be amazing if there were something that is generic, wouldn’t it ? There is. All the ideas summarized here come from the paper  of M. Izbicki. I really recommand going through it (especially if you have a taste for algebra). The idea is to consider : • the data, being a monoid with a concatenation operation : $A \clubsuit B$ is just the bindings of the rows of $A$ with the rows of $B$ (the neutral element being an empty set). • the learner $m$ which, given some input data, returns a model $T : A \ \rightarrow (f:x\rightarrow \text{label})$ • a morphism $\diamondsuit$, so that $m(A \clubsuit B) = m(A) \diamondsuit m(B)$ Now, if $\diamondsuit$ can be evaluated in a constant time (independant of the length of $A$ and $B$), one can have cross-validations in $O(n+k)$ instead of $O(kn)$. This works for Naive Bayes, nearest centroids, and other methods. Now the algorithm presented is the Monoid Cross Validation. Keeping $n$ the number of points and $k$ the number of folds for the notations, and $G_i$ the i-th fold for the k-fold cross-validation, $k$ models are trained on each fold. Now the models are merged in $k-1$ operations (using the morphism relationship) and the prediction is performed over the last fold. If the prefixes and suffixes are evaluated beforehand (i.e. building the sequences of models trained on the various folds in the two orders) a method enables to obtain the cross-validation in $O(n+k)$ operations. This topic is really interesting and I probably will dedicate it a full article. What about a toy ocaml implementation ?  M. Izbicki, “Algebraic classiﬁers: a generic approach to fast cross-validation, online training, and parallel training,” p. 9.  T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The Entire Regularization Path for the Support Vector Machine,” p. 25.  R. J. Tibshirani and J. Taylor, “The solution path of the generalized lasso,” The Annals of Statistics, vol. 39, no. 3, pp. 1335–1371, Jun. 2011.  “Plane Answers to Complex Questions: The theory of linear models” Ronald Christensen # Learning more For those interested about ideas in statistics stemming from algebra (and not only matrix operations) and geometry, I am only aware of two books covering this topic : Algebraic and Geometric Methods in Statistics and Lectures on Algebraic Statistics # Computational complexity of machine learning algorithms What is complexity ? Good question. I should have addressed it right away. It is a notion which is often addressed in algorithmic classes, but not in machine learning classes… Simply put, say you have a model. Training it on $n$ points takes $x$ minutes. Now what happens if you train it on $kn$ points ? If the training time is now $kx$ then the training time is linear. Sometimes, it is more. The new training time may be $k^2 x$. In this case, the training time would be called quadratic in the number of points. If you have a long training time for few thousands points, do not expect to be able to run this method on millions of points (kernel SVMs, I miss you in large samples). It is harder than one would think to evaluate the complexity of a machine learning algorithm, especially as it may be implementation dependent, properties of the data may lead to other algorithms or the training time often depends on some parameters passed to the algorithm. Another caveat is that the learning algorithms are complex and rely on other algorithms. ## A theoretical point of view ### Some bounds Here, I propose upper bounds (as the implementation achieving this bound will be described) when the data is dense. Calling $n$ the number of training sample, $p$ the number of features, $n_{trees}$ the number of trees (for methods based on various trees), $n_{sv}$, the number of support vectors and $n_{l_i}$ the number of neurons at layer $i$ in a neural network, we have the following approximations. Algorithm Classification/Regression Training Prediction Decision Tree C+R $O(n^2p)$ $O(p)$ Random Forest C+R $O(n^2pn_{trees})$ $O(pn_{trees})$ Random Forest R Breiman implementation $O(n^2p n_{trees})$ $O(pn_{trees})$ Random Forest C Breiman implementation $O(n^2 \sqrt p n_{trees})$ $O(pn_{trees})$ Extremly Random Trees C+R $O(npn_{trees})$ $O(npn_{trees})$ Gradient Boosting ($n_{trees}$) C+R $O(npn_{trees})$ $O(pn_{trees})$ Linear Regression R $O(p^2n+p^3)$ $O(p)$ SVM (Kernel) C+R $O(n^2p+n^3)$ $O(n_{sv}p)$ k-Nearest Neighbours (naive) C+R $-$ $O(np)$ Nearest centroid C $O(np)$ $O(p)$ Neural Network C+R ? $O (pn_{l_1} + n_{l_1} n_{l_2} + ... )$ Naive Bayes C $O(np)$ $O(p)$ ### Justifications Decision Tree based models Obviously, ensemble methods multiply the complexity of the original model by the number of “voters” in the model, and replace the training size by the size of each bag. When training a decision tree, a split has to be found until a maximum depth $d$ has been reached. The strategy for finding this split is to look for each variable (there are $p$ of them) to the different thresholds (there are up to $n$ of them) and the information gain that is achieved (evaluation in $O(n)$) In the Breiman implementation, and for classification, it is recommanded to use $\sqrt p$ predictors for each (weak) classifier. Extremly random trees The search strategy for the optimal split simply does not take place in the case of ERTs. This makes it much simpler to find a (weaker) split. However (in my experience), ERTs implementation are not much faster than RFs. Linear regressions The problem of finding the vector of weights $\beta$ in a linear regression boils down to evaluating the following equation: $\beta=(X'X)^{-1}X'Y$. The most computationnaly intensive part is to evaluate the product $X'X$, which is done in $p^2n$ operations, and then inverting it, which is done in $p^3$ operations. Though most implementations prefer to use a gradient descent to solve the system of equations $(X'X)\beta = X'Y$, the complexity remains the same. Support Vector Machine For the training part, the classical algorithms require to evaluate the kernel matrix $K$, the matrix whose general term is $K(x_i,x_j)$ where $K$ is the specified kernel. It is assumed that K can be evaluated with a $O(p)$ complexity, as it is true for common kernels (Gaussian, polynomials, sigmoid…). This assumption may be wrong for other kernels. Then, solving the constrained quadratic programm is “morally equivalent to” inverting a square matrix of size $n$, whose complexity is assumed to be $O(n^3)$ k-Nearest Neighbours In its simplest form, given a new data point $x$, the kNN algorithm looks for the k closest points to $x$ in the training data and returns the most common label (or the averaged values of targets for a regression problem). To achieve this, it is necessary to compare the distance between $x$ and every point in the data set. This amounts to $n$ operations. For the common distances (Euclide, Manhattan…) this operation is performed in a $O(p)$ operations. Not that kernel k Nearest Neighbours have the same complexity (provided the kernels enjoy the same property). However, many efforts pre-train the kNN, indexing the points with quadtrees, which enable to lower dramatically the number of comparisons to the points in the training set. Likewise, in the case of a sparse data set, with an inverted indexation of the rows, it is not necessary to compute all the pairwise distances. ## The practical point of view All this is nice, but what about real life examples ? We can focus on sk-learn implementations below. The assumptions will be that the complexities take the form of $O(n^\alpha p^\beta)$ and $\alpha$ and $\beta$ will be estimated using randomly generated samples with $n$ and $p$ varying. Then, using a log-log regression, the complexities are estimated. Though this assumption is wrong, it should help to have a better idea of how the algorithms work and it will reveal some implementation details / difference between the default settings of the same algorithm that one may overlook. ### The results Method $\alpha$ $\beta$ RandomForestRegressor 1.21 0.89 ExtraTreesRegressor 1.03 0.88 AdaBoostRegressor 0.71 1.01 LinearRegression 0.72 1.3 SVR 1.96 0.42 RandomForestClassifier 1.09 0.38 ExtraTreesClassifier 0.81 0.31 AdaBoostClassifier 0.89 0.79 SVC 2.05 0.52 LogisticRegression(solver=liblinear) 0.9 0.88 LogisticRegression(solver=sag) 1.03 0.95 Surprisingly, some methods are sublinear in $n$. Perhaps the sample sizes were too small. As expected, the Support Vector show a complexity in $n$ that does not scale well with the sample size (close to 2). Another interesting point to note are the complexities in $p$ for the random forest and extra trees, the component in $p$ varies according to the fact that we are performing a regression or a classification problem. A short look at the documentation explains it, they have different behaviors for each problem! For the regression: max_features : int, float, string or None, optional (default=”auto”) The number of features to consider when looking for the best split: If int, then consider max_features features at each split. If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=n_features. If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.  Whereas the classification default behavior is If “auto”, then max_features=sqrt(n_features).  ### The code Fore those who would like to run the code over other algorithms, here is the method I used. import numpy as np import pandas as pd import time from sklearn.linear_model import LinearRegression import math class ComplexityEvaluator: def __init__(self, nrow_samples, ncol_samples): self._nrow_samples = nrow_samples self._ncol_samples = ncol_samples def _time_samples(self, model, random_data_generator): rows_list = [] for nrow in self._nrow_samples: for ncol in self._ncol_samples: train, labels = random_data_generator(nrow, ncol) start_time = time.time() model.fit(train, labels) elapsed_time = time.time() - start_time result = {"N": nrow, "P": ncol, "Time": elapsed_time} rows_list.append(result) return rows_list def Run(self, model, random_data_generator): data = pd.DataFrame(self._time_samples(model, random_data_generator)) print(data) data = data.applymap(math.log) linear_model = LinearRegression(fit_intercept=True) linear_model.fit(data[["N", "P"]], data[["Time"]]) return linear_model.coef_  And, some unit tests (that can just be appended at the bottom of the previous class). if __name__ == "__main__": class TestModel: def __init__(self): pass def fit(self, x, y): time.sleep(x.shape / 1000.) def random_data_generator(n, p): return np.random.rand(n, p), np.random.rand(n, 1) model = TestModel() complexity_evaluator = ComplexityEvaluator( [200, 500, 1000, 2000, 3000], [1,5,10]) res = complexity_evaluator.Run(model, random_data_generator) print(res)  After a small unit test, everything seems consistent. N P Time 0 200 1 0.200263 1 200 5 0.200421 2 200 10 0.200520 3 500 1 0.500928 4 500 5 0.500711 5 500 10 0.501064 6 1000 1 1.001396 7 1000 5 1.001420 8 1000 10 1.000629 9 2000 1 2.002225 10 2000 5 2.002096 11 2000 10 2.000759 12 3000 1 3.003331 13 3000 5 3.003340 14 3000 10 3.003350 [[ 9.99583664e-01 1.13487892e-05]]  So let’s enjoy the number of algorithms offered by sklearn. The following list may be updated as new algorithms are tested. import numpy as np import ComplexityEvaluator from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier from sklearn.svm import SVR, SVC from sklearn.linear_model import LogisticRegression def random_data_regression(n, p): return np.random.rand(n, p), np.random.rand(n) def random_data_classification(n, p): return np.random.rand(n, p), np.random.binomial(1, 0.5, n) regression_models = [RandomForestRegressor(), ExtraTreesRegressor(), AdaBoostRegressor(), LinearRegression(), SVR()] classification_models = [RandomForestClassifier(), ExtraTreesClassifier(), AdaBoostClassifier(), SVC(), LogisticRegression(), LogisticRegression(solver='sag')] names = ["RandomForestRegressor", "ExtraTreesRegressor", "AdaBoostRegressor", "LinearRegression", "SVR", "RandomForestClassifier", "ExtraTreesClassifier", "AdaBoostClassifier", "SVC", "LogisticRegression(solver=liblinear)", "LogisticRegression(solver=sag)"] complexity_evaluator = ComplexityEvaluator.ComplexityEvaluator( [500, 1000, 2000, 5000, 10000, 15000, 20000], [5, 10, 20, 50, 100, 200]) i = 0 for model in regression_models: res = complexity_evaluator.Run(model, random_data_regression) print(names[i] + ' | ' + str(round(res, 2)) + ' | ' + str(round(res, 2))) i = i + 1 for model in classification_models: res = complexity_evaluator.Run(model, random_data_classification) print(names[i] + ' | ' + str(round(res, 2)) + ' | ' + str(round(res, 2))) i = i + 1  # Learning more ## Machine learning The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article ! Foundations of Machine Learning by Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar and Francis Bach provides a theoretical framework to various machine learning algorithms and a detailed implementation of some of them. ## Algorithms and computational complexity If you want a thorough understanding of computational complexity theory, these books are great resources. Computational Complexity: A Modern Approach is a clear, detailed analysis of the topic, also covering cryptography and quantum computation. Randomized Algorithms Though more specialized than the first one, I like the interplay between probabilities and algorithms presented here. # Introduction to q-methodology “In Q methodology there is little more reason to understand the mathematics involved than there is to understand mechanics in order to drive a car.” S.R. Brown # Context This methodology is particularly relevant when one wants to assess the order of importance of variables as seen by a sample of individuals. Per example “A good computer… must be fast, must be beautiful, must be resistant to shocks …”. One approach could be the usual Likert scale analysis. However, the issue is that some (many) individuals may think that every factor is very important. This is why, somehow, q methodology was born. It simply forces the respondant to order their preferences. How ? The assertions have to be placed in a pyramid. Then the position of each element for each participant constitutes the dataset. Each individual will fill a pyramid with different assertions, like the one below. Fig. 1: A q sort # Who uses Q-methodology ? Some will use q-methdology to study the conceptualization of democracy by the individuals  the quality of education as seen by teachers  or the goals and values of beef farmers in Brazil , to name a few published applications. But it can also be used to improve customer behavior analysis, or define relevant (at last!) opinion surveys! # A detailed example ## Assertions Imagine one is asking which assertions describe a good runner. • A good runner must have a steady speed (speed) • A good runner eats well (food) • A good runner trains regularly (training) • A good runner does not drink alcohol (alcohol) In parenthesis are the abbreviations that will be used in the next descriptions. ## Encoding the data Once the data gathered it looks like: Individual speed food training alcohol 1. 0 0 +1 -1 2. +1 0 0 -1 3. +1 0 0 -1 n. 0 +1 0 -1 Per example the row number one would be the encoding of the responses on the first figure. Each line is called a q sort. ## The method Once the data gathered and encoded, a PCA (principal component analysis) is performed on this dataset, followed by an orthogonal rotation (usually varimax). The algebraic details can be found in . In the end, the data looks like: Factor 1st factor 2nd factor speed +1 0 food 0 +1 training 0 -1 alcohol -1 0 ## Analysis The analysis is then performed observing the axis after the rotation. Quoting  The most important aspect of the study file will nonetheless be the factor arrays themselves. These will be found in a table, ‘Item or Factor Scores’. […] The interpretative task in Q methodology involves the production of a series of summarizing accounts, each of which explicates the viewpoint being expressed by a particular factor. Basically each factor will be a point of view (or will represent different similar points of view), shared by many individuals in the sample. This is a very first step to factor analysis and I recommend reading  in more details for an analysis with actual data. # Implementations The “Q sort” data collection procedure is traditionally done using a paper template and the sample of statements or other stimuli printed on individual cards. Wikipedia. ## Setting up an online survey Unless you have a lot of time to write down the assertions on small cardboards, and take pictures of the way they are disposed on a blackboard, I recommend using an online survey ! This github repository does a nice job : easy-htmlq. You really do not need a lot of knowledge of web development to get it running :) If you do not want to go through the hassle of using your own server, you can use a firebase service (possible with this version of the code) to keep the data and deploy the pages on Netlify or on Github pages (depending on your preferences). Let’s have a look at the contents. ├── fonts │ ├── glyphicons-halflings-regular.eot │ ├── [...] │ └── glyphicons-halflings-regular.woff ├── index.html ├── LICENSE ├── logo_center.jpg ├── logo.jpg ├── logo_right.jpg ├── README.md ├── settings │ ├── config.xml │ ├── firebaseInfo.js │ ├── language.xml │ ├── map.xml │ └── statements.xml ├── src │ ├── angular.min.js │ ├── [...] │ └── xml2json.min.js ├── stylesheets │ ├── bootstrap.min.css │ └── htmlq.css └── templates ├── dropEventText.html ├── [...] └── thanks.html  The only things that need to be changed are in settings, which are just xml files. ### statements.xml and map.xml Corresponds to the statements to be sorted by the respondant. In the first figure, we would have used the following configuration. <?xml version="1.0" encoding="UTF-8"?> <statements version="1.0" htmlParse="false"> <statement id="1"> A good runner must have a steady speed </statement> <statement id="2"> A good runner eats well </statement> <statement id="3"> A good runner trains regularly </statement> <statement id="4"> A good runner does not drink alcohol </statement> </statements>  Note that map.xml must be modified accordingly: <map version="1.0" htmlParse="false"> <column id="-1" colour="FFD5D5">1</column> <column id=" 0" colour="E9E9E9">2</column> <column id="+1" colour="9FDFBF">1</column> </map>  ### config.xml Corresponds to other questions that can be asked (checkboxes…) ### firebaseInfo.js Where to put you firebase tokens. // Initialize Firebase var config = { apiKey: "", authDomain: "", databaseURL: "", projectId: "", storageBucket: "", messagingSenderId: "" }; firebase.initializeApp(config); var rootRef = firebase.database().ref();  ### language.xml Enables you to change various elements of language. # Analysis and simulations, in R I do not have data that I can publicly disclose, so the analysis will be on simulated data. Let’s assume you collected the data and want to analyze it. There is an R package taking charge of the analysis . Let’s have a look at it with simulated data. N will be the number of individuals, target_sort a distribution that is the “actual order of preferences” of the individuals, on which we will swap some elements accross individuals, randomly. The following code does the job. require("qmethod") N <- 15 target_sort <- c(-2, 2, 1, -1, 1, 0, 0, -1, 1, 0) data <- t(replicate(N, target_sort)) for (i in 1:nrow(data)) { switch_indices <- sample(x = ncol(data), 2) tmp <- data[i, switch_indices] data[i, switch_indices] <- data[i, switch_indices] data[i, switch_indices] <- tmp } data <- t(data) rownames(data) <- paste0("assertion_", 1:10) colnames(data) <- paste0("individual_", 1:N) qmethod(data, nfactors = 2)  Now let’s look at the output of qmethod. It is basically a very long console output divided in several blocks. The first block is just a summary of the parameters and the data provided to the method. Q-method analysis. Finished on: Mon Apr 16 18:10:10 2018 Original data: 10 statements, 15 Q-sorts Forced distribution: TRUE Number of factors: 2 Rotation: varimax Flagging: automatic Correlation coefficient: pearson Q-method analysis. Finished on: Mon Apr 16 18:10:10 2018 Original data: 10 statements, 15 Q-sorts Forced distribution: TRUE Number of factors: 2 Rotation: varimax Flagging: automatic Correlation coefficient: pearson  Then we have more details about the data sent to the qmethod function. Original data : individual_1 individual_2 individual_3 individual_4 individual_5 assertion_1 -2 0 -2 -2 -2 assertion_2 2 2 2 0 2 assertion_3 1 1 1 1 1 assertion_4 -1 -1 -1 -1 1 assertion_5 -1 1 -1 1 1 assertion_6 0 0 0 0 0 assertion_7 0 0 0 0 0 assertion_8 1 -1 1 -1 -1 assertion_9 1 1 1 1 -1 assertion_10 0 -2 0 2 0 individual_6 individual_7 individual_8 individual_9 individual_10 assertion_1 -2 1 -2 0 -2 assertion_2 2 2 2 2 2 assertion_3 1 1 1 1 -1 assertion_4 -1 -1 -1 -1 1 assertion_5 1 1 -1 1 1 assertion_6 0 0 0 -2 0 assertion_7 -1 0 0 0 0 assertion_8 0 -1 1 -1 -1 assertion_9 1 -2 1 1 1 assertion_10 0 0 0 0 0  The loadings. In a nutshell, they represent how close someone is to the factor column at the end of the qsort. Note that the individuals where the values -2, 2 were untouched by the random swap are the one with the highest loadings with respect to f1. This may be particularly interesting if the individual are heterogeneous, and to test wether one of them (or some of them) are actually really close to the “consensual preferences” (i.e. the principal component). Q-sort factor loadings : f1 f2 individual_1 0.93 0.085 individual_2 0.26 0.762 individual_3 0.93 0.085 individual_4 0.56 0.286 individual_5 0.43 0.553 individual_6 0.80 0.517 individual_7 -0.21 0.732 individual_8 0.93 0.085 individual_9 0.27 0.849 individual_10 0.54 0.406 (...) See item '...$loa' for the full data.


This part is seldomly reported, I ommited some lines on purpose.

Flagged Q-sorts :
flag_f1 flag_f2
individual_1  " TRUE" "FALSE"
[...]
individual_10 "FALSE" "FALSE"
(...) See item '...$flagged' for the full data.  This part is seldomly reported, I ommited some lines on purpose. Statement z-scores : zsc_f1 zsc_f2 assertion_1 -1.841 -0.290 [...] assertion_10 -0.088 -0.382  At last we have the figures that will usually be reported in most papers relying on qsorts. Note that the first column corresponds (almost) to c(-2, 2, 1, -1, 1, 0, 0, -1, 1, 0) as expected! Increasing N in the code allows to recover exactly the original vector. Statement factor scores : fsc_f1 fsc_f2 assertion_1 -2 0 assertion_2 2 2 assertion_3 1 1 assertion_4 -1 -1 assertion_5 -1 1 assertion_6 0 -2 assertion_7 0 0 assertion_8 1 -1 assertion_9 1 1 assertion_10 0 0  Factor characteristics: General factor characteristics: av_rel_coef nload eigenvals expl_var reliability se_fscores f1 0.8 6 6.1 41 0.96 0.2 f2 0.8 6 5.2 35 0.96 0.2 Correlation between factor z-scores: zsc_f1 zsc_f2 zsc_f1 1.00 0.52 zsc_f2 0.52 1.00 Standard error of differences between factors: f1 f2 f1 0.28 0.28 f2 0.28 0.28  Distinguishing and consensus statements : dist.and.cons f1_f2 sig_f1_f2 assertion_1 Distinguishing -1.55 **** assertion_2 Consensus -0.17 assertion_3 Consensus -0.08 assertion_4 Consensus 0.17 assertion_5 Distinguishing -1.50 **** assertion_6 Distinguishing 1.07 *** assertion_7 Consensus -0.11 assertion_8 Distinguishing 1.59 **** assertion_9 Consensus 0.29 assertion_10 Consensus 0.29  # Final words That’s it! Q methodology is a vast topic and deserves a whole book ! Covering the theory of principal components, rotations, possibly the mathematic underlying the method, like *varimax** and the other options, detailing various surveys and how the analysis was performed, setting up tests… As I was writing this post I realized how optimistic I was when I thought I could describe the method. Anyway, I hope this will be enough for a reader to set up one’s survey and analyze it. # Sources and external sites  Rune Holmgaard Andersen, Jennie L. Schulze, and Külliki Seppel, “Pinning Down Democracy: A Q-Method Study of Lived Democracy,” Polity 50, no. 1 (January 2018): 4-42.  Grover, Vijay Kumar (2015, August). Developing indicators of quality school education as perceived by teachers using Q-methodology approach. Zenith International Journal of Multidisciplinary Research, 5(8), 54-65.  Pereira, Mariana A., John R. Fairweather, Keith B. Woodford, & Peter L. Nuthall (2016, April). Assessing the diversity of values and goals amongst Brazilian commercial-scale progressive beef farmers using Q-methodology. Agricultural Systems, 144, 1-8.  Aiora Zabala. qmethod: A Package to Explore Human Perspectives Using Q Methodology. The R Journal, 6(2):163-173, Dec 2014.  H. Abdi, “Factor Rotations in Factor Analyses”  A primer on Q methodology - SR Brown - Operant subjectivity, 1993 - researchgate.net  S. Watts and P. Stenner, “Doing Q methodology: theory, method and interpretation,” p. 26. # What to expect from a model when there is nothing to learn ? # An imbalanced binary classification problem Did it ever happen to you to have a model that have a lower accuracy than a constant guessing (the one that predicts the most common class) ? It happened to me recently and I was quitte puzzled: 200 data points, two classes, 60% of the sample belonged to class one, the remaining part, to class two. After running a random forest, I observed an accuracy of 50% on the out of bag predictions. This seemed really low, if there were nothing to learn from the data, then why did the model did not predict the most common class and achieved an accuracy of 60% ? Playing with the parameters (increasing min_sample_leaf) improved the performance (but it was forcing the trees to have a ridiculously low depth). So I decided to simulate the distribution of the out of sample accuracy of my model with the following snippet! (in R) library("randomForest") N <- 100 P <- 0.6 TRIALS <- 10000 random_response <- as.factor(rbinom(n = N, size = 1, p = P)) evaluate_error_rate <- function(blob) { model <- randomForest(x = matrix(rnorm(n = N * 5), nrow = N), y = random_response) model$err.rate[nrow(model\$err.rate), 1]
}

res <- sapply(1:TRIALS, evaluate_error_rate)

hist(res, main = 'Distribution of the out of sample error',
xlab = 'Out of sample error (percent of mistmatches)')
mean(res) # 0.417235 seems good...
sum(res[res>0.5])/length(res) # 0.010754
sum(res[res>0.4])/length(res) # 0.272972 What were the results ? The mean of the sample seems to converge to the accuracy of a constant predictor, which is good. However, the probability that a model makes more mistake than a random guess was not that low.

What is event better is that, given a training procedure, a number of points, a number of variables and an imbalance between classes, this distribution should not change. So now, one could even imagine to create a statistical test “did my model actually learned something”, and give a probability to the rejection of the null hypothesis. I leave this to the reader :)

# Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !

Applied predictive modelling is also good introduction to predictive modelling, with R (used in the code snippets) and machine learning.

# Should I Scale my data?

If you are asking this, then you probably do not understand the algorithm you are using. This is a bad habit to start with, but if you do not want, have the time or the interest, the following table should be a decent starting point.

# Some definitions

Centering Centering a variable consists in substracting the mean value to each value, so that the new variable has a sample mean equals to 0.

Reducing Reducing a value consists in dividing every value of the sample by the standard deviation of the sample.

Scaling Here we will call “scaling” the action consisting of centering the data and then reducing it. After the scaling, the sample has a null sample mean and a standard deviation of 1.

## Generalities about algorithms regarding the scaling of the data

Supervised learning

Algorithm Scaling
Decision Tree No
Random Forest No
Linear Regression No
Penalized Linear Regression Yes, probably
SVM (Kernel) Yes, probably
k-Nearest Neighbours Yes, probably
Nearest centroid Yes, probably
Neural Network  Yes, probably

Unsupervised learning

Algorithm Scaling needed
PCA Yes, probably
Random projections Yes, probably
t-SNE Yes, probably

The following tables should be read this way. If scaling is not needed, it means you should not see changes between the results you obtain with or without scaling.

If it says yes, probably, it means that scaling is useful as features should have the same order of magnitude for the algorithm to work properly. However, it does not mean that performance will increase.

Per example, in presence of features lying on a bounded scale (when translating an image to a grayscale image and then feeding it to a neural network, or when turning a text to a TFIDF matrix), scaling is not recommended.

## When the scaling is performed before applying the algorithm

Note that some libraries (especially in R) take care of the scaling before applying the algorithm. Though this seems to be a bad idea (the behaviour of the algorithm if a column is constant becomes implementation dependent, per example), this may save you some efforts.

svm(x, y = NULL, scale = TRUE, type = NULL, kernel =
"radial", degree = 3, gamma = if (is.vector(x)) 1 else 1 / ncol(x),
coef0 = 0, cost = 1, nu = 0.5,
class.weights = NULL, cachesize = 40, tolerance = 0.001, epsilon = 0.1,
shrinking = TRUE, cross = 0, probability = FALSE, fitted = TRUE,
..., subset, na.action = na.omit)


This is the svm function as presented in the e1071 R package. Note the default value of scale.

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"),
weights, offset=NULL, alpha = 1, nlambda = 100,
lambda.min.ratio = ifelse(nobs<nvars,0.01,0.0001), lambda=NULL,
standardize = TRUE, intercept=TRUE, thresh = 1e-07, dfmax = nvars + 1,
pmax = min(dfmax * 2+20, nvars), exclude, penalty.factor = rep(1, nvars),
lower.limits=-Inf, upper.limits=Inf, maxit=100000,
type.gaussian=ifelse(nvars<500,"covariance","naive"),
type.logistic=c("Newton","modified.Newton"),
standardize.response=FALSE, type.multinomial=c("ungrouped","grouped"))


In the glmnet package the argument is now called standardize. Note that here, the response can be standardized as well. This topic will not be covered in this post. Now coming back to the dangers of such an approach, look at the following sample:

N <- 100
P <- 5
X <- matrix(data = rnorm(N*P), nrow = N)
Y <- matrix(rnorm(N), nrow = N)

X[,1] <- X[,1]*0 # some evil action

require("glmnet")
model <- glmnet(x = X, y = Y)

# Runs without issues

require("e1071")
model2 <- svm(x = X, y = Y)

# Warning message:
#  In svm.default(x = X, y = Y) :
#  Variable(s) ‘X1’ constant. Cannot scale data.


In the case where you run many models on many datasets (or many combination of features) some will scale the data, others will not (if one of the features is constant) and may report bad performances because the scaling was not operated…

## Is it always possible to scale the data ?

### Theoretical point of view

The assumptions when substracting the mean and dividing by the standard deviation is that they both exist ! Though with finite samples, we can always evaluate sample mean and sample variance, if the variables come from (say a Cauchy distribution) the coefficients for scaling may vary dramatically when enriching the sample with new points.

However, in the case where one is trying to learn anything from distributions that are not integrable, there will be many other issues to deal with.

### Practical point of view

With a sparse dataset, scaling is not a good idea : it would force many of the points (the ones that are 0s in the original dataset). But reducing the variables is possible! And it turns out that some algorithms are not affected by the centering (or not) of the data.

## A better approach

As we saw, there are actually three types of algorithms : those who do not change with monotonic transformations of the inputs, those who do not change with translations of the input and those who do not fit in the first two categories.

Note that the “monotonic transformation invariance” is the strongest property, as translation is just a monotonic transformation.

So the algorithms would enjoy a better representation in this table:

Supervised learning

Algorithm Translation invariant Monotonic transformation invariant
Decision Tree X X
Random Forest X X
Linear Regression X
Penalized Linear Regression
SVM (Gaussian kernel) X
SVM (Other kernels)
k-Nearest Neighbours X
Nearest centroid X
Neural Network

Unsupervised learning

Algorithm Translation invariant Monotonic transformation invariant
PCA
Random projections
t-SNE X

# Learning more

The elements of statistical learning by Trevor Hastie, Robert Tibshirani, Jerome Friedman is a brilliant introduction to the topic and will help you have a better understanding of most of the algorithms presented in this article !