Kernel ridge regression

Contents

We use the MAGIC telescope data from the UCI repository http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. These are simulated data for detection of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. The goal is to separate gamma particles from hadron background.

The analysis described in Beck et al. Nucl.Instr.Meth. A, 516, pp. 511-528, applies a number of powerful classifiers such as random forest, neural net and others. The power of these classifiers is evaluated by measuring true positive rates (TPR) at false positive rates (FPR) 0.01, 0.02, 0.05, 0.1, and 0.2. One such measure is HIACC (high accuracy) defined as the mean TPR for FPR values 0.1 and 0.2.

We split the data into training and test sets in the proportion 2:1 and code class labels as a numeric vector: +1 for gamma and -1 for hadron.

load MagicTelescope;
Ntrain = size(Xtrain,1);
Ntest = size(Xtest,1);
Ytrain = 2*strcmp('Gamma',Ytrain)-1;
Ytest = 2*strcmp('Gamma',Ytest)-1;

Standardize variables

In these data, variables have substantially different standard deviations. These standard deviations do not carry information useful for separation of the two classes. We therefore standardize each variable by subtracting the mean and dividing by the standard deviation. To standardize, we pass the training data to the zscore function, which returns the estimated means and standard deviations. We then use these means and standard deviations to standardize the test data.

[Xtrain,mu,sigma] = zscore(Xtrain);
Xtest = bsxfun(@minus,Xtest,mu);
Xtest = bsxfun(@rdivide,Xtest,sigma);

Choose the kernel

We use the standard normal kernel. Often this kernel is a good choice for standardized data in a low-dimensional setting (in this case, we have 10 variables). In this example, we are not making any effort to find the kernel with the best classification accuracy. It is possible that the predictive power of kernel ridge regression for this dataset could be improved by selecting a different kernel.

We define kernelfun, a function handle to compute pairwise kernel distances for observations collected in two matrices, X1 and X2, of the same size. In each matrix, observations are arranged in rows and variables are arranged in columns. kernelfun returns a column-vector with one element per observation in X1 and X2. The 1st element of this vector is the kernel distance between the 1st observation (row) in X1 and the 1st observation (row) in X2, the 2nd element of this vector is the kernel distance between the 2nd observation (row) in X1 and the 2nd observation (row) in X2 etc. We use this function handle later to fill the Gram matrix and map the test data into the feature space.

sigma = 1;
kernelfun = @(x1,x2) exp(-sum((x1-x2).^2,2)/2/sigma^2);

Compute the Gram matrix

We compute the Gram matrix for the training data. First, we create a square Ntrain-by-Ntrain matrix filled with zeros. Then we loop over training observations. For every training observation represented by vector q, we compute the kernel distance between this observation and all other observations in the training set. To do so, we replicate this observation using repmat to create a square Ntrain-by-Ntrain matrix in which all rows are set to q. This computation could be sped up using the symmetry of the Gram matrix. The loop however runs pretty fast for the chosen training data (12k observations in 10 dimensions), and we do not attempt to reduce the run time.

Gtrain = zeros(Ntrain);
for n=1:Ntrain
    q = Xtrain(n,:);
    Gtrain(n,:) = kernelfun(Xtrain,repmat(q,Ntrain,1));
end

Map the test data into the feature space

We apply kernelfun to compute an Ntest-by-Ntrain feature matrix for the test data.

Gtest = zeros(Ntest,Ntrain);
for n=1:Ntest
    q = Xtest(n,:);
    Gtest(n,:) = kernelfun(Xtrain,repmat(q,Ntrain,1));
end

Compute the predicted response

By quick experimentation on cross-validated data (not shown here) we find that the value of the regularization parameter $\lambda=1$ is about optimal. To find the regression coefficients alpha, we obtain a regularized least-squares solution to the linear system of equations (Gtrain+lambda*I)*alpha=Ytrain using the backslash operator. The output of eye(Ntrain) is I, an Ntrain-by-Ntrain identity matrix. We use the obtained coefficients to compute Yfit, the predicted response. The Yfit values can range from $-\infty$ to $+\infty$. We could assign class labels by thresholding the response at zero. Computation of a ROC curve does not require that we assign class labels, and we choose not to assign them.

alpha = (Gtrain+eye(Ntrain))\Ytrain;
Yfit = Gtest*alpha;

Compute HIACC

Following the paper by Bock et al., we estimate the mean signal efficiency at the background acceptance 0.1 and 0.2. We find TPR obtained by kernel regression, tprkr, at the specified values of FPR using the perfcurve function. The first two inputs to perfcurve are the true class labels and computed classification scores for the test data. We set the 3rd input to +1, the label for the gamma class (signal). By default, perfcurve returns the computed FPR and TPR values as the 1st and 2nd output arguments. We pass the 'xvals' option to perfcurve to choose the desired values for FPR. Because the FPR values do not need to be computed, we place the tilde sign, ~, to ignore the 1st output from perfcurve.

[~,tprkr] = perfcurve(Ytest,Yfit,1,'xvals',[0.1 0.2])
tprkr =
    0.7651
    0.9086

The best HIACC value across all classification algorithms in the study by Bock et al., 0.852, is obtained by random forest. The second best classifier in their study, a neural net, returns 0.839. The value obtained by kernel ridge regression comes close.

accKR = mean(tprkr)
accKR =
    0.8369

We compute 95% confidence bounds for the two values of FPR by vertical averaging. To do so, we run the perfcurve function with the 'nboot' parameter set to 1000 bootstrap repetitions. We do not show this computation in this example to save space. To estimate the 95% lower bound on the mean signal efficiency, we average the two lower bounds; similarly for the upper bound. (This estimate is not quite accurate but should suffice for this exercise.) The confidence interval for the mean signal efficiency contains the HIACC values obtained by the random forest and neural net. The value computed by kernel ridge regression is statistically equal to the two best values quoted in the paper.