Linear classification

Contents

Load data

We use the MAGIC telescope data from the UCI repository http://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. These are simulated data for detection of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique. The goal is to separate gamma particles from hadron background.

The analysis described in Beck et al. Nucl.Instr.Meth. A, 516, pp. 511-528, applies a number of powerful classifiers such as random forest and neural net. The power of these classifiers is evaluated by measuring true positive rates (TPR) at false positive rates (FPR) 0.01, 0.02, 0.05, 0.1, and 0.2. Here, we present an example illustrating properties of three linear techniques for binary classification. These linear techniques are not meant to improve the classification accuracy obtained in the publication. We use a reduced set of variables. As seen below, the optimal boundary between the two classes is far from linear.

We split the data into training and test sets in the proportion 2:1 and code class labels as a logical vector (true for gamma and false for hadron).

load MagicTelescope;
Ytmp = false(numel(Ytrain),1);
Ytmp(strcmp(Ytrain,'Gamma')) = true;
Ytrain = Ytmp;
Ytmp = false(numel(Ytest),1);
Ytmp(strcmp(Ytest,'Gamma')) = true;
Ytest = Ytmp;

Standardize variables

In these data, variables have substantially different standard deviations. These standard deviations do not carry information useful for separation of the two classes. In this situation, it is recommended to standardize variables by subtracting the mean and dividing by the standard deviation. This is especially true if you plan to judge the importance of variables by the magnitudes of the respective coefficients in the linear model.

For some implementations, standardizing variables is not necessary. For example, ClassificationDiscriminant.fit standardizes variables separately for every class. Even if you perform the overall standardization by the zscore function, it is unlikely to affect the analysis outcome. In contrast, estimates of variable importance from plsregress are sensitive to the standardization.

We use function zscore to standardize the training data and then use the computed means and standard deviations to standardize the test data.

[Xtrain,mu,sigma] = zscore(Xtrain);
Xtest = bsxfun(@rdivide,bsxfun(@minus,Xtest,mu),sigma);

Train a linear discriminant and select 4 best predictors

There are 10 variables in the MAGIC data. In physics analysis, linear techniques are usually applied to data with lower dimensionality. We reduce the number of variables to 4 by selecting variables most important for the linear discriminant analysis (LDA). To select the most important variables for LDA, we look at the DeltaPredictor property of the discriminant object. This property shows thresholds (on some normalized scale) at which variables are eliminated from the LDA model, one threshold per variable. A large threshold indicates an important variable.

VarNames = {'fLength' 'fWidth' 'fSize' 'fConc' 'fConc1' ...
    'fAsym' 'fM3Long' 'fM3Trans' 'fAlpha' 'fDist'};
LDA = ClassificationDiscriminant.fit(Xtrain,Ytrain,'PredictorNames',VarNames);
[~,sorted] = sort(LDA.DeltaPredictor,'descend');
D = 4;
keepVarIdx = sorted(1:D); % indices of variables kept in the model
keepVarNames = VarNames(keepVarIdx) % names of kept variables
Xtrain = Xtrain(:,keepVarIdx); % keep only selected variables for training
Xtest = Xtest(:,keepVarIdx); % keep only selected variables for test
keepVarNames = 
    'fLength'    'fAsym'    'fConc'    'fAlpha'

Train LDA, logistic regression (LR) and partial least squares (PLS)

To train LDA, we call the ClassificationDiscriminant.fit method which returns an object of class ClassificationDiscriminant. We then copy LDA coefficients from the respective property of this object to a new variable.

LDA = ClassificationDiscriminant.fit(Xtrain,Ytrain);
LDAcoeffs = LDA.Coeffs(1,2).Linear;

To train LR, we call the GeneralizedLinearModel.fit method returning an object of class GeneralizedLinearModel. We then copy the fitted coefficients from the respective property of this object to a new variable.

glm = GeneralizedLinearModel.fit(Xtrain,Ytrain,'distribution','binomial');
LRcoeffs = glm.Coefficients.Estimate;

To train PLS, we call the plsregress function. It returns

[XplsAxes,YplsAxes,XplsCoord,YplsCoord,PLScoeffs,PLSpctvar,~,PLSstat] = ...
    plsregress(Xtrain,Ytrain); % PLS

Estimate classification errors using test data

We use the three models to compute predictions for the test data and compare these predictions with the true class labels in Ytest.

To compute the predictions by LDA, we call the predict method of the discriminant object. The 1st output of this method is a vector of the predicted class labels, and the 2nd output is an N-by-2 matrix of the estimated posterior class probabilities for N observations in the test data and two classes. Since the 'true' class is second in LDA.ClassNames, we retain the 2nd column in this matrix only. (For binary classification, one of the two scores can be obtained from the other.)

LDA.ClassNames
[YfitLDA,PfitLDA] = predict(LDA,Xtest);
PfitLDA = PfitLDA(:,2);
ans =
     0
     1

To compute the binomial probabilities predicted by LR, we use the predict method of the glm object. We then assign every observation to the 'true' class if the predictied probability is above 0.5 and 'false' class otherwise. (The same procedure is used for LDA in the call to predict.)

PfitLR = predict(glm,Xtest);
YfitLR = PfitLR>0.5;

To compute the predictions by PLS, we multiply Xtest by the vector of PLS coefficients. Since the 1st coefficient is for the intercept, we add a column of ones to Xtest. Then we assign class labels to observations using the same recipe as for LR.

SfitPLS = [ones(size(Xtest,1),1) Xtest]*PLScoeffs;
YfitPLS = SfitPLS>0.5;

Classification error for every model is estimated as a fraction of misclassified observations in the test data.

Ntest = numel(Ytest);
errLDA = sum(YfitLDA~=Ytest)/Ntest
errLR = sum(YfitLR~=Ytest)/Ntest
errPLS = sum(YfitPLS~=Ytest)/Ntest
errLDA =
    0.2155
errLR =
    0.2044
errPLS =
    0.2162

Test the models for equivalence

We run McNemar's test on every pair of models. Although LR offers only a marginal 5% relative error reduction over LDA, this improvement is highly significant. LDA and PLS have comparable errors, and both are outperformed by LR. Note: We use the mcnemary function not included in the official MATLAB distribution. You can code this function yourself with little effort. Refer to the book chapter on evaluation of the predictive performance.

[~,pLRvsLDA] = mcnemary(Ytest,YfitLDA,YfitLR)
pLRvsLDA =
   4.8992e-08
[~,pLRvsPLS] = mcnemary(Ytest,YfitPLS,YfitLR)
pLRvsPLS =
   1.1957e-08
[~,pLDAvsPLS] = mcnemary(Ytest,YfitPLS,YfitLDA)
pLDAvsPLS =
    0.6171

Verify LDA assumptions

For every training observation, we compute the squared Mahalanobis distance to its true class. We then compute the expected quantiles assuming that the squared Mahalanobis distance has a chi-square distribution with 4 degrees of freedom. The empirical quantiles in the QQ-plot deviate up from the expected values indicating that the data in Xtrain have tails heavier than normal. The assumption of multivariate normality is clearly no good, and LDA cannot be the optimal classifier for these data.

mah = mahal(LDA,Xtrain,'ClassLabels',Ytrain);
Ntrain = numel(Ytrain);
expQ = chi2inv(((1:Ntrain)-0.5)/Ntrain,D); % expected quantiles
[mah,sorted] = sort(mah); % sorted observed quantiles
figure;
gscatter(expQ,mah,Ytrain(sorted),'bg','s*',10,'off'); % plot by class
legend('0','1','Location','NW');
xlabel('Expected quantile');
ylabel('Observed quantile');
line([0 30],[0 30],'color','k');

Verify LR assumptions

We use the Hosmer-Lemeshow test to estimate the goodness of fit for the LR model. We divide the fitted binomial probabilities in 10 bins with an equal number of observations per bin. We then compute the Hosmer-Lemeshow statistic and the associated p-value assuming a chi-squared distribution with 8 degrees of freedom. The p-value is low suggesting that the optimal separation between the two classes is not linear.

PfitLR = glm.predict(Xtrain);
edges = [0, quantile(PfitLR,0.1:0.1:0.9), Inf]; % 11 bin edges
Nbin = zeros(10,1);
Pbin = zeros(10,1);
Ybin = zeros(10,1);
for n=1:10
    inThisBin = PfitLR>=edges(n) & PfitLR<edges(n+1);
    Nbin(n) = sum(inThisBin);
    Pbin(n) = mean(PfitLR(inThisBin));
    Ybin(n) = mean(Ytrain(inThisBin));
end
HL = sum( Nbin.*(Ybin-Pbin).^2./Pbin./(1-Pbin) ); % Hosmer-Lemeshow statistic
pHL = 1 - chi2cdf(HL,8) % p-value for Hosmer-Lemeshow test
pHL =
   1.2434e-04

Estimate variable importance

As noted above, the DeltaPredictor property of the discriminant object can be used to judge the relative importance of every variable in the model. This property stores magnitudes of the standardized LDA coefficients (that is, coefficients computed after standardizing varibles for every class separately).

VarImpLDA = LDA.DeltaPredictor
VarImpLDA =
    0.7827    0.6374    0.3736    0.2232

For LR, we simply look at the magnitudes of the coefficients.

VarImpLR = abs(LRcoeffs(2:end))'
VarImpLR =
    1.1968    1.2289    0.5919    0.2923

For PLS, we estimate variable importance as described in the PLS section. The first PLS component explains 29% of variance in the class label, far more than the other three:

YplsAxes
YplsAxes =
   29.1008    7.4754    4.2798    1.0913

The importance of every variable is therefore determined mostly by its contribution to the first PLS component. The first two variables have largest (in magnitude) coefficients in the first component (first column of XplsAxes):

XplsAxes(:,1)
ans =
  -84.4776
  -72.3538
   32.0900
  -46.1461

Here are the exact estimates of variable importance from PLS:

W = PLSstat.W;
Wtilde2 = bsxfun(@rdivide,W.^2,sum(W.^2,1));
Ctilde2 = YplsAxes.^2/sum(YplsAxes.^2);
VarImpPLS = sqrt(D * sum(bsxfun(@times,Wtilde2,Ctilde2),2) )'
VarImpPLS =
    1.6034    1.0314    0.4045    0.4488

All models agree that the first two variables, fLength and fAsym, are more important for prediction than the other two.

Visualize the classification models

We make a scatter plot of the two most important variables and superimpose the 1st and 2nd PLS axes. We do not plot the LDA or LR axes because they are very close to the 1st PLS axis: The angles between the projection of the 1st PLS component and the projections of the LDA and PLS axes on the (fLength,fAsym) plane are 3 and 5 degrees, respectively. The angles between the 1st PLS axis and the LDA and LR axes are much larger in the full 4D space, but we focus on the two most significant variables only. To plot the axes, we compute tangent of the angle between the plotted line and the horizontal axis and multiply it by the abscissa limits to get the ordinate limits.

figure;
h = gscatter(Xtrain(:,1),Xtrain(:,2),Ytrain,[],'s*',[],'off');
Xlims = [-1 2.5];
PLStan = XplsAxes(2,:)./XplsAxes(1,:);
h1 = line(Xlims,PLStan(1)*Xlims,'color','k','LineStyle','-');
h2 = line(Xlims,PLStan(2)*Xlims+2,'color','k','LineStyle','--');
legend([h(1) h(2) h1 h2],...
    {'Class 0' 'Class 1' '1st PLS axis' '2nd PLS axis'},...
    'Location','NW');
xlabel('fLength');
ylabel('fAsym');
set(h(1),'Color',0.7*ones(1,3));
set(h(2),'Color',0.4*ones(1,3));

We also make a scatter plot of the first two coordinates in the PLS reference frame. The line of optimal separation in this plot coincides with the horizontal axis. Because PLS, unlike LDA and LR, computes more than one linear component, we can explore the data in the transformed coordinates. The usefulness of this exploration would vary from one analysis to another.

figure;
h = gscatter(XplsCoord(:,1),XplsCoord(:,2),Ytrain,[],'s*');
xlabel('1st PLS component');
ylabel('2nd PLS component');
set(h(1),'Color',0.7*ones(1,3));
set(h(2),'Color',0.4*ones(1,3));