Apply kernel smoothing to a ROC curve

Contents

Section 10.2.1

We use the ionosphere data from the UCI machine learning repository, included in MATLAB distributions.

load ionosphere;

For reproducibility, set the random number generator seed for data partitioning.

rng(1);

Cross-validate a pseudo-linear discriminant model.

cvplda = ClassificationDiscriminant.fit(X,Y,'DiscrimType','pseudoLinear','crossval','on');

Boost decision stumps by AdaBoostM1 and cross-validate.

cvada = fitensemble(X,Y,'AdaBoostM1',200,'Tree','crossval','on');

Compute scores for out-of-fold data.

[~,SfitLDA] = kfoldPredict(cvplda);
[~,SfitAda] = kfoldPredict(cvada);

Obtain false positive rate (FPR) and true positive rate (TPR) choosing 'good returns' for the positive class.

[fprLDA,tprLDA] = perfcurve(Y,SfitLDA(:,2),'g');
[fprAda,tprAda] = perfcurve(Y,SfitAda(:,2),'g');

Plot the ROC curves for the two classifiers.

figure;
plot(fprLDA,tprLDA,'b--');
hold;
plot(fprAda,tprAda,'r-.');
line([0 1],[0 1],'color','c');
xlabel('False positive rate');
ylabel('True positive rate');
legend('Pseudo LDA','Boosted stumps','Fair coin toss','Location','SE');
Current plot held

Section 10.2.5

Estimate ROC by kernel smoothing

Identify good and bad radar returns. Count the number of observations in each class.

isGood = strcmp(Y,'g');
isBad = strcmp(Y,'b');
Ngood = sum(isGood);
Nbad = sum(isBad);

Obtain quantiles for kernel bandwidth estimation.

Qgood = quantile(SfitLDA(isGood,2),[0.25 0.75]);
Qbad = quantile(SfitLDA(isBad,2),[0.25 0.75]);

Estimate kernel bandwidths for the two classes using the normal approximation.

Hgood = 1.58*(Qgood(2)-Qgood(1))/1.34/Ngood^(1/3)
Hbad = 1.58*(Qbad(2)-Qbad(1))/1.34/Nbad^(1/3)
Hgood =
    0.0070
Hbad =
    0.1849

Estimate score cdf's for the two classes using the obtained bandwidths. These estimates are considered naive because the kernel bandwidths have been obtained using the normal approximation for distributions that are apparently not normal. Setting the two kernel bandwidths to 0.02, we obtain a much better fit.

thre = 0:0.001:1;
Fnaive = ksdensity(SfitLDA(isBad,2),thre,'width',Hbad,'function','cdf');
F = ksdensity(SfitLDA(isBad,2),thre,'width',0.02,'function','cdf');
Gnaive = ksdensity(SfitLDA(isGood,2),thre,'width',Hgood,'function','cdf');
G = ksdensity(SfitLDA(isGood,2),thre,'width',0.02,'function','cdf');

figure;
plot(fprLDA,tprLDA);
hold;
plot(1-Fnaive,1-Gnaive,'r--');
plot(1-F,1-G,'c-.');
legend('Empirical','Naive bandwidth','Optimal bandwidth','Location','SE');
xlabel('False positive rate');
ylabel('True positive rate');
Current plot held

Apply kernel smoothing after a score transform

Recompute classification scores after applying an inverse logit transformation. Select the 2nd column of SfitLDA which corresponds to good radar returns (the 'g' class).

cvplda.ScoreTransform = 'invlogit';
[~,SfitLDA] = kfoldPredict(cvplda);
SfitLDA = SfitLDA(:,2);

Compute an empirical ROC curve (FPR and TPR) and score thresholds.

[fprLDA,tprLDA,threLDA] = perfcurve(Y,SfitLDA,'g');

Obtain quantiles for kernel bandwidth estimation.

Qgood = quantile(SfitLDA(isGood),[0.25 0.75]);
Qbad = quantile(SfitLDA(isBad),[0.25 0.75]);

Estimate optimal kernel bandwidths for the two classes.

Hgood = 1.58*(Qgood(2)-Qgood(1))/1.34/Ngood^(1/3)
Hbad = 1.58*(Qbad(2)-Qbad(1))/1.34/Nbad^(1/3)
Hgood =
    0.4112
Hbad =
    1.6936

Compute smoothed the true negative rate (TNR) and false negative rate (FNR) at 1000 points uniformly spaced from minimal to maximal score threshold.

thre = linspace(min(SfitLDA),max(SfitLDA),1000);
F = ksdensity(SfitLDA(isBad),thre,'width',Hbad,'function','cdf');   % TNR
G = ksdensity(SfitLDA(isGood),thre,'width',Hgood,'function','cdf'); % FNR

Plot the empirical and smoothed FPR as functions of the score threshold.

figure;
plot(threLDA,fprLDA);
hold;
plot(thre,1-F,'r--');
legend('Empirical','Smoothed','Location','SW');
xlabel('Threshold');
ylabel('False positive rate');
Current plot held

Plot the empirical and smoothed TPR as functions of the score threshold.

figure;
plot(threLDA,tprLDA);
hold;
plot(thre,1-G,'r--');
legend('Empirical','Smoothed','Location','SW');
xlabel('Threshold');
ylabel('True positive rate');
Current plot held

Plot the empirical and smoothed ROC curves. Observe that the agreement is far from ideal for low FPR and TPR values, where the classification score is positive.

figure;
plot(fprLDA,tprLDA);
hold;
plot(1-F,1-G,'r--');
legend('Empirical','Smoothed','Location','SE');
xlabel('False positive rate');
ylabel('True positive rate');
xlim([-0.05 0.65]);
ylim([0 1.1]);
grid on;
Current plot held

Focus on the low FPR and TPR region (score is above 0)

keep = SfitLDA>0;
SfitLDA_2 = SfitLDA(keep);
Y_2 = Y(keep);
isGood_2 = strcmp(Y_2,'g');
isBad_2 = strcmp(Y_2,'b');
Ngood_2 = sum(isGood_2);
Nbad_2 = sum(isBad_2);

keep = threLDA>0;
threLDA_2 = threLDA(keep);
fprLDA_2 = fprLDA(keep);
tprLDA_2 = tprLDA(keep);

Obtain quantiles for kernel bandwidth estimation.

Qgood = quantile(SfitLDA_2(isGood_2),[0.25 0.75]);
Qbad = quantile(SfitLDA_2(isBad_2),[0.25 0.75]);

Estimate kernel bandwidths for the two classes.

Hgood_2 = 1.58*(Qgood(2)-Qgood(1))/1.34/Ngood_2^(1/3)
Hbad_2 = 1.58*(Qbad(2)-Qbad(1))/1.34/Nbad_2^(1/3)
Hgood_2 =
    0.4062
Hbad_2 =
    0.7308

Compute smoothed TNR and FNR at 1000 points uniformly spaced from 0 to 1.

thre = linspace(min(SfitLDA_2),max(SfitLDA_2),1000);
F_2 = ksdensity(SfitLDA_2(isBad_2),thre,'width',Hbad_2,'function','cdf');
G_2 = ksdensity(SfitLDA_2(isGood_2),thre,'width',Hgood_2,'function','cdf');

Plot the empirical and smoothed FPR as functions of the score threshold.

figure;
plot(threLDA_2,fprLDA_2);
hold;
plot(thre,Nbad_2/Nbad*(1-F_2),'r--');
legend('Empirical','Smoothed','Location','SW');
xlabel('Threshold');
ylabel('False positive rate');
Current plot held

Plot the empirical and smoothed TPR as functions of the score threshold.

figure;
plot(threLDA_2,tprLDA_2);
hold;
plot(thre,Ngood_2/Ngood*(1-G_2),'r--');
legend('Empirical','Smoothed','Location','SW');
xlabel('Threshold');
ylabel('True positive rate');
Current plot held

Plot the empirical and smoothed ROC curves for low values of FPR and TPR. Observe that the agreement is much better than in the previous ROC plot.

figure;
plot(fprLDA_2,tprLDA_2);
hold;
plot(Nbad_2/Nbad*(1-F_2),Ngood_2/Ngood*(1-G_2),'r--');
legend('Empirical','Smoothed','Location','SE');
xlabel('False positive rate');
ylabel('True positive rate');
Current plot held