Analysis of data with class imbalance

Contents

Load data.

Follow the paper "Classifying extremely imbalanced data sets" by Britsch, Gagunashvili and Schmelling, arXiv:1011.6224v1. Use only 10 integer variables for the covtype data from the UCI repository. The data are split even across training and test sets.

load CovtypeTrain_10IntVar.mat;
Xtrain = Ztrain(:,1:end-1);
Ytrain = Ztrain(:,end);
clear Ztrain;
load CovtypeTest_10IntVar.mat
Xtest = Ztest(:,1:end-1);
Ytest = Ztest(:,end);
clear Ztest;

How many observations of each class do we have?

tabulate(Ytrain) % classes for training data
  Value    Count   Percent
      0    289132     99.53%
      1     1374      0.47%
tabulate(Ytest) % classes for test data
  Value    Count   Percent
      0    289133     99.53%
      1     1373      0.47%

GentleBoost

Grow an ensemble of decision trees by GentleBoost. Set the minimal leaf size to roughly one half of the minority class 1. Use a popular value of 0.1 for the learning rate. Balance the two classes by enforcing equal prior probabilities.

minleafForGentle = round(sum(Ytrain==1)/2)
minleafForGentle =
   687
t = ClassificationTree.template('minleaf',700);
gentle = fitensemble(Xtrain,Ytrain,'GentleBoost',100,t,...
    'prior','uniform','LearnRate',.1);

RUSBoost

Grow an ensemble of decision trees by RUSBoost. The size of the training set for every tree is twice the size of class 1 in the training data. Reduce the minimal leaf size proportionally to produce trees with the same number of leaves as GentleBoost.

minleafForRUS = round(minleafForGentle/numel(Ytrain)*2*sum(Ytrain==1))
minleafForRUS =
     6
t = ClassificationTree.template('minleaf',minleafForRUS);
rus = fitensemble(Xtrain,Ytrain,'RUSBoost',100,t);

ROC curves

Compute predicted classification scores for the test data. The returned array of scores is of size N-by-2 for N observations in the test data and 2 classes. Class 1 (signal) is the 2nd class, and that is why we pick the 2nd column from the array of scores.

Focus on the range [0, 0.0025] for FPR (false positive rate, or background acceptance). Divide this range into 21 intervals. Compute TPR (true positive rate, or signal efficiency) vs FPR at the interval endpoints. Compare with the best curve in Fig. 6 in the paper.

fprVals = linspace(0,2.5e-3,21);
[~,Sgentle] = predict(gentle,Xtest);
SgentleSignal = Sgentle(:,2);
[fprGentle,tprGentle,threGentle] = ...
    perfcurve(Ytest,SgentleSignal,1,'xvals',fprVals);
[~,Srus] = predict(rus,Xtest);
SrusSignal = Srus(:,2);
[fprRUS,tprRUS,threRUS] = perfcurve(Ytest,SrusSignal,1,'xvals',fprVals);
fprPaper = [5 7.5 17 95]*1e-5;
tprPaper = [0.3 0.42 0.64 0.88];

ROC with confidence bounds

Use 1000 bootstrap replicas to compute confidence bounds for the ROC obtained by RUSBoost. Relax the minimal threshold to select a representative sample for bootstrapping.

sum(SrusSignal>min(threRUS)) % number of observations used for the ROC curve
ans =
        2029
keep = SrusSignal>3;
sum(keep) % number of observations used for bootstrapping
ans =
        4814
SrusSignalKeep = SrusSignal(keep);
YtestKeep = Ytest(keep);
N0 = sum(Ytest==0);
N1 = sum(Ytest==1);
[fp,tp] = perfcurve(YtestKeep,SrusSignalKeep,1,'xcrit','FP','ycrit','TP',...
    'xvals',fprVals*N0,'nboot',1000);
fpr = fp/N0;
tpr = tp/N1;

Plot the ROC curves

figure;
plot(fprPaper,tprPaper,'kd','MarkerSize',8);
hold;
plot(fprGentle,tprGentle,'k--');
errorbar(fpr,tpr(:,1),tpr(:,2)-tpr(:,1),tpr(:,3)-tpr(:,1),'k*');
xlim([0 0.003]);
ylim([0.2 1]);
fprTicks = cellstr(num2str(fprVals'));
set(gca,'xticklabel',[fprTicks(1:4:21)' {'0.003'}]);
hold off;
legend('Best in paper','GentleBoost','RUSBoost with bounds','Location','SE');
xlabel('False Positive Rate');
ylabel('True Positive Rate');
Current plot held