Analysis of data with class imbalance
Contents
Load data.
Follow the paper "Classifying extremely imbalanced data sets" by Britsch, Gagunashvili and Schmelling, arXiv:1011.6224v1. Use only 10 integer variables for the covtype data from the UCI repository. The data are split even across training and test sets.
load CovtypeTrain_10IntVar.mat; Xtrain = Ztrain(:,1:end-1); Ytrain = Ztrain(:,end); clear Ztrain; load CovtypeTest_10IntVar.mat Xtest = Ztest(:,1:end-1); Ytest = Ztest(:,end); clear Ztest;
How many observations of each class do we have?
tabulate(Ytrain) % classes for training data
Value Count Percent 0 289132 99.53% 1 1374 0.47%
tabulate(Ytest) % classes for test data
Value Count Percent 0 289133 99.53% 1 1373 0.47%
GentleBoost
Grow an ensemble of decision trees by GentleBoost. Set the minimal leaf size to roughly one half of the minority class 1. Use a popular value of 0.1 for the learning rate. Balance the two classes by enforcing equal prior probabilities.
minleafForGentle = round(sum(Ytrain==1)/2)
minleafForGentle = 687
t = ClassificationTree.template('minleaf',700); gentle = fitensemble(Xtrain,Ytrain,'GentleBoost',100,t,... 'prior','uniform','LearnRate',.1);
RUSBoost
Grow an ensemble of decision trees by RUSBoost. The size of the training set for every tree is twice the size of class 1 in the training data. Reduce the minimal leaf size proportionally to produce trees with the same number of leaves as GentleBoost.
minleafForRUS = round(minleafForGentle/numel(Ytrain)*2*sum(Ytrain==1))
minleafForRUS = 6
t = ClassificationTree.template('minleaf',minleafForRUS); rus = fitensemble(Xtrain,Ytrain,'RUSBoost',100,t);
ROC curves
Compute predicted classification scores for the test data. The returned array of scores is of size N-by-2 for N observations in the test data and 2 classes. Class 1 (signal) is the 2nd class, and that is why we pick the 2nd column from the array of scores.
Focus on the range [0, 0.0025] for FPR (false positive rate, or background acceptance). Divide this range into 21 intervals. Compute TPR (true positive rate, or signal efficiency) vs FPR at the interval endpoints. Compare with the best curve in Fig. 6 in the paper.
fprVals = linspace(0,2.5e-3,21); [~,Sgentle] = predict(gentle,Xtest); SgentleSignal = Sgentle(:,2); [fprGentle,tprGentle,threGentle] = ... perfcurve(Ytest,SgentleSignal,1,'xvals',fprVals);
[~,Srus] = predict(rus,Xtest);
SrusSignal = Srus(:,2);
[fprRUS,tprRUS,threRUS] = perfcurve(Ytest,SrusSignal,1,'xvals',fprVals);
fprPaper = [5 7.5 17 95]*1e-5; tprPaper = [0.3 0.42 0.64 0.88];
ROC with confidence bounds
Use 1000 bootstrap replicas to compute confidence bounds for the ROC obtained by RUSBoost. Relax the minimal threshold to select a representative sample for bootstrapping.
sum(SrusSignal>min(threRUS)) % number of observations used for the ROC curve
ans = 2029
keep = SrusSignal>3;
sum(keep) % number of observations used for bootstrapping
ans = 4814
SrusSignalKeep = SrusSignal(keep); YtestKeep = Ytest(keep); N0 = sum(Ytest==0); N1 = sum(Ytest==1); [fp,tp] = perfcurve(YtestKeep,SrusSignalKeep,1,'xcrit','FP','ycrit','TP',... 'xvals',fprVals*N0,'nboot',1000); fpr = fp/N0; tpr = tp/N1;
Plot the ROC curves
figure; plot(fprPaper,tprPaper,'kd','MarkerSize',8); hold; plot(fprGentle,tprGentle,'k--'); errorbar(fpr,tpr(:,1),tpr(:,2)-tpr(:,1),tpr(:,3)-tpr(:,1),'k*'); xlim([0 0.003]); ylim([0.2 1]); fprTicks = cellstr(num2str(fprVals')); set(gca,'xticklabel',[fprTicks(1:4:21)' {'0.003'}]); hold off; legend('Best in paper','GentleBoost','RUSBoost with bounds','Location','SE'); xlabel('False Positive Rate'); ylabel('True Positive Rate');
Current plot held
