Principal component analysis
Contents
Load data
We use the ionosphere data from the UCI repository http://archive.ics.uci.edu/ml/datasets/Ionosphere. These data are primarily used for binary classification. In this example, we ignore class labels and apply PCA. The data have 351 observations and 34 variables.
load ionosphere
[N,D] = size(X)
N = 351 D = 34
Center the data
The data are readings from a phased array of 16 high-frequency antennas. All 34 variables are produced in a similar fashion, measured using the same units and have, roughly, the same variance. One exception is the 2nd variable whose variance is exactly zero. We therefore choose to apply covariance PCA. To ensure correct computation of the reconstructed error in later steps, we center the entire data by substracting the mean of every column from the respective column in the input matrix.
mu = mean(X); X = bsxfun(@minus,X,mu);
Perform PCA
We execute the pca function available from the Statistics Toolbox. This function returns:
- Loadings (matrix V, in our notation)
- Scores (matrix X*V, in our notation)
- Eigenvalues (the main diagonal of matrix
, in our notation)
- Hotelling's T-squared statistic for each observation in X (not discussed in the book)
- Percentage of variance explained by the respective component
We need only the eigenvalues and the fraction of explained variance. The fraction of explained variance could be easily obtained from the eigenvalues. We thus replace the first two output arguments with the tilde symbol and do not ask for the 4th and 5th arguments.
[~,~,lambda] = pca(X);
Plot the explained variance
We plot the variance for each component normalized to the overall variance versus the component index. We then plot the cumulative variance versus the number of components.
figure; plot(lambda/sum(lambda),'bs','MarkerSize',8); hold; plot(cumsum(lambda)/sum(lambda),'r*','MarkerSize',8); hold off; grid on; xlim([0 D+1]); line([0 D+1],[0.7 0.7],'Color','k','LineStyle','--'); line([0 D+1],[0.9 0.9],'Color','k','LineStyle','--'); legend('Individual','Cumulative','Location','E'); ylabel('Fraction of explained variance');
Current plot held

Make a scree plot
This reproduces the figure in the book.
figure; plot(lambda,'bs--','MarkerSize',6); hold; plot(-diff(lambda),'r*-','MarkerSize',6); grid on; xlabel('Component index'); ylabel('Variance'); legend('\lambda_m','\lambda_m-\lambda_{m+1}','Location','NE'); xlim([0 D+1]);
Current plot held

Find non-trivial eigenvalues
We generate 1000 replicas of the ionosphere data by shuffling every variable (column) in the input matrix at random. We store the eigenvalues for every run on the shuffled data in the lambdaShuffled matrix. Although the conclusions should not depend in a significant way on how exactly the data are shuffled, we set the random number generator seed for reproducibility by executing the rng function.
R = 1000; lambdaShuffled = zeros(D,R); Xperm = zeros(N,D); rng(1); for r=1:R for d=1:D Xperm(:,d) = X(randperm(N),d); end [~,~,lambdaShuffled(:,r)] = pca(Xperm); end
We estimate the p-values for the eigenvalues observed in the ionosphere data by computing the fraction of replicas in which the respective eigenvalue exceeds the one observed in the same position. The first 5 p-values equal zero indicating the non-trivial nature of the first five principal components.
pval = sum(bsxfun(@gt, lambdaShuffled, lambda), 2)/R; pval(1:10)'
ans = 0 0 0 0 0 1 1 1 1 1
Partition the data in 10 folds for cross-validation
The cv object returned by cvpartition stores the data partition. We use the training and test methods of this object in the next step to access the indices of observations in the respective folds.
K = 10;
cv = cvpartition(N,'kfold',K);
Estimate reconstruction error by cross-validation
Increasing the number of principal components from 1 to the number of variables, we compute the reconstruction error. For an assumed number of principal components, m, we copy the first m loadings from matrix V to matrix Vreduced. We then compute the associated scores, X*Vreduced, and transform them back to the original variables using X*Vreduced*Vreduced' to obtain an estimate, Xhat, of the input matrix X. We set the matrix of residuals, residual, to the difference of the reconstructed and original data Xhat-X. We then compute two kinds of reconstruction error:
- meanErr, the mean of all residuals
- maxErr, the maximal magnitude of a single residual value
To compute meanErr, we divide the Frobenius norm of the residual matrix by the square root of N*D, the total number of elements in this matrix. To compute maxErr, we take the maximal magnitude of all residual values.
M = numel(lambda); meanErr = zeros(1,M); maxErr = zeros(1,M); Xhat = zeros(N,M); % for m=1:M % for k=1:K itrain = training(cv,k); itest = test(cv,k); V = pca(X(itrain,:)); Vreduced = V(:,1:m); Xhat(itest,:) = X(itest,:)*(Vreduced*Vreduced'); end % residual = Xhat - X; meanErr(m) = norm(residual,'fro')/sqrt(N*D); maxErr(m) = max(abs(residual(:))); end
Reproduce the figure in the book.
figure; plot(maxErr,'bs','MarkerSize',8); hold; plot(meanErr,'r*','MarkerSize',8); hold off; legend('max error','mean error','Location','NE'); grid on; xlabel('Number of principal components'); ylabel('Reconstruction error');
Current plot held
