摘要

The textbook literature of principal components analysis (PCA) dates from a period when statistical computing was much less powerful than it is today and the dimensionality of data sets typically processed by PCA correspondingly much lower. When the formulas in those textbooks involve limiting properties of PCA descriptors, the limit involved is usually the indefinite increase of sample size for a fixed roster of variables. But contemporary applications of PCA in organismal systems biology, particularly in geometric morphometrics (GMM), generally involve much greater counts of variables. The way one might expect pure noise to degrade the biometric signal in this more contemporary context is described by a different mathematical literature concerned with the situation where the count of variables itself increases while remaining proportional to the count of specimens. The founders of this literature established a result of startling simplicity. Consider steadily larger and larger data sets consisting of completely uncorrelated standardized Gaussians (mean zero, variance 1) such that the ratio of variables to cases (the so-called "p/n ratio") is fixed at a value y. Then the largest eigenvalue of their covariance matrix tends to , the smallest tends to , and their ratio tends to the limiting value , whereas in the uncorrelated model both of these eigenvalues and also their ratio should be just 1.0. For not an atypical value for GMM data sets, this ratio is 9; for which is still not atypical, it is 34. These extrema and ratios, easily confirmed in simulations of realistic size and consistent with real GMM findings in typical applied settings, bear severe negative implications for any technique that involves inverting a covariance structure on shape coordinates, including multiple regression on shape, discriminant analysis by shape, canonical variates analysis of shape, covariance distance analysis from shape, and maximum-likelihood estimation of shape distributions that are not constrained by strong prior models. The theorem also suggests that we should use extreme caution whenever considering a biological interpretation of any Partial Least Squares analysis involving large numbers of landmarks or semilandmarks. I illuminate these concerns with the aid of one simulation, two explicit reanalyses of previously published data, and several little sermons.

  • 出版日期2017-12