IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO.
11, NOVEMBER 2013 7419
A Compressed PCA Subspace Method for Anomaly Detection in High-Dimensional Data Qi Ding and Eric D. Kolaczyk, Senior Member, IEEE AbstractRandom projection is widely used as a method of dimension reduction. In recent years, its combination with stan- dard techniques of regression and classication has been explored. Here, we examine its use for anomaly detection in high-dimen- sional settings, in conjunction with principal component analysis (PCA) and corresponding subspace detection methods. We as- sume a so-called spiked covariance model for the underlying data generation process and a Gaussian random projection. We adopt a hypothesis testing perspective of the anomaly detection problem, with the test statistic dened to be the magnitude of the residuals of a PCA analysis. Under the null hypothesis of no anomaly, we characterize the relative accuracy with which the mean and variance of the test statistic from compressed data approximate those of the corresponding test statistic from uncompressed data. Furthermore, under a suitable alternative hypothesis, we provide expressions that allow for a comparison of statistical power for detection. Finally, whereas these results correspond to the ideal setting in which the data covariance is known, we show that it is possible to obtain the same order of accuracy when the covariance of the compressed measurements is estimated using a sample covariance, as long as the number of measurements is of the same order of magnitude as the reduced dimensionality. We illustrate the practical impact of our results in the context of predicting volume anomalies in Internet trafc data. Index TermsAnomaly detection, principal component anal- ysis, random projection. I. INTRODUCTION P RINCIPAL component analysis (PCA) is a classical tool for dimension reduction that remains at the heart of many modern techniques in multivariate statistics and data mining. Among the multitude of uses that have been found for it, PCA often plays a central role in methods for systems monitoring and anomaly detection. Aprototypical example of this is the method of Jackson and Mudholkar [11], the so-called PCA subspace projection method. In their approach, PCA is used to extract the primary trends and patterns in data and the magnitude of the residuals (i.e., the norm of the projection of the data into the residual subspace) is then monitored for departures, with Manuscript received April 09, 2012; revised July 14, 2013; accepted July 18, 2013. Date of publication August 15, 2013; date of current version October 16, 2013. This work was supported in part by NSF grants CCF-0325701 and CNS-0905565 and in part by ONR award N000140910654. The authors are with the Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA (e-mail: qiding@bu.edu; kolaczyk@bu. edu). Communicated by D. P. Palomar, Associate Editor for Detection and Estimation. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIT.2013.2278017 principles from hypothesis testing being used to set detection thresholds. This method has seen widespread usage in industrial systems control (e.g.,[5], [25], [29]). More recently, it is also being used in the analysis of nancial data (e.g., [9], [19], [20]) and of Internet trafc data (e.g., [17], [18]). In this paper, we propose a methodology in which PCA sub- space projection is applied to data that have rst undergone random projection. Two key observations motivate this pro- posal. First, as is well-known, the computational complexity of PCA, when computed using the standard approach based on the singular value decomposition, scales like , where is the dimensionality of the data and is the sample size. Thus, use of the PCA subspace method is increasingly less feasible with the ever-increasing size and dimensions of modern data sets. Second, concerns regarding data condentiality, whether for proprietary reasons or reasons of privacy, are more and more driving a need for statistical methods to accommodate. The rst of these problems is something a number of authors have sought to address in recent years (e.g., [14], [15], [30], [33]), while the second, of course, does not pertain to PCA-based methods alone. Our proposal to incorporate random projection into the PCA subspace method is made with both issues in mind, in that the original data are transformed to a random coordinate space of reduced dimension prior to being processed. The key application motivating our problem is that of mon- itoring Internet trafc data. Previous use of PCA subspace methods for trafc monitoring [17], [18] has been largely restricted to the level of trafc traces aggregated over broad metropolitan regions (e.g., New York, Chicago, Los Angeles) for a network covering an entire country or continent (e.g., the United States, Europe, etc.). This level of aggregation is useful for monitoring coarse-scale usage patterns and high-level quality-of-service obligations. However, much of the current interest in the analysis of Internet trafc data revolves around the much ner scale of individual users. Data of this sort can be determined up to the (apparent) identity of individual computing devices, i.e., the so-called IP addresses. But there are as many as such IP address, making the monitoring of trafc at this level a task guaranteed to involve massive amounts of data of very high dimension. Furthermore, it is typically necessary to anonymize data of this sort, and often it is not possible for anyone outside of the auspices of a particular Internet service provider to work with such data in its original form. The standard technique used when data of this sort are actually shared is to aggregate the IP addresses in a manner similar to the coarsening of geocoding (e.g., giving only infor- mation on a town of residence, rather than a street address). Our proposed methodology can be viewed as a stylized prototype, 0018-9448 2013 IEEE 7420 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 establishing proof-of-concept for the use of PCA subspace projection methods on data like IP-level Internet trafc in a way that is both computationally feasible and respects concerns for data condentiality. Going back to the famous JohnsonLindenstrauss lemma [12], it is now well known that an appropriately dened random projection will effectively preserve length of data vectors as well as distance between vectors. This fact lies at the heart of an explosion in recent years of newtheory and methods in statistics, machine learning, and signal processing. These include [3], [6], [8]. See, for example, the review[31]. Many of these methods go by names emphasizing the compression inherent in the random projection, such as compressed sensing or compressive sampling. In this spirit, we call our own method compressed PCA subspace projection. The primary contribution of our work is to show that, under certain sparseness conditions on the covariance structure of the original data, the use of Gaussian random projection followed by projection into the PCA residual subspace yields a test statistic whose distributional behavior is comparable to that of the statistic that would have been obtained from PCA subspace projection on the original data. And furthermore that, up to higher order terms, there is no loss in accuracy if an estimated covariance matrix is used, rather than the true (unknown) covariance, as long as the sample size for estimating the covariance is of the same order of magnitude as dimension of the random projection. While there is, of course, an enormous amount of literature on PCA and related methods, and in addition, there has emerged in more recent years a substantial literature on random projection and its integration with various methods for classical problems (e.g., regression, classication, etc.), to the best of our knowl- edge there are only two works that, like ours, explicitly address the use of the tools from these two areas in conjunction with each other. In the case of the rst [26], a method of random pro- jection followed by subspace projection [via the singular value decomposition (SVD)] is proposed for speeding up latent se- mantic indexing for document analysis. It is shown [26, Th. 5] that, with high probability, the result of applying this method to a matrix will yield an approximation of that matrix that is close to what would have been obtained through subspace pro- jection applied to the matrix directly. A similar result is estab- lished in [7, Th. 5], where the goal is to separate a signal of interest from an interfering background signal, under the as- sumption that the subspace within which either the signal of interest or the interfering signal resides is known. In both [26] and [7], the proposed methods use a general class of random projections and xed subspaces. In contrast, here we restrict our attention specically to Gaussian random projections but adopt a model-based perspective on the underlying data them- selves, specifying that the data derive from a high-dimensional zero-mean multivariate Gaussian distribution with covariance possessed of a compressible set of eigenvalues. In addition, we study the cases of both known and unknown covariances. Our results are formulated within the context of a hypothesis testing problem and, accordingly, we concentrate on understanding the accuracy with which 1) the rst two moments of our test statistic is preserved under the null hypothesis, and 2) the power is pre- served under an appropriate alternative hypothesis. From this perspective, the probabilistic statements in [7] and [26] can be interpreted as simpler precursors of our results, which neverthe- less strongly suggest the feasibility of what we present. Finally, we note too that the authors in [7] also propose a method of de- tection in a hypothesis testing setting, and provide results quan- tifying the accuracy of power under random projection, but this is offered separate from their results on subspace projections, and in the context of a model specifying a signal plus white Gaussian noise. This paper is organized as follows. In Section II, we reviewthe standard PCA subspace projection method and establish appro- priate notation for our method of compressed PCAsubspace pro- jection. Our main results are stated in Section III, where we char- acterize the mean and variance behaviors of our statistic as well as the size and power of the corresponding statistical test for anomalies based on this statistic. In Section IV, we present the re- sults of a small simulation study, while in Section V, we illustrate the practical impact of our ndings in the context of predicting volume anomalies in Internet trafc data. Finally, some brief dis- cussion may be found in Section VI. The proofs for all theoretical results presented herein may be found in the Appendices. II. BACKGROUND Let be a multivariate normal random vector of di- mension , with zero mean and positive denite covariance ma- trix . Let be the eigen-decomposition of . De- note the prediction of by the rst principal components of as . Jackson and Mudholkar [11], following an earlier suggestion of Jackson and Morris [10] in the con- text of photographic processing, propose to use the square of the norm of the residual from this prediction as a statistic for testing goodness-of-t and, more generally, for multivariate quality control. This is what is referred to now in the literature as the PCA subspace method. Denoting this statistic as (1) we know that is distributed as a linear combination of i.i.d. chi-square random variables. In particular, where are the eigenvalues of and the are i.i.d. standard normal randomvariables. Anormal approximation to this distri- bution is proposed in [11], based on a power-transformation and appropriate centering and scaling. Here, however, we will con- tent ourselves with the simpler approximation of by a normal with mean and variance respectively. This approximation is well-justied theoretically (and additionally has been conrmed in preliminary numerical studies analogous to those reported later in this paper) by the DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7421 fact that typically will be quite large in our context. In addition, the resulting simplication will be convenient in facil- itating our analysis and in rendering more transparent the impact of random projection on our proposed extension of Jackson and Mudholkars approach. As stated previously, our extension is motivated by a desire to simultaneously achieve dimension reduction and ensure data condentiality. Accordingly, let , for , where the are i.i.d. standardized random variables, i.e., such that and . Throughout this paper, we will assume that the have a standard normal distribution. The random matrix will be used to induce a random projection (2) Note that, under our conditions on the , the off-diagonal ele- ments of tend to zero, while the diagonal elements tend to one, as . As a result, at an intuitive level, we may have reason to hope that if in an appropriate manner, the resulting matrix behaves sufciently like the identity matrix for the inner product and the corresponding Euclidean distance to essentially be preserved under the mapping in (2), while re- ducing the dimensionality of the space from to . Thus, under our intended scenario, rather than observing the original random variable we instead suppose that we see only its projection, which we denote as . Consider now the possibility of applying the PCA subspace method in this new data space. Conditional on the random matrix , the random variable is distributed as multivariate normal with mean zero and covariance . Denote the eigen-decomposition of this covariance matrix by , let represent the prediction of by the rst principal components of , where is the rst columns of , and let be the corresponding residual. Finally, dene the squared norm of this residual as The primary contribution of our work is to show that, despite not having observed , and therefore being unable to calculate the statistic , it is possible, under certain conditions on the covariance of to apply the PCA subspace method to the projected data , yielding the statistic , and nevertheless ob- tain anomaly detection performance comparable to that which would have been yielded by , with the discrepancy between the two made precise. III. MAIN RESULTS It is unrealistic to expect that the statistics and would behave comparably under general conditions. At an intuitive level, it is easy to see that what is necessary here is that the underlying eigen-structure of must be sufciently well preserved under random projection. The relationship between eigen-values and eigen-vectors with and without random pro- jection is an area that is both classical and the focus of much recent activity. See [1], for example, for a recent review. A popular model in this area is the spiked covariance model of Johnstone [13], in which it is assumed that the spectrum of the covariance matrix behaves as This model captures the notionoften encountered in prac- ticeof a covariance whose spectrum exhibits a distinct decay after a relatively few large leading eigenvalues. All of the results in this section are produced under the as- sumption of a spiked covariance model. We present three sets of results: 1) characterization of the mean and variance of , in terms of those of , in the absence of anomalies; 2) a com- parison of the power of detecting certain anomalies under and ; and 3) a quantication of the implications of estimation of on our results. A. Mean and Variance of in the Absence of Anomalies We begin by studying the behavior of when the data are in fact not anomalous, i.e., when truly is normal with mean 0 and covariance . This scenario will correspond to the null hypothesis in the formal detection problem we set up shortly below. Note that under this scenario, similar to , the statistic is distributed, conditional on , as a linear combination of i.i.d. chi-square random variables, with mean and variance given by respectively, where is the spectrum of . Our ap- proach to testing will be to rst center and scale , and to then compare the resulting statistic to a standard normal distribution for testing. Therefore, our primary focus in this section is on characterizing the expectation and variance of . The expectation of may be characterized as follows. Theorem 1: Assume such that . If and , then under the Gaussian random projection dened in (2), (3) where indicates terms of order big-Oh in probability. Thus, differs from in expectation, conditional on , only by a constant independent of and . Alternatively, if we divide through by and note that under the spiked covariance model (4) as , then from (3) we obtain (5) In other words, at the level of expectations, the effect of random projection on our (rescaled) test statistic is to introduce a bias that vanishes like . The variance of may be characterized as follows. 7422 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 Theorem 2: Assume such that . If and , then (6) That is, the conditional variance of differs from the vari- ance of by a factor of , with a relative bias term of order . Taken together, Theorems 1 and 2 indicate that application of the PCA subspace method on nonanomalous data after random projection produces a test statistic that is asymptotically un- biased for the statistic we would in principle like to use, if the original data were available to us, but whose variance is inated over that of by a factor depending explicitly on the amount of compression inherent in the projection. In Section IV, we present the results of a small numerical study that show, over a range of compression values , that the approximations in (5) and (6) are quite accurate. B. Comparison of Power for Detecting Anomalies We now consider the comparative theoretical performance of the statistics and for detecting anomalies. From the per- spective of the PCA subspace method, an anomaly is some- thing that deviates from the null model that the multivariate normal vector has mean zero and covariance in such a way that it is visible in the residual subspace, i.e., under projection by . Hence,we treat the anomaly detection problem in this setting as a hypothesis testing problem, in which (7) for and . Note that, our alternative hypothesis species that anomalies are along a single principal component direction. The hypothesis could be stated more generally, with anomalous contributions from multiple principal component di- rections in the residual subspace. However, since the PCA sub- space detection method is based on the norm of the residual , all the nonzero values in the residual projection subspace will contribute additively to the sum of squares in the statistics, in expectation, and additively in their square, at the level of the variance. Hence, without loss of generality, it is suf- cient here to consider anomalies in only a single direction, par- ticularly as it simplies notation and exposition. Recall that, as discussed in Section II, it is reasonable in our setting to approximate the distribution of appropriately stan- dardized versions of and by the standard normal distribu- tion. Under our spiked covariance model, and using the results of Theorems 1 and 2, this means comparing the statistics (8) respectively, to the upper critical value of a standard normal distribution. Accordingly, we dene the power functions (9) and (10) for and , respectively, where the probabilities on the right-hand side of these expressions refer to the corresponding approximate normal distribution. Our goal is to understand the relative magnitude of compared to , as a function of , and . Approx- imations to the relevant formulas are provided in the following theorem. Theorem 3: Let be a standard normal random variable. Under the same assumptions as Theorems 1 and 2, and a Gaussian approximation to the standardized test statistics, we have that where (11) while (12) Ignoring error terms, we see that the critical values (11) and (12) for both power formulas have as their argument quantities of the form . However, while , we have that . Hence, all else being held equal, as the compression ratio increases, the critical value at which power is evaluated shifts increasingly to the right for , and power decreases accordingly. The extent to which this effect will be apparent is modulated by the magnitude of the anomaly to be detected and the signicance level at which the test is dened, and furthermore by the size of the original data space. Finally, while these observations can be expected to be most accurate for large and large , in the case that either or both are more comparable in size to the and error terms in (12), respectively, the latter will play an increasing role and hence affect the accuracy of the stated results. An illustration may be found in Fig. 1. There we show the power as a function of the compression ratio , for . Here the dimension before projec- tion is and the dimension after projection ranges from 10 000 to 500. A value of was used for the dimension of the principle component analysis, and a choice of was made for the size of the underlying test for anomaly. Note that at , on the far left-hand side of the plot, the value simply reduces to . (Note too that the theory in this paper addresses only compression ra- tios , and not , which we include in this plot DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7423 Fig. 1. as a function of compression ratio for various choices of anomaly size . purely for the sake of visual continuity.) So the ve curves show the loss of power resulting from compression, as a function of compression level , for various choices of strength of the anomaly. Additional numerical results of a related nature are presented in Section IV. C. Unknown Covariance The test statistics and are dened in terms of the covari- ance matrices and , respectively. However, in practice, it is unlikely that these matrices are known. Rather, it is more likely that estimates of their values be used in calculating the test sta- tistics, resulting, say, in statistics and . In the context of industrial systems control, for example, it is not unreasonable to expect that there will be substantial previous data that may be used for this purpose. As our concern in this paper is on the use of the subspace projection method after random projection, i.e., in the use of , the relevant question to ask here is what are the implications of using an estimate for . We study the natural case where the estimate is simply the sample covariance , for the matrix formed from i.i.d. copies of the random variable and their vector mean. Let be the eigen- decomposition of and, accordingly, dene in analogy to . We then have the following result. Theorem 4: Assume . Then, under the same conditions as Theorem 1, (13) and (14) Furthermore, under the conditions of Theorem 3, the power function (15) can be expressed as , where (16) Simply put, the results of the theorem tell us that the accu- racy with which compressed PCA subspace projection approx- imates standard PCA subspace projection in the original data space, when using the estimated covariance rather than the unknown covariance , is unchanged, as long as the sample size used in computing is at least as large as the dimen- sion after random projection. Hence, there is an interesting tradeoff between and , in that the smaller the sample size that is likely to be available, the smaller the dimension that must be used in dening our random projection, if the ideal ac- curacy is to be obtained (i.e., that using the true ). However, decreasing will degrade the quality of the accuracy in this ideal case, as it increases the compression parameter . IV. SIMULATION We present three sets of numerical simulation results in this sectionone corresponding to Theorems 1 and 2, another, to Theorem 3, and the last, to Theorem 4. In our rst set of experiments, we simulated from the spiked covariance model, drawing both random variables and their projections over many trials, and computed and for each trial, thus allowing us to compare their respective means and variances. In more detail, we let the dimension of the orig- inal random variable be , and assumed that to be distributed as normal with mean zero and (without loss of gen- erality) covariance equal to the spiked spectrum with . The corresponding random projections of were computed using randommatrices generated as described in the text, with compression ratios equal to 20, 50, and 100 (i.e., ). We used a total of 2000 trials for each realization of , and 30 realizations of for each choice of ( ). The results of this experiment are summarized in Table I. Re- call that Theorems 1 and 2 say that the rescaled mean and the ratio of variances should be approx- imately equal to and , respectively. It is clear from these results that, for low levels of compression (i.e., ) the approximations in our theorems are quite accurate and that they vary little fromone projection to another. For moderate levels of compression (i.e., ) they are similarly accurate, although more variable. For high levels of compression (i.e., ), 7424 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 Fig. 2. Simulation results assessing the accuracy of Theorems 3 (left) and 4 (right). TABLE I SIMULATION RESULTS ASSESSING THE ACCURACY OF THEOREMS 1 AND 2 we begin to see some nontrivial bias entering, with some ac- companying increase in variability as well. In our second set of experiments, we again simulated from a spiked covariance model, but now with nontrivial mean. The spiked spectrum was chosen to be the same as above, but with , for computational considerations. The mean was de- ned as in (7), with . A range of compres- sions ratios were used. We ran a total of 1000 trials for each realization of , and 30 realizations of for each combination of and . The statistics and were computed as in the statement of Theorem 3 and compared to the critical value , corresponding to a one-sided test of size . The results are shown in Fig. 2. Error bars reect variation over the different realizations of and correspond to one stan- dard deviation. The curves shown correspond to the power ap- proximation given in Theorem 3, and are the same as the middle three curves in Fig. 1. We see that for the strongest anomaly level ( ) the theoretical approximation matches the empirical results quite closely for all but the highest levels of compression. Similarly, for the weakest anomaly level ( ), the match is also quite good, although there appears to be a small but persistent positive bias in the approximation across all compression levels. In both cases, the variation across choice of is quite low. The largest bias in the approximation is seen at the moderate anomaly level ( ), at moderate to high levels of compression, although the bias appears to be on par with the anomaly levels at lower compression levels. The largest varia- tion across realizations of is seen for the moderate anomaly level. Finally, in our third set of experiments, we repeated the same exercise as in the second set, but for the case where is un- known. The design of the simulation was exactly the same as before, except that prior to projection by each realization of , for each combination of and , an estimate of was com- puted from i.i.d. nonanomolous observations (i.e., with ). The results of this simulation are also shown in Fig. 2. We see that there is little to any difference when compared to the case of known , in support of result in Theorem 4. Note that the error bars in this case include the combined uncertainty due to randomness in both and . Taken together, the results of our simulationsalthough modest in scopeoffer some useful guidelines for using our compressed PCA subspace projection method in practice. In particular, we see that compression ratios of up to 10 or even 20 allow for relatively high detection levels for strong signals, and still reasonable detection levels for moderate signals. For weaker signals, however, it appears that detection levels decrease fairly quickly, dropping to as low as 50% already for a compression ratio of only 3. We note that in practice there will also likely be the tradeoff between accuracy and computational burden to be considered. For sufciently large dimension in the original data , a substantial amount of compression may nonetheless be tolerable if it is only at that level that the required PCA calculations become computationally feasible. This is especially likely to be the case if the method is to be implemented in a time-dependent fashion (e.g., online). V. APPLICATION IN INTERNET TRAFFIC ANALYSIS We illustrate the use of our compressed PCAsubspace projec- tion method in the context of computer network trafc analysis. Internet service providers (ISPs) routinely monitor the trafc on their networks for, among other things, anomalous trafc pat- terns and behavior. One of the most basic types of anomalies for which ISPs monitor are volume anomalies, i.e., anomalous DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7425 Fig. 3. Left: Summary network ow time series in units of bytes (top), packets (middle), and ows (bottom). Right: (Log) Spectra of the three trafc time series matrices. levels in the volume of trafc over a certain period of time. Lakhina et al. [17], [18] have shown that methods based on the PCA subspace projection method can be quite effective at identifying volume anomalies in aggregate network trafc data. Here, we demonstrate that our compressed PCA subspace pro- jection method has the potential to allow for similarly effective anomaly detection at a much ner level of granularity, at or close to the level of individual IP addresses. We do so by comparing the performance of the compressed method to the uncompressed method on the original data. Our analysis uses data from a large European ISP, observed at a single network node corresponding to a major city, during the period May 7 through May 10, 2007. The raw measure- ments represent observations of origin-destination trafc ows and certain basic characteristics. Relevant to our analysis are the IP addresses of the origin and destination of each ow, as well as the volume of the ow in units of bytes and packets. We will focus on volume anomalies at destinations. There were a total of 656 785 unique destination IP addresses observed during the four-day period of our study. In order to facilitate our use of the uncompressed PCA sub- space projection method on this data, as a comparison, we pre- process the raw data by binning in both time and IP address space. For time, we used ve-minute intervals, a fairly stan- dard choice in this area, while for IP addresses we used quantile binning with 11 714 bins. The end result is a set of three mul- tivariate time series, each of dimension , measured over time bins, with values corresponding to the total volume for each IP address block during each time interval, in units of number of bytes, packets, and ows, respectively. In Fig. 3 are shown a visualization of these data, obtained simply by summing the multivariate time series across their components (i.e., collapsing across IP address blocks). Also shown are the nontrivial singular values corresponding to each of the three data matrices. Presented on a semilogarithmic scale, we can see that these spectra all decay quite sharply, qualita- tively similar to what is dictated by a spiked covariance model. Examination of these spectra on their original scale suggests a choice of is reasonable. For each of the three trafc matrices, estimates of the prin- ciple components of both and were obtained, using the method of [33]. The value was used for both un- compressed and compressed PCAsubspace projection methods. The test statistics and were computed for each of the ve-minute time intervals, for each of the three units of volume, resulting in six time-indexed test statistic se- quences. These are shown in Fig. 4. Recall that these test statis- tics measure the extent to which there remains nontrivial struc- ture in the data after projection onto the rst principle compo- nents. The gures show that for much of the time there is little to any high-frequency structure missed, although at times there appears to be periods in which low-frequency structure is con- sistently missed. Following Lakhina et al., we will focus here only on recovery of the high-frequency anomalies. For the purpose of comparison, in each of the bytes, packets, and ow time series, we generate an anomaly truth set by declaring an anomaly to exist at any time point for which the standardized value of exceeds , for a given . We then similarly do anomaly detection at the same level using the compressed PCA subpace projection method, at compression ratios , for 30 different realizations of the random matrix . Both and were standardized by ro- bust estimates of mean and standard deviation applied to the sequence of test statistics across time. In Tables II and III are shown the average power of the tests in detecting anomalies. We see that a power of roughly 70% is achieved when testing at , and that this rises to over 75%, when . These results suggest that a large fraction of the anomalies that would be declared using subspace projection on the original data would also be detected using subspace projection after random projection, over a range of compression ratios. Finally, in Fig. 5, we show 3-D plots of the values of and of , for each of the bytes, ows, and packets time series, for any time bin in which an anomaly was declared in at least one of the time series. Lakhina et al. [18] found in their analysis of aggre- gate-level trafc such plots showed a strong clustering in this 3-D space, suggesting that there were distinct types of anoma- lies being captured. Here, we nd that the plot corresponding to -based anomalies shows similar clustering and, moreover, that the plot corresponding to -based anomalies appears to qual- itatively preserve this clustering. Thus, the ability of our com- pressed PCA subspace method to capture with fairly high accu- 7426 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 Fig. 4. Plots of (left) and (right) over time for the network trafc application. Detection thresholds shown in red. Fig. 5. Three-dimensional scatterplots of (left) and (right) for each of the bytes, ows, and packets time series. Each point corresponds to an anomaly, colored according to whether it was declared due to excessive volume of bytes (green), ows (blue), or packets (red). TABLE II ANOMALY DETECTION POWER UNDER TABLE III ANOMALY DETECTION POWER UNDER racy the anomalies declared by the uncompressed method ap- pears to also translate into some ability to recover certain higher level characteristics in the data. VI. DISCUSSION Motivated by dual considerations of dimension reduction and data condentiality, as well as the wide-ranging and successful implementation of PCA subspace projection, we have intro- duced a method of compressed PCA subspace projection and characterized key theoretical quantities relating to its use as a tool in anomaly detection. Furthermore, an illustrative im- plementation of this proposed methodology and its application to detecting IP-level volume anomalies in computer network trafc suggests a high relevance to practical problems. In introducing our extension here of the classical PCA sub- space projection method, we have made use of a handful of as- sumptions, which are useful for simplifying both exposition and various aspects of the proofs of our results. We conjecture that many of these assumptions may be relaxed. For example, while our assumption of Gaussian data (i.e., for the distribution of ) is both classical and natural for PCA-like analyses, given their reliance on the norm, it is unlikely to hold in practice. How- ever, we utilize this Gaussian assumption most directly only in DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7427 exploiting its invariance to rotation when assuming, without loss of generality, that the covariance itself (rather than its spec- trum) follows a spiked covariance model. This step greatly sim- plies the exposition. In addition, the assumption is in principle used initially when we state that the statistics and are dis- tributed as sums of independent chi-square random variables. We note, however, that in all of our power calculations, we im- plicitly use a central limit theorem approximation, which can be expected to hold under more general moment conditions and an assumption of sufciently large . Similarly, the spiked covariance model we assume (wherein only the rst eigenvalues differ from the value 1, and those have a nontrivial gap separating them from 1) is only an ide- alization of what is likely to be encountered in practice. This too can be relaxed, at the cost of some additional work. For in- stance, an assumption that the spectrum lies in a weak ball would more accurately capture the notion of a fast but smooth decay, such as exhibited in our analysis of the network trafc data. See [14], for example, for discussion in this direction. Turning from the data to the random projection (i.e., ), there is room there as well for generalization, in that the assumption of Gaussianity that is standard in much of the early work on random projections, and is used here as well, has since been re- laxed in many settings. Here, again, similar to our assumption of Gaussian data, we have made use of our assumption of Gaussian projections primarily for the invariance of the Gaussian to rota- tion and for the nice behavior of its tails. Extension of our results to zero-mean projections with similarly nice tail behavior (e.g., binary projections) should be relatively straightforward. Additionally, the question of sample size, in relation to the dimension of our problem, is also an interesting direction for future work. The results of Theorem 4 are important in estab- lishing the practical feasibility of our proposed method, wherein the covariance must be estimated from data, when it is pos- sible to obtain samples of size of a similar order of magnitude as the reduced dimension of our random projection. It would be of interest to establish results of a related nature for the case where . In that case, it cannot be expected that the clas- sical moment-based estimator that we have used here will perform acceptably. Instead, an estimator exploiting the struc- ture of presumably is needed. However, as most methods in the recent literature on estimation of large, structured covariance matrices assume sparseness of some sort (e.g., [2], [16], [21]), they are unlikely to be applicable here, since is roughly of the form , where is of rank with entries of mag- nitude . Similarly, neither will methods of sparse PCA be appropriate (e.g., [14], [15], [30], [33]). Rather, variations on more recently proposed methods aimed directly at capturing low-rank covariance structure hold promise (e.g., [23], [24]). Alternatively, the use of so-called very sparse random projec- tions (e.g., [22]), in place of our Gaussian random projections, would yield sparse covariance matrices , and hence in prin- ciple facilitate the use of sparse inference methods in producing an estimate . But this step would likely come at the cost of making the already fairly detailed technical arguments behind our results more involved still, as we have exploited the Gaus- sianity of the randomprojection in certain key places to simplify calculations, as noted above. We note that ultimately, for such approaches to produce results of accuracy similar to that here in Theorem 4, it is necessary that they produce approximations to the PCA subspace of with order accuracy. Finally, we acknowledge that the paradigm explored here, even if extended in the various directions suggested above, is only a caricature of what might be implemented in reality, par- ticularly in contexts like computer network trafc monitoring. There, for example, issues of data management, speed, etc., would become important and can be expected to have nontrivial implications on the design of the type of random projections actually used or even the number (e.g., increased power might result from using multiple projections, while still maintaining computational savings). Nevertheless, we submit that the results presented in this paper strongly suggest the potential success of an appropriately modied system of this nature. APPENDIX In this section, we present proofs of Theorems 1 through 4. A) Proof of Theorem 1: Suppose the random vector has a multivariate Gaussian distribution and , for . Denote the eigenvalues of and as and , respectively. In proving Theorem 1, we seek to show that, under the condi- tions stated in the theorem, the relation holds. Jackson and Mudholkar [11] showthat will be distributed as , where the are i.i.d. standard normal random variables. Consequently, we have and, similarly, . So comparison of and reduces to a comparison of partial sums of the eigenvalues of and . Since in the following proof we will analyze and separately, showing that they are close to and , respectively. 2) : Because orthogonal rotation has no inuence on Gaussian random projection and the matrix spectrum, to sim- plify the computation, we assume without loss of generality that . So the diagonal elements of are We have 7428 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 and therefore Under the spiked covariance model assumed in this paper, for xed . Then, When and the rst term will go to zero like . The second term can be written as More precisely, here we have a series satisfying when . It is easy to show that when . Since the are i.i.d., we can reexpress the series as Recalling that the are standard normal random variables, we know that and . By the central limit theorem, the series will converge to a zero mean normal in distribution. Hence, is of order when . As an innite subsequence, also has the same behavior, which leads to by which we conclude that As a result of the above arguments, 3) : Next we examine the behavior of the rst eigen- values of and , i.e., and . Recalling the denition of as , we dene the matrix . All of the columns of are i.i.d. random vectors from , and , the covariance of , can be expressed as . Let , which contains the same nonzero eigenvalues as . Through this transformation of to and interpretation of as the sample covariance corresponding to , we are able to utilize established results from random matrix theory. Denote the spectrum of as . Under the spiked covariance model, Baik [1] and Paul [28] independently derived the limiting behavior of the elements of this spectrum. Under our normal case , Paul [28] proved the asymptotical normality of . Theorem5: Assume such that . If , then For signicantly large leading eigenvalues , is asymptotically . And for all of the lead eigenvalues which are above the threshold , we have . Recalling the condition in the statement of the theorem, without loss of generality we take (as we will do, when convenient, throughout the rest of the proofs in these Appendices). Using Pauls result, we have Combining these results with those of the previous section, we have D) Proof of Theorem 2: For the proof of this theorem, we seek to show that For notational convenience, denote as , so that . Writing , and similarly for , we have Since , we have Accordingly, if , DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7429 and while if , Changing the order of summation, we therefore have which implies, (17) Now under the spiked covariance model, with , we have As a result, we have Substituting the expression in (17) yields (18) The control of (18) is not immediate. Let us denote the two terms in the RHS of (18) as and . Results in the next two sections show that behaves like , and like . Consequently, Theorem 2 holds. 5) : We show in this section that First note that, by an appeal to the central limit theorem, . So A can be expressed as Using the result by Paul cited in Appendix A.2, in the form of Theorem 6, the rst term is found to behave like . In addition, it easy to see that the second term behaves like . Finally, since under the spiked covariance model , taking we have that As a result, the third term in the expansion of is equal to . Combining terms, we nd that . 6) : Term B in (18) can be written as (19) Recalling that under the spiked covariance model , in the following we will analyze the asymptotic behavior of the term B in two stages, by rst handling the case in detail, and second, arguing that the result does not change under the original conditions. If , which is simply a white noise model, term B becomes (20) which may be usefully reexpressed as (21) and, upon exchanging the order of summation, as (22) Write (22) as . In the material that immediately follows, we will argue that, under the conditions of the theorem and the white noise model, and . 7430 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 To prove the rst of these two expressions, we begin by writing Note that the are i.i.d. randomvariables. We will use a central limit theorem argument to control . A straightforward calculation shows that . To characterize the second moment, we write and consider each of three possible types of terms . 1) If , then , with expectation 9. Since there are such choices of , the contribution of terms from this case to is . 2) If only two of are equal, will take the form with expectation 3. For each triple there are six possible cases: , , , , , . So there are such terms in this case, yielding a contribution of to . 3) If are all different, the expectation of is just 1. Since there are terms in total, the number of such terms in this case and hence the contribution of this case to is . Combining these various calculations, we nd that and hence By the central limit theorem, we know that . Exploiting that and recalling that by assumption, simple calculations yield that . As for , it can be shown that and from which it follows, by Chebyshevs inequality, that Combining all of the results above, under the white noise model, i.e., when , we have . In the case that the spiked covariance model instead holds, i.e., when , it can be shown that the impact on (19) is to introduce an addi- tional term of . The effect is therefore negligible on the nal result stated in the theorem, which involves an term. Intuitively, the value of the rst eigenvalues will not inuence the asymptotic behavior of the innite sum in (19), which is term B. G) Proof of Theorem 3: Here, we derive the approxima- tions to the power functions and given in the statement of the theorem. Through a coordinate transfor- mation, from to , we can, without loss of generality, restrict our attention to the case where 1) , for , and 2) our testing problem is of the form for some . In other words, we test whether the underlying mean is zero or differs from zero in a single component by some value . Consider rst the expression for in (9). Under the above model, where the are independent random variables. So, under the alternative hypothesis, the sum of their squares is a noncentral chi-square random variable, on degrees of freedom, with noncentrality parameter . We have by standard formulas that and Using these expressions and the normal-based expression for power dening (9), we nd that as claimed. Nowconsider the expression for in (10), where we write for and . Under the null hypothesis, we have and Call these expressions and , respectively. Under the alterna- tive hypothesis, the same quantities take the form DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7431 and respectively, where (23) and (24) Arguing as above, we nd that By Theorem 2, we know that . Ignoring the higher order stochastic error term we therefore have This expression is the same as that in the statement of Theorem 3, up to a rescaling by a factor of . Hence, it remains for us to show that Our problem is simplied under transformation by the rota- tion , where is an arbitrary orthonormal rotation matrix. If we similarly apply then and remain unchanged in (23) and (24). Recall that , where , and , with in the location. Choosing and denoting , straightforward calculations yield that and Now write , where denotes the rst rows of , and , the last rows. The elements in the two sums immediately above lie in the th row of the product of and the last columns of . By Paul [28, Th. 4], we know that if , the last leading eigenvalue in the spiked covariance model, is much greater than 1, and such that , then the distance between the subspaces and diminishes to zero. Asymptotically, therefore, we may assume that these two subspaces coincide. Hence, since is statistically independent of , it follows that is asymptotically independent of , and therefore of the orthogonal complement of , i.e., the last columns of . As a result, the elements in behave asymptotically like i.i.d. standard normal randomvariables. Ap- plying Chebyshevs inequality in this context, it can be shown that Rescaling by , the expressions for and in Theorem 4 are obtained. H) Proof of Theorem 4: Let and . If we use the sample covariance to estimate , we will observe the residual instead of . To prove the theorem it is sufcient to derive expressions for and under the null and alternative hypothesis in (7), as these expres- sions are what inform the components of the critical value in the power calculation. Our method of proof involves re-expressing and in terms of and and showing that those terms involving the latter are no larger than the error terms associated with the former in Theorems 1, 2, and 3. Begin by considering the mean and writing We need to control the term (25) Under the null hypothesis the second term in (25) is zero, and so to prove (13) we need to show that the rst term is . Note that, without loss of generality, we may write , where and , for a random matrix of i.i.d. standard Gaussian randomvariables and . Then, using [27, Th II.1], with in the notation of that paper, it follows that (26) where we use generically here and below to denote the th largest eigenvalue of its argument. 7432 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013 For the second term in the right-hand side of (26), write . Using a result attributed to Mori (ap- pearing as Lemma I.1 in [27]), we can write and similarly for in place of . Exploiting the lin- earity of the trace operation and the fact that , we can bound the term of interest as However, and are equal to times the largest and smallest eigenvalues of a sample covariance of standard Gaussian random variables, the latter which converge almost surely to the right and left endpoints of the MarchenkoPastur distribution (e.g., [1]), which in this setting take the values and , respectively. Hence, . Now consider the factor in the rst term of the right-hand side of (26). We have shown that . At the same time, we note that , being proportional to the normalized trace of a matrix whose entries are i.i.d. copies of averages of i.i.d. chi-square random variables on one degree of freedom. Therefore, and recalling the spiked covariance model, we nd that . At the same time, the factor multiplying this term, i.e., the largest absolute eigenvalue of , is just the operator norm and hence bounded above by the Frobenius norm, . We introduce the notation for the th column of times its transpose, and similarly, , in the case of . Then, and To bound this, we use a result in Watson [32, Appendix B, (3.8)], relying on a multivariate central limit theorem, in distribution, as , where is a random matrix whose distribution depends only on and recall are the eigenvalues of . So . Therefore, the left-hand side of (26) is and (13) is es- tablished. Now consider the second term in (25), which must be controlled under the alternative hypothesis. This is easily done, as we may write and note that the rst term is while the second is . Therefore, under the assumption that , the entire term is , which is the same order of error to which we approximate in (24) in the proof of Theorem 3. Hence, the contribution of the mean to the critical value in (16), using , is the same as in (12), using . This completes our treatment of the mean. The variance can be treated similarly, writing and controlling the last two terms. The rst of these two terms takes the form (27) and the second, (28) Again, under the null hypothesis, the second terms in (27) and (28) are zero. Hence, to establish (14), it is sufcient to show that the rst terms in (27) and (28) are . We begin by noting that where the rst inequality follows from [4, Th. 1], and the second, from CauchySchwartz. Straightforward manip- ulations, along with use of [27, Lemma I.1], yield that . At the same time, we have that Therefore, under the assumptions that and , we are able to control the relevant error term in (27) as . Similarly, using [27, Lemma I.1] again, we have the bound The rst term in this bound is , while the second is , which allows us to control the relevant error term in (28) as . As a result, under the null hypothesis, we have that , which is sufcient to establish (14), since . DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7433 Finally, we consider the second terms in (27) and (28), which must be controlled as well under the alternative hypothesis. Writing and it can be seen that we can bound the rst of these expressions by , and the second, by . Therefore, the combined contribution of the second terms in (27) and (28) is , which is the same order to which we approximate in (23) in the proof of Theorem 3. Hence, the contribution of the variance to the critical value in (16), using , is the same as in (12), using . ACKNOWLEDGMENT The authors thank D. Paul for a number of helpful conversations. REFERENCES [1] Z. D. Bai, Methodologies in spectral analysis of large dimensional random matrices, a review, Statist. Sinica, vol. 9, pp. 611677, 1999. [2] P. J. Bickel and E. Levina, Regularized estimation of large covariance matrices, Ann. Statist., vol. 36, no. 1, pp. 199227, 2008. [3] E. Bingham and H. Mannila, Random projection in dimensionality reduction: Applications to image and text data, in Proc. Seventh ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, 2001, pp. 245250. [4] D. W. Chang, A matrix trace inequality for products of Hermitian matrices, J. Math. Analysis Appl., vol. 237, no. 2, pp. 721725, 1999. [5] L. H. Chiang, E. Russell, and R. D. Braatz, Fault Detection and Diag- nosis in Industrial Systems. New York, NY, USA: Springer-Verlag, 2001. [6] S. Dasgupta, Experiments with random projection, in Proc. 16th Conf. Uncertainty Articial Intelligence, 2000, pp. 143151. [7] M. A. Davenport, P. T. Boufounos, M. B. Wakin, and R. G. Bara- niuk, Signal processing with compressive measurements, IEEE J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 445460, Apr. 2010. [8] D. Fradkin and D. Madigan, Experiments with random projections for machine learning, in Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery Data Mining, 2003, pp. 517522. [9] A. Harvey, E. Ruiz, and N. Shephard, Multivariate stochastic variance models, Rev. Econ. Stud., vol. 61, no. 2, pp. 247264, 1994. [10] J. E. Jackson and R. H. Morris, An application of multivariate quality control to photographic processing, J. Amer. Statist. Soc., vol. 52, no. 278, pp. 186199, 1957. [11] J. E. Jackson and G. S. Mudholkar, Control procedures for residual associated with principal component analysis, Technometrics, vol. 21, pp. 341349, 1979. [12] W. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, Contemp. Math., vol. 26, pp. 189206, 1984. [13] I. M. Johnstone, On the distribution of the largest eigenvalue in prin- cipal components analysis, Ann. Statist., vol. 29, pp. 295327, 2001. [14] I. M. Johnstone and A. Y. Lu, Sparse principal components analysis, J. Amer. Statist. Assoc., vol. 104, pp. 682693, Jun. 2009. [15] M. Journe, Y. Nesterov, P. Richtrik, and R. Sepulchre, General- ized power method for sparse principal component analysis, J. Mach. Learning Res., vol. 11, pp. 517553, 2010. [16] N. E. Karoui, Operator norm consistent estimation of large-di- mensional sparse covariance matrices, Ann. Statist., vol. 36, pp. 27172756, 2008. [17] A. Lakhina, M. Crovella, and C. Diot, Diagnosing network-wide trafc anomalies, presented at the Proc. Conf. Applications, Technol., Architect., Protocols Comput. Commun., Portland, OR, USA, Aug.-Sep. 3003, 2004. [18] A. Lakhina, M. Crovella, and C. Diot, Mining anomalies using trafc feature distributions, in Proc. Conf. Appl., Technol., Architect., Pro- tocols Comput. Commun., Philadelphia, PA, USA, Aug. 2226, 2005. [19] L. Laloux, P. Cizeau, J. P. Bouchaud, and M. Potters, Noise dressing of nancial correlation matrices, Phys. Rev. Lett., vol. 83, pp. 14671470, 1999. [20] D. R. Lessard, International portfolio diversication: A multivariate analysis for a group of Latin American countries, J. Finance, vol. 28, no. 3, pp. 619633, Jun. 1973. [21] E. Levina, A. Rothman, and J. Zhu, Sparse estimation of large covari- ance matrices via a nested Lasso penalty, Ann. Appl. Statist., vol. 2, pp. 245263, 2008. [22] P. Li, T. J. Hastie, and K. W. Church, Very sparse random projec- tions, in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2006, pp. 287296. [23] X. Luo, High dimensional low rank and sparse covariance matrix estimation via convex minimization, 2011 [Online]. Available: arXiv:1111.1133 [24] S. Negahban and M. J. Wainwright, Estimation of (near) low-rank matrices with noise and high-dimensional scaling, Ann. Statist., vol. 39, no. 2, pp. 10691097, 2011. [25] P. Nomikos and J. F. MacGregor, Multivariate spc charts for moni- toring batch processes, Technometrics, vol. 37, no. 1, pp. 4159, Feb. 1995. [26] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, Latent semantic indexing: A probabilistic analysis, J. Comput. Syst. Sci., vol. 61, no. 2, pp. 217235, 2000. [27] P. G. Park, On the trace bound of a matrix product, IEEE Trans. Autom. Control, vol. 41, no. 12, pp. 17991802, 1996. [28] D. Paul, Asymptotics of sample eigenstructure for a large dimensional spiked covariance model, Statist. Sinica, vol. 17, pp. 16171642, 2007. [29] S. J. Qin, Statistical process monitoring: Basics and beyond, J. Chemometrics, vol. 17, no. 89, pp. 480502, 2003. [30] H. Shen and J. Z. Huang, Sparse principal component analysis via regularized lowrank matrix approximation, J. Multivariate Anal., vol. 99, no. 6, pp. 10151034, 2008. [31] S. S. Vempala, The Random Projection Method. Providence, RI, USA: AMS, 2004. [32] G. S. Watson, Statistics on Spheres. New York, NY, USA: Wiley, 1983. [33] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal components analysis, J. Comput. Graph. Statist., vol. 15, pp. 265286, 2006. Qi Ding received his B.S. Degree in mathematics from Peking University, Bei- jing, China, in 2005, and the M.A. and Ph.D. degrees in statistics from Boston University, Boston, MA, USA, in 2006 and 2011, respectively. His research in- terests are in statistical detection procedures for anomaly detection, particularly in network-based contexts. Eric D. Kolaczyk (M04SM06) received his B.S. Degree in mathematics from the University of Chicago, Chicago, IL, USA, in 1990, and the M.S. and Ph.D. degrees in statistics from Stanford University, Stanford, CA, in 1992 and 1994, respectively. He is Professor of Mathematics and Statistics, and Director of the Program in Statistics, at Boston University, MA, where he is also afli- ated with the Division of Systems Engineering. His research interests lie in the development of sparse inference methods and related problems, particularly in the context of the statistical analysis of networks and network data.