Sunteți pe pagina 1din 15

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO.

11, NOVEMBER 2013 7419


A Compressed PCA Subspace Method for Anomaly
Detection in High-Dimensional Data
Qi Ding and Eric D. Kolaczyk, Senior Member, IEEE
AbstractRandom projection is widely used as a method of
dimension reduction. In recent years, its combination with stan-
dard techniques of regression and classication has been explored.
Here, we examine its use for anomaly detection in high-dimen-
sional settings, in conjunction with principal component analysis
(PCA) and corresponding subspace detection methods. We as-
sume a so-called spiked covariance model for the underlying data
generation process and a Gaussian random projection. We adopt
a hypothesis testing perspective of the anomaly detection problem,
with the test statistic dened to be the magnitude of the residuals
of a PCA analysis. Under the null hypothesis of no anomaly,
we characterize the relative accuracy with which the mean and
variance of the test statistic from compressed data approximate
those of the corresponding test statistic from uncompressed data.
Furthermore, under a suitable alternative hypothesis, we provide
expressions that allow for a comparison of statistical power for
detection. Finally, whereas these results correspond to the ideal
setting in which the data covariance is known, we show that it is
possible to obtain the same order of accuracy when the covariance
of the compressed measurements is estimated using a sample
covariance, as long as the number of measurements is of the same
order of magnitude as the reduced dimensionality. We illustrate
the practical impact of our results in the context of predicting
volume anomalies in Internet trafc data.
Index TermsAnomaly detection, principal component anal-
ysis, random projection.
I. INTRODUCTION
P
RINCIPAL component analysis (PCA) is a classical tool
for dimension reduction that remains at the heart of many
modern techniques in multivariate statistics and data mining.
Among the multitude of uses that have been found for it, PCA
often plays a central role in methods for systems monitoring and
anomaly detection. Aprototypical example of this is the method
of Jackson and Mudholkar [11], the so-called PCA subspace
projection method. In their approach, PCA is used to extract
the primary trends and patterns in data and the magnitude of
the residuals (i.e., the norm of the projection of the data into
the residual subspace) is then monitored for departures, with
Manuscript received April 09, 2012; revised July 14, 2013; accepted July
18, 2013. Date of publication August 15, 2013; date of current version October
16, 2013. This work was supported in part by NSF grants CCF-0325701 and
CNS-0905565 and in part by ONR award N000140910654.
The authors are with the Department of Mathematics and Statistics, Boston
University, Boston, MA 02215 USA (e-mail: qiding@bu.edu; kolaczyk@bu.
edu).
Communicated by D. P. Palomar, Associate Editor for Detection and
Estimation.
Color versions of one or more of the gures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/TIT.2013.2278017
principles from hypothesis testing being used to set detection
thresholds. This method has seen widespread usage in industrial
systems control (e.g.,[5], [25], [29]). More recently, it is also
being used in the analysis of nancial data (e.g., [9], [19], [20])
and of Internet trafc data (e.g., [17], [18]).
In this paper, we propose a methodology in which PCA sub-
space projection is applied to data that have rst undergone
random projection. Two key observations motivate this pro-
posal. First, as is well-known, the computational complexity of
PCA, when computed using the standard approach based on the
singular value decomposition, scales like , where
is the dimensionality of the data and is the sample size. Thus,
use of the PCA subspace method is increasingly less feasible
with the ever-increasing size and dimensions of modern data
sets. Second, concerns regarding data condentiality, whether
for proprietary reasons or reasons of privacy, are more and more
driving a need for statistical methods to accommodate. The rst
of these problems is something a number of authors have sought
to address in recent years (e.g., [14], [15], [30], [33]), while
the second, of course, does not pertain to PCA-based methods
alone. Our proposal to incorporate random projection into the
PCA subspace method is made with both issues in mind, in that
the original data are transformed to a random coordinate space
of reduced dimension prior to being processed.
The key application motivating our problem is that of mon-
itoring Internet trafc data. Previous use of PCA subspace
methods for trafc monitoring [17], [18] has been largely
restricted to the level of trafc traces aggregated over broad
metropolitan regions (e.g., New York, Chicago, Los Angeles)
for a network covering an entire country or continent (e.g., the
United States, Europe, etc.). This level of aggregation is useful
for monitoring coarse-scale usage patterns and high-level
quality-of-service obligations. However, much of the current
interest in the analysis of Internet trafc data revolves around
the much ner scale of individual users. Data of this sort
can be determined up to the (apparent) identity of individual
computing devices, i.e., the so-called IP addresses. But there
are as many as such IP address, making the monitoring
of trafc at this level a task guaranteed to involve massive
amounts of data of very high dimension. Furthermore, it is
typically necessary to anonymize data of this sort, and often it
is not possible for anyone outside of the auspices of a particular
Internet service provider to work with such data in its original
form. The standard technique used when data of this sort are
actually shared is to aggregate the IP addresses in a manner
similar to the coarsening of geocoding (e.g., giving only infor-
mation on a town of residence, rather than a street address). Our
proposed methodology can be viewed as a stylized prototype,
0018-9448 2013 IEEE
7420 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
establishing proof-of-concept for the use of PCA subspace
projection methods on data like IP-level Internet trafc in a
way that is both computationally feasible and respects concerns
for data condentiality.
Going back to the famous JohnsonLindenstrauss lemma
[12], it is now well known that an appropriately dened random
projection will effectively preserve length of data vectors as
well as distance between vectors. This fact lies at the heart of an
explosion in recent years of newtheory and methods in statistics,
machine learning, and signal processing. These include [3], [6],
[8]. See, for example, the review[31]. Many of these methods go
by names emphasizing the compression inherent in the random
projection, such as compressed sensing or compressive
sampling. In this spirit, we call our own method compressed
PCA subspace projection. The primary contribution of our
work is to show that, under certain sparseness conditions on the
covariance structure of the original data, the use of Gaussian
random projection followed by projection into the PCA residual
subspace yields a test statistic whose distributional behavior
is comparable to that of the statistic that would have been
obtained from PCA subspace projection on the original data.
And furthermore that, up to higher order terms, there is no loss in
accuracy if an estimated covariance matrix is used, rather than
the true (unknown) covariance, as long as the sample size for
estimating the covariance is of the same order of magnitude as
dimension of the random projection.
While there is, of course, an enormous amount of literature on
PCA and related methods, and in addition, there has emerged in
more recent years a substantial literature on random projection
and its integration with various methods for classical problems
(e.g., regression, classication, etc.), to the best of our knowl-
edge there are only two works that, like ours, explicitly address
the use of the tools from these two areas in conjunction with
each other. In the case of the rst [26], a method of random pro-
jection followed by subspace projection [via the singular value
decomposition (SVD)] is proposed for speeding up latent se-
mantic indexing for document analysis. It is shown [26, Th. 5]
that, with high probability, the result of applying this method
to a matrix will yield an approximation of that matrix that is
close to what would have been obtained through subspace pro-
jection applied to the matrix directly. A similar result is estab-
lished in [7, Th. 5], where the goal is to separate a signal of
interest from an interfering background signal, under the as-
sumption that the subspace within which either the signal of
interest or the interfering signal resides is known. In both [26]
and [7], the proposed methods use a general class of random
projections and xed subspaces. In contrast, here we restrict
our attention specically to Gaussian random projections but
adopt a model-based perspective on the underlying data them-
selves, specifying that the data derive from a high-dimensional
zero-mean multivariate Gaussian distribution with covariance
possessed of a compressible set of eigenvalues. In addition, we
study the cases of both known and unknown covariances. Our
results are formulated within the context of a hypothesis testing
problem and, accordingly, we concentrate on understanding the
accuracy with which 1) the rst two moments of our test statistic
is preserved under the null hypothesis, and 2) the power is pre-
served under an appropriate alternative hypothesis. From this
perspective, the probabilistic statements in [7] and [26] can be
interpreted as simpler precursors of our results, which neverthe-
less strongly suggest the feasibility of what we present. Finally,
we note too that the authors in [7] also propose a method of de-
tection in a hypothesis testing setting, and provide results quan-
tifying the accuracy of power under random projection, but this
is offered separate from their results on subspace projections,
and in the context of a model specifying a signal plus white
Gaussian noise.
This paper is organized as follows. In Section II, we reviewthe
standard PCA subspace projection method and establish appro-
priate notation for our method of compressed PCAsubspace pro-
jection. Our main results are stated in Section III, where we char-
acterize the mean and variance behaviors of our statistic as
well as the size and power of the corresponding statistical test for
anomalies based on this statistic. In Section IV, we present the re-
sults of a small simulation study, while in Section V, we illustrate
the practical impact of our ndings in the context of predicting
volume anomalies in Internet trafc data. Finally, some brief dis-
cussion may be found in Section VI. The proofs for all theoretical
results presented herein may be found in the Appendices.
II. BACKGROUND
Let be a multivariate normal random vector of di-
mension , with zero mean and positive denite covariance ma-
trix . Let be the eigen-decomposition of . De-
note the prediction of by the rst principal components of
as . Jackson and Mudholkar [11], following
an earlier suggestion of Jackson and Morris [10] in the con-
text of photographic processing, propose to use the square of
the norm of the residual from this prediction as a statistic
for testing goodness-of-t and, more generally, for multivariate
quality control. This is what is referred to now in the literature
as the PCA subspace method.
Denoting this statistic as
(1)
we know that is distributed as a linear combination of i.i.d.
chi-square random variables. In particular,
where are the eigenvalues of and the are i.i.d. standard
normal randomvariables. Anormal approximation to this distri-
bution is proposed in [11], based on a power-transformation and
appropriate centering and scaling. Here, however, we will con-
tent ourselves with the simpler approximation of by a normal
with mean and variance
respectively. This approximation is well-justied theoretically
(and additionally has been conrmed in preliminary numerical
studies analogous to those reported later in this paper) by the
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7421
fact that typically will be quite large in our context. In
addition, the resulting simplication will be convenient in facil-
itating our analysis and in rendering more transparent the impact
of random projection on our proposed extension of Jackson and
Mudholkars approach.
As stated previously, our extension is motivated by a desire
to simultaneously achieve dimension reduction and ensure data
condentiality. Accordingly, let , for , where
the are i.i.d. standardized random variables, i.e., such that
and . Throughout this paper, we will
assume that the have a standard normal distribution. The
random matrix will be used to induce a random projection
(2)
Note that, under our conditions on the , the off-diagonal ele-
ments of tend to zero, while the diagonal elements tend
to one, as . As a result, at an intuitive level, we may
have reason to hope that if in an appropriate manner,
the resulting matrix behaves sufciently like the identity matrix
for the inner product and the corresponding Euclidean distance
to essentially be preserved under the mapping in (2), while re-
ducing the dimensionality of the space from to .
Thus, under our intended scenario, rather than observing
the original random variable we instead suppose that we
see only its projection, which we denote as .
Consider now the possibility of applying the PCA subspace
method in this new data space. Conditional on the random
matrix , the random variable is distributed as multivariate
normal with mean zero and covariance .
Denote the eigen-decomposition of this covariance matrix by
, let represent the prediction of
by the rst principal components of , where is the
rst columns of , and let be the corresponding
residual. Finally, dene the squared norm of this residual as
The primary contribution of our work is to show that, despite
not having observed , and therefore being unable to calculate
the statistic , it is possible, under certain conditions on the
covariance of to apply the PCA subspace method to the
projected data , yielding the statistic , and nevertheless ob-
tain anomaly detection performance comparable to that which
would have been yielded by , with the discrepancy between
the two made precise.
III. MAIN RESULTS
It is unrealistic to expect that the statistics and would
behave comparably under general conditions. At an intuitive
level, it is easy to see that what is necessary here is that the
underlying eigen-structure of must be sufciently well
preserved under random projection. The relationship between
eigen-values and eigen-vectors with and without random pro-
jection is an area that is both classical and the focus of much
recent activity. See [1], for example, for a recent review. A
popular model in this area is the spiked covariance model of
Johnstone [13], in which it is assumed that the spectrum of the
covariance matrix behaves as
This model captures the notionoften encountered in prac-
ticeof a covariance whose spectrum exhibits a distinct decay
after a relatively few large leading eigenvalues.
All of the results in this section are produced under the as-
sumption of a spiked covariance model. We present three sets
of results: 1) characterization of the mean and variance of ,
in terms of those of , in the absence of anomalies; 2) a com-
parison of the power of detecting certain anomalies under
and ; and 3) a quantication of the implications of estimation
of on our results.
A. Mean and Variance of in the Absence of Anomalies
We begin by studying the behavior of when the data are
in fact not anomalous, i.e., when truly is normal with mean
0 and covariance . This scenario will correspond to the null
hypothesis in the formal detection problem we set up shortly
below. Note that under this scenario, similar to , the statistic
is distributed, conditional on , as a linear combination of
i.i.d. chi-square random variables, with mean and variance
given by
respectively, where is the spectrum of . Our ap-
proach to testing will be to rst center and scale , and to then
compare the resulting statistic to a standard normal distribution
for testing. Therefore, our primary focus in this section is on
characterizing the expectation and variance of .
The expectation of may be characterized as follows.
Theorem 1: Assume such that .
If and , then under the Gaussian random
projection dened in (2),
(3)
where indicates terms of order big-Oh in probability.
Thus, differs from in expectation, conditional on ,
only by a constant independent of and . Alternatively, if we
divide through by and note that under the spiked covariance
model
(4)
as , then from (3) we obtain
(5)
In other words, at the level of expectations, the effect of random
projection on our (rescaled) test statistic is to introduce a bias
that vanishes like .
The variance of may be characterized as follows.
7422 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Theorem 2: Assume such that .
If and , then
(6)
That is, the conditional variance of differs from the vari-
ance of by a factor of , with a relative bias term of
order .
Taken together, Theorems 1 and 2 indicate that application of
the PCA subspace method on nonanomalous data after random
projection produces a test statistic that is asymptotically un-
biased for the statistic we would in principle like to use, if
the original data were available to us, but whose variance is
inated over that of by a factor depending explicitly on the
amount of compression inherent in the projection. In Section IV,
we present the results of a small numerical study that show, over
a range of compression values , that the approximations in (5)
and (6) are quite accurate.
B. Comparison of Power for Detecting Anomalies
We now consider the comparative theoretical performance of
the statistics and for detecting anomalies. From the per-
spective of the PCA subspace method, an anomaly is some-
thing that deviates from the null model that the multivariate
normal vector has mean zero and covariance in
such a way that it is visible in the residual subspace, i.e., under
projection by . Hence,we treat the anomaly detection
problem in this setting as a hypothesis testing problem, in which
(7)
for and . Note that, our alternative hypothesis
species that anomalies are along a single principal component
direction. The hypothesis could be stated more generally, with
anomalous contributions from multiple principal component di-
rections in the residual subspace. However, since the PCA sub-
space detection method is based on the norm of the residual
, all the nonzero values in the residual projection
subspace will contribute additively to the sum of squares in the
statistics, in expectation, and additively in their square, at the
level of the variance. Hence, without loss of generality, it is suf-
cient here to consider anomalies in only a single direction, par-
ticularly as it simplies notation and exposition.
Recall that, as discussed in Section II, it is reasonable in our
setting to approximate the distribution of appropriately stan-
dardized versions of and by the standard normal distribu-
tion. Under our spiked covariance model, and using the results
of Theorems 1 and 2, this means comparing the statistics
(8)
respectively, to the upper critical value of a standard
normal distribution. Accordingly, we dene the power functions
(9)
and
(10)
for and , respectively, where the probabilities on the
right-hand side of these expressions refer to the corresponding
approximate normal distribution.
Our goal is to understand the relative magnitude of
compared to , as a function of , and . Approx-
imations to the relevant formulas are provided in the following
theorem.
Theorem 3: Let be a standard normal random variable.
Under the same assumptions as Theorems 1 and 2, and a
Gaussian approximation to the standardized test statistics, we
have that
where
(11)
while
(12)
Ignoring error terms, we see that the critical values (11) and
(12) for both power formulas have as their argument quantities
of the form . However, while , we
have that . Hence, all else being held
equal, as the compression ratio increases, the critical value
at which power is evaluated shifts increasingly to the right for
, and power decreases accordingly. The extent to which this
effect will be apparent is modulated by the magnitude of the
anomaly to be detected and the signicance level at which the
test is dened, and furthermore by the size of the original data
space. Finally, while these observations can be expected to be
most accurate for large and large , in the case that either or
both are more comparable in size to the and
error terms in (12), respectively, the latter will play an increasing
role and hence affect the accuracy of the stated results.
An illustration may be found in Fig. 1. There we show the
power as a function of the compression ratio , for
. Here the dimension before projec-
tion is and the dimension after projection
ranges from 10 000 to 500. A value of was used for the
dimension of the principle component analysis, and a choice
of was made for the size of the underlying test for
anomaly. Note that at , on the far left-hand side of the
plot, the value simply reduces to . (Note
too that the theory in this paper addresses only compression ra-
tios , and not , which we include in this plot
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7423
Fig. 1. as a function of compression ratio for various choices of
anomaly size .
purely for the sake of visual continuity.) So the ve curves show
the loss of power resulting from compression, as a function of
compression level , for various choices of strength of the
anomaly.
Additional numerical results of a related nature are presented
in Section IV.
C. Unknown Covariance
The test statistics and are dened in terms of the covari-
ance matrices and , respectively. However, in practice, it is
unlikely that these matrices are known. Rather, it is more likely
that estimates of their values be used in calculating the test sta-
tistics, resulting, say, in statistics and . In the context of
industrial systems control, for example, it is not unreasonable
to expect that there will be substantial previous data that may
be used for this purpose. As our concern in this paper is on the
use of the subspace projection method after random projection,
i.e., in the use of , the relevant question to ask here is what
are the implications of using an estimate for .
We study the natural case where the estimate is simply the
sample covariance , for
the matrix formed from i.i.d. copies of the random
variable and their vector mean. Let be the eigen-
decomposition of and, accordingly, dene
in analogy to . We then have
the following result.
Theorem 4: Assume . Then, under the same conditions
as Theorem 1,
(13)
and
(14)
Furthermore, under the conditions of Theorem 3, the power
function
(15)
can be expressed as , where
(16)
Simply put, the results of the theorem tell us that the accu-
racy with which compressed PCA subspace projection approx-
imates standard PCA subspace projection in the original data
space, when using the estimated covariance rather than the
unknown covariance , is unchanged, as long as the sample
size used in computing is at least as large as the dimen-
sion after random projection. Hence, there is an interesting
tradeoff between and , in that the smaller the sample size
that is likely to be available, the smaller the dimension that
must be used in dening our random projection, if the ideal ac-
curacy is to be obtained (i.e., that using the true ). However,
decreasing will degrade the quality of the accuracy in this ideal
case, as it increases the compression parameter .
IV. SIMULATION
We present three sets of numerical simulation results in this
sectionone corresponding to Theorems 1 and 2, another, to
Theorem 3, and the last, to Theorem 4.
In our rst set of experiments, we simulated from the spiked
covariance model, drawing both random variables and their
projections over many trials, and computed and for
each trial, thus allowing us to compare their respective means
and variances. In more detail, we let the dimension of the orig-
inal random variable be , and assumed that to be
distributed as normal with mean zero and (without loss of gen-
erality) covariance equal to the spiked spectrum
with . The corresponding random projections of
were computed using randommatrices generated as described
in the text, with compression ratios equal to 20, 50,
and 100 (i.e., ). We used a total of 2000
trials for each realization of , and 30 realizations of for each
choice of ( ).
The results of this experiment are summarized in Table I. Re-
call that Theorems 1 and 2 say that the rescaled mean
and the ratio of variances should be approx-
imately equal to and , respectively. It is clear from these
results that, for low levels of compression (i.e., ) the
approximations in our theorems are quite accurate and that they
vary little fromone projection to another. For moderate levels of
compression (i.e., ) they are similarly accurate, although
more variable. For high levels of compression (i.e., ),
7424 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Fig. 2. Simulation results assessing the accuracy of Theorems 3 (left) and 4 (right).
TABLE I
SIMULATION RESULTS ASSESSING THE ACCURACY OF THEOREMS 1 AND 2
we begin to see some nontrivial bias entering, with some ac-
companying increase in variability as well.
In our second set of experiments, we again simulated from
a spiked covariance model, but now with nontrivial mean. The
spiked spectrum was chosen to be the same as above, but with
, for computational considerations. The mean was de-
ned as in (7), with . A range of compres-
sions ratios were used. We ran a total of 1000
trials for each realization of , and 30 realizations of for each
combination of and . The statistics and were computed
as in the statement of Theorem 3 and compared to the critical
value , corresponding to a one-sided test of size
.
The results are shown in Fig. 2. Error bars reect variation
over the different realizations of and correspond to one stan-
dard deviation. The curves shown correspond to the power ap-
proximation given in Theorem 3, and are the same
as the middle three curves in Fig. 1. We see that for the strongest
anomaly level ( ) the theoretical approximation matches
the empirical results quite closely for all but the highest levels
of compression. Similarly, for the weakest anomaly level (
), the match is also quite good, although there appears to be a
small but persistent positive bias in the approximation across all
compression levels. In both cases, the variation across choice of
is quite low. The largest bias in the approximation is seen at
the moderate anomaly level ( ), at moderate to high levels
of compression, although the bias appears to be on par with the
anomaly levels at lower compression levels. The largest varia-
tion across realizations of is seen for the moderate anomaly
level.
Finally, in our third set of experiments, we repeated the same
exercise as in the second set, but for the case where is un-
known. The design of the simulation was exactly the same as
before, except that prior to projection by each realization of ,
for each combination of and , an estimate of was com-
puted from i.i.d. nonanomolous observations (i.e., with
). The results of this simulation are also shown in Fig. 2.
We see that there is little to any difference when compared to
the case of known , in support of result in Theorem 4. Note
that the error bars in this case include the combined uncertainty
due to randomness in both and .
Taken together, the results of our simulationsalthough
modest in scopeoffer some useful guidelines for using our
compressed PCA subspace projection method in practice. In
particular, we see that compression ratios of up to 10 or even
20 allow for relatively high detection levels for strong signals,
and still reasonable detection levels for moderate signals.
For weaker signals, however, it appears that detection levels
decrease fairly quickly, dropping to as low as 50% already for a
compression ratio of only 3. We note that in practice there will
also likely be the tradeoff between accuracy and computational
burden to be considered. For sufciently large dimension
in the original data , a substantial amount of compression
may nonetheless be tolerable if it is only at that level that the
required PCA calculations become computationally feasible.
This is especially likely to be the case if the method is to be
implemented in a time-dependent fashion (e.g., online).
V. APPLICATION IN INTERNET TRAFFIC ANALYSIS
We illustrate the use of our compressed PCAsubspace projec-
tion method in the context of computer network trafc analysis.
Internet service providers (ISPs) routinely monitor the trafc on
their networks for, among other things, anomalous trafc pat-
terns and behavior. One of the most basic types of anomalies
for which ISPs monitor are volume anomalies, i.e., anomalous
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7425
Fig. 3. Left: Summary network ow time series in units of bytes (top), packets (middle), and ows (bottom). Right: (Log) Spectra of the three trafc time series
matrices.
levels in the volume of trafc over a certain period of time.
Lakhina et al. [17], [18] have shown that methods based on
the PCA subspace projection method can be quite effective at
identifying volume anomalies in aggregate network trafc data.
Here, we demonstrate that our compressed PCA subspace pro-
jection method has the potential to allow for similarly effective
anomaly detection at a much ner level of granularity, at or close
to the level of individual IP addresses. We do so by comparing
the performance of the compressed method to the uncompressed
method on the original data.
Our analysis uses data from a large European ISP, observed
at a single network node corresponding to a major city, during
the period May 7 through May 10, 2007. The raw measure-
ments represent observations of origin-destination trafc ows
and certain basic characteristics. Relevant to our analysis are the
IP addresses of the origin and destination of each ow, as well
as the volume of the ow in units of bytes and packets. We will
focus on volume anomalies at destinations. There were a total
of 656 785 unique destination IP addresses observed during the
four-day period of our study.
In order to facilitate our use of the uncompressed PCA sub-
space projection method on this data, as a comparison, we pre-
process the raw data by binning in both time and IP address
space. For time, we used ve-minute intervals, a fairly stan-
dard choice in this area, while for IP addresses we used quantile
binning with 11 714 bins. The end result is a set of three mul-
tivariate time series, each of dimension , measured
over time bins, with values corresponding to the total
volume for each IP address block during each time interval, in
units of number of bytes, packets, and ows, respectively.
In Fig. 3 are shown a visualization of these data, obtained
simply by summing the multivariate time series across their
components (i.e., collapsing across IP address blocks). Also
shown are the nontrivial singular values corresponding to each
of the three data matrices. Presented on a semilogarithmic scale,
we can see that these spectra all decay quite sharply, qualita-
tively similar to what is dictated by a spiked covariance model.
Examination of these spectra on their original scale suggests a
choice of is reasonable.
For each of the three trafc matrices, estimates of the prin-
ciple components of both and were obtained, using the
method of [33]. The value was used for both un-
compressed and compressed PCAsubspace projection methods.
The test statistics and were computed for each of the
ve-minute time intervals, for each of the three
units of volume, resulting in six time-indexed test statistic se-
quences. These are shown in Fig. 4. Recall that these test statis-
tics measure the extent to which there remains nontrivial struc-
ture in the data after projection onto the rst principle compo-
nents. The gures show that for much of the time there is little
to any high-frequency structure missed, although at times there
appears to be periods in which low-frequency structure is con-
sistently missed. Following Lakhina et al., we will focus here
only on recovery of the high-frequency anomalies.
For the purpose of comparison, in each of the bytes, packets,
and ow time series, we generate an anomaly truth set by
declaring an anomaly to exist at any time point for which the
standardized value of exceeds , for a given . We then
similarly do anomaly detection at the same level using the
compressed PCA subpace projection method, at compression
ratios , for 30 different realizations of the
random matrix . Both and were standardized by ro-
bust estimates of mean and standard deviation applied to the
sequence of test statistics across time. In Tables II and III are
shown the average power of the tests in detecting anomalies.
We see that a power of roughly 70% is achieved when testing
at , and that this rises to over 75%, when .
These results suggest that a large fraction of the anomalies that
would be declared using subspace projection on the original data
would also be detected using subspace projection after random
projection, over a range of compression ratios.
Finally, in Fig. 5, we show 3-D plots of the values of and of
, for each of the bytes, ows, and packets time series, for any
time bin in which an anomaly was declared in at least one of the
time series. Lakhina et al. [18] found in their analysis of aggre-
gate-level trafc such plots showed a strong clustering in this
3-D space, suggesting that there were distinct types of anoma-
lies being captured. Here, we nd that the plot corresponding to
-based anomalies shows similar clustering and, moreover, that
the plot corresponding to -based anomalies appears to qual-
itatively preserve this clustering. Thus, the ability of our com-
pressed PCA subspace method to capture with fairly high accu-
7426 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
Fig. 4. Plots of (left) and (right) over time for the network trafc application. Detection thresholds shown in red.
Fig. 5. Three-dimensional scatterplots of (left) and (right) for each of the bytes, ows, and packets time series. Each point corresponds to an anomaly,
colored according to whether it was declared due to excessive volume of bytes (green), ows (blue), or packets (red).
TABLE II
ANOMALY DETECTION POWER UNDER
TABLE III
ANOMALY DETECTION POWER UNDER
racy the anomalies declared by the uncompressed method ap-
pears to also translate into some ability to recover certain higher
level characteristics in the data.
VI. DISCUSSION
Motivated by dual considerations of dimension reduction and
data condentiality, as well as the wide-ranging and successful
implementation of PCA subspace projection, we have intro-
duced a method of compressed PCA subspace projection and
characterized key theoretical quantities relating to its use as
a tool in anomaly detection. Furthermore, an illustrative im-
plementation of this proposed methodology and its application
to detecting IP-level volume anomalies in computer network
trafc suggests a high relevance to practical problems.
In introducing our extension here of the classical PCA sub-
space projection method, we have made use of a handful of as-
sumptions, which are useful for simplifying both exposition and
various aspects of the proofs of our results. We conjecture that
many of these assumptions may be relaxed. For example, while
our assumption of Gaussian data (i.e., for the distribution of )
is both classical and natural for PCA-like analyses, given their
reliance on the norm, it is unlikely to hold in practice. How-
ever, we utilize this Gaussian assumption most directly only in
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7427
exploiting its invariance to rotation when assuming, without loss
of generality, that the covariance itself (rather than its spec-
trum) follows a spiked covariance model. This step greatly sim-
plies the exposition. In addition, the assumption is in principle
used initially when we state that the statistics and are dis-
tributed as sums of independent chi-square random variables.
We note, however, that in all of our power calculations, we im-
plicitly use a central limit theorem approximation, which can be
expected to hold under more general moment conditions and an
assumption of sufciently large .
Similarly, the spiked covariance model we assume (wherein
only the rst eigenvalues differ from the value 1, and those
have a nontrivial gap separating them from 1) is only an ide-
alization of what is likely to be encountered in practice. This
too can be relaxed, at the cost of some additional work. For in-
stance, an assumption that the spectrum lies in a weak ball
would more accurately capture the notion of a fast but smooth
decay, such as exhibited in our analysis of the network trafc
data. See [14], for example, for discussion in this direction.
Turning from the data to the random projection (i.e., ), there
is room there as well for generalization, in that the assumption
of Gaussianity that is standard in much of the early work on
random projections, and is used here as well, has since been re-
laxed in many settings. Here, again, similar to our assumption of
Gaussian data, we have made use of our assumption of Gaussian
projections primarily for the invariance of the Gaussian to rota-
tion and for the nice behavior of its tails. Extension of our results
to zero-mean projections with similarly nice tail behavior (e.g.,
binary projections) should be relatively straightforward.
Additionally, the question of sample size, in relation to the
dimension of our problem, is also an interesting direction for
future work. The results of Theorem 4 are important in estab-
lishing the practical feasibility of our proposed method, wherein
the covariance must be estimated from data, when it is pos-
sible to obtain samples of size of a similar order of magnitude
as the reduced dimension of our random projection. It would
be of interest to establish results of a related nature for the case
where . In that case, it cannot be expected that the clas-
sical moment-based estimator that we have used here will
perform acceptably. Instead, an estimator exploiting the struc-
ture of presumably is needed. However, as most methods in
the recent literature on estimation of large, structured covariance
matrices assume sparseness of some sort (e.g., [2], [16], [21]),
they are unlikely to be applicable here, since is roughly of
the form , where is of rank with entries of mag-
nitude . Similarly, neither will methods of sparse PCA
be appropriate (e.g., [14], [15], [30], [33]). Rather, variations
on more recently proposed methods aimed directly at capturing
low-rank covariance structure hold promise (e.g., [23], [24]).
Alternatively, the use of so-called very sparse random projec-
tions (e.g., [22]), in place of our Gaussian random projections,
would yield sparse covariance matrices , and hence in prin-
ciple facilitate the use of sparse inference methods in producing
an estimate . But this step would likely come at the cost of
making the already fairly detailed technical arguments behind
our results more involved still, as we have exploited the Gaus-
sianity of the randomprojection in certain key places to simplify
calculations, as noted above. We note that ultimately, for such
approaches to produce results of accuracy similar to that here in
Theorem 4, it is necessary that they produce approximations to
the PCA subspace of with order accuracy.
Finally, we acknowledge that the paradigm explored here,
even if extended in the various directions suggested above, is
only a caricature of what might be implemented in reality, par-
ticularly in contexts like computer network trafc monitoring.
There, for example, issues of data management, speed, etc.,
would become important and can be expected to have nontrivial
implications on the design of the type of random projections
actually used or even the number (e.g., increased power might
result from using multiple projections, while still maintaining
computational savings). Nevertheless, we submit that the results
presented in this paper strongly suggest the potential success of
an appropriately modied system of this nature.
APPENDIX
In this section, we present proofs of Theorems 1 through 4.
A) Proof of Theorem 1: Suppose the random vector
has a multivariate Gaussian distribution and
, for . Denote the eigenvalues of
and as and , respectively.
In proving Theorem 1, we seek to show that, under the condi-
tions stated in the theorem, the relation
holds. Jackson and Mudholkar [11] showthat
will be distributed as , where the are
i.i.d. standard normal random variables. Consequently, we have
and, similarly, . So
comparison of and reduces to a comparison
of partial sums of the eigenvalues of and .
Since
in the following proof we will analyze and
separately, showing that they are close to and ,
respectively.
2) : Because orthogonal rotation has no inuence on
Gaussian random projection and the matrix spectrum, to sim-
plify the computation, we assume without loss of generality
that . So the diagonal elements of
are
We have
7428 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
and therefore
Under the spiked covariance model assumed in this paper,
for
xed . Then,
When and the rst term will go to zero like
. The second term can be written as
More precisely, here we have a series satisfying
when . It is easy to
show that when . Since the
are i.i.d., we can reexpress the series
as
Recalling that the are standard normal random variables,
we know that and . By the
central limit theorem, the series
will converge to a zero mean normal in distribution. Hence,
is of order when . As
an innite subsequence,
also has the same behavior, which leads to
by which we conclude that
As a result of the above arguments,
3) : Next we examine the behavior of the rst eigen-
values of and , i.e., and .
Recalling the denition of as
, we dene the matrix .
All of the columns of are i.i.d. random vectors from ,
and , the covariance of , can be expressed as .
Let , which contains the same nonzero eigenvalues
as . Through this transformation of to and
interpretation of as the sample covariance corresponding to
, we are able to utilize established results from random matrix
theory.
Denote the spectrum of as . Under the spiked
covariance model, Baik [1] and Paul [28] independently derived
the limiting behavior of the elements of this spectrum. Under our
normal case , Paul [28] proved the asymptotical
normality of .
Theorem5: Assume such that .
If , then
For signicantly large leading eigenvalues , is
asymptotically . And for all of the lead eigenvalues
which are above the threshold , we have
. Recalling the condition in the statement of
the theorem, without loss of generality we take (as we
will do, when convenient, throughout the rest of the proofs in
these Appendices). Using Pauls result, we have
Combining these results with those of the previous section,
we have
D) Proof of Theorem 2: For the proof of this theorem, we
seek to show that
For notational convenience, denote as , so that
. Writing , and similarly
for , we have
Since , we have
Accordingly, if ,
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7429
and
while if ,
Changing the order of summation, we therefore have
which implies,
(17)
Now under the spiked covariance model, with , we
have
As a result, we have
Substituting the expression in (17) yields
(18)
The control of (18) is not immediate. Let us denote the two
terms in the RHS of (18) as and . Results in the next two
sections show that behaves like , and like
. Consequently, Theorem 2 holds.
5) : We show in this section that
First note that, by an appeal to the central limit theorem,
. So A can be expressed as
Using the result by Paul cited in Appendix A.2, in the form
of Theorem 6, the rst term is found to behave like .
In addition, it easy to see that the second term behaves like
. Finally, since under the spiked covariance model
, taking we have that
As a result, the third term in the expansion of is equal to
.
Combining terms, we nd that .
6) : Term B in (18) can be written as
(19)
Recalling that under the spiked covariance model
, in the following we will
analyze the asymptotic behavior of the term B in two stages,
by rst handling the case in detail,
and second, arguing that the result does not change under the
original conditions.
If , which is simply a white noise
model, term B becomes
(20)
which may be usefully reexpressed as
(21)
and, upon exchanging the order of summation, as
(22)
Write (22) as . In the material that immediately
follows, we will argue that, under the conditions of the theorem
and the white noise model, and
.
7430 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
To prove the rst of these two expressions, we begin by
writing
Note that the are i.i.d. randomvariables. We will use a central
limit theorem argument to control .
A straightforward calculation shows that .
To characterize the second moment, we write
and consider each of three possible types of terms
.
1) If , then , with
expectation 9. Since there are such choices of
, the contribution of terms from this case to
is .
2) If only two of are equal,
will take the form with expectation 3. For each
triple there are six possible cases:
, , ,
, , . So
there are such terms in this case, yielding a
contribution of to .
3) If are all different, the expectation of
is just 1. Since there are
terms in total, the number of such terms in this case
and hence the contribution of this case to is
.
Combining these various calculations, we nd that
and hence
By the central limit theorem, we know that
. Exploiting that
and recalling that by
assumption, simple calculations yield that .
As for , it can be shown that and
from which it follows, by Chebyshevs inequality, that
Combining all of the results above, under the white noise
model, i.e., when , we have
. In the case that the spiked covariance model instead
holds, i.e., when ,
it can be shown that the impact on (19) is to introduce an addi-
tional term of . The effect is therefore negligible on the
nal result stated in the theorem, which involves an
term. Intuitively, the value of the rst eigenvalues will not
inuence the asymptotic behavior of the innite sum in (19),
which is term B.
G) Proof of Theorem 3: Here, we derive the approxima-
tions to the power functions and given in
the statement of the theorem. Through a coordinate transfor-
mation, from to , we can, without loss of generality,
restrict our attention to the case where 1) , for
, and 2) our testing problem is of the form
for some . In other words, we test whether the underlying
mean is zero or differs from zero in a single component by some
value .
Consider rst the expression for in (9). Under the
above model,
where the are independent random variables. So,
under the alternative hypothesis, the sum of their squares is
a noncentral chi-square random variable, on degrees of
freedom, with noncentrality parameter
. We have by standard formulas that
and
Using these expressions and the normal-based expression for
power dening (9), we nd that
as claimed.
Nowconsider the expression for in (10), where we
write for and .
Under the null hypothesis, we have
and
Call these expressions and , respectively. Under the alterna-
tive hypothesis, the same quantities take the form
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7431
and
respectively, where
(23)
and
(24)
Arguing as above, we nd that
By Theorem 2, we know that
. Ignoring the higher order stochastic error term
we therefore have
This expression is the same as that in the statement of Theorem
3, up to a rescaling by a factor of . Hence, it remains for
us to show that
Our problem is simplied under transformation by the rota-
tion , where is an arbitrary orthonormal rotation
matrix. If we similarly apply
then and remain unchanged in (23) and (24). Recall
that , where , and
, with in the location.
Choosing and denoting , straightforward
calculations yield that
and
Now write , where denotes the rst
rows of , and , the last rows. The elements in the
two sums immediately above lie in the th row of the product
of and the last columns of . By Paul [28, Th. 4],
we know that if , the last leading eigenvalue in the spiked
covariance model, is much greater than 1, and such
that , then the distance between the subspaces
and diminishes to zero. Asymptotically,
therefore, we may assume that these two subspaces coincide.
Hence, since is statistically independent of , it follows
that is asymptotically independent of , and therefore of
the orthogonal complement of , i.e., the last columns
of . As a result, the elements in behave
asymptotically like i.i.d. standard normal randomvariables. Ap-
plying Chebyshevs inequality in this context, it can be shown
that
Rescaling by , the expressions for and in Theorem
4 are obtained.
H) Proof of Theorem 4: Let and
. If we use the sample covariance
to estimate , we will observe the residual
instead of . To prove the theorem it is
sufcient to derive expressions for and
under the null and alternative hypothesis in (7), as these expres-
sions are what inform the components of the critical value in the
power calculation. Our method of proof involves re-expressing
and in terms of and and
showing that those terms involving the latter are no larger than
the error terms associated with the former in Theorems 1, 2, and
3.
Begin by considering the mean and writing
We need to control the term
(25)
Under the null hypothesis the second term in (25) is zero, and
so to prove (13) we need to show that the rst term is .
Note that, without loss of generality, we may write
, where and ,
for a random matrix of i.i.d. standard Gaussian
randomvariables and . Then, using [27,
Th II.1], with in the notation of that paper, it follows
that
(26)
where we use generically here and below to denote the th
largest eigenvalue of its argument.
7432 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 59, NO. 11, NOVEMBER 2013
For the second term in the right-hand side of (26), write
. Using a result attributed to Mori (ap-
pearing as Lemma I.1 in [27]), we can write
and similarly for in place of . Exploiting the lin-
earity of the trace operation and the fact that
, we can bound the term of interest as
However, and are equal to times the largest
and smallest eigenvalues of a sample covariance of standard
Gaussian random variables, the latter which converge almost
surely to the right and left endpoints of the MarchenkoPastur
distribution (e.g., [1]), which in this setting take the values
and , respectively. Hence,
.
Now consider the factor in the rst term
of the right-hand side of (26). We have shown that
. At the same time, we note that
, being proportional to the normalized trace
of a matrix whose entries are i.i.d. copies of averages of
i.i.d. chi-square random variables on one degree of freedom.
Therefore, and recalling the spiked covariance model, we nd
that .
At the same time, the factor multiplying this term, i.e., the
largest absolute eigenvalue of , is just the operator norm
and hence bounded above by the Frobenius norm,
. We introduce the notation for the th column of
times its transpose, and similarly, , in the case of . Then,
and
To bound this, we use a result in Watson [32, Appendix B, (3.8)],
relying on a multivariate central limit theorem,
in distribution, as , where is a random matrix whose
distribution depends only on and recall are the
eigenvalues of . So .
Therefore, the left-hand side of (26) is and (13) is es-
tablished. Now consider the second term in (25), which must be
controlled under the alternative hypothesis. This is easily done,
as we may write
and note that the rst term is while the second is
. Therefore, under the assumption that , the
entire term is , which is the same order of error to
which we approximate in (24) in the proof of Theorem 3.
Hence, the contribution of the mean to the critical value in (16),
using , is the same as in (12), using .
This completes our treatment of the mean. The variance can
be treated similarly, writing
and controlling the last two terms. The rst of these two terms
takes the form
(27)
and the second,
(28)
Again, under the null hypothesis, the second terms in (27) and
(28) are zero. Hence, to establish (14), it is sufcient to show
that the rst terms in (27) and (28) are . We begin by
noting that
where the rst inequality follows from [4, Th. 1], and the
second, from CauchySchwartz. Straightforward manip-
ulations, along with use of [27, Lemma I.1], yield that
. At the same
time, we have that
Therefore, under the assumptions that and
, we are able to control the relevant error term in (27)
as .
Similarly, using [27, Lemma I.1] again, we have the bound
The rst term in this bound is , while the second is
, which allows us to control the relevant error term in (28)
as . As a result, under the null hypothesis, we have that
, which is sufcient
to establish (14), since .
DING AND KOLACZYK: COMPRESSED PCA SUBSPACE METHOD FOR ANOMALY DETECTION 7433
Finally, we consider the second terms in (27) and (28), which
must be controlled as well under the alternative hypothesis.
Writing
and
it can be seen that we can bound the rst of these expressions by
, and the second, by . Therefore, the combined
contribution of the second terms in (27) and (28) is ,
which is the same order to which we approximate in (23) in
the proof of Theorem 3. Hence, the contribution of the variance
to the critical value in (16), using , is the same as in (12),
using .
ACKNOWLEDGMENT
The authors thank D. Paul for a number of helpful
conversations.
REFERENCES
[1] Z. D. Bai, Methodologies in spectral analysis of large dimensional
random matrices, a review, Statist. Sinica, vol. 9, pp. 611677, 1999.
[2] P. J. Bickel and E. Levina, Regularized estimation of large covariance
matrices, Ann. Statist., vol. 36, no. 1, pp. 199227, 2008.
[3] E. Bingham and H. Mannila, Random projection in dimensionality
reduction: Applications to image and text data, in Proc. Seventh ACM
SIGKDD Int. Conf. Knowledge Discovery Data Mining, 2001, pp.
245250.
[4] D. W. Chang, A matrix trace inequality for products of Hermitian
matrices, J. Math. Analysis Appl., vol. 237, no. 2, pp. 721725, 1999.
[5] L. H. Chiang, E. Russell, and R. D. Braatz, Fault Detection and Diag-
nosis in Industrial Systems. New York, NY, USA: Springer-Verlag,
2001.
[6] S. Dasgupta, Experiments with random projection, in Proc. 16th
Conf. Uncertainty Articial Intelligence, 2000, pp. 143151.
[7] M. A. Davenport, P. T. Boufounos, M. B. Wakin, and R. G. Bara-
niuk, Signal processing with compressive measurements, IEEE J.
Sel. Topics Signal Process., vol. 4, no. 2, pp. 445460, Apr. 2010.
[8] D. Fradkin and D. Madigan, Experiments with random projections for
machine learning, in Proc. 9th ACM SIGKDD Int. Conf. Knowledge
Discovery Data Mining, 2003, pp. 517522.
[9] A. Harvey, E. Ruiz, and N. Shephard, Multivariate stochastic variance
models, Rev. Econ. Stud., vol. 61, no. 2, pp. 247264, 1994.
[10] J. E. Jackson and R. H. Morris, An application of multivariate quality
control to photographic processing, J. Amer. Statist. Soc., vol. 52, no.
278, pp. 186199, 1957.
[11] J. E. Jackson and G. S. Mudholkar, Control procedures for residual
associated with principal component analysis, Technometrics, vol. 21,
pp. 341349, 1979.
[12] W. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings
into a Hilbert space, Contemp. Math., vol. 26, pp. 189206, 1984.
[13] I. M. Johnstone, On the distribution of the largest eigenvalue in prin-
cipal components analysis, Ann. Statist., vol. 29, pp. 295327, 2001.
[14] I. M. Johnstone and A. Y. Lu, Sparse principal components analysis,
J. Amer. Statist. Assoc., vol. 104, pp. 682693, Jun. 2009.
[15] M. Journe, Y. Nesterov, P. Richtrik, and R. Sepulchre, General-
ized power method for sparse principal component analysis, J. Mach.
Learning Res., vol. 11, pp. 517553, 2010.
[16] N. E. Karoui, Operator norm consistent estimation of large-di-
mensional sparse covariance matrices, Ann. Statist., vol. 36, pp.
27172756, 2008.
[17] A. Lakhina, M. Crovella, and C. Diot, Diagnosing network-wide
trafc anomalies, presented at the Proc. Conf. Applications, Technol.,
Architect., Protocols Comput. Commun., Portland, OR, USA,
Aug.-Sep. 3003, 2004.
[18] A. Lakhina, M. Crovella, and C. Diot, Mining anomalies using trafc
feature distributions, in Proc. Conf. Appl., Technol., Architect., Pro-
tocols Comput. Commun., Philadelphia, PA, USA, Aug. 2226, 2005.
[19] L. Laloux, P. Cizeau, J. P. Bouchaud, and M. Potters, Noise dressing
of nancial correlation matrices, Phys. Rev. Lett., vol. 83, pp.
14671470, 1999.
[20] D. R. Lessard, International portfolio diversication: A multivariate
analysis for a group of Latin American countries, J. Finance, vol. 28,
no. 3, pp. 619633, Jun. 1973.
[21] E. Levina, A. Rothman, and J. Zhu, Sparse estimation of large covari-
ance matrices via a nested Lasso penalty, Ann. Appl. Statist., vol. 2,
pp. 245263, 2008.
[22] P. Li, T. J. Hastie, and K. W. Church, Very sparse random projec-
tions, in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data
Mining, 2006, pp. 287296.
[23] X. Luo, High dimensional low rank and sparse covariance matrix
estimation via convex minimization, 2011 [Online]. Available:
arXiv:1111.1133
[24] S. Negahban and M. J. Wainwright, Estimation of (near) low-rank
matrices with noise and high-dimensional scaling, Ann. Statist., vol.
39, no. 2, pp. 10691097, 2011.
[25] P. Nomikos and J. F. MacGregor, Multivariate spc charts for moni-
toring batch processes, Technometrics, vol. 37, no. 1, pp. 4159, Feb.
1995.
[26] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, Latent
semantic indexing: A probabilistic analysis, J. Comput. Syst. Sci., vol.
61, no. 2, pp. 217235, 2000.
[27] P. G. Park, On the trace bound of a matrix product, IEEE Trans.
Autom. Control, vol. 41, no. 12, pp. 17991802, 1996.
[28] D. Paul, Asymptotics of sample eigenstructure for a large dimensional
spiked covariance model, Statist. Sinica, vol. 17, pp. 16171642,
2007.
[29] S. J. Qin, Statistical process monitoring: Basics and beyond, J.
Chemometrics, vol. 17, no. 89, pp. 480502, 2003.
[30] H. Shen and J. Z. Huang, Sparse principal component analysis via
regularized lowrank matrix approximation, J. Multivariate Anal., vol.
99, no. 6, pp. 10151034, 2008.
[31] S. S. Vempala, The Random Projection Method. Providence, RI,
USA: AMS, 2004.
[32] G. S. Watson, Statistics on Spheres. New York, NY, USA: Wiley,
1983.
[33] H. Zou, T. Hastie, and R. Tibshirani, Sparse principal components
analysis, J. Comput. Graph. Statist., vol. 15, pp. 265286, 2006.
Qi Ding received his B.S. Degree in mathematics from Peking University, Bei-
jing, China, in 2005, and the M.A. and Ph.D. degrees in statistics from Boston
University, Boston, MA, USA, in 2006 and 2011, respectively. His research in-
terests are in statistical detection procedures for anomaly detection, particularly
in network-based contexts.
Eric D. Kolaczyk (M04SM06) received his B.S. Degree in mathematics
from the University of Chicago, Chicago, IL, USA, in 1990, and the M.S. and
Ph.D. degrees in statistics from Stanford University, Stanford, CA, in 1992 and
1994, respectively. He is Professor of Mathematics and Statistics, and Director
of the Program in Statistics, at Boston University, MA, where he is also afli-
ated with the Division of Systems Engineering. His research interests lie in the
development of sparse inference methods and related problems, particularly in
the context of the statistical analysis of networks and network data.

S-ar putea să vă placă și