Sunteți pe pagina 1din 20

Thesis Proposal:

Graph Structured Statistical Inference


James Sharpnack
Machine Learning Department and Department of Statistics
Carnegie Mellon University
Thesis Committee:
Aarti Singh (Advisor)
Alessandro Rinaldo (Advisor)
Larry Wasserman
Gary Miller
Ery Arias-Castro (External)

August 2012
Abstract
This thesis addresses statistical estimation and testing of signals over a graph when measurements are noisy and
high-dimensional. Graph structured patterns appear in applications as diverse as sensor networks, virology in human
networks, congestion in internet routers, and advertising in social networks. We will develop asymptotic guarantees
of the performance of statistical estimators and tests, by stating conditions for consistency by properties of the graph
(e.g. graph spectra). The goal of this thesis is to demonstrate theoretically that by exploiting the graph structure one
can achieve statistical consistency in extremely noisy conditions.
We begin with the study of a projection estimator called laplacian eigenmaps, and find that eigenvalue concentration plays a central role in the ability to estimate graph structured patterns. We continue with the study of the edge
lasso, a least squares procedure with total variation penalty, and determine combinatorial conditions under which
changepoints (edges across which the underlying signal changes) on the graph are recovered. We will shift focus
to testing for anomalous activations in the graph, using a scan statistic relaxation and through the construction of a
spanning tree wavelet basis. Finally, we study the consistency of kernel density estimation for vertex valued random
variables with densities that are Lipschitz with respect to graph metrics.

1 Introduction
Statistical inference is inherently difficult when there are few samples and the parameter space is large. The only way
to avoid this problem, given a limited amount of data, is to impose contraints on the parameter space. This thesis
focuses on the problem of detecting, localizing and estimating patterns over a graph when observations are corrupted
by noise. Hence, we consider the case when parameter constraints derive from a graph structure, generally graphs
given by real world networks.
The problem of estimating graph-structured activations is relevant to many applications including identifying
congestion in router and road networks, eliciting preferences in social networks, and localizing viruses in human and
computer networks. While several machine learning algorithms are designed to estimate graph-structured patterns[44,
49, 11] very few statistical guarantees are known. Much less work addresses the detection of anomalous patterns in
graphs from a statistical testing perspective. This is despite a variety of real-world applications such as community
detection in social networks, surveillance, disease outbreak detection, biomedical imaging, sensor network detection,
gene network analysis, environmental monitoring and malware detection. Recent theoretical contributions in the

statistical literature[3, 1] have detailed the inherent difficulty of such a testing problem but have positive results only
under restrictive conditions on the graph topology. By combining knowledge from high-dimensional statistics, graph
theory and mathematical programming, the characterization of detection algorithms over any graph topology by their
statistical properties is possible.
Aside from the statistical challenges, the computational complexity of these algorithms must be addressed. Due to
the combinatorial nature of graph based methods, problems can easily shift from having polynomial-time algorithms
to having running times exponential in the size of the graph. The applications of graph structured inference require
that any method be scalable to large graphs. As we will see some proposed statistical procedures will be intractable,
suggesting that approximation algorithms and relaxations are necessary. Luckily, computer science boasts a plethora
of efficient graph based algorithms that are adaptable to these statistical problems.

1.1

Problem Setup

We will be studying the setting in which a graph that provides the structure to our inference is known a priori, and
we are tasked with identifying latent parameters over the graph. An undirected graph is a set of vertices V which
can be taken to be the natural numbers [p] = {1, . . . , p} for some p > 0 and pairs of vertices E V V called the
edges (with |E| = m). To each edge is possibly associated a weight, which we will denote We for some e E. In
this case we have a weighted graph, and we let the graph be the triplet G = (V, E, W ). We are now ready to define
the graph-structured normal means problem, which will be the main focus of this thesis. Later in section 6, we will
discuss density estimation over graphs, which has a different setup than the normal means problem.

1.1.1

Normal Means Problem

In the graph-structured normal means problem, we observe one realization of the random vector
y = x + ,

(1)

where x RV and N (0, 2 Ip ) is Gaussian white noise with known variance 2 . The goal is to make inferences
regarding the unknown x, when it is believed to be smooth with respect to a graph. Before we define what it means
for x to be smooth over a graph, we must first introduce two differential operators: the incidence matrix and the graph
Laplacian.
We begin by constructing (arbitrarily) an orientation of G by defining a head e+ e and tail e e. The
incidence matrix REV for the oriented graph is the matrix whose e,v entry is 1 if v = e+ , 1 if v = e
and 0 otherwise. The incidence matrix is indeed the discrete analogue of the gradient operator, which is a comparison
to which we will frequently adhere. AnotherPcommonly studied discrete differential operator is the Laplacian of G
which is defined as = . Let dv = wV Wv,w and D = diag({dv }vV ) be the diagonal degree matrix.
P
Then = D W is positive semi-definite and for z RV , z z = v,w Wv,w (zv zw )2 .
Much of the study of the Laplacian revolves around its spectrum, which will play a central role in this thesis.
Let us denote the decomposition = UU for U Rnn and diagonal = diag({i }n
i=1 ). Furthermore, let
i i+1 without loss of generality. Then it is known that 1 = 0 and |{i : i = 0}| is theP
number of connected
P
components of the graph G. Furthermore, because of the invariance of trace under rotations, pi=1 i = pi=1 di ,
hence the average eigenvalue is equal to the average degree.
P
=
For a vector, z Rp , define supp(z) = {i [p] : zi 6= 0} ([p] = {1, . . . , p}), z = n1 pi=1 zi , z
P
z1, and kwk0 = |supp(w)|. Furthermore, define the k norms for k > 0 to be kzkk = ( i[p] zik )1/k , with
kzk = maxi[n] |zi |. We will also be considering the induced norms of matrices, specifically for a matrix M let
|||M|||k,l = supkzkk 1 kMzkl . Immediately, we see that kzk22 = z z and supp(z) = {e E : ze+ 6= ze }.
Furthermore, if A V then let A = V \A and let A denote the edges leaving A (the boundary of A). This thesis
studies estimation and detection primarily with respect several distinct function classes, with parameter > 0:
1. 2 graph-structure: the class X2 () = {x RV : kxk22 }

2. 0 graph-structure: the class X0 () = {x RV : kxk0 }


3. balanced graph-structure: for each x Xb () there exists A V , x = b0 1A +b1 1A with n|A|/(|A||A|)
for some b0 , b1 R

4. graph-structure: the class X () = {x RV : kxk } for > 0

5. Lipschitz classes: given a metric over V , d : V V R+ , we can consider classes Xd () = {x RV :


v, w V, |xv xw | d(v, w)}. Notice that graph-structure is an example of this with the shortest path
length distance in an unweighted graph.

We will also consider two separate prior distributions that are induced by the graph G, the Ising prior and the
Gaussian graphical model (GGM). Below p(x) denotes the density with respect to either the Lebesgue measure (in
the GGM) or the discrete measure over the hypercube {0, 1}p (for Ising).
p(x) exp(x 1 x)
p(x) exp(x x)

Gaussian graphical model:


Ising model:

(2)

In the Gaussian graphical model, = 1 denotes the inverse covariance matrix whose zero entries indicate the
absence of an edge between the corresponding nodes in the graph. With these function classes and priors in mind, we
study the statistical performance of estimation, localization, and detection.

1.1.2

Estimation and Changepoint Localization

in the normal means problem: recovery with an 2 loss and


There are two standards that we will ask of an estimator x
the recovery of its structure through the changepoints, supp(x). In section 2 we will consider a laplacian eigenmaps
. We will say that the estimator is 2 consistent if
projection estimator x
P

k
x xk2 0.
We will highlight conditions under which we can achieve 2 consistency for the 2 graph structure, the Ising, and the
GGM in 2. Changepoint sparsistency is when we hope to recover exactly supp(x), in fact we require that the signs
of the changes be correctly recovered, i.e. ,
lim P{sign(
x) = sign(x)} = 1.

In section 3, we study the changepoint sparsistency of the edge lasso, an instantiation of the generalized lasso.[61]
We will see that while changepoint consistency may be a strong criteria, it directly leads to nearly-oracle rates of
convergence for the 2 loss.

1.1.3

Detection

The detection of graph-structured signals may refer to one of two things: detecting an anomalous cluster from zero
background activation, or detecting an anomalous cluster from constant background activation. In section 4, we
study the case of constant background activation with balanced alternative, in which we assume the following testing
hypotheses:
k
H0 : x = 0 vs H1 : x Xb (), kx x
(3)
In section 5, we study zero background activation in which we assume the null and alternative hypotheses:
H0 : x = 0

vs

H1 : x X0 (), kxk2

(4)

In both cases, H0 represents business as usual while H1 encompasses all of the foreseeable anomalous activity. It is
the composite nature of H1 that causes theoretical difficulties.
It is imperative that we control both the probability of false alarm, and the false acceptance of the null. Let a test
be a mapping T (y) {0, 1}, where 1 indicates that we reject the null. To this end, we define our measure of risk to
be
R(T ) = sup E [T ] + sup E [1 T ]
xH1

xH0

where E denotes the expectation with respect to y N (x, 2 Ip ). The test T may be randomized, in which case the
risk is ET R(T ). Notice that if the distribution of the random test T is independent of x, then ET supxX Ex [1T ] =
supxX ET,x [1 T ]. This is the setting of [3] which we should contrast to the Bayesian setup in [1]. We will say
that H0 and H1 are asymptotically distinguished by a test, T , if limn R(T ) = 0. If such a test exists then H0 and
H1 are asymptotically distinguished, otherwise they are asymptotically indistinguishable.

1.1.4

Density Estimation

The setting for density estimation over graphs differs from the estimation or detection framework and requires a
different analysis. We assume that we are given n independent identically distributed observations of a vertex valued
random variables, z = [z1 , . . . , zn ]. Specifically, for v V , P{zi = v} = fv for density f RV with fv 0
and kf k1 = 1. The goal is to estimate f through z when it is believed that f belongs to a Lipschitz class for some
metric d. In section 6, we will study the kernel density estimator, f , and will provide conditions for consistency,
P
i.e. kf f k 0. Through empirical process theory we will show that we can control the variance of the kernel
density estimator (KDE), relating the VC dimension of the KDE class to metric packing numbers. This theory is
applicable to determining distributions of agents in networks and clustering when the structure is given by a graph.

1.2

Related Work

In section 2, we will analyze the asymptotic statistical performance of projection estimators on graphs, with both a
frequentist result and Bayesian corollary for the Ising prior. Much work has been devoted to the use of shrinkage
estimators for ellipsoid constraints, resulting in the asymptotic optimality for the Pinsker estimator for low-noise
asymptotics [14, 33, 8]. Invariably efficient estimators with ellipsoid constraints shrink components of the observed
vector that correspond to minor axes of the ellipsoid. Other shrinkage procedures have been studied extensively, such
as the projection estimator and Tikhonov regularization[62]. For simplicity we study the performance of projection
estimators, but characterize their statistical consistency through spectral graph theory.
Markov random fields (MRF) provide a succinct framework in which the underlying signal is modeled as a draw
from an Ising or Potts model [11, 49]. Most work on MRFs suggest the use of the maximum a posteriori (MAP)
estimator which is the Bayes rule under 0/1 loss. Less is known about the Bayes rule under Hamming distance loss,
in which the estimator is the posterior centroid [10], a procedure known to be computationally intractable. A similar
line of research is the use of kernels over graphs. The study of kernels over graphs began with the development of
diffusion kernels [36], and was extended through Greens functions on graphs [57]. A related body of work extends
marginalized kernels to graphs [34, 45], while recently it has been shown that this and the aforementioned definitions
are members of an overarching framework with computationally efficient constructions [64]. While kernels on graphs
provide computationally efficient procedures for inference on graphs, much less is known about their asymptotic
statistical efficiency.
We find that statistical consistency of specific shrinkage estimators imply that the Laplacian eigenbasis gives
statistically efficient representations of graph structured patterns. There have been several attempts at constructing
multi-scale basis for graphs that can efficiently represent localized activation patterns, notably diffusion wavelets [12]
and treelets [37], however their approximation capabilities are not well understood. [56] and [23] independently proposed unbalanced Haar wavelets and characterized their approximation properties for tree-structured binary patterns.
In section 3, we study a total variation denoising procedure, called the edge lasso, over general graph structures.
The edge lasso is a generalization of the fused lasso originally proposed in [60] to enable recovery of high-dimensional
patterns which are smooth (piece-wise constant) on a graph. The key idea is to penalize the 1 -norm of differences
of measurements at vertices that share an edge to encourage sparsity of the edges which connect vertices that have
different signal values. While there have been some attempts [61, 29, 41] at coming up with efficient algorithms for
solving the fused lasso optimization, a theoretical analysis of its performance is mostly lacking. The only exception,
to the best of our knowledge, are [50, 27] which analyze the linear graph topology.
In sections 4 and 5, we consider statistical tests for the graph structured normal means problem. Normal means
testing in high-dimensions is a well established and fundamental problem in statistics. Much is known when H1 derives from a smooth function space such as Besov spaces or Sobolev spaces[30, 31]. Only recently has combinatorial
structures such as graphs been proposed as the underlying structure of H1 . A significant portion of the recent work
in this area ([5, 4, 3, 1]) has focused on incorporating structural assumptions on the signal, as a way to mitigate the
effect of high-dimensionality and also because many real-life problems can be represented as instances of the normal
means problem with graph-structured signals (see, for an example, [32]).
Another line of research relevant to our problem is optimal fail detection with nuisance parameters and matched
subspace detection in the signal processing literature: see, e.g. [51, 7, 19, 18]. Though our problem can be cast as a
special case of the more general problem of optimal testing of a linear subspace with nuisance parameters, the focus
on a graph-structured signal, as well as the type of analysis based on the interplay between the scan statistics and the
spectral properties of the graph contained in this work, is novel.
In section 6, we study kernel density estimation over graphs. We benefit from the extensive literature on KDEs in
Euclidean spaces. Much is known about the use of KDEs and histogram estimators with respect to a total variation
loss, also known as the 1 loss[13]. Recently, through developments in empirical process theory, consistency and
rates have been determined for KDEs[24, 22, 25]. This type of analysis allows for confidence bands of the density
estimate [26]. This thesis will develop results of this type for KDEs over graphs, by mirroring when possible the
proof techniques developed for Eucliden spaces.

2 Laplacian Eigenmaps for Normal Means Estimation


This section is devoted to the analysis of an estimator based on the graph Laplacian eigenbasis, and establish conditions for 2 -consistency when latent patterns arise from the Ising and GGM models in (2) or when the pattern
is 2 -graph structured. For both deterministic and probabilistic network evolution models, the results indicate that
by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even
when the noise variance increases with network size. Below is a summary of the contributions that can be found in
the subsections. A more detailed analysis with proofs can be found in [55].

1. Main result: under the Ising, GGM priors and 2 graph structure the 2 risk of Laplacian eigenmaps projection
is bounded by a function of the Laplacian eigenvalues.
2. Hierarchical block structure: under the hierarchical block structure, the Laplacian eigenvectors gives the
Haar wavelet basis, and we can achieve 2 consistency with polynomially growing noise variance.
3. Regular Lattice Structure: for a lattice of increasing dimension, Laplacian eigenmaps achieves 2 consistency
with polynomially growing noise variance.
4. Random Graph Structure: for the supercritical Erdos-Renyi, we can achieve 2 consistency with polynomially growing noise variance.

2.1

Main Result for Laplacian Eigenmaps

If the network activation patterns are generated by a Gaussian graphical model, it is easy to see that the eigenvalues
of the Laplacian (inverse covariance) determine the MSE decay. Consider the GGM prior as in (2), then the posterior
distribution is

1 
x|y N (2 2 + I)1 y, 2 + 2 I
,
(5)
P
where I is the identity matrix. The posterior mean is the Bayes optimal estimator with Bayes MSE, p1 i[p] (2i +
2 )1 , where {i }i[p] are the ordered eigenvalues of . The binary Ising model is essentially a discrete version
of the GGM, however, the Bayes rule and risk for the Ising model have no known closed form. For binary graphstructured patterns drawn from an Ising prior, we suggest a different estimator based on projections onto the graph
Laplacian eigenbasis. Recall that the graph Laplacian has spectral decomposition, = UU , and denote the
first k eigenvectors (corresponding to the smallest eigenvalues) of by U[k] . Define the estimator
bk = U[k] U
x
[k] y,

(6)

which is a hard thresholding of the projection of network measurements onto the graph Laplacian eigenbasis. The
following theorem bounds the MSE of this estimator.
Theorem 1.
1. The maximum MSE of the estimator in (6) for the observation model in (1), when the activation
patterns satisfy x x (i.e. lie within X2 ()) is bounded as


1
k 2

R :=
sup
+
E kb
xk xk2 min 1,
pk+1
p
x:x x p
2. The Bayes MSE of the estimator in (6) for the observation model in (1), when the activation patterns are drawn
from the GGM prior is bounded as
RB :=

p
k 2
1 X 1
1
k 2
1
+
Ex, kb
xk xk2 =

+
p
p
2i
p
2k+1
p
i=k+1

3. The Bayes MSE when the binary activation patterns are drawn from the Ising prior is bounded as


k 2

1
+
xk xk2 min 1,
+ ep
RB := Ex, kb
p
k+1
p
where 0 < < 2 is a constant and k+1 is the (k + 1)th smallest eigenvalue of .
Through this bias-variance decomposition, we see the eigenspectrum of the graph Laplacian determines a bound
on the MSE for binary graph-structured activations.
bi = 1xbi >1/2 , i [p]. Then the results of Theorem 1 (3) also
Remark 2. Consider the binarized estimator x
provide an upper bound on the expected Hamming distance of this new estimator since E[dH (b
x , x)] = MSE(b
x )
4MSE(b
x), by the triangle inequality.

2.2

Asymptotic performance under specific graph models

We now discuss the eigenspectrum of some simple graphs and use the MSE bounds derived in the previous section
to analyze the amount of noise that can be tolerated while ensuring consistent MSE recovery of high-dimensional
patterns. In all these examples, we find that the tolerable noise level scales as 2 = o(p ), where (0, 1)
characterizes the strength of network interactions.

Figure 1: Weight matrices corresponding to hierarchical dependencies between node variables.


2.2.1

Hierarchical structure

Consider that, under an appropriate permutation of rows and columns, the weight matrix W has the hierarchical block
form shown in Figure 1. This corresponds to hierarchical graph structured dependencies between node variables,
where > +1 denote the strength of interactions between nodes that are in the same block at level = 0, 1, . . . , L.
We find that in this case the eigenvectors U of the graph Laplacian correspond to unbalanced Haar wavelet basis
(proposed in [56, 23]). Using the bound on MSE as given in Theorem 1, we can now derive the noise threshold that
allows for consistent MSE recovery of high-dimensional patterns as the network size p .
Corollary 3. Consider a graph-structured pattern under any of the conditions of Theorem 1 (1)-(3) under the hierarchical block graph. If = 2(1) log2 p + 1, for constants , (0, 1), and = 0 otherwise, then
the noise threshold for consistent MSE recovery (R, RB = o(1)) is
2 = o(p ).

2.2.2

Regular Lattice structure

Now consider the lattice graph which is constructed by placing vertices in a regular grid on a d dimensional torus and
adding edges of weight 1 to adjacent points. Let p = rd . For d = 1 this is a cycle which has a circulant weight matrix
w, with eigenvalues {2 cos( 2k
) : k [p]} and eigenvectors correspond to the discrete Fourier transform [20]. Let
p
i = (i1 , ..., id ), j = (j1 , ..., jd ) [r]d . Then the weight matrix of the lattice in d dimensions is
Wi,j = wi1 ,j1 i2 ,j2 ...id ,jd + ... + wid ,jd i1 ,j1 ...id1 ,jd1

(7)

where is the Kronecker delta function. Through concentration of the eigenspectrum, we can choose k such that
k d and k = ped/8 . So, the risk bound becomes O(2/d + 2 ed/8 + ep ), and as we increase dimensions
of the lattice the MSE decays linearly.
Corollary 4. Consider a graph-structured pattern under any of the conditions of Theorem 1 (1)-(3) based on a lattice
graph in d dimensions with p = rd vertices. If r is a constant and d = 8 ln n, for some constant (0, 1), then
the noise threshold for consistent MSE recovery (R, RB = o(1)) is given as:
2 = o(p ).
Again, the noise variance can increase with the network size p, and larger implies stronger network interactions
as each variables interacts with more number of neighbors (d is larger).

2.2.3

Erdos-Renyi random graph structure

Erdos-Renyi (ER) random graphs are generated by adding edges with weight 1 between any two vertices within V
with probability qp . It is known that the probability of edge inclusion (qp ) determines large geometric properties of the
graph [15]. Real world networks are generally sparse, so we set qp = p(1) , where (0, 1). Larger implies
higher probability of edge inclusion and stronger network interaction structure. Using the degree distribution, and a
result from perturbation theory, we bound the quantiles of the eigenspectrum of . This enables us to set the sequence
of quantiles for the eigenvalue distribution k = p p1 such that PG {k p /2 p1 } = O(1/p ). So, we
obtain a bound for the expected Bayes MSE (with respect to the graph) EG [R] O(p )+ 2 O(p p )+O(1/p ).

Corollary 5. Consider a graph G drawn from an Erdos-Renyi random graph model with n vertices and probability
of edge inclusion qp = p(1) for some constant (0, 1). For a graph-structured pattern under any of the
conditions of Theorem 1 (1)-(3), the noise variance that can be tolerated while ensuring consistent MSE recovery
(R, RB = oPG (1)) is given as:
2 = o(p ).

2.3

Future Work

Laplacian eigenmaps is by no means the only estimator that denoises normal means models under ellipsoid constraints. The most notable of alternative estimators are Pinskers estimator[14, 33] and Laplacian regularization[9].
For low noise asymptotics ( 2 0) Pinksers estimator is known to be minimax optimal. A natural extension of this
work is to compare these estimators under similar graph models, determining if there is are conditions under which
any are asymptotically inadmissible. To summarize I intend to:
1. Pinskers estimator: study the relative asymptotic performance of Pinskers estimator to Laplacian eigenmaps
2. Laplacian regularization: determine the relative optimality of Laplacian regularization
3. Extension to other graphs: extend this work to other graph models currently being studied, such as the
graph, binary tree and Kronecker graph.

3 Edge Lasso
The fused lasso was proposed recently to enable recovery of high-dimensional patterns which are piece-wise constant
on a graph, by penalizing the 1 -norm of differences of measurements at vertices that share an edge. While there
have been some attempts at coming up with efficient algorithms for solving the fused lasso optimization, a theoretical
analysis of its performance is mostly lacking except for the simple linear graph topology. In this section, we investigate changepoint sparsistency of fused lasso for general graph structures, i.e. its ability to correctly recover the exact
support of piece-wise constant graph-structured patterns asymptotically (for large-scale graphs). To emphasize this
distinction over previous work, we will refer to it as Edge Lasso.
We focus on the (structured) normal means setting, and our results provide necessary and sufficient conditions on
the graph properties as well as the signal-to-noise ratio needed to ensure changepoint sparsistency. Let A A denote
the maximal sets of vertices with constant activation (viz. xv = xw , v, w A). We examplify our results using
simple graph-structured patterns, and demonstrate
p that in some cases fused lasso is changepoint sparsistent at very
weak signal-to-noise ratios, which may scale as (log p)/|A|, where p is the number of vertices in the graph and A
is the smallest element of A (see figure 2). In other cases, it performs no better than thresholding
the difference of
measurements at vertices which share an edge (which requires signal-to-noise ratio that scales as log p). All results
and detailed proofs can be found in [54]. We summarize the current results regarding the edge lasso:
1. Edge thresholding: Generic chaining provides upper and lower bounds on the changepoint sparsistency of
edge thresholding, a simple estimator, that provides a natural comparison for the edge lasso.
2. Noiseless changepoint sparsistency: There are combinatorial conditions under which the edge lasso is changepoint sparsistent given the noiseless model ( = 0).
3. Noisy changepoint sparsistency: By combining the conditions for the noiseless model with concentration of
measure and spectral graph theory, one can obtain conditions for changepoint sparsistency in the noisy model.
4. Asymptotics for specific graphs: The edge lasso fails to be changepoint sparsistent for the 1 and 2 dimensional
lattice, while it obtains nearly oracle rates for the nested complete graph.

3.1

Edge Thresholding

It is natural as a first pass to merely difference observations ye+ ye and hard threshold to obtain an estimator of
sign(x) to achieve changepoint sparsistency. The estimator is given by,
zth,e ( ) = (ye+ ye )I{|ye+ ye | > } = (e y)I{|e y| > }
We now characterize necessary and sufficient conditions to obtain changepoint sparsistency of edge thresholding.
0
Theorem 6. Suppose that kxk
0 for simplicity.
|E|
p
th is changepoint sparsistent.
1. If = ( log |E|) then z
p
th is not changepoint sparsistent.
2. If = o( log(|E| kxk0 )) then z

Figure 2: A qualitative summary of our changepoint sparsistency results for the SNR required by Edge thresholding,
Edge lasso (for the 1-d, 2-d Grid and Nested Complete Graph), and an Oracle that has a priori knowledge of A. (In
the figure it is assumed that |A| scales like p for all A A.)
We see immediately that the signal to noise ratio must be increasing like the log of the number of edges for edge
thresholding to achieve changepoint sparsistency.

3.2

The Edge Lasso

In this section we will describe the edge lasso estimator, which arises as the solution to a generalized fused lasso
problem as defined in [61] with the graph constraints specified by the matrix . In particular, the edge lasso is the
minimizer of the convex problem
1
k22 + k
ky x
x k1 ,
(8)
2
were > 0 is a tuning parameter. Thus, the edge lasso is the penalized least squares estimator of x with penalty term
given by the 1 norm of the differences of measurements across edges in G. Using KKT conditions and a primal-dual
witness method we are able to extract the following theorem regarding noiseless recovery of x.
min

Rp
x

Theorem 7. Define the following notion of connectivity for each A A,


(A) = max
CA

A|
| C
|C|

|A|
|C C|

(9)

Then the noiseless problem recovers the correct A if (A) = maxAA (A) < 1/2.

See Figure 3 for an illustration of condition (9) in the previous theorem. An interpretation of the (A) parameter
is that there is no bottleneck within vertices of constant activation, A, relative to the flow coming in and out of A.
1
Later we will describe a class of graphs which we call the nested complete graphs for which (A) = |A|
for A A.
We now combine this result with subGaussian concentration to produce a changepoint sparsistency result in the noisy
regime.
Theorem 8. Let B = supp(x) the changepoints of x. Suppose that the following conditions hold for all A A,
and let be the gap between signal across clusters of activation in A.
(A) = o(1)

p
|A|
|||B B |||2, log(| B|)
|A|
!
1

= p

|A|

then the edge lasso is changepoint sparsistent.

Figure 3: An example of the quantities in eq. (9) for a cut of set A depicted by the large vertices. The cut C are the
black vertices, C C are blue edges, and C A are red edges. The RHS of eq. (9) for this cut is 5/21.
Proposition 9. Let the spectral decomposition of the Laplacian for A A be A = U U then |||B B |||2,
is equal to
sX
max max
(Uv,e+ Uv,e )2 (2v )
AA eA

vV

So, if each eigenvector Uv is v -Lipschitz with respect to the shortest path distance then (Uv,e+ Uv,e )2 v2
sX
v2 (2v )
and |||B B |||2, max
AA

vA

Proposition 9 provides a more tractable condition that implies that |||B B |||2, is small. Our findings suggest
that the changepoint sparsistency of the edge lasso is highly dependent on the topology of G and its partition A. In
general, it is necessary that there exists no bottleneck cuts (cuts that force (A) to be large).

3.3

Specific graph models

We apply our results to the edge lasso over the 1 and 2 dimensional grids, commonly referred to as the fused lasso.
In these cases the SNR must not decrease to achieve changepoint sparsistency, which is in sharp contrast to the
performance of the oracle (see Figure 2). We provide a topology called the nested complete graph that satisfies the
sufficient conditions for changepoint sparsistency. These examples are meant to provide a blueprint for using the
previous results to explore topologies for which the edge lasso is changepoint sparsistent.

3.3.1

1D and 2D fused Lasso

Due to the popularity of total variation penalization, it is imperative that we discuss the 1D and 2D fused lasso.
In the 1D grid each vertex can be associated with a number in {1, ..., p} and we connect the pairs with Euclidean
distance less than or equal to 1. Similarly, in the 2D grid each vertex can be associated with a number in {1, ..., p0 }
{1, ..., p1 }. In the 2D grid we will say that a vertex v is a corner if its degree within A(v) (the partition element
containing v) is 2. (See Figure 4)

Figure 4: A 2D grid with |A| = 2 depicted as union of black and red vertices. The red vertex is an example of a corner.
Corollary 10. (a) Consider the 1D fused lasso with a non-trivial signal such that |A| = 2. If the signal to noise
ratio is decreasing (max / = o(1)) then the 1D fused lasso is not changepoint sparsistent.
(b) Consider the 2D fused lasso with a A A such that A contains a corner v and |A| = 2. If the signal to noise
ratio is decreasing (max / = o(1)) then the 2D fused lasso is not changepoint sparsistent.

3.3.2

Nested complete graph

We construct the nested complete graph from k + 1 copies of the complete graph with k vertices by adjoining each
complete graph to each other with one edge. We can form this such that each vertex has only one edge leaving its
element in A which are the original complete graphs. (See Figure 5) We find that modulo factors that scale like the
log p, the changepoint sparsistency thresholds are the same as that of the oracle.

Figure 5: Nested complete graph with k = 3. A are the complete subgraphs of size 3.
Corollary 11. Suppose we construct the nested complete graph with k vertices in A and k + 1 elements in the
partition (|A| = k and |A| = k + 1). If the SNR satisfies,


1 p

log(k(k + 1))
=

k
Then the edge lasso is changepoint sparsistent.

3.4

Future Work

While the aforementioned results illustrate the difficulty of obtaining exactly the changepoint sparsity pattern, it is
not clear what can be said about approximate recovery of x through the edge lasso. While some simple adaptations
of the primal-dual witness method seem plausible, an exact analysis of approximate recovery is not within the scope
of this thesis. One important step toward this direction is to consider using the edge lasso as a changepoint detector.
Specifically, we are currently studying a test statistic formed by the point in the lasso path in which there is a nonconstant reconstructed signal. We are finding that this is related to the max-flow min-cut combinatorial duality for
single-commodity flow problems. I am excited when moving forward to compare this novel detector to the the scan
statistic relaxations (Section 4) and the tree wavelet detector (Section 5). To summarize, I intend to
1. Approximate recovery: adapt the primal-dual witness method to obtain results for approximate recovery with
the edge lasso.
2. Fused lasso detector: study the performance of the edge lasso detector and compare to other change-point
detectors.

4 Graph Scan Statistic Relaxations


In this work, we assume that the class of clusters of activation consists of sub-graphs of small cut size. Specifically, we
adopt the constant background activation hypotheses in (3). We see that the generalized likelihood ratio test (GLRT)
is an integer program with a term in the objective that corresponds to the sparsest cut in a graph, a known NP-hard
problem. With this in mind, we propose a relaxation of the GLRT, called the spectral scan statistic (SSS), which is
based on the combinatorial Laplacian of the graph and, importantly, is a tractable program. As our main result, we
derive theoretical guarantees for the performance of the spectral scan statistic, which hold for any graph and are based
on the spectral measure of the combinatorial Laplacian. For comparison purposes, we derive theoretical guarantees
for two simple estimators, the edge thresholding and the 2 test. We conclude our study by applying the main result
to balanced binary trees, the lattice, and Kronecker graphs, giving us precise asymptotic results. We find that, modulo
logarithm terms, the spectral scan statistic has nearly optimal power for balanced binary trees. For a detailed analysis,
with proofs and lemmas see [53]. We summarize the results below:
1. GLRT form and relaxations: The GLRT is a computationally intractable combinatorial optimization, and the
SSS relaxation is proposed.
2. Theoretical analysis of SSS: The SSS can asymptotically distinguish H0 from H1 given certain conditions on
the graph spectra and the signal-to-noise ratio.
3. Specific graph models: These results are applied to the balanced binary tree, the lattice and Kronecker graphs.

10

4.1

GLRT relaxations

The hypothesis testing problem (3) presents two challenges: (a) the model contains an unbounded nuisance parameter
R and (b) the alternative hypothesis is comprised of a finite disjoint union of composite hypotheses. We will
eliminate the interference caused by the nuisance parameter by considering test procedures that are independent of x
.
The formal justification for this choice is based on the theory of optimal invariant hypothesis testing (see, e.g., [38])
and of uniformly best constant power tests (see [66]). Let H1A denote the subset of H1 in which x is constant over A
For the simpler problem of testing H0 versus H1A for some A V , the optimal test is based on the likelihood
and A.
ratio (LR) statistic.
!2
!
X
supxH A fx (y)
1 |V |
1
v
y
2 log A (y) = log
,
(10)
= 2

supxH0 fx (y)
|A||A|
vA
= y y
= (
where y
yv , v V ) and f is the Lebesgue density of P . This test rejects H0 for large values of A (y).
Optimality follows from the fact that the statistical model we consider has the monotone likelihood ratio property.
When testing against composite alternatives, like in our case, it is customary to consider instead the generalized
likelihood ratio (GLR) statistic, which in our case reduces to
g = max 2 2 log A (y).
AA()

}. Through manipulations of the likelihoods, we find that the GLR


where A() = {A V : n|A|/(|A||A|)
statistic has a very convenient form which is tied to the spectral properties of the graph G via its Laplacian.
=yy
and K = I
Lemma 12. Let y
g =

1
11 .
n

Then

max n

z{0,1}

y
z
z y
z z
s.t.
,

z Kz
z Kz

(11)

where is the combinatorial Laplacian of the graph G.


The savvy reader will notice the connection between (11) and the graph sparsest cut program. By Lagrangian
duality, we see that the program (11) is equivalent to (for some Lagrangian parameter )
P
( iA yi )2
|A|

min

AV |A||A|
|A||A|
the first term of which is precisely the sparsest cut objective, and the second term drives the solution A to have
positive within cluster empirical correlations. The sparsest cut program is known to be NP-hard, with poly-time
algorithms known for trees and planar graphs[46]. We will follow the tradition of bounding sparsity with the algebraic
connectivity (2 ), and provide a surrogate estimator to the scan statistic based on this simple spectral relaxation.
Proposition 13. Define the Spectral Scan Statistic (SSS) as
)2 s.t. z z , kzk 1, z 1 = 0.
s = sup (z y

(12)

zRn

Then the GLR statistic is bounded by the SSS: g s.

Remark 14. By Lagrangian duality and the Courant-Fischer theorem, the spectral scan statistic can be written as
) +
s = min (
yy
>0

where (A) is the maximum non-zero eigenvalue of the matrix A.


In order to bound the probability of false alarm for the SSS, we draw heavily on the theory of the generic chaining,
perfected in [58], which essentially reduces the problem of computing bounds on the expected supremum of Gaussian
k uniformly
processes to geometric properties of its index space. Recall that, under alternative hypothesis, kx x
over H1 .
Theorem 15. The following hold with probability at least 1 . Under the null H0

2
r
s
X
2
1
2
2
s 2
min{1, i } + 2 log ,

i>1
while the alternative H1

2 2

11

2
log

!2

As a corollary we will provide sufficient conditions for asymptotic distinguishability that depend on the spectrum
of the Laplacian . As we will show in the next section, these conditions can be applied to a number of graph
topologies whose spectral properties are known.
Corollary 16. The null and alternative, as described in Thm. 15, are asymptotically distinguished by s and g (y) if

sX

=
(13)
min{1, 1
i }

i>1
Other stronger sufficient conditions are

(p k)
k+
k+1

(14)

if k is large enough that k+1 > .


Interestingly, there are no logarithmic terms in (13) that usually accompany uniform bounds
of this type, which is

attributed to the generic chaining. Notice that the left hand side of (13) is always less than n 1, which we will see
characterizes the performance of the naive estimator k
yk. For comparison, we consider the performance of two naive
procedure for detection: the energy detector, which reject H0 if k
yk2 is too large and the edge thresholding detector,
which reject H0 if max(v,w)E |yv yw | is large.

Theorem 17. H0 and H1 are asymptotically distinguished by k


yk if and only if
p

= ( p 1).

4.2

Specific graph models

In this section we demonstrate the power and flexibility of Theorem 15 by analyzing in detail the performance of the
spectral scan statistic over three important graph topologies: balanced binary trees, the s-dimensional lattice and the
Kronecker graphs (see [40, 39]).

4.2.1

Balanced binary trees

We begin the analysis of the spectral scan statistic by applying it to the balanced binary tree (BBT) of depth . The
class of signals that we will consider have clusters of constant signal which are subtrees of size at least cp for
0 < c 1/2, 0 < 1. Hence, the cut size of the signals are 1 and = [cp (1 cp1 )]1 .
Corollary 18. For the balanced binary tree with p vertices, the spectral scan statistic can asymptotically distinguish
H0 from signals with = n[cp (p cp )]1 if the SNR is stronger than
1

= (p 2 log p).

4.2.2

Lattice

We will analyze the performance guarantees of the SSS over the 2-dimensional lattice graph with r vertices along
each dimension (p = r2 ). We will assume that = ap1/2 , as this is the cut sparsity of rectangles that have a low
surface area to volume ratio. By a simple Fourier analysis (see [55]), we know that the Laplacian eigenvalues are
2(2 cos(2i1 /r) cos(2i2 /r)) for all i1 , i2 [r]. Through some calculus, we arrive at the following conclusion,

Corollary 19. For the r r square lattice, the spectral scan statistic can asymptotically distinguish H0 from signals
with cut size ap1/2 if the SNR is stronger than,

= (p3/8 )

4.2.3

Kronecker Graphs

Much of the research in complex networks has focused on observing statistical phenomena that is common across
many data sources. The most notable of these are that the degree distribution obeys a power law ([16]) and networks
are often found to have small diameter ([47]). A class of graphs that satisfy these, while providing a simple modelling
platform are the Kronecker graphs (see [40, 39]). Let H1 and H2 be graphs on r vertices with Laplacians 1 , 2 and
edge sets E1 , E2 respectively. The Kronecker product, H1 H2 , is the graph over vertices [r] [r] such that there is

12

an edge ((i1 , i2 ), (j1 , j2 )) if i1 = j1 and (i2 , j2 ) E2 or i2 = j2 and (i1 , j1 ) E1 . We will construct graphs that
have a multi-scale topology using the Kronecker product. Let the multiplication of a graph by a scalar indicate that
we multiply each edge weight by that scalar. First let H be a connected graph with r vertices. Then the graph G for
> 0 levels is defined as
1
1
1
H 2 H ... H H
r1
r
r
The choice of multipliers ensures that it is easier to make cuts at the more coarse scale. Notice that all of the previous
results in this section have held for weighted graphs.
Corollary 20. For G be the Kronecker product graph described above with p = r vertices, the spectral scan statistic
can asymptotically distinguish H0 from signals with cuts within the k coarsest scale ( r2k1 ), if the SNR is
stronger than,

= (r2 ( + 2)p(2k+1)/ )

4.3

Future Work

One can easily form an estimator from a relaxation of the GLRT, by thresholding the arg max of the program given
in (12). I intend to explore this through either an analogue of the Davis-Kahan theorem, or by way of results due
to [28] and some concentration of measure. The SSS was developed as a version of the Cheeger inequality, which
provides a weak relaxation of the sparsest cut program. Much tighter results are known for approximations to the
sparsest cut program, the most successful of these being semi-definite programs. We will explore similar relaxations
to the GLRT, motivated by [6]. We summarize the next steps:
1. Estimation and localization: to study the performance of the spectral scan thresholding estimator, bounding
it through either perturbation bounds or concentration results
2. Semi-definite programming relaxation: to propose and analyze the approximation properties of the SDP
relaxation of the GLRT
3. Statistical analysis of SDP: to study the statistical performance of the SDP relaxation either directly or through
the GLRT

5 Spanning Tree Wavelets


In this section, we will be testing if there is a non-zero piece-wise constant activation pattern on the graph given observations that are corrupted by Gaussian white noise, as in hypothesis tests (4). We show that correctly distinguishing
the null and alternative hypotheses is impossible if the signal-to-noise ratio does not grow quickly with respect to
the allowable number of discontinuities in the activation pattern. As we observed in section 4, a test based on the
scan statistic which matches the observations with all possible activation patterns by brute force is infeasible, so, we
propose a Haar wavelet basis construction for general graphs, which is formed by hierarchically dividing a spanning
tree of the graph. We find that the size and power of the test can be bounded in terms of the number of signal discontinuities and the spanning tree, immediately giving us a result for any spanning tree. We then propose choosing
a spanning tree uniformly at random (this can be done efficiently), and show that this bound can be improved by a
factor of the average effective resistance of the edges across which the signal is non-constant. With this machinery in
place we are able to show that for edge transitive graphs, such as lattices, k-nearest neighbor graphs, and geometric
random graphs, our test is nearly-optimal in that the upper bounds match the fundamental limits of detection up to
logarithm factors. A detailed account of these methods and results, including proofs and lemmas can be found in
[52]. We summarize our contributions below:
1. Lower bound: One can lower bound the necessary SNR for asymptotic distinguishability, which indicates that
the most difficult signals to detect have large unstructured active regions.
2. Spanning tree wavelets: One can construct a wavelet basis over graphs, constructed from any spanning tree of
the graph.
3. Main result: Given any spanning tree, we will bound the probability of false alarm in terms of how the cuts in
the spanning tree subsamples cuts in the graph.
4. Uniform spanning tree wavelets: Through proofs akin to cut sparsification, for a uniform spanning tree as a
randomized detector the probability of false alarm can be bounded.
5. Specific graphs: For edge transitive graphs, such as lattices, k-nearest neighbor graphs, and geometric random graphs, the uniform spanning tree wavelet detector is nearly-optimal.

13

5.1

Universal Lower Bound

In order to more completely understand the problem of detecting anomalous activity in graphs, we prove that there is
a universal minimum signal strength under which H0 and H1 in (4), are asymptotically indistinguishable. The proof
is based on a lemma developed in [4], but the strategic use of this lemma is novel. Our construction of the worst
case prior gives a significantly tighter bound than would a more naive implementation. Indeed, it is interesting to
note that the worst case prior is a uniform distribution of the largest unstructured signals that we are allowed in H1
that are nearly disjoint (unstructured).
Theorem 21. Let the maximum degree of G be dmax . Consider the alternative, H1 , in which the cut size of each
H0 and H1 are asymptotically indistinguishable if
signal in X is bounded by , with limp = and dp.
r



, p}
=o
min{

dmax

5.2

Spanning Tree Wavelets

We construct our wavelet basis B recursively, by first finding a seed vertex in the spanning tree such that the subtrees
adjacent to the seed have at most p/2 vertices and then by including basis elements localized on these subtrees
in B. We recurse on each subtree, adding higher-resolution elements to our basis, and consequently constructing a
complete wavelet basis. The first phase of the algorithm ensures that the depth of the recursion is at most log p and
the second ensures that each edge is activated by at most log d basis elements per recursive call. Combining these
two shows that each edge is activated by at most log dlog p basis elements.
Finding a balancing vertex in the tree parallels the technique in [48], which finds a balancing edge. The algorithm
starts from any vertex v T and moves along T to a neighboring vertex w that lies in the largest connected
component of T \ v. The algorithm repeats this process (moving from v to w) until the largest connected component
of T \ w is larger than the largest connected component of T \ v at which point it returns v. We call this the
FindBalance algorithm.
Once we have a balancing vertex v, we form wavelets that are constant over the connected components of T \v
such that any vertex is supported by at most log d wavelets. Let dv be the degree of the balancing vertex v and
let c1 , . . . cdv be the connected components of T \v (with v added to the smallest component). Our algorithm acts
as if c1 , . . . cdv form a chain structure and constructs the Haar wavelet basis over them. We call this algorithm
FormWavelets:
1. Let C1 = idv /2 ci and C2 = i>dv /2 ci
2. Form the following basis element and add it to B:
p


|C1 ||C2 |
1
1
b= p
1C 1
1C 2
|C2 |
|C1 | + |C2 | |C1 |

3. Recurse at (1) with the subcomponents of C1 and C2 with partitions {ci }ip/2 and {ci }i>p/2 respectively.
Our algorithm recursively constructs basis elements using the FindBalance and FormWavelets routines on subtrees
of T . We initialize T to be a spanning tree of the graph and start with no elements in our basis.
1. Let v be the output of FindBalance applied to T .

2. Let T1 , .., Tdv be the connected components of T \v and add v to the smallest component.
3. Add the basis elements constructed in FormWavelets when applied to T1 , ..., Tdv
4. For each i [dv ], recursively apply (1) - (4) on Ti as long as |Ti | > 2.

As we will see, controlling the sparsity, ||Bx||0 is essential in analyzing the performance of the estimator
||By|| . The main theoretical guarantee of our basis construction algorithm is that signals with small cuts in G
are sparse in B.
Lemma 22. Let be the incidence matrix of G and T be the incidence matrix of T (where T has degree at most
d). Then ||x||0 is the cut size of pattern x RV (G) . Then for any x RV (G) ,
||Bx||0 ||T x||0 log dlog p ||x||0 log dlog p

(15)

Equipped with Lemma 22 we can now characterize the performance of the estimator ||By|| on any signal x.
Our bound depends on the choice of spanning tree T , specifically via the quantity ||T x||0 , the cut size of x in T .

14

p
Theorem 23. Perform the test in which we reject the null if ||By|| > . Set = 2 log(p/). If
p
p
p

2||T x||0 log dlog p( log(1/) + log(p/))

(16)

then under H0 , P{Reject} , and under H1 , P{Reject} 1 .

Remark 24. For any tree we have ||T x||0 ||x||0 for all patterns x, so that for the sparse cut alternative we
can have both Type I and Type II error probabilities as long as:
p
p
p

2log dlog p( log(1/) + log(p/))


(17)

5.3

Uniform Spanning Tree Basis

The uniform spanning tree (UST) is a spanning tree generation technique that we will use to construct wavelet bases.
There exists a deep connection between electrical networks, USTs and random walks. Because the UST is randomly
generated, the test statistic, kBT yk when conditioned on y will also be random. Due to results from cut sparsification,
we can relate the performance of the UST wavelet detector to effective resistances.
First, define the effective resistance between vertices v, w as rv,w = (v w ) (v w ), where v is the
Dirac delta function. The UST is a random spanning tree, chosen uniformly at random from the set of all distinct
spanning trees. The foundational Matrix-Tree theorem [35] describes the probability of an edge being included in the
UST. The following lemma can be found in [42] and [43].
Lemma 25. Let G be a graph and T a draw from U ST (G).
P{e T } = re
Hence, we can expect that for a given cut in the graph, that the cut size in the tree will look like the sum of edge
effective resistances. While it is infeasible to enumerate all spanning trees of a graph, the Aldous-Broder algorithm
is an efficient method for generating a draw from U ST (G) [2]. The algorithm simulates a random walk on G, {Xt },
stops when all of the vertices have been visited, and defines the spanning tree T by the edges {(XH(X0 ,v)1 , v) :
v V }. Clearly the UST does not independently sample edges, but it does have the well documented property
of negative association, that the inclusion of an edge decreases the probability that another edge is included. The
following derives from a concentration result for the UST, based on negative association, and can be found in [21].
P
Theorem 26. Let rmax = maxxX esupp(x) re (the maximum effective resistance of a cut in X ). If

p

rmax log d log p


=

then H0 and H1 are asymptotically distinguished by the test statistic kByk where B is the UST wavelet basis.

5.4

Specific Graph Models

In this section we study our detection problem for several different families of graphs. Fosters theorem highlights
why we expect the effective resistance to be less than the cut size.
Theorem 27 (Fosters Theorem [17, 59]).

eE(G)

re = p 1

Hence, if we select an edge uniformly at random from the graph, we expect its effective resistance to be (p
1)/m d1 (the reciprocal of the average degree) where m = |E(g)|.

5.4.1

Edge Transitive Graphs

An edge transitive graph, G, is one such that for any edges e0 , e1 , there is a graph automorphism that maps e0 to
e1 . Examples of edge transitive graphs include the l-dimensional torus and the complete graph Kp . From this we
derive the following corollary, which we note matches the lower bound in Theorem 21 modulo logarithmic terms if

/d p:
Corollary 28. Let G be edge transitive with common degree d. Then for each edge e E(G), re = (p 1)/m.
Consider the hypothesis testing problem (4). If:
r


=
log d log p

d
Then the UST wavelet detector, ||By|| , asymptotically distinguishes H0 and H1 .

15

5.4.2

kNN Graphs

In this section we will devote our attention to the symmetric k-nearest neighbor graphs. Specifically, suppose that
z1 , ..., zp are drawn i.i.d. from a density f supported over Rd . Then we form the graph G over [p] by connecting
vertices i, j if zi is amongst the k-nearest neighbors of zj or vice versa. Some regularity conditions of f are needed
for our results to hold; they can be found in [65]. Through bounds on effective resistances, we arrive at the following
corollary:
Corollary 29. Let G be a k-NN graph with k/p 0 and k(k/p)2/d and where the density f satisfies the
regularity conditions in [65]. Consider the hypothesis testing problem (4). If:
p

= ( /k log d log p)

Then the UST wavelet detector, ||By|| , asymptotically distinguishes H0 and H1 .

5.4.3

-Graphs

The -graph is another widely used random geometric graph in machine learning and statistics. As with the k-NN
graph, the vertices are embedded into Rd and edges are added between pairs of vertices that are within distance of
each other. For such graphs we have the following corollary:
Corollary 30. Let G be a -graph with points z1 , . . . zp drawn from a density f , which satisfies the regularity
conditions in [65] and is lower bounded by some constant fmin (independent of p). Let 0, pd+2 and
consider the hypothesis testing problem (4). If:
r

log d log p)
= (

pd
Then ||By|| asymptotically distinguishes H0 and H1 .

5.5

Future Work

A more typical theoretical analysis of the spanning tree wavelet basis is needed in the future. Specifically, I intend
to pursue a approximation theoretic result in which we show that with few non-zero coefficients the spanning tree
wavelet basis can well approximate any signal in X0 (). This differs from the previous analysis because the UST
was randomly chosen after the data y was observed, where an approximation theoretic analysis would disallow this
randomness in the basis. This will in turn allow us to provide estimation and localization results. We summarize the
future work:
1. Approximation theoretic result: to show that using specific spanning tree functions with few non-zero coefficients in the wavelet basis well approximate X0 ()
2. Estimation and localization: derive corollaries from this regarding estimation and localization

6 Kernel Density Estimation over Graphs


In this section we assume that there is some underlying density with respect to the counting measure on the vertices
of the graph. We assume that the density f belongs to a Lipschitz class with respect to a metric d. Common distance
functions, d, are the diffusion metric, resistance distance and the shortest path length (SPL) distance. As we noted
in section 1, a Lipschitz class with respect to the SPL distance is equivalent to bounded kf k . For our purposes
we will let the dominating measure (A) = |A| for any A V , but in full generality it may be any measure that
dominates f , the measure of zi . As we will see the bias can be controlled through a change of variables similar to
that which appears in Euclidean density estimation. Furthermore, we are able to provide conditions under which
f Ef 0 in norm. We summarize the results below:
1. Approximation error for Lipschitz functions: We will explore the bias incurred by the kernel density estimator through a change of variables.
2. concentration for the boxcar KDE: Using empirical process theory, there are conditions that bound the
norm of f Ef (the estimation error).

16

6.1

KDE definition and approximation error

We are interested in the following kernel density estimator of f .


Definition 31. Let K : R R be measurable.
n
X
K(d(zi , x))
f (v) = 1
R
n i=1 V K(d(z, x))d(z)

(18)

integral, that when is the counting measure on a finite graph evaluates


RHere the integral is a Lebesgue-Stieltjes
P
to w K(d(w, v))d(w) = wV K(w, v) Generally we will consider kernels K that change with the sample size
n, in the same way that bandwidth may decrease in the usual Euclidean density estimators. We find that by choosing
kernels common to density estimation over R [62], such as the Gaussian or Epanechnikov kernel, (31) is truly a
generalization of the the Parzen-Rosenblatt estimator, if we replace V with R, let d be the Euclidean norm, and
be the Lebesgue measure. Let Br (v) be the ball in (V, d) of radius r. The following bound on the bias of the KDE
provides a change of variables that we hope will be useful when controlling the bias in more specific graphs and
kernels.
Lemma 32. Suppose that f is d-Lipschitz with constant and consider the following Steiltjes integral, b(v0 ) =
fv0 Efv0 , and that (Br (v)) Cr for some C, > 0. Then the bias is bounded by,
R
| r1+ K(r)dr|
(19)
|b(v)| C R
K(d(v, z))d(z)
V

6.2

Uniform convergence for the boxcar KDE

The analysis of the KDE differs significantly from the previous sections in that we observe sequential IID random
variables. Thus, we may employ empirical process theory [63]. Let fn denote the empirical measure given by {zi }n
i=1
and let f denote the common measure of zi . For a given vertex v V let av = (BR (v)) denote the volume of
Rthe ball about v, and let qv = f (BR (v)) be the probability of the ball about v. For a measurable function g, f g =
g(v)f (v). Consider the function class K = {K(d(., v)) : v V } and define ||fn f ||K = supgK |fn g f g|.
Hence, kf f k = | supvV (fn K(d(., v)) f K(d(., v)))/av | kfn f kK / minvV av . Generally, we would
like to show that supvV |f(v) Ef(v)| = oP (1) and supvV |Ef(v) f (v)| = o(1). It is sufficient to show that
||fn f ||K = o(minvV av ) and supvV b(v) = o(1).
For simplicity, we assume that we are using the boxcar kernel, K(v, w) = I{w BR (v)} where BR (v) indicates
the ball of radius R about v V with respect to the metric d. We first discover that the bracketing number of the
function class, K, may be related to the covering number of the vertices under the graph distance. This will enable
us to make a uniform law of large numbers under conditions inherent to the metric space (V, d). First we define
the functional, sr = supvV f (BRr (v) BRr (v)), which is a measure of the size of boundaries of Rballs of
width r. This leads usRto a lemma that controls the bracketing number of the function space K with respect to L1 (f )
(i.e. kf gkL1 (f ) = |f g|f ).

Lemma 33. Let N[] (, K, L1 (f )) denote the bracketing number of the function class K under the metric ||.||L1 (f ) .
Furthermore, let N (r, V, d) be the rcovering number of the metric space (V, d).
N[] (sr , K, L1 (f )) N (r, V, d)

(20)

A direct consequence of Lemma 33 is the following theorem that provides consistency for f .
Theorem 34. If log N (r, V, d) = o(minvV min{pa2v /qv , pav }), minvV min{pa2v /qv , pav } , and sr =
o(maxvV av ) then kf Ef k 0.
An alternative result relies on the Vapnik-Chervonenkis dimension as in [24]. It is easily shown that the VC
dimension of K is related to the packing number of metric space V . By adapting Theorem 2.4.3 of [63], we then
arrive at Theorem 36.
Lemma 35. Let V (K) denote the VC number of the function class by K, and the packing number of the space (V, d)
by n(R, V, d).
V (K) = n(R, V, d)
(21)
2

Theorem 36. If n(R, V, d)log(minvV av ) = o(p minvV av ), then kf Ef k 0.

17

6.3

Future Work

We have outlined the basic principles under which we control the approximation and estimation error for KDEs over
graphs. Primarily, what remains is to apply these to various graph models with respect to common graph distances.
Specifically, we suspect that the conditions of Theorem 34 under the shortest path length distance relate to edge
expansion. Moreover, the approximation error (bias) requires a careful analysis of the function classes containing f .
The proposed work is summarized:
1. Approximation error: controlling bias across many graph induced function spaces for specific kernels and
graph models
2. Estimation error: apply the conditions of Theorems 34, 36 to specific graphs with respect to various metrics

References
[1] L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi. On combinatorial testing problems. The Annals of
Statistics, 38(5):30633092, 2010.
[2] D. Aldous. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal
on Discrete Mathematics, 3(4):450465, 1990.
[3] E. Arias-Castro, E. Candes, and A. Durand. Detection of an anomalous cluster in a network. The Annals of
Statistics, 39(1):278304, 2011.
[4] E. Arias-Castro, E. Candes, H. Helgason, and O. Zeitouni. Searching for a trail of evidence in a maze. The
Annals of Statistics, 36(4):17261757, 2008.
[5] E. Arias-Castro, D. Donoho, and X. Huo. Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Inform. Theory, 51(7):24022425, 2005.
[6] S. Arora, S. Rao, and U. Vazirani. Expander flows, geometric embeddings and graph partitioning. Journal of
the ACM (JACM), 56(2):5, 2009.
[7] B. Baygun and A. O. Hero. Optimal simultaneous detection and estimation under a false alarm constraint.
Signal Processing, IEEE Transactions on, 41(3):688703, 1995.
[8] E. Belitser and B. Levit. Asymptotically minimax nonparametric regression in l2. Statistics: A journal of
theoretical and applied statistics, 28(2):105122, 1996.
[9] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from
labeled and unlabeled examples. The Journal of Machine Learning Research, 7:23992434, 2006.
[10] L. Carvalho and C. Lawrence. Centroid estimation in discrete high-dimensional spaces with applications in
biology. Proceedings of the National Academy of Sciences, 105(9):3209, 2008.
[11] V. Cevher, C. Hegde, M. Duarte, and R. Baraniuk. Sparse signal recovery using markov random fields. Technical
report, DTIC Document, 2009.
[12] R. Coifman and M. Maggioni. Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1):53
94, 2006.
[13] L. Devroye and L. Gyorfi. Nonparametric Density Estimation: The L B1 S View. Wiley, 1985.
[14] S. Efroimovich and M. Pinsker. Estimation of square-integrable probability density of a random variable. Problemy Peredachi Informatsii, 18(3):1938, 1982.
[15] P. Erdos and A. Renyi. On the evolution of random graphs. In Publication of the Mathematical Institute of the
Hungarian Academy of Sciences, pages 1761, 1960.
[16] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In ACM
SIGCOMM Computer Communication Review, volume 29, pages 251262. ACM, 1999.
[17] R. Foster. The average impedance of an electrical network. Contributions to Applied Mechanics (Reissner
Anniversary Volume), pages 333340, 1949.

18

[18] M. Fouladirad, L. Freitag, and I. Nikiforov. Optimal fault detection with nuisance parameters and a general
covariance matrix. International Journal of Adaptive Control and Signal Processing, 22(5):431439, 2008.
[19] M. Fouladirad and I. Nikiforov. Optimal statistical fault detection with nuisance parameters. Automatica,
41(7):11571171, 2005.
[20] B. Friedman. Eigenvalues of composite matrices. Mathematical Proceedings of the Cambridge Philosophical
Society, 57:3749, 1961.
[21] W. Fung and N. Harvey. Graph sparsification by edge-connectivity and random spanning trees. Arxiv preprint
arXiv:1005.0265, 2010.
[22] F. Gao. Moderate deviations and large deviations for kernel density estimators. Journal of Theoretical Probability, 16(2):401418, 2003.
[23] M. Gavish, B. Nadler, and R. Coifman. Multiscale wavelets on trees, graphs and high dimensional data: Theory
and applications to semi supervised learning. In Proc. International Conference on Machine Learning, Haifa,
Israel, 2010.
[24] E. Gine and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. In
Annales de lInstitut Henri Poincare (B) Probability and Statistics, volume 38, pages 907921. Elsevier, 2002.
[25] E. Gine and R. Nickl. An exponential inequality for the distribution function of the kernel density estimator,
with applications to adaptive estimation. Probability Theory and Related Fields, 143(3):569596, 2009.
[26] E. Gine and R. Nickl. Confidence bands in density estimation. The Annals of Statistics, 38(2):11221170, 2010.
[27] Z. Harchaoui and C. Levy-Leduc. Multiple change-point estimation with a total variation penalty. Journal of
the American Statistical Association, 105(492):14801493, 2010.
[28] N. Hjort and D. Pollard. Asymptotics for minimisers of convex processes. Arxiv preprint arXiv:1107.3806,
2011.
[29] H. Hoefling. A path algorithm for the fused lasso signal approximator. Technical report, October 2009.
[30] Y. Ingster. Minimax testing of nonparametric hypotheses on a distribution density in the lp metrics. Theory of
Probability and its Applications, 31:333, 1987.
[31] Y. Ingster and I. Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer
Verlag, 2003.
[32] L. Jacob, P. Neuvial, and S. Dudoit. Gains in power from structured two-sample tests of means on graphs. Arxiv
preprint arXiv:1009.5173, 2010.
[33] I. Johnstone. Minimax bayes, asymptotic minimax and sparse wavelet priors. Statistical Decision Theory and
Related Topics, Springer, pages 303326, 1994.
[34] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In MACHINE
LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, volume 20, page 321, 2003.
[35] G. Kirchhoff. Ueber die auflosung der gleichungen, auf welche man bei der untersuchung der linearen vertheilung galvanischer strome gefuhrt wird. Annalen der Physik, 148(12):497508, 1847.
[36] R. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. In MACHINE
LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 315322. Citeseer, 2002.
[37] A. Lee, B. Nadler, and L. Wasserman. Treeletsan adaptive multi-scale basis for sparse unordered data. The
Annals of Applied Statistics, 2(2):435471, 2008.
[38] E. Lehmann and J. Romano. Testing statistical hypotheses. Springer Verlag, 2005.
[39] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach
to modeling networks. The Journal of Machine Learning Research, 11:9851042, 2010.
[40] J. Leskovec and C. Faloutsos. Scalable modeling of real graphs using kronecker multiplication. In Proceedings
of the 24th international conference on Machine learning, pages 497504. ACM, 2007.

19

[41] J. Liu, L. Yuan, and J. Ye. An efficient algorithm for a class of fused lasso problems. In In ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, 2010.
[42] L. Lovasz. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1):146, 1993.
[43] R. Lyons and Y. Peres. Probability on trees and networks. 2000.
[44] A. Madry, G. Miller, and R. Peng. Electrical flow algorithms for total variation minimization. Arxiv preprint
arXiv:1110.1358, 2011.
[45] P. Mahe, N. Ueda, T. Akutsu, J. Perret, and J. Vert. Extensions of marginalized graph kernels. In Proceedings
of the twenty-first international conference on Machine learning, page 70. ACM, 2004.
[46] D. Matula and F. Shahrokhi. Sparsest cuts and bottlenecks in graphs. Discrete Applied Mathematics, 27(1):113
123, 1990.
[47] S. Milgram. The small world problem. Psychology today, 2(1):6067, 1967.
[48] J. Pearl and M. Tarsi. Structuring causal trees. Journal of Complexity, 2(1):6077, 1986.
[49] P. Ravikumar and J. Lafferty. Quadratic programming relaxations for metric labeling and markov random field
map estimation. 2006.
[50] A. Rinaldo. Properties and refinements of the fused lasso. The Annals of Statistics, 37(5B):29222952, 2009.
[51] L. L. Scharf and B. Friedlander. Matched sub-space detectors. Signal Processing, IEEE Transactions on,
42(8):21462157, 1994.
[52] J. Sharpnack, A. Krishnamurthy, and A. Singh. Detecting activations over graphs using spanning tree wavelet
bases. Arxiv preprint arXiv:1206.0937, 2012.
[53] J. Sharpnack, A. Rinaldo, and A. Singh. Changepoint detection over graphs with the spectral scan statistic.
Arxiv preprint arXiv:1206.0773, 2012.
[54] J. Sharpnack, A. Rinaldo, and A. Singh. Sparsistency of the edge lasso over graphs. AIStats (JMLR WCP),
22:10281036, 2012.
[55] J. Sharpnack and A. Singh. Identifying graph-structured activation patterns in networks. In Proceedings of
Neural Information Processing Systems, NIPS, 2010.
[56] A. Singh, R. Nowak, and R. Calderbank. Detecting weak but hierarchically-structured patterns in networks.
Arxiv preprint arXiv:1003.0205, 2010.
[57] A. Smola and R. Kondor. Kernels and regularization on graphs. Learning theory and kernel machines, pages
144158, 2003.
[58] M. Talagrand. The generic chaining. Springer, 2005.
[59] P. Tetali. Random walks and the effective resistance of networks. Journal of Theoretical Probability, 4(1):101
109, 1991.
[60] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. J.
Roy. Statist. Soc. Ser. B, 67:91108, 2005.
[61] R. J. Tibshirani and J. Taylor. The solution path of the generalized lasso. 05 2010.
[62] A. Tsybakov. Introduction to nonparametric estimation. Springer Verlag, 2009.
[63] A. Van Der Vaart and J. Wellner. Weak convergence and empirical processes. Springer Verlag, 1996.
[64] S. Vishwanathan, K. Borgwardt, I. Kondor, and N. Schraudolph. Graph kernels. Arxiv preprint arXiv:0807.0093,
2008.
[65] U. Von Luxburg, A. Radl, and M. Hein. Hitting and commute times in large graphs are often misleading.
ReCALL, 2010.
[66] A. Wald. Tests of statistical hypotheses concerning several parameters when the number of observations is large.
Transactions of American Mathematical Society, 54:426482, 1943.

20

S-ar putea să vă placă și