A Universal Selection Method in Linear Regression Models: Eckhard Liebscher

Open Journal of Statistics, 2012, 2, 153-162 153
http://dx.doi.org/10.4236/ojs.2012.22017 Published Online April 2012 (http://www.SciRP.org/journal/ojs)
A Universal Selection Method in Linear Regression Models

Eckhard Liebscher
Department of Computer Science and Communication Systems, Merseburg University of Applied Sciences,
Merseburg, Germany
Email: eckhard.liebscher@hs-merseburg.de
Received January 27, 2012; revised February 29, 2012; accepted March 9, 2012
ABSTRACT
In this paper we consider a linear regression model with fixed design. A new rule for the selection of a relevant sub-
model is introduced on the basis of parameter tests. One particular feature of the rule is that subjective grading of the
model complexity can be incorporated. We provide bounds for the mis-selection error. Simulations show that by using
the proposed selection rule, the mis-selection error can be controlled uniformly.
Keywords: Linear Regression; Model Selection; Multiple Tests
1. Introduction FDR procedure of Benjamini and Hochberg (see [5])

uses ideas from multiple testing and attempts to control
In this paper we consider a linear regression model with
the false discovery rate, which we will call the mis-se-
fixed design and deal with the problem of how to select a
model from a family of models which fits the data well. lection rate in this paper. More recent papers of this di-
The restriction to linear models is done for the sake of rection are published by Bunea et al. [6], and by Benja-
transparency. In applications the analyst is very often mini and Gavrilov [7]. Surveys of the theory and existing
interested in simple models because these can be inter- results may be found in [8-11]. In a large number of pa-
preted more easily. Thus a more precise formulation of pers the consistency and loss efficiency of the selection
our goal is to find the simplest model which fits the data procedure is shown and the signal to noise ratio is calcu-
reasonably well. We establish a principle for selecting lated for the criterion under consideration. Among these
this “best” model. papers we refer to [12-16], where consistency is proved
Over time the problem of model selection has been in a rather general framework. A method for the sub-
studied by a large number of authors. The papers [1,2] by model selection using graphs is studied in [17]. Leeb and
Akaike and Mallows inspired statisticians to think about Pötscher examine several aspects of the post-model-se-
the comparisons of fitted models to a given dataset. lection inference in [9,18,19]. The authors point out and
Akaike, Mallows and later Schwarz (in [3]) developed illustrate the important distinction between asymptotic
information criteria which may be used for comparisons results and the small sample performance. Shao intro-
and in particular, may be applied to non-nested sets of duced in [20] a generalised information criterion, which
models. The basic idea is the assessment of the trade-off includes many popular criteria or which is asymptotically
between the improved fit of a larger model and the in- equivalent to them. In this paper Shao proved conver-
creased number of parameters. Akaike’s approach is to gence rates for the probability of mis-selection. In [21] a
penalise the maximised log-likelihood by twice the num- rather general approach using a penalised maximum like-
ber of parameters in the model. The resulted quantity, the lihood criterion was considered for nested models.
so called AIC, is maximised with respect to the parame- Edwards and Havránek proposed in [22] a selection
ters and the models. The disadvantage of this procedure procedure aimed at finding sets of simplest models that
is that it is not consistent; more precisely, the probability are accepted by a test like a goodness-of-fit test. Unfor-
of overfitting the model tends to a positive value. Subse- tunately, it is not possible to use the typical statistical
quently, a lot of other criteria have been developed. In a tests of linear models in Edwards and Havránek’s proce-
series of papers the consistency of procedures based on dure since the assumption (b) in the Section 2 of their
several information criteria (BIC, GIC, MDL, for exam- paper is not fulfilled (cf. Section 4 of their paper).
ple) are shown. The MDL-method was introduced by In this paper we develop a new universal method for
Rissanen in [4]. In the nineties of the last century a new selecting a significant submodel from a linear regression
class of model selection methods came into focus. The model with fixed design, where the selection is done
Copyright © 2012 SciRes. OJS

154 E. LIEBSCHER
   1 , ,  k    k is the parameter vector, and

T
from the whole set of all submodels. We point out the
several new features of our approach: 1 , ,  n are independent random variables. Assume
p
1) A new selection procedure based on parameter tests that  i  0 ,   i   for some p  2 , and
is introduced. The procedure is not comparable with Var   i    . Y1 , , Yn denote the values of the re-
2
methods based on information criteria and it is different sponse variable. In short we can write
from Efroymson’s algorithm of stepwise variable selec- Y  X  , (1)
tion in [23].
where Y  Y1 , , Yn  ,    1 , ,  n  . The least square
T T
2) We derive convergence rates for the probability of
mis-selection which are better than those proved in pa- estimator for  is given by
ˆ   X T X  X T Y .
1
pers about information criteria e.g. in [20].
3) Subjective grading of the model complexity can be
incorporated. This leads to the residual sum of squares
Concerning 1) we consider tests on a set of parameters
in contrast to FDR-methods, where several tests on only
Rn  Y  X ˆ
2

 YT I  X X T X 
1

XT Y ,
one parameter are applied. Moreover w.r.t. 2), many au- where . is the Euclidean vector norm.
thors do not analyse the behaviour of mis-selection pro- The aim is to select model (1) or an appropriate sub-
babilities. The results on bounds or convergence rates of model which fits the data well. Moreover, we search for
these probabilities are more informative than the consis- a reasonably simple model. In the following we define
tency. The aspect 3) is of special interest from the point the submodels of (1). The submodel with index
of view of model building. Typically model builder have   1, ,  has the parameter vector
some preference rules in mind when selecting the model.     1 ,  2 , ,  l   l , l : l   , where the vector 
T
They prefer simple models with linear functions to mod- is related to  by   D  with an appropriate ma-
els with more complex functions (exponential or loga- trix D   k l having maximum rank l  k . In a large
rithmic, for example). The crucial idea is to assign to number of applications, the γi’s coincide with different
each submodel a specific complexity number. components of  . The submodel indices 1 and 
We do not assume that the errors are normally distrib- correspond to the model function equal to zero (no pa-
uted. This ensures a wide-ranging applicability of the rameters) and to the full model, respectively. Thus we
approach, but only asymptotic distributions of test statis- can write the model equation for the submodel  as
tics are available. From examples in Section 2, it can be Y  X    , (2)
seen that applications are possible in several directions,
for instance to the one-factor-ANOVA model. The simu- where X  XD . The parameter space of submodel 
lations show an advantage of the proposed method in that 
in (1) is given by   D  :   l . Next we give 
it controls the frequency of mis-selection uniformly. For several versions for the definition of submodels in dif-
models with a large number of regressors, the problem of ferent situations.
establishing an effective selection algorithm is not dis- Example 1. We consider all submodels, where com-
cussed in this paper; we refer to the paper [24]. ponents of  are zero. More precisely, index  is
The paper is organised as follows: In Section 2 we in- assigned to a submodel if  1  i1 , ,  l   il are the
troduce the regression model and several versions of parameters of the submodel ( i1  i2    il ),  j  0
submodels. The asymptotic behaviour of the basic statis- for j  J : i1 , , il  and   1   j 1 2 j
l i 1
. Let
tic is also studied there. Section 3 is devoted to the model
ei   0, , 0,1i , 0,   k be the i-th unit vector. Then
T
selection method. We provide convergence rates for the
probability that the procedure selects the wrong model
(mis-selection). We see that the behaviour is similar to  
D  ei1 , ei2 , , eil   k l , and
that in the case of hypothesis testing. The results of simu-    :  j  0 for all j  J  . For example, for k  5 ,
lations are discussed in Section 4. The reader finds the
proofs in Section 5. the submodel with index   14 has the parameters
 1  1 ,  2  3 ,  3   4 and  2  5  0 holds.
2. Models Moreover, we have
1 0 0
Let us introduce the master model  
0 0 0
 
k
Yi   xij  j   i for i  1, , n , D   0 1 0  ,      5 :  2   5  0
j 1  
0 0 1
where X   xij    nk is the design matrix, 0
 0 0 
i 1,, n , j 1,, k

E. LIEBSCHER 155
in this case. Here the digits “1” in the binary representa- 1

tion of   1 give the indices of the parameters  j  appears in the denominator. The quantity S
n  l  
available in the submodel  . X in (2) consists of the
is the proper estimator under the hypothesis of submodel
columns i1 , , il of the design matrix X correspond-
ing to the present parameters in submodel  . □
 . Classical F-statistics are used in Efroymson’s algo-
rithm of stepwise variable selection (see [23]).
Example 2. Let k  3 . submodel 1: 1  0 ,
In the remainder of this section we study the asymp-
    2 , 3  . Submodel 2: 1  1 ,     2 , 3  . Sub-
T T
totic behaviour of the statistic M n   when  0 is the

model 3: identity (1). □
true parameter of the model (1). For this reason, we first
Example 3. We consider the one-factor ANOVA model
introduce some assumptions.
Yij  i   ij for i  1, , g , j  1, , ng , 1
Assumption  . Let Gn  X T X . Assume that
  n
T
where Y  Y11 , Y12 , , Y1n1 , Y21 , , Ygng  n ,
Rank  Gn   k ,
n   i 1 ni ,    1 , ,  g    k is the parameter
g T
lim Gn   ,  regular.
vector. 11 , ,  gng are independent random variables. n 
The submodels are characterised by the fact that several Moreover,

μi’s are equal. Let  be the k-th Bell number. A Sub- n
model with index   1, ,  is determined by a par-  
p
Bnp : max  xij  o np 2 . □
tition J 1 , , J ,l   of 1, , g in the following way:
j 1,, k
i 1
   :  j  k if j , k  J i for some i . The submodel In a wide range of applications, the entries xij of the
design matrix are uniformly bounded such that
with index  has l   parameters. Furthermore,
D   dij  holds ,
 
Bnp  O  n   o n p 2 since p  2 . The Assumption
i 1,, k , j 1,, l    may be weakened in some ways, but we use this as-
1 for i  J j , sumption to reduce the technical effort. We introduce
where dij   □
 : DT D   
l  l  
0 otherwise. and
Example 3 shows that the model selection problem K :  T
0  I  D  
1

T

D  0   .
occurs also in the context of ANOVA. In submodel (2)
with index  , the least square estimator ˆ and the Proposition 2.1 clarifies the asymptotic behaviour of
residual sum of squares S are given by the statistic M n   .
Proposition 2.1. Assume that Assumption  is sat-
ˆ   XT X  XT Y ,
1
isfied.
1) Assume that  0   and l    k . Then we

(3)
S  Y  X ˆ
2

 Y T I  X XT X  
1
XT Y . have
M n   

  k2l   .
What is an appropriate statistic for model selection?
Let M n   : S  Rn . Here we consider a quantity 2) Suppose that  0   and l    k . Let
M n   , which is similar to F-statistics known from hy-  
Gn    o n 1 2 be satisfied n   . Then we have
pothesis testing in linear regression models with normal
      n ,
1
errors: Mn 2
 K K n  nWn  o
M n   where Wn 

 
  0,  W2 ,  W2  4  2  K  
2
 2 K .
M n   
1
S Depending on whether the true parameter  0 belongs
n  l   to submodel  or not, the statistic M n   has a dif-
 ˆ  D ˆ    ferent asymptotic behaviour. In the first case, it has an
T
  X T X ˆ  D ˆ
 (4) asymptotic χ2-distribution. In the second case it tends to
1 infinity in probability with rate n . Therefore, the sta-
S
n  l   tistic M n   is suitable for model selection. In the next
for    , M n    0. section a selection procedure is introduced based on
M n   serving as fundamental statistic.
The main difference to classical F-statistics is that the
1 3. The New Selection Rule
estimator S of the model variance in submodel
n  l   In this section we propose a selection rule which is based

156 E. LIEBSCHER
on the statistic (4). We introduce a measure d     d  d   are:

of the complexity for submodel  with 1) d is the degree of the polynomial plus 1,
0  d    d max . With this quantity d   it is possible 2) d  l   is the number of parameters  j avail-
to incorporate a subjective grading of the model com- able in the submodel, the other parameters  j are zero,
plexity. The restriction to integers is made for simpler k  k  1
3) d   l   , where l   is the number of
handling in the selection algorithm. The following exam- 2
ples should illustrate the applicability of the complexity parameters  j available in the submodel. This choice
measure. has the advantage that a polynom of higher degree al-
Example 4. We consider the polynomial ways gets a higher complexity number. □
p  x   1   2 x     k x k 1 . The regressor is observed Example 5. For a quasilinear model with regression
at the measurement points x1 , , xn . Hence xij  xij 1 function f  x   1   2 x  3 ln  x  , we can define d
for i  1, , n , j  1, , k . Possible choices for as follows:
1 for submodel functions f  x   1 , f  x    2 x,


2 for submodel function f  x    3 ln  x 
d  3 for submodel function f  x   1   2 x
4 for submodel functions f x     ln x , f x   x   ln x
   1 3     2 3  
5 for the full model
 
This choice takes into account that the logarithm is a    \  i:i  , d  i  d   i . It is assumed that   
more complex function in comparison to constants or for   1, , .
linear functions. □ Example 1: If d    l   then
   :  j  0 for all j  J ,  j  0 for all j  J  . □

Next we need restricted parameter sets defined by
Example 3: If d    l   then
   :  j  k for all j  J i , k  J I , i  I ,  j  k if j , k  J i for some i . □

Given values  n  0  , ,  n  d max    0,   ,   1 , we lowing rule for the selection:

introduce special χ2-quantiles: Select a model ν* such that
 n  d , l    k2l 1   n  d     
d  *  min d   :1     , M n     n  d   , l    
for l  0, , k  1 ,  n  d , k   1 .
and
Here  n  d , l  is just the quantile of order 1   n  d 
   
of the asymptotic distribution of M n   unter the null Fk  l  * M n  *
hypothesis  0   , cf. part 1) of Proposition 2.1. The
quantity  n  d  will play the role of an asymptotic 
 min Fk l    M n    :1     , d    d  * . 
type-1 error probability later. A submodel is referred to
as admissible if M n     n  d   , l    is satisfied, The central idea is to prefer any admissible model with
which in turn corresponds to the nonrejection of the hy- lower complexity. If there is more than one admissible
pothesis that the parameter belongs to the space  of model with the same minimum complexity, then we take
the submodel. The generalised information criterion in- the model with maximum p-value of M n   .
troduced by Shao (see [20]) is given by The next step is to analyse the asymptotic behaviour of
GIC  S  n l   Rn  n  k  . We next show that there the probability that the wrong model is selected; 
i.e. the
is a relationship between the both approaches. A sub- probability of mis-selection (PMS). Let  0   ,
model  is admissible if d  d   , l  l   . The following cases of mis-selec-
tion can occur:
GIC  GIC ,
(m1) M n     n d , l ,  
 
where n   n  n  k   k  l     n  k  n  . More-
(m2) M n     n d , l  ,
over, note that our selection procedure is completely dif-
ferent from Shao’s one. Whereas n in information Fk l  i   M n  i    Fk  l  M   
n for some i : d  i   d ,
criteria is typically free of choice, the quantity  n is
well-defined and motivated. Let Fl be the distribution M n  j   M n   for all j : d  j   d ,
function of the  l2 -distribution. We introduce the fol-  
(m3) M n     n d , l , M n  j    n  d  j  , l  j  

E. LIEBSCHER 157
for some j with d  j   d . Table 1. Frequencies for mis-selection (FM) in percent in

The probability of mis-selection case (m2) may be de- the case n = 100, σ = 0.2, εi ~ (0, σ2).
creased by reducing the number of submodels having the
same complexity. The Theorem 3.1 below provides FM new FM FM
β1 β2 β3 β4
method BIC HQIC
bounds for the selection error.
Theorem 3.1. Let  0   . Assume that Assumption 0 100 100 100 1.910 2.018 2.043
 is fulfilled, and 0.344 100 100 100 1.998 1.895 1.869
1 100 0 100 100 1.900 2.006 2.029

lim n  ln  n  d   0 for all d  0, , d max .
n 100 3 100 100 1.952 1.855 1.830
1) If Gn    o n 1 2  as n   , and p  3 , then 100

100
100
100 6.99
0 100
100
1.918
1.943
2.029
1.844
2.055
1.822
  m1   n  d  1  o 1   O  n 3 2
Bn3  100 100 100 0 1.911 2.017 2.043
n 3
with Bn 3  max
j 1,, k
 xij . 100 100 100 4.58 2.011 1.910 1.886
i 1 0 0 100 100 2.049 3.201 3.239
2) If  n  d   Cn a
for all d  0, , d max with some –0.3681 3.21 100 100 1.830 5.006 4.928
a, C  0 , then 0 0 0 100 2.078 3.725 3.780

  m2   O Bnp n  p  and   m3  O Bnp n  p .   0.377 3.38 7.65 100 1.936 3.490 3.432
0 0 0 0 2.102 4.008 4.060
The PMS of case (m1) behaves like a type-1-error in a
statistical test. It approaches asymptotically  n d   0.38872 3.39 7.8987 5.1754 1.825 5.269 5.178
under the assumptions of part 1). The additional term –0.38872 3.39 7.8987 5.1754 1.830 6.309 6.198
 
with rate O n 3 2 Bn 3 comes from the application of the 0.38872 –3.39 7.8987 5.1754 1.893 5.297 5.213
central limit theorem, and has rate O n 1 2 in the case,   0.38872 3.39 –7.8987 5.1754 1.873 8.039 7.900
where the xij’s are uniformly bounded. This theorem 0.38872 3.39 7.8987 –5.1754 1.897 6.452 6.347
shows that the PMS of cases (m2) and (m3) tends to zero –0.38872 –3.39 7.8987 5.1754 1.893 5.297 5.213
 
at rate O n1 p provided that the xij’s are uniformly –0.38872 3.39 –7.8987 5.1754 1.864 14.207 13.95
bounded and  n  d   Cn  a for all d and some –0.38872 3.39 7.8987 –5.1754 2.029 6.736 6.622
a  0 . These rates of PMS are rather fast. They are better
than in comparable cases in [20] ( n and  n can be
considered to have the same rate). One reason is that in Table 2. FM in percent for different error distributions.
this paper alternative techniques such as Fuk-Nagaev
FM new FM FM
inequality are employed to obtain the convergence rates. n β1 β2 β3 β4 εi ~
meth. BIC HQIC
The results of Theorem 3.1 recommend the selection rule
100 0 0 0 0 σ·t(3) 1.735 3.895 3.951
above from the theoretical point of view. The behaviour
in practice is discussed in the next section. 0.468 4.104 9.516 6.264 σ·t(3) 2.043 2.966 2.943
400 0 0 0 0   0,  2
 0.956 1.780 2.502
4. Simulations 0 0 0 0 σ·t(3) 0.942 1.736 2.459
–0.216 1.873 4.365 –2.863   0,  
2
1.122 5.869 3.905
Here we consider the polynomial model:
–0.216 1.873 4.365 –2.863 σ·t(3) 3.067 8.164 6.275
Yni  1   2 xi  3 xi2   4 xi3   i for i  1, , n
where x1 , , xn   0,1 are the observations of the re- probabilities:  n 1  0.02 ,  n  2   0.022 ,
gressor variable, and the εi’s are i.i.d. random variables.  n  3  0.024 ,  n  4   0.026 in the case n  100 , and
i
For simplicity, we consider the case xi  . The com-  n  i   0.01 in the case n  400 .
n The selection rule of the previous section always gives
plexity d is measured as given in Example 4(b). We FM-values near the given values of  n . The methods
compare the selection method of the previous section based on BIC and HQIC partially show FM-values also
with procedures based on Schwarz’s Bayesian informa- near these  n , but in some special cases the FM-values
tion criterion (BIC, see [3]) and the Hannan-Quinn crite- are much higher (for example, for 1  0.38872 ,
rion (HQIC, see [25]). The Tables 1-3 show the frequen-  2  3.39 ,  3  7.8987 ,  4  5.1754 according to
cies of mis-selection. The results are based on 106 repli- Table 1; 1  0.2569 ,  2  2.227 ,  3  5.197 ,
cations of the model. We choose the following error  4  3.405 according to Table 3). By our method we

158 E. LIEBSCHER
Table 3. FM in percent in the case n = 400, σ = 0.2, εi ~ Applying Fuk-Nagaev’s inequality (see [26]), we ob-
σ·t(3). tain the assertion of the lemma:
β1 β2 β3 β4
FM new
method
FM
BIC
FM
HQIC 
  T XX T    
0 0 0 0 0.942 1.736 2.459   
0.2569 2.227 5.197 3.405 1.003 1.990 1.596 k  n p
 C 
 C    p 2  xij 
p
 i  exp   n
–0.2569 2.227 5.197 3.405 0.958 2.086 1.657 j 1   
  xij 
i 1 2 2
 
0.2569 –2.227 5.197 3.405 0.984 2.169 1.728   i 1 
0.2569 2.227 –5.197 3.405 0.945 2.575 2.004   C  
 C  Bnp  p 2  exp    . 
0.2569 2.227 5.197 –3.405 0.987 2.141 1.690  n 

–0.2569 –2.227 5.197 3.405 1.011 2.005 1.606
Lemma 5.2. Assume that  0   for some
–0.2569 2.227 –5.197 3.405 0.798 3.567 2.652
  1, ,  . Then
–0.2569 2.227 5.197 –3.405 1.015 2.299 1.823
0 100 100 100 1.200 1.064 1.427
 1

 n  l



2
S   2     C2 n  p 2  p 2  e C3 n 
0.217 100 100 100 0.983 1.059 0.890
100 0 100 100 1.004 0.873 1.217 holds for  n  8l  2 ,   1 and n large enough,
100 1.89 100 100 0.969 1.046 0.868 where l : l   and C2 , C3  0 are constants not de-
pending on  , n . The same upper bound holds for
100 100 0 100 0.984 0.844 1.202
 1 T 
100 100 4.38 100 0.976 1.059 0.876     2  .
 n  l 
100 100 100 0 0.948 0.817 1.168
Proof: Observe that
100 100 100 2.87 0.992 1.070 0.887
100
100
0
2.08
0
4.84
100
100
0.973
1.010
1.0706
1.084
1.525
0.914

S   T I  X XT X  
1
XT  
100 2.08 –4.84 100 0.868 2.2384 1.741
by (3). Hence
 1   1 T 
 S   2          2  
are able to control the FM-values by choosing an appro-  nl   nl 2
priate  n . (5)
 1 T 
 
1
  X XT X XT    .
5. Proofs n  l 2
By C , we denote a positive generic constant which can Further an application of Fuk-Nagaev’s inequality from
vary from place to place and does not depend on other [26] leads to
variates. Throughout this section, we assume that As-  1 T   n n 
sumption  is fulfilled. In the following we prove aux-      2        i2   2    
 nl 2  i 1 8 
iliary statements which are used later in the proofs of the
theorems.  n  C 2 n  
 C   p 2 n  p 2    i
p
 exp   4 
(6)
 
Lemma 5.1.
 i 1   i  
  
  T XX T     C1 Bnp  p 2  eC0 n
 
 C n  p 2  p 2  e  C
2
n

holds for all   0 , where C0 , C1  0 are constants not for  n  8l  2 , n  2l . Since
depending on  , n . 1
1 
Proof: Obviously, D : D  XT X  DT  D  DT   k , k holds, and
n 
k  n 2

  therefore, D has a bounded norm, we deduce
 
  T XX T        xij  i    k 1 
 i 1
j 1     1 T 
 
1
  X XT X XT   
k
 n  n  l 2 (7)
     xij  i   k 1 2 ,
j 1  i 1  
   XX   C n
T T 2
  C B np n 
p p 2
e  C n
.
n
by Lemma 5.1 for n large enough. A combination of
and  xij2  nTrace  Gn   O  n  . Inequalities (5)-(7) yields the lemma. 
i 1

E. LIEBSCHER 159
Note that Lemma 5.3. Suppose that  0   for some

  1, ,  . Let 0  K  2 2 , l  l   . Then
S     X  0 
T
I  X  X 
T
 X 
1
XT    X   0
(8) 1) S 3  nK  o n and  
 S 1  2S 2  S 3 , 2)
 
where  1  2
 C    n
S     C4 n  p 2   0 
p 2
 e 5 2

S 1   T I  X XT X 
1

XT  , n  l 
for   0 and n large enough with constants

S 2   0T I  Gn D Gn1 DT X T  ,  C4 , C5  0 not depending on n ,  .
1 T Proof: Part 1) is a consequence of G n   and

S 3  n 0T Gn  Gn D Gn1 DT Gn  0 , G n :  n
X X  .
Gn   . 2) Using Lemmas 5.1 and 5.2, we deduce
 1   1  1    1  1  
 S        S 1  S 3        S 2  S 3   
n  l  n  l  2  2 n  l  2  2
1 1    1 1 
   K  n 1 2 C   T       K  n 1 2 C  C X T  
2 nl 2 2 nl 2
 
 1 T
 n  l
 
  0  2 2 
2
    XX   C   0  n

T T 2 2
 
 
 1 T
 nl
   2 
  0 
2 
  C Bnp n   0 
p p 2
 e  0
 C  
2
n

 C n  p 2   0 
p 2
 e  0
 C  
2
n
  
for n  2l large enough. This implies assertion 2) of 1 n
the lemma.  n   xi  i 


   0,  2   .
n i 1
An application of the central limit theorem and the
Cramér-Wold device leads to the following lemma: In the second part of this section we provide the proofs
Lemma 5.4. Let  n : n 1 2 X T  . xi denotes the i-th of Proposition 2.1 and Theorem 3.1.
column of X T . Then Proof of Proposition 2.1. 1) Let  0   .
Then D  0   0 holds with an appropriate vector  0     . We have
l
M n    Y T X X T X   
1
X T  X XT X  
1

XT Y T
 T X  X T
X 
1
 D XT X  
1

DT X T 
 T
n G 1
n  D G D  n . 1
n
T

Moreover, the identity 1 
S    2 , and therefore assertion 1) of Pro-
lim Gn1  D Gn1 DT   1  D 1 DT (9) n  l  
n 
position 2 is proved.
holds in view of Assumption  . An application of 2) Let  0   , and
Lemma 5.4 and the Cochran theorem leads to 
Wn : 2  0T Gn Gn1  D Gn1 DT  n . By assumption, 
 2 M n   

  k2l   . Lemma 5.2 implies that G  D G D    D  DT  o n 1 2
1
n
1
n
T 1 1
  holds true.
We derive
   X X   D  X X  D   X 
1 1
M n     T X   0T X T X T
 
T

T

T
X 0  X T 

  nT Gn1  D Gn1 D    nW  n  G  G D G
T
 n n
T
0 n n 
1
 n D
T

Gn  0  nK  nWn  o  n .
From Lemma 5.4, (9) and Gn   , it follows that 1 
S 1    2 using Lemma 5.2.
Wn 


  0,  02 , where  We obtain
n  l  
 02 : 4 2  0T    1  D 1 DT     1  D 1 DT   0 Moreover, we deduce
 4 2 K . 
S 2  n  0T I  Gn D Gn1 DT  n  o  n  . 

160 E. LIEBSCHER
Hence by (8) and Lemma 5.3,  n :  2   n 1 4  n  d , l  . Since

1 
S    2  K , which completes the proof of M n     2 nT Gn n , Gn  I  Gn1 2 D Gn1 DT Gn1 2 , we ob-
n  l   tain by using Lemma 5.2
assertion 2). 
In the proof of Theorem 3.1, we need the following 
  m1   M n     n d , l  
Lemma which is proved before.
 1 
Lemma 5.5. For  0   , K  0 holds true. 
   M n     n d , l ,  2   n 1 4   S 
n  l   
Proof. Let Q : I  1 2 D 1 DT 1 2 , and y  1 2  0 . 
Then K  yT Qy  0 since Q is symmetric and idem-  1 
potent. Moreover,  S   2   n 1 4  (11)
 n  l   
Rank  Q   Trace  Q   k  Rank D 1 DT   
  M n     n  O n 1 2   
 k  l : m.
   T
Gn n    n  O n 1 2
2
  
Therefore we have the following representation n
m
and analogously,
Q   hi hiT  HDH T , D  diag 11 , ,1m , 0m 1 , , 0k  ,
i 1

 M n     n d , l      n
T

Gn n   2 n  O n 1 2 .  
and h1 , , hm are the first m columns of the orthogo-
(12)
nal matrix H   k k . For x   k ,
Note that cov  n    2 Gn , which implies
cov  n   I . Further by Assumption  ,
m
 
2
xT Qx   xT hi  0  x  hi
i 1
n
 
3
for i  1, , m  x    hm 1 , , hk  . n  3 2   Gn1 2 xi  i  O n  3 2 Bn3 .
i 1
We consider the linear independent vectors
z1  D e1 , , zl  D el , e j   0, ,1 j , 0,  l is the
T

Since z   k : z T Gz  a is a convex set for all 
a  0 , we can apply Bhattacharya’s theorem on a multi-
j-th unit vector, and obtain
variate Berry-Esseen inequality (see [27])

z Tj Pz j  eTj DT D  DT D 1 DT D e j  0 . (10) 
  
  nT Gn n   2 n   Z T Gn Z T   2 n  O n 1 2 ,   
Since z1   z1 , , zl   zl    hm 1 , , hk  are lin-
12 12
ear independent, these vectors form a basis of where Z    0, I  . The Cochran theorem implies that
  hm 1 , , hk  . Assume that K  0 . Then there exist a Z T Gn Z T   k2 l . We denote the distribution function of
a   l such that the  k2 l -distribution by Fk  l . Hence
l
1 2  0   ai zi  1 2 D a , and hence  0  D a   
 Z T Gn Z T   2 n  1  Fk  l  2 n   
 1  F   d , l    1  o 1     d  1  o 1  ,
i 1
k l n n
in contradiction to the assumption. This proves the le-
mma.  and  Z G Z        d  1  o 1  .
T
n
T 2
n n
Proof of Theorem 3.1: One shows easily that
1 Combining these identities and (11), (12) we obtain
 n d,l   0 . assertion 1).
n
1) Let  n   1Gn1 2 n (  n as in Lemma 5.4), and
2) One can show that  n d , l  i   O  ln  n   . Let  
  0 be a constant. Define  n  be a sequence of real numbers with  n   ,
  
 n :  2   n 1 4  n d , l ,  
 n  o n n  d , l  i  
1
 . We deduce
  
  m2    M n     n d , l , Fk  l  i   M n  i    Fk l   M n   for some i : d  i   d 
 
i  :d  i   d
     
 Fk  l  i   M n  i    Fk  l  n d , l 
i  :d  i   d

 Fk l  i   M n  i    1   n d  
   M n i    2  k  l  i  ,1    d 
n
i  :d  i   d
  1   1  
  
  M n i    n d , l i   n , 
n  l i 
Si   n     Si   n  .
 
i  :d  i   d     n  l  i   

E. LIEBSCHER 161
 
Define K ni :  0T Gn Gn1  Di Gin1 DiT Gn  0 . Let i   Since  0  i , we have K i  0 by Lemma 5.5. Fur-
with d  i   d . Obviously, lim n  K ni  Ki holds true. thermore, by Lemma 5.1 we obtain
      
 M n  i    n d , l  i   n   nK ni   nT Gn1  Di Gin1 DiT  n  nWn   n d , l  i   n   

  nK ni  nW    d , l  i       2  G  G  D G
n n n
T
0 n
1
n 
1
n D     d , l i  n
T
 n n n
1 2
 nK ni 

  2  0T Gn  G  D G D    nK   d , l  i    n
1
n 
1
n
T
 n ni n n
1 2
      C n   
n
T
 
XX T   Cn 2  O Bnp n  p 
for n large enough. On the other hand, we have by Lemma 5.3. We choose  n  n1 2 p . Then
 1


 2
Si   n   O n  p 2 n p 2  e C n n 

 n  o n n  d , l  i  
1
 . This completes the proof of the
 n  l  i   bound for   m2  . Observe that

  m3   M n  i    n  d  i  , l  i   for some i, d  i   d   
i  :d  i   d

 M n i    n  d i  , l i  . 
The bound for   m3 can now be established [11] A. J. Miller, “Subset Selection in Regression,” 2nd Edi-
along the lines of the proof for   m2  .  tion, Chapman & Hall, New York, 2002.
[12] B. Droge, “Asymptotic Properties of Model Selection
Procedures in Linear Regression,” Statistics, Vol. 40, No.
REFERENCES 1, 2006, pp. 1-38. doi:10.1080/02331880500366050a
[1] H. Akaike, “A New Look at the Statistical Model Identi- [13] R. Nishii, “Asymptotic Properties of Criteria for Selection
fication,” IEEE Transactions on Automatic Control, Vol. of Variables in Multiple Regression,” Annals of Statistics,
19, 1974, pp. 716-723. doi:10.1109/TAC.1974.1100705 Vol. 12, No. 2, 1984, pp. 758-765.
doi:10.1214/aos/1176346522
[2] C. Mallows, “Some Comments on Cp,” Technometrics,
Vol. 15, No. 4, 1973, pp. 661-675. doi:10.2307/1267380 [14] C. R. Rao and Y. Wu, “A Strongly Consistent Procedure
for Model Selection in a Regression Problem,” Bio-
[3] G. Schwarz, “Estimating the Dimension of a Model,”
metrika, Vol. 76, No. 2, 1989, pp. 369-374.
Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464.
doi:10.1093/biomet/76.2.369
doi:10.1214/aos/1176344136
[15] J. Shao, “An Asymptotic Theory for Linear Model Selec-
[4] J. Rissanen, “Modeling by Shortest Data Description,”
tion,” Statistica Sinica, Vol. 7, 1997, pp. 221-264.
Automatica, Vol. 14, No. 5, 1978, pp. 465-471.
doi:10.1016/0005-1098(78)90005-5 [16] C.-Y. Sin and H. White, “Information Criteria for Select-
ing Possibly Misspecified Parametric Models,” Journal of
[5] Y. Benjamini and Y. Hochberg, “Controlling the False
Econometrics, Vol. 71, No. 1-2, 1996, pp. 207-225.
Discovery Rate: A Practical and Powerful Approach to
doi:10.1016/0304-4076(94)01701-8
Multiple Testing,” Journal of the Royal Statistical Society,
Series B, Vol. 57, No. 1, 1995, pp. 289-300. [17] C. Gatu, P. I. Yanev and E. J. Kontoghiorghes, “A Graph
Approach to Generate All Possible Regression Submod-
[6] F. Bunea, M. H. Wegkamp and A. Auguste, “Consistent
els,” Computational Statistics & Data Analysis, Vol. 52,
Variable Selection in High Dimensional Regression via
No. 2, 2007, pp. 799-815. doi:10.1016/j.csda.2007.02.018
Multiple Testing,” Journal of Statistical Planning and
Inference, Vol. 136, No. 12, 2006, pp. 4349-4364. [18] H. Leeb, “The Distribution of a Linear Predictor after
doi:10.1016/j.jspi.2005.03.011 Model Selection: Conditional Finite-Sample Distributions
and Asymptotic Approximations,” Journal of Statistical
[7] Y. Benjamini and Y. Gavrilov, “A Simple forward Selec-
Planning and Inference, Vol. 134, No. 1, 2005, pp .64-89.
tion Procedure Based on False Discovery Rate Control,”
Annals of Applied Statistics, Vol. 3, No. 1, 2009, pp. 179- [19] H. Leeb and B. M. Pötscher, “Model Selection and Infer-
198. doi:10.1214/08-AOAS194 ence: Facts and Fiction,” Econometric Theory, Vol. 21,
No. 1, 2005, pp. 21-59. doi:10.1017/S0266466605050036
[8] G. Claeskens and N. L. Hjort, “Model Selection and
Model Averaging,” Cambridge University Press, Cam- [20] J. Shao, “Convergence Rates of the Generalized Informa-
bridge, 2008. tion Criterion,” Journal of Nonparametric Statistics, Vol.
9, No. 3, 1998, pp. 217-225.
[9] H. Leeb and B. M. Pötscher, “Model Selection,” In: T. G.
doi:10.1080/10485259808832743
Andersen, et al., Eds., Handbook of Financial Time Se-
ries, Springer, Berlin, 2009, pp. 889-925. [21] A. Chambaz, “Testing the Order of a Model,” Annals of
doi:10.1007/978-3-540-71297-8_39 Statistics, Vol. 34, No. 3, 2006, pp. 1166-1203.
doi:10.1214/009053606000000344
[10] A. D. R. McQuarrie and C.-L. Tsai, “Regression and
Time Series Model Selection,” World Scientific, Singa- [22] D. E. Edwards and T. Havránek, “A Fast Model Selection
pore City, 1998. Procedure for Large Families of Models,” Journal of the

162 E. LIEBSCHER
American Statistical Association, Vol. 82, No. 397, 1987, [25] E. J. Hannan and B. G. Quinn, “The Determination of the
pp. 205-213. doi:10.2307/2289155 Order of an Autoregression,” Journal of the Royal Statis-
[23] M. A. Efroymson, “Multiple Regression Analysis,” In: A. tical Society, Series B, Vol. 41, No. 2, 1979, pp.190-195.
Ralston and H. S. Wilf, Eds., Mathematical Methods for [26] D. Kh. Fuk and S. N. Nagaev, “Probability Inequalities
Digital Computers, John Wiley, New York, 1960. for Sums of Independent Random Variables,” Theory of
[24] M. Hofmann, C. Gatu and E. J. Kontoghiorghes, “Effi- Probability and Its Applications, Vol. 16, 1971, pp.
cient Algorithms for Computing the Best-Subset Regres- 643-660. doi:10.1137/1116071
sion Models for Large Scale Problems,” Computational [27] R. N. Bhattacharya, “On Errors of Normal Approxima-
Statistics & Data Analysis, Vol. 52, No. 1, 2007, pp. 16- tion,” Annals of Probability, Vol. 3, No. 5, 1975, pp.
29. doi:10.1016/j.csda.2007.03.017 815-828. doi:10.1214/aop/1176996268

A Universal Selection Method in Linear Regression Models: Eckhard Liebscher

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Universal Selection Method in Linear Regression Models: Eckhard Liebscher

Încărcat de

Drepturi de autor:

Formate disponibile

Open Journal of Statistics, 2012, 2, 153-162 153

http://dx.doi.org/10.4236/ojs.2012.22017 Published Online April 2012 (http://www.SciRP.org/journal/ojs)

A Universal Selection Method in Linear Regression Models

Keywords: Linear Regression; Model Selection; Multiple Tests

1. Introduction FDR procedure of Benjamini and Hochberg (see [5])

Copyright © 2012 SciRes. OJS

   1 , ,  k    k is the parameter vector, and

Copyright © 2012 SciRes. OJS

in this case. Here the digits “1” in the binary representa- 1

totic behaviour of the statistic M n   when  0 is the

The submodels are characterised by the fact that several Moreover,

Copyright © 2012 SciRes. OJS

on the statistic (4). We introduce a measure d     d  d   are:

1 for submodel functions f  x   1 , f  x    2 x,

Given values  n  0  , ,  n  d max    0,   ,   1 , we lowing rule for the selection:

Copyright © 2012 SciRes. OJS

for some j with d  j   d . Table 1. Frequencies for mis-selection (FM) in percent in

1 100 0 100 100 1.900 2.006 2.029

1) If Gn    o n 1 2  as n   , and p  3 , then 100

Copyright © 2012 SciRes. OJS

Copyright © 2012 SciRes. OJS

Note that Lemma 5.3. Suppose that  0   for some

the lemma.  n   xi  i 

Copyright © 2012 SciRes. OJS

Hence by (8) and Lemma 5.3,  n :  2   n 1 4  n  d , l  . Since

Copyright © 2012 SciRes. OJS

Copyright © 2012 SciRes. OJS

Copyright © 2012 SciRes. OJS

S-ar putea să vă placă și