Documente Academic
Documente Profesional
Documente Cultură
Received January 27, 2012; revised February 29, 2012; accepted March 9, 2012
ABSTRACT
In this paper we consider a linear regression model with fixed design. A new rule for the selection of a relevant sub-
model is introduced on the basis of parameter tests. One particular feature of the rule is that subjective grading of the
model complexity can be incorporated. We provide bounds for the mis-selection error. Simulations show that by using
the proposed selection rule, the mis-selection error can be controlled uniformly.
methods based on information criteria and it is different sponse variable. In short we can write
from Efroymson’s algorithm of stepwise variable selec- Y X , (1)
tion in [23].
where Y Y1 , , Yn , 1 , , n . The least square
T T
2) We derive convergence rates for the probability of
mis-selection which are better than those proved in pa- estimator for is given by
ˆ X T X X T Y .
1
pers about information criteria e.g. in [20].
3) Subjective grading of the model complexity can be
incorporated. This leads to the residual sum of squares
Concerning 1) we consider tests on a set of parameters
in contrast to FDR-methods, where several tests on only
Rn Y X ˆ
2
YT I X X T X
1
XT Y ,
one parameter are applied. Moreover w.r.t. 2), many au- where . is the Euclidean vector norm.
thors do not analyse the behaviour of mis-selection pro- The aim is to select model (1) or an appropriate sub-
babilities. The results on bounds or convergence rates of model which fits the data well. Moreover, we search for
these probabilities are more informative than the consis- a reasonably simple model. In the following we define
tency. The aspect 3) is of special interest from the point the submodels of (1). The submodel with index
of view of model building. Typically model builder have 1, , has the parameter vector
some preference rules in mind when selecting the model. 1 , 2 , , l l , l : l , where the vector
T
They prefer simple models with linear functions to mod- is related to by D with an appropriate ma-
els with more complex functions (exponential or loga- trix D k l having maximum rank l k . In a large
rithmic, for example). The crucial idea is to assign to number of applications, the γi’s coincide with different
each submodel a specific complexity number. components of . The submodel indices 1 and
We do not assume that the errors are normally distrib- correspond to the model function equal to zero (no pa-
uted. This ensures a wide-ranging applicability of the rameters) and to the full model, respectively. Thus we
approach, but only asymptotic distributions of test statis- can write the model equation for the submodel as
tics are available. From examples in Section 2, it can be Y X , (2)
seen that applications are possible in several directions,
for instance to the one-factor-ANOVA model. The simu- where X XD . The parameter space of submodel
lations show an advantage of the proposed method in that
in (1) is given by D : l . Next we give
it controls the frequency of mis-selection uniformly. For several versions for the definition of submodels in dif-
models with a large number of regressors, the problem of ferent situations.
establishing an effective selection algorithm is not dis- Example 1. We consider all submodels, where com-
cussed in this paper; we refer to the paper [24]. ponents of are zero. More precisely, index is
The paper is organised as follows: In Section 2 we in- assigned to a submodel if 1 i1 , , l il are the
troduce the regression model and several versions of parameters of the submodel ( i1 i2 il ), j 0
submodels. The asymptotic behaviour of the basic statis- for j J : i1 , , il and 1 j 1 2 j
l i 1
. Let
tic is also studied there. Section 3 is devoted to the model
ei 0, , 0,1i , 0, k be the i-th unit vector. Then
T
selection method. We provide convergence rates for the
probability that the procedure selects the wrong model
(mis-selection). We see that the behaviour is similar to
D ei1 , ei2 , , eil k l , and
that in the case of hypothesis testing. The results of simu- : j 0 for all j J . For example, for k 5 ,
lations are discussed in Section 4. The reader finds the
proofs in Section 5. the submodel with index 14 has the parameters
1 1 , 2 3 , 3 4 and 2 5 0 holds.
2. Models Moreover, we have
1 0 0
Let us introduce the master model
0 0 0
k
Yi xij j i for i 1, , n , D 0 1 0 , 5 : 2 5 0
j 1
0 0 1
where X xij nk is the design matrix, 0
0 0
i 1,, n , j 1,, k
: j k if j , k J i for some i . The submodel In a wide range of applications, the entries xij of the
design matrix are uniformly bounded such that
with index has l parameters. Furthermore,
D dij holds ,
Bnp O n o n p 2 since p 2 . The Assumption
i 1,, k , j 1,, l may be weakened in some ways, but we use this as-
1 for i J j , sumption to reduce the technical effort. We introduce
where dij □
: DT D
l l
0 otherwise. and
Example 3 shows that the model selection problem K : T
0 I D
1
T
D 0 .
occurs also in the context of ANOVA. In submodel (2)
with index , the least square estimator ˆ and the Proposition 2.1 clarifies the asymptotic behaviour of
residual sum of squares S are given by the statistic M n .
Proposition 2.1. Assume that Assumption is sat-
ˆ XT X XT Y ,
1
isfied.
1) Assume that 0 and l k . Then we
(3)
S Y X ˆ
2
Y T I X XT X
1
XT Y . have
M n
k2l .
What is an appropriate statistic for model selection?
Let M n : S Rn . Here we consider a quantity 2) Suppose that 0 and l k . Let
M n , which is similar to F-statistics known from hy-
Gn o n 1 2 be satisfied n . Then we have
pothesis testing in linear regression models with normal
n ,
1
errors: Mn 2
K K n nWn o
M n where Wn
0, W2 , W2 4 2 K
2
2 K .
M n
1
S Depending on whether the true parameter 0 belongs
n l to submodel or not, the statistic M n has a dif-
ˆ D ˆ ferent asymptotic behaviour. In the first case, it has an
T
X T X ˆ D ˆ
(4) asymptotic χ2-distribution. In the second case it tends to
1 infinity in probability with rate n . Therefore, the sta-
S
n l tistic M n is suitable for model selection. In the next
for , M n 0. section a selection procedure is introduced based on
M n serving as fundamental statistic.
The main difference to classical F-statistics is that the
1 3. The New Selection Rule
estimator S of the model variance in submodel
n l In this section we propose a selection rule which is based
m2 O Bnp n p and m3 O Bnp n p . 0.377 3.38 7.65 100 1.936 3.490 3.432
0 0 0 0 2.102 4.008 4.060
The PMS of case (m1) behaves like a type-1-error in a
statistical test. It approaches asymptotically n d 0.38872 3.39 7.8987 5.1754 1.825 5.269 5.178
under the assumptions of part 1). The additional term –0.38872 3.39 7.8987 5.1754 1.830 6.309 6.198
with rate O n 3 2 Bn 3 comes from the application of the 0.38872 –3.39 7.8987 5.1754 1.893 5.297 5.213
central limit theorem, and has rate O n 1 2 in the case, 0.38872 3.39 –7.8987 5.1754 1.873 8.039 7.900
where the xij’s are uniformly bounded. This theorem 0.38872 3.39 7.8987 –5.1754 1.897 6.452 6.347
shows that the PMS of cases (m2) and (m3) tends to zero –0.38872 –3.39 7.8987 5.1754 1.893 5.297 5.213
at rate O n1 p provided that the xij’s are uniformly –0.38872 3.39 –7.8987 5.1754 1.864 14.207 13.95
bounded and n d Cn a for all d and some –0.38872 3.39 7.8987 –5.1754 2.029 6.736 6.622
a 0 . These rates of PMS are rather fast. They are better
than in comparable cases in [20] ( n and n can be
considered to have the same rate). One reason is that in Table 2. FM in percent for different error distributions.
this paper alternative techniques such as Fuk-Nagaev
FM new FM FM
inequality are employed to obtain the convergence rates. n β1 β2 β3 β4 εi ~
meth. BIC HQIC
The results of Theorem 3.1 recommend the selection rule
100 0 0 0 0 σ·t(3) 1.735 3.895 3.951
above from the theoretical point of view. The behaviour
in practice is discussed in the next section. 0.468 4.104 9.516 6.264 σ·t(3) 2.043 2.966 2.943
400 0 0 0 0 0, 2
0.956 1.780 2.502
4. Simulations 0 0 0 0 σ·t(3) 0.942 1.736 2.459
–0.216 1.873 4.365 –2.863 0,
2
1.122 5.869 3.905
Here we consider the polynomial model:
–0.216 1.873 4.365 –2.863 σ·t(3) 3.067 8.164 6.275
Yni 1 2 xi 3 xi2 4 xi3 i for i 1, , n
where x1 , , xn 0,1 are the observations of the re- probabilities: n 1 0.02 , n 2 0.022 ,
gressor variable, and the εi’s are i.i.d. random variables. n 3 0.024 , n 4 0.026 in the case n 100 , and
i
For simplicity, we consider the case xi . The com- n i 0.01 in the case n 400 .
n The selection rule of the previous section always gives
plexity d is measured as given in Example 4(b). We FM-values near the given values of n . The methods
compare the selection method of the previous section based on BIC and HQIC partially show FM-values also
with procedures based on Schwarz’s Bayesian informa- near these n , but in some special cases the FM-values
tion criterion (BIC, see [3]) and the Hannan-Quinn crite- are much higher (for example, for 1 0.38872 ,
rion (HQIC, see [25]). The Tables 1-3 show the frequen- 2 3.39 , 3 7.8987 , 4 5.1754 according to
cies of mis-selection. The results are based on 106 repli- Table 1; 1 0.2569 , 2 2.227 , 3 5.197 ,
cations of the model. We choose the following error 4 3.405 according to Table 3). By our method we
Table 3. FM in percent in the case n = 400, σ = 0.2, εi ~ Applying Fuk-Nagaev’s inequality (see [26]), we ob-
σ·t(3). tain the assertion of the lemma:
β1 β2 β3 β4
FM new
method
FM
BIC
FM
HQIC
T XX T
0 0 0 0 0.942 1.736 2.459
0.2569 2.227 5.197 3.405 1.003 1.990 1.596 k n p
C
C p 2 xij
p
i exp n
–0.2569 2.227 5.197 3.405 0.958 2.086 1.657 j 1
xij
i 1 2 2
0.2569 –2.227 5.197 3.405 0.984 2.169 1.728 i 1
0.2569 2.227 –5.197 3.405 0.945 2.575 2.004 C
C Bnp p 2 exp .
0.2569 2.227 5.197 –3.405 0.987 2.141 1.690 n
–0.2569 –2.227 5.197 3.405 1.011 2.005 1.606
Lemma 5.2. Assume that 0 for some
–0.2569 2.227 –5.197 3.405 0.798 3.567 2.652
1, , . Then
–0.2569 2.227 5.197 –3.405 1.015 2.299 1.823
0 100 100 100 1.200 1.064 1.427
1
n l
2
S 2 C2 n p 2 p 2 e C3 n
0.217 100 100 100 0.983 1.059 0.890
100 0 100 100 1.004 0.873 1.217 holds for n 8l 2 , 1 and n large enough,
100 1.89 100 100 0.969 1.046 0.868 where l : l and C2 , C3 0 are constants not de-
pending on , n . The same upper bound holds for
100 100 0 100 0.984 0.844 1.202
1 T
100 100 4.38 100 0.976 1.059 0.876 2 .
n l
100 100 100 0 0.948 0.817 1.168
Proof: Observe that
100 100 100 2.87 0.992 1.070 0.887
100
100
0
2.08
0
4.84
100
100
0.973
1.010
1.0706
1.084
1.525
0.914
S T I X XT X
1
XT
100 2.08 –4.84 100 0.868 2.2384 1.741
by (3). Hence
1 1 T
S 2 2
are able to control the FM-values by choosing an appro- nl nl 2
priate n . (5)
1 T
1
X XT X XT .
5. Proofs n l 2
By C , we denote a positive generic constant which can Further an application of Fuk-Nagaev’s inequality from
vary from place to place and does not depend on other [26] leads to
variates. Throughout this section, we assume that As- 1 T n n
sumption is fulfilled. In the following we prove aux- 2 i2 2
nl 2 i 1 8
iliary statements which are used later in the proofs of the
theorems. n C 2 n
C p 2 n p 2 i
p
exp 4
(6)
Lemma 5.1.
i 1 i
T XX T C1 Bnp p 2 eC0 n
C n p 2 p 2 e C
2
n
holds for all 0 , where C0 , C1 0 are constants not for n 8l 2 , n 2l . Since
depending on , n . 1
1
Proof: Obviously, D : D XT X DT D DT k , k holds, and
n
k n 2
therefore, D has a bounded norm, we deduce
T XX T xij i k 1
i 1
j 1 1 T
1
X XT X XT
k
n n l 2 (7)
xij i k 1 2 ,
j 1 i 1
XX C n
T T 2
C B np n
p p 2
e C n
.
n
by Lemma 5.1 for n large enough. A combination of
and xij2 nTrace Gn O n . Inequalities (5)-(7) yields the lemma.
i 1
where 1 2
C n
S C4 n p 2 0
p 2
e 5 2
S 1 T I X XT X
1
XT , n l
for 0 and n large enough with constants
S 2 0T I Gn D Gn1 DT X T , C4 , C5 0 not depending on n , .
1 T Proof: Part 1) is a consequence of G n and
S 3 n 0T Gn Gn D Gn1 DT Gn 0 , G n : n
X X .
Gn . 2) Using Lemmas 5.1 and 5.2, we deduce
1 1 1 1 1
S S 1 S 3 S 2 S 3
n l n l 2 2 n l 2 2
1 1 1 1
K n 1 2 C T K n 1 2 C C X T
2 nl 2 2 nl 2
1 T
n l
0 2 2
2
XX C 0 n
T T 2 2
1 T
nl
2
0
2
C Bnp n 0
p p 2
e 0
C
2
n
C n p 2 0
p 2
e 0
C
2
n
for n 2l large enough. This implies assertion 2) of 1 n
M n Y T X X T X
1
X T X XT X
1
XT Y T
T X X T
X
1
D XT X
1
DT X T
T
n G 1
n D G D n . 1
n
T
Moreover, the identity 1
S 2 , and therefore assertion 1) of Pro-
lim Gn1 D Gn1 DT 1 D 1 DT (9) n l
n
position 2 is proved.
holds in view of Assumption . An application of 2) Let 0 , and
Lemma 5.4 and the Cochran theorem leads to
Wn : 2 0T Gn Gn1 D Gn1 DT n . By assumption,
2 M n
k2l . Lemma 5.2 implies that G D G D D DT o n 1 2
1
n
1
n
T 1 1
holds true.
We derive
X X D X X D X
1 1
M n T X 0T X T X T
T
T
T
X 0 X T
nT Gn1 D Gn1 D nW n G G D G
T
n n
T
0 n n
1
n D
T
Gn 0 nK nWn o n .
From Lemma 5.4, (9) and Gn , it follows that 1
S 1 2 using Lemma 5.2.
Wn
0, 02 , where We obtain
n l
02 : 4 2 0T 1 D 1 DT 1 D 1 DT 0 Moreover, we deduce
4 2 K .
S 2 n 0T I Gn D Gn1 DT n o n .
m
and analogously,
Q hi hiT HDH T , D diag 11 , ,1m , 0m 1 , , 0k ,
i 1
M n n d , l n
T
Gn n 2 n O n 1 2 .
and h1 , , hm are the first m columns of the orthogo-
(12)
nal matrix H k k . For x k ,
Note that cov n 2 Gn , which implies
cov n I . Further by Assumption ,
m
2
xT Qx xT hi 0 x hi
i 1
n
3
for i 1, , m x hm 1 , , hk . n 3 2 Gn1 2 xi i O n 3 2 Bn3 .
i 1
We consider the linear independent vectors
z1 D e1 , , zl D el , e j 0, ,1 j , 0, l is the
T
Since z k : z T Gz a is a convex set for all
a 0 , we can apply Bhattacharya’s theorem on a multi-
j-th unit vector, and obtain
variate Berry-Esseen inequality (see [27])
z Tj Pz j eTj DT D DT D 1 DT D e j 0 . (10)
nT Gn n 2 n Z T Gn Z T 2 n O n 1 2 ,
Since z1 z1 , , zl zl hm 1 , , hk are lin-
12 12
ear independent, these vectors form a basis of where Z 0, I . The Cochran theorem implies that
hm 1 , , hk . Assume that K 0 . Then there exist a Z T Gn Z T k2 l . We denote the distribution function of
a l such that the k2 l -distribution by Fk l . Hence
l
1 2 0 ai zi 1 2 D a , and hence 0 D a
Z T Gn Z T 2 n 1 Fk l 2 n
1 F d , l 1 o 1 d 1 o 1 ,
i 1
k l n n
in contradiction to the assumption. This proves the le-
mma. and Z G Z d 1 o 1 .
T
n
T 2
n n
Proof of Theorem 3.1: One shows easily that
1 Combining these identities and (11), (12) we obtain
n d,l 0 . assertion 1).
n
1) Let n 1Gn1 2 n ( n as in Lemma 5.4), and
2) One can show that n d , l i O ln n . Let
0 be a constant. Define n be a sequence of real numbers with n ,
n : 2 n 1 4 n d , l ,
n o n n d , l i
1
. We deduce
m2 M n n d , l , Fk l i M n i Fk l M n for some i : d i d
i :d i d
Fk l i M n i Fk l n d , l
i :d i d
Fk l i M n i 1 n d
M n i 2 k l i ,1 d
n
i :d i d
1 1
M n i n d , l i n ,
n l i
Si n Si n .
i :d i d n l i
Define K ni : 0T Gn Gn1 Di Gin1 DiT Gn 0 . Let i Since 0 i , we have K i 0 by Lemma 5.5. Fur-
with d i d . Obviously, lim n K ni Ki holds true. thermore, by Lemma 5.1 we obtain
M n i n d , l i n nK ni nT Gn1 Di Gin1 DiT n nWn n d , l i n
nK ni nW d , l i 2 G G D G
n n n
T
0 n
1
n
1
n D d , l i n
T
n n n
1 2
nK ni
2 0T Gn G D G D nK d , l i n
1
n
1
n
T
n ni n n
1 2
C n
n
T
XX T Cn 2 O Bnp n p
for n large enough. On the other hand, we have by Lemma 5.3. We choose n n1 2 p . Then
1
2
Si n O n p 2 n p 2 e C n n
n o n n d , l i
1
. This completes the proof of the
n l i bound for m2 . Observe that
m3 M n i n d i , l i for some i, d i d
i :d i d
M n i n d i , l i .
The bound for m3 can now be established [11] A. J. Miller, “Subset Selection in Regression,” 2nd Edi-
along the lines of the proof for m2 . tion, Chapman & Hall, New York, 2002.
[12] B. Droge, “Asymptotic Properties of Model Selection
Procedures in Linear Regression,” Statistics, Vol. 40, No.
REFERENCES 1, 2006, pp. 1-38. doi:10.1080/02331880500366050a
[1] H. Akaike, “A New Look at the Statistical Model Identi- [13] R. Nishii, “Asymptotic Properties of Criteria for Selection
fication,” IEEE Transactions on Automatic Control, Vol. of Variables in Multiple Regression,” Annals of Statistics,
19, 1974, pp. 716-723. doi:10.1109/TAC.1974.1100705 Vol. 12, No. 2, 1984, pp. 758-765.
doi:10.1214/aos/1176346522
[2] C. Mallows, “Some Comments on Cp,” Technometrics,
Vol. 15, No. 4, 1973, pp. 661-675. doi:10.2307/1267380 [14] C. R. Rao and Y. Wu, “A Strongly Consistent Procedure
for Model Selection in a Regression Problem,” Bio-
[3] G. Schwarz, “Estimating the Dimension of a Model,”
metrika, Vol. 76, No. 2, 1989, pp. 369-374.
Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464.
doi:10.1093/biomet/76.2.369
doi:10.1214/aos/1176344136
[15] J. Shao, “An Asymptotic Theory for Linear Model Selec-
[4] J. Rissanen, “Modeling by Shortest Data Description,”
tion,” Statistica Sinica, Vol. 7, 1997, pp. 221-264.
Automatica, Vol. 14, No. 5, 1978, pp. 465-471.
doi:10.1016/0005-1098(78)90005-5 [16] C.-Y. Sin and H. White, “Information Criteria for Select-
ing Possibly Misspecified Parametric Models,” Journal of
[5] Y. Benjamini and Y. Hochberg, “Controlling the False
Econometrics, Vol. 71, No. 1-2, 1996, pp. 207-225.
Discovery Rate: A Practical and Powerful Approach to
doi:10.1016/0304-4076(94)01701-8
Multiple Testing,” Journal of the Royal Statistical Society,
Series B, Vol. 57, No. 1, 1995, pp. 289-300. [17] C. Gatu, P. I. Yanev and E. J. Kontoghiorghes, “A Graph
Approach to Generate All Possible Regression Submod-
[6] F. Bunea, M. H. Wegkamp and A. Auguste, “Consistent
els,” Computational Statistics & Data Analysis, Vol. 52,
Variable Selection in High Dimensional Regression via
No. 2, 2007, pp. 799-815. doi:10.1016/j.csda.2007.02.018
Multiple Testing,” Journal of Statistical Planning and
Inference, Vol. 136, No. 12, 2006, pp. 4349-4364. [18] H. Leeb, “The Distribution of a Linear Predictor after
doi:10.1016/j.jspi.2005.03.011 Model Selection: Conditional Finite-Sample Distributions
and Asymptotic Approximations,” Journal of Statistical
[7] Y. Benjamini and Y. Gavrilov, “A Simple forward Selec-
Planning and Inference, Vol. 134, No. 1, 2005, pp .64-89.
tion Procedure Based on False Discovery Rate Control,”
Annals of Applied Statistics, Vol. 3, No. 1, 2009, pp. 179- [19] H. Leeb and B. M. Pötscher, “Model Selection and Infer-
198. doi:10.1214/08-AOAS194 ence: Facts and Fiction,” Econometric Theory, Vol. 21,
No. 1, 2005, pp. 21-59. doi:10.1017/S0266466605050036
[8] G. Claeskens and N. L. Hjort, “Model Selection and
Model Averaging,” Cambridge University Press, Cam- [20] J. Shao, “Convergence Rates of the Generalized Informa-
bridge, 2008. tion Criterion,” Journal of Nonparametric Statistics, Vol.
9, No. 3, 1998, pp. 217-225.
[9] H. Leeb and B. M. Pötscher, “Model Selection,” In: T. G.
doi:10.1080/10485259808832743
Andersen, et al., Eds., Handbook of Financial Time Se-
ries, Springer, Berlin, 2009, pp. 889-925. [21] A. Chambaz, “Testing the Order of a Model,” Annals of
doi:10.1007/978-3-540-71297-8_39 Statistics, Vol. 34, No. 3, 2006, pp. 1166-1203.
doi:10.1214/009053606000000344
[10] A. D. R. McQuarrie and C.-L. Tsai, “Regression and
Time Series Model Selection,” World Scientific, Singa- [22] D. E. Edwards and T. Havránek, “A Fast Model Selection
pore City, 1998. Procedure for Large Families of Models,” Journal of the
American Statistical Association, Vol. 82, No. 397, 1987, [25] E. J. Hannan and B. G. Quinn, “The Determination of the
pp. 205-213. doi:10.2307/2289155 Order of an Autoregression,” Journal of the Royal Statis-
[23] M. A. Efroymson, “Multiple Regression Analysis,” In: A. tical Society, Series B, Vol. 41, No. 2, 1979, pp.190-195.
Ralston and H. S. Wilf, Eds., Mathematical Methods for [26] D. Kh. Fuk and S. N. Nagaev, “Probability Inequalities
Digital Computers, John Wiley, New York, 1960. for Sums of Independent Random Variables,” Theory of
[24] M. Hofmann, C. Gatu and E. J. Kontoghiorghes, “Effi- Probability and Its Applications, Vol. 16, 1971, pp.
cient Algorithms for Computing the Best-Subset Regres- 643-660. doi:10.1137/1116071
sion Models for Large Scale Problems,” Computational [27] R. N. Bhattacharya, “On Errors of Normal Approxima-
Statistics & Data Analysis, Vol. 52, No. 1, 2007, pp. 16- tion,” Annals of Probability, Vol. 3, No. 5, 1975, pp.
29. doi:10.1016/j.csda.2007.03.017 815-828. doi:10.1214/aop/1176996268