Documente Academic
Documente Profesional
Documente Cultură
Article
A new face on two-phase
sampling with calibration
estimators
June 2009
Survey Methodology, June 2009 3
Vol. 35, No. 1, pp. 3-14
Statistics Canada, Catalogue No. 12-001-X
Abstract
This paper provides a framework for estimation by calibration in two-phase sampling designs. This work grew out of the
continuing development of generalized estimation software at Statistics Canada. An important objective in this development
is to provide a wide range of options for effective use of auxiliary information in different sampling designs. This objective
is reflected in the general methodology for two-phase designs presented in this paper.
We consider the traditional two-phase sampling design. A phase-one sample is drawn from the finite population and then a
phase-two sample is drawn as a sub-sample of the first. The study variable, whose unknown population total is to be
estimated, is observed only for the units in the phase-two sample. Arbitrary sampling designs are allowed in each phase of
sampling. Different types of auxiliary information are identified for the computation of the calibration weights at each
phase. The auxiliary variables and the study variables can be continuous or categorical.
The paper contributes to four important areas in the general context of calibration for two-phase designs:
(1) Three broad types of auxiliary information for two-phase designs are identified and used in the estimation. The
information is incorporated into the weights in two steps: a phase-one calibration and a phase-two calibration. We
discuss the composition of the appropriate auxiliary vectors for each step, and use a linearization method to arrive at the
residuals that determine the asymptotic variance of the calibration estimator.
(2) We examine the effect of alternative choices of starting weights for the calibration. The two natural choices for the
starting weights generally produce slightly different estimators. However, under certain conditions, these two estimators
have the same asymptotic variance.
(3) We re-examine variance estimation for the two-phase calibration estimator. A new procedure is proposed that can
improve significantly on the usual technique of conditioning on the phase-one sample. A simulation in section 10 serves
to validate the advantage of this new method.
(4) We compare the calibration approach with the traditional model-assisted regression technique which uses a linear
regression fit at two levels. We show that the model-assisted estimator has properties similar to a two-phase calibration
estimator.
Key Words: Auxiliary information; Two-phase regression estimator; Starting weights; Separate residual variance
estimator; Combined residual variance estimator.
1. Victor M. Estevao, Business Survey Methods Division, Statistics Canada, Ottawa, Ontario, Canada, K1A OT6. E-mail: victor.estevao@statcan.gc.ca;
Carl-Erik Srndal, professor. E-mail: carl.sarndal@rogers.com.
4 Estevao and Srndal: A new face on two-phase sampling with calibration estimators
identical estimators, although in practice the difference is The double-expansion estimator s ak yk is unbiased for
likely to be of little consequence. Resampling for two-phase Y = U yk . We can produce more efficient estimators by
variance estimation is considered in Kott and Stukel (1997). taking into account the available auxiliary information.
Estevao and Srndal (2002) focus on the calibration Three types or sets of auxiliary variables (called x-variables)
argument and distinguish ten different ways to use all or part can be distinguished for two-phase sampling designs. These
of the information available at the two levels. The present are denoted by X , X and X . Their information
paper also focuses on the calibration approach. It extends characteristics are specified in the following table.
earlier work by recognizing three (rather than two) types of
auxiliary information, each having different characteristics. Table 1.1
In the regression approach, it is natural to fit two linear Sets of auxiliary variables for calibration in two-phase sampling
least squares regressions. One set of regression-predicted Set of Auxiliary Unit variable Unit variable
y - values are produced for k s1 using both x1k and x 2k auxiliary variable values for values for
variables total over U k s1 k s
as predictors; another set is produced for k s1 using only X known known known
the vector x1k as predictor. Both sets of predicted y - X known unknown known
values, as well as the known total U x1k , are used to build X unknown known known
the regression-type estimator of Y , in the manner described
in section 9. Each set may contain any number of x-variables. The
The calibration approach is motivated by two factors: To three sets are mutually exclusive. The properties in the last
create a set of weights that are consistent with known or three columns apply to every x-variable in the correspon-
estimated totals for the auxiliary variables and to reduce the ding set. All x-variables used for calibration belong to one of
variance of the estimates made for the study variable(s). We these three sets.
want the weights wk in Y2 P = s wk yk to achieve
consistency with the total U x1k known at the level of the
population and/or with an (approximately) unbiased 2. Phase-one calibration
estimate, made at the level of the phase-one sample, of the
unknown U x2 k . Since y is observed only at the ultimate For the phase-one calibration, we use a vector x1k of
level (the phase-two sample), consistency at higher levels auxiliary variables selected from the set X . While it is
on important auxiliary variables will often significantly natural to let x1k consist of all the variables in X , the
reduce the variance of Y2 P = s wk yk . We can distinguish general presentation here allows us to define x1k to include
two steps in the process leading to the weights wk , a some or even none of the variables in X . The phase-one
phase-one calibration and a phase-two calibration. calibration weights w1k are derived by modifying the
The two-phase sampling design is as follows: From the phase-one starting weights a1k subject to the calibration
finite population of units U = {1, 2, ... k, ... N } we constraint s1 w1k x1k = U x1k . In our formulation, the
select a phase-one sample s1. The known positive inclusion calibration weights are given for k s1 as
probability of unit k is 1k = Pr ( k s1 ), and the
phase-one design weight is a1k = 1 / 1k . Certain variables { 1 )
w1k = a1k 1 + ( X1 X (s a1
z x
1k 1k 1k )
1
z1k } (2.1)
may be observed for the units k s1. Then, conditionally
where X1 = U x1k , X 1 = s a1k x1k and z1k is an
on s1, we select a phase-two sample s from s1. The known 1
used to calculate Y2 P = s wk yk as our estimator of Given the starting weights ak, we determine final weights
Y = U yk . The vector x k = (xk ( t ) , xk ( w) , xk ( a ) ) has wk subject to the calibration equation s wk x k = X. These
three components, as described below. No auxiliary variable final weights are given for k s by
can appear in more than one of the three vector components.
These three components have different roles in the setup of {
wk = ak 1 + ( X X
) ( s a
k z k xk )
1
zk } (3.2)
the phase-two calibration equation s wk x k = X and in the
where X = s ak x k is an unbiased or approximately
determination of the phase-two calibration weights.
The variables in the vector x k (t ) are selected from unbiased estimator of X, depending on the composition of
among those in the set X X. This means that the total x k . The instrumental variable z k has the same dimension
U xk (t ) is known and can be included in X. Variables in as x k . The vectors z1k and z k are assumed to be fixed
x1k are allowed to reoccur in x k (t ) , and this is usually functions of x1k and x k . How to choose z1k and z k is a
preferable in order to reduce the variance of the estimator. topic we leave for others to address.
We can specify x k ( t ) = x1k , but our framework permits
x k (t ) to include variables from X . This allows us to use 4. Comparison of two options for
variables with known population totals in situations where the starting weights
the variables are too expensive to collect for a large
phase-one sample s1 but are observable for the smaller The objective in this section is to analyze how the final
phase-two sample s. These variables are excluded from the weights wk in Y2 P = s wk yk depend on the specification
phase-one calibration because they are unavailable for of the starting weights ak in (3.2). We consider two distinct
k s1. cases based on whether or not the auxiliary variables x k are
The variables in x k ( w) and x k ( a ) are selected from used for the phase-two calibration. When we carry out the
among those in the set X X X provided they are phase-two calibration, the two different choices for starting
not already included in x k (t ). The variables in x k ( w) are weights generally lead to different estimators. We show that
those for which we want to satisfy the phase-two calibration these estimators are asymptotically equivalent under certain
equation s wk x k ( w) = s1 w1k xk ( w) , where the right-hand conditions, commonly found in practice. When we have no
side is approximately unbiased for U xk ( w). The variables phase-two calibration, the two choices for starting weights
in x k ( a ) are those for which we want to satisfy the lead to two other estimators that are usually less efficient
phase-two calibration equation s wk x k ( a ) = s1 a1k x k ( a ). than those obtained by performing the phase-two
Here, the right-hand side is unbiased for U x k ( a ). The calibration.
inclusion of both x k ( w) and x k ( a ) in the definition of x k
allows us to calibrate on one or both of these vectors and 4.1 Estimators with phase-two calibration (xk )
provides a general framework for producing different As noted previously, there are two alternatives for the
estimators from the phase-two calibration. starting weights ak in (3.2): (1) ak = ak = a1k a2 k , and (2)
The phase-two calibration equation is s wk x k = X, ak = w1k a2 k , where w1k is the phase-one calibration
where X is the stacked auxiliary vector weight given by (2.1). We now provide a detailed analysis
of the form of the estimator under these two choices. In this
U k (t )
x subsection, we look at the more interesting case where we
perform the phase-two calibration ( x k ). In the next
X = s w1k xk ( w) . (3.1) subsection, we consider what happens when we do not carry
1
out the phase-two calibration ( x k = ).
a1k x k ( a ) Our procedure is as follows. First, we derive the
s1
linearized (asymptotic) form of Y2 P based on the general
A specific variable can only occur once in x k . starting weights ak. Then we substitute the two choices for
Otherwise, the calibration equation may be inconsistent and ak in this expression. We determine Y2 P based on the
admit no solution. starting weights ak = ak = a1k a2 k . We denote this
The starting weights for the phase-two calibration are estimator by Y2 P a and derive its linearized form, Y2 P a lin.
denoted by ak for k s. There is more than one Similarly, we obtain Y2 P based on the starting weights
reasonable choice for the ak. We consider two alternatives, ak = w1k a2 k . We refer to this estimator as Y2 P w and derive
both of which seem natural: (1) ak = ak = a1k a2 k , and (2) its linearized form, Y2 P w lin. These two forms are slightly
ak = w1k a2 k , where w1k is the phase-one calibration different but we prove in Result 4.2 that Y2 P a lin = Y2 P w lin
weight given by (2.1). under certain conditions.
We start by inserting the weights wk into Y2 P = The following result establishes the relationship between
s wk yk and writing the estimator as the estimators obtained for the two choices of starting
weights.
Y2 P = U xk (t ) B( y; x)(t ) + s w1k xk ( w) B( y; x)( w)
1
Result 4.1: The linearized forms of Y2 P a and Y2 P w are
related by the equation Y2 P w lin = Y2 P a lin +
+ s a1k xk ( a ) B( y; x)( a )
1
+ s ak e( y; x) k 1 )( B ( y ; x ) B ( x; x ) B( y ; x ) ).
( X1 X 1 1
) (B * Proof
+ (X X ( y; x ) B( y; x ) ) (4.1)
We consider expression (4.3) under the two possible
where B *( y ; x ) = ( s ak z k xk ) 1 s ak z k yk , B ( y ; x ) = choices for ak. First, with ak = ak = a1k a2 k we obtain
(U z k xk ) 1U z k yk and B ( y ; x) = (B( y ; x)(t) , B( y ; x)(w) , B( y ; x)(a) ) Y2 P a given by
is the partitioning corresponding to x k = ( xk (t ) , xk ( w) ,
Y2 P a = U (xk ( t ) B ( y ; x )(t ) + x1k B ( xB( w ) ; x 1 ) )
xk ( a ) ). Our subscript notation of the form ( v1 ; v 2 )
identifies the variables in the regression. The term v 2 refers + s a1k (xk (a ) B( y; x)( a )
1
+ e( xB( w ) ; x 1 ) k )
to the independent variables and v1 identifies the dependent
variable or variables. For simplicity, the instrumental + s ak e( y; x) k
vectors z1k and z k are not included in the notation.
) (B
+ ( X1 X
The term e( y ; x ) k = yk xk B ( y ; x ) is defined for k U . 1 ( xB ( w ) ; x1 ) B ( xB ( w ) ; x1 ) )
+ s ak e( y; x) k
+ s ak e( y; x )k . (4.5)
Now let us consider expression (4.3) under the second
) (B
+ ( X1 X1 ( xB ( w ) ; x1 ) B ( xB ( w ) ; x1 ) ) choice, ak = w1k a2 k . This leads to Y2 P w given by
)(B * B
+ (X X (4.3)
( y; x) ( y ; x ) ).
+ s ak e( y ; x )k + s ak e( y; x )k
( s a ) s a z
1
1 )
+ ( X1 X z x e )( B
1
1k 1k 1k k 1k ( y ; x ) k + ( X1 X1 ( y ; x1 ) B ( x; x1 ) B ( y ; x ) ). (4.8)
)(B w
+ (X X ( y; x) B( y; x) ) (4.6) Comparing (4.5) with (4.8), we see that Y2 P w lin =
Y2 P a lin + ( X1 X 1 )( B ( y ; x ) B ( x; x ) B ( y ; x ) ) as stated in
where B (wy ; x ) = ( s w1k a2 k z k xk )1 s w1k a2 k z k yk
1 1
and the result. This completes the proof of result 4.1.
X = s w1k a2 k x k . The first three terms of Y2 P w are the Result 4.1 shows that in general, the linearized forms of
same as those found in expression (4.4) for Y2 P a. The Y2 P w and Y2 P a are not the same. However, they are the
fourth and fifth terms differ from their counterparts in (4.4). same under certain conditions. Let us consider the case of
Although B (wy ; x ) and X are functions of the phase-one nested calibration (not to be confused with nested
calibration weights w1k , we do not need to replace them in sampling), meaning that x k includes x1k . Then x k is of the
B (wy ; x ) and X in the fifth term; this would simply split the form x k = (x1k , x+ k ) where the vector x + k is composed
lower order term ( X X )(B (wy ; x ) B ( y ; x ) ) into other of the remaining variables. We now state and prove the
lower order terms. Therefore, we can drop the fifth term of following result.
(4.6) when the sample sizes are sufficiently large. The Result 4.2: If x k = (x1k , x+ k ) and z k = ( z1k , z + k ) then
fourth term can be written as follows. Y2 P w lin = Y2 P a lin and Y2 P a and Y2 P w are
( s a ) s a z
1 asymptotically equivalent.
)
( X1 X z x e
1 1
1k 1k 1k k 1k ( y ; x ) k
Proof
1 )( B ( y ; x ) B ( x; x ) B ( y ; x ) )
= ( X1 X The proof follows from result 4.1 by showing
1 1
( U z1 x1 ) ( U z1 h )
1
1 )( B ( x; x ) B ( x; x ) )B ( y ; x ) B ( y ;x1 ) B ( x;x1 ) B ( y ;x ) =
( X1 X 1 1
k k k k
where hk = yk xk (U z k xk ) 1 (U z k yk ). Since
( s a )
1
1 )
+ ( X1 X z x
1k 1k 1k
U z1k hk = 0 and we assume z k = (z1k , z + k ), it follows
U z1k hk = 0 and B ( y ; x1 ) B ( x; x1 ) B ( y ; x ) = 0. Therefore
1
linear approximation. The substitution of this term into (4.6) three terms: a constant term U e0 k , a phase-one expansion
leads to the linearized form of Y2 P w, term s1 a1k e1k , and a double-expansion term s ak e2 k ,
Y2 P lin = U e0k + s a1k e1k + s ak e2k. (4.9) calibration. The linearized form of the two-phase estimator
with wk = w1k a2 k is obtained by writing it as follows.
1
( s a ) s a
1
E (Y2 P a lin ) = U (e0k + e1k + e2 k ) = U yk = Y.
1
1k z1k x1k
1
1k z1k xk ( w) B ( y ; x )( w).
This shows that Y2 P a lin is unbiased for Y. By (4.4), The term B ( xB ; x ) in the definition of e1k is the esti-
Y2 P a = Y2 P a lin + R, so the bias of Y2 P a equals the
mate of B ( xB( w ) ; x1 ) = (U z1k x1k ) 1 U z1k xk ( w) B ( y ; x )( w) in
( w) 1
expectation of R, which is the sum of the two lower order (4.10). Two replacements are required in B ( xB( w ) ; x1 ) to arrive
terms ( X1 X 1 ) (B ( xB ; x ) B ( xB ; x ) ) and ( X X
) at B ( xB ; x ): First, sums over U are replaced by ap-
( w) 1 ( w) 1
(B ( y ; x ) B ( y ; x ) ). As pointed out in section 4, each of these
(w) 1
propriately weighted sums over s1, giving B ( xB( w ) ; x1 ) =
terms has expectation close to zero. It follows that Y2 P a is ( s1 a1k z1k x1k ) 1 s1 a1k z1k xk ( w) B ( y ; x )( w). In this expres-
approximately unbiased for Y. sion, B ( y ; x )( w) is still unknown, so we replace it by its esti-
The variance of Y2 P a = s wk yk is closely approxi- mate B ( y ; x )( w) to arrive at B ( xB ; x ).
mated by the variance of the linearized form Y2 P a lin given
( w) 1
A key point to note is that estimates e1k can be obtained
by (4.9) with residuals defined by (4.10). Its first term, for k s1, because x k ( a ) , x k ( w) and x1k are all known for
U e0 k , is constant and does not contribute to the variance. k s1, but estimates e2 k can only be made for k s,
Therefore, because yk is available only for k s. The fact that the
estimates e1k are available for k s1 rather than k s
V (Y2 P a lin ) = V ( s1
a1k e1k + s ak e2k ) . (5.1)
allows us to construct (in section 7) a more efficient
We use (5.1) as the starting point for deriving a variance estimator of V (Y2 P a lin ) than the traditional approach to
estimator for Y2 P a lin . Two different approaches can be used variance estimation (in section 8) where all estimated
and it is of interest to compare them. The one in section 7 is residuals are calculated only for k s.
new and more interesting because it produces a more The design weights a1k = 1/ 1k , a2 k = 1/ 2 k and
efficient variance estimator than the one in section 8, ak = a1 k a2 k were defined in section 1. In the following
derived by the traditional technique of conditioning on the sections, we also need the quantities given below, defined as
phase-one sample s1. The residuals e1k and e2k given by functions of the second-order inclusion probabilities
(4.10) play an important role in both derivations. 1k = Pr ( k & s1 ) and 2 k = Pr (k & s | s1 ):
a1k = 1/ 1k , a2 k = 1/ 2 k , ak = a1 k a2 k
6. Preliminaries for variance estimation
D1k = a1 k a1 a1 k , D2 k = a2 k a2 a2 k ,
Our objective is to estimate the variance V (Y2 P a lin )
given by (5.1). This is done in sections 7 and 8 by two Dk = ak a ak .
different arguments. The residuals e1k and e2 k are defined
for all k U but they can not be computed. They must be Here, 2 k and a2 k are conditional on the sample s1.
replaced by estimates e1k and e2 k . These estimates, formed All first-order and second-order inclusion probabilities are
in the image of (4.10) are assumed positive. Using this notation and the above results,
e1k = xk ( a ) B ( y ; x )( a ) + xk ( w) B ( y ; x )( w) we now develop two different variance estimators in the
next two sections.
x1k B ( xB for k s1
( w) ; x1 )
V (Y2 P a lin ) = V (s a 1
1k )
e1k + V ( s ak e2 k ) common with our approach, but there are also considerable
differences.
+ 2 Cov ( s a 1
1k e1k , s ak e2 k . ) (7.1)
8. The combined residual variance estimator
If we knew the residuals e1k and e2 k , unbiased estimates
for these three components would be given respectively by We arrived at (7.3) by recognizing that the estimates e1k
are obtainable for k s1. The traditional approach,
k s s D1k e1k e1,
1 1 reviewed in this section, is to derive a variance estimator by
conditioning on the phase-one sample s1. This produces a
k s s Dk e2k e2, variance estimator where all required residuals are defined
for k s. Later, we compare it with the more efficient
2 k s s D1k a2 e1k e2. (7.2) (7.3). From (5.1), we condition on the phase-one sample s1
1
to obtain
The proof of unbiasedness is similar for all three
components. For example, for the second one, we have V (Y2 P a lin ) = Vs1 Es | s1 (s a 1
e + s ak e2 k
1k 1k )
Es 1 Es | s 1 ( k s s Dk e2 k e2 )
+ Es1Vs | s1 (s a 1
e
1k 1k + s ak e2 k )
= Es1 ( s ( Dk /a2 k ) e2 k e2 )
( s a )
k s1
1
= Vs1 e + s a1k e2 k
1k 1k
1 1
= k U U ( Dk /ak ) e2k e2
+ Es1Vs | s1 ( s ak e2 k )
= E ( s ak e2 k ) E ( s ak e2 k )
2 2
where e12 k = e1k + e2 k is called the combined residual.
From (4.10), we obtain the following.
= V ( s ak e2 k ) .
e12 k = yk xk ( t ) B ( y ; x )(t ) x1k B ( xB( w ) ; x1 )
We now replace the unknown residuals in (7.2) by the
respective estimates given by (6.1); that is, e1k by e1k for e2 k = yk xk ( t ) B ( y ; x )(t ) xk ( w) B ( y ;x )( w)
k s1 and e2k by e2 k for k s. Then, the resulting
three components are added to arrive at the separate xk ( a ) B ( y ;x )( a ) . (8.2)
residual variance estimator
It is straightforward to define estimators of the two
Vsr (Y2 P a lin ) = k s s D1k e1k e1
1 1
components Vs1 ( s1 a1k e12 k ) and Es1 Vs | s1 ( s ak e2 k ).
Each of these has the form of a double sum over s because
e12k and e2k contain yk which is only available for
+ k s s Dk e2k e2 k s. The first component uses e12 k = e1k + e2 k =
yk xk ( t ) B ( y ; x )(t ) x1k B ( xB ; x ) for k s. We then
+ 2 k s s D1k a2 e1k e2. (7.3) (w) 1
1 have k s s D1k a2 k e12 k e12 as an estimator of
The term separate residual and the corresponding Vs1 ( s1 a1k e12 k ).
subscript sr reflect the fact that (7.3) keeps the residuals For the second component, we use the residual estimates
separate, where e1k is defined over the larger sample s1 and e2 k = yk xk B ( y ; x ) given by (6.1) for k s, and
e2 k over the smaller sample s. The fact that residuals obtain k s s D2 k a1k a1 e2 k e2 as an estimator of
computed for the larger sample s1 can be advantageous for Es1 Vs | s1 ( s ak e2 k ). Summing the two estimated terms we
variance estimation was recognized by Axelson (1998). have the following variance estimator, where the subscript
However, his derivation differs from our calibration cr indicates combined residual,
approach based on x1k and x k . The technique for variance Vcr (Y2 P a lin ) = D1k a2 k e12 k e12
ks s
estimation of the two-phase regression estimator in
Hidiroglou, Rao and Haziza (2006) has certain traits in + k s s D2 k a1k a1 e2 k e2. (8.3)
Let us review how (7.3) and (8.3) differ. The separate estimator with the combined residual variance estimator
residual variance estimator (7.3) starts with the expansion (8.3), also developed by the conditioning argument. The two
V (Y2 P a lin ) = V ( s1 a1k e1k ) + V ( s ak e2 k ) + 2Cov( s1 a1k e1k , variance estimators do not agree exactly, because the point
s ak e2 k ). We estimate these three components separately estimators are slightly different, but they are numerically
as functions of the residuals e1k and e2 k . The resulting close, as shown in this section.
variance expression has three terms: a double sum over s1 Let x1k be a vector of auxiliary variables with known
in terms of e1k and e1 , a double sum over s in terms of population totals, and let x k = (x1k , x2 k ), where both x1k
e2k and e2 , and a cross-sum over s1 and s in terms of and x 2k are known vector values for k s1. The total
e1k s1 and e2 s. Finally, we arrive at (7.3) by U x1k is assumed known whereas the total U x 2 k is
estimating e1k by e1k for k s1 and e2k by e2 k for unknown. The predicted values produced for k s1 by the
k s. two regressions fitted at the top level and bottom level
The combined residual variance estimator (8.3) arises are given respectively by
from the traditional conditioning on the phase-one sample
s1 as V (Y2 P a lin ) = Vs1 Es | s1 (Y2 P a lin ) + Es1Vs | s1 (Y2 P a lin ). y1k = x1k B 1s
This leads us to combine e1k and e2k as e12 k = e1k + e2 k with (9.1)
in the first term. The second term, Es1Vs | s1 (Y2 P a lin ), is a
(s a x ) (s a x )
1
B 1s = k 1k 1k x 12k k 1k yk 12k
function of e2 k . Since e12k and e2k can only be estimated
over s, the resulting variance estimator becomes a sum of and
two terms, each of them expressed as a double sum over s.
y k = xk B s
The separate residual estimator (7.3) is more efficient
than the combined residual alternative (8.3), because it is with (9.2)
( s a x x ) s a x
based on residuals e1k obtained for the typically larger 1
B s = k k k k2 k k yk / 2k .
sample s1. The advantage of (7.3) over (8.3) is illustrated
by the simulation in section 10. The approach behind the The resulting two-phase regression estimator Yreg of
separate residual variance estimator (7.3) can be extended to Y = U yk is
three-phase sampling and other complex designs. In those
extensions of the technique, we proceed in a similar manner, Yreg = ( U x1k ) B 1s + s a1k ( y k
1
y1k )
starting by a derivation of the linearized form through an
expansion of the variance components and the determina- + s ak ( yk y k ). (9.3)
tion of the appropriate residuals.
Can Yreg be interpreted as a calibration estimator? To
answer this question, let us determine the implicit weights in
9. A comparison with the two-phase (9.3). We can write Yreg = s wk yk , with weights wk
regression estimator identified by substituting (9.1) and (9.2) into (9.3) and
simplifying. We find wk = ak g k = a1k a2 k g k , where the
Srndal, Swensson and Wretman (1992) developed a
calibration factor g k is given for k s by
two-phase regression estimator for Y = U yk , based on
an earlier paper by Srndal and Swensson (1989). It is
useful to see how this estimator, denoted here by Yreg ,
gk = 1 + ( x s a x )
U 1k 1
1k 1k
( s a x x ) x
1
compares with the calibration estimator Y2 P considered in 2 2
k 1k 1k 1k 1k 1k
the preceding sections of this paper. When based on the
same auxiliary information, the two estimators are close + ( s a x s a x )
1k k k k
1
but not identical. This is because the estimator Y2 P is
derived by calibration in each of the two phases, whereas ( s a x x ) x .
k k k
2 1
k k
2
k (9.4)
the two-phase regression estimator Yreg is derived by
model-assisted reasoning. The weights wk are not explicitly stated in Srndal,
We now describe the two-phase regression estimator of Swensson and Wretman (1992). In what sense, if any, can
Srndal, Swensson and Wretman (1992). Their derivation wk be considered a calibration weight? To examine this, we
involves the fit of two linear regression models with the use first replace yk in (9.3) with x1k . Using (9.1) and (9.2) with
of the available auxiliary data; one at the top level and the yk = x1k gives U x1k as the right-hand side of (9.3).
other at the bottom level. These authors develop a Thus, the weights wk = ak g k satisfy s wk x1k =
corresponding estimator of variance, via the traditional U x1k . Next we replace yk in (9.3) with x2 k , again using
conditioning argument. We compare their variance (9.1) and (9.2) to obtain
s a1k x2 k
1
+ ( U x s a x )
1k 1
1k 1k e1k = x2 k B ( y ; x )(2) x1k B ( xB
(2) ; x1 )
for k s1
( s a x x ) e2 k = yk xk B ( y ; x )
1
2
k 1k 1k 1k
e12 k = e1k + e2 k
V (Yreg ) = k s s D1k a2 k e1k s e1 s = yk x1k B ( y ; x )(1) x1k B ( xB
( 2) ; x1 )
where, for k s, ( s a1
1k 1kx x2 k B ( y ; x )(2) 12k )}.
e1k s = yk x1k B 1s and ek s = yk xk B s. (9.7) In the expression within curly brackets, let us replace the
two a1k -weighted sums over s1 with the corresponding
Both components of (9.6) are double sums over s, ak -weighted sums over s; the result is equal to B 1s as
reflecting the fact that both e1k s and ek s can only be given by (9.10). This means e12 k yk x1k B 1s = e1k s.
obtained for k s. Formula (9.6) looks similar to formula In summary, e12 k e1k s for k s and e2 k = ek s for
(8.3) for the combined residual estimator but how different k s. Hence, the variance estimator (9.6) for the
are the residuals in the two formulas? Let us look at the two-phase regression estimator Yreg should be numerically
residuals for the comparable point estimator. As noted close to the combined residual variance estimator (8.3) for
above, this estimator Y2 P has xk = (x1k , x2 k ) with x k (t ) = the calibration estimator Y2 P defined in this section. We
x1k , x k ( w) = x 2 k , x k ( a ) = , z1k = x1k / 12k and z k = present empirical support for this through the simulation in
x k / k2 = (x1k / 12k , x2 k / 22 k ). Under these specifications, next section.
the residuals e1k and e2 k in (6.1) are given by
10. Simulation Vsr (Y2 P a lin ) and Vcr (Y2 P a lin ). However, we can not
compare Vcr (Y2 P a lin ) and V (Yreg ) unless we define an
In this section we present a small simulation to validate estimator Y2 P a comparable to Yreg , and to achieve this we
the claim that the separate residual variance estimator need x k ( a ) = , as noted in section 9.
Vsr (Y2 P a lin ) given by (7.3) can be considerably more We drew repeated sample pairs ( s1, s ), where s1 is an
efficient than the combined residual variance estimator SRS of n1 units from U , and s is an SRS of n units from
Vcr (Y2 P a lin ) given by (8.3), and that the behaviour of the s1. Here SRS stands for simple random sampling without
latter is very similar to that of the two-phase regression replacement. We worked with different size combinations
estimator V (Yreg ) given by (9.6). We created a population (n1, n): (4000, 3000), (4000, 2000), (4000, 1000), (3000,
of N = 5,000 units in two steps as follows: First, the 2000), (3000, 1000) and (2000, 1000). If n = n1, two-
values (u1k , u2 k ) for k = 1, 2, ..., 5,000 were generated phase sampling is equivalent to one-phase sampling, and
by 5,000 realizations of the independent random variables Vsr (Y2 P a lin ) and Vcr (Y2 P a lin ) are identical.
u1k ~ 2 Gamma(4) and u2 k ~ 3Gamma(6), where the For each combination (n1, n), we realized 100,000
Gamma(a) distribution has density f ( x) = [ (a ) ] 1 sample pairs ( s1, s ). Based on the data for each of these
x a 1 e x for x > 0. Secondly, the values of the variable of outcomes, we computed the separate residual variance
interest were created as yk = 10 + u1k + 3 u2 k + k , estimator Vsr (Y2 P a lin ), the combined residual variance
k = 1, 2, ... 5,000, with k ~ 5 Normal(0), where estimator Vcr (Y2 P a lin ) and the variance estimator V (Yreg ).
Normal(0) is the standard Normal distribution with mean 0 For this purpose, we used the respective expressions that
and variance 1. The target of estimation in the experiment is follow from (7.3), (8.3) and (9.6) when SRS is specified at
the population y -total Y = U yk = 358, 205. For the phase- each phase. To save space, these expressions are not shown
one calibration, we used the auxiliary vector x1k = (1, u1k ) here. We obtained 100,000 realized values for each of the
and z1k = x1k . That is, the weights w1k for k s1 were three variance estimators. Figure 10.1 shows the distribu-
determined by calibration to the known total ( N, U u1k ) = tions of the 100,000 V -values for n1 = 4,000 and
(5, 000, 39, 611.8). For the phase-two calibration we used n = 2,000.
x k = (xk (t ) , xk ( w) , xk ( a ) ) with x k (t ) = (1, u1k ), x k ( w) = u2 k , The figure shows strikingly different distributions for
x k ( a ) = and z k = xk . These specifications satisfy the Vsr (Y2 P a lin ) and Vcr (Y2 P a lin ) . The distribution of the
conditions for asymptotic equivalence between Y2 P a and separate residual estimator Vsr (Y2 P a lin ) is much more
Y2 P w. Therefore, for this simulation, we can work with concentrated. Thus Vsr (Y2 P a lin ) is more efficient than
Y2 P a and its linearized form Y2 P a lin. Vcr (Y2 P a lin ) and on average, it produces considerably
For each phase-one sample s1, the final weights wk for shorter confidence intervals. We also note that the
the estimator Y2 P a = s wk yk were determined by distribution of V (Yreg ) is very similar to that of
calibrating to the known totals given by the vector Vcr (Y2 P a lin ). This supports our analysis in section 9. Similar
( N, U u1k , s1 w1k u2 k ) = (5,000, 39,611.8, s1 w1k u2 k ). It results were obtained for the other sample sizes in the
is important to note that it was not necessary to have simulation.
x k ( a ) = in order to run a simulation to compare
11,000
10,000
9,000
F 8,000
r
e 7,000
q 6,000
u
5,000
e
n 4,000
c
3,000
y
2,000
1,000
0
750,000 775,000 800,000 825,000 750,000 775,000 800,000 825,000 750,000 775,000 800,000 825,000
Figure 10.1 Distribution of 100,000 realized values for Vsr (Y2 P a lin ), Vcr (Y2 P a lin ) and V (Yreg )
References
Table 10.2
Simulation variance for the combined residual variance Axelson, M. (1998). Variance estimation for the generalised
estimator Vcr (Y2 P a lin ) regression estimator under two-phase sampling - a modified
approach. Proceedings of the Section on Survey Research
n Methods, American Statistical Association, 85-89.
n1 3,000 2,000 1,000
Deville, J.-C., and Srndal, C.-E. (1992). Calibration estimators in
4,000 153.22 364.08 1,290.41 survey sampling. Journal of the Ame,rican Statistical Association,
3,000 2,449.05 6,855.69 87, 376-382.
2,000 33,220.88
Deville, J.-C. (2002). La correction de la nonrponse par calage
Note: Actual values are the displayed values times 106. gnralis. Actes des Journes de Mthodologie, I.N.S.E.E., Paris.
Dupont, F. (1995). Alternative adjustments when there are several
levels of auxiliary information. Survey Methodology, 21, 125-136.
Table 10.3 Estevao, V.M., and Srndal, C.-E. (2002). The ten cases of auxiliary
Simulation variance for the variance estimator V (Yreg ) information for calibration in two phase sampling. Journal of
n Official Statistics, 18, 233-255.
n1 3,000 2,000 1,000 Hidiroglou, M.A. (2001). Double sampling. Survey Methodology, 27,
4,000 153.25 364.14 1,289.79 143-154.
3,000 2,449.36 6,854.52 Hidiroglou, M.A., and Srndal, C.-E. (1998). Use of auxiliary
2,000 33,210.31 information for two-phase sampling. Survey Methodology, 24,
Note: Actual values are the displayed values times 106. 11-20.
Hidiroglou, M.A., Rao, J.N.K. and Haziza, D. (2006). Variance
estimation in two phase sampling. (Accepted paper to appear in)
Australian and New Zealand Journal of Statistics.
Table 10.4
Ratio of entries in Table 10.1 to corresponding entries in Kott, P.S., and Stukel, D.M. (1997). Can the jackknife be used with a
Table 10.2 two-phase sample? Survey Methodology, 23, 81-90.
n Srndal, C.-E., and Swensson, B. (1987). A general view of
n1 3,000 2,000 1,000 estimation for two phases of selection with applications to
two-phase sampling and nonresponse. International Statistical
4,000 0.42 0.26 0.38 Review, 55, 279-294.
3,000 0.48 0.26
2,000 0.42 Srndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted
Survey Sampling. New York: Springer-Verlag.
Sitter, R.R. (1997). Variance estimation for the regression estimator in
two-phase sampling. Journal of the American Statistical
Association, 92, 780-787.