Documente Academic
Documente Profesional
Documente Cultură
Research Article
Identification of Multiple Outliers in a Generalized Linear
Model with Continuous Variables
Loo Yee Peng,1 Habshah Midi,1,2 Sohel Rana,1,2 and Anwar Fitrianto1,2
1
Department of Mathematics, Faculty of Science, Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia
2
Laboratory of Applied and Computational Statistics, Institute for Mathematical Research, Universiti Putra Malaysia,
43400 Serdang, Selangor, Malaysia
Copyright © 2016 Loo Yee Peng et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In the statistical analysis of data, a model might be awfully fitted with the presence of outliers. Besides, it has been well established
to use residuals for identification of outliers. The asymptotic properties of residuals can be utilized to contribute diagnostic tools.
However, it is now evident that most of the existing diagnostic methods have failed in identifying multiple outliers. Therefore, this
paper proposed a diagnostic method for the identification of multiple outliers in GLM, where traditionally used outlier detection
methods are effortless as they undergo masking or swamping dilemma. Hence, an investigation was carried out to determine the
capability of the proposed GSCPR method. The findings obtained from the numerical examples indicated that the performance of
the proposed method was satisfactory for the identification of multiple outliers. Meanwhile, in the simulation study, two scenarios
were considered to assess the validity of the proposed method. The proposed method consistently displayed higher percentage of
correct detection, as well as lower rates of swamping and masking, regardless of the sample size and the contamination levels.
is swamping, where the inliers may resemble the outliers. and 5 is considered if the cutoff value 3 identifies too many
Thus, the outlier detection methods in GLM sustain from observations as outliers [13].
masking or swamping effects. In general, the distribution of the Pearson residuals
In addition, several published papers have considered the deviates from its true distribution by terms of order 𝑛−1 . In
detection of multiple outliers. For example, Lee and Fung addition, the mean and the variance of its true distribution
[10] proposed an adding-back procedure with an initial high are zero and one, respectively. Cordeiro [6] claimed that
breakdown point for the detection of multiple outliers in the adjusted Pearson residuals have closer distribution to
GLM and nonlinear regressions. On the other hand, Imon standard normal distribution compared to the corresponding
and Hadi [9] proposed a generalized version of standardized moments of the unmodified residuals. The adjusted Pearson
Pearson residuals based on group deletion method (GSPR) residual corrects the residuals to equal mean and variance,
to overcome the difficulty of multiple outliers detection in but its distribution is not equivalent to the distribution of
logistic regression. the true Pearson residuals to order 𝑛−1 . Hence, the corrected
These have motivated the researchers to modify the Pearson residuals were proposed to remedy this problem.
corrected Pearson residuals [7] to adapt to the problems According to Cordeiro and Simas [7], it is essential to discern
related to multiple outliers. Hence, this paper presents the the 𝜙- and the 𝑛-asymptotic in GLM. The 𝑛-asymptotic was
development of a method for the identification of multiple applied in this method; that is, the dispersion parameter, 𝜙,
outliers by employing corrected Pearson residuals based on is fixed, whereas the size of the number of observations, 𝑛,
group deletions. This method is called Generalized Standard- becomes large. This asymptotic theory is widely concerned
ized Corrected Pearson Residuals (GSCPR). Furthermore, a about estimation and hypothesis testing with regard to
few frequently used diagnostics related to residuals have been the unknown parameter 𝛽. Nonetheless, these methods are
briefly discussed for the identification of outliers in Section 2. restricted to data set with continuous distributions, such
In Section 3, the proposed GSCPR method is described for as normal, gamma, and inverse Gaussian distribution. The
the identification of multiple outliers. Next, the usefulness corrected Pearson residuals for continuous GLMs are defined
of this proposed method is examined in Section 4 via a real as 𝑟𝑖 = 𝑟𝑖 + 𝜌𝑖 (𝑟𝑖 ), where 𝑟𝑖 is the Pearson residuals and 𝜌(⋅)
data set, and finally, Section 5 reports the Monte Carlo sim-
is a function of order 𝑂(𝑛−1 ) created to produce residual 𝑟𝑖
ulations.
with the same distribution of 𝜀𝑖 to order 𝑛−1 . The correction
function is equivalent to
2. Residuals in GLM
As mentioned earlier, it is often very important to be aware 𝜌𝑖 (𝑥)
of the existence of outliers in a data set. The outliers have
been found to tremendously influence the covariate pattern, 1 −1 (1)
= 𝑒𝑖 (𝑥) {− 𝑉 𝑉 𝜇 𝑧 − 𝐵 (̂𝜂𝑖 ) − 𝑤𝑖1/2 𝑧𝑖𝑖 𝑥}
and hence, their existence may mislead the interpretation 2𝜙 𝑖 𝑖 𝑖 𝑖𝑖
of the statistical analysis. The deviance residuals and the
𝑧𝑖𝑖 (1)
Pearson residuals are two common types of residuals in GLM. − ℎ (𝑥)
In this paper, the focus was only on the Pearson residuals. 2𝜙 𝑖
Hilbe and Robinson [11] pointed out that, by normalizing
Pearson or deviance residuals to a standard deviation of 1.0, 𝑧𝑖𝑖 𝑑
the entire deviance residual tends to normalize the residual + 𝑒𝑖 (𝑥)2 {𝜙√𝑉𝑖 𝑞 (𝜇𝑖 ) + 𝑐 (√𝑉𝑖 𝑥 + 𝜇𝑖 , 𝜙)} .
2𝜙 𝑑𝑥
better than the Pearson residual. Although deviance residual
is the favorable statistics in GLM at the time the study
Furthermore, the term 𝜙−1 𝑧𝑖𝑖 in the equation is equivalent
was conducted, more tests developed based on the Pearson
to Var(̂𝜂𝑖 ). This expression can be easily implemented to any
residuals are essential. The Pearson residuals are defined
as 𝑟𝑖 = (𝑦𝑖 − 𝜋 ̂ 𝑖 )/√V𝑖 , 𝑖 = 1, 2, . . . , 𝑛, where 𝜋 ̂ 𝑖 is the continuous model as only 𝑒𝑖 (𝑥), ℎ𝑖 (𝑥), and (𝑑/𝑑𝑥)𝑐(√𝑉𝑖 𝑥 +
𝑖th fitted values and V𝑖 is variance function. Meanwhile, 𝜇𝑖 , 𝜙) need to be calculated. The equations for the correction
𝜌𝑖 (⋅) for some important GLM are shown in the work
the standardized Pearson residuals are defined as 𝑟𝑠𝑖 =
conducted by Cordeiro and Simas [7]. Besides that, the
(𝑦𝑖 − 𝜋 ̂ 𝑖 )/√V𝑖 (1 − ℎ𝑖 ), 𝑖 = 1, 2, . . . , 𝑛, where 𝜋̂ 𝑖 is the 𝑖th
values such as 𝜇 , 𝜇 for several useful link functions, 𝑞(𝜇),
fitted values, V𝑖 is the variance function, and hat-values ℎ𝑖
𝑉, 𝑤, and (𝑑/𝑑𝑥)𝑐(√𝑉𝑖 𝑥 + 𝜇𝑖 , 𝜙) for various distributions,
for GLM is the 𝑖th diagonal element of the 𝑛 × 𝑛 matrix
are shown by Cordeiro and Simas [7]. Meanwhile, for
𝐻 = 𝑉1/2 𝑋(𝑋𝑇 𝑉𝑋)−1 𝑋𝑇 𝑉1/2 , where 𝑉 is a diagonal matrix
normal models, 𝑉𝑖 = 1, 𝑤𝑖 = 𝜇𝑖2 , 𝑐(𝑥, 𝜙) = −{𝑥2 𝜙 +
with diagonal elements V𝑖 . 𝑋𝑖𝑇 = [1, 𝑥1𝑖 , 𝑥2𝑖 , . . . , 𝑥𝑝𝑖 ] is the
log(2𝜋/𝜙)}/2, 𝑒𝑖 (𝑥) = −𝜇𝑖 and ℎ𝑖 (𝑥) = −𝜇𝑖 . There is also
1 × 𝑘 vector of observations corresponding to the 𝑖th case.
Furthermore, an observation is stated as a residual outlier 𝜌𝑖 (𝑥) = 𝐵(̂𝜂𝑖 )𝜇𝑖 +𝜇𝑖 𝑧𝑖𝑖 /2𝜙+(𝜇𝑖2 𝑧𝑖𝑖 /2)𝑥. As for gamma models,
when its corresponding 𝑟𝑖 or 𝑟𝑠𝑖 exceeds a quantity 𝑐 in 𝑉𝑖 = 𝜇𝑖2 , 𝑤𝑖 = 𝜇𝑖−2 𝜇𝑖2 , 𝑐(𝑥, 𝜙) = (𝜙 − 1) log(𝑥) + 𝜙 log(𝜙) −
absolute term. A popular and well-reasoned choice for 𝑐 could log Γ(𝜙) and (𝑑/𝑑𝑥)𝑐(𝜇𝑥 + 𝜇, 𝜙) = (𝜙 − 1)/(1 + 𝑥). There
be 3 as it matches the three-sigma distance rule practice in are also 𝑒𝑖 (𝑥) = −𝜇𝑖−1 𝜇𝑖 − 𝜇𝑖−1 𝜇𝑖 𝑥 and ℎ𝑖 (𝑥) = −𝜇𝑖−1 𝜇𝑖 +
the normal theory [12]. Another suitable constant between 3 2𝜇𝑖−2 𝜇𝑖2 − 𝜇𝑖−1 𝜇𝑖 𝑥 + 2𝜇𝑖−2 𝜇𝑖2 𝑥.
Mathematical Problems in Engineering 3
(−𝐷)
Then, residuals is ̂𝑟𝑖 = ̂𝑟(−𝐷)
𝑖 + 𝜌𝑖 (̂𝑟(−𝐷)
𝑖 ), 𝑖 = 1, 2, . . . , 𝑛, where
𝜇−1 𝜇 𝜇−2 𝜇2 ̂𝑟(−𝐷)
𝑖 = (𝑦𝑖 − 𝜋̂ (−𝐷)
𝑖 )/√V𝑖(−𝐷) , 𝜋
̂ (−𝐷)
𝑖 is the 𝑖th deletion for
𝜌𝑖 (𝑥) = (1 + 𝑥) (𝜇𝑖−1 𝜇𝑖 𝐵 (̂𝜂𝑖 ) + 𝑖 𝑖 𝑧𝑖𝑖 − 𝑖 𝑖 𝑧𝑖𝑖 the fitted values and 𝜌(⋅) is a function of order 𝑂(𝑛−1 ). The
2𝜙 2𝜙
(2) respective deletion variances, V𝑖(−𝐷) (for gamma distribution),
𝜇−2 𝜇2 𝑧 and deletion leverages, ℎ𝑖(−𝐷) , for the overall data set are
+ 𝑖 𝑖 𝑖𝑖 𝑥) .
2 defined as
consideration to be assigned to the deleted set 𝐷. The rest of Table 1: Drill experiment data for design matrix and response data.
the points are assigned to 𝑅 set.
Run 𝑋1 𝑋2 𝑋3 𝑋4 𝑌
Step 3. Calculate the (−𝐷)
𝑟GSCPR values based on the decided 𝐷 1 −1 −1 −1 −1 1.68
𝑖
and 𝑅 sets in Step 2. 2 1 −1 −1 −1 1.98
3 −1 1 −1 −1 3.28
(−𝐷) −1 −1
Step 4. Any deleted points in accordance with 𝑟GSCPR 𝑖
exceed- 4 1 1 3.44
ing value 3 in absolute term are finalized and declared as the 5 −1 −1 1 −1 4.98
outlier. 6 1 −1 1 −1 5.70
7 −1 1 1 −1 9.97
The choice of deleted observations played a pivotal role 8 1 1 1 −1 9.07
in the method as the exclusion of this group decided the 9 −1 −1 −1 1 2.07
residuals resulting from the 𝐷 set and the 𝑅 set. Besides, all 10 1 −1 −1 1 2.44
suspected outliers were included in the initial deletion set,
11 −1 1 −1 1 4.09
since the entire GSCPR set would have been faulty if any
outlier was left in the 𝑅 set. Meanwhile, in logistic regression, 12 1 1 −1 1 4.53
Imon and Hadi [9] suggested a diagnostic-robust approach, 13 −1 −1 1 1 7.77 (27.77)
which employed graphical procedures or robust procedures 14 1 −1 1 1 9.43
to identify suspect outliers at the early stage. The next stage 15 −1 1 1 1 11.75
employed the diagnostic tools to the resulting residuals so as 16 1 1 1 1 16.30 (36.30)
to assign the inliers (if any), which were incorrectly identified
as outliers at the early stage, back into the estimation subset.
In this study, corrected Pearson residuals were applied to method correctly identified observations 13 and 16 as outliers
detect the suspected outliers in the early stage, and then, the but swapped a good observation (observation 15) as an
suspected points were declared as outliers if every member outlier. Nevertheless, Figure 1(f) shows that the proposed
subset in set 𝐷 fulfilled the rule given in Step 4. Otherwise, GSCPR correctly identified observations 13 and 16 as outliers
the observations were placed back to 𝑅 set. Moreover, the in the data set.
deletion of set 𝐷 was continuously inspected by recalculating
the GSCPR until every member in the final deletion set
individually fulfilled the rule outlined in Step 4. The member 5. Simulation Results
subset in this final set 𝐷 was, eventually, declared as outliers.
In this section, simulation studies based on 2𝑘 factorial
design data and explanatory variable data were conducted
to verify the conclusion of the numerical data set that the
4. Example Using Real Data Set proposed method had been indeed capable of correctly
A real data set was used to illustrate the capability of the identifying multiple outliers. The measures of Pearson resid-
newly proposed tool for the identification of multiple outliers. uals, corrected Pearson residuals, and GSCPR measures were
The drill experiment comprised a 24 unreplicated factorial reported. Besides, only the model that adhered to Gamma
in order to investigate the response variable advance rate. distribution and log link function had been considered.
This example was reanalyzed by using a generalized linear Furthermore, two different scenarios were considered in this
model developed by Lewis and Montgomery [14]. As for the study. In the first scenario (Scenario 1), a 2𝑘 factorial design
GLM analysis, the gamma distribution was selected with a where 𝑘 = 2 with replication of 8, 16, 32, and 64 runs
log link function. Table 1 displays the design matrix and had been applied. For each experimental design condition,
the response data. In order to observe the effect of outliers, a model was set up with known parameters and a known true
the original response data had been modified. For the single mean, as suggested by Lewis et al. [15]. On top of that, the
outlier case, the 16th observation was replaced with a value of true linear predictor, 𝜂𝑖 , was created as 𝜂𝑖 = 𝛽0 + 𝛽1 𝑥1 +
36.30 instead of 16.30. Meanwhile, for the case of two outliers, 𝛽2 𝑥2 + ⋅ ⋅ ⋅ + 𝛽𝑚 𝑥𝑚 , where the true linear predictor was taken
the 13th and 16th observations were replaced with values 27.77 as 𝜂𝑖 = 0.5 + 𝑥1 − 𝑥2 , 𝜙 = 4 for 22 factorial design [7].
and 36.30, respectively. The results of the Pearson residuals, The true mean was defined as 𝜇𝑖 = 𝑔−1 (𝜂𝑖 ), where 𝑔 is the
corrected Pearson residuals, and GSCPR measures are shown link function and 𝑖 = 1, 2, . . . , 𝑛. The actual observation
in Table 2. was acquired by attaching an error drawn at arbitrary from
Table 2 and index plots of Pearson residuals, as given in a specified distribution to the linear predictor; that is, 𝑦𝑖 =
Figures 1(a) and 1(b), show that the Pearson residuals method 𝑔−1 (𝜂𝑖 ) + 𝜀𝑖 , where 𝑖 = 1, 2, . . . , 𝑛, while residuals 𝜀𝑖 were
failed in identifying the outliers. For data set with an outlier, induced from a Gamma distribution, with a shape equivalent
as observed in Figures 1(c) and 1(e), the values for both to 0.1 and a scale that equaled one. Furthermore, in order
corrected Pearson residuals and GSCPR for observation 13 to generate 8 run designs’ matrix, two replicates of the 22
had been remarkably high and, therefore, could be revealed factorial design were used, four replicates gave 16 run designs
as outlier. Meanwhile, for the data set with two outliers, matrix, and so on. The outlying observations were generated
as depicted in Figure 1(d), the corrected Pearson residuals from the model 𝑦𝑖 = 𝑔−1 (𝜇𝑖 ) + 𝑦shift + 𝜀𝑖 , where 𝑖 = 1, 2, . . . , 𝑛
Mathematical Problems in Engineering 5
Table 2: Outlier measures based on GLM for the drill experiment data with single outlier and two outliers.
and 𝑦shift represents the distance of the outliers that were correct detection for the corrected Pearson residuals method
placed away from good observations. In this study, 𝑦shift was had been found to become lower as the amount of outliers
taken as 10. The residual outliers were created in the first increased.
𝑛 observation, where 𝑛 is the total number of the outlying Other than that, Table 4 portrays the eminence of the
observations. The 5,000 replications of the simulation study proposed method for Scenario 2. The Pearson residuals
are summarized in Table 3. This table presents the percentage method had been discovered to be excellent only when the
of correct detection of outliers, masking rates, and swamping contamination level was low, but it exhibited an extremely
rates for 22 factorial design. In the second scenario (Scenario poor performance when the contamination level was more
2), a simulation study was designed for a set of explanatory than 5% in the cases. In general, it generated less percentage
variables data, that is, two explanatory variables. The true in correct detection, high masking, and low swamping effects.
linear predictor was taken as 𝜂𝑖 = 0.5 + 𝑥1 − 𝑥2 , 𝜙 = 4, However, the GSCPR method that outperformed the other
by adhering to that suggested by Cordeiro and Simas [7]. methods revealed higher rate of detection of outliers and
The true mean and the actual observation were obtained almost negligible masking rates regardless of the sample
as described in Scenario 1. The explanatory variables were size, 𝑛, and contamination levels, 𝛼. This method also had
generated as Uniform (0,1). In addition, the sample sizes 𝑛 a tendency of swamping the inliers as outliers when the
were considered as equivalent to 20, 40, 60, 100, and 200 with contamination level had been low, such as 5% and 10%
different contamination levels 𝛼 = 0.05, 0.10, 0.15, and 0.20 of cases. Moreover, the performance of the corrected Pearson
outliers with varying weights. The initial 100 𝛼% observations residuals method was better than that of the Pearson residuals
were further constructed as outliers in the data set. Thus, in method since the rate of detection of outlier was higher when
order to induce the outlying values of the varying weights, the the contamination level was low. Although the swamping
first outlier point was unchanged and remained at value 10, effects for the corrected Pearson residuals method were
whereas those successive values increased as much as value low, its masking effects increased as the contamination level
two. The results of the 5,000 replications are summarized in increased. Hence, the outcomes of the study indicated that
Table 4. the GSCPR method had the best performance, followed by
Additionally, for Scenario 1, Table 3 clearly shows that the the corrected Pearson residuals method and the Pearson
Pearson residuals method was excellent in detecting single residuals method.
outlier only when the size of the experimental run had
been large, but its performance became extremely poor for
multiple outliers detection. However, the proposed GSCPR 6. Conclusion
method consistently displayed higher rate of detection of
outliers with almost negligible masking rates and swamping This paper proposed a diagnostic method for the identi-
rates regardless of the size of the experimental run and the fication of multiple outliers in GLM, where traditionally
existing number of outliers. The performance of the corrected used outlier detection methods are effortless as they undergo
Pearson residuals method was also encouraging, but it did masking or swamping dilemma. Thus, an investigation had
not outperform the proposed method. Moreover, the rate of been conducted to determine the capability of the proposed
6
Table 3: Percentages of correct detection of outlier, masking rates, and swamping rates based on 5,000 simulations for Scenario 1.
% Correct detection % Masking % Swamping
𝑛 Run
Pearson residuals Corrected Pearson residuals GSCPR Pearson residuals Corrected Pearson residuals GSCPR Pearson residuals Corrected Pearson residuals GSCPR
8 0.00 99.24 98.94 100.00 0.76 1.06 0.00 80.02 3.42
16 0.18 99.98 99.84 99.82 0.02 0.16 0.00 1.54 3.22
1
32 85.64 100.00 100.00 14.36 0.00 0.00 0.32 11.00 8.72
64 100.00 100.00 100.00 0.00 0.00 0.00 25.52 26.78 5.92
8 0.00 97.86 91.28 100.00 2.14 8.72 0.00 12.22 4.52
16 0.00 26.64 99.90 100.00 73.36 0.10 0.00 1.84 2.28
2
32 0.10 18.74 100.00 99.90 81.26 0.00 0.00 8.94 6.24
64 26.02 27.50 100.00 73.98 72.50 0.00 0.92 1.26 5.26
8 0.00 0.20 91.02 100.00 99.8 8.98 0.00 0.02 6.74
16 0.00 0.36 97.88 100.00 99.64 2.12 0.00 0.00 4.96
3
32 0.00 86.00 99.62 100.00 14.00 0.38 0.00 0.00 2.70
64 0.00 5.70 100.00 100.00 94.28 0.00 0.00 0.42 5.02
Mathematical Problems in Engineering
Table 4: Percentages of correct detection of outlier, masking rates, and swamping rates based on 5,000 simulations for Scenario 2.
% Correct detection % Masking % Swamping
Mathematical Problems in Engineering
𝛼% 𝑛 𝑛
Pearson residuals Corrected Pearson residuals GSCPR Pearson residuals Corrected Pearson residuals GSCPR Pearson residuals Corrected Pearson residuals GSCPR
20 1 100.00 100.00 100.00 0.00 0.00 0.00 0.00 0.12 39.16
40 2 88.56 100.00 100.00 11.44 0.00 0.00 0.00 0.08 40.66
5 60 3 94.68 100.00 100.00 5.32 0.00 0.00 0.04 0.30 26.44
100 5 100.00 100.00 100.00 0.00 0.00 0.00 0.16 0.46 32.69
200 10 100.00 100.00 100.00 0.00 0.00 0.00 0.12 0.18 39.20
20 2 0.02 100.00 100.00 99.98 0.00 0.00 0.00 0.08 32.66
40 4 0.00 99.80 100.00 100.00 0.20 0.00 0.00 0.02 15.69
10 60 6 0.00 0.10 100.00 100.00 99.90 0.00 0.00 0.00 32.36
100 10 0.00 95.98 100.00 100.00 4.02 0.00 0.00 0.00 38.74
200 20 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 27.62
20 3 0.00 1.92 40.98 100.00 98.08 59.02 0.00 0.04 5.08
40 6 0.00 20.98 86.62 100.00 79.02 13.38 0.00 0.02 24.04
15 60 9 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 0.18
100 15 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 4.28
200 30 0.00 0.00 100.00 100.00 100.00 1.32 0.00 0.00 14.40
20 4 0.00 2.26 98.68 100.00 97.74 0.00 0.00 0.00 10.26
40 8 0.00 4.66 100.00 100.00 95.34 0.00 0.00 0.00 4.74
20 60 12 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 5.46
100 20 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 1.60
200 40 0.00 0.00 100.00 100.00 100.00 0.00 0.00 0.00 1.20
7
8 Mathematical Problems in Engineering
3 3
Pearson residuals
Pearson residuals
1 1
−1 −1
−3 −3
5 10 15 5 10 15
Index Index
(a) One outlier (b) Two outliers
20 20
Corrected Pearson
Corrected Pearson
residuals
10 10
residuals
0 0
−10 −10
5 10 15 5 10 15
Index Index
(c) One outlier (d) Two outliers
30 30
GSCPR
GSCPR
10 10
−10 −10
5 10 15 5 10 15
Index Index
(e) One outlier (f) Two outliers
Figure 1: Index plot of (a) Pearson residuals for one outlier, (b) Pearson residuals for two outliers, (c) corrected Pearson residuals for one
outlier, (d) corrected Pearson residuals for two outliers, (e) GSCPR for one outlier, and (f) GSCPR for two outliers.
International
Journal of Journal of
Mathematics and
Mathematical
Discrete Mathematics
Sciences