2012 728537

This article was downloaded by: [Winchester School of Art]
On: 26 May 2015, At: 08:36

Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Scandinavian Actuarial Journal

Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/sact20
Modelling critical illness claim

diagnosis rates I: methodology
a
Erengul Ozkok , George Streftaris , Howard R. Waters & A.

David Wilkie
Department of Actuarial Sciences , Hacettepe University ,

Ankara , Turkey
b
Department of Actuarial Mathematics and Statistics , HeriotWatt University , Edinburgh , UK

Published online: 12 Dec 2012.
Click for updates

To cite this article: Erengul Ozkok , George Streftaris , Howard R. Waters & A. David Wilkie (2014)
Modelling critical illness claim diagnosis rates I: methodology, Scandinavian Actuarial Journal,
2014:5, 439-457, DOI: 10.1080/03461238.2012.728537
To link to this article: http://dx.doi.org/10.1080/03461238.2012.728537
PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the
Content) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or
arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Downloaded by [Winchester School of Art] at 08:36 26 May 2015
Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions
Scandinavian Actuarial Journal, 2014

Vol. 2014, No. 5, 439457, http://dx.doi.org/10.1080/03461238.2012.728537
Original Article
Modelling critical illness claim diagnosis rates I: methodology

ERENGUL OZKOKa*, GEORGE STREFTARISb, HOWARD R. WATERSb and
A. DAVID WILKIEb
a
Department of Actuarial Sciences, Hacettepe University, Ankara, Turkey;

Department of Actuarial Mathematics and Statistics, Heriot-Watt University, Edinburgh, UK
(Accepted September 2012)
In a series of two papers, this paper and the one by Ozkok et al. (Modelling critical illness claim
diagnosis rates II: results), we develop statistical models to be used as a framework for estimating, and
graduating, Critical Illness (CI) insurance diagnosis rates. We use UK data for 19992005 supplied by
the Continuous Mortality Investigation (CMI) to illustrate their use. In this paper, we set out the basic
methodology. In particular, we set out some models, we describe the data available to us and we discuss
the statistical distribution of estimators proposed for CI diagnosis inception rates. A feature of CI
insurance is the delay, on average about 6 months but in some cases much longer, between the
diagnosis of an illness and the settlement of the subsequent claim. Modelling this delay, the so-called
Claim Delay Distribution, is a necessary first step in the estimation of the claim diagnosis rates and
this is discussed in the present paper. In the subsequent paper, we derive and discuss diagnosis rates for
CI claims from all causes and also from specific causes.
Keywords: critical illness insurance; diagnosis rates; statistical models; Burr generalised linear-type
model; Claim Delay Distribution; Continuous Mortality Investigation
1. Introduction
Critical Illness (CI) insurance is a type of long-term insurance, typically secured by
regular premiums throughout the term of the policy, that provides a lump sum on the
diagnosis of one of a specified list of critical illnesses within the policy conditions. In the
UK, there are two types of CI policy: Full Accelerated (FA), which covers both CI and
death and Stand Alone (SA), which covers only CI. The former is far more popular than
the latter. CI coverage includes, but is not limited to, cancer, heart attack, stroke, coronary
artery by-pass graft (CABG), kidney failure (KF), major organ transplant (MOT) and
multiple sclerosis (MS). Most policies also include total and permanent disability (TPD)
for completeness, essentially to cover disability arising from other causes not covered
explicitly in the policy.
CI insurance has been very popular in the UK since it was introduced in the 1980s.
Around 700,000 new policies were sold in 1998 (Dinani et al. (2000)) and more than
1 million new policies were issued in 2002 (CMI WP 50 (2011)), many of them linked to
*Corresponding author. E-mail: eozkok@hacettepe.edu.tr
# 2012 Taylor & Francis
2
440
442
E. Ozkok et al.
mortgage repayments. Further background information on CI insurance can be found in

CMI WP 50 (2011).
This is Paper I in a series of the two papers; the other paper, Ozkok et al. (2012a), is
referred to as Paper II. Our objective in these papers is to set out, and to illustrate, a
methodology for the estimation and graduation of CI insurance claim diagnosis rates.
These rates are clearly needed for the calculation of premium rates and policy values and
also for monitoring CI experience. The CMI has published all causes (CMI WP 50 (2011))
and cause specific (CMI WP 52 (2011)) CI claim diagnosis rates using its own
methodology. Our methodology differs from that of the CMI mainly because we start
by specifying a statistical model. From this starting point we can:
(1) determine the statistical properties of our estimators, and,
(2) use modern statistical methodology to smooth our estimates, for example, by
specifying (Cox-type) regression models incorporating a range of covariates.
In Section 2, we introduce in very general terms the statistical models whose
parameterisation is discussed in Paper II. The data used to parameterise these models
are described in Section 3. This data set, supplied by the CMI, relates to policies in force,
and claims settled, in the UK in the seven calendar years 19992005. A feature of CI
insurance is the delay between diagnosis of an illness and the settlement of the subsequent
claim. Our estimation procedure requires the distribution of this delay to be modelled,
this is discussed in Section 4. In Section 5, we set out equations for point estimates of the
claim diagnosis rates, discuss the distribution of these estimates and how we can smooth
them using generalised linear (GL)-type models.
In Paper II, we illustrate our methodology by using it to derive numerical results for all
cause claim diagnosis rates and for cause specific claim diagnosis rates.
More details of most of the research reported in these papers can be found in Ozkok
(2011).
2. Models
Our most detailed model for CI insurance is a cause specific model which is represented in
Figure 1.
Points to note about the model represented in Figure 1 are:
(1) Healthy indicates that the individual has not yet been diagnosed with a CI or died.
(2) An individual exits the Healthy state on death or on the diagnosis of a CI, as specified
in the policy conditions.
j
D
(3) The model is specified in terms of transition intensities, labelled kx;h and kx;h .
Transition intensities are analogous to the force of mortality and there are good
reasons for specifying the model in this way. See, for example, Waters (1984).
Modelling critical illness I: methodology
3
443
441
Figure 1. A cause specific model for critical illness insurance.
(4) A transition from Healthy to Dead means death before the diagnosis of a CI, so that,
D
numerically, we might expect kx;h to be different from, possibly lower than, the total
force of mortality for a corresponding set of individuals.
j
D
(5) The transition intensities, kx;h and kx;h , depend on the cause, on the current age of
the individual, x, and also on a set of other covariates, labelled u. These covariates are
the important characteristics of the individual and/or the policy which affect the
likelihood of the diagnosis of a CI or death; for example, Sex, Benefit amount, Office.
The set of covariates cannot include any characteristics which are not recorded in our
data and a major part of the statistical modelling is to determine which of those
characteristics recorded in our data are important and hence should be included in u.
(6) This model can be used for both FA and SA policies. Each transition intensity would
D
be estimated separately and data from FA policies only would be used to estimate kx;h .
Figure 2 represents a simpler, all causes, model for CI insurance. We could use the model
in Figure 2 to model FA and SA policies separately. In this case, a transition from Healthy
to Insured event means:
(1) diagnosis with a CI or death before diagnosis with a CI (FA policies), or,
(2) diagnosis with a CI (SA policies).
In this case, kx;h in Figure 2 corresponds in terms of the model in Figure 1 to:
n
X
kx;h kx;h
for FA polices; and;
j1
n
X
kx;h
for SA polices:
j1
Figure 2.
An all causes model for critical illness insurance.
E. Ozkok et al.
4
444
442
Figure 3. An all CI causes and death model for critical illness insurance (Ozkok et al. (2012a)).
Alternatively, we could use the model in Figure 2 to model FA and SA policies together.
In this case, Insured event has different meanings for FA and SA policies: for FA policies
it would include death before diagnosis with a CI whereas for SA policies it would not.
We might reasonably expect that Benefit type, FA or SA, would be an important covariate
in this model and that, other things being equal, the total claim rate for FA policies
would be higher than for SA policies.
A more satisfactory model, in terms of Benefit type, FA or SA, is illustrated in Figure 3.
Pn
CI
D
j
D
The transition intensities kx;h and kx;h in Figure 3 correspond to j1 kx;h and kx;h in
Figure 1. For FA policies, a transition to either of the two exit states would result in a
claim. For SA policies only a transition to Diagnosed with a CI would result in a claim;
death before diagnosis with a CI would terminate the policy.
We discuss in Paper II, the parameterisation of the models represented in
Figures 13.
3. Data
3.1. Covariates
We were provided by the CMI with a set of CI data relating to UK policies in the seven
calendar years from 1999 to 2005. The data consisted of records of policies in force at the
start and at the end of each of the seven years and details of claims settled within the seven
years. The covariates included in each data record were as follows:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Sex.
Smoker status: non-smoker or smoker.
Benefit type: FA or SA.
Office (coded anonymously).
Policy type: joint or single life.
Benefit amount in pounds.
Date of birth
Date of commencement of the policy.
The original data set contained details of 27,244 claims. Data from some offices could not
be used because of problems associated with missing claims information. Data from these
5
445
443
offices, both in-force and claims, were removed from our analyses, leaving us with data
from a total of 13 offices consisting of 19,127 claims and approximately 18,000,000 policy
years of exposure.
An additional covariate included in our data was Sales channel, which took one of five
possible values: Bancassurer, Direct sales, IFA, Other or Unknown. There was a very close
association between Sales channel and Office 6 of the 13 offices used only one sales
channel, a further three offices used just two, and one office classified all its data as sales
channel Unknown. We decided that it was unnecessary to include both Sales channel and
Office as possible covariates so we excluded the former from our analyses.
For Joint Life policies, both lives are included in the in-force data, but only one claim
can occur.
The presence of duplicate policies in the data would not affect point estimates of the
claim diagnosis rates, but would affect the standard deviations (SD) of these estimates.
This would distort the goodness of fit statistics, making the fit appear to be worse. No
attempt was made to remove duplicate policies from either our in-force or our claims files.
An investigation by the CMI of their 19992004 data indicated that this was not likely to
be a serious problem [see CMI WP 33 (2008, paragraphs 4.104.12)].
3.2. Cause of claim

For the claims records, we were also provided with information about the type of claim,
death or CI, and, in the latter case, the cause of claim. There were 55 different causes of
claim, including death, in our data set, of which many related to different sites for cancer.
Among the specified sites for cancer, the largest group was female breast cancer with 1838
claims. Since we have a significant amount of data for this cancer, it is tempting to analyse
it as a separate cause. However, cancer claims include site not specified which has the
largest number of claims, 4363 out of a total 19,127. Since site not specified includes
female breast cancer claims, as well as other cancers, we cannot reliably analyse female
breast cancer as a separate cause [see CMI WP 43 (2010, p. 37)].
We grouped the claims into 10 separate causes, including death, with cancer treated as a
single cause and the numerically minor causes, such as Motor Neurone Disease and
Angioplasty grouped together as Other causes. Table 1 shows the split of our claims data
by various factors, including cause.
The Association of British Insurers (ABI) has issued a series of reports on CI
insurance, starting in 1999, with the most recent (to date) being ABI (2011), designed to
clarify and standardise the definitions of illnesses covered by CI policies in the UK.
These definitions are accepted for most of the illnesses as a standard guide by the
insurance companies.
3.3. Missing dates

The natural sequence of events for a CI claim is:
Diagnosis: the illness is diagnosed, or death occurs, then,
E. Ozkok et al.
6
446
444
Table 1.
Number of claims and percentages by various factors.
Benefit type
Full accelerated
Stand alone
Joint/single life
Joint life
Single life
Gender
Female
Male
Smoker status
Non-smoker
Smoker
Cause of claim
Coronary artery bypass graft
Cancer
Death
Heart attack
Kidney failure
Major organ transplant
Multiple sclerosis
Other
Stroke
Total and permanent disability
Type of claim
Critical illness
Death
16,875 (88.2%)
2252 (11.8%)
9743 (50.9%)
9384 (49.1%)
8173 (42.7%)
10,954 (57.3%)
14,129 (73.9%)
4998 (26.1%)
393 (2.1%)
9381 (49.0%)
3371 (17.6%)
2220 (11.6%)
110 (0.6%)
36 (0.2%)
825 (4.3%)
1265 (6.6%)
1027 (5.4%)
499 (2.6%)
15,756 (82.4%)
3371 (17.6%)
Notification: a claim is notified to the insurer, then,

Admission: the insurer admits the claim, and finally,
Settlement: the insurer settles the claim.
For each claim, the CMI asks contributing offices to provide the date for each of these
four events. However, some claim records have one or more of these dates missing. Table 2
shows the mean delay (in days) between selected pairs of these dates and the percentage of
the 19,127 claims which have both dates recorded. Just over 9% of the claims records had
no date of diagnosis, but all of these claims had a date of settlement. We need the date of
diagnosis because diagnosis with a CI, rather than the settlement of the subsequent claim,
is the insured event note that for all three models in Section 2, diagnosis with a CI
triggers the transition from Healthy and the date of diagnosis determines important
characteristics of the claim, for example, the age of the insured life and the duration of the
policy when the diagnosis occurred. By modelling, the time delay between the dates of
diagnosis and settlement of a claim, the so-called Claim Delay Distribution (CDD), we
Table 2. Average observed delays between dates of diagnosis, notification, admission and settlement (in days).
Mean delay
Number of observations
Percentage of observations having
both dates
Diagnosis to
notification
Notification to
admission
Admission to
settlement
Diagnosis to
settlement
93
15,585
81
80
9190
48
18
9752
51
185
15,860
83
7
447
445
can estimate missing dates of diagnosis by subtracting from the date of settlement the
median of the CDD. The sensitivity of our final results to the use of the median in this
context is discussed in Paper II.
All claim records had a year of settlement, but, in some cases, the exact date of
settlement was missing. For all these cases, the date of diagnosis was given and a date of
settlement was estimated using the median of the CDD.
The modelling of the CDD is discussed in Section 4 below [see also CMI WP 14
(2005)].
4. Modelling the CDD

4.1. Earlier work allowing for missing dates
In an earlier paper, Ozkok et al. (2012b), the authors described how to model the CDD
using the same data set being used in the present paper. The details are not repeated here.
Key points about this modelling exercise were:
(1) Various GL-type models were fitted to the data, with different error distributions,
most notably the lognormal and the three parameter Burr. The latter gave the best fit.
(2) We fitted three parameter Burr distributions, parameterised as follows:
s
f u; a; s; s
a s u=s
s a1
u1 u=s
(1)
where f(.) is the probability density function of the CDD, a and t are (positive) shape
parameters and s is a (positive) scale parameter. With this parametrisation, the kth
moment of the delay between diagnosis and settlement is:

k
k
sk C a
C 1
Ca
(2)
s
s
for a > k=s, and otherwise.
Note that the CMI has also used Burr distributions to fit CDDs to CI data sets
(CMI WP 33 (2008)), but not in a GLM setting.
(3) The mean of our CDD is a loglinear function of a selected set of covariates denoted
by the vector u, so that:
EX expb hT
(3)
where X denotes the delay and b is a set of regression coefficients. The equation for
the mean given by Equation (3) is achieved by modelling the parameter s as follows:
Ca

expb hT :
s
1
1
C 1
C a
s
s
E. Ozkok et al.
8
448
446
The variance of the distribution is given by:

VX s2
CaCa 2=sC1 2=s Ca 1=s2 C1 1=s2

Ca
and the loglikelihood is given by:

X
X
X
n loga n logs a s
logs s 1
logdi a 1
logss dis
(4)
(5)
where di is the delay for the ith claim and n is the number of claims.
(4) The parameters a and t and the regression coefficients b were modelled using
Bayesian techniques using the full data set consisting of 19,127 claims. Missing event
dates were treated as additional parameters and estimated using their posterior
predictive distributions. Truncation was used where appropriate, so that, for example,
a missing date of diagnosis could not be before the date of commencement of the
policy or after the date of notification of the claim. The Bayesian analysis results in a
posterior distribution for each parameter, and, in particular, for each missing date of
diagnosis. A point estimate of the missing date could be, for example, the median of
the posterior distribution.
(5) Gibbs variable selection was used to determine which covariates should be retained in
the models.
4.2. Allowing for business growth
The analysis in Ozkok et al. (2012b) did not take into account one significant factor:
business growth. For almost all the offices contributing to our data and for almost all
years, the number of CI policies in force increased year on year. If this is not taken into
account, it can introduce bias into the modelling of the CDD. Recall that our claims data
consist of claims settled in the years 19992005. For claims settled in any of these 7 years,
those with relatively short delays relate to claims from policies in force in more recent
years; those with relatively longer delays relate to policies in force in earlier years. The
growth in the numbers of policies in force means that claims with shorter delays are likely
to be relatively over-represented in our data. We allow for business growth in our
modelling of the CDD as follows:
(1) For each office and each year of diagnosis we assign a growth rate, denoted GR. This
depends only on office and year of diagnosis and not on any other characteristic of
the claim, for example type of policy (FA or SA) or age of the policyholder.
(2) For the most recent year for which the office contributed data, 2005 for all but 2 of the
13 offices, GR is set at 1. For each earlier year of diagnosis, GR is set equal to the ratio
of the average number of policies in force in the following year to the average number
in force in the year in question. In this context, average number in force is the average
of the numbers of policies in force at the start and at the end of the year. For years of
diagnosis prior to the earliest for which the office contributed data, the growth rate is
assumed to be the same as in the earliest year for which data exists.
9
447
449
(3) For each office and each year of diagnosis we assign a growth factor, denoted GF. The
growth factor is the product of the growth rates for that year of diagnosis and all
subsequent years of diagnosis up to the final year for which the office contributed
data.
(4) For the purposes of parameter estimation, the parameter s in the three parameter
Burr distribution is replaced in the loglikelihood function (5) by sw, where:
p
sw s= GF :
The effect of this is to decrease the variance for this claim by a factor GF, as can be
seen from Equation (4), so giving more weight to data from years where relatively few
policies were in force. Using weights inversely proportional to the variance is common
in weighted least squares estimation (see, for example, Greene (1990)).
The procedure described above requires the date of diagnosis to be known so that GF
can be estimated. For claims where the date of diagnosis was not known, an iterative
procedure was used. A CDD was parameterised without allowing for business growth and
a preliminary estimate of the year of diagnosis was calculated from the date of settlement
minus the median of the CDD. A value for GF was calculated using this preliminary
estimate, a revised CDD was parameterised and a revised estimate of the year of diagnosis
was calculated. The process ended when two consecutive estimates of the year of diagnosis
were the same this never took more than three iterations.
4.3. Details of the covariates
Details of the covariates used in the modelling of the CDD are given in Table 3. These
covariates are labelled x and u1 u9. The values of x and u1 u7 for each claims record
have been standardised by subtracting the mean and dividing by the standard deviation,
calculated from the claims data. This makes sense for covariates where the nonstandardised value can be very large, for example, benefit amount, and has been done
for consistency for other covariates. For example, Sex has been coded 0 for females, 1 for
males and then standardised for each record by subtracting the mean, 0.573, and
dividing by the standard deviation, 0.495, so that a claims record for a female has a
value u1 (0 0.573)/0.4951.158.
Note that Settlement year is a covariate but Year of diagnosis is not. It would not be
appropriate to have both as covariates. We used the former because we have full
information about Settlement year whereas we have had to estimate the latter in some
cases. This causes minor complications in the estimation of diagnosis rates (see Section 5.4).
Equation (3) can then be written in more detail as follows:
EX expb hT b0 b1 x
7
X
bj1 hj b8;Officei b9;Causei
(6)
j1
where b0 is an intercept and individual bs are taken to be zero if the corresponding

covariate is not included in the model.
E. Ozkok et al.
10
448
450
Table 3.
Definitions of the covariates used in the modelling of the CDDs.
Covariate
x
u1
u2
u3
u4
u5
u6
u7
u8
u9
Age
Sex
Benefit type
Smoker status
Policy type
Settlement year
Benefit amount ()
Policy duration (days)
Office
Cause of claim
Number of levels
2
2
2
2
7
13
10
Additional information
Age last birthday
F0, M 1
FA 0, SA 1
N0, S1
JL 0, SL1
1999 0 1,2000 0 2, . . .
Continuous
Continuous
Mean
SD
44.424
0.573
0.118
0.261
0.491
4.917
55,397
1167
9.478
0.495
0.322
0.439
0.499
1.786
56,988
946
1. CABG
2. Cancer
3. Death
4. Heart attack
5. Kidney failure
6. Major organ transplant
7. Multiple sclerosis
8. Other
9. Stroke
10. Total and permanent disability
4.4. The best fitting CDD

To estimate the missing dates of diagnosis, the best fitting CDD was selected. The
covariates to be included in this model were selected by Gibbs variable selection; those
excluded were x (Age), u1 (Sex), u3 (Smoker status) and u5 (Settlement year). The means
and the standard deviations of the regression coefficients and the two Burr parameters are
given in Table 4.
Some points to note about the regression coefficients in Table 4 are:
(1) Benefit amount has a negative coefficient, so that the larger the amount, the shorter
the expected delay.
(2) The expected delay depends on Office and can differ by up to a factor of 2.4, since
exp(0.5810.315) 2.4.
(3) The expected delay depends on Cause and can differ by up to a factor of 2.1, since
exp(0.2150.542) 2.1, with death giving the shortest delays, and TPD the longest.
(4) The mean and the variance of the posterior distributions for the parameters a and t
are shown in Table 4. From these values, we can see that whatever (reasonable) point
estimates we choose for these parameters, a is less than 2/t, and so the standard
deviation of the posterior distribution for the delay is infinite.
(5) 95% credible intervals for all the parameters are given approximately by the mean
plus and minus twice the standard deviation.
The best fitting CDD is taken to mean the three-parameter Burr distribution, as
specified in Section 4.1, whose coefficients/parameters are equal to the means shown in
Table 4.
Table 5 shows 11 scenarios. Scenario 1 has typical values for its covariates; bold font
marks a change in a covariate from Scenario 1. The mean of the posterior distribution for

Table 4.
Covariate
Intercept
Benefit type
Policy type
Benefit amount
Policy duration
Office
11
449
451
Coefficients for the best fitting CDD model.
Parameter
Mean
SD
Covariate
Parameter
Mean
SD
b0
b3
b5
b7
b8
b9;Office1
b9;Office2
b9;Office3
b9;Office4
b9;Office5
b9;Office6
b9;Office7
b9;Office8
b9;Office9
b9;Office10
b9;Office11
b9;Office12
b9;Office13
5.469
0.023
0.034
0.032
0.098
0.303
0.215
0.205
0.249
0.090
0.050
0.129
0.106
0.315
0.201
0.158
0.209
0.581
0.025
0.006
0.006
0.007
0.007
0.022
0.020
0.061
0.050
0.037
0.085
0.118
0.021
0.025
0.033
0.017
0.023
0.047
Cause of claim
b10;Cause1
b10;Cause2
b10;Cause3
b10;Cause4
b10;Cause5
b10;Cause6
b10;Cause7
b10;Cause8
b10;Cause9
b10;Cause10
a
t
0.145
0.101
0.542
0.029
0.129
0.194
0.152
0.006
0.215
0.121
0.618
2.570
0.040
0.018
0.026
0.023
0.079
0.116
0.033
0.028
0.027
0.056
0.015
0.034
the delay between diagnosis and settlement for each of these 11 scenarios is shown in
Table 6, together with the standard deviation and some percentage points of the estimate
of the mean. Note that these are not the standard deviation and percentage points of the
posterior distribution itself the standard deviation is infinite in every case, as pointed
out in comment (4) earlier in this section. The means in Table 6 can be obtained from
Equation (6), noting that for this model u (u2,u4,u6,u7,u8,u9), and using the information
in Table 3 and the parameters in Table 4. For example, the mean delay for scenario 1 is
calculated as follows:
EX exp5:469 0:023 0 0:118=0:322 0:034 0 0:491=0:499 0:032
50 000 55 397=56 988 0:098 1460 1167=946 0:158 0:101
174 days
Table 5. Scenarios for prediction of the CDD under the best fitting model.
Scenario
Benefit type
Joint/single life
Benefit amount
Policy duration
Office code
Cause of claim
FA
J
50,000
1460
11
Cancer
SA
J
50,000
1460
11
Cancer
FA
S
50,000
1460
11
Cancer
FA
J
10,000
1460
11
Cancer
FA
J
250,000
1460
11
Cancer
FA
J
50,000
365
11
Cancer
Scenario
10
11
Benefit type
Joint/single life
Benefit amount
Policy duration
Office code
Cause of claim
FA
J
50,000
3650
11
Cancer
FA
J
50,000
1460
6
Cancer
FA
J
50,000
1460
10
Cancer
FA
J
50,000
1460
11
Death
FA
J
50,000
1460
11
TPD
E. Ozkok et al.
12
452
450
Table 6. The mean of the posterior distribution of the CDD under the different scenarios given in Table 5 using
the best fitting model, and the standard deviation and some percentage points of the estimate of the mean.
Scenario
1
2
3
4
5
6
7
8
9
10
11
Mean
SD
2.5%
50%
97.5%
174
162
186
178
156
195
139
194
249
112
217
4.0
4.8
4.3
4.1
5.1
4.4
4.2
17.8
10.0
3.2
12.7
167
153
178
170
146
187
131
162
230
106
193
174
162
186
178
155
194
139
194
249
112
217
182
172
195
186
166
204
147
231
270
119
243
4.5. CDDs incorporating all covariates

For the purpose of estimating CI diagnosis rates, it is convenient to have a CDD which
incorporates all possible covariates which could be included in our model. Since we
estimate CI diagnosis rates for an all causes model and for a cause specific model
(Paper II), we need two further CDDs: one which incorporates all the covariates x and
u1 u8, but does not include cause, and a separate CDD which incorporates all the
covariates x and u1 u9, which includes cause. The coefficients for these two CDDs are
set out in Tables 7 and 8.
Points to note about these two CDDs are:
(1) They were fitted in the same way as the best fitting CDD in Section 4.4, with the one
difference that Gibbs variable selection was not used to determine which covariates
are important and so should be retained. In particular, the fitting used Bayesian
methodology, allowing for business growth and including claim records where data
were missing.
(2) For each covariate incorporated, the methodology produces a posterior distribution
for the regression coefficient. Tables 7 and 8 show the mean and the standard
deviation of the posterior distribution for each coefficient.
(3) For each of these two CDDs, aB2/t using the means, or any reasonable estimates, of
these parameters, so that the standard deviation of the posterior distribution for the
delay is always infinite.
(4) 95% credible intervals for all the parameters in Tables 7 and 8 are given
approximately by the mean plus and minus twice the standard deviation.
(5) These two CDDs are used in Paper II when we discuss the estimation of CI diagnosis
rates. See Section 5.4 for an outline of the methodology.
The CDD with all covariates excluding (resp. including) cause is taken to mean the threeparameter Burr distribution, as specified in Section 4.1, whose coefficients/parameters are
equal to the means shown in Table 7 (resp. Table 8).
13
453
451
Table 7. Coefficients for the CDD with all covariates except cause.
Covariate
Parameter
Mean
SD
Covariate
Parameter
Mean
SD
b0
b1
b2
b3
b4
b5
b6
b7
b8
5.288
0.006
0.022
0.010
0.015
0.033
0.008
0.026
0.083
0.020
0.006
0.005
0.005
0.005
0.005
0.006
0.006
0.007
Office
b9;Office1
b9;Office2
b9;Office3
b9;Office4
b9;Office5
b9;Office6
b9;Office7
b9;Office8
b9;Office9
b9;Office10
b9;Office11
b9;Office12
b9;Office13
a
t
0.279
0.203
0.184
0.279
0.112
0.025
0.086
0.122
0.302
0.201
0.170
0.226
0.581
0.543
2.958
0.020
0.018
0.056
0.046
0.035
0.068
0.120
0.019
0.023
0.030
0.017
0.021
0.030
0.011
0.036
Intercept
Age
Sex
Benefit type
Smoker status
Policy type
Settlement year
Benefit amount
Policy duration
5. Estimating and smoothing CI diagnosis rates

5.1. Preliminaries
In this section, we outline the procedure for the estimation and smoothing of the CI
diagnosis rates for the models in Figures 13. We use the generic notation lx;u for this
diagnosis rate, even though this may be, for example, a cause specific rate, as in Figure 1.
We assume the following general functional form for the diagnosis rate:

(7)
kx;h gr x exp fs x bhT ; r; s 0; 1; . . .
Table 8. Coefficients for the CDD with all covariates, including cause.
Covariate
Intercept
Age
Sex
Benefit type
Smoker status
Policy type
Settlement year
Benefit amount
Policy duration
Office
Parameter
Mean
SD
Covariate
Parameter
Mean
SD
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9;Office1
b9;Office2
b9;Office3
b9;Office4
b9;Office5
b9;Office6
b9;Office7
b9;Office8
b9;Office9
b9;Office10
b9;Office11
b9;Office12
b9;Office13
5.206
0.014
0.010
0.026
0.011
0.030
0.116
0.036
0.103
0.217
0.095
0.209
0.177
0.190
0.391
0.344
0.004
0.193
0.178
0.120
0.197
0.587
0.022
0.006
0.005
0.005
0.005
0.005
0.006
0.006
0.006
0.019
0.017
0.053
0.043
0.033
0.062
0.112
0.019
0.022
0.029
0.016
0.020
0.028
Cause of claim
b10;Cause1
b10;Cause2
b10;Cause3
b10;Cause4
b10;Cause5
b10;Cause6
b10;Cause7
b10;Cause8
b10;Cause9
b10;Cause10
a
t
0.137
0.120
0.498
0.026
0.106
0.149
0.137
0.003
0.203
0.182
0.660
2.850
0.036
0.018
0.019
0.021
0.067
0.109
0.029
0.024
0.025
0.047
0.015
0.034
E. Ozkok et al.
14
454
452
where gr x and fs(x) are polynomials in age x (last birthday) of degree r and s,
respectively, so that:
kx;h
r
X
i1
i1
ji x
exp
s
X
!
j1
dj x
bh
(8)
j1
where ki and dj, i 1, . . . ,r, j 1, . . . ,s, are constants

u is a vector of covariates, and,
b is a vector of regression coefficients.
Points to note about this general functional form are:
(1) Without the covariates in u this is a GompertzMakeham GM(r,s) function of age.
(2) Without the term gr(x) this gives the linear predictor in a generalised linear model
incorporating the covariates in u and age, x, albeit with powers of age included up to xs1.
(3) Although it is not explicit in the notation, we can, and do, allow for interaction terms
involving two covariates, including age.
(4) For any given set of covariates, u, and regression coefficients, b, lx;u is necessarily a
smooth function of age.
(5) Many different functional forms could have been chosen for lx;u. The particular
functional form in Equation (7) was chosen because:
(i) it is very flexible,
(ii) it allows one component of the diagnosis rate, gr(x), to depend only on age, and,
(iii) if, as happened in almost all cases, the optimal value of r is 0, it reduces to a
generalised linear model, making it possible to use standard statistical software
to estimate the parameters.
To determine which covariates should be included in the model and to estimate the
parameters r,s,ki,dj and b, we need an estimator for lx;u, calculated from our data, with
known statistical properties.
5.2. The covariates used in the modelling of the intensity rates
The full set of covariates to be considered in the modelling of the diagnosis rates is shown
in Table 9.
Points to note about the covariates in Table 9 are:
(1) The list of covariates is the same as in Table 3, with the exception of Cause of claim,
which is no longer needed as a covariate, and u5 which was Settlement year but is now
Year of exposure for the in-force or Year of diagnosis for the claims. However, the
treatment of the covariates is, in some cases, different.
(2) The maximum values of r and s required in Equation (7) were 1 and 3, respectively.
The values of Age, Age2 and Year were standardised by subtracting the mean and
dividing by the standard deviation. The values of these moments, calculated from the
in-force data, are shown in Table 9.
15
455
453
Table 9. Definitions of the covariates used in the modelling of the intensity rates.
Covariate
Number of levels
Age last birthday
Integer values
u1
u2
u3
u4
u5
Sex
Benefit type
Smoker status
Policy type
Year
2 (F & M)
2 (FA & SA)
2 (N & S)
2 (Joint/Single life)
Numerical (1999, . . .,2005)
u6
Benefit amount
u7
Policy duration
u8
Office
13
Additional information
Age: mean 39.75, SD11.21
Age2: mean 1705, SD 930
F is the base category
FA is the base category
N is the base category
J is the base category
Calendar year of exposure/diagnosis
Year: mean 2002.36, SD1.86
1: Benefit amount B25,000
2: 25,000BBenefit amount B50,000
3: 50,000BBenefit amount B75,000
4: Benefit amount 75,000
Duration between the commencement of the policy and the
beginning of the year of exposure or diagnosis
Duration 0: Policy DurationB1 year
Duration 1: 1 yearBPolicy Duration52 years
Duration 2: 2 years BPolicy Duration 53 years
Duration 5: Policy Duration5 years
(3) Two covariates, Benefit amount and Policy duration, were treated as continuous in the
modelling of the CDD but are now categorised as shown in Table 9. The reason for
this in both cases is computational convenience.
(4) The regression coefficients for the covariates u6 u8 were chosen so that they summed to 0.
(5) The regression coefficients for the covariates u1 u4 were chosen so that the base
category, as indicated in Table 9, has coefficient zero and the alternative category has,
if appropriate, a non-zero coefficient.
5.3. Calculation of the exposure

For each office we have in-force data for the start and end of all or some of the calendar
years 1999, 2000, . . ., 2005. For each calendar year for which we have in-force data, we
have details of claims settled in that year. Many offices contributed data for all seven
calendar years. Those that did not, contributed data for a contiguous set of years, so that
no offices contributed for two or more periods with breaks between them.
For any given calendar year for which an office contributed data, we can count the
number of policies in force at the start and end of the year classified by age x last birthday
and by a set of covariates, u. Using linear interpolation, we can then estimate E(x,u;u), the
number of policies in force at time u, 05u51, after the start of the year, classified by x
and u.
We make the simplifying assumption that a policy is removed from the in-force data as
soon as a CI is diagnosed, or death occurs. In practice, there is at least a short period
between diagnosis and the policys removal because of the delay between diagnosis and
notification, and there may be a significant period. With our assumption, we regard
E(x,u;u) as the number of policies exposed to the risk of the diagnosis of a CI, or death, at
E. Ozkok et al.
16
456
454
time u from the start of a given calendar year, for a given office, classified by x and u. In
conventional actuarial terminology, this is a central exposure. Note that this exposure does
not depend on whether we are estimating cause specific diagnosis rates or all causes rates.
If we knew the number of critical illnesses (cause specific, all causes, including or
excluding deaths, as appropriate) diagnosed in this year, for this office, classified by x and
u, say D(x;u), then, using standard methodology, see, for example, Macdonald (1996), we
could write:

Z
Dx; h
Poisson kx;h
Ex; u; h du
u0
^ , would be given by:

so that our estimator for the diagnosis rate, k
x;h
Z
^ Dx; h
k
x;h
Ex; u; h du
(9)
u0
which has a standard deviation which could be estimated by:

p Z
Dx; h
Ex; u; h du:
u0
The difficulty with this approach is that we do not know the number of critical illnesses
diagnosed in this year; what we know is the number of critical illnesses settled in this year,
and in the subsequent years within the observation period for which this office
contributed data.
5.4. An estimator for lx;u

We can get around the estimation problem outlined above as follows. Consider a specific:
office, calendar year for the exposure and diagnosis, age last birthday, x, and, set of
covariates, u.
Let: E(x,u;u) denote the exposure at time u years after the start of the specific calendar
year, t denote the time in years from the start of the specific calendar year until the end of
the last year, within the observation period, for which the specific office submitted data,
F(s,x;u) denote the cumulative distribution function for the CDD incorporating all the
covariates in u in practice this is one of the two CDDs in Section 4.5 depending on
whether or not we are estimating an all causes or a cause specific diagnosis rate, and,
N(x;u) denote the number of critical illnesses (all causes or cause specific as required)
diagnosed in the specific calendar year, for this office, at age x last birthday and with
covariates u, and settled within one of the years (in the observation period) for which this
office submitted data.
Note that N(x;u) differs from D(x;u) since some critical illness claims included in the
latter will not be settled until after the period in which the office contributes data. The
17
457
455
probability that a CI diagnosed at time u will be settled by the end of the last year of
contribution is F(tu;x,u). Hence, we can write:
Nx; hfPoisson kx;h

Ex; u; h Ft u; x; h du
u0
^ , is given by:
so that our estimator for the diagnosis rate, k
x;q
Z
^
kx;h Nx; h
Ex; u; h F t u; x; h du
(10)
u0
which has a standard deviation which can be estimated by:

p Z
Nx; h
Ex; u; h F t u; x; h du:
(11)
u0
Comparing Equations (9) and (10), we can see that the numerators are different, as
explained above, and that the denominator of the latter has been reduced by the inclusion
of the term F(t u;x,u) to allow for the probability that a CI diagnosed in the specific year
will be settled within the observation period.
Points to note about this estimation methodology are:
(1) As a starting point, the exposure, E(x,u;u), and the claims count, N(x;u), are classified
by every combination of all possible covariates, as listed in Table 9. It is
computationally convenient, but not essential, that the CDD also includes each of
these covariates. If the claims count relates to a specific cause, then it is convenient for
the model for the CDD to incorporate cause of claim. If it is found that a covariate is
statistically unimportant for the modelling of the diagnosis rates, then the claims
R1
count, N(x;u), and the adjusted exposure, u0 Ex; u; h F t u; x; h du can be
aggregated over the values for that covariate.
(2) The estimator in Equation (10) is based on critical illnesses diagnosed in a particular
year. This year is specified in the covariate u5 for the exposure and the claims count
(see Table 9). However, the CDD used in the estimator has Year of settlement rather
than Year of diagnosis as a covariate (see Table 3). This slight mismatch is unfortunate
but is not likely to be of any numerical significance since:
(i) Year of settlement was not an important covariate for the best fitting CDD, and,
(ii) many claims are settled in, or very soon after the end of, their Year of diagnosis.
(3) The two CDDs in Section 4.5 incorporate Benefit amount (u6) and Policy duration (u7)
as continuous covariates, whereas for the estimation of the diagnosis rates these
covariates have been categorised as shown in Table 9. The value of the CDD in the
calculation of the estimator in Equation (10) uses a mid-point value for these two
covariates, as shown in Table 10, although the mid-point for the upper end is fixed
somewhat arbitrarily. The categories for Benefit amount correspond approximately to
the quartiles from the data.
E. Ozkok et al.
18
458
456
5.5. Parameter estimation

The parameters r,s,ki,dj and b were estimated under the assumed Poisson model using
either maximum likelihood (when the term gr(x) was present) or GLM methodology
otherwise. The covariates to be included in the model were chosen by minimising the
Bayes Information Criterion (BIC), given by:
^ b
b p logn
BIC 2 log L^
j; d;
where L() is the likelihood function, j^; d^ and bb are the (vectors of) estimates of the model
parameters, p is the total number of estimated parameters, and, n is the number of data
points.
In principle, we could try to minimise the BIC as a function of the complete set of
parameters. In practice, this would cause computational difficulties and so a pragmatic
approach was employed. We used the following procedure to determine the best model(s):
(1) First we set r 0 and s 1. We then choose the value of d1 and the set of covariates, u,
together with their parameter values, b, which minimises the BIC. In choosing the
optimal set of covariates, we allow for an interaction only if there is a prima facie case
for including it. In practice, the only interaction investigated (and, in some cases,
included) was Age Smoker.
(2) Keeping r 0, we then increase s by 1 and choose the values for d1 and d2, u, and the
corresponding parameter values, b, which minimise the BIC.
(3) We repeat step (2) until the BIC increases. The value of s and the corresponding
values for d1, . . . ,ds, set of covariates, u, and parameters, b, which minimise the BIC,
at least locally, are then our selected values.
(4) For the selected values of s and u we increase the value of r by 1 and check whether,
by optimising over the ks, ds and bs, the BIC decreases or not. If it decreases, we
repeat step (4). If it increases, we choose the value of r which (locally) minimises the
BIC. In almost all cases, the optimal value of r was zero. The only exception was the
diagnosis rate for death for the models in Figures 1 and 3, where the optimal value for
r was 1.
The calculations were carried out using the statistical package R.
Table 10.
Values of benefit amount and policy duration used in the CDDs for the estimation of
CI diagnosis rates.
Benefit amount
Category
1: 525,000
2: 25,000 0 50,000
3: 50,000 0 75,000
4: ]75,000
Policy duration
Mid-point
Category
Mid-point
12,500
37,500
62,500
100,000
0: B1 year
1: 1 0 2 years
2: 2 0 3 years
3: 3 0 4 years
4: 4 0 5 years
5: ]5 years
183 days
548 days
913 days
1278 days
1643 days
2585 days
19
457
459
The results of our modelling are set out and discussed in Paper II. More details of the
procedures and results can be found in Ozkok (2011).
Acknowledgements
The authors are grateful to the Continuous Mortality Investigation for supplying the data
and for advice and support throughout the course of this research, and also to Hacettepe
University for their financial support for one of the authors, Erengul Ozkok, while this
research was being carried out.
References
Association of British insurers. (2011). Statement of best practice for critical illness. London: ABI.
CMI WP 14. (2005). Continuous Mortality Investigation Committee Working Paper 14 Methodology underlying
the 19992002 CMI critical illness experience investigation. Institute of Actuaries and Faculty of Actuaries.
CMI WP 33 (2008). Continuous Mortality Investigation Committee Working Paper 33 A new methodology for
analysing CMI critical illness experience. Institute of Actuaries and Faculty of Actuaries.
CMI WP 43 (2010). Continuous Mortality Investigation Committee Working Paper 43 CMI critical illness
diagnosis rates for accelerated business, 19992004. Institute of Actuaries and Faculty of Actuaries.
CMI WP 50 (2011). Continuous Mortality Investigation Committee Working Paper 50 CMI critical illness
diagnosis rates for accelerated business, 20032006. Institute and Faculty of Actuaries.
CMI WP 52 (2011). Continuous Mortality Investigation Committee Working Paper 52 Causespecific CMI
critical illness diagnosis rates for accelerated business, 20032006. Institute and Faculty of Actuaries.
Dinani, A., Grimshaw, D., Robjohns, N., Somerville, S., Spry, A., Staffurth, J. (2000). A critical review: report of
the critical illness healthcare study group. Presented to the Staple Inn Actuarial Society.
Greene, W. H. (1990). Econometric analysis. New York: Macmillan.
Macdonald, A. S. (1996). An actuarial survey of statistical models for decrement and transition data. I: multiple
state, binomial and Poisson models. British Actuarial Journal 2, 129155.
Ozkok, E. (2011). A stochastic model for critical illness insurance. PhD thesis. HeriotWatt University, 213 p.
Ozkok, E., Srefraris, G., Waters, H. R. & Wilkie, A. D. (2012a). Modelling critical illness claim diagnosis rates II:
results. The Scandinavian Actuarial Journal, DOI:10.1080/03461238.2012.728538.
Ozkok, E., Sreftaris, G., Waters, H. R. & Wilkie, A. D. (2012b). Bayesian modelling of the time delay between
diagnosis and settlement for critical illness insurance using a burr generalised-linear-type model. Insurance:
Mathematics and Economics 50, 266279.
Waters, H. R. (1984). An approach to the study of multiple state models. Journal of the Institute of Actuaries 111,
363374.

2012 728537

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2012 728537

Încărcat de

Drepturi de autor:

Formate disponibile

This article was downloaded by: [Winchester School of Art]

On: 26 May 2015, At: 08:36

Scandinavian Actuarial Journal

Modelling critical illness claim

Erengul Ozkok , George Streftaris , Howard R. Waters & A.

Department of Actuarial Sciences , Hacettepe University ,

Department of Actuarial Mathematics and Statistics , HeriotWatt University , Edinburgh , UK

Click for updates

PLEASE SCROLL DOWN FOR ARTICLE

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Conditions of access and use can be found at http://www.tandfonline.com/page/termsand-conditions

Scandinavian Actuarial Journal, 2014

Modelling critical illness claim diagnosis rates I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Department of Actuarial Sciences, Hacettepe University, Ankara, Turkey;

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

mortgage repayments. Further background information on CI insurance can be found in

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Figure 1. A cause specific model for critical illness insurance.

for FA polices; and;

An all causes model for critical illness insurance.

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Modelling critical illness I: methodology

3.2. Cause of claim

3.3. Missing dates

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Number of claims and percentages by various factors.

Notification: a claim is notified to the insurer, then,

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

4. Modelling the CDD

The variance of the distribution is given by:

CaCa  2=sC1 2=s  Ca  1=s2 C1 1=s2

and the loglikelihood is given by:

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

bj1 hj b8;Officei b9;Causei

where b0 is an intercept and individual bs are taken to be zero if the corresponding

Definitions of the covariates used in the modelling of the CDDs.

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

4.4. The best fitting CDD

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Coefficients for the best fitting CDD model.

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

4.5. CDDs incorporating all covariates

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

5. Estimating and smoothing CI diagnosis rates

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

where ki and dj, i 1, . . . ,r, j 1, . . . ,s, are constants

Modelling critical illness I: methodology

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

Age last birthday

5.3. Calculation of the exposure

Downloaded by [Winchester School of Art] at 08:36 26 May 2015

^ , would be given by:

which has a standard deviation which could be estimated by:

5.4. An estimator for lx;u

CaCa 2=sC1 2=s Ca 1=s2 C1 1=s2

where ki and dj, i 1, . . . ,r, j 1, . . . ,s, are constants