Documente Academic
Documente Profesional
Documente Cultură
David A. Cook
B. Med. ScL, M.B.B.S., F.A.N.Z.C.A., F.J.F.LC.M.
To promote the highest quality healthcare, it is necessary to monitor the outcomes of patient care. The tools to
monitor the outcomes of patient care are not optimal. This thesis develops risk adjusted control chart methods
for monitoring in-hospital mortality outcomes in Intensive Care Unit (ICU) patients. It is a medical
application of statistics and machine learning to measure outcomes for quality management. Three directions
The first is the assessment methods of ICU models that predict the probability of death. The desirable
attributes of model performance are discrimination, the ability to separate survivors and non-survivors, and
calibration, a measure of the extent to which model risk prediction represents that patient's actual risk of
dying. An independent assessment of the APACHE III model used in the Princess Alexandra Hospital ICU
demonstrated good discrimination and calibration, and so the model was validated in this context for
The second is a study of statistical process control charts for patient mortality rate. Risk adjusted control chart
techniques are subsequently developed to incorporate the validated APACHE III probability of death estimate
to control for casemix and severity of illness. The design and performance of these control charts are studied.
The third direction is the development of an alternative model for risk adjustment. The results of preparatory
experiments using machine learning techniques, artificial neural networks (ANNs) and support vector
machines (SVMs), were comparable to those previously obtained with logistic regression. SVM models are
further investigated to model 30-day in-hospital mortality, using raw patient data from the equivalent of one
year of patient admissions. Model development is successfully guided by the desirable attributes of model
The conclusions of this study are: I) risk adjusted control charting offers an adjunct to current methods of
ICU outcome assessment when monitoring the quality of care; 2) SVMs and ANNs are practical approaches
to model the probability of in-hospital mortality for ICU patients; 3) model development can be guided by
The work presented in this thesis is to the best of my knowledge and belief, original and my own work,
except as acknowledged in the text. The material has not been submitted either in whole or in part for a
n'b<<f
David A. Cook
Acknowledgements
My supervisors:
Professor Thomas Downs
Professor Aimette Dobson
A/Professor Chris Joyce
Professor Tony Morton for all his encouragement, inspiration and advice
Petra Graham for collaboration to establish 30-day mortality status and the simplified disease coding
Gillian Ray-Barruel, and my sister, Janet Cook for assisting with editorial comment
Particular acknowledgement is due to my wife, Andrea and our children Sarah, Nicola and Matthew
Table of Contents
Abstract
Table of Contents
List of Figures
List of Tables
Publications Arising from Work Reported in this Ttiesis
Glossary and Abbreviations
1.1 Introduction
1.2 Project outline
1.3 The study approach and objectives
1.4 Perspective
2.1 Introduction
2.2 Validity of ICU mortality prediction models
2.3 Assessment of validity and generalisability of models
2.4 Indices of performance
2.4.1 Discrimination
Covariance graphs
Statistical comparison of the risk of death of survivors and non-
survivors
Classification matrices
Receiver operating characteristic (ROC) curve analysis
2.4.2 Calibration
Evaluation of overall model predictions
Calibration curves
Hosmer-Lemeshow statistics
Spiegelhalter's Z score
Model based analysis of performance
2.5 Recommendations for a practical approach to validation of ICU models that estimate
the probability of in-hospital death
2.6 Survey of models that estimate the probability of in-hospital death of ICU patients
2.7 Summary
4.1 Introduction
4.2 Considerations in monitoring in-hospital mortality of ICU patients
4.3 Control chart analysis of ICU mortality
4.4 Application of control charts to PAH ICU data
4.4.1 /7 chart
Design of the/7 chart
4.4.2 Cumulative Sum chart (CUSUM)
CUSUM statistic
Testing the significance of the CUSUM statistic
4.4.3 Estimates of the current mean: Exponentially weighted moving average
(EWMA) chart
4.5 Search for a cause of decreased in-hospital mortality
4.6 Summary
5.1 Introduction
5.2 Use of risk adjusted methods in monitoring hospital mortality outcomes
5.2.1 Issues with monitoring risk adjusted mortality
5.2.2 Application of risk adjusted monitoring to ICU mortality
5J Risk adjusted/? chart for mortality rate
5.3.1 Risk adjusted p chart with control limits calculated by normal
approximation
5.3.2 Control chart with standardized (Z) scores
5.3.3 Risk adjusted/? chart: Control limits by iterative calculation of the
cumulative probability function of the mortality rate, Rj.
5.4 Risk adjusted CUSUM
5.4.1 Background to the risk adjusted CUSUM
5.4.2 Adaptation of Poloniecki approach
5.4.3 Testing for change in odds of risk adjusted mortality with the Steiner's RA
CUSUM.
5.5 Risk adjusted EWMA
5.5.1 Estimate of the distribution of EWMAy using the central limit theorem
5.5.2 An iterative approach to the estimation of the distribution of EWMA^and
calculation of control limits.
5.6 Summary and Conclusion
7.1 Overview
7.2 Method
7.3 Data
7.3.1 Three examples of pre-processing variables: worst temperature, worst mean
blood pressure, worst bilirubin
7.4 Software
7.5 SVM parameter choice guided by model attributes of discrimination and calibration
7.6 Choice of SVM kernel and parameters
7.6.1 Estimation of SVM parameter choice for RBF kernel SVM
7.6.2 Estimation of SVM parameter choice for polynomial kernel SVM
7.6.3 Summary of the approximation of SVM kernel and parameter selection
7.6.4 Investigation of values around estimated SVM parameters
7.7 Discussion
7.8 Conclusions
8.1 Summary
8.2 Original contributions
8.3 Future research
8.4 Conclusion
Appendices
Appendix 2 Analysis of mortality rate observations PAH ICU 1995 - 1997, and
estimation of in-control parameters
A2.1 Data and methods
A2.2 Results
A2.3 Conclusion
Appendix 5 Analysis of performance, choice of parameters and control limits for the
EWMA chart.
A5.1 Effect of a on in-control ARL
A5.2 Effect of 2 on ARL
A5.3 Effect of changed mortality rate on ARL
A5.4 Choice of parameters, and analysis of performance of EWMA of single
cases
Bibliography
List of Figures
1.1 Outline of the proj ect structure
1.2 The relationship between patient factors, the process of care and random variation to patient
outcome
2.1 Example of calibration curve for APACHE III hospital mortality model
3.1 Calibration curve for the APACHE III ICU mortality models with no adjustment for hospital
characteristics
3.2 Calibration curve for the APACHE III ICU mortality models, with adjustment for hospital
characteristics (similar hospital model)
3.3 Calibration curve for the APACHE III hospital mortality models with no adjustment for
hospital characteristics
3.4 Calibration curve for the APACHE III hospital mortality models with adjustment for hospital
characteristics (similar hospital model)
3.5 Calibration curve for the APACHE III hospital mortality models adjusted for USA standard
ICU
A3.1 Operating characteristic ofp chart with range of control limits for a change in mortality rate
A3.2 Semi-log plot of ARL to signal oip chart with a range of control limits for a change in
mortality rate
A3.3(a) and (b) ARL ofp chart with effect of variable sample size
3.1 Comparison of demographics, operative status and APACHE III score between PAH ICU
admissions and APACHE III developmental sample
3.2 Predicted APACHE III hospital mortality compared to observed hospital mortality
3.3 Predicted APACHE III ICU mortality compared to observed mortality
3.4 Calibration of APACHE III ICU mortality and hospital mortality
4.1 Comparison of PAH ICU casemix for the years 1995 - 1999
A 1.1 Demographic features of the primary admissions to PAH ICU 1995 - 1999
Al .2 Admissions grouped by month of admission to for PAH ICU 1995 - 1999
A1.3 Admissions grouped into ordered blocks of 50 admissions 1995 - 1999
A3.1 Probability of signal for a range of control limit settings and changed mortality rates
A3.2 Semi-log plot of ARL in samples of 87 cases for a range of control limit settings and changed
mortality rates
Glossary and Abbreviations
Acute Physiology and Chronic Health Evaluation II (APACHE II): A score measuring severity of
illness calculated from patient physiology and laboratory data available from the first 24 hours of ICU
admission. The APACHE II score is used with diagnosis to give the APACHE II model that estimates
the probability of in-hospital death for an ICU patient. This model is publicly available and was
published in 1985 by Knaus and co-workers ' and is the forerunner to the APACHE III model (see
below).
Acute Physiology and Chronic Health Evaluation III (APACHE III): A system based on patient data
and logistic regression models to predict ICU patient mortality and length of stay, developed by Knaus
and co-workers ^ in 1991 and marketed by APACHE Medical Systems ^. The system includes a
database structure which guides collection of ICU activity data, and patient demographic, laboratory,
co-morbidity and outcome data. A measure of patient severity of illness, the APACHE III score is
calculated from the patient data. The APACHE III score, diagnosis, lead-time to admission and other
variables are used to estimate the risk of in-lCU and in-hospital mortality. The risk of death model is
fiirther modified with information about the hospital to give an APACHE III risk of death estimate
Artificial Neural Network (ANN): A machine learning method which has the architecture of a network
of interconnected simple processors (neurons) which learns patterns in the training data often by
Average Run Length (ARL): The average number of observations before the control chart statistic lies
Calibration: With reference to the ICU mortality prediction models, calibration is how well the
estimated probability of death provided by a model reflects the patient's probability of death.
Confidence Interval (CI): this interval provides a range of values around an estimate where the "true"
Correct Classification Rate (CCR): The proportion of cases that are correctly classified as deaths or
survivors. It is the sum of the true positive rate and the true negative rate.
Discrimination: With reference to the ICU mortality prediction models, discrimination is the ability of
a model to provide risk of death estimates that separate survivors from the non-survivors.
Evidence Based Medicine: The integration of best research evidence with clinical expertise and patient
values (http://www.cebm.utoronto.ca').
Glasgow Coma Score (GCS): A score used to measure level of consciousness, assessed by the best
verbal response, eye response and upper limb motor response to a noxious stimulus.
Intensive Care Unit (ICU): A hospital ward where patients with life-threatening conditions are
Mortality Probability Model: (MPM ID ^ A model that predicts risk of in-hospital death using data
available at the time of ICU admission (MPMQ II), at 24 hours post admission (MPM24II) and at 72
Multi-Layer Perceptron (MLP): A type of artificial neural network. It is described fully in Chapter 6.
Odds: The odds of an event is the ratio of the probability that an event will occur, divided by the
Odds Ratio (OR): The ratio between two sets of odds. In this application, the OR is the ratio of the odds
of death estimated by a risk of death model, and alternative odds of death where the risk of death has
changed.
Princess Alexandra Hospital (PAH): A large, tertiary referral public hospital of approximately 800
Prospective Validation: Statistical validation of a model after its development using planned,
prospective collection of data. This is in contrast to using a subdivision of existing data into samples
Risk Adjustment (RA): A method of controlling for the casemix and severity of illness of a sample of
Receiver Operating Characteristic Curve (ROC curve): A plot of sensitivity against (1 - specificity).
The area under the ROC curve provides a summary statistic of discrimination.
Re-sampling: Repeated trials of model building using different random samples from the data available
for model building and cross-validation. At each re-sampling different data sets are used for training
and assessment.
Simplified Acute Physiology Score II (SAPS II) ^: A model that predicts in-hospital mortality of ICU
patients.
Signal: In the control charts used in this thesis, a signal occurs when an observation or statistic exceeds
a control limit or decision threshold, indicating the likelihood that the process is not in-control.
Standardised Mortality ratio (SMR): The ratio of observed deaths to predicted deaths.
Support Vector Machine (SVM): A machine learning system that uses an hypothesis space of linear
functions in high dimensional feature space. It is trained with a learning algorithm from optimisation
Test Set: The test set is a sub-set of the data available for model building and assessment which is used
to assess the performance of the model. It is not used to train or optimise the model.
Training Set: For model building, the training set is the data that is used to train the model.
Verification Set: For training ANN, the verification set of the data available for model building is used
to monitor the generalisation performance of the ANN as it is trained. Training is ceased when the
The project is a medical application of statistics and machine learning to quality management. The aim
of the first chapter is to set the context, justification and aims of this work.
Evidence-based medicine is the beginning of high quality care. However, applying the best evidence to
medical practice without monitoring the outcomes is, perhaps, like playing music without listening or
painting without looking. The score, the subject or objective of our work, becomes irrelevant unless we
The purpose of this project is to present a basis for monitoring outcomes in intensive care units (ICU). 1
will develop risk adjusted (RA) control charting to monitor in-hospital mortality rates. In this work, the
term "risk adjustment" refers to a method which controls for casemix and severity of illness of a patient
sample, using a statistical estimate of the probability of patient mortality. A machine learning model
developed with local Princess Alexandra Hospital (PAH) ICU data will be studied as an alternative to
the APACHE III ICU patient model to predict the probability of patient death.
An editorial call-to-arms by Benneyan and Borgman in 2003 ^ summarised the burgeoning interest in
this area.
"Regardless of the exact method employed, the application of statistical process control... or
related longitudinal analysis methods can significantly improve the ability to monitor clinical
contributions to their use in health care. Fostering greater and more widespread use of these
methods, however, remains a significant challenge. Hopefully, studies ... will lead to more
In 1996, the senior staff at the PAH ICU were provided with a unique opportunity. Recruitment and
increases in senior staff levels followed a period of inadequate senior staff numbers, and provided an
opportunity to introduce changes in practice. Adequate staffing meant that a broad review to establish a
uniform, accepted set of approaches and intensive care practices was possible. What began as a team
We believed, without proof, that evidence-based review of practice and standardisation of uniform
approaches would improve patient outcomes, increase efficiency and increase staff satisfaction. So, a
systematic and ongoing multi-disciplinary review of all ICU practices began with a description of the
current procedures. It progressed to a review of the evidence and adoption of a consensus of agreed
Evidence-based changes in practice may not necessarily transfer from the literature to the ICU ward.
There were potential problems of applicability and relevance. Often the research evidence was
collected from patient groups who were different from our ICU patients, or perhaps represented only a
small subset of the whole patient population. The evidence of benefit for our patients was frequently
not very strong, perhaps from a controversial meta-analysis, &post hoc subgroup analysis, or an
underpowered study with an "almost significant" trend. Sometimes the evidence for continuing or
changing practice was only opinion, experience, anecdote or based on a local audit or un-referenced
local guidelines. Occasionally the best evidence was from in vitro or physiological studies.
demonstrate the effectiveness or otherwise of the changes. It seemed plausible that the evolution of
better patient outcomes could be recorded. De Leval and co-workers '" suspected an unacceptable rate
of mortality and complications in neonatal congenital heart surgery, and demonstrated that monitoring
with control charts would have detected poor performance. If this analysis could be done
retrospectively on mortality data, perhaps a similar outcome analysis could be used to guide changes at
The APACHE medical systems APACHE III Version 1^'^, introduced to the PAH in August 1994,
was the database platform used for data collection. The software had user defined analysis capability
based on local data with descriptive statistics and standardised mortality ratio (SMR) comparisons
calculated from observed and expected mortality rates. The APACHE III model was developed from a
North American database ^ and had not been validated beyond the model development sample. At the
start of 1996, 1600 patients had been entered into the PAH ICU admission database.
In an ICU with a commitment to change, the purpose of a monitoring procedure was to recognise poor
performance in a timely way, and to recognise the gradual improvements in outcome that may occur in
a complex, evolutionary system. Random variations were inevitable, so a formal statistical approach to
After the practice and management changes commenced, the preliminary reports showed a trend
toward an increase in ICU patient mortality and SMR. This apparent increase nearly stopped the
process of change, though we elected to continue with the ongoing review and the commitment to
standardised best practice. The reports of SMRs were reviewed monthly, quarterly or semi-annually.
In 1997,1 introduced to the ICU charts, plotting observed against predicted outcome following the
"Variable Life Adjusted Display" chart in cardiac surgery ". Qualitative analysis of RA outcomes now
supplemented the SMR results. However, three issues prevented further development of this analysis at
the PAH ICU. There had been no independent validation of the APACHE III model. The early charts
were plagued by incomplete data. The formal statistical methods to analyse RA data in ICU were not
available. The solution to these problems, and others is the topic of this thesis.
1.2 Project Outline
This thesis provides an approach to the validation and application of RA tools to monitoring outcomes
in an ICU. There is particular reference to the techniques of control charting and analysis, adapted from
the process control techniques of industry. There are three key areas:
After this introductory chapter, the second chapter reviews the methods to assess the performance of
models that estimate the probability of death in ICU patients. The third chapter uses the best approach
to validation to assess the performance of the APACHE III models at the PAH, Brisbane. The fourth
chapter applies a selection of established control chart approaches to the ICU's raw mortality statistics.
The fifth chapter extends this application, infroducing and developing RA control chart methods for use
in monitoring ICU mortality, using the APACHE III system estimates as the RA tool. The sixth and
seventh chapters review the application of machine learning techniques to modelling ICU outcome, and
develop artificial neural networks and support vector machine models to predict ICU mortality. The
final chapter presents a summary and suggestions for areas of further research.
Figure 1.1: Outline of the Project Structure
Chapter 2
1
Chapter 3 Chapter 4
Chapter 6
Prospective validation of Application of process
APACHE ill to the PAH control charting to ICU Machine learning to model 30-
ICU mortality statistics day in-hospital mortality of
ICU patients
X
Chapter 5
^
1
Chapter 7
Development of risk adjusted
process control techniques to Development of support vector
monitor quality of care in the ICU machines for estimating
probability of death, as an
alternative risk adjustment
y
Chapter 8
Summary and Conclusions
Figure 1.1 displays the logical relationship between the parts of this work.
Some of the working, analysis, tables, summaries and reference formulae and derivations are included
in the appendices.
1.3 The Study Approach and Objectives
Theframeworkto conduct this research is based on a model that extends lezzoni's "Algebra of Risk"'^.
She proposed that health outcomes depend on a combination of patient factors, freatment effectiveness
and random events. I have expanded the model to include the process of care as part of the relationship.
This model proposes that variation in the quality of the process of care will affect the relationship
between patient factors and the outcomes. Theframeworkfor the present study, is based on this
At presentation, patients have characteristics that predispose them to their outcome. Some of these
characteristics can be measured or recorded including diagnosis, age, co-morbid conditions and
physiological state (severity of illness). ICU mortality prediction models '^'*'^ use information that is
available within the first 24 hours of admission as the patient's initial response to therapy unfolds. For
example, for the same traumatic event needing ICU care, a fit young patient with no co-morbidity will
be less likely to die than an elderly, infirm patient with numerous serious, co-morbid conditions.
Imprecision, the influence of unmeasured factors including quality of care, and random variation in
measurement, therapy and data collection, will introduce uncertainties in the predictions. A good model
will not only discriminate between the likelihood of death and survival, but will provide a risk estimate
This thesis begins by reviewing methods to evaluate and validate an ICU mortality prediction model.
The performance of the APACHE III system is then critically evaluated. If the validity of the APACHE
III model in the PAH ICU is established, then the APACHE III model can be used as a RA tool and RA
outcome monitoring for the ICU can be developed. Subsequently, alternative models to APACHE III
In the absence of other changes, if the quality of patient care in an ICU deteriorates, then the RA
mortality rate could be expected to rise. A previously accurate model would then under-estimate the
true risk of patient mortality. Similarly, if the quality of care increases, then the RA mortality rate will
A recent publication by Spiegelhalter et al. '^ suggests that a RA analysis would have shown a
deviation from expected performance in the paediatric cardiac surgical data of the Bristol Infirmary
(1984 - 1995) and in the deaths of elderiy patients under the care of Harold Shipman (1979 - 1997).
RA monitoring and detection of poor performance may have reduced the number of fatalities in these
It is possible to improve the performance of the ICU prediction models by collecting information about
the progress of the patient beyond the initial day of ICU care '* by following the progress of organ
failure, complications and the patient's response to therapy. However, improving the accuracy of the
prediction using variables that reflect therapeutic complications and the patient's progress beyond the
initial period of observation, may limit the ability to detect changes in RA mortality rate due to
variation in the quality of the care. For this reason, only the information from the initial day of ICU
When variation between actual and predicted outcomes is detected it could be due to random variation,
or could signal a change in the modelled relationship between the patient factors and survival.
Statistical analysis can assess the probability of the differences being due to random variation. When a
model no longer fits the data, a search for the cause, and an appraisal of the monitoring procedure,
Lim '^ described the issues that face the use of statistical process control charts in clinical practice.
Casemix or RA must be incorporated to make raw data interpretable. The techniques must be able to
detect small true shifts in the mortality rate because small changes in mortality are important. In the
medical setting there are relatively few patients compared to the large number of items in industrial
applications for which control charts were originally developed. Techniques requiring the accumulation
My work is a real application of process monitoring and demonstrates that these challenges are soluble,
and that RA charting is a useful adjunct to quality measurement and the monitoring of mortality
outcomes in the ICU. To begin, I apply standard process control methods to analysis of the raw
mortality data of all eligible patients admitted to the ICU. The APACHE III model is incorporated into
the chart design as a RA tool and so RA outcome monitoring is developed. Analysis and understanding
of the performance of these techniques in terms of false alarms and detection of a change in mortality is
the basis for design. RA monitoring developed in the following chapters will provide information about
Ultimately, it may be possible to track both poor performance, and the evolutionary and incremental
When an externally developed model of ICU mortality fails local independent validation, or a current
model no longer fits the data, the model needs to be revised to take account of patient factors.
Preliminary modelling on the PAH ICU dataset using machine learning techniques (artificial neural
networks and support vector machines: SVMs) demonstrated comparable performance to previous
logistic regression. SVMs have not been studied in application to estimating probability of ICU
mortality, so further SVM model development is conducted to estimate the probability of ICU patient
dataset is addressed and the solution provides adequate model performance on a practical dataset size.
In Chapter 6, the general regression neural network is used to correct the poor calibration of the multi-
layer perceptron artificial neural networks, and SVMs. In Chapter 7, SVM parameter choice, guided by
model discrimination (maximizing area under the ROC curve) and calibration (minimizing H-L C
statistic) on test data is used to develop SVMs that model raw patient data to approximate the
probability of death.
1.4 Perspective
The potential for RA outcome monitoring goes beyond the application I describe in the following
chapters. I have chosen to work on RA charting in an ICU, because the care of critically ill patients is
my area of professional expertise. However, the solutions and methods that 1 have developed in the
PAH ICU context are widely applicable. Recent publications from the United Kingdom's Medical
Research Council Unit in Cambridge (Spiegelhalter et al^^) and the United States Department of
Veterans Affairs (Render et al. '^) indicate the wide interest in this topic, and show what a large
experienced team could achieve in the development and application of RA analysis to improve clinical
care.
My work uses a retrospective analysis to track changes in RA outcome, and proposes novel techniques
for the measurement of quality in the ICU. This is important, as it documents the magnitude of
improvement in patient survival. The change is of the order of 20 additional ICU survivors from the
In industry, the introduction of statistical process control after 1945 caused a revolution as the quality-
productivity dilemma (that increasing production results in inferior quality) was disproved '^. Near real
time monitoring of the quality of care in hospitals using RA charting methods, may offer similar
Chapter 2
2.1 Summary
Models that estimate the probability of death of intensive care unit (ICU) patients can be used to
stratify patients according to the severity of their condition and to control for casemix and severity of
illness. The process of risk adjustment (RA) is needed in quality monitoring, research, administration,
management, research and as an aid to clinical decision making '*''*". Models such as the Mortality
Prediction Model (MPM) family ^'^\ SAPS II ^^', APACHE I I ' , APACHE III ^ and the organ system
failure models ^^'^^ provide estimates of the probability of in-hospital death of ICU patients.
This chapter considers the validity of models that estimate the risk of death of ICU patients. It also
considers, in detail, methods to assess the performance of these models. The key attributes of a model
that accurately estimates the probability of death are discrimination (the accuracy of the ranking in
order of probability of death) and calibration (the extent to which the model's prediction of probability
of death reflects the true risk of death). These attributes should be assessed in existing models that
predict the probability of patient mortality, and in any other model that is developed for the purposes of
The literature contains a range of approaches which are reviewed, and a survey of the methodologies
used in the assessment of ICU mortality models is presented in Table 2.1 at the end of this chapter. A
straightforward method is described to assess existing models and to assist with development of new
models.
12
In Chapter 3, the performance of the APACHE III ^'^ system at the Princess Alexandra Hospital
(PAH) ICU is assessed in detail to evaluate its potential as a RA tool. The characteristics of
discrimination and calibration studied in this chapter will guide the development of novel models using
machine learning techniques and to assess their performance as tools in control chart analysis of ICU
outcomes.
For the purposes of this study, RA is considered as a method for controlling for the variation in patient
The available models were conceived through discussions with experts. The most contributory sub-set
of those variables initially collected was used for modelling. The patient related variables were drawn
from the domains of the patient's acute physiological disturbance, physiological reserve, diagnosis or
procedure. For modelling ICU death the importance of these factors has been consistently demonstrated
16,24-26 ^^^ variables such as physiology, investigations, age, co-morbidities, diagnosis and lead-time are
used in all models of this type ' ' ' ' . The consistency of their use supports the construct validity of
this approach to models for predicting ICU mortality and makes clinical sense.
External validity refers to the accuracy of the predictions as estimates of the probability of in-hospital
death. It is this statistical, objective evaluation of the accuracy of the model that is the focus of
performance assessment. The generalisability of the model's performance is its predication capacity on
independent data not used for model training '*, or its ability to provide accurate predictions in a new
sample of patients ^.
Survival to hospital discharge, as a measure of ICU outcome, raises separate validity issues. It is easily
collected and the definition is unambiguous. Despite this, there are important issues with hospital
transfer and discharge procedures, notification of death and identification of cause of death which can
13
interfere with the interpretation and validity of a seemingly concrete endpoint ^ . Thirty-day in-hospital
mortality, or 6 month or longer term survival may offer better defined endpoints. Elsewhere, I have
shown that modelling patient in-hospital mortality at 30-days post ICU admission offers the advantage
of constant definition and analysis of complete patient data and outcomes ^^. In subsequent chapters, I
will develop machine learning RA tools using 30-day in-hospital mortality as the endpoint.
should be based on data collected prospectively, according to a clear protocol. The study should
involve a complete series, or a representative sample of a population of ICU patients with all outcomes
accounted for. The model predictions must not be used to influence clinical practice and clinical
There are many types of study used to assess the validity of models for ICU mortality. Examples of
studies that assess ICU mortality models are given in Table 2.1. at the end of this chapter. Ridley ^^
evaluates models according to methodological rigour. Justice and co-workers ^ described the
performance of models using the concepts of reproducibility (accuracy of the model on a new sample
of patients at the institution where the model was developed) and transportability (accuracy on a
different patient groups, different institutions or at different times). There is a similar concept in the
literature on the accuracy of diagnostic tests ^^, referred to as the transferability of test results.
can be done with a single model, or as a comparison between two or more models. The findings will be
relevant to performance of each model in the context studied, but will also add to the overall experience
about the generalisability of the model assessed ^. The paper by Pappachan et al. ^^ and an example
from my work ^' provide independent assessments of APACHE III at a single institution. Rowan et al.
^"^ report multi-centre evaluation of APACHE II predictions in the United Kingdom and Ireland.
Multiple model comparisons on the same dataset are presented by Castella et al. ^^ (APACHE II, SAPS
II and MPM II) and Beck et al. ^* (APACHE II and APACHE III).
14
Non-independent validation occurs where the modelling and validation processes are not completely
separate. A non-independent validation study may include performance of the model on the training
and test data as part of the model development process. Examples are the original development reports
of APACHE I I ' , MPM II ^ SAPS II ^ and APACHE III'. Sometimes the validation data is collected to
follow on from the development data. A study may demonstrate that the model results are reproducible
on data collected following the development set of data on which the model was first estimated, and in
the context where the model was developed. Examples of this prospective evaluation at the site the
organ system failure model of Marshal et al. ^^ and the artificial neural network and logistic regression
is the ability of the model to separate survivors from the non-survivors. Calibration is how well the
At an early stage in the patients' clinical care, overlap between the characteristics of survivors and non-
survivors means that predictions will always be estimates of the probability of death. A model cannot
display both perfect calibration and discrimination ^* where there is any uncertainty in classification of
patients into those who live and those who die. A model with perfect discrimination would correctly
rank patients so there is no overlap in predicted probability of death between survivors and non-
survivors. Perfect calibration would occur if all the survivors have an estimated a risk of death of zero
and all deaths have an estimated risk of 1. Perfect calibration and perfect discrimination are not
possible in practice, and random and uimieasured factors mean that models can only estimate a
probability of death.
Other criteria for assessment of the agreement between model predictions and observed outcomes have
been proposed. "Trustworthiness" and "reliability" ^', "calibration in the large" and "calibration in the
small" * all seek to capture characteristics of models. Few, if any, have been applied to assessment of
models that predict the probability of in-hospital death for ICU patients.
15
2.4.1 Discrimination
As ICU models are not used for decisions of preferentially withholding or withdrawing therapy,
practice ^'. It is, however, a usefixl statistical concept. A number of approaches exist, with classification
matrices and receiver operating characteristic (ROC) curves being the most common methods used.
Covariance Graphs
Covariance graphs show the frequency of the estimated probabilities of death for survivors and non-
survivors ''^. They provide visual qualitative assessment of the separation of the estimates given to the
two outcomes and are commonly used to evaluate diagnostic tests which classify patients according to
the presence or absence of a disease ^^. Though used in the anaesthesia and the intensive care literature
to illustrate the principles of discrimination ^*, the method has not been used for models of ICU
mortality. The covariance graph is mathematically related to and can be easily transformed into the
Some authors ^^'^^'^-^^ have statistically examined the separation between the estimated probabilities of
death of survivors and non-survivors. A non-parametric approach (Wilcoxon rank sum test/Mann-
Whitney (/test) must be used because of the non-normality of the distributions of estimated
probabilities, particularly with small numbers. This method is a test of the hypothesis that there is no
difference between the median estimated risks of death of survivors and non-survivors i.e. the model is
no better than chance and is equivalent to comparing the area under the ROC curve to 0.5 (see later).
Classification Matrices
Classification matrices tabulate true positive (TP), true negative (TN), false positive (FP), false
negative (FN), sensitivity, specificity and correct classification rate at various thresholds of risk. They
provide a common approach (Table 2.1), but there are no standard thresholds and there is no single
In specific circumstances, where the assessment of model performance includes the costs of incorrect
classification, then knowing the characteristics of model performance at thresholds is essential ^\ Such
a study by Glance et al. ^^ examined the cost-effectiveness of using a risk estimation model to withdraw
TP, TN, FP, FN, sensitivity and specificity are not context-free properties of a model. Changes in the
definition of the outcome, the quality of data collection and changes in the distribution of proportions
of survivors and non-survivors will cause the measured discrimination of the model to fail. These
factors can cause deterioration of discrimination in the same way as described by Irwig et al. ^' for
The correct classification rate (test accuracy or test efficiency) and the positive predictive value and
negative predictive value are likewise of limited use for model assessment because they depend on the
Confidence intervals (CI) quantify the precision ^^ and generally they can be calculated for
classification matrices. The Standards for Reporting Diagnostic Accuracy (STARD) ^'^ calls for
"statistical methods to quantify uncertainty" in assessing medical diagnostic tests, and the same
standard can be expected for the assessment of models that estimate the probability of in-hospital death.
The ROC curve provides a representation of all possible pairs of sensitivity and specificity at every
threshold in the range of predictions. The curve is a plot of sensitivity against (1 - specificity) and the
area under the curve (+/- standard error or confidence intervals) provides a summary statistic of
discrimination. The area under the ROC curve provides a measure that is independent of criteria for
decision thresholds ^'^^. Inspection of the curve and comparison with other curves provides further
qualitative analysis.
The area under the ROC curve is the probability that a randomly selected death will have a higher
predicted risk than a randomly selected survivor. In the context of models to predict ICU mortality, the
area under the ROC curve provides a measure of the model's ability to rank the patients in order of
probability of death. It provides a test of whether the model is better than chance at separating survivors
and non-survivors. An area of 0.5 means that the model's discrimination between deaths and survivors
There are drawbacks to the use of the ROC curve. The threshold of risk of death carmot be read directly
off the axis and performance at thresholds or in intervals of risk of death carmot be directly compared.
When performances of prediction are compared, it is advisable to visually inspect plots, or to calculate
Murphy-Finkins et al. ^^ quoted a rule of thumb that ICU model performance is acceptable if the ROC
area is > 0.7, good if > 0.8 and excellent if > 0.9. Rosenburg ''* holds the opinion that areas under the
ROC curve of"... 0.8 or better are expected for mortality predictions in current models, and scores of
0.7 or less are considered to be unacceptable." The references''*''^ on which he bases this provide only
opinion. My conclusion, drawn from a review of the published experience with classification
performance of ICU models (Table 2.1) is that Rosenburg's position is correct. Contemporary ICU
models that estimate probability of death should have an area under the ROC curve in the range of 0.80
- 0.90, with less than 0.70 being unacceptable. Steen '* gives an opinion that 0.90 may approach the
upper limit for generalised, discrimination performance for models based on biological measurements.
For small datasets, a non-parametric method to estimate the ROC area based on the Wilcoxon rank
sum test (Mann-Whitney (/test)'' or the calculation based on a series of trapezoids is appropriate.
With large datasets, the "staircase" effect of discrete values is less important and parametric curve
fitting or non-parametric approaches provide minimal difference in calculated values of the area under
the ROC curve and the standard error *". The standard error can be used to compare the areas under the
ROC curves and to estimate confidence intervals '''^. Where the performance of two models'
performances is compared on the same dataset, an alternative, non-parametric method *' using paired
At any decision threshold, the likelihood ratio is the ratio of the predicted probability of death in a
patient who subsequently dies to the predicted probability of death in a patient who subsequently
survives '^. It is the slope of the ROC curve, being the change in sensitivity divided by the change in
(1 - specificity) over a given range of values *''*'. It provides important information about the
performance of a model. The likelihood ratio is the relationship between the pre-test and post-test
probabilities of mortality.
2.4.2 Calibration.
Calibration is an attribute of model performance that reflects the extent to which a risk prediction
represents that patient's actual risk of dying. An overall summary statistic, like the standardised
19
mortality ratio (SMR) provides information about how the overall mortality rate agrees with the
mortality prediction for the sample. Other statistics and scores have been suggested, in non-ICU
applications, by Brier ^, Yates " , Hilden ^'''" and Flora ^ . Goodness-of-fit approaches analyse model
fit in risk strata grouping patients according to estimated risk (Hosmer and Lemeshow ' ,
Spiegelhalter *').
To assess the calibration of a model, a global assessment of calibration and an analysis of fit in risk
of the sample of patients. They compare the observed number of deaths to the predicted number of
deaths. Large departures will indicate failure of the model to predict the probability of patient death in
that context.
The SMR is a commonly used statistic. It is the ratio of observed deaths to predicted deaths, or
observed mortality rate to predicted mortality rate. Equation 2.1. In the example, there are patients
indexed by i. 71^ is the estimate of the probability of death provided by the model. For a patient who
1=1
The hypothesis that there is no difference between the observed mortality rate and the predicted
mortality rate can be formally tested by chi-squared or binomial methods. Confidence intervals (CI)
estimate the precision of the SMR based on various assumptions about the relevant sampling
20
distribution. A number of approximations of the binomial distribution have been reviewed ' and the
The useful estimate of the standard error for the term 2_^ Yj arises from the variances of all the
SE = J'Z^ril-7t,)
i=\
i=l
SMR1.96-
z*.
;=1
2. Flora's Z score
Flora's Z score ^ is similar to the method of using SMR with confidence intervals and has the same
advantages and limitations. It compares the observed number of deaths and the predicted number of
deaths using the difference rather than the ratio of these numbers; then the statistic is standardised by
E*,-(i-*,)
/=i
The Flora score has been used in a Greek study to compare APACHE II and SAPS II in a single
institution ^'.
Where there is a large dataset, a chi-squared statistic has been proposed by Miller and Hui ^^ though it
has not been used to assess ICU models that predict in-hospital mortality. The Hosmer-Lemeshow (H-
L) statistics, described below can be regarded as a special case of this y^ statistic, where continuous
The mean squared error (MSE) of the predictions, also known as the Brier Score ^ can be used to
MSEJI^^^
n
This method is particularly useful where very large datasets exist '*^. The MSE can be decomposed into
components that reflect cliaracteristics of outcome prevalence (and variance of the outcome), bias,
noise, model complexity and over-fitting "'^'''''^''''''^ j ^ ^ MS'E has several drawbacks. Whether the
model is overestimating or underestimating the probability of death is not apparent from the MSE.
Also, the MSE is dependent on the mortality rate which is a characteristic of the context, as much as
Calibration Curves
Calibration curves allow qualitative evaluation of the model fit across risk intervals and are widely
used to evaluate ICU mortality models. The agreement between predicted and observed mortality in
intervals defined by predicted risk can be displayed graphically with a curve comparing model
predictions with observed frequencies *^. Patients are grouped into contiguous intervals of predicted
risk. For each risk interval, the mortality rate is plotted against the mean estimated probability of death.
A perfectly calibrated model will have a calibration curve that has a slope of 1 and an intercept at the
origin. For assessment of ICU models groups can usually be defined by intervals of 0.1 or 0.05 in the
estimated probabilities of death, depending on sample size. Figure 2.1 is an example of a calibration
curve of the APACHE III model for the PAH ICU database 1995 - 1997. The full description of this
This graph gives a visual representation of the agreement or calibration at each of the levels of risk.
Small data samples can produce irregular or noisy curves or empty risk intervals, though smoothing
fiinctions are available *^. Small numbers are particularly likely in the strata of higher risk of death.
The CI for the observed mortality rate in each interval can be estimated based on a normal
approximation to the binomial. The relationship between the calibration curve and its degree of
estimated variation allows qualitative appraisal of model performance, and is useful for making
comparisons between models. In each interval, CI give an indication of the precision of the mortality
estimate and accounts for sub-group size and random variations. However, the sub-groups in each
interval are not independent. As an alternative to CI, some authors 30,34,36,48,49,76,77 j^^iy^g ^ histogram of
patient numbers by interval. Either approach allows the reader to infer the likely precision of the
However, the calibration curve approach does not provide a way to test hypotheses about the adequacy
of fit across all intervals. The lack of independence of the values, the issues of small numbers and
precision, and the problems of false positives and multiple testing limit the quantitative inferences that
Hosmer-Lemeshow statistics
The Hosmer-Lemeshow (H-L) ^^'^ statistics (C and H) were proposed for the assessment of logistic
regression models. Their use has been adapted and extended to include prospective, independent model
validation.
H-L tests compare the observed against the predicted numbers of deaths and survivors in intervals of
risk. Most applications that assess the calibration of ICU models have used 10 risk intervals. For the C
statistic, patients are ranked according to predicted risk of death and divided into 10 near equal groups.
The H statistic uses the sample divided into 10 contiguous risk intervals of equal width but unequal
number. The C and H statistics are chi-squared like statistics calculated from a 4 x 10 table of observed
and estimated mortality and survival. The value of the test statistic is compared with the chi-squared
The number of degrees of freedom when the model is being assessed on the developmental dataset is
the number of risk strata minus 2 **'**. By convention in the ICU literature, there are usually 10 deciles
of risk, and so 8 degrees of freedom. The degrees of freedom for the chi-squared distribution when
prospective independent validation is performed are equal to the number of risk intervals ^ . Again, by
convention there are 10 intervals, though 8 or 9 intervals are reported when there are small samples ^^.
With the H statistic, the risk intervals may contain few or no cases, requiring combination of intervals,
As with all of the methods for model assessment studied in the current context, the H-L statistics are
vulnerable to changes in the patient casemix ^^ and the distribution of severity of illness ^*, as well as
model fit. The power of the analysis will depend on the sample size. Small samples tend to lack power
to recognise poor fit. Conversely, large samples are more likely to suggest poor fit. Rowan et al. ^^
states that "A significant departure from the null hypothesis does not necessarily imply a bad fit, just
that imperfections are of such a size that they can be detected in a large sample size". For comparisons
on the same dataset, it is the magnitude of the chi-squared statistic that indicates the better model fit. In
practice, it is necessary to decide on non-statistical grounds what level of fit is clinically acceptable.
24
Notwithstanding the effect of sample size and the inconsistencies in their use, the H-L statistics are
widely used, and provide useful context specific information about calibration.
Spiegelhalter's Z score
A similar approach to the H-L H statistic has been proposed by Spiegelhalter ^^ His method calculates
standardised Z scores for each of 10 risk intervals based on the observed mortality rate, and the
standard error of the risk estimates in the interval. It is assumed that Z will be approximately normally
distributed and scores greater than 1.96 or less than -1.96 imply that the model is poorly calibrated in
that interval.
Overall, the Spiegelhalter score provides an assessment of fit of the model predictions across all the
intervals. It is calculated as the sum of the squares of the Z scores for the intervals and is then compared
By standardising in each risk interval and accounting for patient numbers and distribution of risk of
death, comparison between different contexts may be possible provided the samples are large. A further
advantage of the Spiegelhalter score is that it can be calculated from the tables presented for H-L
statistics. The Spiegelhalter method of calculating the standard error is a conservative test ^^. A major
No assessment of performance of ICU mortality models has used the Spiegelhalter method to date.
Model based analyses of performance are more often used during fine-tuning of a model on a
developmental dataset or for recalibration than for validation studies. Real ICU data may violate some
of the assumptions on which both the modelling and subsequent analysis are carried out *^. However,
Ash and Schwartz '^ observe that "...these algorithms for transforming data are judged primarily by
how closely their predictions match reality, rather than the extent to which underlying assumptions are
met."
25
Approaches based on a logistic regression model allow analysis of the relationship between the
estimated risk of death and observed outcomes '*^-^^'^. These assessment methods *"'*' naturally allow
new or refined ICU outcome models to be developed by adjustment of the parameters of an existing
An ICU model that accurately estimates the probability of in-hospital mortality can potentially be used
as a risk adjustment tool to analyse mortality outcomes in an ICU. The model's performance must
however be thoroughly evaluated to determine whether the estimates of probability of death are
accurate.
The recent publication of guidelines for standard reporting of the accuracy of diagnostic tests by the
Standards for Reporting of Diagnostic Accuracy (STARD) ^* provides a useful example of an explicit
and rigorous check list to serve as a guide to methodology and documentation. Reports on the
performance of ICU models that estimate the probability of in-hospital death share similar
characteristics to reports on the accuracy of a diagnostic test. Therefore, the systematic approach used
by STARD (Checklist for Reporting Diagnostic Accuracy ^^) provides a framework for my
recommendations.
The Title, Abstract and Keywords should identify the report as that of the assessment of the
The Infroduction should clearly state the aim of the report. It may be to develop and introduce a new
model, to validate an existing model, to compare several models, or to adjust an existing model.
The Methods section should describe the context of the analysis and dates of data collection. A single
or multiple ICUs, and the type of hospital and ICU should be described. The study population and the
method of patient eligibility and exclusion should be described. This will allow the reader to assess
26
independence of the validation process, conditions affecting model performance, and the applicability
An account of rules of data collection and a description of the model are essential. The mortality and
survival endpoints must be defined. The methods by which patient data are divided into sets for model
development and testing must be given. A description of the statistical methods used to develop the
The accuracy of ICU outcome models should be assessed in terms of discrimination and calibration.
For discrimination, the area under the ROC curve should be calculated, with standard error or
confidence intervals for precision. For two models on the same sample, pair-wise comparison of
Calibration should be assessed by an overall assessment statistic and an assessment of fit across risk
intervals. The most commonly used global indication is the SMR with confidence intervals. A
graphical approach with a calibration curve using confidence intervals should be presented. A
numerical evaluation of goodness-of-fit using the H-L statistics or the Spiegelhalter approach should be
used. If more than one model is being evaluated on the same dataset, then a statistical comparison,
The Results section should state when the data collection was performed. A description of the sample
should include characteristics of age, gender and major diagnostic categories, severity of illness
measurements and mortality rate. If this is an independent evaluation of a model, it is usefijl to compare
the patient characteristics of the validation data set with those for the sample on which the model was
The recommended approach described above is applied to analysis of APACHE III in an Australian
The performance of all models on the developmental datasets is consistently better than on subsequent
independent assessment. As expected, the issues of reproducibility and transportability of the models
require that all models which estimate the probability of in-hospital death of ICU patients are validated
2.7 Conclusion
The key attributes of models that estimate the risk of death of patients in the ICU are discrimination
and calibration. The area under the ROC curve is the best measure of discrimination. For models to
predict contemporary ICU mortality, an area under the ROC curve in the range of 0.80 - 0.90 is
expected. Model calibration is an attribute that is more difficult to capture. Though calibration curves
provide a qualitative representation, statistical evaluation of calibration is done using H-L goodness-of-
fit statistics. In practice, it is necessary to decide on non-statistical grounds what level of fit, and
maximum value is acceptable. Goals for Model performance using these measures will guide
^
^
^
^
3
>
Y*
(0
H
o
c
o
>>.
5
5
(0
Q.
(4
0)
0)
o
SV.
<iM
5o
c(Q
o<0
o
<B
s
f/ie
<0)>i
^P
Q>
b
tt)
fIC orta
a>
tt)
w>
JP
3
3
o
JS
u
C
3
i
o
33
_
3
o
3
J3
u
u
JS
u
to
5
T3
u
3to
t>
t
>
13
T3
J3
rS
O
tM
<0
OS
to
X!
to
*-
"CS
at
s
o
u
u
ca
u
s
>
ts
(U
to
O
ti<0
-
to
CO
-4-
CO
O
E
CO
S
S
s(U
&
(i>
ex
CO
cS
OJ
ca
-s
<D
&
>.
CI.
kex
a.
s>
x>
E
fa
fa
&
3
Pi to
Reports are presente(i in chronological order of publication.
* Denotes absence o1' information (eg calibration not reported).
Authors Models Comments Area under the Discrimination Overall Calibration: Calibration:
ROC curve Classification Assessment Graph H-L statistic
(95% confidence intervals Matrix Thresholds
or standard error)
as
*.
00
"Q
CO
>n
APACHE II USA ROC curve displayed, Classification
Developmental 13 ICUs area not stated matrix
paper 5815 admissions PPV, NPV, sens,
spec, and CCR at
0.7,0.8 and 0.9
*
*
Jacobs era/. 1987'' APACHE II Saudi Arabia Sens, spec, NPV, 0/E can be
Modified with single ICU PPV and CCR at calculated from
best GCS Nov 1984-Nov 1985 0.6, 0.7 and 0.8 the text
210 admissions
T
*
o
0\
OO
u
oo
to
Q
CO
1>
"^
x>
APACHE 11 Saudi Arabia 0/E =1.2
Modified with single ICU
best GCS Nov 1984-April 1987
583 admissions
*
*
NJ
a
APACHE II Italy APACHE II 0.84 (0.07) CCR at 0.7 O/E = 0.78
1
II
S; oo
8 0\
ce 00
52
Single ICU
3 years: ?dates
521 patients
XI
X)
*
*
*
JS
CO
<u
ON
ON
APACHE II Mayo affiliated hospitals
1285 patients
Knaus etal. 1991 "^ APACHE 111 USA APACHE III 0.90 PPV, NPV, Calibration curve in
Developmental 40 hospitals sens, spec and 0.1 intervals as split
paper 17440 patients CCR at 0.1, halves.
May 1988-Nov 1989 0.5 and 0.9 All patients in 0.05
intervals.
No CIS
Berger era/. 1992 APACHE II Switzerland ROC curves for O/E can be calculated
89 Single ICU subgroups, areas not from the text
Jan 1986-Mar 1988 given
2061 patients
3
*
fN
O
OS
OS
c
APACHE II Japan APACHE II 0.78 O/E can be calculated
6 hospitals from the text
Dec 1987-June 1989
1292 patients
*
*
*
*
MPMo II Europe and North America: 12 MPMo II: HLC
MPM24 11 countries: Developmental set 0.837 df = 8 in
Developmental Apr 1989 -July 1990 and Sept Validation set 0.824 development
paper 1990-May 1991 sample
19124 patients (MPMo II) MPM24 II:
15925 patients (MPM2411) Developmental set 0.844
missing data excluded Validation set 0.836
0
*
0
OS
ON
m
APACHE U Hong Kong APACHE II 0.89 PPV, NPV, O/E = 0.97 0.1 intervals
Single ICU sens, spec and Plot looks as if mid
May 1988-Nov 1990 CCR at 0.5, points of interval
Reinterpret APACHE II 0.7 and 0.9 used
using best GCS NoCIs
Rowan et al. 1993 APACHE II UK and Ireland APACHE II 0.83 TPR, FPR, O/E (95% CIs) 0.05 intervals, H-L//
32.33 26 ICUs, CCR at 0.1, 1.02(0.98-1.06) CIs df=8
Oct-Dec 1987 0.5 and 0.9
Jan-April 1989
8796 admits: only 80% of all
admits included
*
*
<U
MPMtg II USA 6 ICUs MPM48II: H-LC
coS
ON
2 ON
MPM72 II Sep 1990-May 1991 Development set 0.81 df = 10 in
Developmental 3023 patients (MPM48II) Validation set 0.80 development
paper 2233 patients (MPM72 II) sample
missing data excluded MPM72 II:
Development set 0.79
Validation set 0.75
Rowan et al 1994 APACHE 11 UK and Ireland MPMo II: 0.74 TPR, FPR, O/E can be calculated Calibration curve H-L H and C
34
MPMo II: 26 ICUs, APACHE II: 0.83 CCR at 0.1, from text with 0.05 intervals, df=8
adjusted from Oct-Dec 1987 0.5,0.75 and CIs
MPMol Jan-April 1989 0.9
8724: 80% of all admissions
included
u
T3
0
g g a
ive
8
8 -3 0
0 0 8
o -s u2
lidat
as
adsc
00
S^ II =0 ^ ,j II II HJ II
I > II o
X -o ce C|-<
ac 13 ac T3
ac u c<-i 8 ce ac T3 0 . > 0
o -o a. >
o
rve
rve
CO
CO
3 3
interv
3 3
0 & c o
8 8 B
0 0 8 33
ce 1>
ce f-1
VI
e
alib
.) (J
.4-
8 u 3
8
.-
0 a ^ '-'
0 :S 0Z 0 S Z
-0 T3
d) 4>
ce
3 3
0 S 6
ce
0 o fti<S
<D 00 u OS CO
Xi
o 8
8
ce
0
g ce
u I 2 3:s
>-i 3
UIO.
4) t> ^
e^ o
U
oS o S
^ ce T3 O > ^ OS
u d d |fti
u _ CO Z ce
CO as
Pi '^ ^ Di
(J 3 <N 00
H ce d d
g^O-0
ft. on U
t/3 OH O O
<N
00 O
ON
o d
d 00
so OS
so 00
00 00
_ f^ d
00 o
d ^ = o 2 00
=w w
' ^x MM ^
ac X X ac
S S < [/3
S 4)
2 3
u
<
C/3 u u en
ft. ft- -S, ft- < < ft- <
ft< OH CL. < ft- ^ ft, < ft- ft- < ft-
CO 00 < < < < t/3
fN
4>
8 o
4>
ce o
o 2 E
o.
_o 8
3 .2 s so >o
ON ON
8 05 t/1
ON ON
ON 4> CO 8 O Os OS
J3 OS ft< .<
i-H
u
2 eo< L>
o ^ H goi ce ce
Z Q
4>
8 - .23
a
<l>
.3 O 8 > 3 .H
B O
2 3 1
CO ^ D 8
^ -H m ft. O 4) *
^ S p ON 0. ;^ 'o '^ y - . ce
gal
ati
0
OS r> ^ - S CO o .
*
l a cS
H
pi Ji
3a
ON n.
3
D
U '
Os
' 0.
i ^ .4-*
0
r-
*
3 ce c*i
3 <t5 -O - '
>
5
5 ON m
i S ON - H
4>
z
&
ce ^ s
P bo
is c
0
ft-
ON
u rs
00
Q OS
4>
_ . "^
T3 O
w ac
2O .g X X X
ac s " CZ) ' 5 3 [/3
< <:
u
< p^ f' ft- <;
ft- < ft- ft-
c^ < 3 ft-
M 2 U < CO < < <
< ws U-1 so
ON 0^ IT)
OS Os
ON Os ON
OS OS
ON ON
Os
ce 8
ce _o
O
4> o 3
ce
OS J: < 2 Sg
<N
00
Os
o
OS
o
B
"a
reno e EURlICUS Sens, spec, SMRwi 95% CIs Calibration curve No
Sd
o
a:o
J1
-H
ft- ^
ac
o
o
CO - ^
ON
O
on 2
.t;
so
& "^
s ~.
.11
89 IC Uin 2 countries PPV, NPV, CIs
,
Octl 994- Jan 1995 correct
8
8
ft.
CO
4}
OS
cz)
10027 pati classiflc atio
< d
^ d
00
Csl
y*>v
<n
o'r
005 rate at 0
and 0.9
4-
oo
rsl
N<
ON
ON
oo
Calibration curve
r ^
Tunisia Correct O/E not ovided bi
*..:
Nouira et
ac -
o2
3
J II
48
ac T3
-.
3 ICUs classificatio can be c ;ulated Intervals of 0.1 of
ac := -
.
0< CO *-!
O >n 00
o
Jan 1994- July 1996 rate at from tex interval
<*-
< 2 2 O)
a o d 00
(1^ <j< OH OH
<ii^
1325 patie ts thresholds o NoCIs
,
IN
<wss
0.2,0.Sand
0.8
^
<
<
5
00
00
(J
W
Hong Kon, 1064 ICU PACH II SMR Calibration curves and
bO
Tan 1998
'~^
O
J II
ac T3
00
ac 3
< ft-
W <=>
SN
S oo
CO
< OH
ac H
^ <
disch;arges 77 (95 CI+/- Intervals of 0.1
O
Octl 995- June 1997 073) NoCIs
t^
CO
APS II MR 0.82
(95% CI + /-0.071)
*
o
d
S
II
<
<
O
<
^H
OH
<
8
OS
OO
OH
*
w
Calibration curve
ac
USA
ce
ac
eta
J II
Oi
It! 00
ac Tj
ce
161 h ospit Is, 285 ICU 0.1 intervals
II1"
May 1993 -Dec 1996 No CIs
37 66 8 patients
^
O
<
H^
^
00
<
a
CO
ft-
W-(
w
^-1
ft-
OH
TP,TN MR (95% CIs) Calibration curve and
ac
ce
UK
<^
So
ee o
MJ II
X -o
5 : oo
CXOv
ra O N
'I
17 ICUs CCR at 0.05 intervals
^
(U
12793 pati nts
*
90
ja
8
d
o
<
O
<
^
Ci_i
OH
W
^H
ON
--*
CJ
Table ol MRs wit Calibration curve
s
<;
ac
.
USA
r1
999
-^
Siriioe^ai
^ :
CA
X g
38 ICUs
O
OS
Marcl[11991 -April 19 institutii and for approach as the H-L
al
00
APACHE 11 UK and Ireland APACHE II 0.83 Classification O/E not provided but Calibration curve
^2
Artificial neural Oct 1987-April 1989 ANN 0.84 matrix can be calculated with strata of risk 0.1 df=8
network 8796 patients from 26 ICUs TP, TN and from text intervals
CCR at
thresholds of
0.1 to 0.9 in
0.1 intervals
o
o
o
3
4-*
f/i
a
(N
09
I
APACHE III Australia O/E =1.25 Calibration curve
Single institution with 0.1 intervals
Jan-Aug 1991 CIs
519 patients Plot looks as if mid
points of interval
used
APACHE II Germany APACHE II 0.83 Classification O/E not provided but Calibration curve H-L//
60*
APACHE III Single ICU APACHE III 0.85 matrix can be calculated with 0.1 intervals No d f = 8
SAPS II Oct 1991-Oct 1994 SAPS II 0.85 TP, TN and from textO CIs
2795 patients correct
classification
rate at
thresholds of
0.1,0.5 and
0.9
CO
ee
APACHE 11 Greece APACHE III 0.84 Classification Floras Z score Calibration curve H-L//and C
(30
l
i CM
SAPS II Single ICU Nov 1992-Dec SAPS II 0.87 matrix, Sens, with 0.05 intervals dfused = number
1997 Spec, PPV, O/E not provided but NoCIs ofriskstrata: 8
NPV and can be calculated and 9 in this study
CCR at 0.3, from the text
0.5 and 0.7 APACHE II SMR
1.14
SAPS II SMR 1.62
o
o
o
8
O
CM
OS
APACHE II 0.74 H-LC 233
4>
APACHE II USA
OR
Single ICU df not stated
6808 patients over 8 years
? dates
* OS ON ON
CO so
m r~: o so t~
oo oo W 00 -^ t NO *
o -- <--. ~" ro a , 1 [ ]
CM
' "-^ r~' ac CM !U
1 - ^ 11
r-i
o U M * 1-H ee n M so
<J ac ac =: ft-< CO
ac ac p*^
"^ s 4-*
u u CO
S S O u u ro
u1 <ft- <OH OH OH I ft-
n1
< ft- OH
8 <
^ ft-
< ft,
ft-
i3^ ee
C
4>
t 3
8
O 8
CO
O 1
.2
alib
CO _ . CO
IH O i-H
.13 O
3 .^ o
u # z U ^ Z
CM CSJ
o ON ^
o
o 00 CO
Csl
__ ^ O N
S ^ O CM S ^ Cs|
CO
'" w 3 --
g I- - ' ac 7 t/)=
ON oo ^ ^ ^ ^
00 < 2; ft. oo
< <~i & - & <=!
.CO - H 2 2 -H
23 n < ro
CM
00 00
_ "^ d
0.83
d oo
Ss^WONg OS 00 00
So
o t~- o o
CJ = = ; a; ^ d s
CO
w o^ , 1 ^ w w
wwo u 2 s ' ac ac
CO o
ac ac := < S
= S~
x; CO u u CO
Ol
S S < :^ ftj 0;
< ft- S S < <: ft-
ft- < ft- OH
ft. OH ^
w Q < <: S D S S < CO <
SS < < CO
OS
ro
ON
ON
oo .<
Os bO o
ON o
8
8 oCM
Xi D
o
c u
D CO
4> U CO
CO -3 ee , Q
4> CO
R -25
Cs)
2 3
ON -3
CM
1 ee
Cl-
io
J3
ce --*
<:
8 ii) c^
3
3 rt OS
'
y 3
Ian
rs ' ON
ce ON 2 l
3 =^ 3) a,'-'
udi
o CM 4> 4> O .
5 , ON OS 13
u ro m ^Os 8 r^
O S ^ CO 11 CJ r4 -- ce CO so ^
CO D CM ' - -
Os ^
3
O
o
CJ
ac a: 1=1 ac s s " um
< 00
X a
b OH SS <2ss
O < ee ft- OH <; ft-
OH . ^ OH ft, < CO
CO (Zl
o
ss < CO S S
o
o CM
CM
o o
o o
CM
CM
8
o
COt~
.2 o S (J
J c^
^ ^ u
CQ
ON
ro
r~-
SO
00 K ^ -^r
*
oM
C
C-|
' so *
-a m
PS 11
ACH s
UH
1 ft- <:
ac < CO
s
4)
-
O
CJ O
3 (
CM
ro
o 3
4}
CM o CO ee 3
^Pi Se MU > val g
3 '*
ive pred ive
4> 8 .5 4) (30
oo ^s O TO c- 0 D
d d 00 CO a, 8 CJ CJ
2
-. r o
^ U 1->
s
predi
0 41
I1>
positi
04^
pec ificity
r- S <N
u D
4>
I S ^., 3 o ^
ce
>
e
8
CO
4)m 0
CI.
4> C
U 8
to to
3
g
n
Oi M s
J-J
t/1
ft; ft- q <J pq S 2 O-DH
Zft-KcocococotoH
o
00 00
NO 00
00
pq
d
ac d
o
^ . 000- C~ r o
ft- 00 < j 0 0 ft; 0 0
< d CO d S d
CM 3
CM O
0 S
3
c/l 3
3
0
>
pq
CO
CO
20 ca
4)
^.
ee - 3 ON
"* i2 . a:
ospi
0 3
USA
995
so S 3
VO
^ H
43 ^ JS <L> o
"I3 s
8 3 S <^ o
* O
w
X =1 >^
2 II g o S ^ a>
O = 4J l^
.ii en
en en
>
.ti
S-o
2 2 s
8 c
O 8
NJJ S ft-
J3 CO > I
ft- O ^ "-I
o. O 4>
a< < & o ta S
' ce
-ts 3 8 &
< w 2
^ <
O O '.^
O "3
B o
eg a X
.ss
s
o
CM
en
K
.0
pq
ac
u o ..
< Pi (ii 00
O3 U .:iJ CL: U J. prs ^ =
37
Chapter 3
3.1 Objective
The objective of this chapter is to evaluate the performance of the APACHE III (Acute Physiology and
Chronic Health Evaluation) ICU (intensive care unit) and hospital mortality models at an Australian
The independent validation of the APACHE III models at the Princess Alexandra Hospital (PAH)
intensive care unit (ICU) has been published ^' during the course of this project. It was, at the time of
publication, the largest single institution evaluation of the APACHE III model.
3.2 Introduction
At the time of the study, PAH ICU provided medical and surgical critical care services to an 858 bed
adult metropolitan hospital which was the regional centre for trauma, major surgery, medical
subspecialties, and psychiatry. In August 1994, the APACHE (Acute Physiology and Chronic Health
The APACHE III estimates of mortality risk are part of a proprietary database and decision support
system provided by APACHE Medical Systems, Inc. The APACHE III score is a measure of severity
of illness of ICU patients and is calculated from the patient's age, the presence of existing medical
conditions and the worst physiological and laboratory investigations in the first 24 hours. The
APACHE score attempts to measure the patient's physical reserve, and the degree of physiological
38
disturbance through the most abnormal physiology recorded during the first day. The APACHE III
model uses the admission diagnosis, the source of admission and the APACHE III score to estimate the
patients risk of death in the ICU, and risk of death during the hospitalisation. The risk equations and
The performance of the APACHE III model predictions has been evaluated on the developmental
database ^. Independent APACHE III validation series are available from Brazil (multi-centre, 1734
patients '*"*), the United Kingdom (UK: single institution, 1144 patients ^^ and multi-centre, 12 793
patients'") and Germany (single institution, 2661 patients '*'). In each study hospital mortality was
higher than predicted indicating poor model calibration. In contrast, in two large, prospective, multi-
centre North American series (37 668 patients ^^ and 116 340 patients ^'*) the APACHE III model
demonstrated good overall performance. Generally, the APACHE III mortality prediction model has
not performed well in clinical evaluation outside the USA where it was developed.
The purpose of this chapter is to assess the performance of the APACHE III models for hospital and
ICU mortality, unadjusted and with proprietary model adjustments, at the PAH ICU.
16 years of age, cardiac surgical and bums patients, and patients admitted to the ICU for less than 4
hours or for exclusion of myocardial infarction were excluded from the APACHE III predictions.
Patient data were collected according to the rules of APACHE III 2'^''^'. Data were manually collected,
or transferred from the pathology laboratory information system. The database manager verified all
data. After the first six months of the data collection, 4% of patient records were extracted to check
inter-reporter reliability. Outcomes were survival status on discharge from the ICU or the PAH hospital
campus. Patients transferred to rehabilitation facilities (spinal, geriatric, head injury and general
rehabilitation units) or the psychiatric unit within the PAH complex were deemed inpatients, until
study sample using /-test or chi-squared test adopting a significance level ofp < 0.01, to correct for
multiple testing.
The ICU mortality models were assessed on all eligible admissions to ICU, including readmissions.
The hospital mortality model assessment excluded all ICU readmissions during an episode of
hospitalisation. For each admission, mortality estimates were provided based on proprietary weights
and the APACHE III equation. For in-ICU mortality, the APACHE III ICU mortality model and a
model with proprietary adjustments for hospital characteristics (similar hospital ICU mortality model)
were studied. Three models of in-hospital mortality were evaluated. The APACHE III hospital
mortality model and models with proprietary adjustments for hospital characteristics (similar hospital
mortality model) and referenced to the North American database (USA database hospital mortality
The proprietary adjustments to the APACHE III models include the additional data variables of pre-
ICU treatment period and information about the institution size, teaching status and region. In the case
of the PAH, the similar hospital model references the predictions to teaching hospitals of similar size in
the Mid-West region of USA. The USA database hospital mortality model reflects a "typical" USA
ICU modelled from the APACHE III database (personal communication: C. Alzola, APACHE
The aggregate predicted mortality rate for each model was the sum of estimated probabilities of death
divided by the number of admissions. The standardised mortality ratio (SMR) was the ratio of observed
mortality to the aggregate predicted mortality. Confidence intervals were estimated using a normal
For assessment of model fit or calibration, the agreement between predicted and observed mortality
rate in risk intervals was assessed. Using 10 equal, contiguous risk intervals, calibration curves present
observed against predicted outcomes with 95% confidence intervals estimated by a normal
approximation to the binomial distribution **. The Hosmer-Lemeshow (H-L) statistics, C and H, ^^'^
40
indicate the agreement between the observed and predicted mortality across risk intervals. For C,
admissions are ranked according to predicted risk of death and divided into 10 nearly equal groups. H
uses the sample divided into 10 contiguous intervals of risk of equal width, but unequal number. For
external validation studies, the degrees of freedom of the chi-squared distribution is the number of
intervals of risk ^. Rejection of the null hypothesis that there is no difference between the predicted
Discrimination was assessed by calculating the area under the receiver operating characteristic (ROC)
curves, with estimates of the standard error and confidence intervals ^^. The area under the ROC curve
estimates the probability that a randomly selected patient who died would have been given a higher
3.4 Results
There were 3455 admissions to the PAH ICU between 1 January 1995 and 31 December 1997.
Exclusions were 45 admissions under 16 years of age, 8 admissions staying less than 4 hours and 4
bums admissions. 3398 remaining admissions represented 3159 patient hospitalisations of 3038
individual patients. All patient outcomes were accounted for during the study period.
There were 338 deaths in ICU (9.9%) and 507 deaths in hospital (16.0%). The median length of stay in
ICU was 2 days (range of 1 - 75 days), with 65.4% of patients admitted for 2 or 3 days. Median
duration of hospitalisation was 16 days (range 1 - 930 days, 25% quartile: 8 days, 75% quartile 28
days).
Table 3.1 shows the demographic characteristics of patients, with reference to the APACHE III
developmental database. Compared to the APACHE III development sample, the PAH sample was
younger, had a greater male preponderance, a different case mix of non-operative / operative (elective
and emergency) and a different mix of sources of referral. Severity of illness reflected by the day 1
APACHE III score and the Acute Physiology Score component appear similar.
41
Operative Status^^
Admission Source
,2/r\ -
*** Xl6) = 515;p<0.001
The 231 admission diagnoses were grouped into 77 disease groups. The commonest operative disease
groups were gastrointestinal cancer (9.0%), elective aortic surgery (8.5%), operative trauma (7.2%),
head and neck cancer surgery (3.5%), miscellaneous gastrointestinal surgery (3.5%) and liver
transplantation (2.8%). The commonest non-operative groups were non-operative trauma (10.3%), dmg
overdose (7.5%), cardiac arrest (2.6%) and asthma (2.5%). The ten mostfrequentgroups accounted for
There were 2812 admissions (82.8%) with no APACHE III co-morbidities. 459 (13.5%) had one co-
morbidity, 120 (3.5%) two and 7 (0.02%) three or more co-morbidities. The prevalence of one or more
co-morbidities in the present study sample (17.2%) differs from that of the APACHE III
The majority of admissions had low estimates of probabilities of death according to APACHE III.
Seventy-nine percent of patients had a predicted ICU mortality of 0.1 or less and 91% of 0.3 or less.
Sixty-eight percent of patients had a predicted hospital mortality of 0.1 or less, and 85% of 0.3 or less.
The observed hospital mortality (16.0%) was significantly higher than the mortality rate predicted by
APACHE III, 13.6% (x2(l) = 7.4; p = 0.01, Table 3.2). The observed hospital mortality was not
different to the APACHE III predictions when model adjustments were made for hospital
characteristics (similar hospital model: 14.9%) or the USA database referenced model (15.6%) are
used. The observed ICU mortality (9.9%) was not significantly different from the predictions of the
APACHE III ICU mortality model (8.9%) or the APACHE III similar hospital ICU model (10.5%,
Table 3.3).
Mortality Model
The APACHE III ICU models show good calibration. Calibration curves for the ICU mortality model
and the similar hospital ICU mortality model (Figure 3.1 and 3.2) are close to the line of perfect model
calibration. The H-L statistics (Table 3.4) with corresponding p values > 0.05, confirm adequate
calibration, with the similar hospital ICU model providing the best fit on this sample.
Figure 3.1: Calibration Curve for APACHE III ICU Mortality Model with
No Adjustment for Hospital Characteristics
APACHE III model predicts in-ICU risk of death
Observed (+/- 95% CI) v predicted mortality in 10 intervals of risk
CD
o
Figure 3.2: Calibration Curve for APACHE III ICU Mortality Model with
Adjustment for Hospital Characteristics
(Similar Hospital Model)
APACHE III model predicts in-ICU risk of death
Observed (+/- 95% CI) v predicted mortality in 10 intervals of risk
(0
o
H 1 \-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
APACHE III Predicted ICU Mortality
The unadjusted APACHE III hospital mortality model has poor calibration demonstrated on the
calibration curve (Figure 3.3) and on statistical analysis (Table 3.4).The similar hospital model (Figure
3.4) and the model referenced to the USA database (Figure 3.5) display adequate calibration. Though
both calibration curves show that observed mortality differs from expected in the range of 40 - 50%,
45
the H-L statistics (Table 3.4) suggest non-significant differences between the estimated and actual
mortality rates.
H 1 h
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
APACHE III Predicted Hospital Mortality
46
The area under the ROC curves for both ICU mortality models was 0.92. The areas under the ROC
curves for the APACHE III hospital mortality model and the model referenced for the USA database
were 0.90. For the similar hospital model, the area was 0.91. This demonstrates good or excellent
3.5 Discussion
This analysis demonstrates that the APACHE III mortality models with adjustment for hospital
characteristics, the ICU mortality model and the hospital mortality model referenced to the USA
database have good discrimination and calibration in an Australian adult ICU population at the PAH
ICU during the study period. This was the first series from a general ICU outside USA that endorsed
APACHE III model performance. It also supports the findings of previous reports from the UK ^"'^^
Brazil'"", Australia ^^ and Germany '*' where the original, unadjusted hospital mortality model
performed poorly.
The practical approach adopted in this analysis is based on the review and conclusions of Chapter 2.
47
Discrimination of all APACHE III models was good (area under the ROC curve > 0.8) or excellent
(area under the ROC curve > 0.9) ^*. The area under the ROC curve was similar to that of APACHE III
in-hospital mortality predictions on the developmental data set ^, the USA prospective multi-centre
validation series ^^'^ and the UK multi centre series ^*'. However, the discrimination of the APACHE
in model is vulnerable to differences in case mix, clinical practice and data collection conditions, given
the lesser performance in the multi-centre Brazilian series '*"', and single institution studies from
In this sample, the H-L statistics, the calibration curves and the global agreement between observed
and predicted outcomes concur that among the models studied, only the unadjusted APACHE III
hospital mortality model displayed inadequate fit. Comparison with other published calibration curves
for APACHE III 30.36,77,ioo ^^^^^ ^^^^ ^^^ calibration at the PAH resembles the curves of the North
Despite differences in case mix and referral patterns between the PAH sample and the APACHE III
developmental sample, the performance of the APACHE III models adjusted for hospital
characteristics was good. Other analyses of APACHE III have only assessed the performance of the
unadjusted APACHE III hospital mortality models. A UK validation study ^^ with a patients' sample
having a higher average APACHE III score, more co-morbidities and different referral sources, found
excellent discrimination. However, there was a 25% higher than predicted mortality rate and excess
mortality in all risk ranges indicating poor calibration. Differences in casemix were proposed as the
likely reason for higher than predicted hospital mortality. A German study '*^ describes a 22% higher
than predicted mortality, with data collection anomalies, casemix, leadtime bias, model inaccuracy or
In contrast to other work, the present study showed that the APACHE III model can be applied to a
patient population with a different case mix and referral pattern, outside of the USA, and produce
similar performance to that observed on the APACHE III development and validation series.
48
The validation of a model that estimates the risk of death on an independent data set implies model
validity and the reliability of variables and data collection methods ^^ Potential for bias and inaccuracy
'"^ and threats to model performance can arisefromlocal anomalies of clinical practice, casemix or
data collection. The apparent variability of performance of ICU outcome or risk adjustment models
mandates that these models must be closely examined at each site where they are used before
In the PAH series during the study period, the APACHE III mortality estimates, particularly with
proprietary adjustment for hospital characteristics, provided both good discrimination and good
calibration. This supports the validity and robustness of APACHE III variables, data collection and the
The local performance of APACHE III will therefore allow its use as a locally validated risk
Chapter 4
Control Charts for Analysis of Mortality Outcomes in
Intensive Care.
4.1 Introduction
The purpose of the next two chapters is to develop methods to continuously monitor outcomes of
intensive care unit (ICU) patients using an adjustment for risk of death. In Chapter 4, control charts are
applied to analyse ICU deaths. The choice of charting parameters is made by modelling the occurrence
of false alarms and detection of changes in the mortality rate. Control charts incorporating risk
4.1.1 Overview
Statistical analyses can be used to detect a change in the pattern of the outcomes of a process. A change
in observed mortality rates for patients may be caused by various factors including a change in the
process of patient care. Therefore, monitoring outcomes in a clinical setting may provide information
In this chapter, several approaches to track ICU mortality will be presented. Process control chart
analysis will be applied to mortality at the PAH ICU during the period 1 January 1995 to 31 December
1999. The APACHE III model for estimating the in-hospital death of ICU patients, with proprietary
adjustments for hospital characteristics, was shown in Chapter 3 to have good discrimination and
calibration on the PAH ICU patient population. Therefore it will be used as a RA model for
The data used in the following chapters cover an additional two year period beyond those used for the
validation study of Chapter 3 which was published in 2000 ^'. A summary of the larger data set is
provided in Appendix 1. An update on the performance of the APACHE III model on this data set
This project was commenced using the 1995 - 1997 patient dataset. This initial 3 year period is used as
a period of historical observation. The chart monitoring will be applied to the subsequent data for 1998
-1999
Primarily the methods must track the mortality rate and analyse the mortality observations in the
context of a standard mortality rate. For the charts presented in this chapter, the expected mortality rate
was derived from historical observations from the 1995 - 1997 patient data.
ICU mortality monitoring must include all patients that are admitted to the ICU, so that a global picture
of ICU mortality is obtained which will reflect the care given to all patients. This will also maximize
the number of patients, the power of the analysis and may minimize delay to recognize changes.
The data collection and monitoring must not distort the care provided to patients. Protocols have a
place to guide the quality and efficiency of patient care. However, protocols that exist solely to
improve data analysis must not constrain patient care or interfere with changes to the system of care. It
is important that the usual care of patents is being assessed, rather than the effects of an experimental-
type intervention.
The results of analysis must be available in a timely maimer without unnecessary delay. This can be
aided by using a sequential analysis technique or grouping patients into the smallest practical samples
that provide adequate power. Large sample groups potentially miss temporal relationships, such as
The mortality outcome used for the development of the confrol chart methods is the APACHE III
endpoint of patient survival to hospital discharge. ICU patient hospital stay can be very long, up to
months or even years as an inpatient. A complete dataset can only be analysed after all patients are
dead or discharged from hospital, which can take months, or years. Early analysis tends to be biased
51
toward the deaths, and I have found in practice that use of in-hospital mortality as the endpoint limits
the timeliness of analysis. This observation is consistent with opinion in the literature, which has called
for ICU patient survival to be reported at fixed times post ICU admission ^^'^^. The considerable
limitation in using in-hospital mortality as the outcome is addressed in subsequent chapters of this
thesis. When new models are developed for RA charting in this thesis, the outcome of 30-day in-
hospital mortality is used. Patient outcomes can therefore be analysed 30 days after admission.
Process control charts, such as p chart, CUSUM chart and EWMA charts are suitable methods for
detecting changes in proportions of binary outcomes of a process. In the ICU, the process under
scrutiny is the complex milieu of patients and patient care. The outcome to be monitored is in-hospital
mortality.
Variations to the mortality rates may be due to common cause, or special cause variations '"^ Common
cause variations are related to the nature of the process and include chance variations. A process is
operating "in-control" if the variability is due only to random variation '"*. Control charts provide tools
for monitoring and analysis by tracking whether the outcome of the process conforms to the expected
in-control variability. The variance of a process in-control will account for these chance effects, and is
A process is "out-of-control", when the output does not have a stable distribution "'^, and the variation
is attributable to causes other than random variation '"*. These are special causes or assignable causes
of variation. These can be due to temporary or new factors that were not part of the in-control process.
Examples in an industrial context usually include defective raw materials, improperly adjusted
machines or operator errors '*. In ICU, special cause variations causing a fall in mortality occur with
an increase in low risk elective surgery, transfer of low risk patients from a nearby ICU or potentially a
systematic increase in quality of care. In contrast, a decrease in mortality rate due to a chance run of
The control parameters for a control chart are estimated from a stable, historical period of observation
while the process is in-control. In this application, the initial period 1 January 1995 - 31 December
1999 of 36 monthly observations, or 31 blocks of 100 patients was used. Appendix 2 presents an
analysis which establishes that it was reasonable to assume that the ICU process and the mortality rate
Where a change in the distribution of the outcome observations exceeds a level that is expected by
chance, a signal will occur. Such a signal should prompt several actions in an ICU. The first is an
evaluation of the likely importance of the signal by considering the sensitivity and false alarm
characteristics of the chart which are calculated or simulated during the chart design process. Other
alternative, complimentary patient data will also be examined to differentiate between common or
special causes of variation. Secondly, if a systematic change has occurred, an examination of the
process in an appropriate and timely manner will be conducted, including a search for the assignable
causes. A systematic and acceptable increase in mortality rate due to more high risk patients will lead
to a revision of the expected mortality rates. An increase in low risk elective surgery will require a
revision down of the expected mortality rate. On the other hand, an increase in mortality rate attributed
to, say, premature patient discharge or inadequate medical staff numbers would demand intervention
and improvement of the care offered to patients, without revision of the expected mortality targets.
Equally important would be a true fall in mortality that was not attributable to severity, casemix or
other expected influences. A search for the causes of improvement could provide hypothesis for further
clinical investigation.
The key to differentiating between random variations and a systematic change in the underlying
Design of clinical trials considers the power of the analysis and the Type I and Type II errors.
Analogous charting concepts are studied by analysing run length to signal. With control charts it is
important to understand how long a process might run before a signal is expected when the process is
in-control (false alarm), and when conditions change (true positive). The expected pattern of false
53
alarms and true positive signals can be quantified as the average run length to signal (ARL) under in-
A clinician can design the monitoring scheme prospectively. The information required is: the in-control
conditions, the changed conditions to be detected, the tolerable ARLs for in-control and changed
conditions, and the patient numbers. The performance of a specific chart method is predictable and the
There are limitations to monitoring ICU mortality that arise from the assumptions of the control chart
model.
difficult to assess. Correlations between successive outcomes may be related to cycles of activity,
staffing, trauma, operating theatre lists and casemix. Staff learn on the job, so annual or term related
cycles of staff gaining experience may affect the process. Public and school holidays affect casemix
and staff availability. Annual budgetary cycles may have effects on patient case mix, particularly the
volume of elective surgery. There is seasonal variation in many diseases such as heart disease '"*,
asthma and communicable diseases like respiratory tract infections, meningococcal disease or flaccid
ascending paralysis.
Sequential effects may also exist. Nosocomial infection transmission increases the risk of exposure and
cross infection and may cause clusters or outbreaks of infection. High staff activity levels followed by
periods of exhaustion and low staffing levels may influence the quality of care. The recent experience
of clinicians could affect fixture clinical decisions. Business considerations may impact on activity
levels, which in turn could affect admission and discharge criteria, resource availability and withdrawal
of therapy.
There is therefore the possibility of lack of independence of patient outcomes and the possibility that
the in-control mortality observations do not conform to the predicted distribution. In either of these
For/? charts and the CUSUM, it is assumed that the individual patient outcomes are random variables
and a normal distribution can be used to calculate control limits. Departure from normality will prevent
control limits from being correctly estimated. Appendix 2 presents a thorough analysis of the
observations of monthly mortality rates and the mortality rates of blocks of 50 and 100 consecutive
cases (1 January 1995 - 31 December 1997). The monthly and 100 case block mortality observations
appear to have a normal distribution and there is no evidence of non-random clustering or mixing. The
estimate of the standard deviation of the binomial distribution agrees with the observed standard
deviation of mortality rate observations. There is some evidence for a 3 month (or 300 case) cycle in
mortality rates.
The following discussion concentrates on the use of control charts to monitor the mortality rate of ICU
4.4.1 p Chart
The Shewhart/7 chart is a control chart for the proportion of an attribute in a sample. Control limits are
based on a normal approximation of the binomial distribution, with parameters of sample size and
To plot ap chart for patient mortality rate, the statistic ^ . , the monthly mortality rate is
where, the month or sample is indexed by / and the n patient outcomes are indexed b y / The patient
outcome Y.j is 1 if the patient dies in hospital and 0 if the patient survives to hospital discharge.
55
The target p is the mean mortality rate during an in-control period for the process. The value
\p(X-p)
,
For large samples, control limits are calculated by a normal approximation to the binomial distribution:
CL=-pa.}P^^^
.
The p chart analysis presented in this section plots mortality rate by month. Appendix 3 presents an
analysis of the effects on ARL to signal for a choice of control limit parameters in the setting of
changing mortality rates. Based on this analysis, the charts are designed to signal plausible and
clinically important changes in mortality rate in a timely fashion with an acceptable incidence of false
alarms.
56
Figure 4.1 presents the data from Appendix 1 Table A1.2 in a/? chart of monthly mortality for the PAH
ICU in 1995 - 1997, with both 2 <T and 3CTcontrol limits. No month mortality observation lies outside
the 3CTlimits and only a single observation (August 1996 0.26) lies outside the 2 a limits. This further
The mortality rates of the months during the period 1998 - 1999 are plotted on Figure 4.2, using the
observations of 1995 - 7 to determine the value of the control parameter, ^ . Thefirst6 observations
all fall below the target mean, and 20 of the 24 observations are below the mean. Three fall beyond the
lower 2 a control limits, including thefirstobservation of January 1998. Mortality was lower than in
the period 1995 - 7, and the chart demonstrates that the process was "out of control" following January
1998.
Figure 4.2: p Chart PAH ICU Hospital Mortality Rate 1998 - 1999
by Month
Target Mortality 0.16
0.35
-Montinly Mortality
Upper 3 a CL
Lower 3 a CL
Upper 2 a CL
Lower 2 a CL
- - Target Mortality Rate
The first issue raised is that of the level of proof required for data analysis for quality monitoring. The
observations for January 1998 and November 1999 were below the 2 a limit. There were no
observations falling outside the 3CTcontrol limits. In medical applications, the 2 a limit may be
observations may be more sensitive than the/j charts described thus far to detect a process out-of-
58
control. For medical applications, Benneyan "" has proposed a series of rules which allow recognition
1: Eight consecutive observations on the same side of the target mean. Using this rule, the signal
2: Any 12 of 14 consecutive observations on the same side of the target mean. The signal would
3: Three consecutive observations that lie outside 2/3 the distance to the control limits. Based on
4: Five consecutive observations beyond 1/3 of the distance from the target mortality rate to the
control limits. Based on 2CTcontrol limits, a signal would have been given in February 1999. The
Western Electric equivalent rule, that 4 out of 5 points lie beyond 1CTfromthe target mean would have
5: Thirteen consecutive observations within 1/3 of the distance from the target mortality rate to
the control limits on either side. This rule is designed to identify reduced process variability, and there
6: Eight consecutive observations displaying a sustained run up or run down. There would have
Betmeyan's rules can be used to identify a lack of statistical control to supplement/? chart analysis.
These are similar to the famous Western Electric decision rules '"^ which are used to increase the
sensitivity of p charts to small sustained shifts in the process, and are similar to the Runs Tests of
Appendix 2.
Decision rules like those of Benneyan have not been incorporated in this analysis. As more rules are
used, the simplicity of chart interpretation is lost and the decision process becomes more complicated
"'^. These rules increase not only the sensitivity in recognising true changes, but also the occurrence of
false alarms. As the rules are not independent, and each rule relies on several observations, the analysis
of the distribution of run length is difficult '"*. Interpretation of the CUSUM or EWMA charts which
59
are described later in this chapter is more straightforward and easier to analyse than a complex set of
106
rules
The third issue is the nature of the action prompted by a signal. A signal that the process is out-of-
control, even if it indicates lower than expected mortality should initiate a review of the process
searching for factors that may have caused the mean mortality rate to shift. In the work described here,
the data were analysed retrospectively, providing no opportunity to evaluate the process
contemporaneously. Examination of the/? chart in Figure 4.3 suggests that the fall may have occurred
Figure 4.3: p Chart PAH ICU Hospital Mortality Rate 1995 -1999
by Month
Target Mortality 0.16
0.35
-Monthly Mortality
Upper 3 a C L
Lower 3 a CL
Upper 2 a CL
Lower 2 a CL
- - Target Mortality Rate
When monitoring is resumed after a signal has occurred, the same chart parameters should be used if
the in-control distribution is the only acceptable outcome specification, the cause of the change is
temporary, or where the process has been examined, and adjustments are made to return the output to
the in-control specifications. If these conditions do not apply, monitoring specifications should be
reviewed. In this example, it would have been appropriate to have reduced the estimate of the mean
mortality rate.
60
The CUSUM is a graphical technique that accumulates the differences between observations and an
expected, ideal or target value. The target may be a historical value, an industrial process specification,
or a clinical benchmark. In the ICU example, the in-control mortality rate of 0.16 for 1995 - 1997 was
used.
The CUSUM is usefiil to detect small changes in the output of a process. In contrast to the/? charts,
CUSUM methods can detect sustained changes in the process mean of less than 1CT,if the parameters
are set to do so. For detecting substantially larger changes, the performance ofp charts and the
CUSUM statistic
The formula for the calculation of the CUSUM statistic C ' " is:
c = Z iP^-p-)
/=!
where/?, is the mortality rate of the sample, indexed by i, and p is the target or in-control mortality
rate.
If the process is in-control, the mean of C is 0, it is normally distributed and a sequential plot creates a
A sustained change in the observed mean will produce a CUSUM chart with a slope equal to the
magnitude of the change in mean. A constant slope indicates that the difference between the target
mean and the observed outcomes remains the same over time. Where the gradient is changing, the
relationship between the outcome and the target mean is changing, suggesting a dynamic situation. A
cyclical variation may reflect the effects of, say, seasonal or predictable temporal change and may
Figure 4.4 shows a CUSUM of the in-hospital mortality for 2200 cases from January 1998. The
expected mortality rate is 0.16 and the observed mortality rate is analysed in blocks of 100 admissions.
The sustained negative slope indicates that the observed mortality is below 0.16. The slope is estimated
over 21 blocks as - 0.66/21 = - 0.03. This suggests that the mortality rate over the monitoring period
was consistently 0.03 below the in-control mortality rate of 0.16. The new mortality rate is therefore
estimated to be 0.13.
-0.2
O
_- -0.3
CO
For stable mortality rates, blocks with the same number of patients will have the same mean, variance
and control limits. Therefore, if the process is in-control the CUSUM stastic can be calculated from the
differences between the observed and the in-control mortality rates. However, when monthly mortality
rates are analysed, the variation of the monthly admission numbers creates difficulties in calculation of
control limits for the CUSUM. This problem can be dealt with by standardising the observed mortality
rate (or count of deaths, as this is equivalent) and using a CUSUM of the Z score.
62
111.
The Z score is calculated using the approach suggested by Hawkins and Olwell
(p.-p)
m-p)
ri:
s,=Sz,
(=1
where Z . is the standardised mortality rate for the month indexed by /, and S is the CUSUM statistic
Figure 4.5 displays a Z score CUSUM of data for the PAH ICU 1995 - 1999 analysed in months based
on the in-control mortality rate of 0.16. A change in the slope of the Z score CUSUM of in-hospital
mortality rates occurs between July 1996 and January 1997 where the mortality rate falls.
The Z score CUSUM accumulates the Z score values, as the target mean is zero with a standard
deviation of 1. The advantages of standardising "^ all observations prior to incorporation on a CUSUM
63
chart, are that this allows other observations, say clinical indicators or other measures of process
plotted as CUSUMs on the same chart. The disadvantage is that the original units are no longer
Statistical tests that the current process mean is different from the in-control mean are based on the
assumptions that the observations are independent and that they are drawn from an in-control sample of
known distribution.
Appendix 4 provides the background for the statistical approach to CUSUM chart analysis. It contains
the formulae used in this section and has an analysis of the ARL that lead to the choice of the chart
design parameters for this ICU application. The design of the chart and the statistical analysis of the
data require the in-control distribution of mortality rates to be known, and the shift of the process mean
that is of interest is specified. In the example, the in-control distribution is based on a mean mortality
rate of 0.16. However, the choice of the shift in process mean to be detected by the CUSUM must be
plausible and based on clinically important values. The choice of chart parameters and the decision
thresholds depends on the ARL of the CUSUM under in-control and changed conditions.
The CUSUM test provides a method for ongoing analysis as the outcome data accumulates. In this ICU
example, two CUSUM charts are run concurrently. The upper CUSUM, C* tests for an increased
mortality rate with the chart signalling that the process is out-of-control when C^>h^. The lower
CUSUM, C~ tests for a decreased mortality rate with the chart signalling that the process is out-of-
control when C~ <h~. The CUSUM test chart examples are plotted with the upper and lower statistics
C^ and C (or 5^ and 5^ for the Z score CUSUM) against observation number. The control limits h^ and
h are plotted as lines parallel to the X axis. When the CUSUM statistic exceeds the control threshold,
the statistic is reset to zero and the monitoring process is continued with the same parameters.
The following charts illustrate the upper and lower CUSUM charts. Figure 4.6a is a CUSUM chart of
blocks of 100 patients. The target mean is 0.16, and the upper level of mortality to be detected is 0.21,
64
corresponding to iC- 0.185, and the control limit is h'^ - 0.073. The lower level of mortality is 0.11
corresponding to K- 0.135, and the lower control limit is A" = - 0.073. The ARL under in-control
conditions is approximately 20 observations which is equivalent to a false alarm on average every 2000
patients or 1.7 years. There are three occasions when the CUSUM chart signalled that observed
mortality was significantly lower than the in-control mortality rate. Given the design characteristics of
10 15 20
Admission Block Number
If the signal represents a true and sustained change in the process, then the charting parameters should
be altered. Figure 4.6b shows the CUSUM of blocks of 100 admissions recommenced after block 3,
when the first signal of Figure 4.6a was recorded. The revised chart is designed to a target mortality
rate of 0.13, and to optimally detect changed mortality rates of 0.10 and 0.16 (K*^ 0.145, IC= 0.115, /i
= 0.088, in-control ARL of- 20 again). There are no signals in this chart, which suggests that the
mortality rate observed during the period of analysis was not substantially different from 0.13.
65
A number of approaches to estimating the current mean output are available. The overall mean of a
series of values does not provide a good estimate of the current output, due to contributions of
sometimes distant historical values. A method that provides better current estimates is the
exponentially weighted moving average (EWMA) "^. More detail for this statistic is provided as it
The EWMA is an estimator that has been extensively investigated and it will be applied in this section.
It assigns a higher weight to recent observations than to more distant or historical values. It is useftil for
detecting small persistent shifts in the mean outcome and for estimating the timing of those shifts "*.
The EWMA is slower to detect large shifts in the process mean than the/? chart. In contrast to the/?
chart and the CUSUM, the EWMA is relatively insensitive to departures from the normal distribution
"".As well as monitoring mortality for groups of patients, the EWMA can be applied to monitoring
where >>, is the value of the /** observation. This value may be the mortality rate of samples of
patients, />,. or the outcome of a single patient, Y.. In the examples used, both sample blocks of 100
consecutive patients, and single patient outcomes are presented. X is the weight between 0 and 1.
Larger values give more weight to recent observations, with the limiting value of A = 1 producing a
plot of yj against /, similar to a/? chart. EWMA- is the value of the statistic indexed by i.
The calculation of the EWMA statistic is an iterative process. For the first value, EWMAQ , an
estimate of the in-control mortality rate is used. In this example, the estimate is p . EWMA. is a
EMWA.=(l-Ayp + A,Y,(l-A,y-'y,
k=l
where k isa whole number < / .We can calculate control limits for the EWMA using the in-control
mortality rate and the standard deviation. A formula for the control limits is "" :
CL,=pacJ^[l-{l-ir]
whereCTis the standard deviation of the observations of the in-control sample, and a is the width of the
c., = ,...ISj_i_[,_(,.,)..]
where , is the number of patients in each block of admissions. For an EWMA for single cases, n, = 1.
-.---j^s
can be used o^-'"^'"".
The design of the EWMA chart requires consideration of the effects of a, A and , on ARL under in-
control and changed conditions. Appendix 5 provides an analysis of the effect of these parameters.
Parameter choice determines the ability to detect real shifts in process mean and the likelihood of false
positive signals.
68
Figure 4.7a presents EWMA chart of the mortality rates for blocks of 100 patients during 1998 - 1999.
The "p =0.16, a = 2 and yi = 0.3. It is clear that the mean mortality rate is not 0.16. The mean
mortality rate appears to have changed to lie in the range 0.12 - 0.14.
0.18
0.1
LU 0.08
-EWMA
Upper Control limit: +2a
Lower Control limit: - 2a
0.06
0.04
0.02-
10 18 ^25
Figure 4.7b presents the same data, but with charting parameters revised and estimated based on the
EWMA estimate of the mean mortality. The new target mean was 0.141 after the second block of
observations. There are nofiirthersignals. The chart cannot be used to demonstrate a statistically
significant difference between the EWMA estimate of mortality rate, and 0.141; that would require use
of an appropriate statistical test. Nevertheless, the chart suggests that the mortality rate is in the range
0.2
0.18
0.16
0.12
0:1
UI -EWMA
0.08 Upper Control limit: +2o
Lower Control limit: -2a
0.06
0.04
0.02
0 4
10 15 20 25
Blocks of 100 Cases
70
The next series of charts plot EWMA for individual patient outcomes. Figure 4.8a displays the EMWA
chart of outcomes of admissions 1998 -1999. The in-control mortality estimate is 0.16 and
X was chosen to be 0.001. Again it is clear that the mean during this period is not 0.16.
0.2
0.18
0.16
0.14
< 0.12
1 0.1 -EWMA
^ 0.08 Upper Control Limit + 2a
Lower Control Limit - 2a
0.06
0.04
0.02
0
500 1000 1500 2000
Case Number
71
Figure 4.8b presents the same data, using a starting in-control estimate of 0.16. When the EWMA
crosses the lower control limit at case 69, a revised estimate of mortality of 0.154 allows
recommencement of monitoring. At case 387, the lower control limit is crossed again and the in-control
mortality is re-estimated to be 0.143. At case 2022, the lower control limit is crossed again, and the in-
control mortality rate is estimated to be 0.128 and there are no fiirther signals.
0.18
0.16
0.14
0.12
target{4) = 0.128
0.1
LU
0.08
0.06 -EWMA
Upper Control Limit + 2a
0.04
Lower Control Limit - 2o
0.02
0
500 1000 1500 2000
Case Number
The conclusion from using these charts is that the EWMA is usefiil to estimate the current mean and to
demonstrate when the mean varies from a target value. When a signal occurs, the EWMA provides a
ready estimate for a new target mean mortality rate, and the monitoring can be continued. The choice
of parameters A and a, depends on the ARL performance. As the EWMA is robust to the effect of non-
Appendix 5 presents an analysis of the EWMA charts which leads to the choice of charting parameters
Each of the control charts used in this chapter demonstrated that the in-hospital mortality during 1998 -
1999 was less than the in-conti-ol estimatefromthe period 1995 - 1997. Figure 4.5 suggests that the
change in mortality rate was likely to have occurred between July 1996 and January 1997
An analysis of the ICU casemix and illness severity was undertaken to determine a cause of the change
in mean mortality rate. Patient age and severity of illness (measured by the APACHE III score) were
analysed with non-parametric one way analysis of variance (Kruskal-Wallis). Casemix was assessed
with the patients grouped into Emergency Surgery, Elective Surgery and Non-operative Cases.
%male
Age# 52.2 52.2 53.4 53.0 54.2
mean in years
mean
Emergency surgery 151 139 145 137 121
number (%) (14.2) (13.0) (14.1) (13.6) (10.9)
*X\S) = 6.8, p = 0.56; # Kruskal - Wallis x^(4) = 9.2:p = 0.n); t Kruskal - Wal lisx'(4) = 9.2:p:= 0.06
There was no change in the gender of patients admitted to the ICU. There was a non-statistically
significant increase in the average patient agefrom52.2 years in 1996 and 1997 to 53.4 years in 1997,
which would not explain a fall in mortality rate. There was no systematic trend in the APACHE III
score measuring severity of illness thereby not explaining the fall in mortality rate.
73
0.6
- ^ Elective Surgery
9Non-Operative Cases
- -A- - Emergency Surgery
0.1
There were changes in patient casemix during the period. Figure 4.9 presents the percentages for the
patient groups from Table 4.1. There was an increase in the percentage of elective surgery and a
decrease in the percentage of non-operative cases during 1998 - 1999 compared to the period of 1995-
1997. Patients with Elective Surgery have a low in-hospital mortality rate compared to Non- Operative
patients or Emergency Surgery patients. Therefore, changes in casemix could explain the fall in the
mortality rate.
4.6 Summary
The use ofp charts, CUSUM and EWMA charts has been demonstrated on raw mortality data for the
PAH ICU. A thorough analysis of ARL of the charts under in-control and under clinically relevant
changed mortality rates provided the basis for choice of charting parameters.
74
All charts demonstrate that the mortality rate was lower in the period 1998 - 1999 than the previous
period 1995 - 1997. The increase in the percentage of elective surgery and fall in the percentage of
non-operative cases may account for the fall in morality rates that were observed.
This suggests that some form of casemix and patient severity of illness adjustment could improve the
assessment provided by control charts. The incorporation RA tool to the confrol charts is the subject of
Chapter 5.
75
Chapter 5:
Monitoring Intensive Care Outcome using Risk
Adjusted Control Charts.
5.1: Introduction
The purpose of this chapter is to develop and adapt control charts to incorporate an adjustment for
casemix and patient severity using an APACHE III model of in-hospital mortality as the risk
adjustment (RA) tool. The development of RA control charting brings together the topics studied in the
previous chapters.
A control chart compares observed outcomes with expected outcomes from a process over time. The
control charts used in Chapter 4 assumed a homogenous risk of death for all patients equal to the in-
control mortality rate of 0.16 at Princess Alexandra Hospital (PAH) Intensive Care Unit (ICU). A RA
control chart will incorporate an estimate of the risk of death of every patient. The expected mortality
rate depends on the probabilities of death of the patients included in the analysis. The RA control chart
uses a model that fits the patient data during an in-control period to provide an estimate of the risk of
death. The APACHE III model with proprietary adjustments for hospital characteristics predicts the
probability of in-hospital death of ICU patients at the PAH reliably and the fit was validated on the
1995 - 1997 patient dataset. This model will be used to develop the RA charts described in this
chapter.
Control limits or decision thresholds provide a statistical test of the agreement between the expected
and observed outcomes. Out-of-control signals indicate that the RA model is unlikely to describe
adequately the relationship between the patient variables and the patient outcomes. This implies a
change from the in-control state. Signals that indicate change in model fit may be due to a number of
The CUSUM has been used for RA control chart monitoring for cardiac surgery '>"5^ myocardial
infarction patients "^ and general surgical cases "^ and RA Shewhart charts have been described by
76
Alemi and co-workers"^'"'. The methods described in these papers will be modified for the ICU
context, and the RA exponentially weighted moving average (RA EWMA) chart will be introduced.
The average run length (ARL) for these charts to detect changes in RA mortality rates will be analysed.
Parameter choice will be made for this ICU application based on the performance of the charts over
An important consideration in the development of RA confrol chart methods has been to select a
method to describe the distribution of mortality rates. This important basis for RA chart development is
explored in Appendix 6. Three methods to characterise the distribution of observed mortality rates are
discussed.
Two of these methods are applied to the charts and data in this chapter. The central limit theorem leads
to approximation of the mortality rate distribution using a normal distribution. This is a good
approximation for most RA chart applications. An exact method, using an iterative approach is also
used to calculate the cumulative probability fimction of mortality rates of samples with patients of
In previous applications of RA strategies to health care, the emphasis has been on comparison between
institutions, rather than monitoring changes over time. For example, the emphasis in cardiac surgical
studies has been on using RA approaches for comparisons '^''''^'. in contrast, the RA methods that will
be described in this thesis are designed to monitor mortality rates, and detect changes within a single
The American APACHE III hospital mortality model, adjusted for hospital characteristics performed
well as a estimate of in-hospital mortality in the PAH ICU between January 1995 and December 1999
'''\ It will be used as a RA tool to illustrate monitoring of ICU mortality. The characteristics of the data
series and the performance of APACHE III model for the in-confrol period of 1995 - 1997 are
summarised in Chapter 3. Although a validated model of the APACHE system is used in these
analyses, any model with adequate performance could be used for RA. Other models have been
77
developed using this dataset ^^''^^ and fiirther development of an alternative machine learning RA
RA control charts are proposed to detect a change in the process of care, using model fit as an
indication of change. In Figure 1.2, lezzonis' "Algebra of Risk" described the manner in which patient
factors and diagnosis factors led to patient outcomes, influenced by the process of care and random
variation. The relationship between the patient and disease factors, and the patient outcomes could be
modelled. Subsequently, differences between observed and predicted patient mortality could be
There are a number of issues that arise with risk adjusted outcome monitoring.
1. Performance of RA model
The key to monitoring RA mortality rates is to have an adequate model to estimate the patient risk of
death. The model performance must meet the criteria discussed in Chapter 2. It must be reproducible
and be able to be generalised to unseen data separate from the context in which the model was
developed or validated. The performance must be assessed in terms of discrimination, for example,
using the area under the ROC curve, and calibration, assessed by the Hosmer-Lemeshow statistics.
Ultimately, any numerical recommendation will be an opinion based on the practicalities of data
The first consideration is the level of performance that can be realistically expected from an ICU
mortality prediction model. A review of the performance of models to predict ICU outcomes was
summarised in Chapter 2, Table 2.1. This gives a guide to the discrimination and calibration that can be
78
expected from this type of model. For example, the area under the ROC curve should be in the range
0.80 - 0.90. It is reasonable to expect that the Hosmer-Lemeshow C statistic based on 10 groups,
should have a value less than 15.5, ( ^'^ (s), /? > 0.05), but preferably C should be much smaller. The
power of the Hosmer-Lemeshow method to detect departures from the null hypothesis of good model
The second consideration is the sizes of the modelling and validation datasets. This requires a balance
between the need for large datasets to develop and validate a model, and the finite time and resources
available to collect patient data. A recent study by Clermont and co-workers '^^ modelled ICU patient
outcomes with logistic regression and artificial neural networks, using a developmental dataset of 800
patients and a validation set of 447 cases. With developmental datasets of 800 patients or larger,
satisfactory models were developed, but with smaller datasets the models were unreliable. A similar
patient series of 1200 patients would take approximately 1 year to collect at the PAH ICU. This would
provide 800 cases (67% of the data) for model development and 400 cases (33% of the data) for model
validation. The study of Clermont et al. provides a useful benchmark, and on this basis, I propose that a
practical compromise for model development dataset size be 800 patients, with the validation dataset
The APACHE III model with adjustment for hospital characteristics performed better than these
minimum recommendations given above. On the 3159 patients at the PAH ICU during 1995 - 1997,
the area under the ROC curve was 0.91, and the Hosmer-Lemeshow C statistic was 14.5.
2. Model generalisation
Several simulation studies 5*'"'^*''2'' show that models that estimate the probability of ICU patient death
can have poor performance and provide unreliable predictions of risk of death when the patient samples
are different from that under which the model was developed.
A statistical model potentially describes only the relationships that were present in the data on which it
was developed, and do not represent an immutable truth '^^ Models may not be applicable to all ICU
79
patient samples and institutions to which they might be applied. Model fit cannot be assumed and
The fit of a RA model may change over time ^. This could be attiibutable to evolving quality of care.
A simulation study by Zhu et al. '^* demonstrated that a poor quality of care, simulated by increased
mortality causes model fit to deteriorate. Alternatively, changes to model fit could be due to changes in
discharge, admission, data collection practice, or changes in policy, equipment or freatment goals that
4. Observer effects
In practice, changes in performance could be due to the effect of the observer, or the commencement of
surveillance on the process ^. During the 1920s, productivity at the Western Electric plant in
Hawthorne, Illinois, improved whenever observation or any intervention was attempted '^^. The
Havi1:home effect could impact on the process of care, or it could exert a more subtle influence.
Admission and discharge practices, data collection, patient selection and data interpretation could be
influenced if the system is under scrutiny. For analysis of an historical dataset, this would not be an
issue, but it is a consideration when these monitoring approaches are applied in a real clinical setting.
It is unrealistic to use a single measurement to capture the quality of care of patients. Patient survival
provides only one aspect to evaluating ICU outcomes. A programme of quality management should
evaluate many domains of ICU performance including mortality, resource use, process measures,
access measures and complication rates '^*. RA control charts form part of the measurement of patient
mortality. Confrol chart approaches can be readily applied to the other domains of measurement of
5. Choice of outcome
There are other meaningfiil endpoints than in-hospital death. Quality of life after treatment in ICU is
important to patients, but is usually not measured. It has been proposed that assessments of the quality
80
of ICU outcomes should measure the quality of life rather than the rate of death '^'. Mortality is a crude
130
indicator of quality of ICU care and may offer limited insight into subtie changes in the process
The ability of RA outcomes other than death to measure quality of care is untested '^' and will be
difficult to measure and model. Quality of life after ICU has been studied and reviewed '^^, though it
has not been used as a measure of quality of care. Unfortunately, there are no models that estimate the
The choice of the definition of the mortality outcome has important implications for the timeliness and
accuracy of mortality analysis. ICU patients may have hospital stays of months or occasionally years. If
the endpoint of in-hospital mortality is being analysed, then it may be months (or even years) after an
admission date before all the patient records are finalised with the patients discharged from hospital or
dead. Early analysis is biased toward a higher mortality rate. The APACHE III system uses in-hospital
mortality, and for this refrospective application using a complete dataset this is not an issue. However,
in practice, the 30 day in-hospital mortality endpoints allow analysis to be conducted 30 days after ICU
admission. This important alternative endpoint is used in model development in Chapters 6 and 7.
Data quality and collection issues are important considerations for the development and validation of
RA models and subsequent RA chart analysis. The general considerations of data quality have been
discussed in Chapter 2.
Some RA mortality models for settings other than ICU are based on the use of patient data designed for
billing or adminisfrative work, not for RA. Minimum datasets from adminisfrative, demographic and
patient billing records are relatively inexpensive and there are masses of data available. These data are
of controversial quality because of ambiguities of definition and problems with coding '^'.
Adminisfrative databases do not capture all the features that determine a patient's outcome, and do not
include clinical variables reflecting severity of illness, co-morbidities and abnormal laboratory
evaluations. Failure to capture these explanatory features will compromise the potential performance of
models.
81
The best datasets should be gathered with clinical risk estimate modelling or RA purposes in mind,
though this is time consuming and expensive " ' . A prospective data collection and collation process is
characteristics of the patients. The data for ICU patients at the PAH ICU 1995 - 1999 is a complete
RA techniques applied to monitoring outcomes in a single institution over time offer an adjunct to
current methods of quality management. If the RA model meets performance criteria, charts comparing
observed mortality rates against the RA expected values or other RA mortality statistics are possible.
The methods in the following sections are RA adaptations of standard charts monitoring: the p chart of
mortality rates (RAp chart), the RA CUSUM and the RA EWMA chart. Before describing these
adaptations, it is necessary to describe and address some issues that arise in calculation of control limits
for the charting procedures, and how valid the statistical assumptions are for the distribution of
mortality rates. Appendix 6 describes three approaches to characterising the distribution of the
In an industrial application, where the inputs are standardised, and the process is in-control the risk of
failure for each item sampled is assumed to be the same. Changes in the statistical distribution of
In confrast, in medical applications, each patient has a unique set of contributors to risk of death. A
patient brings physiological reserve (captured by age and co-morbidities), physiological disturbance
(captured by the abnormalities in a range of physiological, clinical and laboratory measurements) and a
diagnosis or disease category. There will be additional factors that are not measured in the first 24
hours in the ICU. Some of these occur before admission (pre-admission treatment, details of the exact
surgical event), some are present but not recorded (rare conditions, new technology or uncommon
82
measurements) and many of the determinants of outcome occur after the first day (failure or
complications of therapy).
Alemi and Oliver " ' argue that the estimate of risk of death should reflect the risk of death of a patient
under ideal conditions, rather than what is reasonable to expect from a health care process. This ideal
may be impossible to estimate, but if the argument were accepted, underperformance would be
universal.
The exposure of patients to the process of care will not be constant and will vary from hours to
months. The outcome of a patient who spends 8 hours in ICU after an elective surgical procedure is
more reliant on the success of the surgery and the patient's underlying physiological reserve than ICU
care. In contrast, the survival of a patient who spends 3 months in ICU with pneumonia and multi-
organ failure depends heavily on the quality of the care offered in the ICU.
With these issues considered, the development of a RA charting approach will be presented.
The patient mortality data can be analysed in blocks of a constant number of patients, say, 100
admissions or in specified time periods (e.g. months). For fixed block length, the time periods over
which patients are accrued will vary, and will not neatiy fit into calendar months or quarters. Fixed
time periods, e.g. monthly blocks, are more convenient from a unit management perspective, but the
case numbers vary. Either approach can be used. For this application, a fixed block length of 100
patients will be used. At the PAH ICU during the study period, 100 patients were admitted over about 4
weeks.
83
The RAp chart plots the observed mortality rate on a chart with confrol limits calculated using the
risks of mortality of patients in the sample. RA X charts "^ and p charts " ' have been proposed by
Alemi and co-workers using the ^distribution to calculate control limits. The ICU application has large
sample sizes, so I have adapted the RAp chart to ICU RA mortality monitoring by using a normal
The following notation and formulae are used for the RAp charts. Yy is the outcome for patienty in
sample /. If the patient dies, Yy = 1 and the probability of death is TTy. If the patient survives,
E(Yy)=^y
vaTfe)=^(l-^)
^y may be estimated by ^y, using a statistical model such as the APACHE III risk of death estimate.
var(^.) = ;r,(l-;r,)
R: =
n.
n.
J^varfe) Y.^y(l-7ty)
MRi) = ^^^-^ = ^^^
84
The RAp chart compares /?, to ^(i?,) with control limits calculated around E{Ri). The control limits
are defined as multiples of the standard deviation (a). Let a be the number of o-used, then
CL,=E{R,)a.^vaT(R,)
n. n,
Figure 5.1 is a RAp chart of the hospital mortality rate of all ICU patients, in blocks of 100
admissions. Confrol limits are set at +/- 2 a, and observed mortality rates that fall outside the limits are
marked with an asterisk (*). The average APACHE III predicted mortality rate and the observed
20 m 40
Sample Number (blocks of 100 patients)
85
In blocks 7, 13 and 19, the observed mortality was above the upper control limit and in block 51, the
mortality rate was below the lower control limit. The chart can be interpreted as showing that hospital
mortality rate of ICU patients was very likely to have been significantiy above the level expected from
the patient characteristics in the earlier part of the analysis period. The hospital mortality may have
An expression for the probability of a single observation of the RAp chart falling outside the control
limits can be developed. This is the power for a single observation and is adapted from the formula of
Flora ^. The estimated probability of Yy = 1 is ^y , but under changed conditions, which we wish to
^E^v(l-^(,)+Z(^/,-^<y)
j=i y=i
Power = O )-+ i-o ^'-' .
E<(i-0 E<(i-<)
Equation 5.1
where
It is convenient to consider the change in risk of death in terms of an altered odds ratio (OR) where it is
the odds of death rather than the absolute risk of death that changes.
^.
0R=-^
7t,,
'\-7ty
OR ft..
'J
Equation 5.2
^ l-;r....+0/?^.
86
An analysis of performance of the RAp chart is shown in Figure 5.2. Each curve is the probability of
Rj, the mortality rate of the samples of 100 cases with known /Ty , will fall beyond the 2 a confrol
limits. Kf from Equation 5.2 is substituted into Equation 5.1. The curves plotted in Figure 5.2 are the
power analyses of consecutive samples of 100 patients from the PAH database across the range of
altered OR 0.2 - 4.
This power analysis is equivalent to the operating characteristic curve analysis used to analyse the
performance of the non-RA/? chart, in Appendix 3. For any observation at any OR,
_C
"G 05
0)
^3 0.4
Jg02
2
0- 0.1
15 2 2.5 3 3.5 4
Changed Odds Ratio: OR
These analyses show that the probability of an observed mortality rate exceeding the control limits will
depend on the altered OR and the estimated risks of death of the 100 patients in each sample. For
example, with an OR falling to 0.5, there is between a 0.2 and 0.55 chance of detecting a difference on
the single block observation. The probability of detecting a doubling of O^ is between 0.48 and 0.65
The performance of a control chart is generally assessed in terms of ARL in-control (OR = I) and out-
of-control (OR *1). Figure 5.3 shows the results of a simulation experiment demonstrating the effect of
changing OR on the ARL of a RAp chart. To simulate the patient casemix for this analysis, cases were
randomly drawn in blocks with replacement, from the 5278 admission records. Block sizes of 100
87
cases (~ a month), 600 (~ 6 months) and 1200 (~ year) were chosen to simulate rational sample sizes.
Simulated outcomes were allocated as a series of Bernoulli trials, based on the new OR and the out-of-
control risk of death ^y . The ARL is the average number of samples / until a simulated mortality rate
fell outside the 2 a control limits. For Figure 5.3, as the sample sizes differ, the ARL is expressed in
average number of cases to signal. Ten thousand simulations were performed at each value for OR in
100000
Figure 5.3 is a semi-logarithmic plot that shows the effect on ARL of different sample sizes and
different OR. As a general observation, smaller block sizes require fewer cases to detect changes in
OR, and shorter ARL under in-control conditions. Whilst blocks of 1200 patients offer only one false
alarm every 22 years on average and the analysis readily detects changed OR < 0.7 and > 1.6, accrual
of cases would take a year at the PAH ICU. This limits the practical use ofp charts with 1200 case
blocks. Blocks of 100 cases have ARL of 2.7 (270 cases) at OR - 0.5, and ARL of 22.1 (2210 cases) at
OR = 1 and 1.8 (176 cases) at OR = 2. A false alarm is more likely to occur with the block size of 100
patients, but the probability is acceptable in this situation. The 100 patient blocks/> chart readily detects
increases in OR.
88
The same data grouped in blocks of 100 cases can be analysed and plotted using a standardised Z score
where:
This statistic is the same as the score described by Flora ^ plotted in a confrol chart format. Figure 5.4
displays the Z score confrol chart. As this is the same data, the same blocks have Z scores > 2 or < -2 as
those marked with an asterisk (*) in Figure 5.1. Afa-endof falling RA mortality is seen this chart.
o
o
CO
N 5000 5500
Admission Number
A simple post hoc runs analysis provides strong evidence that the RA hospital mortality exceeded that
predicted by the model in the initial part of the chart. There were 14 out of the first 15 of the RA
observations above the predicted mortality in thefirstpart of the analysis period in both the RAp
The clear presentation and lack of clutter is an advantage of this Z score presentation, although there
are no units.
89
Note that the Z score of this example uses a standardised statistic based on the estimated distribution
parameters of /?, given Tty . In contrast the Z score CUSUM of Chapter 4 uses a standardised statistic
By definition, an analysis of the performance of the RA Z score chart assuming the Z scores are
normally distributed will give the same results as the RAp chart of the previous section.
methods include the t distribution used by Alemi and co-workers "^'"' or the normal distribution that I
Further approaches including exact methods based on the sample TTy are described in Appendix 6. An
iterative method can be used to estimate the empirical probability distribution of/?,, given the series of
TTy . The observed mortality rate of a sample will correspond to a quantile of the cumulative
probability distribution estimated for Rj . An adaptation of the RAp chart plots these quantile values.
Figure 5.5 shows an example of an empirical cumulative probability fiinction of /?, for a single sample
of 100 patients. The iterative method described in Appendix 6 was used to calculate the distribution.
90
For example, an observed mortality rate of 0.15 corresponds to the range of 22"^* to 34* percentiles of
the empirical cumulative probability fiinction of Rj , for this sample, given the estimated probabilities
of patient death. Less likely mortality rates define smaller spans of the distribution. An observation of
mortality rate of 0.1 corresponds to the range of the 0.2* to the 0.9* percentiles. An observation of
mortality rate of 0.2 corresponds to the range of the 99.7* to the 99.9* percentiles.
Figure 5.6 is a RAp chart of the hospital mortality data presented as percentiles of the empirical
cumulative probability fiinction of/?,- .The single value displayed is the smallest percentile in the range
defined by the observed mortality rate. The 97.5* and 2.5* percentiles of the cumulative probability
fiinction of Rj are marked on the chart for reference. These percentiles are comparable to 2 cr confrol
97.5 percentile
2.5 percentile
20 30 40 50
Block Number (n = 100)
Figure 5.6 shows that the observed mortality rates were high compared to the expected distribution /?,.
in blocks 2, 5, 6, 11, 12 and 18 which all fell above the 97.5* percentile of the distribution. Blocks 48
and 51 fell below the 2.5* percentiles of the distributions of /?,. The overall message from this chart is
similar to those presented for the previous RAp charts. 8 of the first 12 values are above the 95*
percentile of predicted mortality rate and 3 of the last 12 mortality observations are on or below the 5*
percentile of predicted mortality. The RA analysis of the mortality rate demonstrates that the mortality
5.4 RA CUSUM
5.4.1 Background to the RA CUSUM
In the cardiac surgical literature, RA charts that plot accumulating expected deaths minus observed
deaths have been called the "Variable Life Adjusted Display": VLAD " and the "Cumulative RA
Mortality Chart": CRAM chart "^''^^. Both applications used recalibrations of the Parsonet '^"^ score to
There are differences between cardiac surgery and general ICU practice. Firstiy, the APACHE III
mortality model has superior performance compared to the cardiac surgical RA models. This may make
92
existing RA models more practical in ICU. For example the APACHE III ROC curve area is
consistently above 0.88. It exceeds the discrimination of commonly used surgical risk of death estimate
tools like the POSSUM Score (ROC area = 0.75 '^^) and the Parsonet score (area under the ROC 0.68 -
0.74 '^^'^^). The APACHE III system collects more clinical and laboratory data, and uses clinical
information up to the first 24 hours of admission, in addition to the demographic and diagnostic
information. Secondly, mortality in the general ICU population is higher, being in the range of 0.1 -
0.2. This makes it reasonable to accept normal approximations of a binomial distribution in confrast to
approximations based on the Poisson distribution "^. Thirdly, cardiac surgery involves a small number
of common surgical procedures with the majority of adult cardiac surgery being elective coronary
bypass and valve surgery. In contrast, in ICU there are 250 diagnostic groups in the APACHE III
system giving a diversity of case mix and relatively few patients in each category. Even the most
common categories of ICU admission groups may comprise only 10% of patients.
With these considerations in mind, the methods of RA CUSUMs, developed in the cardiac surgical
Lovegrove et.al ' presented a simple qualitative method of reviewing cardiac surgical performance,
using the Parsonet method of mortality estimation as a RA tool. It is a plot of the cumulative expected
deaths minus observed deaths. This method can be applied to ICU mortality data using the APACHE
III RA model adjusted for hospital characteristics. The difference between cumulative expected and
observed mortality is plotted on the vertical axis and the patient admission number on the horizontal
axis.
C =y(7i Y.)
" j-i ^ ^ Equation 5.3
wherey indexes the n patients. Yj = 1 for a death and >;. = 0 for a survivor, and Ttj is the RA estimate
Figure 5.7 shows a plot of the RA CUSUM for individual patient observations from 1 January 1995 for
the series of 5278 patients. The chart shows an apparent excess mortality (downward slope) in the first
1800 admissions. After admission 2000, there is an upward slope, suggesting an "excess survival"
performance.
Admission Number
Attempts have been made to calculate the statistical significance of the change in model fit. Poloniecki
et al. "^ developed a plot of expected deaths minus observed deaths, incorporating control limits. The
estimation of cardiac surgical risk and the control limits were provided by a recalibrated Parsonet
score. The estimate of surgical performance was updated as the monitoring scheme progressed
comparing observed outcomes to predicted mortality rates. This would correct a poor original estimate,
but would be insensitive to any gradual change in experience '^*. The conttol limits were designed with
Type I error of 0.01, but were not formal statistical tests with no allowance made for multiple testing
'^'. It is difficult to know exactiy what the control limits mean, though to cross the upper control limit
With the preceding reservations considered, I have applied a similar analysis to the PAH ICU mortality
data. The approach uses a moving frame of admissions (100 cases) to estimate the confrol limits around
the cumulative expected minus observed mortality. Due to the low mortality rates in cardiac surgery,
Poloniecki et al. used a Poisson distribution to develop confrol limits. For the ICU data, a normal
approximation is used.
The confrol limits at any admission number, n, are determined by the value of the CUSUM at the
beginning of the block (case n - 99), and the predicted risks of death of the 100 patients in the block.
This is done as a moving frame as individual patients are added to the series. No attempt has been
The adapted Poloniecki approach plots the cumulative statistic C as defined in Equation 5.3. Using a
moving window of 100 cases, the upper and lower confrol limits after the first 100 cases, at case n are :
CL.=.'
" 100 \^j=n-99 y y=n-99
Figure 5.8 displays an adapted Poloniecki plot of the PAH ICU data from 1995 - 1999 using 3 a
confrol limits..
95
c
O
Admission Number
This type of approach is not a formal test, and the meaning of the confrol limits is not certain. Note
however that the RA CUSUM statistic meets the confrol limits in the same areas where the mortality
rate observed in the RAp charts of the previous section (Figures 5.1) and the Z score confrol chart
A simulation experiment was performed to examine the performance of the Poloniecki's CRAM chart
in terms of ARL to signal. The ARL was estimated from 5 000 simulations at each value in a range of
altered odds ratio (OR) 0.5 - 2 in increments of 0.1, with confrol limits at +/- 2 a and +/- 3 a. To
simulate the patient casemix for this analysis, samples of 100 cases were randomly drawn with
replacement from the 5278 admission records. Simulated outcomes were allocated as a series of
Bernoulli trials, based on the new OR and the risk of death estimate. To reproduce the effect of
additional cases and the moving window, as each new case and outcome was added, the first case in the
sample was removed. The ARL was the average number of cases added to the moving frame after the
Figure 5.9 is a plot of ARL against OR. The ARL under unchanged conditions is 421 (4 months of
PAH ICU admissions). This means that a false alarm under unchanged conditions would be expected,
on average every 4 months. The chart will detect a doubling (ARL = 24, one week) or halving of OR
(ARL = 56, 15 days) quickly. If the confrol limits are set to +/- 3 a, the ARLs are considerably longer.
Attention is drawn to the asymmetry of the ARLs, with the maximal ARL for the charts not present at
the OR = 1. This occurs because of the frequency distribution of the Kj from the PAH series. It is a
feature with all ICU populations, including PAH ICU, that the majority of the patients have a low risk
of death and the distribution of Ttj is skewed. If this simulation is repeated with a contrived series of
Kj, with a frequency distribution that is symmefrical around a mean value of 0.5, the ARLs of the
chart are maximum at (9/? = 1. A similar phenomenon is seen with ARL analysis of the RA EWMA
chart of the next section (Figure 5.13). There are inaccuracies in the application of the central limit
theorem to approximate the probability distribution of the patient mortality rates due to the skewed
frequency distribution of TTj . These inaccuracies are revealed when the Bernoulli trials are conducted
in the simulation experiments to determine the ARLs. The results of the simulations in Figure 5.9
shows that the longest ARLs are for values just a little below OR = 1.
10000
1 1.5
Changed Odds Ratio
97
5.4.3 Testing for Change in Odds ofRA Mortality with the Steiner's RA
CUSUM.
Steiner et al. '"*' have proposed a RA CUSUM monitoring procedure testing the hypothesis of change
ofRA odds of death. It has been applied to neonatal arterial switch operations and to adult cardiac
surgery, '^*. The CUSUM procedure can be used to detect, for example, a doubling or halving of
surgical risk. In my own work I have adapted this application of the RA CUSUM to monitoring
The method provides a test of the hypothesis that the RA odds of death are unchanged, against an
alternative hypothesis of a changed OR. An upper RA CUSUM with confrol limits tests the null
hypothesis of unchanged odds of death; the alternative hypothesis is that the OR has increased. A lower
RA CUSUM with confrol limits tests the null hypothesis of unchanged odds of death; the alternative
hypothesis is that the OR has decreased. In this analysis, the RA CUSUM procedure will be designed
A score (Wj) is given to each patient. It is derived from the log-likelihood ratio of the current risk of
death, compared with the risk of death if the overall level of ICU performance has changed. Under the
lY- i \Y.
null hypothesis, the likelihood for patienty is given by Kj ' \^ 7tjj' and the odds of death are
TCj OR^Tlj
-I r. While under the alternative hypothesis the odds of death are .
98
Since there are only two possible outcomes (death or survival), the two possible log-likelihood ratio
OR.
Wj = log if the patient dies, or
[\-7tj+OR,7tj)_
For the upper confrol chart, an upper CUSUM statistic, ^j" is plotted againsty, wherey is patient
number and SQ - 0.
The RA CUSUM formally tests the null hypothesis, HQ: ORO= 1, against the alternative hj^jothesis. HA:
ORA > 1. Successive non-negative values will lead to accumulation of Sj until its value exceeds the
For testing for a change in the RA odds of death where the mortality is falling, the procedure is similar
to the test for an increased OR. For the convenience of plotting both charts on the same figure, the
statistic S~ is accumulated as a negative value (or zero) and h~ is a negative value. Thus,
S]=mm(s:_,-Wj,0)
The ORA is less than 1 and the confrol limit is h , the value below which S must fall to give an alert
or alarm.
99
The choice of h'^ and h~ are made by modelling the ARL where the RA outcome of patients has
changed, or remains unchanged, given the set of estimates A , and the choices of clinically relevant
ORA . Figure 5.10 shows the ARL using an upper and lower RA CUSUM tests based on the log
likelihood together, with the ORA set at 0.5 and 2, for a range of A"^ and h~. In this example, h* and
h~ have the same absolute magnitude, though for other applications, these decision thresholds will
vary, and may not be equal. The experiment calculated the ARL for RA CUSUM from 10 000
simulations. Each case was drawn at random and with replacement from the 5278 patients in the series.
Each simulated patient outcome was a Bernoulli trial based on Kj , the altered probability of death.
The values in Figure 5.10 differ from those published in Cook et al.^^^ reporting an ARL estimate of
5400 based on the patient population from the 1995 - 1997 dataset. The inclusion of the additional
1998 - 1999 cases used in this updated analysis has increased the ARL for monitoring for OR 2, with
A^ = 4.5 to 5638.
^ 5000
4000
100
The relationship between ARL and changes in the OR is shown in Figure 5.10. For this experiment, 3
RA CUSUM monitoring schemes were studied. An upper RA CUSUM testing for ORA = 2, a lower
RA CUSUM testing for ORA = 0.5, and a combined upper and lower RA CUSUM with ORf^'' = 2
and OR'A'^'' = 0.5 . Initial experiments showed that for the upper RA CUSUM, h* = 4.5 gave an in-
confrol ARL of 5640 admissions. For the lower RA CUSUM, h~ = -4.5 gave an ARL in-confrol of
The effect of changingrisksof death by changing the OR was examined by performing 10 000 RA
CUSUMs of each type at each OR value in the range of 0.1 - 0.4 in increments of 0.1. To simulate
patient outcomes, cases were randomly drawnfromthe series and outcomes were allocated as
/v A
Bernoulli trials where the risk of death was /T . The results of this simulation are shown in Figure
5.11.
7000
6000
ARL: patients
3000
2000
1000
Odds Ratio
With unchanged 0R=\, the upper RA CUSUM has an ARL = 5640, and the lower is 6701. The ARL
is 3043 for the combined scheme. As the OR increases above 1, the estimate of the ARL is dominated
by the ARL of the upper RA CUSUM scheme. Similariy, with OR below 1, the ARL is rapidly
dominated by the effect of the lower RA CUSUM scheme. If only increases in RA mortality were of
interest, the specificity of the signal could be increased by using only the upper RA CUSUM. With a
101
RA CUSUM only monitoring for an increased OR, the false alarm ARL would exceed 5640 cases, or 5
years.
The RA CUSUM rapidly detects decreases in OR. The ARL for OR of 0.7 is 646 and for OR of 0.5 is
244. The RA CUSUM also rapidly detects increases in OR with ARL of 842 with OR = 1.3, ARL of
The parameters chosen for the RA CUSUM (upper and lower RA CUSUM with OR'f^'' = 2 and
Qj^iower _ Q^ ^ confrol Hmits h'^'~ = 4.5) provide rapid detection of clinically important changes in
OR with an acceptably long ARL for in-confrol or minimally changed OR. If only an increase in OR
was of clinical interest, the in-confrol ARL could be dramatically increased by only monitoring with
These parameters are used to plot upper and lower RA CUSUM charts for the PAH ICU data 1995 -
/)* =4.5
/)" = -4.5
Two signals of increase mortality are marked with an asterisk (*). The first occurs when 5"^ > h* at
case 533. The upper RA CUSUM S* is reset to zero and the monitoring is recommenced, with a
fiirther signal at case 1117. The RA CUSUM is reset to zero, and there are no other signals of increased
103
OR. These signals are good evidence that the 0R>\ in the first 1117 cases. The APACHE III model
Two further signals are seen on this combined chart. In the latter part of the series, there are signals
marked by a hash (#) where S~ <h~. These two signals at cases 4939 and 5181 are good evidence
that the APACHE III estimates are overestimating the risk of patient death.
The RA CUSUM chart analysis provides similar findings to the RAp chart, and the method adapted
average of the mortality rate of individual patient observations. This is an iterative process, where the
EWMA statistic is calculated from the weight, X, the previous EWMA value or the starting value, and
the patient outcome ( Yy) or mortality rate of the block of patients ( /?,). To construct a RA EWMA
chart, it is proposed that the same calculation of the EWMA statistic is performed, but the unique series
of estimates of patient risk of death are used to calculate the expected EWMA statistic and the control
limits.
I propose and describe two methods of calculating the confrol limits. The first is an approach that uses
the cenfral limit theorem to estimate the distribution oiEWMAj . The second uses an iterative
approach to characterise the expected distribution of EWMAj . For both approaches, it is assumed that
the RA model provides an accurate estimate of the true risk of death of the patient and that the number
For this application, individual patient outcomes (Yj) will be analysed. The method can easily be
The EWMA statistic of the observed deaths is calculated from the series of observations as described in
Chapter 4.
or
The value of the statistic EWMAj is compared to an estimate of the expected value of the EWMA
E[EWMA])= EWMA^
or
The confrol limits are calculated from an estimate of the variance of EWMAJ .
This can be simplified if we assume that the starting estimate of the mortality rate EWMA^ has a
variance of zero.
Under the assumption that the model provides an accurate estimate of the patient's tine risk of death,
then the estimate of the variance of Yj will be ^^. (l - ^ . j . Substituting for var 7-,
Two special cases of this formula are apparent. The RAp chart is a special case of the RA EWMA
where /I = 1, and the standard EWMA, without RA is the case where all the risks of death are the same.
105
For ease of computation, the estimated variance of EWMAJ can be calculated iteratively in parallel to
To determine suitable choice of 2 for RA EWMA in this application, a series of simulations were
conducted. The effect of choice of/I on ARL under conditions of changed OR was investigated using
values of 2 of 0.001, 0.002, 0.005 and 0.01, and OR in the range of 0.5 - 2.0 in increments of 0.1. Each
simulated patient was drawn at random from the patient series, and the simulated outcome decided by a
Bernoulli experiment with the probability of patient death being Ttj . The ARL of 5 000 RA EWMA
Figure 5.13 shows the effect on ARL of choice of the weight X over a range of OR. The ARL for OR
close to 1 was much greater for the small X, though these differences are less apparent at an OR > 1.2
100000
a:
<
1 1.5
Odds Ratio
106
There is again an effect due to the distribution of cases in the PAH dataset and the use of the central
limit theorem. The ARLs of the large values of A are greater than those of the small values of 2 for
small OR. If the ARLs are modelled with a contrived patient sample that has a distribution of Ttj that
is symmetrical around a mean value of 0.5, this effect could not be seen. The simulations using the
Bernoulli trials bring the inaccuracies of the central limit theorem, which was used to calculate the
control limits, to attention. Exact methods are required to overcome the inaccuracies of these
assumptions.
Figure 5.\A{X = 0.001) shows a RA EWMA chart of the 5278 cases in the series with control limits set
to 2 . There is a sustained period between admission 329 and admission 2183 where the EWMA
The same parameter settings are used for the RA EWMA chart in Figure 5.15, where the RA EWMA
chart is restarted when the EWMA estimate of the mortality rate crosses one of the control limits. This
chart shows signals of increased mortality above the upper 2 a control limits at cases 329, 532, 610
1033, 1112 and 1188 denoted by the asterisk on the chart. It is good evidence, again, that the mortality
rate observed during the first 1200 admissions in the series was above that predicted by the APACHE
111 model. There were 3 instances where the EWMA crossed the lower control limits at cases 4180,
107
5044 and 5182. This provides good evidence that the observed mortality rate was below that predicted
Admission Number
EWMAJ . Therefore, in this section I will explore the use of an iterative, exact method to describe the
The probability density fiinction of /?, used for the RAp chart was a multinomial with only n, +1
possible values (see Appendix 6) and an iterative calculation is computationally rapid. In confrast,
when an iterative method is applied to describe the distribution of EWMAJ , there is a logarithmic
growth of the number of possible values of EWMAJ each with its own probability, which can be
An estimate of the probabilities of all possible values of the EWMAJ statistic can be made under the
familiar assumption that the RA tool will accurately predict the true patient's risk of death, and that the
Wheny = 0, the only possible value is EWMA^ which has a probability of 1, i.e.'?r\EWMA^)=1
This is repeated to produce the distribution of values and probabilities after each patient is added to the
series. There is a maximum of 2' possible values that the EWMAJ can take. This rapidly becomes an
The first is to resume the asymptotic approximation applying the cenfral limit theorem as in the
previous section when the size of the computing problem becomes too great. However, when
considering ARLs for the RA EWMA in the range of hundreds to thousands in-confrol, this means that
the approximation will be used in almost all instances. It would make no sense to use the exact method
for those few calculations of confrol limits before reverting to the approximation.
The second is to adopt a Monte Carlo simulation, using a large number of Bernoulli trials. This is a
practical solution.
The third solution is to limit the accuracy of the exact method, and limit the number of discrete values
that the EWMAJ can take. In this application, I have chosen this method, and have limited the number
of values of EWMAJ to 10^ values between the largest plausible EWMAJ and the smallest ^fFM^J .
After each iteration, the values are rounded to the closest of the 10^ values. This approach is rapid and
is accurate to the 8* decimal point, which is far beyond the accuracy of the models that predict the risk
of in-hospital death.
This method is used to estimate the cumulative probability distribution of EWMAJ . For each patient
admission recorded in the series, the observed EWMAJ corresponds to a quantile of the estimated
y
cumulative probability distribution of EWMAj .
Figure 5.15 shows the series of 5278 cases with EWMAJ , EWMAj and the upper confrol limit taken
as the value of EWMAJ which defines the 97.5* percentile, and the lower confrol limit that defines the
110
2.5* percentile. I took A = 0.001, and accumulated individual cases. This is an approximation of the a
= +1-2 a confrol limits of Figure 5.13 and the same pattern is apparent.
0.16
0.15
< 0.14
0.13
UI
0.12
0.11
0.1
1000 2000 3000 4000 5000
Case Number
Figure 5.17 provides an additional, visually simple representation of this series as a plot of the smallest
pecentile of the estimated cumulative probability distribution of EWMAj that is defined by the
the lower 2 a confrol limits of the previous charts. Similarly, 0.975 is approximately equal to an
observation falling on the upper 2 a confrol limits. A perfect agreement between EWMAJ and
The same pattern that was seen in the previous analysis is seen in Figure 5.17. The EWMAJ falls on the
upper end of the distribution of possible values of EWMAj . Between 553 and 2051 all the
EWMAJ are beyond the plausible values for EWMAj . This is sfrong evidence that the APACHE III
model was under predicting the risk of death during this part of the series.
Ill
n(D
o
50%
.>
'.4-t
a
E
3
o
1000 2000 3000 4000 5000
Case Number
work is the first use of the APACHE III score as a validated RA tool to monitor ICU outcome. Each of
the methods applied and developed in this section have a method for design and analysis described in
the Appendices.
This work is original in its application. It is an adaptation of previously described methods, or presents
new work. The RAp chart was developed from the method of Alemi et al "^ and incorporates a
method for analysing the power of each sample, adapted from Flora ^ . The Z score/? chart relies on the
work of Flora ^ and Sherlaw - Johnson and colleagues '^'. The use of the iterative approximation of
The RA CUSUM is based on the work of Lovegrove et al. '' and the moving frame approach is
adapted from the work on cardiac surgery charting by Poloniecki et al. "^. The use of the RA CUSUM
of Steiner and co-workers'^*''^''*^ is an original application in ICU, though the method apart from the
Both the RA EWMA charts, the parametric approximation and the discrete approximations of the
This chapter is about current RA charting techniques that can be applied to such contexts as the ICU.
The charting methods have an important place in the range of quality measures that are available, and
In the next two chapters ICU data will be modelled again using machine learning approaches (neural
networks and support vector machines) on datasets of a small and practical size for application in a
single ICU.
113
Chapter 6:
In the previous chapters, I have reviewed the techniques for assessment of models that estimate the
probability of intensive care unit (ICU) patient death. With an existing, validated model, I have
developed risk adjusted control charts, applying such models for monitoring mortality in ICU patients.
This chapter will infroduce machine learning approaches for development of new models to estimate
The first is to describe the machine learning methods of artificial neural networks (ANNs) and support
The second is to review applications of ANNs and SVMs in the intensive care unit (ICU) with
The third is to present preliminary experiments with ANNs and SVMs on the Princess Alexandra
Hospital (PAH) ICU dataset, as a prelude to development of a practical RA tool using raw patient data.
One experimental task is the classification of patients according to 30-day in-hospital death or survival.
The other is a regression problem to estimate the probability of 30-day in-hospital death. The variables
used are pre-processed patient data: APACHE III acute physiology score, APACHE III chronic health
score, patient age and modified APACHE III diagnostic coding. The results are compared to previous
ANNs and SVMs are machine learning approaches to modelling the probability of ICU patient death. It
is practical to model ICU patient outcomes in this way because of the availability of powerfiil desktop
computers. The recent review '"^^ of "Artificial Intelligence in the ICU" inadequately covered the topic
in only six paragraphs. Four of these were infroductory definitions and descriptions of the ANN. There
Machine learning to model ICU outcome is a supervised learning task. Supervised learning in this
application uses fraining datasets where the patient and diagnostic variables and the resultant patient
outcomes are known. Generally a machine learning algorithm adjusts or optimises aspects of a model
to minimise an error fiinction that compares the outcomes in the fraining dataset to model predictions.
ANNs provide a diverse range of models for classification and regression problems. ANNs have
architectures of networks of interconnected simple processors. The fraining process involves learning
patterns in the fraining data, often by modifying the weights that link the units. The following
discussion will be limited to the supervised learning instance. A more detailed infroduction to this topic
The most commonly used example of an ANN is the multilayer perception (MLP) which is shown in
Figure 6.1.
115
Feature
values of Input Hidden units in Output units
Weights
input vector
X; - X
units^ hidden layer v J D
Xl
Xi A
Xi A.
X4
.'vA** L ^'^
Xn \ D
This example of a 2 layer MLP has an input layer of input units (left, ) equal to the dimension of the
input vector. It has hidden layer units (cenfre, o) and output layer units (right, D). The data input is an n
dimensional vector, X. In the example, feature values Xi to Xn are presented at the input nodes. The
arrows represent connecting weights. Each input node cormects to each node in the layer below. The
vector of weights at each node are W . In the hidden layer in Figure 6.1, W will have n elements. The
input to each hidden units is the sum of weighted inputs. This is the sum of elements of the dot product
of the vectors W and X i.e. W,X, -h WjXj + ....WX and is abbreviated as ^ W X. The input is
Figure 6.2 shows three input values and weight vectors processed at the hidden unit. The output signals
of the processing units are then connected by weights to the next layer. The next layer in this example
Hi
is the output layer, though fiirther hidden layers are possible. At the output unit, the sum of the
X3
Supervised learning proceeds by iterative adjustments to the inter-connecting weights of the ANN, to
minimise the sum of the squared errors between observed and predicted output values. Initially,
weights are set to random values. Subsequent adjustment of the interconnecting weights in the network
is by gradient descent down the error surface seeking a minimum error value. Back propagation is one
such algorithm whereby the weights in the network are adjusted, according to the derivative of the error
147
with respect to the weights
MLP ANNs have a number of potential issues '*^ which are common to many numerical optimisation
procedures. For example, the gradient descent algorithm can become frapped in local error minima,
rather than proceeding to the optimal solution. The MLP can produce different fraining results
depending on the starting weights, by terminating at different local error minima. As well, MLPs are
prone to over-fitting. It is usual practice to monitor performance on an external verification data set
during the fraining. The fraining process is terminated when the generalisation performance begins to
deteriorate. In addition, a test dataset is used to assess the model performance - see below). Whilst
117
there are sfrategies to limit the problems caused by these issues, it is important to frain many MLPs and
In the ANN experiments in this chapter, the available data were divided into three sets for each trial. A
fraining set was used to frain the ANN. A verification set was used to monitor the generalisation
performance so that fraining could be stopped. A test set was used to assess the model performance on
The radial basis fiinction (RBF: Figure 6.3) network is another example of multilayer ANNs used in
this study.
Weighted
sum of
Xj
outputs
X2
Xi
As with the MLP, the input vector X at the n input nodes and each input is connected to the hidden
layer. However, there is only a single hidden layer of processing units. The inputs to the hidden layer
are fransformed using a fiinction with radial symmetry such as a Gaussian function. The output from
each unit in the radial layer depends on the distance of the feature values from the cenfre of the
fiinction for that unit '**. The sum of the weighted signals of the radial layer is processed with a linear
118
activation fiinction at the output units. The weights between the radial layer and the output layer are
adjusted by the learning algorithm. The RBF ANN is rapidly frained, and is useful for modelling
The generalised regression neural network (GRNN) shovm in Figure 6.4 is similar to the RBF ANN. It
is used to recalibrate the mortality predictions in the regression experiments in this chapter. It is a
statistical method of fiinction approximation and in this application provides a Bayesian, kernel-based
estimate of the risk of death. The GRNN was developed by Specht '^^ based on Parzen '^''^^. Figure
6.4 (based on Figure 1 in Specht '^*') shows the architecture of the GRNN.
weights weights
Figure 6.4 shows there are input units for each feature dimension of the input vector. Again, each input
unit connects to all of the radial units in the first hidden layer. These pattern units are each dedicated to
a fraining example, or represent clusters of similar fraining data. The pattern units have a Gaussian
fiinction which fransforms the distance of each feature value from the pattern represented by each unit,
similar to the RBF processing units. A smoothing parameter determines the shape of the radial
fiinction, and the overlap between fiinctions. The second hidden layer has summation units which
IB
process the weighted signals from all the pattern units. The output node provides estimates of the mean
The GRNN uses one fraining epoch to set the weights, and can be trained very quickly. Although there
are architectural similarities to the RBF ANN, the GRNN has no iterative algorithm of weight
ANNs are widely applied in the medical area. Non-ICU applications include diagnosis '^^'^',
prognosis '^"'*^, physiological and laboratory data interpretation '55.i58.i67-i69 ^^^ pharmacology '^*''.
There are a number of reports describing ANNs used to model ICU patient data and outcomes. The
following review of applications summarises the use of ANNs in the ICU from the perspective of
Two studies have examined cardiac surgical sub-sets of ICU patients. Lippmann and Shahian ''' used a
MLP ANN, logistic regression and Bayesian analysis to model the survival outcomes of 80 606 cases.
Fifty-nine demographic, physiological, laboratory, diagnosis and cardiac assessment features were
collected, of which 36 were used in the model. The calibration of the logistic regression model was the
best, but the model had an area under the ROC curve of only 0.76. Orr "^ reported an ANN to estimate
the risk of death in cardiac surgical patients using only 7 variables selected from a patient database.
This model had good calibration, but lacked discrimination, with the area under the ROC curve of only
0.74.
Buchmann et al. ' " compared a logistic regression model, MLP, GRNN and a probabilistic neural
network to classify ICU patients on the basis of chronicity (length of stay in ICU > 7 days), rather than
to predict mortality. He found that the discrimination and calibration of the ANNs were superior to
logistic regression. Another study of prediction of resource use, as measured by hospital length of stay
was published by Mobley et al. "*. An ANN was frained to estimate the hospital length of stay of 557
observations, laboratory results, diagnosis tests and index events. The ANN was able to predict length
of stay within 24 hrs in 72% of patients. However, this was only marginally better than using the mean
Doig et al. "^ compared the predictions of an ANN to a logistic regression model for the classification
of a small series of 422 patients. Each patient had already survived to 72 hours in ICU. Variables were
selected from the APACHE II system and re-modelled. Remarkable discrimination was achieved on
the fraining set (area under the ROC curve - 0.99). The discrimination was less good on the validation
set (area under the ROC curve 0.82) suggesting overfraining of the ANN with overfitting to the fraining
Dybowski et al. '^^ compared a ANN frained with a genetic algorithm to a logistic regression model for
predicting the outcomes of a small subset of ICU patients (258 patients) with systemic inflammatory
response syndrome. A classification free and logistic regression were used to select variables from
physiological and demographic variables, and index events that occurred during the hospital stay. The
ANN had better discrimination than logistic regression (area under the ROC curve 0.86 v 0.75). No
Prize et al. ' " frained MLPs to classify non-operative (608) and operative (883) ICU patients
according to the predicted duration of mechanical ventilation. These models were only assessed on the
developmental dataset predictions. By pruning the number of input features from 51 to 6, classification
performance improved and the network complexity was reduced. This study demonsfrated a practical
approach to limiting network complexity and provides usefiil documentation of a successfiil approach
to processing and fransformation of patient data. However, the ANNs in this study were designed to
estimate resource use rather than to predict mortality outcome. A weakness of the study was the models
were not assessed on a separate test dataset, so no conclusions can be drawn about the reproducibility
Two studies compared ANNs to the APACHE II system. Wong and Young '" compared MLPs to the
APACHE II system on 8796 patient admissions collected for an APACHE II database. Both the MLP
121
ANN and the APACHE II system had similar discrimination (area under the ROC curve 0.82 - 0.84)
and calibration. Nimgaonkar et al. "^ also compared the performance of ANNs to the APACHE II
system to predict mortality in an Indian ICU. A series of 2962 cases were modelled using the input
variables for the APACHE II system. They analysed the contribution of each of the features to the
models. Discrimination by the ANN was superior to the APACHE II (area under the ROC curve 0.88 v
0.77). The ANN displayed better calibration than the APACHE II model.
One of the most relevant studies is by Clermont et al. '^, who used logistic regression and ANN to
model hospital mortality outcome on 1647 ICU patients. The demographic and physiology variables
were collected under the rules of the APACHE III system. It is important to examine this study in some
detail as the modelling context is very similar to the task of modelling patient outcomes at the PAH
ICU. In Clermont's study, the component variables of the APACHE 111 model and the APACHE III
score were used to model the probability of patient death. The areas under the ROC curves were in the
range of 0.8 (logistic regression) - 0.836 (ANN with coded APACHE III observations). All the models
had reasonable calibration when 800 or more cases were used to develop the model.
The ANN and the logistic regression models were able to successfiilly predict ICU patient death. A
fiirther important conclusion was that both the logistic regression and ANN model performance
deteriorated when the model development set size was reduced below 800 cases. In a real application,
at the PAH ICU, this requires 9 months of patient data collection for model building. For smaller ICUs,
Equally important was the author's practical choice of the size of 447 cases for the test dataset for
model assessment. The size of this assessment set is a frade-off between the practicality of collecting
patient data and the important statistical issues of the power and precision of the model assessment. For
the PAH ICU and other busy ICUs, 447 patients requires about 3 - 4 months of data collection. For
smaller units, a year may be required just to collect enough patient data for model assessment.
Balanced against this, is the issue of smaller assessment datasets giving low statistical power to the
Hosmer-Lemeshow C (H-L C) test to detect imperfect calibration, and the loss of statistical precision
in estimating the area under the ROC curve. Clermont et.al made their choice based on a timed period
122
of data collection. The size of the dataset they used strikes a reasonable compromise between the time
There are limitations to Clermont's study. As Paetz '^^ in a letter to the Editor in a subsequent edition
of Critical Care Medicine commented, the study did not involve re-sampling to demonsfrate the
robustness or reproducibility of the approach. It is possible that the non-random split of cases to the
fraining and validation sets was a major determinant of the model's reported performance. Replicates
of the modelling on alternative random data selections and re-sampling to provide alternative
assessment sets are necessary to demonsfrate consistency of the modelling approach. I would add to
these criticisms, that the authors used consecutive patients to build the models. The last 447
consecutive cases in the dataset were used to assess all the models. Even when smaller fraining sets
were explored, non-random, consecutive sampling was used. Infroduction of bias into the model, the
effects of influential outiiers, or fortuitous sampling carmot be excluded with their methodology.
There are fiirther limitations to their techniques. The patient's diagnosis or diagnostic coding was not
included in the variables for the model. Also the authors relied on the APACHE III algorithm for the
weights that they used to pre-process the variables for both the MLP ANN and the logistic regression
model. These APACHE III weights are added together to give the APACHE III score, which the
authors also selected as a variable in some models. A lack of a diagnosis variable and reliance on the
APACHE III system may have limited the quality of the models that could be built.
The authors did not record how many ANN were frained to yield the optimal performance ANN, so it
is not clear whether a limited or an exhaustive survey of possible ANNs was conducted
In summary, many applications of ANN to prediction of ICU mortality have been published. Overall,
the performance of ANNs on ICU mortality prediction tasks appear as good as or better than logistic
regression, on the datasets on which the models have been developed. However, no meaningfiil
conclusions about the generalisation of these ANN models outside the context where each was
The SVM is based on work by Vapnik '^^ using a linear, machine learning algorithm to model a
projection of the data into multi-dimensional feature space. Whilst the example data may not be
linearly separable in input space (its raw form), the attributes mapped by a kernel fiinction into feature
space may be separable. A usefiil definition of SVMs is ".. .learning systems that use an hypothesis
space of linear fiinctions in high dimensional feature space, frained with a learning algorithm from
optimisation theory, to implement a learning bias derived from statistical learning theory.." *.
There are several infroductory references to the theory of the SVM ^'^'"'*^. The following description
draws heavily on these references, and is not meant as a mathematically detailed account of the theory.
Enough detail is provided to explain the basis for the experiments conducted in this and the next
chapter.
For classification problems, SVMs are frained to define the position of an optimal separating
hyperplane between classes. For regression problems, linear learning algorithms model a non-linear
fiinction (a regression "tube") in feature space. The term "support vector" describes how only the most
Linear machine learning routines can be applied on a fixed, non-linear mapping of the data vectors in
feature space because the numerical operations required to perform the minimisation and learning
procedure can be evaluated efficiently. This is a tractable problem, as the complexity of the
A highly curved multi-dimensional model hypothesis might also increase the risks of over-fitting the
data. Overfitting is a particular issue where there is noise in the data used for regression modelling, and
there is overlap of classes in classification problems. Several factors can be confrolled during the
learning process, to limit the complexity of the model and promote generalisation of the SVM model.
The learning algorithm minimises an error fiinction within consfraints of the model. Lagrange
multipliers are included in the fiinction to be optimised, presenting a quadratic optimisation problem
SVM Classification.
The SVM algorithm defines a hyperplane to minimise the number of classification errors on the
fraining dataset, and maximise the margin between the two classes (see Figure 6.5).
Figure 6.5 shows a simple two class classification problem of stars and circles, (based on Figure 1.1 in
Scholkopf e? al. (1999) '*^). The data input pattern is the vector x of elements. The class labels are
Y = \ for circles and Y = \ for stars, and examples in this dataset are linearly separable in input
space. The dashed lines define a margin, enclosing all possible separating hyperplanes
where W-X-l-Z> = O . W i s a weight vector normal to the hyperplane, and 6 is a constant. There are no
Figure 6.5 Simple classification of stars and circles, showing the margin
either side of the optimal hyperplane.
\
^
(w.x)+ /? = + !
\ ^ o o
I ^
\ ^ \ \ o
-b/\w\ ^
o \
1 ^ fc.
P
\
\ \ox 1^
x ^^^ ^
\l ^^ *
\ X ^
(w.x) + Z) = -1 \ X
\ X
+ (w.x) + /? = 0
\ \
125
A decision fimction f{j) = sign[(w x) +fej,allows the maximal sqjarating hyperplane (in
2
bold) to be defined, by minimising subject to
liwir
Lagrange multipliers (tt;.- i = 1 ...l) are infroduced to give an expression that is readily differentiated.
The optimal solution is found by minimising L with respect to W and b, and by maximising L with
respect to the dual variablesttjtofindthe "saddle point". The conditions of Equation 6.1 allow W and
The solution can be expanded in terms of only the fraining vectors which have non-zero Lagrange
multipliers, O,,- ^ 0. These are the support vectors, which wUl lie on the dashed lines in Figure 6.5. The
other fraining examples are not required to define the optimal separating hyperplane.
Where the data are not liaearly separable in input space, the problem can be solved by mapping the
data into a multidimensional feature space where the decision function may be a linear fimction. For
the learning task, (Equation 6.2) can be rewritten using the dot product of a mapping function, 0 ( } of
X; and Xj
A kernel fimction, K Q that can implement the dot product in Equation 6.3, is used.
L=j:cci-\tt^i^ji^A^i'^j)
/=1 ^ ,=1 7=1
For the non-separable case, a "soft margin hyperplane" is sought, by infroducing a positive slack
variable, <f to relax the constraints on the optimal hyperplane in Equation 6.1.
Ylw + Xj)+b]>l-^j
A regularisation constant C, is used to assign a cost to fraining errors that are infroduced by the slack
variable. This parameter C is the upper bound on the Lagrange multipliers, and will limit the influence
of any single fraining vector. It will Hmit the complexity of the model and will influence the balance
SVM Regression
For regression tasks, the SVM algorithm is modified. Yt is a real number representing the outcome, and
X| is an input vector of w elements, for patient i. For classification, it was usefiil to use the concept of
the margin in which the optimal separating hyperplane Ues. For regression, it is usefiil to consider a
tabe in feature space, in which the optimal regression fimction lies. The Unear regression function,
/"(xj = (w x j -F 5 , is estimated by a tube of radius S around the regression fiinction using the -
' fct|7,-/(x,,J
Wll ;=1
127
Figure 6.6 shows the concept ofS, the precision with which the regression fiinction is modelled. It is
There are two types of slack variables to account for positive and negative deviations from the tube,
(^,r>0)
[yi-<^(iij) + b\-Yj< +^
y;.-[w.<E)(x,)-HZ,]<f + r
To estimate the regression fiinction, with precision E, the optimisation process is then to
minimise
w
cife+^;)
/=i
128
Generalisation to non-linear regression requires the use of a kernel fiinction, and again optimising the
loss fiinction is done using Lagrange multipliers, Or,. and Of,. . The variables W and b axe, eliminated
Z(,-<)=o
1=1
0<aj,aj<C
The support vectors are defined by those fraining data points where one of the Lagrange multipliers is
not zero. Analogous to the classification example, the support vectors that define the regression
It is not necessary that values for ^ or i are chosen, only whether 6 = 0 (unbiased hyperplane or
regression tube) or 6 * 0 (biased hyperplane or regression tube). For these experiments described in this
The choice of the SVM kernel and the SVM parameters will determine the quality of the predictions of
the frained SVM model. The SVM kernel function is chosen a priori. The choice of SVM parameters
must be experimentally confirmed by cross validation '*^. From my experience in this application, this
experimentation in the absence of knowledge of the likely values can require hundreds of hours of
computing time to explore and select the best SVM. However, once suitable parameters are chosen,
A proposal is offered by Mattera and Haykin '*^ for the choice of C for the RBF SVM. A value for C
that is approximately equal to B, where the range of predictions lies between 0 and B will balance the
approximation error and the complexity of the model. For the RBF SVM, they advise that the value of
e which leads to about 50% of the fraining data points as support vectors is a robust balance between
the approximation error and the complexity and computing time. This is given by the authors without
experimental support, and these choices require experimental confirmation in any application.
129
In medicine, SVMs may have application in classification problems such as prediction of mortality, use
as a diagnostic support tool for interpreting patterns from physiology observations, investigation results
and laboratory tests, and regression problems such as the estimation of resource use, length of stay and
risk of death.
There is only a single application of SVM in ICU, appearing in two publications. Morik et al. '*^''*^
present use of SVMs in assisting development of rules for changes in haemo-dynamic therapy in the
ICU by emulating physician's choices. The studies by Morik and co-workers use an automated
physiological data collection and time series analysis to pre-process the raw patient data. Therefore, the
methods are of limited relevance to the task of modelling 30 day in-hospital mortality at the PAH ICU.
There are no studies that investigate the use of SVMs in predicting mortality, modelling outcome or
classification and regression problems in the prediction of 30 day in-hospital mortality in PAH ICU
patients.
The first objective was to demonsfrate the use of SVMs and ANNs to accurately classify patients into
30 day in-hospital survivors and non-survivors. The data used for modelling were collected in the first
24 hours in ICU. Both SVM and ANN are compared to other more familiar modelling techniques such
as logistic regression '** and classification frees '^', as well as the APACHE III model ^''"'^.
The second objective was to demonsfrate the use of SVMs and ANNs to estimate the probability of
patient death using data collected from the first 24 hours in ICU.
This series of experiments is a prelude to building a local machine learning model for monitoring unit
mortality.
130
The model performance was assessed on unseen data (test set) which was either that portion of the
patient sample not used for model development (fraining set) or a randomly selected sample of these
patients.
To demonsfrate that the techniques are reproducible on the dataset, multiple trials of modelling with
random selections of development set data were undertaken with the ANN and SVM experiments. For
model assessment, the two criteria were discrimination, measured by area under the ROC curve and
calibration, assessed by the Hosmer-Lemeshow (H-L) C statistic. 1 have extensively reviewed these
model atfributes in Chapter 2. Support for their use to assess models developed by the machine learning
can be found in the review by Ripley "". In the following experiments, these atfributes are proposed as
summary indices of generalisation performance for comparison of models and to define minimum
For classification tasks, classification accuracy is assessed as the number of cases correctly classified
divided by the total number of cases in the sample: (correct classification rate: CCR). Discrimination is
assessed by area under the ROC curve. For classification models, all of the test data available was used
For the regression task, the criteria for quality of the probability estimate are the model discrimination
and calibration. In order for the regression model to be a valid estimate of the risk of in-hospital death,
it must effectively rank patients in order of risk of death as well as demonsfrate reliable probability
estimates i.e. calibration. Based on my review of ICU model discrimination, it is reasonable to expect
models developed on the first 24 hours of ICU data to exhibit an area under the ROC curve > 0.8.
Calibration is assessed by H-L C statistic. A large statistic value would indicate a significant departure
from perfect calibration, though the power of this test is dependent on the sample size. Using a test
sample size of 400 cases is a balance between the power of the statistical test and the practicalities of
delays in data collection. 400 patients as a model validation set is an arbifrary choice though it is in
accord with the sample size of 447 patients, used by Clermont et al. '^ . The H-L C statistic is
131
distiibuted as a chi-squared with 8 degrees of freedom, and the critical value {p < 0.05) = 15.5. Based
on this criterion, the aim for new ICU mortality models that estimate risk of death will be to attain an
The area under the ROC curve and the H-L C statistic were calculated for each of 100 re-sampling
trials for each model constructed, with data not used in the model development set. The discrimination
for each regression model was the average of the area under the ROC curve of the re-sampling trials.
The calibration was assessed with the H-L C statistic as the average of the statistic calculated from the
6.2.2 Software
For the experiments described in this chapter, Statistica 6.0 '^^ was used for data pre-processing. The
The SVM program used was SVMlight Version 5.00 (3 July 2002) written by T. Joachims. It is
optimisation problems presented by SVMs with large datasets fractable in several ways ''^. Subsets of
the data are chosen and worked with in turn. The size of the optimisation problem is reduced by
"shrinkage" or elimination of datapoints that were unlikely to become support vectors, and so unlikely
to participate in the model solution. The termination of the optimisation process is at a predetermined
acceptable accuracy: the termination error. The shrinkage parameter for SVMlight defined the
number of iterations at which a variable remained optimal before shrinkage, and was set to 100 (default
value). The criterion for termination was an allowable error of 0.001 (default value). Following
The SVM models require the choice of a kernel fiinction and of the SVM parameters C and 8 (for
regression problems). All SVM experiments used a biased hyperplane and solutions were not
^default 'jJT ^'^^^^ ^' ^'^ * ^ ^"P"* vectors and ||x,|| is the 2-norm of x,-
For these preliminary regression problems, the default value for e was 0.1
An SVMlight 5.0 interface using Matiab 6.5 '^^ was adapted from a freeware interface for SVMlight
http://www.cis.tugraz.at/igi/aschwaig/software.html '^'*.
The output and performance of the models was analysed on Matiab 6.5 '^^ and Statistica 6.0 ''^
There were 5278 admissions to the PAH ICU between 1 January 1995 and 31 December 1999. The
data were randomly split for each individual ANN or SVM that was frained and the test set was
sampled for each assessment trial. To prevent over-fitting during ANN fraining, 50% of the dataset not
Outcome definition
The outcome of interest is in-hospital mortality at 30 days after ICU admission. Deaths are defined as
patients who died in hospital within 30 days. Those patients who survived to discharge within 30 days
or who were still in-patients in the hospital at 30 days are defined as survivors. This is different from
the mortality outcome that has been previously used in modelling ICU patient mortality, where survival
Statistics of mortality at a fixed time have advantages over mortality status at hospital discharge
statistics '**''^6-'^\ Thirty-day outcome is usefiil for contemporaneous audit of ICU mortality as all
patient outcomes can be accounted for at 30 days, and a complete data set can be analysed. There is no
133
requirement to wait until all patients have been discharged from the hospital, nor need to analyse
incomplete data sets where the deaths are over represented. This "closure of the books" is necessary so
quality audit and RA confrol charting may be performed with reproducible methodology in a timely
manner. The use of a fixed endpoint such as 30 day mortality may facilitate the comparison of
outcomes between institutions by reducing variability that may result from differing discharge policies.
Figure 6.7 presents a schematic representation of the relationship between the ICU patients who
survived, and the deaths according to timing and site of death. The in-hospital 30 day mortality (C) is a
subset of the total in-hospital deaths (B) and also a subset of the 30 day mortality and the patients who
die in-hospital, at home or at another institution (C+D). Most of the in-hospital deaths occur in the first
30 days, but there will be a small group of patients who were discharged from hospital within 30 days,
who subsequently died (D). These patients may have been discharged home or to a palliative care
facility with an expectation that they would die, or fransferred to another hospital, or discharged home
and re-admitted to die later. These patients will not register as deaths. Such patients comprise only a
very small number of all the patients admitted to the ICU and resource limitations precluded follow-up
For the classification task the outcomes were coded as -1 for death, and +1 for survival to provide class
separation. For the regression task the coding was +1 for death: and 0 for survival, and the models
Figure 6.7 All ICU Patients, and subsets of patients who died, to
show the relationship between 30 day mortality and in - hospital
mortality.
demonsfrated to work elsewhere. The use of data dravm from an APACHE III database has precedent
in Clermont et al. '^^. I have performed modelling in collaboration with Pefra Graham with the
variables used in this preliminary study, using linear regression ^^ and classification frees '*' . This
work supported the choice and pre-processing of variables used in the preliminary machine learning
The features used are a severity of illness measure (acute physiology score from APACHE III), and
measures of physiological reserve (age and co morbidity score). The disease process is coded with an
ordered category to capture the effect of the patient's disease, accident or surgery.
135
The contribution to the chi-squared statistic made by each variable can be measured when that variable
is added to a model containing the other variables. This procedure was used to gauge the importance of
each of the variables. On the APACHE III developmental database, the relative contributions to the
model were acute physiology 73.1%, the disease or diagnosis 13.6%, age 7.3% and chronic health
items 2.9%^'*. Another recent study by Johnson et al. ^^ modelling ICU survival, observed similar
confributions to the model performance: physiology 67.7%, diagnosis 17.7%, co-morbidities 8.4% and
age 4.0%. On the PAH ICU dataset, using logistic regression Pefra Graham '** and I describe the
relative confributions as APACHE 111 score 77%> (or acute physiology score 71%), diagnostic group
With knowledge about the importance of variables in ICU populations generally, and in this sample
from PAH ICU particularly, four input variables were chosen. The APACHE III acute physiology
score was chosen to reflect the amount of physiological disturbance in the first 24 hours in ICU. The
diagnostic category was used to capture the influence of the patient's diagnosis, procedure, or surgery.
The patient's physiological reserve was represented by the age and chronic health score component of
For each variable, the area under the ROC curve was used to assess the variable's ability to
The acute physiology component of the APACHE III score was calculated according to the data
collection rules and definitions of APACHE III ^'"". It is a sum of scored components of the worst
recorded observations and laboratory values from the first day of ICU admission. Neurological
abnormalities, temperature, blood pressure, heart rate, respiratory rate, mechanical ventilation, and
urine output are scored according to the extent of deviation from a midpoint "normal" physiological
value. The blood chemistry measures: creatinine, white cell count, haematocrit, albumin, bilirubin.
136
glucose, sodium and urea are similarly scored according to deviance from a normal value. The sum of
these scores gives the acute physiology component of the APACHE III score.
The acute physiology score (APS) has very good discrimination on the PAH ICU dataset. The area
under the ROC curve of the APS score alone on tiie PAH ICU dataset is 0.837. Any model that is built
from variables that include the APS should have an area under the ROC curve of greater than 0.837.
2. Disease Category
In collaboration with Pefra Graham '***,! have proposed a simple and powerfiil approach to recoding
the complex APACHE III disease group and weights. A brief description is provided here to assist
The disease group classification is a three level simplification of the APACHE III diagnostic
categories. Each disease group is categorised as High, Low or Neufral risk according to whether the
mortality for the disease group was higher, lower or not significantly different to the average mortality
of the APACHE III model development dataset (Table 3, Knaus et al, ^).Table 6.1 shows the coding of
the original APACHE III diagnosis and the simplified scale. Any diagnosis that was not present in the
original APACHE III disease group list is classified as "Neufral", and marked with an asterisk in the
table.
The risk was coded with high = 1, neutral = 0, and low = -1. The area under the ROC curve for
Nonoperative
Cardiovascular/vascular Cardiogenic shock high
Cardiac arrest high
Cardiomyopathy* neufral
Aortic aneurysm neufral
Congestive heart failure high
Peripheral vascular disease neufral
Rhythm disturbance low
Acute myocardial infarction low
Hypertension low
Allergy* neufral
Other cardiovascular diseases neufral
Respiratory Parasitic pneumonia high
Aspiration pneumonia high
Respiratory neoplasm high
Respiratory arrest high
Pulmonary oedema (non-cardiogenic) high
Bacterial/viral pneumonia high
Chronic obstructive pulmonary disease high
Pulmonary embolism neufral
Mechanical airway obstruction neufral
Asthma low
Other respiratory diseases neufral
Hepatic failure high
Gastrointestinal (GI) GI perforation/obstruction high
GI bleeding due to varices high
GI inflammatory disease neufral
GI bleeding due to neoplasm* neufral
GI bleeding due to ulcer/laceration low
GI bleeding due to diverticulosis low
Other GI diseases high
Neurologic Coma (unknown cause)* neufral
Intracerebral haemorrhage high
Subarachnoid haemorrhage low
Stroke high
Neurologic infection neufral
Neurologic neoplasm neufral
Neuromuscular disease neufral
Seizure neufral
Other neurologic diseases neufral
Sepsis Sepsis (other than urinary tract) high
Sepsis of urinary tract origin high
Trauma Head frauma (with/without multiple trauma) low
Multiple frauma (excluding head trauma) low
138
Operative
Vascular/cardiovascular Dissecting/ ruptured aorta high
Peripheral vascular disease (no bypass graft) neutral
Valvular heart surgery low
Elective abdominal aneurysm repair low
Peripheral artery bypass graft low
Carotid endarterectomy low
Other cardiovascular diseases low
Respiratory Respiratory infection neufral
Lung neoplasm low
Respfratory/ neoplasm low
Other respiratory diseases low
Gastrointestinal (GI) GI abscess* neufral
GI perforation/rupture high
GI inflammatory disease neufral
GI pancreatitis* neufral
GI peritonitis* neufral
GI obstruction neufral
GI bleeding neufral
GI vascular* neufral
Liver fransplant neufral
GI neoplasm low
GI cholecystitis/ cholangitis low
Other GI diseases low
Neurologic Intracerebral haemorrhage high
Subdural/ Epidural haematoma high
Subarachnoid haemorrhage low
Laminectomy/ other spinal cord surgery low
Craniotomy for neoplasm low
Other neurologic diseases low
Trauma Head frauma (with/without multiple trauma) high
Multiple frauma (excluding head trauma) low
Renal Renal neoplasm low
Renal transplant* neufral
Other renal diseases low
Gynaecologic Hysterectomy low
Orthopaedic Hip or extremity fracture neufral
* indicates a diagnosis that did not clearly map onto an APACHE III disease group at the time of modelling.
These diagnoses had no risk data and were given a "neufral" disease group risk category coding.
139
A score was calculated by adding points allocated according to the presence of co-morbid conditions,
according to the data collection rules and definitions of APACHE III ^''*". Where these conditions are
present, the points allocated are AIDS (23), hepatic failure (16), lymphoma (13), metastatic cancer
(11), leukemia/multiplemyeloma (10), immune suppression (10) and cirrhosis (4). The area under the
ROC curve for Chronic Health Score alone on the PAH ICU dataset is 0.624
4. Age
Age was used as a continuous variable calculated as date of admission minus the patient's date of birth.
The area under the ROC curve of the patient's age alone on the PAH ICU dataset is 0.62.
Index
An integer was used to index the order of admission. This variable was not subsequently used in the
The data were not manipulated or pre-processed fiirther, prior to modelling with the ANN and SVM.
6.2.4 Classification
The objective of the classification experiments was to assess how well ANNs and SVMs can classity
patients using the data collected in the first 24 hours, into survivors or deaths in-hospital at 30 days.
140
SVM Classification
Initial classification experiments were conducted with the default value of the SVMlight parameter
SVM.
The RBF kernel that was used to implement the dot product of the mapping function in Equation 6.3
was
^(x,-,Xy) = exp(-7||x.-x^.||')
where x, and Xj are input patterns, and X,. X is the 2-norm of the difference between these
SVMs with varying RBF kernel parameter y were investigated (Figure 6.8). For each SVM, the sample
was randomly split into a set of 20% (1056 cases) for model fraining, and the remaining 80% (4222
cases) for model testing. Values are the mean of correct classification rate (CCR) of 10 SVMs frained
at values of y between 0 and 20. Figure 6.8 shows the CCR for RBF kernel SVMs in the range of y
between 0 and 1. The best CCR is 90.2% at y between 0.007 - 0.01, whilst the best in the training set
was 89% with y less than 0.005. At larger y, the CCR approaches the mortality rate, as all cases are
classified as deaths.
141
0.98
0.96
0.94
0.88 r<^
0.86
0.84
0.82
0.8
0.2 0.4 0.6 0.8
The discrimination was assessed with the area under the ROC curve. Figure 6.9 displays the
discrimination performance at values of 0 < y < 0.1 for the RBF kernel SVMs. The ability of the SVMs
to rank the fraining set improves with increasing the size of y. On the test set, the areas under the ROC
curves of the RBF kernel SVM models are all < 0.80 and are all less than that of the acute physiology
Figure 6.9: Area under the ROC Curve for RBF SVM y; 0...0.1
Values are mean of ROC curve area of 10 models, 1056 cases In the training set
1
0.9
<D
t3 0.8
o 07
(J
o
cu 0.6
(1)
4- O.b
1-
0.4
C
3
0)
0.3
(0 - -^ - Training Set
ffi 0 ? - T e s t Set
<
0.1
0
0.02 0.04 0.06 0.08 0.1
7
142
The polynomial kernel fiinction that was used to implement the dot product of the mapping fiinction in
K{Xj, X .) = (x. X . j where d, the degree of the polynomial was studied for values 1 - 4.
The average CCR and areas under the ROC curves at a range of fraining set sizes were explored.
Figure 6.10 displays the relationship between the fraining set sizes and CCR for the orders of
polynomial kernels. Performance for all models was dominated by the preponderance of survivors in
the sample. The CCR is around 87.5% - 88.5% for all polynomial kernels. The polynomial kernel with
d-4is not shown, as the optimisation algorithm failed to converge at the parameter settings for data
sample. At tests set sizes < 40% (2111 cases), 2"'' and 3'^'' order polynomial function kernel SVMs had a
slightly superior CCR on the test sets than the linear SVM models.
10 20 30 40 50 60
% of Sample as Training Set
143
Figure 6.11 shows the discrimination at a range of polynomial kernels ^ = 1 - 3. The areas under the
ROC curves are averaged over 20 trials at each point. Some of the models frained on small datasets
(5% 264 cases) had equivalent performance with areas under the ROC curves on the test set up to 0.86.
However, the striking feature of the small test sets was their unreliability, compared to the more robust
and dependable modelling on the larger fraining sets > 20%> (1056 cases) of the sample. At > 10% (528
cases) the area under the ROC curve was consistentiy greater than the area under the ROC curve of the
acute physiology score variable. In confrast to the RBF SVM, the polynomial kernel SVM had better
20 30 40 50 60
% of Sample as Training Set
At all fraining sample sizes, 2"** and 3'^'' order polynomial fiinctions had a lower area under the ROC
The polynomial kernels SVMs studied in this experiment had better classification and discrimination
ANN Classification
The ANN fraining was conducted on Statistica Neural Networks^*^ using radial basis fiinction (RBF)
and multilayer perception (MLP) designs. Initial exploration showed the RBF ANN consistently
underperformed in classification and discrimination compared to the MLP, and is not considered
The "Problem Solver" module on the Statistica Neural Networks Programme was used to assign
weights, adjust learning the rate and momentum and prune the nets. The output variable was coded as
-1 for death and +1 for alive for classification and ranking problems. The input features were the same
as in the SVM classification experiment. The fraining algorithm provided for pruning of both inputs
MLP were designed to have only one hidden layer, with the number of hidden units and the ANN
The fraining set sizes studied were 5% (264 cases), 10% (528 cases), 20% (1055 cases) and 50% (2639
cases). A verification dataset was used to prevent over-fraining, and a test set was used to evaluate the
generalisation performance of the ANN. The verification and the test set data were half each of the
The fiinction to be minimised was the sum of the squared differences between the observed outcomes
and the model predictions on each output unit. For MLP fraining, a two-stage fraining process was used
with back propagation (50 epochs, learning rate 0.1, momentum 0.3) followed by conjugate gradient
descent. Training was continued until the verification set performance started to deteriorate or
convergence ceased.
For each trial, numerous ANNs were successively frained and the best models were retained. At the
commencement of each trial, an ANN was frained until its generalisation performance on the
145
verification set began to decline, or convergence ceased. Another ANN was frained until its
performance on the verification set began to deteriorate, or convergence ceased. If its performance was
superior to the first ANN, then it was retained. If not, it was discarded. This process was continued
until 20 increases in performance had been achieved. Ten such trials were conducted on random
selections of fraining data. The average performance of the best ANN from each trial provided the
Figure 6.12 shows the mean area under the ROC curve for MLP ANN model performance on test set.
0.9
(D
3
0.8
o 0.7
o
o 0.6
(D
0.5
CD 0.4
a
0.3
0.2
%.1
10 20 30 40 50
% Sample as Training Data
The discrimination of the MLP ANN was on average 0.865 - 0.875 across the dataset sizes. The ANNs
were able to improve upon the discrimination performance of the APS variable of the APACHE III
score. The ANNs performance was equivalent to, or exceeded the discrimination of the best SVM
polynomial kernel. Unlike the SVM, the ANN performance did not drop below an area under the ROC
curve of 0.865 even with the sample size at only 5% (264 cases). This is in confrast to the Clermont et
al. '^^ study of ANN in ICU where the model performance declined at fraining set sizes of less than 800
cases.
146
hospital mortality on the PAH dataset. SVMs, ANNs and logistic regression approaches including
APACHE III have all used similar input variables for modelling.
Table 6.2 Discrimination and Area under the ROC Curve for Various
Models
Model Area under the ROC curve
SVM Polynomial (T' order) 0.84
SVM RBF 0.79
MLP ANN 0.87
Acute Physiology component of APACHE III score 0.84
APACHE III mortality prediction * (30 day outcome) 0.89
Logistic regression* Tllll
0.87
#from Graham and Cook
The SVMs and ANNs successfully classified the patients into survivors and non-survivors. The best
discrimination was seen with the proprietary APACHE 111 predictions, logistic regression and the MLP
ANN of this experiment. The APACHE III model was developed on a large dataset of 17 440 patients
over 10 years ago in North America, and its discrimination was still very good on the PAH ICU
dataset. The logistic regression model used all of the patients in the dataset. In comparison, the MLP
ANN was able to discriminate accurately between survivors and non-survivors using only 264 patients,
The MLP ANN discriminated as well as the other methods. In confrast to the ANN modelled by
Clermont et al., the discrimination did not deteriorate at datasets sizes less than 800 patients. In fact,
the area under the ROC curve was still 0.865 on the smallest dataset studied (264 patients). The
equivalent in the Clermont study achieved area under the ROC curve of 0.817 (400 cases) and 0.752
(200 cases).
One of the shortcomings of this experiment was that optimisation of the SVM parameters was not
pursued. It is possible that the SVM performance may exceed that of the ANNs or logistic regression, if
optimisation of the SVM parameter, Chad been explored. This issue will not be pursued in this section,
as this preliminary work is a prelude to building a regression model with raw patient data. However,
147
this work establishes that even before extensive parameter tuning, the classification and discrimination
performance of the SVM is close to that of the ANN, logistic regression and the APACHE III system.
6.2.5 Regression
The aim of the prehminary regression modelling was to develop an ANN or SVM to estimate the risk
of 30 day mortality in ICU patients, using the 4 features described earlier: APS, Disease Category, Age
The performances of the models were assessed by the discrimination using the area under the ROC
curve. The calibration was assessed by the Hosmer - Lemeshow C statistic on assessment sets of 400
cases.
SVM Regression
There was no a priori knowledge of the likely SVM parameters or the precision required for this task,
so experiments explored SVM kernel functions across a range of E (regression tube width). The SVM
Preliminary experiments with a RBF kernel used in regression SVM modelling with these variables
were disappointing. Figure 6.13 shows the discrimination of the RBF kernels with the average area
under the ROC curve of 50 trials at each data point for y in the range of 0 - 2 and 8 in the range 0 - 2
is presented. The surface colours denote the area under the ROC curve values. There are no dark blue
shaded areas, so no average area under the ROC curve was greater than or equal to 0.83. This means
that for all parameter combinations tested, the discrimination of the RBF kernel SVM was worse than
At s values below 0.5, the RBF kernel SVM had the best discrimination, with the area under the ROC
curve being 0.78 - 0.81 for the range of values of y. At values of greater than 0.5, the discrimination
148
of the models rapidly deteriorated to provide no better than random discrimination (area under the ROC
curve ~ 0.5).
Figure 6.13 Surface Plot of the Average Area under the ROC
The surface plot tones correspond to the average area under the ROC curve.
Colour Area under the ROC Curve
Dark Blue: > 0.83
Light blue, green, yellow 0.80 - 0.83
Red < 0.80
No further optimisation of regression SVM parameters was done with the RBF kernel using these data
and variables.
Regression SVM with polynomial kernels had more promising discrimination. Figure 6.14 shows the
areas under the ROC curves for a range of for polynomial kernels d=l-3. Each point is the average
149
of 20 SVM performances at that value. The best average area under the ROC curve (0.873) was with
0}
0.9
o 0.7
o
o 0.6 d =1
a: --.-.d = 2
o 0.5 *d =3
0.4
a
c 0.3
3
(0 0.2
0.1
1 1 1 1 1 1 1 1 1 1
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
The calibrations of the RBF and polynomial kernel regression SVM models were poor. The SVM's
output provided a point estimate of a binary outcome and did not provide an estimate of the probability
of death. Therefore, the output of the linear kernel SVM which had the best discrimination was
recalibrated. In this experiment, a GRNN was used to recalibrate the SVM outputs, to more reliably
The best calibration at the default setting for the SVM parameter C was found with the linear kernel
SVM (d= 1 and S = 0.005). One hundred linear kernel SVMs were frained and recalibrated with the
GRNN. Each of these frained SVM GRNN models was then assessed on 100 test set samples of 400
randomly chosen cases, and the average area under the ROC curve and H-L C statistic of the 100 test
Figure 6.15 shows the distribution of the mean areas under the ROC curves of the 100 SVM GRNN
models built. The average area under the ROC curve was 0.863 and all models had an area under the
Figure 6.15:
Discrimination of SVIVI-GPNN IVIodels:
Histogram of IVIean ROC Curve Area
Each observation is the mean of 100 Samples
of test set of 400 cases for each of the SVM-GPNN models.
60 I 1 1 1
50
40
a)
x>
O
* 30
o
o
20 ^
10
r" '
1 tr^>J^ ivinaga
|: 1 1 1
Figure 6.16 is afrequencyhistogram of the average H-L C statistic values of 100 assessment sets for
each of the 100 SVM-GRNN models. Twenty nine SVM GRNN models of the 100 trained had an
average H-L C> 15.5 on 400 case test sets. The histogram of the average H-L C statistics has a skewed
distribution, which is expected as the H-L C statistic follows a chi-squared distribution. Note the 7
models grouped on therightof the histogram which all had mean H-L C statistics > 50.
151
Figure 6.16:
Calibration of SVM-GPNN Models
Histogram of Mean H-L C Statistic
Each observation is the mean of 100 Samples
of test set of 400 cases for each of the SVIVI-GPNN models
24 -I r I I T- -i [ r -1 1 r-
22 m.
20
18
16
(o 14
o 'W[
S 12
o
z 10
8
6
I
4
2 m
Wi
0
12 16 20 24 28 32 36 40 44 48
Mean H-L C
Seventy one percent of the SVM GRNN models met both the discrimination (area under the ROC
curve > 0.80) and calibration (H-L C < 15.5 on 400 cases) criteria. Therefore, the SVM GRNN is a
practical approach to modelling risk of 30 day in-hospital mortality on the PAH ICU data.
The distributions of the discrimination and calibration statistics in Figures 6.15 and 6.16 demonstrate
the need to conduct multiple trials and re-sampling. The spread of areas under the ROC curve and H-L
C statistics are due to sampling of the data for model building and model assessment. For each trial, a
different training set was chosen. For assessment of each model, 100 different test sets were chosen
from the remaining data. Experiments conducted on a single sampling of data only demonstrate that it
impossible to build a suitable model on the dataset. This experiment has demonstrated that the
While some investigation was done of the SVM kernels and kernel parameters, the SVM parameter C
was not optimised, and the SVMlight default value was chosen. It is possible that more extensive
152
investigation of SVM parameters may provide better performance. This will be explored in the next
chapter.
ANN Regression
The ANN fraining was conducted on Statistica Neural Networks '''^ using Radial Basis Function (RBF)
and Multilayer Perception (MLP) designs. Initial experiments indicated that the RBF ANN consistently
produced lesser discrimination than the MLP, so the following experiment is reported using only the
MLP. The methods of feature selection, fraining and optimisation were the same as for the previous
classification experiment.
For each trial, numerous ANNs were successively frained and the best models were retained, in the
same way that the classification experiments were performed. At the commencement of each trial, an
ANN was frained until its generalisation performance began to decline. Training of that ANN ceased
and another ANN was similarly frained. If its best performance was superior to the first, then it was
retained, if not, it was discarded. As before, this process was continued until a series of ANN with 20
Twenty such trials were conducted on random selections of training data. The performances of the best
ANNs from each trial were analysed. The fraining datasets were 20% of the sample (1056 cases).
Initial analysis of the ANN outputs indicated that the outputs did not estimate the probability of death.
Whilst the areas under the ROC curve demonsfrated good discrimination, the output of the ANN was
poorly calibrated. Therefore, the MLP ANN output values were recalibrated with a GRNN.
To evaluate the generalisation performance of the 20 MLP GRNN ANNs, the areas under the ROC
curves and the H-L C statistics were calculated by analysis of 100 test sets of 400 cases. These datasets
were drawn at random, and with replacement from the available test data. The average of the areas
under the ROC curve and the H-L C statistics were used to describe the discrimination and calibration
of each of the 20 MLP GRNN models. These results are presented in Table 6.3.
153
Table 6.3: Discrimination (Area under the ROC Curve) and Calibration
trials drawing 400 cases with replacement from the unseen data.
Seven of the MLP GRNN ANNs (Table 6.3) did not satisfy the model performance criteria that 1 have
proposed (area under the ROC curve > 0.80 and H-L C < 15.5 on 400 cases). All models had an area
under the ROC curve > 0.85. However, models 3, 6, 8, 11, 12, 17 and 18 had H-L C statistics that were
greater than 15.5. This experiment demonsfrates that for MLP GRNN models frained on a dataset of
1056 cases in 13 of 20 occasions the models achieved adequate performance. Therefore, the MLP
GRNN offers one way to frain a model to estimate the probability of 30 day in-hospital mortality from
These preliminary experiments demonsfrate that ANNs and SVMs can be used to build models that
The SVM GRNN and ANN GRNN models were frained on 20% of the PAH ICU dataset, using 1056
cases, or approximately 10 months of admissions to the PAH ICU. This set size was chosen for the
preliminary machine learning regression experiments for two reasons. Ths sample size of 1056 cases
was successfiil in the classification experiments, and was likely to be adequate for the regression
experiments. Clermont et al. '^^ had successfiilly used a fraining set size of 800 cases, but found that
model performances with ANN and logistic regression deteriorated below this size.
The MLP ANN and the SVM have adequately discriminated between the survivors and non-survivors
on the PAH ICU dataset. However, the MLPs and SVMs did not provide estimates of the probability of
death, rather a point estimate of the predicted outcome. Therefore, the regression modelling experiment
used a GRNN to recalibrate the predictions of the MLP and the SVMs. The GRNN added an additional
step to the modelling process, but allowed the output values to approximate the probability of death.
This is verified by the discrimination and calibration performance of the recalibrated models.
By itself, the GRNN was unable to effectively rank the patients in order of risk of death, with areas
under the ROC curve < 0.7. However, once the patients were modelled by the SVM or the ANN the
average area under the ROC was greater than 0.83. The GRNN was used to then adjust the SVM and
ANN outputs to an accurate estimate of risk of death, verified by the calibration of the models. The
In a practical sense, the second phase of model calibration with the GRNN for the MLP ANN and the
SVM was a usefiil experiment but is unwieldy. If this approach is to be infroduced, then the steps
would have to be automated. Modelling, as conducted in this experimental context could not be
infroduced into general use due to its complexity and predilection to error. One alternative is to apply
the recommendations of Weston et al. ''* to constiiict SVM kernels specifically for probability density
estimation. Using a logit fransformation, the log of the odds ratio, of the ANN or SVM outputs would
be an alternative to the GRNN as a method to recalibrate the regression outputs to provide estimates of
The alternative that will be investigated in the next chapter is to build a regression SVM that will
approximate the probability of in-hospital patient death without a re-calibration step. It is planned to
tailor the design and parameter choice to the performance requirements of a risk adjustment model. An
the desirable performance attributes of discrimination and calibration. I will pursue this in the next
chapter, using discrimination (maximising the area under the ROC curve) and calibration (minimisation
of H-L C statistic). If this is successfiil, then a regression SVM will be built with a single modelling
This study uses re-sampling of data for fraining and testing to evaluate models. Previous comparable
Studies ' have relied on a single split of fraining and test sets. In a letter to the Editor, Paetz
highlighted this weakness in the study of Clermont and co-workers. Without re-sampling, the process is
vulnerable to variations with inclusion (or exclusion) of influential outliers that may bias the model
building, fraining or assessment. Clermont et al. had shown that the machine learning modelling was
possible, but not that their findings were necessarily reproducible. The histograms of Figure 6.15 and
6.16 illusfrate the variability of model performance that is found when multiple samples of 1056 cases
were drawn at random. The modelling with SVM and ANN was done with 20 MLP GRNN trials and
100 SVM GRNNs, each assessed with 100 random test sets. This approach with multiple trials and re-
sampling confirms that the experimental method will produce models that are likely to meet the
discrimination and calibration goals on any repeated sample of 1056 cases from the PAH ICU dataset.
Whilst Clermont et al. showed that ANN could successfiilly model patient probability of death, the
experiments of this chapter demonsfrate machine learning approaches that will usually be successfiil.
The performances of the SVM GRNN and the MLP GRNN models are as good as or better than the
findings of Clermont et al This is the only comparable study using machine learning with similar
variables on a similar ICU dataset. Clermont et al. found the areas under the ROC curves of a logistic
regression and MLP ANN model were each 0.839, using samples of 800 and 1200 cases. This
compares to the experimental findings in this chapter using 1056 cases. The classification experiment
frained MLPs (0.87) and the SVMs (linear kernel: 0.84) and the regression experiment frained MLP-
The calibration of the ANN and the SVM based regression models was similar in this experiment. For
both approaches, under the conditions of this experiment, there were 65% of MLP-GRNN (13/20
models) and 71%) (71/100) of SVM-GRNN that had both discrimination and calibration that was
acceptable. The calibration of these models assessed by the H-L C statistic was comparable to the
Any differences in the areas under the ROC curves could be explained by random variation, and it is
possible that SVM, ANN and logistic regression will provide equivalent results. If, however, any
difference is real, several explanations related to data and variable selection may be proposed. It is
possible that the absence of a diagnostic field in the variables used by Clermont et al. may have
contributed to a slightly lower performance. Pefra Graham and I '** have shown a diagnosis variable to
be associated with up to 13% of the model's explanatory power on the PAH ICU dataset and up to
18% has been demonsfrated on other ICU patient models ^^. Also, the failure by Clermont et al. to
randomise to development or testing sets, or to use a re-sampling procedure creates the potential for
bias in the development and assessment of models. An unlucky split of cases may have resulted in the
lesser measured areas under the ROC curves in their study. Other possible reasons can be proposed,
such as data accuracy and precision, but not enough details are provided in the paper to comment. The
data in the study by Clermont et al. and at the PAH ICU were collected and compiled according to the
same rules of the APACHE 111 database system with the same fraining and quality checks.
6.3 Conclusion
In the field of machine learning, ANNs, but not SVMs have been previously applied to ICU outcome
modelling. These preliminary experiments demonsfrated that ANNs and SVMs provided comparable
model performance on classification and regression tasks to logistic regression. In particular, the
preliminary results indicate that SVMs may perform as well as logistic regression or ANN. As it is an
area that has not been extensively studied a SVM model for risk adjustment will be pursued to estimate
However, the models proposed in this section have considerable room for improvement. This issue will
Firstiy, models will be built with the patient database data, rather than use the APACHE III score. The
variables used as input features for the ANNs and the SVMs were chosen on the basis of other
successfiil ICU outcome modelling. These were heavily pre-processed, and as a preliminary area of
study, improved the chances of successfiil model development. Both the APS and the Disease Category
variables offer effective discrimination between survivors and non-survivors, even before incorporation
into a model. However, there is the possibility that this pre-processing may have limited the
performance of the models. Therefore, in Chapter 7, patient outcome will be modelled with all the
variables that are collected on the database, not just the 4 pre-processed variables used in Chapter 6.
The raw physiological and laboratory data, the diagnostic code, co-morbidly information and additional
If a model can be built that uses only the component data variables, then the model development can be
done independent of the APACHE III score and the APACHE III system that provides proprietary
estimates of the probability of death. This will potentially save money spent on the software license
fee. More importantiy, it will provide the flexibility to remodel the patient data when a risk adjustment
model used for risk adjusted confrol charting no longer fits the patient data.
Secondly, 1056 cases were used for the model development in this chapter. Clermont and co-workers
reported that 800 cases were sufficient for model development using logistic regression and ANNs.
Therefore in the next chapter, a fraining dataset of 800 cases will be used. This is of practical
importance as 800 cases is approximately 8 months of ICU admissions. With an additional 400 cases
for model assessment, the model development and assessment can be conducted on about 1 year's
patient data.
Thirdly, neither MLP ANNs nor regression SVMs provided estimates of the probability of patient
death without a calibration step with the GRNN. By using the value of the machine learning model
output as the input variable for the GRNN, the ranking performance of the SVMs and the ANN was
158
preserved and a probability estimate was produced. These methods satisfied the discrimination and
calibration criteria of the mortality prediction model that 1 have proposed: area under the ROC curve >
0.80 and H-L C statistic < 15.5 on 400 cases. However, a SVM regression model may be able to
provide reliable estimates of the probability of death without a GRNN calibration step. SVM model
parameter choice will be determined by the kernel and parameter combinations that on average meet
Fourthly, the estimation of regression SVM parameters will be revised. In this experiment, the
polynomial and RBF kernels were explored for a range of values of e, without investigating values for
the SVM parameter C. Parameter C is an important determinate of the SVM generalisation error. For
the SVM to reliably estimate the probability of patient death, the values of kernel parameters and the
parameters C and 8 will be systematically explored, seeking the parameter combinations that satisfy the
Therefore in the next chapter, modelling of the ICU patient risk of death will be undertaken with a
regression SVM using the raw patient data variables with fransformation where necessary. After an
initial exploration of the search space, an extensive optimisation process will seek the model with the
Chapter 7
This chapter presents an experiment in which support vector machines (SVM) are developed to estimate the
risk of in-hospital death of patients admitted to the Princess Alexandra Hospital (PAH) intensive care unit
SVMs have displayed comparable performance to artificial neural networks (ANNs) and logistic regression
on the intensive care unit (ICU) dataset. As SVMs have not been extensively studied in this application;
fiirther SVM model development will be pursued. This experiment will employ a sfrategy to frain a
regression SVM based on achieving discrimination and calibration targets and thus guide kernel and
parameter selection. It will be shown that regression SVMs can be frained to accurately and reliably
estimate the probability of a patient's 30-day in-hospital death using the number of cases equivalent to 1
year of activity. This provides a practical approach to modelling risk of death for monitoring risk adjusted
7.1 Overview
The purpose of the experiment was to build a SVM to accurately predict the probability of 30 day in-
hospital mortality of the ICU patients. For the SVM to provide a practical alternative to the commercial
The data variables were raw demographic, physiological, diagnostic and investigational data available
within the first 24 hours of ICU admission. The modelling was to be done on a 1200 case sub-sample or the
equivalent of a year of patient admissions to the PAH ICU, during 1995 - 1999.
Estimating the probability of death can be framed as a regression problem. A proposed model had to have
adequate discrimination and calibration to meet the standard of an accurate estimate of the probability of 30
day in-hospital death of an ICU patient at the PAH. In confrast to the experiment in the previous chapter,
the aim was to frain regression SVM with outputs that did not require re-calibration by a method such as
7.2 Method
7.2.1 Data
From the original set of 5278 consecutive admissions in the PAH ICU patient dataset 1/1/1995 -
31/12/1999, 16 cases were excluded due to missing data on Source of Admission or Time to ICU
Thirty five input variables, not limited to the APACHE HI physiology, laboratory or diagnosis variables
were used from the PAH ICU patient database. The list and description of the variables is shown in Table
6.1.
The demographic and admission data were collected at time of admission. The diagnosis, co-morbidity,
laboratory values, physiological observations and measurements were collected at the end of the first day in
ICU. No observations or measurements were included if the observations pre-dated the admission to the
physical ICU area. In the event that a diagnosis was revised after 24 hours in the ICU, the original
diagnosis was retained. Physiological and laboratory values were chosen according to the most exfreme
values seen in the fu-st 24 hours in ICU. Where physiological or laboratory values were not collected, the
database allocated a physiological value within the normal range. As the aim of the modelling task was to
161
estimate mortality based on patient attributes only, the APACHE III score which had been successfully
Samples of 1200 cases, comprising a fraining set of 800 cases and a test set of 400 cases were randomly
Initial experimentation on the dataset indicated that raw values could not be successfully modelled and that
best results would be obtained if some of the variables underwent afransformationprior to inclusion in the
model
Only one example of the use of SVMs to model ICU patient data was found in the literature. Morik et al.
'^' used the SVM to emulate physician's behaviour to assist in rule generation and guide the haemo-
dynamic management of patients in a German ICU. The data acquisition system was different to that used
at the PAH ICU, and the German patient data were initially modelled using time series analysis. Morik et
al. collected observations every 1 minute to detect and manage haemo-dynamic shock, in confrast to the
APACHE III database, which collects the worst value during the first 24 hours.
Morik and co-workers receded categorical variables into a number of binary atfributes. This is appropriate
for the PAH application and was therefore used for the "surgical category" variable and "patient admission
source" variable. Surgical category became the binary indicator variables for Elective Surgery, Non-
Surgical and Emergency Surgery. Source of admission was coded using indicator variables for Emergency
However, the data fransformation used by Morik and co-workers for the real valued, continuous or discrete
variables could not be used. Their method was to normalise the values of each variable using
162
(X-X )
norm(X)=^ , '"'""'
Vvar(X)
This was not appropriate as the values in the PAH database were collected according to the rules of the
APACHE 111 system database. In common with the APACHE II', SAPS I I ' and the MPM24 II * ICU
models use the worst or most exfreme values (high or low) during the first 24 hours in ICU. This method
has been widely used as it is the deviation from normal values that is associated with an increased risk of
death, rather than the absolute measurements. Therefore, the frequency histograms of the values of some
variables were comprised of two distributions of "worst" values. One component of the worst value was
observations that were the "worst low values", whilst the other was the "worst high values". Two examples
are shown in Section 7.2.2. Worst Temperature (Figure 7.1) has a frequency disfribution that is dominated
by the worst high temperatures, reflecting that in critically ill patients, hyperthermia with sepsis or systemic
inflammation is more common than is hypothermia. Worst Mean Arterial Pressure (Figure 7.2) clearly
shows a bimodal distribution of worst high blood pressure and worst low blood pressure.
Other machine leaming methods have accurately modelled ICU patient in-hospital mortality, and some of
their successful data fransformation sfrategies are described next. Generally, the raw physiology and
laboratory observations were processed to values that reflected the distance of the observation from a
physiologically normal state, which should be associated with the lowest risk of mortality. Clermont et al.
'^^ gave numerical values to the variables according to the original APACHE III scoring scheme. I have not
used this method, as it relies too closely on the APACHE III algorithm, and requires knowledge of the
scoring for each variable. Doig et al. "^ coded all variables, except creatinine and the Glasgow Coma Scale
as the difference above or below the median value of the variable. Potentially, the median value may not
reflect the variable value with the lowest risk of death, and it will depend heavily on the casemix and
patient severity.
The study by Prize et al. " ' reported an ANN-based clinical decision support system for ICU patients, and
provides the most detail about successfiil pre-processing. In their study, all non-binary variables were
standardised, and scaled so that the zero values of the variables were associated with the lowest risk of
163
death. An ICU clinician chose the value that, in his expert opinion, gave the lowest risk of death for each
variable. The data observations were then subfracted from this physiologically normal value, and scaled by
With the previous published experience in mind, the PAH ICU raw non-binary data were processed to
achieve two objectives. The first aim was to standardise each variable, so that there was a minimum or
maximum of mortality risk at a standardised value of zero. The second aim was to process the variables so
that all were of approximately the same scale and range of values.
To standardise a variable, a physiologically normal value was subfracted from the variable value. One
option for settling on a physiological normal value for this experiment was to use expert judgement, as used
in the paper by Prize et al.. As an experienced ICU clinician, my clinical judgement tells me that this
approach is too subjective. The second option was to use the normal values recommended by APACHE III.
The values considered normal were derived from ICU patient data collected in 1988 and 1989 in North
America. However, there have been advances in medical care and it was plausible that 5 - 1 0 years after
the APACHE III model was developed, that the lowest risk of death would be found at different
physiological values.
Therefore, a third option, to re-explore the data variables was undertaken. The values for each variable
associated with minimum risk of death were identified using a smoothed, locally weighted second order
polynomial fitted to scatter plots of 30 day in-hospital death (dependent variable) against each non-binary
variable. This method is similar to that used in the development of APACHE III ^, and a recent large risk
adjustment exercise reported by Render et al. '*fromthe Veteran's Affairs ICUs in the United States.
However, in this PAH ICU application, the smoothed curves were only used to identify the value of
minimal (or maximal) risk of death, and not to assign continuously re-weighted values to the variables.
A distance weighted least squares procedure (as part of Statistica 5 "^) was used to fit a curve to a plot of
input variable values and the patient outcomes. A quadratic regression for each value of the variable is used
164
to estimate the corresponding patient outcome such that the influence of data points on the regression
decreases with distance, along the X axis, from the particular variable value. This method demonsfrated
which values of each variable were associated with the lowest or highest risks of patient death. Each
variable was standardised by subfracting the value associated with minimum (or maximum) risk from the
raw data value. Scaling was done by dividing the result by the standard deviation of the standardised
variable values.
The raw data of some variables had frequency histograms that appeared unimodal, with grossly skewed
frequency disfributions, or had values over a large range. In these cases, the logarithm to base 10 was used
to fransform non-zero raw data values to facilitate scaling and standardisation. Entries with raw values of
zero were given the lowest fransformed value from that variable.
This pre-processing did not aim to create a "standard normal variate" as used by Morik et al. The
histograms of values of some variables were often far from having a Gaussian distribution. In the PAH
dataset, the mean values were only occasionally associated with the lowest risk of death.
Worst Temperature.
Worst Temperature illusfrates a unimodal distribution with a normal physiological temperature associated
2800
2600 ' ' -
2400
(0
c 2200
o
"co 2000
1800 - . , -
W 1600
o 1400
1200
(D 1000
E 800
600 -
400 ^M
200
n IV:?'"'
24 26 28 30 32 34 36 38 40 42 44
The minimum risk of death was associated with a temperature of 37.2 Celsius, which is a normal body
temperature. The Worst Temperature value minus 37.2 was scaled by dividing by the standard deviation
(1.33). Figure 7.2 is a fit using distance weighted least squares that shows the relationship between the
0.8
0) 0.6
E
o
o
8 ''
0.2
-0.2
28 30 32 34 36 38 40 42 44
For Worst Temperature the distribution appears to be unimodal, and approximately normally distributed.
The minimum risk of death is not associated with the mean Worst Temperature value of 36.1 ^C, or the
APACHE III default value physiologically normal value of 38*'C, rather it is found at 37.2''C (Figure 7.2).
The mean Worst Temperature is different from the Worst Temperature associated with the lowest risk of
death. This illustrates why the normalisation method of Morik et al. using the mean value taken from a
group of critically ill patients, should not be used with these data.
Worst Mean Blood Pressure has a bi-modal distribution with a normal physiological blood pressure
associated with lowest risk of death. Figure 7.3 shows the frequency histogram, and Figure 7.4 shows the
plot of the mean blood pressure values and the 30 day in-hospital mortality. The lowest risk of death was
associated with a mean blood pressure of 90mmHg. Therefore, Worst Mean Blood Pressure was pre-
processed by scaling the raw value minus 90mmHg divided by the standard deviation (27.0). The mean of
167
the Worst Mean Arterial Blood Pressure was 79.5mmHg with a bimodal distribution of the "low" and the
"high" Worst Mean Arterial Blood Pressure values, demonstrating that the normalisation method of Morik
et al. is again, not appropriate. The APACHE III recommended physiologically normal value was 90
mmHg, and agreed with the Blood Pressure value associated with the lowest risk of death.
220
200
180
CO
c
o 160
"(3
& 140
0)
10
O 120
100
CD
.o 80
E
3 60
40
20
' ' ' ^ A .
0.8 I-
(D
0.6
O
5 0.4
O
0.2 I-
0.0 I(*.ti*>w:(*>:>i**ifi*>:Mii*;Mi*i<<**>ii*)>t*;*i<*ii*:i:*i*ia>if(i*i
-0.2
20 40 60 80 100 120 140 160 180 200 220
Variables which had a minimum risk of death were: Temperature, Mean Blood Pressure, Heart Rate, White
Cell Count, Haematocrit, Sodium, Glucose, Albumin, pC02 and Blood Urea Nitrogen. The variable which
For some of the variables, the risk of death was an increasing or decreasing function of the variable.
Examples where the curves fitting 30 day in-hospital mortality was an increasing function of the variable
Examples of the curve fitting in-hospital death as a decreasing fiinction of the variable included Arterial
Blood pH and the measures of patient level of consciousness and arousal (the Glasgow Coma Score for
Worst Bilirubin
Worst Bilirubin illustrates a unimodal, skewed distribution over a large range. Figure 7.5 shows the
histogram of Worst Bilirubin and Figure 7.6 shows increasing in-hospital death with increasing value of the
60
CO
. 2 50
*->
CO
S 40
O
^O 30
i_
<D
I 20
10
Increasing blood bilirubin concentrations were associated with increasing 30 day in-hospital death.
170
0.8
0
0.6
O
"3 0.4 ^
o
0.2
-0.2
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Log(IO) Bilirubin
Binary variables, Chronic Health Score and the Diagnostic Category were not transformed. No cases were
removed as outliers.
Table 7.1 shows the list of variables, descriptions, transformations and pre-processing employed to yield
the standardised and scaled values for model building. Each variable is numbered according to the fields in
the dataset. The name of the variable is based on the APACHE III database name, and the description of the
variable provides information about how the variable was collected and coded, where necessary. The type
of variable is binary, discrete or continuous. Where the pre-processing is described as "standardised", its
treatment follows the methods detailed in Section 7.2.1 and 7.2.2. Logarithmic transformations are noted
when used. The notes provide information about the values associated with the lowest risk of death, and the
nature of the relationship between each variable and the patient outcomes. All the features described below
s
6 0 1-1 t<nu O
a>
I i
3 Ol
&.
u > I>
i
V) .3
w CD
I i
u ^fin <1>
UI
(U JS
T3
<*-!
o
2 .a
ts -o o c
s o
<u
.M u
ca oil
U' < ^u
thre
utral
> .g so
so CO 3C u ii>
4>
M
J3
C
^ u B j3 c cs
4> <U
re cs
o ^
I o e O
130
a
'
n
u
u 00 on
a o 60 60
C c c
ca ca
ca ca
u u 9>
u O
o o
U I
o
i i
J3 D.
u 9-
u <a
i
u h> u
a. ii a O. a.
o o o o o
2 Z Z Z Z
-2^
IS'S u
00 43
3 iri ca o
II ca lU
__ 'ii u cS m ca
etev
of ca
g
u C S S3 1 ofz
"5 " J3 o
00 i i
"U I?
1 W ^J3 j -i u
^ O cs " ~ cs ii> 2f 00 ^ o
.. C u . . <u S I Q c s o^
o
c
(41
o
(41
(0 HI ^
o
o 3 D c en u
E c - O
^ CS
ca
U ca H cs
;- 2 co
Is "
O n T3 >
(1 c C
0)
c ?'S o
o o
.a u c
(0
cs _ i
ca 5
u ^ I >
00
o o
C O
? s o O X -a ^ ^ s cs CO u
00 B o
w C g vo ^ o CO
"3
c
73 ^
s
Ol o "* u
o ^ 8o M
<U u
(0 O M . . 1 cs
.o2 . 2 ^ ::- ^
o "is
CO - "
Ii
to
3)3 H i| Ii
c
>
3
* o
4J O rt .^ CO C/) o
c b 00.2 a> a>
is ca ^ s^ cs S o .a u <u
o J 3 cS
2 ca
CO 05
60
g o cs 4-* . ^
S 00|
6-^fca u Is^
B -a u
c/D . a Q T3 < p^ -3 jz W
- d es
o
(0 cs CS
o ac c>
00
Q b
on-Sur
lective
u
urgery
seas
tego
JO)
.2
(0
>
II Q cs
o U m
w K/i Z
(0
fM
I- VO
a
TO "O
op5 "3
.S y
O
I0
o
C X (41
133 <0 o
D 2 Ml
_ W C cS CO
(41
CO
cs
o
. 2 ;=! .> oa TS G
CO rt 3
> .ts O c c ^
to U *^ ^
C < _i 00 c s
5>o u u
cS 4 3 o
J (41 >
^ <^ o o
K O CS
o t- i-J 0 0
(41 i-J m
O UU U (41
.11
O '-' UI o T3 T3
<u ID
a oo^ <u
4-!
g cs .
<U
<=> Is o o
ca
O
C OS
oo 00 00 60 60
II > CO
c c c .g ,g T3 ca
M M o oa U D to
u oa CO
u
o
u <u
o
u
o 43 2 5 >o^ .2 U
CO
o o T3 oa T3
U
Ii U o a S
a> 2 .-1 a>
E^ CS 13
a.
o
Z
OH
O
a.
o
Z
o.
o
a.
o
Z
g ^^
op C ^ >
N
.p e
.41
T3
I 1
Z Z TO . . 1
o O 43 00
00 41
C/3
J c
00
X oa
ca
B oa (U
3 3 3 .fi
O U
o 3 0 o
3 G ^
S r.^ -3
cs cs .g B *
c a g G G I S I .S
a g O O \o <N
s s 5 3 a
on
u .S o
rli
<u a: CM ^ 4-. 13
C
u ts
(41
T3 o < *^ to 2 O 1>
00 Ui la Ok oa 0 , 3
<L> 'B- i=l ts '^
o <l> cs .c u .s
00 'S.2 *-' S o ^ 3
i3 0 0 * 5
i3 oa *- ts
( * ' (D 00
C 43 C
cs
u I c
*
W 01 ZJ
'C
fe
ts -a
G
"3 o D (E S cs . g
o u
CO ha
'Eib H-<
c C4-H
2 4) ^ <u
Source o dmis
43
ofhospit satio
Readmis nto
u oa o.
3 S " c cs
O) 2 e;^ "rt
13 G
CO
>> cs m (J O u ts
other hosp
(J
c
(41
O 6 <Uo (41 Ja ts s CO
u
u ^ra ^
o fi !r sT
00 (U 'tl ^Oi i
O
4-
s
u
u -; U U
3 fa 2 e .3 ^ E *-
0 5 O O
3 u> efl
U tzi t/3 * 01 ? o-ii 4 3 T3 1-1 (41 ^
D
U
u CO
C O OH
u CQ
00 CO
(U
ii
o cS e
0!^ _o J3 O <i>
tq
3
C/)
<1>
Pi ac
u (as
o H
tN
vo
o\
o
CS
cs lU cs g
u 13 T3
u <u
u .g
c
43
i es O
CO
o
oa
CO
^ (41 es
>. I O
4^
cS
43
ts
43
ts
>-.
oa es
cS c 2 -^^^ lU (U 0 0
ti
ort
(41 >J
ai o (41
"d o
cs o o o O (41
o
S u E a*'
s b oa oa
S E
I c c
rato
ase
CS es cs .2 to E
u u no
g>
u o. o u (J
u CO
C <u u u (U o o ^
Q Q Q l-J '-I
T3 T3
ID u
"3 3
DO'S
o u
CO 4) eS
on on on on 00 00 <u .
c c G G 13 4 3 "O S "^
G :" 4)
ca <a
CO
CO
U) v>
CO (41 CO cs ^ CO
en
o>
<i
<l>
CJ a
0> u
0
^ ^ T3
<U o S
0 0 0 0
1
o" e3 oa CO
u u u
O. 0. 0.
9^ ^
-B
u =6
0>
UI
H)
h-
i> u
UI cs u
u .^ oa cs .^ CO a>
0. a a Ol O. a. u (O
T3
T3
0 0 0 0 o o 43 5 41
43 C3 TO
Z Z Z Z Z (Zl
z H ^
a> s I I
eS eS 43 _3 o
"o o
G >
oa CO
3 O o 3 vo 3 o 3 _
O
a>
O o
a>
I?
3 *
3
S >n ,fi 41
i>
.c O g<D
J3
I -J
I P S E
<3\ o C 60 3 '^
\o . 2 cs , c '^
s .2 ^ I .2 cs
Q ^ - Q
O
S-1
I
U 2 o E Z U
s tN
3 O o O 22
S3
^3 CS
CS 13 1^ 1^ rsl CO 41
3
O
.^
(U
u (JJ oa "P CO CO ts fa
eS IS
60 JS o E cs
60 IS
S3 O
4> a o<u "E
_g
feb."4 3 oa 43
O
41
41
o (41 ^- U 4-.
I S fa
(41 u S ^ 60
E (41 2i . a w g o u ca 3 ep < *
I
6 0 ^
ts o s -n
.S c I 8 ^ s2 > 60 3 o o
S3
O
^
2 ^
I <u </>
S3
i3 S CO
Hi
o . +.
CT <
"^
(n O . CO 4-t
oa
41
S c -o
cs 3 U ^3
.u
E
o C 60
O
ca
3 O3 o u
a SI oau CO a>
2i ^
u
<+- >J
CS g
^
2 u CO
o
u u
oa
"I . o
-3 O
_J 1-1 -
u O
O -1
1+3
.g
41
i ""^ cs . ^
D E
CO o
C
.2
cs
=> 4
a> ..
(41 C U <U 43 S i i 3 u o
o o & 43 2(fc > y=
.2
o
4> 8S r i u 8 <N
CO O S t )
3 t> . - e
O G ^^
O
e o (Sen - O <u ^P. 2 2 ' a 4-> *J 43 2
3 - 5 u ._ <u .Ii O 'fa U .3 1> 00 41 .3
CO u .g o 3 > >, -^ > 4^ C
2 > M oU <u [/3 U
'u > O U es (41 (U cs cs - W 4>
* c/3 U '5 D > 43 T3 T3 O "O a u h c u
u 3 00 (3
oo-o >? o
E
3 3 cs
u
O
41
cS a-
3 o o u
Ii _H
o u o o
tzi
D ^
cS
E > u 41 ??Z
(Z3
t3 u Ol
O
1/3
u G
o
cs ^^ es o fa J
tzi O
O O a <-^ Z S Z ^-
^ ac
ON o >o r--
<s
4J
43 t3 T3
a> u <u
43
cs
3 CJ o
u
o
> o
Ol
O
to
43
oa eS ^
cs , ^
C8 ^ , cs
u
41
"3
ii
T3
(41
43
CS
n
t
T5 o T3
(go (fH o ^^ 4^ (N
CO
:4i 41
o a o UI o O N o
^ 'G 6.? Ml
(41
oa . 3
C 3 to 00
c -I
i> .ha
oa .
3
u
T3
<l>
oa O
^-N
CM
oa
c
41 a> CO ^
cs C
u "i
Q
cs ^ CO
4J o 41
g(p
CO
01 u g DQ M (It
ka
U
^ ^
60
UI
CJ O
u
3 O
C O c J .2
T3 T3
T3 g aj
4} JJ 3 3
CS
CQ c "3 2 o o
t> o CO oa
V CO 4) cs CO
-F=. -o S "^ S -g
(^
o CO
y (41 CO
u
43
cs
T3
T3
O CS
u W
u
CO
(2 CO
1 <a
.-1
-1 -a
s^./
60
C rt
C 60
'"B CO
CO
O UI cs .2 to T3
O DO 3
<l> cs !3 U ca
o
60
I
ere
43 es 4 3 O
4-
CS 41 o -3 Z
O
!Z1
H CO E- ^ (73 Z
I
r-
T3 60
o
13 d X
T3
oa e CO to
3 ON 3 -t 3 E 3 ^
o r~ O ~6l)
onui
o c o o
613
3 NO ^ 3 3
,_^ 3 t^
c C 00 S E 3
c I
^ I ^ o\ 3 <N -*-
3 1
O <*1 O 00 O -^ O O <N O NO
U d U d U CM O NO
U
O ;
to a
u
G G -s -s .2 _3 (fa 2
c o
E E
p o w S^ CO
"3 tti 3
3 ^
60 G o ^ "2 ts TS >
G O (fa
c a u 43 a> cs s >
u
t) 1.1>i o 00 o 3 cs (i,aj
rS 3
tension
_ - U CJ 3 o
val
cs 3 -31 4 3 J J
u S3 -a
3 cS
O >
(41
o O 4^ C o 60 u 41
CJ
4-
o . 2 oa CJ ^^ gj "^ 3 O
u u 8
> -0 =
u 3 u u .S 3 -3 60
.3 O "O l4=l E fa.3.S u T3"
.g U 41 u CO H S Z 3
i)
O I ^ CQ a> OH ^
41
J3 .g >^ 3 u
3 >. S es^
60 (l>
O ^ J " ^ (41
>-. s up 8^
X -) O A g
u cs CJ 4)
TS g XJ - I 60 c O
T3 CO
^ 6b
^
^ &?,
a>
60 r^
O b >
CO ' O o > W CM ( 4 1
cs
CS
.>; O "
* j 4>
o o
2w X! t j c E 41 "3 &
^
g
43
(1)
5 E "^ C ^
arte
low
u CM
.3 o A u u o
8 .2 CO
S^'
cs es T3 43
3
60
(3
U 60 U
o O G
1)
^ 3 ^
ts s E 60
3 4-
3
O.
3 n UI
t?;
Ui
(U
43
ON
3 u . O CO " ^ 3
o OO"
E O u 3
> 43 f^
ON <1>
cS o fa gj ^ W 43 "*
> C cs (41 > cs .5
^s u
5 > oa 1> CS
'eiil u TS O > T3 00
^
I
(U
X 73 CO 3
o O "2 >^Ji
o
o
3
.3 c ii I 2 -o S "S E ta rCO
n
cs o
o CO
I 1 s _ -
3 "3
c u
8 & P.S^I
.s |o
gS -^ es O
ac >,
5 03
> 3 O o S 55 fe ISE^a
00 ON o CM NO
CM en en
175
7.3 Software
195
For the experiments described in this chapter, data were pre-processed on Statistica 6.0 . The SVM
programme was SVMlight Version 5.00 (3 July 2002) written by T. Joachims. It can be downloaded and
Any new model to predict the risk of 30 day in-hospital mortality must exhibit acceptable discrimination
and calibration if it is to be used as a risk adjustment tool. The issues of model assessment were discussed
in Chapter 2. Therefore, in this experiment, models were developed, assessed and compared by
performance on a test dataset according to the average area under the ROC curve (discrimination), and the
As before, adequate discrimination was defined as an ROC curve area of > 0.8 and acceptable calibration
was defined as an H-L C statistic of 15.5 or less on test data of 400 cases.
To train a SVM model, the choice of a kernel function and SVM parameters and C for that training
application is required. A heuristic method was developed that searches two dimensional space and is
The first step is an initial investigation to gain an understanding of the properties of the search space, and
the second is a more detailed study of the area near to the optimum. In the first step, both the RBF kernel
SVM and the polynomial kernel SVM were investigated, and the calibration and discrimination of models
were assessed.
The RBF kernel function that was used to implement the dot product of the mapping function in Equation
6.3 was
where x, and Xj are input patterns, and X,. X .is the 2- norm of the difference between these
In the absence of any a priori knowledge of the best RBF SVM parameter values, the parameter ranges
The first to be studied was the RBF kemel parameter y, in the range 0 - 2 . The default parameter values of
SVMlight were used with 8 set at 0.01 and SVM parameter C at the default value:
Cdefault = jjJT where x, is the input vector, and ||x,. || = .^/(x,. X,.)
mean'I^XjVi
The most promising RBF kemel was taken forward and then 8 was explored. C was still set at the default
value.
When a value for 8 was set, the best model at a range of values for C was determined. Multiple sampling
and training runs were done with 50 trials at each point using 800 randomly selected cases for fraining and
Discrimination, measured by the model with the largest area under the ROC curve on the test set and
calibration, measured by the lowest H-L C statistic on the test set were used to determine which parameters
Figure 7.7 shows the relationship between RBF y in the range 0 - 2 and the average test set area under the
ROC curve. The best area under the ROC curve was 0.84 when RBF y - 0.03. As with the surface plot of
the area under the ROC curves in Figure 7.7, irregularity and variation in the plot contour is due to the
Figure 7.7 Area Under the ROC Curve for SVM RBF Kernel y in
the range 0 - 1
average of 50 trials at each point, 800 cases training set, 400 cases test set
=0.01, C = default
0.4 0.6
RBF parameter y
178
Figure 7.8 shows the response of model calibration to changes in RBF y with 8 = 0.01 and SVM parameter
C set at the default value. The best average H-L C statistic (24.2) was also achieved at RBF y - 0.03. Note
that at larger J'values > 0.8, the H-L C statistic again begins to decrease, but in this range, the SVMs have
poor discrimination on the test set (ROC < 0.7) which precludes their use.
200
180
160
S 140
= 120
100
o 80
X 60
This reinforces the importance of examining both discrimination and calibration. This graph demonstrates
that the model calibration response does have more than one minimum of H-L C and adequate calibration.
However, one of these occurs with models that have unacceptably low discrimination.
179
The RBF parameter y was fixed at 0.03, and the areas under the ROC curve for 8 between 0 and 2 with C
set at the default value was then explored. Figure 7.9 shows a part of that trial with values of 8 in the range
0.1 - 0.5 demonsfrating that the response of the area under the ROC curve is relatively flat between 0.1 and
0.4, dropping off sharply at around 8 = 0.48. For 8 > 0.5, model predictions were no better than random,
and some models gave the same estimate to all data points.
1
0}
t 0.95
s
O 0.9-
O 0.85
o 0.8
0)
0.75-1-
0}
0.7
T3
C 0.65
3
CD 0.6
0.55-h
0.5
0 0.1 0.2 0.3 0.4 0.5
180
Figure 7.10 shows the H-L C statistic across a range of 8 with y =0.03 and SVM parameter C = default
value. The best calibration was at 8 = 0.057 where the H-L C statistic was 17.0. Inspection of the model
outputs at 8 < 0.03 revealed a tendency toward separation of the output values which clustered around 0 or
around 1. With 8 < 0.03, Figure 7.9 shows that discrimination was retained but Figure 7.10 demonsfrates
that model outputs did not accurately reflect the risk of death.
The SVM parameter 8 was set at 0.057, the RBF y set at 0.03 and a range of C was trialled. Figure 7.11
shows the response of the area under the ROC curve to changes in Cin the range 0.1 - 2. The plot response
was fairly flat. There is some variation in area under the ROC curve, which is due to random choice of
Figure 7.11 ROC Curve Area for SVM RBF Kernel with
rangeof CO."/-0.6
average of 50 trials at each point, 800 cases training set, 400 cases test set
y = 0.03, . = 0.057
cc
9
0)
3
O
o
o
cc
O.i
0.5
0.5 1 1.5
SVM Parameter C
182
Figure 7.12 plots the average H-L C statistic against a range of SVM parameter C with RBF kemel, y -
0.03 and 8 = 0.057. The best H-L C values < 20 appear between SVM parameter C in the range 0.5 - 1.
These values of C gave an average area under the ROC curve ranging between 0.826 - 0.836 and average
H-L C values were 16.7 - 20, with the exception of C = 0.9 where the average H-L C statistic was 715.
Inspection of individual model test set performances in the range of parameter C 0.8 - 1.5 revealed that
many, but not all SVM performed quite well. However, some combinations of the test data, the model and
the validation data had H-L C statistics that were high and increased the average value.
Figure 7.12 H-L C Statistic for SVM RBF Kernel with range
of C 0.1-2
average of 50 trials at each point, 800 cases training set, 400 cases test set
y = 0.03, = 0.057
100000
10000
Statis
scale
1000
o o 100
10
0.5 1 1.5
SVM Parameter C:
The polynomial kemel fiinction that was used to implement the dot product of the mapping functions in
K(Xj ,'S.j) = [Xj Xj ) where d, the degree of the polynomial was varied.
183
Trials of the polynomial kernels were conducted on polynomial fiinctions with d = I -4 .As with the
experiment in the previous chapter, the SVMlight algorithm failed to converge to a solution for the 4 order
Figure 7.13 shows the average areas under the ROC curves for the polynomial fiinctions in the range 0 -
0.6 with Cat the default value. The polynomial kernels (d= 1 and d=2) displayed adequate discrimination
in the range e < 0.4. The best SVM discrimination was with the polynomial kemel d=l (ROC curve area
0.88) at 8 = 0.003.
Figure 7.13 Area under the ROC Curve for SVM Polynomial kernels
d 1 - 3 for 0 - 0.6
average of 50 trials at each point, 800 cases training set, 400 cases test set
C= default
1
.d = 1
0.9 --- d= 2
r*- J, .- -'*^ --d=3
0.8 Wt^t-L' lUC" :::
V \-' ^- ^ \
0.7
\ ^ ~" ^" * ' -
0.6
0.5 \ ""~ -^'.x
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6
184
Figure 7.14 shows a plot of the average H-L C statistic over a range of e 0 - 0.5, with the SVM parameter C
at the default value. The calibration was best at polynomial kemel d=2 with e about 0.1 - 0.2 giving H-L C
1000000
-d = 1
--d = 2
100000 *- d = 3
a> 10000
I g
(S o
W E 1000
o
^ 8* 100
t----
To optimise C, the SVM polynomial {d=2) kemel and s = 0.2 were set. The range C 0 - 2 was studied.
The best performance was at C = 0. A portion of the chart is shown in Figure 7.15. The best area under the
ROC curve was 0.841 at C = 0, with decreasing area under the ROC curve as the SVM parameter C was
increased.
185
Fig 7.16 shows the calibration of the 2"^ order polynomial kemel SVM over a range of SVM parameter C.
1000000
( W V ' ' ^
^ 100000 -I-
(D
o m
o
list
CO 10000
CO C}
^-
CO 1000
b
O .c
- J co
100
I o
10
1
0.0005 0.001 0.0015 0.002
SVM Parameter C
186
No satisfactory combination of parameters for the SVM polynomial kemel was found to be close to both
The best discrimination found with the RBF kemel was y = 0.03, 8 = 0.057, C = 0.6 which displayed an
average area under the ROC curve of 0.83 and an average H-L C of 16.7. In comparison, the best
polynomial kemel was a 2"'' order polynomial kemel, 8 = 0.2, C= 0 which had an area under the ROC
curve of 0.84 and an average H-L C statistic of 51. The discrimination of both the RBF and the polynomial
kemel SVMs was adequate. The average calibration of the RBF kemel SVM was better and the parameters
These results are in contrast to those of the previous chapter, where the linear polynomial kemel had the
best discrimination on the regression task (0.86 - 0.87), while the discrimination of the RBF kemel SVM
was less than 0.80 at all parameter values examined. The better performance with the RBF kemel could be
because of the differences between the variables used for modelling in the two experiments. In the
regression experiment of Chapter 6, four heavily pre-processed variables were used. In the regression
experiment in this chapter, the non-binary patient observations, measurements and demographic
information were pre-processed so that a central value of zero was associated with a minimum or maximum
risk of death for each non-binary variable. This transformation may be better suited to RBF kemel.
Alternatively, the investigation of changes in the parameter C and finer tuning of model parameters may
The approximation of SVM parameters provided a guide to the values of SVM parameters to be
investigated more intensely. The RBF kemel SVM was the most promising and fiirther investigation of the
187
best parameters and performance was undertaken close to the approximate values found in the previous
section. The parameter ranges chosen for more detaOed study were: y: 0.01 - 0.1, 8: 0 - 0.2 and C: 0.4 -
4.0.
Within these ranges, the RBF kemel parameter y, and the SVM parameters 8 and C were varied and the
average H-L C statistic and the area under the ROC curve were calculated for 50 trials at each value. Each
tiial used a ttaining set of 800 randomly chosen patients and the model performance was assessed on a
validation set of 400 randomly chosen patients. Three dimensional plots of the response of the average H-L
C statistics and areas under the ROC curves were produced (Figure 7.17 and Figure 7.18).
188
Figure 7.17: Surface Plot of Average H-L C Statistic for RBF SVM y
0.1 -1.0
Figure 7.18:Surface Plot of Average Area under the ROC Curve for
RBF SVM Y 0.1 -1.0
The surface plot tones correspond to the average area under the ROC curve.
Colour Area under the ROC Curve
Dark Blue: > 0.82
Light blue, green, yellow 0.80 - 0.82
Red <0.80
Figure 7.17 is a plot of the average H-L C statistic for 50 RBF kernel SVMs over a range of SVM
parameters, S and C. The areas of reasonable average calibration (H-L C < 20) are shown in regions of
Ughter blue, yellow and green. The best calibration is coloured dark blue (H-L C < 15.5) and this is the area
where the RBF SVM meet the proposed calibration standards on the test set.
The surface plots are dramatic. Where y = 0.03, there is a ttough of parameter settings (RBF y = 0.03 with
C 0.6 - 1.5 and 6 0.04 - 0.08) where the H-L C was 15.5 - 20, and only one value (RBF y = 0.03 with C
0.6 and S 0.05; H-L C = 15.4) where the H-L C was < 15.5. This was the approximate area suggested by
the initial exploration of SVM parameters. This optimum parameter choice, and near optimum areas are at
the foot of a slope of falling H-L C (improving calibration) as the value of S is reduced. This area is near a
Where y = 0.04, there is a broader and slightiy deeper trough in the surface plot of H-L C statistic. The
better performances are found at RBF y = 0.04 with C 0.7 - 3 and 0.02 - 0.04 where most of the H-L C
values are in the range of 15.5 - 20. The best model calibrations are found at = 0.03 with average
parameter C = 1.2 (H-L C 15.2), 1.3 (H-L C 14.8), 1.7 (H-L C 15.2), 1.9 ( H-L C 15.2), 2.1 (H-L C 13.3)
and 3.5 (H-L C 15.2). Again this ttough in the surface is at the bottom of a slope of falling H-L statistics as
is reduced. This zone is again the foot of a steep deterioration in calibration if is reduced below 0.02.
Where y = 0.05, there is also an area of reasonable calibration witii the H-L C statistic 15.5 - 20 at RBF y =
0.05 with C 0.8 - 4 and 8 0.01 - 0.04. There are, however, no values in this plot where the H-L C statistic
is less than 15.5. This area is a ttough at the lower end of a slope of decreasing H-L C values with
decreasing .
191
To decide on the optimum choice of parameters for the RBF SVM, comparison must be made with the
surface plot of the average area under the ROC curve. Figure 7.18 shows the areas under the ROC curves
for RBF kemel SVMs. Each plot is at a single value of the RBF kemel parameter y 0.01 - 0.10 and each
value is the mean of the area under the ROC curve of 50 models over a range of SVM parameters S and C.
The colours of the surface plots correspond to the average of the areas under the ROC curves, with dark
blue representing the most desirable average area under the ROC curve of > 0.82. Light blue, green and
yellow are adequate areas under the ROC curve of 0.80 - 0.82. Red colours the areas under the ROC curve
The plots in Figure 7.18 have irregular surface variations with local minima, due to the effects of random
sampling with different model training and testing datasets. Overall, these 3 dimensional plots show that
the discrimination is generally better for SVM with y between 0.01 and 0.04 than for y > 0.04, and that
discrimination is generally better for smaller values of parameter C for any choice of y and E. This is in
contrast to the important relationships in the calibration surface plots in Figure 7.17 where the changes in
In Figure 7.18, the response is relatively flat for the surface plots for each of the RBF y. In general, smaller
values of C and smaller values of 8 give the best discrimination at any RBF kemel y value. A SVM with a
RBF with y in the range 0.1 - 0.4 will provide adequate discrimination with area under the ROC curve >
0.80 across the range of 8 and C. The best area under the ROC curve greater than 0.83 was found in the
following zones
The largest average areas under the ROC curve was 0.833, found at RBF y = 0.02 with C = 0.6, = 0
and at C = 0.5, = 0.
Therefore from the comparison of the 3 dimensional plots, the parameter settings that are most likely to
give adequate models on this dataset are RBF y - 0.03 with C = 0.6 and = 0.05, and RBF y = 0.04 with
C I .0 -1.9 and 0.03. At these parameter settings, I expect that for samples of 800 cases, a RBF SVM will
be trained that when tested on a random sample of 400 cases will have an area under the ROC curve of 0.82
- 0.83 and a H-L C statistic of < 15.5. A model with this performance would be suitable for use as a risk
The calibration response was particularly sensitive to changes in and y. In contrast, most parameter
choices gave acceptable discrimination, but generally, smaller values of C gave sUghtly better
The initial heuristic was able to provide a good estimate of the optimal parameter choices for the RBF
SVM and allow limitation of the more intensive investigation of SVM parameters to an area with
acceptable discrimination in the parameter ranges y = 0.01 - 0.10, 0 - 0.2 and C 0 - 2 initially. The range
of C that was investigated was extended, as the area of calibration that was almost acceptable (H-L C 15.5
- 20) extended beyond C = 2. At C = 4 the discrimination was beginning to drop off below area under the
The experiment in Figure 7.17 and 7.18 required 154,000 RBF SVM to be frained and analysed. Each
model, including sample selection model building and analysis took about 5 seconds to complete, and so
193
the experiment took 9 days to mn on a single 2.7 GHz processor. If the intensive study had been conducted
with the same detail over the original ranges of the parameter estimates (y = 0.01 - 2, 0 - 2 and C 0 - 4)
The shape of the surface plots of the area under the ROC curve and the H-L C statistic in response to
changes in the SVM parameters allowed the estimation of approximate SVM parameters to be reasonably
accurate. This estimation involved a single parameter being studied with the other parameters fixed. This
approach works in this application because of the shape of the error surfaces and because two model
Using only a single model performance attribute, for example the H-L C statistic would potentially lead to
errors. The presence of a convex response seen in Figure 7.8 where the range of RBF y was studied, may
have led to the choice of an inappropriate large value for y, if the area under the ROC curve had not also
been used. At low RBF y values the areas under the ROC curve were > 0.82. At large RBF y, the
discrimination was no better than random, yet the H-L C statistic suggested adequate calibration of the
model.
7.6 Discussion
These experiments demonsfrated that the probability of ICU patient death in-hospital within 30 days of
admission can be modelled with a SVM, using only the patient data available within the first 24 hours of
ICU admission. The approach to optimisation of the SVM using the discrimination and calibration to guide
the parameter choice was successfiil in identifying parameters to build models of adequate performance.
The approximation of SVM parameters using an initial search heuristic provided an estimate of the best
RBF parameters. This sfrategy reduced the computing time that would have been required. It is not certain
that this approach will work in all applications of regression SVM. However, in this task, to estimate the
194
probability of ICU patient death, the simple exploration of parameter space efficiently localised an
approximate area of optimal parameter choice. This was possible because of the shape of the surfaces of the
The results of the SVM models built in this experiment are equivalent to previous studies with ANN and
logistic regression to estimate the probability of patient death. The calibration and discrimination of the
RBF SVM on the test data is similar to the results of models described by Clermont et al.^^^, using logistic
regression and ANNs on a similar set of ICU patient data. Modelling data from 800 cases, Clermont
described areas under the ROC curves for the logistic regression (0.829) and ANN (0.810) which is
comparable to that found with the RBF SVM (0.82 - 0.83). The SVM assessment was the average of 50
frials of randomly selected sample sets in confrast to the single, non-random sample of Clermont et al.
Therefore, the RBF SVM discrimination performance is at least equivalent to their study. As 50 trials were
conducted on random samples, the choice of RBF SVM parameters is very likely to consistently produce
SVM models that have adequate performance on the PAH ICU dataset.
However, the APACHE III estimates of probability of death still offer superior discrimination, with area
under the ROC curve of 0.89 on the PAH ICU dataset. The APACHE 111 model was developed on 17 440
patient admissions ^. It is possible that with such a large dataset, similar levels of discrimination could be
achieved with SVM modelling. However, very large datasets are impractical for local, single institution
model development.
The calibration of the best RBF SVM was good with an average H-L C of less than 15.5. Caution must be
taken with any comparisons using the H-L statistic, particularly as the mortality endpoints differ and the
size of the validation set of Clermont et al. is 447 cases against 400 cases in the present SVM experiment.
However, the SVM probably have superior calibration to the logistic regression model of Clermont et al.
(H-L C = 45.9) developed with data from 800 patients. The best SVM is on average, equivalent to the
These results indicate that using RBF SVM to model patient data offers a practical and reproducible
altemative for modelling the probability of ICU patient death within 30 days of admission.
However, the SVM model using the component patient data in this experiment performed less well than
models using the pre-processed data such as APACHE III scores of the previous chapter. The area under
the ROC curve of the APACHE III score alone on the PAH ICU dataset is 0.837, yet the largest ROC curve
area found with the RBF SVM was 0.833. In the previous chapter using the variables of APACHE 111
score. Age, Chronic Health Score and the Diagnostic Code, the average area under the ROC curve for both
the polynomial SVM and the multi-layer percepfron ANN was 0.86. Similarly, using the same variables on
the same dataset, the logistic regression models developed by Graham and Cook '*"* also displayed ROC
curve areas on the validation set of 0.86. The models of Clermont et al. from a similar ICU incorporating
the APACHE III score had ROC curve areas up to 0.84. It may be, therefore that fiiture improved SVM
models should include the APACHE III score, even though this means that such models are still reliant on
There must be an upper limit to the performance of a model that estimates probability of death using
information from the first 24 hours in ICU. Much potentially important explanatory information is not
available. Examples relevant to this dataset would be detail of the success or otherwise of surgery.
Important episodes that occurred in the lead time prior to ICU admission are not captured, neither are the
events, complications and progress after the first day. There will always be uncertainty in models using
patient data from on the first day. It is this uncertainty that represents random events and effects, but also
embraces the quality of patient care, that is the reason for the monitoring of risk adjusted mortality rates.
The choice of kemel and parameters is a problem to be solved for each set of data '*^. In the SVM the
kemel fiinction, and the parameters and C are chosen or varied to provide optimal regression
performance. For regression, the only reliable method that exists for all kernels and datasets is the use of a
cross validation set, as used in these chapters. In this application, optimisation was guided by multiple trials
of samples of training and testing sets. By using re-sampling and many training examples, SVM parameter
196
values were identified that are most likely to work well in any sub-sample of the patient data. The average
measurement of 50 samples at each data point demonstrated the reproducibility of the model performances
at parameter settings.
The choice of 8 by setting the regression tube width affects the approximation error, the number of support
vectors, the training time and the complexity of the solution "'^ A training example will be a support vector
only if the approximation error is larger than . A large e will give few support vectors, rapid training, but
poor accuracy as the regression tube is very wide. Small defines a narrow regression tube, w ith a large
number of support vectors and a complex model with long training time. Mattera and Haykin suggested
a robust approximation method for E by choosing a value so that approximately 50% of the examples in the
training set are support vectors. This is a compromise between complexity and maintaining a small
approximation error. An alternative suggestion by Weston et. a / ' '* to set 8 = 0 , and rely on adjustments to
Cto control the generalisation performance takes no account of training time or complexity considerations.
For the PAH iCU patient data, the noise and the desired accuracy were not known. The \alues of \\ ere
chosen experimentally to provide the best discrimination and calibration. The number of support \ectors in
the trained models where the training set is 800 randomly selected cases, was on a\erage. 500 or 62.5''i).
The use of the ROC curve area and H-L C gave a similar number of support vectors to the approximation
The choice of C is also determined by experimentation "~v A guideline proposed for choice in the RBF
kernel SVM by Mattera and Haykin "* is that C should be of a similar size to the expected outputs of the
SVM which are estimates of probability, and thus so are between 0 and I. This is only a guide, and further
work on optimisation is necessary, in this application w ith the RBF }' set to 0.03 and i: set at 0.5 - 0.6, the
best values for C were experimentally determined to be between 0.5 and 1.5. The \alue of 0.6 was chosen
by the parameter estimation, and is consistent with the estimate suggested b> Mattera and Ha>kin.
197
In this light, the proposals of Mattera and Haykin for estimation of C and E can be used in conjunction with
the parameter estimation method used in this chapter, as a means of localising the parameter solution, and
7.7 Conclusion
The probability of ICU patient death in-hospital within 30 days of admission can be modelled with a SVM,
using only the patient data available within the first 24 hours of ICU admission. The approach to
optimisation of the SVM using the discrimination and calibration statistics to guide the parameter choice is
successful in identifying parameters to build models of adequate performance to use as risk adjustment
tools.
These models can be trained and cross-validated on the number of cases seen in one year at the PAH ICU.
Though the performance is less than that of the APACHE III models during 1995 - 1997, the SVM models
have the advantage of using a 30 day in-hospital mortality endpoint, having the flexibility to be remodelled
when the model no longer fits and not incurring an annua! licence fee.
199
Chapter 8
8.1 Summary
The aim of this work was to develop risk adjusted confrol chart methods for monitoring in-hospital
mortality outcomes in Intensive Care. It is a medical application of statistics and machine leaming to the
The methods of assessment of models that estimate the probability of death for a patient in the intensive
care unit (ICU) were reviewed. From this review, the important attributes of discrimination and calibration
were identified as the key measures of model performance. The performance of a model may deteriorate
when it is applied to patient data which are not part of the database from which the model was developed.
Therefore, a model such as APACHE 111 which was developed on a large North American ICU population
must be validated in the Ausfralian ICU setting, before any conclusions can be drawn about the reliability
The area under the receiver operating characteristic (ROC) curve is the most useful approach to assessment
of model discrimination. From a review of the assessment of models for ICU patient mortality, a reasonable
expectation of the discrimination of ICU models is an area under the ROC curve greater than 0.80.
Calibration is more difficult to assess. To evaluate calibration, a calibration curve showing observed and
expected mortality in ranges of risk, and a statistical evaluation of calibration such as the Hosmer-
These considerations were the basis for the evaluation of the APACHE III models at the Princess
Alexandra Hospital ICU for the period 1995 - 1997. The models provided by the APACHE III system to
predict the risk of death of an ICU patient in-ICU and in-hospital performed very well. For all models, the
200
discrimination was excellent, with the area under the ROC curve being 0.90 - 0.92. The calibrations of the
in-ICU mortality models, and the in-hospital mortality model with proprietary adjustments for hospital
characteristics were very good. The APACHE III model with adjustments for hospital characteristics
The initial confrol chart application was to in-hospital ICU patient mortality data without risk adjustment
(RA). Thep charts, cumulative sum (CUSUM) and exponentially weighted moving average (EWMA)
charts were used to analyse the in-hospital mortality rate. The years 1995 - 1997 were used as an in-control
period to establish the confrol mortality rate, and charting was carried on through 1998 - 1999.
There was a significant fall in the mortality rate from the in-confrol rate of 0.16 to approximately 0.13. A
post hoc analysis suggested that a change in the overall severity of patient illness probably did not occur.
However, a change in casemix was demonsfrated. There was an increase in elective surgery. This type of
patient has generally a low risk of in-hospital death, and the change contributed to the decrease in in-
The second confrol chart application developed the techniques of control chart analysis of in-hospital ICU
patient mortality data with RA. The APACHE 111 model with proprietary adjustments for hospital
characteristics provided accurate estimates of in-hospital death, and was incorporated into thep chart,
CUSUM and EWMA charts. The use of a RA model, such as APACHE III, improved the information
provided by the control charts. After adjusting for the severity of illness and casemix of patients, these
analyses demonsfrated that patient survival had improved between 1995 and 1999. During the first 1 - 2
years, there were signals that the risk of patient death was higher than that predicted by the APACHE III
model. During the last year of the analysis, the observed patient mortality was lower than that predicted by
the APACHE 111 model. Possible explanations are that the RA monitoring has documented improvements
If the observed mortality rate is consistentiy different to that predicted by the RA model, it means that the
RA model no longer provides an accurate assessment of patient risk of death. In this application, the
APACHE III model was consistently overestimating the probability of patient death by the end of 1999.
Therefore, I developed altemative tools to the APACHE 111 model to estimate the probability of patient
deaths at the PAH ICU. Machine learning techniques, such as support vector machines (SVMs) and
artificial neural networks (ANNs) were developed on the PAH ICU database, and thus provide a new RA
tool.
Several shortcomings of the APACHE III system were addressed in the new model development. An
endpoint of 30 day mortality was chosen, instead of in-hospital mortality used by APACHE III. Mortality
data could be analysed just 30 days after the episodes of patient care commenced, rather than waiting
The APACHE III algorithm functions as a "black box" without model updates. In contrast, the machine
leaming models are intended to be updated when the model fit deteriorates. To this end, models were
successfiilly and reliably developed on a fraining set of 800 cases and a test set of 400 cases. The data for
model fraining and cross validation can be collected over approximately one year. This means that it is
practical for a single ICU to develop and maintain RA models to continuously monitor their clinical
outcomes.
The APACHE 111 software and model commands an annual licence fee, whereas a locally developed RA
In Chapter 6, the preliminary ANN and SVM models were developed using the APACHE III based
variables of acute physiological score, modified disease category and chronic health score, and the patient
age. The models demonsfrated good discrimination, but very poor calibration. Subsequently, a multilayer
percepfron ANN and a linear kemel SVM were both recalibrated with a general regression neural network.
202
The performances of these models were equivalent to previous ANN and logistic regression models on a
similar ICU application '^^ and to logistic regression on the PAH ICU database.
A further experiment was then conducted to model 30 day in-hospital mortality. The component variables
that describe patient physiology, demographics, laboratory results and diagnosis were used, rather than the
APACHE III variables. Standardisation and scaling of the raw patient data was performed using successful
data fransformation sfrategies after considering other machine leaming applications in the ICU. A radial
basis function (RBF) SVM was used to estimate the probability of 30-day in-hospital mortality. The SVM
models were developed, cross-validated and compared according to the average area under the ROC curve
A simple and efficient heuristic method to search for optimal SVM parameters was used. In this way,
optimal parameter choice was localised in space and this was intensively explored for the best performing
SVM models. The average performance of the RBF SVM on random samples of the dataset was adequate
with area under the ROC curve of greater than 0.83 and an H-L C statistic of less than 15.5 on 400 patient
test sets.
This thesis brings together several streams of research. It arose from clinical medicine, patient care and the
need to measure the quality of clinical care. To accomplish this, several novel confributions were made.
The most important contribution of this work is the application and development ofRA confrol charts to
monitor ICU mortality rate. This paradigm of incorporating adjustments for casemix and severity of illness
has not been applied to the ICU before. It is an idea that has wide appHcation to monitor the quality of the
care that we offer in many areas of the health care service. The prerequisites are the ability to accurately
measure patient outcomes, and to collect patient data to permit modelling estimates of the probability of
these outcomes.
203
i. The assessment of the APACHE III models at the PAH was the first assessment of Ausfralian
experience with APACHE III, and used the largest single institution series of patients outside of
ii. The use of confrol charts to monitor in-hospital mortality of ICU patients is a logical extension of
industrial statistics, but has not been carried out elsewhere. This is the first use of the APACHE III
score as a validated RA tool to continuously monitor ICU outcome. The design and analysis of
each of the techniques 1 developed is described in the Appendices. The technique allows the
iii. The RA confrol chart work is original in its application and involves modifications of previous
The RA p chart was developed from the method of Alemi et al.^^'^ and incorporates a method for
analysing the power of each sample, adapted from Flora ^''. The Z score/? chart relies on the work
The use of the iterative method to characterise the distribution used for constmcting/? charts is
presented in Chapter 5.
The RA CUSUM is based on the work of Lovegrove et al. " and the moving frame approach is
adapted from the work on charting cardiac surgery by Poloniecki et al. "^. The use of the RA
CUSUM of Steiner and co-workers'^^'''*^''''^ is an original application in an ICU setting, though the
method, apart from the use of APACHE III, has not been modified.
204
Both the RA EWMA charts with the parametric approximation and the discrete approximations
iv. The 30 day in-hospital mortality endpoint that was used in the machine leaming is an original
contribution, based on the specific requirements ofRA control chart analysis of patient mortality.
V. This is the first application of SVM to estimate the probability of ICU patient death.
vi. A model that estimates the probability of patient death has attributes of discrimination and
calibration. These two performance measures were used to guide model development and SVM
parameter selection.
During the course of this project, several areas were identified where fiirther research is required.
Further study of the assessment of the performance of models that predict the probability of death of ICU
patients is required. At present, there is a good understanding that model reliability is reduced when the
model is applied to other ICU contexts different from those where it was developed. Issues of data
collection and mle interpretation, patient variables, clinical practice, admission and discharge practices,
case mix, lead-time, mortality rate and type of hospital have all been shown to affect the ability of models
to accurately estimate the probability of death of ICU patients. More work is required to understand the
The assessment of model calibration proved a difficult part of this project. The H-L C statistic was
determined to be the best technique available at the time at which this study was undertaken. Further work
may lead to an altemative technique to measure the calibration and reliability of probability estimates.
205
The RA charts described in this thesis are initial steps in what could be a much larger undertaking.
Potentiallyfiaiitfiilareas of research exist in additional analysis of the behaviour of the confrol charts under
a larger range of plausible, and important clinical situations. This thesis was limited to analysing the effects
of increases and decreases in mortality rates, and in increases and decreases in the odds ratio. From these
scenarios, control chart parameters and control limits were chosen, and charts were constmcted. Many
other combinations of changed casemix, early discharge, and simulations of aspects of poor or improved
The use of the central limit theorem to permit estimation of the expected distribution of mortality rates and
the EWMA statistics involves approximations. The errors are probably small compared to the imprecision
of the currently available RA tools for ICU. However, fiirther work can be done using altemative methods
to characterise the distributions. Iterative methods, exact to the limits of the model predictions, are
presented in this work. Further research can be done to more efficiently calculate the distributions.
Examples are Monte Carlo simulations, or possibly analytical expressions for determining the probability
density function of the mortality rates, based on the individual patient's predicted risks of death.
The latter part of this thesis described machine leaming applications to estimate the probability of death of
the ICU patients. SVMs have, to my knowledge, only been employed once previously in the ICU setting,
and only then to predict physicians clinical actions, rather than to estimate the probability of death. There is
considerable opportunity to investigate other applications in ICU and to further optimise the models to
The choice of kemels and SVM tuning parameters provides an area for ongoing work. The approach
presented here is practical, as it uses the desirable performance characteristics of the potential RA tool to
guide optimisation of the SVM parameters. A simple search heuristic to identify the likely area to search
more intensively may not work in all applications. It is not a replacement for thorough parameter evaluation
and testing of model performance by cross validation. Genetic algorithms, for example, may offer a ihiitfiil
In this application, the polynomial and RBF kemels were investigated, but there are many other kemels
which can be applied and optimised on the data. SVMlight provides the opportunity for fiiture investigation
of linear kemels, tanh sigmoid kemels, a range of polynomial fiinctions and a user defined kemel option.
198
The recommendation of specific kernels for probability density estimation by Weston and co-workers
The variable selection and processing used for the SVM model can be explored fiirther. Though the models
gave acceptable performance, it is possible that more intensive evaluation may provide improvements.
All of the raw data components in the APACHE III algorithm were used for the SVM models in Chapter 7.
Any additional variables from the database that may have had some explanatory power were included. It is
possible that fiirther additional variables could be included for fiiture models. Variables describing organ
failure, additional laboratory results such as potassium, platelet count, lactate, and injury severity score,
therapeutic interventions, gender, marital status and ethnicity may improve prediction of outcomes.
Some of these variables could be improved by expert pre-processing. For example, there is a potential for
inclusion of alveolar-arterial oxygen gradient using the blood gas measurements to better estimate the
ability of the lungs to exchange oxygen. The diagnostic code used in this study could be revised now that
the APACHE 111 diagnostic weights are all publicly available "".
It is not certain whether using all of the patient variables that were available on the database maximised the
performance of the SVM models. Altematively, the large number of variables could have increased the
complexity of the model and perhaps led to deterioration in model performance. Ideally, feature selection
should order the features by effectiveness, to provide the best combination of features and remove those of
The data pre-processing was carried out to provide features with a similar range of values. The pre-
processing was based on methods used in other machine leaming applications. Where possible, each
feature had an identifiable minimum or maximum in relation to the risk of death associated with that
feature. The SVM was ineffective when used with raw data, and it is not clear which of the pre-processing
manipulations were beneficial, or indeed whether some of the pre-processing may have limited the model's
ultimate performance. Further experiments are necessary to explore the necessity and suitability of the pre-
In summary, the conclusions of this work are that RA confrol charting offers an important adjunct to
current methods of assessment of ICU outcome to monitor the quality of care. SVMs provide a practical
approach to model the probability of in-hospital mortality of ICU patients for RA based on patient data
obtained in the first 24 hours. Their development can be effectively guided by optimisation of the attributes
of discrimination and calibration. Models can be reliably built and assessed on 1200 cases, and so provide a
Appendix 1
The dataset for analysis was expanded from that presented in Chapter 3 (1 January 1995 to 1 January 1997)
with additional patients from 1 January 1998 - 31 December 1999. This extended dataset (1 January 1995 -
31 December 1999) is used in Chapters 4 to 7 for control chart analysis and modelling.
Patient eligibility and exclusions and the statistical analysis are the same as Chapter 3. There were 5681
eligible episodes of ICU admission analysed. There were 5278 primary admissions and 403 (7.1 %)
readmissions. The demographic features of the patients are summarised in Table Al. 1. The overall
mortality was 515 in-ICU and 779 in-hospital deaths from 5278 patient hospitalisations.
210
Male: % 62.4
ICU length of stay: mean in days (sd), median 2.9 (5.1), median = 1.0
Hospital length of stay: mean in days (sd), median 27.9 (46.6) median =15.0
The following tables Al .2 and Al .3 present a summary of mortality rate data that were used for the
charting and analysis in Chapter 4. Table Al .2 groups admissions by month and Table Al .3 groups patients
Appendix 2
Analysis of Mortality Rate Observations PAH ICU: 1995 -
1997, and Estimation of In-Control Parameters.
The purpose of Appendix 2 is to present an analysis of the PAH ICU patient data from 1 January 1995 - 31
December 1997, including evaluating whether the process was in-confrol, and estimating the process
parameters. The analysis was conducted according to the principles detailed by Kennett and Zacks "'^. This
analysis of the patient data examines whether the observed mortality rates are approximately normally
distributed, and whether the process is stationary (i.e. distribution of mortality rates does not change over
time).
A three-year period of observation from 1 January 1995 to 31 December 1997 is covered with this analysis.
This initial collection of data was available at the commencement of this project. The data were analysed in
three groupings: monthly groups, blocks of 50 consecutive cases, and in blocks of 100 consecutive cases.
The term "mortality rate" is used for the proportions of deaths among blocks of consecutive cases, even
though the observation periods vary. An overview of the data is presented in Chapter 2, and Appendix 1
tabulates the monthly observations (Table Al .2) and blocks of 50 cases (Table Al .3) up to the end of 1999.
The analysis was conducted on 3 years of admissions to the PAH ICU. Only the first admission to ICU
during any hospitalisation was considered to avoid double counting. The outcome of interest was survival
to hospital discharge.
There were 3159 eligible admissions during this period. Data were grouped into months of varying sample
size (36 months), samples of 50 consecutive cases (63 blocks) and samples of 100 consecutive cases (31
214
blocks). The overall mean mortality rate, the mean mortality rate of each grouping, range and standard
deviation of the observed mortality rates was calculated. A normal probability plot and the Shapiro - Wilk
W statistic were used to investigate whether the mortality rates were approximately normally distributed.
A comparison between the observed standard deviation of the mortality rates, and the standard deviation
estimated from the binomial distribution was made for the patient samples of 50 and 100 cases. The
.= mEM Hj
where p is the mean mortality rate, and n is the number of patients in sample, /, either 50 or 100.
Runs Tests
A search for non-random pattems in the data was conducted using methods recommended by Kennett and
Zacks "^ and the derivation of these tests and the notation comes from this reference.
The first of the Runs tests is a test of the null hypothesis that the monthly mortality rates fall randomly
either side of the overall mean mortality rate. Each sample mortality rate observation is allocated to a group
according to whether it is above or below the overall mortality rate. "A" identifies values equal or above
the mean, and "B" identifies values below the mean. A mn is defined as a consecutive series of one or more
mortality rates that have the same classification. The statistic, R, is the observed number of mns.
IfR is too small, there is a high chance of "clustering" due to non-random disfribution of mortality rates.
Too many runs imply mixing or homogenous distribution of the mortalities around the mean. A test of the
null hypothesis of random distribution either side of the mean, has rejection regions R < R^ or R> R,
With large samples such as used in this study (36 months), a normal approximation is used and
where
215
2iiiAnic^
//,=1-F-'A"^B
(2m^mg(2m^mg-)
(^R =
n^{n-\)
a is the level of significance for rejection of the null hypothesis (in this application a = 0.025) "^. Z/^ is the
value of the standard normal distribution corresponding to a. TW^ is the number of observations above (
count of "A") and mg is the number of observations below (count of "B") the mean mortality rate.
Runs up or down
It is possible that cyclical effects could be present without being detected on the previous assessment of
randomness. For this analysis, each mortality rate is compared to the previous mortality rate. Where it is the
same or larger, there is a frend up identified by "U". Where the mortality rate is less than the previous rate,
there is a run down, identified by "D". A mn is defined as a consecutive series of one or more mortality
rates that have the same classification and the statistic R*'is the number of mns counted.
The null hypothesis of random distribution of mns up or down is rejected '\f R~* <R^
or R ^ .**
-** > R^_.
The normal approximation can be used and 7?^ = /Ll . Z^_^.CF . and R^_^ = jU . + Z,_^.(J^.
where:
^R'
J2n-lj/
=
l{l6n-29)/
'90
For the monthly data, the variable sample size, and resultant variations in standard deviation limit the value
of the analysis of mns up and down '', so this analysis was only conducted on mortality rates of the
Auto-correlation
Estimates of the sample auto-correlation for a range of lags of up to 1 year were calculated to exclude a
seasonal cycle ^''^ The analysis used the sample auto-correlation fiinction:
n-k
r,=^^ , k = 0,\,2,^...N
iiPi-pf
i=\
where N is the number of samples and mortality rate observations in the series; /?, is the mortality rate of
the sample indexed by /; p is the mean of the monthly mortaUty rates; k is the number of time periods
A2.2 Results
Overview
Between 1 January 1995 and 31 December 1996, there were 3159 patients eligible for analysis. There were
Monthly observations
There were 36 monthly observations, with a mean monthly mortality rate of 0.161 and standard deviation
of 0.037.
The range of monthly mortalities rates was 0.09 - 0.26. The normal probability plot in Figure A2.1:
supports the assumption that the monthly mortality observations were approximately normally distributed.
217
The Shapiro-Wilk fTstatistic (W= 0.98/7 = 0.752) provides no evidence against the null hypothesis that
a>
CD
>
(D
E
o
z
%
7^
0}
Q.
X
UJ
Mortality rates above or below the monthly mean mortality rate gave the following sequence
with 19 mns.
There was insufficient evidence to support non-random pattems with clusters or abnormal mixing of
mortality rates.
218
Figure A2.2 displays the estimates of the auto-correlation coefficients for lags of 1 - 12 months. At a lag of
3 months, the coefficient is 0.38, which exceeds twice the standard error of the estimate. This suggests that
there is a likely to be a positive correlation between monthly mortality rates 3 months apart. This is
plausible, in the light of seasonal cycles of activity, and the possible effects of school holidays, and the
hospital year (January to January) and the financial year (July to June) on casemix. However, there do not
0.4
s 3 Auto - correlation coefficients
.E 0.3 -Lower 2 SE
-Upper 2 SE
(D
o
E 0.2
8 0.1
c
2 0 1^^
i5 7 10 11 2:
2-0.1
1^
o
f-0.2
-0.3 -
<
-0.4
-0.5
Lag (months)
There were 63 blocks, with a mean mortality rate of 0.160 and a standard deviation of 0.051. The range of
mortality rates was 0.04 - 0.28. The normal probability plot in Figure A2.3 suggests that the mortality rates
are approximately normally disfributed. The Shapiro-Wilk ^^ statistic {W= 0.95/? = 0.02) suggests a
significant departure from the normal distribution. The standard deviation estimated from the binomial
0)
> .
CO
E
i_
o
o
(D
+
o
(D
Q.
X .It
UJ
The sequence of the 63 observations above or below the monthly mean mortality rate was:
Therefore, R < R< R^_^, and there is no evidence to support non-random clusters or abnormal mixing of
mortality rates.
Runs of values up and down allowed 62 comparisons from the 63 observations, with 40 mns.
U DD UU D UU D U D U D UU D U
R ' = 40, = 63, fiR* = 41.7, and R, = 3.3 from which R \ = 35.2 and R* ,^ = 48.1, so R^ < R* < R*^, and
Auto-correlation coefficients for mortality rates of blocks of 50 admissions are in Figure A2.4 for lags of 1
to 21, covering the possibility of up to 1 year correlations. There was no significant auto-correlation.
0) 0.3
3 Auto - correlation coefficients
c -
-Lower 2 SE
"8 0.2 -Upper 2 SE
!=
0)
O 0.1
O
c
.2 0 H , H ,H_ La_
JO 1 2 3d: 5 6 7 a 11 12 13 ti ts 17 18 19 ^ ^
fc-0.1
o
o
o -0.2
- -
< -0.3
-0.4
Lag (blocks of 50 cases)
There were 31 observations, with a mean mortality rate of 0.160 and a standard deviation of 0.0376. The
range of mortality rates was 0.10 - 0.27. Neither the normal probability plot in Figure A2.5, nor the
Shapiro-Wilk fT statistic (fF= 0.94/7 = 0.08) provide evidence against the hypothesis that the mortality
rates have a normal distribution. The standard deviation estimated from the binomial distribution was
0.0377.
221
0)
_3
CD
>
(0
E
1- , ^ ' *
o
z /
T3
0}
ts
a.
X
UJ
-3
0.05 0.1 0.15 0.2 0.25 0.3
The sequence of the 31 mortality rates above or below the mean mortality rate:
demonsfrated 19 mns. R = 19, /w^ = 14, w^ = 17, // = 16.4, and Oj? = 2.7 from which /? = 11.1 and R]^ =
21.7, so R^ < R< /?]_, and there is no evidence for non-random clusters or abnormal mixing of mortality
rates.
A Runs Test of values up and down allowed 30 comparisons from the 31 observations.
U D U DD U DD U D UU DD U D UU D UU D U DD
/?' = 18, = 3\,fjR* = 20.3, anda^. = 2.3. From whichR'= 15.8 and/?'/^ = 24.8, so
Figure A2.6 presents the estimates of the auto-correlation coefficients for lags of 1 to 10, covering the
possibility of up to 1 year correlations. At a lag of 3 blocks of 100 cases, the coefficient is 0.47, which
exceeds twice the standard error. This suggests that there is a likely to be a positive correlation between
observations that are 3 blocks of 100 cases apart. This is similar to the possible auto-correlation between
mortality rates of blocks of month observations that were tree months apart. However, there does not
o Upper2SE
eff
0.2
o
o
c 1=1
tio
0
CO 1 2 3 6 9
0)
i_
oo -0.2
u.
o
4-.*
3 -0.4
< ^
*
-0.6
Lag (blocks of 100 cases)
Discussion of analysis
The observed mortality rates were randomly distributed either side of the mean mortality, and there were no
abnormal mns up or down. The analysis of the mortality rates for 1995 -1997, confirms the monthly
mortality rates are approximately normally distributed, and that the disfribution is probably stationary over
this period. The process can be considered to be in-confrol for this period. The presence of a significant
auto-correlation at a lag of 3 months may be a factor that increases the false alarm rate of any control
charting approach. Altematively, it may be a chance finding. The auto-correlation analysis examined 43
223
possible relationships and two coefficients (monthly mortality at a lag of 3 months, and the 100 case blocks
at a lag of 3, which are largely equivalent analyses) were found to lie outside the two standard error limits.
With the grouping into blocks of 50 cases, the Shapiro-Wilk statistic suggested that there was a significant
departure from the Normal distribution of block mortality rates. For this reason, the data will be grouped
into larger patient samples by month or 100 patient blocks for the confrol chart applications.
The standard deviation of the observed mortality rates for the blocks of patients estimated from the
binomial distribution agrees with the calculated standard deviation observed on this series.
A2.3 Conclusion
The observed patient mortality rates from the period 1 January 1995 to 31 December 1999 will be
considered in-confrol. From this analysis, the in-confrol mortality rate is estimated as 0.16. Auto-correlation
of mortality rate observations with a lag of 3 months or 300 cases may exist.
225
Appendix 3
Analysis of Performance of p-Chart
The purpose of Appendix 3 is to provide the background necessary to design appropriate/; charts to
monitor ICU patient mortality. This appendix illusfrates the performance of the/? chart by examining the
average mn length (ARL) to signal for a range of control Hmit parameters in the context of changing
mortality rates. The approach to analysis is adapted from that suggested by Kennett and Zacks "^.
The following analyses are done under assumptions that the changed mortality rate observations are
normally distributed with a mean mortality rate of p and a standard error of ., j . This estimate
I
closely agrees with the observed standard deviation of the mortality rates in Appendix 2.
The analysis is conducted with sample size set at = 87, as this is the average number of patients in each
month. PQ is the in-confrol mortality rate and p^ is the changed mortality rate. The control limits are
defined as a a where a is the multiple of the standard deviation of the mortality rate (&).
The operating characteristic is the probability that an observed mortality rate will fall within the confrol
limits. It is calculated by the difference between the two cumulative distributions defined by the confrol
limits. The operating characteristic is usefiil, as it allows analysis of the probability of a single sample
mortality rate observation on thep chart to detect a change in mean mortality, or to falsely signal in the
V n
When the mortality rate shifts to a new value p^ and the sample sizes are large, then Kennett and Zachs
113
(1998) provide an expression for the operating characteristic.
OC(^,) = D
(UCL-p.y^] J(LCL-p,y[^
-o
VA(I-PI) J t VAO^-A)
0C(^,) = 0 V ^ U - A ) + ^ V A ( 1 - A ) 1 _ o
J yfnJPo- Pi)-ayl PoJl-PoY
VA(I-A) ylPx^^-Px)
Equation A3.1
Figure A3.1 shows the OC curves where confrol limits are calculated using a - 0.5, 1,1.5,2, and 3. Table
A3.1 contains values read off Figure A3.1 of the probability of signal under selected values for confrol
limits and changed mortality rates. Note that a doubling of mortality rate from 0.16 to 0.32 will be missed
with 0.20 of the observations if 3 <7 confrol limits are used. It may be more appropriate to choose 2 a
confrol limits where the doubling of mortality will fail to signal with only 0.05 of the observations. With
the 2 a confrol limits, with unchanged mortality rate, the occurrence of false positive signals will be 0.05.
With the confrol limits set at 0.5 ff or 1 <7, the charts are very sensitive to changes in the mortality rate, but
the probability of false alarm is too high for clinical use. When a = 0.5, the probability of false alarm with
227
unchanged mortality rate is 1- 0.38 = 0.62. When a = 1, the probability of false alarm with unchanged
mortality rate is 1- 0.68 = 0.32, which is still unacceptably high for clinical use.
i5 0.2
o 0.1
0
0.05 0.15 0.2 0.25 0.3 0.35 0.4 0.45
A fiirther perspective can be obtained by converting the OC[px) into estimates of ARL to signal. The
ARL =
[l-OCte)]
This is important as analysis of the CUSUM and EWMA charts is done with ARL. Figure A3.2 shows the
ARL to signal in samples of 87 cases where confrol limits are calculated using a = 0.5, 1, 1.5, 2, and 3, and
00
II
c
"o
(/)
Q.
E
CD
01
<
With 3 <7 confrol limits, an average of 370.38 observations (32 225 patients or once every 27 years) will be
expected before a false alarm. However, in the event of mortality rate doubling to 0.32, the ARL is 1.25
samples (109 patients). With 2 tr control limits, the ARL for false alarm is 21.98(1912.3 patients) and the
ARL for doubling mortality is 1.05 (91.4 patients). The advantage of the 2 a control limits is the better
detection of moderate increases in mortality rate that are more clinically plausible. For example, an increase
of mortality rate from0.16 to 0.20 will be detected by the 2 a confrol limit chart with an ARL of 5.35 (465
patients or between five and six months) compared to the 3 a confrol limit chart which has an ARL of 28.79
(2504 patients or over two years). Though the specificity of the 3 a confrol limit chart is very high, it will
While designing thep charts, samples of patients grouped by month of admission was convenient.
Under this analysis, there is a linear relationship between ARL and block size when the mortality rate is
unchanged. From Equation A3.1, when p^ = PQ, OC\p ) = 0.95 if a = 2. The estimated ARL will be
20 and the ARL(in cases) will be 20n,. Figure A3.3 (a &b) shows the effect on changing the sample size on
ARL to signal when p^ ^ p^ using an in-control mortality rate of 0.16, and 2 a confrol limits. In this
analysis, the ARLs are presented as average number of cases to avoid confiision as the sample size is
varied.
Figure A3.3a displays the effect of changing block size, when the out of control mortality rates are below
0.16.
Figure A3.3a: Average Number of Cases to Signal o f p Chart with Effect of Variable
Sample Size over Range of Changed Mortality Rates
Control limits 2a, Po = 0.16, p, < Pg
2500 n =10.. 200.
to signal
2000
number
'Pi =0.14
1500
91 w 1000
CD 10
500
<
0
50 100 150 200
Number of cases in block: n
231
When the changed mortality rate is 0.14 the ARL in cases continues to rise with increasing block size. With
lower/;/, the ARL in cases does not increase much, after block sizes of 50 - 100 cases. Therefore there is
little sensitivity advantage in using block sizes greater than 50 -100 cases to detect a decrease in mortality
rate. With block size of 50 cases, the specificity of thep chart, measured by in-confrol ARL, is 1000 cases,
Figure A3.3b: Average Number of Cases to Signal o f p Chart with Effect of Variable
Sample Size over Range of Changed Mortality Rates
Control limits 2a, Po =0.16, p , >po
n 10 ...200
2500
0.18
Figure A3.3b presents the out of confrol mortality rates above 0.16. Again we can confirm that there is little
advantage in sensitivity gained by using block sizes larger than 50 - 100 cases, unless the new mortality
The choice of 2 o confrol limits and block size of > 50 represents appropriate performance for thep chart
Appendix 4
Statistical Analysis of the CUSUM Chart
The purpose of this appendix is to summarise the background to statistical analysis using the CUSUM. The
assumptions underlying the analysis, the notation and calculations and an analysis of chart performance
under a range of parameter values and changed mortality rates are included. This appendix is not an
exhaustive review, and is provided as a reference to the material relevant to the text of the thesis. The
results of the analyses in this appendix provide the basis for the design of the CUSUM charts in Chapter 4.
The books by Ryan '^, Montgomery ^^, Hawkins and Olwell ^*'^, and Kennett and Zacks ^"^ provide
A4.1 Assumptions
The assumptions on which the methods are based are that the observations are independent and that
samples are randomly drawn from a population of known distribution. The analysis in Appendix 2, of the
PAH ICU data during the in-confrol period (1 January 1995 - 31 December 1997) suggests that these
assumptions are plausible. For the following example, the sample size of 100 cases will be used. The
distribution of the mortality rate of the blocks of 100 patients is approximately normal.
To test for shifts in the process mean of an anticipated magnitude, three equivalent statistical approaches
can be used: V mask, decision interval and Page's two sided CUSUM. The following discussion adopts the
fiQ is the mean of the process in-confrol, estimated by the mean mortality rate Po After a change in
the process the new mean will be //,. Where a shift to a higher mean mortality rate is of interest,
then p^) PQ . Where a shift to a lower value is of interest, then ju[( ju^
IPO{^~PO)
a standard deviation of the process in-control. It is estimated by J
V n
From Appendix 2, this estimate gave the same values as standard deviation of the mortality rates 1995 -
1997.
K^ This parameter is dependent on /J^, and the choice of a (clinically) important increased process
mean.
^.^>"o+-",
2
Similarly, K" depends on the clinically important new lower mean mortality rate.
K^ and IT have the units of the outcome measurement. In the examples of Chapter 4 the units are deaths
per 100 cases. The CUSUM performance in detecting persistent shifts is optimal under these conditions,
but the CUSUM is robust and will signal when shifts of greater or lesser magnitude are present. A signal
h This parameter is the confrol limit to which the CUSUM statistic is compared. It is chosen
according to the desired performance characteristics of the chart, i.e. average run lengths (ARL) for in-
235
confrol and changed mortality conditions under the parameter choices, h^ is the upper decision interval and
The upper and the lower CUSUMs are separate statistical analyses testing for increases and decreases
respectively, in the process mean. These statistics are often run concurrently. If only a change in the
process mean in one direction is sought, then either an upper or lower CUSUM could be run in isolation.
Where a single CUSUM is run, it has a lower rate of false alarm than running both upper and a lower
CUSUM together. This analysis will consider the performance of both upper and lower CUSUMs together,
C =o
C;=max(0,C;_,+Pj-K^)
Concurrentiy, a lower CUSUM can be mn, and a decrease in the mean would be signalled when,
C7 < h-
given,
C;^imn(0,C;_,+Pj-K-)
236
The performance of the CUSUM confrol charts can be described in terms of ARL to signal under in-
confrol conditions, and under conditions of a changed mean mortality rate. The in-confrol ARL is a
measure of the occurrence of false alarms. The ARL when the process mean has changed is an indication of
the efficiency with which the chart detects the changed mortality rate. The ARL to detect a changed
mortality rate can be estimated using a starting value of CUSUM = 0, or from a CUSUM with a steady state
value and running under in-confrol conditions. The ARL from steady state will be shorter than the ARL
from an in-confrol state. However, both methods are thought to give the same ranking of the efficiency of
For this analysis, I will calculate the ARL for CUSUMs that have the changed or out-of-confrol mortality
To characterise the performance of the CUSUM charts used to analyse the ICU data, a series of simulations
were run. All simulations were programmed on MATLAB 6.5 "^, and the results graphed in Microsoft
Excel. Altemative approaches using integral equations, Markov chain discrete approximations to the
integral equations and other methods to reduce the computing intensity of simulations are described by
A simple programme was written to simulate the process in-control. The ARL of a CUSUM chart of the
mortality rates of blocks of 100 admissions was modelled. The in-confrol mortality rate was 0.16 and the
mortality rate of 0.16 and the standard deviation of 0.037 formed the basis for simulations. Simulated block
237
mortality rates were randomly drawn from a normal distribution with these parameters, and any negative
values were given mortality rate of zero. 10 000 simulated runs were used to estimate the ARL at each
value in the ranges studied. In each simulation an upper and a lower CUSUM were modelled.
Figure A4.1 shows the relationship between ARL and the range of the decision interval parameters, h..
h*'~ of 0.073 gives an in-confrol ARL for a single CUSUM test of 37 (or 3700 cases approximately 3
years), and 20 (or 2000 cases or 1.7 years) for the upper and lower CUSUM together.
The results of a fiirther simulation to examine ARL for the range of hL+l- of 0.07 to 0.08 for CUSUMs
with both upper and lower, and single monitoring schemes are shown in Figure A4.2 and A4.3. The
simulation was conducted as before, except that altemative mortality rates of interest {p^ and p^ ) were
used. The choice of p.* and //f has a great effect on the ARL while the process remains in confrol. For
example, if the shifts in mean mortality rates that are to be detected are a lower rate of 0.08 and an upper
rate of 0.24, the ARL in-confrol (h*'~ = 0.073) is 77 (7700 patients or about 7 years) for a two sided
CUSUM, and 140 (14000 patients or about 13 years) for a one sided CUSUM. This is a large increase in
the ARL in-confrol compared to when the mortality rates of 0.11 and 0.21 are to be detected
(h*'~ = 0.073 ) where the ARL for a single CUSUM is 37 and for an upper and lower CUSUM is 20.
238
Figure A4.2: In - Control ARL for a Range of /)+/- and Alternative Mortality
Rates
in - control mortality rate = 0.16, upper and lower CUSUM
120
-<0.08or>0.24
-<0.09or>0.23
-<0.1 or>0.22
-<0.110r>0.21
-<0.12or>0.20
-<0.13or>0.19
-<0.14or>0.18
Figure A4.3: In - Control ARL for a Range of h+/- and Alternative Mortality
Rates of Interest
In - control mortality rate = 0.16, single (upper or lower) CUSUM
250
200
-<0.08or>0.24
150 -
-<0.09or>0.23
-<0.12or>0.20
-<0.13or>0.19
-<0.14or>0.18
To examine the effect on ARL of changes in the observed mortality rate, a series of simulations were
conducted. The in-confrol mean was 0.16, the changed mortality rates to be detected were p^ = 0.21 and
//f = 0.11 and the control limits were h^'~ - 0.073. For the simulations that follow, it is assumed that the
process has the changed mean from the commencement of monitoring. Simulated mortality rates were
randomly drawn from a normal distribution with mean equal to the out-of confrol mortality rate, and
standard deviation calculated from the out-of-confrol mortality rate and the number of patients. Any
negative values were given a mortality rate of zero. 10 000 simulated mns were used to estimate the ARL
at each value in the ranges studied. In each simulation an upper and a lower CUSUM were modelled.
Figure A4.4 shows the ARL to signal of the CUSUM for a range of out-of-confrol mortality rates.
50
1;
40
I-
a. \
<
20
f\
^ ARL Upper and Lower CUSUMs together
ARL Upper CUSUM alone
ARL Lower CUSUM alone
y v^__
0.05 0.1 0.15 0.2 0.25 0.3 0.4
The ARL of the upper and lower CUSUM together show the ARL of 18.7 when the mortality rate remains
at 0.16, which differs slightiy from the estimate of 20 from the simulation of Figure A4.2. The ARL
decreases as the mortality rate is changed from 0.16. The ARLs for 0.15 (16.5) and 0.17 (13.9) are not
240
much different from the false alarm ARL. This is appropriate as such small changes are not of particular
clinical importance. However, at more exfreme out-of-confrol mortality rates, the CUSUM will signal quite
rapidly. The ARL for a mortality rate of 0.21 is 3.3 and that of 0.11 is 3.4. At the PAH ICU this would have
Figure A4.4 also shows the ARL when only a single upper or lower CUSUM is used. This monitoring
option maybe chosen when only one direction of changed mortality rate is of clinical interest. For example,
it may be important only to detect an increased mortality rate. The ARL for detection of increased mortality
rates that are close to the in-confrol mortality rate is much larger. At a mortality rate of 0.18, the ARL of
the upper and lower CUSUM and the upper CUSUM only is the same. So the sensitivity of the combined
upper and lower CUSUMs is about the same as the upper CUSUM to detect increases in mortality rates.
The advantage of using only a single CUSUM when only one direction of change is of interest is that the
ARL for in-confrol is much longer (35.5 v 18.7), and the analyses are more specific. Also the ARL when
the mortality rate has fallen is dramatically increased. It is very unlikely to get signals from the upper
CUSUM, as it is not designed to detect decreases in the mortality rate. The increased ARL for the upper
CUSUM is seen at out-of-confrol mortality rates of 0.15 (16.5 v 18.9), 0.14 (10.9 v 310) and 0.13(7.0 v
1377/ Similarly, the lower CUSUM is not designed to detect increased mortality rates.
A4.7 Summary
The design of the CUSUM chart requires that in-confrol mortality rate is known and that out-of-confrol
mortality rates of clinical importance are chosen. Analysis of the ARL under in-confrol and under changed
In the PAH ICU example, the in-confrol mortality rate was 0.16, and the mortality rates that were to be
detected were 0.11 and 0.21. With h'^'~ = 0.073, an ARL of 1870 patients for in-confrol occurrence of
false alarm was considered acceptable. This choice of parameters allowed rapid detection of clinically
Appendix 5
Analysis of Performance and Choice of Parameters and
Control Limits for Exponentially Weighted Moving
Average (EWMA) Chart
The purpose of this appendix is to analyse the ARL of the EWMA chart for a range of chart characteristics.
From this analysis, a choice of appropriate parameters will be made, and used in Chapter 4 for monitoring
where y,. is the value of the /"' observation. This value may be the mortality rate of samples of
patients, />,. or the outcome of a single patient, Yj. In the examples used, both sample blocks of 100
consecutive patients, and single patient outcomes are presented. A is the weight between 0 and 1.
c.,=.-..J10)IJ^[i-(.-.f]
where , is the number of cases in the sample. , = 1, when the outcomes of single patients, Y,, are being
analysed. A series of simulation experiments were performed to characterise the run length distribution
and the ARL under different conditions, a is the parameter defining the width of confrol Umits as multiples
of <7
242
Simulations were conducted to display the relationship between a, which defines the width of confrol limits
To estimate the ARL, 10 000 EWMA simulations were performed at values of a between 0.5 and 3. For
each simulation, the starting in-confrol estimate was 0.16 and X =0.3. A random variable having a
10.16(1-0.16) ^^^^^
normal distribution with a mean of 0.16 and a standard deviation of.,/ = 0.0367 was
V 100
used to simulate the mortality rates of blocks of 100 patients. Each simulated mortality observation was
incorporated into the EWMA chart. When the EWMAj statistic fell beyond one of the confrol limits, the
chart was deemed to have signalled, and the run length for that simulation was /. The mean ARL for each
Figure A5.1 shows the effect of varying the width of the confrol limits on the in-confrol ARL. As expected,
the narrower confrol limits have short ARLs with the ARL of a = 1 only 3.6 blocks. At a = 2, the ARL is
1.5 2 2.5
Width of Control limits (+/- a)
Simulations were conducted to display the relationship between 2., the EWMA weight and the ARL under
in-control conditions.
To estimate the ARL, 10 000 EWMA simulations were performed at each value of 2 between 0.01 and 1.
For each chart simulation, the starting in-confrol estimate was 0.16 and a-2. A random variable having a
normal distribution with a mean of 0.16 and a standard deviation of 0.0367 was used to simulate the
mortality rates of blocks of 100 patients. Each simulated mortality observation was incorporated into the
EWMA chart. When the EWMAj statistic fell beyond one of the confrol limits, the chart was deemed to
have signalled, and the run length for that simulation was /. The ARL for each value of 1 was the mean of
Figure A5.2 shows the rapid fall in ARL as 2, the weight is reduced. At /I = 0.02, the in-confrol ARL is
180.1. At 2 = 0.2 the ARL is 37.2, and at X =0.3, the ARL is 30.0. There is a somewhat flat response for A
>0.4.
The relationship between the ARL to signal under changed mortality was explored.
10 000 EWMA simulations were performed at each of the simulated changed mortality rates between 0.05
and 0.30, in increments of 0.001. For each chart simulation, the starting in-confrol estimate was 0.16, a = 2
and il = 0.3. A random variable having a normal distribution with a mean of the changed mortality rate and
a standard deviation calculated from the new mortality rate was used to simulate the mortality rates of
blocks of 100 patients. Each simulated mortality observation was incorporated into the EWMA chart.
When the EWMAj statistic fell beyond one of the confrol limits, the chart was deemed to have signalled,
and the mn length for that simulation was /. The mean ARL was the mean of the 10 000 run lengths.
245
The ARL for a range of mortality rates are shown in Figure A5.3. The in-confrol ARL is 30.4 blocks. The
ARL rapidly drops as the mortality rate moves from 0.16. At the mortality rate of 0.15 the ARL is 21.1 and
at 0.17 ARL is 17.5. When the mortality rate has fallen to 0.12 or risen to 0.20, the ARL is 3.6.
For monitoring mortality rates in this application in the ICU, where the in-confrol mortality rate is 0.16, the
values of A = 0.3 and a = 2 provide in-confrol ARL of about 30.5 (3050 patients or 2.5 years) and rapid
The EWMA chart can be used to monitor the mean mortality rate for any rational group size, including
individual patient observations, Yj. A smaller value of A will smooth the effect of accumulating individual
patient outcomes.
246
Simulations were conducted to display the relationship between 1, the EWMA weight and the ARL at a
To estimate the ARL, 1000 EWMA simulations were performed at values of 2 between 0.001 and 0.3 in
increments of 0.001. A range of changed mortality rates between 0.10 and 0.22 in increments of 0.02 were
simulated. For each chart simulation, the starting in-confrol estimate was 0.16 and a = 2. A Bernoulli trial
with a probability equal to the changed mortality rate was used to simulate the outcomes of the patients.
Each simulated outcome was incorporated into the EWMA chart. When the EWMAj statistic fell beyond
one of the confrol limits, the chart was deemed to have signalled, and the run length for that simulation was
/. The mean ARL was the mean of the 10 000 run lengths.
Figure A5.4 is a log-log plot of ARL against 2 which displays the results of this simulation. There is an
almost linear relationship between the log ARL and the log 1. There is a rapid rise in the ARL for all values
of 1 below 0.05 at all simulated changed mortality rates. For 1 > 0.2, there is little difference between the
ARL for in-confrol mortality rate of 0.16 and the out of confrol rates.
X for EWMA chart with = 1 in the range of 0.001 - 0.02 gives a balance between rapid detection of
changed mortality rates and a long ARL for the in-confrol process.
Based on these analyses, for the EWMA charts which analyse the mortality rate using individual patient
observations, X = 0.001 is chosen. This gives an in-confrol ARL = 2155 cases and ARL of 117 cases for a
mortality rate of 0.10,232 cases for a mortaHty rate of 0.12,487 cases for a mortality rate of 0.18 and 163
Appendix 6
Characterisation of the Distribution of Observed
Mortality Rates
The purpose of this appendix is to discuss approaches to characterise the distribution of mortality rates in a
A6.1 Notation
The notation used in this appendix is the same as used in the body of the thesis, and is summarised as
follows:
Yjj is the outcome for patienty in sample /. If the patient dies, Yy = 1 and the probability of death is TTj,. If
the patient survives, Yjj - 0 and the probability of survival is 1 TCjj .
E[Yj^)=\X7U..+0x[\-7tjj)=7tjj
WZ.T[Yjj)^[\-7Ujjf7Ujj+[0-Kjjf[\-7rjj)=7tjj[\-7tjj)
Ttjj may be estimated by Kjj, using a statistical model such as the APACHE III estimate of the probability
of death.
var(>;)=^,(l-^,)
YY,
The observed mortality rate for sample / is /?,. = '' and the predicted mortality rate, JE'^/?,) is.
;
E{R;)=^^'
.
250
R. is the average of the random variables, Yjj so using the cenfral limit theorem, the distiibution of Rj can
MYJJ) |:;r,(l-;r..)
M
MRI)= - 2 2
nr n
and
l;*.,(i-*)
7=1
var ( , ) = n'
Where ,. is small, Alemi and co-workers "*'"' use the / distribution rather than the standard normal
For the application to the ICU dataset, the sample sizes are large, being more than 87 cases. The normal
approximation is simple to work with and for the purposes of RA charting, its accuracy and precision will
However, this model of the distribution of sample mortality rate is a continuous, unbounded approximation
whereas, the distribution of Rj given the series of TTjj values is a discrete distribution of mortality rate
For non-RA analysis, it is assumed that the patients were independently and randomly selected. The
individual patient's risk of death is not known, so it is assumed to be the same for all patients, and is
denoted by TTj. E{RJ) was the average predicted mortality rate of the sample.
The confrol limits were calculated using the estimate of the variance of /?,-.
Wj(l-Jrj)
var IK)= .
However, this expression for the variance, over-estimates var(/?,) if all patients actually do not have the
same risk of dying. In the realistic case where the patients do not all have the same risk of dying, Wj is still
n,
vari[(R.) = - ^ '-L
y=i
. Hj
n.
ri; ri:
252
r n, \ ("I V (": ^
vy=l J
+
n]
, V ^ ^'
1:^.(1-^.) is^.
7=1 .7=_1_ )
[t^^
V7=_l I
-I--
; n]
y=i is the estimate of the variance of the mortality rate where all the patients' probabilities of
,
death are not all equal. Assuming that this expression is accurate, then the non-RA estimate of the variance
Therefore, where the average patient's risk of death is low and the sample sizes are large, this estimate is
potentially a reasonable approximation. The advantage of this approximation is its ease of calculation.
Its disadvantages are that it overestimates the variance of Rj. Figure A 6 . 1 , at the end of this appendix
shows the error in this estimate compared to more accurate methods using 100 patients randomly drawn
from the P A H ICU dataset. This approach though simple, will result in wide control limits that will provide
a conservative analysis.
253
The probability distribution of Rj can be calculated iteratively, if the values of all ^j. are known, or are
each approximated by ^,y. Consider that each patient outcome, Yjj (death = 1, survival = 0) is an
independent Bernoulli random variable with probability ;r,y . With , patients in a sample, then
YY.
0 < y^ K < n,. The probability distribution of i?, = can be described in terms of ^j-.
When the first patient is included, there are two possible values:
Pr(i?,.,=0) = l-;r,
After the second patient there are three possible observed mortality rates,
Pr(/?,=2=l)=^i^2
Vr{Rj^^=Q.5) = 7tfy-A2)+{\-Ayt,
?x{Rj^,=0) = (\-7t,l\-7i,)
The process can be continued with each patient risk of death estimate.
This iterative approach is a simple method to compute the probability distribution or the cumulative
The likelihood function or the joint probability of all the terms Yjj is P [ TTjJ (l - Ttjj ) " and this
7=1
provides a simple method of computing the probability of any /?,-, given the set of values of Ttjj . The
probability of observing a number of deaths, d of the sample of , cases is the sum of each of the
likelihood terms that correspond to the ways that d deaths can occur.
{rtj-d).
( rt, ,
y-Y,
such that yYij d .
V7=l 7=1
Figure A6.1 compares the estimates of the cumulative probability function of /?, using the three methods
described. The calculations have been performed for a single randomly chosen subset of 100 patients from
the dataset.
Figure A6.1 shows that the model using the sample mean mortality rate for all patients provides an
overestimate of the variance of /?,. The staircase appearance of the iterative or likelihood fiinction method
represents the true distribution of /?,. The continuous distribution of the cenfral limit theorem
However, in Chapter 5, with the RA CUSUM and RA EWMA charts, the continuous approximation is
inaccurate, and fiirther exact methods for the RA EWMA are developed.
257
Bibliography
1 Knaus WA, Draper EA, Wagner DP and Zimmerman JE. APACHE II: a severity of disease
classification system. Crit Care Med 1985; 13:818 - 829
2 Knaus WA, Wagner DP, Draper EA, Zimmerman JE, Bergner M, Bastos PG, Sirio CA, Murphy
DJ, Lotring T, Damiano A and Harrell FE. The APACHE III prognostic system. Risk prediction of
hospital mortality for critically ill hospitalized adults. Chest 1991; 100:1619 -1636
4 Hastie T, Tibshirani R and Friedman J. The Elements of Statistical Leaming: Data Mining,
Inference and Prediction. 1st ed. New York: Springer, 2001
5 Justice A, Covinski K and Berlin J. Assessing the generalizability of prognostic information. Arch
Intem Med 1999; 130:515 - 524
7 Le Gall J, Lemeshow S and Saulnier F. A new simplified acute physiology score (SAPS II) based
on a European/North American multi-cenfre study. JAMA 1993; 270:2957 - 2963
9 Benneyan J and Borgman A. Risk adjusted sequential probability ratio tests and longitudinal
surveillance. Int J of Quahty in Health Care 2003; 15:5-6
258
10 de Leval MR, Francois K, Bull C, Brawn W and Spiegelhalter D. Analysis of a cluster of surgical
failures. Application to a series of neonatal arterial switch operations. J Thorac Cardiovasc Surg
1994; 107:914-924
12 lezzoni L. Dimensions of risk. In: lezzoni L, ed. Risk Adjustment for Measuring Health Care
Outcomes. Chicago, 111.: Health Adminisfration Press, 1997
13 Spiegelhalter D, Grigg O, Kinsman R and Treasure T. Risk adjusted sequential probability ratio
tests: applications to Bristol, Shipmann and adult cardiac surgery. Int J for Quality in Health Care
2003; 15:7-13
14 Rosenberg AL. Recent innovations in ICU risk-prediction models. Curr Opin Crit Care 2002;
8:321 -330
15. Lim T. Statistical process confrol tools for monitoring clinical performance. Int J for Quality in
Health Care 2003; 15:3-4
16 Render M, Kim M, Welsh D, Timmons S, Johnston J, Hui S, Connors A, Wagner D, Daley J and
Hofer T. Automated ICU risk adjustment: Results from a National Veterans Affairs study. Crit
Care Med 2003; 31:1638-1646
17 Kennett R and Zacks S. Chapter 1: The role of statistical methods in modem industry. Modem
Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press,
1998; 2 - 1 3
18 Clermont G and Angus DC. Severity scoring systems in the modem ICU. Ann Acad Med
Singapore 1998; 27:397 - 403
19 Gunning K and Rowan K. ABC of intensive care outcome data and scoring systems. BMJ 1999-
319:241-244
259
21 Le Gall J, Lemeshow S and Saulnier F. Correction: A new simplified acute physiology score
(SAPS II) based on a European/North American multicenfre study. JAMA 1994; 271:1321
22 Zimmerman JE, Knaus WA, Wagner DP, Sun X, Hakim RB and Nysfrom PO. A comparison of
risks and outcomes for patients with organ system failure: 1982-1990. Crit Care Med 1996;
24:1633-1641
23 Zimmerman JE, Knaus WA, Sun X and Wagner DP. Severity sfratification and outcome
prediction for multi-system organ failure and dysfiinction. World J Surg 1996; 20:401 - 405
24 Knaus WA, Wagner DP, Zimmerman JA and Draper EA. Variations in mortality and length of
stay in ICU. Ann Int Med 1993; 118:753 - 761
25 Johnston JA, Wagner DP, Timmons S, Welsh D, Tsevat J and Render ML. Impact of different
measures of comorbid disease on predicted mortality of ICU patients. Med Care 2002; 40:929 -
940
26 Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality
audit. Chest: accepted for publication, October 2003
27 Young D and Ridley S. Mortality as an outcome measure in intensive care. In: Ridley S, ed.
Outcomes in Critical Care. Oxford: Butterworth - Heinemann, 2002; 25 - 46
28 Ridley S. Severity of illness scoring systems and performance appraisal. Anaesthesia 1998;
53:1185-1194
29 Irwig L, Bossuyt P, Glasziou P, Gatsonis C and Lijmer J. Designing studies to ensure that
estimates of test accuracy are fransferable. BMJ 2002; 324:669 -671
260
Ift Pappachan JV, Millar B, Bennett D and Smith GB. Comparison of outcome from intensive care
admission after adjustment for case mix by the APACHE III prognostic system. Chest 1999;
115:802-810
31 Cook DA. Performance of APACHE III models in an Australian ICU. Chest 2000; 118:1732 -
1738
32 Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's
APACHE II study in Britain and Ireland -1 Variations in casemix of adult admissions to general
ICUs and impact on outcome. BMJ 1993; 307:972 - 977
33 Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's
APACHE II study in Britain and Ireland - II Outcome comparisons of ICUs after adjustments for
casemix by the American APACHE II method. BMJ 1993; 307:977 - 981
34 Rowan KM, Kerr JH, Major E, McPherson K, Short A and Vessey MP. Intensive Care Society's
APACHE II study in Britain and Ireland: a prospective, multi-center, cohort study comparing two
methods for predicting outcome for adult intensive care patients. Crit Care Med 1994; 22:1392 -
1401
35 Castella X, Artigas A, Bion J and Kari A. A comparison of severity of illness scoring systems for
ICU patients: Results of a multicenter, multinational study. Crit Care Med 1995; 23:1327 -1335
36 Beck DH, Taylor BL, Millar B and Smith GB. Prediction of outcome from intensive care: a
prospective cohort study comparing APACHE 11 and III prognostic systems in a United Kingdom
intensive care unit. Crit Care Med 1997; 25:9-15
37 Marshall JC, Cook DJ, Christou NV, Bernard GR, Sprung CL and Sibbald WJ. Multiple Organ
Dysfunction Score: A reliable descriptor of a complex clinical outcome. Crit Care Med 1995;
23:1638-1652
38 Diamond G. What price perfection? Calibration and discrimination of clinical predictive models. J
Clin Epidemiol 1992; 45:85 - 89
261
40 Yates FJ. Decompositions of the mean probability score. Organisational Behaviour and Human
Performance 1982; 30:132 -156
41 Hilden J. The area under the ROC curve and its competitors. Med Decis Making 1991; 11:95 -101
42 Zweig HW and Campbell G. ROC plots: Afiindamentalevaluation tool in clinical medicine. Clin
Chem 1993;39:561-577
43 Poses RM, Cebul RD and Center RM. Evaluating physicians probablistic judgements. Med Decis
Making 1988; 8:233-240
44 Jacobs S, Chang R, Lee B and Lee B. Audit of intensive care: a 30 month experience using the
APACHE II severity of disease classification system. Int Care Med 1988; 14:567 - 574
45 Giangiuliani G, Mancini A and Gui D. Validation of a severity of illness score (APACHE 11) in a
surgical ICU. Int Care Med 1989; 15:519-522
46 Oh TE, Hutchinson R, Short S, Buckley T, Lin E and Leung D. Verification of the APACHE
scoring system in a Hong Kong ICU . Crit Care Med 1993; 21:698 - 695
47 Wong DT, Crofts SL, Gomez M, McGuire GP and Byrick RJ. Evaluation of predictive ability of
APACHE II system and hospital outcome in Canadian ICU patients. Crit Care Med 1995; 23:1177
-1183
49 Markgraf R, Deutschinoff G, Pientka L and Scholten T. Comparison of APACHE II and III and
SAPS II: A prospective cohort study evaluating these methods to predict outcome in a German
interdisciplinary ICU. Crit Care Med 2000; 28:26 - 33
50 Ruttiman UE. Statistical approaches to the development and validation of predictive instruments.
Crit Care Clin 1994; 10:19-35
51 Altinan D and Bland J. Diagnostic tests 3: The ROC plot. BMJ 1994; 309:188
52 Glance LG, Osier T and Shinozaki T. ICU prognostic scoring systems to predict death: a cost
effectiveness analysis. Crit Care Med 1998; 26:1842 -1849
53 Henderson AR. Assessing test accuracy and its clinical consequences: a primer for ROC curve
analysis. Ann Clin Biochem 1993; 30:521 - 539
55 Swetts J A. Measuring the accuracy of diagnostic systems. Science 1988; 240:1285 -1293
56 Murphy-Filkins R, Teres D, Lemeshow S and Hosmer DW. Effect of changing patient mix on the
performance of an ICU severity-of-illness model: How to distinguish a general from a specialty
intensive care unit. Crit Care Med 1996; 24:1968 -1973
57 Glance LG, Osier TM and Papadakos P. Effect of mortality rate on the performance of the
APACHE II: a simulation study. Crit Care Med 2000; 28:3424 - 3428
58 Steen PM. Approaches to predictive modelling. Ann Thor Surg 1994; 58:1836 -1840
59 Hanley JA and McNeil BJ. The meaning and use of a ROC curve. Radiology 1982; 143:29 - 36
263
60 Center RM and Schwartz JS. An evaluation of methods of estimating the area under the ROC
curve. Med Decis Making 1985; 5:149 -156
61 Hanley JA and McNeil BJ. A method of comparing the areas under the ROC curves derived from
the same cases. Radiology 1983; 148:839 - 843
62 Spiegelhalter DJ. Statistical methodology for evaluating gasfrointestinal symptoms. Clin Gastro
1985;14:489-515
63 Yates JF and Curley SP. Conditional distribution analyses of probablistic forecasts. J Forecast
1985; 4:61 -73
64 Flora J. A method for comparing survival of bums patients to a standard survival curve. J Trauma
1978; 18:701-705
65 Lemeshow S and Hosmer D. A review of goodness of fit statistics for use in the development of
logistic regression models. Am J Epidem 1982; 115:92 - 106
66 Hosmer DW and Lemeshow S. Assessing the fit of the model. Applied logistic regression. New
York: John Wiley and Sons, 1989; 135 -175
67 Vollset S. Confidence intervals for a binomial proportion. Stat Med 1993; 12:809 - 824
68 Armitage P and Berry P. Inferences from proportions. Statistical Methods in Medical Research.
London: Blackwell Scientific Publications, 1994; 118-124
69 Rapoport J, Teres D, Lemeshow S and Gehlbach S. A method for assessing the clinical
performance and cost effectiveness of ICUs: a multi-cenfre inception cohort study. Crit Care Med
1994;22:1385-1391
71 Sherlaw - Johnson C and Gallivan S. Approximating prediction intervals for use in variable life
adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University
College, 2000
73 Miller M and Hui S. Validation techniques for logistic regression models. Stat Med 1991; 10:1213
-1226
74 Murphy AH. A new vector partition of the probability score. J Appl Meteorology 1973; 12:595 -
600
75 Dolan JG, Bordley DR and Mushlin AI. An evaluation of clinician's subjective prior probability
estimates. Med Dec Making 1986; 6:216 - 223
77 Zimmerman JE, Wagner DP, Draper EA, Wright L, Alzola C and Knaus WA. Evaluation of
APACHE III predictions of hospital mortality in an independent database. Crit Care Med 1998;
26:1317-1326
78 Beck DH, Smith GB and Taylor BL. The impact of low-risk ICU admissions on mortality
probabilities by SAPS II, APACHE II and APACHE III. Anaesthesia 2002; 57:21 - 26
79 Ash AS and Schwartz M. Evaluating the performance of risk adjustment methods: dichotomous
methods. In: lezzoni LI, ed. Risk Adjustment for Measuring Health Outcomes. Ann Arbor: Health
Admin Press, 1994; 313-346
80 Lee KL, Pryor DB, Harrell FE, Califf RM, Behar VS, Floyd WL, Morris AJ, Waugh RA, Whalen
RE and Rosati RA. Predicting outcome in coronary disease. Am J Med 1986; 80:553 - 560
265
81 Miller ME, Langefeld CD, Tiemey WM, Hui SL and McDonald CJ. Validation of probablistic
predictions. Med Decis Making 1993; 13:49 - 58
82 Le Gall R, Klar J, Lemeshow S, Saulnier F, Alberti C, Artigas A and Teres D. The Logistic Organ
Dysfiinction System: A new way to assess organ dysfiinction in the ICU. JAMA 1996; 276:802 -
810
84 Sirio CA, Shepardson LB, Rotondi AJ, Cooper GS, Angus DC, Harper DL and Rosenthal GE.
Community-wide assessment of intensive care outcomes using a physiologically based prognostic
measure. Chest 1999; 115:793 - 801
85 Glance L, Osier T and Dick A. Identifying quality outliers in a large, multi - institutional database
by using customised versions of the SAPS II and MPM II. Crit Care Med 2002; 30:1995 - 2002
16 Jacobs S, Chang R and Lee B. One years experience with the APACHE 11 severity of disease
classification system in a general ICU. Anaesthesia 1987; 42:738 - 744
87 Zimmerman JE, Knaus WA, Judson JA, Havill JH, Trubuhovich RV, Draper EA and Wagner DP.
Patient selection for intensive care: A comparison of New Zealand and United States hospitals.
Crit Care Med 1988; 16:318 - 326
91 Le Gall J, Lemeshow S, Leleu G, Klar J, Huillard J, Rue M, Teres D and Artigas A. Customised
probability models for early severe sepsis in adult intensive care. JAMA 1995; 273:644 - 650
92 Moreno R, Miranda DR, Fidler V and Van Schilfgaarde R. Evaluation of two outcome prediction
models on an independent database. Crit Care Med 1998; 26:50 - 61
93 Tan IKS. APACHE II and SAPS II are pooriy calibrated in a Hong Kong ICU. Ann Acad Med
Singapore 1998; 27:318 - 322
94 Wong LS and Young JD. A comparison of ICU mortality prediction using the APACHE II scoring
system and artificial neural networks. Anaesthesia 1999; 54:1048 -1054
95 Buist M, Gould T, Hagley S and Webb R. An analysis of excess mortality not predicted to occur
by APACHE III in an Australian Level 111 ICU. Anaes Int Care 2000; 28:171 -177
96 Janssens U, Graf C, Graf J, Radke P, Konigs B, Koch K, Lepper W, vom Dahl J and Hanrath P.
Evaluation of the SOFA score: a single center experience of a medical ICU unit in 303
consecutive patients with predominantly cardiovascular disorders. Int Care Med 2000; 26:1037 -
1045
97 Livingston BM, MacKirdy FN, Howie JC, Jones R and Norrie JD. Assessment of the performance
of five intensive care scoring models within a large Scottish database. Crit Care Med 2000;
28:1820-1827
99 Glance L, Osier T and Dick A. Rating the quality of intensive care units: Is it a fiinction of the
ICU scoring system? Crit Care Med 2002; 30:1976 -1982
267
100 Bastos PG, Sun X, Wagner DP, Knaus WA and Zimmerman JE. Application of the APACHE III
prognostic system in Brazilian intensive care units: a prospective multicenter study. Int Care Med
1996;22:564-570
102 Cowen JS and Kelly MA. Errors and bias in using predictive scoring systems. Crit Care Clin
1994; 10:53-72
103 Cook D, Joyce C, Bamett R, Birgan S, Playford H, Cockings J and Hurford R. Prospective
independent validation of APACHE 111 models in an Ausfralian tertiary referral ICU. Anaesth Int
Care 2002; 30:308-315
104 Levett J and Carey R. Measuring for improvement: Toyota to thoracic surgery. Ann Thorac Surg
1999; 68:353 - 358
105 Kennett R and Zacks S. Chapter 10: Basic tools and principles of statistical process confrol.
Modem Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA:
Duxbury Press, 1998; 322 - 359
106 Montgomery D. Chapter 4: Methods and philosophy of statistical process confrol. Infroduction to
Statistical Quality Confrol. New York: John Wiley and Sons, 1996
107 Ryan T. Statistical Methods for Quality Improvement. New York: John Wiley and Sons, 1989
108 Seto T, Mittleman M, Davis R, Taira D and Kawachi. Seasonal variation in coronary artery
disease mortality in Hawaii: an observational study. BMJ 1998; 316:1946
109 Montgomery D. Chapter 6: The confrol chart for fraction non - conforming. Infroduction to
Statistical Quality Confrol. New York: John Wiley and Sons, 1996
268
110 Benneyan J. Statistical quality confrol methods in infection confrol and hospital epidemiology.
Part II: Chart use, statistical properties and research issues. Infection Confrol and Hospital
Epidemiology 1998; 19:265 - 283
111 Hawkins D and Olwell D. Theoretical foundations of the CUSUM. Cumulative Sum Charts and
Charting for Quality Improvement. New York: Springer Veriag, 1998
112 Hawkins D and Olwell D. Introduction. Cumulative Sum Charts and Charting for Quality
Improvement. New York: Springer Veriag, 1998; 1 - 29
113 Kennett R and Zacks S. Chapter 11: Advanced methods of statistical process confrol. Modem
Industtnal Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press,
1998; 360-407
114 Montgomery D. Chapter 7: Cusum and EWMA confrol charts. Infroduction to Statistical Quality
Confrol. New York: John Wiley and Sons, 1996; 313-347
115 Poloniecki J, Valencia O and Littlejohns P. Cumulative risk adjusted mortality chart for detecting
changes in death rate: observational study of heart surgery. BMJ 1998; 316:1697 -1700
117 Tekkis PP, McCulloch P, Steger AC, Benjamin IS and Poloniecki JD. Mortality confrol charts for
comparing performance of surgical units: validation study using hospital mortality data. BMJ
2003; 326:786 - 791
118 Alemi F and Sullivan T. Tutorial on risk adjusted A'-Bar charts: Applications to measurement of
diabetes confrol. QuaHty Management in Health Care 2001; 9:57 - 65
119 Alemi F and Oliver D. Tutorial on risk adjusted;? charts. Quality Management in Health Care
2001; 10:1-9
269
120 Khuri S, Daley J, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J,
Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk
adjustment of the postoperative mortality rate for the comparative assessment of the quality of
surgical care: Resuhs of the National Veterans Affairs surgical risk study: Part 1. J Amer Coll
Surg 1997; 185:315-327
121 Daley J, Khuri S, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J,
Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk
adjustment of the postoperative mortality rate for the comparative assessment of the quality of
surgical care: Results of the National Veterans Affairs surgical risk study: Part 2. J Amer Coll
Surg 1997; 185:328-340
122 Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using
reliability measures: A classification free approach. Stat Med; Submitted for publication March
2003
123 Clermont G, Angus DC, DiRusso SM, Griffin M and Linde - Zwirble WT. Predicting hospital
mortality for patients in the intensive care unit: a comparison of artificial neural networks with
logistic regression models. Crit Care Med 2001; 29:291 - 296
124 Glance LG, Osier T and Shinozaki T. Effect of varying the casemix on the SMR and W statistic.
Chest 2000; 117:1112-1116
125 lezzoni L. Statistically derived predictive models: caveat emptor (Editorial). J Gen Intem Med
1999;14:388-389
126 Zhu HP, Lemeshow S, Hosmer DW, Klar J, Avmnin J and Teres D. Factors affecting the
performance of the models in the MPMl II system and sfrategies of customization: A simulation
sttidy. Crit Care Med 1996; 24:57 - 63
127 Maxwell S and Delaney H. The logic of experimental design: Threats to the validity of inferences
from experiments. Designing experiments and analyzing data: A model comparison perspective.
Belmont, Ca: Wadsworth Publishing Company, 1990; 25 - 32
270
128 Berenholtz S, Doorman T, Ngo K and Pronovost P. Qualitative review of ICU quality indicators. J
Crit Care 2002; 17:1-15
129 Sivak ED and Rogers MAM. Assessing quality of care using in-hospital mortality: Does it yield
informed choices? (Editorial). Chest 1999; 115:613 - 614
130 Sheldon T. Promoting health care quality: what role performance indicators? Quality in Health
Care 1998; 7:S45-50
131 Rosen AK, Ash AS, McNiff KJ and Moskowitz MA. The importance of severity of illness
adjustment in predicting adverse outcomes in the Medicare population. J Clin Epidemiol 1995;
48:631-643
132 Kutsogianis DJ and Noseworthy T. Quality of life after ICU. In: Ridley S, ed. Outcomes in
Critical Care. Oxford: Butterworth Heinemann, 2002; 139 - 168
134 Parsonnet V, Dean D and Bemstein A. A method of uniform sfratification of risk for evaluating
the results of surgery in acquired adult heart disease. Circulation 1989; 79(S1):S3 - S12
135 Jones D, Copeland G and de Cossart L. Comparison of POSSUM with APACHE II for prediction
of outcome from a surgical high dependency unit. Br J Surg 1992; 79:1293 -1296
136 Orr R, Maini B, Sottile F, Dumas E and O'Mara P. A comparison of four severity adjusted models
to predict mortality after coronary artery bypass graft surgery. Arch Surg 1995; 130:301 - 306
137 Weightman W, Gibbs N, Sheminant M, Thackray N and Newman M. Risk prediction in coronaty
artery surgery: a comparison of four risk scores. Medical Joumal of Ausfralia 1997; 166:408 - 411
138 Steiner S, Cook R and Farewell V. Risk adjusted monitoring of surgical outcomes. Medical
Decision Making 2001; 21:163-169
271
139 Gallivan S, Lovegrove J and Sheriaw - Johnson C. Letter. BMJ 1998; 317:1453
140 Steiner SH, Cook RJ and Farewell VT. Monitoring paired binary surgical outcomes using
cumulative sum charts,. Statistics in Medicine 1999; 18:69 - 86
141 Cook D, Steiner S, Cook R, Farewell V and Morton A. Monitoring the evolutionary process of
quality: Risk adjusted charting to frack outcomes in intensive care. Crit Care Med 2003; 2003
142 Steiner S, Cook R, Farewell V and Treasure T. Monitoring surgical performance using risk-
adjusted cumulative sum charts. Biostatistics 2000; 1:441 - 452
143 Hanson WC and Marshall BE. Artificial intelligence applications in the ICU. Crit Care Med 2001;
29:427 - 435
144 Maren A, Harston C and Pap R, eds. Handbook of Neural Computing. San Diego: Harcourt Brace
Jovanovich, 1990
146 Bishop C, ed. Neural networks and Machine Leaming. Berlin: Springer Veriag, 1998
147 Reed RD and Marks RJ. Neural Smithing: Supervised leaming in feedforward artificial neural
network. 1st ed. Cambridge, Massachusetts: MIT Press, 1999
148 Anderson J. Chapter 13: Nearest neighbour classifiers. An Infroduction to Neural Networks.
Cambridge, Mass.: MIT Press, 1995
149 Maren A. NN stmcture: Form follows fiinction. In: Maren A, Harston C and Pap R, eds.
Handbook of Neural Computing. San Diego: Harcourt Brace Jovanovich, 1990
150 Specht DF. A general regression neural network. IEEE Transactions on Neural Networks 1991;
2:568 - 576
272
151 Parzen E. Mathematical considerations in the estimation of specfra. Technometrics 1961; 3:167 -
190
152 Parzen E. On estimation of a probability density fiinction and mode. Annals of Mathematical
Statistics 1962; 33:1065 -1076
153 Floyd CE, Lo JY, Yun AJ, Sullivan DC and Komguth PJ. Prediction of breast cancer malignancy
using an artificial neural network. Cancer 1994; 74:2944 - 2948
154 Ortiz J, Ghefter CGM, Silva CES and Sabbatini RME. One year mortality prognosis in heart
failure: A neural network approach based on echocardiographic data. JACC 1995; 26:1586 -1593
155 Selker HP, Griffin JL, Patil S, Long WL and D'Agostino RB. A comparison of performance of
mathematical predictive methods for medical diagnosis: Identifying acute cardiac ischaemia
among emergency department patients. J Investigative Med 1995; 43:468 - 476
156 Doyle HR, Parmanto B, Munro WP, Marino IR, Aldrighetti L, Doria C, McMicheal J and Fung JJ.
Building clinical classifiers using incomplete observations - A neural network ensemble for
hepatoma detection in patients with cirrhosis. Meth Inform Med 1995; 34:253 - 258
157 Setiono R. Exfracting mles from pruned ANN for breast cancer diagnosis. AI in Med 1996; 8:37 -
51
158 Itchhaporia D, Snow PB, Almassy RJ and Oetgen WJ. ANN: Current status in cardiovascular
medicine. JACC 1996; 28:515 - 521
159 Eisenstein EL and Alemi F. A comparison of 3 techniques for rapid model development: An
application in patient risk sfratification. Proc Med Informat Ass 1996:443 - 447
160 Lette J, Colletti BW, Cerino M, McNamara D, Eybalin MC, Levasseur A and Nattel S. Artificial
intelligence vs logistic regression statistical modelling to predict cardiac complications after non -
cardiac surgery. Clin Cardiol 1994; 17:609 - 614
273
161 Doyle HR, Dvorchik I, Mitchell S, Marino IR, Ebert FH, McMichael J and Fung JJ. Predicting
outcome after liverfransplantation.Ann Surg 1994; 219:408 - 415
162 Hamamoto I, Okada S, Hashimotot T, Wakabayashi H, Maeba T and Maeta H. Prediction of the
eariy prognosis of the hepatectomised patient with hepatocellular carcinoma with a neural
netivork. Comput Biol Med 1995; 25:49 - 59
163 Dombi GW, Nandi P, Saxe JM, Ledgerwood AM and Lucas CE. Prediction of rib fracture
outcome by an artifical neural network. J Trauma, Infection and Critical Care. 1995; 39:915 - 921
165 Izenberg SD, Williams MD and Luterman A. Prediction of frauma mortality using a neural
network. American Surgeon 1997; 63:275 - 281
166 Jefferson MF, Pendleton N, Lucas SB and Horan MA. Comparison of a genetic algorithm neural
network with logistic regression for predicting outcome after surgery for patients with non-small
cell lung carcinoma. Cancer 1997; 79:1338 -1342
167 Reibnegger G, Weiss G, Wemer - Felmayer G, Judmaier G and Wachter H. Neural network as a
tool for utilizing laboratory information: Comparison with linear discriminant analysis and with
classification and regression frees. Proc. Natl. Acad. Sci. USA 1991; 88
168 Forssfrom JJ and Dalton KJ. Artificial neural network for decision support in clinical medicine.
Ann Med 1995; 27:509-517
169 Jorgensen JS, Pedersen JB and Pedersen SM. Use of neural network to diagnose acute myocardial
infarction: Methodology. Clin Chem 1996; 42:604 - 612
170 Brier ME and Aronoff GR. Application of artificial neural network to clinical pharmacology. Int J
Clin Pharm Therapeutics 1996; 34:510-514
274
171 Lippmann RP and Shahian DM. Coronary artery bypass risk prediction using neural networks.
Ann Thor Surg 1997; 63:1635 -1643
172 Orr RK. Use of a probablistic neural network to estimate risk of mortality after cardiac surgery.
Med Dec Making 1997; 17:178 - 185
173 Buchman TG, Kubos KL, Seidler AJ and Siegforth MJ. A comparison of statistical and
connectionist models for the prediction of chronicity in a surgical ICU. Crit Care Med. 1994;
22:750 - 762
174 Mobley BA, Leasure R and Davidson L. Artificial neural network predicitons of lengths of stay on
a post coronary care unit. Heart and Lung 1995; 24:251 - 256
175 Doig GS, Inman KJ, Sibbald WJ, Martin CM and Roberston JM. Modelling mortality in the ICU:
comparing the performance of a back propagation, associative leaming neural network with
multivariate logistic regression. Proc. Ann. Symp. Computer Application in Med Care. 1994;
17:361-365
176 Dybowski R, Weller P, Chang R and Gant V. Prediction of outcome in critically ill patients using
artificial neural networks synthesized by genetic algorithm. Lancet 1996; 347:1146 -1150
177 Prize M, Ennett CM, Stevenson M and Trigg HCE. Clinical decision support systems for ICU:
Using artificial neural networks. Med Eng Physics 2001; 23:217 - 225
178 Nimgaonkar A, Sudarshan S and Kamad DR. Prediction of mortality in an Indian ICU:
Comparison between APACHE II and artificial neural network. (Hansraj Prize paper).
Proceedings of the Annual Scientific Meeting, Indian Society of Critical Care Medicine. 2001:43 -
46
179 Paetz J. Some remarks on choosing a method for outcome prediction (letter). Crit Care Med 2002;
30:724
181 Burges C. A tutorial on SVM for pattern recognition. Data Mining and Knowledge Discovery
1998;2:121-167
182 Campbell C. An Infroduction to Kemel Methods. In: Howlett R and Jain L, eds. Radial Basis
Function Networks: Design and Application. Berlin: Springer Veriag, 2000
183 Christianini N and Shawe-Taylor J, eds. An Infroduction to Support Vector Machine and Other
Kernel-Based Leaming Methods. 1 st ed. Cambridge: Cambridge University Press, 2000
184 Scholkopf B, Burges C and Smola A. Introduction to support vector leaming. In: Scholkopf B,
Burges C and Smola A, eds. Advances in Kemel Methods: Support Vector Leaming. Cambridge:
MIT Press, 1999
185 Mattera D and Haykin S. SVM for dynamic reconstmction of a chaotic system. In: Scholkopf B,
Burges C and Smola A, eds. Advances in Kemel Methods: SVM Leaming. Cambridge,
Massachusetts: MIT Press, 2000
186 Morik K, Brockhausen P and Joachims T. Combining statistical leaming with a knowledge based
approach - A case study in intensive care monitoring.: http://vyww-ai.informatik.uni-
dortmund.de/DOKUMENTE/morik etal 99a.pdf. 1999
187 Morik K, Imboff M, Brockhausen P, Joachims T and Gather U. Knowledge discovery and
knowledge validation in intensive care. AI in Med 2000; 19:225 - 249
188 Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality
audit. Submitted to Chest, January 2003 2003
189 Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using
reliability measures: A Classification free approach. Stat Med 2003; Submitted for publication
March 2003
190 Ripley B. Statistical theories of model fitting. In: Bishop C, ed. Neural Networks and Machine
Leaming. Berlin: Springer/NatoScientific Affairs Division, 1998
276
192 Joachims T. Making large scale SVM leaming practical. In: Scholkopf B, Burges C and Smola A,
eds. Advances in Kemel Methods: Support Vector Machine Leaming. Cambridge: MIT Press,
2000
196 Jencks SF, Williams DK and Kay TL. Assessing hospital associated deaths from discharge data:
The role of length of stay and co-morbidities. JAMA 1988; 260:2240 - 2246
197 Glance LG and Szalados J. Benchmarking in critical care (editorial). Chest 2002; 121:326 - 328
198 Weston J, Gammerman A, Stitson M, Vapnik V, Vovk V and Watkins C. Support vector density
estimation. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: Support
Vector Machine Leaming. Cambridge: MIT Press, 2000
199 Sheriaw-Johnson C and Gallivan S. Approximating prediction intervals for use in variable life
adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University
College, 2000
200 Neal R. Assessing the relevance determination methods using DELVE. In: Bishop C, ed. Neural
Networks and Machine Leaming: Springer-Veriag, 1998; 97-129
201 Kitter J, Pudil P and Somol P. Advances in statistical feature selection. In: Singh S, Murshed N
and Kropattsch W, eds. ICAPR 2001. Beriin: Springer-Veriag, 2001
277
202 Guyon I and Elisseeff A. An introduction to variable and feature selection. J Machine Leaming
Research 2003; 3:1157-1182
203 Montgomery D. Chapter 8: Other statistical process confrol techniques. Infroduction to Statistical
Quality Confrol. New York: John Wiley and Son, 1996
204 Montgomery D. Introduction to Statistical Quality Confrol. New York.: John Wiley and Sons,
1996
205 Hawkins D and Olwell D. Cumulative Sum Charts and Charting for Quality Improvement. New
York: Springer, 1998
206 Kennett R and Zacks S. Modem Industrial Statistics: Design and Confrol of Quality and
Reliability. 1st ed. Belmont, CA: Duxbury Press, 1998