Sunteți pe pagina 1din 292

The Development of Risk Adjusted

Control Charts and Machine Learning

Models to Monitor the Mortality Rate

of Intensive Care Unit Patients

David A. Cook
B. Med. ScL, M.B.B.S., F.A.N.Z.C.A., F.J.F.LC.M.

Submittedfor the Degree of Doctor ofPhilosophy

School of Information Technology and Electrical Engineering,


University of Queensland

Date of Submission: December 2003


Abstract

To promote the highest quality healthcare, it is necessary to monitor the outcomes of patient care. The tools to

monitor the outcomes of patient care are not optimal. This thesis develops risk adjusted control chart methods

for monitoring in-hospital mortality outcomes in Intensive Care Unit (ICU) patients. It is a medical

application of statistics and machine learning to measure outcomes for quality management. Three directions

of investigation are followed to achieve this.

The first is the assessment methods of ICU models that predict the probability of death. The desirable

attributes of model performance are discrimination, the ability to separate survivors and non-survivors, and

calibration, a measure of the extent to which model risk prediction represents that patient's actual risk of

dying. An independent assessment of the APACHE III model used in the Princess Alexandra Hospital ICU

demonstrated good discrimination and calibration, and so the model was validated in this context for

prediction of in-hospital mortality.

The second is a study of statistical process control charts for patient mortality rate. Risk adjusted control chart

techniques are subsequently developed to incorporate the validated APACHE III probability of death estimate

to control for casemix and severity of illness. The design and performance of these control charts are studied.

The third direction is the development of an alternative model for risk adjustment. The results of preparatory

experiments using machine learning techniques, artificial neural networks (ANNs) and support vector

machines (SVMs), were comparable to those previously obtained with logistic regression. SVM models are

further investigated to model 30-day in-hospital mortality, using raw patient data from the equivalent of one

year of patient admissions. Model development is successfully guided by the desirable attributes of model

performance: discrimination and calibration.

The conclusions of this study are: I) risk adjusted control charting offers an adjunct to current methods of

ICU outcome assessment when monitoring the quality of care; 2) SVMs and ANNs are practical approaches

to model the probability of in-hospital mortality for ICU patients; 3) model development can be guided by

optimization of the model attributes of discrimination and calibration.


Declaration of Originality

The work presented in this thesis is to the best of my knowledge and belief, original and my own work,

except as acknowledged in the text. The material has not been submitted either in whole or in part for a

degree at this or any other university.

n'b<<f

David A. Cook
Acknowledgements

My supervisors:
Professor Thomas Downs
Professor Aimette Dobson
A/Professor Chris Joyce

Professor Tony Morton for all his encouragement, inspiration and advice

Petra Graham for collaboration to establish 30-day mortality status and the simplified disease coding

Co-workers at the Princess Alexandra Hospital:


Rod Hurford who manages the patient database and clinical information system
Dr Robert Bamett, Dr Jerome Cockings, Dr Peter Kruger, Sean Birgan and all the nursing
staff and the doctors in-training who are involved in patient care

Gillian Ray-Barruel, and my sister, Janet Cook for assisting with editorial comment

Particular acknowledgement is due to my wife, Andrea and our children Sarah, Nicola and Matthew
Table of Contents
Abstract
Table of Contents
List of Figures
List of Tables
Publications Arising from Work Reported in this Ttiesis
Glossary and Abbreviations

Chapter 1: Motivation for Study and an Introduction to Risl< Adjusted


Control Charting in the Intensive Care Unit (ICU)

1.1 Introduction
1.2 Project outline
1.3 The study approach and objectives
1.4 Perspective

Chapter 2: Review of Methods of Assessment of the Performance of


Models to Estimate Risk and Patient Outcome in ICU Patients

2.1 Introduction
2.2 Validity of ICU mortality prediction models
2.3 Assessment of validity and generalisability of models
2.4 Indices of performance
2.4.1 Discrimination
Covariance graphs
Statistical comparison of the risk of death of survivors and non-
survivors
Classification matrices
Receiver operating characteristic (ROC) curve analysis
2.4.2 Calibration
Evaluation of overall model predictions
Calibration curves
Hosmer-Lemeshow statistics
Spiegelhalter's Z score
Model based analysis of performance
2.5 Recommendations for a practical approach to validation of ICU models that estimate
the probability of in-hospital death
2.6 Survey of models that estimate the probability of in-hospital death of ICU patients
2.7 Summary

Chapter 3: Performance of APACHE III models in an Australian ICU

3.1 Objective and outline


3.2 Introduction
3.3 Materials and methods
3.4 Results
3.5 Discussion
Chapter 4: Control Charts for Analysis of Outcomes in ICU

4.1 Introduction
4.2 Considerations in monitoring in-hospital mortality of ICU patients
4.3 Control chart analysis of ICU mortality
4.4 Application of control charts to PAH ICU data
4.4.1 /7 chart
Design of the/7 chart
4.4.2 Cumulative Sum chart (CUSUM)
CUSUM statistic
Testing the significance of the CUSUM statistic
4.4.3 Estimates of the current mean: Exponentially weighted moving average
(EWMA) chart
4.5 Search for a cause of decreased in-hospital mortality
4.6 Summary

Chapter 5: Risk Adjusted Control Charts for Analysis of Outcomes in ICU

5.1 Introduction
5.2 Use of risk adjusted methods in monitoring hospital mortality outcomes
5.2.1 Issues with monitoring risk adjusted mortality
5.2.2 Application of risk adjusted monitoring to ICU mortality
5J Risk adjusted/? chart for mortality rate
5.3.1 Risk adjusted p chart with control limits calculated by normal
approximation
5.3.2 Control chart with standardized (Z) scores
5.3.3 Risk adjusted/? chart: Control limits by iterative calculation of the
cumulative probability function of the mortality rate, Rj.
5.4 Risk adjusted CUSUM
5.4.1 Background to the risk adjusted CUSUM
5.4.2 Adaptation of Poloniecki approach
5.4.3 Testing for change in odds of risk adjusted mortality with the Steiner's RA
CUSUM.
5.5 Risk adjusted EWMA
5.5.1 Estimate of the distribution of EWMAy using the central limit theorem
5.5.2 An iterative approach to the estimation of the distribution of EWMA^and
calculation of control limits.
5.6 Summary and Conclusion

Chapter 6 Introduction to Machine Learning: Support Vector Machine and


Artificial Neural Networks Modelling ICU Patient Outcome

6.1 Application of machine learning to classification and regression problems to predict


ICU mortality
6.1.1 Artificial neural networks (ANN)
6.1.2 Review of the use of ANN in the ICU
6.1.3 Support vector machines (SVM)
SVM classification
SVM regression
Selection of kernel and SVM parameters
Quadratic programming implementation
6.1.4 Review of the use of SVM in the ICU
6.2 Classification and regression modelling to predict ICU patient in-hospital mortality
using ANNs and SVMs
6.2.1 Criteria for choice of models
6.2.2 Software
6.2.3 Data summary
Outcome definition
Variable selection and preprocessing
6.2.4 Classification
SVM classification
ANN classification
Discussion of classification results
6.2.5 Regression
SVM regression
ANN regression
Discussion of regression results
6.3 Conclusion

Chapter 7 Modelling ICU Patient Outcome with Regression Support Vector


Machine

7.1 Overview
7.2 Method
7.3 Data
7.3.1 Three examples of pre-processing variables: worst temperature, worst mean
blood pressure, worst bilirubin
7.4 Software
7.5 SVM parameter choice guided by model attributes of discrimination and calibration
7.6 Choice of SVM kernel and parameters
7.6.1 Estimation of SVM parameter choice for RBF kernel SVM
7.6.2 Estimation of SVM parameter choice for polynomial kernel SVM
7.6.3 Summary of the approximation of SVM kernel and parameter selection
7.6.4 Investigation of values around estimated SVM parameters
7.7 Discussion
7.8 Conclusions

Chapter 8 Summary, Conclusions and Future Work

8.1 Summary
8.2 Original contributions
8.3 Future research
8.4 Conclusion
Appendices

Appendix 1 Data summary of admissions to PAH ICU 1995 - 1999


A1.1 Data collection and sample description

Appendix 2 Analysis of mortality rate observations PAH ICU 1995 - 1997, and
estimation of in-control parameters
A2.1 Data and methods
A2.2 Results
A2.3 Conclusion

Appendix 3 Analysis of performance ofp chart.


A3.1 Assumptions about the nature of the change
A3.2 Operating characteristic of the p chart
A3.3 Effect of sample size on p chart performance
A3.4 Summary of analysis

Appendix 4 Statistical analysis of the CUSUM chart


A4.1 Assumptions
A4.2 Statistical tests
A4.3 Definitions and notation
A4.4 Analysis of performance of CUSUM charts: choice of design parameters.
A4.5 In control ARL
A4.6 ARL where mortality rate has changed
A4.7 Summary

Appendix 5 Analysis of performance, choice of parameters and control limits for the
EWMA chart.
A5.1 Effect of a on in-control ARL
A5.2 Effect of 2 on ARL
A5.3 Effect of changed mortality rate on ARL
A5.4 Choice of parameters, and analysis of performance of EWMA of single
cases

Appendix 6 Characteristics of the distribution of observed mortality rates


A6.1 Notation and approximate distribution
A6.2 Exact method: Iterative approach to define the distribution of mortality rates

Bibliography
List of Figures
1.1 Outline of the proj ect structure

1.2 The relationship between patient factors, the process of care and random variation to patient
outcome

2.1 Example of calibration curve for APACHE III hospital mortality model

3.1 Calibration curve for the APACHE III ICU mortality models with no adjustment for hospital
characteristics
3.2 Calibration curve for the APACHE III ICU mortality models, with adjustment for hospital
characteristics (similar hospital model)
3.3 Calibration curve for the APACHE III hospital mortality models with no adjustment for
hospital characteristics
3.4 Calibration curve for the APACHE III hospital mortality models with adjustment for hospital
characteristics (similar hospital model)
3.5 Calibration curve for the APACHE III hospital mortality models adjusted for USA standard
ICU

4.1 p chart PAH ICU mortality rate 1995 - 1997 by month.


4.2 p chart PAH ICU mortality rate 1998 - 1999 by month.
4.3 p chart PAH ICU mortality rate 1995 - 1999 by month.
4.4 CUSUM accumulating observed minus target mortality rate PAH ICU 1998 - 1999
4.5 Z score CUSUM accumulating observed minus target mortality rate PAH ICU 1998 - 1999
4.6(a) C"" and C" in blocks of 100 admissions, PAH ICU 1998 - 1999: target mortality rate = 0.16
4.6(b) C" and C in blocks of 100 admissions, PAH ICU 1998 - 1999: target mortality rate = 0.13
4.7(a) EWMA chart PAH ICU 1998 - 1999: target mortality rate = 0.16
4.7(b) EWMA chart PAH ICU 1998 - 1999: revised mortality rate to 0.141
4.8(a) EWMA chart PAH ICU 1998 - 1999: target mortality rate = 0.16, single case
4.8(b) EWMA chart PAH ICU 1998 - 1999: target mortality rate = 0.16, single case, reset
4.9 Percentage of elective surgery, emergency surgery and non-operative cases

5.1 RA /? chart of hospital mortality, PAH ICU 1995 - 1999


5.2 Power analysis of RA p chart
5.3 Effect of changing odds of death on ARL of RA p chart
5.4 Z score control chart
5.5 RA/7 chart, as quantile of cumulative probability function of mortality rate
5.6 Cumulative plot of expected deaths minus observed deaths, PAH ICU 1995 - 1999
5.7 Cumulative RA mortality chart with control limits
5.8 Effect of a on ARL of Poloniecki CUSUM
5.9 Effect ofchoiceofA on ARL of RA CUSUM
5.10 Effect of changed odds ratio on ARL of RA CUSUM
5.11 Effect ofchoiceof/j on ARL of RA CUSUM
5.12 Effect of A on ARL of RA EWMA over a range of odds ratios
5.13 RA EWMA chart, PAH ICU 1995 - 1999
5.14 RA EWMA chart with reset after signal, PAH ICU 1995 - 1999
5.15 RA EWMA with iterative characterization of distribution of EWMA
5.16 Plot of the value of cumulative probability of RA EWMA

6.1 Architecture of 2 layer multi-layer perceptron


6.2 Output of a hidden unit as a function of a sum of weighted inputs
6.3 Architecture of radial basis function network
6.4 Architecture of a generalized regression neural network
6.5 Simple classification of stars and circles, showing the margin either side of optimal hyperlane
6*6 e: regression tube precision (or width)
6.7 All ICU patients and subsets of patients who died
6.8 Correct classification rate for RBF SVM
6.9 Area under the ROC curve for RBF SVM
6.10 Effect of training set size on correct classification rate for polynomial kernel SVM
6.11 Effect of training set size on area under the ROC curve for polynomial kernel SVM
6.12 Discrimination of MLP ANN with a range of test set sizes
6.13 Surface plot of the average area under the ROC Curve for RBF Kernel SVM
6.14 Area under the ROC curve polynomial kernel SVM, d = 1,2,3 and range of S
6.15 Discrimination of SVM GRNN models
6.16 Calibration of SVM GRNN models

7.1 Histogram of worst temperature values


7.2 Scatter plot of in-hospital mortality and temperature
7.3 Histogram of worst mean arterial pressure
7.4 Scatter plot of 30 day in-hospital mortality and worst mean blood pressure
7.5 Histogram of worst bilirubin
7.6 Scatter plot of 30 day in-hospital mortality and log( 10) bilirubin
7.7 Area under the ROC curve for SVM RBF kernel with a range of y
7.8 H-L C statistic for SVM RBF kernel with a range of y
7.9 Area under the ROC curve for SVM RBF kernel with a range of e
7.10 H-L C statistic for SVM RBF kernel with a range of e
7.11 Area under the ROC curve for SVM RBF kernel with a range of C
7.12 H-L C statistic for SVM RBF kernel with a range of C
7.13 Area under the ROC curve for polynomial RBF kernel {d l-3)with a range of
7.14 H-L C statistic for polynomial RBF kernel {d 1 -3 )with a range of S
7.15 Area under the ROC curve for polynomial RBF kernel {d 1 -3)with a range of C
7.16 H-L C statistic for polynomial RBF kernel {d 1 -3)with a range of C
7.17 Surface plot of average area under the ROC curve for RBF SVM
7.18 Surface plot of average H-L C statistic for RBF SVM

A2.1 Normal probability plot of monthly mortality rates


A2.2 Auto-correlation coefficients for monthly mortality rates
A2.3 Normal probability plot of mortality rates in blocks of 50 admissions
A2.4 Auto-correlation coefficients for mortality rates in blocks of 50 admissions
A2.5 Normal probability plot of mortality rates in blocks of 100 admissions
A2.6 Auto-correlation coefficients for mortality rates in blocks of 100 admissions

A3.1 Operating characteristic ofp chart with range of control limits for a change in mortality rate
A3.2 Semi-log plot of ARL to signal oip chart with a range of control limits for a change in
mortality rate
A3.3(a) and (b) ARL ofp chart with effect of variable sample size

A4.1 In control ARL for a range of h


A4.2 In control ARL for a range oih and monitoring for a selection of mortality rates upper and
lower CUSUM
A4.3 In control ARL for a range of/? and monitoring for a selection of mortality rates upper or
lower CUSUM
A4.4 ARL to signal with altered mortality rate

A5.1 Effect of control limit width on ARL on EWMA


A5.2 Effect o f l on ARL of EWMA
A5.3 Effect of changing mortality rate on ARL of EWMA
A5.4 Effect of A on ARL of EWMA at a range of mortality rates

A6.1 Models of the cumulative probability distribution of Ri


List of Tables
2.1 Summary of the methods of assessment of performance of ICU mortality models

3.1 Comparison of demographics, operative status and APACHE III score between PAH ICU
admissions and APACHE III developmental sample
3.2 Predicted APACHE III hospital mortality compared to observed hospital mortality
3.3 Predicted APACHE III ICU mortality compared to observed mortality
3.4 Calibration of APACHE III ICU mortality and hospital mortality

4.1 Comparison of PAH ICU casemix for the years 1995 - 1999

6.1 Mapping of APACHE III disease groups to disease category


6.2 Discrimination and area under the ROC curve for various models
6.3 Discrimination and calibration results of ANN MLP GRNN

7.1 Variable descriptions and transformations

A 1.1 Demographic features of the primary admissions to PAH ICU 1995 - 1999
Al .2 Admissions grouped by month of admission to for PAH ICU 1995 - 1999
A1.3 Admissions grouped into ordered blocks of 50 admissions 1995 - 1999

A3.1 Probability of signal for a range of control limit settings and changed mortality rates
A3.2 Semi-log plot of ARL in samples of 87 cases for a range of control limit settings and changed
mortality rates
Glossary and Abbreviations

Acute Physiology and Chronic Health Evaluation II (APACHE II): A score measuring severity of

illness calculated from patient physiology and laboratory data available from the first 24 hours of ICU

admission. The APACHE II score is used with diagnosis to give the APACHE II model that estimates

the probability of in-hospital death for an ICU patient. This model is publicly available and was

published in 1985 by Knaus and co-workers ' and is the forerunner to the APACHE III model (see

below).

Acute Physiology and Chronic Health Evaluation III (APACHE III): A system based on patient data

and logistic regression models to predict ICU patient mortality and length of stay, developed by Knaus

and co-workers ^ in 1991 and marketed by APACHE Medical Systems ^. The system includes a

database structure which guides collection of ICU activity data, and patient demographic, laboratory,

co-morbidity and outcome data. A measure of patient severity of illness, the APACHE III score is

calculated from the patient data. The APACHE III score, diagnosis, lead-time to admission and other

variables are used to estimate the risk of in-lCU and in-hospital mortality. The risk of death model is

fiirther modified with information about the hospital to give an APACHE III risk of death estimate

adjusted for hospital characteristics ("similar hospital").

Artificial Neural Network (ANN): A machine learning method which has the architecture of a network

of interconnected simple processors (neurons) which learns patterns in the training data often by

adjusting the weights that connect the units.

Average Run Length (ARL): The average number of observations before the control chart statistic lies

beyond specified control limits.

Calibration: With reference to the ICU mortality prediction models, calibration is how well the

estimated probability of death provided by a model reflects the patient's probability of death.
Confidence Interval (CI): this interval provides a range of values around an estimate where the "true"

value is located with a given level of certainty.

Correct Classification Rate (CCR): The proportion of cases that are correctly classified as deaths or

survivors. It is the sum of the true positive rate and the true negative rate.

Discrimination: With reference to the ICU mortality prediction models, discrimination is the ability of

a model to provide risk of death estimates that separate survivors from the non-survivors.

Evidence Based Medicine: The integration of best research evidence with clinical expertise and patient

values (http://www.cebm.utoronto.ca').

Generalisation Performance: A model's prediction capacity on independent data *, or its ability to

provide accurate predictions for a new sample of patients ^.

Glasgow Coma Score (GCS): A score used to measure level of consciousness, assessed by the best

verbal response, eye response and upper limb motor response to a noxious stimulus.

Intensive Care Unit (ICU): A hospital ward where patients with life-threatening conditions are

managed, usually requiring support for organ failure, or intensive monitoring.

Logit Transformation: Logit transformation is a transformation of the probability of an event, equal to

the natural log of the odds ratio.

Mortality Probability Model: (MPM ID ^ A model that predicts risk of in-hospital death using data

available at the time of ICU admission (MPMQ II), at 24 hours post admission (MPM24II) and at 72

hours post ICU admission (MPM72II)

Multi-Layer Perceptron (MLP): A type of artificial neural network. It is described fully in Chapter 6.
Odds: The odds of an event is the ratio of the probability that an event will occur, divided by the

probability that an event will not occur.

Odds Ratio (OR): The ratio between two sets of odds. In this application, the OR is the ratio of the odds

of death estimated by a risk of death model, and alternative odds of death where the risk of death has

changed.

Princess Alexandra Hospital (PAH): A large, tertiary referral public hospital of approximately 800

adult beds in Brisbane, Queensland, Australia.

Prospective Validation: Statistical validation of a model after its development using planned,

prospective collection of data. This is in contrast to using a subdivision of existing data into samples

used for model development and for model testing.

Risk Adjustment (RA): A method of controlling for the casemix and severity of illness of a sample of

patients, using a statistical model to estimate the probability of patient death.

Receiver Operating Characteristic Curve (ROC curve): A plot of sensitivity against (1 - specificity).

The area under the ROC curve provides a summary statistic of discrimination.

Re-sampling: Repeated trials of model building using different random samples from the data available

for model building and cross-validation. At each re-sampling different data sets are used for training

and assessment.

Simplified Acute Physiology Score II (SAPS II) ^: A model that predicts in-hospital mortality of ICU

patients.

Signal: In the control charts used in this thesis, a signal occurs when an observation or statistic exceeds

a control limit or decision threshold, indicating the likelihood that the process is not in-control.

Standardised Mortality ratio (SMR): The ratio of observed deaths to predicted deaths.
Support Vector Machine (SVM): A machine learning system that uses an hypothesis space of linear

functions in high dimensional feature space. It is trained with a learning algorithm from optimisation

theory, to implement a learning bias derived from statistical learning theory *.

Test Set: The test set is a sub-set of the data available for model building and assessment which is used

to assess the performance of the model. It is not used to train or optimise the model.

Training Set: For model building, the training set is the data that is used to train the model.

Verification Set: For training ANN, the verification set of the data available for model building is used

to monitor the generalisation performance of the ANN as it is trained. Training is ceased when the

performance of the ANN on the verification set begins to deteriorate.


Chapter 1

Motivation for Study and an Introduction to Risk

Adjusted Control Charting in the Intensive Care Unit.

The project is a medical application of statistics and machine learning to quality management. The aim

of the first chapter is to set the context, justification and aims of this work.

Evidence-based medicine is the beginning of high quality care. However, applying the best evidence to

medical practice without monitoring the outcomes is, perhaps, like playing music without listening or

painting without looking. The score, the subject or objective of our work, becomes irrelevant unless we

interpret with critical observation and regard for the result.

The purpose of this project is to present a basis for monitoring outcomes in intensive care units (ICU). 1

will develop risk adjusted (RA) control charting to monitor in-hospital mortality rates. In this work, the

term "risk adjustment" refers to a method which controls for casemix and severity of illness of a patient

sample, using a statistical estimate of the probability of patient mortality. A machine learning model

developed with local Princess Alexandra Hospital (PAH) ICU data will be studied as an alternative to

the APACHE III ICU patient model to predict the probability of patient death.

An editorial call-to-arms by Benneyan and Borgman in 2003 ^ summarised the burgeoning interest in

this area.

"Regardless of the exact method employed, the application of statistical process control... or

related longitudinal analysis methods can significantly improve the ability to monitor clinical

process and outcomes. Incorporation and adaptation of RA ... represent important

contributions to their use in health care. Fostering greater and more widespread use of these

methods, however, remains a significant challenge. Hopefully, studies ... will lead to more

awareness of their value for contributing to a safer healthcare system."


1.1 Introduction

In 1996, the senior staff at the PAH ICU were provided with a unique opportunity. Recruitment and

increases in senior staff levels followed a period of inadequate senior staff numbers, and provided an

opportunity to introduce changes in practice. Adequate staffing meant that a broad review to establish a

uniform, accepted set of approaches and intensive care practices was possible. What began as a team

building exercise resulted in codified, uniform standards of practice.

We believed, without proof, that evidence-based review of practice and standardisation of uniform

approaches would improve patient outcomes, increase efficiency and increase staff satisfaction. So, a

systematic and ongoing multi-disciplinary review of all ICU practices began with a description of the

current procedures. It progressed to a review of the evidence and adoption of a consensus of agreed

methods and guidelines.

Evidence-based changes in practice may not necessarily transfer from the literature to the ICU ward.

There were potential problems of applicability and relevance. Often the research evidence was

collected from patient groups who were different from our ICU patients, or perhaps represented only a

small subset of the whole patient population. The evidence of benefit for our patients was frequently

not very strong, perhaps from a controversial meta-analysis, &post hoc subgroup analysis, or an

underpowered study with an "almost significant" trend. Sometimes the evidence for continuing or

changing practice was only opinion, experience, anecdote or based on a local audit or un-referenced

local guidelines. Occasionally the best evidence was from in vitro or physiological studies.

A programme of review and of improvement began without established methods in place to

demonstrate the effectiveness or otherwise of the changes. It seemed plausible that the evolution of

better patient outcomes could be recorded. De Leval and co-workers '" suspected an unacceptable rate

of mortality and complications in neonatal congenital heart surgery, and demonstrated that monitoring

with control charts would have detected poor performance. If this analysis could be done
retrospectively on mortality data, perhaps a similar outcome analysis could be used to guide changes at

the PAH ICU.

The APACHE medical systems APACHE III Version 1^'^, introduced to the PAH in August 1994,

was the database platform used for data collection. The software had user defined analysis capability

based on local data with descriptive statistics and standardised mortality ratio (SMR) comparisons

calculated from observed and expected mortality rates. The APACHE III model was developed from a

North American database ^ and had not been validated beyond the model development sample. At the

start of 1996, 1600 patients had been entered into the PAH ICU admission database.

In an ICU with a commitment to change, the purpose of a monitoring procedure was to recognise poor

performance in a timely way, and to recognise the gradual improvements in outcome that may occur in

a complex, evolutionary system. Random variations were inevitable, so a formal statistical approach to

quantify the likelihood of chance events was required.

After the practice and management changes commenced, the preliminary reports showed a trend

toward an increase in ICU patient mortality and SMR. This apparent increase nearly stopped the

process of change, though we elected to continue with the ongoing review and the commitment to

standardised best practice. The reports of SMRs were reviewed monthly, quarterly or semi-annually.

This approach, however, was inadequate, incomplete and not timely.

In 1997,1 introduced to the ICU charts, plotting observed against predicted outcome following the

"Variable Life Adjusted Display" chart in cardiac surgery ". Qualitative analysis of RA outcomes now

supplemented the SMR results. However, three issues prevented further development of this analysis at

the PAH ICU. There had been no independent validation of the APACHE III model. The early charts

were plagued by incomplete data. The formal statistical methods to analyse RA data in ICU were not

available. The solution to these problems, and others is the topic of this thesis.
1.2 Project Outline

This thesis provides an approach to the validation and application of RA tools to monitoring outcomes

in an ICU. There is particular reference to the techniques of control charting and analysis, adapted from

the process control techniques of industry. There are three key areas:

1. Assessment and validation of models to predict in an ICU

2. Statistical aspects of RA control charts

3. Development of a machine learning RA model for ICU mortality

After this introductory chapter, the second chapter reviews the methods to assess the performance of

models that estimate the probability of death in ICU patients. The third chapter uses the best approach

to validation to assess the performance of the APACHE III models at the PAH, Brisbane. The fourth

chapter applies a selection of established control chart approaches to the ICU's raw mortality statistics.

The fifth chapter extends this application, infroducing and developing RA control chart methods for use

in monitoring ICU mortality, using the APACHE III system estimates as the RA tool. The sixth and

seventh chapters review the application of machine learning techniques to modelling ICU outcome, and

develop artificial neural networks and support vector machine models to predict ICU mortality. The

final chapter presents a summary and suggestions for areas of further research.
Figure 1.1: Outline of the Project Structure

Chapter 1 Introduction, Prologue and Rationale

Chapter 2

Approaches to validation of models


that estimate patient probability of
death

1
Chapter 3 Chapter 4
Chapter 6
Prospective validation of Application of process
APACHE ill to the PAH control charting to ICU Machine learning to model 30-
ICU mortality statistics day in-hospital mortality of
ICU patients

X
Chapter 5
^

1
Chapter 7
Development of risk adjusted
process control techniques to Development of support vector
monitor quality of care in the ICU machines for estimating
probability of death, as an
alternative risk adjustment

X model to APACHE 111

y
Chapter 8
Summary and Conclusions

Figure 1.1 displays the logical relationship between the parts of this work.

Some of the working, analysis, tables, summaries and reference formulae and derivations are included

in the appendices.
1.3 The Study Approach and Objectives

Theframeworkto conduct this research is based on a model that extends lezzoni's "Algebra of Risk"'^.

She proposed that health outcomes depend on a combination of patient factors, freatment effectiveness

and random events. I have expanded the model to include the process of care as part of the relationship.

This model proposes that variation in the quality of the process of care will affect the relationship

between patient factors and the outcomes. Theframeworkfor the present study, is based on this

proposed model, and is presented below in Figure 1.2.

Figure 1.2: The relationship between patient factors, the process of


care, random variation and patient outcomes.

random variation random variation random variation

Patient factors Patient outcomes

At presentation, patients have characteristics that predispose them to their outcome. Some of these

characteristics can be measured or recorded including diagnosis, age, co-morbid conditions and

physiological state (severity of illness). ICU mortality prediction models '^'*'^ use information that is

available within the first 24 hours of admission as the patient's initial response to therapy unfolds. For

example, for the same traumatic event needing ICU care, a fit young patient with no co-morbidity will

be less likely to die than an elderly, infirm patient with numerous serious, co-morbid conditions.

Imprecision, the influence of unmeasured factors including quality of care, and random variation in

measurement, therapy and data collection, will introduce uncertainties in the predictions. A good model
will not only discriminate between the likelihood of death and survival, but will provide a risk estimate

that reflects the patient's probability of death.

This thesis begins by reviewing methods to evaluate and validate an ICU mortality prediction model.

The performance of the APACHE III system is then critically evaluated. If the validity of the APACHE

III model in the PAH ICU is established, then the APACHE III model can be used as a RA tool and RA

outcome monitoring for the ICU can be developed. Subsequently, alternative models to APACHE III

can be developed according to the desirable performance characteristics.

In the absence of other changes, if the quality of patient care in an ICU deteriorates, then the RA

mortality rate could be expected to rise. A previously accurate model would then under-estimate the

true risk of patient mortality. Similarly, if the quality of care increases, then the RA mortality rate will

fall and so the model will overestimate the probability of death.

A recent publication by Spiegelhalter et al. '^ suggests that a RA analysis would have shown a

deviation from expected performance in the paediatric cardiac surgical data of the Bristol Infirmary

(1984 - 1995) and in the deaths of elderiy patients under the care of Harold Shipman (1979 - 1997).

RA monitoring and detection of poor performance may have reduced the number of fatalities in these

two famous cases.

It is possible to improve the performance of the ICU prediction models by collecting information about

the progress of the patient beyond the initial day of ICU care '* by following the progress of organ

failure, complications and the patient's response to therapy. However, improving the accuracy of the

prediction using variables that reflect therapeutic complications and the patient's progress beyond the

initial period of observation, may limit the ability to detect changes in RA mortality rate due to

variation in the quality of the care. For this reason, only the information from the initial day of ICU

care is used in this thesis.

When variation between actual and predicted outcomes is detected it could be due to random variation,

or could signal a change in the modelled relationship between the patient factors and survival.
Statistical analysis can assess the probability of the differences being due to random variation. When a

model no longer fits the data, a search for the cause, and an appraisal of the monitoring procedure,

including the model itself should itself be undertaken.

Lim '^ described the issues that face the use of statistical process control charts in clinical practice.

Casemix or RA must be incorporated to make raw data interpretable. The techniques must be able to

detect small true shifts in the mortality rate because small changes in mortality are important. In the

medical setting there are relatively few patients compared to the large number of items in industrial

applications for which control charts were originally developed. Techniques requiring the accumulation

of large samples are not applicable.

My work is a real application of process monitoring and demonstrates that these challenges are soluble,

and that RA charting is a useful adjunct to quality measurement and the monitoring of mortality

outcomes in the ICU. To begin, I apply standard process control methods to analysis of the raw

mortality data of all eligible patients admitted to the ICU. The APACHE III model is incorporated into

the chart design as a RA tool and so RA outcome monitoring is developed. Analysis and understanding

of the performance of these techniques in terms of false alarms and detection of a change in mortality is

the basis for design. RA monitoring developed in the following chapters will provide information about

the changes in mortality rate and quality of care.

Ultimately, it may be possible to track both poor performance, and the evolutionary and incremental

progress of improvement of quality of care.

When an externally developed model of ICU mortality fails local independent validation, or a current

model no longer fits the data, the model needs to be revised to take account of patient factors.

Preliminary modelling on the PAH ICU dataset using machine learning techniques (artificial neural

networks and support vector machines: SVMs) demonstrated comparable performance to previous

logistic regression. SVMs have not been studied in application to estimating probability of ICU

mortality, so further SVM model development is conducted to estimate the probability of ICU patient

30-day in-hospital mortality using transformed, raw patient data.


The balance between minimum acceptable model performance and the size of the developmental

dataset is addressed and the solution provides adequate model performance on a practical dataset size.

In Chapter 6, the general regression neural network is used to correct the poor calibration of the multi-

layer perceptron artificial neural networks, and SVMs. In Chapter 7, SVM parameter choice, guided by

model discrimination (maximizing area under the ROC curve) and calibration (minimizing H-L C

statistic) on test data is used to develop SVMs that model raw patient data to approximate the

probability of death.

1.4 Perspective
The potential for RA outcome monitoring goes beyond the application I describe in the following

chapters. I have chosen to work on RA charting in an ICU, because the care of critically ill patients is

my area of professional expertise. However, the solutions and methods that 1 have developed in the

PAH ICU context are widely applicable. Recent publications from the United Kingdom's Medical

Research Council Unit in Cambridge (Spiegelhalter et al^^) and the United States Department of

Veterans Affairs (Render et al. '^) indicate the wide interest in this topic, and show what a large

experienced team could achieve in the development and application of RA analysis to improve clinical

care.

My work uses a retrospective analysis to track changes in RA outcome, and proposes novel techniques

for the measurement of quality in the ICU. This is important, as it documents the magnitude of

improvement in patient survival. The change is of the order of 20 additional ICU survivors from the

PAH ICU each year over the period of analysis.

In industry, the introduction of statistical process control after 1945 caused a revolution as the quality-

productivity dilemma (that increasing production results in inferior quality) was disproved '^. Near real

time monitoring of the quality of care in hospitals using RA charting methods, may offer similar

improvements in the delivery of health care.


11

Chapter 2

Review of Methods of Assessment of the Performance

of Models to Estimate Risk and Patient Outcome in

Intensive Care Unit Patients

2.1 Summary
Models that estimate the probability of death of intensive care unit (ICU) patients can be used to

stratify patients according to the severity of their condition and to control for casemix and severity of

illness. The process of risk adjustment (RA) is needed in quality monitoring, research, administration,

management, research and as an aid to clinical decision making '*''*". Models such as the Mortality

Prediction Model (MPM) family ^'^\ SAPS II ^^', APACHE I I ' , APACHE III ^ and the organ system

failure models ^^'^^ provide estimates of the probability of in-hospital death of ICU patients.

This chapter considers the validity of models that estimate the risk of death of ICU patients. It also

considers, in detail, methods to assess the performance of these models. The key attributes of a model

that accurately estimates the probability of death are discrimination (the accuracy of the ranking in

order of probability of death) and calibration (the extent to which the model's prediction of probability

of death reflects the true risk of death). These attributes should be assessed in existing models that

predict the probability of patient mortality, and in any other model that is developed for the purposes of

estimating these probabilities.

The literature contains a range of approaches which are reviewed, and a survey of the methodologies

used in the assessment of ICU mortality models is presented in Table 2.1 at the end of this chapter. A

straightforward method is described to assess existing models and to assist with development of new

models.
12

In Chapter 3, the performance of the APACHE III ^'^ system at the Princess Alexandra Hospital

(PAH) ICU is assessed in detail to evaluate its potential as a RA tool. The characteristics of

discrimination and calibration studied in this chapter will guide the development of novel models using

machine learning techniques and to assess their performance as tools in control chart analysis of ICU

outcomes.

For the purposes of this study, RA is considered as a method for controlling for the variation in patient

risk of mortality, or adjusting for casemix and illness severity.

2.2 Validity of ICU Mortality Prediction Models


Before the statistical validity of mortality models' prediction is reviewed, the underlying assumptions

on which these models depend must be considered.

The available models were conceived through discussions with experts. The most contributory sub-set

of those variables initially collected was used for modelling. The patient related variables were drawn

from the domains of the patient's acute physiological disturbance, physiological reserve, diagnosis or

procedure. For modelling ICU death the importance of these factors has been consistently demonstrated

16,24-26 ^^^ variables such as physiology, investigations, age, co-morbidities, diagnosis and lead-time are

used in all models of this type ' ' ' ' . The consistency of their use supports the construct validity of

this approach to models for predicting ICU mortality and makes clinical sense.

External validity refers to the accuracy of the predictions as estimates of the probability of in-hospital

death. It is this statistical, objective evaluation of the accuracy of the model that is the focus of

performance assessment. The generalisability of the model's performance is its predication capacity on

independent data not used for model training '*, or its ability to provide accurate predictions in a new

sample of patients ^.

Survival to hospital discharge, as a measure of ICU outcome, raises separate validity issues. It is easily

collected and the definition is unambiguous. Despite this, there are important issues with hospital

transfer and discharge procedures, notification of death and identification of cause of death which can
13

interfere with the interpretation and validity of a seemingly concrete endpoint ^ . Thirty-day in-hospital

mortality, or 6 month or longer term survival may offer better defined endpoints. Elsewhere, I have

shown that modelling patient in-hospital mortality at 30-days post ICU admission offers the advantage

of constant definition and analysis of complete patient data and outcomes ^^. In subsequent chapters, I

will develop machine learning RA tools using 30-day in-hospital mortality as the endpoint.

2.3 Assessment of Validity and Generalisability of


Models
Ideally, a validation study of a model should be independent of the initial modelling process and it

should be based on data collected prospectively, according to a clear protocol. The study should

involve a complete series, or a representative sample of a population of ICU patients with all outcomes

accounted for. The model predictions must not be used to influence clinical practice and clinical

decisions during the evaluation phase.

There are many types of study used to assess the validity of models for ICU mortality. Examples of

studies that assess ICU mortality models are given in Table 2.1. at the end of this chapter. Ridley ^^

evaluates models according to methodological rigour. Justice and co-workers ^ described the

performance of models using the concepts of reproducibility (accuracy of the model on a new sample

of patients at the institution where the model was developed) and transportability (accuracy on a

different patient groups, different institutions or at different times). There is a similar concept in the

literature on the accuracy of diagnostic tests ^^, referred to as the transferability of test results.

Transportability of a model is demonstrated by an independent model validation study. Such studies

can be done with a single model, or as a comparison between two or more models. The findings will be

relevant to performance of each model in the context studied, but will also add to the overall experience

about the generalisability of the model assessed ^. The paper by Pappachan et al. ^^ and an example

from my work ^' provide independent assessments of APACHE III at a single institution. Rowan et al.

^"^ report multi-centre evaluation of APACHE II predictions in the United Kingdom and Ireland.

Multiple model comparisons on the same dataset are presented by Castella et al. ^^ (APACHE II, SAPS

II and MPM II) and Beck et al. ^* (APACHE II and APACHE III).
14

Non-independent validation occurs where the modelling and validation processes are not completely

separate. A non-independent validation study may include performance of the model on the training

and test data as part of the model development process. Examples are the original development reports

of APACHE I I ' , MPM II ^ SAPS II ^ and APACHE III'. Sometimes the validation data is collected to

follow on from the development data. A study may demonstrate that the model results are reproducible

on data collected following the development set of data on which the model was first estimated, and in

the context where the model was developed. Examples of this prospective evaluation at the site the

organ system failure model of Marshal et al. ^^ and the artificial neural network and logistic regression

models of Clermont et al. '^.

2.4 Indices of Performance Assessment:


Discrimination and Calibration.
The two desirable attributes of a probabilistic model are discrimination and calibration. Discrimination

is the ability of the model to separate survivors from the non-survivors. Calibration is how well the

predicted risk correctly reflects the true risk of death.

At an early stage in the patients' clinical care, overlap between the characteristics of survivors and non-

survivors means that predictions will always be estimates of the probability of death. A model cannot

display both perfect calibration and discrimination ^* where there is any uncertainty in classification of

patients into those who live and those who die. A model with perfect discrimination would correctly

rank patients so there is no overlap in predicted probability of death between survivors and non-

survivors. Perfect calibration would occur if all the survivors have an estimated a risk of death of zero

and all deaths have an estimated risk of 1. Perfect calibration and perfect discrimination are not

possible in practice, and random and uimieasured factors mean that models can only estimate a

probability of death.

Other criteria for assessment of the agreement between model predictions and observed outcomes have

been proposed. "Trustworthiness" and "reliability" ^', "calibration in the large" and "calibration in the

small" * all seek to capture characteristics of models. Few, if any, have been applied to assessment of

models that predict the probability of in-hospital death for ICU patients.
15

2.4.1 Discrimination

As ICU models are not used for decisions of preferentially withholding or withdrawing therapy,

discrimination (prediction of survivors v non-survivors) is an issue which seldom arises in clinical

practice ^'. It is, however, a usefixl statistical concept. A number of approaches exist, with classification

matrices and receiver operating characteristic (ROC) curves being the most common methods used.

Covariance Graphs
Covariance graphs show the frequency of the estimated probabilities of death for survivors and non-

survivors ''^. They provide visual qualitative assessment of the separation of the estimates given to the

two outcomes and are commonly used to evaluate diagnostic tests which classify patients according to

the presence or absence of a disease ^^. Though used in the anaesthesia and the intensive care literature

to illustrate the principles of discrimination ^*, the method has not been used for models of ICU

mortality. The covariance graph is mathematically related to and can be easily transformed into the

classification matrix or ROC curve.


16

statistical Comparison of Risk of Death of Survivors and Non-


Survivors

Some authors ^^'^^'^-^^ have statistically examined the separation between the estimated probabilities of

death of survivors and non-survivors. A non-parametric approach (Wilcoxon rank sum test/Mann-

Whitney (/test) must be used because of the non-normality of the distributions of estimated

probabilities, particularly with small numbers. This method is a test of the hypothesis that there is no

difference between the median estimated risks of death of survivors and non-survivors i.e. the model is

no better than chance and is equivalent to comparing the area under the ROC curve to 0.5 (see later).

Classification Matrices

Classification matrices tabulate true positive (TP), true negative (TN), false positive (FP), false

negative (FN), sensitivity, specificity and correct classification rate at various thresholds of risk. They

provide a common approach (Table 2.1), but there are no standard thresholds and there is no single

evaluation result ^*'.

In specific circumstances, where the assessment of model performance includes the costs of incorrect

classification, then knowing the characteristics of model performance at thresholds is essential ^\ Such

a study by Glance et al. ^^ examined the cost-effectiveness of using a risk estimation model to withdraw

care at various thresholds of risk of death.

TP, TN, FP, FN, sensitivity and specificity are not context-free properties of a model. Changes in the

definition of the outcome, the quality of data collection and changes in the distribution of proportions

of survivors and non-survivors will cause the measured discrimination of the model to fail. These

factors can cause deterioration of discrimination in the same way as described by Irwig et al. ^' for

medical diagnostic tests.


17

The correct classification rate (test accuracy or test efficiency) and the positive predictive value and

negative predictive value are likewise of limited use for model assessment because they depend on the

overall mortality rate as well as the decision thresholds chosen.

Confidence intervals (CI) quantify the precision ^^ and generally they can be calculated for

classification matrices. The Standards for Reporting Diagnostic Accuracy (STARD) ^'^ calls for

"statistical methods to quantify uncertainty" in assessing medical diagnostic tests, and the same

standard can be expected for the assessment of models that estimate the probability of in-hospital death.

ROC Curve Analysis

The ROC curve provides a representation of all possible pairs of sensitivity and specificity at every

threshold in the range of predictions. The curve is a plot of sensitivity against (1 - specificity) and the

area under the curve (+/- standard error or confidence intervals) provides a summary statistic of

discrimination. The area under the ROC curve provides a measure that is independent of criteria for

decision thresholds ^'^^. Inspection of the curve and comparison with other curves provides further

qualitative analysis.

The area under the ROC curve is the probability that a randomly selected death will have a higher

predicted risk than a randomly selected survivor. In the context of models to predict ICU mortality, the

area under the ROC curve provides a measure of the model's ability to rank the patients in order of

probability of death. It provides a test of whether the model is better than chance at separating survivors

and non-survivors. An area of 0.5 means that the model's discrimination between deaths and survivors

is no better than chance.

There are drawbacks to the use of the ROC curve. The threshold of risk of death carmot be read directly

off the axis and performance at thresholds or in intervals of risk of death carmot be directly compared.

When performances of prediction are compared, it is advisable to visually inspect plots, or to calculate

decision matrices at thresholds of interest ' .


18

Murphy-Finkins et al. ^^ quoted a rule of thumb that ICU model performance is acceptable if the ROC

area is > 0.7, good if > 0.8 and excellent if > 0.9. Rosenburg ''* holds the opinion that areas under the

ROC curve of"... 0.8 or better are expected for mortality predictions in current models, and scores of

0.7 or less are considered to be unacceptable." The references''*''^ on which he bases this provide only

opinion. My conclusion, drawn from a review of the published experience with classification

performance of ICU models (Table 2.1) is that Rosenburg's position is correct. Contemporary ICU

models that estimate probability of death should have an area under the ROC curve in the range of 0.80

- 0.90, with less than 0.70 being unacceptable. Steen '* gives an opinion that 0.90 may approach the

upper limit for generalised, discrimination performance for models based on biological measurements.

For small datasets, a non-parametric method to estimate the ROC area based on the Wilcoxon rank

sum test (Mann-Whitney (/test)'' or the calculation based on a series of trapezoids is appropriate.

With large datasets, the "staircase" effect of discrete values is less important and parametric curve

fitting or non-parametric approaches provide minimal difference in calculated values of the area under

the ROC curve and the standard error *". The standard error can be used to compare the areas under the

ROC curves and to estimate confidence intervals '''^. Where the performance of two models'

performances is compared on the same dataset, an alternative, non-parametric method *' using paired

comparisons is more powerful.

At any decision threshold, the likelihood ratio is the ratio of the predicted probability of death in a

patient who subsequently dies to the predicted probability of death in a patient who subsequently

survives '^. It is the slope of the ROC curve, being the change in sensitivity divided by the change in

(1 - specificity) over a given range of values *''*'. It provides important information about the

performance of a model. The likelihood ratio is the relationship between the pre-test and post-test

probabilities of mortality.

2.4.2 Calibration.

Calibration is an attribute of model performance that reflects the extent to which a risk prediction

represents that patient's actual risk of dying. An overall summary statistic, like the standardised
19

mortality ratio (SMR) provides information about how the overall mortality rate agrees with the

mortality prediction for the sample. Other statistics and scores have been suggested, in non-ICU

applications, by Brier ^, Yates " , Hilden ^'''" and Flora ^ . Goodness-of-fit approaches analyse model

fit in risk strata grouping patients according to estimated risk (Hosmer and Lemeshow ' ,

Spiegelhalter *').

To assess the calibration of a model, a global assessment of calibration and an analysis of fit in risk

intervals should be considered.

Evaluation of Overall Model Prediction


There are statistics for assessing how well the model estimates compare with the actual mortality rate

of the sample of patients. They compare the observed number of deaths to the predicted number of

deaths. Large departures will indicate failure of the model to predict the probability of patient death in

that context.

/. Standardised Mortality Ratio

The SMR is a commonly used statistic. It is the ratio of observed deaths to predicted deaths, or

observed mortality rate to predicted mortality rate. Equation 2.1. In the example, there are patients

indexed by i. 71^ is the estimate of the probability of death provided by the model. For a patient who

dies, the outcome, Y^ = 1, or J^. = 0 if the patient survives.

SMR = -^ Equation 2.1

1=1

The hypothesis that there is no difference between the observed mortality rate and the predicted

mortality rate can be formally tested by chi-squared or binomial methods. Confidence intervals (CI)

estimate the precision of the SMR based on various assumptions about the relevant sampling
20

distribution. A number of approximations of the binomial distribution have been reviewed ' and the

choice is a balance between simplicity and accuracy.

The useful estimate of the standard error for the term 2_^ Yj arises from the variances of all the

individual predictions ^'^^^'.

SE = J'Z^ril-7t,)
i=\

The estimate of the 95% CI of the SMR is then:

i=l
SMR1.96-
z*.
;=1

2. Flora's Z score

Flora's Z score ^ is similar to the method of using SMR with confidence intervals and has the same

advantages and limitations. It compares the observed number of deaths and the predicted number of

deaths using the difference rather than the ratio of these numbers; then the statistic is standardised by

the estimate of the standard error.

Fiora score = ' ' ' '

E*,-(i-*,)
/=i

The Flora score has been used in a Greek study to compare APACHE II and SAPS II in a single

institution ^'.

Where there is a large dataset, a chi-squared statistic has been proposed by Miller and Hui ^^ though it

has not been used to assess ICU models that predict in-hospital mortality. The Hosmer-Lemeshow (H-

L) statistics, described below can be regarded as a special case of this y^ statistic, where continuous

probability estimates are grouped for analysis.


21

3. Mean Squared Error

The mean squared error (MSE) of the predictions, also known as the Brier Score ^ can be used to

assess model fit.

MSEJI^^^
n

This method is particularly useful where very large datasets exist '*^. The MSE can be decomposed into

components that reflect cliaracteristics of outcome prevalence (and variance of the outcome), bias,

noise, model complexity and over-fitting "'^'''''^''''''^ j ^ ^ MS'E has several drawbacks. Whether the

model is overestimating or underestimating the probability of death is not apparent from the MSE.

Also, the MSE is dependent on the mortality rate which is a characteristic of the context, as much as

the model performance.

Calibration Curves

Calibration curves allow qualitative evaluation of the model fit across risk intervals and are widely

used to evaluate ICU mortality models. The agreement between predicted and observed mortality in

intervals defined by predicted risk can be displayed graphically with a curve comparing model

predictions with observed frequencies *^. Patients are grouped into contiguous intervals of predicted

risk. For each risk interval, the mortality rate is plotted against the mean estimated probability of death.

A perfectly calibrated model will have a calibration curve that has a slope of 1 and an intercept at the

origin. For assessment of ICU models groups can usually be defined by intervals of 0.1 or 0.05 in the

estimated probabilities of death, depending on sample size. Figure 2.1 is an example of a calibration

curve of the APACHE III model for the PAH ICU database 1995 - 1997. The full description of this

evaluation of this model is provided in Chapter 3.


22

Figure 2.1: Calibration Curve for APACHE III Hospital Mortality


Model
PAH ICU 1995 -1997, Observed (+/- 95% confidence intervals) v predicted mortality in 10 deciles of risk
APACHE III model predicts in-hospital risk of death with adjustment for hospital characteristics
1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


APACHE III Predicted Hospital Mortality

This graph gives a visual representation of the agreement or calibration at each of the levels of risk.

Small data samples can produce irregular or noisy curves or empty risk intervals, though smoothing

fiinctions are available *^. Small numbers are particularly likely in the strata of higher risk of death.

The CI for the observed mortality rate in each interval can be estimated based on a normal

approximation to the binomial. The relationship between the calibration curve and its degree of

estimated variation allows qualitative appraisal of model performance, and is useful for making

comparisons between models. In each interval, CI give an indication of the precision of the mortality

estimate and accounts for sub-group size and random variations. However, the sub-groups in each

interval are not independent. As an alternative to CI, some authors 30,34,36,48,49,76,77 j^^iy^g ^ histogram of

patient numbers by interval. Either approach allows the reader to infer the likely precision of the

estimates in each interval, based on the number of cases in the intervals.

However, the calibration curve approach does not provide a way to test hypotheses about the adequacy

of fit across all intervals. The lack of independence of the values, the issues of small numbers and

precision, and the problems of false positives and multiple testing limit the quantitative inferences that

can be drawn from calibration curves.


23

Hosmer-Lemeshow statistics

The Hosmer-Lemeshow (H-L) ^^'^ statistics (C and H) were proposed for the assessment of logistic

regression models. Their use has been adapted and extended to include prospective, independent model

validation.

H-L tests compare the observed against the predicted numbers of deaths and survivors in intervals of

risk. Most applications that assess the calibration of ICU models have used 10 risk intervals. For the C

statistic, patients are ranked according to predicted risk of death and divided into 10 near equal groups.

The H statistic uses the sample divided into 10 contiguous risk intervals of equal width but unequal

number. The C and H statistics are chi-squared like statistics calculated from a 4 x 10 table of observed

and estimated mortality and survival. The value of the test statistic is compared with the chi-squared

distribution with an appropriate number of degrees of freedom.

The number of degrees of freedom when the model is being assessed on the developmental dataset is

the number of risk strata minus 2 **'**. By convention in the ICU literature, there are usually 10 deciles

of risk, and so 8 degrees of freedom. The degrees of freedom for the chi-squared distribution when

prospective independent validation is performed are equal to the number of risk intervals ^ . Again, by

convention there are 10 intervals, though 8 or 9 intervals are reported when there are small samples ^^.

With the H statistic, the risk intervals may contain few or no cases, requiring combination of intervals,

or use of an alternative method.

As with all of the methods for model assessment studied in the current context, the H-L statistics are

vulnerable to changes in the patient casemix ^^ and the distribution of severity of illness ^*, as well as

model fit. The power of the analysis will depend on the sample size. Small samples tend to lack power

to recognise poor fit. Conversely, large samples are more likely to suggest poor fit. Rowan et al. ^^

states that "A significant departure from the null hypothesis does not necessarily imply a bad fit, just

that imperfections are of such a size that they can be detected in a large sample size". For comparisons

on the same dataset, it is the magnitude of the chi-squared statistic that indicates the better model fit. In

practice, it is necessary to decide on non-statistical grounds what level of fit is clinically acceptable.
24

Notwithstanding the effect of sample size and the inconsistencies in their use, the H-L statistics are

widely used, and provide useful context specific information about calibration.

Spiegelhalter's Z score

A similar approach to the H-L H statistic has been proposed by Spiegelhalter ^^ His method calculates

standardised Z scores for each of 10 risk intervals based on the observed mortality rate, and the

standard error of the risk estimates in the interval. It is assumed that Z will be approximately normally

distributed and scores greater than 1.96 or less than -1.96 imply that the model is poorly calibrated in

that interval.

Overall, the Spiegelhalter score provides an assessment of fit of the model predictions across all the

intervals. It is calculated as the sum of the squares of the Z scores for the intervals and is then compared

to a chi-squared distribution with 7 degrees of freedom.

By standardising in each risk interval and accounting for patient numbers and distribution of risk of

death, comparison between different contexts may be possible provided the samples are large. A further

advantage of the Spiegelhalter score is that it can be calculated from the tables presented for H-L

statistics. The Spiegelhalter method of calculating the standard error is a conservative test ^^. A major

shortcoming arises with intervals with small numbers of patients.

No assessment of performance of ICU mortality models has used the Spiegelhalter method to date.

Model Based Analysis of Performance

Model based analyses of performance are more often used during fine-tuning of a model on a

developmental dataset or for recalibration than for validation studies. Real ICU data may violate some

of the assumptions on which both the modelling and subsequent analysis are carried out *^. However,

Ash and Schwartz '^ observe that "...these algorithms for transforming data are judged primarily by

how closely their predictions match reality, rather than the extent to which underlying assumptions are

met."
25

Approaches based on a logistic regression model allow analysis of the relationship between the

estimated risk of death and observed outcomes '*^-^^'^. These assessment methods *"'*' naturally allow

new or refined ICU outcome models to be developed by adjustment of the parameters of an existing

logistic regression model *^'*^.

2.5 Recommendations for a Practical Approach to


Validation of ICU Models that Estimate the Probability
of In-Hospital Death

An ICU model that accurately estimates the probability of in-hospital mortality can potentially be used

as a risk adjustment tool to analyse mortality outcomes in an ICU. The model's performance must

however be thoroughly evaluated to determine whether the estimates of probability of death are

accurate.

The recent publication of guidelines for standard reporting of the accuracy of diagnostic tests by the

Standards for Reporting of Diagnostic Accuracy (STARD) ^* provides a useful example of an explicit

and rigorous check list to serve as a guide to methodology and documentation. Reports on the

performance of ICU models that estimate the probability of in-hospital death share similar

characteristics to reports on the accuracy of a diagnostic test. Therefore, the systematic approach used

by STARD (Checklist for Reporting Diagnostic Accuracy ^^) provides a framework for my

recommendations.

The Title, Abstract and Keywords should identify the report as that of the assessment of the

performance of an ICU mortality prediction model.

The Infroduction should clearly state the aim of the report. It may be to develop and introduce a new

model, to validate an existing model, to compare several models, or to adjust an existing model.

The Methods section should describe the context of the analysis and dates of data collection. A single

or multiple ICUs, and the type of hospital and ICU should be described. The study population and the

method of patient eligibility and exclusion should be described. This will allow the reader to assess
26

independence of the validation process, conditions affecting model performance, and the applicability

of the model to a situation of interest.

An account of rules of data collection and a description of the model are essential. The mortality and

survival endpoints must be defined. The methods by which patient data are divided into sets for model

development and testing must be given. A description of the statistical methods used to develop the

model and to assess its performance should be given.

The accuracy of ICU outcome models should be assessed in terms of discrimination and calibration.

For discrimination, the area under the ROC curve should be calculated, with standard error or

confidence intervals for precision. For two models on the same sample, pair-wise comparison of

models should be done.

Calibration should be assessed by an overall assessment statistic and an assessment of fit across risk

intervals. The most commonly used global indication is the SMR with confidence intervals. A

graphical approach with a calibration curve using confidence intervals should be presented. A

numerical evaluation of goodness-of-fit using the H-L statistics or the Spiegelhalter approach should be

used. If more than one model is being evaluated on the same dataset, then a statistical comparison,

using H-L statistics or Spiegelhalter Z scores is suggested.

The Results section should state when the data collection was performed. A description of the sample

should include characteristics of age, gender and major diagnostic categories, severity of illness

measurements and mortality rate. If this is an independent evaluation of a model, it is usefijl to compare

the patient characteristics of the validation data set with those for the sample on which the model was

developed. All missing or incomplete records must be accounted for.

The recommended approach described above is applied to analysis of APACHE III in an Australian

adult ICU in the next chapter.


27

2.6 Survey of Models that Estimate the Probability of


in-Hospital Death of ICU Patients
Table 2.1 presents a tabular survey of studies that report the performance of ICU models to estimate the

probability of in-hospital death.

The performance of all models on the developmental datasets is consistently better than on subsequent

independent assessment. As expected, the issues of reproducibility and transportability of the models

require that all models which estimate the probability of in-hospital death of ICU patients are validated

at the site where the models will actually be used.

2.7 Conclusion
The key attributes of models that estimate the risk of death of patients in the ICU are discrimination

and calibration. The area under the ROC curve is the best measure of discrimination. For models to

predict contemporary ICU mortality, an area under the ROC curve in the range of 0.80 - 0.90 is

expected. Model calibration is an attribute that is more difficult to capture. Though calibration curves

provide a qualitative representation, statistical evaluation of calibration is done using H-L goodness-of-

fit statistics. In practice, it is necessary to decide on non-statistical grounds what level of fit, and

maximum value is acceptable. Goals for Model performance using these measures will guide

development of new models in Chapter 6 and 7.


00

^
^
^
^

3
>

Y*
(0
H
o
c

o
>>.

5
5
(0
Q.

(4
0)
0)
o
SV.

<iM

5o
c(Q

o<0
o

<B
s
f/ie

<0)>i
^P
Q>

b
tt)
fIC orta

a>
tt)

w>
JP

3
3

o
JS

u
C
3

i
o
33

_
3

o
3

J3

u
u
JS

u
to
5

T3

u
3to

t>
t

>
13

T3
J3

rS
O
tM

<0
OS
to
X!

to
*-

"CS
at
s
o
u
u
ca
u

s
>

ts

(U
to
O

ti<0
-
to

CO
-4-

CO
O

E
CO

S
S
s(U

&
(i>

ex
CO
cS

OJ
ca

-s
<D
&

>.

CI.

kex
a.

s>
x>
E

fa
fa
&

3
Pi to
Reports are presente(i in chronological order of publication.
* Denotes absence o1' information (eg calibration not reported).
Authors Models Comments Area under the Discrimination Overall Calibration: Calibration:
ROC curve Classification Assessment Graph H-L statistic
(95% confidence intervals Matrix Thresholds
or standard error)

as

*.
00

"Q

CO
>n
APACHE II USA ROC curve displayed, Classification
Developmental 13 ICUs area not stated matrix
paper 5815 admissions PPV, NPV, sens,
spec, and CCR at
0.7,0.8 and 0.9

*
*

Jacobs era/. 1987'' APACHE II Saudi Arabia Sens, spec, NPV, 0/E can be
Modified with single ICU PPV and CCR at calculated from
best GCS Nov 1984-Nov 1985 0.6, 0.7 and 0.8 the text
210 admissions

T

*

o
0\
OO

u
oo

to
Q

CO
1>
"^

x>
APACHE 11 Saudi Arabia 0/E =1.2
Modified with single ICU
best GCS Nov 1984-April 1987
583 admissions


*

*

Zimmerman et al. APACHE 11 New Zealand 0/E can be


1988 " 2 ICUs calculated from
Dec 1982-Nov 1983 the text
1005 patients
Csl
ON


NJ

a
APACHE II Italy APACHE II 0.84 (0.07) CCR at 0.7 O/E = 0.78

1
II
S; oo

8 0\
ce 00

52
Single ICU
3 years: ?dates
521 patients

XI
X)
*

*
*

JS
CO
<u
ON
ON
APACHE II Mayo affiliated hospitals
1285 patients


Knaus etal. 1991 "^ APACHE 111 USA APACHE III 0.90 PPV, NPV, Calibration curve in
Developmental 40 hospitals sens, spec and 0.1 intervals as split
paper 17440 patients CCR at 0.1, halves.
May 1988-Nov 1989 0.5 and 0.9 All patients in 0.05
intervals.
No CIS

Berger era/. 1992 APACHE II Switzerland ROC curves for O/E can be calculated
89 Single ICU subgroups, areas not from the text
Jan 1986-Mar 1988 given
2061 patients

3
*

fN

O
OS
OS

c
APACHE II Japan APACHE II 0.78 O/E can be calculated
6 hospitals from the text
Dec 1987-June 1989
1292 patients

*
*
*

LeGaW etal. 1993' SAPS II Europe and North America: 12 SAPS II


C|_(
X T3
a: 00

Developmental countries: Development set:


paper Oct 1991-Feb 1992 0.88 (0.87-.90)
13152 patients excluding
missing data of admission, Validation set:
ventilation 0.86 (0.84 - .88)
oen

*

MPMo II Europe and North America: 12 MPMo II: HLC
MPM24 11 countries: Developmental set 0.837 df = 8 in
Developmental Apr 1989 -July 1990 and Sept Validation set 0.824 development
paper 1990-May 1991 sample
19124 patients (MPMo II) MPM24 II:
15925 patients (MPM2411) Developmental set 0.844
missing data excluded Validation set 0.836

0
*

0
OS
ON
m
APACHE U Hong Kong APACHE II 0.89 PPV, NPV, O/E = 0.97 0.1 intervals
Single ICU sens, spec and Plot looks as if mid
May 1988-Nov 1990 CCR at 0.5, points of interval
Reinterpret APACHE II 0.7 and 0.9 used
using best GCS NoCIs
Rowan et al. 1993 APACHE II UK and Ireland APACHE II 0.83 TPR, FPR, O/E (95% CIs) 0.05 intervals, H-L//
32.33 26 ICUs, CCR at 0.1, 1.02(0.98-1.06) CIs df=8
Oct-Dec 1987 0.5 and 0.9
Jan-April 1989
8796 admits: only 80% of all
admits included


*
*

<U
MPMtg II USA 6 ICUs MPM48II: H-LC

coS

ON
2 ON
MPM72 II Sep 1990-May 1991 Development set 0.81 df = 10 in
Developmental 3023 patients (MPM48II) Validation set 0.80 development
paper 2233 patients (MPM72 II) sample
missing data excluded MPM72 II:
Development set 0.79
Validation set 0.75

Rowan et al 1994 APACHE 11 UK and Ireland MPMo II: 0.74 TPR, FPR, O/E can be calculated Calibration curve H-L H and C
34
MPMo II: 26 ICUs, APACHE II: 0.83 CCR at 0.1, from text with 0.05 intervals, df=8
adjusted from Oct-Dec 1987 0.5,0.75 and CIs
MPMol Jan-April 1989 0.9
8724: 80% of all admissions
included
u
T3
0

g g a

ive
8
8 -3 0
0 0 8
o -s u2

lidat
as

adsc
00
S^ II =0 ^ ,j II II HJ II
I > II o
X -o ce C|-<
ac 13 ac T3
ac u c<-i 8 ce ac T3 0 . > 0
o -o a. >
o

rve
rve
CO
CO
3 3

interv
3 3
0 & c o
8 8 B
0 0 8 33
ce 1>
ce f-1
VI
e

alib
.) (J
.4-
8 u 3
8
.-
0 a ^ '-'
0 :S 0Z 0 S Z
-0 T3
d) 4>
ce
3 3
0 S 6
ce
0 o fti<S
<D 00 u OS CO
Xi
o 8
8
ce
0
g ce
u I 2 3:s
>-i 3
UIO.

4) t> ^
e^ o
U
oS o S
^ ce T3 O > ^ OS
u d d |fti
u _ CO Z ce
CO as
Pi '^ ^ Di
(J 3 <N 00
H ce d d
g^O-0
ft. on U
t/3 OH O O

<N
00 O
ON
o d
d 00
so OS
so 00
00 00
_ f^ d
00 o
d ^ = o 2 00
=w w
' ^x MM ^
ac X X ac
S S < [/3
S 4)
2 3
u
<
C/3 u u en
ft. ft- -S, ft- < < ft- <
ft< OH CL. < ft- ^ ft, < ft- ft- < ft-
CO 00 < < < < t/3

fN
4>
8 o
4>
ce o
o 2 E
o.
_o 8
3 .2 s so >o
ON ON
8 05 t/1
ON ON
ON 4> CO 8 O Os OS
J3 OS ft< .<
i-H

u
2 eo< L>
o ^ H goi ce ce
Z Q
4>
8 - .23

a
<l>
.3 O 8 > 3 .H
B O
2 3 1
CO ^ D 8
^ -H m ft. O 4) *
^ S p ON 0. ;^ 'o '^ y - . ce
gal

ati

0
OS r> ^ - S CO o .
*

l a cS
H
pi Ji
3a
ON n.
3
D
U '
Os
' 0.
i ^ .4-*

0
r-
*
3 ce c*i
3 <t5 -O - '
>

5
5 ON m
i S ON - H
4>

z
&
ce ^ s
P bo
is c
0
ft-
ON
u rs
00
Q OS

4>
_ . "^
T3 O
w ac
2O .g X X X
ac s " CZ) ' 5 3 [/3
< <:
u
< p^ f' ft- <;
ft- < ft- ft-
c^ < 3 ft-
M 2 U < CO < < <
< ws U-1 so
ON 0^ IT)
OS Os
ON Os ON
OS OS
ON ON
Os

ce 8
ce _o
O
4> o 3
ce
OS J: < 2 Sg
<N

00
Os
o

OS

o
B

"a
reno e EURlICUS Sens, spec, SMRwi 95% CIs Calibration curve No

Sd
o
a:o

J1

-H

ft- ^
ac

o
o

CO - ^
ON
O

on 2
.t;

so

& "^

s ~.
.11
89 IC Uin 2 countries PPV, NPV, CIs

,
Octl 994- Jan 1995 correct

8
8

ft.

CO

4}
OS
cz)
10027 pati classiflc atio

< d
^ d
00
Csl
y*>v
<n
o'r
005 rate at 0
and 0.9

4-

oo
rsl

N<


ON
ON
oo
Calibration curve

r ^
Tunisia Correct O/E not ovided bi

*..:
Nouira et

ac -
o2

3
J II

48
ac T3

-.
3 ICUs classificatio can be c ;ulated Intervals of 0.1 of

ac := -
.
0< CO *-!

O >n 00

o
Jan 1994- July 1996 rate at from tex interval

<*-

Jl^ ft- OH <j<


W _^ S "="
^a
^ 00 00 ^

< 2 2 O)
a o d 00

(1^ <j< OH OH
<ii^
1325 patie ts thresholds o NoCIs

,
IN

<wss
0.2,0.Sand
0.8
^

<

<

5
00
00
(J

W
Hong Kon, 1064 ICU PACH II SMR Calibration curves and

bO
Tan 1998

'~^
O
J II
ac T3
00

ac 3
< ft-
W <=>
SN

S oo

CO

< OH
ac H

^ <
disch;arges 77 (95 CI+/- Intervals of 0.1

O
Octl 995- June 1997 073) NoCIs

t^

CO
APS II MR 0.82
(95% CI + /-0.071)

*
o

d
S
II

<
<
O

<
^H
OH

<

8
OS
OO

OH

*
w
Calibration curve

ac
USA

ce
ac
eta
J II

Oi
It! 00

ac Tj

ce
161 h ospit Is, 285 ICU 0.1 intervals

II1"
May 1993 -Dec 1996 No CIs
37 66 8 patients

^
O

<
H^

^
00

<

a
CO

ft-

W-(
w

^-1

ft-

OH
TP,TN MR (95% CIs) Calibration curve and

ac

ce
UK

<^
So
ee o
MJ II
X -o
5 : oo

CXOv
ra O N
'I
17 ICUs CCR at 0.05 intervals
^

April 1993 -Dec 1995 0.5 and .23(1. 12-1.25) NoCIs

(U
12793 pati nts
*

90
ja
8

d
o

<
O

<

^
Ci_i

OH
W

^H
ON
--*

CJ
Table ol MRs wit Calibration curve

s
<;
ac

.
USA

r1
999

-^
Siriioe^ai
^ :

CA
X g

95% CI or each Uses the same decile


-J 3 .

38 ICUs
O

OS
Marcl[11991 -April 19 institutii and for approach as the H-L
al

1163 40 patients each yei studied C, with intervals of


unequal span, but
equal number
8
o

H-L//

00
APACHE 11 UK and Ireland APACHE II 0.83 Classification O/E not provided but Calibration curve

^2
Artificial neural Oct 1987-April 1989 ANN 0.84 matrix can be calculated with strata of risk 0.1 df=8
network 8796 patients from 26 ICUs TP, TN and from text intervals
CCR at
thresholds of
0.1 to 0.9 in
0.1 intervals

o
o
o

3
4-*
f/i
a
(N

09
I
APACHE III Australia O/E =1.25 Calibration curve
Single institution with 0.1 intervals
Jan-Aug 1991 CIs
519 patients Plot looks as if mid
points of interval
used
APACHE II Germany APACHE II 0.83 Classification O/E not provided but Calibration curve H-L//

60*
APACHE III Single ICU APACHE III 0.85 matrix can be calculated with 0.1 intervals No d f = 8
SAPS II Oct 1991-Oct 1994 SAPS II 0.85 TP, TN and from textO CIs
2795 patients correct
classification
rate at
thresholds of
0.1,0.5 and
0.9

CO

ee
APACHE 11 Greece APACHE III 0.84 Classification Floras Z score Calibration curve H-L//and C

(30

l
i CM
SAPS II Single ICU Nov 1992-Dec SAPS II 0.87 matrix, Sens, with 0.05 intervals dfused = number
1997 Spec, PPV, O/E not provided but NoCIs ofriskstrata: 8
NPV and can be calculated and 9 in this study
CCR at 0.3, from the text
0.5 and 0.7 APACHE II SMR
1.14
SAPS II SMR 1.62

o
o
o

8
O
CM

OS
APACHE II 0.74 H-LC 233

4>
APACHE II USA

OR
Single ICU df not stated
6808 patients over 8 years
? dates
* OS ON ON
CO so
m r~: o so t~
oo oo W 00 -^ t NO *

o -- <--. ~" ro a , 1 [ ]
CM
' "-^ r~' ac CM !U
1 - ^ 11
r-i
o U M * 1-H ee n M so
<J ac ac =: ft-< CO
ac ac p*^
"^ s 4-*
u u CO
S S O u u ro
u1 <ft- <OH OH OH I ft-
n1
< ft- OH
8 <
^ ft-
< ft,
ft-

a: < < m ai D S S a: < w S S CO


a: < <
4> co 4> CO

i3^ ee
C
4>
t 3
8
O 8
CO
O 1
.2
alib

CO _ . CO
IH O i-H
.13 O
3 .^ o
u # z U ^ Z

CM CSJ
o ON ^
o
o 00 CO
Csl
__ ^ O N
S ^ O CM S ^ Cs|
CO
'" w 3 --
g I- - ' ac 7 t/)=
ON oo ^ ^ ^ ^
00 < 2; ft. oo
< <~i & - & <=!
.CO - H 2 2 -H
23 n < ro

CM
00 00
_ "^ d
0.83

d oo
Ss^WONg OS 00 00
So
o t~- o o
CJ = = ; a; ^ d s
CO
w o^ , 1 ^ w w
wwo u 2 s ' ac ac
CO o
ac ac := < S
= S~
x; CO u u CO
Ol
S S < :^ ftj 0;
< ft- S S < <: ft-
ft- < ft- OH
ft. OH ^
w Q < <: S D S S < CO <
SS < < CO

OS
ro
ON
ON
oo .<
Os bO o
ON o
8
8 oCM
Xi D
o
c u
D CO
4> U CO
CO -3 ee , Q
4> CO

R -25
Cs)

2 3
ON -3
CM
1 ee
Cl-
io
J3
ce --*
<:
8 ii) c^
3

3 rt OS
'
y 3
Ian

rs ' ON
ce ON 2 l
3 =^ 3) a,'-'
udi

o CM 4> 4> O .
5 , ON OS 13
u ro m ^Os 8 r^
O S ^ CO 11 CJ r4 -- ce CO so ^
CO D CM ' - -
Os ^
3
O

o
CJ

ac a: 1=1 ac s s " um
< 00
X a
b OH SS <2ss
O < ee ft- OH <; ft-
OH . ^ OH ft, < CO
CO (Zl

o
ss < CO S S
o
o CM
CM
o o
o o
CM
CM
8
o
COt~

.2 o S (J
J c^
^ ^ u
CQ
ON
ro
r~-
SO
00 K ^ -^r
*
oM
C
C-|
' so *

-a m
PS 11
ACH s
UH
1 ft- <:
ac < CO
s

4)
-
O
CJ O
3 (
CM
ro
o 3
4}
CM o CO ee 3
^Pi Se MU > val g
3 '*
ive pred ive

4> 8 .5 4) (30
oo ^s O TO c- 0 D
d d 00 CO a, 8 CJ CJ
2
-. r o
^ U 1->
s
predi

0 41
I1>

ens itivity >


dardis

positi
04^

pec ificity
r- S <N
u D
4>
I S ^., 3 o ^
ce
>
e
8
CO
4)m 0
CI.
4> C
U 8
to to
3
g
n

Oi M s
J-J

t/1
ft; ft- q <J pq S 2 O-DH
Zft-KcocococotoH
o
00 00

NO 00
00
pq
d
ac d
o
^ . 000- C~ r o
ft- 00 < j 0 0 ft; 0 0
< d CO d S d

CM 3
CM O
0 S
3
c/l 3
3
0
>
pq
CO
CO

20 ca
4)
^.
ee - 3 ON
"* i2 . a:
ospi

0 3
USA

995

so S 3
VO
^ H
43 ^ JS <L> o

"I3 s
8 3 S <^ o
* O

w
X =1 >^
2 II g o S ^ a>
O = 4J l^
.ii en
en en
>
.ti
S-o
2 2 s
8 c
O 8
NJJ S ft-
J3 CO > I
ft- O ^ "-I
o. O 4>
a< < & o ta S
' ce
-ts 3 8 &
< w 2
^ <
O O '.^
O "3
B o
eg a X
.ss
s
o
CM
en
K
.0
pq
ac
u o ..
< Pi (ii 00
O3 U .:iJ CL: U J. prs ^ =
37

Chapter 3

Performance of APACHE III Models in an Australian

Intensive Care Unit (1995 - 1997)

3.1 Objective

The objective of this chapter is to evaluate the performance of the APACHE III (Acute Physiology and

Chronic Health Evaluation) ICU (intensive care unit) and hospital mortality models at an Australian

tertiary adult ICU.

The independent validation of the APACHE III models at the Princess Alexandra Hospital (PAH)

intensive care unit (ICU) has been published ^' during the course of this project. It was, at the time of

publication, the largest single institution evaluation of the APACHE III model.

3.2 Introduction
At the time of the study, PAH ICU provided medical and surgical critical care services to an 858 bed

adult metropolitan hospital which was the regional centre for trauma, major surgery, medical

subspecialties, and psychiatry. In August 1994, the APACHE (Acute Physiology and Chronic Health

Evaluation) III Management System ^ was introduced to the 12 bed ICU.

The APACHE III estimates of mortality risk are part of a proprietary database and decision support

system provided by APACHE Medical Systems, Inc. The APACHE III score is a measure of severity

of illness of ICU patients and is calculated from the patient's age, the presence of existing medical

conditions and the worst physiological and laboratory investigations in the first 24 hours. The

APACHE score attempts to measure the patient's physical reserve, and the degree of physiological
38

disturbance through the most abnormal physiology recorded during the first day. The APACHE III

model uses the admission diagnosis, the source of admission and the APACHE III score to estimate the

patients risk of death in the ICU, and risk of death during the hospitalisation. The risk equations and

weights were developed by Knaus et al. .

The performance of the APACHE III model predictions has been evaluated on the developmental

database ^. Independent APACHE III validation series are available from Brazil (multi-centre, 1734

patients '*"*), the United Kingdom (UK: single institution, 1144 patients ^^ and multi-centre, 12 793

patients'") and Germany (single institution, 2661 patients '*'). In each study hospital mortality was

higher than predicted indicating poor model calibration. In contrast, in two large, prospective, multi-

centre North American series (37 668 patients ^^ and 116 340 patients ^'*) the APACHE III model

demonstrated good overall performance. Generally, the APACHE III mortality prediction model has

not performed well in clinical evaluation outside the USA where it was developed.

The purpose of this chapter is to assess the performance of the APACHE III models for hospital and

ICU mortality, unadjusted and with proprietary model adjustments, at the PAH ICU.

3.3 Materials and Methods


Admissions to the PAH ICU were studied from 1 January 1995 to 31 December 1997. Patients under

16 years of age, cardiac surgical and bums patients, and patients admitted to the ICU for less than 4

hours or for exclusion of myocardial infarction were excluded from the APACHE III predictions.

Patient data were collected according to the rules of APACHE III 2'^''^'. Data were manually collected,

or transferred from the pathology laboratory information system. The database manager verified all

data. After the first six months of the data collection, 4% of patient records were extracted to check

inter-reporter reliability. Outcomes were survival status on discharge from the ICU or the PAH hospital

campus. Patients transferred to rehabilitation facilities (spinal, geriatric, head injury and general

rehabilitation units) or the psychiatric unit within the PAH complex were deemed inpatients, until

discharged from the campus.


Comparisons were made between the APACHE III developmental patient sample ^^ and the current

study sample using /-test or chi-squared test adopting a significance level ofp < 0.01, to correct for

multiple testing.

The ICU mortality models were assessed on all eligible admissions to ICU, including readmissions.

The hospital mortality model assessment excluded all ICU readmissions during an episode of

hospitalisation. For each admission, mortality estimates were provided based on proprietary weights

and the APACHE III equation. For in-ICU mortality, the APACHE III ICU mortality model and a

model with proprietary adjustments for hospital characteristics (similar hospital ICU mortality model)

were studied. Three models of in-hospital mortality were evaluated. The APACHE III hospital

mortality model and models with proprietary adjustments for hospital characteristics (similar hospital

mortality model) and referenced to the North American database (USA database hospital mortality

model) were studied.

The proprietary adjustments to the APACHE III models include the additional data variables of pre-

ICU treatment period and information about the institution size, teaching status and region. In the case

of the PAH, the similar hospital model references the predictions to teaching hospitals of similar size in

the Mid-West region of USA. The USA database hospital mortality model reflects a "typical" USA

ICU modelled from the APACHE III database (personal communication: C. Alzola, APACHE

Medical Systems, Inc. 1999).

The aggregate predicted mortality rate for each model was the sum of estimated probabilities of death

divided by the number of admissions. The standardised mortality ratio (SMR) was the ratio of observed

mortality to the aggregate predicted mortality. Confidence intervals were estimated using a normal

approximation to the binomial distribution *^.

For assessment of model fit or calibration, the agreement between predicted and observed mortality

rate in risk intervals was assessed. Using 10 equal, contiguous risk intervals, calibration curves present

observed against predicted outcomes with 95% confidence intervals estimated by a normal

approximation to the binomial distribution **. The Hosmer-Lemeshow (H-L) statistics, C and H, ^^'^
40

indicate the agreement between the observed and predicted mortality across risk intervals. For C,

admissions are ranked according to predicted risk of death and divided into 10 nearly equal groups. H

uses the sample divided into 10 contiguous intervals of risk of equal width, but unequal number. For

external validation studies, the degrees of freedom of the chi-squared distribution is the number of

intervals of risk ^. Rejection of the null hypothesis that there is no difference between the predicted

frequencies across the deciles of risk is at p < 0.05.

Discrimination was assessed by calculating the area under the receiver operating characteristic (ROC)

curves, with estimates of the standard error and confidence intervals ^^. The area under the ROC curve

estimates the probability that a randomly selected patient who died would have been given a higher

estimate of risk of death than a randomly chosen survivor.

3.4 Results
There were 3455 admissions to the PAH ICU between 1 January 1995 and 31 December 1997.

Exclusions were 45 admissions under 16 years of age, 8 admissions staying less than 4 hours and 4

bums admissions. 3398 remaining admissions represented 3159 patient hospitalisations of 3038

individual patients. All patient outcomes were accounted for during the study period.

There were 338 deaths in ICU (9.9%) and 507 deaths in hospital (16.0%). The median length of stay in

ICU was 2 days (range of 1 - 75 days), with 65.4% of patients admitted for 2 or 3 days. Median

duration of hospitalisation was 16 days (range 1 - 930 days, 25% quartile: 8 days, 75% quartile 28

days).

Table 3.1 shows the demographic characteristics of patients, with reference to the APACHE III

developmental database. Compared to the APACHE III development sample, the PAH sample was

younger, had a greater male preponderance, a different case mix of non-operative / operative (elective

and emergency) and a different mix of sources of referral. Severity of illness reflected by the day 1

APACHE III score and the Acute Physiology Score component appear similar.
41

Table 3.1: Comparison of Demographics, Operative Status and


APACHE III Score, between Princess Alexandra Hospital Intensive
Care Unit Admissions and APACHE III Developmental Sample.
PAH ICU APA CHE Developmental Sample*

Sample Size (ICU admissions) 3159 17440

Age: yrs (sd) ^ 52.6(19.2) 59(19)

Male:%" 61.5 55.2

Operative Status^^

Non-operative: % 46.8 57.7

Elective: % 38.4 33.3

Emergency: % 14.9 9.0

Admission Source

Operating/Recovery: % 53.2 42.3

Emergency Room: % 20.0 35.5

Hospital ward: % 13.8 16.4

Other Hospital/ICU: % 13.0 5.8

Day 1 APACHE III score: mean 47.6 49.2

Day 1 Acute Physiology Score: mean 39.2 39

ICU Mortality (%) 9.9 NA

Hospital Mortality (%) 16.0 16.8

* data from ^^^*; ^ two tailed t-test; p < 0.001


; **X'(1) =46, p < 0.001; V ( 2 ) = 180, p < 0.001;

,2/r\ -
*** Xl6) = 515;p<0.001

The 231 admission diagnoses were grouped into 77 disease groups. The commonest operative disease

groups were gastrointestinal cancer (9.0%), elective aortic surgery (8.5%), operative trauma (7.2%),

head and neck cancer surgery (3.5%), miscellaneous gastrointestinal surgery (3.5%) and liver

transplantation (2.8%). The commonest non-operative groups were non-operative trauma (10.3%), dmg

overdose (7.5%), cardiac arrest (2.6%) and asthma (2.5%). The ten mostfrequentgroups accounted for

57.3 % of all admissions.


42

There were 2812 admissions (82.8%) with no APACHE III co-morbidities. 459 (13.5%) had one co-

morbidity, 120 (3.5%) two and 7 (0.02%) three or more co-morbidities. The prevalence of one or more

co-morbidities in the present study sample (17.2%) differs from that of the APACHE III

developmental group ^ of 6.6% (5C^(10) = 1735,;? < 0.0001).

The majority of admissions had low estimates of probabilities of death according to APACHE III.

Seventy-nine percent of patients had a predicted ICU mortality of 0.1 or less and 91% of 0.3 or less.

Sixty-eight percent of patients had a predicted hospital mortality of 0.1 or less, and 85% of 0.3 or less.

The observed hospital mortality (16.0%) was significantly higher than the mortality rate predicted by

APACHE III, 13.6% (x2(l) = 7.4; p = 0.01, Table 3.2). The observed hospital mortality was not

different to the APACHE III predictions when model adjustments were made for hospital

characteristics (similar hospital model: 14.9%) or the USA database referenced model (15.6%) are

used. The observed ICU mortality (9.9%) was not significantly different from the predictions of the

APACHE III ICU mortality model (8.9%) or the APACHE III similar hospital ICU model (10.5%,

Table 3.3).

Table 3.2: Predicted APACHE III Hospital Mortality Compared to


Observed Hospital Mortality of 16.0%, with Standardised Mortality
Ratio and 95% Confidence Intervals.
Predicted Mortality SMR (95% CI) X2(l) P
Rate

Hospital Mortality 13.6% 1.17(1.08-1.29) 7.4 0.01*


Model
Similar Hospital 14.9% 1.07(1.00-1.17) 1.8 0.19
Mortality Model
USA Database 15.6% 1.02(0.96-1.12) 0.3 0.63
Hospital Mortality
Model
43

Table 3.3 Predicted APACHE III Intensive Care Unit Mortality


Compared to Observed Mortality of 9.9%, with Standardised
Mortality Ratio and 95% Confidence Intervals.
Predicted Mortality SMR (95% CI) Z2(l) P
Rate
ICU Mortality Model 8.9% 1.12(1.01-1.26) 2.5 0.12

Similar Hospital ICU 10.5% 0.95(0.87-1.06) 0,4 0.52

Mortality Model

The APACHE III ICU models show good calibration. Calibration curves for the ICU mortality model

and the similar hospital ICU mortality model (Figure 3.1 and 3.2) are close to the line of perfect model

calibration. The H-L statistics (Table 3.4) with corresponding p values > 0.05, confirm adequate

calibration, with the similar hospital ICU model providing the best fit on this sample.

Table 3.4: Calibration of Predictions of APACHE III Intensive Care


Unit Mortality and Hospital Mortality:
Hosmer-Lemeshow goodness of fit statistics (C and H) with probabilities using chi-squared distribution
with 10 degrees of freedom.

APACHE III Model C P H P


ICU Mortality Model 12.7 0.24 14.8 0.13
Similar Hospital ICU Mortality Model 5.6 0.85 8.7 0.56
Hospital Mortality Model 32.8 < 0.001 35.6 <0.001
Similar Hospital Mortality Model 14.5 0.15 16.9 0.08
USA Database Mortality Model 12.2 0.27 15.9 0.10
44

Figure 3.1: Calibration Curve for APACHE III ICU Mortality Model with
No Adjustment for Hospital Characteristics
APACHE III model predicts in-ICU risk of death
Observed (+/- 95% CI) v predicted mortality in 10 intervals of risk

CD
o

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


APACHE II Predicted ICU Mortality

Figure 3.2: Calibration Curve for APACHE III ICU Mortality Model with
Adjustment for Hospital Characteristics
(Similar Hospital Model)
APACHE III model predicts in-ICU risk of death
Observed (+/- 95% CI) v predicted mortality in 10 intervals of risk

(0
o

H 1 \-
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
APACHE III Predicted ICU Mortality

The unadjusted APACHE III hospital mortality model has poor calibration demonstrated on the

calibration curve (Figure 3.3) and on statistical analysis (Table 3.4).The similar hospital model (Figure

3.4) and the model referenced to the USA database (Figure 3.5) display adequate calibration. Though

both calibration curves show that observed mortality differs from expected in the range of 40 - 50%,
45

the H-L statistics (Table 3.4) suggest non-significant differences between the estimated and actual

mortality rates.

Figure 3.3: Calibration Curve for APACHE III Hospital Mortality


Model with No Adjustment for Hospital Characteristics
APACHE III model predicts in-hospltal risk of death
Observed (+/- 95% CI) y predicted mortality in 10 intervals of risk

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9


APACHE III Predicted Hospital Mortality

Figure 3.4: Calibration Curve for APACHE III Hospital Mortality


Model with Adjustment for Hospital Characteristics
(Similar Hospital Model)
APACHE III model predicts in-hospital risk of death
Obsen/ed {+/- 95% CI) v predicted mortality in 10 intervals of risk

H 1 h
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
APACHE III Predicted Hospital Mortality
46

Figure 3.5: Calibration Curve for APACHE III Hospital Mortality


Model Adjusted for USA Standard ICU
APACHE III model predicts in-hospital risk of death
Observed {+/- 95% CI) v predicted mortality in 10 intervals of risk

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


APACHE III Predicted Hospital Mortality

The area under the ROC curves for both ICU mortality models was 0.92. The areas under the ROC

curves for the APACHE III hospital mortality model and the model referenced for the USA database

were 0.90. For the similar hospital model, the area was 0.91. This demonstrates good or excellent

discrimination for each of the models.

3.5 Discussion

This analysis demonstrates that the APACHE III mortality models with adjustment for hospital

characteristics, the ICU mortality model and the hospital mortality model referenced to the USA

database have good discrimination and calibration in an Australian adult ICU population at the PAH

ICU during the study period. This was the first series from a general ICU outside USA that endorsed

APACHE III model performance. It also supports the findings of previous reports from the UK ^"'^^

Brazil'"", Australia ^^ and Germany '*' where the original, unadjusted hospital mortality model

performed poorly.

The practical approach adopted in this analysis is based on the review and conclusions of Chapter 2.
47

Discrimination of all APACHE III models was good (area under the ROC curve > 0.8) or excellent

(area under the ROC curve > 0.9) ^*. The area under the ROC curve was similar to that of APACHE III

in-hospital mortality predictions on the developmental data set ^, the USA prospective multi-centre

validation series ^^'^ and the UK multi centre series ^*'. However, the discrimination of the APACHE

in model is vulnerable to differences in case mix, clinical practice and data collection conditions, given

the lesser performance in the multi-centre Brazilian series '*"', and single institution studies from

England ^^ and Germany'".

In this sample, the H-L statistics, the calibration curves and the global agreement between observed

and predicted outcomes concur that among the models studied, only the unadjusted APACHE III

hospital mortality model displayed inadequate fit. Comparison with other published calibration curves

for APACHE III 30.36,77,ioo ^^^^^ ^^^^ ^^^ calibration at the PAH resembles the curves of the North

American ICUs ^^.

Despite differences in case mix and referral patterns between the PAH sample and the APACHE III

developmental sample, the performance of the APACHE III models adjusted for hospital

characteristics was good. Other analyses of APACHE III have only assessed the performance of the

unadjusted APACHE III hospital mortality models. A UK validation study ^^ with a patients' sample

having a higher average APACHE III score, more co-morbidities and different referral sources, found

excellent discrimination. However, there was a 25% higher than predicted mortality rate and excess

mortality in all risk ranges indicating poor calibration. Differences in casemix were proposed as the

likely reason for higher than predicted hospital mortality. A German study '*^ describes a 22% higher

than predicted mortality, with data collection anomalies, casemix, leadtime bias, model inaccuracy or

quality of care cited as possible contributors.

In contrast to other work, the present study showed that the APACHE III model can be applied to a

patient population with a different case mix and referral pattern, outside of the USA, and produce

similar performance to that observed on the APACHE III development and validation series.
48

The validation of a model that estimates the risk of death on an independent data set implies model

validity and the reliability of variables and data collection methods ^^ Potential for bias and inaccuracy

'"^ and threats to model performance can arisefromlocal anomalies of clinical practice, casemix or

data collection. The apparent variability of performance of ICU outcome or risk adjustment models

mandates that these models must be closely examined at each site where they are used before

conclusions are drawn about comparative or historical performance.

In the PAH series during the study period, the APACHE III mortality estimates, particularly with

proprietary adjustment for hospital characteristics, provided both good discrimination and good

calibration. This supports the validity and robustness of APACHE III variables, data collection and the

model for mortality predictions.

The local performance of APACHE III will therefore allow its use as a locally validated risk

adjustment tool to analyse mortality outcomes at the PAH ICU


49

Chapter 4
Control Charts for Analysis of Mortality Outcomes in
Intensive Care.

4.1 Introduction
The purpose of the next two chapters is to develop methods to continuously monitor outcomes of

intensive care unit (ICU) patients using an adjustment for risk of death. In Chapter 4, control charts are

applied to analyse ICU deaths. The choice of charting parameters is made by modelling the occurrence

of false alarms and detection of changes in the mortality rate. Control charts incorporating risk

adjustment (RA) are introduced in Chapter 5.

4.1.1 Overview

Statistical analyses can be used to detect a change in the pattern of the outcomes of a process. A change

in observed mortality rates for patients may be caused by various factors including a change in the

process of patient care. Therefore, monitoring outcomes in a clinical setting may provide information

to detect and influence improvements in the quality of patient care.

In this chapter, several approaches to track ICU mortality will be presented. Process control chart

analysis will be applied to mortality at the PAH ICU during the period 1 January 1995 to 31 December

1999. The APACHE III model for estimating the in-hospital death of ICU patients, with proprietary

adjustments for hospital characteristics, was shown in Chapter 3 to have good discrimination and

calibration on the PAH ICU patient population. Therefore it will be used as a RA model for

development of the RA control charts that are presented in Chapter 5.

The data used in the following chapters cover an additional two year period beyond those used for the

validation study of Chapter 3 which was published in 2000 ^'. A summary of the larger data set is

provided in Appendix 1. An update on the performance of the APACHE III model on this data set

has been published "'^.


50

This project was commenced using the 1995 - 1997 patient dataset. This initial 3 year period is used as

a period of historical observation. The chart monitoring will be applied to the subsequent data for 1998

-1999

4.2 Considerations in Monitoring In-Hospital Mortality of ICU Patients.

An approach for monitoring ICU mortality should have several characteristics.

Primarily the methods must track the mortality rate and analyse the mortality observations in the

context of a standard mortality rate. For the charts presented in this chapter, the expected mortality rate

was derived from historical observations from the 1995 - 1997 patient data.

ICU mortality monitoring must include all patients that are admitted to the ICU, so that a global picture

of ICU mortality is obtained which will reflect the care given to all patients. This will also maximize

the number of patients, the power of the analysis and may minimize delay to recognize changes.

The data collection and monitoring must not distort the care provided to patients. Protocols have a

place to guide the quality and efficiency of patient care. However, protocols that exist solely to

improve data analysis must not constrain patient care or interfere with changes to the system of care. It

is important that the usual care of patents is being assessed, rather than the effects of an experimental-

type intervention.

The results of analysis must be available in a timely maimer without unnecessary delay. This can be

aided by using a sequential analysis technique or grouping patients into the smallest practical samples

that provide adequate power. Large sample groups potentially miss temporal relationships, such as

seasonal cycles or other periodic influences'**.

The mortality outcome used for the development of the confrol chart methods is the APACHE III

endpoint of patient survival to hospital discharge. ICU patient hospital stay can be very long, up to

months or even years as an inpatient. A complete dataset can only be analysed after all patients are

dead or discharged from hospital, which can take months, or years. Early analysis tends to be biased
51

toward the deaths, and I have found in practice that use of in-hospital mortality as the endpoint limits

the timeliness of analysis. This observation is consistent with opinion in the literature, which has called

for ICU patient survival to be reported at fixed times post ICU admission ^^'^^. The considerable

limitation in using in-hospital mortality as the outcome is addressed in subsequent chapters of this

thesis. When new models are developed for RA charting in this thesis, the outcome of 30-day in-

hospital mortality is used. Patient outcomes can therefore be analysed 30 days after admission.

4.3 Control Chart Analysis of ICU Mortality

Process control charts, such as p chart, CUSUM chart and EWMA charts are suitable methods for

detecting changes in proportions of binary outcomes of a process. In the ICU, the process under

scrutiny is the complex milieu of patients and patient care. The outcome to be monitored is in-hospital

mortality.

Variations to the mortality rates may be due to common cause, or special cause variations '"^ Common

cause variations are related to the nature of the process and include chance variations. A process is

operating "in-control" if the variability is due only to random variation '"*. Control charts provide tools

for monitoring and analysis by tracking whether the outcome of the process conforms to the expected

in-control variability. The variance of a process in-control will account for these chance effects, and is

the basis for the calculation and presentation of control limits.

A process is "out-of-control", when the output does not have a stable distribution "'^, and the variation

is attributable to causes other than random variation '"*. These are special causes or assignable causes

of variation. These can be due to temporary or new factors that were not part of the in-control process.

Examples in an industrial context usually include defective raw materials, improperly adjusted

machines or operator errors '*. In ICU, special cause variations causing a fall in mortality occur with

an increase in low risk elective surgery, transfer of low risk patients from a nearby ICU or potentially a

systematic increase in quality of care. In contrast, a decrease in mortality rate due to a chance run of

unexpected survivors would represent a common cause variation.


52

The control parameters for a control chart are estimated from a stable, historical period of observation

while the process is in-control. In this application, the initial period 1 January 1995 - 31 December

1999 of 36 monthly observations, or 31 blocks of 100 patients was used. Appendix 2 presents an

analysis which establishes that it was reasonable to assume that the ICU process and the mortality rate

was in-control during that period.

Where a change in the distribution of the outcome observations exceeds a level that is expected by

chance, a signal will occur. Such a signal should prompt several actions in an ICU. The first is an

evaluation of the likely importance of the signal by considering the sensitivity and false alarm

characteristics of the chart which are calculated or simulated during the chart design process. Other

alternative, complimentary patient data will also be examined to differentiate between common or

special causes of variation. Secondly, if a systematic change has occurred, an examination of the

process in an appropriate and timely manner will be conducted, including a search for the assignable

causes. A systematic and acceptable increase in mortality rate due to more high risk patients will lead

to a revision of the expected mortality rates. An increase in low risk elective surgery will require a

revision down of the expected mortality rate. On the other hand, an increase in mortality rate attributed

to, say, premature patient discharge or inadequate medical staff numbers would demand intervention

and improvement of the care offered to patients, without revision of the expected mortality targets.

Equally important would be a true fall in mortality that was not attributable to severity, casemix or

other expected influences. A search for the causes of improvement could provide hypothesis for further

clinical investigation.

The key to differentiating between random variations and a systematic change in the underlying

process lies in the design and performance of the control charts.

Design of clinical trials considers the power of the analysis and the Type I and Type II errors.

Analogous charting concepts are studied by analysing run length to signal. With control charts it is

important to understand how long a process might run before a signal is expected when the process is

in-control (false alarm), and when conditions change (true positive). The expected pattern of false
53

alarms and true positive signals can be quantified as the average run length to signal (ARL) under in-

control and under changed condition.

A clinician can design the monitoring scheme prospectively. The information required is: the in-control

conditions, the changed conditions to be detected, the tolerable ARLs for in-control and changed

conditions, and the patient numbers. The performance of a specific chart method is predictable and the

choice and design, of charts can be tailored to specific applications.

There are limitations to monitoring ICU mortality that arise from the assumptions of the control chart

model.

Independence of observations (patient outcomes) is an important assumption of control charting, but is

difficult to assess. Correlations between successive outcomes may be related to cycles of activity,

staffing, trauma, operating theatre lists and casemix. Staff learn on the job, so annual or term related

cycles of staff gaining experience may affect the process. Public and school holidays affect casemix

and staff availability. Annual budgetary cycles may have effects on patient case mix, particularly the

volume of elective surgery. There is seasonal variation in many diseases such as heart disease '"*,

asthma and communicable diseases like respiratory tract infections, meningococcal disease or flaccid

ascending paralysis.

Sequential effects may also exist. Nosocomial infection transmission increases the risk of exposure and

cross infection and may cause clusters or outbreaks of infection. High staff activity levels followed by

periods of exhaustion and low staffing levels may influence the quality of care. The recent experience

of clinicians could affect fixture clinical decisions. Business considerations may impact on activity

levels, which in turn could affect admission and discharge criteria, resource availability and withdrawal

of therapy.

There is therefore the possibility of lack of independence of patient outcomes and the possibility that

the in-control mortality observations do not conform to the predicted distribution. In either of these

cases, the control chart analysis would be undermined.


54

For/? charts and the CUSUM, it is assumed that the individual patient outcomes are random variables

and a normal distribution can be used to calculate control limits. Departure from normality will prevent

control limits from being correctly estimated. Appendix 2 presents a thorough analysis of the

observations of monthly mortality rates and the mortality rates of blocks of 50 and 100 consecutive

cases (1 January 1995 - 31 December 1997). The monthly and 100 case block mortality observations

appear to have a normal distribution and there is no evidence of non-random clustering or mixing. The

estimate of the standard deviation of the binomial distribution agrees with the observed standard

deviation of mortality rate observations. There is some evidence for a 3 month (or 300 case) cycle in

mortality rates.

The following discussion concentrates on the use of control charts to monitor the mortality rate of ICU

patients unadjusted for any factors such as casemix.

4.4 Application of Control Charts to PAH ICU Data

4.4.1 p Chart

The Shewhart/7 chart is a control chart for the proportion of an attribute in a sample. Control limits are

based on a normal approximation of the binomial distribution, with parameters of sample size and

control rate of the attribute'"'.

To plot ap chart for patient mortality rate, the statistic ^ . , the monthly mortality rate is

plotted each month

where, the month or sample is indexed by / and the n patient outcomes are indexed b y / The patient

outcome Y.j is 1 if the patient dies in hospital and 0 if the patient survives to hospital discharge.
55

The target p is the mean mortality rate during an in-control period for the process. The value

\p(X-p)
,

is an estimate of the standard deviation (CT) of the in-control process

For large samples, control limits are calculated by a normal approximation to the binomial distribution:

CL=-pa.}P^^^
.

where a is the number of c width of the control limits.

Design of the p Chart

The p chart analysis presented in this section plots mortality rate by month. Appendix 3 presents an

analysis of the effects on ARL to signal for a choice of control limit parameters in the setting of

changing mortality rates. Based on this analysis, the charts are designed to signal plausible and

clinically important changes in mortality rate in a timely fashion with an acceptable incidence of false

alarms.
56

Figure 4.1 presents the data from Appendix 1 Table A1.2 in a/? chart of monthly mortality for the PAH

ICU in 1995 - 1997, with both 2 <T and 3CTcontrol limits. No month mortality observation lies outside

the 3CTlimits and only a single observation (August 1996 0.26) lies outside the 2 a limits. This further

suggests that the process was in control during this period.

Figure 4.1; p Chart PAH ICU Hospital Mortality Rate 1995 -


1997 by Month
Average = 0.16

-^Monthly Mortality Rate


- - - Average Mortality Rate
UpperSoCL
Lower 3 a GL
Upper 2 a GL
Lower 2 o GL

-95 Jul-95 Jan-96 Jul-96 Jan-97 Jul-97 Jan-98


Month
57

The mortality rates of the months during the period 1998 - 1999 are plotted on Figure 4.2, using the

observations of 1995 - 7 to determine the value of the control parameter, ^ . Thefirst6 observations

all fall below the target mean, and 20 of the 24 observations are below the mean. Three fall beyond the

lower 2 a control limits, including thefirstobservation of January 1998. Mortality was lower than in

the period 1995 - 7, and the chart demonstrates that the process was "out of control" following January

1998.

Figure 4.2: p Chart PAH ICU Hospital Mortality Rate 1998 - 1999
by Month
Target Mortality 0.16
0.35

-Montinly Mortality
Upper 3 a CL
Lower 3 a CL
Upper 2 a CL
Lower 2 a CL
- - Target Mortality Rate

Jul-98 Jan-99 Jul-99 Jan-00


Month

This/? chart introduces three issues.

The first issue raised is that of the level of proof required for data analysis for quality monitoring. The

observations for January 1998 and November 1999 were below the 2 a limit. There were no

observations falling outside the 3CTcontrol limits. In medical applications, the 2 a limit may be

preferable to the 3 a limits adopted in industrial applications.

Secondly, a modified/? chart incorporating additional rules to detect non-random patterns of

observations may be more sensitive than the/j charts described thus far to detect a process out-of-
58

control. For medical applications, Benneyan "" has proposed a series of rules which allow recognition

of an out-of-control signal. These are:

1: Eight consecutive observations on the same side of the target mean. Using this rule, the signal

of reduced mortality would have been given in May 1999.

2: Any 12 of 14 consecutive observations on the same side of the target mean. The signal would

have been given in August 1999.

3: Three consecutive observations that lie outside 2/3 the distance to the control limits. Based on

2CTlimits, there would be no signal from this rule.

4: Five consecutive observations beyond 1/3 of the distance from the target mortality rate to the

control limits. Based on 2CTcontrol limits, a signal would have been given in February 1999. The

Western Electric equivalent rule, that 4 out of 5 points lie beyond 1CTfromthe target mean would have

given an out of control signal in May 1998.

5: Thirteen consecutive observations within 1/3 of the distance from the target mortality rate to

the control limits on either side. This rule is designed to identify reduced process variability, and there

would have been no signals under this rule.

6: Eight consecutive observations displaying a sustained run up or run down. There would have

been no signals under this rule.

7: Cyclical or periodic behaviour.

Betmeyan's rules can be used to identify a lack of statistical control to supplement/? chart analysis.

These are similar to the famous Western Electric decision rules '"^ which are used to increase the

sensitivity of p charts to small sustained shifts in the process, and are similar to the Runs Tests of

Appendix 2.

Decision rules like those of Benneyan have not been incorporated in this analysis. As more rules are

used, the simplicity of chart interpretation is lost and the decision process becomes more complicated

"'^. These rules increase not only the sensitivity in recognising true changes, but also the occurrence of

false alarms. As the rules are not independent, and each rule relies on several observations, the analysis

of the distribution of run length is difficult '"*. Interpretation of the CUSUM or EWMA charts which
59

are described later in this chapter is more straightforward and easier to analyse than a complex set of
106
rules

The third issue is the nature of the action prompted by a signal. A signal that the process is out-of-

control, even if it indicates lower than expected mortality should initiate a review of the process

searching for factors that may have caused the mean mortality rate to shift. In the work described here,

the data were analysed retrospectively, providing no opportunity to evaluate the process

contemporaneously. Examination of the/? chart in Figure 4.3 suggests that the fall may have occurred

as early as September 1996.

Figure 4.3: p Chart PAH ICU Hospital Mortality Rate 1995 -1999
by Month
Target Mortality 0.16
0.35

-Monthly Mortality
Upper 3 a C L
Lower 3 a CL
Upper 2 a CL
Lower 2 a CL
- - Target Mortality Rate

Jan-95 Jan-96 Jan-97 Jan-98 Jan-99 Jan-00


Month

When monitoring is resumed after a signal has occurred, the same chart parameters should be used if

the in-control distribution is the only acceptable outcome specification, the cause of the change is

temporary, or where the process has been examined, and adjustments are made to return the output to

the in-control specifications. If these conditions do not apply, monitoring specifications should be

reviewed. In this example, it would have been appropriate to have reduced the estimate of the mean

mortality rate.
60

4.4.2 Cumulative Sum (CUSUM).

The CUSUM is a graphical technique that accumulates the differences between observations and an

expected, ideal or target value. The target may be a historical value, an industrial process specification,

or a clinical benchmark. In the ICU example, the in-control mortality rate of 0.16 for 1995 - 1997 was

used.

The CUSUM is usefiil to detect small changes in the output of a process. In contrast to the/? charts,

CUSUM methods can detect sustained changes in the process mean of less than 1CT,if the parameters

are set to do so. For detecting substantially larger changes, the performance ofp charts and the

CUSUM is equivalent '"^

CUSUM statistic

The formula for the calculation of the CUSUM statistic C ' " is:

c = Z iP^-p-)
/=!

where/?, is the mortality rate of the sample, indexed by i, and p is the target or in-control mortality

rate.

If the process is in-control, the mean of C is 0, it is normally distributed and a sequential plot creates a

chart that tracks the horizontal axis.

A sustained change in the observed mean will produce a CUSUM chart with a slope equal to the

magnitude of the change in mean. A constant slope indicates that the difference between the target

mean and the observed outcomes remains the same over time. Where the gradient is changing, the

relationship between the outcome and the target mean is changing, suggesting a dynamic situation. A

cyclical variation may reflect the effects of, say, seasonal or predictable temporal change and may

appear as cycles in the CUSUM plot.


61

Figure 4.4 shows a CUSUM of the in-hospital mortality for 2200 cases from January 1998. The

expected mortality rate is 0.16 and the observed mortality rate is analysed in blocks of 100 admissions.

The sustained negative slope indicates that the observed mortality is below 0.16. The slope is estimated

over 21 blocks as - 0.66/21 = - 0.03. This suggests that the mortality rate over the monitoring period

was consistently 0.03 below the in-control mortality rate of 0.16. The new mortality rate is therefore

estimated to be 0.13.

Figure 4.4 CUSUM Accumulating Observed - Target Mortality


PAH ICU In-Hospital Mortality 1998 -1999
Target = 0.16, blocks of 100 Admissions
Admission Number
0
29D0

-0.2
O
_- -0.3

CO

For stable mortality rates, blocks with the same number of patients will have the same mean, variance

and control limits. Therefore, if the process is in-control the CUSUM stastic can be calculated from the

differences between the observed and the in-control mortality rates. However, when monthly mortality

rates are analysed, the variation of the monthly admission numbers creates difficulties in calculation of

control limits for the CUSUM. This problem can be dealt with by standardising the observed mortality

rate (or count of deaths, as this is equivalent) and using a CUSUM of the Z score.
62

111.
The Z score is calculated using the approach suggested by Hawkins and Olwell

(p.-p)
m-p)
ri:

and the Z score CUSUM statistic is :

s,=Sz,
(=1

where Z . is the standardised mortality rate for the month indexed by /, and S is the CUSUM statistic

accumulating the difference in standardised mortality rates.

Figure 4.5 displays a Z score CUSUM of data for the PAH ICU 1995 - 1999 analysed in months based

on the in-control mortality rate of 0.16. A change in the slope of the Z score CUSUM of in-hospital

mortality rates occurs between July 1996 and January 1997 where the mortality rate falls.

Figure 4.5; Z Score CUSUM Accumulating Observed - Target


Mortality PAH ICU 1995 -1999
Target = 0.16, IVIonthly Observations
Month 1995-1999

The Z score CUSUM accumulates the Z score values, as the target mean is zero with a standard

deviation of 1. The advantages of standardising "^ all observations prior to incorporation on a CUSUM
63

chart, are that this allows other observations, say clinical indicators or other measures of process

quality (infection rates, readmissions, inability to admit patients or procedure complications) to be

plotted as CUSUMs on the same chart. The disadvantage is that the original units are no longer

apparent, and ease of intuitive interpretation may be lost.

Testing the significance of the CUSUIVI statistic

Statistical tests that the current process mean is different from the in-control mean are based on the

assumptions that the observations are independent and that they are drawn from an in-control sample of

known distribution.

Appendix 4 provides the background for the statistical approach to CUSUM chart analysis. It contains

the formulae used in this section and has an analysis of the ARL that lead to the choice of the chart

design parameters for this ICU application. The design of the chart and the statistical analysis of the

data require the in-control distribution of mortality rates to be known, and the shift of the process mean

that is of interest is specified. In the example, the in-control distribution is based on a mean mortality

rate of 0.16. However, the choice of the shift in process mean to be detected by the CUSUM must be

plausible and based on clinically important values. The choice of chart parameters and the decision

thresholds depends on the ARL of the CUSUM under in-control and changed conditions.

The CUSUM test provides a method for ongoing analysis as the outcome data accumulates. In this ICU

example, two CUSUM charts are run concurrently. The upper CUSUM, C* tests for an increased

mortality rate with the chart signalling that the process is out-of-control when C^>h^. The lower

CUSUM, C~ tests for a decreased mortality rate with the chart signalling that the process is out-of-

control when C~ <h~. The CUSUM test chart examples are plotted with the upper and lower statistics

C^ and C (or 5^ and 5^ for the Z score CUSUM) against observation number. The control limits h^ and

h are plotted as lines parallel to the X axis. When the CUSUM statistic exceeds the control threshold,

the statistic is reset to zero and the monitoring process is continued with the same parameters.

The following charts illustrate the upper and lower CUSUM charts. Figure 4.6a is a CUSUM chart of

blocks of 100 patients. The target mean is 0.16, and the upper level of mortality to be detected is 0.21,
64

corresponding to iC- 0.185, and the control limit is h'^ - 0.073. The lower level of mortality is 0.11

corresponding to K- 0.135, and the lower control limit is A" = - 0.073. The ARL under in-control

conditions is approximately 20 observations which is equivalent to a false alarm on average every 2000

patients or 1.7 years. There are three occasions when the CUSUM chart signalled that observed

mortality was significantly lower than the in-control mortality rate. Given the design characteristics of

the chart, these very likely represent a reduction of in-hospital mortality.

Figure 4.6a: C* and C" in Blocks of 100 Admissions


PAH ICU In-Hospital Mortality 1998 -1999
C*taiyet mortality = 0.16, K+= 0.185; C" target mortality = 0.11, K'= 0.135, h *'' = O.OIZ
signal = D

10 15 20
Admission Block Number

If the signal represents a true and sustained change in the process, then the charting parameters should

be altered. Figure 4.6b shows the CUSUM of blocks of 100 admissions recommenced after block 3,

when the first signal of Figure 4.6a was recorded. The revised chart is designed to a target mortality

rate of 0.13, and to optimally detect changed mortality rates of 0.10 and 0.16 (K*^ 0.145, IC= 0.115, /i

= 0.088, in-control ARL of- 20 again). There are no signals in this chart, which suggests that the

mortality rate observed during the period of analysis was not substantially different from 0.13.
65

Figure 4.6b: C+ and C- in Blocks of 100 Admissions


PAH ICU 1998-1999
'*'- =
Target mortality = 0.13, K* = 0.145, K'= 0.115, h"0.088,
CUSUM started at block 4
0.1
0.08
O 0.06 h +
I h-
o 0.04 - C+
+
O, 0.02 -C-
.o
a> 0 i
ro -0.02
55 ^ ; ^ .
E -0.04
(A
-0.06
3
o -0.08
-0.1
10 15 20
Admission Block Number

4.4.3 Estimates of the Current Mean: EWMA Chart

A number of approaches to estimating the current mean output are available. The overall mean of a

series of values does not provide a good estimate of the current output, due to contributions of

sometimes distant historical values. A method that provides better current estimates is the

exponentially weighted moving average (EWMA) "^. More detail for this statistic is provided as it

leads to original work developing the RA EWMA statistic, in Chapter 5.

The EWMA is an estimator that has been extensively investigated and it will be applied in this section.

It assigns a higher weight to recent observations than to more distant or historical values. It is useftil for

detecting small persistent shifts in the mean outcome and for estimating the timing of those shifts "*.

The EWMA is slower to detect large shifts in the process mean than the/? chart. In contrast to the/?

chart and the CUSUM, the EWMA is relatively insensitive to departures from the normal distribution

"".As well as monitoring mortality for groups of patients, the EWMA can be applied to monitoring

mortality rates by analysing a series of individual patient outcomes.


66

The formula for the EWMA is

EWMA. = y.Z + EWMA,_, (1 - X)

where >>, is the value of the /** observation. This value may be the mortality rate of samples of

patients, />,. or the outcome of a single patient, Y.. In the examples used, both sample blocks of 100

consecutive patients, and single patient outcomes are presented. X is the weight between 0 and 1.

Larger values give more weight to recent observations, with the limiting value of A = 1 producing a

plot of yj against /, similar to a/? chart. EWMA- is the value of the statistic indexed by i.

The calculation of the EWMA statistic is an iterative process. For the first value, EWMAQ , an

estimate of the in-control mortality rate is used. In this example, the estimate is p . EWMA. is a

weighted average of the starting estimate, ^ and the subsequent observations j , to j ,

An alternative expression for EWMA. is

EMWA.=(l-Ayp + A,Y,(l-A,y-'y,
k=l

where k isa whole number < / .We can calculate control limits for the EWMA using the in-control

mortality rate and the standard deviation. A formula for the control limits is "" :

CL,=pacJ^[l-{l-ir]

whereCTis the standard deviation of the observations of the in-control sample, and a is the width of the

control limits in multiples ofCT.The estimate ofCTis:


67

and the control limits are

c., = ,...ISj_i_[,_(,.,)..]

where , is the number of patients in each block of admissions. For an EWMA for single cases, n, = 1.

When the number of observations is large, a simplified formula.

-.---j^s
can be used o^-'"^'"".

The design of the EWMA chart requires consideration of the effects of a, A and , on ARL under in-

control and changed conditions. Appendix 5 provides an analysis of the effect of these parameters.

Parameter choice determines the ability to detect real shifts in process mean and the likelihood of false

positive signals.
68

Figure 4.7a presents EWMA chart of the mortality rates for blocks of 100 patients during 1998 - 1999.

The "p =0.16, a = 2 and yi = 0.3. It is clear that the mean mortality rate is not 0.16. The mean

mortality rate appears to have changed to lie in the range 0.12 - 0.14.

Figure 4.7a: EWMA Chart


PAH ICU In-Hospital Mortality 1998-9
Blocks of 100 cases, in-control mean = 0.16, A = 0.3, a = +/-2
0.2

0.18

0.1

LU 0.08
-EWMA
Upper Control limit: +2a
Lower Control limit: - 2a
0.06

0.04

0.02-

10 18 ^25

Block of 100 Cases


69

Figure 4.7b presents the same data, but with charting parameters revised and estimated based on the

EWMA estimate of the mean mortality. The new target mean was 0.141 after the second block of

observations. There are nofiirthersignals. The chart cannot be used to demonstrate a statistically

significant difference between the EWMA estimate of mortality rate, and 0.141; that would require use

of an appropriate statistical test. Nevertheless, the chart suggests that the mortality rate is in the range

of 0.12 - 0.14 during the latter part of 1999.

Figure 4.7b: EWMA Chart


PAH ICU In-Hospital Mortality1998 - 9
Blocks of 100 cases, A = 0.3, a = +/-2
initial estimate of in-control mean = 0.16, then after 2nd block, revised to 0.141

0.2

0.18

0.16

0.12

0:1
UI -EWMA
0.08 Upper Control limit: +2o
Lower Control limit: -2a
0.06

0.04

0.02

0 4
10 15 20 25
Blocks of 100 Cases
70

The next series of charts plot EWMA for individual patient outcomes. Figure 4.8a displays the EMWA

chart of outcomes of admissions 1998 -1999. The in-control mortality estimate is 0.16 and

X was chosen to be 0.001. Again it is clear that the mean during this period is not 0.16.

Figure 4.8a: EWMA Chart


PAH ICU In-Hospital Mortality 1998-9
Sirigle case, in-control mortality = 0.16, A = 0.001, a = +/-2

0.2
0.18
0.16
0.14
< 0.12
1 0.1 -EWMA
^ 0.08 Upper Control Limit + 2a
Lower Control Limit - 2a
0.06
0.04
0.02
0
500 1000 1500 2000
Case Number
71

Figure 4.8b presents the same data, using a starting in-control estimate of 0.16. When the EWMA

crosses the lower control limit at case 69, a revised estimate of mortality of 0.154 allows

recommencement of monitoring. At case 387, the lower control limit is crossed again and the in-control

mortality is re-estimated to be 0.143. At case 2022, the lower control limit is crossed again, and the in-

control mortality rate is estimated to be 0.128 and there are no fiirther signals.

Figure 4.8b: EWMA Chart


PAH ICU In-Hospital Mortality 1998-9
Single case, A = 0.001,a = +/-2

0.18
0.16
0.14
0.12
target{4) = 0.128
0.1

LU
0.08
0.06 -EWMA
Upper Control Limit + 2a
0.04
Lower Control Limit - 2o
0.02
0
500 1000 1500 2000
Case Number

The conclusion from using these charts is that the EWMA is usefiil to estimate the current mean and to

demonstrate when the mean varies from a target value. When a signal occurs, the EWMA provides a

ready estimate for a new target mean mortality rate, and the monitoring can be continued. The choice

of parameters A and a, depends on the ARL performance. As the EWMA is robust to the effect of non-

normally distributed data, it can be used to chart individual patient outcomes.

Appendix 5 presents an analysis of the EWMA charts which leads to the choice of charting parameters

for this application.


72

4.5 Search for Cause of Decreased In-Hospital Mortality

Each of the control charts used in this chapter demonstrated that the in-hospital mortality during 1998 -

1999 was less than the in-conti-ol estimatefromthe period 1995 - 1997. Figure 4.5 suggests that the

change in mortality rate was likely to have occurred between July 1996 and January 1997

An analysis of the ICU casemix and illness severity was undertaken to determine a cause of the change

in mean mortality rate. Patient age and severity of illness (measured by the APACHE III score) were

analysed with non-parametric one way analysis of variance (Kruskal-Wallis). Casemix was assessed

with the patients grouped into Emergency Surgery, Elective Surgery and Non-operative Cases.

Table 4.1: Comparison of Princess Alexandra Hospital ICU Casemix


for the Years 1995 - 1999

1995 1996 1997 1998 1999

Gender * 63.9 60.4 60.5 64.5 64.7

%male
Age# 52.2 52.2 53.4 53.0 54.2

mean in years

APACHE III Score t 46.1 48.5 47.6 46.8 48.2

mean
Emergency surgery 151 139 145 137 121
number (%) (14.2) (13.0) (14.1) (13.6) (10.9)

Elective Surgery 407 390 397 445 498


number (%) (38.3) (36.6) (38.5) (44.1) (44.9)

Nonoperative 505 538 490 426 489


number (%) (47.5) (50.4) (47.5) (42.2) (44.1)

*X\S) = 6.8, p = 0.56; # Kruskal - Wallis x^(4) = 9.2:p = 0.n); t Kruskal - Wal lisx'(4) = 9.2:p:= 0.06

There was no change in the gender of patients admitted to the ICU. There was a non-statistically

significant increase in the average patient agefrom52.2 years in 1996 and 1997 to 53.4 years in 1997,

which would not explain a fall in mortality rate. There was no systematic trend in the APACHE III

score measuring severity of illness thereby not explaining the fall in mortality rate.
73

Figure 4.9 Proportion of Elective Surgery,


Emergency Surgery and Non- Operative Cases
PAH ICU 1995-1999

0.6

- ^ Elective Surgery
9Non-Operative Cases
- -A- - Emergency Surgery

0.1

1995 1996 1997 1998 1999


Year

There were changes in patient casemix during the period. Figure 4.9 presents the percentages for the

patient groups from Table 4.1. There was an increase in the percentage of elective surgery and a

decrease in the percentage of non-operative cases during 1998 - 1999 compared to the period of 1995-

1997. Patients with Elective Surgery have a low in-hospital mortality rate compared to Non- Operative

patients or Emergency Surgery patients. Therefore, changes in casemix could explain the fall in the

mortality rate.

4.6 Summary

The use ofp charts, CUSUM and EWMA charts has been demonstrated on raw mortality data for the

PAH ICU. A thorough analysis of ARL of the charts under in-control and under clinically relevant

changed mortality rates provided the basis for choice of charting parameters.
74

All charts demonstrate that the mortality rate was lower in the period 1998 - 1999 than the previous

period 1995 - 1997. The increase in the percentage of elective surgery and fall in the percentage of

non-operative cases may account for the fall in morality rates that were observed.

This suggests that some form of casemix and patient severity of illness adjustment could improve the

assessment provided by control charts. The incorporation RA tool to the confrol charts is the subject of

Chapter 5.
75

Chapter 5:
Monitoring Intensive Care Outcome using Risk
Adjusted Control Charts.

5.1: Introduction
The purpose of this chapter is to develop and adapt control charts to incorporate an adjustment for

casemix and patient severity using an APACHE III model of in-hospital mortality as the risk

adjustment (RA) tool. The development of RA control charting brings together the topics studied in the

previous chapters.

A control chart compares observed outcomes with expected outcomes from a process over time. The

control charts used in Chapter 4 assumed a homogenous risk of death for all patients equal to the in-

control mortality rate of 0.16 at Princess Alexandra Hospital (PAH) Intensive Care Unit (ICU). A RA

control chart will incorporate an estimate of the risk of death of every patient. The expected mortality

rate depends on the probabilities of death of the patients included in the analysis. The RA control chart

uses a model that fits the patient data during an in-control period to provide an estimate of the risk of

death. The APACHE III model with proprietary adjustments for hospital characteristics predicts the

probability of in-hospital death of ICU patients at the PAH reliably and the fit was validated on the

1995 - 1997 patient dataset. This model will be used to develop the RA charts described in this

chapter.

Control limits or decision thresholds provide a statistical test of the agreement between the expected

and observed outcomes. Out-of-control signals indicate that the RA model is unlikely to describe

adequately the relationship between the patient variables and the patient outcomes. This implies a

change from the in-control state. Signals that indicate change in model fit may be due to a number of

causes, one of which is a change in the quality of patient care.

The CUSUM has been used for RA control chart monitoring for cardiac surgery '>"5^ myocardial

infarction patients "^ and general surgical cases "^ and RA Shewhart charts have been described by
76

Alemi and co-workers"^'"'. The methods described in these papers will be modified for the ICU

context, and the RA exponentially weighted moving average (RA EWMA) chart will be introduced.

The average run length (ARL) for these charts to detect changes in RA mortality rates will be analysed.

Parameter choice will be made for this ICU application based on the performance of the charts over

ranges of parameters and clinical scenarios.

An important consideration in the development of RA confrol chart methods has been to select a

method to describe the distribution of mortality rates. This important basis for RA chart development is

explored in Appendix 6. Three methods to characterise the distribution of observed mortality rates are

discussed.

Two of these methods are applied to the charts and data in this chapter. The central limit theorem leads

to approximation of the mortality rate distribution using a normal distribution. This is a good

approximation for most RA chart applications. An exact method, using an iterative approach is also

used to calculate the cumulative probability fimction of mortality rates of samples with patients of

known probability of death.

In previous applications of RA strategies to health care, the emphasis has been on comparison between

institutions, rather than monitoring changes over time. For example, the emphasis in cardiac surgical

studies has been on using RA approaches for comparisons '^''''^'. in contrast, the RA methods that will

be described in this thesis are designed to monitor mortality rates, and detect changes within a single

institution over time.

The American APACHE III hospital mortality model, adjusted for hospital characteristics performed

well as a estimate of in-hospital mortality in the PAH ICU between January 1995 and December 1999

'''\ It will be used as a RA tool to illustrate monitoring of ICU mortality. The characteristics of the data

series and the performance of APACHE III model for the in-confrol period of 1995 - 1997 are

summarised in Chapter 3. Although a validated model of the APACHE system is used in these

analyses, any model with adequate performance could be used for RA. Other models have been
77

developed using this dataset ^^''^^ and fiirther development of an alternative machine learning RA

model will be explored in Chapters 6 and 7.

5.2: Use of RA methods in Monitoring Hospital


Mortality Outcomes.

RA control charts are proposed to detect a change in the process of care, using model fit as an

indication of change. In Figure 1.2, lezzonis' "Algebra of Risk" described the manner in which patient

factors and diagnosis factors led to patient outcomes, influenced by the process of care and random

variation. The relationship between the patient and disease factors, and the patient outcomes could be

modelled. Subsequently, differences between observed and predicted patient mortality could be

ascribed to random variation or to deterioration in model fit, potentially attributable to changes in

quality of care. RA charts will provide a means of assessing model fit.

5.2.1: Issues with Monitoring RA Mortality.

There are a number of issues that arise with risk adjusted outcome monitoring.

1. Performance of RA model

The key to monitoring RA mortality rates is to have an adequate model to estimate the patient risk of

death. The model performance must meet the criteria discussed in Chapter 2. It must be reproducible

and be able to be generalised to unseen data separate from the context in which the model was

developed or validated. The performance must be assessed in terms of discrimination, for example,

using the area under the ROC curve, and calibration, assessed by the Hosmer-Lemeshow statistics.

Ultimately, any numerical recommendation will be an opinion based on the practicalities of data

collection, model building and validation, and clinical use.

My recommendation is based on two important considerations.

The first consideration is the level of performance that can be realistically expected from an ICU

mortality prediction model. A review of the performance of models to predict ICU outcomes was

summarised in Chapter 2, Table 2.1. This gives a guide to the discrimination and calibration that can be
78

expected from this type of model. For example, the area under the ROC curve should be in the range

0.80 - 0.90. It is reasonable to expect that the Hosmer-Lemeshow C statistic based on 10 groups,

should have a value less than 15.5, ( ^'^ (s), /? > 0.05), but preferably C should be much smaller. The

power of the Hosmer-Lemeshow method to detect departures from the null hypothesis of good model

fit will depend on the size of the validation data set.

The second consideration is the sizes of the modelling and validation datasets. This requires a balance

between the need for large datasets to develop and validate a model, and the finite time and resources

available to collect patient data. A recent study by Clermont and co-workers '^^ modelled ICU patient

outcomes with logistic regression and artificial neural networks, using a developmental dataset of 800

patients and a validation set of 447 cases. With developmental datasets of 800 patients or larger,

satisfactory models were developed, but with smaller datasets the models were unreliable. A similar

patient series of 1200 patients would take approximately 1 year to collect at the PAH ICU. This would

provide 800 cases (67% of the data) for model development and 400 cases (33% of the data) for model

validation. The study of Clermont et al. provides a useful benchmark, and on this basis, I propose that a

practical compromise for model development dataset size be 800 patients, with the validation dataset

size of 400 cases.

The APACHE III model with adjustment for hospital characteristics performed better than these

minimum recommendations given above. On the 3159 patients at the PAH ICU during 1995 - 1997,

the area under the ROC curve was 0.91, and the Hosmer-Lemeshow C statistic was 14.5.

2. Model generalisation

Several simulation studies 5*'"'^*''2'' show that models that estimate the probability of ICU patient death

can have poor performance and provide unreliable predictions of risk of death when the patient samples

are different from that under which the model was developed.

A statistical model potentially describes only the relationships that were present in the data on which it

was developed, and do not represent an immutable truth '^^ Models may not be applicable to all ICU
79

patient samples and institutions to which they might be applied. Model fit cannot be assumed and

evaluation of the performance of each model at each site is required.

3. Changes over time

The fit of a RA model may change over time ^. This could be attiibutable to evolving quality of care.

A simulation study by Zhu et al. '^* demonstrated that a poor quality of care, simulated by increased

mortality causes model fit to deteriorate. Alternatively, changes to model fit could be due to changes in

discharge, admission, data collection practice, or changes in policy, equipment or freatment goals that

impact on the variables and outcomes collected.

4. Observer effects

In practice, changes in performance could be due to the effect of the observer, or the commencement of

surveillance on the process ^. During the 1920s, productivity at the Western Electric plant in

Hawthorne, Illinois, improved whenever observation or any intervention was attempted '^^. The

Havi1:home effect could impact on the process of care, or it could exert a more subtle influence.

Admission and discharge practices, data collection, patient selection and data interpretation could be

influenced if the system is under scrutiny. For analysis of an historical dataset, this would not be an

issue, but it is a consideration when these monitoring approaches are applied in a real clinical setting.

5. Limitations of a single measurement

It is unrealistic to use a single measurement to capture the quality of care of patients. Patient survival

provides only one aspect to evaluating ICU outcomes. A programme of quality management should

evaluate many domains of ICU performance including mortality, resource use, process measures,

access measures and complication rates '^*. RA control charts form part of the measurement of patient

mortality. Confrol chart approaches can be readily applied to the other domains of measurement of

quality of care in ICU.

5. Choice of outcome

There are other meaningfiil endpoints than in-hospital death. Quality of life after treatment in ICU is

important to patients, but is usually not measured. It has been proposed that assessments of the quality
80

of ICU outcomes should measure the quality of life rather than the rate of death '^'. Mortality is a crude
130
indicator of quality of ICU care and may offer limited insight into subtie changes in the process

The ability of RA outcomes other than death to measure quality of care is untested '^' and will be

difficult to measure and model. Quality of life after ICU has been studied and reviewed '^^, though it

has not been used as a measure of quality of care. Unfortunately, there are no models that estimate the

probability of alternative outcomes and provide a basis for RA.

The choice of the definition of the mortality outcome has important implications for the timeliness and

accuracy of mortality analysis. ICU patients may have hospital stays of months or occasionally years. If

the endpoint of in-hospital mortality is being analysed, then it may be months (or even years) after an

admission date before all the patient records are finalised with the patients discharged from hospital or

dead. Early analysis is biased toward a higher mortality rate. The APACHE III system uses in-hospital

mortality, and for this refrospective application using a complete dataset this is not an issue. However,

in practice, the 30 day in-hospital mortality endpoints allow analysis to be conducted 30 days after ICU

admission. This important alternative endpoint is used in model development in Chapters 6 and 7.

6. Data source and quality

Data quality and collection issues are important considerations for the development and validation of

RA models and subsequent RA chart analysis. The general considerations of data quality have been

discussed in Chapter 2.

Some RA mortality models for settings other than ICU are based on the use of patient data designed for

billing or adminisfrative work, not for RA. Minimum datasets from adminisfrative, demographic and

patient billing records are relatively inexpensive and there are masses of data available. These data are

of controversial quality because of ambiguities of definition and problems with coding '^'.

Adminisfrative databases do not capture all the features that determine a patient's outcome, and do not

include clinical variables reflecting severity of illness, co-morbidities and abnormal laboratory

evaluations. Failure to capture these explanatory features will compromise the potential performance of

models.
81

The best datasets should be gathered with clinical risk estimate modelling or RA purposes in mind,

though this is time consuming and expensive " ' . A prospective data collection and collation process is

necessary to capture the adminisfrative, demographic, diagnostic, physiological and co-morbid

characteristics of the patients. The data for ICU patients at the PAH ICU 1995 - 1999 is a complete

dataset collected under the rules of the APACHE III model.

5.2.2: Application ofRA to the monitoring of ICU Mortality.

RA techniques applied to monitoring outcomes in a single institution over time offer an adjunct to

current methods of quality management. If the RA model meets performance criteria, charts comparing

observed mortality rates against the RA expected values or other RA mortality statistics are possible.

The methods in the following sections are RA adaptations of standard charts monitoring: the p chart of

mortality rates (RAp chart), the RA CUSUM and the RA EWMA chart. Before describing these

adaptations, it is necessary to describe and address some issues that arise in calculation of control limits

for the charting procedures, and how valid the statistical assumptions are for the distribution of

mortality rates. Appendix 6 describes three approaches to characterising the distribution of the

observed mortality rate of a sample of patients.

In an industrial application, where the inputs are standardised, and the process is in-control the risk of

failure for each item sampled is assumed to be the same. Changes in the statistical distribution of

outcomes imply that there is a change in the process.

In confrast, in medical applications, each patient has a unique set of contributors to risk of death. A

patient brings physiological reserve (captured by age and co-morbidities), physiological disturbance

(captured by the abnormalities in a range of physiological, clinical and laboratory measurements) and a

diagnosis or disease category. There will be additional factors that are not measured in the first 24

hours in the ICU. Some of these occur before admission (pre-admission treatment, details of the exact

surgical event), some are present but not recorded (rare conditions, new technology or uncommon
82

measurements) and many of the determinants of outcome occur after the first day (failure or

complications of therapy).

Alemi and Oliver " ' argue that the estimate of risk of death should reflect the risk of death of a patient

under ideal conditions, rather than what is reasonable to expect from a health care process. This ideal

may be impossible to estimate, but if the argument were accepted, underperformance would be

universal.

The exposure of patients to the process of care will not be constant and will vary from hours to

months. The outcome of a patient who spends 8 hours in ICU after an elective surgical procedure is

more reliant on the success of the surgery and the patient's underlying physiological reserve than ICU

care. In contrast, the survival of a patient who spends 3 months in ICU with pneumonia and multi-

organ failure depends heavily on the quality of the care offered in the ICU.

With these issues considered, the development of a RA charting approach will be presented.

5.3: RA p chart of mortality rate.


The RA/j chart plots the mortality rate compared to and expected mortaHty rate given by the RA model

predictions for the patients in each sample.

The patient mortality data can be analysed in blocks of a constant number of patients, say, 100

admissions or in specified time periods (e.g. months). For fixed block length, the time periods over

which patients are accrued will vary, and will not neatiy fit into calendar months or quarters. Fixed

time periods, e.g. monthly blocks, are more convenient from a unit management perspective, but the

case numbers vary. Either approach can be used. For this application, a fixed block length of 100

patients will be used. At the PAH ICU during the study period, 100 patients were admitted over about 4

weeks.
83

5.3.1 RA p Chart: Control limits calculated using a normal approximation

The RAp chart plots the observed mortality rate on a chart with confrol limits calculated using the

risks of mortality of patients in the sample. RA X charts "^ and p charts " ' have been proposed by

Alemi and co-workers using the ^distribution to calculate control limits. The ICU application has large

sample sizes, so I have adapted the RAp chart to ICU RA mortality monitoring by using a normal

approximation to calculate the control limits.

The following notation and formulae are used for the RAp charts. Yy is the outcome for patienty in

sample /. If the patient dies, Yy = 1 and the probability of death is TTy. If the patient survives,

Yy = 0 and the probability of survival is l-TCy .

E(Yy)=^y

vaTfe)=^(l-^)
^y may be estimated by ^y, using a statistical model such as the APACHE III risk of death estimate.

var(^.) = ;r,(l-;r,)

The observed mortality rate /?, for sample i is

R: =
n.

and the predicted mortality rate, (/?,) is

n.

The variance of the observed mortality rate,

J^varfe) Y.^y(l-7ty)
MRi) = ^^^-^ = ^^^
84

The RAp chart compares /?, to ^(i?,) with control limits calculated around E{Ri). The control limits

are defined as multiples of the standard deviation (a). Let a be the number of o-used, then

CL,=E{R,)a.^vaT(R,)

n. n,

In this illustration, a = 2, , = 100 patients.

Figure 5.1 is a RAp chart of the hospital mortality rate of all ICU patients, in blocks of 100

admissions. Confrol limits are set at +/- 2 a, and observed mortality rates that fall outside the limits are

marked with an asterisk (*). The average APACHE III predicted mortality rate and the observed

mortality rate per block are plotted.

Figure 5.1: RAp Chart of Hospital Mortality


PAH ICU 1995 - 1999, n = 100, APACHE III RA
marks observed mortality values beyond the 2 a control limits.

20 m 40
Sample Number (blocks of 100 patients)
85

In blocks 7, 13 and 19, the observed mortality was above the upper control limit and in block 51, the

mortality rate was below the lower control limit. The chart can be interpreted as showing that hospital

mortality rate of ICU patients was very likely to have been significantiy above the level expected from

the patient characteristics in the earlier part of the analysis period. The hospital mortality may have

been lower than predicted in the later part of the period.

An expression for the probability of a single observation of the RAp chart falling outside the control

limits can be developed. This is the power for a single observation and is adapted from the formula of

Flora ^. The estimated probability of Yy = 1 is ^y , but under changed conditions, which we wish to

detect, the probability of K = 1 becomes TTy .

^E^v(l-^(,)+Z(^/,-^<y)
j=i y=i
Power = O )-+ i-o ^'-' .
E<(i-0 E<(i-<)

Equation 5.1

where

0{ ) is the cumulative standard normal distribution.

It is convenient to consider the change in risk of death in terms of an altered odds ratio (OR) where it is

the odds of death rather than the absolute risk of death that changes.

^.

0R=-^
7t,,
'\-7ty

This expression is rearranged for ease of use so that

OR ft..
'J
Equation 5.2
^ l-;r....+0/?^.
86

An analysis of performance of the RAp chart is shown in Figure 5.2. Each curve is the probability of

Rj, the mortality rate of the samples of 100 cases with known /Ty , will fall beyond the 2 a confrol

limits. Kf from Equation 5.2 is substituted into Equation 5.1. The curves plotted in Figure 5.2 are the

power analyses of consecutive samples of 100 patients from the PAH database across the range of

altered OR 0.2 - 4.

This power analysis is equivalent to the operating characteristic curve analysis used to analyse the

performance of the non-RA/? chart, in Appendix 3. For any observation at any OR,

poweVj = 1 - operating characteristici

Figure 5.2: Power Analysis of IRA p Chart


1
Samples of 100 patients, a = \-l-2. APACHfe III RA
g 0.9
O
a.
^ 0.8
(D
C
S>0.7
0)

_C
"G 05
0)
^3 0.4

Jg02
2
0- 0.1

15 2 2.5 3 3.5 4
Changed Odds Ratio: OR

These analyses show that the probability of an observed mortality rate exceeding the control limits will

depend on the altered OR and the estimated risks of death of the 100 patients in each sample. For

example, with an OR falling to 0.5, there is between a 0.2 and 0.55 chance of detecting a difference on

the single block observation. The probability of detecting a doubling of O^ is between 0.48 and 0.65

with the single sample observations.

The performance of a control chart is generally assessed in terms of ARL in-control (OR = I) and out-

of-control (OR *1). Figure 5.3 shows the results of a simulation experiment demonstrating the effect of

changing OR on the ARL of a RAp chart. To simulate the patient casemix for this analysis, cases were

randomly drawn in blocks with replacement, from the 5278 admission records. Block sizes of 100
87

cases (~ a month), 600 (~ 6 months) and 1200 (~ year) were chosen to simulate rational sample sizes.

Simulated outcomes were allocated as a series of Bernoulli trials, based on the new OR and the out-of-

control risk of death ^y . The ARL is the average number of samples / until a simulated mortality rate

fell outside the 2 a control limits. For Figure 5.3, as the sample sizes differ, the ARL is expressed in

average number of cases to signal. Ten thousand simulations were performed at each value for OR in

the range 0.5 - 2, in increments of 0.1.

Figure 5.3: Effect of Changing Odds of Death on ARL of RA p


Chart (Semi-logarithmic plot)
Samples of 100, 600 and 1200 cases, OR range 0.5-2.0 in increments of 0.1,
10 000 simulations at each value, APACHE III RA. a = +/-2

100000

-Blocks of 1200 cases


Blocks of 600 cases
Blocks of 100 cases

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8


Changed Odds Ratio: OR

Figure 5.3 is a semi-logarithmic plot that shows the effect on ARL of different sample sizes and

different OR. As a general observation, smaller block sizes require fewer cases to detect changes in

OR, and shorter ARL under in-control conditions. Whilst blocks of 1200 patients offer only one false

alarm every 22 years on average and the analysis readily detects changed OR < 0.7 and > 1.6, accrual

of cases would take a year at the PAH ICU. This limits the practical use ofp charts with 1200 case

blocks. Blocks of 100 cases have ARL of 2.7 (270 cases) at OR - 0.5, and ARL of 22.1 (2210 cases) at

OR = 1 and 1.8 (176 cases) at OR = 2. A false alarm is more likely to occur with the block size of 100

patients, but the probability is acceptable in this situation. The 100 patient blocks/> chart readily detects

increases in OR.
88

5.3.2 Control Chart of Standardised (Z) Scores

The same data grouped in blocks of 100 cases can be analysed and plotted using a standardised Z score

where:

This statistic is the same as the score described by Flora ^ plotted in a confrol chart format. Figure 5.4

displays the Z score confrol chart. As this is the same data, the same blocks have Z scores > 2 or < -2 as

those marked with an asterisk (*) in Figure 5.1. Afa-endof falling RA mortality is seen this chart.

Figure 5.4: Z Score Control Chart


PAH ICU In-Hospital Mortality 1995 -1999, n = 100, APACHE III RA Tool

o
o
CO
N 5000 5500

Admission Number

A simple post hoc runs analysis provides strong evidence that the RA hospital mortality exceeded that

predicted by the model in the initial part of the chart. There were 14 out of the first 15 of the RA

observations above the predicted mortality in thefirstpart of the analysis period in both the RAp

charts of Figure 5.1 and Figure 5.4.

The clear presentation and lack of clutter is an advantage of this Z score presentation, although there

are no units.
89

Note that the Z score of this example uses a standardised statistic based on the estimated distribution

parameters of /?, given Tty . In contrast the Z score CUSUM of Chapter 4 uses a standardised statistic

based on a uniform risk of death estimate.

By definition, an analysis of the performance of the RA Z score chart assuming the Z scores are

normally distributed will give the same results as the RAp chart of the previous section.

5.3.3 RA p Chart: Control limits by iterative calculation of the empirical


cumulative probability function of the mortality rate, R^.

The probability distribution fiinction of Rj can be calculated by several methods. Approximate

methods include the t distribution used by Alemi and co-workers "^'"' or the normal distribution that I

have used in Section 5.3.1 and 5.3.2.

Further approaches including exact methods based on the sample TTy are described in Appendix 6. An

iterative method can be used to estimate the empirical probability distribution of/?,, given the series of

TTy . The observed mortality rate of a sample will correspond to a quantile of the cumulative

probability distribution estimated for Rj . An adaptation of the RAp chart plots these quantile values.

Figure 5.5 shows an example of an empirical cumulative probability fiinction of /?, for a single sample

of 100 patients. The iterative method described in Appendix 6 was used to calculate the distribution.
90

Figure 5.5: Emperical Cumulative Probability Distribution of Rj


An example of 100 cases where each risk of death is estimated by APACHE III

0.05 0.1 0.15 0.2 0.25 0.35


Observed Mortality Rate

For example, an observed mortality rate of 0.15 corresponds to the range of 22"^* to 34* percentiles of

the empirical cumulative probability fiinction of Rj , for this sample, given the estimated probabilities

of patient death. Less likely mortality rates define smaller spans of the distribution. An observation of

mortality rate of 0.1 corresponds to the range of the 0.2* to the 0.9* percentiles. An observation of

mortality rate of 0.2 corresponds to the range of the 99.7* to the 99.9* percentiles.

Figure 5.6 is a RAp chart of the hospital mortality data presented as percentiles of the empirical

cumulative probability fiinction of/?,- .The single value displayed is the smallest percentile in the range

defined by the observed mortality rate. The 97.5* and 2.5* percentiles of the cumulative probability

fiinction of Rj are marked on the chart for reference. These percentiles are comparable to 2 cr confrol

limits of the RAp chart in Figure 5.1.


91

Figure 5.6: RAp Chart:


Observations expressed as quantile of cumulative probability function of
expected mortality rate
PAH ICU 1995 -99 in-hospital mortality, for samples n, = 100, APACHE III RA

97.5 percentile

2.5 percentile

20 30 40 50
Block Number (n = 100)

Figure 5.6 shows that the observed mortality rates were high compared to the expected distribution /?,.

in blocks 2, 5, 6, 11, 12 and 18 which all fell above the 97.5* percentile of the distribution. Blocks 48

and 51 fell below the 2.5* percentiles of the distributions of /?,. The overall message from this chart is

similar to those presented for the previous RAp charts. 8 of the first 12 values are above the 95*

percentile of predicted mortality rate and 3 of the last 12 mortality observations are on or below the 5*

percentile of predicted mortality. The RA analysis of the mortality rate demonstrates that the mortality

rate was falling during the period of analysis.

5.4 RA CUSUM
5.4.1 Background to the RA CUSUM

In the cardiac surgical literature, RA charts that plot accumulating expected deaths minus observed

deaths have been called the "Variable Life Adjusted Display": VLAD " and the "Cumulative RA

Mortality Chart": CRAM chart "^''^^. Both applications used recalibrations of the Parsonet '^"^ score to

estimate the risk of death of cardiac surgical cases

There are differences between cardiac surgery and general ICU practice. Firstiy, the APACHE III

mortality model has superior performance compared to the cardiac surgical RA models. This may make
92

existing RA models more practical in ICU. For example the APACHE III ROC curve area is

consistently above 0.88. It exceeds the discrimination of commonly used surgical risk of death estimate

tools like the POSSUM Score (ROC area = 0.75 '^^) and the Parsonet score (area under the ROC 0.68 -

0.74 '^^'^^). The APACHE III system collects more clinical and laboratory data, and uses clinical

information up to the first 24 hours of admission, in addition to the demographic and diagnostic

information. Secondly, mortality in the general ICU population is higher, being in the range of 0.1 -

0.2. This makes it reasonable to accept normal approximations of a binomial distribution in confrast to

approximations based on the Poisson distribution "^. Thirdly, cardiac surgery involves a small number

of common surgical procedures with the majority of adult cardiac surgery being elective coronary

bypass and valve surgery. In contrast, in ICU there are 250 diagnostic groups in the APACHE III

system giving a diversity of case mix and relatively few patients in each category. Even the most

common categories of ICU admission groups may comprise only 10% of patients.

With these considerations in mind, the methods of RA CUSUMs, developed in the cardiac surgical

monitoring literature are examined.

Lovegrove et.al ' presented a simple qualitative method of reviewing cardiac surgical performance,

using the Parsonet method of mortality estimation as a RA tool. It is a plot of the cumulative expected

deaths minus observed deaths. This method can be applied to ICU mortality data using the APACHE

III RA model adjusted for hospital characteristics. The difference between cumulative expected and

observed mortality is plotted on the vertical axis and the patient admission number on the horizontal

axis.

The equation for calculating the statistic for charting is;

C =y(7i Y.)
" j-i ^ ^ Equation 5.3

wherey indexes the n patients. Yj = 1 for a death and >;. = 0 for a survivor, and Ttj is the RA estimate

of the patient's probability of death.


93

Figure 5.7 shows a plot of the RA CUSUM for individual patient observations from 1 January 1995 for

the series of 5278 patients. The chart shows an apparent excess mortality (downward slope) in the first

1800 admissions. After admission 2000, there is an upward slope, suggesting an "excess survival"

performance.

Figure 5.7: Cumulative Plot of Expected Deaths minus Observed


Deaths
PAH ICU Hospital mortality 1995 - 1999, Individual cases, APACHE III RA

Admission Number

Attempts have been made to calculate the statistical significance of the change in model fit. Poloniecki

et al. "^ developed a plot of expected deaths minus observed deaths, incorporating control limits. The

estimation of cardiac surgical risk and the control limits were provided by a recalibrated Parsonet

score. The estimate of surgical performance was updated as the monitoring scheme progressed

comparing observed outcomes to predicted mortality rates. This would correct a poor original estimate,

but would be insensitive to any gradual change in experience '^*. The conttol limits were designed with

Type I error of 0.01, but were not formal statistical tests with no allowance made for multiple testing

'^'. It is difficult to know exactiy what the control limits mean, though to cross the upper control limit

is bad and fairly unlikely to occur by chance.


94

5.4.2 Adaptation of Poloniecki's approach

With the preceding reservations considered, I have applied a similar analysis to the PAH ICU mortality

data. The approach uses a moving frame of admissions (100 cases) to estimate the confrol limits around

the cumulative expected minus observed mortality. Due to the low mortality rates in cardiac surgery,

Poloniecki et al. used a Poisson distribution to develop confrol limits. For the ICU data, a normal

approximation is used.

The confrol limits at any admission number, n, are determined by the value of the CUSUM at the

beginning of the block (case n - 99), and the predicted risks of death of the 100 patients in the block.

This is done as a moving frame as individual patients are added to the series. No attempt has been

made to alter the calibration of the APACHE III model.

The adapted Poloniecki approach plots the cumulative statistic C as defined in Equation 5.3. Using a

moving window of 100 cases, the upper and lower confrol limits after the first 100 cases, at case n are :

CL.=.'
" 100 \^j=n-99 y y=n-99

Figure 5.8 displays an adapted Poloniecki plot of the PAH ICU data from 1995 - 1999 using 3 a

confrol limits..
95

Figure 5.8: Cumulative Risk Adjusted Mortality Chart with


Control Limits
PAH ICU 1995 - 1999, moving window of 100 cases, a = +/-3

c
O

Admission Number

This type of approach is not a formal test, and the meaning of the confrol limits is not certain. Note

however that the RA CUSUM statistic meets the confrol limits in the same areas where the mortality

rate observed in the RAp charts of the previous section (Figures 5.1) and the Z score confrol chart

(Figure 5.4) suggested higher mortality rates than expected.

A simulation experiment was performed to examine the performance of the Poloniecki's CRAM chart

in terms of ARL to signal. The ARL was estimated from 5 000 simulations at each value in a range of

altered odds ratio (OR) 0.5 - 2 in increments of 0.1, with confrol limits at +/- 2 a and +/- 3 a. To

simulate the patient casemix for this analysis, samples of 100 cases were randomly drawn with

replacement from the 5278 admission records. Simulated outcomes were allocated as a series of

Bernoulli trials, based on the new OR and the risk of death estimate. To reproduce the effect of

additional cases and the moving window, as each new case and outcome was added, the first case in the

sample was removed. The ARL was the average number of cases added to the moving frame after the

100*case, until the CUSUM statistic crossed the confrol limits.


96

Figure 5.9 is a plot of ARL against OR. The ARL under unchanged conditions is 421 (4 months of

PAH ICU admissions). This means that a false alarm under unchanged conditions would be expected,

on average every 4 months. The chart will detect a doubling (ARL = 24, one week) or halving of OR

(ARL = 56, 15 days) quickly. If the confrol limits are set to +/- 3 a, the ARLs are considerably longer.

Attention is drawn to the asymmetry of the ARLs, with the maximal ARL for the charts not present at

the OR = 1. This occurs because of the frequency distribution of the Kj from the PAH series. It is a

feature with all ICU populations, including PAH ICU, that the majority of the patients have a low risk

of death and the distribution of Ttj is skewed. If this simulation is repeated with a contrived series of

Kj, with a frequency distribution that is symmefrical around a mean value of 0.5, the ARLs of the

chart are maximum at (9/? = 1. A similar phenomenon is seen with ARL analysis of the RA EWMA

chart of the next section (Figure 5.13). There are inaccuracies in the application of the central limit

theorem to approximate the probability distribution of the patient mortality rates due to the skewed

frequency distribution of TTj . These inaccuracies are revealed when the Bernoulli trials are conducted

in the simulation experiments to determine the ARLs. The results of the simulations in Figure 5.9

shows that the longest ARLs are for values just a little below OR = 1.

Figure 5.9: Effect of a on ARL of Poloniecki CUSUM


window =100 cases, OR range 0.5-2 in 0.1 increments, a = +/-2
5000 simulations at each point, APACHE III RA

10000

-+/- 3 o control limits


- +/- 2 o control limits

1 1.5
Changed Odds Ratio
97

5.4.3 Testing for Change in Odds ofRA Mortality with the Steiner's RA
CUSUM.

Steiner et al. '"*' have proposed a RA CUSUM monitoring procedure testing the hypothesis of change

ofRA odds of death. It has been applied to neonatal arterial switch operations and to adult cardiac

surgery, '^*. The CUSUM procedure can be used to detect, for example, a doubling or halving of

surgical risk. In my own work I have adapted this application of the RA CUSUM to monitoring

outcomes in ICU using the APACHE III model"".

The method provides a test of the hypothesis that the RA odds of death are unchanged, against an

alternative hypothesis of a changed OR. An upper RA CUSUM with confrol limits tests the null

hypothesis of unchanged odds of death; the alternative hypothesis is that the OR has increased. A lower

RA CUSUM with confrol limits tests the null hypothesis of unchanged odds of death; the alternative

hypothesis is that the OR has decreased. In this analysis, the RA CUSUM procedure will be designed

to best detect a doubling (or halving) of the OR.

A score (Wj) is given to each patient. It is derived from the log-likelihood ratio of the current risk of

death, compared with the risk of death if the overall level of ICU performance has changed. Under the

lY- i \Y.
null hypothesis, the likelihood for patienty is given by Kj ' \^ 7tjj' and the odds of death are

TCj OR^Tlj
-I r. While under the alternative hypothesis the odds of death are .
98

Since there are only two possible outcomes (death or survival), the two possible log-likelihood ratio

scores for patienty are given by;

OR.
Wj = log if the patient dies, or
[\-7tj+OR,7tj)_

Wj = log if the patient survives.


(l-7rj+0R,;rj[

For the upper confrol chart, an upper CUSUM statistic, ^j" is plotted againsty, wherey is patient

number and SQ - 0.

The RA CUSUM formally tests the null hypothesis, HQ: ORO= 1, against the alternative hj^jothesis. HA:

ORA > 1. Successive non-negative values will lead to accumulation of Sj until its value exceeds the

confrol limit, /?"*".

For testing for a change in the RA odds of death where the mortality is falling, the procedure is similar

to the test for an increased OR. For the convenience of plotting both charts on the same figure, the

statistic S~ is accumulated as a negative value (or zero) and h~ is a negative value. Thus,

S]=mm(s:_,-Wj,0)

The ORA is less than 1 and the confrol limit is h , the value below which S must fall to give an alert

or alarm.
99

The APACHE III prediction of risk of death, Kj is used as an estimate of Ttj.

The choice of h'^ and h~ are made by modelling the ARL where the RA outcome of patients has

changed, or remains unchanged, given the set of estimates A , and the choices of clinically relevant

ORA . Figure 5.10 shows the ARL using an upper and lower RA CUSUM tests based on the log

likelihood together, with the ORA set at 0.5 and 2, for a range of A"^ and h~. In this example, h* and

h~ have the same absolute magnitude, though for other applications, these decision thresholds will

vary, and may not be equal. The experiment calculated the ARL for RA CUSUM from 10 000

simulations. Each case was drawn at random and with replacement from the 5278 patients in the series.

Each simulated patient outcome was a Bernoulli trial based on Kj , the altered probability of death.

This analysis is the basis for the choice of chart parameters.

The values in Figure 5.10 differ from those published in Cook et al.^^^ reporting an ARL estimate of

5400 based on the patient population from the 1995 - 1997 dataset. The inclusion of the additional

1998 - 1999 cases used in this updated analysis has increased the ARL for monitoring for OR 2, with

A^ = 4.5 to 5638.

Figure 5.10: Effect of Choice of h on ARL of RA CUSUM


1 sided monitoring for OR = 2, 2 sided monitoring for OR = 0.5, 2
APACHE Hi RA, 10 000 simulations at each value ofh+/-

^ 5000
4000
100

The relationship between ARL and changes in the OR is shown in Figure 5.10. For this experiment, 3

RA CUSUM monitoring schemes were studied. An upper RA CUSUM testing for ORA = 2, a lower

RA CUSUM testing for ORA = 0.5, and a combined upper and lower RA CUSUM with ORf^'' = 2

and OR'A'^'' = 0.5 . Initial experiments showed that for the upper RA CUSUM, h* = 4.5 gave an in-

confrol ARL of 5640 admissions. For the lower RA CUSUM, h~ = -4.5 gave an ARL in-confrol of

6701 cases, or about 6 years.

The effect of changingrisksof death by changing the OR was examined by performing 10 000 RA

CUSUMs of each type at each OR value in the range of 0.1 - 0.4 in increments of 0.1. To simulate

patient outcomes, cases were randomly drawnfromthe series and outcomes were allocated as
/v A

Bernoulli trials where the risk of death was /T . The results of this simulation are shown in Figure

5.11.

Figure 5.11: Effect of Changed OR on ARL of RA


CUSUM
upper ORA = 2 , lower ORA = 0.5. combined upper and lower RA CUSUM ORA = 0.5 and 2 ,h+/- = 4.5, APACHE III RA
10 000 simulations at each point
8000

7000

6000
ARL: patients

5000 ARL Upper RACUSUM testing for OR = 2


ARL Lower RACUSUM testing for OR = 0.5
4000 ARL Upper and lower RACUSUM

3000

2000

1000

Odds Ratio

With unchanged 0R=\, the upper RA CUSUM has an ARL = 5640, and the lower is 6701. The ARL

is 3043 for the combined scheme. As the OR increases above 1, the estimate of the ARL is dominated

by the ARL of the upper RA CUSUM scheme. Similariy, with OR below 1, the ARL is rapidly

dominated by the effect of the lower RA CUSUM scheme. If only increases in RA mortality were of

interest, the specificity of the signal could be increased by using only the upper RA CUSUM. With a
101

RA CUSUM only monitoring for an increased OR, the false alarm ARL would exceed 5640 cases, or 5

years.

The RA CUSUM rapidly detects decreases in OR. The ARL for OR of 0.7 is 646 and for OR of 0.5 is

244. The RA CUSUM also rapidly detects increases in OR with ARL of 842 with OR = 1.3, ARL of

416 with OR = 1.5 and ALR of 169 with OR = 2.0.

The parameters chosen for the RA CUSUM (upper and lower RA CUSUM with OR'f^'' = 2 and

Qj^iower _ Q^ ^ confrol Hmits h'^'~ = 4.5) provide rapid detection of clinically important changes in

OR with an acceptably long ARL for in-confrol or minimally changed OR. If only an increase in OR

was of clinical interest, the in-confrol ARL could be dramatically increased by only monitoring with

the upper RA CUSUM.

These parameters are used to plot upper and lower RA CUSUM charts for the PAH ICU data 1995 -

1999 in Figure 5.12.

Figure 5.12: Steiner RA CUSUM


PAH ICU In-Hospital Mortality 1995 - 1 9 9 9
/7+/- = 4.5, optimal detection of change OR(iomer) = 0.5 and OR(upper) = 2, APACHE III RA,
single cases, * identifies S* >h* , and # identifies S' <h'

/)* =4.5

/)" = -4.5

1000 2000 3000 4000 5000


Patient Number

Two signals of increase mortality are marked with an asterisk (*). The first occurs when 5"^ > h* at

case 533. The upper RA CUSUM S* is reset to zero and the monitoring is recommenced, with a

fiirther signal at case 1117. The RA CUSUM is reset to zero, and there are no other signals of increased
103

OR. These signals are good evidence that the 0R>\ in the first 1117 cases. The APACHE III model

under estimated the probability of patient death in this period.

Two further signals are seen on this combined chart. In the latter part of the series, there are signals

marked by a hash (#) where S~ <h~. These two signals at cases 4939 and 5181 are good evidence

that the APACHE III estimates are overestimating the risk of patient death.

The RA CUSUM chart analysis provides similar findings to the RAp chart, and the method adapted

from Poloniecki et al.

5.5: RA EWMA Chart


The EWMA chart considered and analysed in the previous chapter plotted an exponentially weighted

average of the mortality rate of individual patient observations. This is an iterative process, where the

EWMA statistic is calculated from the weight, X, the previous EWMA value or the starting value, and

the patient outcome ( Yy) or mortality rate of the block of patients ( /?,). To construct a RA EWMA

chart, it is proposed that the same calculation of the EWMA statistic is performed, but the unique series

of estimates of patient risk of death are used to calculate the expected EWMA statistic and the control

limits.

I propose and describe two methods of calculating the confrol limits. The first is an approach that uses

the cenfral limit theorem to estimate the distribution oiEWMAj . The second uses an iterative

approach to characterise the expected distribution of EWMAj . For both approaches, it is assumed that

the RA model provides an accurate estimate of the true risk of death of the patient and that the number

and order of the patients are known.


103

5.5.1: Estimate of the distribution ofEWMA^ using the central limit


theorem

For this application, individual patient outcomes (Yj) will be analysed. The method can easily be

adapted to analyse mortality rates, Rj .

The EWMA statistic of the observed deaths is calculated from the series of observations as described in

Chapter 4.

EWMAj = YjA + EWMA]_, (1 - X)

or

EMWA] = (1 - Xy EMWAl + A^] (1 - A)^"* 7,

The value of the statistic EWMAj is compared to an estimate of the expected value of the EWMA

statistic, using the series of TTj.

E[EWMA])= EWMA^

which is calculated from

EWMA^ = ftjX + EWMA]_, (1 - X)

or

EWMA] = (1 - ?CiEMw4 + ^2(1 - ^)''*^*


A=l
104

The confrol limits are calculated from an estimate of the variance of EWMAJ .

wai(EWMA])=(l-Ay' warEWMA^ +A^J^{l-Xy^'~'KarY,


k=i

This can be simplified if we assume that the starting estimate of the mortality rate EWMA^ has a

variance of zero.

var(^PFM4;) = A' J ^ (l - Xf^^'"^ var Y,


k=\

The estimated standard error of the estimate of EWMAJ is:

se(EWMAJ) = 1 / ^ (1 - Af^'-'^ var Y,


V t=i

The confrol limits can be estimated as:

CLj = EWMAJ aAJ^ (l - X}'^'-'^ var Y,


k=l

Under the assumption that the model provides an accurate estimate of the patient's tine risk of death,

then the estimate of the variance of Yj will be ^^. (l - ^ . j . Substituting for var 7-,

CLj = EWMAJ aAJf^il-Zy^'-'^TTjil-^j)


Vt=i

Two special cases of this formula are apparent. The RAp chart is a special case of the RA EWMA

where /I = 1, and the standard EWMA, without RA is the case where all the risks of death are the same.
105

For ease of computation, the estimated variance of EWMAJ can be calculated iteratively in parallel to

the calculation of EWMAj and EWMAj as

V2X[EWMA] ) = XAj (l -7tj)+{\-Xf w2a{EWMA]_,)

To determine suitable choice of 2 for RA EWMA in this application, a series of simulations were

conducted. The effect of choice of/I on ARL under conditions of changed OR was investigated using

values of 2 of 0.001, 0.002, 0.005 and 0.01, and OR in the range of 0.5 - 2.0 in increments of 0.1. Each

simulated patient was drawn at random from the patient series, and the simulated outcome decided by a

Bernoulli experiment with the probability of patient death being Ttj . The ARL of 5 000 RA EWMA

simulations at each set of values was calculated.

Figure 5.13 shows the effect on ARL of choice of the weight X over a range of OR. The ARL for OR

close to 1 was much greater for the small X, though these differences are less apparent at an OR > 1.2

Figure 5.13: Effect of A on ARL of RA EWMA over a


range of OR
APACHE III RA,\ = 0.01, 0.005, 0.002, 0.001,
OR range 0.5-2 increments of 0.1, 5000 simulations at each point

100000

a:
<

1 1.5
Odds Ratio
106

There is again an effect due to the distribution of cases in the PAH dataset and the use of the central

limit theorem. The ARLs of the large values of A are greater than those of the small values of 2 for

small OR. If the ARLs are modelled with a contrived patient sample that has a distribution of Ttj that

is symmetrical around a mean value of 0.5, this effect could not be seen. The simulations using the

Bernoulli trials bring the inaccuracies of the central limit theorem, which was used to calculate the

control limits, to attention. Exact methods are required to overcome the inaccuracies of these

assumptions.

Figure 5.\A{X = 0.001) shows a RA EWMA chart of the 5278 cases in the series with control limits set

to 2 . There is a sustained period between admission 329 and admission 2183 where the EWMA

estimate of the mortality rate exceeds the upper 2 a control limits.

Figure 5.14: RA EWMA Chart


PAH ICU In-Hospital Mortality 1995 -1999, individual cases, APACHE III RA
A = 0.001, a = 2, starting estimate = 0.16

EWMA of risk of death


EWMA of Mortality Outcome
Upper Control Limit a = 2
Lower Control Limit a = -2

1000 2000 3000 5000


Admission Number

The same parameter settings are used for the RA EWMA chart in Figure 5.15, where the RA EWMA

chart is restarted when the EWMA estimate of the mortality rate crosses one of the control limits. This

chart shows signals of increased mortality above the upper 2 a control limits at cases 329, 532, 610

1033, 1112 and 1188 denoted by the asterisk on the chart. It is good evidence, again, that the mortality

rate observed during the first 1200 admissions in the series was above that predicted by the APACHE

111 model. There were 3 instances where the EWMA crossed the lower control limits at cases 4180,
107

5044 and 5182. This provides good evidence that the observed mortality rate was below that predicted

by the APACHE III model.

Figure 5.15: RA EWMA Chart with Reset After Signal


PAH ICU InHospital Mortality 1995 -1999, individual cases, APACHE III RA
A = 0.001, 3 = 2, starting estimate = 0.16
RA EWMA * above the upper control limit, # below the lower control limit

EWMA of risl< of deatti


EWMA of Mortality Outcome
Upper Control Limit a = 2
Lower Control Limit a = -2

1000 2000 3000 4000 5000

Admission Number

5.5.2: An iterative approach to the estimation of the distribution of


EWMA^ and calculation of control limits.

There is a degree of inaccuracy in the previous asymptotic approximation of the disfribution of

EWMAJ . Therefore, in this section I will explore the use of an iterative, exact method to describe the

discrete distribution of EWMA, .

The probability density fiinction of /?, used for the RAp chart was a multinomial with only n, +1

possible values (see Appendix 6) and an iterative calculation is computationally rapid. In confrast,

when an iterative method is applied to describe the distribution of EWMAJ , there is a logarithmic

growth of the number of possible values of EWMAJ each with its own probability, which can be

estimated from the ordered series of ^ , .


108

An estimate of the probabilities of all possible values of the EWMAJ statistic can be made under the

familiar assumption that the RA tool will accurately predict the true patient's risk of death, and that the

order of admission of each patient is known.

Wheny = 0, the only possible value is EWMA^ which has a probability of 1, i.e.'?r\EWMA^)=1

Wheny = 1, the two possible values of EWMAJ are

(l - X)EWMAQ +A if the patient dies, and

(l - X)EWMAQ if the patient survives.

So Pr[EWMA[ = (l - A)EWMA^ ^?i\=7t, Equation 5.4

YX\EWMAI = (1 - X)EWMAl J = (1 - ^ , ) Equation 5.5

At the second patient there are 4 possible outcomes, and probabilities.

From Equation 5.4, if both of thefirsttwo patients die,

Pr{EWMA] = (1 - A)[(l - A)EWMA; +Ji\+A}= Jt.Tt^

If the first dies and the second patient survives:

Vr{EWMAl = (1 - Xji\ - X)EWMAl + AJ}= ;r, (l - ; r j

Similarly,fromEquation 5.5, if both patients survive,

Vr^WMAl = (1 - ^Y EWMAl }= (l - ^, )(l - 7t^)

If thefirstpatient survives and the second dies

Pr^WMA] = (1 - Af EWMA', +A}={1-^, )^,


109

This is repeated to produce the distribution of values and probabilities after each patient is added to the

series. There is a maximum of 2' possible values that the EWMAJ can take. This rapidly becomes an

infractable computing problem. It can be dealt with in several ways.

The first is to resume the asymptotic approximation applying the cenfral limit theorem as in the

previous section when the size of the computing problem becomes too great. However, when

considering ARLs for the RA EWMA in the range of hundreds to thousands in-confrol, this means that

the approximation will be used in almost all instances. It would make no sense to use the exact method

for those few calculations of confrol limits before reverting to the approximation.

The second is to adopt a Monte Carlo simulation, using a large number of Bernoulli trials. This is a

practical solution.

The third solution is to limit the accuracy of the exact method, and limit the number of discrete values

that the EWMAJ can take. In this application, I have chosen this method, and have limited the number

of values of EWMAJ to 10^ values between the largest plausible EWMAJ and the smallest ^fFM^J .

After each iteration, the values are rounded to the closest of the 10^ values. This approach is rapid and

is accurate to the 8* decimal point, which is far beyond the accuracy of the models that predict the risk

of in-hospital death.

This method is used to estimate the cumulative probability distribution of EWMAJ . For each patient

admission recorded in the series, the observed EWMAJ corresponds to a quantile of the estimated

y
cumulative probability distribution of EWMAj .

Figure 5.15 shows the series of 5278 cases with EWMAJ , EWMAj and the upper confrol limit taken

as the value of EWMAJ which defines the 97.5* percentile, and the lower confrol limit that defines the
110

2.5* percentile. I took A = 0.001, and accumulated individual cases. This is an approximation of the a

= +1-2 a confrol limits of Figure 5.13 and the same pattern is apparent.

Figure 5.16 RA EWMA with Iterative Characterisation


of Distribution of EWMAj''
PAH ICU In Hospital Mortality 1995 -1999, individual cases, APACHE III RA
A = 0.001, a = 2, starting estimate = 0.16

0.2 EWMA of risk of death


0.19 EWMA of Mortality Outcome
Upper Control Limit a = 2
0.18
w 0.17
.s^J^ ^yV,,w ., ",,,, . ; ,

0.16
0.15
< 0.14
0.13
UI
0.12
0.11
0.1
1000 2000 3000 4000 5000
Case Number

Figure 5.17 provides an additional, visually simple representation of this series as a plot of the smallest

pecentile of the estimated cumulative probability distribution of EWMAj that is defined by the

observed value of EWMAj . A value of 0.025 is approximately equivalent to an observation falling on

the lower 2 a confrol limits of the previous charts. Similarly, 0.975 is approximately equal to an

observation falling on the upper 2 a confrol limits. A perfect agreement between EWMAJ and

EWMAJ is the value of 0.50.

The same pattern that was seen in the previous analysis is seen in Figure 5.17. The EWMAJ falls on the

upper end of the distribution of possible values of EWMAj . Between 553 and 2051 all the

EWMAJ are beyond the plausible values for EWMAj . This is sfrong evidence that the APACHE III

model was under predicting the risk of death during this part of the series.
Ill

Figure 5.17 Percentile of Cumulative Probability of RA


EWMAj
with iterative cfiaracterisation of distribution
PAH ICU 1995 -1999, individual cases, APACHE III RA
A = 0.001. starting estimate = 0.16
100%

n(D
o
50%
.>
'.4-t
a
E
3
o
1000 2000 3000 4000 5000
Case Number

5.6 Summary and Conclusion.


This chapter has demonsfrated a new application ofRA control chart analysis to an ICU setting. My

work is the first use of the APACHE III score as a validated RA tool to monitor ICU outcome. Each of

the methods applied and developed in this section have a method for design and analysis described in

the Appendices.

This work is original in its application. It is an adaptation of previously described methods, or presents

new work. The RAp chart was developed from the method of Alemi et al "^ and incorporates a

method for analysing the power of each sample, adapted from Flora ^ . The Z score/? chart relies on the

work of Flora ^ and Sherlaw - Johnson and colleagues '^'. The use of the iterative approximation of

the disfribution to derive confrol charts is an original contribution.

The RA CUSUM is based on the work of Lovegrove et al. '' and the moving frame approach is

adapted from the work on cardiac surgery charting by Poloniecki et al. "^. The use of the RA CUSUM

of Steiner and co-workers'^*''^''*^ is an original application in ICU, though the method apart from the

use of the APACHE III has not been modified.


112

Both the RA EWMA charts, the parametric approximation and the discrete approximations of the

distribution of EWMAj are original contributions.

This chapter is about current RA charting techniques that can be applied to such contexts as the ICU.

The charting methods have an important place in the range of quality measures that are available, and

potentially have wide application.

In the next two chapters ICU data will be modelled again using machine learning approaches (neural

networks and support vector machines) on datasets of a small and practical size for application in a

single ICU.
113

Chapter 6:

Supervised IVIachine Learning Techniques with

Application to Intensive Care Mortality Models

In the previous chapters, I have reviewed the techniques for assessment of models that estimate the

probability of intensive care unit (ICU) patient death. With an existing, validated model, I have

developed risk adjusted control charts, applying such models for monitoring mortality in ICU patients.

This chapter will infroduce machine learning approaches for development of new models to estimate

the risk of death of ICU patients.

This chapter has three aims:

The first is to describe the machine learning methods of artificial neural networks (ANNs) and support

vector machines (SVMs).

The second is to review applications of ANNs and SVMs in the intensive care unit (ICU) with

particular reference to the estimation of probability of death of ICU patients.

The third is to present preliminary experiments with ANNs and SVMs on the Princess Alexandra

Hospital (PAH) ICU dataset, as a prelude to development of a practical RA tool using raw patient data.

One experimental task is the classification of patients according to 30-day in-hospital death or survival.

The other is a regression problem to estimate the probability of 30-day in-hospital death. The variables

used are pre-processed patient data: APACHE III acute physiology score, APACHE III chronic health

score, patient age and modified APACHE III diagnostic coding. The results are compared to previous

logistic regression results on the PAH ICU dataset.


114

6.1: Application of Machine Learning to Classification


and Regression Problems to Predict ICU Mortality.

ANNs and SVMs are machine learning approaches to modelling the probability of ICU patient death. It

is practical to model ICU patient outcomes in this way because of the availability of powerfiil desktop

computers. The recent review '"^^ of "Artificial Intelligence in the ICU" inadequately covered the topic

in only six paragraphs. Four of these were infroductory definitions and descriptions of the ANN. There

is much more to the topic than that review suggested.

Machine learning to model ICU outcome is a supervised learning task. Supervised learning in this

application uses fraining datasets where the patient and diagnostic variables and the resultant patient

outcomes are known. Generally a machine learning algorithm adjusts or optimises aspects of a model

to minimise an error fiinction that compares the outcomes in the fraining dataset to model predictions.

6.1.1 Artificial Neural Networks

ANNs provide a diverse range of models for classification and regression problems. ANNs have

architectures of networks of interconnected simple processors. The fraining process involves learning

patterns in the fraining data, often by modifying the weights that link the units. The following

discussion will be limited to the supervised learning instance. A more detailed infroduction to this topic

is available in various references ^^^'^'',

The most commonly used example of an ANN is the multilayer perception (MLP) which is shown in

Figure 6.1.
115

Figure 6 . 1 : Architecture of 2 Layer


Multi-Layer Perceptron

Feature
values of Input Hidden units in Output units
Weights
input vector
X; - X
units^ hidden layer v J D
Xl

Xi A

Xi A.

X4
.'vA** L ^'^
Xn \ D

This example of a 2 layer MLP has an input layer of input units (left, ) equal to the dimension of the

input vector. It has hidden layer units (cenfre, o) and output layer units (right, D). The data input is an n

dimensional vector, X. In the example, feature values Xi to Xn are presented at the input nodes. The

arrows represent connecting weights. Each input node cormects to each node in the layer below. The

vector of weights at each node are W . In the hidden layer in Figure 6.1, W will have n elements. The

input to each hidden units is the sum of weighted inputs. This is the sum of elements of the dot product

of the vectors W and X i.e. W,X, -h WjXj + ....WX and is abbreviated as ^ W X. The input is

processed by a bounded increasing activation function to produce a non-linear output. A commonly

used activation fiinction is the sigmoid function / ( V W x) = :=- .

Figure 6.2 shows three input values and weight vectors processed at the hidden unit. The output signals

of the processing units are then connected by weights to the next layer. The next layer in this example
Hi

is the output layer, though fiirther hidden layers are possible. At the output unit, the sum of the

weighted signals is processed into the output signal.

Figure 6.2: Output of a hidden unit is a function of a sum of weighted


inputs.

X3

Supervised learning proceeds by iterative adjustments to the inter-connecting weights of the ANN, to

minimise the sum of the squared errors between observed and predicted output values. Initially,

weights are set to random values. Subsequent adjustment of the interconnecting weights in the network

is by gradient descent down the error surface seeking a minimum error value. Back propagation is one

such algorithm whereby the weights in the network are adjusted, according to the derivative of the error

147
with respect to the weights

MLP ANNs have a number of potential issues '*^ which are common to many numerical optimisation

procedures. For example, the gradient descent algorithm can become frapped in local error minima,

rather than proceeding to the optimal solution. The MLP can produce different fraining results

depending on the starting weights, by terminating at different local error minima. As well, MLPs are

prone to over-fitting. It is usual practice to monitor performance on an external verification data set

during the fraining. The fraining process is terminated when the generalisation performance begins to

deteriorate. In addition, a test dataset is used to assess the model performance - see below). Whilst
117

there are sfrategies to limit the problems caused by these issues, it is important to frain many MLPs and

to choose the best ANN for the application at hand.

In the ANN experiments in this chapter, the available data were divided into three sets for each trial. A

fraining set was used to frain the ANN. A verification set was used to monitor the generalisation

performance so that fraining could be stopped. A test set was used to assess the model performance on

data not used for model development.

The radial basis fiinction (RBF: Figure 6.3) network is another example of multilayer ANNs used in

this study.

Figure 6 .3 Architecture of Radial Basis Function Network

Feature Input Radial units in Output units


values of unitsV hidden layer \-) D
input vector
X] x

Inputs fransferred to hidden


units

Weighted
sum of
Xj
outputs

X2

Xi

As with the MLP, the input vector X at the n input nodes and each input is connected to the hidden

layer. However, there is only a single hidden layer of processing units. The inputs to the hidden layer

are fransformed using a fiinction with radial symmetry such as a Gaussian function. The output from

each unit in the radial layer depends on the distance of the feature values from the cenfre of the

fiinction for that unit '**. The sum of the weighted signals of the radial layer is processed with a linear
118

activation fiinction at the output units. The weights between the radial layer and the output layer are

adjusted by the learning algorithm. The RBF ANN is rapidly frained, and is useful for modelling

continuous mappings of input features onto outputs '^^.

The generalised regression neural network (GRNN) shovm in Figure 6.4 is similar to the RBF ANN. It

is used to recalibrate the mortality predictions in the regression experiments in this chapter. It is a

statistical method of fiinction approximation and in this application provides a Bayesian, kernel-based

estimate of the risk of death. The GRNN was developed by Specht '^^ based on Parzen '^''^^. Figure

6.4 (based on Figure 1 in Specht '^*') shows the architecture of the GRNN.

Figure 6 .4 Architecture of GRNN


Feature input Hidden pattern units in Summation
values of units^ radial, hidden layer L J
input vector units \-J

Output unit C>

weights weights

Figure 6.4 shows there are input units for each feature dimension of the input vector. Again, each input

unit connects to all of the radial units in the first hidden layer. These pattern units are each dedicated to

a fraining example, or represent clusters of similar fraining data. The pattern units have a Gaussian

fiinction which fransforms the distance of each feature value from the pattern represented by each unit,

similar to the RBF processing units. A smoothing parameter determines the shape of the radial

fiinction, and the overlap between fiinctions. The second hidden layer has summation units which
IB

process the weighted signals from all the pattern units. The output node provides estimates of the mean

risk of death of a patient with the input pattern xi...x.

The GRNN uses one fraining epoch to set the weights, and can be trained very quickly. Although there

are architectural similarities to the RBF ANN, the GRNN has no iterative algorithm of weight

adjustment after all the fraining examples are incorporated.

6.1.2 Review of the Use of ANN in the Intensive Care Unit.

ANNs are widely applied in the medical area. Non-ICU applications include diagnosis '^^'^',

prognosis '^"'*^, physiological and laboratory data interpretation '55.i58.i67-i69 ^^^ pharmacology '^*''.

There are a number of reports describing ANNs used to model ICU patient data and outcomes. The

following review of applications summarises the use of ANNs in the ICU from the perspective of

predicting mortality or resource use.

Two studies have examined cardiac surgical sub-sets of ICU patients. Lippmann and Shahian ''' used a

MLP ANN, logistic regression and Bayesian analysis to model the survival outcomes of 80 606 cases.

Fifty-nine demographic, physiological, laboratory, diagnosis and cardiac assessment features were

collected, of which 36 were used in the model. The calibration of the logistic regression model was the

best, but the model had an area under the ROC curve of only 0.76. Orr "^ reported an ANN to estimate

the risk of death in cardiac surgical patients using only 7 variables selected from a patient database.

This model had good calibration, but lacked discrimination, with the area under the ROC curve of only

0.74.

Buchmann et al. ' " compared a logistic regression model, MLP, GRNN and a probabilistic neural

network to classify ICU patients on the basis of chronicity (length of stay in ICU > 7 days), rather than

to predict mortality. He found that the discrimination and calibration of the ANNs were superior to

logistic regression. Another study of prediction of resource use, as measured by hospital length of stay

was published by Mobley et al. "*. An ANN was frained to estimate the hospital length of stay of 557

coronary care patients using 74 variables including demographic characteristics, physiological


120

observations, laboratory results, diagnosis tests and index events. The ANN was able to predict length

of stay within 24 hrs in 72% of patients. However, this was only marginally better than using the mean

length of stay of each diagnostic class in the dataset.

Doig et al. "^ compared the predictions of an ANN to a logistic regression model for the classification

of a small series of 422 patients. Each patient had already survived to 72 hours in ICU. Variables were

selected from the APACHE II system and re-modelled. Remarkable discrimination was achieved on

the fraining set (area under the ROC curve - 0.99). The discrimination was less good on the validation

set (area under the ROC curve 0.82) suggesting overfraining of the ANN with overfitting to the fraining

data at the expense of generalisation performance.

Dybowski et al. '^^ compared a ANN frained with a genetic algorithm to a logistic regression model for

predicting the outcomes of a small subset of ICU patients (258 patients) with systemic inflammatory

response syndrome. A classification free and logistic regression were used to select variables from

physiological and demographic variables, and index events that occurred during the hospital stay. The

ANN had better discrimination than logistic regression (area under the ROC curve 0.86 v 0.75). No

assessment was made of the calibration of the models.

Prize et al. ' " frained MLPs to classify non-operative (608) and operative (883) ICU patients

according to the predicted duration of mechanical ventilation. These models were only assessed on the

developmental dataset predictions. By pruning the number of input features from 51 to 6, classification

performance improved and the network complexity was reduced. This study demonsfrated a practical

approach to limiting network complexity and provides usefiil documentation of a successfiil approach

to processing and fransformation of patient data. However, the ANNs in this study were designed to

estimate resource use rather than to predict mortality outcome. A weakness of the study was the models

were not assessed on a separate test dataset, so no conclusions can be drawn about the reproducibility

of the modelling, or of the model's generalisation performance.

Two studies compared ANNs to the APACHE II system. Wong and Young '" compared MLPs to the

APACHE II system on 8796 patient admissions collected for an APACHE II database. Both the MLP
121

ANN and the APACHE II system had similar discrimination (area under the ROC curve 0.82 - 0.84)

and calibration. Nimgaonkar et al. "^ also compared the performance of ANNs to the APACHE II

system to predict mortality in an Indian ICU. A series of 2962 cases were modelled using the input

variables for the APACHE II system. They analysed the contribution of each of the features to the

models. Discrimination by the ANN was superior to the APACHE II (area under the ROC curve 0.88 v

0.77). The ANN displayed better calibration than the APACHE II model.

One of the most relevant studies is by Clermont et al. '^, who used logistic regression and ANN to

model hospital mortality outcome on 1647 ICU patients. The demographic and physiology variables

were collected under the rules of the APACHE III system. It is important to examine this study in some

detail as the modelling context is very similar to the task of modelling patient outcomes at the PAH

ICU. In Clermont's study, the component variables of the APACHE 111 model and the APACHE III

score were used to model the probability of patient death. The areas under the ROC curves were in the

range of 0.8 (logistic regression) - 0.836 (ANN with coded APACHE III observations). All the models

had reasonable calibration when 800 or more cases were used to develop the model.

The ANN and the logistic regression models were able to successfiilly predict ICU patient death. A

fiirther important conclusion was that both the logistic regression and ANN model performance

deteriorated when the model development set size was reduced below 800 cases. In a real application,

at the PAH ICU, this requires 9 months of patient data collection for model building. For smaller ICUs,

it may represent 2 - 3 years of data collection.

Equally important was the author's practical choice of the size of 447 cases for the test dataset for

model assessment. The size of this assessment set is a frade-off between the practicality of collecting

patient data and the important statistical issues of the power and precision of the model assessment. For

the PAH ICU and other busy ICUs, 447 patients requires about 3 - 4 months of data collection. For

smaller units, a year may be required just to collect enough patient data for model assessment.

Balanced against this, is the issue of smaller assessment datasets giving low statistical power to the

Hosmer-Lemeshow C (H-L C) test to detect imperfect calibration, and the loss of statistical precision

in estimating the area under the ROC curve. Clermont et.al made their choice based on a timed period
122

of data collection. The size of the dataset they used strikes a reasonable compromise between the time

for data collection and statistical issues.

There are limitations to Clermont's study. As Paetz '^^ in a letter to the Editor in a subsequent edition

of Critical Care Medicine commented, the study did not involve re-sampling to demonsfrate the

robustness or reproducibility of the approach. It is possible that the non-random split of cases to the

fraining and validation sets was a major determinant of the model's reported performance. Replicates

of the modelling on alternative random data selections and re-sampling to provide alternative

assessment sets are necessary to demonsfrate consistency of the modelling approach. I would add to

these criticisms, that the authors used consecutive patients to build the models. The last 447

consecutive cases in the dataset were used to assess all the models. Even when smaller fraining sets

were explored, non-random, consecutive sampling was used. Infroduction of bias into the model, the

effects of influential outiiers, or fortuitous sampling carmot be excluded with their methodology.

There are fiirther limitations to their techniques. The patient's diagnosis or diagnostic coding was not

included in the variables for the model. Also the authors relied on the APACHE III algorithm for the

weights that they used to pre-process the variables for both the MLP ANN and the logistic regression

model. These APACHE III weights are added together to give the APACHE III score, which the

authors also selected as a variable in some models. A lack of a diagnosis variable and reliance on the

APACHE III system may have limited the quality of the models that could be built.

The authors did not record how many ANN were frained to yield the optimal performance ANN, so it

is not clear whether a limited or an exhaustive survey of possible ANNs was conducted

In summary, many applications of ANN to prediction of ICU mortality have been published. Overall,

the performance of ANNs on ICU mortality prediction tasks appear as good as or better than logistic

regression, on the datasets on which the models have been developed. However, no meaningfiil

conclusions about the generalisation of these ANN models outside the context where each was

developed can be made.


123

6.1.3 Support Vector Machines

The SVM is based on work by Vapnik '^^ using a linear, machine learning algorithm to model a

projection of the data into multi-dimensional feature space. Whilst the example data may not be

linearly separable in input space (its raw form), the attributes mapped by a kernel fiinction into feature

space may be separable. A usefiil definition of SVMs is ".. .learning systems that use an hypothesis

space of linear fiinctions in high dimensional feature space, frained with a learning algorithm from

optimisation theory, to implement a learning bias derived from statistical learning theory.." *.

There are several infroductory references to the theory of the SVM ^'^'"'*^. The following description

draws heavily on these references, and is not meant as a mathematically detailed account of the theory.

Enough detail is provided to explain the basis for the experiments conducted in this and the next

chapter.

For classification problems, SVMs are frained to define the position of an optimal separating

hyperplane between classes. For regression problems, linear learning algorithms model a non-linear

fiinction (a regression "tube") in feature space. The term "support vector" describes how only the most

informative patterns in the data are used in defining the model.

Linear machine learning routines can be applied on a fixed, non-linear mapping of the data vectors in

feature space because the numerical operations required to perform the minimisation and learning

procedure can be evaluated efficiently. This is a tractable problem, as the complexity of the

calculations does not increase with the dimension of feature space.

A highly curved multi-dimensional model hypothesis might also increase the risks of over-fitting the

data. Overfitting is a particular issue where there is noise in the data used for regression modelling, and

there is overlap of classes in classification problems. Several factors can be confrolled during the

learning process, to limit the complexity of the model and promote generalisation of the SVM model.

The learning algorithm minimises an error fiinction within consfraints of the model. Lagrange

multipliers are included in the fiinction to be optimised, presenting a quadratic optimisation problem

with a convex error surface without local minima.


124

SVM Classification.

The SVM algorithm defines a hyperplane to minimise the number of classification errors on the

fraining dataset, and maximise the margin between the two classes (see Figure 6.5).

Figure 6.5 shows a simple two class classification problem of stars and circles, (based on Figure 1.1 in

Scholkopf e? al. (1999) '*^). The data input pattern is the vector x of elements. The class labels are

Y = \ for circles and Y = \ for stars, and examples in this dataset are linearly separable in input

space. The dashed lines define a margin, enclosing all possible separating hyperplanes

where W-X-l-Z> = O . W i s a weight vector normal to the hyperplane, and 6 is a constant. There are no

points that lie between the dashed lines.

Figure 6.5 Simple classification of stars and circles, showing the margin
either side of the optimal hyperplane.

\
^
(w.x)+ /? = + !

\ ^ o o
I ^
\ ^ \ \ o
-b/\w\ ^
o \

1 ^ fc.
P
\
\ \ox 1^
x ^^^ ^
\l ^^ *
\ X ^
(w.x) + Z) = -1 \ X
\ X

+ (w.x) + /? = 0

\ \
125

A decision fimction f{j) = sign[(w x) +fej,allows the maximal sqjarating hyperplane (in

2
bold) to be defined, by minimising subject to
liwir

Yj [(w X/ ) -I- Z?] > 1 where Z = 1.../. Equation 6.1

Lagrange multipliers (tt;.- i = 1 ...l) are infroduced to give an expression that is readily differentiated.

The optimal solution is found by minimising L with respect to W and b, and by maximising L with

respect to the dual variablesttjtofindthe "saddle point". The conditions of Equation 6.1 allow W and

b to be eliminated and the solution then is to maximise

^ = Z.- -l^HH^i^/i^J^i -^j Equation6.2

The solution can be expanded in terms of only the fraining vectors which have non-zero Lagrange

multipliers, O,,- ^ 0. These are the support vectors, which wUl lie on the dashed lines in Figure 6.5. The

other fraining examples are not required to define the optimal separating hyperplane.

Where the data are not liaearly separable in input space, the problem can be solved by mapping the

data into a multidimensional feature space where the decision function may be a linear fimction. For

the learning task, (Equation 6.2) can be rewritten using the dot product of a mapping function, 0 ( } of

X; and Xj

L = t^i -\tt^i^/f^M^-^-^{^i) Equation6.3


,=1 ^ ,-=1 j=\
126

A kernel fimction, K Q that can implement the dot product in Equation 6.3, is used.

L=j:cci-\tt^i^ji^A^i'^j)
/=1 ^ ,=1 7=1

For the non-separable case, a "soft margin hyperplane" is sought, by infroducing a positive slack

variable, <f to relax the constraints on the optimal hyperplane in Equation 6.1.

Ylw + Xj)+b]>l-^j

A regularisation constant C, is used to assign a cost to fraining errors that are infroduced by the slack

variable. This parameter C is the upper bound on the Lagrange multipliers, and will limit the influence

of any single fraining vector. It will Hmit the complexity of the model and will influence the balance

between overfitting and generalisation

SVM Regression

For regression tasks, the SVM algorithm is modified. Yt is a real number representing the outcome, and

X| is an input vector of w elements, for patient i. For classification, it was usefiil to use the concept of

the margin in which the optimal separating hyperplane Ues. For regression, it is usefiil to consider a

tabe in feature space, in which the optimal regression fimction lies. The Unear regression function,

/"(xj = (w x j -F 5 , is estimated by a tube of radius S around the regression fiinction using the -

insensitive loss function (Figure 6.6)

To estimate the linear regression it is necessary to minimise

' fct|7,-/(x,,J
Wll ;=1
127

Figure 6.6 shows the concept ofS, the precision with which the regression fiinction is modelled. It is

based on Figure 1.4 of Scholkopf et al. 184

Figure 6.6 E: regression tube precision (or width)


=(w-x)+Z)

There are two types of slack variables to account for positive and negative deviations from the tube,

(^,r>0)
[yi-<^(iij) + b\-Yj< +^

y;.-[w.<E)(x,)-HZ,]<f + r

To estimate the regression fiinction, with precision E, the optimisation process is then to

minimise

w
cife+^;)
/=i
128

Generalisation to non-linear regression requires the use of a kernel fiinction, and again optimising the

loss fiinction is done using Lagrange multipliers, Or,. and Of,. . The variables W and b axe, eliminated

and the Lagrange multipliers are maximised subject to consfraints

Z(,-<)=o
1=1

0<aj,aj<C

The support vectors are defined by those fraining data points where one of the Lagrange multipliers is

not zero. Analogous to the classification example, the support vectors that define the regression

fiinction will lie on the surface of the regression tube.

It is not necessary that values for ^ or i are chosen, only whether 6 = 0 (unbiased hyperplane or

regression tube) or 6 * 0 (biased hyperplane or regression tube). For these experiments described in this

chapeter I have used a biased hypeplane 6 * 0 .

Selection of Kernel and SVM Parameters

The choice of the SVM kernel and the SVM parameters will determine the quality of the predictions of

the frained SVM model. The SVM kernel function is chosen a priori. The choice of SVM parameters

must be experimentally confirmed by cross validation '*^. From my experience in this application, this

experimentation in the absence of knowledge of the likely values can require hundreds of hours of

computing time to explore and select the best SVM. However, once suitable parameters are chosen,

SVM fraining and assessment takes seconds to minutes.

A proposal is offered by Mattera and Haykin '*^ for the choice of C for the RBF SVM. A value for C

that is approximately equal to B, where the range of predictions lies between 0 and B will balance the

approximation error and the complexity of the model. For the RBF SVM, they advise that the value of

e which leads to about 50% of the fraining data points as support vectors is a robust balance between

the approximation error and the complexity and computing time. This is given by the authors without

experimental support, and these choices require experimental confirmation in any application.
129

6.1.4 Use of SVMs in the ICU

In medicine, SVMs may have application in classification problems such as prediction of mortality, use

as a diagnostic support tool for interpreting patterns from physiology observations, investigation results

and laboratory tests, and regression problems such as the estimation of resource use, length of stay and

risk of death.

There is only a single application of SVM in ICU, appearing in two publications. Morik et al. '*^''*^

present use of SVMs in assisting development of rules for changes in haemo-dynamic therapy in the

ICU by emulating physician's choices. The studies by Morik and co-workers use an automated

physiological data collection and time series analysis to pre-process the raw patient data. Therefore, the

methods are of limited relevance to the task of modelling 30 day in-hospital mortality at the PAH ICU.

There are no studies that investigate the use of SVMs in predicting mortality, modelling outcome or

severity of illness in the ICU.

6.2 Classification and Regression IVIodelling to Predict


ICU patient in-Hospital Mortality Using ANNs and
SVMs
The experiments described in this chapter are an initial exploration of the use of SVMs and ANNs for

classification and regression problems in the prediction of 30 day in-hospital mortality in PAH ICU

patients.

The first objective was to demonsfrate the use of SVMs and ANNs to accurately classify patients into

30 day in-hospital survivors and non-survivors. The data used for modelling were collected in the first

24 hours in ICU. Both SVM and ANN are compared to other more familiar modelling techniques such

as logistic regression '** and classification frees '^', as well as the APACHE III model ^''"'^.

The second objective was to demonsfrate the use of SVMs and ANNs to estimate the probability of

patient death using data collected from the first 24 hours in ICU.

This series of experiments is a prelude to building a local machine learning model for monitoring unit

mortality.
130

6.2.1 Criteria for Choice of Models

The model performance was assessed on unseen data (test set) which was either that portion of the

patient sample not used for model development (fraining set) or a randomly selected sample of these

patients.

To demonsfrate that the techniques are reproducible on the dataset, multiple trials of modelling with

random selections of development set data were undertaken with the ANN and SVM experiments. For

model assessment, the two criteria were discrimination, measured by area under the ROC curve and

calibration, assessed by the Hosmer-Lemeshow (H-L) C statistic. 1 have extensively reviewed these

model atfributes in Chapter 2. Support for their use to assess models developed by the machine learning

can be found in the review by Ripley "". In the following experiments, these atfributes are proposed as

summary indices of generalisation performance for comparison of models and to define minimum

model performance criteria, prior to use in RA outcome monitoring.

For classification tasks, classification accuracy is assessed as the number of cases correctly classified

divided by the total number of cases in the sample: (correct classification rate: CCR). Discrimination is

assessed by area under the ROC curve. For classification models, all of the test data available was used

to assess the CCR and area under the ROC curve.

For the regression task, the criteria for quality of the probability estimate are the model discrimination

and calibration. In order for the regression model to be a valid estimate of the risk of in-hospital death,

it must effectively rank patients in order of risk of death as well as demonsfrate reliable probability

estimates i.e. calibration. Based on my review of ICU model discrimination, it is reasonable to expect

models developed on the first 24 hours of ICU data to exhibit an area under the ROC curve > 0.8.

Calibration is assessed by H-L C statistic. A large statistic value would indicate a significant departure

from perfect calibration, though the power of this test is dependent on the sample size. Using a test

sample size of 400 cases is a balance between the power of the statistical test and the practicalities of

delays in data collection. 400 patients as a model validation set is an arbifrary choice though it is in

accord with the sample size of 447 patients, used by Clermont et al. '^ . The H-L C statistic is
131

distiibuted as a chi-squared with 8 degrees of freedom, and the critical value {p < 0.05) = 15.5. Based

on this criterion, the aim for new ICU mortality models that estimate risk of death will be to attain an

H-L C < 15.5, on a random selection of 400 cases.

The area under the ROC curve and the H-L C statistic were calculated for each of 100 re-sampling

trials for each model constructed, with data not used in the model development set. The discrimination

for each regression model was the average of the area under the ROC curve of the re-sampling trials.

The calibration was assessed with the H-L C statistic as the average of the statistic calculated from the

100 re-sampling trials.

6.2.2 Software

For the experiments described in this chapter, Statistica 6.0 '^^ was used for data pre-processing. The

ANN programmes were run using Statistica Neural Networks '''^.

The SVM program used was SVMlight Version 5.00 (3 July 2002) written by T. Joachims. It is

downloadable with documentation at http://svmlight.ioachims.org/'''. SVMlight makes the quadratic

optimisation problems presented by SVMs with large datasets fractable in several ways ''^. Subsets of

the data are chosen and worked with in turn. The size of the optimisation problem is reduced by

"shrinkage" or elimination of datapoints that were unlikely to become support vectors, and so unlikely

to participate in the model solution. The termination of the optimisation process is at a predetermined

acceptable accuracy: the termination error. The shrinkage parameter for SVMlight defined the

number of iterations at which a variable remained optimal before shrinkage, and was set to 100 (default

value). The criterion for termination was an allowable error of 0.001 (default value). Following

exploratory examination of changes in shrinkage and termination criteria, no changes to these

programme parameters were made.

The SVM models require the choice of a kernel fiinction and of the SVM parameters C and 8 (for

regression problems). All SVM experiments used a biased hyperplane and solutions were not

consfrained to pass through the origin.


132

In the initial experimentation, the value of C was the programme default

^default 'jJT ^'^^^^ ^' ^'^ * ^ ^"P"* vectors and ||x,|| is the 2-norm of x,-

For these preliminary regression problems, the default value for e was 0.1

An SVMlight 5.0 interface using Matiab 6.5 '^^ was adapted from a freeware interface for SVMlight

4.0, by A. Schwaighofer which is downloadable with documentation at

http://www.cis.tugraz.at/igi/aschwaig/software.html '^'*.

The output and performance of the models was analysed on Matiab 6.5 '^^ and Statistica 6.0 ''^

6.2.3 Data Summary

There were 5278 admissions to the PAH ICU between 1 January 1995 and 31 December 1999. The

data were randomly split for each individual ANN or SVM that was frained and the test set was

sampled for each assessment trial. To prevent over-fitting during ANN fraining, 50% of the dataset not

used for model fraining was randomly sampled as verification set.

Outcome definition

The outcome of interest is in-hospital mortality at 30 days after ICU admission. Deaths are defined as

patients who died in hospital within 30 days. Those patients who survived to discharge within 30 days

or who were still in-patients in the hospital at 30 days are defined as survivors. This is different from

the mortality outcome that has been previously used in modelling ICU patient mortality, where survival

means survival to hospital discharge, irrespective of the duration of hospitalisation.

Statistics of mortality at a fixed time have advantages over mortality status at hospital discharge

statistics '**''^6-'^\ Thirty-day outcome is usefiil for contemporaneous audit of ICU mortality as all

patient outcomes can be accounted for at 30 days, and a complete data set can be analysed. There is no
133

requirement to wait until all patients have been discharged from the hospital, nor need to analyse

incomplete data sets where the deaths are over represented. This "closure of the books" is necessary so

quality audit and RA confrol charting may be performed with reproducible methodology in a timely

manner. The use of a fixed endpoint such as 30 day mortality may facilitate the comparison of

outcomes between institutions by reducing variability that may result from differing discharge policies.

Figure 6.7 presents a schematic representation of the relationship between the ICU patients who

survived, and the deaths according to timing and site of death. The in-hospital 30 day mortality (C) is a

subset of the total in-hospital deaths (B) and also a subset of the 30 day mortality and the patients who

die in-hospital, at home or at another institution (C+D). Most of the in-hospital deaths occur in the first

30 days, but there will be a small group of patients who were discharged from hospital within 30 days,

who subsequently died (D). These patients may have been discharged home or to a palliative care

facility with an expectation that they would die, or fransferred to another hospital, or discharged home

and re-admitted to die later. These patients will not register as deaths. Such patients comprise only a

very small number of all the patients admitted to the ICU and resource limitations precluded follow-up

of all ICU patients discharged from hospital.

For the classification task the outcomes were coded as -1 for death, and +1 for survival to provide class

separation. For the regression task the coding was +1 for death: and 0 for survival, and the models

aimed to provide a continuous estimate of the probability of the outcome.


134

Figure 6.7 All ICU Patients, and subsets of patients who died, to
show the relationship between 30 day mortality and in - hospital
mortality.

A: All patients who were admitted to the PAH ICU

J j : All patients who died at any time in hospital


following admission to the PAH ICU

\^ Patients who died in hospital within


30 days of admission to the PAH ICU

D Patients discharged from hospital


who then subsequently died within 30 days
of admission to the PAH ICU,

Variable Selection and Pre-processing


For preliminary experiments, there is a considerable advantage in modelling features that have been

demonsfrated to work elsewhere. The use of data dravm from an APACHE III database has precedent

in Clermont et al. '^^. I have performed modelling in collaboration with Pefra Graham with the

variables used in this preliminary study, using linear regression ^^ and classification frees '*' . This

work supported the choice and pre-processing of variables used in the preliminary machine learning

models developed in this chapter.

The features used are a severity of illness measure (acute physiology score from APACHE III), and

measures of physiological reserve (age and co morbidity score). The disease process is coded with an

ordered category to capture the effect of the patient's disease, accident or surgery.
135

The contribution to the chi-squared statistic made by each variable can be measured when that variable

is added to a model containing the other variables. This procedure was used to gauge the importance of

each of the variables. On the APACHE III developmental database, the relative contributions to the

model were acute physiology 73.1%, the disease or diagnosis 13.6%, age 7.3% and chronic health

items 2.9%^'*. Another recent study by Johnson et al. ^^ modelling ICU survival, observed similar

confributions to the model performance: physiology 67.7%, diagnosis 17.7%, co-morbidities 8.4% and

age 4.0%. On the PAH ICU dataset, using logistic regression Pefra Graham '** and I describe the

relative confributions as APACHE 111 score 77%> (or acute physiology score 71%), diagnostic group

code 12 - 13 %, and age 1 - 8%.

With knowledge about the importance of variables in ICU populations generally, and in this sample

from PAH ICU particularly, four input variables were chosen. The APACHE III acute physiology

score was chosen to reflect the amount of physiological disturbance in the first 24 hours in ICU. The

diagnostic category was used to capture the influence of the patient's diagnosis, procedure, or surgery.

The patient's physiological reserve was represented by the age and chronic health score component of

the APACHE III score.

For each variable, the area under the ROC curve was used to assess the variable's ability to

discriminate between the deaths and survivors, prior to inclusion in models.

/. Acute Physiology Score

The acute physiology component of the APACHE III score was calculated according to the data

collection rules and definitions of APACHE III ^'"". It is a sum of scored components of the worst

recorded observations and laboratory values from the first day of ICU admission. Neurological

abnormalities, temperature, blood pressure, heart rate, respiratory rate, mechanical ventilation, and

urine output are scored according to the extent of deviation from a midpoint "normal" physiological

value. The blood chemistry measures: creatinine, white cell count, haematocrit, albumin, bilirubin.
136

glucose, sodium and urea are similarly scored according to deviance from a normal value. The sum of

these scores gives the acute physiology component of the APACHE III score.

The acute physiology score (APS) has very good discrimination on the PAH ICU dataset. The area

under the ROC curve of the APS score alone on tiie PAH ICU dataset is 0.837. Any model that is built

from variables that include the APS should have an area under the ROC curve of greater than 0.837.

2. Disease Category

In collaboration with Pefra Graham '***,! have proposed a simple and powerfiil approach to recoding

the complex APACHE III disease group and weights. A brief description is provided here to assist

explanation of modelling and feature selection.

The disease group classification is a three level simplification of the APACHE III diagnostic

categories. Each disease group is categorised as High, Low or Neufral risk according to whether the

mortality for the disease group was higher, lower or not significantly different to the average mortality

of the APACHE III model development dataset (Table 3, Knaus et al, ^).Table 6.1 shows the coding of

the original APACHE III diagnosis and the simplified scale. Any diagnosis that was not present in the

original APACHE III disease group list is classified as "Neufral", and marked with an asterisk in the

table.

The risk was coded with high = 1, neutral = 0, and low = -1. The area under the ROC curve for

diagnostic code data alone on PAH ICU dataset was 0.726.


137

Table 6.1: Mapping of APACHE III Disease groups to Disease Category

Disease Group Category


(High/neufral/low risk of death)

Nonoperative
Cardiovascular/vascular Cardiogenic shock high
Cardiac arrest high
Cardiomyopathy* neufral
Aortic aneurysm neufral
Congestive heart failure high
Peripheral vascular disease neufral
Rhythm disturbance low
Acute myocardial infarction low
Hypertension low
Allergy* neufral
Other cardiovascular diseases neufral
Respiratory Parasitic pneumonia high
Aspiration pneumonia high
Respiratory neoplasm high
Respiratory arrest high
Pulmonary oedema (non-cardiogenic) high
Bacterial/viral pneumonia high
Chronic obstructive pulmonary disease high
Pulmonary embolism neufral
Mechanical airway obstruction neufral
Asthma low
Other respiratory diseases neufral
Hepatic failure high
Gastrointestinal (GI) GI perforation/obstruction high
GI bleeding due to varices high
GI inflammatory disease neufral
GI bleeding due to neoplasm* neufral
GI bleeding due to ulcer/laceration low
GI bleeding due to diverticulosis low
Other GI diseases high
Neurologic Coma (unknown cause)* neufral
Intracerebral haemorrhage high
Subarachnoid haemorrhage low
Stroke high
Neurologic infection neufral
Neurologic neoplasm neufral
Neuromuscular disease neufral
Seizure neufral
Other neurologic diseases neufral
Sepsis Sepsis (other than urinary tract) high
Sepsis of urinary tract origin high
Trauma Head frauma (with/without multiple trauma) low
Multiple frauma (excluding head trauma) low
138

Metabolic Metabolic coma high


Diabetic ketoacidosis low
Drug overdose low
Other metabolic diseases neufral
Haematologic Coagulopathy/ neufropenia/ high
thrombocytopenia
Other haematologic diseases low
Renal diseases neufral
Genitourinary Pre-eclampsia neufral
Other genitourinary diseases* (may be non- neufral
operative or operative)
Other Other medical diseases low

Operative
Vascular/cardiovascular Dissecting/ ruptured aorta high
Peripheral vascular disease (no bypass graft) neutral
Valvular heart surgery low
Elective abdominal aneurysm repair low
Peripheral artery bypass graft low
Carotid endarterectomy low
Other cardiovascular diseases low
Respiratory Respiratory infection neufral
Lung neoplasm low
Respfratory/ neoplasm low
Other respiratory diseases low
Gastrointestinal (GI) GI abscess* neufral
GI perforation/rupture high
GI inflammatory disease neufral
GI pancreatitis* neufral
GI peritonitis* neufral
GI obstruction neufral
GI bleeding neufral
GI vascular* neufral
Liver fransplant neufral
GI neoplasm low
GI cholecystitis/ cholangitis low
Other GI diseases low
Neurologic Intracerebral haemorrhage high
Subdural/ Epidural haematoma high
Subarachnoid haemorrhage low
Laminectomy/ other spinal cord surgery low
Craniotomy for neoplasm low
Other neurologic diseases low
Trauma Head frauma (with/without multiple trauma) high
Multiple frauma (excluding head trauma) low
Renal Renal neoplasm low
Renal transplant* neufral
Other renal diseases low
Gynaecologic Hysterectomy low
Orthopaedic Hip or extremity fracture neufral
* indicates a diagnosis that did not clearly map onto an APACHE III disease group at the time of modelling.
These diagnoses had no risk data and were given a "neufral" disease group risk category coding.
139

3. Chronic Health Score

A score was calculated by adding points allocated according to the presence of co-morbid conditions,

according to the data collection rules and definitions of APACHE III ^''*". Where these conditions are

present, the points allocated are AIDS (23), hepatic failure (16), lymphoma (13), metastatic cancer

(11), leukemia/multiplemyeloma (10), immune suppression (10) and cirrhosis (4). The area under the

ROC curve for Chronic Health Score alone on the PAH ICU dataset is 0.624

4. Age

Age was used as a continuous variable calculated as date of admission minus the patient's date of birth.

The area under the ROC curve of the patient's age alone on the PAH ICU dataset is 0.62.

Index

An integer was used to index the order of admission. This variable was not subsequently used in the

modelling, other than as a unique identifier.

The data were not manipulated or pre-processed fiirther, prior to modelling with the ANN and SVM.

6.2.4 Classification

The objective of the classification experiments was to assess how well ANNs and SVMs can classity

patients using the data collected in the first 24 hours, into survivors or deaths in-hospital at 30 days.
140

SVM Classification

Initial classification experiments were conducted with the default value of the SVMlight parameter

Cdefault = f - p - where x,- is the input vector, and ||x,. || = ^ I V M


meanllxi
Classification was investigated using the radial basis fiinction kernel SVM and polynomial kernel

SVM.

SVM: Radial basis fiinction kernel

The RBF kernel that was used to implement the dot product of the mapping function in Equation 6.3

was

^(x,-,Xy) = exp(-7||x.-x^.||')

where x, and Xj are input patterns, and X,. X is the 2-norm of the difference between these

input vectors. The RBF parameter, y was varied.

SVMs with varying RBF kernel parameter y were investigated (Figure 6.8). For each SVM, the sample

was randomly split into a set of 20% (1056 cases) for model fraining, and the remaining 80% (4222

cases) for model testing. Values are the mean of correct classification rate (CCR) of 10 SVMs frained

at values of y between 0 and 20. Figure 6.8 shows the CCR for RBF kernel SVMs in the range of y

between 0 and 1. The best CCR is 90.2% at y between 0.007 - 0.01, whilst the best in the training set

was 89% with y less than 0.005. At larger y, the CCR approaches the mortality rate, as all cases are

classified as deaths.
141

Figure 6.8: CCR for RBF SVM 7; 0...1


Values are mean of CCR of 10 models, 1056 cases In the training set

0.98

0.96

0.94

0.92 Training Set


-Test Set
0.9

0.88 r<^
0.86

0.84

0.82

0.8
0.2 0.4 0.6 0.8

The discrimination was assessed with the area under the ROC curve. Figure 6.9 displays the

discrimination performance at values of 0 < y < 0.1 for the RBF kernel SVMs. The ability of the SVMs

to rank the fraining set improves with increasing the size of y. On the test set, the areas under the ROC

curves of the RBF kernel SVM models are all < 0.80 and are all less than that of the acute physiology

score alone (area under the ROC curve = 0.837).

Figure 6.9: Area under the ROC Curve for RBF SVM y; 0...0.1
Values are mean of ROC curve area of 10 models, 1056 cases In the training set

1
0.9
<D
t3 0.8
o 07
(J
o
cu 0.6
(1)
4- O.b
1-

0.4
C
3
0)
0.3
(0 - -^ - Training Set
ffi 0 ? - T e s t Set
<
0.1
0
0.02 0.04 0.06 0.08 0.1
7
142

SVM: Linear and polynomial kernel

The polynomial kernel fiinction that was used to implement the dot product of the mapping fiinction in

Equation 6.3 was

K{Xj, X .) = (x. X . j where d, the degree of the polynomial was studied for values 1 - 4.

The average CCR and areas under the ROC curves at a range of fraining set sizes were explored.

Figure 6.10 displays the relationship between the fraining set sizes and CCR for the orders of

polynomial kernels. Performance for all models was dominated by the preponderance of survivors in

the sample. The CCR is around 87.5% - 88.5% for all polynomial kernels. The polynomial kernel with

d-4is not shown, as the optimisation algorithm failed to converge at the parameter settings for data

sample. At tests set sizes < 40% (2111 cases), 2"'' and 3'^'' order polynomial function kernel SVMs had a

slightly superior CCR on the test sets than the linear SVM models.

Figure 6.10: Effect Of Training Set Size on CCR for Polynomial


SVM
d = 1 - 3, values are the average of 20 trials

10 20 30 40 50 60
% of Sample as Training Set
143

Figure 6.11 shows the discrimination at a range of polynomial kernels ^ = 1 - 3. The areas under the

ROC curves are averaged over 20 trials at each point. Some of the models frained on small datasets

(5% 264 cases) had equivalent performance with areas under the ROC curves on the test set up to 0.86.

However, the striking feature of the small test sets was their unreliability, compared to the more robust

and dependable modelling on the larger fraining sets > 20%> (1056 cases) of the sample. At > 10% (528

cases) the area under the ROC curve was consistentiy greater than the area under the ROC curve of the

acute physiology score variable. In confrast to the RBF SVM, the polynomial kernel SVM had better

discrimination than the APS variable.

Figure 6.11: Effect Of Training Set Size ROC Curve Area


Polynomial Kernel SVMs
d = 1 -3, values are the average of 20 trials

20 30 40 50 60
% of Sample as Training Set

At all fraining sample sizes, 2"** and 3'^'' order polynomial fiinctions had a lower area under the ROC

curve than a linear function.

The polynomial kernels SVMs studied in this experiment had better classification and discrimination

on the tests sets than the RBF kernel SVMs.


144

ANN Classification

The ANN fraining was conducted on Statistica Neural Networks^*^ using radial basis fiinction (RBF)

and multilayer perception (MLP) designs. Initial exploration showed the RBF ANN consistently

underperformed in classification and discrimination compared to the MLP, and is not considered

fiirther in this experiment.

The "Problem Solver" module on the Statistica Neural Networks Programme was used to assign

weights, adjust learning the rate and momentum and prune the nets. The output variable was coded as

-1 for death and +1 for alive for classification and ranking problems. The input features were the same

as in the SVM classification experiment. The fraining algorithm provided for pruning of both inputs

and hidden units based on weight magnitudes and sensitivity analysis.

MLP were designed to have only one hidden layer, with the number of hidden units and the ANN

architecture determined by the "Problem Solver" algorithm.

The fraining set sizes studied were 5% (264 cases), 10% (528 cases), 20% (1055 cases) and 50% (2639

cases). A verification dataset was used to prevent over-fraining, and a test set was used to evaluate the

generalisation performance of the ANN. The verification and the test set data were half each of the

cases from the dataset not used for fraining.

The fiinction to be minimised was the sum of the squared differences between the observed outcomes

and the model predictions on each output unit. For MLP fraining, a two-stage fraining process was used

with back propagation (50 epochs, learning rate 0.1, momentum 0.3) followed by conjugate gradient

descent. Training was continued until the verification set performance started to deteriorate or

convergence ceased.

For each trial, numerous ANNs were successively frained and the best models were retained. At the

commencement of each trial, an ANN was frained until its generalisation performance on the
145

verification set began to decline, or convergence ceased. Another ANN was frained until its

performance on the verification set began to deteriorate, or convergence ceased. If its performance was

superior to the first ANN, then it was retained. If not, it was discarded. This process was continued

until 20 increases in performance had been achieved. Ten such trials were conducted on random

selections of fraining data. The average performance of the best ANN from each trial provided the

value for each data point.

Figure 6.12 shows the mean area under the ROC curve for MLP ANN model performance on test set.

Figure 6.12: Discrimination of MLP ANN with a Range of


Training Set Sizes
Values are the mean of area under the ROC curve area of the model predictions on the test set,
Each point Is the average of 10 trials.

0.9
(D

3
0.8
o 0.7
o
o 0.6
(D
0.5
CD 0.4
a
0.3

0.2
%.1

10 20 30 40 50
% Sample as Training Data

The discrimination of the MLP ANN was on average 0.865 - 0.875 across the dataset sizes. The ANNs

were able to improve upon the discrimination performance of the APS variable of the APACHE III

score. The ANNs performance was equivalent to, or exceeded the discrimination of the best SVM

polynomial kernel. Unlike the SVM, the ANN performance did not drop below an area under the ROC

curve of 0.865 even with the sample size at only 5% (264 cases). This is in confrast to the Clermont et

al. '^^ study of ANN in ICU where the model performance declined at fraining set sizes of less than 800

cases.
146

Discussion of Classification Results


Table 6.2 shows the discrimination of classification models that have been applied to predict 30 day in-

hospital mortality on the PAH dataset. SVMs, ANNs and logistic regression approaches including

APACHE III have all used similar input variables for modelling.

Table 6.2 Discrimination and Area under the ROC Curve for Various
Models
Model Area under the ROC curve
SVM Polynomial (T' order) 0.84
SVM RBF 0.79
MLP ANN 0.87
Acute Physiology component of APACHE III score 0.84
APACHE III mortality prediction * (30 day outcome) 0.89
Logistic regression* Tllll
0.87
#from Graham and Cook

The SVMs and ANNs successfully classified the patients into survivors and non-survivors. The best

discrimination was seen with the proprietary APACHE 111 predictions, logistic regression and the MLP

ANN of this experiment. The APACHE III model was developed on a large dataset of 17 440 patients

over 10 years ago in North America, and its discrimination was still very good on the PAH ICU

dataset. The logistic regression model used all of the patients in the dataset. In comparison, the MLP

ANN was able to discriminate accurately between survivors and non-survivors using only 264 patients,

with an average area under the ROC curve of 0.865.

The MLP ANN discriminated as well as the other methods. In confrast to the ANN modelled by

Clermont et al., the discrimination did not deteriorate at datasets sizes less than 800 patients. In fact,

the area under the ROC curve was still 0.865 on the smallest dataset studied (264 patients). The

equivalent in the Clermont study achieved area under the ROC curve of 0.817 (400 cases) and 0.752

(200 cases).

One of the shortcomings of this experiment was that optimisation of the SVM parameters was not

pursued. It is possible that the SVM performance may exceed that of the ANNs or logistic regression, if

optimisation of the SVM parameter, Chad been explored. This issue will not be pursued in this section,

as this preliminary work is a prelude to building a regression model with raw patient data. However,
147

this work establishes that even before extensive parameter tuning, the classification and discrimination

performance of the SVM is close to that of the ANN, logistic regression and the APACHE III system.

6.2.5 Regression
The aim of the prehminary regression modelling was to develop an ANN or SVM to estimate the risk

of 30 day mortality in ICU patients, using the 4 features described earlier: APS, Disease Category, Age

and Chronic Health Score.

The performances of the models were assessed by the discrimination using the area under the ROC

curve. The calibration was assessed by the Hosmer - Lemeshow C statistic on assessment sets of 400

cases.

SVM Regression

There was no a priori knowledge of the likely SVM parameters or the precision required for this task,

so experiments explored SVM kernel functions across a range of E (regression tube width). The SVM

parameter C was set at the default value.

Preliminary experiments with a RBF kernel used in regression SVM modelling with these variables

were disappointing. Figure 6.13 shows the discrimination of the RBF kernels with the average area

under the ROC curve of 50 trials at each data point for y in the range of 0 - 2 and 8 in the range 0 - 2

is presented. The surface colours denote the area under the ROC curve values. There are no dark blue

shaded areas, so no average area under the ROC curve was greater than or equal to 0.83. This means

that for all parameter combinations tested, the discrimination of the RBF kernel SVM was worse than

that of the Acute Physiology Score variable (0.837).

At s values below 0.5, the RBF kernel SVM had the best discrimination, with the area under the ROC

curve being 0.78 - 0.81 for the range of values of y. At values of greater than 0.5, the discrimination
148

of the models rapidly deteriorated to provide no better than random discrimination (area under the ROC

curve ~ 0.5).

Figure 6.13 Surface Plot of the Average Area under the ROC

Curve for RBF Kernel SVM

The surface plot tones correspond to the average area under the ROC curve.
Colour Area under the ROC Curve
Dark Blue: > 0.83
Light blue, green, yellow 0.80 - 0.83
Red < 0.80

No further optimisation of regression SVM parameters was done with the RBF kernel using these data

and variables.

Regression SVM with polynomial kernels had more promising discrimination. Figure 6.14 shows the

areas under the ROC curves for a range of for polynomial kernels d=l-3. Each point is the average
149

of 20 SVM performances at that value. The best average area under the ROC curve (0.873) was with

the linear kernel with at = 0.005.

Figure 6.14: Area under the ROC Curve


Polynomial Kernel SVM using d = 1,2,3 for Range of
Training set size = 1056 cases, 20 trials at each data point

0}
0.9

o 0.7
o
o 0.6 d =1
a: --.-.d = 2
o 0.5 *d =3

0.4
a
c 0.3
3
(0 0.2
0.1
1 1 1 1 1 1 1 1 1 1

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

The calibrations of the RBF and polynomial kernel regression SVM models were poor. The SVM's

output provided a point estimate of a binary outcome and did not provide an estimate of the probability

of death. Therefore, the output of the linear kernel SVM which had the best discrimination was

recalibrated. In this experiment, a GRNN was used to recalibrate the SVM outputs, to more reliably

estimate probability of 30 day in-hospital death.

The best calibration at the default setting for the SVM parameter C was found with the linear kernel

SVM (d= 1 and S = 0.005). One hundred linear kernel SVMs were frained and recalibrated with the

GRNN. Each of these frained SVM GRNN models was then assessed on 100 test set samples of 400

randomly chosen cases, and the average area under the ROC curve and H-L C statistic of the 100 test

set evaluations were used to assess the SVM GRNNs.


150

Figure 6.15 shows the distribution of the mean areas under the ROC curves of the 100 SVM GRNN

models built. The average area under the ROC curve was 0.863 and all models had an area under the

ROC curve above 0.83.

Figure 6.15:
Discrimination of SVIVI-GPNN IVIodels:
Histogram of IVIean ROC Curve Area
Each observation is the mean of 100 Samples
of test set of 400 cases for each of the SVM-GPNN models.
60 I 1 1 1

50

40
a)
x>
O
* 30
o
o

20 ^

10
r" '
1 tr^>J^ ivinaga
|: 1 1 1

0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90


Area Under the ROC Curve

Figure 6.16 is afrequencyhistogram of the average H-L C statistic values of 100 assessment sets for

each of the 100 SVM-GRNN models. Twenty nine SVM GRNN models of the 100 trained had an

average H-L C> 15.5 on 400 case test sets. The histogram of the average H-L C statistics has a skewed

distribution, which is expected as the H-L C statistic follows a chi-squared distribution. Note the 7

models grouped on therightof the histogram which all had mean H-L C statistics > 50.
151

Figure 6.16:
Calibration of SVM-GPNN Models
Histogram of Mean H-L C Statistic
Each observation is the mean of 100 Samples
of test set of 400 cases for each of the SVIVI-GPNN models
24 -I r I I T- -i [ r -1 1 r-

22 m.
20
18
16
(o 14
o 'W[
S 12
o
z 10
8
6
I
4
2 m
Wi
0
12 16 20 24 28 32 36 40 44 48
Mean H-L C

Seventy one percent of the SVM GRNN models met both the discrimination (area under the ROC

curve > 0.80) and calibration (H-L C < 15.5 on 400 cases) criteria. Therefore, the SVM GRNN is a

practical approach to modelling risk of 30 day in-hospital mortality on the PAH ICU data.

The distributions of the discrimination and calibration statistics in Figures 6.15 and 6.16 demonstrate

the need to conduct multiple trials and re-sampling. The spread of areas under the ROC curve and H-L

C statistics are due to sampling of the data for model building and model assessment. For each trial, a

different training set was chosen. For assessment of each model, 100 different test sets were chosen

from the remaining data. Experiments conducted on a single sampling of data only demonstrate that it

impossible to build a suitable model on the dataset. This experiment has demonstrated that the

modelling process is very likely to provide a suitable model on this dataset.

While some investigation was done of the SVM kernels and kernel parameters, the SVM parameter C

was not optimised, and the SVMlight default value was chosen. It is possible that more extensive
152

investigation of SVM parameters may provide better performance. This will be explored in the next

chapter.

ANN Regression
The ANN fraining was conducted on Statistica Neural Networks '''^ using Radial Basis Function (RBF)

and Multilayer Perception (MLP) designs. Initial experiments indicated that the RBF ANN consistently

produced lesser discrimination than the MLP, so the following experiment is reported using only the

MLP. The methods of feature selection, fraining and optimisation were the same as for the previous

classification experiment.

For each trial, numerous ANNs were successively frained and the best models were retained, in the

same way that the classification experiments were performed. At the commencement of each trial, an

ANN was frained until its generalisation performance began to decline. Training of that ANN ceased

and another ANN was similarly frained. If its best performance was superior to the first, then it was

retained, if not, it was discarded. As before, this process was continued until a series of ANN with 20

increases in performance had been achieved.

Twenty such trials were conducted on random selections of training data. The performances of the best

ANNs from each trial were analysed. The fraining datasets were 20% of the sample (1056 cases).

Initial analysis of the ANN outputs indicated that the outputs did not estimate the probability of death.

Whilst the areas under the ROC curve demonsfrated good discrimination, the output of the ANN was

poorly calibrated. Therefore, the MLP ANN output values were recalibrated with a GRNN.

To evaluate the generalisation performance of the 20 MLP GRNN ANNs, the areas under the ROC

curves and the H-L C statistics were calculated by analysis of 100 test sets of 400 cases. These datasets

were drawn at random, and with replacement from the available test data. The average of the areas

under the ROC curve and the H-L C statistics were used to describe the discrimination and calibration

of each of the 20 MLP GRNN models. These results are presented in Table 6.3.
153

Table 6.3: Discrimination (Area under the ROC Curve) and Calibration

Results of ANN MLP-GRNN Regression.


For each of 20 experimental fraining samples of 1056 cases each value is the mean of 100 re-sampling

trials drawing 400 cases with replacement from the unseen data.

Trial Number Area under the H-L C statistic


ROC Curve
1 0.866 14.0
2 0.870 12.7
3 0.855 25.6
4 0.867 10.8
5 0.870 12.3
6 0.874 23.3
7 0.862 11.0
8 0.851 17.1
9 0.860 13.0
10 0.859 10.2
11 0.857 17.6
12 0.875 27.7
13 0.862 9.2
14 0.847 10.0
15 0.866 8.6
16 0.866 14.2
17 0.872 21.9
18 0.861 17.7
19 0.867 13.9
20 0.858 10.1

Seven of the MLP GRNN ANNs (Table 6.3) did not satisfy the model performance criteria that 1 have

proposed (area under the ROC curve > 0.80 and H-L C < 15.5 on 400 cases). All models had an area

under the ROC curve > 0.85. However, models 3, 6, 8, 11, 12, 17 and 18 had H-L C statistics that were

greater than 15.5. This experiment demonsfrates that for MLP GRNN models frained on a dataset of

1056 cases in 13 of 20 occasions the models achieved adequate performance. Therefore, the MLP

GRNN offers one way to frain a model to estimate the probability of 30 day in-hospital mortality from

the PAH ICU dataset.

Discussion of Regression Results

These preliminary experiments demonsfrate that ANNs and SVMs can be used to build models that

accurately estimate the probability of in-hospital death of ICU patients.


154

The SVM GRNN and ANN GRNN models were frained on 20% of the PAH ICU dataset, using 1056

cases, or approximately 10 months of admissions to the PAH ICU. This set size was chosen for the

preliminary machine learning regression experiments for two reasons. Ths sample size of 1056 cases

was successfiil in the classification experiments, and was likely to be adequate for the regression

experiments. Clermont et al. '^^ had successfiilly used a fraining set size of 800 cases, but found that

model performances with ANN and logistic regression deteriorated below this size.

The MLP ANN and the SVM have adequately discriminated between the survivors and non-survivors

on the PAH ICU dataset. However, the MLPs and SVMs did not provide estimates of the probability of

death, rather a point estimate of the predicted outcome. Therefore, the regression modelling experiment

used a GRNN to recalibrate the predictions of the MLP and the SVMs. The GRNN added an additional

step to the modelling process, but allowed the output values to approximate the probability of death.

This is verified by the discrimination and calibration performance of the recalibrated models.

By itself, the GRNN was unable to effectively rank the patients in order of risk of death, with areas

under the ROC curve < 0.7. However, once the patients were modelled by the SVM or the ANN the

average area under the ROC was greater than 0.83. The GRNN was used to then adjust the SVM and

ANN outputs to an accurate estimate of risk of death, verified by the calibration of the models. The

area under the ROC curve and discrimination was preserved.

In a practical sense, the second phase of model calibration with the GRNN for the MLP ANN and the

SVM was a usefiil experiment but is unwieldy. If this approach is to be infroduced, then the steps

would have to be automated. Modelling, as conducted in this experimental context could not be

infroduced into general use due to its complexity and predilection to error. One alternative is to apply

the recommendations of Weston et al. ''* to constiiict SVM kernels specifically for probability density

estimation. Using a logit fransformation, the log of the odds ratio, of the ANN or SVM outputs would

be an alternative to the GRNN as a method to recalibrate the regression outputs to provide estimates of

the risk of death between 0 and 1.


155

The alternative that will be investigated in the next chapter is to build a regression SVM that will

approximate the probability of in-hospital patient death without a re-calibration step. It is planned to

tailor the design and parameter choice to the performance requirements of a risk adjustment model. An

approximation of probability of in-hospital death can be made if optimisation of models is driven by

the desirable performance attributes of discrimination and calibration. I will pursue this in the next

chapter, using discrimination (maximising the area under the ROC curve) and calibration (minimisation

of H-L C statistic). If this is successfiil, then a regression SVM will be built with a single modelling

step to approximate the probability of patient mortality.

This study uses re-sampling of data for fraining and testing to evaluate models. Previous comparable

Studies ' have relied on a single split of fraining and test sets. In a letter to the Editor, Paetz

highlighted this weakness in the study of Clermont and co-workers. Without re-sampling, the process is

vulnerable to variations with inclusion (or exclusion) of influential outliers that may bias the model

building, fraining or assessment. Clermont et al. had shown that the machine learning modelling was

possible, but not that their findings were necessarily reproducible. The histograms of Figure 6.15 and

6.16 illusfrate the variability of model performance that is found when multiple samples of 1056 cases

were drawn at random. The modelling with SVM and ANN was done with 20 MLP GRNN trials and

100 SVM GRNNs, each assessed with 100 random test sets. This approach with multiple trials and re-

sampling confirms that the experimental method will produce models that are likely to meet the

discrimination and calibration goals on any repeated sample of 1056 cases from the PAH ICU dataset.

Whilst Clermont et al. showed that ANN could successfiilly model patient probability of death, the

experiments of this chapter demonsfrate machine learning approaches that will usually be successfiil.

The performances of the SVM GRNN and the MLP GRNN models are as good as or better than the

findings of Clermont et al This is the only comparable study using machine learning with similar

variables on a similar ICU dataset. Clermont et al. found the areas under the ROC curves of a logistic

regression and MLP ANN model were each 0.839, using samples of 800 and 1200 cases. This

compares to the experimental findings in this chapter using 1056 cases. The classification experiment

frained MLPs (0.87) and the SVMs (linear kernel: 0.84) and the regression experiment frained MLP-

GRNNs (0.86) and the SVM-GRNNs (0.86).


156

The calibration of the ANN and the SVM based regression models was similar in this experiment. For

both approaches, under the conditions of this experiment, there were 65% of MLP-GRNN (13/20

models) and 71%) (71/100) of SVM-GRNN that had both discrimination and calibration that was

acceptable. The calibration of these models assessed by the H-L C statistic was comparable to the

findings of Clermont et al.

Any differences in the areas under the ROC curves could be explained by random variation, and it is

possible that SVM, ANN and logistic regression will provide equivalent results. If, however, any

difference is real, several explanations related to data and variable selection may be proposed. It is

possible that the absence of a diagnostic field in the variables used by Clermont et al. may have

contributed to a slightly lower performance. Pefra Graham and I '** have shown a diagnosis variable to

be associated with up to 13% of the model's explanatory power on the PAH ICU dataset and up to

18% has been demonsfrated on other ICU patient models ^^. Also, the failure by Clermont et al. to

randomise to development or testing sets, or to use a re-sampling procedure creates the potential for

bias in the development and assessment of models. An unlucky split of cases may have resulted in the

lesser measured areas under the ROC curves in their study. Other possible reasons can be proposed,

such as data accuracy and precision, but not enough details are provided in the paper to comment. The

data in the study by Clermont et al. and at the PAH ICU were collected and compiled according to the

same rules of the APACHE 111 database system with the same fraining and quality checks.

6.3 Conclusion

In the field of machine learning, ANNs, but not SVMs have been previously applied to ICU outcome

modelling. These preliminary experiments demonsfrated that ANNs and SVMs provided comparable

model performance on classification and regression tasks to logistic regression. In particular, the

preliminary results indicate that SVMs may perform as well as logistic regression or ANN. As it is an

area that has not been extensively studied a SVM model for risk adjustment will be pursued to estimate

the risk of death of ICU.


157

However, the models proposed in this section have considerable room for improvement. This issue will

be addressed in the next chapter.

Firstiy, models will be built with the patient database data, rather than use the APACHE III score. The

variables used as input features for the ANNs and the SVMs were chosen on the basis of other

successfiil ICU outcome modelling. These were heavily pre-processed, and as a preliminary area of

study, improved the chances of successfiil model development. Both the APS and the Disease Category

variables offer effective discrimination between survivors and non-survivors, even before incorporation

into a model. However, there is the possibility that this pre-processing may have limited the

performance of the models. Therefore, in Chapter 7, patient outcome will be modelled with all the

variables that are collected on the database, not just the 4 pre-processed variables used in Chapter 6.

The raw physiological and laboratory data, the diagnostic code, co-morbidly information and additional

demographic and admission information will be used.

If a model can be built that uses only the component data variables, then the model development can be

done independent of the APACHE III score and the APACHE III system that provides proprietary

estimates of the probability of death. This will potentially save money spent on the software license

fee. More importantiy, it will provide the flexibility to remodel the patient data when a risk adjustment

model used for risk adjusted confrol charting no longer fits the patient data.

Secondly, 1056 cases were used for the model development in this chapter. Clermont and co-workers

reported that 800 cases were sufficient for model development using logistic regression and ANNs.

Therefore in the next chapter, a fraining dataset of 800 cases will be used. This is of practical

importance as 800 cases is approximately 8 months of ICU admissions. With an additional 400 cases

for model assessment, the model development and assessment can be conducted on about 1 year's

patient data.

Thirdly, neither MLP ANNs nor regression SVMs provided estimates of the probability of patient

death without a calibration step with the GRNN. By using the value of the machine learning model

output as the input variable for the GRNN, the ranking performance of the SVMs and the ANN was
158

preserved and a probability estimate was produced. These methods satisfied the discrimination and

calibration criteria of the mortality prediction model that 1 have proposed: area under the ROC curve >

0.80 and H-L C statistic < 15.5 on 400 cases. However, a SVM regression model may be able to

provide reliable estimates of the probability of death without a GRNN calibration step. SVM model

parameter choice will be determined by the kernel and parameter combinations that on average meet

discrimination and calibration goals.

Fourthly, the estimation of regression SVM parameters will be revised. In this experiment, the

polynomial and RBF kernels were explored for a range of values of e, without investigating values for

the SVM parameter C. Parameter C is an important determinate of the SVM generalisation error. For

the SVM to reliably estimate the probability of patient death, the values of kernel parameters and the

parameters C and 8 will be systematically explored, seeking the parameter combinations that satisfy the

model performance criteria.

Therefore in the next chapter, modelling of the ICU patient risk of death will be undertaken with a

regression SVM using the raw patient data variables with fransformation where necessary. After an

initial exploration of the search space, an extensive optimisation process will seek the model with the

best discrimination and calibration for use as a risk adjustment model.


159

Chapter 7

Development and Assessment of SVMs to Estimate the

Probability of ICU Patients' 30-Day In-Hospltal Mortality

using Patient Data

This chapter presents an experiment in which support vector machines (SVM) are developed to estimate the

risk of in-hospital death of patients admitted to the Princess Alexandra Hospital (PAH) intensive care unit

(ICU) using data about their admission, physiology and diagnoses.

SVMs have displayed comparable performance to artificial neural networks (ANNs) and logistic regression

on the intensive care unit (ICU) dataset. As SVMs have not been extensively studied in this application;

fiirther SVM model development will be pursued. This experiment will employ a sfrategy to frain a

regression SVM based on achieving discrimination and calibration targets and thus guide kernel and

parameter selection. It will be shown that regression SVMs can be frained to accurately and reliably

estimate the probability of a patient's 30-day in-hospital death using the number of cases equivalent to 1

year of activity. This provides a practical approach to modelling risk of death for monitoring risk adjusted

(RA) outcome monitoring.

7.1 Overview

The purpose of the experiment was to build a SVM to accurately predict the probability of 30 day in-

hospital mortality of the ICU patients. For the SVM to provide a practical alternative to the commercial

APACHE model, it has to fiinction under the following consfraints.


160

The data variables were raw demographic, physiological, diagnostic and investigational data available

within the first 24 hours of ICU admission. The modelling was to be done on a 1200 case sub-sample or the

equivalent of a year of patient admissions to the PAH ICU, during 1995 - 1999.

Estimating the probability of death can be framed as a regression problem. A proposed model had to have

adequate discrimination and calibration to meet the standard of an accurate estimate of the probability of 30

day in-hospital death of an ICU patient at the PAH. In confrast to the experiment in the previous chapter,

the aim was to frain regression SVM with outputs that did not require re-calibration by a method such as

the general regression neural network (GRNN).

7.2 Method
7.2.1 Data
From the original set of 5278 consecutive admissions in the PAH ICU patient dataset 1/1/1995 -

31/12/1999, 16 cases were excluded due to missing data on Source of Admission or Time to ICU

Admission. ICU readmissions during a single hospitalisation were not included.

Thirty five input variables, not limited to the APACHE HI physiology, laboratory or diagnosis variables

were used from the PAH ICU patient database. The list and description of the variables is shown in Table

6.1.

The demographic and admission data were collected at time of admission. The diagnosis, co-morbidity,

laboratory values, physiological observations and measurements were collected at the end of the first day in

ICU. No observations or measurements were included if the observations pre-dated the admission to the

physical ICU area. In the event that a diagnosis was revised after 24 hours in the ICU, the original

diagnosis was retained. Physiological and laboratory values were chosen according to the most exfreme

values seen in the fu-st 24 hours in ICU. Where physiological or laboratory values were not collected, the

database allocated a physiological value within the normal range. As the aim of the modelling task was to
161

estimate mortality based on patient attributes only, the APACHE III score which had been successfully

used in the last chapter, was not used in this experiment.

The mortality outcome to be predicted was death in-hospital within 30 days.

Samples of 1200 cases, comprising a fraining set of 800 cases and a test set of 400 cases were randomly

chosen, without replacement from the 5262 cases.

Initial experimentation on the dataset indicated that raw values could not be successfully modelled and that

best results would be obtained if some of the variables underwent afransformationprior to inclusion in the

model

Only one example of the use of SVMs to model ICU patient data was found in the literature. Morik et al.

'^' used the SVM to emulate physician's behaviour to assist in rule generation and guide the haemo-

dynamic management of patients in a German ICU. The data acquisition system was different to that used

at the PAH ICU, and the German patient data were initially modelled using time series analysis. Morik et

al. collected observations every 1 minute to detect and manage haemo-dynamic shock, in confrast to the

APACHE III database, which collects the worst value during the first 24 hours.

Morik and co-workers receded categorical variables into a number of binary atfributes. This is appropriate

for the PAH application and was therefore used for the "surgical category" variable and "patient admission

source" variable. Surgical category became the binary indicator variables for Elective Surgery, Non-

Surgical and Emergency Surgery. Source of admission was coded using indicator variables for Emergency

Room, Operating Room, Floor or Other Hospital.

However, the data fransformation used by Morik and co-workers for the real valued, continuous or discrete

variables could not be used. Their method was to normalise the values of each variable using
162

(X-X )
norm(X)=^ , '"'""'
Vvar(X)

This was not appropriate as the values in the PAH database were collected according to the rules of the

APACHE 111 system database. In common with the APACHE II', SAPS I I ' and the MPM24 II * ICU

models use the worst or most exfreme values (high or low) during the first 24 hours in ICU. This method

has been widely used as it is the deviation from normal values that is associated with an increased risk of

death, rather than the absolute measurements. Therefore, the frequency histograms of the values of some

variables were comprised of two distributions of "worst" values. One component of the worst value was

observations that were the "worst low values", whilst the other was the "worst high values". Two examples

are shown in Section 7.2.2. Worst Temperature (Figure 7.1) has a frequency disfribution that is dominated

by the worst high temperatures, reflecting that in critically ill patients, hyperthermia with sepsis or systemic

inflammation is more common than is hypothermia. Worst Mean Arterial Pressure (Figure 7.2) clearly

shows a bimodal distribution of worst high blood pressure and worst low blood pressure.

Other machine leaming methods have accurately modelled ICU patient in-hospital mortality, and some of

their successful data fransformation sfrategies are described next. Generally, the raw physiology and

laboratory observations were processed to values that reflected the distance of the observation from a

physiologically normal state, which should be associated with the lowest risk of mortality. Clermont et al.

'^^ gave numerical values to the variables according to the original APACHE III scoring scheme. I have not

used this method, as it relies too closely on the APACHE III algorithm, and requires knowledge of the

scoring for each variable. Doig et al. "^ coded all variables, except creatinine and the Glasgow Coma Scale

as the difference above or below the median value of the variable. Potentially, the median value may not

reflect the variable value with the lowest risk of death, and it will depend heavily on the casemix and

patient severity.

The study by Prize et al. " ' reported an ANN-based clinical decision support system for ICU patients, and

provides the most detail about successfiil pre-processing. In their study, all non-binary variables were

standardised, and scaled so that the zero values of the variables were associated with the lowest risk of
163

death. An ICU clinician chose the value that, in his expert opinion, gave the lowest risk of death for each

variable. The data observations were then subfracted from this physiologically normal value, and scaled by

dividing by 3 times the standard deviation.

With the previous published experience in mind, the PAH ICU raw non-binary data were processed to

achieve two objectives. The first aim was to standardise each variable, so that there was a minimum or

maximum of mortality risk at a standardised value of zero. The second aim was to process the variables so

that all were of approximately the same scale and range of values.

To standardise a variable, a physiologically normal value was subfracted from the variable value. One

option for settling on a physiological normal value for this experiment was to use expert judgement, as used

in the paper by Prize et al.. As an experienced ICU clinician, my clinical judgement tells me that this

approach is too subjective. The second option was to use the normal values recommended by APACHE III.

The values considered normal were derived from ICU patient data collected in 1988 and 1989 in North

America. However, there have been advances in medical care and it was plausible that 5 - 1 0 years after

the APACHE III model was developed, that the lowest risk of death would be found at different

physiological values.

Therefore, a third option, to re-explore the data variables was undertaken. The values for each variable

associated with minimum risk of death were identified using a smoothed, locally weighted second order

polynomial fitted to scatter plots of 30 day in-hospital death (dependent variable) against each non-binary

variable. This method is similar to that used in the development of APACHE III ^, and a recent large risk

adjustment exercise reported by Render et al. '*fromthe Veteran's Affairs ICUs in the United States.

However, in this PAH ICU application, the smoothed curves were only used to identify the value of

minimal (or maximal) risk of death, and not to assign continuously re-weighted values to the variables.

A distance weighted least squares procedure (as part of Statistica 5 "^) was used to fit a curve to a plot of

input variable values and the patient outcomes. A quadratic regression for each value of the variable is used
164

to estimate the corresponding patient outcome such that the influence of data points on the regression

decreases with distance, along the X axis, from the particular variable value. This method demonsfrated

which values of each variable were associated with the lowest or highest risks of patient death. Each

variable was standardised by subfracting the value associated with minimum (or maximum) risk from the

raw data value. Scaling was done by dividing the result by the standard deviation of the standardised

variable values.

The raw data of some variables had frequency histograms that appeared unimodal, with grossly skewed

frequency disfributions, or had values over a large range. In these cases, the logarithm to base 10 was used

to fransform non-zero raw data values to facilitate scaling and standardisation. Entries with raw values of

zero were given the lowest fransformed value from that variable.

This pre-processing did not aim to create a "standard normal variate" as used by Morik et al. The

histograms of values of some variables were often far from having a Gaussian distribution. In the PAH

dataset, the mean values were only occasionally associated with the lowest risk of death.

7.2.2 Three examples of pre-processing variables:

Worst Temperature, Worst Mean Blood Pressure and Worst Bilirubin

Three examples of pre-processing variables are worked to show the approach.

Worst Temperature.
Worst Temperature illusfrates a unimodal distribution with a normal physiological temperature associated

with the lowest risk of death.


165

Figure 7.1 Histogram of Worst Temperature Values


PAH ICU 1995-1999
1 1 1 1 11 11 1

2800
2600 ' ' -

2400

(0
c 2200
o
"co 2000
1800 - . , -

W 1600

o 1400
1200

(D 1000
E 800
600 -

400 ^M
200
n IV:?'"'
24 26 28 30 32 34 36 38 40 42 44

Worst Temperature (C)

The minimum risk of death was associated with a temperature of 37.2 Celsius, which is a normal body

temperature. The Worst Temperature value minus 37.2 was scaled by dividing by the standard deviation

(1.33). Figure 7.2 is a fit using distance weighted least squares that shows the relationship between the

worst temperature and the outcome of 30 day in-hospital death.


166

Figure 7.2 Scatterplot of In-Hospital


30 Day Mortality and Temperature
1.2

1.0 - o ooo TM?oT'T"TTnrr^""'''''''''''''''''''''"'"''''i'''''''''"'"''"'''"'''''""PtD o o

0.8

0) 0.6
E
o
o
8 ''
0.2

0.0 GOO O O <f(a'ifi(iifi(ffii(>(Cf<(ii<c(t(nffi(f*.ua

-0.2
28 30 32 34 36 38 40 42 44

Worst Temperature (C)

For Worst Temperature the distribution appears to be unimodal, and approximately normally distributed.

The minimum risk of death is not associated with the mean Worst Temperature value of 36.1 ^C, or the

APACHE III default value physiologically normal value of 38*'C, rather it is found at 37.2''C (Figure 7.2).

The mean Worst Temperature is different from the Worst Temperature associated with the lowest risk of

death. This illustrates why the normalisation method of Morik et al. using the mean value taken from a

group of critically ill patients, should not be used with these data.

Worst Mean Arterial Pressure

Worst Mean Blood Pressure has a bi-modal distribution with a normal physiological blood pressure

associated with lowest risk of death. Figure 7.3 shows the frequency histogram, and Figure 7.4 shows the

plot of the mean blood pressure values and the 30 day in-hospital mortality. The lowest risk of death was

associated with a mean blood pressure of 90mmHg. Therefore, Worst Mean Blood Pressure was pre-

processed by scaling the raw value minus 90mmHg divided by the standard deviation (27.0). The mean of
167

the Worst Mean Arterial Blood Pressure was 79.5mmHg with a bimodal distribution of the "low" and the

"high" Worst Mean Arterial Blood Pressure values, demonstrating that the normalisation method of Morik

et al. is again, not appropriate. The APACHE III recommended physiologically normal value was 90

mmHg, and agreed with the Blood Pressure value associated with the lowest risk of death.

Figure 7.3 Histogram of Worst


Mean Arterial Blood Pressure
PAH ICU 1995-99
240 -1 1 1 1 1 1 1 r-

220
200
180
CO
c
o 160
"(3
& 140
0)
10
O 120
100
CD
.o 80
E
3 60
40
20
' ' ' ^ A .

10 40 70 100 130 160 190 220 250 280

Worst Mean Blood Pressure (mmHg)


168

Figure 7.4 Scatterplot of 30 day


In-Hospital IVIortality and Worst
IVIean Blood Pressure.
1.2

1.0 '':f < ! > : > : . (>>a*wii<:*>if<*>'<>i>*'>:*l*W<*.*>li<**

0.8 I-

(D
0.6
O
5 0.4
O
0.2 I-

0.0 I(*.ti*>w:(*>:>i**ifi*>:Mii*;Mi*i<<**>ii*)>t*;*i<*ii*:i:*i*ia>if(i*i

-0.2
20 40 60 80 100 120 140 160 180 200 220

Worst Mean Blood Pressure (mmhig)

Variables which had a minimum risk of death were: Temperature, Mean Blood Pressure, Heart Rate, White

Cell Count, Haematocrit, Sodium, Glucose, Albumin, pC02 and Blood Urea Nitrogen. The variable which

had a value with a maximum risk was Creatinine.

For some of the variables, the risk of death was an increasing or decreasing function of the variable.

Examples where the curves fitting 30 day in-hospital mortality was an increasing function of the variable

were Age, Bilirubin, Ventilation Rate and Fi02.

Examples of the curve fitting in-hospital death as a decreasing fiinction of the variable included Arterial

Blood pH and the measures of patient level of consciousness and arousal (the Glasgow Coma Score for

eye, verbal and motor response to stimulus).


169

Worst Bilirubin

Worst Bilirubin illustrates a unimodal, skewed distribution over a large range. Figure 7.5 shows the

histogram of Worst Bilirubin and Figure 7.6 shows increasing in-hospital death with increasing value of the

logarithm (base 10) of Bilirubin.

Figure 7.5 Histogram of Worst Bilirubin


70

60

CO

. 2 50
*->
CO

S 40
O
^O 30
i_
<D

I 20

10

' ' ' '


l^^^^TTT^ . V.iJJ^..i4M-..- t
'
15 25 35 45 55 65 75 85 95
Worst Bilirubin (mg/dl)

Increasing blood bilirubin concentrations were associated with increasing 30 day in-hospital death.
170

Figure 7.6 Scatterplot of Log(IO) Bilirubin


and 30Day In-Hospital Mortality
1.2

1.0 O r> r rtmnri 11 HI iii iir iii i n ! in iiii t ii v ii i i iKTnnri rrrmn-Kn n

0.8

0
0.6
O
"3 0.4 ^
o
0.2

O O r> oirnmriininiKTVTTiTTtMinnni i iii iiiimi iiiim>nn o


0.0

-0.2
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Log(IO) Bilirubin

Binary variables, Chronic Health Score and the Diagnostic Category were not transformed. No cases were

removed as outliers.

Table 7.1 shows the list of variables, descriptions, transformations and pre-processing employed to yield

the standardised and scaled values for model building. Each variable is numbered according to the fields in

the dataset. The name of the variable is based on the APACHE III database name, and the description of the

variable provides information about how the variable was collected and coded, where necessary. The type

of variable is binary, discrete or continuous. Where the pre-processing is described as "standardised", its

treatment follows the methods detailed in Section 7.2.1 and 7.2.2. Logarithmic transformations are noted

when used. The notes provide information about the values associated with the lowest risk of death, and the

nature of the relationship between each variable and the patient outcomes. All the features described below

were used in model building. There was no further processing of features.


U 00

s
6 0 1-1 t<nu O
a>
I i
3 Ol
&.
u > I>
i
V) .3
w CD
I i

u ^fin <1>
UI
(U JS
T3
<*-!
o
2 .a
ts -o o c
s o
<u
.M u
ca oil
U' < ^u

thre
utral
> .g so
so CO 3C u ii>
4>
M
J3
C

^ u B j3 c cs
4> <U
re cs
o ^
I o e O

130
a
'
n
u
u 00 on
a o 60 60
C c c
ca ca
ca ca
u u 9>
u O
o o
U I
o
i i
J3 D.
u 9-
u <a

i
u h> u
a. ii a O. a.
o o o o o
2 Z Z Z Z

-2^
IS'S u
00 43
3 iri ca o
II ca lU
__ 'ii u cS m ca
etev

of ca
g
u C S S3 1 ofz
"5 " J3 o
00 i i
"U I?
1 W ^J3 j -i u
^ O cs " ~ cs ii> 2f 00 ^ o
.. C u . . <u S I Q c s o^
o
c
(41
o
(41
(0 HI ^
o
o 3 D c en u
E c - O
^ CS
ca
U ca H cs
;- 2 co

Is "
O n T3 >
(1 c C
0)
c ?'S o
o o
.a u c
(0
cs _ i
ca 5
u ^ I >
00
o o
C O
? s o O X -a ^ ^ s cs CO u
00 B o
w C g vo ^ o CO
"3
c
73 ^
s
Ol o "* u
o ^ 8o M
<U u
(0 O M . . 1 cs
.o2 . 2 ^ ::- ^
o "is
CO - "
Ii
to
3)3 H i| Ii

c
>
3
* o
4J O rt .^ CO C/) o
c b 00.2 a> a>
is ca ^ s^ cs S o .a u <u
o J 3 cS

2 ca
CO 05
60
g o cs 4-* . ^
S 00|
6-^fca u Is^
B -a u
c/D . a Q T3 < p^ -3 jz W
- d es
o
(0 cs CS
o ac c>
00
Q b
on-Sur
lective

u
urgery
seas
tego

JO)
.2
(0
>
II Q cs
o U m
w K/i Z

(0
fM
I- VO
a
TO "O
op5 "3
.S y
O
I0
o
C X (41
133 <0 o
D 2 Ml
_ W C cS CO

(41
CO
cs
o
. 2 ;=! .> oa TS G
CO rt 3
> .ts O c c ^
to U *^ ^
C < _i 00 c s
5>o u u
cS 4 3 o
J (41 >
^ <^ o o
K O CS
o t- i-J 0 0
(41 i-J m
O UU U (41

.11
O '-' UI o T3 T3
<u ID
a oo^ <u
4-!
g cs .
<U
<=> Is o o
ca
O
C OS
oo 00 00 60 60
II > CO

c c c .g ,g T3 ca
M M o oa U D to
u oa CO
u
o
u <u
o
u
o 43 2 5 >o^ .2 U
CO

o o T3 oa T3
U
Ii U o a S
a> 2 .-1 a>
E^ CS 13
a.
o
Z
OH
O
a.
o
Z
o.
o
a.
o
Z
g ^^
op C ^ >
N
.p e
.41
T3

I 1
Z Z TO . . 1
o O 43 00
00 41
C/3
J c

00
X oa
ca
B oa (U
3 3 3 .fi
O U
o 3 0 o
3 G ^
S r.^ -3
cs cs .g B *
c a g G G I S I .S
a g O O \o <N
s s 5 3 a
on
u .S o
rli
<u a: CM ^ 4-. 13
C
u ts
(41
T3 o < *^ to 2 O 1>
00 Ui la Ok oa 0 , 3
<L> 'B- i=l ts '^
o <l> cs .c u .s
00 'S.2 *-' S o ^ 3
i3 0 0 * 5
i3 oa *- ts
( * ' (D 00
C 43 C
cs
u I c
*
W 01 ZJ
'C
fe

ts -a
G
"3 o D (E S cs . g
o u
CO ha
'Eib H-<
c C4-H
2 4) ^ <u
Source o dmis

43
ofhospit satio
Readmis nto

u oa o.
3 S " c cs
O) 2 e;^ "rt
13 G
CO

>> cs m (J O u ts
other hosp

(J
c
(41
O 6 <Uo (41 Ja ts s CO
u
u ^ra ^
o fi !r sT
00 (U 'tl ^Oi i
O
4-
s
u
u -; U U
3 fa 2 e .3 ^ E *-
0 5 O O
3 u> efl
U tzi t/3 * 01 ? o-ii 4 3 T3 1-1 (41 ^

D
U
u CO
C O OH
u CQ
00 CO
(U
ii
o cS e
0!^ _o J3 O <i>
tq
3
C/)
<1>
Pi ac
u (as
o H

tN
vo
o\
o
CS
cs lU cs g
u 13 T3
u <u
u .g
c
43
i es O
CO
o
oa
CO
^ (41 es
>. I O
4^
cS
43
ts
43
ts
>-.
oa es
cS c 2 -^^^ lU (U 0 0
ti
ort

(41 >J
ai o (41
"d o
cs o o o O (41
o
S u E a*'
s b oa oa
S E
I c c
rato
ase

CS es cs .2 to E
u u no
g>
u o. o u (J
u CO
C <u u u (U o o ^
Q Q Q l-J '-I

T3 T3
ID u
"3 3
DO'S
o u
CO 4) eS
on on on on 00 00 <u .
c c G G 13 4 3 "O S "^
G :" 4)
ca <a
CO
CO
U) v>
CO (41 CO cs ^ CO
en
o>
<i
<l>
CJ a
0> u
0
^ ^ T3
<U o S
0 0 0 0
1
o" e3 oa CO
u u u
O. 0. 0.
9^ ^
-B
u =6
0>
UI
H)
h-
i> u
UI cs u
u .^ oa cs .^ CO a>
0. a a Ol O. a. u (O
T3
T3
0 0 0 0 o o 43 5 41
43 C3 TO

Z Z Z Z Z (Zl
z H ^

a> s I I
eS eS 43 _3 o
"o o
G >
oa CO
3 O o 3 vo 3 o 3 _
O
a>
O o
a>
I?
3 *
3
S >n ,fi 41
i>
.c O g<D
J3
I -J
I P S E
<3\ o C 60 3 '^
\o . 2 cs , c '^
s .2 ^ I .2 cs
Q ^ - Q
O
S-1
I
U 2 o E Z U
s tN
3 O o O 22
S3

^3 CS
CS 13 1^ 1^ rsl CO 41
3
O
.^
(U
u (JJ oa "P CO CO ts fa
eS IS
60 JS o E cs
60 IS
S3 O
4> a o<u "E
_g
feb."4 3 oa 43
O
41
41
o (41 ^- U 4-.

I S fa
(41 u S ^ 60
E (41 2i . a w g o u ca 3 ep < *

I
6 0 ^
ts o s -n
.S c I 8 ^ s2 > 60 3 o o
S3
O
^
2 ^
I <u </>
S3
i3 S CO
Hi
o . +.
CT <
"^
(n O . CO 4-t
oa
41
S c -o
cs 3 U ^3
.u
E
o C 60
O
ca
3 O3 o u
a SI oau CO a>
2i ^
u
<+- >J
CS g

^
2 u CO
o
u u
oa
"I . o
-3 O
_J 1-1 -
u O
O -1
1+3
.g
41
i ""^ cs . ^
D E
CO o
C
.2
cs

=> 4
a> ..
(41 C U <U 43 S i i 3 u o
o o & 43 2(fc > y=
.2
o
4> 8S r i u 8 <N
CO O S t )
3 t> . - e
O G ^^
O
e o (Sen - O <u ^P. 2 2 ' a 4-> *J 43 2
3 - 5 u ._ <u .Ii O 'fa U .3 1> 00 41 .3
CO u .g o 3 > >, -^ > 4^ C
2 > M oU <u [/3 U
'u > O U es (41 (U cs cs - W 4>
* c/3 U '5 D > 43 T3 T3 O "O a u h c u
u 3 00 (3
oo-o >? o
E
3 3 cs
u
O
41
cS a-
3 o o u
Ii _H
o u o o
tzi
D ^
cS
E > u 41 ??Z
(Z3
t3 u Ol
O
1/3
u G
o
cs ^^ es o fa J
tzi O
O O a <-^ Z S Z ^-
^ ac

ON o >o r--
<s
4J
43 t3 T3
a> u <u
43
cs
3 CJ o
u
o
> o
Ol
O
to
43
oa eS ^
cs , ^

C8 ^ , cs
u
41

"3
ii
T3
(41
43
CS
n
t
T5 o T3
(go (fH o ^^ 4^ (N
CO
:4i 41

o a o UI o O N o
^ 'G 6.? Ml
(41

oa . 3
C 3 to 00
c -I
i> .ha
oa .
3

u
T3
<l>
oa O
^-N
CM
oa
c
41 a> CO ^
cs C
u "i
Q
cs ^ CO
4J o 41

g(p
CO
01 u g DQ M (It
ka
U
^ ^
60
UI
CJ O
u
3 O
C O c J .2

T3 T3
T3 g aj
4} JJ 3 3
CS
CQ c "3 2 o o
t> o CO oa
V CO 4) cs CO

-F=. -o S "^ S -g
(^
o CO
y (41 CO
u
43
cs
T3
T3

O CS
u W
u
CO
(2 CO

1 <a
.-1
-1 -a
s^./
60
C rt
C 60
'"B CO
CO
O UI cs .2 to T3
O DO 3
<l> cs !3 U ca
o
60
I
ere

43 es 4 3 O
4-
CS 41 o -3 Z
O
!Z1
H CO E- ^ (73 Z

I
r-
T3 60
o
13 d X
T3
oa e CO to
3 ON 3 -t 3 E 3 ^
o r~ O ~6l)
onui

o c o o
613

3 NO ^ 3 3
,_^ 3 t^
c C 00 S E 3

c I
^ I ^ o\ 3 <N -*-
3 1
O <*1 O 00 O -^ O O <N O NO
U d U d U CM O NO
U
O ;
to a
u
G G -s -s .2 _3 (fa 2
c o
E E
p o w S^ CO
"3 tti 3
3 ^
60 G o ^ "2 ts TS >
G O (fa
c a u 43 a> cs s >
u
t) 1.1>i o 00 o 3 cs (i,aj
rS 3
tension

_ - U CJ 3 o
val

cs 3 -31 4 3 J J

u S3 -a
3 cS
O >
(41
o O 4^ C o 60 u 41
CJ
4-
o . 2 oa CJ ^^ gj "^ 3 O
u u 8
> -0 =
u 3 u u .S 3 -3 60
.3 O "O l4=l E fa.3.S u T3"
.g U 41 u CO H S Z 3
i)
O I ^ CQ a> OH ^
41
J3 .g >^ 3 u
3 >. S es^
60 (l>
O ^ J " ^ (41
>-. s up 8^
X -) O A g
u cs CJ 4)
TS g XJ - I 60 c O
T3 CO
^ 6b
^
^ &?,
a>
60 r^
O b >
CO ' O o > W CM ( 4 1
cs
CS

.>; O "
* j 4>
o o
2w X! t j c E 41 "3 &
^
g
43

(1)
5 E "^ C ^
arte
low

u CM
.3 o A u u o
8 .2 CO
S^'
cs es T3 43
3
60
(3
U 60 U
o O G
1)
^ 3 ^
ts s E 60
3 4-
3
O.
3 n UI
t?;
Ui
(U
43
ON
3 u . O CO " ^ 3
o OO"
E O u 3
> 43 f^
ON <1>
cS o fa gj ^ W 43 "*
> C cs (41 > cs .5
^s u
5 > oa 1> CS
'eiil u TS O > T3 00

^
I
(U
X 73 CO 3
o O "2 >^Ji

o
o
3
.3 c ii I 2 -o S "S E ta rCO
n

cs o
o CO
I 1 s _ -
3 "3
c u
8 & P.S^I
.s |o
gS -^ es O

ac >,
5 03
> 3 O o S 55 fe ISE^a

00 ON o CM NO
CM en en
175

7.3 Software
195
For the experiments described in this chapter, data were pre-processed on Statistica 6.0 . The SVM

programme was SVMlight Version 5.00 (3 July 2002) written by T. Joachims. It can be downloaded and

documentation is available at http://svmlight.ioachims.org '^''^^. I have adapted the interface programming

in Matiab 6.5 "^ from an interface for SVMlight 4.0, by A. Schwaighofer at

http://www.cis.tugraz.at/igi/aschwaig/software.html "'*. The model output and performance was analysed

on Matiab 6.5 ''^

7.4 SVM Parameter Choice Guided by Model Attributes:


Discrimination and Calibration.

Any new model to predict the risk of 30 day in-hospital mortality must exhibit acceptable discrimination

and calibration if it is to be used as a risk adjustment tool. The issues of model assessment were discussed

in Chapter 2. Therefore, in this experiment, models were developed, assessed and compared by

performance on a test dataset according to the average area under the ROC curve (discrimination), and the

average H-L C statistic (calibration).

As before, adequate discrimination was defined as an ROC curve area of > 0.8 and acceptable calibration

was defined as an H-L C statistic of 15.5 or less on test data of 400 cases.

7.5 Choice of SVM Kernel and Parameters.

To train a SVM model, the choice of a kernel function and SVM parameters and C for that training

application is required. A heuristic method was developed that searches two dimensional space and is

shown to provide a solution close to the optimal parameter choice.


176

The first step is an initial investigation to gain an understanding of the properties of the search space, and

the second is a more detailed study of the area near to the optimum. In the first step, both the RBF kernel

SVM and the polynomial kernel SVM were investigated, and the calibration and discrimination of models

were assessed.

7.5.1 Estimation of Parameters for RBF SVM

The RBF kernel function that was used to implement the dot product of the mapping function in Equation

6.3 was

where x, and Xj are input patterns, and X,. X .is the 2- norm of the difference between these

vectors. The RBF parameter, y was varied.

In the absence of any a priori knowledge of the best RBF SVM parameter values, the parameter ranges

were initially explored over large ranges: ^ 0 - 2 , 8 0 - 2 and C 0 - 4

The first to be studied was the RBF kemel parameter y, in the range 0 - 2 . The default parameter values of

SVMlight were used with 8 set at 0.01 and SVM parameter C at the default value:

Cdefault = jjJT where x, is the input vector, and ||x,. || = .^/(x,. X,.)
mean'I^XjVi

The most promising RBF kemel was taken forward and then 8 was explored. C was still set at the default

value.

When a value for 8 was set, the best model at a range of values for C was determined. Multiple sampling

and training runs were done with 50 trials at each point using 800 randomly selected cases for fraining and

400 randomly selected cases for model testing.


177

Discrimination, measured by the model with the largest area under the ROC curve on the test set and

calibration, measured by the lowest H-L C statistic on the test set were used to determine which parameters

were chosen at each stage.

Figure 7.7 shows the relationship between RBF y in the range 0 - 2 and the average test set area under the

ROC curve. The best area under the ROC curve was 0.84 when RBF y - 0.03. As with the surface plot of

the area under the ROC curves in Figure 7.7, irregularity and variation in the plot contour is due to the

effects of sampling model training and test datasets.

Figure 7.7 Area Under the ROC Curve for SVM RBF Kernel y in
the range 0 - 1
average of 50 trials at each point, 800 cases training set, 400 cases test set
=0.01, C = default

0.4 0.6
RBF parameter y
178

Figure 7.8 shows the response of model calibration to changes in RBF y with 8 = 0.01 and SVM parameter

C set at the default value. The best average H-L C statistic (24.2) was also achieved at RBF y - 0.03. Note

that at larger J'values > 0.8, the H-L C statistic again begins to decrease, but in this range, the SVMs have

poor discrimination on the test set (ROC < 0.7) which precludes their use.

Figure 7.8 H-L C Statistic for SVM RBF Kernel y in the


range of 0 - "/
average of 50 trials at each point
800 cases training set, 400 cases test set
= 0.01,C = default

200
180
160
S 140
= 120

100
o 80
X 60

0>2 0.4 0.6 0.8


RBF parameter y

This reinforces the importance of examining both discrimination and calibration. This graph demonstrates

that the model calibration response does have more than one minimum of H-L C and adequate calibration.

However, one of these occurs with models that have unacceptably low discrimination.
179

The RBF parameter y was fixed at 0.03, and the areas under the ROC curve for 8 between 0 and 2 with C

set at the default value was then explored. Figure 7.9 shows a part of that trial with values of 8 in the range

0.1 - 0.5 demonsfrating that the response of the area under the ROC curve is relatively flat between 0.1 and

0.4, dropping off sharply at around 8 = 0.48. For 8 > 0.5, model predictions were no better than random,

and some models gave the same estimate to all data points.

Figure 7.9 Area under the ROC Curve


for SVM RBF Kernel with z in the range of 0 -0.5
average of 50 trials at each point, 800 cases training set, 400 cases test set
/ = 0.03, C = default

1
0}
t 0.95
s
O 0.9-
O 0.85
o 0.8
0)
0.75-1-

0}
0.7
T3
C 0.65
3
CD 0.6
0.55-h
0.5
0 0.1 0.2 0.3 0.4 0.5
180

Figure 7.10 shows the H-L C statistic across a range of 8 with y =0.03 and SVM parameter C = default

value. The best calibration was at 8 = 0.057 where the H-L C statistic was 17.0. Inspection of the model

outputs at 8 < 0.03 revealed a tendency toward separation of the output values which clustered around 0 or

around 1. With 8 < 0.03, Figure 7.9 shows that discrimination was retained but Figure 7.10 demonsfrates

that model outputs did not accurately reflect the risk of death.

Figure 7.10 H-L C Statistic for SVM RBF Kernel with in


the range 0-0.6
average of 50 trials at each point, 800 cases training set, 400 cases test set
y = 0.03, C = default
181

The SVM parameter 8 was set at 0.057, the RBF y set at 0.03 and a range of C was trialled. Figure 7.11

shows the response of the area under the ROC curve to changes in Cin the range 0.1 - 2. The plot response

was fairly flat. There is some variation in area under the ROC curve, which is due to random choice of

model fraining and assessment datasets.

Figure 7.11 ROC Curve Area for SVM RBF Kernel with
rangeof CO."/-0.6
average of 50 trials at each point, 800 cases training set, 400 cases test set
y = 0.03, . = 0.057

cc
9

0)
3
O
o
o
cc
O.i

0.5
0.5 1 1.5
SVM Parameter C
182

Figure 7.12 plots the average H-L C statistic against a range of SVM parameter C with RBF kemel, y -

0.03 and 8 = 0.057. The best H-L C values < 20 appear between SVM parameter C in the range 0.5 - 1.

These values of C gave an average area under the ROC curve ranging between 0.826 - 0.836 and average

H-L C values were 16.7 - 20, with the exception of C = 0.9 where the average H-L C statistic was 715.

Inspection of individual model test set performances in the range of parameter C 0.8 - 1.5 revealed that

many, but not all SVM performed quite well. However, some combinations of the test data, the model and

the validation data had H-L C statistics that were high and increased the average value.

Figure 7.12 H-L C Statistic for SVM RBF Kernel with range
of C 0.1-2
average of 50 trials at each point, 800 cases training set, 400 cases test set
y = 0.03, = 0.057

100000

10000
Statis
scale

1000

o o 100

10

0.5 1 1.5

SVM Parameter C:

7.5.2 Estimation of Parameters with Polynomial Kernel SVM

The polynomial kemel fiinction that was used to implement the dot product of the mapping functions in

Equation 6.3 was

K(Xj ,'S.j) = [Xj Xj ) where d, the degree of the polynomial was varied.
183

Trials of the polynomial kernels were conducted on polynomial fiinctions with d = I -4 .As with the

experiment in the previous chapter, the SVMlight algorithm failed to converge to a solution for the 4 order

polynomial kemel SVM, and it was not studied fiirther.

Figure 7.13 shows the average areas under the ROC curves for the polynomial fiinctions in the range 0 -

0.6 with Cat the default value. The polynomial kernels (d= 1 and d=2) displayed adequate discrimination

in the range e < 0.4. The best SVM discrimination was with the polynomial kemel d=l (ROC curve area

0.88) at 8 = 0.003.

Figure 7.13 Area under the ROC Curve for SVM Polynomial kernels
d 1 - 3 for 0 - 0.6
average of 50 trials at each point, 800 cases training set, 400 cases test set
C= default

1
.d = 1
0.9 --- d= 2
r*- J, .- -'*^ --d=3
0.8 Wt^t-L' lUC" :::
V \-' ^- ^ \
0.7
\ ^ ~" ^" * ' -

0.6
0.5 \ ""~ -^'.x
0.4
0.3
0.2
0.1
0
0.1 0.2 0.3 0.4 0.5 0.6


184

Figure 7.14 shows a plot of the average H-L C statistic over a range of e 0 - 0.5, with the SVM parameter C

at the default value. The calibration was best at polynomial kemel d=2 with e about 0.1 - 0.2 giving H-L C

statistics of 41.6 - 50.6.

Figure 7.14: H-L C Statistic for SVM Polynomial Kernels d 1- 3 for


Range of 0 - 0.5
average of 50 trials at each point, 800 cases training set, 400 cases test set
C= default

1000000
-d = 1
--d = 2
100000 *- d = 3

a> 10000

I g
(S o
W E 1000
o
^ 8* 100
t----

0.2 0.3 0.4 0.5 0.6

To optimise C, the SVM polynomial {d=2) kemel and s = 0.2 were set. The range C 0 - 2 was studied.

The best performance was at C = 0. A portion of the chart is shown in Figure 7.15. The best area under the

ROC curve was 0.841 at C = 0, with decreasing area under the ROC curve as the SVM parameter C was

increased.
185

Figure 7.15 Area Under the ROC Curve for SVM


Polynomial Kernel for C: 0 - 0.002
average of 50 trials at each point
800 cases training set, 400 cases test set
rf = 2, E = 0.2
0.9-r-
0)
0.85
O 0.8
O
o 0.75
IT 0.7
(0
L.
ffi 0.65
o
0.6
c
3 0.55
(D
9> 0.5
0.0005 0.001 0.0015 0.002
SVM Parameter C

Fig 7.16 shows the calibration of the 2"^ order polynomial kemel SVM over a range of SVM parameter C.

The best H-L C statistic value for this range was 51 at C = 0

Figure 7.16 H-L C Statistic for SVM Polynomial


Kernel for C 0 - 0.002
average of 50 trials at each point, 800 cases training set, 400 cases test set
tf = 2, E = 0.2

1000000
( W V ' ' ^
^ 100000 -I-
(D
o m
o
list

CO 10000
CO C}
^-
CO 1000
b
O .c
- J co
100
I o
10
1
0.0005 0.001 0.0015 0.002
SVM Parameter C
186

No satisfactory combination of parameters for the SVM polynomial kemel was found to be close to both

the desirable discrimination or the calibration targets.

7.5.3 Summary of the Approximation of SVM Kernel and Parameter


Selection.

The best discrimination found with the RBF kemel was y = 0.03, 8 = 0.057, C = 0.6 which displayed an

average area under the ROC curve of 0.83 and an average H-L C of 16.7. In comparison, the best

polynomial kemel was a 2"'' order polynomial kemel, 8 = 0.2, C= 0 which had an area under the ROC

curve of 0.84 and an average H-L C statistic of 51. The discrimination of both the RBF and the polynomial

kemel SVMs was adequate. The average calibration of the RBF kemel SVM was better and the parameters

around this approximation were then investigated more intensively.

These results are in contrast to those of the previous chapter, where the linear polynomial kemel had the

best discrimination on the regression task (0.86 - 0.87), while the discrimination of the RBF kemel SVM

was less than 0.80 at all parameter values examined. The better performance with the RBF kemel could be

because of the differences between the variables used for modelling in the two experiments. In the

regression experiment of Chapter 6, four heavily pre-processed variables were used. In the regression

experiment in this chapter, the non-binary patient observations, measurements and demographic

information were pre-processed so that a central value of zero was associated with a minimum or maximum

risk of death for each non-binary variable. This transformation may be better suited to RBF kemel.

Alternatively, the investigation of changes in the parameter C and finer tuning of model parameters may

have revealed the potential performance of the RBF kemel SVM.

7.5.4 Investigation of Values around Estimated SVM Parameters

The approximation of SVM parameters provided a guide to the values of SVM parameters to be

investigated more intensely. The RBF kemel SVM was the most promising and fiirther investigation of the
187

best parameters and performance was undertaken close to the approximate values found in the previous

section. The parameter ranges chosen for more detaOed study were: y: 0.01 - 0.1, 8: 0 - 0.2 and C: 0.4 -

4.0.

Within these ranges, the RBF kemel parameter y, and the SVM parameters 8 and C were varied and the

average H-L C statistic and the area under the ROC curve were calculated for 50 trials at each value. Each

tiial used a ttaining set of 800 randomly chosen patients and the model performance was assessed on a

validation set of 400 randomly chosen patients. Three dimensional plots of the response of the average H-L

C statistics and areas under the ROC curves were produced (Figure 7.17 and Figure 7.18).
188

Figure 7.17: Surface Plot of Average H-L C Statistic for RBF SVM y

0.1 -1.0

The surface plot tones correspond to the average H-L C statistic


Colour H-L C Statistic
Dark Blue < 15.5
Light blue, green, yellow 15.5- 20
Red > 20

f = 0.02 T = 0.03 T = 0.04

0.1 02 4 c 0.2 4 c 0.1 02 4 c


e T = 0.05 t = 0.06 e r = 0.07
189

Figure 7.18:Surface Plot of Average Area under the ROC Curve for
RBF SVM Y 0.1 -1.0

The surface plot tones correspond to the average area under the ROC curve.
Colour Area under the ROC Curve
Dark Blue: > 0.82
Light blue, green, yellow 0.80 - 0.82

Red <0.80

y = 0.02 7 = 0.03 r = 0.04


190

Figure 7.17 is a plot of the average H-L C statistic for 50 RBF kernel SVMs over a range of SVM

parameters, S and C. The areas of reasonable average calibration (H-L C < 20) are shown in regions of

Ughter blue, yellow and green. The best calibration is coloured dark blue (H-L C < 15.5) and this is the area

where the RBF SVM meet the proposed calibration standards on the test set.

The surface plots are dramatic. Where y = 0.03, there is a ttough of parameter settings (RBF y = 0.03 with

C 0.6 - 1.5 and 6 0.04 - 0.08) where the H-L C was 15.5 - 20, and only one value (RBF y = 0.03 with C

0.6 and S 0.05; H-L C = 15.4) where the H-L C was < 15.5. This was the approximate area suggested by

the initial exploration of SVM parameters. This optimum parameter choice, and near optimum areas are at

the foot of a slope of falling H-L C (improving calibration) as the value of S is reduced. This area is near a

dramatically steep deterioration of calibration if is reduced below 0.03.

Where y = 0.04, there is a broader and slightiy deeper trough in the surface plot of H-L C statistic. The

better performances are found at RBF y = 0.04 with C 0.7 - 3 and 0.02 - 0.04 where most of the H-L C

values are in the range of 15.5 - 20. The best model calibrations are found at = 0.03 with average

parameter C = 1.2 (H-L C 15.2), 1.3 (H-L C 14.8), 1.7 (H-L C 15.2), 1.9 ( H-L C 15.2), 2.1 (H-L C 13.3)

and 3.5 (H-L C 15.2). Again this ttough in the surface is at the bottom of a slope of falling H-L statistics as

is reduced. This zone is again the foot of a steep deterioration in calibration if is reduced below 0.02.

Where y = 0.05, there is also an area of reasonable calibration witii the H-L C statistic 15.5 - 20 at RBF y =

0.05 with C 0.8 - 4 and 8 0.01 - 0.04. There are, however, no values in this plot where the H-L C statistic

is less than 15.5. This area is a ttough at the lower end of a slope of decreasing H-L C values with

decreasing .
191

To decide on the optimum choice of parameters for the RBF SVM, comparison must be made with the

surface plot of the average area under the ROC curve. Figure 7.18 shows the areas under the ROC curves

for RBF kemel SVMs. Each plot is at a single value of the RBF kemel parameter y 0.01 - 0.10 and each

value is the mean of the area under the ROC curve of 50 models over a range of SVM parameters S and C.

The colours of the surface plots correspond to the average of the areas under the ROC curves, with dark

blue representing the most desirable average area under the ROC curve of > 0.82. Light blue, green and

yellow are adequate areas under the ROC curve of 0.80 - 0.82. Red colours the areas under the ROC curve

of < 0.80 which indicates unacceptable discrimination.

The plots in Figure 7.18 have irregular surface variations with local minima, due to the effects of random

sampling with different model training and testing datasets. Overall, these 3 dimensional plots show that

the discrimination is generally better for SVM with y between 0.01 and 0.04 than for y > 0.04, and that

discrimination is generally better for smaller values of parameter C for any choice of y and E. This is in

contrast to the important relationships in the calibration surface plots in Figure 7.17 where the changes in

had the most dramatic effect on the values of H-L statistic.

In Figure 7.18, the response is relatively flat for the surface plots for each of the RBF y. In general, smaller

values of C and smaller values of 8 give the best discrimination at any RBF kemel y value. A SVM with a

RBF with y in the range 0.1 - 0.4 will provide adequate discrimination with area under the ROC curve >

0.80 across the range of 8 and C. The best area under the ROC curve greater than 0.83 was found in the

following zones

RBF y = 0.02 with C 0.4 - 1.0 and 0 - 0.2 and

RBF y = 0.03 with C 0.4 - 0.8 and 0 -0.16 and


192

RBF y - 0.04 with C 0.4 - 0.6 and 0-0.16.

The largest average areas under the ROC curve was 0.833, found at RBF y = 0.02 with C = 0.6, = 0

and at C = 0.5, = 0.

Therefore from the comparison of the 3 dimensional plots, the parameter settings that are most likely to

give adequate models on this dataset are RBF y - 0.03 with C = 0.6 and = 0.05, and RBF y = 0.04 with

C I .0 -1.9 and 0.03. At these parameter settings, I expect that for samples of 800 cases, a RBF SVM will

be trained that when tested on a random sample of 400 cases will have an area under the ROC curve of 0.82

- 0.83 and a H-L C statistic of < 15.5. A model with this performance would be suitable for use as a risk

adjustment tool for control charting.

The calibration response was particularly sensitive to changes in and y. In contrast, most parameter

choices gave acceptable discrimination, but generally, smaller values of C gave sUghtly better

discrimination than larger values.

The initial heuristic was able to provide a good estimate of the optimal parameter choices for the RBF

SVM and allow limitation of the more intensive investigation of SVM parameters to an area with

acceptable discrimination in the parameter ranges y = 0.01 - 0.10, 0 - 0.2 and C 0 - 2 initially. The range

of C that was investigated was extended, as the area of calibration that was almost acceptable (H-L C 15.5

- 20) extended beyond C = 2. At C = 4 the discrimination was beginning to drop off below area under the

ROC curve of 0.80.

The experiment in Figure 7.17 and 7.18 required 154,000 RBF SVM to be frained and analysed. Each

model, including sample selection model building and analysis took about 5 seconds to complete, and so
193

the experiment took 9 days to mn on a single 2.7 GHz processor. If the intensive study had been conducted

with the same detail over the original ranges of the parameter estimates (y = 0.01 - 2, 0 - 2 and C 0 - 4)

the experiment would have been intractable.

The shape of the surface plots of the area under the ROC curve and the H-L C statistic in response to

changes in the SVM parameters allowed the estimation of approximate SVM parameters to be reasonably

accurate. This estimation involved a single parameter being studied with the other parameters fixed. This

approach works in this application because of the shape of the error surfaces and because two model

characteristics were used to guide the parameter selection.

Using only a single model performance attribute, for example the H-L C statistic would potentially lead to

errors. The presence of a convex response seen in Figure 7.8 where the range of RBF y was studied, may

have led to the choice of an inappropriate large value for y, if the area under the ROC curve had not also

been used. At low RBF y values the areas under the ROC curve were > 0.82. At large RBF y, the

discrimination was no better than random, yet the H-L C statistic suggested adequate calibration of the

model.

7.6 Discussion
These experiments demonsfrated that the probability of ICU patient death in-hospital within 30 days of

admission can be modelled with a SVM, using only the patient data available within the first 24 hours of

ICU admission. The approach to optimisation of the SVM using the discrimination and calibration to guide

the parameter choice was successfiil in identifying parameters to build models of adequate performance.

The approximation of SVM parameters using an initial search heuristic provided an estimate of the best

RBF parameters. This sfrategy reduced the computing time that would have been required. It is not certain

that this approach will work in all applications of regression SVM. However, in this task, to estimate the
194

probability of ICU patient death, the simple exploration of parameter space efficiently localised an

approximate area of optimal parameter choice. This was possible because of the shape of the surfaces of the

calibration and discrimination curves seen in Figures 7.17 and 7.18.

The results of the SVM models built in this experiment are equivalent to previous studies with ANN and

logistic regression to estimate the probability of patient death. The calibration and discrimination of the

RBF SVM on the test data is similar to the results of models described by Clermont et al.^^^, using logistic

regression and ANNs on a similar set of ICU patient data. Modelling data from 800 cases, Clermont

described areas under the ROC curves for the logistic regression (0.829) and ANN (0.810) which is

comparable to that found with the RBF SVM (0.82 - 0.83). The SVM assessment was the average of 50

frials of randomly selected sample sets in confrast to the single, non-random sample of Clermont et al.

Therefore, the RBF SVM discrimination performance is at least equivalent to their study. As 50 trials were

conducted on random samples, the choice of RBF SVM parameters is very likely to consistently produce

SVM models that have adequate performance on the PAH ICU dataset.

However, the APACHE III estimates of probability of death still offer superior discrimination, with area

under the ROC curve of 0.89 on the PAH ICU dataset. The APACHE 111 model was developed on 17 440

patient admissions ^. It is possible that with such a large dataset, similar levels of discrimination could be

achieved with SVM modelling. However, very large datasets are impractical for local, single institution

model development.

The calibration of the best RBF SVM was good with an average H-L C of less than 15.5. Caution must be

taken with any comparisons using the H-L statistic, particularly as the mortality endpoints differ and the

size of the validation set of Clermont et al. is 447 cases against 400 cases in the present SVM experiment.

However, the SVM probably have superior calibration to the logistic regression model of Clermont et al.

(H-L C = 45.9) developed with data from 800 patients. The best SVM is on average, equivalent to the

calibration of their best ANN (H-L C = 17.3).


195

These results indicate that using RBF SVM to model patient data offers a practical and reproducible

altemative for modelling the probability of ICU patient death within 30 days of admission.

However, the SVM model using the component patient data in this experiment performed less well than

models using the pre-processed data such as APACHE III scores of the previous chapter. The area under

the ROC curve of the APACHE III score alone on the PAH ICU dataset is 0.837, yet the largest ROC curve

area found with the RBF SVM was 0.833. In the previous chapter using the variables of APACHE 111

score. Age, Chronic Health Score and the Diagnostic Code, the average area under the ROC curve for both

the polynomial SVM and the multi-layer percepfron ANN was 0.86. Similarly, using the same variables on

the same dataset, the logistic regression models developed by Graham and Cook '*"* also displayed ROC

curve areas on the validation set of 0.86. The models of Clermont et al. from a similar ICU incorporating

the APACHE III score had ROC curve areas up to 0.84. It may be, therefore that fiiture improved SVM

models should include the APACHE III score, even though this means that such models are still reliant on

the APACHE III system.

There must be an upper limit to the performance of a model that estimates probability of death using

information from the first 24 hours in ICU. Much potentially important explanatory information is not

available. Examples relevant to this dataset would be detail of the success or otherwise of surgery.

Important episodes that occurred in the lead time prior to ICU admission are not captured, neither are the

events, complications and progress after the first day. There will always be uncertainty in models using

patient data from on the first day. It is this uncertainty that represents random events and effects, but also

embraces the quality of patient care, that is the reason for the monitoring of risk adjusted mortality rates.

The choice of kemel and parameters is a problem to be solved for each set of data '*^. In the SVM the

kemel fiinction, and the parameters and C are chosen or varied to provide optimal regression

performance. For regression, the only reliable method that exists for all kernels and datasets is the use of a

cross validation set, as used in these chapters. In this application, optimisation was guided by multiple trials

of samples of training and testing sets. By using re-sampling and many training examples, SVM parameter
196

values were identified that are most likely to work well in any sub-sample of the patient data. The average

measurement of 50 samples at each data point demonstrated the reproducibility of the model performances

at parameter settings.

The choice of 8 by setting the regression tube width affects the approximation error, the number of support

vectors, the training time and the complexity of the solution "'^ A training example will be a support vector

only if the approximation error is larger than . A large e will give few support vectors, rapid training, but

poor accuracy as the regression tube is very wide. Small defines a narrow regression tube, w ith a large

number of support vectors and a complex model with long training time. Mattera and Haykin suggested

a robust approximation method for E by choosing a value so that approximately 50% of the examples in the

training set are support vectors. This is a compromise between complexity and maintaining a small

approximation error. An alternative suggestion by Weston et. a / ' '* to set 8 = 0 , and rely on adjustments to

Cto control the generalisation performance takes no account of training time or complexity considerations.

For the PAH iCU patient data, the noise and the desired accuracy were not known. The \alues of \\ ere

chosen experimentally to provide the best discrimination and calibration. The number of support \ectors in

the trained models where the training set is 800 randomly selected cases, was on a\erage. 500 or 62.5''i).

The use of the ROC curve area and H-L C gave a similar number of support vectors to the approximation

method of Mattera and Haykin '* .

The choice of C is also determined by experimentation "~v A guideline proposed for choice in the RBF

kernel SVM by Mattera and Haykin "* is that C should be of a similar size to the expected outputs of the

SVM which are estimates of probability, and thus so are between 0 and I. This is only a guide, and further

work on optimisation is necessary, in this application w ith the RBF }' set to 0.03 and i: set at 0.5 - 0.6, the

best values for C were experimentally determined to be between 0.5 and 1.5. The \alue of 0.6 was chosen

by the parameter estimation, and is consistent with the estimate suggested b> Mattera and Ha>kin.
197

In this light, the proposals of Mattera and Haykin for estimation of C and E can be used in conjunction with

the parameter estimation method used in this chapter, as a means of localising the parameter solution, and

minimising the computing time required.

7.7 Conclusion

The probability of ICU patient death in-hospital within 30 days of admission can be modelled with a SVM,

using only the patient data available within the first 24 hours of ICU admission. The approach to

optimisation of the SVM using the discrimination and calibration statistics to guide the parameter choice is

successful in identifying parameters to build models of adequate performance to use as risk adjustment

tools.

These models can be trained and cross-validated on the number of cases seen in one year at the PAH ICU.

Though the performance is less than that of the APACHE III models during 1995 - 1997, the SVM models

have the advantage of using a 30 day in-hospital mortality endpoint, having the flexibility to be remodelled

when the model no longer fits and not incurring an annua! licence fee.
199

Chapter 8

Summary, Conclusions and Future \Nork

8.1 Summary
The aim of this work was to develop risk adjusted confrol chart methods for monitoring in-hospital

mortality outcomes in Intensive Care. It is a medical application of statistics and machine leaming to the

measurement of outcomes for quality management.

The methods of assessment of models that estimate the probability of death for a patient in the intensive

care unit (ICU) were reviewed. From this review, the important attributes of discrimination and calibration

were identified as the key measures of model performance. The performance of a model may deteriorate

when it is applied to patient data which are not part of the database from which the model was developed.

Therefore, a model such as APACHE 111 which was developed on a large North American ICU population

must be validated in the Ausfralian ICU setting, before any conclusions can be drawn about the reliability

of its mortality estimates in that particular population.

The area under the receiver operating characteristic (ROC) curve is the most useful approach to assessment

of model discrimination. From a review of the assessment of models for ICU patient mortality, a reasonable

expectation of the discrimination of ICU models is an area under the ROC curve greater than 0.80.

Calibration is more difficult to assess. To evaluate calibration, a calibration curve showing observed and

expected mortality in ranges of risk, and a statistical evaluation of calibration such as the Hosmer-

Lemeshow (H-L) statistic can be used.

These considerations were the basis for the evaluation of the APACHE III models at the Princess

Alexandra Hospital ICU for the period 1995 - 1997. The models provided by the APACHE III system to

predict the risk of death of an ICU patient in-ICU and in-hospital performed very well. For all models, the
200

discrimination was excellent, with the area under the ROC curve being 0.90 - 0.92. The calibrations of the

in-ICU mortality models, and the in-hospital mortality model with proprietary adjustments for hospital

characteristics were very good. The APACHE III model with adjustments for hospital characteristics

provided the most accurate estimates of the probability of in-hospital death.

The initial confrol chart application was to in-hospital ICU patient mortality data without risk adjustment

(RA). Thep charts, cumulative sum (CUSUM) and exponentially weighted moving average (EWMA)

charts were used to analyse the in-hospital mortality rate. The years 1995 - 1997 were used as an in-control

period to establish the confrol mortality rate, and charting was carried on through 1998 - 1999.

There was a significant fall in the mortality rate from the in-confrol rate of 0.16 to approximately 0.13. A

post hoc analysis suggested that a change in the overall severity of patient illness probably did not occur.

However, a change in casemix was demonsfrated. There was an increase in elective surgery. This type of

patient has generally a low risk of in-hospital death, and the change contributed to the decrease in in-

hospital mortality rate of ICU patients.

The second confrol chart application developed the techniques of control chart analysis of in-hospital ICU

patient mortality data with RA. The APACHE 111 model with proprietary adjustments for hospital

characteristics provided accurate estimates of in-hospital death, and was incorporated into thep chart,

CUSUM and EWMA charts. The use of a RA model, such as APACHE III, improved the information

provided by the control charts. After adjusting for the severity of illness and casemix of patients, these

analyses demonsfrated that patient survival had improved between 1995 and 1999. During the first 1 - 2

years, there were signals that the risk of patient death was higher than that predicted by the APACHE III

model. During the last year of the analysis, the observed patient mortality was lower than that predicted by

the APACHE 111 model. Possible explanations are that the RA monitoring has documented improvements

in patient outcomes or quality of care.


201

If the observed mortality rate is consistentiy different to that predicted by the RA model, it means that the

RA model no longer provides an accurate assessment of patient risk of death. In this application, the

APACHE III model was consistently overestimating the probability of patient death by the end of 1999.

Therefore, I developed altemative tools to the APACHE 111 model to estimate the probability of patient

deaths at the PAH ICU. Machine learning techniques, such as support vector machines (SVMs) and

artificial neural networks (ANNs) were developed on the PAH ICU database, and thus provide a new RA

tool.

Several shortcomings of the APACHE III system were addressed in the new model development. An

endpoint of 30 day mortality was chosen, instead of in-hospital mortality used by APACHE III. Mortality

data could be analysed just 30 days after the episodes of patient care commenced, rather than waiting

months for all patient hospitalisations to be finalised.

The APACHE III algorithm functions as a "black box" without model updates. In contrast, the machine

leaming models are intended to be updated when the model fit deteriorates. To this end, models were

successfiilly and reliably developed on a fraining set of 800 cases and a test set of 400 cases. The data for

model fraining and cross validation can be collected over approximately one year. This means that it is

practical for a single ICU to develop and maintain RA models to continuously monitor their clinical

outcomes.

The APACHE 111 software and model commands an annual licence fee, whereas a locally developed RA

model does not have that additional cost.

In Chapter 6, the preliminary ANN and SVM models were developed using the APACHE III based

variables of acute physiological score, modified disease category and chronic health score, and the patient

age. The models demonsfrated good discrimination, but very poor calibration. Subsequently, a multilayer

percepfron ANN and a linear kemel SVM were both recalibrated with a general regression neural network.
202

The performances of these models were equivalent to previous ANN and logistic regression models on a

similar ICU application '^^ and to logistic regression on the PAH ICU database.

A further experiment was then conducted to model 30 day in-hospital mortality. The component variables

that describe patient physiology, demographics, laboratory results and diagnosis were used, rather than the

APACHE III variables. Standardisation and scaling of the raw patient data was performed using successful

data fransformation sfrategies after considering other machine leaming applications in the ICU. A radial

basis function (RBF) SVM was used to estimate the probability of 30-day in-hospital mortality. The SVM

models were developed, cross-validated and compared according to the average area under the ROC curve

(discrimination), and the average H-L C statistic (calibration).

A simple and efficient heuristic method to search for optimal SVM parameters was used. In this way,

optimal parameter choice was localised in space and this was intensively explored for the best performing

SVM models. The average performance of the RBF SVM on random samples of the dataset was adequate

with area under the ROC curve of greater than 0.83 and an H-L C statistic of less than 15.5 on 400 patient

test sets.

8.2 Original Contributions

This thesis brings together several streams of research. It arose from clinical medicine, patient care and the

need to measure the quality of clinical care. To accomplish this, several novel confributions were made.

The most important contribution of this work is the application and development ofRA confrol charts to

monitor ICU mortality rate. This paradigm of incorporating adjustments for casemix and severity of illness

has not been applied to the ICU before. It is an idea that has wide appHcation to monitor the quality of the

care that we offer in many areas of the health care service. The prerequisites are the ability to accurately

measure patient outcomes, and to collect patient data to permit modelling estimates of the probability of

these outcomes.
203

Development of these RA charts involved the following novel contributions.

i. The assessment of the APACHE III models at the PAH was the first assessment of Ausfralian

experience with APACHE III, and used the largest single institution series of patients outside of

the United States of America.

ii. The use of confrol charts to monitor in-hospital mortality of ICU patients is a logical extension of

industrial statistics, but has not been carried out elsewhere. This is the first use of the APACHE III

score as a validated RA tool to continuously monitor ICU outcome. The design and analysis of

each of the techniques 1 developed is described in the Appendices. The technique allows the

methods to be adapted and modified according to different clinical requirements.

iii. The RA confrol chart work is original in its application and involves modifications of previous

methods. In addition, the RA EWMA chart is entirely new.

The RA p chart was developed from the method of Alemi et al.^^'^ and incorporates a method for

analysing the power of each sample, adapted from Flora ^''. The Z score/? chart relies on the work

of Flora ^ and Sheriaw-Johnson et al. '"'.

The use of the iterative method to characterise the distribution used for constmcting/? charts is

presented in Chapter 5.

The RA CUSUM is based on the work of Lovegrove et al. " and the moving frame approach is

adapted from the work on charting cardiac surgery by Poloniecki et al. "^. The use of the RA

CUSUM of Steiner and co-workers'^^'''*^''''^ is an original application in an ICU setting, though the

method, apart from the use of APACHE III, has not been modified.
204

Both the RA EWMA charts with the parametric approximation and the discrete approximations

are new work.

iv. The 30 day in-hospital mortality endpoint that was used in the machine leaming is an original

contribution, based on the specific requirements ofRA control chart analysis of patient mortality.

V. This is the first application of SVM to estimate the probability of ICU patient death.

vi. A model that estimates the probability of patient death has attributes of discrimination and

calibration. These two performance measures were used to guide model development and SVM

parameter selection.

8.3 Future Research

During the course of this project, several areas were identified where fiirther research is required.

Further study of the assessment of the performance of models that predict the probability of death of ICU

patients is required. At present, there is a good understanding that model reliability is reduced when the

model is applied to other ICU contexts different from those where it was developed. Issues of data

collection and mle interpretation, patient variables, clinical practice, admission and discharge practices,

case mix, lead-time, mortality rate and type of hospital have all been shown to affect the ability of models

to accurately estimate the probability of death of ICU patients. More work is required to understand the

effects of these and other factors.

The assessment of model calibration proved a difficult part of this project. The H-L C statistic was

determined to be the best technique available at the time at which this study was undertaken. Further work

may lead to an altemative technique to measure the calibration and reliability of probability estimates.
205

The RA charts described in this thesis are initial steps in what could be a much larger undertaking.

Potentiallyfiaiitfiilareas of research exist in additional analysis of the behaviour of the confrol charts under

a larger range of plausible, and important clinical situations. This thesis was limited to analysing the effects

of increases and decreases in mortality rates, and in increases and decreases in the odds ratio. From these

scenarios, control chart parameters and control limits were chosen, and charts were constmcted. Many

other combinations of changed casemix, early discharge, and simulations of aspects of poor or improved

care should be explored.

The use of the central limit theorem to permit estimation of the expected distribution of mortality rates and

the EWMA statistics involves approximations. The errors are probably small compared to the imprecision

of the currently available RA tools for ICU. However, fiirther work can be done using altemative methods

to characterise the distributions. Iterative methods, exact to the limits of the model predictions, are

presented in this work. Further research can be done to more efficiently calculate the distributions.

Examples are Monte Carlo simulations, or possibly analytical expressions for determining the probability

density function of the mortality rates, based on the individual patient's predicted risks of death.

The latter part of this thesis described machine leaming applications to estimate the probability of death of

the ICU patients. SVMs have, to my knowledge, only been employed once previously in the ICU setting,

and only then to predict physicians clinical actions, rather than to estimate the probability of death. There is

considerable opportunity to investigate other applications in ICU and to further optimise the models to

predict patient death.

The choice of kemels and SVM tuning parameters provides an area for ongoing work. The approach

presented here is practical, as it uses the desirable performance characteristics of the potential RA tool to

guide optimisation of the SVM parameters. A simple search heuristic to identify the likely area to search

more intensively may not work in all applications. It is not a replacement for thorough parameter evaluation

and testing of model performance by cross validation. Genetic algorithms, for example, may offer a ihiitfiil

altemative method to efficientiy localise optimal parameter settings.


206

In this application, the polynomial and RBF kemels were investigated, but there are many other kemels

which can be applied and optimised on the data. SVMlight provides the opportunity for fiiture investigation

of linear kemels, tanh sigmoid kemels, a range of polynomial fiinctions and a user defined kemel option.
198

The recommendation of specific kernels for probability density estimation by Weston and co-workers

points to a usefiil direction for future investigation.

The variable selection and processing used for the SVM model can be explored fiirther. Though the models

gave acceptable performance, it is possible that more intensive evaluation may provide improvements.

All of the raw data components in the APACHE III algorithm were used for the SVM models in Chapter 7.

Any additional variables from the database that may have had some explanatory power were included. It is

possible that fiirther additional variables could be included for fiiture models. Variables describing organ

failure, additional laboratory results such as potassium, platelet count, lactate, and injury severity score,

therapeutic interventions, gender, marital status and ethnicity may improve prediction of outcomes.

Some of these variables could be improved by expert pre-processing. For example, there is a potential for

inclusion of alveolar-arterial oxygen gradient using the blood gas measurements to better estimate the

ability of the lungs to exchange oxygen. The diagnostic code used in this study could be revised now that

the APACHE 111 diagnostic weights are all publicly available "".

It is not certain whether using all of the patient variables that were available on the database maximised the

performance of the SVM models. Altematively, the large number of variables could have increased the

complexity of the model and perhaps led to deterioration in model performance. Ideally, feature selection

should order the features by effectiveness, to provide the best combination of features and remove those of

negligible relevance. Such techniques as automatic relevance determination or altemative statistical

methods of selection ^'"^"^ could be pursued.


207

The data pre-processing was carried out to provide features with a similar range of values. The pre-

processing was based on methods used in other machine leaming applications. Where possible, each

feature had an identifiable minimum or maximum in relation to the risk of death associated with that

feature. The SVM was ineffective when used with raw data, and it is not clear which of the pre-processing

manipulations were beneficial, or indeed whether some of the pre-processing may have limited the model's

ultimate performance. Further experiments are necessary to explore the necessity and suitability of the pre-

processing and fransformations used in this experiment.

In summary, the conclusions of this work are that RA confrol charting offers an important adjunct to

current methods of assessment of ICU outcome to monitor the quality of care. SVMs provide a practical

approach to model the probability of in-hospital mortality of ICU patients for RA based on patient data

obtained in the first 24 hours. Their development can be effectively guided by optimisation of the attributes

of discrimination and calibration. Models can be reliably built and assessed on 1200 cases, and so provide a

RA model for a single ICU.


209

Appendix 1

Data Description of Admissions to Princess Alexandra


Hospital Intensive Care Unit: 1995 - 2000

The dataset for analysis was expanded from that presented in Chapter 3 (1 January 1995 to 1 January 1997)

with additional patients from 1 January 1998 - 31 December 1999. This extended dataset (1 January 1995 -

31 December 1999) is used in Chapters 4 to 7 for control chart analysis and modelling.

A1.1 Data Collection and Sample Summary

Patient eligibility and exclusions and the statistical analysis are the same as Chapter 3. There were 5681

eligible episodes of ICU admission analysed. There were 5278 primary admissions and 403 (7.1 %)

readmissions. The demographic features of the patients are summarised in Table Al. 1. The overall

mortality was 515 in-ICU and 779 in-hospital deaths from 5278 patient hospitalisations.
210

Table A1.1: Demographic features of the primary admissions to PAH ICU


1/1995-12/1999
Age years: mean (sd) 53.0(19.0)

Male: % 62.4

Surgical patients: ( % of admissions) 3154 (55.5)

Elective surgical patients: (% of admissions) 2365 (41.6)

Emergency surgical patients: ( % of admissions) 789 (13.9)

Non operative patients: (% of admissions) 2527 (45.5)

APACHE III score: mean (sd, range) 47.5(25.7,0-187)

ICU mortality: % (number of deaths) 9.1 (515)

Hospital Mortality: % (number of deaths) 14.8 (779)

ICU length of stay: mean in days (sd), median 2.9 (5.1), median = 1.0

Hospital length of stay: mean in days (sd), median 27.9 (46.6) median =15.0

Abbreviations: sd, standard deviation

The following tables Al .2 and Al .3 present a summary of mortality rate data that were used for the

charting and analysis in Chapter 4. Table Al .2 groups admissions by month and Table Al .3 groups patients

in consecutive blocks of 50 patients.


211

Table A1.2: Admissions Grouped by Month of Admission to Princess


Alexandra Hospital Intensive Care Unit 1995- 1999

Month Observed Number of Dec-98 0.12 89


Mortality Admissions Jan-99 0.08 75
Rate Feb-99 0.12 92
Jan-95 0.11 99 Mar-99 0.13 98
Feb-95 0.16 87 Apr-99 0.15 87
Mar-95 0.13 101 May-99 0.13 97
Apr-95 0.15 88 Jun-99 0.13 118
May-95 0.19 86 Jul-99 0.15 105
Jun-95 0.17 87 Aug-99 0.11 92
Jul-95 0.20 100 Sep-99 0.14 93
Aug-95 0.15 89 Oct-99 0.10 86
Sep-95 0.16 74 Nov-99 0.06 83
Oct-95 0.21 82 Dec-99 0.18 82
Nov-95 0.16 94 Jan-97 0.11 88
Dec-95 0.12 76 Feb-97 0.12 90
Jan-96 0.19 74 Mar-97 0.15 103
Feb-96 0.21 77 Apr-97 0.09 90
Mar-96 0.16 101 May-97 0.13 82
Apr-96 0.18 89 Jun-97 0.16 87
May-96 0.17 87 Jul-97 0.12 86
Jun-96 0.21 86 Aug-97 0.15 78
Jul-96 0.14 93 Sep-97 0.20 74
Aug-96 0.26 96 Oct-97 0.15 79
Sep-96 0.22 81 Nov-97 0.19 86
Oct-96 0.12 104 Dec-97 0.12 89
Nov-96 0.17 95 Jan-98 0.06 79
Dec-96 0.19 84 Feb-98 0.10 86
Jan-97 0.11 88 Mar-98 0.11 88
Feb-97 0.12 90 Apr-98 0.15 101
Mar-97 0.15 103 May-98 0.11 82
Apr-97 0.09 90 Jun-98 0.14 84
May-97 0.13 82 Jul-98 0.17 76
Jun-97 0.16 87 Aug-98 0.20 71
Jul-97 0.12 86 Sep-98 0.16 90
Aug-97 0.15 78 Oct-98 0.16 85
Sep-97 0.20 74 Nov-98 0.08 77
Oct-97 0.15 79 Dec-98 0.12 89
Nov-97 0.19 86 Jan-99 0.08 75
Dec-97 0.12 89 Feb-99 0.12 92
Jan-98 0.06 79 Mar-99 0.13 98
Feb-98 0.10 86 Apr-99 0.15 87
Mar-98 0.11 88 May-99 0.13 97
Apr-98 0.15 101 Jun-99 0.13 118
May-98 0.11 82 Jul-99 0.15 105
Jun-98 0.14 84 Aug-99 0.11 92
Jul-98 0.17 76 Sep-99 0.14 93
Aug-98 0.20 71 Oct-99 0.10 86
Sep-98 0.16 90 Nov-99 0.06 83
Oct-98 0.16 85 Dec-99 0.18 82
Nov-98 0.08 77
212

Table A1.3 Admissions Grouped into Ordered Blocks of 50 Admissions:


1995 - 1999
Block of Patients Observed Deaths 2750 _5
50 6 2800 __9
100 4 2850 _6
150 8 2900 JI2
200 8 2950 8
250 6 3000
300 7 3050 JIO
350 9 3100 _5
400 7 3150 _6
450 10 3200 _8
500 8 3250 _[
550 8 3300 7
600 7 3350 _6
650 14 3400 _4
700 8 3450 JIO
750 5 3500 _6
800 10 3550 _6
850 12 3600 _5
900 8 3650 __6
950 9 3700 _6
1000 7 3750 JI2
1050 5 3800 11
1100 9 3850 5
1150 10 3900 10
1200 10 3950 7
1250 11 4000 9
1300 6 4050 4
1350 8 4100 5
1400 8 4150 7
1450 8 4200 3
1500 13 4250 6
1550 7 4300 8
1600 8 4350 5
1650 8 4400 5
1700 10 4450 5
1750 14 4500 9
1800 13 4550 7
1850 10 4600 8
1900 6 4650 6
1950 6 4700 8
2000 9 4750 4
2050 7 4800 10
2100 14 4850 9
2150 6 4900 3
2200 4 4950 9
2250 9 5000 5
2300 2 5050 5
2350 8 5100 2
2400 7 5150 3
2450 5 5200 9
2500 5 5250
8
2550 8
2600 4
2650 8
2700 8
213

Appendix 2
Analysis of Mortality Rate Observations PAH ICU: 1995 -
1997, and Estimation of In-Control Parameters.

The purpose of Appendix 2 is to present an analysis of the PAH ICU patient data from 1 January 1995 - 31

December 1997, including evaluating whether the process was in-confrol, and estimating the process

parameters. The analysis was conducted according to the principles detailed by Kennett and Zacks "'^. This

analysis of the patient data examines whether the observed mortality rates are approximately normally

distributed, and whether the process is stationary (i.e. distribution of mortality rates does not change over

time).

A three-year period of observation from 1 January 1995 to 31 December 1997 is covered with this analysis.

This initial collection of data was available at the commencement of this project. The data were analysed in

three groupings: monthly groups, blocks of 50 consecutive cases, and in blocks of 100 consecutive cases.

The term "mortality rate" is used for the proportions of deaths among blocks of consecutive cases, even

though the observation periods vary. An overview of the data is presented in Chapter 2, and Appendix 1

tabulates the monthly observations (Table Al .2) and blocks of 50 cases (Table Al .3) up to the end of 1999.

A2.1 Data and Methods

The analysis was conducted on 3 years of admissions to the PAH ICU. Only the first admission to ICU

during any hospitalisation was considered to avoid double counting. The outcome of interest was survival

to hospital discharge.

There were 3159 eligible admissions during this period. Data were grouped into months of varying sample

size (36 months), samples of 50 consecutive cases (63 blocks) and samples of 100 consecutive cases (31
214

blocks). The overall mean mortality rate, the mean mortality rate of each grouping, range and standard

deviation of the observed mortality rates was calculated. A normal probability plot and the Shapiro - Wilk

W statistic were used to investigate whether the mortality rates were approximately normally distributed.

A comparison between the observed standard deviation of the mortality rates, and the standard deviation

estimated from the binomial distribution was made for the patient samples of 50 and 100 cases. The

standard deviation is:

.= mEM Hj

where p is the mean mortality rate, and n is the number of patients in sample, /, either 50 or 100.

Runs Tests

A search for non-random pattems in the data was conducted using methods recommended by Kennett and

Zacks "^ and the derivation of these tests and the notation comes from this reference.

Runs above or below the mean

The first of the Runs tests is a test of the null hypothesis that the monthly mortality rates fall randomly

either side of the overall mean mortality rate. Each sample mortality rate observation is allocated to a group

according to whether it is above or below the overall mortality rate. "A" identifies values equal or above

the mean, and "B" identifies values below the mean. A mn is defined as a consecutive series of one or more

mortality rates that have the same classification. The statistic, R, is the observed number of mns.

IfR is too small, there is a high chance of "clustering" due to non-random disfribution of mortality rates.

Too many runs imply mixing or homogenous distribution of the mortalities around the mean. A test of the

null hypothesis of random distribution either side of the mean, has rejection regions R < R^ or R> R,

With large samples such as used in this study (36 months), a normal approximation is used and

where
215

2iiiAnic^
//,=1-F-'A"^B

(2m^mg(2m^mg-)
(^R =
n^{n-\)

a is the level of significance for rejection of the null hypothesis (in this application a = 0.025) "^. Z/^ is the

value of the standard normal distribution corresponding to a. TW^ is the number of observations above (

count of "A") and mg is the number of observations below (count of "B") the mean mortality rate.

Runs up or down
It is possible that cyclical effects could be present without being detected on the previous assessment of

randomness. For this analysis, each mortality rate is compared to the previous mortality rate. Where it is the

same or larger, there is a frend up identified by "U". Where the mortality rate is less than the previous rate,

there is a run down, identified by "D". A mn is defined as a consecutive series of one or more mortality

rates that have the same classification and the statistic R*'is the number of mns counted.

The null hypothesis of random distribution of mns up or down is rejected '\f R~* <R^
or R ^ .**
-** > R^_.

The normal approximation can be used and 7?^ = /Ll . Z^_^.CF . and R^_^ = jU . + Z,_^.(J^.

where:

^R'
J2n-lj/
=

l{l6n-29)/
'90
For the monthly data, the variable sample size, and resultant variations in standard deviation limit the value

of the analysis of mns up and down '', so this analysis was only conducted on mortality rates of the

samples with 50 and 100 patients.


216

Auto-correlation
Estimates of the sample auto-correlation for a range of lags of up to 1 year were calculated to exclude a

seasonal cycle ^''^ The analysis used the sample auto-correlation fiinction:

n-k

r,=^^ , k = 0,\,2,^...N
iiPi-pf
i=\

where N is the number of samples and mortality rate observations in the series; /?, is the mortality rate of

the sample indexed by /; p is the mean of the monthly mortaUty rates; k is the number of time periods

apart chosen for the estimation of correlation (lag).

, the presence of a non-zero auto-correlation at the lag, k cannot be excluded.

A2.2 Results

Overview

Between 1 January 1995 and 31 December 1996, there were 3159 patients eligible for analysis. There were

507 deaths with an overall mortality rate of 0.161.

Monthly observations

There were 36 monthly observations, with a mean monthly mortality rate of 0.161 and standard deviation

of 0.037.

The range of monthly mortalities rates was 0.09 - 0.26. The normal probability plot in Figure A2.1:

supports the assumption that the monthly mortality observations were approximately normally distributed.
217

The Shapiro-Wilk fTstatistic (W= 0.98/7 = 0.752) provides no evidence against the null hypothesis that

the monthly mortality rates have a normal distribution.

Figure A2.1: Normal Probability Plot of Monthly Mortality Rates


PAH ICU In Hospital Mortality 1995 -1997
mean = 0.161, sd = 0.038

a>
CD
>
(D
E
o
z
%
7^
0}
Q.
X
UJ

0.05 0.1 0.15 0.2 0.25 0.3


Observed Monthly Mortality Rate

Mortality rates above or below the monthly mean mortality rate gave the following sequence

BBBB AAA B AA BB AA B AAA B AA B AA BBBBB A BB A B A B

with 19 mns.

R -\9,mA = \9,mB=\l,^lR=\8.9, andR = 2.9../? = 13.2 andRi^ = 24.7,thereforeR<R</?,_,.

There was insufficient evidence to support non-random pattems with clusters or abnormal mixing of

mortality rates.
218

Figure A2.2 displays the estimates of the auto-correlation coefficients for lags of 1 - 12 months. At a lag of

3 months, the coefficient is 0.38, which exceeds twice the standard error of the estimate. This suggests that

there is a likely to be a positive correlation between monthly mortality rates 3 months apart. This is

plausible, in the light of seasonal cycles of activity, and the possible effects of school holidays, and the

hospital year (January to January) and the financial year (July to June) on casemix. However, there do not

appear to be any other periodic effects.

Figure A2.2: Auto-Correlation Coefficients for Monthly Mortality


Rates
PAH ICU Jan 1995 - Dec 1997
0.5

0.4
s 3 Auto - correlation coefficients
.E 0.3 -Lower 2 SE
-Upper 2 SE
(D
o
E 0.2
8 0.1
c
2 0 1^^
i5 7 10 11 2:
2-0.1
1^
o
f-0.2
-0.3 -
<
-0.4

-0.5
Lag (months)

Observations on blocks of 50 consecutive cases

There were 63 blocks, with a mean mortality rate of 0.160 and a standard deviation of 0.051. The range of

mortality rates was 0.04 - 0.28. The normal probability plot in Figure A2.3 suggests that the mortality rates

are approximately normally disfributed. The Shapiro-Wilk ^^ statistic {W= 0.95/? = 0.02) suggests a

significant departure from the normal distribution. The standard deviation estimated from the binomial

distribution was 0.052.


219

Figure A2.3: Normal Probability Plot of Mortality Rates in Blocks


of 50 Admissions
PAH ICU In Hospital Mortality 1995 -1997
3150 cases, 63 blocks of observations

0)

> .
CO

E
i_

o
o
(D
+
o
(D
Q.
X .It
UJ

0 0.1 0.2 0.3 0.4


Observed Mortality of Blocks of 50 Cases

The sequence of the 63 observations above or below the monthly mean mortality rate was:

BB AA BB A B AAA B AA B AAAA BB AAAA B AAAA B AAAAAA BB A B A BB A B A


BBB A B AA B A B AAAA BB
with 33 runs./? =33, /w^ = 25,/2 = 38,yU/e = 31.2, andv} = 3.8 from which/? = 23.8 and/?/. = 38.5,

Therefore, R < R< R^_^, and there is no evidence to support non-random clusters or abnormal mixing of

mortality rates.

Runs of values up and down allowed 62 comparisons from the 63 observations, with 40 mns.

D U U D U U D U D U D U D D U U D U D D UUUU D UUUU D UUUU DDD UU D U DD U D

U DD UU D UU D U D U D UU D U

R ' = 40, = 63, fiR* = 41.7, and R, = 3.3 from which R \ = 35.2 and R* ,^ = 48.1, so R^ < R* < R*^, and

there is no evidence of non-random cycles up or down.


220

Auto-correlation coefficients for mortality rates of blocks of 50 admissions are in Figure A2.4 for lags of 1

to 21, covering the possibility of up to 1 year correlations. There was no significant auto-correlation.

Figure A2.4: Auto-Correlation Coefficients for Mortality Rates of


Blocks of 50 Cases
PAH ICU 1995-1997
0.4

0) 0.3
3 Auto - correlation coefficients
c -
-Lower 2 SE
"8 0.2 -Upper 2 SE
!=
0)
O 0.1
O
c
.2 0 H , H ,H_ La_
JO 1 2 3d: 5 6 7 a 11 12 13 ti ts 17 18 19 ^ ^

fc-0.1
o
o
o -0.2
- -
< -0.3

-0.4
Lag (blocks of 50 cases)

Observations on blocks of 100 consecutive cases

There were 31 observations, with a mean mortality rate of 0.160 and a standard deviation of 0.0376. The

range of mortality rates was 0.10 - 0.27. Neither the normal probability plot in Figure A2.5, nor the

Shapiro-Wilk fT statistic (fF= 0.94/7 = 0.08) provide evidence against the hypothesis that the mortality

rates have a normal distribution. The standard deviation estimated from the binomial distribution was

0.0377.
221

Figure A2.5: Normal Probability Plot of Mortality Rates in Blocks


of 100 Admissions
PAH ICU In Hospital Mortality 1995 - 1997
3100 cases, 31 blocks of observations

0)
_3
CD
>
(0
E
1- , ^ ' *
o
z /
T3
0}
ts
a.
X
UJ

-3
0.05 0.1 0.15 0.2 0.25 0.3

Observed Mortality of Blocks of 100 Cases

The sequence of the 31 mortality rates above or below the mean mortality rate:

B A B A A B A B A A B AAAA B AAA B A BBBBB A B AA B

demonsfrated 19 mns. R = 19, /w^ = 14, w^ = 17, // = 16.4, and Oj? = 2.7 from which /? = 11.1 and R]^ =

21.7, so R^ < R< /?]_, and there is no evidence for non-random clusters or abnormal mixing of mortality

rates.

A Runs Test of values up and down allowed 30 comparisons from the 31 observations.

U D U DD U DD U D UU DD U D UU D UU D U DD

/?' = 18, = 3\,fjR* = 20.3, anda^. = 2.3. From whichR'= 15.8 and/?'/^ = 24.8, so

R <R ^R , and there is no evidence of non-random cycles up or down.


222

Figure A2.6 presents the estimates of the auto-correlation coefficients for lags of 1 to 10, covering the

possibility of up to 1 year correlations. At a lag of 3 blocks of 100 cases, the coefficient is 0.47, which

exceeds twice the standard error. This suggests that there is a likely to be a positive correlation between

observations that are 3 blocks of 100 cases apart. This is similar to the possible auto-correlation between

mortality rates of blocks of month observations that were tree months apart. However, there does not

appear to be any other periodic effects.

Figure A2.6: Auto-Correlation Coefficients for Mortality Rates of


Blocks of 100 Cases
PAH ICU 1995-1997
0.6
nts

0.4 t-H nJAuto - correlation coefficients


0) Lower2SE

o Upper2SE
eff

0.2
o
o
c 1=1
tio

0
CO 1 2 3 6 9
0)
i_
oo -0.2
u.

o
4-.*
3 -0.4
< ^
*

-0.6
Lag (blocks of 100 cases)

Discussion of analysis

The observed mortality rates were randomly distributed either side of the mean mortality, and there were no

abnormal mns up or down. The analysis of the mortality rates for 1995 -1997, confirms the monthly

mortality rates are approximately normally distributed, and that the disfribution is probably stationary over

this period. The process can be considered to be in-confrol for this period. The presence of a significant

auto-correlation at a lag of 3 months may be a factor that increases the false alarm rate of any control

charting approach. Altematively, it may be a chance finding. The auto-correlation analysis examined 43
223

possible relationships and two coefficients (monthly mortality at a lag of 3 months, and the 100 case blocks

at a lag of 3, which are largely equivalent analyses) were found to lie outside the two standard error limits.

With the grouping into blocks of 50 cases, the Shapiro-Wilk statistic suggested that there was a significant

departure from the Normal distribution of block mortality rates. For this reason, the data will be grouped

into larger patient samples by month or 100 patient blocks for the confrol chart applications.

The standard deviation of the observed mortality rates for the blocks of patients estimated from the

binomial distribution agrees with the calculated standard deviation observed on this series.

A2.3 Conclusion

The observed patient mortality rates from the period 1 January 1995 to 31 December 1999 will be

considered in-confrol. From this analysis, the in-confrol mortality rate is estimated as 0.16. Auto-correlation

of mortality rate observations with a lag of 3 months or 300 cases may exist.
225

Appendix 3
Analysis of Performance of p-Chart

The purpose of Appendix 3 is to provide the background necessary to design appropriate/; charts to

monitor ICU patient mortality. This appendix illusfrates the performance of the/? chart by examining the

average mn length (ARL) to signal for a range of control Hmit parameters in the context of changing

mortality rates. The approach to analysis is adapted from that suggested by Kennett and Zacks "^.

A3.1 Assumptions about the nature of the change.

The following analyses are done under assumptions that the changed mortality rate observations are

normally distributed with a mean mortality rate of p and a standard error of ., j . This estimate
I

closely agrees with the observed standard deviation of the mortality rates in Appendix 2.

The analysis is conducted with sample size set at = 87, as this is the average number of patients in each

month. PQ is the in-confrol mortality rate and p^ is the changed mortality rate. The control limits are

defined as a a where a is the multiple of the standard deviation of the mortality rate (&).

Further analysis explores the effect of sample size on chart performance.

A3.2 Operating Characteristic of the p chart

The operating characteristic is the probability that an observed mortality rate will fall within the confrol

limits. It is calculated by the difference between the two cumulative distributions defined by the confrol

limits. The operating characteristic is usefiil, as it allows analysis of the probability of a single sample

mortality rate observation on thep chart to detect a change in mean mortality, or to falsely signal in the

absence of changed conditions.


226

If the confrol limits, LCL and UCL, are set at

V n

When the mortality rate shifts to a new value p^ and the sample sizes are large, then Kennett and Zachs

113
(1998) provide an expression for the operating characteristic.

OC(^,) = D
(UCL-p.y^] J(LCL-p,y[^
-o
VA(I-PI) J t VAO^-A)

where 0 ( ) denotes the cumulative standard normal distribution.

Substituting for the values for the control limits.

0C(^,) = 0 V ^ U - A ) + ^ V A ( 1 - A ) 1 _ o
J yfnJPo- Pi)-ayl PoJl-PoY
VA(I-A) ylPx^^-Px)

Equation A3.1

Figure A3.1 shows the OC curves where confrol limits are calculated using a - 0.5, 1,1.5,2, and 3. Table

A3.1 contains values read off Figure A3.1 of the probability of signal under selected values for confrol

limits and changed mortality rates. Note that a doubling of mortality rate from 0.16 to 0.32 will be missed

with 0.20 of the observations if 3 <7 confrol limits are used. It may be more appropriate to choose 2 a

confrol limits where the doubling of mortality will fail to signal with only 0.05 of the observations. With

the 2 a confrol limits, with unchanged mortality rate, the occurrence of false positive signals will be 0.05.

With the confrol limits set at 0.5 ff or 1 <7, the charts are very sensitive to changes in the mortality rate, but

the probability of false alarm is too high for clinical use. When a = 0.5, the probability of false alarm with
227

unchanged mortality rate is 1- 0.38 = 0.62. When a = 1, the probability of false alarm with unchanged

mortality rate is 1- 0.68 = 0.32, which is still unacceptably high for clinical use.

Figure A3.1 : Operating Characteristic ofp Chart with Range of


Control Limits for a Change in Mortality Rate
Control limits +/- 0.5,1,1.5, 2, 3 a, po = 0.16, n = 87 admissions
1
--+/-3 a
o 0.9 - ^ +/- 2 a
ts
0.8 -^+1- 1.5 a
B
<D -x--^/- 10
0.7
Q -*-+/-0.5 a
15 0.6
c
0.5
CO a.
o o 0.4
1 0.3

i5 0.2

o 0.1
0
0.05 0.15 0.2 0.25 0.3 0.35 0.4 0.45

New Monthly Mortality Rate: pi

Table A3.1 Probability of Signal for a Range of Control Limit Settings


(a) and Changed Mortality Rates (^,).
read from Figure A3.1.
Control limits: +/- a a

New +/- 0.5 o +/- 1 G + / - 1.5(7 +1-1 G +/-3(7


mortality
rate:p,
0.05 - 1.0 -1.0 0.99 0.91 0.37

0.10 0.90 0.74 0.51 0.28 0.04

0.16 0.62 0.32 0.14 0.05 <0.01

0.20 0.76 0.54 0.34 0.19 0.03

0.25 0.94 0.87 0.75 0.60 0.27

0.32 -1.0 0.99 0.98 0.95 0.80

A fiirther perspective can be obtained by converting the OC[px) into estimates of ARL to signal. The

relationship between ARL and OC{^i) is


228

ARL =
[l-OCte)]

This is important as analysis of the CUSUM and EWMA charts is done with ARL. Figure A3.2 shows the

ARL to signal in samples of 87 cases where confrol limits are calculated using a = 0.5, 1, 1.5, 2, and 3, and

Table A3.2 contains values read off Figure A3.2

Figure A3.2 : Semi log Plot of Average Run Length (ARL) to


Signal ofp Chart with Range of Control Limits for a Change in
Mortality Rate
Control limits a = 0.5, 1, 1.5, 2, 3, Po =0.16, n = 87
1000

00
II
c
"o
(/)
Q.
E
CD

01
<

New Monthly Mortality pi


229

Table A3.2 Average Run Length (ARL) in Samples of 87 Cases, for a


Range of Control Limit Settings (a) and Changed Mortality Rates (pf).
readfrom Figure A3.1.

Control limits: +/- a G

New +/- 0.5 a +/- 1 G + / - 1.5(7 +1-2 G +/-3(7


mortality
rate: p.
0.05 -1.00 1.00 1.01 1.10 2.72

0.10 1.11 1.35 1.95 3.55 27.87

0.16 1.62 3.15 7.48 21.98 370.38

0.20 1.31 1.86 2.98 5.35 28.79

0.25 1.06 1.16 1.34 1.67 3.65

0.32 - 1.0 -1.0 1.02 1.05 1.25

With 3 <7 confrol limits, an average of 370.38 observations (32 225 patients or once every 27 years) will be

expected before a false alarm. However, in the event of mortality rate doubling to 0.32, the ARL is 1.25

samples (109 patients). With 2 tr control limits, the ARL for false alarm is 21.98(1912.3 patients) and the

ARL for doubling mortality is 1.05 (91.4 patients). The advantage of the 2 a control limits is the better

detection of moderate increases in mortality rate that are more clinically plausible. For example, an increase

of mortality rate from0.16 to 0.20 will be detected by the 2 a confrol limit chart with an ARL of 5.35 (465

patients or between five and six months) compared to the 3 a confrol limit chart which has an ARL of 28.79

(2504 patients or over two years). Though the specificity of the 3 a confrol limit chart is very high, it will

be unlikely to detect clinically plausible changes in the mortality rate.


230

A3.3 Effect of sample size on p chart performance

While designing thep charts, samples of patients grouped by month of admission was convenient.

However, it is important to explore altemative samples sizes.

Under this analysis, there is a linear relationship between ARL and block size when the mortality rate is

unchanged. From Equation A3.1, when p^ = PQ, OC\p ) = 0.95 if a = 2. The estimated ARL will be

20 and the ARL(in cases) will be 20n,. Figure A3.3 (a &b) shows the effect on changing the sample size on

ARL to signal when p^ ^ p^ using an in-control mortality rate of 0.16, and 2 a confrol limits. In this

analysis, the ARLs are presented as average number of cases to avoid confiision as the sample size is

varied.

Figure A3.3a displays the effect of changing block size, when the out of control mortality rates are below

0.16.

Figure A3.3a: Average Number of Cases to Signal o f p Chart with Effect of Variable
Sample Size over Range of Changed Mortality Rates
Control limits 2a, Po = 0.16, p, < Pg
2500 n =10.. 200.
to signal

2000
number

'Pi =0.14

1500
91 w 1000
CD 10
500
<
0
50 100 150 200
Number of cases in block: n
231

When the changed mortality rate is 0.14 the ARL in cases continues to rise with increasing block size. With

lower/;/, the ARL in cases does not increase much, after block sizes of 50 - 100 cases. Therefore there is

little sensitivity advantage in using block sizes greater than 50 -100 cases to detect a decrease in mortality

rate. With block size of 50 cases, the specificity of thep chart, measured by in-confrol ARL, is 1000 cases,

and for block size of 100, is 2000 cases.

Figure A3.3b: Average Number of Cases to Signal o f p Chart with Effect of Variable
Sample Size over Range of Changed Mortality Rates
Control limits 2a, Po =0.16, p , >po
n 10 ...200
2500

0.18

0 50 100 150 200


Number of cases in block: n

Figure A3.3b presents the out of confrol mortality rates above 0.16. Again we can confirm that there is little

advantage in sensitivity gained by using block sizes larger than 50 - 100 cases, unless the new mortality

rate is close to 0.16.

A3.4 Summary of Analysis

The choice of 2 o confrol limits and block size of > 50 represents appropriate performance for thep chart

of in-hospital ICU mortality.


233

Appendix 4
Statistical Analysis of the CUSUM Chart

The purpose of this appendix is to summarise the background to statistical analysis using the CUSUM. The

assumptions underlying the analysis, the notation and calculations and an analysis of chart performance

under a range of parameter values and changed mortality rates are included. This appendix is not an

exhaustive review, and is provided as a reference to the material relevant to the text of the thesis. The

results of the analyses in this appendix provide the basis for the design of the CUSUM charts in Chapter 4.

The books by Ryan '^, Montgomery ^^, Hawkins and Olwell ^*'^, and Kennett and Zacks ^"^ provide

excellent descriptions of aspects of the topic.

A4.1 Assumptions

The assumptions on which the methods are based are that the observations are independent and that

samples are randomly drawn from a population of known distribution. The analysis in Appendix 2, of the

PAH ICU data during the in-confrol period (1 January 1995 - 31 December 1997) suggests that these

assumptions are plausible. For the following example, the sample size of 100 cases will be used. The

distribution of the mortality rate of the blocks of 100 patients is approximately normal.

A4.2 statistical Tests

To test for shifts in the process mean of an anticipated magnitude, three equivalent statistical approaches

can be used: V mask, decision interval and Page's two sided CUSUM. The following discussion adopts the

notation and conventions used by Hawkins and Olwell ^^^.


234

A4.3 Definitions and Notation

p. the mortality rate observed for the sample indexed by /.

fiQ is the mean of the process in-confrol, estimated by the mean mortality rate Po After a change in

the process the new mean will be //,. Where a shift to a higher mean mortality rate is of interest,

then p^) PQ . Where a shift to a lower value is of interest, then ju[( ju^

n the number of patients in a sample or block.

IPO{^~PO)
a standard deviation of the process in-control. It is estimated by J
V n

From Appendix 2, this estimate gave the same values as standard deviation of the mortality rates 1995 -

1997.

K^ This parameter is dependent on /J^, and the choice of a (clinically) important increased process

mean.

^.^>"o+-",
2
Similarly, K" depends on the clinically important new lower mean mortality rate.

K^ and IT have the units of the outcome measurement. In the examples of Chapter 4 the units are deaths

per 100 cases. The CUSUM performance in detecting persistent shifts is optimal under these conditions,

but the CUSUM is robust and will signal when shifts of greater or lesser magnitude are present. A signal

does not imply that the shift is to ni exactly

h This parameter is the confrol limit to which the CUSUM statistic is compared. It is chosen

according to the desired performance characteristics of the chart, i.e. average run lengths (ARL) for in-
235

confrol and changed mortality conditions under the parameter choices, h^ is the upper decision interval and

h' is the lower decision interval.

The upper and the lower CUSUMs are separate statistical analyses testing for increases and decreases

respectively, in the process mean. These statistics are often run concurrently. If only a change in the

process mean in one direction is sought, then either an upper or lower CUSUM could be run in isolation.

Where a single CUSUM is run, it has a lower rate of false alarm than running both upper and a lower

CUSUM together. This analysis will consider the performance of both upper and lower CUSUMs together,

and a single CUSUM alone.

The upper CUSUM statistic is calculated recursively,

C =o

C;=max(0,C;_,+Pj-K^)

A signal occurs when

Concurrentiy, a lower CUSUM can be mn, and a decrease in the mean would be signalled when,

C7 < h-

given,

C;^imn(0,C;_,+Pj-K-)
236

A4.4 Analysis of Performance of CUSUM Charts: Choice of Design


Parameters.

The performance of the CUSUM confrol charts can be described in terms of ARL to signal under in-

confrol conditions, and under conditions of a changed mean mortality rate. The in-confrol ARL is a

measure of the occurrence of false alarms. The ARL when the process mean has changed is an indication of

the efficiency with which the chart detects the changed mortality rate. The ARL to detect a changed

mortality rate can be estimated using a starting value of CUSUM = 0, or from a CUSUM with a steady state

value and running under in-confrol conditions. The ARL from steady state will be shorter than the ARL

from an in-confrol state. However, both methods are thought to give the same ranking of the efficiency of

charting designs ^''^.

For this analysis, I will calculate the ARL for CUSUMs that have the changed or out-of-confrol mortality

rates from the first observation.

To characterise the performance of the CUSUM charts used to analyse the ICU data, a series of simulations

were run. All simulations were programmed on MATLAB 6.5 "^, and the results graphed in Microsoft

Excel. Altemative approaches using integral equations, Markov chain discrete approximations to the

integral equations and other methods to reduce the computing intensity of simulations are described by

Hawkins and Olwell'".

A4.5 In-control ARL

A simple programme was written to simulate the process in-control. The ARL of a CUSUM chart of the

mortality rates of blocks of 100 admissions was modelled. The in-confrol mortality rate was 0.16 and the

shiftsofmeanmortalityrateweretoO.il and 0.21, giving K^ =0.185 and K~ =0.135. The observed

mortality rate of 0.16 and the standard deviation of 0.037 formed the basis for simulations. Simulated block
237

mortality rates were randomly drawn from a normal distribution with these parameters, and any negative

values were given mortality rate of zero. 10 000 simulated runs were used to estimate the ARL at each

value in the ranges studied. In each simulation an upper and a lower CUSUM were modelled.

Figure A4.1 shows the relationship between ARL and the range of the decision interval parameters, h..

h*'~ of 0.073 gives an in-confrol ARL for a single CUSUM test of 37 (or 3700 cases approximately 3

years), and 20 (or 2000 cases or 1.7 years) for the upper and lower CUSUM together.

Figure A4.1: In - Control ARL for a Range of h+/-


Single CUSUM, and combined upper and lower CUSUM Monitoring scheme
In - control mortality rate = 0.16. optimal detection for change to 0.11 and 0.21

The results of a fiirther simulation to examine ARL for the range of hL+l- of 0.07 to 0.08 for CUSUMs

with both upper and lower, and single monitoring schemes are shown in Figure A4.2 and A4.3. The

simulation was conducted as before, except that altemative mortality rates of interest {p^ and p^ ) were

used. The choice of p.* and //f has a great effect on the ARL while the process remains in confrol. For

example, if the shifts in mean mortality rates that are to be detected are a lower rate of 0.08 and an upper

rate of 0.24, the ARL in-confrol (h*'~ = 0.073) is 77 (7700 patients or about 7 years) for a two sided

CUSUM, and 140 (14000 patients or about 13 years) for a one sided CUSUM. This is a large increase in

the ARL in-confrol compared to when the mortality rates of 0.11 and 0.21 are to be detected

(h*'~ = 0.073 ) where the ARL for a single CUSUM is 37 and for an upper and lower CUSUM is 20.
238

Figure A4.2: In - Control ARL for a Range of /)+/- and Alternative Mortality
Rates
in - control mortality rate = 0.16, upper and lower CUSUM
120

-<0.08or>0.24

-<0.09or>0.23

-<0.1 or>0.22

-<0.110r>0.21

-<0.12or>0.20

-<0.13or>0.19

-<0.14or>0.18

0.07 0.072 0.074 0.076 0.078 0.08


h+/-

Figure A4.3: In - Control ARL for a Range of h+/- and Alternative Mortality
Rates of Interest
In - control mortality rate = 0.16, single (upper or lower) CUSUM
250

200

-<0.08or>0.24
150 -
-<0.09or>0.23

< -<0.1 or>0.22


100 -<0.11 0r>0.21

-<0.12or>0.20

-<0.13or>0.19

-<0.14or>0.18

0.07 0.072 0.074 0.076 0.078 0.08


h+/-
239

A4.6 ARL where mortality rate has changed.

To examine the effect on ARL of changes in the observed mortality rate, a series of simulations were

conducted. The in-confrol mean was 0.16, the changed mortality rates to be detected were p^ = 0.21 and

//f = 0.11 and the control limits were h^'~ - 0.073. For the simulations that follow, it is assumed that the

process has the changed mean from the commencement of monitoring. Simulated mortality rates were

randomly drawn from a normal distribution with mean equal to the out-of confrol mortality rate, and

standard deviation calculated from the out-of-confrol mortality rate and the number of patients. Any

negative values were given a mortality rate of zero. 10 000 simulated mns were used to estimate the ARL

at each value in the ranges studied. In each simulation an upper and a lower CUSUM were modelled.

Figure A4.4 shows the ARL to signal of the CUSUM for a range of out-of-confrol mortality rates.

Figure A4.4: ARL to Signal with Altered Mortality


ARL for upper and lower CUSUMs together, and upper or lower CUSUMs alone
h+/- = 0.073, optimum setting for change mortality rate to 0.11 or 0.21

50
1;

40

I-

a. \
<
20

f\
^ ARL Upper and Lower CUSUMs together
ARL Upper CUSUM alone
ARL Lower CUSUM alone

y v^__
0.05 0.1 0.15 0.2 0.25 0.3 0.4

Out-of-control mortality rate

The ARL of the upper and lower CUSUM together show the ARL of 18.7 when the mortality rate remains

at 0.16, which differs slightiy from the estimate of 20 from the simulation of Figure A4.2. The ARL

decreases as the mortality rate is changed from 0.16. The ARLs for 0.15 (16.5) and 0.17 (13.9) are not
240

much different from the false alarm ARL. This is appropriate as such small changes are not of particular

clinical importance. However, at more exfreme out-of-confrol mortality rates, the CUSUM will signal quite

rapidly. The ARL for a mortality rate of 0.21 is 3.3 and that of 0.11 is 3.4. At the PAH ICU this would have

been about 3 months.

Figure A4.4 also shows the ARL when only a single upper or lower CUSUM is used. This monitoring

option maybe chosen when only one direction of changed mortality rate is of clinical interest. For example,

it may be important only to detect an increased mortality rate. The ARL for detection of increased mortality

rates that are close to the in-confrol mortality rate is much larger. At a mortality rate of 0.18, the ARL of

the upper and lower CUSUM and the upper CUSUM only is the same. So the sensitivity of the combined

upper and lower CUSUMs is about the same as the upper CUSUM to detect increases in mortality rates.

The advantage of using only a single CUSUM when only one direction of change is of interest is that the

ARL for in-confrol is much longer (35.5 v 18.7), and the analyses are more specific. Also the ARL when

the mortality rate has fallen is dramatically increased. It is very unlikely to get signals from the upper

CUSUM, as it is not designed to detect decreases in the mortality rate. The increased ARL for the upper

CUSUM is seen at out-of-confrol mortality rates of 0.15 (16.5 v 18.9), 0.14 (10.9 v 310) and 0.13(7.0 v

1377/ Similarly, the lower CUSUM is not designed to detect increased mortality rates.

A4.7 Summary

The design of the CUSUM chart requires that in-confrol mortality rate is known and that out-of-confrol

mortality rates of clinical importance are chosen. Analysis of the ARL under in-confrol and under changed

conditions directs choice of the confrol limit parameters.

In the PAH ICU example, the in-confrol mortality rate was 0.16, and the mortality rates that were to be

detected were 0.11 and 0.21. With h'^'~ = 0.073, an ARL of 1870 patients for in-confrol occurrence of

false alarm was considered acceptable. This choice of parameters allowed rapid detection of clinically

important changes to the mortality rate.


241

Appendix 5
Analysis of Performance and Choice of Parameters and
Control Limits for Exponentially Weighted Moving
Average (EWMA) Chart

The purpose of this appendix is to analyse the ARL of the EWMA chart for a range of chart characteristics.

From this analysis, a choice of appropriate parameters will be made, and used in Chapter 4 for monitoring

the ICU mortality rate.

The formula for the EWMA statistic is

EWMAj = yjA + EWMAj_^ (1 - A)

where y,. is the value of the /"' observation. This value may be the mortality rate of samples of

patients, />,. or the outcome of a single patient, Yj. In the examples used, both sample blocks of 100

consecutive patients, and single patient outcomes are presented. A is the weight between 0 and 1.

EWMAj is the value of the statistic indexed by /.

The confrol limits are calculated by:

c.,=.-..J10)IJ^[i-(.-.f]
where , is the number of cases in the sample. , = 1, when the outcomes of single patients, Y,, are being

analysed. A series of simulation experiments were performed to characterise the run length distribution

and the ARL under different conditions, a is the parameter defining the width of confrol Umits as multiples

of <7
242

A5.1 Effect of a on in-control ARL

Simulations were conducted to display the relationship between a, which defines the width of confrol limits

as multiples of <T and the ARL under in-confrol conditions.

To estimate the ARL, 10 000 EWMA simulations were performed at values of a between 0.5 and 3. For

each simulation, the starting in-confrol estimate was 0.16 and X =0.3. A random variable having a

10.16(1-0.16) ^^^^^
normal distribution with a mean of 0.16 and a standard deviation of.,/ = 0.0367 was
V 100
used to simulate the mortality rates of blocks of 100 patients. Each simulated mortality observation was

incorporated into the EWMA chart. When the EWMAj statistic fell beyond one of the confrol limits, the

chart was deemed to have signalled, and the run length for that simulation was /. The mean ARL for each

value of <3 was the mean of the 10 000 run lengths.

Figure A5.1 shows the effect of varying the width of the confrol limits on the in-confrol ARL. As expected,

the narrower confrol limits have short ARLs with the ARL of a = 1 only 3.6 blocks. At a = 2, the ARL is

30.6 blocks. At a = 3, the ARL rises to 455.6 blocks.


243

Figure A5.1: Effect of Control Limit Width (a) on ARL of


EWMA
A = 0.3, n =100, in-control mortality rate = 0.16
Width of control limits = +/- a.a
500

1.5 2 2.5
Width of Control limits (+/- a)

A5.2 Effect of A on ARL

Simulations were conducted to display the relationship between 2., the EWMA weight and the ARL under

in-control conditions.

To estimate the ARL, 10 000 EWMA simulations were performed at each value of 2 between 0.01 and 1.

For each chart simulation, the starting in-confrol estimate was 0.16 and a-2. A random variable having a

normal distribution with a mean of 0.16 and a standard deviation of 0.0367 was used to simulate the

mortality rates of blocks of 100 patients. Each simulated mortality observation was incorporated into the

EWMA chart. When the EWMAj statistic fell beyond one of the confrol limits, the chart was deemed to

have signalled, and the run length for that simulation was /. The ARL for each value of 1 was the mean of

the 10 000 run lengths.


244

Figure A5.2 shows the rapid fall in ARL as 2, the weight is reduced. At /I = 0.02, the in-confrol ARL is

180.1. At 2 = 0.2 the ARL is 37.2, and at X =0.3, the ARL is 30.0. There is a somewhat flat response for A

>0.4.

Figure A5.2: Effect of/\ on ARL of EWIVIA


Control limits = 2a, n =100, in-control mortality rate = 0.16

A5.3 Effect of changed mortality rate on ARL

The relationship between the ARL to signal under changed mortality was explored.

10 000 EWMA simulations were performed at each of the simulated changed mortality rates between 0.05

and 0.30, in increments of 0.001. For each chart simulation, the starting in-confrol estimate was 0.16, a = 2

and il = 0.3. A random variable having a normal distribution with a mean of the changed mortality rate and

a standard deviation calculated from the new mortality rate was used to simulate the mortality rates of

blocks of 100 patients. Each simulated mortality observation was incorporated into the EWMA chart.

When the EWMAj statistic fell beyond one of the confrol limits, the chart was deemed to have signalled,

and the mn length for that simulation was /. The mean ARL was the mean of the 10 000 run lengths.
245

The ARL for a range of mortality rates are shown in Figure A5.3. The in-confrol ARL is 30.4 blocks. The

ARL rapidly drops as the mortality rate moves from 0.16. At the mortality rate of 0.15 the ARL is 21.1 and

at 0.17 ARL is 17.5. When the mortality rate has fallen to 0.12 or risen to 0.20, the ARL is 3.6.

Figure A5.3: Effect of Changing IVIortality Rate on ARL


of EWMA
Control limits = 2a, / = 0.3, n = 100, in-control mortality rate = 0.16

0.15 0.2 0.3


Mortality Rate

For monitoring mortality rates in this application in the ICU, where the in-confrol mortality rate is 0.16, the

values of A = 0.3 and a = 2 provide in-confrol ARL of about 30.5 (3050 patients or 2.5 years) and rapid

detection of changed mortality rate.

A5.4 Choice of parameters, and analysis of performance of EWMA of single


cases

The EWMA chart can be used to monitor the mean mortality rate for any rational group size, including

individual patient observations, Yj. A smaller value of A will smooth the effect of accumulating individual

patient outcomes.
246

Simulations were conducted to display the relationship between 1, the EWMA weight and the ARL at a

range of changed mortality rates.

To estimate the ARL, 1000 EWMA simulations were performed at values of 2 between 0.001 and 0.3 in

increments of 0.001. A range of changed mortality rates between 0.10 and 0.22 in increments of 0.02 were

simulated. For each chart simulation, the starting in-confrol estimate was 0.16 and a = 2. A Bernoulli trial

with a probability equal to the changed mortality rate was used to simulate the outcomes of the patients.

Each simulated outcome was incorporated into the EWMA chart. When the EWMAj statistic fell beyond

one of the confrol limits, the chart was deemed to have signalled, and the run length for that simulation was

/. The mean ARL was the mean of the 10 000 run lengths.

Figure A5.4 is a log-log plot of ARL against 2 which displays the results of this simulation. There is an

almost linear relationship between the log ARL and the log 1. There is a rapid rise in the ARL for all values

of 1 below 0.05 at all simulated changed mortality rates. For 1 > 0.2, there is little difference between the

ARL for in-confrol mortality rate of 0.16 and the out of confrol rates.

Figure A5.4: Effect of/^ on ARL of EWMA at a Range of


Mortality Rates
log scale plots of ARL at mortality rates 0.10 - 0.22, control limits = 2 a, n = 1
10000

0.001 0.01 0.1


A (log scale)
247

X for EWMA chart with = 1 in the range of 0.001 - 0.02 gives a balance between rapid detection of

changed mortality rates and a long ARL for the in-confrol process.

Based on these analyses, for the EWMA charts which analyse the mortality rate using individual patient

observations, X = 0.001 is chosen. This gives an in-confrol ARL = 2155 cases and ARL of 117 cases for a

mortality rate of 0.10,232 cases for a mortaHty rate of 0.12,487 cases for a mortality rate of 0.18 and 163

cases for a mortality rate of 0.20.


249

Appendix 6
Characterisation of the Distribution of Observed
Mortality Rates

The purpose of this appendix is to discuss approaches to characterise the distribution of mortality rates in a

sample of ICU patients.

A6.1 Notation

The notation used in this appendix is the same as used in the body of the thesis, and is summarised as

follows:

Yjj is the outcome for patienty in sample /. If the patient dies, Yy = 1 and the probability of death is TTj,. If
the patient survives, Yjj - 0 and the probability of survival is 1 TCjj .

E[Yj^)=\X7U..+0x[\-7tjj)=7tjj

WZ.T[Yjj)^[\-7Ujjf7Ujj+[0-Kjjf[\-7rjj)=7tjj[\-7tjj)

Ttjj may be estimated by Kjj, using a statistical model such as the APACHE III estimate of the probability
of death.

var(>;)=^,(l-^,)

YY,
The observed mortality rate for sample / is /?,. = '' and the predicted mortality rate, JE'^/?,) is.
;

E{R;)=^^'
.
250

Three methods to characterise the distribution of /?,. will be considered.

A6.2 Approximation using each Estimate of Patient Risk of Death

R. is the average of the random variables, Yjj so using the cenfral limit theorem, the distiibution of Rj can

be approximated by a normal distribution if the number of cases in the sample ,. is large.

The variance of /? is.

MYJJ) |:;r,(l-;r..)
M
MRI)= - 2 2
nr n

and

l;*.,(i-*)
7=1
var ( , ) = n'

Where ,. is small, Alemi and co-workers "*'"' use the / distribution rather than the standard normal

distribution to approximate the distribution of theses Z-scores

For the application to the ICU dataset, the sample sizes are large, being more than 87 cases. The normal

approximation is simple to work with and for the purposes of RA charting, its accuracy and precision will

be adequate and will probably exceed the accuracy of the RA tool.

However, this model of the distribution of sample mortality rate is a continuous, unbounded approximation

whereas, the distribution of Rj given the series of TTjj values is a discrete distribution of mortality rate

values bounded by 0 and 1.


251

A6.3 Approximation using Mean Patient Risk of Death

For non-RA analysis, it is assumed that the patients were independently and randomly selected. The

individual patient's risk of death is not known, so it is assumed to be the same for all patients, and is

denoted by TTj. E{RJ) was the average predicted mortality rate of the sample.

The confrol limits were calculated using the estimate of the variance of /?,-.

Wj(l-Jrj)
var IK)= .

However, this expression for the variance, over-estimates var(/?,) if all patients actually do not have the

same risk of dying. In the realistic case where the patients do not all have the same risk of dying, Wj is still

their average risk.

n,

For the non-RA p chart,

vari[(R.) = - ^ '-L

y=i
. Hj

n.

ri; ri:
252

By both adding and subfracting the term


n.

r n, \ ("I V (": ^

vy=l J
+
n]

, V ^ ^'
1:^.(1-^.) is^.
7=1 .7=_1_ )
[t^^
V7=_l I
-I--
; n]

y=i is the estimate of the variance of the mortality rate where all the patients' probabilities of
,

death are not all equal. Assuming that this expression is accurate, then the non-RA estimate of the variance

will overestimate the var(i?,) b y the value of the term:

V7=l ) V7=l ) _ V7=l )


3
w. . .

Therefore, where the average patient's risk of death is low and the sample sizes are large, this estimate is

potentially a reasonable approximation. The advantage of this approximation is its ease of calculation.

Its disadvantages are that it overestimates the variance of Rj. Figure A 6 . 1 , at the end of this appendix

shows the error in this estimate compared to more accurate methods using 100 patients randomly drawn

from the P A H ICU dataset. This approach though simple, will result in wide control limits that will provide

a conservative analysis.
253

A6.4 Exact Method: Iterative approach to defme the distribution of mortality


rates.

The probability distribution of Rj can be calculated iteratively, if the values of all ^j. are known, or are

each approximated by ^,y. Consider that each patient outcome, Yjj (death = 1, survival = 0) is an

independent Bernoulli random variable with probability ;r,y . With , patients in a sample, then

YY.
0 < y^ K < n,. The probability distribution of i?, = can be described in terms of ^j-.

When the first patient is included, there are two possible values:

Pr(^y=, = 1) = -^1 and

Pr(i?,.,=0) = l-;r,

After the second patient there are three possible observed mortality rates,

Pr(/?,=2=l)=^i^2

Vr{Rj^^=Q.5) = 7tfy-A2)+{\-Ayt,

?x{Rj^,=0) = (\-7t,l\-7i,)

The process can be continued with each patient risk of death estimate.

This iterative approach is a simple method to compute the probability distribution or the cumulative

probability fiinction for /?, for a particular set of Kjj .


254

The likelihood function or the joint probability of all the terms Yjj is P [ TTjJ (l - Ttjj ) " and this
7=1

provides a simple method of computing the probability of any /?,-, given the set of values of Ttjj . The

probability of observing a number of deaths, d of the sample of , cases is the sum of each of the

likelihood terms that correspond to the ways that d deaths can occur.
{rtj-d).

( rt, ,
y-Y,
such that yYij d .
V7=l 7=1

Figure A6.1 compares the estimates of the cumulative probability function of /?, using the three methods

described. The calculations have been performed for a single randomly chosen subset of 100 patients from

the dataset.

Figure A6.1: Models of the Cumulative Probability Distribution of


^i
An example of 100 cases where each risk of death is estimated by APACHE III

0.05 0.1 0.15 0.2 0.25 0.3 0.35


Observed Mortality Rate
255

Figure A6.1 shows that the model using the sample mean mortality rate for all patients provides an

overestimate of the variance of /?,. The staircase appearance of the iterative or likelihood fiinction method

represents the true distribution of /?,. The continuous distribution of the cenfral limit theorem

approximation overall provides a good estimate for RA/? charts.

However, in Chapter 5, with the RA CUSUM and RA EWMA charts, the continuous approximation is

inaccurate, and fiirther exact methods for the RA EWMA are developed.
257

Bibliography

1 Knaus WA, Draper EA, Wagner DP and Zimmerman JE. APACHE II: a severity of disease
classification system. Crit Care Med 1985; 13:818 - 829

2 Knaus WA, Wagner DP, Draper EA, Zimmerman JE, Bergner M, Bastos PG, Sirio CA, Murphy
DJ, Lotring T, Damiano A and Harrell FE. The APACHE III prognostic system. Risk prediction of
hospital mortality for critically ill hospitalized adults. Chest 1991; 100:1619 -1636

3 APACHE. APACHE 111 Management System Software. Washington: APACHE Medical


Systems, 1990-1999

4 Hastie T, Tibshirani R and Friedman J. The Elements of Statistical Leaming: Data Mining,
Inference and Prediction. 1st ed. New York: Springer, 2001

5 Justice A, Covinski K and Berlin J. Assessing the generalizability of prognostic information. Arch
Intem Med 1999; 130:515 - 524

6 Lemeshow S, Teres D, Klar J, Avrunin J, Gehlbach S and Rapoport J. Mortality probability


models (MPM II) based on an international cohort of ICU patients. JAMA 1993; 270:2478 - 2486

7 Le Gall J, Lemeshow S and Saulnier F. A new simplified acute physiology score (SAPS II) based
on a European/North American multi-cenfre study. JAMA 1993; 270:2957 - 2963

8 Christianini N and Shawe-Taylor J. The leaming methodology: The SVM. An Infroduction to


Support Vector Machine and Other Kernel-Based Leaming Methods. Cambridge: Cambridge
University Press, 2000

9 Benneyan J and Borgman A. Risk adjusted sequential probability ratio tests and longitudinal
surveillance. Int J of Quahty in Health Care 2003; 15:5-6
258

10 de Leval MR, Francois K, Bull C, Brawn W and Spiegelhalter D. Analysis of a cluster of surgical
failures. Application to a series of neonatal arterial switch operations. J Thorac Cardiovasc Surg
1994; 107:914-924

11 Lovegrove J, Valencia O, Treasure T, Sheriaw-Johnson C and Gallivan S. Monitoring the results


of cardiac surgery by variable Hfe-adjusted display. Lancet 1997; 350:1128 - 1130

12 lezzoni L. Dimensions of risk. In: lezzoni L, ed. Risk Adjustment for Measuring Health Care
Outcomes. Chicago, 111.: Health Adminisfration Press, 1997

13 Spiegelhalter D, Grigg O, Kinsman R and Treasure T. Risk adjusted sequential probability ratio
tests: applications to Bristol, Shipmann and adult cardiac surgery. Int J for Quality in Health Care
2003; 15:7-13

14 Rosenberg AL. Recent innovations in ICU risk-prediction models. Curr Opin Crit Care 2002;
8:321 -330

15. Lim T. Statistical process confrol tools for monitoring clinical performance. Int J for Quality in
Health Care 2003; 15:3-4

16 Render M, Kim M, Welsh D, Timmons S, Johnston J, Hui S, Connors A, Wagner D, Daley J and
Hofer T. Automated ICU risk adjustment: Results from a National Veterans Affairs study. Crit
Care Med 2003; 31:1638-1646

17 Kennett R and Zacks S. Chapter 1: The role of statistical methods in modem industry. Modem
Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press,
1998; 2 - 1 3

18 Clermont G and Angus DC. Severity scoring systems in the modem ICU. Ann Acad Med
Singapore 1998; 27:397 - 403

19 Gunning K and Rowan K. ABC of intensive care outcome data and scoring systems. BMJ 1999-
319:241-244
259

M Lemeshow S, Klar J, Teres D, Avrunin J, Gehlbach S, Rapaport J and Rue M. Mortality


probability models for patients in the ICU for 48 or 72 hours: A prospective multicenter study.
Crit Care Med 1994; 22:1351 -1358

21 Le Gall J, Lemeshow S and Saulnier F. Correction: A new simplified acute physiology score
(SAPS II) based on a European/North American multicenfre study. JAMA 1994; 271:1321

22 Zimmerman JE, Knaus WA, Wagner DP, Sun X, Hakim RB and Nysfrom PO. A comparison of
risks and outcomes for patients with organ system failure: 1982-1990. Crit Care Med 1996;
24:1633-1641

23 Zimmerman JE, Knaus WA, Sun X and Wagner DP. Severity sfratification and outcome
prediction for multi-system organ failure and dysfiinction. World J Surg 1996; 20:401 - 405

24 Knaus WA, Wagner DP, Zimmerman JA and Draper EA. Variations in mortality and length of
stay in ICU. Ann Int Med 1993; 118:753 - 761

25 Johnston JA, Wagner DP, Timmons S, Welsh D, Tsevat J and Render ML. Impact of different
measures of comorbid disease on predicted mortality of ICU patients. Med Care 2002; 40:929 -
940

26 Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality
audit. Chest: accepted for publication, October 2003

27 Young D and Ridley S. Mortality as an outcome measure in intensive care. In: Ridley S, ed.
Outcomes in Critical Care. Oxford: Butterworth - Heinemann, 2002; 25 - 46

28 Ridley S. Severity of illness scoring systems and performance appraisal. Anaesthesia 1998;
53:1185-1194

29 Irwig L, Bossuyt P, Glasziou P, Gatsonis C and Lijmer J. Designing studies to ensure that
estimates of test accuracy are fransferable. BMJ 2002; 324:669 -671
260

Ift Pappachan JV, Millar B, Bennett D and Smith GB. Comparison of outcome from intensive care
admission after adjustment for case mix by the APACHE III prognostic system. Chest 1999;
115:802-810

31 Cook DA. Performance of APACHE III models in an Australian ICU. Chest 2000; 118:1732 -
1738

32 Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's
APACHE II study in Britain and Ireland -1 Variations in casemix of adult admissions to general
ICUs and impact on outcome. BMJ 1993; 307:972 - 977

33 Rowan K, Kerr J, Major E, McPherson K, Short A and Vessey M. Intensive Care Society's
APACHE II study in Britain and Ireland - II Outcome comparisons of ICUs after adjustments for
casemix by the American APACHE II method. BMJ 1993; 307:977 - 981

34 Rowan KM, Kerr JH, Major E, McPherson K, Short A and Vessey MP. Intensive Care Society's
APACHE II study in Britain and Ireland: a prospective, multi-center, cohort study comparing two
methods for predicting outcome for adult intensive care patients. Crit Care Med 1994; 22:1392 -
1401

35 Castella X, Artigas A, Bion J and Kari A. A comparison of severity of illness scoring systems for
ICU patients: Results of a multicenter, multinational study. Crit Care Med 1995; 23:1327 -1335

36 Beck DH, Taylor BL, Millar B and Smith GB. Prediction of outcome from intensive care: a
prospective cohort study comparing APACHE 11 and III prognostic systems in a United Kingdom
intensive care unit. Crit Care Med 1997; 25:9-15

37 Marshall JC, Cook DJ, Christou NV, Bernard GR, Sprung CL and Sibbald WJ. Multiple Organ
Dysfunction Score: A reliable descriptor of a complex clinical outcome. Crit Care Med 1995;
23:1638-1652

38 Diamond G. What price perfection? Calibration and discrimination of clinical predictive models. J
Clin Epidemiol 1992; 45:85 - 89
261

39 Hilden J, Habbema JD and Bjerregaard B. The measurement of performance in probablistic


diagnosis: The trustworthiness of the exact values of the diagnostic probabilities. Meth Inform
Med 1978; 17:227-237

40 Yates FJ. Decompositions of the mean probability score. Organisational Behaviour and Human
Performance 1982; 30:132 -156

41 Hilden J. The area under the ROC curve and its competitors. Med Decis Making 1991; 11:95 -101

42 Zweig HW and Campbell G. ROC plots: Afiindamentalevaluation tool in clinical medicine. Clin
Chem 1993;39:561-577

43 Poses RM, Cebul RD and Center RM. Evaluating physicians probablistic judgements. Med Decis
Making 1988; 8:233-240

44 Jacobs S, Chang R, Lee B and Lee B. Audit of intensive care: a 30 month experience using the
APACHE II severity of disease classification system. Int Care Med 1988; 14:567 - 574

45 Giangiuliani G, Mancini A and Gui D. Validation of a severity of illness score (APACHE 11) in a
surgical ICU. Int Care Med 1989; 15:519-522

46 Oh TE, Hutchinson R, Short S, Buckley T, Lin E and Leung D. Verification of the APACHE
scoring system in a Hong Kong ICU . Crit Care Med 1993; 21:698 - 695

47 Wong DT, Crofts SL, Gomez M, McGuire GP and Byrick RJ. Evaluation of predictive ability of
APACHE II system and hospital outcome in Canadian ICU patients. Crit Care Med 1995; 23:1177
-1183

48 Nouira S, Belghith M, Elafrous S, Jaafoura M, Ellouzi M, Boujdaria R, Gahbiche M, Bouchoucha


S and Abroug F. Predicitive value of severity scoring systems: Comparison of four models in
Tunisian adult ICUs. Crit Care Med 1998; 26:852 - 858
262

49 Markgraf R, Deutschinoff G, Pientka L and Scholten T. Comparison of APACHE II and III and
SAPS II: A prospective cohort study evaluating these methods to predict outcome in a German
interdisciplinary ICU. Crit Care Med 2000; 28:26 - 33

50 Ruttiman UE. Statistical approaches to the development and validation of predictive instruments.
Crit Care Clin 1994; 10:19-35

51 Altinan D and Bland J. Diagnostic tests 3: The ROC plot. BMJ 1994; 309:188

52 Glance LG, Osier T and Shinozaki T. ICU prognostic scoring systems to predict death: a cost
effectiveness analysis. Crit Care Med 1998; 26:1842 -1849

53 Henderson AR. Assessing test accuracy and its clinical consequences: a primer for ROC curve
analysis. Ann Clin Biochem 1993; 30:521 - 539

54 Bossuyt P, Reitsma J, Bruns D, Gatsonis C, Glasziou P, Irwig L, Lijmer J, Moher D, Rennie D


and Wde Vet H. Towards complete and accurate reporting of studies of diagnostic accuracy: the
STARD initiative. BMJ 2003; 326:41 - 44

55 Swetts J A. Measuring the accuracy of diagnostic systems. Science 1988; 240:1285 -1293

56 Murphy-Filkins R, Teres D, Lemeshow S and Hosmer DW. Effect of changing patient mix on the
performance of an ICU severity-of-illness model: How to distinguish a general from a specialty
intensive care unit. Crit Care Med 1996; 24:1968 -1973

57 Glance LG, Osier TM and Papadakos P. Effect of mortality rate on the performance of the
APACHE II: a simulation study. Crit Care Med 2000; 28:3424 - 3428

58 Steen PM. Approaches to predictive modelling. Ann Thor Surg 1994; 58:1836 -1840

59 Hanley JA and McNeil BJ. The meaning and use of a ROC curve. Radiology 1982; 143:29 - 36
263

60 Center RM and Schwartz JS. An evaluation of methods of estimating the area under the ROC
curve. Med Decis Making 1985; 5:149 -156

61 Hanley JA and McNeil BJ. A method of comparing the areas under the ROC curves derived from
the same cases. Radiology 1983; 148:839 - 843

62 Spiegelhalter DJ. Statistical methodology for evaluating gasfrointestinal symptoms. Clin Gastro
1985;14:489-515

63 Yates JF and Curley SP. Conditional distribution analyses of probablistic forecasts. J Forecast
1985; 4:61 -73

64 Flora J. A method for comparing survival of bums patients to a standard survival curve. J Trauma
1978; 18:701-705

65 Lemeshow S and Hosmer D. A review of goodness of fit statistics for use in the development of
logistic regression models. Am J Epidem 1982; 115:92 - 106

66 Hosmer DW and Lemeshow S. Assessing the fit of the model. Applied logistic regression. New
York: John Wiley and Sons, 1989; 135 -175

67 Vollset S. Confidence intervals for a binomial proportion. Stat Med 1993; 12:809 - 824

68 Armitage P and Berry P. Inferences from proportions. Statistical Methods in Medical Research.
London: Blackwell Scientific Publications, 1994; 118-124

69 Rapoport J, Teres D, Lemeshow S and Gehlbach S. A method for assessing the clinical
performance and cost effectiveness of ICUs: a multi-cenfre inception cohort study. Crit Care Med
1994;22:1385-1391

70 Sherlaw - Johnson C, Lovegrove J, Treasure T and Gallivan S. Likely variations in perioperative


mortality associated with cardiac surgery: when does high mortality reflect bad practice? Heart
2000; 84:79 - 82
264

71 Sherlaw - Johnson C and Gallivan S. Approximating prediction intervals for use in variable life
adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University
College, 2000

72 Katsaragakis S, Papadimifropoulos K, Antonakis P, Strergiopoulos S, Konstadoulakis MM and


Androulakis G. Comparison of APACHE II and SAPS II scoring systems in a single Greek ICU.
Crit Care Med 2000; 28:426 - 432

73 Miller M and Hui S. Validation techniques for logistic regression models. Stat Med 1991; 10:1213
-1226

74 Murphy AH. A new vector partition of the probability score. J Appl Meteorology 1973; 12:595 -
600

75 Dolan JG, Bordley DR and Mushlin AI. An evaluation of clinician's subjective prior probability
estimates. Med Dec Making 1986; 6:216 - 223

76 Moreno R and Morals P. Outcome prediction in intensive care: results of a prospective,


multicenfre, Portuguese study. Int Care Med 1997; 23:177-186

77 Zimmerman JE, Wagner DP, Draper EA, Wright L, Alzola C and Knaus WA. Evaluation of
APACHE III predictions of hospital mortality in an independent database. Crit Care Med 1998;
26:1317-1326

78 Beck DH, Smith GB and Taylor BL. The impact of low-risk ICU admissions on mortality
probabilities by SAPS II, APACHE II and APACHE III. Anaesthesia 2002; 57:21 - 26

79 Ash AS and Schwartz M. Evaluating the performance of risk adjustment methods: dichotomous
methods. In: lezzoni LI, ed. Risk Adjustment for Measuring Health Outcomes. Ann Arbor: Health
Admin Press, 1994; 313-346

80 Lee KL, Pryor DB, Harrell FE, Califf RM, Behar VS, Floyd WL, Morris AJ, Waugh RA, Whalen
RE and Rosati RA. Predicting outcome in coronary disease. Am J Med 1986; 80:553 - 560
265

81 Miller ME, Langefeld CD, Tiemey WM, Hui SL and McDonald CJ. Validation of probablistic
predictions. Med Decis Making 1993; 13:49 - 58

82 Le Gall R, Klar J, Lemeshow S, Saulnier F, Alberti C, Artigas A and Teres D. The Logistic Organ
Dysfiinction System: A new way to assess organ dysfiinction in the ICU. JAMA 1996; 276:802 -
810

83 Apolone G, Bertolini G, D'Amico R, lapichino G, Cattaneo A, De Salvo G and Melotti R. The


performance of SAPS II in a cohort of patients admitted to Italian ICUs: Results from GiViTi. Int
Care Med 1996; 33:1368 -1378

84 Sirio CA, Shepardson LB, Rotondi AJ, Cooper GS, Angus DC, Harper DL and Rosenthal GE.
Community-wide assessment of intensive care outcomes using a physiologically based prognostic
measure. Chest 1999; 115:793 - 801

85 Glance L, Osier T and Dick A. Identifying quality outliers in a large, multi - institutional database
by using customised versions of the SAPS II and MPM II. Crit Care Med 2002; 30:1995 - 2002

16 Jacobs S, Chang R and Lee B. One years experience with the APACHE 11 severity of disease
classification system in a general ICU. Anaesthesia 1987; 42:738 - 744

87 Zimmerman JE, Knaus WA, Judson JA, Havill JH, Trubuhovich RV, Draper EA and Wagner DP.
Patient selection for intensive care: A comparison of New Zealand and United States hospitals.
Crit Care Med 1988; 16:318 - 326

88 Marsh M, Krishan I, Naessens J, Strickland R, Gracet D, Campion M, Nobrega F, Southom P,


McMichan J and Kelly M. Assessment of prediction of mortality by using the APACHE II scoring
system in ICUs. Mayo Clin. Proc. 1990; 65:1549 -1557

89 Berger M, Marazzi A, Freeman J and Chiolero R. Evaluation of the consistency of APACHE II


scoring in a surgical ICU. Crit Care Med 1992; 20:1681 -1687
266

90 Sirio C, Tajima K, Tase C, Knaus W, Wagner D, Hirasawa H, Sakanishi N, Katsuya H and


Taenaka N. An initial comparison of intensive care in Japan and in the United States. Crit Care
Med 1992; 20:1207-1215

91 Le Gall J, Lemeshow S, Leleu G, Klar J, Huillard J, Rue M, Teres D and Artigas A. Customised
probability models for early severe sepsis in adult intensive care. JAMA 1995; 273:644 - 650

92 Moreno R, Miranda DR, Fidler V and Van Schilfgaarde R. Evaluation of two outcome prediction
models on an independent database. Crit Care Med 1998; 26:50 - 61

93 Tan IKS. APACHE II and SAPS II are pooriy calibrated in a Hong Kong ICU. Ann Acad Med
Singapore 1998; 27:318 - 322

94 Wong LS and Young JD. A comparison of ICU mortality prediction using the APACHE II scoring
system and artificial neural networks. Anaesthesia 1999; 54:1048 -1054

95 Buist M, Gould T, Hagley S and Webb R. An analysis of excess mortality not predicted to occur
by APACHE III in an Australian Level 111 ICU. Anaes Int Care 2000; 28:171 -177

96 Janssens U, Graf C, Graf J, Radke P, Konigs B, Koch K, Lepper W, vom Dahl J and Hanrath P.
Evaluation of the SOFA score: a single center experience of a medical ICU unit in 303
consecutive patients with predominantly cardiovascular disorders. Int Care Med 2000; 26:1037 -
1045

97 Livingston BM, MacKirdy FN, Howie JC, Jones R and Norrie JD. Assessment of the performance
of five intensive care scoring models within a large Scottish database. Crit Care Med 2000;
28:1820-1827

98 Arabi Y, Hahhad S, Goraj R, Al-Shimemeri A and Al-Malik S. Assessment of performance of 4


mortality prediction systems in a Saudi Arabian ICU. Crit Care Med 2002; 6:166-174

99 Glance L, Osier T and Dick A. Rating the quality of intensive care units: Is it a fiinction of the
ICU scoring system? Crit Care Med 2002; 30:1976 -1982
267

100 Bastos PG, Sun X, Wagner DP, Knaus WA and Zimmerman JE. Application of the APACHE III
prognostic system in Brazilian intensive care units: a prospective multicenter study. Int Care Med
1996;22:564-570

101 APACHE C. APACHE Resources: http://www.apache-web.com/public/pub main.html. 2003

102 Cowen JS and Kelly MA. Errors and bias in using predictive scoring systems. Crit Care Clin
1994; 10:53-72

103 Cook D, Joyce C, Bamett R, Birgan S, Playford H, Cockings J and Hurford R. Prospective
independent validation of APACHE 111 models in an Ausfralian tertiary referral ICU. Anaesth Int
Care 2002; 30:308-315

104 Levett J and Carey R. Measuring for improvement: Toyota to thoracic surgery. Ann Thorac Surg
1999; 68:353 - 358

105 Kennett R and Zacks S. Chapter 10: Basic tools and principles of statistical process confrol.
Modem Industrial Statistics: Design and Confrol of Quality and Reliability. Belmont, CA:
Duxbury Press, 1998; 322 - 359

106 Montgomery D. Chapter 4: Methods and philosophy of statistical process confrol. Infroduction to
Statistical Quality Confrol. New York: John Wiley and Sons, 1996

107 Ryan T. Statistical Methods for Quality Improvement. New York: John Wiley and Sons, 1989

108 Seto T, Mittleman M, Davis R, Taira D and Kawachi. Seasonal variation in coronary artery
disease mortality in Hawaii: an observational study. BMJ 1998; 316:1946

109 Montgomery D. Chapter 6: The confrol chart for fraction non - conforming. Infroduction to
Statistical Quality Confrol. New York: John Wiley and Sons, 1996
268

110 Benneyan J. Statistical quality confrol methods in infection confrol and hospital epidemiology.
Part II: Chart use, statistical properties and research issues. Infection Confrol and Hospital
Epidemiology 1998; 19:265 - 283

111 Hawkins D and Olwell D. Theoretical foundations of the CUSUM. Cumulative Sum Charts and
Charting for Quality Improvement. New York: Springer Veriag, 1998

112 Hawkins D and Olwell D. Introduction. Cumulative Sum Charts and Charting for Quality
Improvement. New York: Springer Veriag, 1998; 1 - 29

113 Kennett R and Zacks S. Chapter 11: Advanced methods of statistical process confrol. Modem
Industtnal Statistics: Design and Confrol of Quality and Reliability. Belmont, CA: Duxbury Press,
1998; 360-407

114 Montgomery D. Chapter 7: Cusum and EWMA confrol charts. Infroduction to Statistical Quality
Confrol. New York: John Wiley and Sons, 1996; 313-347

115 Poloniecki J, Valencia O and Littlejohns P. Cumulative risk adjusted mortality chart for detecting
changes in death rate: observational study of heart surgery. BMJ 1998; 316:1697 -1700

116 Lawrance R, Dorsch M, Sapsford R, Macintosh A, Greenwood D, Jackson B, Morrell C, Robinson


M and Hall A. Use of cumulative mortality data in patients with acute myocardial infarction for
early detection of variation in clinical practice: Observational study. BMJ 2001; 323:324 - 327

117 Tekkis PP, McCulloch P, Steger AC, Benjamin IS and Poloniecki JD. Mortality confrol charts for
comparing performance of surgical units: validation study using hospital mortality data. BMJ
2003; 326:786 - 791

118 Alemi F and Sullivan T. Tutorial on risk adjusted A'-Bar charts: Applications to measurement of
diabetes confrol. QuaHty Management in Health Care 2001; 9:57 - 65

119 Alemi F and Oliver D. Tutorial on risk adjusted;? charts. Quality Management in Health Care
2001; 10:1-9
269

120 Khuri S, Daley J, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J,
Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk
adjustment of the postoperative mortality rate for the comparative assessment of the quality of
surgical care: Resuhs of the National Veterans Affairs surgical risk study: Part 1. J Amer Coll
Surg 1997; 185:315-327

121 Daley J, Khuri S, Henderson W, Hur K, Gibbs J, Barbour G, Demakis J, Irvin G, Sfremple J,
Grover F, McDonald G, Passaro E, Fabri P, Spencer J, Hammermeister K and Aust J. Risk
adjustment of the postoperative mortality rate for the comparative assessment of the quality of
surgical care: Results of the National Veterans Affairs surgical risk study: Part 2. J Amer Coll
Surg 1997; 185:328-340

122 Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using
reliability measures: A classification free approach. Stat Med; Submitted for publication March
2003

123 Clermont G, Angus DC, DiRusso SM, Griffin M and Linde - Zwirble WT. Predicting hospital
mortality for patients in the intensive care unit: a comparison of artificial neural networks with
logistic regression models. Crit Care Med 2001; 29:291 - 296

124 Glance LG, Osier T and Shinozaki T. Effect of varying the casemix on the SMR and W statistic.
Chest 2000; 117:1112-1116

125 lezzoni L. Statistically derived predictive models: caveat emptor (Editorial). J Gen Intem Med
1999;14:388-389

126 Zhu HP, Lemeshow S, Hosmer DW, Klar J, Avmnin J and Teres D. Factors affecting the
performance of the models in the MPMl II system and sfrategies of customization: A simulation
sttidy. Crit Care Med 1996; 24:57 - 63

127 Maxwell S and Delaney H. The logic of experimental design: Threats to the validity of inferences
from experiments. Designing experiments and analyzing data: A model comparison perspective.
Belmont, Ca: Wadsworth Publishing Company, 1990; 25 - 32
270

128 Berenholtz S, Doorman T, Ngo K and Pronovost P. Qualitative review of ICU quality indicators. J
Crit Care 2002; 17:1-15

129 Sivak ED and Rogers MAM. Assessing quality of care using in-hospital mortality: Does it yield
informed choices? (Editorial). Chest 1999; 115:613 - 614

130 Sheldon T. Promoting health care quality: what role performance indicators? Quality in Health
Care 1998; 7:S45-50

131 Rosen AK, Ash AS, McNiff KJ and Moskowitz MA. The importance of severity of illness
adjustment in predicting adverse outcomes in the Medicare population. J Clin Epidemiol 1995;
48:631-643

132 Kutsogianis DJ and Noseworthy T. Quality of life after ICU. In: Ridley S, ed. Outcomes in
Critical Care. Oxford: Butterworth Heinemann, 2002; 139 - 168

133 Poloniecki J. Letter. BMJ 1998:1453

134 Parsonnet V, Dean D and Bemstein A. A method of uniform sfratification of risk for evaluating
the results of surgery in acquired adult heart disease. Circulation 1989; 79(S1):S3 - S12

135 Jones D, Copeland G and de Cossart L. Comparison of POSSUM with APACHE II for prediction
of outcome from a surgical high dependency unit. Br J Surg 1992; 79:1293 -1296

136 Orr R, Maini B, Sottile F, Dumas E and O'Mara P. A comparison of four severity adjusted models
to predict mortality after coronary artery bypass graft surgery. Arch Surg 1995; 130:301 - 306

137 Weightman W, Gibbs N, Sheminant M, Thackray N and Newman M. Risk prediction in coronaty
artery surgery: a comparison of four risk scores. Medical Joumal of Ausfralia 1997; 166:408 - 411

138 Steiner S, Cook R and Farewell V. Risk adjusted monitoring of surgical outcomes. Medical
Decision Making 2001; 21:163-169
271

139 Gallivan S, Lovegrove J and Sheriaw - Johnson C. Letter. BMJ 1998; 317:1453

140 Steiner SH, Cook RJ and Farewell VT. Monitoring paired binary surgical outcomes using
cumulative sum charts,. Statistics in Medicine 1999; 18:69 - 86

141 Cook D, Steiner S, Cook R, Farewell V and Morton A. Monitoring the evolutionary process of
quality: Risk adjusted charting to frack outcomes in intensive care. Crit Care Med 2003; 2003

142 Steiner S, Cook R, Farewell V and Treasure T. Monitoring surgical performance using risk-
adjusted cumulative sum charts. Biostatistics 2000; 1:441 - 452

143 Hanson WC and Marshall BE. Artificial intelligence applications in the ICU. Crit Care Med 2001;
29:427 - 435

144 Maren A, Harston C and Pap R, eds. Handbook of Neural Computing. San Diego: Harcourt Brace
Jovanovich, 1990

145 Statistica Neural Networks. Tulsa, Oklahoma: Statsoft, 1998

146 Bishop C, ed. Neural networks and Machine Leaming. Berlin: Springer Veriag, 1998

147 Reed RD and Marks RJ. Neural Smithing: Supervised leaming in feedforward artificial neural
network. 1st ed. Cambridge, Massachusetts: MIT Press, 1999

148 Anderson J. Chapter 13: Nearest neighbour classifiers. An Infroduction to Neural Networks.
Cambridge, Mass.: MIT Press, 1995

149 Maren A. NN stmcture: Form follows fiinction. In: Maren A, Harston C and Pap R, eds.
Handbook of Neural Computing. San Diego: Harcourt Brace Jovanovich, 1990

150 Specht DF. A general regression neural network. IEEE Transactions on Neural Networks 1991;
2:568 - 576
272

151 Parzen E. Mathematical considerations in the estimation of specfra. Technometrics 1961; 3:167 -
190

152 Parzen E. On estimation of a probability density fiinction and mode. Annals of Mathematical
Statistics 1962; 33:1065 -1076

153 Floyd CE, Lo JY, Yun AJ, Sullivan DC and Komguth PJ. Prediction of breast cancer malignancy
using an artificial neural network. Cancer 1994; 74:2944 - 2948

154 Ortiz J, Ghefter CGM, Silva CES and Sabbatini RME. One year mortality prognosis in heart
failure: A neural network approach based on echocardiographic data. JACC 1995; 26:1586 -1593

155 Selker HP, Griffin JL, Patil S, Long WL and D'Agostino RB. A comparison of performance of
mathematical predictive methods for medical diagnosis: Identifying acute cardiac ischaemia
among emergency department patients. J Investigative Med 1995; 43:468 - 476

156 Doyle HR, Parmanto B, Munro WP, Marino IR, Aldrighetti L, Doria C, McMicheal J and Fung JJ.
Building clinical classifiers using incomplete observations - A neural network ensemble for
hepatoma detection in patients with cirrhosis. Meth Inform Med 1995; 34:253 - 258

157 Setiono R. Exfracting mles from pruned ANN for breast cancer diagnosis. AI in Med 1996; 8:37 -
51

158 Itchhaporia D, Snow PB, Almassy RJ and Oetgen WJ. ANN: Current status in cardiovascular
medicine. JACC 1996; 28:515 - 521

159 Eisenstein EL and Alemi F. A comparison of 3 techniques for rapid model development: An
application in patient risk sfratification. Proc Med Informat Ass 1996:443 - 447

160 Lette J, Colletti BW, Cerino M, McNamara D, Eybalin MC, Levasseur A and Nattel S. Artificial
intelligence vs logistic regression statistical modelling to predict cardiac complications after non -
cardiac surgery. Clin Cardiol 1994; 17:609 - 614
273

161 Doyle HR, Dvorchik I, Mitchell S, Marino IR, Ebert FH, McMichael J and Fung JJ. Predicting
outcome after liverfransplantation.Ann Surg 1994; 219:408 - 415

162 Hamamoto I, Okada S, Hashimotot T, Wakabayashi H, Maeba T and Maeta H. Prediction of the
eariy prognosis of the hepatectomised patient with hepatocellular carcinoma with a neural
netivork. Comput Biol Med 1995; 25:49 - 59

163 Dombi GW, Nandi P, Saxe JM, Ledgerwood AM and Lucas CE. Prediction of rib fracture
outcome by an artifical neural network. J Trauma, Infection and Critical Care. 1995; 39:915 - 921

164 Dvorchik I, Subotin M, Marsh W, McMichael J and Fung J J. Performance of multi-layer


feedforeward neural network to predict liver fransplantation outcome. Meth Inform Med 1996;
35:12-18

165 Izenberg SD, Williams MD and Luterman A. Prediction of frauma mortality using a neural
network. American Surgeon 1997; 63:275 - 281

166 Jefferson MF, Pendleton N, Lucas SB and Horan MA. Comparison of a genetic algorithm neural
network with logistic regression for predicting outcome after surgery for patients with non-small
cell lung carcinoma. Cancer 1997; 79:1338 -1342

167 Reibnegger G, Weiss G, Wemer - Felmayer G, Judmaier G and Wachter H. Neural network as a
tool for utilizing laboratory information: Comparison with linear discriminant analysis and with
classification and regression frees. Proc. Natl. Acad. Sci. USA 1991; 88

168 Forssfrom JJ and Dalton KJ. Artificial neural network for decision support in clinical medicine.
Ann Med 1995; 27:509-517

169 Jorgensen JS, Pedersen JB and Pedersen SM. Use of neural network to diagnose acute myocardial
infarction: Methodology. Clin Chem 1996; 42:604 - 612

170 Brier ME and Aronoff GR. Application of artificial neural network to clinical pharmacology. Int J
Clin Pharm Therapeutics 1996; 34:510-514
274

171 Lippmann RP and Shahian DM. Coronary artery bypass risk prediction using neural networks.
Ann Thor Surg 1997; 63:1635 -1643

172 Orr RK. Use of a probablistic neural network to estimate risk of mortality after cardiac surgery.
Med Dec Making 1997; 17:178 - 185

173 Buchman TG, Kubos KL, Seidler AJ and Siegforth MJ. A comparison of statistical and
connectionist models for the prediction of chronicity in a surgical ICU. Crit Care Med. 1994;
22:750 - 762

174 Mobley BA, Leasure R and Davidson L. Artificial neural network predicitons of lengths of stay on
a post coronary care unit. Heart and Lung 1995; 24:251 - 256

175 Doig GS, Inman KJ, Sibbald WJ, Martin CM and Roberston JM. Modelling mortality in the ICU:
comparing the performance of a back propagation, associative leaming neural network with
multivariate logistic regression. Proc. Ann. Symp. Computer Application in Med Care. 1994;
17:361-365

176 Dybowski R, Weller P, Chang R and Gant V. Prediction of outcome in critically ill patients using
artificial neural networks synthesized by genetic algorithm. Lancet 1996; 347:1146 -1150

177 Prize M, Ennett CM, Stevenson M and Trigg HCE. Clinical decision support systems for ICU:
Using artificial neural networks. Med Eng Physics 2001; 23:217 - 225

178 Nimgaonkar A, Sudarshan S and Kamad DR. Prediction of mortality in an Indian ICU:
Comparison between APACHE II and artificial neural network. (Hansraj Prize paper).
Proceedings of the Annual Scientific Meeting, Indian Society of Critical Care Medicine. 2001:43 -
46

179 Paetz J. Some remarks on choosing a method for outcome prediction (letter). Crit Care Med 2002;
30:724

180 Vapnik V. Statistical Leaming Theory. New York: Wiley, 1998


275

181 Burges C. A tutorial on SVM for pattern recognition. Data Mining and Knowledge Discovery
1998;2:121-167

182 Campbell C. An Infroduction to Kemel Methods. In: Howlett R and Jain L, eds. Radial Basis
Function Networks: Design and Application. Berlin: Springer Veriag, 2000

183 Christianini N and Shawe-Taylor J, eds. An Infroduction to Support Vector Machine and Other
Kernel-Based Leaming Methods. 1 st ed. Cambridge: Cambridge University Press, 2000

184 Scholkopf B, Burges C and Smola A. Introduction to support vector leaming. In: Scholkopf B,
Burges C and Smola A, eds. Advances in Kemel Methods: Support Vector Leaming. Cambridge:
MIT Press, 1999

185 Mattera D and Haykin S. SVM for dynamic reconstmction of a chaotic system. In: Scholkopf B,
Burges C and Smola A, eds. Advances in Kemel Methods: SVM Leaming. Cambridge,
Massachusetts: MIT Press, 2000

186 Morik K, Brockhausen P and Joachims T. Combining statistical leaming with a knowledge based
approach - A case study in intensive care monitoring.: http://vyww-ai.informatik.uni-
dortmund.de/DOKUMENTE/morik etal 99a.pdf. 1999

187 Morik K, Imboff M, Brockhausen P, Joachims T and Gather U. Knowledge discovery and
knowledge validation in intensive care. AI in Med 2000; 19:225 - 249

188 Graham P and Cook D. Risk prediction using 30 day outcome: A practical endpoint for quality
audit. Submitted to Chest, January 2003 2003

189 Graham PL, Kuhnert PM, Cook DA and Mengersen K. Improving the quality of patient care using
reliability measures: A Classification free approach. Stat Med 2003; Submitted for publication
March 2003

190 Ripley B. Statistical theories of model fitting. In: Bishop C, ed. Neural Networks and Machine
Leaming. Berlin: Springer/NatoScientific Affairs Division, 1998
276

191 Joachims T. SVMlight. Dortmund: University of Dortmund, Informatik, Al-Unit, "Collaborative


Research Center on Complexity Reduction in Multivariate Data", 2002

192 Joachims T. Making large scale SVM leaming practical. In: Scholkopf B, Burges C and Smola A,
eds. Advances in Kemel Methods: Support Vector Machine Leaming. Cambridge: MIT Press,
2000

193 Mathworks. MATLAB. Boston: The Math Works Inc., 2002

194 Schwaighofer A. Matiab Wrapper for SVM light: anton.schwaighofer@gmx.net, 2002

195 Statistica. Tulsa: Statsoft, 1984 - 2000

196 Jencks SF, Williams DK and Kay TL. Assessing hospital associated deaths from discharge data:
The role of length of stay and co-morbidities. JAMA 1988; 260:2240 - 2246

197 Glance LG and Szalados J. Benchmarking in critical care (editorial). Chest 2002; 121:326 - 328

198 Weston J, Gammerman A, Stitson M, Vapnik V, Vovk V and Watkins C. Support vector density
estimation. In: Scholkopf B, Burges C and Smola A, eds. Advances in Kemel Methods: Support
Vector Machine Leaming. Cambridge: MIT Press, 2000

199 Sheriaw-Johnson C and Gallivan S. Approximating prediction intervals for use in variable life
adjusted displays. London: Clinical Operational Research Unit, Dept. Mathematics, University
College, 2000

200 Neal R. Assessing the relevance determination methods using DELVE. In: Bishop C, ed. Neural
Networks and Machine Leaming: Springer-Veriag, 1998; 97-129

201 Kitter J, Pudil P and Somol P. Advances in statistical feature selection. In: Singh S, Murshed N
and Kropattsch W, eds. ICAPR 2001. Beriin: Springer-Veriag, 2001
277

202 Guyon I and Elisseeff A. An introduction to variable and feature selection. J Machine Leaming
Research 2003; 3:1157-1182

203 Montgomery D. Chapter 8: Other statistical process confrol techniques. Infroduction to Statistical
Quality Confrol. New York: John Wiley and Son, 1996

204 Montgomery D. Introduction to Statistical Quality Confrol. New York.: John Wiley and Sons,
1996

205 Hawkins D and Olwell D. Cumulative Sum Charts and Charting for Quality Improvement. New
York: Springer, 1998

206 Kennett R and Zacks S. Modem Industrial Statistics: Design and Confrol of Quality and
Reliability. 1st ed. Belmont, CA: Duxbury Press, 1998

S-ar putea să vă placă și