A Critique of Present Procedures Used To Compare

Journal of Hydrology (2008) 352, 379 387
available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/jhydrol
A critique of present procedures used to compare performance of rainfall-runoff models

Robin T Clarke
ulicas, Avenida Bento Goncalves 9500, Caixa Postal 15029, CEP 91501-970, Porto Alegre, RS, Instituto de Pesquisas Hidra Brazil Received 22 June 2007; received in revised form 21 January 2008; accepted 31 January 2008
KEYWORDS Rainfall-runoff models; Model comparison; Experimental design
Summary The purpose of this paper is to stimulate debate on how to compare models of watershed behavior. It is argued that procedures commonly used at present to compare the performance of rainfall-runoff models are unsatisfactory for several reasons, but principally because they provide no measure of the uncertainty (as measured by an estimate of residual variation between experimental units) in differences between measures of model performance. It is also concluded that the present procedures, in which each model is usually calibrated and also validated by persons most familiar with its use, do not provide a sound basis for recommending any particular model for use by the hydrological community at large. Whilst the principles of good experimental design (replication, randomization) are widely practised in other elds of applied science, they are not yet widely practised in hydrology (and other geophysical sciences where models are essential tools). It is argued that these principles can and should be applied where experiments are designed to compare the performance of hydrological models. Designs are proposed in which different periods of record from within each watershed are used to calibrate (or validate) each of the models being compared; subject to the random allocation of models to test periods, the designs provide a valid measure of the uncertainty in measures of model performance. The basic design can be extended to explore whether models perform differently on different types of watershed; it is also shown how it can be adapted, through the use of balanced incomplete blocks, to avoid the need to calibrate (and validate) every model on every watershed. 2008 Elsevier B.V. All rights reserved.
Introduction
Over the past 25 years, very many hydrological models have been proposed in the literature to describe mechanisms by which precipitation, or the lack of it, inuences runoff.
E-mail address: clarke@iph.ufrgs.br
0022-1694/$ - see front matter 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2008.01.026
380 Two recent books (Singh and Frevert, 2002, 2006) comprehensively illustrate the immense list of applications for such models; in the latter book, the authors selected 24 models for inclusion in their volume although many more could have been included. The range of application of watershed models is also increasing; some limited success is reported in their use with quantitative precipitation forecasts (QPFs) derived from weather prediction models for operational ood-forecasting (Collischonn et al., in press, and references cited therein), and if concerns about global climate change continue to grow and impacts on water resources are felt, the use of watershed models is likely to become ever more widespread. Procedures for the operational testing of models where land use or climate is changing were dened in the classical work of Klemes (1986), and the number and variety of models has increased enormously ever since. Faced with the plethora of watershed models, practitioners need to be aware of their characteristics, strengths and weaknesses, and attempts have been made to identify models which were, in some senses, better than others. The earliest model inter-comparisons were set up by the World Meteorological Organization (WMO); the rst (WMO, 1975) being a comparison of conceptual models, the second (WMO, 1986) included a comparison of models with snowmelt components, and the third (WMO, 1992) compared models for forecasting streamow in real time. More recently, a Special Issue of the Journal of Hydrology (Schaake and Duan, 2006) reported results of the model parameter estimation experiment (MOPEX) in which data from 12 basins in the southeastern USA were distributed and used to compare the performance of eight hydrological models (Duan et al., 2006). Model intercomparisons are reported from the Mediterranean region (Diomede et al., 2006), with reference to the uncertainty in predicting discharge, and from Brazil (Guilhon et al., in press) for the purpose of forecasting inows into hydropower-generating reservoirs. Bowling et al. (www.ce.washington.edu/pub/HYDRO/lxb/presentations/) have described an intercomparison study involving 21 models with research groups participating from 11 countries, in a multi-year, spatial intercomparison of hydrology and land surface transfer schemes in northern regions. Its purpose is to evaluate the performance of land surface parameterizations in high latitudes, in a context that allows evaluation of their ability to capture key processes spatially. Model intercomparison studies continue to gain momentum. The distributed model intercomparison project (DMIP), organized by the US National Weather Service US-NWS (Smith et al., 2004), was formulated as a broad comparison of many distributed models amongst themselves and with a lumped model used for operational river forecasting in the US. Twelve groups participated, including groups from Canada, China, Denmark, New Zealand, and the US. Numerous data sets including 7 years of concurrent radar-rainfall and streamow data were provided to participants. DMIP has now developed into a second phase, DMIP 2, which is designed to answer a number of research questions arising from DMIP 1 (Smith et al., 2007) including the following: can distributed hydrologic models provide increased simulation accuracy compared to lumped models? If so, under what conditions? Are improvements constrained by forcing data (i.e., precipitation and meteorological data) quality?
R.T Clarke What simulation improvements can be realized through the use of a more recent period of radar precipitation data than was used in DMIP 1? What is the performance of (distributed) models if they are calibrated with observed precipitation data but use forecasts of precipitation? Can distributed models reasonably predict processes such as runoff generation and soil moisture re-distribution at interior locations? In what ways do routing schemes contribute to the simulation success of distributed models? What is the nature of spatial variability of rainfall and basin physiographic features, and the effects of their variability on runoff generation processes? What physical characteristics (basin shape, feature variability) and/or rainfall variability warrant the use of distributed hydrologic models for improved basin outlet simulations? What is the potential for distributed models set up for basin outlet simulations to generate meaningful hydrographs at interior locations for ash ood forecasting? It is therefore clear that much research effort is being applied by many groups to answer such questions. If this effort is to lead to clear and unambiguous conclusions about the relative merits of different models, it is essential that the procedures designed to compare them satisfy the requirements necessary for scientic rigour. When the performance of rainfall-runoff models is compared for whatever purpose, the comparison consists essentially of an experiment in the classical sense in which treatments (models) are applied to experimental units (data sets) and a measure is taken of the model performance on each data set. The way in which models are allocated to data sets, and in which performance measures are taken, constitutes the experimental design; the interpretation of the results constitutes the analysis of the experiment. The purpose of this paper is to draw attention to the need for good experimental design and analysis, and to suggest ways in which currently-used experimental designs could be improved. Much of what is set out in the present paper is already implicit in the paper by Klemes (1986) who dened operational procedures for the systematic testing of hydrological simulation models based upon split-sample testing, differential split-sample testing (for assessing effects of climatic or land-use change) and proxy-basin testing (for assessing the transposability of a model within a region or from one region to another). The present paper extends the procedures recommended by Klemes by suggesting how the results of different model validations can be compared statistically. To quote from his 1986 paper (p. 20): the model [is] tted to both [substitute basins] and the results of the two validation runs compared. Only if the results are similar can the model be judged adequate. Also (p. 21): [The model] is judged adequate if errors in both validation runs . . . are acceptable and not signicantly different. The experimental designs described in the present paper, when properly executed, allow the errors mentioned in this quote to be properly quantied, and the uncertainty in differences between model validations (earlier quote) to be assessed objectively. Finally, although the emphasis of this paper is on aspects of experimental design for comparing rainfall-runoff models, it is argued that principles of good experimental design apply more widely, to wherever models of natural systems are compared.
381
A review of current practice for model comparison

A typical procedure adopted where it is required to compare the performance of a number T of rainfall-runoff models, denoted by M1, M2. . . MT, is broadly as follows: (i) criteria of model performance are selected. These may be, for example, the Nash and Sutcliffe (1970) measures of goodness of model t (R2) during both calibration and verication stages of model use; or they may be the values of R2 calculated after transforming observed and tted ows to logarithms or square roots (Chiew and McMahon, 1994); or some other measure of model performance (Ye et al., 1997). Wagener et al. (2002) give a list of eight measures (including the NashSutcliffe R2) used in a toolkit for the development and application of parsimonious hydrological models; many others could also be proposed. Smith et al. (2007) describe an extensive list of metrics to be used in DMIP 2 for comparing model performance when modeling soil moisture, in addition to criteria based on the comparison of observed and predicted streamows. (ii) records of rainfall and runoff (together possibly with other hydrological and meteorological records) are assembled from each of B watersheds, denoted by W1, W2. . . WB, which are to supply data for the model comparison. Commonly two sub-periods are selected from each record, one to be used for model calibration and the other for model validation. The Brazilian intercomparison study (Guilhon et al., in press) used daily runoff calculated as an average of estimated discharges at 07:00 h and 17:00 h; DMIP 2 uses both hourly and daily runoff discharges (Smith et al., 2007). Depending on modeling objective and data availability, precipitation data may be from raingauge networks read daily, hourly totals, radar estimates of rainfall, or forecasts of rainfall given by models such as Eta (Smith et al., 2007) of atmospheric behaviour. The basins used by MOPEX (Duan et al., 2006) included daily precipitation, daily maximum and minimum temperatures, daily streamow data and potential evaporation together with land surface characteristics; Duan et al. (2006) emphasize that the quality of precipitation data is critically important for model comparison purposes. (iii) Each of the T models is calibrated using the records from each of the B basins. The criteria of model performance are calculated; each model is then run with data from the period set aside for model verication, and the performance criteria are again calculated. (iv) The results are typically assembled in a number of paired tables with B rows and T columns, one pair for each criterion of model performance, with one table of each pair showing the criteria obtained during model calibration, and the other showing the criteria obtained during model validation. There may also be a third table showing model performances when no calibration is used: see Table 2 of Reed et al. (2004), where the no calibration two-way table is shown
side-by-side with the calibration table. An example of such a table (for the validation period only) is shown in Table 1 below in which the Nash and Sutcliffe (1970) coefcient R2 is used to measure model performance. The tables are examined to compare model performances and to identify, if possible, a best model, according to the criteria adopted. Alternatively, one of the T models, which can be denoted by M1, is taken as a Control with which the remaining models are compared so as to ascertain which, if any, performs better than the standard. In the DMIP study (Smith et al., 2004) a lumped model was taken as the Control against which distributed models were evaluated. In practice, any such table may be incomplete where, for whatever reason, a model was not tested on data sets from one or more basins; this complicates the analyses discussed later in this paper, but does not invalidate them. Thus Reed et al. (2004) reported that 198 out of 360 possible combinations of models with data-sets were obtained from the 12 participants of DMIP. As mentioned in the introduction above, the procedures outlined in (i)(iv) above specify the design of what is really an experiment set up to compare the model performances. In other elds of science, notably agriculture and medicine, comparative experiments are essential for the selection of new crop varieties and new medicines which may lead to better agricultural production and better treatment of disease, so that sophisticated designs for comparative experiments have been developed and are in everyday use, built upon the foundations laid originally by Fisher (1925,1947). Good practice in such experiments ensures that the results are clear-cut and that unwanted factors which might blur the clarity of the comparisons are as far as possible eliminated. The calculated values of criteria used to make comparisons will remain subject to uncontrolled variability, but good experimental design provides a measure of this uncontrolled variability, which is used as a yardstick against which comparisons of interest can be assessed. One of the commonest experimental designs in very many elds of applied science is the randomized-block design (RB), leading to the two-way analysis of variance described in many statistical texts, and used for the comparison of treatments which are randomly allocated to
Table 1 Scheme showing structure of table obtained when T models are validated on periods of record from B watersheds, with the Nash and Sutcliffe (1970) coefcient used as a measure of model performance Model 1: Watershed 1: Watershed 2: ... Watershed B: Means: R2 11 R2 12 ... R2 B1 2 R 1 Model 2: R2 12 R2 12 ... R2 B2 2 R 2 ... ... Model T: R2 1T R2 2T ... R2 BT 2 R T
382 experimental units. One or more measures of treatment performance are recorded on each experimental unit, and hypotheses are tested using the two-way analysis of variance (ANOVA) of each measure (e.g. Cochran and Cox, 1957). In a RB experiment in which T treatments are allocated at random to the T experimental units within each of B blocks, the ANOVA for the RB design separates the total variability amongst the BT values of the performance measure into three components: one measuring the variability between the blocks, another measuring the variability arising from the imposition of the treatments, and the third providing a measure of the uncontrolled or residual variability (i.e. the variability left over after the variability due to blocks, and to treatments, has been accounted for). Good estimation by ANOVA of this third component, the residual variation, is of the highest importance, because it is the yard-stick against which differences between treatments are measured. The larger the variability due to treatments relative to the residual variability, the greater is the condence that differences between the treatments are real, and not the result of random uncontrolled variation amongst the experimental units. When the residual variation can be calculated, it becomes possible to deal more rigorously with the question of signicance mentioned in the two quotes from Klemes (1986) cited above. In the present context, the treatments in the statistical literature are the hydrological models that are to be compared. If the B watersheds are considered to be the blocks of the RB experiment, then supercially the twoway table (watersheds models) of a measure of hydrological model performance, such as R2, can be viewed as analogous to the two-way table (blocks treatments) that is the basis of the ANOVA described in the preceding paragraph. Supercially, a mean value of R2 could be calculated for each model over the B watersheds, standard errors for these mean values could be calculated from the residual variation given by the ANOVA, and models performing signicantly better than others, or than a standard model, could be identied. In the case of published results from model intercomparison studies, it appears that interpretation has been by means of visual appraisal of the two-way tables, rather than by any formal statistical analysis. One reason for this may be that hydrological modelers may be less familiar with the methodology of experimental design and analysis than with methodologies of time series analysis and numerical optimization. However to interpret the model intercomparison procedure listed in steps (i)(iv) as a form of RB experiment is misleading. The design given in (i)(iv) provides no measure of residual variation, so that it fails to provide any statistical measure of which models out-perform or under-perform in test. Although it might be argued that visual inspection of performance criteria is likely to be sufcient for comparing rainfall-runoff models and that no measure of residual variation is required, the above design remains unsatisfactory because it is not the models themselves that are compared, but the models as used by those who apply them to datasets as part of the intercomparison study (termed the model testers in this paper). In the terminology of classical experimental design (e.g. Cochran and Cox, 1957), differences between models are confounded with differences between model testers.
R.T Clarke
Adverse features of the procedure dened in (i)(iv)

Features of the design given in (i)(iv) above include the following: (a) The order in which each model Mi calibrates the B watersheds basins, and validates them, is (usually) not specied, so that each models testers are free to choose the order. Thus they may choose to calibrate all B watersheds one after the other, and then validate them, or they may calibrate and then validate each watershed in turn, again in no particular order. (b) Each model Mi is calibrated and validated by its testers, who are often the most experienced users of the model and may therefore be expected to know it best. With regard to (a), there are four points that must be considered. First, it is necessary to consider what are the experimental units in the design. To x ideas, consider just the B T table of NashSutcliffe coefcients R2 for the validation periods. For any one particular model, these B values of R2 are obtained from the haphazard time order in which the model happens to have been applied to data from each of the B watersheds. Therefore, the experimental unit in the design is the time-period during which the model is validated. Since there are B watersheds, each particular model is validated during the B time-periods (working periods) taken by each tester (assuming that every model is validated using the same data from every basin: see the discussion below, where this assumption is relaxed). These B time periods form a sequence of experimental units for each model, such that it would be possible to allocate the B sets of validation data in random time-order to the model. Random allocation to experimental units is essential for the validity of the randomized-block experiment and its associated two-way ANOVA (John, 1971; Montgomery, 1991), although to the authors knowledge the order in which any given model is validated using data from the B watersheds is very seldom random, but is chosen by the model testers as mentioned above. Neither the MOPEX descriptions (Duan et al., 2006) nor the DMIPS 2 specications (Smith et al., 2007) dene any order in which basin data sets are to be used: decisions on how this should be done are left to each model tester. The absence of randomization in the design given in (i)(iv) above is one issue which destroys the analogy between the B T table and the supercially-similar table used in the ANOVA of randomized-block designs. The second point regarding (a) is as follows: The randomized-block analysis would be valid if, rstly, the B watersheds could be regarded as the blocks in the RB design, with the T models as its treatments. This would require that the treatments (i.e. models) were allocated at random to experimental units within each block. For the above design, however there is no natural way in which experimental units within watersheds (as distinct from within models where, as we have seen, the units are successive time-intervals) can be dened; treatments (models) cannot therefore be randomly allocated to experimental units, as the RB analysis requires. On this issue, therefore, the randomizedblock analogy again fails.
A critique of present procedures used to compare performance of rainfall-runoff models The third point regarding (a) concerns the relation between the T models, and the B experimental units (timeintervals) which form the sequence in which each model is validated using data from the B basins. In a randomizedblock experiment, the purpose of using blocks is to ensure that differences between experimental units within each block are kept small, so that treatments are compared within the blocks as precisely as possible; it does not matter if differences between blocks are large because (provided that block and treatment effects are additive) such differences will not inuence the within-block comparison of treatments (for example in an experiment to compare diets on the growth of rats, a block may consist of rats from the same litter, which are genetically more similar than rats from different litters; most of the unwanted genetic variation between rats is therefore removed as differences between the blocks and litters in the ANOVA). In the case of the design specied by (i)(iv) above, however, the differences between T blocks are in fact the differences between the models; in the terminology of the statistical design of experiments, the design consists of a split-unit (or split-plot) experiment in which the main units are the T models, and the sub-units are the B basins. Since there is no replication of the T models, there is therefore no possibility of computing a residual variance that is appropriate for comparisons between them. Tables 2 and 3 illustrate the problem; Table 2 shows the structure of the design, and Table 3 shows the corresponding ANOVA, with no residual variation for comparing the T models. The fourth point to be made about (a) concerns the possibility of carry-over effects. For each model, the design gives B values for the criterion of model performance, resulting from the time-sequence in which validation data from the B drainage basins is used. It is therefore necessary to consider the possibility that the model testers, who are running their model on the validation data sets, perform better or worse as the time-sequence develops better performance possibly resulting from a learning process, or poorer performance resulting from fatigue. Statistical designs exist which allow for carry-over effects (Cochran and Cox, 1957); but failure to allow for possible carry-over effects from one time-period to the next is less important
383
than the absence of any measure of residual variation (see preceding paragraph) for assessing the reality, or otherwise, of differences between the models being compared. Consider now point (b), that each of the models under comparison is calibrated and validated by the same testers. The consequence of this is that the comparisons between models are not comparisons between the models themselves, but are really comparisons between the models as calibrated and validated by their testers: in the terminology of statistical experimental design, differences between the models are confounded with differences between proponents. A better experiment would require the testers of each model to use not only the models that they know best, but the models of other testers also; this would provide the replication of models which as we have shown (the third point in the discussion of (a) above) to be lacking. The counter-argument to this is, of course, that comparisons between hydrological models would be enormously complicated if model testers were also required to learn how to use alternative models. The counters to this counter-argument are twofold: rst, if the purpose of the comparison is to identify a model or models that can be recommended for use by the average hydrological practitioner, then it is the model itself that must be recommended to him/her, and not the model as used by experienced testers who are presumably familiar with its details and its points of weakness. The second counter is that if hydrology is to be considered a science and not an arcane art-form, it is essential that the performance of a hydrological model should be reproducible by others. This is not to suggest that those who propose rainfall-runoff models for inclusion in an intercomparison study are charlatans: it is simply a condition of good science that researchers should be capable of conrming results reported by others in the literature. Examples from other branches of applied science where such conrmation has not been found (room-temperature nuclear fusion; cloning of human stemcells) are not far to seek. It could be argued that true replication of models in studies to compare their performances would lead to an impos-
Table 2 Idealized representation of a common but unsatisfactory experimental design for comparing T models M1, M2, M3 . . . MT on data-sets from B watersheds W1, W2, W3,. . .WB Test period 1 2 3 ... ... B M1 W3 WB W4 ... ... W1 M2 WB W2 W1 ... ... W3 M3 W2 W4 WB ... ... W1 ... ... ... ... ... ... ... ... ... ... ... ... ... ... MT W2 W4 W5
Table 3 ANOVA of the design in Table 2 (ANOVA is strictly valid only if, within each model, data-sets from the B watersheds are tested in randomized order) Source of variation Between main-units Between models (M) Residual variation between main units Between sub-units Between watersheds (W) Residual variation between sub-units Degrees of freedom T1 T1 None Mean square MST ()
W1
W1 (T 1)(W 1)
MSW MSR
Watershed data-sets (denoted by W1, W2. . . WB) are usually tested haphazardly (i.e., not strictly in random order) in the successive test periods 1, 2. . . B, not usually or necessarily of equal duration. The models M1, M2, M3 . . . MT are tested on main experimental units (blocks) which are unreplicated.
The design has no residual variation between main units, by which differences between models could be compared. The ratio MSW/MSR gives a measure of the statistical signicance of differences between watersheds, which is of very minor interest.
384 sibly time-consuming experiment, and it is true that, if all B watersheds are used in each model run, the amount of time taken would be very considerable. However it is exactly for such circumstances that procedures in experimental design have been developed in which the number of experimental units (time, in the present context) must be severely restricted. It should be possible, for example, to utilize classical designs in balanced incomplete blocks (Cochran and Cox, 1957) in order to secure comparisons between model performances which are both statistically sound and at the same time not excessively demanding in terms of experimental material. This point is discussed further below.
R.T Clarke
Table 5 ANOVA of the design in Table 3 (ANOVA is still strictly valid only if the T data-sets within each watershed are allocated at random for testing with the T models Source of variation Between main units Between watersheds (W) Residual variation between main units Between sub-units Between models (M) Residual variation between sub-units Degrees of freedom W1 W1 None Mean square MSW ()
An improved (but still not fully satisfactory) design

Recall that, to keep the discussion specic, we deal with the analysis of a single measure of model performance (say the NashSutcliffe R2) given by T models tested on B drainage basins, during the validation process. In the design specied in (i)(iv) above, each of the T models is validated using the same period of record from any given watershed (and the same is true of the model calibration phase). The consequence of this is that each watershed basin provides just one experimental unit. Consider the alternative in which T distinct periods of record are taken from each watershed, for the purpose of model validation (setting aside for the moment the question of whether the basin records are sufciently long to allow it: this point is discussed further below). Table 4 shows the alternative design. These T distinct periods can now be considered as within-watershed experimental units, to which the T models can be allocated at random for validation; each model will yield its value of R2, as shown in Table 1, and (provided that the T models have been allocated at random to the T periods of record taken from each watershed) the ANOVA shown in Table 5 is now valid: the residual variance r2 can be used to calculate standard errors of the mean measure of model performance R2 j j 1 . . . T and, where appropriate, to test the signicance of differences between them.
T1 (T 1)(W 1)
MST MSR
The design has still has no residual variation between main units (which are now the watersheds), but comparisons between watersheds are of minor interest. The ratio MST /MSR gives a measure of the statistical signicance of differences between models, which is the objective of model comparison. The residual variation between sub-units can be used to calculate errors in differences between models.
The modied design remains unsatisfactory, however, if it is required to answer the question do the models perform differently on drainage basins of different types?. To answer this question, it would be necessary to modify the design still further by validating each model on more than one period of record from each watershed. If models are each validated on two different and non-overlapping periods from each watershed, the ANOVA (subject to the use of randomization) becomes as shown in Table 6, and the statistic MS(B M)/ r2 can be used to test the hypothesis that model performance is independent of type of watershed.
The issue of data availability

It could be argued that records available for model intercomparisons are rarely long enough to allow each model to be separately validated (or calibrated) on more than one period of record from each watershed. Indeed it is possible to argue more generally that the length of records which are at the same time available, reliable and representative limits the use of efcient statistical designs, such as that in Table 4, for model intercomparison studies.
Table 4 Idealized representation of a more satisfactory experimental design for comparing T models M1, M2, M3 . . . MT on separate data-sets from each of B watersheds W1, W2, W3,. . .WB Data set 1 2 3 ... ... T W1 M3 MT M4 W2 MT M2 M1 W3 M2 M4 MT W4 ... ... ... ... ... ... ... ... ... ... ... ... WB M2 M4 M5
Table 6 Analysis of variance (ANOVA) for testing whether models perform differently on watersheds of different type Source of variation Between watersheds Between models Interaction, watersheds models Residual variation Total Degrees of freedom B1 T1 (B 1)(T 1) BT 2BT 1 Mean square
MS(B M) r2
M1
M3
M1
M1
Each of the B watersheds contributes T data sets on which the T models M1, M2, M3 . . . MT are tested in random order. The watersheds now constitute the main experimental units; models M1, M2, M3 . . . MT are tested on sub-units, and are now fully replicated.
Each model validated on two periods of record from each watershed.
A critique of present procedures used to compare performance of rainfall-runoff models There is substance to this argument, but it should not be accepted too readily, not least because the use of inefcient designs leads to questionable conclusions: intercomparison studies are time-consuming and costly, and should therefore yield conclusive results. In the literature, there is a good deal of variation in the lengths of record that have been used to calibrate and validate models (just as there are wide variations in the size of plots used in agronomic experiments); the Brazilian model intercomparison study used 5 years (19962001) for calibration, and eight (19881995) for validation; all the models were tested on these periods of data. In the MOPEX study, the original intention was that a 10-year period (19801990) would be used for calibration, and a 19-year period for validation, although in practice this was not adhered to. Huang and Liang (2006) used a 5-year calibration period for some of the 12 MOPEX basins, and then used the entire 39-year period 19601998 for validation of their VIC-3L model. Gan and Burges (2006) used 20-year-long data-sets for calibration, and separate19-year data-sets from the 12 MOPEX basins for validation, thus following the split-record test recommended by Klemes (1986), in which each half of a record is used in turn for model calibration and validation. Young (2006) used the last 10 years of record from each of 179 UK catchments for calibrating a rainfall-runoff model. In the case of DMIP, model participants were asked to use much shorter periods, both for calibration (May 1 1993May 31, 1999 a little over 6 years) and validation (June 1, 1999July 31 2000, slightly more than 1 year) on the eight basins (Reed et al., 2004). For purposes of discussion of the data issue, let us assume10-year periods of record as appropriate for model calibration, with perhaps rather longer periods of, say, 20 years for validation. A 40-year record therefore provides scope for four 10-year calibration periods, and two (possibly three) unbroken 20-year validation periods. Supposing that the records available for use are all of about 40 years duration, this suggests, supercially, that the number of models that can be tested is limited to a very small number. Consider, however, the experimental design shown in Table 7a; this is a design which could be used for calibrating seven models on seven watersheds, using three 13-year calibration periods from each (recalling that the 12 MOPEX basins had 39-year records). Not all models are calibrated on every watershed, but the criterion of performance, at calibration or validation, of every model is compared with that of every
385
Table 7b As for Table 7a, but allowing for possible nonrepresentativity in the 13-year calibration periods from each watershed Watershed W1 Data-set 1 (rst period) Data-set 2 (second period) Data-set 3 (third period) M1 M2 M4 W2 M2 M3 M5 W3 M3 M4 M6 W4 M4 M5 M7 W5 M5 M7 M1 W6 M6 M7 M2 W7 M7 M1 M3
Calibration of each model is compared with the calibration of every other model once within each watershed, and once within each period.
Table 7c Design for validating seven models M1, M2, M3,. . . M7 using two periods (say of 19 and 20 years) from each of 21 watersheds W1, W2, W3,. . . W21 Reps I and II W1 W2 W3 W4 W5 W6 W7 M1 M2 M3 M4 M1 M5 M3 M2 M6 M4 M7 M5 M6 M7 Reps III and IV W8 W9 W10 W11 W12 W13 W14 M1 M2 M3 M4 M5 M1 M2 M3 M4 M5 M6 M7 M6 M7 Reps V and VI W15 W16 W17 W18 W19 W20 W21 M1 M2 M3 M4 M2 M6 M1 M4 M3 M6 M5 M5 M7 M7
Validation of each model is compared once with the validation of every other model. The design consists of three pairs of replications; in Reps I and II, for example, each model is validated twice, and similarly for Reps III and IV, Reps V and VI.
Table 7a A balanced incomplete block (BIB) design for comparing calibrations of seven models M1, M2, M3,. . . M7 on seven watersheds, each with at least 39 years of record (as used in MOPEX), using three data-sets each of 13-years duration for model calibration Data-set 1 Watershed Watershed Watershed Watershed Watershed Watershed Watershed 1 2 3 4 5 6 7 M1 M2 M3 M4 M1 M2 M1 Data-set 2 M2 M3 M4 M5 M5 M6 M3 Data-set 3 M4 M5 M6 M7 M6 M7 M7
other model within one watershed. If representativity is an issue, so that it is suspected that not all three 13-year periods within each watershed are equally representative, the design can be modied as shown in Table 7b, in which each model is calibrated an equal number of times (once, in this case) in each of the 13-year periods. The ANOVA for such a design removes differences between calibration periods from comparisons between models. If validation periods must be longer (19 or 20 years, as in the MOPEX study), this too can be accommodated by design, but at the cost of needing data-sets from more watersheds. Table 7c shows a design, again for comparing the validation of seven models, using validation periods of 19 or 20 years (used in MOPEX) from 21 watersheds, rather more than the 12 used in MOPEX but much less than the number used (for calibration) by Young (2006). Furthermore, if 40-year records were available from 21 watersheds, the designs of Table 7a or Table 7b could be repeated, without greatly increasing the work-load of individual model testers; even with three repetitions of the design in Table 7a, each tester would be required to run his model 9 times: instead of the seven times if the model had been calibrated on each of the seven watersheds. Thus it can be concluded that where many models are to be compared, the use of balance incomplete block designs (Cochran and Cox, 1957, Chapter 9) would eliminate the need for each model to be validated on every basin, but
386 would still provide a valid estimate of the residual variance r2 needed for calculating the precision of estimates of model performance.
R.T Clarke DMIP 2. If clear and unambiguous answers to these questions are to be found, much attention must be given (and of course is being given) to planning the procedures by which models are to be compared. The message of this paper is that there is sometimes benet in stepping outside the box and looking at commonly-adopted procedures from a new standpoint. And since any intercomparison between hydrological models is essentially a comparative experiment in the sense of classical experimental design, it is appropriate to consider whether, and how, the principles of that discipline may be applied, or adapted, to compare the performance of hydrological models tested on data from a number of watersheds. It is concluded from the arguments given above that procedures commonly used at present to compare the performance of rainfall-runoff models need to be examined much more carefully for several reasons, but principally because they provide no measure of the uncertainty (as measured by an estimate of residual variation between experimental units) in differences between measures of model performance. It is also concluded that the present procedures, in which each model is calibrated and validated by persons who are most familiar with its use, do not provide a sound basis for recommending any particular model for use by the hydrological community at large. Whilst the principles of good experimental design (replication, randomization) are widely practised in other elds of applied science, they are not yet widely practiced in hydrology (and other geophysical sciences where models are essential tools) for the sound comparison of rainfall-runoff models. A design is discussed in which each model is calibrated (and validated) on a separate period of record from each watershed. Subject to the random allocations of models to test periods, the design provides a valid measure of the uncertainty in measures of model performance. Elaborations of the proposed design allow the possibility to be explored that models perform differently on different types of watershed; the design can also be adapted to avoid the need to validate (and calibrate) every model on every watershed.
Criteria of model performance

It is common, and necessary, for a range of criteria of model performance to be used for model intercomparison. Wagener et al. (2002) listed eight such criteria, ranging from the NashSutcliffe Efciency (R2) to the Heteroscedastic Maximum Likelihood Estimator used by Sorooshian and Dracup (1980) and Yapo et al. (1996). Smith et al. (2004) used 12 criteria, ranging from summary statistics (mean, standard deviation, correlation coefcient, R2, to percentages based on errors in volume, peak error, peak time error, and improvements in ood runoff, peak ow and peak time of hydrograph events. When used to compare the performance of different models, a problem with all such criteria is that they fail to take account of differing numbers of parameters in the models being compared. As an example, if the correlation between observed and tted streamows is 0.9 for a model with six parameters and is also 0.9 for a model with 60 parameters, it cannot be concluded that the two models perform equally well; the former model has greater explanatory power than the latter. Where the difference between the numbers of parameters p in different models is small and the number N of values in the observed streamow sequence is large, there is less need to take account of the number of parameters; but with some models now reported in the literature (those based on articial neural networks, for example, where the number of estimated weights can be very large indeed) the need is much greater. In the case of the NashSutcliffe R2, Clarke (2008) has suggested replacing the form almost universally used .X X 2 ^ 2 q q q q R2 1 by R2 1 hX i.hX i ^ 2 =N p 2 =n 1 q q q q
as an empirical correction, corresponding to the adjusted R2 used in linear regression, just as the NashSutcliffe coefcient is analogous to the unadjusted R2 used when a model is linear in its parameters. The divisor N p could also be used when calculating some other criteria of model performance (but not, for example, the correlation between observed and tted streamows, which is also sometimes used as a criterion of model performance). A potential issue arises in the case of distributed models where the number of parameters can become very large as the model cell-size decreases, although in practice it appears that many such parameters are constrained to be equal. However the essential point remains that any measure of model performance should take account of its explanatory power, which is related to the number of its parameters that must be estimated from data.
Acknowledgements
The author is very grateful for detailed comments and criticisms from two anonymous Reviewers which contributed very substantially to this paper.
References
Clarke, R.T., 2008. Issues of experimental design for comparing the performance of hydrologic models, Water Resour. Res., doi:10.1029/2007WR005927, in press. Cochran, W.G., Cox, G.M., 1957. Experimental Designs, second ed. John Wiley & Sons Inc, New York. Collischonn, W., Tucci, C.E.M., Clarke, R.T., Chou, S.C., Guilhon, L.G., Cataldi, M., Allasia, D., in press. Medium-rang reservoir inow predictions based on quantitative precipitation forecasts. J. Hydrol. Chiew, F., McMahon, T., 1994. Application of the daily rainfallrunoff model MODHYDROLOG to 28 Australian catchments. J. Hydrol. 153, 383416. Diomede, T., Amengual, A., Marsigli, C., Martin, A., Romero, R., Papetti, P., Paccagnella, T., 2006. A meteo-hydrological model
Conclusion
The Introduction to this paper gave a list of questions which have arisen from DMIP 1 and which are to be answered by

intercomparison as tool to quantify forecast uncertainty at medium-sized basin scale. Geophys. Res. Abstr. Vol. 8, 08164. Duan, Q., Schaake, J., Andre assian, V., Franks, S., Goteti, G., Gupta, H.V., Gusev, Y.M., Habets, F., Hall, A., Hay, L., Hogue, T., Huang, M., Leavesley, G., Liang, X., Nasonova, O.N., Noilhan, J., Oudin, L., Sorooshian, S., Wagener, T., Wood, E.F., 2006. Model parameter estimation experiment (MOPEX): an overview of science strategy and major results from the second and third workshops. J. Hydrol. 320, 317. Fisher, R.A., 1925. Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh. Fisher, R.A., 1947. The Design of Experiments, fourth ed. Oliver and Boyd, Edinburgh. Gan, T.Y., Burges, S.J., 2006. Assessment of soil-based and calibrated parameters of the Sacramento model and parameter transferability. J. Hydrol. 320, 117131. Guilhon, L.G.F, Rocha, V.F., Moreira, J.C., in press. Project on the competition to develop models for forecasting inows to hydroelectric installations of the national interlinked system (in Portuguese). Special Issue of Revista ABRH. Huang, M., Liang, X., 2006. On the assessment of the impact of reducing parameters and identication of parameter uncertainties for a hydrologic model with applications to ungauged basins. J. Hydrol. 320, 3761. John, P.W.M., 1971. Statistical Design and Analysis of Experiments, third ed. McMillan & co., New York. Klemes, V., 1986. Operational testing of hydrological simulation models. Hydrol. Sci. J. 31 (1), 1324. Montgomery, D.C., 1991. Design and Analysis of Experiments, third ed. John Wiley & Sons, New York. Nash, J.E., Sutcliffe, J.V., 1970. River ow forecasting through conceptual models. Part I: A discussion of principles. J. Hydrol. 10 (3), 282290. Reed, S., Koren, V., Smith, M., Zhan, Z., Moreda, F., Seo, D.-J., DMIP Participants, 2004. Overall distributed model intercomparison results. J. Hydrol. 298, 2760. Schaake, J., Duan, Q., 2006. The model parameter estimation experiment (MOPEX). J. Hydrol. 320, 12.
387
Smith, M.B., Seo, D.-J., Koren, V.I., Reed, S.M., Zhang, Z., Duan, Q., Moreda, F., Cong, S., 2004. The distributed model intercomparison project (DMIP): motivation and experimental design. J. Hydrol. 298, 426. Smith, M., Koren, V., Reed, S., Zhang, Z., Seo, D.J., Moreda, F., Cui, Z., 2007. The distributed model intercomparison project: phase 2 Science Plan. Singh, V.P., Frevert, D.K., 2006. Watershed Models. Taylor and Francis, FL. Singh, V.P., Frevert, D.K., 2002. Mathematical Models of Large Watershed Hydrology. Water Resources Publications, LLC, Colorado. Sorooshian, S., Dracup, J.A., 1980. Stochastic parameter estimation procedures for hydrologic rainfall-runoff models: correlated and heteroscedastic error cases. Water Resour. Res. 16 (2), 430 442. Wagener, T., Lees, M.J., Wheater, H.S., 2002. A toolkit for the development and application of parsimonious hydrological models. In: Singh, V.P., Frevert, D.K. (Eds.), Mathematical Models of Large Watershed Hydrology, pp. 91140. (Chapter 4). WMO, 1975. Intercomparison of conceptual models used in operational hydrological forecasting. Operational Hydrology Paper No. 429, World Meteorological Organization, Geneva, Switzerland. WMO, 1986. Intercomparison of models of snowmelt runoff. Operational Hydrology Paper No. 646, World Meteorological Organization, Geneva, Switzerland. WMO, 1992. Simulated realtime intercomparison of hydrological models. Operational Hydrology Paper No. 779, World Meteorological Organization, Geneva, Switzerland. Ye, W., Bates, B.C., Viney, W.R., Sivapalan, M., Jakeman, A.J., 1997. Performance of conceptual rainfall-runoff models in low-yielding ephemeral catchments. Water Resour. Res. 33 (1), 153166. Yapo, P.O., Gupta, H.V., Sorooshian, S., 1996. Automatic calibration of conceptual rainfall-runoff models: sensitivity to calibration data. J. Hydrol. 18, 2348. Young, A.R., 2006. Streamow simulation within UK ungauged catchments using a daily rainfall-runoff model. J. Hydrol. 320, 155172.

A Critique of Present Procedures Used To Compare

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Critique of Present Procedures Used To Compare

Încărcat de

Drepturi de autor:

Formate disponibile

Journal of Hydrology (2008) 352, 379 387

journal homepage: www.elsevier.com/locate/jhydrol

A critique of present procedures used to compare performance of rainfall-runoff models

KEYWORDS Rainfall-runoff models; Model comparison; Experimental design

E-mail address: clarke@iph.ufrgs.br

A critique of present procedures used to compare performance of rainfall-runoff models

A review of current practice for model comparison

Adverse features of the procedure dened in (i)(iv)

An improved (but still not fully satisfactory) design

The issue of data availability

Each model validated on two periods of record from each watershed.

Criteria of model performance

A critique of present procedures used to compare performance of rainfall-runoff models

S-ar putea să vă placă și