The receiver-operating characteristic (roc) analysis 49
Journal of Cognitive and Behavioral Psychotherapies, Vol. 9, No. 1, March 2009, 49-66.
THE RECEIVER-OPERATING CHARACTERISTIC (ROC) ANALYSIS: FUNDAMENTALS AND APPLICATIONS IN CLINICAL PSYCHOLOGY Sebastian PINTEA * & Ramona MOLDOVAN Babes-Bolyai University, Cluj-Napoca, Romania
Abstract The Receiver-Operating Characteristic (ROC) analysis has been long used in Signal Detection Theory to depict the tradeoff between hit rates and false alarm rates of classifiers. In the last years, ROC analysis has become largely used in the medical community for visualizing and analyzing the performance of diagnostic tests. Our article points out some fundamental aspects of ROC analysis underlying the importance of using ROC analysis in evaluating the diagnostic validity of tests commonly used in clinical psychology. The main statistical programs available for this type of analysis, with their advantages and deficiencies are also discussed. In order to illustrate how ROC analysis works in clinical research, we also describe an application of ROC analysis in evaluating scales generally related to depression.
Keywords: receiver operating characteristics (ROC), ROC analysis, area under the curve (AUC), diagnostic performance, sensitivity, specificity, clinical psychology, depression
THE ROC ANALYSIS: FUNDAMENTALS
Receiver Operating Characteristic (ROC) analysis is a procedure used in assessing diagnostic properties of tests, namely in assessing the way various measures generally discriminate between different categories of subjects. In order to do this, a cut-off point needs to be established; based on the cut-off point, we can determine whether a person with a certain score belongs to one category or another (e.g. normal/non-clinical or clinical group). ROC analysis may also be used when comparing the diagnostic performance of two or more tests (Westin, 2001). ROC analysis was used for the first time in the military field, for the analysis of radar images, during The Second World War (Westin, 2001). In
* Correspondence concerning this article should be addressed to: E-mail: sebastianpintea@psychology.ro
Articles Section
Sebastian Pintea, Ramona Moldovan 50 medicine, the procedure has been used since the 1960s, and there is an extensive literature on the use of ROC graphs for diagnostic testing (Fawcett, 2006). In chemistry, ROC curves analysis is used to solve dichotomous decision problems such as: the presence or absence of a protein marker, whether the structure of a molecule is X or Y, whether a reaction obeys first order kinetics or second, should a reaction be terminated or continued, etc.? (Brown & Davis, 2006). In clinical psychology, ROC analysis is being used with increased frequency, particularly in examining the utility or performance of diagnostic or screening tools for: future difficulties in reading comprehension (Shapiro, Solari & Petscher, 2008), alcohol and drug abuse (Kills Small et al., 2007), neuropsychological impairment (Horwitz et al., 2008; O'Brien et al., 2007), depression (Benazzi, 2008; Serrano-Duenas & Serrano, 2008; Stafford, Berk & Jackson, 2007; Ballesteros et al., 2007; Walsh et al., 2006), obsessive-compulsive disorder (Ivarsson & Larsson, 2008), bipolar disorder (Parker et al., 2008;), suicide (Jokinen, Nordstrom & Nordstrom, 2008), dementia (Chiu et al., 2008; Giaquinto & Parnetti, 2006), dropout risk from different treatments such as cognitive-behavior therapy for insomnia (Ong, Kuo & Manber, 2008) and so on. Two of the more important literature reviews on ROC curves analysis have been conducted by McFall & Treat (1999) and Streiner & Cairney (2007). McFall and Treat follow a broader objective in their work, as they themselves state, about the functions of clinical assessment, the standards by which methods can be evaluated, and the most promising approaches to achieving the broad goals of clinical assessment (p. 216). Compared to their article and also to the one of Streiner and Cairney, our paper is intended to be more applicative. We use actual data collected from a Romanian sample to illustrate a number of ROC concepts reviewed in this paper; we also discuss an extended set of uses, from establishing cutoff-points to comparing different tests to overall and partial ROC areas, to specific points of the curves.
Measures related to ROC curves
The dichotomous decision process is based on a threshold value V (cut- off point) which classifies the scores of a continuous variable Y (also called classifier) into two categories: positive vs. negative. If YV, the subject will be classified as positive; if Y<V, the subject will be classified as negative (Brown & Davis, 2006). Now let us assume that we have a valid procedure of discriminating between the presence and the absence of a disorder (also called valid diagnosis), and we differentiate two groups of individuals: with and without a certain disorder. By also administering the test that assesses the value of the Y variable in each subject, we obtain two distributions of Y scores, one for each group. In actual research (in real life) a perfect separation between the two groups
Articles Section
The receiver-operating characteristic (roc) analysis 51 is quite rare. Most of the times, the distribution of test results will overlap, as shown in Figure 1.
Figure 1. Four possible outcomes when intersecting a valid diagnosis with a classifier
The intersection of the valid diagnosis with the classifier generates four possible outcomes. If the valid diagnosis is positive and it is correctly classified as positive, the outcome is counted as a true positive (TP); if the same outcome is incorrectly classified as negative, it is counted as a false negative (FN). If the valid diagnosis is negative and it is correctly classified as negative, the outcome is counted as a true negative (TN); if the same outcome is incorrectly classified as positive, it is counted as a false positive (FP) (Brown & Davis, 2006; Fawcett, 2006). The outcomes identified in Figure 1 can be conceptualized in a contingency table that we will call decision matrix, also known as the confusion matrix (Brown & Davis, 2006). Such an approach is presented in table 1.
Table 1. The decision matrix
DIAGNOSIS TEST Positive Negative Total Positive TP FP T+ Negative FN TN T- Total D+ D- N
Articles Section
Sebastian Pintea, Ramona Moldovan 52 In the decision matrix, besides the four outcomes described above, we have included the following values: the total number of subjects who have the disorder (D+), the total number of subjects who do not have the disorder (D-), the total number of subjects with a positive test result (T+), the total number of subjects with a negative test result (T-) and the total number of subjects analysed (n) (Brown & Davis, 2006; Fawcett, 2006). It is important to note that in research situations we have a decision matrix for each possible cut-off point. If variable Y is a discrete variable, with k possible values, we will have k-1 decision matrices. From the outcomes described in the decision matrix, we can calculate the following measures (metrics):
(1) Sensitivity = TP/D+ (2) Specificity = TN/D- (3) Positive likelihood ratio = Sensitivity / (1-Specificity) (4) Negative likelihood ratio = (1-Sensitivity) / Specificity (5) Positive predictive value = TP/T+ (6) Negative predictive value = TN/T- (7) Accuracy = (TP+TN)/n
Sensitivity, also called the true positive rate (when expressed as a percentage) is defined as the probability that a test result will be positive when the disorder is present. Specificity, also called the true negative rate (when expressed as a percentage), represents the probability that a test result will be negative when the disorder is not present. These two indicators are essential for ROC curves analysis. The positive likelihood ratio, is the ratio between the probability of a positive test result given the presence of the disorder and the probability of a positive test result given the absence of the disorder. Similarly, the negative likelihood ratio is defined as the ratio between the probability of a negative test result given the presence of the disorder and the probability of a negative test result given the absence of the disorder. Positive predictive value, also called precision, is defined as the probability that the disorder is present when the result of the test is positive, while the negative predictive value is defined as the probability that the disorder is not present when the result of the test is negative. The last indicator presented here is the diagnostic accuracy of a test, or the clinical performance of a test. It can be described in terms of diagnostic accuracy, or the ability to correctly classify subjects into clinically relevant subgroups. Diagnostic accuracy refers to the quality of the information provided by the classification device (for more details see Jokinen, Nordstrom & Nordstrom, 2008).
Articles Section
The receiver-operating characteristic (roc) analysis 53
Other measures that are worth mentioning here, even though they are not actually used in the following analysis, are prevalence and pre-test and post-test odds. The prevalence (D+/n) refers to the proportion of cases exhibiting the disorder; the pre-test odds (prevalence/1-prevalence) refers to the odds that the patient suffers from the target disorder before the test is carried out, while the post-test odds (pre-test odds* Positive likelihood ratio) reflects the odds that the patient suffers from the target disorder after the test is carried out. An important observation is that when the prevalence in the sample is different from the prevalence in the population, measures such as accuracy and predictive values (both positive and negative) are calculated taking into account the prevalence in the population (for more details see Brown & Davis, 2006).
The ROC space
ROC graphs are bidimensional representations of the sensitivity (also called the true positive rate on the X axis) and 1-specificity (also called the false positive rate on the Y axis), coresponding to each possible cut-off point (classifying value). In other words, they represent the tradeoffs between benefits (true positives) and costs (false positives) (Fawcett, 2006). In order to be able to interpret the so-called ROC space, we need to have a reference point in this space. The ROC space is illustrated in Figure 2.
Figure 2. The ROC space
The lower left point of the graph (0,0) is the value that contains no error (no false positives) but also, does not detect any true positives. The opposite
Articles Section
Sebastian Pintea, Ramona Moldovan 54 point, in the upper right side of the graph (1,1), identifies all true positives, but with a 100% error rate (rate of false positives FP). The upper left point (0,1) is the perfect classification, where all the true positives are identified without any error (no false positives, or 0 costs). The lower right point (1, 0) is the worse classification, where all subjects classified as positive are in fact false positives, with no true positives being identified (Fawcett, 2006). In order to establish the optimal cut-off point, we have to look at the most northwestern point in the ROC space (highest TP rate and lowest FP rate). The diagonal line where the TP rate is equal to FP rate (y=x) represents the performance of a random test. This means that if the classifier is randomly guessing, it is supposed to correctly identify half of the positives and half of the negatives. Consequently, all cut-off points that are above the random diagonal perform better than random guessing, and all cut-off points that are below this diagonal are worse than random guessing (Fawcett, 2006).
ROC analysis utility
When is ROC analysis useful? The literature dedicated to this procedure indicates three main uses: 1. determining the ability of a test to discriminate between groups 2. choosing the optimal cut-off point of a test, and 3. comparing the performance of two or more tests. All of these uses rely on several statistics that can be derived from the ROC (Westin, 2001; Fawcett, 2006; Brown & Davis, 2006).
Determining the ability of a test to discriminate between groups Before any statistical analysis, the ROC curve needs to be inspected visually. The curve of a good test will be well above the diagonal of the ROC graphic; the curve will tend toward the north-western corner of the graphic. As concerning statistical indicators of the ROC curve, the primary statistic derived from the ROC is the area under the curve (AUC). The total area under the ROC curve is a measure of the overall performance of a diagnostic test: the larger the area, the better the performance (Westin, 2001). The area under the curve corresponding to a test may be compared with the random performance of a test, that is designated by the diagonal of a graphic where x=y. This is in fact an inference problem, testing the null hypothesis which states that the test performs randomly in establishing the two diagnostic categories. Considering the null hypothesis, the area under the curve is 0.50, namely the area under the diagonal. That is, if we subtract 0.50 from the area of a test, we will obtain the area of the test that is over the diagonal of the random test. In testing the difference of AUC between a diagnostic test and a random test the values of interest are Z (test) and the probability of this value under the null hypothesis (p). There are a number of ways of calculating the AUC due to the finite number of points on the ROC curve (possible cut-off points). Thus, we most often
Articles Section
The receiver-operating characteristic (roc) analysis 55 estimate the area designated by these points; the derived solutions are diverse and do not always lead to the same result. Some of the ways used for the calculation of AUC are: the trapezoidal rule, the approximation of the curve by fitting the data to a binormal model with maximum-likelihood estimates, the use of the Mann-Whitney U statistic (for more details see Westin, 2001). The interpretation of the AUC of a test is the following: the AUC is the probability that the test yielded a higher value for a randomly chosen individual suffering from the disorder than for a randomly chosen individual not suffering from the disorder (Streiner & Cairney, 2007). Going back to AUC utility in determining the ability of a test to discriminate between groups, in interpreting AUC, Streiner and Cairney (2007) show that the accuracy of tests with AUC between 0.50 and 0.70 is low; an accuracy between 0.70 and 0.90 is moderate, while an AUC over 0.90 indicates high accuracy.
Choosing the optimal cut-off point The ROC analysis can also be used in determining the optimal cut-off point of a test. As mentioned above, the optimal cut-off point is the most northwestern point in the ROC space. It is the cut-off point where the proportion of subjects that were accurately classified is maximal (cut-off point which has a high sensitivity and also a high specificity). In other words, as a rule, the optimal cut-off point is the one which maximizes TP+TN (or minimizes the FP+FN). However, this principle is based on the assumption that the cost of making a false positive mistake is equal to the cost of making a false negative mistake. In real life, these costs are rarely equivalent. For example, the costs of a false positive mistake in the case of a child suffering from ADHD (e.g. unjustified drugs administration) may not be equivalent to a false negative mistake (delayed intervention) (Streiner & Cairney, 2007). Other studies use a specificity of at least 95% as criterion, even with a decreased sensitivity (Westin, 2001) or, for screening, a sensitivity of at least 80% even with a decreased specificity (Sharifi et al., 2008). Taking into account this imbalance of costs between FP and FN mistakes, Westin (2001) concludes that the optimal cut-off point depends on what the test will be used for.
Comparing the performance of two or more tests ROC curves are also used when comparing the average performance of two or more tests. When the two compared curves do not intersect one another, an analysis can be made simply by comparing the ROC curves corresponding to the two tests (Streiner & Cairney, 2007). In such cases it is essential to also consider the correlation between the two areas (the correlation between the two sets of scores) in order to reduce the standard error and to increase the power of comparison (statistical power of comparison test). It must be noted that a
Articles Section
Sebastian Pintea, Ramona Moldovan 56 difference which is not significant does not indicate the equivalence of the two tests. If ROC curves are intersected, we use the partial AUC in order to compare them. Instead of comparing the AUC over the entire range of tests, we concentrate on the area of curves between specific values of sensitivity or specificity, or simply on a specific value (ex. specificity equal to 0.4, or sensitivity equal to 0.6). The test with a higher curve for a specific value is more useful. Similarly, we can compare two tests at their optimum cut-off points (in this case we can compare their specificities and sensitivities independently or we can compare them globally, considering the accuracy that incorporates both indicators). In the figure below, Test B has a better average performance than Test A; the difference in performance is confirmed by the difference between the areas. Moreover, if we compare the performance of these tests at a specificity value of 0.4 (1-specificity=0.6), we note that test A has a higher curve, which means a better performance (see point X in Figure 3).
Figure 3. Comparing the performance of two tests
If we compare the performance of test A and test B between specific values of specificity (1-specificity between 0.6 and 1.00), we notice that test A has a better performance, even if its average performance (the entire AUC) is poorer that the performance of test B. A B
Articles Section
The receiver-operating characteristic (roc) analysis 57 To conclude, using ROC curve analysis for two or more tests, we can compare their average performance, their performance considering certain specificity or sensitivity areas, considering discrete values of specificity or sensitivity or the optimum cut-off point of each test.
Computer programs used for ROC analysis
In recent years, several computer programs have been used to perform ROC analysis. Among these, the best known are: AccuROC, Analyse-it, CMDT, GraphROC, MedCalc, mROC, ROCKIT, SPSS, STATA and SAS. In a comparative analysis of eight computer programs, Stephan et al. (2003) have highlighted the advantages and drawbacks of each program. The authors considered the following criteria in their analysis: their ease of use, mathematical correctness, final output, and their compatibility with other graphics programs. The results of their analysis show that only the Analyse-It, AccuROC, and MedCalc exhibit good performance, while only the GraphROC can compare curves at a certain sensitivity or specificity cut-off point. Their authors conclusion was that adequate ROC analysis and ROC plotting cannot be performed with a single program, therefore they recommend the use of Analyse- It, AccuROC, and MedCalc, despite certain limitations. As far as the well-known SPSS is concerned, the authors notice that ROC analysis within this package is not yet fully developed. For example, the SPSS does not allow comparison of ROC curves. Although it shows a wide range of other statistics, a valid ROC analysis cannot be performed with this software (for more details, see Stephan et al., 2003). It needs to be mentioned that their conclusions are valid for 2003 and that multiple developments of the statistical packages regarding ROC analysis have taken place since then. For example, the STATA and SAS programs were not included in this review. A valid up-to-date review is needed.
ROC ANALYSIS: AN APPLICATION IN CLINICAL PSYCHOLOGY
Brief description of the study Mental health is one of the most important health issues worldwide. Research suggests that overall rates of mental illnesses are rising and that there is a trend toward earlier onset of various mental illnesses (e.g., depression); additionally, the level of disability, mortality and human or financial resources associated with mental illness are currently at unprecedented rates (Avenevoli et al., 2008). Major depressive disorder (MDD) is one of the leading causes of emotional distress and disability both in adults and youth (OMS, 2000). MDD is a serious condition characterized by one or more major depressive episodes.
Articles Section
Sebastian Pintea, Ramona Moldovan 58 Symptoms of depression may differ at various ages and stages of development, and may vary for different ethnic groups. Generally, however, according to the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR, American Psychiatric Association, 2000) depressed children and adolescents show a significant mood change (depressed, sad, irritable, and indecisive), they lose interest in activities that used to please them, have problems concentrating, and may lack energy or motivation; they may neglect their appearance and hygiene and they criticize themselves and feel that others criticize them. They may also indicate that they are feeling worthless or hopeless about the future, and thoughts of suicide may be present. A variety of methods are used to assess mental health and illness, and these methods have considerable impact on diagnosis and treatment. It has been often suggested that reliance on clinical judgment alone rather than statistical and standardized measures leads to accurate diagnosis (Dawes et al., 1989; Lowe et al., 2004). Given that MDD is often overlooked, misdiagnosed, and therefore, mistreated, it is worth exploring if there are additional tools which can quickly and accurately assess specific symptoms and etiopathogenetic mechanisms in order to improve diagnosis and treatment in these patients. Objective: The objectives of the current study are: (1) to test if there is a significant difference between the two scales regarding their mean diagnostic performance; (2) to establish the optimal cut-off point of each measure for the sample investigated; (3) to evaluate the diagnostic performance for each investigated measure; and (4) to compare their sensitivity and specificity at their established cut-off points. Sample: Data were pooled from participants in two separate studies. Inclusion criteria required that patients (children and adolescents aged from 12 to 18) meet criteria for a current diagnosis of MDD as described in the Diagnostic and Statistical Manual of Mental Disorders (4 th edition, text revision; DSM-IV- TR, American Psychiatric Association, 2000). Diagnoses were determined by means of the Structured Clinical Interview for DSM-IV (KID SCID; Hien et al., 1994, 1998, 2004). Exclusion criteria included a number of concurrent psychiatric disorders, current substance abuse, mental retardation, organic brain syndrome; we also excluded participants who were in some concurrent form of psychotherapy, who were receiving psychotropic medication or who needed to be hospitalized because of imminent suicidal risk. Patients were recruited by local advertisements and by referrals from clinics within the Pediatric Psychiatry Clinic in Cluj-Napoca from March 2007 until June 2008. The sample consisted of 50 patients. A number of 50 voluntary adolescents from several high schools in Cluj were also included. Measures: All subjects filled in self-report measures that evaluate the symptoms and the etiopathogenetic mechanisms involved in depression.
Articles Section
The receiver-operating characteristic (roc) analysis 59 The Beck Depression Inventory (BDI; Beck et al., 1979) is a 21-item self- report inventory measuring current symptoms of depression (e.g., sadness, fatigue, social withdrawal, irritability, hopelessness etc.). A value of 0 to 3 is assigned to each answer and the total score is computed by summing up individual value. Higher total scores indicate more severe depressive symptoms. The Romanian version of the BDI has very good psychometric properties and has proven sensitive in screenings and clinical change assessments. The Automatic Thoughts Questionnaire (ATQ; Hollon & Kendal, 1980) is a 15-item questionnaire assessing the frequency of negative thoughts experienced by depressed individuals. All items consist of different self related automatic thoughts (e.g. I am worthless; The future is dark; I feel helpless), frequently identified in patients with MDD. Such sustained, inaccurate and intrusive negative automatic thoughts about the self, the world and the future are hypothesized to cause depression, rather than be generated by depression. Each of the 15 items is rated on a scale from 1 to 5. The total score is computed by summing up individual item values. The higher the score, the higher the negative automatic thoughts frequency. Scores of the original scale and the Romanian version indicate high internal consistency and concurrent validity, and also differentiate depressed from non-depressed groups (Netemeyer, 2002). Statistical analysis: ROC curve analysis was performed, for both scales, using the MedCalc and SPSS softwares. The AUC of the scales is significantly different from the AUC corresponding to a random test. We then determined their specificity and sensitivity in order to establish their cut-off points and we checked if there are significant differences between their AUCs. Finally, we compared their sensitivity and specificity at their previous established cut-off points.
Results
In analyzing our results, we first checked if the AUC of the two scales is significantly different from the area under the diagonal determined of a random test. Table 2 we presents the results of this analysis for the ATQ scale.
Table 2. The AUC for ATQ scale
Area under the ROC curve (AUC) 0.909 Standard error 0.0307 95% Confidence interval 0.835 to 0.957 z statistic 13.338 Significance level P (Area=0.5) 0.0001
As Table 2 shows, the Z test performed with MedCalc indicates a significant difference from the random area, with a probability of error smaller
Articles Section
Sebastian Pintea, Ramona Moldovan 60 than 1% (Z=13.33, p<.01). The same table shows an AUC for the ATQ scale of .90, which, according to Streiner and Cairney (2007) indicates, a high discriminative capacity of the ATQ scale. In Table 3 we present the results of the same analysis for the BDI scale.
Table 3. The AUC for BDI scale
Area under the ROC curve (AUC) 0.996 Standard error 0.00655 95% Confidence interval 0.955 to 0.996 z statistic 75.682 Significance level P (Area=0.5) 0.0001
Table 3 shows that the BDI is also significantly different from the random area, with a probability of error smaller than 1% (Z=75.68, p<.01). Our data also show an AUC of the BDI of .90, which, according to Streiner and Cairney (2007), indicates a high accuracy of the scale. Our results so far indicate that the performance of both ATQ and BDI significantly differs from the performance of a random test and that both have a high accuracy in identifying participants with depression. In order to establish the optimal cut-off point of each scale, we analyzed both sensitivity and specificity at each possible cut-off point, for both scales. Table 4 presents the analysis performed for the ATQ. As the results in Table 4 indicate, the best performance of the ATQ in discriminating between depressed and non/depressed participants, is reached at the cut-off point of 34 (sensitivity=94, specificity=70). The same analysis was performed for the BDI. Results are presented in Table 5.
Articles Section
The receiver-operating characteristic (roc) analysis 61 Table 4. Criterion values and coordinates of the ROC curve for ATQ
Table 5 indicates that the best discriminating performance of the BDI is reached at the cut-off point of 21 (sensitivity=100 specificity=96). In order to test if there is a significant difference between the two scales regarding their mean diagnostic performance, we also used MedCalc and compared their AUCs (Table 6).
Articles Section
The receiver-operating characteristic (roc) analysis 63
1.0 0.8 0.6 0.4 0.2 0.0 1 - Specificity 1.0 0.8 0.6 0.4 0.2 0.0 Sensitivity Reference Line BDI ATQ Source of the Curve
Table 6. Pairwise comparison of ATQ and BDI AUCs
ATQ ~ BDI Difference between areas 0.0868 Standard error 0.0299 95% Confidence interval 0.0283 to 0.145 z statistic 2.907 Significance level P = 0.004
As results in Table 6 indicate, there is a significant difference in the overall diagnostic performance between ATQ and BDI (z=2.90, p<.01) in favor of the BDI scale (AUC for BDI=.99 vs AUC for ATQ=.90). A visual inspection of both curves in Figure 4, suggests that there is a diagnostic performance difference in favor of the BDI at all sensitivity or specificity levels, and at all intervals of these parameters.
Figure 4. The ROC curves of ATQ and BDI
Finally if we look at Tables 4 and 5, at the accuracy of both scales at their optimum cut-off point, we can conclude that the BDI has a better overall accuracy, and also better sensitivity (100 vs. 94) and specificity (96 vs. 70). To conclude, our analysis underlies the superiority of the BDI compared to the ATQ at all levels of comparison: the overall ability to discriminate between the two categories (depression vs. non-depression), the ability to discriminate between the two categories at specific levels of sensitivity or specificity, intervals of those parameters, or at their optimum cut-off point.
Articles Section
Sebastian Pintea, Ramona Moldovan 64 DISCUSSION AND CONCLUSIONS
Our work stands for an advocacy in favor of using ROC analysis on a larger scale in clinical psychology. As we have shown in the section dedicated to ROC fundamentals, this procedure allows a rigorous analysis of the diagnostic performance of tests (instruments) frequently used in clinical psychology. Based on these results, we can establish, for example, if the clinical instruments have a good overall performance and select an optimal cut-off point which best discriminates between categories of people with and without a certain disorder. ROC analysis also allows comparing two or more tests regarding their overall diagnostic performance, or their performance at certain discrete values or intervals of sensitivity or specificity. We must emphasize that, before performing ROC analysis, valid measures are needed to identify the two categories investigated. In our case, depression was diagnosed using the Structured Clinical Interview for DSM-IV (KID SCID; Hien et al., 1994, 1998, 2004); the BDI and ATQ were used after a valid method of diagnosing depression was employed. The example of ROC analysis used in this article illustrates how this procedure works in a real clinical research situation, and consequently, our results must be interpreted in this particular context. However, our findings replicate previous studies on depression. Consistent with the literature, BDI and ATQ scores have a good discriminating performance between adolescents suffering from depression and adolescents not suffering from depression. The superiority of the BDI in discriminating depressed from non- depressed participants was expected and is fairly justifiable. High BDI scores indicate the presence of depressive symptoms while high ATQ scores indicate the presence of mechanisms that are likely to lead to depression. Therefore, better sensitivity and specificity of the BDI in discriminating depressed from non- depressed participants is rather intuitive: having frequent automatic thoughts does not make us necessarily depressed, while having high depression symptoms will most probable place us in the depressed category. Among the limitations of our study, several are worth mentioning: patients included may not be representative for the general depressed population or for depressed patients entering treatment; similarly, data pooled from voluntary adolescents may not be specific to the non-depressed population in general. However, despite these limitations, our application can be also regarded as innovative considering that, to our knowledge, this is the first study to analyze BDI and ATQ accuracy (both sensitivity and specificity) in discriminating depressed from non-depressed patients using ROC analysis. To conclude, as we have previously shown, the principles behind ROC analysis are easy to understand and the existence of softwares performing the procedure makes it attractive and easy to use, with relevant benefits for clinical research. The ROC analysis in this article illustrates how this procedure works in a real clinical research situation, and consequently, our results must be interpreted in this particular context.
Articles Section
The receiver-operating characteristic (roc) analysis 65 REFERENCES
American Psychiatric Association. (2000). Diagnostic and Statistical Manual of Mental Disorders (4 th edition, text revision). Washington, DC: APA. Avenevoli, S., Knight, E., Kessler, R. C., & Merikangas, K. R. (2008). Epidemiology of depression in children and adolescents. In Abela, J.R.Z., & Hankin, B.L. (Eds.). Handbook of depression in children and adolescents. New York: The Guilford Press. Ballesteros, J., Bobes, J., Bulbena, A., Luque, A., Dal-Re, R., Ibarra, N., & Guemes, I. (2007). Sensitivity to change, discriminative performance, and cutoff criteria to define remission for embedded short scales of the Hamilton depression rating scale (HAMD). Journal of Affective Disorders, 102, 9399. Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of depression. New York: Guilford Press. Benazzi, F. (2008). Defining mixed depression. Progress in Neuro-Psychopharmacology and Biological Psychiatry, 32, 932-939. Brown, C. D., & Davis, H. T. (2006). Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Systems, 80, 24-38. Chiu, Y. C., Li, C. L., Lin, K. N., Chiu, Y. F., Liu, & H. C. (2008). Sensitivity and specificity of the clock drawing test, incorporating Rouleau scoring system, as a screening instrument for questionable and mild dementia: Scale development. International Journal of Nursing Studies, 45, 7584. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 16681674. Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27, 861-874. Giaquinto, S., & Parnetti, L. (2006). Early detection of dementia in clinical practice. Mechanisms of Ageing and Development, 127, 123128. Hien, D., Matzner, F., First, M.B., Spitzer, R.L., Williams, J.B.W., and Gibbon, M. (1994, 1998, 2004). Interviu clinic structurat pentru tulburrile clinice ale sugarului, copilului i adolescentului. Adaptare n limba romn David, D. (coordonator). Editura RTS Cluj-Napoca. Horwitz, J. E., Lynch, J. K., McCaffrey, R. J., & Fisher, J. M. (2008). Screening for neuropsychological impairment using Reitan and Wolfsons preliminary neuropsychological test battery. Archive of Clinical Neuropsychology, 23, 393-398. Ivarsson, T., & Larsson, B. (2008). The Obsessive-Compulsive Symptom (OCS) scale of the Child Behavior Checklist: A comparison between Swedish children with Obsessive-Compulsive Disorder from a specialized unit, regular outpatients and a school sample. Journal of Anxiety Disorders, in-press Jokinen, J., Nordstrom, A. L., & Nordstrom, P. (2008). ROC analysis of dexamethasone suppression test threshold in suicide prediction after attempted suicide. Journal of Affective Disorders, 106, 145152. Kendler, K. S. (1995). Genetic epidemiology in psychiatry. Taking both genes and environment seriously. Archives of General Psychiatry, 52, 895-899. Kills Small, N. J., Simons, J. S., & Stricherz, M. (2007). Assessing criterion validity of the Simple Screening Instrument for Alcohol and Other Drug Abuse (SSI-AOD) in a college population. Addictive Behaviors, 32, 24252431.
Articles Section
Sebastian Pintea, Ramona Moldovan 66 Lowe, B., Spitzer, R. L., Grafe, K., Kroenke, K., Quenter, A., & Zipfel, S. (2004). Comparative validity of three screening questionnaires for DSM-IV depressive disorders and physicians diagnoses. Journal of Affective Disorders, 78, 131140. McFall, R. M., & Treat, T. A. (1999). Quantifying the information value of clinical assesement with Signal Detection Theory. Annual Review of Psychology, 50, 215- 241 Netemeyer, R. G., Williamson, D. A., Burton, S., Biswas, D., Jindal, S., Landreth, S., Mills, G., & Primeaux, S. (2002). Psychometric properties of shortened versions of the Automatic Thoughts Questionnaire. Educational and Psychological Measurement, 62, 111-129. O'Brien, A., Gaudino-Goering, E., Shawaryn, M., Komaroff, E., Moore, N. B., & DeLuca, J. (2007). Relationship of the Multiple Sclerosis Neuropsychological Questionnaire (MSNQ) to functional, emotional, and neuropsychological outcomes. Archives of Clinical Neuropsychology, 22, 933948. Ong, J. C., Kuo, T. F., & Manber, R. (2008). Who is at risk for dropout from group cognitive-behavior therapy for insomnia? Journal of Psychosomatic Research, 64, 419425. Parker, G., Fletcher, K., Barrett, M., Synnott, H., Breakspear, M., Hyett, M., & Hadzi- Pavlovic, D. (2008). Screening for bipolar disorder: The utility and comparative properties of the MSS and MDQ measures. Journal of Affective disorders, 109, 83- 89. Serrano-Duenas, M., & Serrano, M S. (2008). Concurrent validation of the 21-item and 6- item Hamilton Depression Rating Scale versus the DSM-IV diagnostic criteria to assess depression in patients with Parkinsons disease: An exploratory analysis. Parkinsonism and Related Disorders, 14, 233238. Shapiro, E. S., Solari, E., & Petscher, Y. (2008). Use of a measure of reading comprehension to enhance prediction on the state high stakes assessment. Learning and Individual Differences, In Press, Uncorrected Proof, Available online 26 March 2008 Sharifi, F., Mousavinasab, N., Mazloomzadeh, S., Jaberi, Y., Saeini, M., Dinmohammadi, M., & Angomshoaa, A. (2008). Cutoff point of waist circumference for the diagnisis of metabolic syndrome in an Iranian population. Obesity Research & Clinical Practice, In Press, Corrected Proof, Available online 23 May 2008 Stafford, L., Berk, M., Jackson, H. J. (2007). Validity of the Hospital Anxiety and Depression Scale and Patient Health Questionnaire-9 to screen for depression in patients with coronary artery disease. General Hospital Psychiatry, 29, 417424. Stephan, C., Wesseling, S., Schink, T., & Jung, K. (2003). Comparison of Eight Computer Programs for Receiver-Operating Characteristic Analysis. Clinical Chemistry, 49, 433439. Streiner, D. L., Cairney, J. (2007). What's under the ROC? An introduction to Receiver Operating Characteristics Curves. The Canadian Journal of Psychiatry, 52, 121- 128. Walsh, T. L., Homa, K., Hanscom, B., Lourie, J., Sepulveda, M. G., & Abdu, W. (2006). Screening for depressive symptoms in patients with chronic spinal pain using the SF-36 Health Survey. The Spine Journal, 6, 316320. Westin, L. (2001). Receiver operating characteristic (ROC) analysis: Evaluating discriminance effects among decision support systems, retreived 05.16.2008 from source http://www.cs.umu.se/research/reports/2001/018/part1.pdf.