Assessment Tutorial PDF

2007; 29: e122e132
WEB PAPER
An analysis of peer, self, and tutor assessment in problem-based learning tutorials

TRACEY PAPINCZAK, LOUISE YOUNG, MICHELE GROVES & MICHELE HAYNES
School of Medicine, University of Queensland, Herston Road, Herston, 4006, Queensland, Australia
Abstract
Objective: The purpose of this study was to explore self-, peer-, and tutor assessment of performance in tutorials among first year medical students in a problem-based learning curriculum. Methods: One hundred and twenty-five students enrolled in the first year of the Bachelor of Medicine and Bachelor of Surgery Program at the University of Queensland were recruited to participate in a study of metacognition and peer- and self-assessment. Both quantitative and qualitative data were collected from the assessment of PBL performance within the tutorial setting, which included elements such as responsibility and respect, communication, and critical analysis through presentation of a case summary. Self-, peer-, and tutor assessment took place concurrently. Results: Scores obtained from tutor assessment correlated poorly with self-assessment ratings (r 0.31 0.41), with students consistently under-marking their own performance to a substantial degree. Students with greater self-efficacy, scored their PBL performance more highly. Peer-assessment was a slightly more accurate measure, with peer-averaged scores correlating moderately with tutor ratings initially (r 0.40) and improving over time (r 0.60). Students consistently over-marked their peers, particularly those with sceptical attitudes to the peer-assessment process. Peer over-marking led to less divergence from the tutor scoring than under-marking of ones own work. Conclusion: According to the results of this study, first-year medical students in a problem-based learning curriculum were better able to accurately judge the performance of their peers compared to their own performance. This study has shown that self-assessment of process is not an accurate measure, in line with the majority of research in this domain. Nevertheless, it has an important role to play in supporting the development of skills in reflection and self-awareness.
Introduction
Education of medical students should prepare them to deal with problems in the future, equipping them with skills necessary to become active, self-directed learners, rather than passive recipients of information (Dolmans & Schmidt 1996). Acknowledgment of this need was responsible, in part, for the development of problem-based learning (PBL) (Barrows & Tamblyn 1980). Within the discipline of medical education, PBL is a curriculum innovation that involves students in learning activities using loosely structured medical problems to drive learning (Norman & Schmidt 1992). The pedagogical appeal of PBL is its perceived capacity to encourage, through these learning processes, enhanced clinical reasoning skills, and the development of both an adaptable knowledge base and skills in self-directed learning necessary to become lifelong learners (Kelson & Distlehorst 2000). Four crucial conditions for a deep approach to learning are encompassed within the PBL approach: a well-structured knowledge base, active learning, collaborative learner interaction, and a context designed to promote internal motivation through the provision of pragmatic goals (Margetson 1994). Assessment of student progress in such a student-centred curriculum, however, has remained challenging (Eva 2001).
Practice points
. Self-assessment results in substantial under-marking compared to tutor assessment. . Scores obtained from peer-assessment are significantly more generous than those scores arising from tutor assessment. . Self-assessment is a less accurate means of assessing student performance than peer-assessment.
Assessment protocols within PBL curricula have sometimes sought to include self-, peer-, and tutor evaluation to assess a range of skills, such as self-directed learning, group cooperation, and communication (Swanson et al. 1997). Tutors and peers have a unique opportunity to judge each others work in PBL tutorials, and students should develop the ability to reflect on their own strengths and weaknesses as these are central elements of self-directed learning (Eva et al. 2004). Several published quantitative studies of peer-assessment within PBL curricula reveal correlations between staff/tutor and peer ratings ranging from very low (Sluijman et al. 2001; Reiter et al. 2002) to moderate (Sullivan et al. 1999; Segers & Dochy 2001). Of limited research undertaken with medical students in PBL, moderate correlation between peer and tutor
Correspondence: Tracey Papinczak, 18 Debussy Place, Mt Ommaney, 4074, Queensland, Australia. Email: traceypapinczak@optusnet.com.au
e122
ISSN 0142159X print/ISSN 1466187X online/07/05012211 2007 Informa UK Ltd. DOI: 10.1080/01421590701294323
An analysis of peer, self, and tutor assessment
ratings was demonstrated by Sullivan (1999) while low correlation was reported by Reiter et al. (2002). Findings arising from studies of medical students in non-PBL curricula show generally moderate correlations (Burnett & Cavaye 1980; Van Rosendaal & Jennett 1992; Rudy et al. 2001; Minion et al. 2002). Several factors have the potential to impact negatively on the accuracy of peer evaluations, including friendship marking, and decibel marking which favours dominant group members (Pond & ul-Haq 1997). These may result in peer over-marking often observed in quantitative studies of peerassessment (for instance, Rudy et al. 2001). It is also possible that, in high stakes settings such as medical schools, inflated estimates of peer performance would be the norm (Norcini 2003). Another format for evaluating student performance in PBL tutorials is self- assessment. Self- and peer-assessment are often combined or considered together. Peer-assessment, for instance, builds on evaluation skills that may be transferred to self-assessment tasks and enables learners to compare their self evaluations with the assessments of others. Despite meta-analyses of self-assessment in higher education deeming students well able to self-assess accurately (Sluijmans et al. 1999, p. 300), within medical PBL programs reported correlations between self and tutor evaluations are uniformly low (Rezler 1989; Gordon 1991; Das et al. 1998; Sullivan et al. 1999; Reiter et al. 2002). Nor has a significant relationship been found between self-assessment scores and examination results (Tousignant & Des Marchais 2002; Eva et al. 2004). This is also true of non-PBL medical curricula, where poor association has been shown between scores obtained from self-assessment and tutor assessment (or examination results) (Arnold et al. 1985; Woolliscroft et al. 1993; Rudy et al. 2001; Fitzgerald et al. 2003). When self-assessment scores are compared with peerassessment scores, low correlations have been shown in all but one study of PBL curricula or of medical courses (Sullivan et al. 1999; Reiter et al. 2002; Miller 2003). The exception, a study by Burnett and Cavaye (1980), reported a very high correlation between self- and peer-assessment scores among fifth year medical students in a traditional curriculum. However, Eva et al. (2004) reported disappointingly low correlation between students self-assessment and performance on a test of medical knowledge, with no evidence of improvement after one year of medical education. The tendency exists for students to overestimate their competence (Mattheos et al. 2004), especially lower-performing students (Woolliscroft et al. 1993; Lejk & Wyvill 2001; Edwards et al. 2003). Young or highly capable students are more likely to undermark their work (Stefani 1992; Rudy et al. 2001; Edwards et al. 2003; Fitzgerald et al. 2003). The influence of gender on both self-assessment and peer-assessment accuracy appears to be minimal (Falchikov & Magin 1997; Mattheos et al. 2004), although the experience of self and peer-assessment may be more stressful for females (Pope 2005). Several approaches have been suggested to improve the accuracy of scores generated from peer- or self-assessment. One well-supported idea is the use of co-assessment which involves students with staff in the assessment process (Sluijmans et al. 1999). While research has not supported the
reliability of tutor assessment in PBL tutorials for measures of student knowledge (Neville 1998; Cunnington 2001; Whitfield & Xie 2002), the potential exists for tutor assessment to be combined with or compared to peer- or self-assessment to improve the accuracy and comprehensiveness of the evaluations generated (Dochy et al. 1999; Eva 2001). Tutors are in a reasonable position to judge group processes (Dodds et al. 2001). The aim of this study was to explore peer- and selfassessment within PBL tutorials in a medical course using qualitative and quantitative approaches. Qualitative data were collected to gather students perceptions of these alternate forms of assessment (see Papinczak et al. 2007). Quantitative data were analysed to assess the accuracy of students as assessors, with tutor scores as comparison. The impact of specific demographic factors and students self-efficacy was analysed to gain greater understanding of influences on scoring. It was anticipated that confident (efficacious) students would award themselves higher marks, althought this may be mediated by fears of self-aggrandisement in a public arena (see Chaves et al. 2006). Self-efficacy is defined as students perceptions of their ability to successfully carry out a task (Bandura 1986). When facing a difficult learning task, a student with high self-efficacy beliefs is more likely to participate more actively, work harder, remain more problem-focussed, and persist for a longer time than a student with low self-efficacy, who is more likely to view the situation as insurmountable, get frustrated and give up (Pajares 1996; Nichols & Steffi 1999). Students with high levels of self-efficacy are more willing to take on challenging tasks (Zimmerman 2000), whereas students with low selfefficacy may fail to achieve even when goals are within easy reach (Bandura 1993). The effect of self-efficacy on scores obtained through self-assessment has not previously been evaluated within the PBL tutorial setting. The PBL environment, with its emphasis on self-directed and collaborative learning, provides a unique context in which to explore alternative forms of assessment. As they work together in PBL tutorials, students may develop interdependent relationships facilitating learning and motivation (Willis et al. 2002). This study sought to incorporate qualitative and quantitative dimensions in order to gain a fuller understanding of peer- and self-assessment within collaborative small group environments.
Description of the study

Quantitative and qualitative data were gathered as part of a larger study of metacognitive processes undertaken with first-year students enrolled in the Bachelor of Medicine and Bachelor of Surgery (MBBS) Program at The University of Queensland, Australia. Only the results of the self- and peerassessment segment of the study are reported here, including quantitative findings and qualitative results which may be explanatory or insightful. Ethical approval was obtained from the University of Queenslands Behavioural & Social Sciences Ethical Review Committee.
e123
T. Papinczak et al.
Setting
The MBBS Program introduced a four-year, graduate-entry PBL curriculum in 1997. First-year students, in small groups of nine or ten, undertake five hours of PBL tutorial time each week for 33 weeks of the year. Working in collaboration with group members, students analyse a problem of practice, formulate hypotheses, and undertake self-directed learning to try to understand and explain all aspects of the patients problem.
Subjects
The study was conducted with 125 first-year medical students and 20 tutors over a period of six months during 2004. Every student in thirteen tutorial groups took part in a program of educational activities within their PBL tutorials, including peer- and self-assessment. Subsequent statistical analysis showed that the self-selected study subjects (40.2% of the student population) were representative of the entire cohort on measures of age, gender, and primary undergraduate degree.
Instruments
Qualitative and quantitative data were generated using two instruments: the peer assessment instrument (as shown in Figure 1) and the test of self-efficacy, which all participants completed at the commencement of the study.
question (inviting comments) on the final page of the questionnaire. These were analysed and coded to themes to provide insight into student perceptions. In order to gain a measure of face validity, three experienced PBL facilitators were asked to indicate whether each of the 17 items on the instrument was relevant to PBL performance and able to be adequately assessed using the item in question. Unanimous face validity was obtained for all items in the four sub-scores: responsibility and respect, information processing, communication, and critical analysis. Some dissent about the validity of the self-awareness sub-score was evident. Construct validity describes the degree to which the items used in the instrument define the constructs (Pressley & McCormick 1995). The five constructs or domains of performance were reported extensively in the medical and nursing education literature. Each of the three PBL tutors and ten PBL students were asked to categorise the 17 items into the five specified domains. In all cases, the items were distributed in accordance with the domains as defined on the instrument. Values for Cronbachs alpha ranged from 0.76 to 0.84, indicating good internal consistency among the five subscores. Acceptable reliability was found, with Pearson correlation coefficients for peer-averaged and tutor assessment ranging from 0.40 to 0.60. Notably, self-awareness items were problematic with a significant number of students consistently entering not applicable for those two items. Unfortunately, time constraints prevented further renegotiation of the peer assessment instrument with the subsequent cohort prior to the commencement of the study.
The peer assessment instrument

In order to enhance student ownership of assessment criteria (as recommended by Boud (1995) and Orsmond et al. (2000)), members of several PBL tutorial groups in the previous cohort were invited to participate in the development of an instrument for peer- and self-assessment of students performance in PBL tutorials. Students were first presented with a list of criteria derived from relevant literature (including Das et al. (1998) and Willis et al. (2002)) from which a set of items were selected for inclusion in the first draft of the instrument. The negotiated instrument with 19 items, labelled the peer assessment instrument, was trialled with another student group and rated as easy to use and understand by all participants. Student dissatisfaction with two items resulted in their removal from the final version of the instrument. The resulting scale measures several features of successful adult education, such as participation, punctuality, respect for others, effective communication, and critical analysis (as shown in Figure 1). However, the inclusion of items specifically targeting self-directed learning and self-awareness, core features of PBL, allows it to be differentiated from others which may be appropriate for open-ended, but less studentcentred, approachessuch as case-based instruction (Hay & Katsikitis 2001). The phrasing of items on the peer assessment instrument was varied slightly to make it more relevant to self-evaluation where applicable, for instance, I instead of the student. Qualitative data were collected from responses to an open e124
Test of self-efficacy
The instrument to measure students self-efficacy was composed specifically for this project as existing instruments were not designed for use in problem-based learning courses. The test of self-efficacy comprises eleven closed questions relating to regulation of, and confidence in, learning, with scores rated on a Likert scale of one-to-five. The first six items (Part A) deal with students perceived capability to use various self-regulation strategies, such as organizing their studies, and concentrating and participating in small-group tutorials. These were loosely based on Banduras (1989) multidimensional scales of perceived self-efficacy reported in Zimmerman et al. (1992). This original scale was designed to measure high school students perceived capability to use various selfregulating strategies, such as concentrating on school subjects, organising schoolwork, and participating in school discussions. Of the eleven items on the original scale, the most applicable six were chosen and rewritten to more appropriately reflect the learning and studying activities carried out by students in this medical course in order to create a brief instrument measuring self-efficacy to regulate learning. The six items deal with the self-regulation strategies: completing allocated learning objectives for the group; studying when there are distractions; planning and organising study; course motivation; and concentration and active participation in tutorials. A further five items measuring self-efficacy for academic achievement were included in the final set of questions
Peer assessment instrument

Please answer the items below indicating the strength of your agreement or disagreement with the statements about this students performance in this week of PBL tutorials by circling the number on the scale. {1 = totally disagree ; 5 = totally agree}
The student: A. Responsibility and Respect 1. Completed all assigned tasks to the appropriate level 2. Completed all assigned tasks on time 3. Participated actively in the tutorial 4. Showed behaviour and input which facilitated my learning 5. Was punctual to this PBL tutorial 6. Listened to and showed respect for the opinions of others B. Information processing 7. Brought in new information to share with the group 1 8. Provided information that was relevant and helpful 1 9. Seemed to use a variety of resources to obtain the information 1 C. Communication 10. Was able to communicate ideas clearly 11. Made comments and responses that did not confuse me D. Critical analysis 12. Gave input which was focussed and relevant to the case 13. Made conclusions that can be substantiated by the evidence presented in the case 14. Gave a thorough summary of the case 15. Gave a summary of the case which showed evidence of reflection and evaluation 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 2 2 2 3 3 3 4 4 4 5 5 5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
1 1
2 2
3 3
4 4
5 5
E. Self-awareness 16. Appeared able to assess his/her own strengths and weaknesses 1 within PBL 17. Accepted and responded to criticism gracefully 1 (* adapted from Das, 1998 )
2 2
3 3
4 4
5 5
Figure 1.
The final 17-item draft of the Peer assessment instrument.
(Part B) of the instrument. These were framed using items taken from the ten-item measure of self-efficacy first reported by Schwarzer and Jerusalem (1995) with five questions selected and modified to better measure the specific responsibilities of examination performance, tutorial participation, self-awareness, clinical reasoning, and academic achievement under consideration in this study. Statistical testing to determine internal reliability yielded Cronbachs alpha values of 0.68 for the first six items measuring self-efficacy for self-regulation, and 0.73 for the remaining five items dealing with self-efficacy for academic achievement. Reliability was not improved by the omission of a single item from either self-efficacy measure.
itself. All participants completed the test of self-efficacy before commencing the program of peer- and self-assessment. For a period of twenty-four weeks, all members of thirteen PBL tutorial groups took part in an activity designed to enhance learning. Two key components were introduced into PBL tutorials, both of which were readily integrated into the existing tutorial format with minimal additional time required from tutors or students. These components were: (1) Reflection on learning: Each week one student from each tutorial group was asked to compose a summary of the weeks medical problem incorporating the clinical reasoning and collaborative learning occurring in their PBL tutorial group. The student was encouraged to present the summary to the group as a concept map or in mechanistic case-diagramming format (Guerrero 2001) to give a visual representation of both the content and the clinical reasoning entailed in solving the problem (a knowledge object (Entwistle & Marton 1994)). Peer- and self-assessment: The student presentation, in association with his/her fulfilment of PBL roles and responsibilities for that week, was assessed using the peer assessment instrument. Peer-, self-, and tutor
Peer-, self- and tutor assessment

The study was undertaken with the 2004 cohort of medical students, using the assessment instrument previously developed with student input. Student feedback was collected during, and at the conclusion of, the study. This feedback was applied to modify aspects of the larger study framework, only some of which included the peer and self-assessment process
(2)
e125
T. Papinczak et al.
MARKING EPISODE 1 Student A presents case summary for the week
Tutor scores Tutor scores on Peer on Assessment Instrument
Self-scoring on self-assessment version of Peer Assessment Instrument
8 9 peers score on Peer Assessment Instrument - skewed scores removed and average score calculated
8 10 weeks later
MARKING EPISODE 2
Student A presents case summary for the week
Tutor scores on Peer Assessment Instrument
Self-scoring on self-assessment version of Peer Assessment Instrument
8 9 peers score on Peer Assessment Instrument - skewed scores removed and average score calculated
Figure 2. Diagram showing the sequence of marking episodes and assessment events for each student in each of 13 tutorial groups. assessment took place concurrently. Scores from these worksheets were compared to explore statistical relationships. Qualitative data were analysed. Constant monitoring of student perceptions of, and attitudes to, these educational activities helped to monitor the effects of the dual activities to maximize student learning. This is in keeping with an action research process which uses overlapping cycles of planning, acting and observing, and reflecting (Kemmis & Wilkinson 1998) to maintain a responsive and flexible study design. In the first week of the main study (week four of the academic year), students in the thirteen tutorial groups were distributed two documents: a copy of the peer assessment instrument to enable students to become familiar with the evaluation criteria, and an exemplar outlining good and poor outcomes for the criteria. Tutors assigned to each group then led their group in a practice session, with a tutor-led presentation of a summary of the previous weeks medical case, in order to establish familiarity with the instrument and process. The summary itself was written by the researcher and presented, with explanation, to each tutor in the week prior to the trial. Tutors received written information, a short information session, and frequent communication and feedback to help maintain fidelity of treatment. In the ensuing 23 weeks, tutorial groups implemented the summarization and assessment activities at the start of each week as part of the wrap-up of the previous weeks PBL case. Each student was expected to be the focus of two marking episodes, each preceded by a summary presentation e126 (as shown in Figure 2). Tutors were encouraged to give concise feedback (based on written peer comments on the reverse of the assessment sheet) to students as soon as possible after the completion of the peer and self-assessment procedure. Student feedback about the exercise was regularly invited as part of the action research process. During the course of implementation, two tutorial groups withdrew from the study (16% of participants). Their justification for withdrawal was based on perceptions of their experiences, including scepticism about the value of peerassessment and concerns about friendship marking. Statistical analysis showed students withdrawing did not differ significantly from those remaining in the study in terms of age, gender, primary degree, or self-efficacy.
Data analysis
Analysis of the data was implemented using statistical software SPSS Version 13.0. Scores on all five sub-scores of the peer assessment instrument were summed to give an overall score, with a maximum score of 85. For each marking episode, data for each student consisted of a self-assessment score, a tutor assessment score, and up to nine peer-assessment scores. Scores obtained from the test of self-efficacy were summed to create two sub-scores: self-efficacy for self-regulation (with a maximum score of 30) and self-efficacy for academic achievement (with a maximum score of 25). These two sub-scores were analysed as separate entities.
Table 1. Descriptive statistics self-, peer-, and tutor assessment - for each of the two marking episodes. Marking episode
1 1 1 2 2 2
Score
Self-assessment Averaged peer-assessment Tutor assessment Self-assessment Averaged peer-assessment Tutor assessment
Number of responses Mean Standard deviation Minimummaximum score

108 115 89 82 87 70 68.80 79.08 76.15 67.70 79.04 74.99 8.32 4.21 7.58 10.70 3.75 8.96 4485 6685 4385 3885 6884 4685
Descriptive statistics for tutor, self- and peer-assessment are presented in Table 1 for scores for each of the two marking episodes. Data were missing for individuals failing to submit completed assessment instruments. Frequency histograms revealed non-normal distributions of peer scores resulting from peer-assessment, with some groups awarding full marks for a large proportion of assessments. As qualitative data made it apparent that some students deliberately scored 100% for peer performances, irrespective of quality, it was resolved to apply an algorithm to reduce the prevalence of deliberately skewed scores or scores resulting from friendship marking and students cavalier attitudes. In instances where the tutor score for a given group was 72 out of 85 (representing a result of 85%) or less, all peer scores of 100% were omitted from the statistical analysis for that tutorial group. In this way, the most highly skewed results were excluded from the data set (representing 4.6% data loss) yet the data remained a reflection of the peerassessment process which operates in a climate of student generosity towards others (see, for instance, Rudy et al. 2001). Averaged peer-assessment scores were calculated by computing the mean for each students completed peerassessment instruments (once skewed results were removed). The reliability among peer-averaged scores for all intervention tutorial groups was well within acceptable limits with Cronbachs alpha scores ranging from 0.66 to 0.77. The results of KolmogorovSmirnov testing confirmed that scores for self-, peer-averaged, and tutor-assessment generated from two marking episodes followed a normal distribution. The parametric tests chosen are fairly robust and should remain relatively unaffected by the observed clustering of assessment marks to the upper end of the range. Despite reservations about its use in this capacity (see Ward et al. 2002) tutor scoring was utilised as the most appropriate benchmark for comparative purposes in assessing the reliability of peer- and self-assessment. In order to provide evidence for claims of accuracy, Bland Altman plots (see Bland and Altman 1986) were used to graphically represent levels of agreement between two sets of scores. Average of scores was plotted against difference between paired scores for (1) self- versus tutor scores at time 2 and (2) peer-averaged versus tutor scores at time 2. Three lines representing the mean difference and upper and lower limits of agreement were drawn. The limits of agreement were set as mean difference 5% of maximum score. The limits
Scatter plot - time 2 scores for self vs tutor assessment (self scores lower) 30 20 Mean difference 10
Upper limit = 3.34 0 Mean = 7.59
10 20 30 40
Lower limit = 11.84
50
60
70
80
Average of scores
Figure 3.
Bland-Altman plot of self-assessment versus tutor scores at time 2.
Scatter plot - time 2 scores for peer averaged vs tutor assessments (peer averaged scores higher) 40 30 Mean difference 20
Upper limit = 8.00
10
Mean = 3.75
0 10 20
Lower limit = 0.50
60
65
70
75
80
85
Average of scores
Figure 4. Bland-Altman plot of peer-averaged versus tutor scores at time 2. suggested by Bland and Altman (1986), which are mean difference 2 standard deviations, were considered too wide to give meaningful results in this study and were not used. These plots are included as Figures 3 and 4.
e127
T. Papinczak et al.
Results
Self-assessment
Demographic variables. Multilevel regression analysis was used to explore the relationship between demographic variables and self-assessment scores. Of four factors incorporated into the model (age, gender, primary undergraduate degree, and repeat student status), only primary degree was statistically significant in explaining the variance in self-assessment scores. Students with an arts, commerce, music, education or law degree on admission to the MBBS Program were significantly more likely to have higher initial self-assessment scores than others, while those with pure sciences or therapies degrees marked themselves significantly lower (t 2.89; p 0.05). This distinction was less noticeable in the second marking episode. Self-efficacy. Initial self-efficacy for self-regulation was moderate to high with a mean of 23.85 (out of 30) and a standard deviation of 3.18, while initial self-efficacy for academic achievement also showed relatively elevated levels (mean 19.51 out of a possible 25, standard deviation 2.51). In order to explore the relationship between self-assessment scores and self-efficacy, a multiple linear regression analysis was undertaken. Only initial self-efficacy for self-regulation was statistically significant in explaining the variance in self-assessment scores (t 3.85, p 0.001). Comparison of means. Direct comparison of the self-assessment mean with the tutor score revealed consistent undermarking of students own work, as shown in Table 1. Paired t-tests were undertaken to determine whether statistically significant directional differences existed for each marking episode. In each marking episode, the students scored themselves significantly lower than their tutor (t 5.27 to 8.10; p < 0.001). Analysis of qualitative data indicated students were concerned about lack of objectivity. One student commented on their struggle to remain impartial: I find it difficult to undertake self assessmentmainly because I feel that my perception of my performance may be inaccurate due to bias or distorted perceptions.
Inter-rater agreement. To explore levels of agreement between scores for self-assessment and other tutorialbased scores, two analyses were undertaken. These were: (a) correlation to test for the strength and direction of linear relationships between scores; and (b) Bland Altman plots to graphically represent scores obtained from self- and tutor assessment. (a) Pearson correlation coefficients were computed for self and peer-averaged, and self and tutor assessment scores derived from both marking episodes (see Table 2). Despite reaching statistical significance, the correlation between self and peer-averaged scores was low-to-moderate (r 0.30 0.32). A slightly stronger correlation was observed for self and tutor scores (r 0.31 0.41). Tremendous variability existed between tutorial groups. When groups were analysed for score correlation separately, across both marking episodes, six tutorial groups showed very high levels of marking agreement with self and tutor score correlation coefficients ranging from 0.74 to 0.92. Other groups showed low correlations. (b) Bland Altman plots charted the difference between tutor and self-assessment scores against the average of these scores. Plots showed poor accuracy of self- versus tutor assessment at both times 1 and 2, with a considerable proportion of plotted scores well outside the levels of agreement (mean 5% of the maximum score). Figure 3 shows a BlandAltman plot for self- versus tutor assessment at time 2. The mean differences of 7.59 (time 2) highlights both the considerable under-marking of self compared to tutor scores and lack of accuracy. The standard deviation was quite large (11.70 at time 2), indicating a wide spread of scores about the mean.
Peer-assessment
Demographic variables. Analysis of variance demonstrated no significant differences between the marks awarded to peers based on the presenting students gender, age, or primary degree. There was evidence, however, of a trend towards higher scores being awarded to older male students in the groups. This failed to reach statistical significance (F 3.12; df 12; p 0.095).
Table 2. Correlation between pairs of scores obtained from self-, peer-, and tutor assessment. Peer averaged scores have been used. Cronbachs alpha for all peer averaged scores across 13 tutorial groups in the first marking episode 0.77, while Cronbachs alpha for all peer averaged scores across 13 tutorial groups in the second marking episode 0.66. Marking episode
1 1 1 2 2 2
Paired scores
Self and tutor scores Self and peer-averaged scores Tutor and peer-averaged scores Self and tutor scores Self and peer-averaged scores Tutor and peer-averaged scores
Number of paired responses

85 108 89 66 82 70
Pearson correlation coefficient

0.41 0.32 0.40 0.31 0.30 0.60
p value (2 tailed)
<0.001 <0.001 <0.001 0.012 0.007 <0.001
e128
Peer-assessment average (1)
Comparison of means. Direct comparison of the peeraveraged mean with the tutor score revealed consistent overmarking by peers (see Table 1). Paired t-tests were undertaken to determine whether statistically significant directional differences existed for each marking episode. In each marking episode, the mean of the peer scores for each student presentation was significantly higher than the score awarded by their tutor (t 3.71 to 4.14; p < 0.001). Inter-rater agreement. To explore levels of agreement between scores for peer-averaged assessment and other tutorial-based scores, two analyses were undertaken. These were: (a) correlation to test for the strength and direction of linear relationships between scores; and (b) Bland Altman plots to graphically represent scores obtained from peeraveraged and tutor assessment. (a) Table 2 presents Pearson correlation coefficients for tutor and peer-averaged scores generated from both marking episodes. At best moderate correlations were demonstrated initially for tutor and peer-averaged scores (r 0.40), with some improvement over time (r 0.60). This data supports the acceptable reliability of the assessment instrument subject to the limitations of the use of tutor assessment as the benchmark. When tutorial groups were analysed for score correlation separately, in all seven of the thirteens groups were capable of very high levels of marking agreement with correlation coefficients ranging from 0.76 to 0.96. Qualitative data indicated that the majority of these groups were very supportive of, and committed to, the peer-assessment process. Comments such as: . . . good to learn how to do this appropriately, as I think we will need to be able to assess our peers performance, as well as our own, throughout our careers were given by some enthusiastic respondents. Scores obtained from other tutorial groups were in substantially less agreement. Most of these group members expressed negative views about peer-assessment related specifically to potential for bias. The effect of omission of highly skewed results (as discussed earlier) on correlation was briefly explored. Data editing was found to improve the peertutor correlation from 0.32 to 0.40 in the first marking episode. (b) Bland Altman plots charted the difference between tutor and peer-averaged scores against the average of these scores. A moderate level of agreement between peeraveraged and tutor assessment at both time 1 and 2 were shown, with a considerable proportion of plotted scores within the levels of agreement (mean 5% of the maximum score). The mean differences of 3.75 (time 2) highlight both the over-marking of peers compared to tutor scores and improved accuracy of peer-averaged scores compared to scores derived from self-assessment (see Figure 4). Variability between tutorial groups. Differences between the averaged peer-assessment scores of all 13 groups were explored using multivariate analysis of variance. Statistically significant variability for the first marking episode only
85
80 5
75 44 63 70 20 22 65 4 6 7 9 14 16 17 23 24 27 28 29 31 Group number 86 76
Figure 5. Box plots of averaged peer-assessment scores for thirteen tutorial groups with results obtained from the first marking episode. (F 2.09; df 12; p 0.028) was evident. The effect size, as measured by partial eta squared, was 0.26. Figure 5 illustrates this variability, through box plots, for the 13 tutorial groups on the 085 scale of the peer assessment instrument. Analysis of variance demonstrated five tutorial groups had reliably recorded significantly lower peer-average scores, while three groups had consistently scored group members more favourably. Comparison between the three sources of assessment scores (self-, peer-, and tutor) revealed a small number of groups consistently reporting high student-generated scores in the absence of high tutor scores. Based on tutor assessment as the benchmark, members of these tutorial groups must be awarding peers overly generous marks. Qualitative data suggest this may relate to friendship marking. A strong reaction to peer-assessment was the widespread perception that this process could be corrupted by bias due to friendship marking, fear of tit-for-tat scoring, or lack of honesty. The following comments sum up the attitude among many study participants: (It is) hard to criticise friends and Relationships between students can colour opinions. Some students expressed casual and/or sceptical attitudes, scoring 100% for each student in their group regardless of the quality of the work to be judged. Frequent comments such as: Not taken too seriously, and Not too much thought goes into the marking reflect a cynical attitude to the peer-assessment process.
Discussion and conclusion

Self-assessment appears to be a less accurate means of evaluating student performance in PBL tutorials than peerassessment. The heterogeneity of the group was confirmed by the presence of many outliers in the data, confirming the low accuracy of self-assessment among this sample of students. Subject to the variability observed between the scores obtained from 13 tutorial groups, correlations between selfassessment scores and those generated from tutor and peerassessment were disappointingly low. Some groups showed evidence of greater accuracy, and some students within groups
e129
T. Papinczak et al.
were also more accurate than others in their self-assessment. These findings are in accord with the general consensus in medical education that self-assessment of tutorial processes in PBL is an inexact measure (Rezler 1989; Gordon 1991; Das et al. 1998; Sullivan et al. 1999; Reiter et al. 2002). Students clearly under-marked themselves, particularly those with pure science or therapies degrees. Other authors have shown that young or highly capable students are more likely to undermark their work (Stefani 1992; Rudy et al. 2001; Edwards et al. 2003; Fitzgerald et al. 2003). Analysis of qualitative data suggests that students struggled to find a balance between confidence in their performance, selfawareness, and humility. Objectivity was also a major cause of concern. Self-efficacy was correlated with self-assessment. Students awarding themselves higher marks were more likely to have stronger self-efficacy for self-regulation. Self-assessment scores would be influenced by many factors, but confidence in ones ability to do well would be expected to provide incentive to award oneself higher marks on PBL performance. Positive collaborative learning behaviour has been shown to be related to learning self-efficacy (McLoughlin & Luca 2004). Peer-assessment offers a greater likelihood of providing accurate alternate forms of assessment within the PBL tutorial environment. Correlation between tutor and peer-averaged scores were barely moderate at first, then improved with continued practice in peer-assessment. Some groups achieved very high correlation (up to 0.96) between tutor scores and peer-averaged scores. The use of peer-averaged rather than individual scores may help to account for apparently improved accuracy compared to self-assessment. Nevertheless, some students, and some groups of students, were able to judge the performance of their peers in PBL tutorials with precision. The removal of highly skewed results prior to the generation of peer-averaged scores improved the correlation between peer-averaged and tutor scores. Qualitative data indicated that some students were treating the peer-assessment process with casual and/or sceptical attitudes. By removing scores known to be deliberately distorted, it was anticipated that the peer-averaged scores would more appropriately reflect the genuine abilities of responsible students to assess their peers. Data analysis indicated that this was the case, with a small improvement in the correlation of tutor and peer-averaged scores in the first marking episode once highly skewed scores were omitted. The implementation of peer-assessment in any setting is likely to lead to initial scepticism and doubt about its value and validity. However, through repeated exposure to, and practice in, peer-assessment, such perceptions should be moderated (Sluijmans et al. 1999; Ballantyne et al. 2002), and highly skewed results would be expected to decrease in frequency. The improved correlation between peer-averaged and tutor assessment at time 2 (compared to time 1) lends support to this supposition. Results from other studies of peer-assessment of processes in PBL (or small group) tutorials show variable correlations between staff and peer scores. Correlations range from very low (Reiter et al. 2002) to high (Magin 2001). e130
In keeping with the findings of Rudy et al. (2001), students were over generous in their marking of peers. The consistent under-marking of self combined with over-marking by peers helps to account for the low correlation found for self and peer-assessment. Qualitative results showed the potential for inflated estimates of peer performance resulting from friendship marking. Friendship marking has been reported by other researchers as biasing peer-assessment responses (Pond & ul-Haq 1997). This study has the capacity to make a contribution to knowledge in the area of peer- and self-assessment in PBL tutorials. The study incorporated strengths in four main areas. These were: (1) focus on the learning process in PBL tutorials; (2) duration of the program; (3) congruence with PBL philosophy; and (4) triangulation in data collection. (1) The focus on learning process is an important strength of the study design as it enabled insights into learning processes undertaken by students. Bereiter and Scardamalia (2000) call for greater research into PBL processes using reflective action research. The moderate duration of the program (exceeding six months) leads to greater confidence in study findings. Loss of participants over time (16% of participants) was not excessive given the time pressures experienced by students. Congruence with PBL philosophy allowed the study to complement the existing structure of PBL tutorials. By supporting the practice of self-assessment within PBL, the study upheld a SDL emphasis. Collaboration was enhanced through the use of collaborative assessment formats. The reflective component of the intervention built upon the review phase of the PBL learning cycle. Methodological triangulation was achieved by combining quantitative and qualitative approaches to study design. While qualitative inquiry helped to confirm theory emerging from student perspectives, quantitative inquiry enabled a set of statistical relationships to be uncovered.
(2)
(3)
(4)
Results should be interpreted within the context of potential limitations, including non-probability sampling, a relatively small sample size, subjective scoring of test items, and the different ways students and tutors interpret and apply assessment criteria. With regard to assessment of PBL processes such as communication and respect, no real benchmark exists (Ward et al. 2002). This casts some doubt on the validity of expert assessment in this domain. Peer- and self-assessment within the tutorial setting has an important role to play through its reinforcement of the educational goals and instructional principles of problembased learning (Nendaz & Tekian 1999; Segers & Dochy 2001). There is evidence that tutorial-based assessment may also reduce the overwhelming reliance on formal grading of students which encourages competition rather than collaboration (Eva 2001). First-year medical students in this study demonstrated poor ability to self-assess their performance in PBL, which included
the creation and presentation of a case summary. Normally a private process, self-assessments conducted publicly require students to balance unrealistic goals and perceptions, assessment anxiety, and social norms about self-aggrandizement (Chaves et al. 2006, p. 30). This makes it unlikely that self-assessment accuracy in medical education is achievable (Eva & Regehr 2005). Nevertheless, practice in self-assessment should be integrated into existing programs of medical education. As Eva and Regehr (2005) emphasise, self-assessment is a means of identifying ones strengths and weaknesses to guide goal setting and enhance self-efficacy. This study has shown that self-assessment is not an accurate measure, in line with the majority of research in this domain. Nevertheless, it has an important role to play in supporting the development of skills in reflection and self-awareness. Self-assessment needs to be viewed from a self-improvement perspective (Eva & Regehr, 2005, p. S52). Further qualitative research needs to be conducted to better understand students apparent inability to self-assess accurately within collaborative small group learning environments. Peer-assessment provides a valuable opportunity for tutorial-based assessment. The act of evaluating the performance of professional peers has long been central to the referral process in medicine (Norcini 2003). Skills gained through peerassessment activities may transfer to self-assessment tasks and enable learners to compare their self-assessment with the assessments of others (Searby & Ewers 1997; Dochy et al. 1999). Feedback from peers has the potential to assist learners to develop more accurate impressions of themselves and their abilities (Eva & Regehr 2005). A fundamental part of the PBL process is the capacity of students to embrace their responsibilities as active members of a group of learners. These may include collaborative and selfassessment practices which have the potential to enhance reflection and self-awareness.
Notes on contributors
TRACEY PAPINCZAK is completing her PhD in medical education within the School of Medicine, The University of Queensland. DR LOUISE YOUNG is a senior lecturer in the School of Medicine at The University of Queensland and is currently Deputy Director of the Universitys Centre for Medical Education. ASSOCIATE PROFESSOR MICHELE GROVES is Deputy Head of School and Director of Medical Studies in the School of Medicine, Griffith University, Queensland. DR MICHELE HAYNES works at The University of Queenslands Social Research Centre as Statistical Advisor and lectures in the School of Social Science.
References
Arnold L, Willoughby TL, Calkins EV. 1985. Self-evaluation in undergraduate medical education: a longitudinal perspective. J Med Edu 60:2128. Ballantyne R, Hughes K, Mylonas A. 2002. Developing procedures for implementing peer assessment in large classes using an action research process. Asses Eval Higher Edu 27:427441. Bandura A. 1986. Social Foundations of Thought and Action (Englewood Cliffs, NJ, Prentice Hall).
Bandura A. 1993. Perceived self-efficacy in cognitive development and functioning. Edu Psychologist 28:117148. Barrows HS, Tamblyn RM. 1980. Problem-Based Learning: an Approach to Medical Education (New York, Springer). Bereiter C, Scardamalia M. 2000. Process and product in problem-based learning research. In: DH Evenson & CE Hmelo (Eds), Problem-Based Learning: a Research Perspective on Learning Interactions (Mahwah, NJ, Lawrence Erlbaum Associates). Bland MJ, Altman DG. 1986. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i:307311. Boud D. 1995. Enhancing Learning Through Self Assessment (London, Kogan Page). Burnett W, Cavaye G. 1980. Peer assessment by fifth year students of surgery. Assess Higher Edu 5:273278. Chaves JF, Baker CM, Chaves JA, Fisher ML. 2006. Self, peer and tutor assessments of MSN competencies using the PBL-Evaluator. J Nurs Edu 45:2531. Cunnington J. 2001. Evolution of student evaluation in the McMaster MD programme. Pedagogue 10:19. Das M, Mpofu D, Dunn E, Lanphear JH. 1998. Self and tutor evaluations in problem-based learning tutorials: is there a relationship? Med Edu 32:411418. Dochy F, Segers M, Sluijmans D. 1999. The use of self-, peer-, and co-assessment in higher education. Studies in Higher Edu 24:331350. Dodds AE, Orsmond RH, Elliott SL. 2001. Assessment in problem-based learning: The role of the tutor. Annal Acad Med Singapore 30:366370. Dolmans DH, Schmidt HG. 1996. The advantages of problem-based curricula. Postgraduate Med J 72:535538. Edwards RK, Kellner KR, Sistrom CL, Magyari EJ. 2003. Medical student selfassessment of performance on an obstetrics and gynaecology clerkship. Am J Obstetrics and Gynaecol 188:10781082. Entwistle NJ, Marton F. 1994. Knowledge objects: Understandings constituted through intensive academic study. Br J Edu Psychol 64:161178. Eva KW. 2001. Assessing tutorial-based assessment. Adv Health Sci Edu 6:243257. Eva KW, Regehr G. 2005. Self-assessment in the health professions: A reformulation and research agenda. Acad Med 80:S46S54. Eva KW, Cunnington JPW, Reiter HI, Keane DR, Norman GR. 2004. How can I know what I dont know? Poor self assessment in a well-defined domain. Adv Health Sci Edu 9:211224. Falchikov N, Magin D. 1997. Detecting gender bias in peer marking of students group process work. Asses Eval Higher Edu 22:385396. Fitzgerald JT, White CB, Gruppen LD. 2003. A longtitudinal study of selfassessment accuracy. Med Edu 37:645649. Gordon MJ. 1991. A review of the validity and accuracy of self-assessments in health professions training. Acad Med 66:762769. Guerrero APS. 2001. Mechanistic case diagramming: A tool for problembased learning. Acad Med 76:385389. Hay PJ, Katsikitis M. 2001. The expert in problem-based and case-based learning: Necessary or not? Med Edu 35:2228. Kelson ACM, Distlehorst LH. 2000. Groups in problem-based learning (PBL): Essential elements in theory, & practice. In: DH Evenson & CE Hmelo (Eds), Problem-Based Learning: a Research Perspective on Learning Interactions (Mahwah, NJ, Lawrence Erlbaum Associates). Kemmis S, Wilkinson M. 1998. Participatory action research and the study of practice. In: B. Atweh, S. Kemmis & P. Weeks (Eds), Action Research in Practice: partnerships for Social Justice in Education, pp. 2136 (London, Routledge). Lejk M, Wyvill M. 2001. The effect of the inclusion of self-assessment with peer-assessment of contributions to a group project: a quantitative study of secret and agreed assessments. Assess Eval Higher Edu 26:551561. Magin DJ. 2001. A novel technique for comparing the reliability of multiple peer assessments with that of a single teacher assessment of group process work. Asses Eval Higher Edu 26:139152. Margetson D. 1994. Current educational reform and the significance of problem-based learning. Stud Higher Edu 19:519. Mattheos N, Nattestad A, Falk-Nilsson E, Attstrom R. 2004. The interactive examination: assessing students self-assessment ability. Med Edu 38:378389.
e131
T. Papinczak et al.
McLoughlin C, Luca J. 2004. An investigation of the motivational aspects of peer and self assessment tasks to enhance teamwork outcomes. Paper presented at the Proceedings of the 21st ASCILITE Conference, Perth, 58 December. Miller PJ. 2003. The effect of scoring criteria specificity on peer and self assessment. Asses Eval Higher Edu 28:383394. Minion DJ, Donnelly MB, Quick RC, Pulito A, Schwartz R. 2002. Are multiple objective measures of student performance necessary? Am J Surg 183:663665. Nendaz MR, Tekian A. 1999. Assessment in problem-based learning medical schools: a literature review. Teach Learn Med 11:232243. Neville AJ. 1998. The tutor in small-group problem-based learning: teacher? Facilitator? Evaluator? Pedagogue 8:19. Nichols JD, Steffi BE. 1999. An evaluation of success in an alternative learning programme: motivational impact versus completion rate. Edu Rev 51:207219. Norcini JJ. 2003. The metric of medical education. Peer assessment of competence. Med Edu 37:539543. Norman G, Schmidt HG. 1992. The psychological basis of problem-based learning: a review of the evidence. Acad Med 67:557565. Orsmond P, Merry S, Reiling K. 2000. The use of student derived marking criteria in peer- and self-assessment. Asses Eval Higher Edu 25:2338. Pajares F. 1996. Role of self-efficacy beliefs in the mathematical problem-solving of gifted students. Contemporary Edu Psychol 21:325344. Papinczak T, Young L, Groves M. 2007. Peer-assessment in problem-based learning: A qualitative study. Adv Health Scie Edu 12:169186. Pond K, ul-Haq R. 1997. Learning to assess students using peer review. Stud Edu Eval 23:331348. Pope NK. 2005. The impact of stress in self- and peer-assessment. Asses Eval Higher Edu 30:5163. Pressley M, McCormick CB. 1995. Advanced Educational Psychology for Educators, Researchers and Policymakers (New York, Harper Collins College Publishers). Reiter HI, Eva KW, Hatala RM, Norman GR. 2002. Self and peer assessment in tutorials: Application of a relative-ranking model. Acad Med 77:11341139. Rezler AG. 1989. Self-assessment in problem-based groups. Med Teach 11:151156. Rudy DW, Fejfar MC, Griffith CH, Wilson JF. 2001. Self and peer assessment in a first-year communication and interviewing course. Eval Health Profess 24:436445.
Schwarzer R, Jerusalem M. 1995. Generalized Self-efficacy Scale (Windsor, UK, Nfer-Nelson). Searby M, Ewers T. 1997. An evaluation of the use of peer assessment in higher education: A case study in the school of music. Asses Eval Higher Edu 22:371383. Segers M, Dochy F. 2001. New assessment forms in problem-based learning: The value-added of the students perspective. Stud Higher Edu 26:327343. Sluijmans D, Dochy F, Moerkerke G. 1999. Creating a learning environment by using self-, peer-, and co-assessment. Learn Environ Res 1:293319. Sluijmans DMA, Moerkerke G, van Merrienboer JJG, Dochy FJRC. 2001. Peer assessment in problem-based learning. Stud Edu Eval 27:153173. Stefani LAJ. 1992. Comparison of collaborative self, peer and tutor assessment in a biochemistry practical. Biochem Edu 20:148151. Sullivan ME, Hitchcock MA, Dunnington GL. 1999. Peer and self assessment during problem-based tutorials. Am J Surg 177:266269. Swanson DB, Case SM, van der Vleuten CPM. 1997. Strategies for student assessment. In: D Boud & G Feletti (Eds), The Challenge of ProblemBased Learning, pp. 269282 (London, Kogan Page). Tousignant M, DesMarchais JE. 2002. Accuracy of student self-assessment ability compared to their own performance in a problem-based learning medical program: a correlation study. Adv Health Sci Edu 7:1927. Van Rosendaal GMA, Jennett PA. 1992. Resistance to peer evaluation in an internal medicine residency. Acad Med 67:63. Ward M, Gruppen L, Regehr G. 2002. Measuring self-assessment: current state of the art. Adv Health Sci Edu 7:6380. Whitfield CF, Xie SX. 2002. Correlation of problem-based learning facilitators scores with student performance on written exams. Adv Health Sci Edu Theory and Pract 7:4151. Willis SC, Jones A, Bundy C, Burdett K, Whitehouse CR, ONeill PA. 2002. Small-group work and assessment in a PBL curriculum: a qualitative and quantitative evaluation of student perceptions of the process of working in small groups and its assessment. Med Teacher 24:495501. Woolliscroft JO, Tenhaken J, Smith J, Calhoun JG. 1993. Medical students clinical self-assessments: comparisons with external measures of performance and the students self-assessments of overall performance and effort. Acad Med 68:285294. Zimmerman BJ. 2000. Self-efficacy: an essential motive to learn. Contemp Edu Psychol 25:8291. Zimmerman BJ, Bandura A, Martinez-Pons M. 1992. Self-motivation for academic attainment: the role of self-efficacy beliefs and personal goal setting. Am Edu Res J 29:663676.
e132

Assessment Tutorial PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Assessment Tutorial PDF

Încărcat de

Drepturi de autor:

Formate disponibile

2007; 29: e122e132

An analysis of peer, self, and tutor assessment in problem-based learning tutorials

An analysis of peer, self, and tutor assessment

Description of the study

The peer assessment instrument

An analysis of peer, self, and tutor assessment

Peer assessment instrument

The final 17-item draft of the Peer assessment instrument.

Peer-, self- and tutor assessment

MARKING EPISODE 1 Student A presents case summary for the week

Tutor scores Tutor scores on Peer on Assessment Instrument

Self-scoring on self-assessment version of Peer Assessment Instrument

Student A presents case summary for the week

Tutor scores on Peer Assessment Instrument

Self-scoring on self-assessment version of Peer Assessment Instrument

An analysis of peer, self, and tutor assessment

Number of responses Mean Standard deviation Minimummaximum score

Lower limit = 11.84

Bland-Altman plot of self-assessment versus tutor scores at time 2.

Lower limit = 0.50

Number of paired responses

Pearson correlation coefficient

An analysis of peer, self, and tutor assessment

Peer-assessment average (1)

Discussion and conclusion

An analysis of peer, self, and tutor assessment

S-ar putea să vă placă și