DIGITAL LIBRARY
THE INTERRATER RELIABILITY OF A SUMMATIVE OBJECTIVE STRUCTURED CLINICAL EXAMINATION IN MEASURING THE HISTORY TAKING SKILL: LESSON LEARNED FROM ATCS CENTER
University of Medicine and Pharmacy (VIETNAM)
About this paper:
Appears in: INTED2021 Proceedings
Publication year: 2021
Pages: 5412-5416
ISBN: 978-84-09-27666-0
ISSN: 2340-1079
doi: 10.21125/inted.2021.1101
Conference name: 15th International Technology, Education and Development Conference
Dates: 8-9 March, 2021
Location: Online Conference
Abstract:
Introduction:
The objective structured clinical examination (OSCE) is a performance-based examination that can be used to assess the history-taking skill of medical students. Although there are features of the OSCE contributing to its greater reliability, there are more errors introduced into the observed scores will diminish its reliability. The reliability of an OSCE addresses the question of how the OSCE is used and has never been reported in Vietnamese medical education programs. The objective of this study was to determine the inter-rater reliability of a summative OSCE in measuring the history-taking skill of second-year medical students (MS-2).

Methods:
The examiners met before the administration of summative OSCE to standardize the procedures and the validity of checklists and global rating scales. A seven-station summative OSCE was administered to all 388 MS-2 to assess their performance including history-taking skill. Examiners were instructed to not communicate during the exam to maintain the independence of their scoring and assessed students’ performance via an audio-visual system. Two weeks later, another core-group educator used the same checklist and recorded videos to retest 59 borderline examinees’ performance at that station. To determine interrater reliability in scoring of 59 examinees, intraclass correlation coefficient (ICC) estimates and their 95% confident intervals were calculated using SPSS statistical package version 20 based on a single-measurement, consistency-agreement, 2-way random-effects model. The Cohen’s Kappa statistic also was used to test interrater reliability for pass/fail decisions based on the mean checklist score of borderline examinees.

Results:
The ICC was 0.21 with a 95% confident interval was -0.05 – 0.44, showing a poor level of agreement for faculty in assessing the history-taking skill of MS-2. The Kappa was 0.15 also indicating no agreement in making a pass-fail decision. Only 4/17 items (23.53%) of the rating scale had a moderate to good level of agreement with the highest value was 0.77 (0.64 – 0.86).

Discussion:
The low value of ICC and Kappa indicated that there is no agreement between raters in scoring history taking skill of MS-2 and making pass-fail decisions. A negative ICC estimate would mean that the process, whatever it is, that brings pairs together makes them less similar than any two randomly chosen examiners of the whole population. A low ratio of checklist items’ agreement implied an indefinite instruction and an unfeasible rating scale. A higher than anticipated failure rate may be explained as an inappropriate standard has been set for the station; or technical problems such as ambiguity in the instructions or difficulty with the performance of the SP; or the station are not an appropriate assessment of the expected learning outcomes; or the teaching and the training program are deficient.

Conclusion:
In the process of development and implementation of OSCE to assess clinical competence, our one could be a valuable but not reliable tool for assessing history-taking skill because of inadequate preparation. Our future OSCEs should have sufficient time for the preparation and strictly follow the steps necessary to successfully deliver an OSCE, especially proposals for each station. Further investigations should be conducted to assess decisively different forms of reliability such as stability reliability, alternate-form reliability, and internal consistency reliability.
Keywords:
ATCS Center, UMP HCMC, OSCE, History-Taking Skill, Inter-rater Reliability, Borderline Group Method.