A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions

Doreen M Olvet; Jeffrey B Bird; Tracy B Fulton; Marieke Kruidering; Klara K Papp; Kelli Qua; Joanne M Willey; Judith M Brenner

doi:10.1080/10401334.2022.2111571

A Multi-institutional Study of the Feasibility and Reliability of the Implementation of Constructed Response Exam Questions

Teach Learn Med. 2023 Oct-Dec;35(5):609-622. doi: 10.1080/10401334.2022.2111571. Epub 2022 Aug 20.

Authors

Doreen M Olvet¹, Jeffrey B Bird¹, Tracy B Fulton², Marieke Kruidering³, Klara K Papp⁴, Kelli Qua⁵, Joanne M Willey¹, Judith M Brenner¹

Affiliations

¹ Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, USA.
² Department of Biochemistry and Biophysics, University of California San Francisco School of Medicine, San Francisco, California, USA.
³ Department of Cellular & Molecular Pharmacology, University of California at San Francisco School of Medicine, San Francisco, California, USA.
⁴ Center for Medical Education, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA.
⁵ Research and Evaluation, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA.

PMID: 35989668
DOI: 10.1080/10401334.2022.2111571

Abstract

Problem: Some medical schools have incorporated constructed response short answer questions (CR-SAQs) into their assessment toolkits. Although CR-SAQs carry benefits for medical students and educators, the faculty perception that the amount of time required to create and score CR-SAQs is not feasible and concerns about reliable scoring may impede the use of this assessment type in medical education.

Intervention: Three US medical schools collaborated to write and score CR-SAQs based on a single vignette. Study participants included faculty question writers (N = 5) and three groups of scorers: faculty content experts (N = 7), faculty non-content experts (N = 6), and fourth-year medical students (N = 7). Structured interviews were performed with question writers and an online survey was administered to scorers to gather information about their process for creating and scoring CR-SAQs. A content analysis was performed on the qualitative data using Bowen's model of feasibility as a framework. To examine inter-rater reliability between the content expert and other scorers, a random selection of fifty student responses from each site were scored by each site's faculty content experts, faculty non-content experts, and student scorers. A holistic rubric (6-point Likert scale) was used by two schools and an analytic rubric (3-4 point checklist) was used by one school. Cohen's weighted kappa (κ_w) was used to evaluate inter-rater reliability.

Context: This research study was implemented at three US medical schools that are nationally dispersed and have been administering CR-SAQ summative exams as part of their programs of assessment for at least five years. The study exam question was included in an end-of-course summative exam during the first year of medical school.

Impact: Five question writers (100%) participated in the interviews and twelve scorers (60% response rate) completed the survey. Qualitative comments revealed three aspects of feasibility: practicality (time, institutional culture, teamwork), implementation (steps in the question writing and scoring process), and adaptation (feedback, rubric adjustment, continuous quality improvement). The scorers' described their experience in terms of the need for outside resources, concern about lack of expertise, and value gained through scoring. Inter-rater reliability between the faculty content expert and student scorers was fair/moderate (κ_w=.34-.53, holistic rubrics) or substantial (κ_w=.67-.76, analytic rubric), but much lower between faculty content and non-content experts (κ_w=.18-.29, holistic rubrics; κ_w=.59-.66, analytic rubric).

Lessons learned: Our findings show that from the faculty perspective it is feasible to include CR-SAQs in summative exams and we provide practical information for medical educators creating and scoring CR-SAQs. We also learned that CR-SAQs can be reliably scored by faculty without content expertise or senior medical students using an analytic rubric, or by senior medical students using a holistic rubric, which provides options to alleviate the faculty burden associated with grading CR-SAQs.

Keywords: constructed response; feasibility; inter-rater reliability; short answer questions; undergraduate medical education.

Publication types

Multicenter Study

MeSH terms

Educational Measurement*
Feasibility Studies
Humans
Learning
Reproducibility of Results
Students, Medical*