ChatGPT4 Outperforms Endoscopists for Determination of Postcolonoscopy Rescreening and Surveillance Recommendations

Patrick W Chang; Maziar M Amini; Rio O Davis; Denis D Nguyen; Jennifer L Dodge; Helen Lee; Sarah Sheibani; Jennifer Phan; James L Buxbaum; Ara B Sahakian

doi:10.1016/j.cgh.2024.04.022

ChatGPT4 Outperforms Endoscopists for Determination of Postcolonoscopy Rescreening and Surveillance Recommendations

Clin Gastroenterol Hepatol. 2024 May 9:S1542-3565(24)00429-4. doi: 10.1016/j.cgh.2024.04.022. Online ahead of print.

Authors

Affiliations

¹ Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, California.
² Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, California; Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California.
³ Division of Gastrointestinal and Liver Diseases, University of Southern California, Los Angeles, California. Electronic address: arasahak@med.usc.edu.

PMID: 38729387
DOI: 10.1016/j.cgh.2024.04.022

Abstract

Background & aims: Large language models including Chat Generative Pretrained Transformers version 4 (ChatGPT4) improve access to artificial intelligence, but their impact on the clinical practice of gastroenterology is undefined. This study compared the accuracy, concordance, and reliability of ChatGPT4 colonoscopy recommendations for colorectal cancer rescreening and surveillance with contemporary guidelines and real-world gastroenterology practice.

Methods: History of present illness, colonoscopy data, and pathology reports from patients undergoing procedures at 2 large academic centers were entered into ChatGPT4 and it was queried for the next recommended colonoscopy follow-up interval. Using the McNemar test and inter-rater reliability, we compared the recommendations made by ChatGPT4 with the actual surveillance interval provided in the endoscopist's procedure report (gastroenterology practice) and the appropriate US Multisociety Task Force (USMSTF) guidance. The latter was generated for each case by an expert panel using the clinical information and guideline documents as reference.

Results: Text input of de-identified data into ChatGPT4 from 505 consecutive patients undergoing colonoscopy between January 1 and April 30, 2023, elicited a successful follow-up recommendation in 99.2% of the queries. ChatGPT4 recommendations were in closer agreement with the USMSTF Panel (85.7%) than gastroenterology practice recommendations with the USMSTF Panel (75.4%) (P < .001). Of the 14.3% discordant recommendations between ChatGPT4 and the USMSTF Panel, recommendations were for later screening in 26 (5.1%) and for earlier screening in 44 (8.7%) cases. The inter-rater reliability was good for ChatGPT4 vs USMSTF Panel (Fleiss κ, 0.786; 95% CI, 0.734-0.838; P < .001).

Conclusions: Initial real-world results suggest that ChatGPT4 can define routine colonoscopy screening intervals accurately based on verbatim input of clinical data. Large language models have potential for clinical applications, but further training is needed for broad use.

Keywords: Artificial Intelligence; ChatGPT4; Colorectal Neoplasms; Large Language Model.