Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment

Bashar Hasan; Samer Saadi; Noora S Rajjoub; Moustafa Hegazi; Mohammad Al-Kordi; Farah Fleti; Magdoleen Farah; Irbaz B Riaz; Imon Banerjee; Zhen Wang; Mohammad Hassan Murad

doi:10.1136/bmjebm-2023-112597

Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment

BMJ Evid Based Med. 2024 Feb 21:bmjebm-2023-112597. doi: 10.1136/bmjebm-2023-112597. Online ahead of print.

Authors

Bashar Hasan^{1

2}, Samer Saadi^{3

2}, Noora S Rajjoub³, Moustafa Hegazi^{3

2}, Mohammad Al-Kordi^{3

2}, Farah Fleti^{3

2}, Magdoleen Farah^{3

2}, Irbaz B Riaz⁴, Imon Banerjee^{5

6}, Zhen Wang^{3

7}, Mohammad Hassan Murad^{3

2}

Affiliations

¹ Kern Center for the Science of Healthcare Delivery, Mayo Clinic, Rochester, Minnesota, USA Hasan.Bashar@mayo.edu.
² Public Health, Infectious Diseases and Occupational Medicine, Mayo Clinic, Rochester, Minnesota, USA.
³ Kern Center for the Science of Healthcare Delivery, Mayo Clinic, Rochester, Minnesota, USA.
⁴ Division of Hematology-Oncology Department of Medicine, Mayo Clinic, Rochester, Minnesota, USA.
⁵ Department of Radiology, Mayo Clinic Arizona, Scottsdale, Arizona, USA.
⁶ School of Computing and Augmented Intelligence, Arizona State University, Tempe, Arizona, USA.
⁷ Health Care Policy and Research, Mayo Clinic Minnesota, Rochester, Minnesota, USA.

PMID: 38383136
DOI: 10.1136/bmjebm-2023-112597

Abstract

Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of 'Classification of Intervention'. Kendall agreement coefficient was highest for the domains of 'Participant Selection', 'Missing Data' and 'Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.

Keywords: Evidence-Based Practice; Methods; Systematic Reviews as Topic.