Artificial Intelligence for Urology Research: The Holy Grail of Data Science or Pandora's Box of Misinformation?

Ryan Matthew Blake; Johnathan Alexander Khusid

doi:10.1089/end.2023.0703

Artificial Intelligence for Urology Research: The Holy Grail of Data Science or Pandora's Box of Misinformation?

J Endourol. 2024 Mar 28. doi: 10.1089/end.2023.0703. Online ahead of print.

Authors

Ryan Matthew Blake¹, Johnathan Alexander Khusid²

Affiliations

¹ Icahn School of Medicine at Mount Sinai, 5925, 1 Gustav Levy Pl, New York, New York, United States, 10029-6574; ryanmblake@outlook.com.
² Icahn School of Medicine at Mount Sinai, 5925, Urology, 1 Gustave Levy Pl., New York, New York, United States, 10029-6574; johnathan.khusid@mountsinai.org.

PMID: 38545764
DOI: 10.1089/end.2023.0703

Abstract

Introduction Artificial intelligence tools such as the large language models (LLMs) Bard and ChatGPT have generated significant research interest. Utilization of these LLMs to study epidemiology of a target population could benefit urologists. We investigated whether Bard and ChatGPT can perform a large-scale calculation of the incidence and prevalence of kidney stone disease. Materials and Methods We obtained reference values from two published studies which used the National Health and Nutrition Examination Survey (NHANES) database to calculate the prevalence and incidence of kidney stone disease. We then tested the capability of Bard and ChatGPT to perform similar calculations using two different methods. First, we instructed the LLMs to access the datasets and independently perform the calculation. Second, we instructed the interfaces to generate customized computer code which could perform the calculation on downloaded datasets. Results While ChatGPT denied the ability to access and perform calculations on the NHANES database, Bard intermittently claimed the ability to do so. Bard provided either accurate results or inaccurate and inconsistent results. For example, Bard's "calculations" for the incidence of kidney stones from 2015-2018 were 2.1% (95% CI: 1.5-2.7), 1.75% (95% CI: 1.6-1.9), and 0.8% (95% CI 0.7-0.9), while the published number was 2.1% (95% CI 1.5-2.7). Bard provided discrete mathematical details of its calculations, however when prompted further, admitted to having obtained the numbers from online sources, including our chosen reference papers, rather than from a de novo calculation. Both LLMs were able to produce code (Python) to use on the downloaded NHANES datasets, however these would not readily execute. Conclusions ChatGPT and Bard are currently incapable of performing epidemiological calculations and lack transparency and accountability. Caution should be used, particularly with Bard, as claims of its capabilities were convincingly misleading, and results were inconsistent.