Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery

Bashar ZAIDAT; Nancy SHRESTHA; Ashley M ROSENBERG; Wasil AHMED; Rami RAJJOUB; Timothy HOANG; Mateo Restrepo MEJIA; Akiro H DUEY; Justin E TANG; Jun S KIM; Samuel K CHO

Return

Performance of a Large Language Model in the Generation of Clinical Guidelines for Antibiotic Prophylaxis in Spine Surgery

Author: Bashar ZAIDAT ¹ ; Nancy SHRESTHA ; Ashley M. ROSENBERG ; Wasil AHMED ; Rami RAJJOUB ; Timothy HOANG ; Mateo Restrepo MEJIA ; Akiro H. DUEY ; Justin E. TANG ; Jun S. KIM ; Samuel K. CHO
Author Information

1. Department of Orthopedic Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, Korea
Publication Type:Original Article
From: Neurospine 2024;21(1):128-146
CountryRepublic of Korea
Language:English
Abstract: Objective:Large language models, such as chat generative pre-trained transformer (ChatGPT), have great potential for streamlining medical processes and assisting physicians in clinical decision-making. This study aimed to assess the potential of ChatGPT’s 2 models (GPT-3.5 and GPT-4.0) to support clinical decision-making by comparing its responses for antibiotic prophylaxis in spine surgery to accepted clinical guidelines.
Methods:ChatGPT models were prompted with questions from the North American Spine Society (NASS) Evidence-based Clinical Guidelines for Multidisciplinary Spine Care for Antibiotic Prophylaxis in Spine Surgery (2013). Its responses were then compared and assessed for accuracy.
Results:Of the 16 NASS guideline questions concerning antibiotic prophylaxis, 10 responses (62.5%) were accurate in ChatGPT’s GPT-3.5 model and 13 (81%) were accurate in GPT-4.0. Twenty-five percent of GPT-3.5 answers were deemed as overly confident while 62.5% of GPT-4.0 answers directly used the NASS guideline as evidence for its response.
Conclusion:ChatGPT demonstrated an impressive ability to accurately answer clinical questions. GPT-3.5 model’s performance was limited by its tendency to give overly confident responses and its inability to identify the most significant elements in its responses. GPT-4.0 model’s responses had higher accuracy and cited the NASS guideline as direct evidence many times. While GPT-4.0 is still far from perfect, it has shown an exceptional ability to extract the most relevant research available compared to GPT-3.5. Thus, while ChatGPT has shown far-reaching potential, scrutiny should still be exercised regarding its clinical use at this time.