Comparative study of different large language models and medical professionals of different levels responding to ophthalmology questions

Hui HUANG; Jinyu HU; Xiaoyu WANG; Shuyuan YE; Shinan WU; Cheng CHEN; Liangqi HE; Yanmei ZENG; Hong WEI; Yi SHAO

Return

Comparative study of different large language models and medical professionals of different levels responding to ophthalmology questions

VernacularTitle:不同大型语言模型与不同水平医学专业人士回答眼科问题的对比研究
Author: Hui HUANG ¹ ; Jinyu HU ¹ ; Xiaoyu WANG ¹ ; Shuyuan YE ¹ ; Shinan WU ¹ ; Cheng CHEN ¹ ; Liangqi HE ¹ ; Yanmei ZENG ¹ ; Hong WEI ¹ ; Yi SHAO ¹
Author Information

1. Department of Ophthalmology, the First Affiliated Hospital of Nanchang University, Nanchang 330006, Jiangxi Province, China; Eye & ENT Hospital of Fudan University, Shanghai 200126, China
Publication Type:Journal Article
Keywords: large language models(LLM); natural language processing; ophthalmology question
From: International Eye Science 2024;24(3):458-462
CountryChina
Language:Chinese
Abstract: AIM: To evaluate the performance of three distinct large language models(LLM), including GPT-3.5, GPT-4, and PaLM2, in responding to queries within the field of ophthalmology, and to compare their performance with three different levels of medical professionals: medical undergraduates, master of medicine, and attending physicians.METHODS: A total of 100 ophthalmic multiple-choice tests, which covered ophthalmic basic knowledge, clinical knowledge, ophthalmic examination and diagnostic methods, and treatment for ocular disease, were conducted on three different kinds of LLM and three different levels of medical professionals(9 undergraduates, 6 postgraduates and 3 attending physicians), respectively. The performance of LLM was comprehensively evaluated from the aspects of mean scores, consistency and confidence of response, and it was compared with human.RESULTS: Notably, each LLM surpassed the average performance of undergraduate medical students(GPT-4:56, GPT-3.5:42, PaLM2:47, undergraduate students:40). Specifically, performance of GPT-3.5 and PaLM2 was slightly lower than those of master's students(51), while GPT-4 exhibited a performance comparable to attending physicians(62). Furthermore, GPT-4 showed significantly higher response consistency and self-confidence compared with GPT-3.5 and PaLM2.CONCLUSION: LLM represented by GPT-4 performs well in the field of ophthalmology, and the LLM model can provide clinical decision-making and teaching aids for clinicians and medical education.