Accuracy and quality of answer reasoning of Chinese large language model in Chinese middle level professional qualification examination of radiology

Jingyu ZHONG; Yue XING; Yangfan HU; Qinghua MIN; Caisong ZHU; Dandan SHI; Xiaoyu FAN; Jingshen CHU; Huan ZHANG; Weiwu YAO

Return

Accuracy and quality of answer reasoning of Chinese large language model in Chinese middle level professional qualification examination of radiology

VernacularTitle:中文大语言模型在放射医学中级专业技术考试中的正确率和答案解析质量
Author: Jingyu ZHONG ¹ ; Yue XING ; Yangfan HU ; Qinghua MIN ; Caisong ZHU ; Dandan SHI ; Xiaoyu FAN ; Jingshen CHU ; Huan ZHANG ; Weiwu YAO
Author Information

1. 上海交通大学医学院附属同仁医院影像科，上海　200336
Publication Type:Journal Article
Keywords: Large language model; Standardized residency training; Assessment; Radiology
From: Chinese Journal of Medical Education Research 2025;24(2):145-149
CountryChina
Language:Chinese
Abstract: Objective:To compare the accuracy of a Chinese large language model (LLM) and radiologists in Chinese middle level professional qualification examination of radiology, and evaluate the quality of answer reasoning provided by the Chinese LLM.Methods:In this study, 100 high-quality questions were selected using stratified random sampling to form a test set. We asked the ERNIE Bot by dialogues on the website to provide the correct answers and answer reasoning for these questions. These questions were also answered by 15 radiologists with different levels of experience. The accuracy of Chinese LLM and that of radiologists were compared. Two radiologists evaluated the quality of answer reasoning using a 5-point semi-quantitative scale.Results:The accuracy of ERNIE Bot was 60.00%, which was lower than the median (interquartile) accuracy of 67.00% (64.00%, 73.00%) for radiologists, and the difference was statistically significant ( W=2.47, P=0.013). The word count of the reasoning provided by Ernie Bot was (196.44±99.25) words, with no significant difference in word count between correct and incorrect answer reasoning, which were (211.03±107.53) words and (174.55±81.84) words, respectively ( t=1.82, P=0.072). Among the correct answers, the quality of reasoning was scored as follows: 1 point for 3 questions, 2 points for 9 questions, 3 points for 12 questions, and 4 points for 36 questions. No reasoning received a score of 5. Conclusions:Chinese LLM demonstrates a certain level of medical knowledge and clinical reasoning ability, which can assist clinical teachers in educational activities. However, it is not yet able to independently tutor residents and lacks the ability of invitational and heuristic teaching.