Performance of domestic and international large language models in question banks of clinical laboratory medicine

Yuechang LIU; Ziru CHEN; Ming YANG; Chen FU; Tao ZENG

Return

Performance of domestic and international large language models in question banks of clinical laboratory medicine

VernacularTitle:国内外大语言模型在临床检验题库中的表现
Author: Yuechang LIU ¹ ; Ziru CHEN ; Ming YANG ; Chen FU ; Tao ZENG
Author Information

1. 中山大学附属第六医院临床检验科,广州市黄埔区中六生物医学创新研究院,广州 510655
Keywords: clinical laboratory medicine; large language models; Al large model; artificial intelligence
From: Chinese Journal of Clinical Laboratory Science 2023;41(12):941-944
CountryChina
Language:Chinese
Abstract: Objective To explore the performance of domestic and international large language models(LLMs)in the context of ques-tion banks for clinical examination knowledge.Methods The performance of six domestic or international LLMs,in the question banks with a set of 330 questions for intermediate-level of clinical medical laboratory technology were assessed.The differences in accuracy and consistency among the different LLMs were evaluated using chi-square tests,Fisher's exact tests and logistic regression.Results The accuracy results for the four English LLMs along with 95％confidence intervals(95％CI)were as follows:the accuracy rates of ChatGPT,BingAI,Claude and GPT-4 were demonstrated as 0.56(95％CI:0.527-0.601),0.61(95％CI:0.572-0.644),0.64(95％CI:0.607-0.678)and 0.80(95％CI:0.767-0.833)respectively,while the performance of Xinghuo and Tiangong yielded accuracy rates of 0.52(95％CI:0.479-0.561)and 0.45(95％CI:0.408-0.482)respectively.Using ChatGPT as the reference model,we found that the odds ratios(OR)of correct answers of BingAI,Claude and GPT-4 were 1.272(95％CI:1.020-1.588),1.397(95％CI:1.119-1.743)and 3.270(95％CI:1.904-2.729)respectively.The differences of LLMs performance were statistically significant(P＜0.05)for all the three models.In terms of consistency,Tiangong and BingAI showed poor consistency,while GPT-4 appeared better.Conclusion A-mong the six LLMs,GPT-4 demonstrated the highest overall accuracy and consistency in each question category.