1.Performance of domestic and international large language models in question banks of clinical laboratory medicine
Yuechang LIU ; Ziru CHEN ; Ming YANG ; Chen FU ; Tao ZENG
Chinese Journal of Clinical Laboratory Science 2023;41(12):941-944
Objective To explore the performance of domestic and international large language models(LLMs)in the context of ques-tion banks for clinical examination knowledge.Methods The performance of six domestic or international LLMs,in the question banks with a set of 330 questions for intermediate-level of clinical medical laboratory technology were assessed.The differences in accuracy and consistency among the different LLMs were evaluated using chi-square tests,Fisher's exact tests and logistic regression.Results The accuracy results for the four English LLMs along with 95%confidence intervals(95%CI)were as follows:the accuracy rates of ChatGPT,BingAI,Claude and GPT-4 were demonstrated as 0.56(95%CI:0.527-0.601),0.61(95%CI:0.572-0.644),0.64(95%CI:0.607-0.678)and 0.80(95%CI:0.767-0.833)respectively,while the performance of Xinghuo and Tiangong yielded accuracy rates of 0.52(95%CI:0.479-0.561)and 0.45(95%CI:0.408-0.482)respectively.Using ChatGPT as the reference model,we found that the odds ratios(OR)of correct answers of BingAI,Claude and GPT-4 were 1.272(95%CI:1.020-1.588),1.397(95%CI:1.119-1.743)and 3.270(95%CI:1.904-2.729)respectively.The differences of LLMs performance were statistically significant(P<0.05)for all the three models.In terms of consistency,Tiangong and BingAI showed poor consistency,while GPT-4 appeared better.Conclusion A-mong the six LLMs,GPT-4 demonstrated the highest overall accuracy and consistency in each question category.