Performance of domestic and international large language models in question banks of clinical laboratory medicine
10.13602/j.cnki.jcls.2023.12.14
- VernacularTitle:国内外大语言模型在临床检验题库中的表现
- Author:
Yuechang LIU
1
;
Ziru CHEN
;
Ming YANG
;
Chen FU
;
Tao ZENG
Author Information
1. 中山大学附属第六医院临床检验科,广州市黄埔区中六生物医学创新研究院,广州 510655
- Keywords:
clinical laboratory medicine;
large language models;
Al large model;
artificial intelligence
- From:
Chinese Journal of Clinical Laboratory Science
2023;41(12):941-944
- CountryChina
- Language:Chinese
-
Abstract:
Objective To explore the performance of domestic and international large language models(LLMs)in the context of ques-tion banks for clinical examination knowledge.Methods The performance of six domestic or international LLMs,in the question banks with a set of 330 questions for intermediate-level of clinical medical laboratory technology were assessed.The differences in accuracy and consistency among the different LLMs were evaluated using chi-square tests,Fisher's exact tests and logistic regression.Results The accuracy results for the four English LLMs along with 95%confidence intervals(95%CI)were as follows:the accuracy rates of ChatGPT,BingAI,Claude and GPT-4 were demonstrated as 0.56(95%CI:0.527-0.601),0.61(95%CI:0.572-0.644),0.64(95%CI:0.607-0.678)and 0.80(95%CI:0.767-0.833)respectively,while the performance of Xinghuo and Tiangong yielded accuracy rates of 0.52(95%CI:0.479-0.561)and 0.45(95%CI:0.408-0.482)respectively.Using ChatGPT as the reference model,we found that the odds ratios(OR)of correct answers of BingAI,Claude and GPT-4 were 1.272(95%CI:1.020-1.588),1.397(95%CI:1.119-1.743)and 3.270(95%CI:1.904-2.729)respectively.The differences of LLMs performance were statistically significant(P<0.05)for all the three models.In terms of consistency,Tiangong and BingAI showed poor consistency,while GPT-4 appeared better.Conclusion A-mong the six LLMs,GPT-4 demonstrated the highest overall accuracy and consistency in each question category.