Standardized Evaluation of Large Language Models in Traditional Chinese Medicine
10.14148/j.issn.1672-0482.2024.1383
- VernacularTitle:大语言模型在中医领域的标准化评估
- Author:
Lu CAO
1
;
Lin XU
1
;
Yujie ZHANG
1
;
Linshuai ZHANG
1
;
Yaqin FU
1
;
Tao JIANG
1
Author Information
1. 成都中医药大学智能医学学院,四川 成都 611100
- Publication Type:Journal Article
- Keywords:
large language models;
Chinese medical models;
evaluation benchmark;
ChatGPT;
traditional Chinese medicine
- From:
Journal of Nanjing University of Traditional Chinese Medicine
2024;40(12):1383-1392
- CountryChina
- Language:Chinese
-
Abstract:
OBJECTIVE Aiming at the current vacancy of large language models(LLMs)in TCM evaluation,a TCM benchmark dataset is designed and constructed to comprehensively and objectively evaluate the mastery and reasoning performance of LLMs in TCM knowledge,providing scientific and reliable basis for optimizing the performance of LLMs in the field of TCM.METHODS This benchmark includes 29 506 questions across 13 subjects,with data collected from standardized TCM exams and textbooks.Three gen-eral-purpose models(GPT-3.5,ChatGLM3,Baichuan)and five Chinese medical models(PULSE,BenTsao,HuatuoGPT2,Bian-Que2,ShenNong)were evaluated with answer prediction and answer reasoning tasks.The evaluation results were quantitatively as-sessed using metrics including accuracy,F1 score,BLEU,and Rouge.RESULTS For the answer prediction task,Baichuan had the highest accuracy of 36.07%in single-choice questions,while ChatGLM3 achieved the highest accuracy of 18.96%and F1 score of 76.31%in multiple-choice questions.For the answer reasoning experiment,Baichuan scored highest on BLEU-1 with 24.71,while ChatGLM3 achieved the highest Rouge-1 score of 44.64.CONCLUSION In this study,general LLMs performed slightly better than Chinese medical LLMs.Meanwhile,all models'accuracy on choice questions remained below 60%,reflecting the significant challen-ges and room for improvement that LLMs still face in the field of TCM.