Evaluation of the performance of large language models in indication-based drug reimbursement review in hospitals

Ming GAO; Meichen HE; Licheng ZHANG; Zhaoming LIN; Yi LIU; Jiahua LENG

Return

Evaluation of the performance of large language models in indication-based drug reimbursement review in hospitals

VernacularTitle:大语言模型在医院药品医保适应证审核中的效能评估
Author: Ming GAO ¹ ; Meichen HE ¹ ; Licheng ZHANG ¹ ; Zhaoming LIN ¹ ; Yi LIU ¹ ; Jiahua LENG ¹
Author Information

1. 北京大学肿瘤医院医疗保险服务处，北京　100142
Publication Type:Journal Article
Keywords: Insurance, health, reimbursement; Large language models; Indication review; Natural language processing; AI-assisted decision making
From: Chinese Journal of Hospital Administration 2025;41(1):63-66
CountryChina
Language:Chinese
Abstract: Objective:To evaluate the performance of three mainstream large language models (LLMs) in the review of drug reimbursement indications in hospitals, and to explore their potential in improving audit quality and efficiency, thereby safeguarding the medical insurance fund.Methods:A total of 3 247 outpatient prescription records were retrospectively collected from a specialized oncology hospital between January 2, 2022, and June 30, 2023. Manual assessment of the consistency between clinical diagnoses and drug reimbursement indications was used as the gold standard. Three LLMs, Baidu′s ERNIE Bot, Alibaba′s Tongyi Qianwen, and OpenAI′s ChatGPT-4o, were evaluated on the same task. Performance metrics included accuracy, precision, sensitivity, specificity, F1 score, and area under the curve (AUC).Results:The ERNIE Bot model returned 3 242 valid data, which took 314 min; The Tongyi Qianwen model returned a total of 3 162 valid data, taking 384 min; The ChatGPT-4o model returned a total of 3 218 valid data, taking 150 min. ChatGPT-4o demonstrated the best performance, with an accuracy of 88.41%, precision of 60.48%, sensitivity of 78.75%, specificity of 90.24%, F1 score of 0.68, and an AUC of 0.88.Conclusions:LLMs demonstrate stable performance in determining whether prescriptions align with reimbursement indications, with ChatGPT-4o approaching human-level accuracy and exhibiting more conservative specificity. These findings suggest that LLMs have practical value as auxiliary tools in drug indication reviews, contributing to improved audit efficiency and more refined management of medical insurance funds.