Application of large language models in health education for patients with diabetic retinopathy
10.3760/cma.j.cn115989-20240723-00207
- VernacularTitle:大语言模型在糖尿病视网膜病变患者健康教育中的应用
- Author:
Fei GAO
1
;
Xue GAO
;
Yan SHAO
;
Xinjun REN
;
Boshi LIU
;
Mingfei JIAO
;
Xiaorong LI
;
Juping LIU
Author Information
1. 天津医科大学眼科医院 天津医科大学眼视光学院 天津医科大学眼科研究所 国家眼耳鼻喉疾病临床医学研究中心天津市分中心 天津市视网膜功能与疾病重点实验室,天津 300384
- Keywords:
Diabetic retinopathy;
Health education;
Deep learning;
Large language models;
Evaluation
- From:
Chinese Journal of Experimental Ophthalmology
2024;42(12):1111-1118
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To evaluate the accuracy, completeness, and reproducibility of domestic open-source large language models (LLM) in diabetic retinopathy (DR) patient education, and to explore their potential as intelligent virtual assistants for DR patient education.Methods:A total of 41 questions and answers related to the diagnosis and treatment of DR in five categories, namely risk factors, screening and examination, symptoms and staging, diagnosis, treatment and prognosis.All questions were repeated twice as a " new dialogue" in the LLM, and all the answers were recorded.Three senior fundus physicians independently evaluated the answers on a 6-point Likert scale for accuracy and a 3-point Likert scale for completeness and repeatability, and for each answer, the evaluator was asked to make a recommendation between the LLM and the manual answers.Five questions were randomly selected to evaluate the three open source LLM, ERNIE Bot 3.5, Qwen and Kimi chat, and the LLM with the best overall performance was selected for further evaluation in the full question bank.Results:Among the three LLM, Kimi chat had the best overall performance, Kimi chat performed best, with percentages of 6 for accuracy, 3 for completeness, and 3 for repeatability among the 5 questions at 90%, 90%, and 100%, respectively.For all questions answered, the number of words in manual replies was 106 (70, 202), which was significantly lower than 505 (386, 600) in Kimi chat ( Z=-7.866, P<0.001).There was no significant correlation between the number of Kimi chat replies and the accuracy score ( rs=-0.044, P=0.492), but it was positively correlated with the integrity score ( rs=0.239, P<0.001).The interclass correlation coefficient for accuracy and completeness scores were above 0.700 among three evaluators, with the highest agreement for repeatability at 0.853, followed by completeness of the first response at 0.771.The proportion of responses ≥5 points for accuracy was 87.0%(214/246), the proportion ≥2 points for completeness was 98.0%(241/246), and the proportion higher than 70% for repeatability was 78.5%(193/246).Kimi chat excelled in answering basic questions about the disease such as disease definition, staging, frequency of screening, and common risk factors, but performed poorly on questions involving treatment choices that require a doctor's professional judgment.The proportion of evaluators choosing Kimi chat responses as superior was 69.5% (171/246), and the reasons for non-selection included lack of characteristic answers, inclusion of too much irrelevant information, and lack of responses to questions requiring a high degree of medical expertise. Conclusions:Kimi chat answers DR-related diagnostic questions in a detailed and well-organized manner, with a high degree of accuracy, completeness and reproducibility.