AI-driven Medical Care: Evaluation of Large Language Models in Generating Personalized Stroke Education Materials

Surim YOON; Woo-Keun SEO; Kyungseo KIM; Seongvin JU; Hyun Kyung KIM; Hyung Jun KIM; Jong-Won CHUNG; Oh Young BANG; Gyeong-Moon KIM; Eun Young LEE; Youngrak CHOI; Soyoung YOO

Return

AI-driven Medical Care: Evaluation of Large Language Models in Generating Personalized Stroke Education Materials

Author: Surim YOON ¹ ; Woo-Keun SEO ; Kyungseo KIM ; Seongvin JU ; Hyun Kyung KIM ; Hyung Jun KIM ; Jong-Won CHUNG ; Oh Young BANG ; Gyeong-Moon KIM ; Eun Young LEE ; Youngrak CHOI ; Soyoung YOO
Author Information

1. Department of Neurology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea
Publication Type:Original Article
From:Healthcare Informatics Research 2026;32(2):179-189
CountryRepublic of Korea
Language:English
Abstract: Objectives:Large language models (LLMs) demonstrate remarkable potential in healthcare communication. However, whether they can process complex, high-volume medical information, such as stroke-related content, remains insufficiently validated. This study aimed to evaluate the natural language processing capabilities of LLMs in handling such content and to develop an evaluation instrument.
Methods:A survey compared educational materials generated by two LLMs (ChatGPT 4.0 and Claude 3) with neurologist-authored content on stroke. The materials were based on two clinical scenarios representing distinct stroke etiologies: cardioembolism and large-artery atherosclerosis. They were evaluated in terms of accuracy, legality, ethics, comprehensiveness, and information delivery. Scores for comprehensiveness and information delivery were compared according to participants’ agreement with the use of LLMs in healthcare.
Results:ChatGPT received the highest scores across all domains, except for legality in Scenario 2. In Scenario 1, the ranking for accuracy and summarization of clinical information was, from highest to lowest, ChatGPT, Claude, and the neurologist (η2 = 0.140, p < 0.001; η2 = 0.175, p < 0.001). The same hierarchy was observed in Scenario 2 for accuracy (η2 = 0.077, p < 0.001) and summarization (η2 = 0.194, p < 0.001). Participants who agreed with the use of LLMs in healthcare assigned higher scores for the comprehensiveness (Scenario 1, p = 0.005; Scenario 2, p = 0.007) and information delivery (Scenario 1, p = 0.003; Scenario 2, p = 0.026) of ChatGPT-generated materials than participants who did not agree.
Conclusions:LLMs demonstrated adequate capability to convey complex content, such as stroke-related information, in an accessible and understandable manner for non-experts.