Accuracy of large language models for answering pediatric preventive dentistry questions
10.12016/j.issn.2096-1456.202440370
- Author:
GUAN Boyan
1
;
XU Minghe
1
;
ZHANG Huiqi
1
;
MA Shulei
1
;
ZHANG Shanshan
2
;
ZHAO Junfeng
3
Author Information
1. Peking University School of Stomatology
2. Peking University Hospital of Stomatology
3. 1.School of Computer Science, Peking University 2.Key Laboratory of High Confidence Software Technologies
- Publication Type:Journal Article
- Keywords:
large language model / pediatric stomatology / preventive dentistry / stomatology / ChatGPT / artificial intelligence / Chatbot / medicine
- From:
Journal of Prevention and Treatment for Stomatological Diseases
2025;33(4):313-319
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To evaluate and compare the accuracy of responses to pediatric preventive dentistry-related questions between the domestic large language model, ChatGLM-6B, and the international large language model, ChatGPT-3.5, in order to provide insights for further research and development of domestic language models in the field of oral medicine.
Methods:A total of 100 common pediatric preventive dentistry questions of varying difficulty levels [basic (n = 35), intermediate (n = 35), and advanced (n = 30) ] were provided by pediatric preventive dentistry experts. Two doctors independently registered these questions with ChatGPT-3.5 and ChatGLM-6B and collected the answers. A cohort of 16 dentists assessed responses generated by ChatGLM-6B and ChatGPT-3.5 using a predefined 3-point Likert scale. The average score of the ratings from 16 doctors was taken as the answer score. If the answer score was higher than 2.8, it was accepted as a accurate answer; if the score was lower than 1.4, it was accepted as an inaccurate answer; if the score was between 1.4 and 2.8, it was accepted as a partially accurate answer. Comparative analysis was conducted on the accuracy rates and evaluation outcomes between the two groups. Consistency analysis of the ratings was conducted
Results:The answer accuracy rates of ChatGPT-3.5 and ChatGLM-6B for 100 pediatric preventive dentistry questions were comparable: ChatGPT-3.5 demonstrated 68% accurate, 30% partially accurate, and 2% inaccurate responses, while ChatGLM-6B showed 67% accurate, 31% partially accurate, and 2% inaccurate responses, with no statistically significant differences (P>0.05). Both models exhibited equivalent accuracy across questions of varying difficulty levels (basic, intermediate, advanced), showing no statistical differences (P>0.05). The overall average scores for ChatGPT3.5 and ChatGLM-6B in answering all questions were both 2.65, with no statistically significant difference (P>0.05). For questions of different difficulty levels, ChatGPT3.5 had an average score of 2.66 for basic questions while ChatGLM-6B had an average score of 2.70. For intermediate questions, ChatGPT3.5 had an average score of 2.63 and ChatGLM-6B had an average score of 2.64. For advanced questions, ChatGPT3.5 had an average score of 2.68, and ChatGLM-6B had an average score of 2.61. No statistically significant differences were observed across any difficulty category (P>0.05). The consistency of the experts’ grading ranged from fair to moderate.
Conclusion:This study demonstrates the potential of both ChatGLM-6B and ChatGPT-3.5 in answering pediatric preventive dentistry questions. ChatGLM-6B performed similarly to ChatGPT-3.5 in this field, but the accuracy rates of both models fell short of expectations and are not suitable for clinical use. Future efforts should focus on improving the accuracy and consistency of large language models in providing medical information, as well as developing specialized medical models for the field of oral medicine.
- Full text:2025040916043987660大语言模型在儿童口腔预防医学领域问答的准确性比较.pdf