1.Accuracy of large language models for answering pediatric preventive dentistry questions
GUAN Boyan ; XU Minghe ; ZHANG Huiqi ; MA Shulei ; ZHANG Shanshan ; ZHAO Junfeng
Journal of Prevention and Treatment for Stomatological Diseases 2025;33(4):313-319
Objective:
To evaluate and compare the accuracy of responses to pediatric preventive dentistry-related questions between the domestic large language model, ChatGLM-6B, and the international large language model, ChatGPT-3.5, in order to provide insights for further research and development of domestic language models in the field of oral medicine.
Methods:
A total of 100 common pediatric preventive dentistry questions of varying difficulty levels [basic (n = 35), intermediate (n = 35), and advanced (n = 30) ] were provided by pediatric preventive dentistry experts. Two doctors independently registered these questions with ChatGPT-3.5 and ChatGLM-6B and collected the answers. A cohort of 16 dentists assessed responses generated by ChatGLM-6B and ChatGPT-3.5 using a predefined 3-point Likert scale. The average score of the ratings from 16 doctors was taken as the answer score. If the answer score was higher than 2.8, it was accepted as a accurate answer; if the score was lower than 1.4, it was accepted as an inaccurate answer; if the score was between 1.4 and 2.8, it was accepted as a partially accurate answer. Comparative analysis was conducted on the accuracy rates and evaluation outcomes between the two groups. Consistency analysis of the ratings was conducted
Results:
The answer accuracy rates of ChatGPT-3.5 and ChatGLM-6B for 100 pediatric preventive dentistry questions were comparable: ChatGPT-3.5 demonstrated 68% accurate, 30% partially accurate, and 2% inaccurate responses, while ChatGLM-6B showed 67% accurate, 31% partially accurate, and 2% inaccurate responses, with no statistically significant differences (P>0.05). Both models exhibited equivalent accuracy across questions of varying difficulty levels (basic, intermediate, advanced), showing no statistical differences (P>0.05). The overall average scores for ChatGPT3.5 and ChatGLM-6B in answering all questions were both 2.65, with no statistically significant difference (P>0.05). For questions of different difficulty levels, ChatGPT3.5 had an average score of 2.66 for basic questions while ChatGLM-6B had an average score of 2.70. For intermediate questions, ChatGPT3.5 had an average score of 2.63 and ChatGLM-6B had an average score of 2.64. For advanced questions, ChatGPT3.5 had an average score of 2.68, and ChatGLM-6B had an average score of 2.61. No statistically significant differences were observed across any difficulty category (P>0.05). The consistency of the experts’ grading ranged from fair to moderate.
Conclusion
This study demonstrates the potential of both ChatGLM-6B and ChatGPT-3.5 in answering pediatric preventive dentistry questions. ChatGLM-6B performed similarly to ChatGPT-3.5 in this field, but the accuracy rates of both models fell short of expectations and are not suitable for clinical use. Future efforts should focus on improving the accuracy and consistency of large language models in providing medical information, as well as developing specialized medical models for the field of oral medicine.