Evaluation of the Performance of Advanced Large Language Models in Laboratory Medicine Using Residency Examinations
10.3343/alm.2025.0200
- Author:
Kiwook JUNG
;
Hyun Jin KIM
;
Sunghwan SHIN
;
Wookeun LEE
;
Jun Hyung LEE
;
Hee Sue PARK
;
Qute CHOI
- Publication Type:Original Article
- From:Annals of Laboratory Medicine
2026;46(3):327-337
- CountryRepublic of Korea
- Language:English
-
Abstract:
Background:Recent advancements in large language models (LLMs) have accelerated their integration into clinical domains, including laboratory medicine. The performance of LLMs in answering board-level laboratory medicine questions has not been comprehensively evaluated. Given the importance of diagnostic accuracy in this field, rigorous and objective evaluations of LLM capabilities are essential.
Methods:We assessed 12 LLMs from OpenAI, Anthropic, and Google using 320 Korean Residency Examination questions (2021–2024) spanning six laboratory medicine subspecialties. Standardized prompts were provided via their application programming interfaces under deterministic settings (temperature = 0). Questions were administered thrice to assess response reproducibility. Outputs were compared with validated answers and analyzed for accuracy, reasoning quality, and error typology.
Results:Google’s Gemini 2.0 Pro achieved the highest accuracy (80.0%), followed by OpenAI’s GPT-4.5 (77.2%) and Anthropic’s Claude 3.7 Sonnet (74.1%). Accuracy decreased as the difficulty of questions increased (78.0% for easy vs. 45.1% for challenging). Subspecialty performance varied. Al models underperformed on questions on transfusion medicine (mean accuracy: 38.8%), primarily because of limitations in domain-specific and regional knowledge representations. Incorrect answers primarily resulted from reasoning errors. Reproducibility exceeded 95% for most models; however, some residual non-determinism appeared even with greedy decoding (temperature = 0).
Conclusions:LLMs demonstrated substantial potential for integration into laboratory medicine, particularly in clinical chemistry and immunology. Performance inconsistencies (particularly for high-difficulty questions) and knowledge gaps (notably for transfusion medicine) highlight the necessity for further development—potentially including domain-specific fine-tuning and retrieval-augmented generation integration—and robust expert oversight before clinical application.