Comparison on Qwen2.5 and GPT-4o models for generating structured thyroid ultrasound reports

Saimei QIN; Qiong WEN; Yilian DUAN; Feixiang XIANG

Return

Comparison on Qwen2.5 and GPT-4o models for generating structured thyroid ultrasound reports

VernacularTitle:对比通义千问2.5与GPT-4o模型生成的甲状腺超声结构化报告
Author: Saimei QIN ¹ ; Qiong WEN ¹ ; Yilian DUAN ¹ ; Feixiang XIANG ¹
Author Information

1. 华中科技大学同济医学院附属协和医院超声医学科湖北省影像医学临床医学研究中心分子影像湖北省重点实验室,湖北武汉 430022
Publication Type:Journal Article
Keywords: thyroid diseases; ultrasonography; large language model; structured reports
From: Chinese Journal of Medical Imaging Technology 2025;41(3):409-413
CountryChina
Language:Chinese
Abstract: Objective To compare the efficacy of Qwen2.5(model A)and GPT-4o model(model B)for converting free-text reports of thyroid ultrasound into structured reports.Methods Preoperative thyroid ultrasound data of 100 patients who then underwent thyroidectomy(236 thyroid nodules)were retrospectively collected.Free-text reports were written by an attending ultrasound physician in accordance with guidelines of American College of Radiology thyroid imaging reporting and data system(ACR TI-RADS)and input into both model A and B for 3 times to generate structured thyroid ultrasound reports.The quality of these structured reports output by 2 models were compared,including structured writing,ACR TI-RADS categories and management recommendations,while the consistency of 3 times structured reports output by the models were analyzed.Results Among 300 structured reports output by models,the total satisfaction rate of structured writing was 94.00％(282/300)for model A and 94.67％(284/300)for model B,and no significant difference was found(x2=0.045,P=0.832).Among 236 thyroid nodules,there were 36,47,33,39 and 81 in ACR TI-RADS categories 1,2,3,4 and 5,respectively.The total accuracy model A and B for 3 times categories of thyroid nodules was 88.28％(625/708)and 89.27％(632/708),respectively,with no significant difference(x2=0.582,P=0.505).Moderate consistencies of results of 3 times structured writing(ICC=0.531,0.673)and nodules categories(ICC=0.714,0.747)were noticed between 2 models.The coincidence rate of model A and B for providing management recommendations for thyroid nodules was 74.86％(530/708)and 67.51％(478/708),respectively,the former was higher than the latter(x2=4.567,P=0.033),and the consistencies of 3 times management recommendations provided by 2 models were both good(ICC=0.836,0.769).Conclusion The efficacy of structured writing and ACR TI-RADS categories of Qwen2.5 and GPT-4o models for converting thyroid ultrasound free-text reports into structured reports was comparable,while the former was more effective for providing management recommendations.