Intelligent head and neck CT angiography report quality detection using large language models
10.3760/cma.j.cn112149-20241209-00722
- VernacularTitle:大语言模型智能化检测头颈部CT血管成像影像报告质量的对比研究
- Author:
Liping TIAN
1
;
Xiaolu FEI
;
Dan SONG
;
Yao LU
;
Jie LU
Author Information
1. 首都医科大学宣武医院放射与核医学科 磁共振成像脑信息学北京市重点实验室,北京 100053
- Publication Type:Journal Article
- Keywords:
Tomography, X-ray computed;
Large language models;
Imaging reports;
Quality control;
Natural language processing
- From:
Chinese Journal of Radiology
2025;59(10):1118-1125
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To assess common errors in head and neck CT angiography (CTA) image reports using four types of large language models (LLM), namely GPT-4, DeepSeek, ERNIE Bot and SparkDesk, and to assess the feasibility of using existing LLMs to support quality control of radiology reports in Chinese.Methods:The study was a cross-sectional study. Totally 1 000 head and neck CTA image reports were randomly selected using the simple random sampling method from Xuanwu Hospital, Capital Medical University in 2023, including 500 primary reports and 500 finalized reports. Two radiologists collaboratively identified six types of errors in the reports: description errors, writing errors, left-right confusion errors, diagnostic omissions, logical sequence errors, and other errors. The overall quality of the reports was assessed using a 5-point Likert scale. Subsequently, GPT-4, DeepSeek, ERNIE Bot and SparkDesk models were employed to detect the same six types of errors in the imaging reports and to provide overall scoring. The results from manual review were considered the gold standard for calculating F1 score to evaluate model performance. Intra-class correlation coefficients ( ICC) were used to assess the consistency between manual scores and the overall scores from the four LLMs. Results:In the primary imaging reports, the proportions of manually detected errors were as follows: descriptive errors 2.6% (13/500), writing errors 0.6% (3/500), left-right confusion errors 0, diagnostic omissions 6.4% (32/500), logical sequence errors 5.2% (26/500), and other errors 0. In the finalized imaging reports, the proportions of errors across the six categories were 0.2% (1/500), 0, 0, 0, 0, and 0.2% (1/500), respectively. For error detection in the primary imaging reports, the F1 scores of GPT-4 for the six error types were 0.992, 0.997, 0.997, 0.967, 0.980, and 0.992, respectively. DeepSeek achieved F1 scores of 0.980, 0.955, 0.981, 0.920, 0.995, and 0.960; ERNIE Bot scored 0.982, 0.990, 1.000, 0.956, 0.976, and 0.999; and SparkDesk achieved 0.985, 0.995, 1.000, 0.961, 0.982, and 1.000. In the detection of errors in finalized imaging reports, GPT-4′s F1 scores were 0.994, 0.995, 0.998, 0.973, 0.989, and 0.993; DeepSeek scored 0.968, 0.965, 0.985, 0.971, 0.991, and 0.983; ERNIE Bot achieved 0.996, 0.992, 1.000, 0.983, 0.999, and 0.997; and SparkDesk achieved 0.999, 0.999, 1.000, 1.000, 1.000, and 0.999. The consistency between GPT-4, DeepSeek, and SparkDesk models and human ratings was moderate, with ICC values of 0.514, 0.560, and 0.515 respectively (all P0.001); in contrast, the overall score of ERNIE Bot showed poor consistency with human ratings, with an ICC of 0.221 ( P0.001). Conclusion:LLMs demonstrate high accuracy in detecting errors in head and neck CTA imaging reports. The overall scoring of report quality shows moderate consistency with manual assessments, indicating a certain feasibility for automated quality control in reporting.