1.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
		                        		
		                        			 Purpose:
		                        			This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions. 
		                        		
		                        			Methods:
		                        			This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs. 
		                        		
		                        			Results:
		                        			With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005). 
		                        		
		                        			Conclusion
		                        			Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance. 
		                        		
		                        		
		                        		
		                        	
2.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
		                        		
		                        			 Purpose:
		                        			This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions. 
		                        		
		                        			Methods:
		                        			This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs. 
		                        		
		                        			Results:
		                        			With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005). 
		                        		
		                        			Conclusion
		                        			Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance. 
		                        		
		                        		
		                        		
		                        	
3.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
		                        		
		                        			 Purpose:
		                        			This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions. 
		                        		
		                        			Methods:
		                        			This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs. 
		                        		
		                        			Results:
		                        			With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005). 
		                        		
		                        			Conclusion
		                        			Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance. 
		                        		
		                        		
		                        		
		                        	
4.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
		                        		
		                        			 Purpose:
		                        			This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions. 
		                        		
		                        			Methods:
		                        			This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs. 
		                        		
		                        			Results:
		                        			With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005). 
		                        		
		                        			Conclusion
		                        			Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance. 
		                        		
		                        		
		                        		
		                        	
5.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
		                        		
		                        			 Purpose:
		                        			This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions. 
		                        		
		                        			Methods:
		                        			This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs. 
		                        		
		                        			Results:
		                        			With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005). 
		                        		
		                        			Conclusion
		                        			Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance. 
		                        		
		                        		
		                        		
		                        	
6.Comparison of micro-flow imaging and contrast-enhanced ultrasonography in assessing segmental congestion after right living donor liver transplantation
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN ; Dong Ik CHA ; Kyowon GU ; Jinsoo RHU ; Jong Man KIM ; Gyu-Seong CHOI
Ultrasonography 2024;43(6):469-477
		                        		
		                        			 Purpose:
		                        			This study aimed to determine whether micro-flow imaging (MFI) offers diagnostic performance comparable to that of contrast-enhanced ultrasonography (CEUS) in detecting segmental congestion among patients undergoing living donor liver transplantation (LDLT). 
		                        		
		                        			Methods:
		                        			Data from 63 patients who underwent LDLT between May and December 2022 were retrospectively analyzed. MFI and CEUS data collected on the first postoperative day were quantified. Segmental congestion was assessed based on imaging findings and laboratory data, including liver enzymes and total bilirubin levels. The reference standard was a postoperative contrast-enhanced computed tomography scan performed within 2 weeks of surgery. Additionally, a subgroup analysis examined patients who underwent reconstruction of the middle hepatic vein territory. 
		                        		
		                        			Results:
		                        			The sensitivity and specificity of MFI were 73.9% and 67.5%, respectively. In comparison, CEUS demonstrated a sensitivity of 78.3% and a specificity of 75.0%. These findings suggest comparable diagnostic performance, with no significant differences in sensitivity (P=0.655) or specificity (P=0.257) between the two modalities. Additionally, early postoperative laboratory values did not show significant differences between patients with and without congestion. The subgroup analysis also indicated similar diagnostic performance between MFI and CEUS. 
		                        		
		                        			Conclusion
		                        			MFI without contrast enhancement yielded results comparable to those of CEUS in detecting segmental congestion after LDLT. Therefore, MFI may be considered a viable alternative to CEUS. 
		                        		
		                        		
		                        		
		                        	
7.Comparison of micro-flow imaging and contrast-enhanced ultrasonography in assessing segmental congestion after right living donor liver transplantation
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN ; Dong Ik CHA ; Kyowon GU ; Jinsoo RHU ; Jong Man KIM ; Gyu-Seong CHOI
Ultrasonography 2024;43(6):469-477
		                        		
		                        			 Purpose:
		                        			This study aimed to determine whether micro-flow imaging (MFI) offers diagnostic performance comparable to that of contrast-enhanced ultrasonography (CEUS) in detecting segmental congestion among patients undergoing living donor liver transplantation (LDLT). 
		                        		
		                        			Methods:
		                        			Data from 63 patients who underwent LDLT between May and December 2022 were retrospectively analyzed. MFI and CEUS data collected on the first postoperative day were quantified. Segmental congestion was assessed based on imaging findings and laboratory data, including liver enzymes and total bilirubin levels. The reference standard was a postoperative contrast-enhanced computed tomography scan performed within 2 weeks of surgery. Additionally, a subgroup analysis examined patients who underwent reconstruction of the middle hepatic vein territory. 
		                        		
		                        			Results:
		                        			The sensitivity and specificity of MFI were 73.9% and 67.5%, respectively. In comparison, CEUS demonstrated a sensitivity of 78.3% and a specificity of 75.0%. These findings suggest comparable diagnostic performance, with no significant differences in sensitivity (P=0.655) or specificity (P=0.257) between the two modalities. Additionally, early postoperative laboratory values did not show significant differences between patients with and without congestion. The subgroup analysis also indicated similar diagnostic performance between MFI and CEUS. 
		                        		
		                        			Conclusion
		                        			MFI without contrast enhancement yielded results comparable to those of CEUS in detecting segmental congestion after LDLT. Therefore, MFI may be considered a viable alternative to CEUS. 
		                        		
		                        		
		                        		
		                        	
8.Comparison of micro-flow imaging and contrast-enhanced ultrasonography in assessing segmental congestion after right living donor liver transplantation
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN ; Dong Ik CHA ; Kyowon GU ; Jinsoo RHU ; Jong Man KIM ; Gyu-Seong CHOI
Ultrasonography 2024;43(6):469-477
		                        		
		                        			 Purpose:
		                        			This study aimed to determine whether micro-flow imaging (MFI) offers diagnostic performance comparable to that of contrast-enhanced ultrasonography (CEUS) in detecting segmental congestion among patients undergoing living donor liver transplantation (LDLT). 
		                        		
		                        			Methods:
		                        			Data from 63 patients who underwent LDLT between May and December 2022 were retrospectively analyzed. MFI and CEUS data collected on the first postoperative day were quantified. Segmental congestion was assessed based on imaging findings and laboratory data, including liver enzymes and total bilirubin levels. The reference standard was a postoperative contrast-enhanced computed tomography scan performed within 2 weeks of surgery. Additionally, a subgroup analysis examined patients who underwent reconstruction of the middle hepatic vein territory. 
		                        		
		                        			Results:
		                        			The sensitivity and specificity of MFI were 73.9% and 67.5%, respectively. In comparison, CEUS demonstrated a sensitivity of 78.3% and a specificity of 75.0%. These findings suggest comparable diagnostic performance, with no significant differences in sensitivity (P=0.655) or specificity (P=0.257) between the two modalities. Additionally, early postoperative laboratory values did not show significant differences between patients with and without congestion. The subgroup analysis also indicated similar diagnostic performance between MFI and CEUS. 
		                        		
		                        			Conclusion
		                        			MFI without contrast enhancement yielded results comparable to those of CEUS in detecting segmental congestion after LDLT. Therefore, MFI may be considered a viable alternative to CEUS. 
		                        		
		                        		
		                        		
		                        	
9.Comparison of micro-flow imaging and contrast-enhanced ultrasonography in assessing segmental congestion after right living donor liver transplantation
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN ; Dong Ik CHA ; Kyowon GU ; Jinsoo RHU ; Jong Man KIM ; Gyu-Seong CHOI
Ultrasonography 2024;43(6):469-477
		                        		
		                        			 Purpose:
		                        			This study aimed to determine whether micro-flow imaging (MFI) offers diagnostic performance comparable to that of contrast-enhanced ultrasonography (CEUS) in detecting segmental congestion among patients undergoing living donor liver transplantation (LDLT). 
		                        		
		                        			Methods:
		                        			Data from 63 patients who underwent LDLT between May and December 2022 were retrospectively analyzed. MFI and CEUS data collected on the first postoperative day were quantified. Segmental congestion was assessed based on imaging findings and laboratory data, including liver enzymes and total bilirubin levels. The reference standard was a postoperative contrast-enhanced computed tomography scan performed within 2 weeks of surgery. Additionally, a subgroup analysis examined patients who underwent reconstruction of the middle hepatic vein territory. 
		                        		
		                        			Results:
		                        			The sensitivity and specificity of MFI were 73.9% and 67.5%, respectively. In comparison, CEUS demonstrated a sensitivity of 78.3% and a specificity of 75.0%. These findings suggest comparable diagnostic performance, with no significant differences in sensitivity (P=0.655) or specificity (P=0.257) between the two modalities. Additionally, early postoperative laboratory values did not show significant differences between patients with and without congestion. The subgroup analysis also indicated similar diagnostic performance between MFI and CEUS. 
		                        		
		                        			Conclusion
		                        			MFI without contrast enhancement yielded results comparable to those of CEUS in detecting segmental congestion after LDLT. Therefore, MFI may be considered a viable alternative to CEUS. 
		                        		
		                        		
		                        		
		                        	
10.Comparison of micro-flow imaging and contrast-enhanced ultrasonography in assessing segmental congestion after right living donor liver transplantation
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN ; Dong Ik CHA ; Kyowon GU ; Jinsoo RHU ; Jong Man KIM ; Gyu-Seong CHOI
Ultrasonography 2024;43(6):469-477
		                        		
		                        			 Purpose:
		                        			This study aimed to determine whether micro-flow imaging (MFI) offers diagnostic performance comparable to that of contrast-enhanced ultrasonography (CEUS) in detecting segmental congestion among patients undergoing living donor liver transplantation (LDLT). 
		                        		
		                        			Methods:
		                        			Data from 63 patients who underwent LDLT between May and December 2022 were retrospectively analyzed. MFI and CEUS data collected on the first postoperative day were quantified. Segmental congestion was assessed based on imaging findings and laboratory data, including liver enzymes and total bilirubin levels. The reference standard was a postoperative contrast-enhanced computed tomography scan performed within 2 weeks of surgery. Additionally, a subgroup analysis examined patients who underwent reconstruction of the middle hepatic vein territory. 
		                        		
		                        			Results:
		                        			The sensitivity and specificity of MFI were 73.9% and 67.5%, respectively. In comparison, CEUS demonstrated a sensitivity of 78.3% and a specificity of 75.0%. These findings suggest comparable diagnostic performance, with no significant differences in sensitivity (P=0.655) or specificity (P=0.257) between the two modalities. Additionally, early postoperative laboratory values did not show significant differences between patients with and without congestion. The subgroup analysis also indicated similar diagnostic performance between MFI and CEUS. 
		                        		
		                        			Conclusion
		                        			MFI without contrast enhancement yielded results comparable to those of CEUS in detecting segmental congestion after LDLT. Therefore, MFI may be considered a viable alternative to CEUS. 
		                        		
		                        		
		                        		
		                        	
            
Result Analysis
Print
Save
E-mail