1.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
Purpose:
This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
Methods:
This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
Results:
With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
Conclusion
Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.
2.Early Administration of Nelonemdaz May Improve the Stroke Outcomes in Patients With Acute Stroke
Jin Soo LEE ; Ji Sung LEE ; Seong Hwan AHN ; Hyun Goo KANG ; Tae-Jin SONG ; Dong-Ick SHIN ; Hee-Joon BAE ; Chang Hun KIM ; Sung Hyuk HEO ; Jae-Kwan CHA ; Yeong Bae LEE ; Eung Gyu KIM ; Man Seok PARK ; Hee-Kwon PARK ; Jinkwon KIM ; Sungwook YU ; Heejung MO ; Sung Il SOHN ; Jee Hyun KWON ; Jae Guk KIM ; Young Seo KIM ; Jay Chol CHOI ; Yang-Ha HWANG ; Keun Hwa JUNG ; Soo-Kyoung KIM ; Woo Keun SEO ; Jung Hwa SEO ; Joonsang YOO ; Jun Young CHANG ; Mooseok PARK ; Kyu Sun YUM ; Chun San AN ; Byoung Joo GWAG ; Dennis W. CHOI ; Ji Man HONG ; Sun U. KWON ;
Journal of Stroke 2025;27(2):279-283
3.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
Purpose:
This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
Methods:
This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
Results:
With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
Conclusion
Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.
4.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
Purpose:
This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
Methods:
This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
Results:
With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
Conclusion
Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.
5.Study on the Necessity and Methodology for Enhancing Outpatient and Clinical Education in the Department of Radiology
Soo Buem CHO ; Jiwoon SEO ; Young Hwan KIM ; You Me KIM ; Dong Gyu NA ; Jieun ROH ; Kyung-Hyun DO ; Jung Hwan BAEK ; Hye Shin AHN ; Min Woo LEE ; Seunghyun LEE ; Seung Eun JUNG ; Woo Kyoung JEONG ; Hye Doo JEONG ; Bum Sang CHO ; Hwan Jun JAE ; Seon Hyeong CHOI ; Saebeom HUR ; Su Jin HONG ; Sung Il HWANG ; Auh Whan PARK ; Ji-hoon KIM
Journal of the Korean Society of Radiology 2025;86(1):199-200
6.Profiling of Anti-Signal-Recognition Particle Antibodies and Clinical Characteristics in South Korean Patients With Immune-Mediated Necrotizing Myopathy
Soo-Hyun KIM ; Yunjung CHOI ; Eun Kyoung OH ; Ichizo NISHINO ; Shigeaki SUZUKI ; Bum Chun SUH ; Ha Young SHIN ; Seung Woo KIM ; Byeol-A YOON ; Seong-il OH ; Yoo Hwan KIM ; Hyunjin KIM ; Young-Min LIM ; Seol-Hee BAEK ; Je-Young SHIN ; Hung Youl SEOK ; Seung-Ah LEE ; Young-Chul CHOI ; Hyung Jun PARK
Journal of Clinical Neurology 2025;21(1):31-39
Background:
and Purpose This study evaluated the diagnostic utility of an anti-signal-recognition particle 54 (anti-SRP54) antibody-based enzyme-linked immunosorbent assay (ELISA) as well as the clinical, serological, and pathological characteristics of patients with SRP immune-mediated necrotizing myopathy (IMNM).
Methods:
We evaluated 87 patients with idiopathic inflammatory myopathy and 107 healthy participants between January 2002 and December 2023. The sensitivity and specificity of the ELISA for anti-SRP54 antibodies were assessed, and the clinical profiles of patients with antiSRP54 antibodies were determined.
Results:
The ELISA for anti-SRP54 antibodies had a sensitivity and specificity of 88% and 99%, respectively, along with a test–retest reliability of 0.92 (p<0.001). The 32 patients diagnosed with anti-SRP IMNM using a line-blot immunoassay included 28 (88%) who tested positive for anti-SRP54 antibodies using the ELISA, comprising 12 (43%) males and 16 (57%) females whose median ages at symptom onset and diagnosis were 43.0 years and 43.5 years, respectively. Symptoms included proximal muscle weakness in all 28 (100%) patients, neck weakness in 9 (32%), myalgia in 15 (54%), dysphagia in 5 (18%), dyspnea in 4 (14%), dysarthria in 2 (7%), interstitial lung disease in 2 (7%), and myocarditis in 2 (7%). The median serum creatine kinase (CK) level was 7,261 U/L (interquartile range: 5,086–10,007 U/L), and the median anti-SRP54 antibody level was 2.0 U/mL (interquartile range: 1.0–5.6 U/mL). The serum CK level was significantly higher in patients with coexisting anti-Ro-52 antibodies.
Conclusions
This study has confirmed the reliability of the ELISA for anti-SRP54 antibodies and provided insights into the clinical, serological, and pathological characteristics of South Korean patients with anti-SRP IMNM.
7.Profiling of Anti-Signal-Recognition Particle Antibodies and Clinical Characteristics in South Korean Patients With Immune-Mediated Necrotizing Myopathy
Soo-Hyun KIM ; Yunjung CHOI ; Eun Kyoung OH ; Ichizo NISHINO ; Shigeaki SUZUKI ; Bum Chun SUH ; Ha Young SHIN ; Seung Woo KIM ; Byeol-A YOON ; Seong-il OH ; Yoo Hwan KIM ; Hyunjin KIM ; Young-Min LIM ; Seol-Hee BAEK ; Je-Young SHIN ; Hung Youl SEOK ; Seung-Ah LEE ; Young-Chul CHOI ; Hyung Jun PARK
Journal of Clinical Neurology 2025;21(1):31-39
Background:
and Purpose This study evaluated the diagnostic utility of an anti-signal-recognition particle 54 (anti-SRP54) antibody-based enzyme-linked immunosorbent assay (ELISA) as well as the clinical, serological, and pathological characteristics of patients with SRP immune-mediated necrotizing myopathy (IMNM).
Methods:
We evaluated 87 patients with idiopathic inflammatory myopathy and 107 healthy participants between January 2002 and December 2023. The sensitivity and specificity of the ELISA for anti-SRP54 antibodies were assessed, and the clinical profiles of patients with antiSRP54 antibodies were determined.
Results:
The ELISA for anti-SRP54 antibodies had a sensitivity and specificity of 88% and 99%, respectively, along with a test–retest reliability of 0.92 (p<0.001). The 32 patients diagnosed with anti-SRP IMNM using a line-blot immunoassay included 28 (88%) who tested positive for anti-SRP54 antibodies using the ELISA, comprising 12 (43%) males and 16 (57%) females whose median ages at symptom onset and diagnosis were 43.0 years and 43.5 years, respectively. Symptoms included proximal muscle weakness in all 28 (100%) patients, neck weakness in 9 (32%), myalgia in 15 (54%), dysphagia in 5 (18%), dyspnea in 4 (14%), dysarthria in 2 (7%), interstitial lung disease in 2 (7%), and myocarditis in 2 (7%). The median serum creatine kinase (CK) level was 7,261 U/L (interquartile range: 5,086–10,007 U/L), and the median anti-SRP54 antibody level was 2.0 U/mL (interquartile range: 1.0–5.6 U/mL). The serum CK level was significantly higher in patients with coexisting anti-Ro-52 antibodies.
Conclusions
This study has confirmed the reliability of the ELISA for anti-SRP54 antibodies and provided insights into the clinical, serological, and pathological characteristics of South Korean patients with anti-SRP IMNM.
8.Early Administration of Nelonemdaz May Improve the Stroke Outcomes in Patients With Acute Stroke
Jin Soo LEE ; Ji Sung LEE ; Seong Hwan AHN ; Hyun Goo KANG ; Tae-Jin SONG ; Dong-Ick SHIN ; Hee-Joon BAE ; Chang Hun KIM ; Sung Hyuk HEO ; Jae-Kwan CHA ; Yeong Bae LEE ; Eung Gyu KIM ; Man Seok PARK ; Hee-Kwon PARK ; Jinkwon KIM ; Sungwook YU ; Heejung MO ; Sung Il SOHN ; Jee Hyun KWON ; Jae Guk KIM ; Young Seo KIM ; Jay Chol CHOI ; Yang-Ha HWANG ; Keun Hwa JUNG ; Soo-Kyoung KIM ; Woo Keun SEO ; Jung Hwa SEO ; Joonsang YOO ; Jun Young CHANG ; Mooseok PARK ; Kyu Sun YUM ; Chun San AN ; Byoung Joo GWAG ; Dennis W. CHOI ; Ji Man HONG ; Sun U. KWON ;
Journal of Stroke 2025;27(2):279-283
9.Study on the Necessity and Methodology for Enhancing Outpatient and Clinical Education in the Department of Radiology
Soo Buem CHO ; Jiwoon SEO ; Young Hwan KIM ; You Me KIM ; Dong Gyu NA ; Jieun ROH ; Kyung-Hyun DO ; Jung Hwan BAEK ; Hye Shin AHN ; Min Woo LEE ; Seunghyun LEE ; Seung Eun JUNG ; Woo Kyoung JEONG ; Hye Doo JEONG ; Bum Sang CHO ; Hwan Jun JAE ; Seon Hyeong CHOI ; Saebeom HUR ; Su Jin HONG ; Sung Il HWANG ; Auh Whan PARK ; Ji-hoon KIM
Journal of the Korean Society of Radiology 2025;86(1):199-200
10.Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions
Taewon HAN ; Woo Kyoung JEONG ; Jaeseung SHIN
Ultrasonography 2025;44(3):220-231
Purpose:
This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.
Methods:
This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]–generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.
Results:
With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with significant improvements over the basic (7.96%, P=0.009), chain-of-thought (6.47%, P=0.029), and multiagent prompts (5.97%, P=0.043). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. nonrare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).
Conclusion
Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

Result Analysis
Print
Save
E-mail