1.Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
Siegmund Philipp LANG ; Ezra Tilahun YOSEPH ; Aneysis D. GONZALEZ-SUAREZ ; Robert KIM ; Parastou FATEMI ; Katherine WAGNER ; Nicolai MALDANER ; Martin N. STIENEN ; Corinna Clio ZYGOURAKIS
Neurospine 2024;21(2):633-641
Objective:
In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
Methods:
Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
Results:
In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
Conclusion
ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.
2.Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
Siegmund Philipp LANG ; Ezra Tilahun YOSEPH ; Aneysis D. GONZALEZ-SUAREZ ; Robert KIM ; Parastou FATEMI ; Katherine WAGNER ; Nicolai MALDANER ; Martin N. STIENEN ; Corinna Clio ZYGOURAKIS
Neurospine 2024;21(2):633-641
Objective:
In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
Methods:
Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
Results:
In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
Conclusion
ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.
3.Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
Siegmund Philipp LANG ; Ezra Tilahun YOSEPH ; Aneysis D. GONZALEZ-SUAREZ ; Robert KIM ; Parastou FATEMI ; Katherine WAGNER ; Nicolai MALDANER ; Martin N. STIENEN ; Corinna Clio ZYGOURAKIS
Neurospine 2024;21(2):633-641
Objective:
In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
Methods:
Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
Results:
In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
Conclusion
ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.
4.Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
Siegmund Philipp LANG ; Ezra Tilahun YOSEPH ; Aneysis D. GONZALEZ-SUAREZ ; Robert KIM ; Parastou FATEMI ; Katherine WAGNER ; Nicolai MALDANER ; Martin N. STIENEN ; Corinna Clio ZYGOURAKIS
Neurospine 2024;21(2):633-641
Objective:
In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
Methods:
Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
Results:
In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
Conclusion
ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.
5.Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
Siegmund Philipp LANG ; Ezra Tilahun YOSEPH ; Aneysis D. GONZALEZ-SUAREZ ; Robert KIM ; Parastou FATEMI ; Katherine WAGNER ; Nicolai MALDANER ; Martin N. STIENEN ; Corinna Clio ZYGOURAKIS
Neurospine 2024;21(2):633-641
Objective:
In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
Methods:
Our study aims to assess the response quality of Open AI (artificial intelligence)’s ChatGPT 3.5 and Google’s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from ‘unsatisfactory’ to ‘excellent.’ The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
Results:
In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard’s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
Conclusion
ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs’ role in medical education and healthcare communication.
6.The Ever-Evolving Regulatory Landscape Concerning Development and Clinical Application of Machine Intelligence: Practical Consequences for Spine Artificial Intelligence Research
Massimo BOTTINI ; Seung-Jun RYU ; Adrian Elmi TERANDER ; Stefanos VOGLIS ; Nicolai MALDANER ; David BELLUT ; Luca REGLI ; Carlo SERRA ; Victor E. STAARTJES
Neurospine 2025;22(1):134-143
This paper analyzes the regulatory frameworks for artificial intelligence/machine learning AI/ML-enabled medical devices in the European Union (EU), the United States (US), and the Republic of Korea, with a focus on applications in spine surgery. The aim is to provide guidance for developers and researchers navigating regulatory pathways. A review of current literature, regulatory documents, and legislative frameworks was conducted. Key differences in regulatory bodies, risk classification, submission requirements, and approval pathways for AI/ML medical devices were examined in the EU, US, and Republic of Korea. The EU AI Act (2024) establishes a risk-based framework, requiring regulatory review based on device risk, with high-risk devices subject to stricter oversight. The US applies a more flexible approach, allowing multiple submission pathways and incorporating a focus on continuous learning. The Republic of Korea emphasizes possibilities of streamlined approval and with growing use of real-world data to support validation. Developers must ensure regulatory alignment early in the development process, focusing on key aspects like dataset quality, transparency, and continuous monitoring. Across all regions, the need for technical documentation, quality management systems, and bias mitigation are essential for approval. Developers are encouraged to adopt adaptable strategies to comply with evolving regulatory standards, ensuring models remain transparent, fair, and reliable. The EU’s comprehensive AI Act enforces stricter oversight, while the US and Korea offer more flexible pathways. Developers of spine surgery AI/ML devices must tailor development strategies to align with regional regulations, emphasizing transparent development, quality assurance, and postmarket monitoring to ensure approval success.
7.The Ever-Evolving Regulatory Landscape Concerning Development and Clinical Application of Machine Intelligence: Practical Consequences for Spine Artificial Intelligence Research
Massimo BOTTINI ; Seung-Jun RYU ; Adrian Elmi TERANDER ; Stefanos VOGLIS ; Nicolai MALDANER ; David BELLUT ; Luca REGLI ; Carlo SERRA ; Victor E. STAARTJES
Neurospine 2025;22(1):134-143
This paper analyzes the regulatory frameworks for artificial intelligence/machine learning AI/ML-enabled medical devices in the European Union (EU), the United States (US), and the Republic of Korea, with a focus on applications in spine surgery. The aim is to provide guidance for developers and researchers navigating regulatory pathways. A review of current literature, regulatory documents, and legislative frameworks was conducted. Key differences in regulatory bodies, risk classification, submission requirements, and approval pathways for AI/ML medical devices were examined in the EU, US, and Republic of Korea. The EU AI Act (2024) establishes a risk-based framework, requiring regulatory review based on device risk, with high-risk devices subject to stricter oversight. The US applies a more flexible approach, allowing multiple submission pathways and incorporating a focus on continuous learning. The Republic of Korea emphasizes possibilities of streamlined approval and with growing use of real-world data to support validation. Developers must ensure regulatory alignment early in the development process, focusing on key aspects like dataset quality, transparency, and continuous monitoring. Across all regions, the need for technical documentation, quality management systems, and bias mitigation are essential for approval. Developers are encouraged to adopt adaptable strategies to comply with evolving regulatory standards, ensuring models remain transparent, fair, and reliable. The EU’s comprehensive AI Act enforces stricter oversight, while the US and Korea offer more flexible pathways. Developers of spine surgery AI/ML devices must tailor development strategies to align with regional regulations, emphasizing transparent development, quality assurance, and postmarket monitoring to ensure approval success.
8.The Ever-Evolving Regulatory Landscape Concerning Development and Clinical Application of Machine Intelligence: Practical Consequences for Spine Artificial Intelligence Research
Massimo BOTTINI ; Seung-Jun RYU ; Adrian Elmi TERANDER ; Stefanos VOGLIS ; Nicolai MALDANER ; David BELLUT ; Luca REGLI ; Carlo SERRA ; Victor E. STAARTJES
Neurospine 2025;22(1):134-143
This paper analyzes the regulatory frameworks for artificial intelligence/machine learning AI/ML-enabled medical devices in the European Union (EU), the United States (US), and the Republic of Korea, with a focus on applications in spine surgery. The aim is to provide guidance for developers and researchers navigating regulatory pathways. A review of current literature, regulatory documents, and legislative frameworks was conducted. Key differences in regulatory bodies, risk classification, submission requirements, and approval pathways for AI/ML medical devices were examined in the EU, US, and Republic of Korea. The EU AI Act (2024) establishes a risk-based framework, requiring regulatory review based on device risk, with high-risk devices subject to stricter oversight. The US applies a more flexible approach, allowing multiple submission pathways and incorporating a focus on continuous learning. The Republic of Korea emphasizes possibilities of streamlined approval and with growing use of real-world data to support validation. Developers must ensure regulatory alignment early in the development process, focusing on key aspects like dataset quality, transparency, and continuous monitoring. Across all regions, the need for technical documentation, quality management systems, and bias mitigation are essential for approval. Developers are encouraged to adopt adaptable strategies to comply with evolving regulatory standards, ensuring models remain transparent, fair, and reliable. The EU’s comprehensive AI Act enforces stricter oversight, while the US and Korea offer more flexible pathways. Developers of spine surgery AI/ML devices must tailor development strategies to align with regional regulations, emphasizing transparent development, quality assurance, and postmarket monitoring to ensure approval success.
9.The Ever-Evolving Regulatory Landscape Concerning Development and Clinical Application of Machine Intelligence: Practical Consequences for Spine Artificial Intelligence Research
Massimo BOTTINI ; Seung-Jun RYU ; Adrian Elmi TERANDER ; Stefanos VOGLIS ; Nicolai MALDANER ; David BELLUT ; Luca REGLI ; Carlo SERRA ; Victor E. STAARTJES
Neurospine 2025;22(1):134-143
This paper analyzes the regulatory frameworks for artificial intelligence/machine learning AI/ML-enabled medical devices in the European Union (EU), the United States (US), and the Republic of Korea, with a focus on applications in spine surgery. The aim is to provide guidance for developers and researchers navigating regulatory pathways. A review of current literature, regulatory documents, and legislative frameworks was conducted. Key differences in regulatory bodies, risk classification, submission requirements, and approval pathways for AI/ML medical devices were examined in the EU, US, and Republic of Korea. The EU AI Act (2024) establishes a risk-based framework, requiring regulatory review based on device risk, with high-risk devices subject to stricter oversight. The US applies a more flexible approach, allowing multiple submission pathways and incorporating a focus on continuous learning. The Republic of Korea emphasizes possibilities of streamlined approval and with growing use of real-world data to support validation. Developers must ensure regulatory alignment early in the development process, focusing on key aspects like dataset quality, transparency, and continuous monitoring. Across all regions, the need for technical documentation, quality management systems, and bias mitigation are essential for approval. Developers are encouraged to adopt adaptable strategies to comply with evolving regulatory standards, ensuring models remain transparent, fair, and reliable. The EU’s comprehensive AI Act enforces stricter oversight, while the US and Korea offer more flexible pathways. Developers of spine surgery AI/ML devices must tailor development strategies to align with regional regulations, emphasizing transparent development, quality assurance, and postmarket monitoring to ensure approval success.
10.The Ever-Evolving Regulatory Landscape Concerning Development and Clinical Application of Machine Intelligence: Practical Consequences for Spine Artificial Intelligence Research
Massimo BOTTINI ; Seung-Jun RYU ; Adrian Elmi TERANDER ; Stefanos VOGLIS ; Nicolai MALDANER ; David BELLUT ; Luca REGLI ; Carlo SERRA ; Victor E. STAARTJES
Neurospine 2025;22(1):134-143
This paper analyzes the regulatory frameworks for artificial intelligence/machine learning AI/ML-enabled medical devices in the European Union (EU), the United States (US), and the Republic of Korea, with a focus on applications in spine surgery. The aim is to provide guidance for developers and researchers navigating regulatory pathways. A review of current literature, regulatory documents, and legislative frameworks was conducted. Key differences in regulatory bodies, risk classification, submission requirements, and approval pathways for AI/ML medical devices were examined in the EU, US, and Republic of Korea. The EU AI Act (2024) establishes a risk-based framework, requiring regulatory review based on device risk, with high-risk devices subject to stricter oversight. The US applies a more flexible approach, allowing multiple submission pathways and incorporating a focus on continuous learning. The Republic of Korea emphasizes possibilities of streamlined approval and with growing use of real-world data to support validation. Developers must ensure regulatory alignment early in the development process, focusing on key aspects like dataset quality, transparency, and continuous monitoring. Across all regions, the need for technical documentation, quality management systems, and bias mitigation are essential for approval. Developers are encouraged to adopt adaptable strategies to comply with evolving regulatory standards, ensuring models remain transparent, fair, and reliable. The EU’s comprehensive AI Act enforces stricter oversight, while the US and Korea offer more flexible pathways. Developers of spine surgery AI/ML devices must tailor development strategies to align with regional regulations, emphasizing transparent development, quality assurance, and postmarket monitoring to ensure approval success.