Interpretable machine learning-based models in predicting prognoses in stroke patients
10.3760/cma.j.cn115354-20240613-00344
- VernacularTitle:可解释的机器学习模型预测缺血性脑卒中患者预后研究
- Author:
Xinhong LI
1
;
Hui MAI
;
Tieyi FU
;
Jianya CHEN
Author Information
1. 广东医科大学附属湛江中心医院神经内科,湛江 524000
- Keywords:
Acute ischemic stroke;
Prognosis;
Machine learning model;
Extreme gradient boosting model;
Shapley Additive exPlanation
- From:
Chinese Journal of Neuromedicine
2024;23(8):817-827
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To explore the value of interpretable machine learning model in predicting the prognoses of patients with acute ischemic stroke..Methods:A total of 296 patients with acute ischemic stroke who received intravenous thrombolysis in Zhanjiang Central Hospital, Guangdong Medical University from March 2020 to October 2023 were selected. Prognosis was assessed 3 months after follow-up using modified Rankin scale (scores of 0-2: good prognosis; scores of 3-6: poor prognosis). Clinical data were collected and analyzed retrospectively, and independent influencing factors for prognoses were analyzed by multivariate Logistic regression. These patients were randomly divided into training dataset ( n=178) and test dataset ( n=118) in a 3:2 ratio; independent influencing factors were used as characteristic variables to train these 10 machine learning models, including Logistic regression, random forest, support vector machine, naive Bayesian model, linear discriminant analysis, mixture discriminant analysis, flexible discriminant analysis, gradient boosting machine, extreme gradient boosting, and category boosting. Prediction performance of these 10 machine learning models were evaluated using calibration curve, precise-recall curve, precision-recall gain curve and receiver operating characteristic (ROC) curve. Interpretation and visualization were added via Shapley Additive exPlanation (SHAP) to the machine learning models (including global interpretation and local interpretation). Results:Of the 296 patients, 72 had a poor prognosis. Age ( OR=1.039, 95% CI: 1.008-1.072, P=0.015), National Institute of Health Stroke Scale score ( OR=1.213, 95% CI: 1.000-1.337, P<0.001), Glasgow Coma Scale score ( OR=0.470, 95% CI: 0.289-0.765, P=0.002), Stroke Prognostic Instrument-Ⅱ score ( OR=1.257, 95% CI: 1.043-1.516, P=0.016,), C-reactive protein ( OR=1.709, 95% CI: 1.398-2.087, P<0.001) and platelet count ( OR=0.988, 95% CI: 0.978-0.998, P=0.016) were independent influencing factors for prognoses. Among the 10 machine learning algorithms, calibration curve (C-inder: 0.896), precise-recall curve (area under the curve [AUC]: 0.791), precision-recall gain curve (AUC: 0.363), and ROC curve (AUC: 0.856) in both the training and test sets confirmed that the XGBoost model has the highest performance in predicting prognoses. SHAP visualisation diagram indicated that order of importance was C-reactive protein, National Institutes of Health Stroke Scale, platelet count, Glasgow Coma Scale, Stroke Prediction Tool-II, and age. SHAP scatter plot visualized the contribution direction of these 6 characteristic variables, with bimodal distribution. SHAP dependence plot indicated dependence between values of 6 characteristic variables and SHAP values, with C-reactive protein enjoying the most significant trend. SHAP plot provided local interpretation for individual sample, making the extreme gradient enhancement model more transparent and interpretable. Conclusion:XGBoost model incorporating age, National Institute of Health Stroke Scale, Glasgow Coma Scale, Stroke Prognostic Instrument-Ⅱ, C-reactive protein, and platelet count can differentiate poor prognosis from good prognosis in patients with acute ischemic stroke with high accuracy; on this basis, the model interpretation and visualization combined with SHAP are helpful to understand the contribution and direction of each characteristic variable to the prediction results.