Construction and validation of machine learning predictive models for the risk of metabolic associated fatty liver disease
- VernacularTitle:代谢相关脂肪性肝病发病风险的机器学习预测模型的构建及验证
- Author:
Linjie QIU
1
;
Haiyan REN
1
;
Yan REN
1
;
Meijie LI
1
;
Chacha ZOU
1
;
Zijing WU
1
;
Jin ZHANG
1
Author Information
- Publication Type:Journal Article
- Keywords: Metabolic Associated Fatty Liver Disease; Machine Learning; Models, Statistical
- From: Journal of Clinical Hepatology 2026;42(4):848-855
- CountryChina
- Language:Chinese
- Abstract: ObjectiveTo investigate the value of predictive models established based on machine learning methods in predicting the risk of metabolic associated fatty liver disease (MAFLD), and to analyze its key risk factors. MethodsA retrospective analysis was performed for the 50 variables of 2 168 healthy individuals who underwent physical examination in Department of Health Assessment, Xiyuan Hospital, China Academy of Chinese Medical Sciences, from January 2021 to December 2024, including body composition, past history, and laboratory tests, and according to whether they were diagnosed with MAFLD or not, they were divided into MAFLD group with 265 individuals and non-MAFLD group with 1 903 individuals. The Mann-Whitney U test was used for comparison of continuous data between two groups, and the chi-square test was used for comparison of categorical data between two groups. Randomly split the research data into a training set and a validation set in a 70% to 30% ratio. Predictive factors were screened from the training set data using univariate analysis, LASSO regression, and multivariate Logistic regression analysis. Predictive models were then constructed using seven machine learning methods: Logistic regression, decision tree, random forest (RF), eXtreme gradient boosting, light gradient boosting machine, support vector machine, and artificial neural network. Model performance was evaluated by plotting receiver operating characteristic curve for the validation set and calculating the area under the curve (AUC), sensitivity, specificity, and Youden index for each model. Furthermore, the SHapley Additive exPlanation (SHAP) method was used to analyze the contribution of variables in the optimal model. ResultsThe prevalence rate of MAFLD among the 2 168 subjects was 12.22% (265/2 168). Smoking, diastolic blood pressure, phase angle, visceral fat area, muscle fat ratio, waist-to-hip ratio, aspartate aminotransferase, non-HDL-C/HDL-C ratio, triglyceride-glucose index, and gallstones were independent risk factors for MAFLD (all P<0.05). The seven predictive models of support vector machine, eXtreme gradient boosting, decision tree, light gradient boosting machine, artificial neural network, RF, and Logistic regression had an AUC of 0.738, 0.754, 0.757, 0.786, 0.795, 0.796, and 0.815, respectively, in the validation set, among which the RF model had the best discriminatory ability (AUC=0.796, 95% confidence interval: 0.754 — 0.839), with a sensitivity of 81.01%, a specificity of 63.16%, and a Youden index of 44.17%. The SHAP analysis showed that visceral fat area, waist-to-hip ratio, and diastolic blood pressure were the top three predictive factors in terms of importance. ConclusionThe RF model, constructed based on body composition and clinical indicators, has a good performance in predicting the risk of MAFLD, and its interpretability can help to identify high-risk individuals in the early stage in clinical practice.
