Establishment of a Traditional Chinese Medicine Syndrome Diagnostic Model Based on Stacking Ensemble Learning：Take Lung Cancer as an Example

Xiaochuan GUO; Zhenzhen FENG; Wenrui LIU; Jiansheng LI

Return

Establishment of a Traditional Chinese Medicine Syndrome Diagnostic Model Based on Stacking Ensemble Learning：Take Lung Cancer as an Example

VernacularTitle:基于Stacking集成算法的中医证候诊断模型建立
Author: Xiaochuan GUO ¹ ; Zhenzhen FENG ¹ ; Wenrui LIU ¹ ; Jiansheng LI ¹
Author Information

1. Co-construction Collaborative Innovation Center for Chinese Medicine and Respiratory Diseases by Henan and Education Ministry of P.R. China，Henan University of Chinese Medicine，Zhengzhou，450046
Publication Type:Journal Article
Keywords: diagnostic model of traditional Chinese medicine syndrome; lung cancer; syndrome; machine learning; Stacking ensemble learning
From: Journal of Traditional Chinese Medicine 2024;65(17):1775-1783
CountryChina
Language:Chinese
Abstract: ObjectiveTo explore the method of optimizing the performance of traditional Chinese medicine （TCM） syndrome diagnostic models using Stacking ensemble learning. MethodsTaking the construction of TCM syndrome diagnostic model for lung cancer as an example， 2598 cases of clinical symptoms and signs from lung cancer patients in 9 hospitals were used as independent variables （i.e.， feature variables）， TCM syndrome information as dependent variables， and the clinical data were divided into training set and testing set in 8：2 ratio according to random number table method using Python 3.7 software. The stable features of TCM syndrome of lung cancer were screened using chi-square test， Spearman's correlation test， and Least Absolute Shrinkage and Selection Operator （LASSO） logistic regression analysis； nine machine learning algorithms are trained， including support vector machines （SVMs）， k-nearest neighbors （KNN） algorithm， Random Forest （RF）， Extremely Randomized Trees， Extreme Gradient Boosting （XGBoost）， Lightweight Gradient Boosting （LightGBM）， Adaptive Boosting （AdaBoost）， Gradient Boosting （GB） and the multi-layer perceptron （MLP）， to obtain 9 basic models. Four models with better performance were screened out from the above basic models and fused to form a fusion model by using the Stacking ensemble learning， and the fusion model was trained twice by the above nine machine learning algorithms and evaluated by accuracy rate， micro-average ROC curves， area under the curve （AUC）， and confusion matrix metrics， to screen the optimal diagnostic model. ResultsAfter data processing， 79 stable features and 13 TCM syndromes were obtained. In the basic model training， the comprehensive performance of RF， ExtraTrees， MLP and SVM basic models were better， so the predicted distributions of the syndromes of these four models were used as the secondary training data， and nine fusion models were obtained based on the Stacking ensemble learning （SVM， KNN， RF， ExtraTree， XGBoost， LightGBM， GB， AdaBoost， MLP）. Among them， the XGBoost fusion model performed the best， with an accuracy of 0.850 and 0.838 in the training set and test set， respectively， an overfitting difference of 0.012， and an area under the micro-average ROC curve of 0.996. All fusion models showed an improvement in accuracy and area under the micro-average ROC curve compared with the base model in the test set. ConclusionTaking the TCM syndrome information of lung cancer as an example， the XGBoost fusion model has significant advantages in improving the diagnostic performance of TCM syndrome information of lung cancer through Stacking ensemble learning. It can be seen that the advantages of Stacking ensemble learning to integrate multiple models and effectively improve the diagnostic efficiency of TCM diagnostic models， which provided a methodological reference for similar studies.