Development of Machine Learning-Driven Diagnostic and Prognostic Models for Non-Small Cell Lung Cancer-Associated Malignant Pleural Effusion

Ping QI; Jinhua LI; Jinsheng ZHAO; Caihong FU; Longxia ZHANG; Hui QIAO

Return

Development of Machine Learning-Driven Diagnostic and Prognostic Models for Non-Small Cell Lung Cancer-Associated Malignant Pleural Effusion

VernacularTitle:基于机器学习构建非小细胞肺癌恶性胸腔积液的诊断和预后模型
Author: Ping QI ¹ ; Jinhua LI ¹ ; Jinsheng ZHAO ² ; Caihong FU ³ ; Longxia ZHANG ⁴ ; Hui QIAO ⁴
Author Information

1. First Clinical Medical College of Lanzhou University, Lanzhou 730000, China.
2. Medical Department of the First Hospital of Lanzhou University, Lanzhou 730000, China.
3. Department of Respiratory Oncology, Gansu Provincial Tumor Hospital, Lanzhou 730000, China.
4. Department of Oncology, The First Hospital of Lanzhou University, Lanzhou 730000, China.
Publication Type:CLINICALRESEARCH
Keywords: Non-small cell lung cancer; Malignant pleural effusion; Machine learning; Diagnostic model; Prognostic model
From: Cancer Research on Prevention and Treatment 2025;52(12):988-996
CountryChina
Language:Chinese
Abstract: Objective To construct a diagnostic and prognostic model for malignant pleural effusion (MPE) in patients with non-M1b stage (AJCC 7th edition) non-small cell lung cancer (NSCLC) by machine learning. Methods Retrospective analysis was conducted on patients diagnosed with NSCLC in the Surveillance, Epidemiology, and End Results database from 2010 to 2015, excluding those in the M1b stage. Two sets of data were collected: data 1 (patients with non-M1b stage NSCLC, n=47 392) was used to construct the MPE diagnostic model; and data 2 (patients with M1a stage NSCLC and MPE, n=2 422) was used to construct a prognostic model. The Least Absolute Shrinkage and Selection Operator (LASSO) regression was used to screen feature variables, with a training set and validation set ratio of 7:3. Models were built using eight machine learning algorithms, with evaluation metrics including accuracy, precision, recall, F1 score, area under the ROC curve (AUC), decision curve, calibration curve, and precision recall curve (PR), with ROC-AUC as the main evaluation metric. Results The incidence of MPE in patients with non-M1b stage NSCLC was 5.12%, and the 1-year survival rate of patients with MPE was 32.5%. LASSO regression identified nine diagnostic-related variables and 12 prognostic-related variables. The AUC values of the models constructed by eight machine learning algorithms all exceeded 0.70. The random forest model performed the best in the diagnostic model (training set AUC=0.908, validation set AUC=0.897), and the XGBoost model showed the best performance in the prognostic model (training set AUC=0.905, validation set AUC=0.875). Other evaluation indicators showed good results and balanced distribution. SHAP feature importance analysis showed that tumor size, lymph node metastasis, and histological type were important influencing factors for the occurrence of MPE, and chemotherapy intervention was the most remarkably prognostic factor. Conclusion The random forest diagnostic model constructed in this study can effectively predict the risk of MPE in patients with non-M1b stage NSCLC, and the XGBoost prognostic model can predict the prognosis of M1a-stage NSCLC patients with concurrent MPE.