Applied research of the impact of air pollution on absenteeism in students with respiratory issues through machine learning analysis
10.16835/j.cnki.1000-9817.2024169
- VernacularTitle:大气污染对学生因呼吸系统症状 缺课影响的机器学习算法应用研究
- Author:
CAO Chengbin, YANG Wenyi, YU Xiaojin, WANG Yan, YANG Jie
1
Author Information
1. School of Public Health Southeast University, Nanjing (210009) , Jiangsu Province, China
- Publication Type:Journal Article
- Keywords:
Air pollution;Respiratory system;Absenteeism;Models, statistical;Students
- From:
Chinese Journal of School Health
2024;45(6):770-774
- CountryChina
- Language:Chinese
-
Abstract:
Objective:To explore the performance of machine learning prediction models in forecasting student absenteeism due to respiratory symptoms caused by air pollution in short term, aiming to provide a methodological reference for early warning systems of school diseases.
Methods:Utilizing data from shortterm sequences of student absenteeism due to respiratory symptoms in Jiangsu Province from September 2019 to October 2022, the study integrated average concentrations of atmospheric pollutants. A univariate distributed lag nonlinear model was employed to select optimal lag variables for the pollutants. An extreme gradient boosting(XGBoost) algorithm model was developed to predict the frequency of absenteeism due to respiratory symptoms and compared with the seasonal autoregressive integrated moving average with exogenous factors(SARIMAX) model.
Results:Between 2019 and 2022, an average of 9 709 students per day in Jiangsu Province were absent due to respiratory symptoms. The daily average air quality index (AQI) was 76.96,with mass concentrations of PM2.5, PM10, NO2, and O3 averaging at 35.75, 61.13, 28.89, 104.81 μg/m3, respectively. Granger causality tests indicated that AQI, PM2.5, PM10, NO2, and O3 were significant predictors of absenteeism frequency due to respirutory symptoms(F=1.46,1.79,1.67,3.41,2.18,P<0.01). The singleday lag effects of PM2.5, PM10, NO2, and O3 reached their peak relative risk (RR) values at lag4, lag0, lag0, lag4 respectively. When integrating these optimal lag variables for the pollutants, the XGBoost model demonstrated superior predictive performance to the SARIMAX model, reducing the mean absolute error (MAE) from 2.251 to 0.475, mean absolute percentage error (MAPE) from 0.429 to 0.080, and root mean square error (RMSE) from 2.582 to 0.713; at the P75 percentile alert threshold, the sensitivity improved from 0.086 to 0.694 and specificity from 0.979 to 0.988, with the Youden index increasing from 0.065 to 0.682.
Conclusions:The XGBoost model exhibits robust predictive performance and effective early warning capabilities for shortterm sequences of student absenteeism due to respiratory symptoms caused by air pollution. Schools could timely adopt this model to preemptively detect and control disease outbreaks, thereby enhancing school health management.