1.Simulation study on missing data imputation methods for longitudinal data in cohort studies
Yemian LI ; Peng ZHAO ; Yuhui YANG ; Jingxian WANG ; Hong YAN ; Fangyao CHEN
Chinese Journal of Epidemiology 2021;42(10):1889-1894
Objective:Data being missed is an unavoidable problem in cohort studies. This paper compares the imputation effect of eight common missing data imputation methods involved in cutting longitudinal data through simulation study to provide a valuable reference for the treatment of missing data in longitudinal studies.Methods:The simulation study is based on R language software and generates missing longitudinal data by the Monte Carlo method. By comparing the average absolute deviation, average relative deviation, and TypeⅠerror from the regression analysis of different imputation methods, the imputation effect of varying imputation methods on missing longitudinal data and the influence on subsequent multivariate analysis are evaluated.Results:The mean imputation, k nearest neighbor (KNN), regression imputation, and random forest all have a similar imputation effect, which is also steady. However, the hot deck is inferior to the above imputation methods. K-means clustering and expectation maximization (EM) algorithm are among the worst and unstable. Mean imputation, EM algorithm, random forest, KNN, and regression imputation can control TypeⅠerror. Still, multiple imputations, hot deck, and K-means clustering cannot effectively manage the TypeⅠerror.Conclusions:For missing data in longitudinal studies, mean imputation, KNN, regression imputation, and random forest can be used as better imputation methods under the mechanism of missing at random. When the missing ratio is not too large, multiple imputations and hot deck can also perform well, but K-means clustering and EM algorithm are not recommended.
2.Simulation study on variable selection method for high-dimensional biomedical data
Jingxian WNAG ; Peng ZHAO ; Yemian LI ; Yuhui YANG ; Fangyao CHEN
Journal of Xi'an Jiaotong University(Medical Sciences) 2021;42(4):628-632
【Objective】 To compare the performance of five commonly used variable selection methods in high-dimensional biomedical data variable screening so as to explore the effects of sample size and association among candidate variables on screening results and provide evidence for the development of variable selection strategy in high-dimensional biomedical data analysis. 【Methods】 Variable selection algorithms were implemented based on R-programming language. Monte Carlo method was used to simulate high-dimensional biomedical data under different conditions to evaluate and compare the performance of different variable selection methods. Variable selection performance was evaluated based on the true positive rate and true negative rate in screening. 【Results】 For specified high-dimensional data, the variable selection performance was improved for all the methods when sample size was increased, and the association between candidate variables did affect variable screening results. Simulation results indicated that the elastic network algorithm yielded the best screening performance, LASSO algorithm took the second place, and ridge algorithm did not work at all. 【Conclusion】 Elastic network algorithm is an ideal variable screening method for high-dimensional data variable screening.