Statistical methods for extremely unbalanced data in genome-wide association study (1)
10.3760/cma.j.cn112338-20240506-00235
- VernacularTitle:全基因组关联研究中极端不平衡数据的统计分析方法(一)
- Author:
Ning XIE
1
;
Wenjian BI
;
Zhongwen ZHANG
;
Fang SHAO
;
Yongyue WEI
;
Yang ZHAO
;
Ruyang ZHANG
;
Feng CHEN
Author Information
1. 南京医科大学公共卫生学院生物统计学系,南京 211166
- Keywords:
Extremely unbalanced data;
Genome-wide association study;
Asymptotic distribution;
Rare variants
- From:
Chinese Journal of Epidemiology
2024;45(11):1582-1589
- CountryChina
- Language:Chinese
-
Abstract:
Extremely unbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe unbalance in proportions, such as extremely unbalanced case-control ratio, very low incidence rate of disease, heavily censored time-to-event data, and low-frequency or rare variants. In such scenarios, the statistic derived from hypothesis test using the classical statistical method, e.g., logistic regression model and Cox proportional hazard regression model, might deviate from theoretical asymptotic distribution, resulting in inflation or deflation of type I error. With the increased availability and exploration of resources from large-scale population cohorts in genome-wide association study (GWAS), there is a growing demand for effective and accurate statistical approaches to handle extremely unbalanced data in independent and non-independent samples. Our study introduces classical statistical methods in genetic statistics firstly, then, summarizes the failure of classical statistical methods in dealing with extremely unbalanced data through simulation experiments to draw researchers' attention to the extremely unbalanced data in GWAS.