摘要

In genome-wide association studies (GWAS), the acquired sequential data may exhibit imbalance structure: abundant control vs. limited case samples. Such sample imbalance issue is particularly serious when investigating rare diseases or common diseases on rare populations. Conventional GWAS methods may suffer from severe statistic biases to the major group, leading to power losses in uncovering true suspicious loci. We introduce a boosting correction method termed as Bosco to deal with such imbalanced problem. Bosco is motivated by the boost learning theory in machine learning and is implemented in a coarse-to-fine learning framework: the coarse step assigns importance scores for all samples in the major group and the fine step calculates P-values by a weighted logistic regression. On simulated data sets, we demonstrate the proposed methods can dramatically improve the discovery power even on extremely imbalanced datasets, with well controlling the false positives. The Bosco is also applied to a genome-scale gastric cancer data set to conduct genome-wide analysis. Our method replicates existing reported findings (from the likelihood ratio test) with high statistical significance and shows the ability to identify new suspicious SNPs.