摘要

We propose a novel adaptive penalized logistic regression modeling strategy based on Wilcoxon rank sum test (WRST) to effectively uncover driver genes in classification. In order to incorporate significance of gene in classification, we first measure significance of each gene by gene ranking method based on WRST, and then the adaptive L-1-type penalty is discriminately imposed on each gene depending on the measured importance degree of gene. The incorporating significance of genes into adaptive logistic regression enables us to impose a large amount of penalty on low ranking genes, and thus noise genes are easily deleted from the model and we can effectively identify driver genes. Monte Carlo experiments and real world example are conducted to investigate effectiveness of the proposed approach. In Sanger data analysis, we introduce a strategy to identify expression modules indicating gene regulatory mechanisms via the principal component analysis (PCA), and perform logistic regression modeling based on not a single gene but gene expression modules. We can see through Monte Carlo experiments and real world example that the proposed adaptive penalized logistic regression outperforms feature selection and classification compared with existing L-1-type regularization. The discriminately imposed penalty based on WRST effectively performs crucial gene selection, and thus our method can improve classification accuracy without interruption of noise genes. Furthermore, it can be seen through Sanger data analysis that the method for gene expression modules based on principal components and their loading scores provides interpretable results in biological viewpoints.