摘要

Evaluating the significance of goodness-of-fits tests for multinomial data in general, and estimating the p value of the log-likelihood ratio statistic G(2) in particular, have been studied extensively in the statistical literature. The methods for computing the latter can be broadly divided into two categories: asymptotic approximations and exact methods. In many applications in bioinformatics one needs a fast algorithm to estimate vanishingly small p values which led to a renewed interest in exact methods.
We introduce anew algorithm (bagFFT) for computing the p value of G(2) which is based on the algorithm by Baglivo, Olivier, and Pagano. The original algorithm tends to suffer from large numerical errors, rendering it inappropriate for our purposes. We give theoretical and empirical evidence that by applying appropriate exponential shifts we obtain an algorithm that is: (a) as stable numerically as Hirji's network algorithm, and (b) asymptotically faster than both Baglivo et al.'s and Hirji's algorithm: 0 (QKN log N) versus 0 (QKN(2)), where Q is the size of the lattice, K is the number of categories, and N is the size of the sample. We conclude by showing how bagFFf can be combined with our earlier work to obtain a fast and accurate algorithm to compute the significance of biological sequence motifs.

  • 出版日期2006-12

全文