A New Method of Approximating the Probability of Matching Common Words in Multiple Random Sequences

Haiman George<sup>*</sup>; Preda Cristian

doi:10.1007/s11009-010-9192-9

摘要

In this paper we consider R independent sequences of length T formed by independent, not necessarily uniformly distributed letters drawn from a finite alphabet. We first develop a new and efficient method of calculating the expectation E(N(R)) = E(N(R)(m, T)) of the number of distinct words of length m, N(R)(m, 7), which are common to R such sequences. We then consider the case of four uniformly distributed letters. We determine a b(R) = b(R)(m, T) > 0 such that the interval [E(N(R)) b(R); E(N(R)) contains the probability p(R) = P(N(R) >= 1) that there exists a word of length m common to the R sequences. We show that b(R) approximate to 0.07E(N(R)) if R = 3 and b(R) <= 0.05E(N(R)) if R >= 4. Thus, for unusual common words, i.e. such that p(R) is small, E(N(R)) provides a very accurate approximation of this probability. We then compare numerically the intervals [E(N(R)) b(R), E(N(R))] with former approximations of p R provided by Karlin and Ost (Ann Probab 16:535-563, 1988) and Naus and Sheng (Bull Math Biol 59(3):483-495,1997).

出版日期2010-12

全文

访问全文

收藏分享被引浏览

更新时间：2017-04-26 13:20

A New Method of Approximating the Probability of Matching Common Words in Multiple Random Sequences

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友