摘要

Descriptors based on information content (IC) are introduced to characterize nucleotide sequences. The descriptors are an extension of Shannon IC and are denoted as ICr, where r = 1, 2,..., n corresponding to the probability distribution of DNA strings of length 1, 2, etc. Sequence IC (SICr) and complementary IC (CSICr) are also introduced. IC saturates by reaching a maximum after a few orders and the order (string length) corresponding to the maximum IC value for a given sequence depends on the length of the DNA sequence. Effectiveness of the new descriptors in comparing similarity of DNA sequences was evaluated by performing phylogenetic analyses on first exons of 14 beta-globin genes, and complete coding sequences of 20 beta-globin genes. Dendrograms obtained using the descriptors were comparable to the classification of organisms according to the evolutionary tree. ICr, SICr and CSICr could be calculated without much demand for computation time even for very long DNA sequences.

  • 出版日期2010-8-10