摘要

Motifs are short recurring patterns that are of much interest as they help us understand the mechanism behind regulating gene expression. This paper presents an efficient deterministic algorithm that exhaustively discovers all identical string motifs of all sizes that appears in all or most of the input strings. The input is a set of n different strings each of length N. The problem is to discover all exact instances of any size occurring at least theta times per string in q or more input strings. The quorums theta and q are user set positive integers. The algorithm starts with substrings of size 2, and keeps augmenting them one symbol a time We encode the substrings, such that it is not possible to have two different substrings with the same encoding. After the sorting of the encoded substring, we can tell where, and which of the substrings have multiple copies. Next, we eliminate all the substrings that appear less than the quorums, and repeat the process. Theoretically the algorithm is competitive and scales extremely well with the size of the alphabet. For strings over the alphabet Sigma, it is O(nN + vertical bar Sigma vertical bar) in space, and O(nNL - nL(2)) in time, where L is the length of the longest identical string motif satisfying the constraint. For experimental purposes we applied the algorithm on real biological sequences (DNA and protein), and on randomly generated sequences using alphabet of sizes up to 1000 symbols. These experiments confirm that in practice the algorithm has indeed a linear complexity, both in space and time, making it practical for the important and difficult case of large alphabet inputs.

  • 出版日期2015-3-1