摘要

Motivation: Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length l of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Omega(n(2)) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
Results: For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov(max)(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length l is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight omega=l-vertical bar ov(max)(x, y)vertical bar if and only if omega <lambda, where vertical bar ov(max)(x, y)vertical bar is the length of ov(max)(x, y) and lambda is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2 lambda-1)(2lognr+log lambda r)n bits with a guarantee that the basic operation of accessing an edge takes O(log lambda) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(lambda lnlogn) worse-case time and requires O(lambda) extra memory. The second one runs in O(lambda ln) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.

  • 出版日期2011-7-15