摘要

Genome assembly has been an area of active research since the DNA structure was discovered and has gathered more steam after the Human Genome project was launched. A large number of genomes have been assembled and many more are in the pipeline. A number of full-scale assemblers and other special-purpose modules have been reported. Since the volume of data involved in the genome assembly process is extraordinarily large and requires significantly large computational power and processing time, many assemblers have utilized parallel computing to achieve faster and more efficient reconstruction of the DNA. A genome assembler is a multi-step process including different components that may be partly or fully parallelized. Although several assemblers and individual modules that perform various tasks, such as pairwise alignment, multiple sequence alignment, and repeat finding, have been analyzed and documented before, this paper provides a holistic view of the assembly process in the realm of parallel and distributed computing, streamlining all the individual tasks related, but not limited to, the whole genome shotgun sequencing into a sequence of loosely coupled stages where one stage consumes the output of the preceding stage and passes its results to the next one. Many of these tasks are essential to the current and next-generation sequence assemblers. The paper walks through the entire streamlined process while describing, analyzing, and commenting on the algorithms and techniques that have been designed and implemented for each of the stages. Where applicable, the paper suggests improvements that may form the basis of potentially new research work.

  • 出版日期2015-1