摘要

In population genetics, it is common to represent the haplotype diversity at a genomic region between multiple populations using well-constructed visual representations. This typically requires the chromosomes from these populations to be grouped according to some definition of haplotypic similarity. Here, we introduce a novel algorithm for clustering haplotypes with the specific aim of addressing haplotype diversity within or between populations. The algorithm allows for missing data in the haplotypes and appropriately downweighs single nucleotide polymorphisms with higher extent of missingness. By identifying the canonical haplotypes in a genomic region, defined as the haplotype forms, which most chromosomes are similar to, the algorithm maps each chromosome to either a unique canonical haplotype or as a mosaic of the identified canonical haplotypes. This mapping can subsequently be utilized for producing graphical visualizations of the haplotype clustering for understanding the extent of haplotype diversity in the region. The clustering application has been implemented in R for distribution as haplosim, and we also provide a visualization script hapvisual for graphical display of the clustering results. The outcome of such analysis can be informative in understanding the extent of haplotype diversity between populations, in addressing the reproducibility of established association signals across multiple populations, and also in the investigation of positive selection in the human genome. Genet. Epidemiol. 34:34-41, 2010.

  • 出版日期2010-1