A novel alignment-free vector method to cluster protein sequences

作者:He, Lily; Li, Yongkun; He, Rong Lucy; Yau, Stephen S. T.*
来源:Journal of Theoretical Biology, 2017, 427: 41-52.
DOI:10.1016/j.jtbi.2017.06.002

摘要

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms. Published by Elsevier Ltd.