摘要

Protein remote homology detection refers to detecting structural homology in proteins with an extremely low rate of sequence similarity. Such detection is primarily conducted using 3 methods: pairwise sequence comparisons, generative models for protein families, and discriminative classifiers. In this study, a discriminative classification method involving N-Grams was adopted to extract features using a random forest algorithm to classify data sets. Experiments in the SCOP 1.53 data set showed that our approach improved the receiver operating characteristic by 6% compared with well-known methods. To determine a score threshold that could be used to divide the data set, we also used a heuristic method through which the precision of positive examples and recall rate reached 0.5647 and 0.8647, respectively. Few other studies have investigated the recall and precision of such examples.