摘要

In this work, we have systematically analyzed the distribution of two neighboring amino acids in the sequences of thermophilic and mesophilic proteins. We observed that the occurrence of EE, KK, RR, PP, KI, VV, VE, KE and VK in thermophilic proteins were significantly higher, while the occurrence of QQ, AA, EQ, LL, QA, QL, NN, KQ, QG, RQ, QT and AQ were significantly lower. The thermostable mechanism was studied and we thought that the dipeptide composition contained more information than amino acid composition. Based on the information of dipeptide composition, we have developed a statistical method for discriminating thermophilic and mesophilic proteins. The accuracy of our method for the training dataset was 86.3%. Furthermore, the accuracy of the method for another two independent testing datasets was 85.5 and 89.7%, respectively. The influence of some specific dipeptides on prediction accuracy was also discussed.