摘要

An integrated approach is proposed to predict the chromatographic retention time of oligonucleotides based on quantitative structure-retention relationships (QSRR) models. First, the primary base sequences of oligonucleotides are translated into vectors based on scores of generalized base properties (SGBP), involving physicochemical, quantum chemical, topological, spatial structural properties, etc.; thereafter, the sequence data are transformed into a uniform matrix by auto cross covariance (ACC). ACC accounts for the interactions between bases at a certain distance apart in an oligonucleotide sequence; hence, this method adequately takes the neighboring effect into account. Then, a genetic algorithm is used to select the variables related to chromatographic retention behavior of oligonucleotides. Finally, a support vector machine is used to develop QSRR models to predict chromatographic retention behavior. The whole dataset is divided into pairs of training sets and test sets with different proportions; as a result, it has been found that the QSRR models using more than 26 training samples have an appropriate external power, and can accurately represent the relationship between the features of sequences and structures, and the retention times. The results indicate that the SGBP-ACC approach is a useful structural representation method in QSRR of oligonucleotides due to its many advantages such as plentiful structural information, easy manipulation and high characterization competence. Moreover, the method can further be applied to predict chromatographic retention behavior of oligonucleotides.

全文