摘要

Many disease genes are located within one or more specific chromosomal regions according to linkage studies. The identification of disease genes from these regions is one of the most important tasks in bioinformatics research. Among all the approaches reported recently, methods based on sequence characteristics have the widest application range. However, their accuracies are usually low, because these methods take into account the overall differences between disease and non-disease gene, rather than specific characteristics among different diseases. To tackle this problem, the statistical characteristics of the protein sequences between disease genes and non-disease genes have been analyzed. The analysis showed that genes responsible for the same disease often used the amino acids uniquely, indicating that the amino acids usage by a gene was similar to genes responsible for the same disease but remarkably different from others. An algorithm based on the amino acid usage characteristics was developed. And cross validation was performed for a set of 208 genes involved in 55 diseases with significant amino acid usage characteristics. The test demonstrated that, 15.4% target genes ranked first, and the target genes were in the top 5% with 44.2% chance. For those diseases with significant amino acid usage characteristics, this approach showed promising performance compared to other methods.

  • 出版日期2012-5

全文