Sparse nonnegative matrix factorization for protein sequence motif discovery

Kim Wooyoung<sup>*</sup>; Chen Bernard; Kim Jingu; Pan Yi; Park Haesun

doi:10.1016/j.eswa.2011.04.133

摘要

The problem of discovering motifs from protein sequences is a critical and challenging task in the field of bioinformatics. The task involves clustering relatively similar protein segments from a huge collection of protein sequences and culling high quality motifs from a set of clusters. A granular computing strategy combined with K-means clustering algorithm was previously proposed for the task, but this strategy requires a manual selection of biologically meaningful clusters which are to be used as an initial condition. This manipulated clustering method is undisciplined as well as computationally expensive. In this paper, we utilize sparse non-negative matrix factorization (SNMF) to cluster a large protein data set. We show how to combine this method with Fuzzy C-means algorithm and incorporate bio-statistics information to increase the number of clusters whose structural similarity is high. Our experimental results show that an SNMF approach provides better protein groupings in terms of similarities in secondary structures while maintaining similarities in protein primary sequences.

出版日期2011-9-15

全文

访问全文

收藏分享被引(13) 浏览

更新时间：2024-04-17 18:50

Sparse nonnegative matrix factorization for protein sequence motif discovery

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友