Mining frequent subtree on paging XML data stream

Lei Xiangxin<sup>*</sup>; Yang Zhiying; Huang Shaoyin; Hu Yunfa

摘要

With the widespread use of XML data stream, discovering knowledge from it becomes important. Compared with other frequent pattern mining, mining frequent subtree over large-scale XML documents and unlimited growing XML data stream is facing difficulties: data steam can not be resolved in memory as a whole, and mining partitioned XML data stream must be considered semi-structured characteristics of XML data, etc. Inspired by this fact, Tmlist is proposed for mining frequent subtrees over paging XML data stream. Tmlist pages XML data stream, manages cross-page nodes and frequent candidate subtrees growing across page, and mines frequent subtrees page-by-page. Frequent candidate subtrees grow by inserting frequent candidate nodes in their rightmost path according to the level of their roots, avoiding the repeated recursive growth of the subtrees rooted by the low-level nodes. A subtree is represented by the topologic sequence of its rightmost path, which avoids the prefix match for the increment of subtrees, so the storing and matching cost for the prefix nodes is cut. Frequent candidate subtrees are selected according to the page minimum support, the support of frequent subtrees is decayed and branches are pruned according to the decaying factor. Accordingly, Tmlist reduces the memory cost of mining frequent subtrees in the limit of error and improves memory utilization and mining efficiency.

出版日期2012
单位复旦大学; 北京大学; 华东理工大学; 上海海事大学

全文

访问全文

收藏分享被引浏览

更新时间：2018-08-07 12:11

Mining frequent subtree on paging XML data stream

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友