A novel web page categorization algorithm based on block propagation using query-log information

作者:Dai, Wenyuan*; Yu, Yong; Zhang, Cong Le; Han, Jie; Xue, Gui Rong
来源:ADVANCES IN WEB-AGE INFORMATION MANAGEMENT, PROCEEDINGS, SPRINGER-VERLAG BERLIN, HEIDELBERGER PLATZ 3, D-14197 BERLIN, GERMANY, 435-446, 2006.

摘要

Most existing web page classification algorithms, including content-based, link-based, or query-log analysis methods, treat the pages as smallest units. However, web pages usually contain some noisy or biased information which could affect the performance of classification. In this paper, we propose a Block Propagation Categorization (BPC) algorithm which deep mines web structure and views blocks as basic semantic units. Moreover, with query log information, BPC propagates only suitable information (block) among web pages to emphasize their topics. We also optimize the BPC algorithm to significantly speed up the block propagation process, without losing any precision. Our experiments on ODP and MSN search engine log show that BPC achieves a great improvement over traditional approaches.