摘要

Aiming at the problems that existing Chinese webpage duplicate removal approaches are restricted by the page scale and the algorithm efficiency, we propose an efficient distributed parallel duplicates elimination approach based on Linux cluster. This paper not only solves the problems of memory limitations and large computation caused by the huge data scale, but also researches into the time-series problems in the distributed parallel computing and gives an effective solution. Experimental results on 10 million webpage dataset show that the proposed approach can deal with duplicates from massive web pages well and truly.

  • 出版日期2011

全文