A novel approach for Web page modeling in personal information extraction

作者:Wei Yuliang; Zhou Qi; Lv Fang; Han Xixian; Xin Guodong; Wang Bailing*
来源:World Wide Web-internet and Web Information Systems, 2019, 22(2): 603-620.
DOI:10.1007/s11280-018-0631-9

摘要

The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.