Data Extraction for Deep Web Using WordNet

作者:Hong Jer Lang*
来源:IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews , 2011, 41(6): 854-868.
DOI:10.1109/TSMCC.2010.2089678

摘要

Our survey shows that the techniques used in data extraction from deep webs need to be improved to achieve the efficiency and accuracy of automatic wrappers. Further investigations indicate that the development of a lightweight ontological technique using existing lexical database for English (WordNet) is able to check the similarity of data records and detect the correct data region with higher precision using the semantic properties of these data records. The advantages of this method are that it can extract three types of data records, namely, single-section data records, multiple-section data records, and loosely structured data records, and it also provides options for aligning iterative and disjunctive data items. Experimental results show that our technique is robust and performs better than the existing state-of-the-art wrappers. Tests also show that our wrapper is able to extract data records from multilingual web pages and that it is domain independent.

  • 出版日期2011-11