AN EFFICIENT WEB-BASED WRAPPER AND ANNOTATOR FOR TABULAR DATA

Amin Mohammad Shafkat<sup>*</sup>; Jamil Hasan

doi:10.1142/S0218194010004657

摘要

In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time.

出版日期2010-3

全文

访问全文

收藏分享被引(5) 浏览

更新时间：2018-02-09 15:55

AN EFFICIENT WEB-BASED WRAPPER AND ANNOTATOR FOR TABULAR DATA

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友