Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx)

作者:AL Ghuribi Sumaia Mohammed*; Alshomrani Saleh
来源:Arabian Journal for Science and Engineering, 2015, 40(2): 501-518.
DOI:10.1007/s13369-014-1530-8

摘要

Extracting useful Web content is a major step in data mining. The Web content extraction process is very important for many technologies or uses as a preprocessing of many systems such as crawlers and indexers. Additionally, the extracted content is needed by the end users especially for blind and visually impaired users. It aims to extract useful and meaningful data from Webpages that are surrounded with various clutters such as advertisements and navigation menus. Many extraction algorithms are designed for English Language and perform less efficient and less accurate in Arabic language. In this paper, a bi-languages mining algorithm for extracting Web contents called BiLEx is presented. It extracts useful Web content from Arabic and English Webpages in the approximately same level of efficiency and accuracy. An experiment is made for 600 Webpages which are chosen randomly from 30 different Websites to test the proposed algorithm performance and efficiency. Results prove that BiLEx algorithm gives high precision, recall, and F1-measure for both Arabic and English Webpages.

  • 出版日期2015-2