摘要

Web object extraction technology has already been widely applied to the object-oriented search engine to improve the search service in specific domain. However, there is a lack of methods to extract multi-category Web objects, which may belong to kinds of domains and hundreds categories. If there are some categories described in structured Web pages and some others described in unstructured Web pages, it';s difficult to find a method to extract record-level Web objects. On the other hand, while hundreds categories belong to kinds of domains, it is also hard to predefine attribute schemas to extract attribute-level Web objects.
Aiming at resolving this problem, we propose a method of multi-category Web object extraction. Firstly, this method transforms Web page into HTML tag tree, in which the node size is set by its text amount. Node';s text-support degree is calculated on the basis of the node size, and used for finding and extracting the unstructured node. In the same way, sibling nodes'; size similarity is worked out and used for finding and extracting the structured parent node. Then the extracted node having the biggest node size is selected to be the Web object record. Secondly, it utilizes raw data of Wikipedia to construct a relation warehouse of multi-category Web objects, and extracts a core relation schema of 400 categories with relations'; weight calculation and iteration. Finally, it assigns the Web object record to a corresponding category by schema matching, and extracts the core Web object and its related objects in the record with a voting strategy and the core relation schema of the corresponding category.
In experiments, we have tested 1000 Web pages of 20 categories belonged to 3 domains (including Computer, Art, medicine) and demonstrated that this method is able to effectively extract multi-category Web objects from structured and unstructured Web pages in an acceptable performance. The core Web object extraction is 0.724 in precision and 0.600 in recall, and the related Web object extraction is 0.932 in precision and 0.886 in recall.