A top-down information theoretic word clustering algorithm for phrase recognition

Wu Yu Chieh<sup>*</sup>

doi:10.1016/j.ins.2014.02.033

摘要

Semi-supervised machine learning methods have the features of both, integrating labeled and unlabeled training data. In most structural problems, such as natural language processing and image processing, developing labeled data for a specific domain requires considerable amount of human resources. In this paper, we present a cluster-based method to fuse labeled training and unlabeled raw data. We design a top-down divisive clustering algorithm that ensures maximal information gain in the use of unlabeled data via clustering similar words. To implement this idea, we design a top-down iterative K-means clustering algorithm to merge word clusters. Differently, the derived term groups are then encoded as new features for the supervised learners in order to improve the coverage of lexical information. Without additional training data or external materials, this approach yields state-of-the-art performance on the shallow parsing and base-chunking benchmark datasets (94.50 and 93.12 in F-(beta) rates).

出版日期2014-8-10

全文

访问全文

收藏分享被引(6) 浏览

更新时间：2021-04-19 18:31

A top-down information theoretic word clustering algorithm for phrase recognition

摘要

全文

产品服务

站内浏览

服务支持

联系方式

科研之友