摘要

Traditional categorization algorithm suffers from not having sufficient labeled training data for learning, while large amount unlabeled data are easily available. We investigate co-training algorithm and its assumption that the features set can be split into two compatible and independent views. However, the assumption is usually violated to some degree in practice and sometimes the natural feature split does not exist. So we adopt TEF_WA technique which utilizes term evaluation functions to split features set and construct multiple views. We can choose a pair of views which are compatible and independent to certain degree. Based TEF_WA technique we develop a semi-supervised categorization algorithm Co_CLM. Experimental results show Co_CLM can significantly decrease the classification error utilizing unlabeled data especially labeled data is sparse. Our experimental results also indicate Co_CLM will achieve more satisfactory performance with the more independent view pairs.

全文