摘要

The use of learning algorithms for text classification assumes the availability of a large amount of documents which have been organized and labeled correctly by human experts for use in the training phase. Unless the text documents in question have been in existence for some time, using an expert system is inevitable because manual organizing and labeling of thousands of groups of text documents can be a very labor intensive and intellectually challenging activity. Also, in some new domains, the knowledge to organize and label different classes might not be unavailable. Therefore unsupervised learning schemes for automatically clustering data in the training phase are needed. Furthermore, even when knowledge exists, variation is high when the subject under classification depends on personal opinions and is open to different interpretations. This paper describes a methodology which uses Self Organizing Maps (SOM) and alternatively does the automatic clustering by using the Correlation Coefficient (CorrCoef). Consequently the clusters are used as the labels to train the Support Vector Machine (SVM). Experiments and results are presented based on applying the methodology to some standard text datasets in order to verify the accuracy of the proposed scheme. We will also present results which are used to evaluate the effect that dimensionality reduction and changes in the clustering schemes have on the accuracy of the SVM. Results show that the proposed combination has better accuracy compared to training the learning machine using the expert knowledge.

  • 出版日期2016-10-26