Active learning for clinical text classification: is it better than random sampling?

作者:Figueroa Rosa L; Zeng Treitler Qing*; Ngo Long H; Goryachev Sergey; Wiechmann Eduardo P
来源:Journal of the American Medical Informatics Association, 2012, 19(5): 809-816.
DOI:10.1136/amiajnl-2011-000648

摘要

Objective This study explores active learning algorithms as a way to reduce the requirements for large training sets in medical text classification tasks. %26lt;br%26gt;Design Three existing active learning algorithms (distance-based (DIST), diversity-based (DIV), and a combination of both (CMB)) were used to classify text from five datasets. The performance of these algorithms was compared to that of passive learning on the five datasets. We then conducted a novel investigation of the interaction between dataset characteristics and the performance results. %26lt;br%26gt;Measurements Classification accuracy and area under receiver operating characteristics (ROC) curves for each algorithm at different sample sizes were generated. The performance of active learning algorithms was compared with that of passive learning using a weighted mean of paired differences. To determine why the performance varies on different datasets, we measured the diversity and uncertainty of each dataset using relative entropy and correlated the results with the performance differences. %26lt;br%26gt;Results The DIST and CMB algorithms performed better than passive learning. With a statistical significance level set at 0.05, DIST outperformed passive learning in all five datasets, while CMB was found to be better than passive learning in four datasets. We found strong correlations between the dataset diversity and the DIV performance, as well as the dataset uncertainty and the performance of the DIST algorithm. %26lt;br%26gt;Conclusion For medical text classification, appropriate active learning algorithms can yield performance comparable to that of passive learning with considerably smaller training sets. In particular, our results suggest that DIV performs better on data with higher diversity and DIST on data with lower uncertainty.