摘要

We investigate semisupervised learning (SL) and pool-based active learning (AL) of a classifier for domains with label-scarce (LS) and unknown categories, i.e., defined categories for which there are initially no labeled examples. This scenario manifests, e.g., when a category is rare, or expensive to label. There are several learning issues when there are unknown categories: 1) it is a priori unknown which subset of (possibly many) measured features are needed to discriminate unknown from common classes and 2) label scarcity suggests that overtraining is a concern. Our classifier exploits the inductive bias that an unknown class consists of the subset of the unlabeled pool's samples that are atypical (relative to the common classes) with respect to certain key (albeit a priori unknown) features and feature interactions. Accordingly, we treat negative log-p-values on raw features as nonnegatively weighted derived feature inputs to our class posterior, with zero weights identifying irrelevant features. Through a hierarchical class posterior, our model accommodates multiple common classes, multiple LS classes, and unknown classes. For learning, we propose a novel semisupervised objective customized for the LS/unknown category scenarios. While several works minimize class decision uncertainty on unlabeled samples, we instead preserve this uncertainty [ maximum entropy (maxEnt)] to avoid overtraining. Our experiments on a variety of UCI Machine learning (ML) domains show: 1) the use of p-value features coupled with weight constraints leads to sparse solutions and gives significant improvement over the use of raw features and 2) for LS SL and AL, unlabeled samples are helpful, and should be used to preserve decision uncertainty (maxEnt), rather than to minimize it, especially during the early stages of AL. Our AL system, leveraging a novel sample-selection scheme, discovers unknown classes and discriminates LS classes from common ones, with sparing use of oracle labeling.

  • 出版日期2017-4