摘要
In recent years, information extraction from tweets has been challenging for researchers in the fields of knowledge discovery and data mining. Unlike formal text, such as news articles and pieces of longer content, tweets are of a specific nature: short, noisy, and with dynamic content. Thus, it is difficult to apply the traditional natural language processing algorithms to analyze them. Active learning is well-suited to many problems in natural language processing, especially when unlabeled data may be abundant, but labeled data is limited. The method proposed here aims to minimize annotation costs while maximizing the desired performance from the model. The method recognizes named entities from tweet streams on Twitter by using an active learning method with different query strategies. The tweets are queried for labeling by a human annotator based on query-by-committee, uncertainty-based sampling, and diversity-based sampling. The experimental evaluations of the proposed method on tweet data achieved better results than random sampling.
- 出版日期2017