Distribution-Aware Crowdsourced Entity Collection

作者:Fan, Ju*; Wei, Zhewei; Zhang, Dongxiang; Yang, Jingru; Du, Xiaoyong
来源:IEEE Transactions on Knowledge and Data Engineering, 2019, 31(7): 1312-1326.
DOI:10.1109/TKDE.2016.2611509

摘要

The problem of crowdsourced entity collection solicits people (a.k.a. workers) to complete missing data in a database and has witnessed many applications in knowledge base completion and enterprise data collection. Although previous studies have attempted to address the "open world" challenge of crowdsourced entity collection, they do not pay much attention to the "distribution" of the collected entities. Evidently, in many real applications, users may have distribution requirements on the collected entities, e.g., even spatial distribution when collecting points-of-interest. In this paper, we study a new research problem, distribution-aware crowdsourced entity collection (CROWDDEC): Given an expected distribution w.r.t. an attribute (e.g., region or year), it aims to collect a set of entities via crowdsourcing and minimize the difference of the entity distribution from the expected distribution. Due to the openness of crowdsourcing, the CROWDDEC problem calls for effective crowdsourcing quality control. We propose an adaptive worker selection approach to address this problem. The approach estimates underlying entity distribution of workers on-the-fly based on the collected entities. Then, it adaptively selects the best set of workers that minimizes the difference from the expected distribution. Once workers submit their answers, it adjusts the estimation of workers' underlying distributions for subsequent adaptive worker selections. We prove the hardness of the problem, and develop effective estimation techniques as well as efficient worker selection algorithms to support this approach. We deployed the proposed approach on Amazon Mechanical Turk and the experimental results on two real datasets show that the approach achieves superiority on both effectiveness and efficiency.