A novel approach for entity resolution in scientific documents using context graphs

作者:Huang, Changqin*; Zhu, Jia; Huang, Xiaodi; Yang, Min; Fung, Gabriel; Hu, Qintai
来源:Information Sciences, 2018, 432: 431-441.
DOI:10.1016/j.ins.2017.12.024

摘要

Entity resolution refers to disambiguating and resolving entities in structured and unstructured data. Developments of effective resolution algorithms are significant for processing scientific documents, particularly for biomedical literature. Specifically, name ambiguity among biomedical entities is a primary task that needs to be solved in the knowledge extraction process. In this paper, we present a novel approach to disambiguating gene/protein names by using context graphs. A set of abstracts of documents is used to build the context graphs through disclosing the indirect co-occurrence relationships among words. Feature vectors of the graphs can be constructed according to information gain (IG) on the word set. To evaluate the IG values, we propose a new metrics that integrates the word frequency (WF), dispersion degree (DD) and concentration degree (CD). Finally, entity resolution is performed by applying a support vector machine (SVM). Compared to existing approaches, the proposed method is capable of discovering latent information from the context of entity names, rather than using some statistical information such as the number of occurrences of words. Based on the results from comprehensive experiments over two benchmark datasets, we conclude that our proposed method, compared to several existing solutions, for resolving ambiguity entities is promising.