摘要

Author name disambiguation plays a very important role in individual based bibliometric analysis and has suffered from lack of information. Therefore, some have tried to leverage external web sources to obtain additional evidence with success. However, the main problem is generally the high cost of extracting data from web pages due to their diverse designs. Considering this challenge, we employed ResearchGate (RG), a social network platform for scholars presenting their publication lists in a structured way. Even though the platform might be imperfect, it can be valuable when it is used along with traditional approaches for the purpose of confirmation. To this end, in our first (retrieval) stage we applied a graph based machine learning approach, connected components (CC) and formed clusters. Then, the data crawled from RG for the same authors were combined with the CC results in stage 2. We observed that 76.40% of the clusters formed by CC were confirmed by the RG data and they accounted for 68.33% of all citations. Second, a subset was drawn from the dataset by retaining those clusters having at least 10 members to examine the details. This time we additionally employed the Google Custom Search Engine (CSE) API to access authors' web pages as a complementary tool to RG. We observed an F score of 0.95 when CC results were confirmed by RG&CSE. Almost the same success was observed when only the CC approach was applied. In addition, we observed that the publications identified and confirmed through the external sources were cited to a greater extent than those publications not found in the related external sources. Even though promising, there are still issues with the use of external sources. We have seen that many authors present only a few selected papers on the web. This hampers our procedure, making it unable to obtain the entire publication list. Missing publications affect bibliometric analysis adversely since all citation data is required. That is, if only the data confirmed via external sources is used, bibliometric indicators will be overestimated. On the other hand, our suggested methodology can potentially decrease the manual work required for individual based bibliometric analysis. The procedure may also present more reliable results by confirming cluster members derived from unsupervised grouping methods. This approach might be especially beneficial for large datasets where extensive manual work would otherwise be required.

  • 出版日期2017-6