LAGOS-AND: A Large, Gold Standard Dataset for Scholarly Author Name Disambiguation.

2021 
In this paper, we present a method to automatically generate a large-scale labeled dataset for author name disambiguation (AND) in the academic world by leveraging authoritative sources, ORCID and DOI. Using the method, we built LAGOS-AND, a large, gold standard dataset for AND, which is substantially different from existing ones. It contains 7.5M citations authored by 797K unique authors and shows close similarities to the entire Microsoft Academic Graph (MAG) across six gold standard validations. In building the dataset, we investigated the long-standing name synonym problem and revealed the degree of variation in the last name for the first time. Evidence from PubMed, MAG, and Semantic Scholar all suggests that there are ~7.5% of authorships who have varied their last names from the credible last names in the ORCID system when ignoring the variants introduced by special characters. Furthermore, we provided a classification-based AND benchmark on the new dataset and released our model for disambiguation in general scenarios. If this work is helpful for future studies, we believe it will challenge (1) the widely accepted block-based disambiguation framework in production environment and, (2) the state-of-the-art methods or models on AND. The code, dataset, and pre-trained model are publicly available.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    53
    References
    1
    Citations
    NaN
    KQI
    []