Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia

2019 
Current Wikipedia editing approaches typically summarize a named entity by one main-article supplemented by multiple sub-articles describing various aspects and subtopics of the entity. Such separation of articles aims at improving the curation of content-rich Wikipedia entities. However, a wide range of Wikipedia-based technologies critically rely on the article-as-concept assumption, which requires a one-to-one mapping between entities (or concepts) and the articles that describe these entities. Thus, the current editing approaches sow confusion and ambiguity to knowledge representation, and cause problems to a wide-range of downstream technologies. In this paper, we present an approach that resolves these problems by differentiating the main-article from the sub-articles that are not at the core of entity representations. We propose a hybrid neural article model that learns on two facets of a Wikipedia article: (i) Two neural document encoders capture the latent semantic features from the article title and text contents. (ii) A set of explicit features measure and characterize the symbolic and structural aspects of each article. In this study, we use crowdsourcing to create a large annotated dataset for feature extraction, and for evaluating a variety of encoding techniques and learning structures. The optimized model so derived identifies main articles with near-perfect precision and recall, and outperforms various baselines on the contributed dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    48
    References
    2
    Citations
    NaN
    KQI
    []