Web Table Understanding by Collective Inference

2018 
Web tables have become very popular and important in many real applications, such as search engines and knowledge base enrichment. Due to its benefit, it is very urgent to understand web tables. An important task in web table understanding is the column-type detection, which detects the most likely types (categories) to describe the columns in the web table. Some existing studies use knowledge bases to determine the column types. However, this problem has three challenges. (i) Web tables are too dirty to be understood. (ii) Knowledge bases are not comprehensive enough to cover all the columns. (iii) The size of both knowledge bases and web tables are extremely huge. Thus, traditional approaches encounter the limitations with low quality and poor scalability. Also, they cannot extract the best type from top-k types automatically. To address these limitations, we propose a collective inference approach (CIA) based on Topic Sensitive PageRank, which considers not only the types of detected columns, but also the collective information of web tables to automatically produce more accurate top-k types, especially the top-1 type, for both incorrectly detected columns and undetectable columns whose cells do not exist in the knowledge base. We also propose three methods to improve the inference performance and implemented techniques of CIA in MapReduce. Experimental results on real-world datasets show that our CIA achieves much higher quality in top-1 type detection as well as the entity enrichment, and outperforms state-of-the-art approaches significantly.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    3
    Citations
    NaN
    KQI
    []