Discovering Enterprise Concepts Using Spreadsheet Tables

2017 
Existing work on knowledge discovery focuses on using natural language techniques to extract entities and relationships from textual documents. However, today relational tables are abundant in quantities, and are often well-structured with coherent data values. So far these rich relational tables have been largely overlooked for the purpose of knowledge discovery. In this work, we study the problem of building concept hierarchies using a large corpus of enterprise spreadsheet tables. Our method first groups distinct values from tables into a large hierarchical tre based on co-occurrence statistics. We then "summarize" the large tree by selecting important tree nodes that are likely good concepts based on how well they "describe" the original corpus. The result is a small concept hierarchy that is easy for humans to understand and curate. Our end-to-end algorithms are designed to run on Map-Reduce and to scale to large corpus. Experiments using real enterprise spreadsheet corpus show that proposed approach can generate concepts with high quality.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    47
    References
    5
    Citations
    NaN
    KQI
    []