Spreadsheet Property Detection With Rule-assisted Active Learning

2017 
Spreadsheets are a critical and widely-used data management tool. Converting spreadsheet data into relational tables would bring benefits to a number of fields, including public policy, public health, and economics. Research to date has focused on designing domain-specific languages to describe transformation processes or automatically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets, we have to identify various spreadsheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables. In this paper, we focus on the problem of spreadsheet property detection. We propose a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data especially in the initial training stage. Using a bagging-like technique, Our approach can also tolerate lower-quality user-provided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34-44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrates that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets into a relational form.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    24
    Citations
    NaN
    KQI
    []