Spreadsheet Property Detection With Rule-assisted Active Learning

Zhe Chen,Sasha Dadiomov,Richard Wesley,Gang Xiao,Daniel Cory,Michael J. Cafarella,Jock Douglas MacKinlay

Spreadsheet Property Detection With Rule-assisted Active Learning

2017

Spreadsheets are a critical and widely-used data management tool. Converting spreadsheet data into relational tables would bring benefits to a number of fields, including public policy, public health, and economics. Research to date has focused on designing domain-specific languages to describe transformation processes or automatically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets, we have to identify various spreadsheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables. In this paper, we focus on the problem of spreadsheet property detection. We propose a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data especially in the initial training stage. Using a bagging-like technique, Our approach can also tolerate lower-quality user-provided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34-44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrates that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets into a relational form.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations