language-icon Old Web
English
Sign In

Query-centric regression

2021 
Abstract Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real-world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    58
    References
    0
    Citations
    NaN
    KQI
    []