Spark SQL Query Optimization Based on Runtime Statistics Collection

2021 
The Spark SQL system improves efficiency of execution by describing data analysis tasks and optimizing according to query optimization theory. However, the query optimization function of Spark SQL contains the following deficiencies that it requires the operator to explicitly collect statistics information through the statistics information collection commands and the collected statistics information is not accurate, which leads to poor optimization effects. To solve the above problems, this paper proposes an algorithm that collects statistics at runtime and optimizes the query adaptively. The algorithm uses Bloom Filter Pruning to prune data that does not meet the join conditions before a join operation is executed, and combines with AMS Sketch to estimate the intermediate cardinality of the join more accurately. Finally, a graph-based join plan is proposed to schedule the statistics information collection task dynamically based on the query statement and then the query execution plan is adjusted adaptively according to the statistics information. Experiments have proven that without considering the join order, the BFP algorithm can prune the input of join by up to 12%. The algorithm for join plan generation can produce the optimal plans in 14 out of 18 queries without pre-collecting statistics data and save execution time by up to 31% and the time spent on statistics information collection is no more than 5% of the total execution time.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []