Translation of Array-Based Loops to Spark SQL

2020 
Many programs written to analyze data are expressed in terms of array operations in an imperative programming language with loops. However, for data analysts who need to analyze vast volumes of data, large-scale data-intensive processing is becoming a necessity. Hence, they want to convert their programs, originally written to run on a single computer, to work on current Big Data systems, such as Map-Reduce and Spark, so that they can process larger amounts of data. We present a novel framework, called SQLgen, that automatically translates imperative programs with loops and array operations to distributed data-parallel programs. Unlike related work, SQL- gen translates these programs to SQL, which can be translated to more efficient code since it can be optimized using a relational database optimizer. SQLgen has been implemented on Spark SQL. We compare the performance of SQLgen with DIABLO, hand-written RDD-based, and Spark SQL programs on real- world problems. SQLgen is up to 78× faster than DIABLO and up to 25× faster than hand-written RDD-based programs, giving performance close to that of hand-written programs in Spark SQL.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    0
    Citations
    NaN
    KQI
    []