Efficient spark analysis on incremental datasets

2018 
Distributed analysis platform such as Spark provides an unprecedented capacity on big data analysis processing, especially, for Extract-Transform-Load (ETL), which has won the wide recognition by academia and industry. The performance based on Spark, however, comes from immutable datasets losing flexibility of mutable ones with trivial changes. Therefore, there is an unmet demand on how to efficiently integrate distributed dynamical datasets between two or more autonomous computer systems or servers. Traditional data synchronization solutions detect changes and apply them to target datasets, which is impossible for Spark since the resilient distributed dataset (RDD) is invariant in nature. In this paper, we design a paradigm for integrating datasets from several data repositories by extending RDD mechanism to enable incremental processing. We use appended logs for each RDD component to avoid serious performance degradation caused by re-fetching full dataset. Furthermore, a cost model is proposed to evaluate the cost of merging these changes into an existing RDD or forming a new RDD from scratch, providing us a balance between incremental processing and re-fetching explicitly. Our experimental results demonstrate the necessity of incremental processing and the effectiveness of our cost model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    0
    Citations
    NaN
    KQI
    []