Efficient spark analysis on incremental datasets

Wei Sheng,Zhao Cao,Dacheng Qu

Efficient spark analysis on incremental datasets

2018

Distributed analysis platform such as Spark provides an unprecedented capacity on big data analysis processing, especially, for Extract-Transform-Load (ETL), which has won the wide recognition by academia and industry. The performance based on Spark, however, comes from immutable datasets losing flexibility of mutable ones with trivial changes. Therefore, there is an unmet demand on how to efficiently integrate distributed dynamical datasets between two or more autonomous computer systems or servers. Traditional data synchronization solutions detect changes and apply them to target datasets, which is impossible for Spark since the resilient distributed dataset (RDD) is invariant in nature. In this paper, we design a paradigm for integrating datasets from several data repositories by extending RDD mechanism to enable incremental processing. We use appended logs for each RDD component to avoid serious performance degradation caused by re-fetching full dataset. Furthermore, a cost model is proposed to evaluate the cost of merging these changes into an existing RDD or forming a new RDD from scratch, providing us a balance between incremental processing and re-fetching explicitly. Our experimental results demonstrate the necessity of incremental processing and the effectiveness of our cost model.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations