Big data analytics on HPC architectures: Performance and cost

2016 
Data driven science, accompanied by the explosion of petabytes of data, has called into need dedicated analytics computing resources. Dedicated analytics clusters require large capital outlays due to their expensive hardware requirements. Additionally, if such resources are located far from the data they analyze, they also incur substantial data transfer, which has both cost and latency implications. In this paper, we benchmark a variety of high-performance computing (HPC) architectures for classic data science algorithms, as well as conduct a cost analysis of these architectures. Additionally, we compare algorithms across analytic frameworks, as well as explore hidden costs in the form of queuing mechanisms. We observe that node architectures with large memory and high memory bandwidth are better suited for big data analytics on HPC hardware. We also conclude that cloud computing is more cost effective for small or experimental data workloads, but HPC is more cost effective at scale. Additionally, we quantify the hidden costs of queuing and how it relates to data science workloads. Finally, we observe that software developed for the cloud, such as Spark, performs significantly worse than pbdR when run in HPC environments.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    8
    Citations
    NaN
    KQI
    []