DRASH: A Data Replication-Aware Scheduler in Geo-Distributed Data Centers

2016 
Driven by the trends of BigData and Cloud computing, there is a growing demand for processing and analyzing data that are generated and stored across geo-distributed data centers. However, due to the limited network bandwidth between data centers and the growing data volume spread across different locations, it has become increasingly inefficient to aggregate data and to perform computations at a single data center. An approach that has been commonly used by data-intensive cluster computation systems, like Hadoop, is to distribute computations based on data locality so that data can be processed locally to reduce the network overhead and improve performance. But limited work has been done to adapt and evaluate such technique for geo-distributed data centers. In this paper, we proposed DRASH (Data-Replication Aware Scheduler), a job scheduling algorithm that enforces data locality to prevent data transfer, and exploits data replications to improve overall system performance. Our evaluation using simulations with realistic workload traces shows that DRASH can outperform other existing approaches by 16% to 60% in average job completion time, and achieve greater improvements under higher data replication factors.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    7
    Citations
    NaN
    KQI
    []