Multi-Stage Distributed Computing for Big Data: Evaluating Connective Topologies

2020 
With the increase in computation and data intensive needs along with the real-time requirements in the big data era, a distributed framework that can handle parallel processing at various levels becomes a necessity. In this paper, we propose a multi-stage framework that leverages distributed computational agents to handle large-scaled data in parallel. The computational agents are distributed both in breadth (parallelism) and depth (various stages) to pursue the scalability, parallelism, and heterogeneity of modern computational big data workflows. Additionally, the proposed framework allows the interleaving of computation among the stages to speed up the processing. This high throughput computing (HTC) framework is resource efficient, depth and breadth scalable, and supports both distributed agents and centralized analytics. Herein, three computational agent sequential-stage connection topologies are investigated: fully connected, partition, and mesh connectivity. To evaluate the performance of the proposed framework, a large-scale, remote sensing imagery data deep feature extraction and clustering application has been adapted as a use case. Various experiments show that the internal performances of the framework based on the queue size varies according to the type of the topology. While the total time of the experiments are affected mostly by the number of agents in each stage as expected. Additionally, it has been observed that the setup of the partition topology affects the time as well. The system is shown to maintain its resource efficient HTC performance under all the connection topologies, wherein measurement metrics, such as the average CPU usage and average memory usage, reveal nearly the same performance for all the topologies except the partitioning topology.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    0
    Citations
    NaN
    KQI
    []