Laika: Efficient In-Place Scheduling for 3D Mesh Graph Computations

2018 
Scientific computing problems are frequently solved using data-graph computations -- algorithms that perform local updates on application-specific data associated with vertices of a graph, over many time steps. The data-graph in such computations is commonly a mesh graph, where vertices have positions in 3D space, and edges connect physically nearby vertices. A scheduler controls the parallel execution of the algorithm. Two classes of parallel schedulers exist: double-buffering and in-place. Double-buffering schedulers do not incur synchronization overheads due to an absence of read-write conflicts, but require two copies of the vertices, as well as a higher iteration count due to a slower convergence rate. Computations for which this difference in convergence rate is significant (e.g., multigrid method) are frequently performed using an in-place scheduler, which incurs synchronization overheads to avoid read-write conflicts on the single copy of vertex data. We present Laika, a deterministic in-place scheduler we created using a principled three-step design strategy for high-performance schedulers. Laika reorders the input graph using a Hilbert space-filling curve to improve cache locality and minimizes parallel coordination overhead by explicitly curbing excess execution parallelism. Consequently, Laika has significantly lower scheduling overhead than alternative in-place schedulers and is even faster per iteration than the parallel double-buffered implementation on a reordered input graph. We derive an improved bound on the expected number of cache misses incurred during a traversal of a graph reordered using a space-filling curve. We also prove that on a mesh graph G = (V, E), Laika performs O(|V| + |E|) total work and achieves linear expected speedup with P = O(|V| / log^2 |V|) workers. On 48 cores, Laika yields 38.4x parallel speedup and empirically fares well against comparably well-engineered alternatives: it runs 6.97--12.60 times faster in geometric mean over a suite of input graphs than other parallel schedulers and 222.57 times faster than the baseline serial implementation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    3
    Citations
    NaN
    KQI
    []