Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

Haewon Jeong

Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

2020

Haewon Jeong

In this thesis, I ask the question “how do we compute reliably using thousands of distributed, unreliable nodes?” We propose a system-level solution where we add redundant data across distributed nodes using the technique called “coded computing.” Our main contribution is developing strategies for a masterless, fullydecentralizedsetting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposingcoded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescalecomputing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations