Fully-Decentralized Coded Computing for Reliable Large-Scale Computing

2020 
In this thesis, I ask the question “how do we compute reliably using thousands of distributed, unreliable nodes?” We propose a system-level solution where we add redundant data across distributed nodes using the technique called “coded computing.” Our main contribution is developing strategies for a masterless, fullydecentralizedsetting for important computation primitives in machine learning (ML) and scientific computing applications while minimizing the overhead of coding. For distributed matrix multiplication, we make a fundamental advance by proposingcoded computing strategies that outperform prior works by an unbounded factor, including recently-developed coded computing strategies as well as traditional Algorithm-Based Fault Tolerance (ABFT) strategies. We also propose coded computing schemes for other primitives such as fast Fourier transform (FFT) and matrix QR factorization. Completing computation reliably and in time under diverse unpredictabilities (e.g., stragglers, node failures, bit flips) is becoming a more important problem. The amount of data we collect is growing exponentially and recent developments in ML have enabled utilizing and processing such large quantities of data. This has not only led to an increase in the scale of computing but also the wide popularity of largescalecomputing across our society. I will discuss how masterless coded computing can be a more efficient fault-tolerance technique under growing unpredictability in computing systems, providing both theoretical and experimental evidence.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    83
    References
    0
    Citations
    NaN
    KQI
    []