Capturing And Enhancing In Situ System Observability For Failure Detection

Authors:
Peng Huang Johns Hopkins University
Chuanxiong Guo Bytedance Inc.
Jacob R. Lorch Microsoft Research
Lidong Zhou Microsoft Research
Yingnong Dang Microsoft Research

Introduction:

Real-world distributed systems suffer unavailability due to various types of failure.But, despite enormous effort, many failures, especially gray failures, still escape detection.Panorama incorporates techniques for making observations even when indirection exists between components.

Abstract:

Real-world distributed systems suffer unavailability due to various types of failure. But, despite enormous effort, many failures, especially gray failures, still escape detection. In this paper, we argue that the missing piece in failure detection is detecting what the requesters of a failing component see. This insight leads us to the design and implementation of Panorama, a system designed to enhance \emph{system observability} by taking advantage of the interactions between a system's components. By providing a systematic channel and analysis tool, Panorama turns a component into a logical observer so that it not only handles errors, but also \emph{reports} them. Furthermore, Panorama incorporates techniques for making such observations even when indirection exists between components. Panorama can easily integrate with popular distributed systems and detect all 15 \emph{real-world} gray failures that we reproduced in less than 7 s, whereas existing approaches detect only one of them in under 300 s.

You may want to know: