RobinHood: Tail Latency Aware Caching—Dynamic Reallocation From Cache-Rich To Cache-Poor

Authors:
Daniel S. Berger Carnegie Mellon University
Benjamin Berg Carnegie Mellon University
Timothy Zhu Pennsylvania State University
Siddhartha Sen Microsoft Research
Mor Harchol-Balter Carnegie Mellon University

Introduction:

the authors propose a novel solution for maintaining low request tail latency: repurpose existing caches to mitigate the effects of backend latency variability, rather than just caching popular data. their solution, RobinHood, dynamically reallocates cache resources from the cache-rich to the cache-poor . We evaluate RobinHood with production traces on a 50-server cluster with 20 different backend systems.In the presence of load spikes, RobinHood meets a 150ms P99 goal 99.7% of the time, whereas the next best policy meets this goal only 70% of the time.

Abstract:

Tail latency is of great importance in user-facing web services. However, maintaining low tail latency is challenging, because a single request to a web application server results in multiple queries to complex, diverse backend services (databases, recommender systems, ad systems, etc.). A request is not complete until all of its queries have completed. We analyze a Microsoft production system and find that backend query latencies vary by more than two orders of magnitude across backends and over time, resulting in high request tail latencies.We propose a novel solution for maintaining low request tail latency: repurpose existing caches to mitigate the effects of backend latency variability, rather than just caching popular data. Our solution, RobinHood, dynamically reallocates cache resources from the cache-rich (backends which don't affect request tail latency) to the cache-poor (backends which affect request tail latency). We evaluate RobinHood with production traces on a 50-server cluster with 20 different backend systems. Surprisingly, we find that RobinHood can directly address tail latency even if working sets are much larger than the cache size. In the presence of load spikes, RobinHood meets a 150ms P99 goal 99.7% of the time, whereas the next best policy meets this goal only 70% of the time.

You may want to know: