The wisdom of virtual crowds: mining datacenter telemetry to collaboratively debug performance

2013 
Explaining the (mis)behavior of virtual machines in large-scale cloud environments presents a number of challenges with respect to both scale and making sense of torrents of datacenter telemetry emanating from multiple levels of the stack. In this paper we leverage VM-similarity to explain the behavior or performance of a VM using its cohort as a reference (or by contrasting it against groups of VMs outside of its cohort). The key insight is that virtual machines (VMs) running the same application (components or workloads), or VMs colocated within the same (logical) tier of a complex application exhibit similar telemetry patterns. The power of similarity relationships stems from the additional context that similarity provides. The quantitative or qualitative "distance" between a VM and its expected cohort could be used to explain or diagnose any discrepancy. Similarly, the distance between a VM and one in another cohort can be used to explain why the VMs are dissimilar. As an example we apply our data-mining techniques to debugging ViewPlanner performance. ViewPlanner is a tool used to emulate and evaluate large-scale deployments of virtual desktops. Using a ViewPlanner deployment of 175 VMs we collect ~ 300 metrics-per-VM, sampled at 20-second frequency over multiple 1 hour epochs, from the PerformanceManager [4] on ESX and automatically filter (using entropy measures [2]) and cluster them using K-means [1]. We use the median value of each metric within an epoch to summarize the VM's behavior during that epoch. We introduce spread/diffusion metrics to explain the difference between VMs. Spread metrics are those such that the expected value of the order statistic (in our case the median) of a metric, m , E [ m ] differs between two clusters, i.e., the expected value is conditioned on the cluster, E [ m i | clusterA ] ≠ E [ m i | clusterB ]. Within a cluster of VMs, differences in the distribution of a particular metric, m i , may be explained by conditioning m i on other metrics, { c 0 , ..., c n }, where E [ m i ] ≠ E [ m i | c 0 , ..., c n ]. We automatically find potentially interesting m i 's using Silverman's test [3] for multi-modality and we use Mutual Information [2] to find associated c i 's.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    2
    Citations
    NaN
    KQI
    []