|Subhendu Khatuya||IIT KHARAGPUR, India|
|Niloy Ganguly||Indian Institute of Technology Kharagpur, India|
|Jayanta Basak||NetApp India Pvt. Ltd., India|
|Madhumita Bharde||Expert Technologist, India|
|Bivas Mitra||Indian Institute of Technology Kharagpur, India|
A large population of users gets affected by sudden slowdown or shutdown of an enterprise application. System administrators and analysts spend considerable amount of time dealing with functional and performance bugs. These problems are particularly hard to detect and diagnose in most computer systems, since there is a huge amount of system generated supportability data (counters, logs etc.) that need to be analyzed. Most often, there isn't a very clear or obvious root cause. Timely identification of significant change in application behavior is very important to prevent negative impact on the service. In this paper, we present ADELE, an empirical, data-driven methodology for early detection of anomalies in data storage systems. The key feature of our solution is diligent selection of features from system logs and development of effective machine learning techniques for anomaly prediction. ADELE learns from system's own history to establish the baseline of normal behavior and gives accurate indications of the time period when something is amiss for a system. Validation on more than 4800 actual support cases shows ∼ 83% true positive rate and ∼ 12% false positive rate in identifying periods when the machine is not performing normally. We also establish the existence of problem "signatures" which help map customer problems to already seen issues in the field. ADELE's capability to predict early paves way for online failure prediction for customer systems.