Smart Server Crash Prediction in Cloud Service Data Center

2020 
In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    1
    Citations
    NaN
    KQI
    []