Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques

2020 
Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []