Finding Crash-Consistency Bugs With Bounded Black-Box Crash Testing

Authors:
Jayashree Mohan University of Texas at Austin
Ashlie Martinez University of Texas at Austin
Soujanya Ponnapalli University of Texas at Austin
Pandian Raju University of Texas at Austin
Vijay Chidambaram University of Texas at Austin and VMware Research

Introduction:

the authors present a new approach to testing file-system crash consistency: bounded black-box crash testing (B3).Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space.the authors build two tools, CrashMonkey and Ace, to demonstrate the effectiveness of this approach. B3 tests the file system in a black-box manner using workloads of file-system operations.Each workload is tested on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash.Our tools also revealed 10 new crash-consistency bugs in widely-used, mature Linux file systems, seven of which existed in the kernel since 2014.

Abstract:

We present a new approach to testing file-system crash consistency: bounded black-box crash testing (B3). B3 tests the file system in a black-box manner using workloads of file-system operations. Since the space of possible workloads is infinite, B3 bounds this space based on parameters such as the number of file-system operations or which operations to include, and exhaustively generates workloads within this bounded space. Each workload is tested on the target file system by simulating power-loss crashes while the workload is being executed, and checking if the file system recovers to a correct state after each crash. B3 builds upon insights derived from our study of crash-consistency bugs reported in Linux file systems in the last five years. We observed that most reported bugs can be reproduced using small workloads of three or fewer file-system operations on a newly-created file system, and that all reported bugs result from crashes after fsync() related system calls. We build two tools, CrashMonkey and Ace, to demonstrate the effectiveness of this approach. Our tools are able to find 24 out of the 26 crash-consistency bugs reported in the last five years. Our tools also revealed 10 new crash-consistency bugs in widely-used, mature Linux file systems, seven of which existed in the kernel since 2014. The new bugs result in severe consequences like broken rename atomicity and loss of persisted files.

You may want to know: