Threats of a replication crisis in empirical computer science
Many areas of computer science research (e.g., performance analysis, software engineering, artificial intelligence, and human-computer interaction) validate research claims by using statistical significance as the standard of evidence. A loss of confidence in statistically significant findings is plaguing other empirical disciplines, yet there has been relatively little debate of this issue and its associated 'replication crisis' in computer science. We review factors that have contributed to the crisis in other disciplines, with a focus on problems stemming from an over-reliance on-and misuse of-null hypothesis significance testing. Computer science research can be greatly improved by following the steps taken by other disciplines, such as using more sophisticated evidentiary criteria, and showing greater openness and transparency through experimental preregistration and data/artifact repositories.