How Do You Test a Test?: A Multifaceted Examination of Significance Tests
2022
We examine three statistical significance tests -- a recently proposed ANOVA model and two baseline tests -- using a suite of measures to determine which is better suited for offline evaluation. We apply our analysis to both the runs of a whole TREC track and also to the runs submitted by six participant groups. The former reveals test behavior in the heterogeneous settings of a large-scale offline evaluation initiative; the latter, almost overlooked in past work (to the best of our knowledge), reveals what happens in the much more restricted case of variants of a single system, i.e. the typical context in which companies and research groups operate. We find the ANOVA test strikingly consistent in large-scale settings, but worryingly inconsistent in some participant experiments. Of greater concern, the participant only experiments show one of our baseline tests (a test widely used in research) can produce a substantial number of inconsistent results. We discuss the implications of this inconsistency for possible publication bias.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
44
References
0
Citations
NaN
KQI