How Do You Test a Test?: A Multifaceted Examination of Significance Tests

2022 
We examine three statistical significance tests -- a recently proposed ANOVA model and two baseline tests -- using a suite of measures to determine which is better suited for offline evaluation. We apply our analysis to both the runs of a whole TREC track and also to the runs submitted by six participant groups. The former reveals test behavior in the heterogeneous settings of a large-scale offline evaluation initiative; the latter, almost overlooked in past work (to the best of our knowledge), reveals what happens in the much more restricted case of variants of a single system, i.e. the typical context in which companies and research groups operate. We find the ANOVA test strikingly consistent in large-scale settings, but worryingly inconsistent in some participant experiments. Of greater concern, the participant only experiments show one of our baseline tests (a test widely used in research) can produce a substantial number of inconsistent results. We discuss the implications of this inconsistency for possible publication bias.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    44
    References
    0
    Citations
    NaN
    KQI
    []