GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Daniel Khashabi,Gabriel Stanovsky,Jonathan Bragg,Nicholas Lourie,Jungo Kasai,Yejin Choi,Noah A. Smith,Daniel S. Weld

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

2021

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations