D-score: Holistic Dialogue Evaluation without Reference

2021 
In artistic gymnastics, difficulty score or D-score is used for judging performance. Starting from zero, an athlete earns points from different aspects such as composition requirement, difficulty, and connection between moves. The final score is a composition of the quality of various performance indicators. Similarly, when evaluating dialogue responses, human judges generally follow a number of criteria, among which language fluency, context coherence, logical consistency, and semantic appropriateness are on top of the agenda. In this paper, we propose an automatic dialogue evaluation framework called D-score that resembles the way gymnastics is evaluated. Following the four human judging criteria above, we devise a range of evaluation tasks and model them under a multi-task learning framework. The proposed framework, without relying on any human-written reference, learns to appreciate the overall quality of human-human conversations through a representation that is shared by all tasks without over-fitting to individual task domain. We evaluate D-score by performing comprehensive correlation analyses with human judgement on three dialogue evaluation datasets, among which two are from past DSTC series, and benchmark against state-of-the-art baselines. D-score not only outperforms the best baseline by a large margin in terms of system-level Spearman correlation but also represents an important step towards explainable dialogue scoring.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    60
    References
    0
    Citations
    NaN
    KQI
    []