Grabbing the Long Tail: A data normalization method for diverse and informative dialogue generation

2021 
Abstract Recent neural models have shown significant progress in dialogue generation. Among those models, most of them are based on language models, yielding the generation word by word according to the previous context. Due to the inherent mechanism in language models, as well as the most frequently used cross-entropy function (making the distribution of generations approximate that of training data continuously), trained generation models inevitably tend to generate frequent words in training datasets, leading to low diversity and poor informativeness issues. By investigating a few mainstream dialogue generation models, we find that the probable cause is the intrinsic Long Tail Phenomenon in linguistics. To address these issues of low diversity and poor informativeness, we explore and analyze a large corpus from Wikipedia, and then propose an efficient frequency-based data normalization method, i.e., Log Normalization. Furthermore, we explore another two methods, Mutual Normalization and Log-Mutual Normalization, to eliminate the mutual information effect. In order to validate the effectiveness of the proposed methods, we conduct extensive experiments on three datasets with different subjects, including social media, film subtitles, and online customer service. Compared with the vanilla transformers, generation models augmented with our proposed methods achieve significant improvements in generated responses, in terms of both diversity and informativeness. Specifically, the unigram and bigram diversity in the responses are improved by 8.5%–14.1% and 19.7%–25.8% on the three datasets, respectively. The informativeness (defined as amounts of nouns and verbs) is increased by 13.1%–31.0% and 30.4%–59.0%, respectively. Moreover, our methods can be adapted to new generation models efficiently and effectively, with their model-agnostic characteristics.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    1
    Citations
    NaN
    KQI
    []