Comparison of Sequence-to-Sequence and Retrieval Approaches on the Code Summarization and Code Generation Tasks

Nicolas Chausseau

Comparison of Sequence-to-Sequence and Retrieval Approaches on the Code Summarization and Code Generation Tasks

2021

Nicolas Chausseau

In this study, we evaluate and compare state-of-the-art models on the code generation and code summarization tasks (English-to-code and code-to-English). We compare the performance of neural seq2seq BiLSTM [Yin et al. 2018] and attentional-GRU architectures [LeClair et al. 2019], along with that of a semantic code search model reproduced from [Sachdev et al. 2018]. We compare these three models' BLEU scores (1) on their original study datasets as well as (2) on additional benchmark datasets [Yin et al. 2018, Sennrich et al. 2018, LeClair et al. 2019], each time for translation and back-translation (i.e. English-to-code and code-to-English). We observe that, surprisingly, semantic code search performs best overall, surpassing the seq2seq models on 5 task-dataset combinations out of 8. We find that the seq2seq BiLSTM always outperforms the attentional-GRU, including on the relatively large (2M pairs) Javadoc-based dataset from the original attentional-GRU study, setting a new high score on that dataset, higher than four previous published studies. However, we also observe that model scores remain low on several datasets. Some test-set questions are harder to answer due to a lack of relevant examples in the training-set. We introduce a new procedure for estimating the degree of novelty, and difficulty of any given test-set question. We use the BLEU score of the highest-scoring training-set entry as reference point for model scores on the question, a procedure which we call BLEU Optimal Search, or BOS. The BOS score (i) allows us to generate an information retrieval ceiling for model scores for each test-set question, (ii) can help to shed light on the seq2seq models' capacity to generalize to novel, unseen questions on any dataset, and (iii) helps to identify dataset-artifacts, by inspecting the rare model answers that score above it. We observe that the BOS is not reliably surpassed by the seq2seq models, except in the presence of dataset-artifacts (such as when the first words of the question contains the answer), and call for further empirical investigation.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations