Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR

2019 
This study investigated large-scale semi-supervised training (SST) to improve acoustic models for automatic speech recognition. The conventional self-training, the recently proposed committee-based SST using heterogeneous neural networks and the lattice-based SST were examined and compared. The large-scale SST was studied in deep neural network acoustic modeling with respect to the automatic transcription quality, the importance data filtering, the training data quantity and other data attributes of a large quantity of multi-genre unsupervised live data. We found that the SST behavior on large-scale ASR tasks was very different from the behavior obtained on small-scale SST: 1) big data can tolerate a certain degree of mislabeling in the automatic transcription for SST. It is possible to achieve further performance gains with more unsupervised fresh data, and even the automatic transcriptions have a certain degree of errors; 2) the audio attributes, transcription quality and importance of the fresh data are more important than the increased data quantity for large-scale SST; and 3) there are large differences in performance gains on different recognition tasks, such that the benefits highly depend on the selected data attributes of unsupervised data and the data scale of the baseline ASR system. Furthermore, we proposed a novel utterance filtering approach based on active learning to improve the data selection in large-scale SST. The experimental results showed that the SST with the proposed data filtering yields a 2-11% relative word error rate reduction on five multi-genre recognition tasks, even with the baseline acoustic model that was already well trained on a 10000-hr supervised dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    28
    References
    2
    Citations
    NaN
    KQI
    []