Error back propagation for sequence training of Context-Dependent Deep NetworkS for conversational speech transcription

2013 
We investigate back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, for conversational speech transcription. Theoretically, sequence training integrates with backpropagation in a straight-forward manner. However, we find that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness: The model must be adjusted to the updated numerator lattices by additional iterations of frame-based cross-entropy (CE) training; and to avoid distortions from “runaway” models, we can either add artificial silence arcs to the denominator lattices, or smooth the sequence objective with the frame-based one (F-smoothing). With the 309h Switchboard training set, the MMI objective achieves a relative word-error rate reduction of 11-15% over CE for matched test sets, and 10-17% for mismatched ones. This includes gains of 4-7% from realigned CE iterations. The BMMI and sMBR objectives gain less. With 2000h of data, gains are 2-9% after realigned CE iterations. Using GPGPUs, MMI is about 70% slower than CE training.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    153
    Citations
    NaN
    KQI
    []