Soft-BAC: Soft Bidirectional Alignment Cost for End-to-End Automatic Speech Recognition

2021 
Connectionist temporal classification (CTC) has gained success in both end-to-end ASR model and as an auxiliary task for attention-based sequence-to-sequence (S2S) system. However, the special topological structure of CTC and the modeling form that a redundant blank symbol to be optionally inserted between each modeling units makes the CTC inclined to model blank symbols, resulting in a worse than expected model alignment effect, and frames are usually aligned with redundant symbols. In this paper, we design a new simple topology and propose a novel smooth alignment optimization method named soft bidirectional alignment cost (soft-BAC), which is an alternative to the CTC. We propose a scheme that only inserts identifiers between consecutive repetitive labels and solve the alignment problem between two time series of speech-transcription pair by minimizing all costs of the left-to-right and right-to-left alignment process. Experiments on the LibriSpeech corpus show that the proposed soft-BAC method achieves significant improvement in word error rate and alignment effect over the CTC-based baseline model.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    0
    Citations
    NaN
    KQI
    []