Multi-target hybrid CTC-Attentional Decoder for joint phoneme-grapheme recognition

2020 
In traditional Automatic Speech Recognition (ASR) systems, such as HMM-based architectures, words are predicted using either phonemes or graphemes as sub-word units. In this paper, we explore such joint phoneme-grapheme decoding using an Encoder-Decoder network with hybrid Connectionist Temporal Classification (CTC) and Attention mechanism. The Encoder network is shared between two Attentional Decoders which individually learn to predict phonemes and graphemes from a unique Encoder representation. This Encoder and multi-decoder network is trained in a multi-task setting to minimize the prediction error for both phoneme and grapheme sequences. We also implement the phoneme decoder at an intermediate layer of Encoder and demonstrate performance benefits to such an architecture. By carrying out various experiments on different architectural choices, we demonstrate, using the TIMIT and Librispeech 100 hours datasets, that with this approach, an improvement in performance than the baseline independent phoneme and grapheme recognition systems can be achieved.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    2
    Citations
    NaN
    KQI
    []