Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization

2019 
The Connectionist Temporal Classification (CTC) technique can be used to train a neural-network based speech recognizer. When the technique is used to train a phoneme recognizer, it is required that training data should be annotated with phoneme-level labels. This is not feasible if large speech databases are used. One approach to make use of such speech data is to convert the word-level transcriptions into phoneme-level labels, followed by a CTC training. The problem of this approach is that the converted phonemelevel labels may mismatch the audio content of the speech data. This paper uses a probabilistic model to describe the probability of observing the noisy phoneme-level labels given an utterance. The model consists of a neural network which predicts the probability of any phoneme sequence, and another so-called mismatch model to describe the probability of disturbing a phoneme sequence to another. Based on the Expectation-Maximization (EM) framework, we propose a training algorithm which can simultaneously learn parameters of the neural-network and the mismatch model. Effectiveness of our method is verified by comparing recognition performance of our method with a conventional training method on TIMIT corpus.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    0
    Citations
    NaN
    KQI
    []