Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization

Chen Li,Bo Zhang,Shan Huang,Zhenhuan Liu

Using Noisy Word-Level Labels to Train a Phoneme Recognizer based on Neural Networks by Expectation Maximization

2019

The Connectionist Temporal Classification (CTC) technique can be used to train a neural-network based speech recognizer. When the technique is used to train a phoneme recognizer, it is required that training data should be annotated with phoneme-level labels. This is not feasible if large speech databases are used. One approach to make use of such speech data is to convert the word-level transcriptions into phoneme-level labels, followed by a CTC training. The problem of this approach is that the converted phonemelevel labels may mismatch the audio content of the speech data. This paper uses a probabilistic model to describe the probability of observing the noisy phoneme-level labels given an utterance. The model consists of a neural network which predicts the probability of any phoneme sequence, and another so-called mismatch model to describe the probability of disturbing a phoneme sequence to another. Based on the Expectation-Maximization (EM) framework, we propose a training algorithm which can simultaneously learn parameters of the neural-network and the mismatch model. Effectiveness of our method is verified by comparing recognition performance of our method with a conventional training method on TIMIT corpus.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations