Deep neural network training for whispered speech recognition using small databases and generative model sampling
State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN–HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN–HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN–HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral–neutral training/test and whisper–whisper training/test scenarios, respectively.