Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

2018 
Deep belief networks (DBNs) have shown impressive improvements over the Gaussian mixture models whilst are employed inside the Hidden Markov Model (HMM)-based automatic speech recognition systems. In this study, the benefits of the DBNs to be used in audiovisual speech recognition systems are investigated. First, the DBN-HMMs are explored in speech recognition and lip-reading tasks, separately. Next, the challenge of appropriately integrating the audio and visual information is studied; for this purpose, the application of the fused feature in an audiovisual (AV) DBN-HMM based speech recognition task is studied. With regard to the integration of information, those layers that provide generalities and details with together, so that in overall a completion is made, are selected. A modified technique is proposed based on the entropy of different layers of the used DBNs, to measure the amount of information. The best audio layer representation is found to have the highest entropy, with the highest power of providing information details in the fusion scheme. In contrast, the best visual layer representation is found to have the lowest entropy, which could best provide sufficient generalities. Over the CUAVE database, on English digit recognition task, the conducted experiments show that the AV DBN-HMM, with proposed feature fusion method, can reduce phone error rate by as much as 4% and 1.5%, and word error rate by about 3.49% and 1.89%, over the baseline conventional HMM and audio DBN-HMM, respectively.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    1
    Citations
    NaN
    KQI
    []