BLTRCNN Based 3D Articulatory Movement Prediction: Learning Articulatory Synchronicity From Both Text and Audio Inputs

2018 
Predicting articulatory movements from audio or text has diverse applications, such as speech visualization. Various approaches have been proposed to solve the acoustic-articulatory mapping problem. However, their precision is not high enough with only acoustic features available. Recently, deep neural network (DNN) has brought tremendous success in various fields, like speech recognition and image processing. To increase the accuracy, we propose a new network architecture for articulatory movement prediction with both text and audio inputs, called bottleneck long-term recurrent convolutional neural network (BLTRCNN). To the best of our knowledge, it is the first time to predict articulatory movements based on DNN by fusing text and audio inputs. Our BLTRCNN consists of two networks. The first is the bottleneck network, generating a compact bottleneck features of text information for each frame independently. The second, including convolutional neural network (CNN), long short-term memory (LSTM) and skip connection, is called long-term recurrent convolutional neural network (LTRCNN). LTRCNN is used for articulatory movement prediction when bottleneck features, acoustic features, and text features are integrated as inputs together. Experiments show that the proposed BLTRCNN achieves the state-of-the-art root-mean-squared error (RMSE) 0.528mm and the correlation coefficient 0.961. Moreover, we also demonstrate how text information complements acoustic features in this prediction task.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    9
    Citations
    NaN
    KQI
    []