Mining Audio, Text and Visual Information for Talking Face Generation

2019 
Providing methods to support audio-visual interaction with growing volumes of video data is an increasingly important challenge for data mining. To this end, there has been some success in speech-driven lip motion generation or talking face generation. Among them, talking face generation aims to generate realistic talking heads synchronized with the audio or text input. This task requires mining the relationship between audio signal/text and lip-sync video frames and ensures the temporal continuity between frames. Due to the issues such as polysemy, ambiguity, and fuzziness of sentences, creating visual images with lip synchronization is still challenging. To overcome the problems above, we present a data-mining framework to learn the synchronous pattern between different channels from large recorded audio/text dataset and visual dataset, and apply it to generate realistic talking face animations. Specifically, we decompose this task into two steps: mouth landmarks prediction and video synthesis. First, a multimodal learning method is proposed to generate accurate mouth landmarks with multimedia inputs (both text and audio). Second, a network named Face2Vid is proposed to generate video frames conditioned on the predicted mouth landmarks. In Face2Vid, optical flow is employed to model the temporal dependency between frames, meanwhile, a self-attention mechanism is introduced to model the spatial dependency across image regions. Extensive experiments demonstrate that our method can generate realistic videos with background, and exhibit the superiorities on accurate synchronization of lip movements and smooth transition of facial movements.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    9
    Citations
    NaN
    KQI
    []