TACR-Net: Editing on Deep Video and Voice Portraits

2021 
Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target video is a novel yet challenging task. Despite impressive results have been achieved, there are still three limitations in the existing methods: 1) since the acoustic features are not completely decoupled from person identity, there is no global speech to facial features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven talking face sequences generated by simple cascade structure usually lack of temporal consistency and spatial correlation, which leads to defects in the consistency of changes in details. 3) the operation of forgery is always at the video level, without considering the forgery of the voice, especially the synchronization of the converted voice and the mouth. To address these distortion problems, we propose a novel deep learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network (TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes facial expression blendshape based on the given acoustic features without separately training for special video. Then TACR-Net also involves a novel autoregressive cascade structure generator for video re-rendering. Finally, we transform the in-the-wild speech to the target portrait and obtain a photo-realistic and audio-realistic video.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    0
    Citations
    NaN
    KQI
    []