Speaker-Independent Lipreading With Limited Data

2020 
Recent researches have demonstrated that with a huge annotated training dataset, some sophisticated automatic lipreading methods perform even better than a professional human lip reader. However, when the training set is limited, i.e. containing a few number of speakers, most existing lipreading approaches cannot provide accurate recognition results for unseen speakers due to the inter-speaker variability. To improve the lipreading performance in the speaker-independent scenario, a new deep neural network (DNN) is proposed in this paper. The proposed network is composed of two parts, i.e. the Transformer-based Visual Speech Recognition Network (TVSR-Net) and the Speaker Confusion Block (SC-Block). The TVSR-Net is designed to extract lip features and recognize the speech. The SC-Block aims to achieve speaker normalization by eliminating the influence of various talking styles/habits. A Multi-Task Learning (MTL) scheme is designed for network optimization. Experiment results on the GRID dataset have demonstrated the effectiveness of the proposed network on speaker-independent recognition with limited training data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    0
    Citations
    NaN
    KQI
    []