Discriminatively trained joint speaker and environment representations for adaptation of deep neural network acoustic models

2016 
A recent trend in normalization of factors extraneous to a speech recognition task has been to explicitly introduce features related to the unwanted variability in the training of Deep Neural Networks (DNN). Typically, this is done by either perturbing the training set with models of these extraneous factors such as vocal tract length and environmental noise or augmenting the conventional spectral features with auxiliary information such as i-vector, noise spectrum, etc. Another emerging approach is to derive low dimensional representations of the factors from the hidden layers of DNN and use it for normalization of the acoustic model. Almost all of these approaches focus on either speaker or environment normalization. In this paper we propose a novel approach for estimating a compact joint representation of speakers and environment by training a DNN, with a bottleneck layer, to classify the i-vector features into speaker and environment labels by Multi-Task Learning (MTL). Another novelty is to learn this compact representation while learning to map the i-vector of a noisy utterance into its corresponding clean speaker i-vector and noise-only i-vector. Experiments were conducted on an artificially noise-corrupted version of the WSJ corpus. The proposed compact joint speaker-environment representations show promising gains.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    4
    Citations
    NaN
    KQI
    []