Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

2020 
Mapping and Masking targets are both widely used in recent Deep Neural Network (DNN) based supervised speech enhancement. Masking targets are proved to have a positive impact on the intelligibility of the output speech, while mapping targets are found, in other studies, to generate speech with better quality. However, most of the studies are based on comparing the two approaches using the Multilayer Perceptron (MLP) architecture only. With the emergence of new architectures that outperform the MLP, a more generalized comparison is needed between mapping and masking approaches. In this paper, a complete comparison will be conducted between mapping and masking targets using four different DNN based speech enhancement architectures, to work out how the performance of the networks changes with the chosen training target. The results show that there is no perfect training target with respect to all the different speech quality evaluation metrics, and that there is a tradeoff between the denoising process and the intelligibility of the output speech. Furthermore, the generalization ability of the networks was evaluated, and it is concluded that the design of the architecture restricts the choice of the training target, because masking targets result in significant performance degradation for deep convolutional autoencoder architecture.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    51
    References
    2
    Citations
    NaN
    KQI
    []