Performance analysis of various training targets for improving speech quality and intelligibility

2021 
Abstract Denoising a single-channel speech (recorded using one microphone) remains an open problem in many speech-related applications. Recently, supervised deep learning methods are used to denoise the speech signal. This work uses Deep Neural Network (DNN) to learn the Time–Frequency (T-F) mask of the clean speech from its noisy speech features. In general, Ideal Binary Mask (IBM) is used as the binary mask training target to improve speech intelligibility, and Ideal Ratio Mask (IRM) is used as a non-binary mask training target to improve speech quality. Still, it may not necessarily be the best T-F mask to analyze the performance of improvement in speech quality/intelligibility. However, an appropriate training target remains to be unclear for supervised deep learning methods. In this work, a non-binary novel soft T-F mask named Optimum Soft Mask (OSM) is proposed, analyzed and compared with different T-F mask types used for single-channel speech denoising methods. In addition, the target T-F mask is compared with the existing state of art approaches to show a clear performance advantage of supervised deep learning models. The performance of the binary and non-binary training targets of DNN is evaluated under different Signal-to-Noise-Ratio’s and noise conditions ti improve speech quality and intelligibility. The experimental results reveal that the binary mask IBM shows significant improvement in speech intelligibility; the non-binary mask IRM shows a substantial improvement in speech quality. At the same time, the proposed novel soft T-F mask shows notable improvement in both quality and intelligibility under various test conditions.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    64
    References
    1
    Citations
    NaN
    KQI
    []