Block-based high performance CNN architectures for frame-level overlapping speech detection

2020 
Speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition, and speech synthesis have advanced significantly by the emergence of deep learning techniques. However, none of these voice-enabled systems perform well in natural environmental circumstances, specifically in situations where one or more potential interfering talkers are involved. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datsets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The proposed architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference. We also investigate the effect of alternate input features including spectral magnitude, MFCC, MFB, and pyknogram on both computational time and classification performance. Evaluation is performed on simulated overlapping speech signals based on the GRID corpus. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and 92.8% Fscore on same gender overlapped speech. For opposite gender cases, the network scores exceed 95% in all the classification metrics.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    9
    Citations
    NaN
    KQI
    []