Block-based high performance CNN architectures for frame-level overlapping speech detection

Midia Yousefi,John H. L. Hansen

Block-based high performance CNN architectures for frame-level overlapping speech detection

2020

Speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition, and speech synthesis have advanced significantly by the emergence of deep learning techniques. However, none of these voice-enabled systems perform well in natural environmental circumstances, specifically in situations where one or more potential interfering talkers are involved. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datsets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The proposed architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference. We also investigate the effect of alternate input features including spectral magnitude, MFCC, MFB, and pyknogram on both computational time and classification performance. Evaluation is performed on simulated overlapping speech signals based on the GRID corpus. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and 92.8% Fscore on same gender overlapped speech. For opposite gender cases, the network scores exceed 95% in all the classification metrics.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations