A Restriction Training Recipe for Speech Separation on Sparsely Mixed Speech

2021 
Techniques of speech separation have changed rapidly in the last few years. The traditional recurrent neural networks (RNNs) have been replaced by any other architecture like convolutional neural networks (CNNs) steadily. Although these models have improved the performance greatly in speed and accuracy, it is still inevitable to sacrifice some long-term dependency. As a result, the separated signals are vulnerable to be wrong assigned. This situation could be even common when the mixed speech is sparse, like the communication. In this paper, a two-stage training recipe with a restriction term based on scale-invariant signal-to-noise ratio (SISNR) is put forward to prevent wrong assignment problem on sparsely mixed speech. The experiment is conducted on the mixture of Japanese Newspaper Article Sentences (JNAS). According to the experiments, the proposed approach can work efficiently on sparse data (overlapping rate around 50%), and performances are improved consequently. In order to test the application of speech separation in actual situations, such as meeting transcription, the separation results are also evaluated by speech recognition. The results show that the character error rate is reduced by 10% compared to the baseline.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []