Enhanced feature network for monaural singing voice separation

2019 
Abstract Deep Recurrent Neural Network (DRNN) based monaural singing voice separation (MSVS) methods have recently obtained impressive separation results. Most of DRNN based methods directly take the magnitude spectra of the mixture signal as the input feature, which has high dimensionality and contains redundant information. The DRNN based models, however, cannot extract the effective low-dimensional and de-redundant representations from the magnitude spectra. In this paper, we propose an Enhanced Feature Network (EFN) to extract effective representations of the magnitude spectra, i.e., enhanced-feature, for MSVS. The generation of enhanced-feature includes two consecutive stages: (i) modeling the local and contextual information explicitly by Convolutional Neural Network (CNN); (ii) extracting the high-level sequential feature by Highway Network and bi-directional Recurrent Neural Network (RNN). In the first stage, the EFN generates an enhanced-sequence consisting of the high-resolution magnitude spectra and its low-dimensional representations, where the low-dimensional part avoids the high cost of spectra decomposition and the high-resolution part mitigates problems of information loss. In the second stage, the enhanced-sequence is used to extract the enhanced-feature which are more suitable for MSVS. Experiments on the MIR-1K dataset have shown that the enhanced-feature can be used to obtain better separation effects than the magnitude spectra or its low-dimensional representations. The proposed method obtains 0.16–0.31 dB GNSDR gain and 0.48–0.71 dB GSAR gain, as compared with the previously proposed DRNN based methods. Moreover, the separation module of EFN which adopts only one hidden layer of GRU RNN can increase the training speed obviously.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    5
    Citations
    NaN
    KQI
    []