A Pitch-aware Speaker Extraction Serial Network

2020 
Despite deep learning has an excellent performance in monaural speaker extraction, it’s still a challenge to extract speakers when facing the same gender, i.e., male-male and female-female. On the other hand, it has been proved that pitch tracking is effective for same-gender speech separation. In this study, we proposed a pitch-aware speaker extraction serial network (PSESNet) to improve extraction performance. We designed a serial system and compared it with multi-task learning, we tried to use the target speaker’s pitch information to optimize the loss function rather than as input to the extraction network. The extraction part uses SpeakerBeam-FE (SBF) with magnitude and temporal spectrum approximation loss (MTSAL) and speaker embedding concatenation. After extracting the spectrogram of the target speaker, we connected the spectrogram to predict the pitch information to do further optimization. Experimental results show that serial system performs better than multi-task learning and proposed method improves performance in both same and opposite gender conditions. On average, PSESNet achieves 4.7% and 3.8% relative improvements on WSJ0 dataset over the SBF-MTSAL-Concat baseline on signal-to-distortion ratio (SDR) under both closed and open condition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    0
    Citations
    NaN
    KQI
    []