Multiple Temporal Scales Based Speaker Embeddings Learning for Text-dependent Speaker Recognition

2019 
To extract high speaker-sensitive embeddings from deep neural networks is still a challenge in the field of speaker recognition. This paper proposes a novel network that learns speaker embeddings from multiple temporal scales. This idea comes from the recent biological research that the human auditory system has a mechanism of fusing multi-timescale information together to encode sound information. A two-pathway neural network is presented, in which one pathway focuses on short-time (or local) traits and the other focuses on long-range (or global) scale. Both traits are fused into one feature vector and the utterance-level speaker embeddings are extracted from these features. Experimental results show that different timescale traits can complement each other. And their fusion, which refer to as t-vector, outperforms i-vector and other deep embeddings. Moreover, with the end-to-end training, t-vectors can obtain excellent performance even using simple scoring approach like cosine distance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    1
    Citations
    NaN
    KQI
    []