Multiple Temporal Scales Based Speaker Embeddings Learning for Text-dependent Speaker Recognition

Wenchao Wang,Yike Zhang,Ji Xu,Yonghong Yan

Multiple Temporal Scales Based Speaker Embeddings Learning for Text-dependent Speaker Recognition

2019

To extract high speaker-sensitive embeddings from deep neural networks is still a challenge in the field of speaker recognition. This paper proposes a novel network that learns speaker embeddings from multiple temporal scales. This idea comes from the recent biological research that the human auditory system has a mechanism of fusing multi-timescale information together to encode sound information. A two-pathway neural network is presented, in which one pathway focuses on short-time (or local) traits and the other focuses on long-range (or global) scale. Both traits are fused into one feature vector and the utterance-level speaker embeddings are extracted from these features. Experimental results show that different timescale traits can complement each other. And their fusion, which refer to as t-vector, outperforms i-vector and other deep embeddings. Moreover, with the end-to-end training, t-vectors can obtain excellent performance even using simple scoring approach like cosine distance.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations