Deep neural networks based speaker modeling at different levels of phonetic granularity

2017 
Recently, a hybrid deep neural network/i-vector framework has been proved effective for speaker verification, where the DNN trained to predict tied-triphone states (senones) is used to produce frame alignments for sufficient statistics extraction. In this work, in order to better understand the impact of different phonetic precision to speaker verification tasks, three levels of phonetic granularity are evaluated when doing frame alignments, which are tied-triphone state, monophone state and monophone. And the distribution of the features associated to a given phonetic unit is further modeled with multiple Gaussians rather than a single Gaussian. We also propose a fast and efficient way to generate phonetic units of different granularity by tying DNN's outputs according to the clustering results based on DNN derived senone embeddings. Experiments are carried out on the NIST SRE 2008 female tasks. Results show that using DNNs with less precise phonetic units and more Gaussians per phonetic unit for speaker modeling generalize better to different speaker verification tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    10
    Citations
    NaN
    KQI
    []