On the Use of Acoustic Unit Discovery for Language Recognition

2016 
In this paper, we explore the use of large-scale acoustic unit discovery for language recognition. The deep neural network-based approaches that have achieved recent success in this task require transcribed speech and pronunciation dictionaries, which may be limited in availability and expensive to obtain. We aim to replace the need for such supervision via the unsupervised discovery of acoustic units. In this work, we present a parallelized version of a Bayesian nonparametric model from previous work and use it to learn acoustic units from a few hundred hours of multilingual data. These unit (or senone) sequences are then used as targets to train a deep neural network-based i-vector language recognition system. We find that a score-level fusion of our unsupervised system with an acoustic baseline can shrink the gap significantly between the baseline and a supervised benchmark system built using transcribed English. Subsequent experiments also show that an improved acoustic representation of the data can yield substantial performance gains and that language specificity is important for discovering meaningful acoustic units. We validate the generalizability of our proposed approach by presenting state-of-the-art results that exhibit similar trends on the NIST Language Recognition Evaluations from 2011 and 2015.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    54
    References
    9
    Citations
    NaN
    KQI
    []