Vocal Tract Length Normalization using a Gaussian mixture model framework for query-by-example spoken term detection

2019 
Abstract A speech spectrum is known to be changed by the variations in the length of the vocal tract of a speaker. This is because of the fact that speech formants are inversely related to the vocal tract length (VTL). The process of compensating spectral variation due to the length of the vocal tract is known as Vocal Tract Length Normalization (VTLN). VTLN is a very important speaker normalization technique for speech recognition and related tasks. In this paper, we used Gaussian Posteriorgram (GP) of VTL-warped spectral features for a Query-by-Example Spoken Term Detection (QbE-STD) task. This paper presents the use of a Gaussian Mixture Model (GMM) framework for VTLN warping factor estimation. In particular, the presented GMM framework does not require phoneme-level transcription. We observed the correlation between the VTLN warping factor estimates obtained via a supervised HMM-based approach and an unsupervised GMM-based approach. In addition, a phoneme recognition and speaker de-identification tasks were conducted using GMM-based VTLN warping factor estimates. For QbE-STD, we considered three spectral features, namely, Mel Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction (PLP), and MFCC-TMP (which uses Teager Energy Operator (TEO) to exploit implicitly magnitude and phase information in the MFCC framework). Linear frequency scaling variations for VTLN warping factor are incorporated into these three cepstral representations for the QbE-STD task. Similarly, VTL-warped Gaussian posteriorgram improved the Maximum Term Weighted Value by 0.021 (i.e., 2.1 %), and 0.015 (i.e., 1.5 %), for MFCC and PLP feature sets, respectively, on the evaluation set of the MediaEval SWS 2013 corpus. The better performance is primarily due to VTLN warping factor estimation using unsupervised GMM framework. Finally, the effectiveness of the proposed VTL-warped GP is presented to rescore using various detection sources, such as depth of detection valley, Self-Similarity Matrix, Pseudo Relevance Feedback and weighted mean features.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    70
    References
    2
    Citations
    NaN
    KQI
    []