Two Stage Zero-resource Approaches for QbE-STD

In this paper, we explore the information in the acoustic representation for Query-by-Example Spoken Term Detection (QbE-STD) task. Several approaches have been employed to detect the spoken instance of the query in audio databases. Zero-resource approach attempts to detect the acoustically similar information without the use of phone recognizer. In this paper, we present two-stage frame-level matching for QbE-STD. At first stage, we used Gaussian posteriorgram and subsequence dynamic time warping (subDTW) to detect the segments within audio databases. In the second stage, we exploited several acoustic features along with Dynamic Time Warping (DTW) detection cues such as cosine similarity of term frequency vectors and the valley depth of detection obtained in subDTW. The score-level fusion of search system gave the performance comparable to phonetic posteriorgram on SWS 2013 database. We obtained 0.045 (i.e., 4.5 %) improvement in Maximum Term Weighted Value (MTWV) with the score-level fusion of all the evidence in MTWV as compared to subDTW on Mel Frequency Cepstral Coefficients (MFCC) Gaussian posteriorgram.
    • Correction
    • Source
    • Cite
    • Save