Approximate search of audio queries by using DTW with phone time boundary and data augmentation

Dynamic Time Warping (DTW) is widely used in language independent query-by-example (QbE) spoken term detection (STD) tasks due to its high performance. However, there are two limitations of DTW based template matching, 1) it is not straightforward to perform approximate match of audio queries; 2) DTW is sensitive to the mismatch of signal conditions between the query and the speech search data. To allow approximate search, we propose a partial template matching strategy using phone time boundary information generated by a phone recognizer. To have more invariant representation of audio signals, we use bottleneck features (BNF) as the input of DTW. The BNF network is trained from augmented data, which is generated by adding reverberation and additive noises to the clean training data. Experimental results on QUESST 2015 task shows the effectiveness of the proposed methods for QbE-STD when the queries and search data are both distorted by reverberation and noises.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader