Feature learning for efficient ASR-free keyword spotting in low-resource languages

2022 
Abstract We consider feature learning for a computationally efficient method of keyword spotting that can be applied in severely under-resourced settings. The objective is to support humanitarian relief programmes by the United Nations (UN) in parts of Africa in which almost no language resources are available. To allow a keyword spotting system to be rapidly developed in such a language, we rely on a small and easily-compiled set of isolated keywords. Using the isolated keywords as templates, we apply dynamic time warping (DTW) to a much larger corpus of in-domain but untranscribed speech. The resulting DTW alignment scores are used to train a convolutional neural network (CNN) which is orders of magnitude more computationally efficient than DTW and therefore suitable for real-time application. We optimise this ASR-free neural network keyword spotting procedure by identifying acoustic features that provide robust performance in this almost zero-resource setting. First, we consider the benefits of incorporating information from well-resourced but unrelated languages by incorporating a multilingual bottleneck feature (BNF) extractor. Next, we consider using features extracted from an autoencoder (AE) trained on in-domain but untranscribed data. Finally, we consider features obtained from a correspondence autoencoder (CAE) which is initialised with the AE and subsequently fine-tuned on the small set of in-domain labelled data. Experiments in South African English and Luganda, a low-resource language, demonstrate that, on their own, both the BNF and CAE features can achieve a 5% relative performance improvement over baseline MFCCs. However, by using BNFs as input to the CAE, even better performance is achieved, resulting in a more than 27% relative improvement over MFCCs in ROC area-under-the-curve (AUC) and more than twice as many top-10 retrievals. We also show that, using these features, the CNN-DTW keyword spotter performs almost as well as the DTW keyword spotter while comfortably outperforming a baseline CNN trained only on the keyword templates. We conclude that a CNN-DTW keyword spotter using BNF-derived CAE features represents a computationally efficient approach with very competitive performance that is suited to rapid deployment in a severely under-resourced scenario.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    58
    References
    1
    Citations
    NaN
    KQI
    []