Code-switched automatic speech recognition in five South African languages

Astik Biswas,Emre Yılmaz,Ewald van der Westhuizen,Febe de Wet,Thomas Niesler

Code-switched automatic speech recognition in five South African languages

2022

Abstract Most automatic speech recognition (ASR) systems are optimised for one specific language and their performance consequently deteriorates drastically when confronted with multilingual or code-switched speech. We describe our efforts to improve an ASR system that can process code-switched South African speech that contains English and four indigenous languages: isiZulu, isiXhosa, Sesotho and Setswana. We begin using a newly developed language-balanced corpus of code-switched speech compiled from South African soap operas, which are rich in spontaneous code-switching. The small size of the corpus makes this scenario under-resourced, and hence we explore several ways of addressing this sparsity of data. We consider augmenting the acoustic training sets with in-domain data at the expense of making it unbalanced and dominated by English. We further explore the inclusion of monolingual out-of-domain data in the constituent languages. For language modelling, we investigate the inclusion of out-of-domain text data sources and also the inclusion of synthetically-generated code-switch bigrams. In our experiments, we consider two system architectures. The first considers four bilingual speech recognisers, each allowing code-switching between English and one of the indigenous languages. The second considers a single pentalingual speech recogniser able to process switching between all five languages. We find that the additional inclusion of each acoustic and text data source leads to some improvements. While in-domain data is substantially more effective, performance gains were also achieved using out-of-domain data, which is often much easier to obtain. We also find that improvements are achieved in all five languages, even when the training set becomes unbalanced and heavily skewed in favour of English. Finally, we find the use of TDNN-F architectures for the acoustic model to consistently outperform TDNN–BLSTM models in our data-sparse scenario.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations