Using Taigi Dramas with Mandarin Chinese Subtitles to Improve Taigi Speech Recognition

2020 
An obvious problem with automatic speech recognition (ASR) for Taigi is that the amount of training data is far from enough to build a practical ASR system. Collecting speech data with reliable transcripts for training the acoustic model (AM) is feasible but expensive. Moreover, text data used for language model (LM) training is extremely scarce and difficult to collect because Taigi is a spoken language, not a commonly used written language. Interestingly, the subtitles of Taigi drama in Taiwan have long been in Chinese characters for Mandarin. Since a large amount of Taigi drama episodes with Mandarin Chinese subtitles are available on YouTube, we propose a method to augment the training data for AM and LM of Taigi ASR. The idea is to use an initial Taigi ASR system to convert a Mandarin Chinese subtitle into the most likely Taigi word sequence by referring to the speech. Experimental results show that our ASR system can be remarkably improved by such training data augmentation.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    1
    Citations
    NaN
    KQI
    []