Spoken Multiple-Choice Question Answering Using Multi-turn Audio-extracter BERT

2020 
In spoken multiple-choice question answering (SMCQA) task, given a passage, a question, and multiple choices all in the form of speech, the machine needs to pick the correct choice to answer the question. A common strategy is to employ an automatic speech recognition (ASR) system to translate speech contents into auto-transcribed text. Therefore, a SMCQA task is reduced to a classic MCQA task. Under the strategy, bidirectional encoder representations from transformers (BERT) can achieve a certain level of performance despite ASR errors. However, previous studies have evidenced that acoustic-level statistics can compensate for text inaccuracies caused by ASR systems, thereby improving the performance of a SMCQA system. Accordingly, we concentrate on designing a BERT-based SMCQA framework, which not only inherits the advantages of contextualized language representations learned by BERT, but integrates acoustic-level information with text-level information in a systematic and theoretical way. Considering temporal characteristics of speech, we first formulate multi-turn audio-extracter hierarchical convolutional neural networks (MA-HCNNs), which encode acoustic-level features under various temporal scopes. Based on MA-HCNNs, we propose a multi-turn audio-extracter BERT-based (MA-BERT) framework for SMCQA task. A series of experiments demonstrates remarkable improvements in accuracy over selected baselines and SOTA systems on a published Chinese SMCQA dataset.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []