Speech and Language Processing for Multimodal Human-Computer Interaction

Li Deng,Ye-Yi Wang,Kuansan Wang,Alex Acero,Hsiao-Wuen Hon,James G. Droppo,Constantinos Boulis,Milind Mahajan,Xuedong David Huang

Speech and Language Processing for Multimodal Human-Computer Interaction

2004

In this paper, we describe our recent work at Microsoft Research, in the project codenamed Dr. Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present in detail MiPad as the first Dr. Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones. Despite its current incomplete implementation, we have observed that speech and pen have the potential to significantly improve user experience in our user study reported in this paper. We describe in this system-oriented paper the main components of MiPad, with a focus on the robust speech processing and spoken language understanding aspects. The detailed MiPad components discussed include: distributed speech recognition considerations for the speech processing algorithm designs a stereo-based speech feature enhancement algorithm used for noise-robust front-end speech processings Aurora2 evaluation results for this front-end processings speech feature compression (source coding) and error protection (channel coding) for distributed speech recognition in MiPads HMM-based acoustic modeling for continuous speech recognition decodings a unified language model integrating context-free grammar and N-gram model for the speech decodings schema-based knowledge representation for the MiPad's personal information management tasks a unified statistical framework that integrates speech recognition, spoken language understanding and dialogue managements the robust natural language parser used in MiPad to process the speech recognizer's outputs a machine-aided grammar learning and development used for spoken language understanding for the MiPad tasks Tap & Talk multimodal interaction and user interface designs back channel communication and MiPad's error repair strategys and finally, user study results that demonstrate the superior throughput achieved by the Tap & Talk multimodal interaction over the existing pen-only PDA interface. These user study results highlight the crucial role played by speech in enhancing the overall user experience in MiPad-like human-computer interaction devices.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations