Semi-Supervised Acoustic Model Retraining for Medical ASR

2018 
Training models for speech recognition usually requires accurate word-level transcription of available speech data. For the domain of medical dictations, it is common to have “semi-literal” transcripts available: large numbers of speech files along with their associated formatted episode report, whose content only partially overlaps with the spoken content of the dictation. We present a semi-supervised method for generating acoustic training data by decoding dictations with an existing recognizer, confirming which sections are correct by using the associated report, and repurposing these audio sections for training a new acoustic model. The effectiveness of this method is demonstrated in two applications: first, to adapt a model to new speakers, resulting in a 19.7% reduction in relative word errors for these speakers; and second, to supplement an already diverse and robust acoustic model with a large quantity of additional data (from already known voices), leading to a 5.0% relative error reduction on a large test set of over one thousand speakers.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    1
    Citations
    NaN
    KQI
    []