USING DATA-DRIVEN SUBWORD UNITS IN LANGUAGE MODEL OF HIGHLY INFLECTIVE SLOVENIAN LANGUAGE

2009 
This paper presents the results of a study on modeling the highly inflective Slovenian language. We focus on creating a language model for a large vocabulary speech recognition system. A new data-driven method is proposed for the induction of inflectional morphology into language modeling. The research focus is on data sparsity, which results from the complex morphology of the language. The idea of using subword units is examined. An attempt is made to figure out the segmentation of words into two subword units: stems and endings. No prior knowledge of the language is used. The subword units should fit into the frameworks of the probabilistic language models. A morphologically correct decomposition of words is not being sought, but searching for a decomposition which yields the minimum entropy of the training corpus. This entropy is approximated by using N-gram models. Despite some seemingly over-simplified assumption, the subword models improve the applicability of the language models for a sparse training corpus. The experiments were performed using the VECER newswire text corpus as a training corpus. The test set was taken from the SNABI speech database, because the final models were evaluated in speech recognition experiments on SNABI speech database. Two different subword-based models are proposed and examined experimentally. The experiments demonstrate that subword-based models, which considerably reduce OOV rate, improve speech recognition WER when compared with standard word-based models, even though they increase test set perplexity. Subword-based models with improved perplexity, but which reduce the OOV rate much less than the previous ones, do not improve speech recognition results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    5
    Citations
    NaN
    KQI
    []