Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties

2017 
Abstract Identifying the language of a text is an important step for several natural language processing applications. State-of-the-art language identification (LID) systems perform very well when discriminating between unrelated languages on standard datasets. However, the LID task has a bottleneck when discriminating between similar languages or language varieties. Furthermore, LID has also proven to be very challenging when dealing with short texts such as the ones from Twitter. In this paper, we propose the use of smoothed n-gram language models to classify tweets in both Brazilian and European Portuguese variants. Word and character n-gram language models were combined and evaluated through five different classifiers. We have compared the smoothed n-gram language models together with the Term Frequency and Inverse Document Frequency weighting scheme. This paper also proposes an ensemble model, in which the class labels output were combined using majority voting and algebraic combiners. The best configuration reached accuracy of 92.71% using an ensemble model, which combines Lidstone (0.1) character 6-gram, Good–Turing word unigram, and Witten–Bell word bigram models, together with the Log-Likelihood Ratio estimation method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    15
    Citations
    NaN
    KQI
    []