Language Identification of Intra-Word Code-Switching for Arabic–English

2021 
Abstract Multilingual speakers tend to mix different languages in text and speech; a phenomenon referred to by linguists as “code-switching” (CS). Also, speakers switch between morphemes from various languages in the same word (intra-word CS). User-generated texts on social media are informal and contain a fair amount of different types of CS data. This data needs to be investigated and analyzed for several linguistic tasks. Language Identification (LID) is one of the important tasks that should be tackled for intra-word CS data. LID involves segmenting mixed words and tagging each part with its corresponding language ID. This work aimed at creating the first annotated Arabic–English (AR–EN) corpus for the CS intra-word LID task along with a web-based application for data annotation. We implemented two baseline models using Naive Bayes and Character BiLSTM for AR–EN text. Our main model was constructed using segmental recurrent neural networks (SegRNN). We investigated the usage of different word embeddings with SegRNN. The highest LID system for tagging the entire data-set was obtained using SegRNN alone, achieving an F1-score of 94.84% and was able to recognize mixed words with F1-score equal to 81.15%. Besides, the model of the SegRNN with FastText embeddings achieved the highest results equal to 81.45% F1-score for tagging the mixed words.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    47
    References
    0
    Citations
    NaN
    KQI
    []