Lexical Normalization for Code-switched Data and its Effect on POS-tagging.

Rob van der Goot,Özlem Çetinoğlu

Lexical Normalization for Code-switched Data and its Effect on POS-tagging.

2020

Social media provides an unfiltered stream of user-generated input, leading to creative language use and many interesting linguistic phenomena, which were previously not available so abundantly. However, this language is harder to process automatically. One particularly challenging phenomenon is the use of multiple languages within one utterance, also called Code-Switching (CS). Whereas monolingual social media data already provides many problems for natural language processing, CS adds another challenging dimension. One solution that is commonly used to improve processing of social media data is to translate input texts to standard language first. This normalization has shown to improve performance of many natural language processing tasks. In this paper, we focus on normalization in the context of code-switching. We introduce a variety of models to perform normalization on CS data, and analyse the impact of word-level language identification on normalization. We show that the performance of the proposed normalization models is generally high, but language labels are only slightly informative. We also carry out POS tagging as extrinsic evaluation and show that automatic normalization of the input leads to 3.2% absolute performance increase, whereas gold normalization leads to an increase of 6.8%.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations