Standardizing Sinhala Code-Mixed Text using Dictionary based Approach

2020 
Code-mixing is one of the biggest challenges when processing social media text. This paper presents a thorough review on the state of the art code-mixed text processing and identified the main challenges in processing Sinhala code-mixed text. In this study we could identify how researchers conducted different kinds of tasks such as normalization of code-mixed data, word level language identification of the code-mixed text etc. The study lead to identify the challenges in Sinhala code-mixed text such as phonetic transliterations, borrowing of words, spelling errors, embedded languages, the use of numeric characters in words, discourse marker switching etc. Based on this challenges identified, it was necessary to standardize the Singlish text to Sinhala letters, since there are so many representations for the same word. So a dictionary is proposed where Sinhala letters are mapped to Singlish text which could be used as a standardization. Finally the paper discuss about the future work planed on using the proposed dictionary for Sinhala code-mixed data analysis.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    30
    References
    1
    Citations
    NaN
    KQI
    []