Experiments in Sentence Language Identification with Groups of Similar Languages

Ben King,Dragomir R. Radev,Steven P. Abney

Experiments in Sentence Language Identification with Groups of Similar Languages

2014

Ben King
Dragomir R. Radev
Steven P. Abney

Language identification is a simple problem that becomes much more difficult when its usual assumptions are broken. In this paper we consider the task of classifying short segments of text in closely-related languages for the Discriminating Similar Languages shared task, which is broken into six subtasks, (A) Bosnian, Croatian, and Serbian, (B) Indonesian and Malay, (C) Czech and Slovak, (D) Brazilian and European Portuguese, (E) Argentinian and Peninsular Spanish, and (F) American and British English. We consider a number of different methods to boost classification performance, such as feature selection and data filtering, but we ultimately find that

Keywords:

Natural language processing
Artificial intelligence
Malay
Croatian
Language identification
Indonesian
Linguistics
British English
Czech
Computer science
Slovak
Sentence
European Portuguese
Serbian

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations