Tagging a corpus of Malay texts, and coping with 'syntactic drift'.

2003 
-The structure of Malay presents the corpus linguist with an extremely interesting problem. At high syntactic levels, the language is familiar enough, and one can talk of direct objects in transitive constructions, and even of agentless passives. The dominant sentence order is SVO. Parsing at this level is therefore relatively straightforward. The problem is at lower levels, where Malay patterns quite differently from Indo-European languages. If the linguist tries to process Malay using categories and techniques designed for Indo-European, then it comes across as at best confusing and at worst in a state of chaos. Malay is neither confusing nor in chaos; but it does need to be analysed using techniques which are sensitive to its own patterns. Conventional tagging is based on the assumption that grammatical class is static, allowing for some ‘ambiguities’ such as telephone as a noun or as a verb. In Malay, grammatical class is dynamic: adjectives can occur as verbs or adverbs and some verbs as adjectives; some verbs can pattern as nouns and others can take on the role of function words. In order to cope with this, we have to make a rigorous distinction between lexical class and the syntactic slots which words fill. Words are given a single class label in the lexicon, and the parser then has to identify cases in which words have drifted away from their default slot. Tagging and parsing can be carried out for English as separate if independent processes. , for Malay they have to be treated as at least complementary.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    10
    Citations
    NaN
    KQI
    []