Practitioner’s view: A comparison and a survey of lemmatization and morphological tagging in German and Latin

2019 
This paper relates to the challenge of POS tagging and lemmatization in morphologically rich languages by example of German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on the question what a practitioner can expect when using state-of-the-art solutions. Moreover, we contrast these with old(er) methods and implementations for coarse-grained pos tagging as well as fine-grained (morphological) POS tagging which also includes tagging of case, number, mood etc.). We examine to what degree recent efforts in tagger development pay out in improved accuracies – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-domain evaluation. Out-domain evaluations are particularly insightful because the distribution of the data which is being tagged by a user will typically differ from the distribution on which the tagger has been trained. Furthermore, we compare pipeline tagging vs. a tagging approach that acknowledges dependencies between inflectional categories. Finally, two lemmatization techniques are evaluated.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    3
    Citations
    NaN
    KQI
    []