Restoration of Arabic diacritics using dynamic programming

2013 
Arabic script can be written with diacritics or without diacritics. In normal situation, Arabic text is written without the diacritics (e.g. Arabic newspapers). When the diacritics are present, the Arabic script provides enough information about the correct pronunciation and the meaning of the words. Assigning the correct diacritics to Arabic words is a complex task implying morphology, syntax, and semantic processing. The goal of this research is to develop an automatic system to assign diacritics to Arabic words. The presented technique is purely statistical approach and depends only on an Arabic corpus annotated with diacritics. In this paper, we present an algorithm to restore Arabic diacritics using dynamic programming approach. The possible word sequences with diacritics are assigned scores using statistical n-gram language modeling approach. Using the assigned scores, it is possible to search the most likely sequence using a dynamic programming algorithm. When case ending is ignored (i.e the diacritic mark of last letter), preliminary results on a public domain corpus show that the algorithm can lead to good results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    12
    Citations
    NaN
    KQI
    []