Recognition of Degraded Bangla Documents Using Hybrid Deep Neural Network Model

2021 
Digitization of degraded document by Optical Character Recognition is an active research area in the context of document analysis. This will help to edit document electronically, to perform content based searching and finally to store it for easy document management. Considering the popularity and heritage, in this work degraded printed Bangla document has been considered as a source material to be digitized. The well known ISIDDI database of Bangla degraded document has been exploited in the present research. This database contains 535 images of printed Bengali pages. These pages are of different fonts, sizes, formats together with different levels of degradation, collected from various sources. From these page images character samples have been extracted and 336 character classes are identified which are now ready for classification. In this research work we have developed an CNN-XGBoost hybrid model for better classification. Here CNN extracts the features of the character images automatically and XGBoost technique is responsible for better classification and recognition. The classification accuracy thus obtained is 91.86%, which outperforms the accuracies of the classifiers exercised so far on the ISIDDI datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []