Segmentation of Text-Lines and Words from JPEG Compressed Printed Text Documents Using DCT Coefficients

2020 
Segmenting a document image into text-lines and words finds applications in many research areas of DIA (Document Image Analysis) such as OCR, Word Spotting, and document retrieval. However, carrying out segmentation operation directly in the compressed document images is still an unexplored and challenging research area. Since JPEG is most widely accepted compression algorithm, this research paper attempts to segment a JPEG compressed printed text document image into text-lines and words, without fully decompressing the image. During JPEG compression, the non-overlapping 8×8 DCT blocks encode text contents of two adjacent text-lines and words without leaving any visible clue for segmentation. This paper proposes two stage algorithms for segmentation of text-lines and words by intelligently analyzing approximate text-line and word boundaries using the DC coefficient during the first stage. In the second stage, AC coefficients of selected DCT blocks are used to extract exact line and word boundaries. The experimental results on a JPEG compressed document data set (with variable spacing between lines and words, different font sizes and styles) shows a good computational performance.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []