摘要

This study proposes a novel system that extracts text lines and restores text-removed images from various types of complex document images of mixed and overlapping text, graphics and pictures, which may contain text lines with different illumination levels, sizes and font styles. The proposed system first decomposes the document image into distinct object planes to separate homogeneous objects, including textual regions of interest, non-text objects (such as graphics and pictures) and background textures. A knowledge-based text extraction and identification method accurately detects and extracts text lines with different characteristics from each object plane. Afterwards, a computationally efficient text removal and inpainting process, based on an effective adaptive inpainting neighborhood adjustment scheme, is applied to the obtained text-line regions to produce a clear non-text restored background image. Experimental and comparative results demonstrate that the proposed system can provide accurate extraction of text-line regions of interest with diverse illumination levels, sizes and font styles from various complex compound document images, and can effectively and computationally produce clear and well-preserved non-text background images with satisfactory visual quality for further applications.

  • 出版日期2012-1