摘要

There exists a high interest in the digitization of handwriting historical documents, in the quest to preserve the cultural heritage of nations. In general, these manuscript images present new segmentation difficulties with respect to non-historical documents. The problems come from features such as paper aging, faded ink, back-to-front ink superposition or variable line skew, among others. This paper presents a methodology for detecting and extracting the text lines of images from complex handwritten historical documents. The proposed line segmentation algorithm is based on computing a binary transition map of the document and then extracting and refining the corresponding line regions through skeletonization. To improve the accuracy of line segmentation, a new graph-based splitting method to separate the touching lines is introduced. Once text lines have been segmented, we propose an algorithm based on mathematical morphology operators and position heuristics, to extract the component words on each text line. The robustness and accuracy of our approach was tested on digitalized pages of two complex historical document datasets: the correspondence of Nabuco and the family papers of Graham Bell. We have also successfully compared our algorithms to other general line and word segmentation algorithms presented at the ICDAR 2007 Handwriting Segmentation Contest.

  • 出版日期2011