Arabic document layout analysis

作者:Hesham Amany M*; Rashwan Mohsen A A; Al Barhamtoshy Hassanin M; Abdou Sherif M; Badr Amr A; Farag Ibrahim
来源:Pattern Analysis and Applications, 2017, 20(4): 1275-1287.
DOI:10.1007/s10044-017-0595-x

摘要

Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.

  • 出版日期2017-11