A Novel Approach to Printed Arabic Optical Character Recognition

Electronic versions

Documents

Mansoor_Alghamdi PhD 2019
19.6 MB, PDF document

Mansoor Alghamdi

School of Computer Science & Engineering

Research areas

School of Computer Science and Electronic Engineering

Abstract

Optical character recognition (OCR) is essential in various real-world applications, such as digitizing learning resources to assist visually impaired people and transforming printed resources into electronic media. Considering the Arabic language, the need to extend digital Arabic content on the Internet has motivated more research on Arabic text recognition. However, Arabic OCR still poses significant challenges, owing to the special characteristics of Arabic script. This research aims to develop an effective printed Arabic OCR system.
Performance evaluation of OCR systems is an essential task for OCR systems development. However, studies in Arabic OCR suffer from the lack of proper performance evaluation metrics, the availability of evaluation tools and effective performance evaluation of current OCR systems. Thus, this work proposes a standard protocol with an automated evaluation tool, which has a new set of metrics, for measuring the effectiveness of Arabic OCR systems. In addition, the effectiveness of the state-of-the-art printed Arabic text recognition systems have been experimentally evaluated.
In this work, we describe the implementation of a printed Arabic OCR system. The implantation of this system is divided into five stages: pre-processing, feature extraction, character segmentation, classification and post-processing. Unlike other typical Arabic OCR systems, the developed system performs the feature extraction stage prior to the character segmentation stage.
In the pre-processing stage, a novel thinning algorithm is developed in order to produce skeletons for Arabic text images. An evaluation experiment is conducted to evaluate the performance of the new algorithm against other well established thinning algorithms with respect to several new performance metrics. In all performance tests, the new algorithm produces the best results. In the second stage, a new chain-code representation technique using an agent-based model for extracting features from non-dotted Arabic text images has been proposed. The feature extraction method achieved an accuracy of 98.1%. Based on the extracted features, a character segmentation technique for segmenting connected Arabic words into characters was developed. The
vi
character segmentation technique produced a recall of 84.2% and a precision of 77.3%. In the classification stage, the Prediction by Partial Matching (PPM) compression based method is applied as a classifier to recognise Arabic text. Experimental evaluation on a public dataset reveals that the system has an accuracy of 77.3% for paragraph-based text images. In the final stage, a post-processing technique based on a PPM model is applied for correcting the OCR output. By applying the post-processing method, the recognition accuracy improves to 86.9%. The experimental results show that the system substantially improves upon the state-of-the-art when compared with four well-known Arabic OCR systems using the automated evaluation tool.

Details

Original language	English
Awarding Institution	Bangor University
Supervisors/Advisors	William Teahan (Supervisor)
Award date	25 Sept 2019

Research outputs (1)

Published
Visualisation Data Modelling Graphics (VDMG) at Bangor
Research output: Contribution to conference › Paper › peer-review

View all

Theses