Adaptive models of Arabic text

Electronic versions

Documents

  • Khaled Alhawiti

    Research areas

  • PhD, School of computer Sciences

Abstract

The main aim of this thesis is to build adaptive language models of Ara-
bic text that can achieve the best compression performance over existing
models.
Prediction by partial matching (PPM) language models has been the
best performing over the other adaptive language models through the past
three decades in term of compression performance. In order to get such
performance for Arabic text, the rich morphological nature of Arabic lan-
guage should be taken into consideration.
In this thesis, two new resources of Arabic language have been intro-
duced for understanding the nature of Arabic language and standardizing
the experiments on Arabic text. The first is a new corpus, the Bangor
Arabic Compression Corpus (BACC), for standardizing compression ex-
periments and creating a benchmark corpus for future compression ex-
periments on Arabic text. The second is a new corpus, Bangor Balanced
Corpus of Contemporary Arabic (BBCCA), The purpose of this corpus is to
mirror similar balanced corpora that are available for the English language
(Brown and LOB) but instead comprises the Arabic language.
Two new adaptive models, BS-PPM and CS-PPM, based on the Predic-
tion by Partial Matching (PPM) compression scheme are then introduced
to improve the compression performance of standard PPM model by using
preprocessing techniques. The first model works by replacing the most
VII
frequent bigraphs with unique characters and the second model works
by separating the encoding of the processing text into two streams, named
the vocabulary stream and symbols stream. Both models achieve excellent
compression results with significant improvements over standard PPM.
A further novel model adapted especially for the characteristics of Ara-
bic text, lossless dotted and lossy non-dotted variants of PPM, are then
introduced to also improve the compression performance over standard
PPM by using the historical feature of Arabic language being non dotted.
This method also achieves excellent compression results.
We also have investigated some applications of PPM models to the prob-
lems of authorship attribution, word segmentation and correcting of OCR
output for Arabic text that demonstrate excellent results using PPM.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award dateJan 2014