Grammar-based preprocessing for PPM compression and classification

Electronic versions

Documents

  • Nojood Obaidallah M Aljehane

    Research areas

  • PhD, Computer Science - dissertation

Abstract

The aim of this study is to investigate the efficiency of novel methods using context-free grammars and Prediction by Partial Matching (PPM) in order to build and evaluate the quality of compression models for text files such as English, Arabic, Persian,Welsh and Chinese. A further aim is then to apply these models to the problem of the classification of text to see how well they perform at this application. We apply grammar-based pre-processing prior to using the PPM compression algorithm. The methods achieve significantly better compression for different natural language texts compared with other well-known compression methods. Our method first generates a grammar based on the most common two-character sequences (bigraphs) or three-character sequences (trigraphs) in the text being compressed and then substitutes these sequences using the respective non-terminal symbols defined by the grammar in a pre-processing phase prior to compression. We describe further improvements using a two-pass scheme where grammar-based preprocessing is applied again in a second pass through the text. We then apply the algorithms to the files in the Calgary Corpus and also achieve significantly improved results in compression when compared with other compression algorithms, including a
grammar-based approach known as the Sequitur algorithm. Despite the advances of the PPM method in predicting upcoming symbols or words in
the English language, more research is required to devise better compression methods for other languages, such as Arabic due to, for example, the rich morphological nature of Arabic text, in which a single word can take many different forms. In this dissertation, we propose a new method that achieves the best compression rates not only for Arabic language text but also for other languages that use Arabic script in their writing systems, such as Persian. Our word-based method (GRW-PPM) constructs a contextfree grammar for the text; this grammar is then encoded using PPM to achieve excellent compression rates.
Finally, we investigate the classification of genre in English and Arabic text by using our new character-based text compression scheme (GRB-PPM). Experimental results on a parallel Arabic and English corpus show that our new method is very effective compared with traditional compression-based classification methods. We have also confirmed that good compression leads to good classification.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date2018

Research outputs (1)

View all