The shortage of Arabic language resources in the field of corpus linguistics compared to other popular languages such as English, Chinese and Spanish inspired this work. The research in the field of dialectal Arabic is still limited due to the relative unavailability of resources and the time-consuming nature of the task needed to create and process these corpora.
This thesis introduces the Bangor Twitter Arabic corpus (BTAC) that was created specifically using Arabic Twitter text. The corpus contains over 122K tweets. The tweets were annotated manually into five main dialects, Egyptian, Gulf, Iraqi, Maghrebi, and Levantine, in addition to Modern Standard Arabic and Classical Arabic. The resource has also identified written code-switching in single tweet which occurs between Modern Standard Arabic and Arabic dialects.
This thesis evaluates various methods for categorisation of Arabic Twitter text. The categorisation is performed on three main categorisation tasks: authorship attribution; gender categorisation; and dialect identification. The experiments are performed using the Prediction by Partial Matching (PPM) character-based text compression approach. Furthermore, well known algorithms were selected to perform the comparison using character-based and feature-based approaches such as Multinomial Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and Support Vector Machine (SVM).
The results show that PPM outperforms traditional feature-based classifiers in most cases in terms of accuracy, precision, recall and F-measure. The results reported for classifying author multiple tweets achieved an accuracy of 88% for gender categorisation, an accuracy of 96% for authorship attribution, and an accuracy of 87% for dialect identification. In terms of single-tweet text categorisation, the results achieved an accuracy of 76% for gender categorisation, an accuracy of 77% for authorship attribution, and an accuracy of 74% for dialect identification. Further optimization using concatenated author models as the secondary class type improved
the classification accuracy for both the gender and dialect experiments, achieving an accuracy of 97% for gender categorisation and an accuracy of 98% for dialect identification.
We also investigated code-switching that often occurs in text acquired from social media. In this study we investigated code-switching between two variant linguistic systems from one language (Modern Standard Arabic and Arabic dialects). The purpose of the experiment was to detect the switch at the character level. An accuracy of 81.2% for detecting code-switching was obtained using 5-fold cross-validation on the full BTAC dataset.