An empirical study of stream-based techniques for text categorization

Electronic versions

Documents

  • Daniel Thomas

Abstract

Due to the popularity of social networking sites such as Twitter, Facebook and blogs, the amount of electronic text is continuing to grow. There is a need to categorize these vast amounts of documents and it is no surprise that the field of text categorization is a popular one. The traditional approach to text categorization is a feature-based approach, normally processing features based on words. Stream based methods have been shown to perform well in some
experimentations but there has been no thorough study of their performance on a number of major corpora and their results have not been thoroughly compared against the current state-of-the-art feature based techniques. This is an important problem as the techniques cannot be fully recognized until a thorough study has been performed.
The concept of protocols and how each affects categorization results has also not been studied thoroughly across a number of methods for several corpora. It is important to attempt to discover which stream based approaches perform best in which situations and how the choice of protocol affects their performance, if at all. It is hoped that it can be shown that for certain corpora or document lengths that certain approaches and protocols should be used. These findings could then drive the decision of which methods and protocols to use for future experiments.
An existing problem within the field of text categorization is that it is often difficult to recreate the exact experimentation conditions of previous studies. One reason for this is that the training and testing splits often differ and it was important that this study did not add to this existing problem, that the experimentations could be accurately recreated and that others would be fairly
compared.
A toolkit has been developed that allows all of the methods and protocols to be compared in a consistent manner. The toolkit models the streams using suffix trees and all of the stream based methods have been implemented. In addition to the implementation of existing techniques, a number of new stream based methods have been detailed within the thesis and one of these new techniques, R-Ranges, has been shown to outperform all other methods for two of the corpora, including PPM (Prediction by Partial Matching) variants, state-of-the-art techniques that are mathematically well supported. The experimentation has also shown that the protocol (whether static or dynamic training models are used in addition to training documents of the same category being concatenated or not) does indeed affect the accuracies of each method. The concatenated dynamic protocol was found to outperform all others and performs consistently
well across all methods, for all corpora. This study has now conclusively shown that the method used to categorize text must not be the only one, the selection of protocol is also just as important.

Details

Original languageEnglish
Awarding Institution
  • Bangor University
Supervisors/Advisors
    Award date2011