Adaptive Models of Chinese Text
Electronic versions
Documents
23.4 MB, PDF document
Abstract
This thesis is concerned with how to build adaptive language models of Chinese text, which can be used in different Chinese natural language processing (NLP) applications.
The Prediction by Partial Matching (PPM) language model has been widely used in many NLP areas. To apply this model for Chinese, many problems arise that originate from the language's large alphabet. In this thesis, the PPM-ch model is introduced first to improve the traditional PPM model, by first using preprocessing techniques, then a frequency sorting technique and a variation of PPM that performs no exclusions.
PPMO, a novel variant of the PPM model is then proposed. Unlike traditional PPM models, which output an escape symbol when a novel symbol occurs in a context model, PPMO separates the coding process into two streams, named the orders stream and the symbols stream. This algorithm is the first PPM variant that does not use the escape mechanism and it achieves the best compression results.
We have also investigated Chinese Word segmentation by using our PPM-ch and PPMO model. Although our PPMO models have not been carefully crafted for segmentation usage, we still achieve satisfactory results.
The Prediction by Partial Matching (PPM) language model has been widely used in many NLP areas. To apply this model for Chinese, many problems arise that originate from the language's large alphabet. In this thesis, the PPM-ch model is introduced first to improve the traditional PPM model, by first using preprocessing techniques, then a frequency sorting technique and a variation of PPM that performs no exclusions.
PPMO, a novel variant of the PPM model is then proposed. Unlike traditional PPM models, which output an escape symbol when a novel symbol occurs in a context model, PPMO separates the coding process into two streams, named the orders stream and the symbols stream. This algorithm is the first PPM variant that does not use the escape mechanism and it achieves the best compression results.
We have also investigated Chinese Word segmentation by using our PPM-ch and PPMO model. Although our PPMO models have not been carefully crafted for segmentation usage, we still achieve satisfactory results.
Details
Original language | English |
---|---|
Awarding Institution |
|
Supervisors/Advisors |
|
Thesis sponsors |
|
Award date | 2007 |