Adaptive compression-based models of Chinese text
Research output: Contribution to conference › Paper
Standard Standard
2014. 874-881 Paper presented at International Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China.
Research output: Contribution to conference › Paper
HarvardHarvard
APA
CBE
MLA
VancouverVancouver
Author
RIS
TY - CONF
T1 - Adaptive compression-based models of Chinese text
AU - Teahan, W.J.
AU - Wu, P.
AU - Liu, W.
PY - 2014/7/7
Y1 - 2014/7/7
N2 - Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.
AB - Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.
U2 - 10.1109/ICALIP.2014.7009920
DO - 10.1109/ICALIP.2014.7009920
M3 - Paper
SP - 874
EP - 881
T2 - International Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China
Y2 - 3 January 0001
ER -