Adaptive compression-based models of Chinese text

W.J. Teahan; P. Wu; W. Liu

doi:10.1109/ICALIP.2014.7009920

Adaptive compression-based models of Chinese text

Allbwn ymchwil: Cyfraniad at gynhadledd › Papur

Fersiynau electronig

Dangosydd eitem ddigidol (DOI)

https://doi.org/10.1109/ICALIP.2014.7009920
Fersiwn derfynol wedi’i chyhoeddi

W.J. Teahan
P. Wu
W. Liu

Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.

Iaith wreiddiol	Saesneg
Tudalennau	874-881
Dynodwyr Gwrthrych Digidol (DOIs)	https://doi.org/10.1109/ICALIP.2014.7009920
Statws	Cyhoeddwyd - 7 Gorff 2014
Digwyddiad	International Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China - Hyd: 3 Ion 0001 → …

Cynhadledd

Cynhadledd	International Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China
Cyfnod	3/01/01 → …

Gweld graff cysylltiadau

Porth Ymchwil

Adaptive compression-based models of Chinese text

Fersiynau electronig

Dangosydd eitem ddigidol (DOI)

Cynhadledd