Skip to main navigation Skip to search Skip to main content

Adaptive compression-based models of Chinese text

    Research output: Contribution to conferencePaper

    Abstract

    Large alphabet languages such as Chinese present different problems for language modelling compared to small alphabet languages such as English. In this paper, we describe adaptive models of Chinese text based on the Partial Predictive Match (PPM) text compression scheme that learns the language as the text is processed sequentially. We describe several character-based, word-based and part-of-speech (POS) based variants of PPM that achieve significant improvements in compression rate over existing models. Interestingly, results for Chinese text contrast that achieved for English text, with character-based models outperforming the word and POS based models rather than the other way round. We then explore how well these models perform at the task of Chinese word segmentation.
    Original languageEnglish
    Pages874-881
    DOIs
    Publication statusPublished - 7 Jul 2014
    EventInternational Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China -
    Duration: 3 Jan 0001 → …

    Conference

    ConferenceInternational Conference on Audio, Language and Image Processing (ICALIP), 7 - 9 July 2014, Shanghai, China
    Period3/01/01 → …

    Fingerprint

    Dive into the research topics of 'Adaptive compression-based models of Chinese text'. Together they form a unique fingerprint.

    Cite this