Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

Research output: Chapter in Book/Report/Conference proceedingChapter

Standard Standard

Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. / Liu, W.; Chang, Z.; Teahan, W.J.
Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer, 2014. p. 70-81.

Research output: Chapter in Book/Report/Conference proceedingChapter

HarvardHarvard

Liu, W, Chang, Z & Teahan, WJ 2014, Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. in Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer, pp. 70-81. https://doi.org/10.1007/978-3-319-11397-5_5

APA

Liu, W., Chang, Z., & Teahan, W. J. (2014). Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science (pp. 70-81). Springer. https://doi.org/10.1007/978-3-319-11397-5_5

CBE

Liu W, Chang Z, Teahan WJ. 2014. Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer. pp. 70-81. https://doi.org/10.1007/978-3-319-11397-5_5

MLA

Liu, W., Z. Chang, and W.J. Teahan "Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment". Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer. 2014, 70-81. https://doi.org/10.1007/978-3-319-11397-5_5

VancouverVancouver

Liu W, Chang Z, Teahan WJ. Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. In Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer. 2014. p. 70-81 doi: 10.1007/978-3-319-11397-5_5

Author

Liu, W. ; Chang, Z. ; Teahan, W.J. / Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment. Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science. Springer, 2014. pp. 70-81

RIS

TY - CHAP

T1 - Experiments with a PPM Compression-Based Method for English-Chinese Bilingual Sentence Alignment

AU - Liu, W.

AU - Chang, Z.

AU - Teahan, W.J.

PY - 2014/9/3

Y1 - 2014/9/3

N2 - Alignment of parallel corpora is a crucial step prior to training statistical language models for machine translation. This paper investigates compression-based methods for aligning sentences in an English-Chinese parallel corpus. Four metrics for matching sentences required for measuring the alignment at the sentence level are compared: the standard sentence length ratio (SLR), and three new metrics, absolute sentence length difference (SLD), compression code length ratio (CR), and absolute compression code length difference (CD). Initial experiments with CR show that using the Prediction by Partial Matching (PPM) compression scheme, a method that also performs well at many language modeling tasks, significantly outperforms the other standard compression algorithms Gzip and Bzip2. The paper then shows that for sentence alignment of a parallel corpus with ground truth judgments, the compression code length ratio using PPM always performs better than sentence length ratio and the difference measurements also work better than the ratio measurements.

AB - Alignment of parallel corpora is a crucial step prior to training statistical language models for machine translation. This paper investigates compression-based methods for aligning sentences in an English-Chinese parallel corpus. Four metrics for matching sentences required for measuring the alignment at the sentence level are compared: the standard sentence length ratio (SLR), and three new metrics, absolute sentence length difference (SLD), compression code length ratio (CR), and absolute compression code length difference (CD). Initial experiments with CR show that using the Prediction by Partial Matching (PPM) compression scheme, a method that also performs well at many language modeling tasks, significantly outperforms the other standard compression algorithms Gzip and Bzip2. The paper then shows that for sentence alignment of a parallel corpus with ground truth judgments, the compression code length ratio using PPM always performs better than sentence length ratio and the difference measurements also work better than the ratio measurements.

U2 - 10.1007/978-3-319-11397-5_5

DO - 10.1007/978-3-319-11397-5_5

M3 - Chapter

SN - 9783319113968

SP - 70

EP - 81

BT - Statistical Language and Speech Processing Volume 8791 of the series Lecture Notes in Computer Science

PB - Springer

ER -