Evaluating Parallel Corpora and Translation Quality for Chinese and English

Electronic versions


  • Wei Liu


Parallel bilingual corpora are important basic resources for statistical machine translation. Accurate alignment of textual elements (e.g. documents, paragraphs, sentences) in a parallel bilingual corpus is a crucial step for statistical machine translation. Rather than using sentence length, word co-occurrence, cognates, dictionaries or parts of speech, this thesis uses compression code lengths based on the Prediction by Partial Matching (PPM) compression algorithm to measure when two sentences are aligned for parallel Chinese-English corpora. PPM has been found to be an effective method as a measure of whether the information conveyed by the texts is similar at estimating the entropy of the text. Evaluation of the quality of sentence alignment is a way to measure the quality of a corpus. Evaluating parallel bilingual corpora is also an important process and usually the last step for parallel bilingual corpus creation. However, most statistics of parallel bilingual corpora are based on counts of characters, words, tokens, sentences or files. As there is a lack of advanced parallel bilingual corpus evaluation methods, this thesis adopts a new PPMbased method for parallel bilingual corpus evaluation. The method has been used to evaluate the quality of three existing parallel bilingual corpora|the DC Corpus, the Hong Kong Yearbook Corpus and the UN Corpus. The compression-based method has also been applied to the problem of the automatic creation of new parallel corpora. The quality of sentence alignment for automatically created parallel bilingual corpora is always lower than manually checked corpora. This thesis processed the Corpus of United Nations by using the PPM-based metric and sought the best code length threshold value that can be used for automatically determining satisfactory or unsatisfactory sentence alignment in terms of translation quality in the corpus. The thesis also collected bilingual textual elements from the web and improved the quality based on the threshold code length ratio of 1.5. The approach has also been adapted to use as a method to perform translation system evaluation by comparing the compression code lengths of back translations at the sentence level. Compared to Bilingual Evaluation Understudy (BLEU) scores, the back translation-based evaluation method was able to present differences at the sentence level between original sentences and their back translations more accurately when used to evaluate some common Chinese-English translation systems.


Iaith wreiddiolSaesneg
Sefydliad dyfarnu
Goruchwylydd / Goruchwylwyr / Cynghorydd
Dyddiad dyfarnuIon 2016