A Compression-Based Toolkit for Modelling and Processing Natural Language Text

Research output: Contribution to journalArticlepeer-review

Standard Standard

A Compression-Based Toolkit for Modelling and Processing Natural Language Text. / Teahan, William.
In: Information, Vol. 9, No. 294, 9, 294, 22.11.2018, p. 1-29.

Research output: Contribution to journalArticlepeer-review

HarvardHarvard

APA

CBE

MLA

VancouverVancouver

Teahan W. A Compression-Based Toolkit for Modelling and Processing Natural Language Text. Information. 2018 Nov 22;9(294):1-29. 9, 294. doi: 10.3390/info9120294

Author

RIS

TY - JOUR

T1 - A Compression-Based Toolkit for Modelling and Processing Natural Language Text

AU - Teahan, William

PY - 2018/11/22

Y1 - 2018/11/22

N2 - A novel compression-based toolkit for modelling and processing natural language text is described. The design of the toolkit adopts an encoding perspective—applications are considered to be problems in searching for the best encoding of different transformations of the source text into the target text. This paper describes a two phase ‘noiseless channel model’ architecture that underpins the toolkit which models the text processing as a lossless communication down a noise-free channel. The transformation and encoding that is performed in the first phase must be both lossless and reversible. The role of the verification and decoding second phase is to verify the correctness of the communication of the target text that is produced by the application. This paper argues that this encoding approach has several advantages over the decoding approach of the standard noisy channel model. The concepts abstracted by the toolkit’s design are explained together with details of the library calls. The pseudo-code for a number of algorithms is also described for the applications that the toolkit implements including encoding, decoding, classification, training (model building), parallel sentence alignment, word segmentation and language segmentation. Some experimental results, implementation details, memory usage and execution speeds are also discussed for these applications.

AB - A novel compression-based toolkit for modelling and processing natural language text is described. The design of the toolkit adopts an encoding perspective—applications are considered to be problems in searching for the best encoding of different transformations of the source text into the target text. This paper describes a two phase ‘noiseless channel model’ architecture that underpins the toolkit which models the text processing as a lossless communication down a noise-free channel. The transformation and encoding that is performed in the first phase must be both lossless and reversible. The role of the verification and decoding second phase is to verify the correctness of the communication of the target text that is produced by the application. This paper argues that this encoding approach has several advantages over the decoding approach of the standard noisy channel model. The concepts abstracted by the toolkit’s design are explained together with details of the library calls. The pseudo-code for a number of algorithms is also described for the applications that the toolkit implements including encoding, decoding, classification, training (model building), parallel sentence alignment, word segmentation and language segmentation. Some experimental results, implementation details, memory usage and execution speeds are also discussed for these applications.

U2 - 10.3390/info9120294

DO - 10.3390/info9120294

M3 - Article

VL - 9

SP - 1

EP - 29

JO - Information

JF - Information

SN - 2078-2489

IS - 294

M1 - 9, 294

ER -