Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Steven Neale; Kevin Donnelly; Gareth Watkins; Dawn Knight

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Research output: Contribution to conference › Paper › peer-review

Standard Standard

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. / Neale, Steven; Donnelly, Kevin; Watkins, Gareth et al.
2018. 3946-3954 Paper presented at LREC 2018, Miyazaki, Japan.

Research output: Contribution to conference › Paper › peer-review

RIS

TY - CONF

T1 - Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

AU - Neale, Steven

AU - Donnelly, Kevin

AU - Watkins, Gareth

AU - Knight, Dawn

PY - 2018/5/7

Y1 - 2018/5/7

N2 - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.

AB - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.

M3 - Paper

SP - 3946

EP - 3954

T2 - LREC 2018

Y2 - 12 May 2018 through 12 May 2018

ER -

Research Portal

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Standard Standard

HarvardHarvard

APA

CBE

MLA

VancouverVancouver

Author

RIS