Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh
Research output: Contribution to conference › Paper › peer-review
Standard Standard
2018. 3946-3954 Paper presented at LREC 2018, Miyazaki, Japan.
Research output: Contribution to conference › Paper › peer-review
HarvardHarvard
APA
CBE
MLA
VancouverVancouver
Author
RIS
TY - CONF
T1 - Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh
AU - Neale, Steven
AU - Donnelly, Kevin
AU - Watkins, Gareth
AU - Knight, Dawn
PY - 2018/5/7
Y1 - 2018/5/7
N2 - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.
AB - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.
M3 - Paper
SP - 3946
EP - 3954
T2 - LREC 2018
Y2 - 12 May 2018 through 12 May 2018
ER -