Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Research output: Contribution to conferencePaperpeer-review

Standard Standard

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. / Neale, Steven; Donnelly, Kevin; Watkins, Gareth et al.
2018. 3946-3954 Paper presented at LREC 2018, Miyazaki, Japan.

Research output: Contribution to conferencePaperpeer-review

HarvardHarvard

Neale, S, Donnelly, K, Watkins, G & Knight, D 2018, 'Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh', Paper presented at LREC 2018, Miyazaki, Japan, 12/05/18 - 12/05/18 pp. 3946-3954. <https://aclanthology.org/L18-1623.pdf>

APA

Neale, S., Donnelly, K., Watkins, G., & Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. 3946-3954. Paper presented at LREC 2018, Miyazaki, Japan. https://aclanthology.org/L18-1623.pdf

CBE

Neale S, Donnelly K, Watkins G, Knight D. 2018. Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Paper presented at LREC 2018, Miyazaki, Japan.

MLA

Neale, Steven et al. Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. LREC 2018, 12 May 2018, Miyazaki, Japan, Paper, 2018. 9 p.

VancouverVancouver

Neale S, Donnelly K, Watkins G, Knight D. Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. 2018. Paper presented at LREC 2018, Miyazaki, Japan.

Author

Neale, Steven ; Donnelly, Kevin ; Watkins, Gareth et al. / Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Paper presented at LREC 2018, Miyazaki, Japan.9 p.

RIS

TY - CONF

T1 - Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

AU - Neale, Steven

AU - Donnelly, Kevin

AU - Watkins, Gareth

AU - Knight, Dawn

PY - 2018/5/7

Y1 - 2018/5/7

N2 - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.

AB - As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statisticalpart-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while notas extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POStagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-sourcedictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a giventoken can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manuallychecked test corpus of 611 Welsh sentences.

M3 - Paper

SP - 3946

EP - 3954

T2 - LREC 2018

Y2 - 12 May 2018 through 12 May 2018

ER -