Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Steven Neale; Kevin Donnelly; Gareth Watkins; Dawn Knight

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Research output: Contribution to conference › Paper › peer-review

Electronic versions

Links

https://aclanthology.org/L18-1623.pdf
Final published version
Licence: CC BY Show licence

Steven Neale
University of Wales, Cardiff
Kevin Donnelly
Gareth Watkins
University of Wales, Cardiff
Dawn Knight
University of Wales, Cardiff

As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.

Original language	English
Pages	3946-3954
Number of pages	9
Publication status	Published - 7 May 2018
Externally published	Yes
Event	LREC 2018 - Miyazaki, Japan Duration: 12 May 2018 → 12 May 2018

Conference

Conference	LREC 2018
Country/Territory	Japan
City	Miyazaki
Period	12/05/18 → 12/05/18

View graph of relations

Research Portal

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Electronic versions

Links

Conference