Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Research output: Contribution to conferencePaperpeer-review

Electronic versions

Links

  • Steven Neale
    University of Wales, Cardiff
  • Kevin Donnelly
  • Gareth Watkins
    University of Wales, Cardiff
  • Dawn Knight
    University of Wales, Cardiff
As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.
Original languageEnglish
Pages3946-3954
Number of pages9
Publication statusPublished - 7 May 2018
Externally publishedYes
EventLREC 2018 - Miyazaki, Japan
Duration: 12 May 201812 May 2018

Conference

ConferenceLREC 2018
Country/TerritoryJapan
CityMiyazaki
Period12/05/1812/05/18
View graph of relations