Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh
Research output: Contribution to conference › Paper › peer-review
Electronic versions
Links
- https://aclanthology.org/L18-1623.pdf
Final published version
Licence: CC BY Show licence
As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.
Original language | English |
---|---|
Pages | 3946-3954 |
Number of pages | 9 |
Publication status | Published - 7 May 2018 |
Externally published | Yes |
Event | LREC 2018 - Miyazaki, Japan Duration: 12 May 2018 → 12 May 2018 |
Conference
Conference | LREC 2018 |
---|---|
Country/Territory | Japan |
City | Miyazaki |
Period | 12/05/18 → 12/05/18 |