Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh
Allbwn ymchwil: Cyfraniad at gynhadledd › Papur › adolygiad gan gymheiriaid
Fersiynau electronig
Dolenni
- https://aclanthology.org/L18-1623.pdf
Fersiwn derfynol wedi’i chyhoeddi
Trwydded: CC BY Dangos trwydded
As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.
Iaith wreiddiol | Saesneg |
---|---|
Tudalennau | 3946-3954 |
Nifer y tudalennau | 9 |
Statws | Cyhoeddwyd - 7 Mai 2018 |
Cyhoeddwyd yn allanol | Ie |
Digwyddiad | LREC 2018 - Miyazaki, Siapan Hyd: 12 Mai 2018 → 12 Mai 2018 |
Cynhadledd
Cynhadledd | LREC 2018 |
---|---|
Gwlad/Tiriogaeth | Siapan |
Dinas | Miyazaki |
Cyfnod | 12/05/18 → 12/05/18 |