Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Steven Neale; Kevin Donnelly; Gareth Watkins; Dawn Knight

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Allbwn ymchwil: Cyfraniad at gynhadledd › Papur › adolygiad gan gymheiriaid

Fersiynau electronig

Dolenni

https://aclanthology.org/L18-1623.pdf
Fersiwn derfynol wedi’i chyhoeddi
Trwydded: CC BY Dangos trwydded

Steven Neale
University of Wales, Cardiff
Kevin Donnelly
Gareth Watkins
University of Wales, Cardiff
Dawn Knight
University of Wales, Cardiff

As the quantity of annotated language data and the quality of machine learning algorithms have increased over time, statistical
part-of-speech (POS) taggers trained over large datasets have become as robust or better than their rule-based counterparts. However,
for lesser-resourced languages such as Welsh there is simply not enough accurately annotated data to train a statistical POS tagger.
Furthermore, many of the more popular rule-based taggers still require that their rules be inferred from annotated data, which while not
as extensive as that required for training a statistical tagger must still be sizeable. In this paper we describe CyTag, a rule-based POS
tagger for Welsh based on the VISL Constraint Grammar parser. Leveraging lexical information from Eurfa (an extensive open-source
dictionary for Welsh), we extract lists of possible POS tags for each word token in a running text and then apply various constraints –
based on various features of surrounding word tokens – to prune the number of possible tags until the most appropriate tag for a given
token can be selected. We explain how this approach is particularly useful in dealing with some of the specific intricacies of Welsh
- such as morphological changes and word mutations - and present an evaluation of the performance of the tagger using a manually
checked test corpus of 611 Welsh sentences.

Iaith wreiddiol	Saesneg
Tudalennau	3946-3954
Nifer y tudalennau	9
Statws	Cyhoeddwyd - 7 Mai 2018
Cyhoeddwyd yn allanol	Ie
Digwyddiad	LREC 2018 - Miyazaki, Siapan Hyd: 12 Mai 2018 → 12 Mai 2018

Cynhadledd

Cynhadledd	LREC 2018
Gwlad/Tiriogaeth	Siapan
Dinas	Miyazaki
Cyfnod	12/05/18 → 12/05/18

Gweld graff cysylltiadau

Porth Ymchwil

Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh

Fersiynau electronig

Dolenni

Cynhadledd