Gathering Data for Speech Technology in the Welsh Language: A Case Study

Delyth Prys; Dewi Jones

Gathering Data for Speech Technology in the Welsh Language: A Case Study

Allbwn ymchwil: Cyfraniad at gynhadledd › Papur › adolygiad gan gymheiriaid

StandardStandard

Gathering Data for Speech Technology in the Welsh Language: A Case Study. / Prys, Delyth ; Jones, Dewi.
2018. Papur a gyflwynwyd yn LREC 2018, Miyazaki, Siapan.

Allbwn ymchwil: Cyfraniad at gynhadledd › Papur › adolygiad gan gymheiriaid

RIS

TY - CONF

T1 - Gathering Data for Speech Technology in the Welsh Language

T2 - LREC 2018

AU - Prys, Delyth

AU - Jones, Dewi

PY - 2018/5/12

Y1 - 2018/5/12

N2 - Less-resourced languages face additional challenges in the creation of tools and resources for speech recognition applications. These include lack of funding, sparsity of data and shortage of experts with relevant skills. On the other hand there are also opportunities to be had from tapping into committed communities of language activists, and potentially developing innovative solutions to common problems that may be applied elsewhere. This paper describes a recent series of short-term projects for the Welsh language that have used crowdsourcing methodologies, together with data from Wicipedia (the Welsh Wikipedia) and existing Welsh corpora, to further advance the field. They have also borrowed and adapted open source tools, such as MaryTTS and Mozilla CommonVoice that were already freely available. In addition this paper provides some pointers towards further needs and solutions for speech technology in less-resourced languages, aiming at a coherent, long-term approach that may be applicable in many environments.

AB - Less-resourced languages face additional challenges in the creation of tools and resources for speech recognition applications. These include lack of funding, sparsity of data and shortage of experts with relevant skills. On the other hand there are also opportunities to be had from tapping into committed communities of language activists, and potentially developing innovative solutions to common problems that may be applied elsewhere. This paper describes a recent series of short-term projects for the Welsh language that have used crowdsourcing methodologies, together with data from Wicipedia (the Welsh Wikipedia) and existing Welsh corpora, to further advance the field. They have also borrowed and adapted open source tools, such as MaryTTS and Mozilla CommonVoice that were already freely available. In addition this paper provides some pointers towards further needs and solutions for speech technology in less-resourced languages, aiming at a coherent, long-term approach that may be applicable in many environments.

M3 - Paper

Y2 - 12 May 2018 through 12 May 2018

ER -

Porth Ymchwil

Gathering Data for Speech Technology in the Welsh Language: A Case Study

StandardStandard

HarvardHarvard

APA

CBE

MLA

VancouverVancouver

Author

RIS