Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Sarah Cooper; Dewi Bryn Jones; Delyth Prys

doi:10.3390/info10080247

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Allbwn ymchwil: Cyfraniad at gyfnodolyn › Erthygl › adolygiad gan gymheiriaid

Fersiynau electronig

Dogfennau

information-10-00247-v2
Fersiwn derfynol wedi’i chyhoeddi, 1.46 MB, dogfen-PDF
Trwydded: CC BY Dangos trwydded

Dangosydd eitem ddigidol (DOI)

https://doi.org/10.3390/info10080247
Fersiwn derfynol wedi’i chyhoeddi
Trwydded: CC BY Dangos trwydded

Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

Allweddeiriau

Iaith wreiddiol	Saesneg
Tudalennau (o-i)	247
Nifer y tudalennau	12
Cyfnodolyn	Information
Cyfrol	10
Rhif y cyfnodolyn	8
Dynodwyr Gwrthrych Digidol (DOIs)	https://doi.org/10.3390/info10080247
Statws	Cyhoeddwyd - 25 Gorff 2019

Cyfanswm lawlrlwytho

Nid oes data ar gael

Gweld graff cysylltiadau

Porth Ymchwil

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Fersiynau electronig

Dogfennau

Dangosydd eitem ddigidol (DOI)

Allweddeiriau

Cyfanswm lawlrlwytho