Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Sarah Cooper; Dewi Bryn Jones; Delyth Prys

doi:10.3390/info10080247

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Allbwn ymchwil: Cyfraniad at gyfnodolyn › Erthygl › adolygiad gan gymheiriaid

StandardStandard

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology. / Cooper, Sarah ; Jones, Dewi Bryn ; Prys, Delyth.
Yn: Information, Cyfrol 10, Rhif 8, 25.07.2019, t. 247.

Allbwn ymchwil: Cyfraniad at gyfnodolyn › Erthygl › adolygiad gan gymheiriaid

RIS

TY - JOUR

T1 - Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

AU - Cooper, Sarah

AU - Jones, Dewi Bryn

AU - Prys, Delyth

PY - 2019/7/25

Y1 - 2019/7/25

N2 - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

AB - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

KW - low-resource languages

KW - linguistic diversity

KW - speech recognition

KW - speech technology

KW - corpus

U2 - 10.3390/info10080247

DO - 10.3390/info10080247

M3 - Article

VL - 10

SP - 247

JO - Information

JF - Information

SN - 2078-2489

IS - 8

ER -

Porth Ymchwil

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

StandardStandard

HarvardHarvard

APA

CBE

MLA

VancouverVancouver

Author

RIS