Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Sarah Cooper; Dewi Bryn Jones; Delyth Prys

doi:10.3390/info10080247

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Research output: Contribution to journal › Article › peer-review

Standard Standard

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology. / Cooper, Sarah ; Jones, Dewi Bryn ; Prys, Delyth.
In: Information, Vol. 10, No. 8, 25.07.2019, p. 247.

Research output: Contribution to journal › Article › peer-review

RIS

TY - JOUR

T1 - Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

AU - Cooper, Sarah

AU - Jones, Dewi Bryn

AU - Prys, Delyth

PY - 2019/7/25

Y1 - 2019/7/25

N2 - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

AB - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.

KW - low-resource languages

KW - linguistic diversity

KW - speech recognition

KW - speech technology

KW - corpus

U2 - 10.3390/info10080247

DO - 10.3390/info10080247

M3 - Article

VL - 10

SP - 247

JO - Information

JF - Information

SN - 2078-2489

IS - 8

ER -

Research Portal

Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology

Standard Standard

HarvardHarvard

APA

CBE

MLA

VancouverVancouver

Author

RIS