Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology
Research output: Contribution to journal › Article › peer-review
Standard Standard
In: Information, Vol. 10, No. 8, 25.07.2019, p. 247.
Research output: Contribution to journal › Article › peer-review
HarvardHarvard
APA
CBE
MLA
VancouverVancouver
Author
RIS
TY - JOUR
T1 - Crowdsourcing the Paldaruo Speech Corpus of Welsh for Speech Technology
AU - Cooper, Sarah
AU - Jones, Dewi Bryn
AU - Prys, Delyth
PY - 2019/7/25
Y1 - 2019/7/25
N2 - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.
AB - Collecting speech data for a low-resource language is challenging when funding and resources are limited. This paper describes the process of designing, creating and using the Paldaruo Speech Corpus for developing speech technology for Welsh. Specifically, this paper focuses on the crowdsourcing of data using an app on smartphones and mobile devices, allowing speakers from across Wales to contribute. We discuss the development of reading prompts: isolated words and full sentences, as well as the metadata collected from contributors. We also provide background on the design of the Paldaruo App as well as the main uses for the corpus and its availability and licensing. The corpus was designed for the development of speech recognition for Welsh and has been used to create a number of other resources. These methods can be extended to other languages, and suggestions for other low-resource languages are discussed.
KW - low-resource languages
KW - linguistic diversity
KW - speech recognition
KW - speech technology
KW - corpus
U2 - 10.3390/info10080247
DO - 10.3390/info10080247
M3 - Article
VL - 10
SP - 247
JO - Information
JF - Information
SN - 2078-2489
IS - 8
ER -